Recommender system

Significance of Product Association Recommender in the Recommender Systems

Significance of Product Association Recommender in the Recommender System

 

Mariam Jamal

Software Engineer 

August 8, 2019

Recommender system

A recommender system or a recommendation system (sometimes replacing “system” with a synonym such as platform or engine) is a subclass of information filtering system that seeks to predict the “rating” or “preference” a user would give to an item.

Product Association recommender comes under the category of ‘Non-personalized recommender systems’. Non-personalized for the fact that it neither relies on user’s preferences or their historical behavioral patterns, nor upon item’s own characteristics or attribute information to make recommendations; rather, recommendations are based upon overlapping customer groups and product categories.

Product Association is often confused with Collaborative Filtering. However, both of the techniques address the problem using different approaches. The main difference lies in the ‘session’ based on which any item is being recommended. In that sense, Product Association is ephemeral in nature as it is based only on the current moment setting, e.g. what a customer just added into the cart? The recommendation for next item/product will completely rely on the item just added.

On the other hand, Collaborative Filtering often comes under the category of ‘persistent’ recommenders. Based on user’s inclination for a particular genre, items that belongs to that genre and other people who have showed interest for that particular genre as well, recommendations are made. Even if a user logs out of his account on, say, Goodreads, the next time they log in, they will be seeing more or less the same recommendations as in the last session given that goodreads follow Collaborative Filtering (or any other persistent recommendation) approach. Where Product Association focuses on ‘What products frequently found together in a cart?’, Collaborative Filtering addresses ‘What products do users with interests similar to yours like?’.

Product Association is based on two entities: item-item afiniti score and the factor that binds these items together. In a typical case of an e-store, the binding factor will be common customer base for that particular item.

Product Association Recommender based on Bayesian Theorem

First idea that comes to mind to find how frequently two items are bought together (or how closely they can be associated) is to find the percentage of customers buying both the items. For example, if we say 75% of the buyers buy ‘tissue rolls’ whenever they buy ‘eggs’, it shows a high association between the two products. However, this is not a true association since ‘eggs’ are itself are a very common product and people will buy eggs whether they buy tissues or not.

To avoid such unreal associations, a modification of Bayes’ statistical theorem is used which says that the association between two products must be found relatively i.e. the probability of buying ‘Y’ and ‘X’ together should keep in consideration the probability of buying product ‘Y’ alone.
I.e.

With the intuition understood, let’s move forward to create a basic recommender system in Python based on Bayesian Theorem. Our recommender system will generate movie recommendations to the viewers based on which movie the user just watched using MovieLens dataset. Download the dataset and let’s get started!

The dataset downloaded has four different csv files. We will use ‘ratings.csv’ to track which viewers watch which subset of movies by tracking the movies they have rated and ‘movies.csv’ to map titles of movies with movieIds. First, load the files using pandas library and map movieTitles to movieIds.

import pandas as pd
import numpy as np

ratings_fields = ['userId', 'movieId', 'rating']
movies_fields = ['movieId', 'title']

ratings = pd.read_csv("ratings.csv", encoding="ISO-8859-1", usecols=ratings_fields)
movies = pd.read_csv("movies.csv", encoding="ISO-8859-1", usecols=movies_fields)
ratings = pd.merge(ratings, movies, on='movieId')

Next we will create an empty DataFrame to store movie-movie afiniti score and find afiniti score for a referenced movie against other movies based on common viewers.

Following Bayesian Theorem modification, afiniti score can be found by finding the Probability of an item Y given item X, divided by total probability of item Y.

By translating the equation to our scenario, we will calculate afiniti score for a movie ‘m1’ with all other movies by finding the distinct viewers of ‘m1’ and using that value to find fraction of common viewers for any other movie ‘m2’. Thus,

# empty dataframe for movie-movie afiniti score
movie_afiniti = pd.DataFrame(columns=[
    'base_movieId',
    'base_movieTitle',
    'associated_movieId',
    'associated_movieTitle',
    'afiniti_score'])


# get unique movies 
distinct_movies = np.unique(ratings['movieId'])

# movieId of movie viewer watched
ref_movie = 10
m_data = ratings[ratings['movieId'] == ref_movie]


#compare m1 with every other movie in distinct_movies 
for m1 in distinct_movies:
  
  if m1 == ref_movie:
    continue
  
  # count distinct viewers of m1
  m1_data = ratings[ratings['movieId'] == m1]
  m1_viewers = np.unique(m1_data['userId'])
  
  # find movies watched by same set of users to calculate afiniti score
  m2_viewers = np.intersect1d(m1_viewers, [m_data['userId']])
      
  # find common viewers of m2 and m1
  common_viewers = len(np.unique(m2_viewers))
  afiniti_score = float(common_viewers)/float(len(m1_viewers))

  # update movie_afiniti score dataframe
  movie_afiniti = movie_afiniti.append({
      "base_movieId": ref_movie,
      "base_movieTitle": m_data.loc[m_data['movieId'] == ref_movie, 'title'].iloc[0],
      "associated_movieId": m1,
      "associated_movieTitle": m1_data.loc[m1_data['movieId'] == m1, 'title'].iloc[0],
      "afiniti_score": afiniti_score
      
  }, ignore_index=True)
  
movie_afiniti = movie_afiniti.sort_values(['afiniti_score'], ascending=False)

# For better recommendations, set afiniti score threshold
similar_movies = movie_afiniti[(movie_afiniti['afiniti_score'] > 0.6)]

similar_movies.sample(10)

By running the above association program, you will find movies often watched together with the referenced movie as output below:

Result indicates that higher the afiniti score, the more viewers have watched the two movies together.

Another method to find association is ‘Lift value’ method that is the probability of items bought together divided by the product of individual probabilities of items.

Mathematically,

We just implemented a basic Product Association Recommender using a modification of Bayesian Theorem. The source code and the jupyter notebook is also available on our GitHub repo. Keep following to learn about more sophisticated recommenders!

Start Gowing with Folio3 AI Today.

We are the Pioneers in the Cognitive Arena – Do you want to become a pioneer yourself ?
Get In Touch

Please feel free to reach out to us, if you have any questions. In case you need any help with development, installation, integration, up-gradation and customization of your Business Solutions. We have expertise in Deep learning, Computer Vision, Predictive learning, CNN, HOG and NLP.

Connect with us for more information at Contact@folio3.ai

[sharethis-inline-buttons]

Apache Spark in Big Data and Data Science Ecosystem

Apache Spark in Big Data and Data Science Ecosystem

Mariam Jamal 
Software Engineer
April 04, 2019

Extremely large data sets that may be analysed computationally to reveal patterns, trends, and associations, especially relating to human behaviour and interactions are referred as “Big Data”

This blog post is dedicated to one of the most popular Big Data tool; Apache Spark, which as of today has over 500 contributors from more than 200 organizations. Technically speaking, Apache Spark is a cluster computing framework meant for large distributed datasets to promote general-purpose computing with high speed. However, to truly appreciate and utilize the potential of this fast-growing Big Data framework, it is vital to understand the concept behind Spark’s creation and the reason for its crowned place in Data Science.

Data Science itself is an interdisciplinary field that deals with processes and systems that extract inferences and insights from large volumes of data referred to as Big Data. Some of the methods used in data science for the purpose of analysis of data include machine learning, data mining, visualizations, and computer programming among others.

 

Before Spark

With the boom of Big Data came great challenges of data discovery, storage and analytics on gigantic amounts of data. Apache Hadoop appeared in the picture as one of the most comprehensive frameworks for addressing these challenges with its own Hadoop Distributed File System(HDFS) to deal with storage and the ability to parallelize the data across a cluster, YARN, that manages application runtimes along with MapReduce, the algorithm that makes seamless parallel processing of batch data possible. Hadoop MapReduce set the Big Data wheel rolling, by taking care of data batch processing.

However, the other usecases for data analysis like visualizing and streaming big data still needed a practical solution. Additionally, constant I/O operations going on in HDFS made latency another major concern in data processing with Hadoop.

To support other methods of data analysis, Apache took a leading role and introduced various frameworks:

 

 

However, while having multiple robust frameworks to aid in data processing, there was no unified powerful engine in the industry being able to process multiple types of data. Also, there was a vast room of improvement in the I/O latency aspect of dealing with data in batch mode in Hadoop.

 

Enters Spark

Apache Spark project enters Big Data world as a Unified, General-purpose data processing engine addressing real-time processing, interactive processing, graph processing, in-memory processing as well as batch processing, all under one umbrella.
Spark aims at speed, ease-of-use, extensibility and interactive analytics with the flexibility to run alone or in an existing cluster environment. It has introduced high-level APIs in Java, Scala, Python, and R in its core along with four other components empowering big data processing.

 

 

Components of Apache Spark 

Spark Core API

Spark Core is the underlying general execution engine for the Spark platform that all other functionality is built atop. It contains the basic functionality of Spark including job scheduling, fault recovery, memory management and storage system interactions. Spark Core is where the API for resilient distributed datasets (RDDs) is defined which serves as Spark’s basic programming abstraction. A RDD in Spark is an immutable distributed collection of records.

Spark SQL and Dataframes

Keeping in mind a major reliance of technical users on SQL queries for data manipulation, Spark introduces a module, Spark SQL, for structured data processing which supports many sources of data including Hive tables, Parquet, and JSON.

It also provides a programming abstraction called DataFrames and can also act as a distributed SQL query engine. Along with providing a SQL interface, Spark also allows intermixed SQL queries with programmatic data manipulations using its  RDDs in Python, Java, Scala and R.

  

Spark Streaming

Spark streaming is the component that brings the power of real-time data processing in Spark Framework; thus enabling programmers to work with applications that deal with data stored in memory, on disk or arriving in real time. Running atop Spark, Spark Streaming inherits Spark Core’s ease of use and fault tolerance characteristics.

  

MLlib

Machine learning has unfolded as a explicative element in mining Big Data for actionable insights. Spark comes with a library containing common machine learning (ML) functionality, called MLlib, delivering both high-quality algorithms (e.g., multiple iterations to increase accuracy) and blazing speed (much faster than MapReduce). It also provides some lower-level ML primitives, including a generic gradient descent optimization

Algorithm, all with the ability to scale out across multiple nodes.

  

GraphX

GraphX is a graph manipulation engine built on top of Spark enabling users to interactively build and transform graphical data at large as well as to perform graph-parallel computations. GraphX comes with a complete library of common graph algorithms (e.g. PageRank and triangle counting) and various operators for graph manipulation (e.g. subgraph and mapVertices).

 

How Spark has sped up data processing?

Beyond providing a general-purpose unified engine, the main innovation of Spark over classical big data tools is the capability of ‘in-memory data processing’. Unlike Hadoop MapReduce that persists entire dataset to disk after running each job, Spark takes a more holistic view of jobs pipeline by feeding the output of one operation directly to the next without writing it to persistent storage. Along with in-memory data processing, Spark introduced ‘in-memory caching abstraction’ that allows multiple operations to work with the same dataset so that they do not need to be read from memory for every single operation. Hence it has been titled as ‘lightning-fast’ analytics engine on its official site.

 

What filesystem does Spark use?

Apache Spark entered the Big Data ecosystem as a tool that enhanced existing frameworks without reinventing the wheel. Unlike Hadoop, Spark does not come with its own file system, instead, it can be integrated with many file systems including Hadoop’s HDFS, MongoDB, and Amazon’s S3. By providing a common platform for multiple types of data processing and replacing MapReduce to support iterative programming by introducing in-memory data processing, Spark is gaining considerable momentum in data analytics.

 

Has Spark replaced Hadoop?

Lastly, a common misconception worth mentioning is that Apache Spark is a replacement for Hadoop. Apache Spark can never be a replacement for Hadoop. Although Spark provides many additional features than Hadoop, yet being such comprehensive a framework, Spark is not necessarily the best choice for every use case. Due to its capability for in-memory data processing, Spark demands a lot of RAM and can become a bottleneck when it comes to cost-efficient processing of big data. Furthermore, Spark is not designed for a multi-user environment and hence lacks the capability of concurrent execution. Thus, it is important to be fully familiar with the use case in hand for data analyses to make a decision for a big data tool to work with.

 

This was a brief introduction of Apache Spark’s place in Big Data and Data Science Ecosystem. For a deeper understanding of Apache Spark programming and Big Data analytics, follow blogs on folio3.ai.

 

Please feel free to reach out to us, if you have any questions. In case you need any help with development, installation, integration, up-gradation and customization for your Business Solutions.

We have expertise in Deep learning, Computer Vision, Predictive learning, CNN, HOG and NLP.

Connect with us for more information at Contact@folio3.ai

[sharethis-inline-buttons]

A look into ETA Problem using Regression in Python – Machine Learning

A look into ETA Problem using Regression in Python – Machine Learning

Farman Shah
Senior Software Engineer
Feb 22, 2019

The term “ETA” usually means “Estimated Time of Arrival” but in the technology realm it refers as “Estimated Completion Time” of a computational process in general. In particular, this problem is specific to estimating completion time of a batch of long scripts running parallel to each other.

Problem

A number of campaigns run together in parallel, to process data and prepare some lists. Running time for each campaign varies from 10 minutes to may be 10 hours or more, depending on the data. A batch of campaigns is considered as complete when execution of all campaigns is finished and then resorted to have mutually exclusive data.

What we will do is to provide a solution that can accurately estimate completion time of campaigns based on some past data. 

Data

We have very limited data available; per campaign, from the past executions of these campaigns: 

campaign_idstart_timeend_timeuses_recommendationsquery_count
6982018-07-31 10:20:022018-08-01 02:05:48048147
6982018-07-24 11:10:022018-07-25 05:42:37045223
6992018-07-31 11:05:032018-08-01 07:23:160121898
6992018-07-24 12:00:042018-07-25 10:21:480116721
7002018-07-31 10:50:032018-08-01 06:54:530400325
7002018-07-24 11:45:032018-07-25 09:53:030353497
8112018-07-31 15:20:032018-08-01 01:54:5112601500
8112018-07-24 11:00:02 2018-07-25 05:36:3012609112

 

Feature Engineering / Preprocessing

These are the campaigns for which past data is available. A batch can consist of one or many campaigns from the above list. The feature uses_recommendations resulted during feature engineering. This feature helps machine differentiate between the campaigns which are dependent on an over the network API and the ones which are not, so that machine can keep a variable which caters network lag implicitly. 

Is this a Time Series Forecasting problem?

It could have been, but from the analysis it can be observed that the time of the year doesn’t impact the data that much. So this problem can be tackled as a regression problem instead of time series forecasting problem. 

How it turned into a regression problem?

The absolute difference in seconds between start time and end time of a campaign will be a numeric variable, that can be made the target variable. This is what we are going to estimate using Regression.

Regression

Our input X is the data available, and output y is the time difference between start time and the end time. Now let’s import the dataset and start processing. 

import pandas as pd

dataset = pd.read_csv(‘batch_data.csv’)

As it can be seen from our data that there are no missing entries as of now, but as this may be an automated process, so we better handle the NA values. The following command will fill the NA values with column mean.

dataset = dataset.fillna(dataset.mean())

 

The output y is the difference between start time and end time of the campaign. Let’s set up our output variable.

start = pd.to_datetime(dataset[‘start_time’])
process_end = pd.to_datetime(dataset[‘end_time’])
y = (process_end – start).dt.seconds

 

y is taken out from the dataset, and we won’t be needing start_time and end_time columns in the X.

X = dataset.drop([‘start_time’, ‘end_time’], 1)

 

You might have questioned that how would Machine differentiate between the campaign_ids, here in particular, or any such categorical data in general. 

A quick recall if you already know the concept of One-Hot-Encoding. It is a method to create a toggle variable for categorical instances of data. So that, a variable is 1 for the rows where data belongs to that categorical value, and all other variables are 0. 

If you don’t already know the One-Hot-Encoding concept, it is highly recommended to read more about it online and come back to continue. We’ll use OneHotEncoder from sklearn library.

from sklearn.preprocessing import OneHotEncoder
onehotencoder = OneHotEncoder(categorical_features = [0], handle_unknown = ‘ignore’)
X = onehotencoder.fit_transform(X).toarray()

start = pd.to_datetime(dataset[‘start_time’])
# Avoiding the Dummy Variable Trap is part of one hot encoding
X = X[:, 1:]

 

Now that input data is ready, one final thing to do is to separate out some data to later test how good our algorithm is performing. We are separating out 20% data at random.

 

# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

 

After trying Linear regression, SVR (Support Vector Regressor) and XGBoosts and RandomForests on the data it turned out that Linear and SVR models doesn’t fit the data well. And the other finding was that the performance of XGBoost and RandomForest was close to each other for this data. With a slight difference, let’s move forward with RandomForest.

 

# Fitting Regression to the dataset
from sklearn.ensemble import RandomForestRegressor

regressor = RandomForestRegressor(n_estimators=300, random_state=1, criterion=’mae’)
regressor.fit(X_train, y_train)

 

The regression has been performed and a regressor has been fitted on our data. What follows next is checking how good is the fit.

 

Performance Measure

We’ll use Root Mean Square Error (RMSE) as our unit of goodness. The lesser the RMSE, the better the regression.

Let’s ask our regressor to make predictions on our Training data, that is 80% of the total data we had. This will give a glimpse of training accuracy. Later we’ll make predictions on the Test data, the remaining 20%, which will tell us about the performance of this regressor on Unseen data.

If the performance on training data is very good, and the performance on unseen data is poor, then our model is Overfitting. So, ideally the performance on unseen data should be close to that on the training data.

from sklearn.metrics import mean_squared_error
from math import sqrt
training_predictions = regressor.predict(X_train)

training_mse = mean_squared_error(y_train, training_predictions)
training_rmse = sqrt(training_mse) / 60 # Divide by 60 to turn it into minutes

 

We got the training RMSE, you should print it and see how many minutes does it deviate on average from the actual.

Now, let’s get the test RMSE.

 

test_predictions = regressor.predict(X_test)
test_mse = mean_squared_error(y_test_pred, test_predictions)

test_rmse = sqrt(test_mse) / 60

 

Compare the test_rmse with training_rmse to see how good is the regression performing on seen and unseen data.

What’s next for you is to now try fitting XGBoost, SVR and any other Regression models that you think should fit well on this data and see how different is the performance of different models.

Please feel free to reach out to us, if you have any questions. In case you need any help with development, installation, integration, up-gradation and customization of your Business Solutions.

We have expertise in Deep learning, Computer Vision, Predictive learning, CNN, HOG and NLP.

Connect with us for more information at Contact@folio3.ai

[sharethis-inline-buttons]