###### Home | About | Data scientists Interviews | For beginners | Join us | Monthly newsletter

# Collaborative Filtering

In the introduction post of recommendation engine we have seen the need of recommendation engine in real life as well as importance of recommendation engine in online and finally we have discussed 3 methods of recommendation engine. They are:

1) Collaborative filtering

2) Content-based filtering

3) Hybrid Recommendation Systems

So today we are going to implement collaborative filtering way of recommendation engine, before that i want to explain some key things about recommendation engine which was missed in Introduction to recommendation engine post.

### Today learning Topics:

1) what is the Long tail phenomenon in recommendation engine ?

2) Basic idea about Collaborative filtering ?

3) Understanding Collaborative filtering approach with friends movie recommendation engine ?

4) Implementing Collaborative filtering approach of recommendation engine ?

### what is the Long tail phenomenon in recommendation engine ?

Usage of Recommendation engine become popular in last 10 to 20 years. The reason behind this is we changed from a state of scarcity to abundance state. Don’t be frighten about this two words scarcity and abundance. Coming paragraph i will make you clear why i have use those words in popularity of recommendation engine.

**Scarcity to Abundance:**

Lets understand **scarcity to abundance** with book stores. Imagine a physical books store like crossword with thousands of books. we can see almost all popular books in books store shells, but shop shells are limited,to increasing number of shells, retailer need more space for this retailer need to spend some huge amount of money. This is the reason why we are missing some many good books which are less popular and finally we are only getting popular books.

On the other hand if we see online web books stores, we have unlimited shells with unlimited shell space, so we can get almost all books apart from popular or unpopular. This is because of web enables near-zero cost dissemination of information about products ( books,news articles, mobiles ,etc ). This new zero cost dissemination of information in web gives rise to an phenomenon is called as **“Long Tail”** phenomenon.

##### Long Tail phenomenon:

The above graph clearly represents the Long Tail phenomenon. On X – axis we have products with popularity (most popular ranked products will be at left side and less popular ranked product will be going to right side). Here popularity means numbers of times an product purchased or number of time an product was viewed. On Y-axis it was genuine popularity means how many times products was purchased or viewed in an interval of time like one week or one month. if you see the graph at top Y – axis the orange color curve was just far away from the actual Y- axis this means they are very few popular products. Coming to curve behavior Initially it has very stiff fall and if we move towards the right as the product rank become greater on X – axis, products popularity falls very stiffly and at an certain point this popularity fall less and less stiffly and it don’t reach X – axis . The interesting things is products right side to the cutoff point was very very less poplar and hardly they was purchased once or twice in an week or month. So these type of product we don’t get in physical retailer store because storing this less poplar items is waste of money so good business retailer don’t think to store them any more. So popular product can be store in physical retailers as well as we can get them in online too ,In case of graph left side to cutoff point which is Head part is the combination of the both physical retailer store and online store. Coming to the right side of the cutoff point which are less poplar, so we can only get them in Online. So this tail part to the right side of the cutoff point for less poplar product is called **Long Tail. **If you see the area under cure of the Long tail we can notice there were some many good products which are less popular. Finding them is harder task to user in online so there is need of an system which can recommend these good product which are unpopular to user by considering some metrics. This system is nothing but recommendation system.

**Basic idea about Collaborative filtering :**

collaborative filtering algorithm usually works by searching a large group of people and finds an smaller set with tastes similar to user. It looks at other things ,they like and combines then to create a ranked list of suggestions. Finally shows the suggestion to user.For better understanding of collaborative filtering let’s think about our own friends movie recommendation system.

##### Understanding Collaborative filtering approach with friends movie recommendation engine:

Let’s understand collaborative filtering approach by friends movie recommendations engine. To explain friends movie recommendations engine i want to share my own story about this.

Like most of the people i do love to watch movies in week ends. Generally there was so many movies in my hard disk and it’s hard to select one movie from that. That’s the reason why when i want to watch an movie i will search for some of my friends who’s movie taste is similar to me and i will ask them to recommend an movie which i may like ( haven’t seen by me but seen by them ). They will recommend me an movie which they like and which i was never seen. it may happened to you also.

This means to implement your own movie recommendation engine by considering your friends as a set of group people. Something you have learned over time by observing whether they usually like the same things as you. So you gonna select a group of your friends and you have to find someone who is more similar to you. Then you can expect movie recommendation for your friend. Applying this scenario of techniques to implement an recommendation engine is called as collaborative filtering.

Hope i have clear the idea about Collaborative filtering. So Let’s wet our hands by implementing this collaborative filtering in Python programming language.

##### Implementing Collaborative filtering approach of recommendation engine :

**Data set for implementing collaborative filtering recommendation engine:**

To implement collaborative filtering first we need data set having rated preferences ( how likely the people in data set like some set of items). So i am taking this data set from one of my favorite book **Collective Intelligence book **which was written by **Toby Segaran.**

First i am storing this data set to an Python Dictionary. For huge data set we generally store this preferences in Databases.

File name : **recommendation_data.py**

#!/usr/bin/env python # Collabrative Filtering data set taken from Collective Intelligence book. dataset={ 'Lisa Rose': {'Lady in the Water': 2.5, 'Snakes on a Plane': 3.5, 'Just My Luck': 3.0, 'Superman Returns': 3.5, 'You, Me and Dupree': 2.5, 'The Night Listener': 3.0}, 'Gene Seymour': {'Lady in the Water': 3.0, 'Snakes on a Plane': 3.5, 'Just My Luck': 1.5, 'Superman Returns': 5.0, 'The Night Listener': 3.0, 'You, Me and Dupree': 3.5}, 'Michael Phillips': {'Lady in the Water': 2.5, 'Snakes on a Plane': 3.0, 'Superman Returns': 3.5, 'The Night Listener': 4.0}, 'Claudia Puig': {'Snakes on a Plane': 3.5, 'Just My Luck': 3.0, 'The Night Listener': 4.5, 'Superman Returns': 4.0, 'You, Me and Dupree': 2.5}, 'Mick LaSalle': {'Lady in the Water': 3.0, 'Snakes on a Plane': 4.0, 'Just My Luck': 2.0, 'Superman Returns': 3.0, 'The Night Listener': 3.0, 'You, Me and Dupree': 2.0}, 'Jack Matthews': {'Lady in the Water': 3.0, 'Snakes on a Plane': 4.0, 'The Night Listener': 3.0, 'Superman Returns': 5.0, 'You, Me and Dupree': 3.5}, 'Toby': {'Snakes on a Plane':4.5, 'You, Me and Dupree':1.0, 'Superman Returns':4.0}}

Now we are ready with data set. So let’s start implementing recommendation engine. create new file with name **collaborative_filtering.py** be careful you created this file in the same directory where you created **recommendation_data.py** file. First let’s import our recommendation dataset to collaborative_filtering.py file and let’s try whether we are getting data properly or not by answering the below questions.

1) What was the rating for Lady in the Water by Lisa Rose and Michael Phillips ?

2) Movie rating of Jack Matthews ?

File name : **collaborative_filtering.py**

#!/usr/bin/env python # Implementation of collaborative filtering recommendation engine from recommendation_data import dataset print "Lisa Rose rating on Lady in the water: {}\n".format(dataset['Lisa Rose']['Lady in the Water']) print "Michael Phillips rating on Lady in the water: {}\n".format(dataset['Michael Phillips']['Lady in the Water']) print '**************Jack Matthews ratings**************' print dataset['Jack Matthews']

**Script Output:**

Lisa Rose rating on Lady in the water: 2.5 Michael Phillips rating on Lady in the water: 2.5 **************Jack Matthews ratings************** {'Lady in the Water': 3.0, 'Snakes on a Plane': 4.0, 'You, Me and Dupree': 3.5, 'Superman Returns': 5.0, 'The Night Listener': 3.0} [Finished in 0.1s]

After getting data. we need to find the similar people by comparing each person with every other person this by calculating similarity score between them. To know more about similarity you can view similarity score post. So Let’s write a function to find the similarity between two persons.

**Similar Persons:**

First let use Euclidean distance to find similarity between two people. To do that we need to write an euclidean distance measured function Let’s add this function in **collaborative_filtering.py **and let’s find the Euclidean distance between Lisa Rose and Jack Matthews. Before that let’s remember Euclidean distance formula.

Euclidean distance is the most common use of distance. In most cases when people say about distance, they will refer to Euclidean distance. Euclidean distance is also know as simply distance. When data is dense or continuous , this is the best proximity measure. The Euclidean distance between two points is the length of the path connecting them.This distance between two points is given by the Pythagorean theorem.

#!/usr/bin/env python # Implementation of collaborative filtering recommendation engine from recommendation_data import dataset from math import sqrt def similarity_score(person1,person2): # Returns ratio Euclidean distance score of person1 and person2 both_viewed = {} # To get both rated items by person1 and person2 for item in dataset[person1]: if item in dataset[person2]: both_viewed[item] = 1 # Conditions to check they both have an common rating items if len(both_viewed) == 0: return 0 # Finding Euclidean distance sum_of_eclidean_distance = [] for item in dataset[person1]: if item in dataset[person2]: sum_of_eclidean_distance.append(pow(dataset[person1][item] - dataset[person2][item],2)) sum_of_eclidean_distance = sum(sum_of_eclidean_distance) return 1/(1+sqrt(sum_of_eclidean_distance)) print similarity_score('Lisa Rose','Jack Matthews')

**Source Output:**

0.340542426583 [Finished in 0.0s]

This means the Euclidean distance score of Lisa Rose and Jack Matthews is 0.34054

**Code Explanation:**

Line 3-4:

- Imported recommendation data set and imported sqrt function.

Line 6-28:

- similarity_score function we are taking two parameters as person1 and person2
- The first thing we need to do is whether the person1 and person2 rating any common items. we doing this in line 12 to 14.
- Once we find the both_viewed we are checking the length of the both_viewed. If it zero there is no need to find similarity score why because it will zero.
- Then we are find the Euclidean distance sum value by consider the items which was rated by both person1 and person2.
- Then we returns the inverted value of euclidean distance. The reason behind using inverted euclidean distance is generally euclidean distance returns the distance between the two users. If the distance between two users is less means they are more similar but we need high value for the people who are similar so this can be done by adding 1 to euclidean distance ( so you don’t get a division by zero error) and inverting it.

Do you think the approach we used here is the good one to find the similarity between two users. Let’s consider an example to get clear idea about is this good approach to find similarity between two users. Suppose we have two users X and Y. If X feels it’s good movie he will rate 4 for it, if he feels it’s an average movie he will rate 2 and finally if he feel it’s not an good movie he will rate 1. In the same way Y will rate 5 for good movie, 4 for average move and 3 for worst movie.

If we calculated similarity between both users it will some what similar but we are missing one great point here According to Euclidean distance if we consider an movie which rated by both X and Y. Suppose X rated as 4 and Y rated as 4 then euclidean distance formulas give both X and Y are more similar, but this movie is good one for user X and average movie for Y. So if we use Euclidean disatance our approach will be wrong. So we have use some other approach to find similarity between two users. This approach is Pearson Correlation.

**Pearson Correlation:**

Pearson Correlation Score:

A slightly more sophisticated way to determine the similarity between people’s interests is to use a pearson correlation coefficient. The correlation coefficient is a measure of how well two sets of data fit on a straight line. Formula for this is more complicated that the Euclidean distance score, but it tends to give better results in situations where the data isn’t well normalized like our present data set.

Implementation for the Pearson correlation score first finds the items rated by both users. It then calculates the sums and the sum of the squares of the ratings for the both users and calculates the sum of the products of their ratings. Finally, it uses these results to calculate the Pearson correlation coefficient.Unlike the distance metric, this formula is not intuitive, but it does tell you how much the variables change together divided by the product of how much the vary individually.

Let’s create this function in the same **collaborative_filtering.py **file.

# Implementation of collaborative filtering recommendation engine from recommendation_data import dataset from math import sqrt def similarity_score(person1,person2): # Returns ratio Euclidean distance score of person1 and person2 both_viewed = {} # To get both rated items by person1 and person2 for item in dataset[person1]: if item in dataset[person2]: both_viewed[item] = 1 # Conditions to check they both have an common rating items if len(both_viewed) == 0: return 0 # Finding Euclidean distance sum_of_eclidean_distance = [] for item in dataset[person1]: if item in dataset[person2]: sum_of_eclidean_distance.append(pow(dataset[person1][item] - dataset[person2][item],2)) sum_of_eclidean_distance = sum(sum_of_eclidean_distance) return 1/(1+sqrt(sum_of_eclidean_distance)) def pearson_correlation(person1,person2): # To get both rated items both_rated = {} for item in dataset[person1]: if item in dataset[person2]: both_rated[item] = 1 number_of_ratings = len(both_rated) # Checking for number of ratings in common if number_of_ratings == 0: return 0 # Add up all the preferences of each user person1_preferences_sum = sum([dataset[person1][item] for item in both_rated]) person2_preferences_sum = sum([dataset[person2][item] for item in both_rated]) # Sum up the squares of preferences of each user person1_square_preferences_sum = sum([pow(dataset[person1][item],2) for item in both_rated]) person2_square_preferences_sum = sum([pow(dataset[person2][item],2) for item in both_rated]) # Sum up the product value of both preferences for each item product_sum_of_both_users = sum([dataset[person1][item] * dataset[person2][item] for item in both_rated]) # Calculate the pearson score numerator_value = product_sum_of_both_users - (person1_preferences_sum*person2_preferences_sum/number_of_ratings) denominator_value = sqrt((person1_square_preferences_sum - pow(person1_preferences_sum,2)/number_of_ratings) * (person2_square_preferences_sum -pow(person2_preferences_sum,2)/number_of_ratings)) if denominator_value == 0: return 0 else: r = numerator_value/denominator_value return r print pearson_correlation('Lisa Rose','Gene Seymour')

**Script Output:**

0.396059017191 [Finished in 0.0s]

Generally this pearson_correlation function return a value between -1 to 1 . A value 1 means both users are having the same taste in all most all cases.

**Ranking similar users for an user:**

Now that we have functions for comparing two people, we can create a function that scores everyone against a given person and finds the closest matches. Lets add this function to **collaborative_filtering.py** file to get an ordered list of people with similar tastes to the specified person.

def most_similar_users(person,number_of_users): # returns the number_of_users (similar persons) for a given specific person. scores = [(pearson_correlation(person,other_person),other_person) for other_person in dataset if other_person != person ] # Sort the similar persons so that highest scores person will appear at the first scores.sort() scores.reverse() return scores[0:number_of_users] print most_similar_users('Lisa Rose',3)

Script Output:

[(0.9912407071619299, 'Toby'), (0.7470178808339965, 'Jack Matthews'), (0.5940885257860044, 'Mick LaSalle')] [Finished in 0.0s]

What we have done now is we just look at the person who are most similar persons to him and Now we have to recommend some movie to that person. But that would be too permissive. Such an approach could accidentally turn up reviewers who haven’t rated some of the movies that particular person like. it could also return a reviewer who strangely like a move that got bad reviews from all the other person returned by **most_similar_persons** function.

To solve these issues, you need to score the items by producing a weighted score that ranks the users. Take the votes of all other persons and multiply how similar they are to particular person by the score they gave to each move.

Below image shows how we have to do that.

This images shows the correlation scores for each person and the ratings they gave for three movies The Night Listener, Lady in the Water, and Just My Luck that Toby haven’t rated. The Colums beginning with S.x give the similarity multiplied by the rating,so a person who is similar to Toby will contribute more to the overall score than a person who is different from Toby. The Total row shows the sum of all these numbers.

We could just use the totals to calculate the rankings, but then a movie reviewed by more people would have a big advantage. To correct for this you need to divide by the sum of all the similraties for persons that reviewd that movie (the Sim.Sum row in the table) because The Night Listener was reviewed by everyone, it’s total is divided by the average of similarities. Lady in the water ,however , was not reviewed by Puig, The last row shows the results of this division.

Let’s implement that now.

def user_reommendations(person): # Gets recommendations for a person by using a weighted average of every other user's rankings totals = {} simSums = {} rankings_list =[] for other in dataset: # don't compare me to myself if other == person: continue sim = pearson_correlation(person,other) # ignore scores of zero or lower if sim <=0: continue for item in dataset[other]: # only score movies i haven't seen yet if item not in dataset[person] or dataset[person][item] == 0: # Similrity * score totals.setdefault(item,0) totals[item] += dataset[other][item]* sim # sum of similarities simSums.setdefault(item,0) simSums[item]+= sim # Create the normalized list rankings = [(total/simSums[item],item) for item,total in totals.items()] rankings.sort() rankings.reverse() # returns the recommended items recommendataions_list = [recommend_item for score,recommend_item in rankings] return recommendataions_list print "Recommendations for Toby" print user_reommendations('Toby')

**Script Output:**

Recommendations for Toby ['The Night Listener', 'Lady in the Water', 'Just My Luck'] [Finished in 0.0s]

We have done the Recommendation engine, just change the any other persons and check recommended items.

**To get total code you can clone our github project :**

**https://github.com/saimadhu-polamuri/CollaborativeFiltering**

**To get all codes of dataaspirant blog you can clone the below github link:**

https://github.com/saimadhu-polamuri/DataAspirant_codes

**Reference Book:**

## Follow us:

**FACEBOOK| QUORA |TWITTER| REDDIT | FLIPBOARD |LINKEDIN | MEDIUM | GITHUB**

I hope you liked todays post. If you have any questions then feel free to comment below. If you want me to write on one specific topic then do tell it to me in the comments below.

**If you want share your experience or opinions you can say.**