collaborative filtering recommendation engine implementation in python

Home | About | Data scientists Interviews | For beginners | Join us | Monthly newsletter

Collaborative Filtering

first

 

In the introduction post of recommendation engine we have seen the need of recommendation engine in real life as well as importance of recommendation engine in online and finally we have discussed 3 methods of recommendation engine. They are:

1) Collaborative filtering

2) Content-based filtering

3) Hybrid Recommendation Systems

So today we are going to implement collaborative filtering way  of recommendation engine, before that i want to explain some key things about recommendation engine which was missed in Introduction to recommendation engine post.

Today learning Topics:

1) what is the Long tail phenomenon in recommendation engine ?

2) Basic idea about Collaborative filtering ?

3) Understanding Collaborative filtering approach with friends movie recommendation engine ?

4) Implementing Collaborative filtering approach of recommendation engine ?

what is the Long tail phenomenon in recommendation engine ?

Usage of Recommendation engine  become popular in last 10 to 20 years. The reason behind this is we changed from a state of scarcity to abundance state. Don’t be frighten about this two words scarcity and abundance. Coming paragraph i will make you clear why i have use those words in popularity of recommendation engine.

Scarcity to Abundance:

Harvard Book Store 1256 Massachusetts Ave., Cambridge, MA 02138 Tara Metal, 617-661-1424 x1 tmetal@harvard.com

Lets understand scarcity to abundance with book stores. Imagine a physical books store like crossword with thousands of books.  we can see almost all popular books in books store shells, but shop shells are limited,to increasing number of shells, retailer need more space for this retailer need to spend some huge amount of money. This is the reason why we are missing some many good books which are less popular and finally we are only getting popular books.

On the other hand if we see online web books stores, we have unlimited shells with unlimited shell space, so we can get almost all books apart from popular or unpopular. This is because of web enables near-zero cost dissemination of information about products ( books,news articles, mobiles ,etc ). This new zero cost dissemination of information in web gives rise to an phenomenon is called as “Long Tail” phenomenon.

Long Tail phenomenon:

long_tail_final

The above graph clearly represents the Long Tail phenomenon. On X – axis we have products with popularity (most popular ranked products will be at left side and  less popular ranked product will be going to right side). Here popularity means numbers of times an product purchased or number of time an product was viewed. On Y-axis it was genuine popularity means how many times products was purchased or viewed in an interval of time like one week or one month. if you see the graph at top Y – axis the orange color curve was just far away from the actual Y- axis this means they are very few popular products. Coming to curve behavior Initially it has very stiff fall and if we move towards the right as the product rank become greater on X – axis, products popularity falls very stiffly and at an certain point this popularity fall less and less stiffly and it don’t reach X – axis . The interesting things is products right side to the cutoff point  was very  very less poplar and hardly they was purchased once or twice in an week or month. So these type of product we don’t get in physical retailer store because storing this less poplar items is waste of money so good business retailer don’t think to store them any more. So popular product can be store in physical retailers as well as we can get them in online too ,In case of graph left side to cutoff point which is Head part is the combination of the both physical retailer store and online store. Coming to the right side of the cutoff point which are less poplar, so we can only get them in Online. So this tail part to the right side of the cutoff point for less poplar product is called Long Tail. If you see the area under cure of the Long tail we can notice there were some many good products which are less popular. Finding them is harder task to user in online so there is need of an system which can recommend these good product which are unpopular  to user by considering some metrics. This system is nothing but recommendation system.

Basic idea about Collaborative filtering :

collaborative filtering algorithm usually works by searching a large group of people and finds an smaller set with tastes similar to user. It looks at other things ,they like and combines then to create a ranked list of suggestions. Finally shows the suggestion to user.For better understanding of collaborative filtering let’s think about our own friends movie recommendation system.

Understanding Collaborative filtering approach with friends movie recommendation engine:

da

Let’s understand collaborative filtering approach by friends movie recommendations engine. To explain friends movie recommendations engine i want to share my own story about this.

Like  most of the people i do love to watch movies in week ends. Generally there was so many movies in my hard disk and it’s hard to select one movie from that. That’s the reason why when i want to watch an movie i will search for some of my friends who’s movie taste is similar to me and i will ask them to recommend an movie which i may like ( haven’t seen by me but seen by them ). They will recommend me an movie which they like and which i was never seen. it may happened to you also.

This means to implement your own movie recommendation engine by considering your friends as a set of group people. Something you have learned over time by observing whether they usually like the same things as you. So you gonna select a group of your friends and you have to find someone who is more similar to you. Then you can expect movie recommendation for your friend. Applying this scenario  of techniques to implement an recommendation engine is called as collaborative filtering.

Hope i have clear the idea about Collaborative filtering. So Let’s wet our hands by implementing this collaborative filtering in Python programming language.

Implementing Collaborative filtering approach of recommendation engine :

Data set for implementing collaborative filtering recommendation engine:

To implement collaborative filtering first we need data set having rated preferences  ( how likely the people in data set  like some set of items). So i am taking this data set from one of my favorite book Collective Intelligence book which was written by  Toby Segaran.

First i am storing this data set to an Python Dictionary. For huge data set we generally store this preferences in Databases.

File name : recommendation_data.py


#!/usr/bin/env python
# Collabrative Filtering data set taken from Collective Intelligence book.

dataset={
         'Lisa Rose': {'Lady in the Water': 2.5,
                       'Snakes on a Plane': 3.5,
                       'Just My Luck': 3.0,
                       'Superman Returns': 3.5,
                       'You, Me and Dupree': 2.5,
                       'The Night Listener': 3.0},
         'Gene Seymour': {'Lady in the Water': 3.0,
                          'Snakes on a Plane': 3.5,
                          'Just My Luck': 1.5,
                          'Superman Returns': 5.0,
                          'The Night Listener': 3.0,
                          'You, Me and Dupree': 3.5},

        'Michael Phillips': {'Lady in the Water': 2.5,
                             'Snakes on a Plane': 3.0,
                             'Superman Returns': 3.5,
                             'The Night Listener': 4.0},
        'Claudia Puig': {'Snakes on a Plane': 3.5,
                         'Just My Luck': 3.0,
                         'The Night Listener': 4.5,
                         'Superman Returns': 4.0,
                         'You, Me and Dupree': 2.5},
        'Mick LaSalle': {'Lady in the Water': 3.0,
                         'Snakes on a Plane': 4.0,
                         'Just My Luck': 2.0,
                         'Superman Returns': 3.0,
                         'The Night Listener': 3.0,
                         'You, Me and Dupree': 2.0},
       'Jack Matthews': {'Lady in the Water': 3.0,
                         'Snakes on a Plane': 4.0,
                         'The Night Listener': 3.0,
                         'Superman Returns': 5.0,
                         'You, Me and Dupree': 3.5},
      'Toby': {'Snakes on a Plane':4.5,
               'You, Me and Dupree':1.0,
               'Superman Returns':4.0}}

Now we are ready with data set. So let’s start implementing recommendation engine. create new file with name collaborative_filtering.py be careful you created this  file in the same directory where you created recommendation_data.py file. First let’s import our recommendation dataset to collaborative_filtering.py file and let’s try whether we are getting data properly or not by answering the below questions.

1) What was the rating for Lady in the Water by Lisa Rose and Michael Phillips ?

2) Movie rating  of Jack Matthews ?

File name : collaborative_filtering.py


#!/usr/bin/env python
# Implementation of collaborative filtering recommendation engine

from recommendation_data import dataset

print "Lisa Rose rating on Lady in the water: {}\n".format(dataset['Lisa Rose']['Lady in the Water'])
print "Michael Phillips rating on Lady in the water: {}\n".format(dataset['Michael Phillips']['Lady in the Water'])

print '**************Jack Matthews ratings**************'
print dataset['Jack Matthews']

Script Output:

Lisa Rose rating on Lady in the water: 2.5

Michael Phillips rating on Lady in the water: 2.5

**************Jack Matthews ratings**************
{'Lady in the Water': 3.0, 'Snakes on a Plane': 4.0, 'You, Me and Dupree': 3.5, 'Superman Returns': 5.0, 'The Night Listener': 3.0}
[Finished in 0.1s]

After getting data. we need to find the similar people by comparing each person with every other person this by calculating similarity score between them. To know more about similarity you can view similarity score post. So Let’s write a function to find the similarity between two persons.

Similar Persons:

First let use Euclidean distance to find similarity between two people. To do that we need to write an euclidean distance measured function Let’s add this function in collaborative_filtering.py and let’s find the Euclidean distance between Lisa Rose and Jack Matthews. Before that let’s remember Euclidean distance formula.

euclid_eqn

Euclidean distance is the most common use of distance. In most cases when people say about distance, they will refer to Euclidean distance. Euclidean distance  is also know as simply distance. When data is dense or continuous , this is the best proximity measure. The Euclidean distance between two points is the length of the path connecting them.This distance between two points is given by the Pythagorean theorem.

 


#!/usr/bin/env python
# Implementation of collaborative filtering recommendation engine

from recommendation_data import dataset
from math import sqrt

def similarity_score(person1,person2):

    # Returns ratio Euclidean distance score of person1 and person2 

    both_viewed = {} # To get both rated items by person1 and person2

    for item in dataset[person1]:
       if item in dataset[person2]:
          both_viewed[item] = 1

   # Conditions to check they both have an common rating items
   if len(both_viewed) == 0:
       return 0

   # Finding Euclidean distance
   sum_of_eclidean_distance = [] 

   for item in dataset[person1]:
      if item in dataset[person2]:
         sum_of_eclidean_distance.append(pow(dataset[person1][item] - dataset[person2][item],2))
       sum_of_eclidean_distance = sum(sum_of_eclidean_distance)

   return 1/(1+sqrt(sum_of_eclidean_distance))

print similarity_score('Lisa Rose','Jack Matthews')

Source Output:

0.340542426583
[Finished in 0.0s]

This means the Euclidean distance score of  Lisa Rose and Jack Matthews is 0.34054

Code Explanation:

Line 3-4:

  • Imported recommendation data set  and imported sqrt function.

Line 6-28:

  • similarity_score function we are taking two parameters as person1 and person2
  • The first thing we need to do is whether the person1 and person2 rating any common items. we doing this in line 12 to 14.
  • Once we find the both_viewed  we are checking  the length of the both_viewed. If it zero there is no need to find similarity score why because it will zero.
  • Then we are find the Euclidean distance sum value by consider the items which was rated by both person1 and person2.
  • Then we returns the inverted value of euclidean distance. The reason behind using inverted euclidean distance is generally euclidean distance returns the distance between the two users. If the distance between two users is less means they are more similar but we need high value for the people who are similar so this can be done by adding 1 to euclidean distance ( so you don’t get a division by zero error) and inverting it.

 

Do you think the approach we used here is the good one to find the similarity between two users. Let’s consider an example to get clear idea about is this good approach to find similarity between two users.  Suppose we have two users X and Y. If X feels it’s good movie he will rate 4 for it, if he feels it’s an  average movie he will rate 2 and finally if he feel it’s not an good movie he will rate 1. In the same way Y will rate 5 for good movie, 4 for average move and 3 for worst movie.

ratings_table

If we calculated  similarity between both users it will some what similar but we are missing one great point here According to Euclidean distance if we consider an movie which rated by both X and Y. Suppose X rated as 4 and Y rated as 4 then euclidean distance formulas give both  X and Y are more similar, but this movie is good one for user X and average movie for Y. So if we use Euclidean disatance our approach will be wrong. So we have use some other approach to find similarity between two users. This approach is Pearson Correlation.

Pearson Correlation:

Pearson Correlation Score:

A slightly more sophisticated way to determine the similarity between people’s interests is to use a pearson correlation coefficient. The correlation coefficient is a measure of how well two sets of data fit on a straight line. Formula for this is more complicated that the Euclidean distance score, but it tends to give better results in situations where the data isn’t well normalized like our present data set.

Implementation for the Pearson correlation score first finds the items rated by both users. It then calculates the sums and the sum of the squares of the ratings for the both users and calculates the sum of the products of their ratings. Finally, it uses these results to calculate the Pearson correlation coefficient.Unlike the distance metric, this formula is not intuitive, but it does tell you how much the variables change together divided by the product of how much the vary individually.

correlation2

 

 

Let’s create this function in the same collaborative_filtering.py file.


# Implementation of collaborative filtering recommendation engine

from recommendation_data import dataset
from math import sqrt

def similarity_score(person1,person2):

    # Returns ratio Euclidean distance score of person1 and person2 

    both_viewed = {} # To get both rated items by person1 and person2

    for item in dataset[person1]:
       if item in dataset[person2]:
           both_viewed[item] = 1

    # Conditions to check they both have an common rating items
    if len(both_viewed) == 0:
       return 0

    # Finding Euclidean distance
    sum_of_eclidean_distance = [] 

    for item in dataset[person1]:
      if item in dataset[person2]:
            sum_of_eclidean_distance.append(pow(dataset[person1][item] - dataset[person2][item],2))
   sum_of_eclidean_distance = sum(sum_of_eclidean_distance)

    return 1/(1+sqrt(sum_of_eclidean_distance))

def pearson_correlation(person1,person2):

     # To get both rated items
     both_rated = {}
     for item in dataset[person1]:
        if item in dataset[person2]:
          both_rated[item] = 1

     number_of_ratings = len(both_rated) 

     # Checking for number of ratings in common
     if number_of_ratings == 0:
         return 0

     # Add up all the preferences of each user
     person1_preferences_sum = sum([dataset[person1][item] for item in both_rated])
     person2_preferences_sum = sum([dataset[person2][item] for item in both_rated])

     # Sum up the squares of preferences of each user
     person1_square_preferences_sum = sum([pow(dataset[person1][item],2) for item in both_rated])
     person2_square_preferences_sum = sum([pow(dataset[person2][item],2) for item in both_rated])

    # Sum up the product value of both preferences for each item
    product_sum_of_both_users = sum([dataset[person1][item] * dataset[person2][item] for item in both_rated])

   # Calculate the pearson score
   numerator_value = product_sum_of_both_users - (person1_preferences_sum*person2_preferences_sum/number_of_ratings)
  denominator_value = sqrt((person1_square_preferences_sum - pow(person1_preferences_sum,2)/number_of_ratings) * (person2_square_preferences_sum -pow(person2_preferences_sum,2)/number_of_ratings))
   if denominator_value == 0:
      return 0
   else:
     r = numerator_value/denominator_value
     return r 

print pearson_correlation('Lisa Rose','Gene Seymour')

Script Output:

0.396059017191
[Finished in 0.0s]

Generally this pearson_correlation function return a value between -1 to 1 . A value 1 means both users are having the same taste in all most all cases.

Ranking similar users for an user:

Now that we have functions for comparing two people, we can create a function that scores everyone against a given person and finds the closest matches. Lets add this function to collaborative_filtering.py file to get an ordered list of people with similar tastes to the specified person.


def most_similar_users(person,number_of_users):
    # returns the number_of_users (similar persons) for a given specific person.
    scores = [(pearson_correlation(person,other_person),other_person)    for other_person in dataset if other_person != person ]

        # Sort the similar persons so that highest scores person will appear at the first
        scores.sort()
        scores.reverse()
        return scores[0:number_of_users]

print most_similar_users('Lisa Rose',3) 

Script Output:

[(0.9912407071619299, 'Toby'), (0.7470178808339965, 'Jack Matthews'), (0.5940885257860044, 'Mick LaSalle')]
[Finished in 0.0s]

What we have done now is we just look at the person who are most similar  persons to him and Now we have to recommend some movie to that person. But that would be too permissive. Such an approach could accidentally turn up reviewers who haven’t rated  some of the movies that particular person like. it could also return a reviewer who strangely like a move that got bad reviews from all the other person returned by most_similar_persons function.

To solve these issues, you need to score the items by producing a weighted score that ranks the users. Take the votes of all other persons and multiply how similar they are to particular person by the score they gave to each move.
Below image shows how we have to do that.

recommendataion_for_toby

 

This images shows the correlation scores for each person and the ratings they gave for three movies The Night Listener, Lady in the Water, and Just My Luck that Toby haven’t rated. The Colums beginning with S.x give the similarity multiplied by the rating,so a person who is similar to Toby will contribute more to the overall score than a person who is different from Toby. The Total row shows the sum of all these numbers.

We could just use the totals to calculate the rankings, but then a movie reviewed by more people would have a big advantage. To correct for this you need to divide by the sum of all the similraties for persons that reviewd that movie (the Sim.Sum row in the table) because The Night Listener was reviewed by everyone, it’s total is divided by the average of similarities. Lady in the water ,however , was not reviewed by Puig, The last row shows the results of this division.

Let’s implement that now.


def user_reommendations(person):

       # Gets recommendations for a person by using a weighted average of every other user's rankings
       totals = {}
       simSums = {}
       rankings_list =[]
       for other in dataset:
           # don't compare me to myself
           if other == person:
                continue
           sim = pearson_correlation(person,other)

           # ignore scores of zero or lower
           if sim <=0:
              continue
           for item in dataset[other]:

            # only score movies i haven't seen yet
                if item not in dataset[person] or dataset[person][item] == 0:

                # Similrity * score
                totals.setdefault(item,0)
                totals[item] += dataset[other][item]* sim
                # sum of similarities
                simSums.setdefault(item,0)
                simSums[item]+= sim

     # Create the normalized list

     rankings = [(total/simSums[item],item) for item,total in totals.items()]
     rankings.sort()
     rankings.reverse()
     # returns the recommended items
     recommendataions_list = [recommend_item for score,recommend_item in rankings]
     return recommendataions_list

print "Recommendations for Toby"
print user_reommendations('Toby')

Script Output:

Recommendations for Toby

['The Night Listener', 'Lady in the Water', 'Just My Luck']
[Finished in 0.0s]

We have done the Recommendation engine, just change the any other persons and check recommended items.

To get total code you can clone our github project :

 https://github.com/saimadhu-polamuri/CollaborativeFiltering

To get all codes of dataaspirant blog you can clone the below github link:

 https://github.com/saimadhu-polamuri/DataAspirant_codes

Reference Book:

Collective intelligence book

Follow us:

FACEBOOK| QUORA |TWITTERREDDIT | FLIPBOARD |LINKEDIN | MEDIUM| GITHUB

I hope you liked todays post. If you have any questions then feel free to comment below.  If you want me to write on one specific topic then do tell it to me in the comments below.

If you want share your experience or opinions you can say.

Hello to hello@dataaspirant.com 

THANKS FOR READING…..

Home | About | Data scientists Interviews | For beginners | Join us |  Monthly newsletter
Advertisements

14 thoughts on “collaborative filtering recommendation engine implementation in python

  1. hi saimadhu, I run your code on the amazon review dataset, but the user do not rate for the same product and the pearson relation is always 0. Hence it wont recommend any item. Is there any remedy on that.

    Like

    • Hi Sachin,
      If two users don’t have any common rated products pearson will always returns the value Zero. As we don’t have an common rated items between the users. Any how i will try to find way to get out of this conditions.
      Thanks ,
      Saimadhu

      Like

  2. Excellent! But how would you evaluate your model? How would you know whether a ranking list of recommendation movies are suitable or not?

    Like

    • If we use jaccard similarity in this case it will find similarity by considering two user rated movie or not. i won’t consider about the ratings. so we will miss judge the similarity between two users. That’s why we don’t use jaccard similarity.

      Like

  3. Hi Saimadhu,
    Amazingly explained example with step by step approach to Recommender system! I must say it is the simplified ever blog on RS I have come across, so nicely explained the fundamentals !!!
    As I work in DS field, I understand the importance of conceptual understanding of techniques and there are very very few who have made things simple to understand for learners. You have hit the nail.
    I have become fan of your writing, Saimadhu.
    Keep it up! Can’t wait to read your next post on content based, hybrid filter RS examples.
    Regards,
    Pravin Bhosale

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s