collaborative filtering recommendation engine implementation in python

Home | About | Data scientists Interviews | For beginners | Join us | Monthly newsletter

Collaborative Filtering

first

 

In the introduction post of recommendation engine we have seen the need of recommendation engine in real life as well as importance of recommendation engine in online and finally we have discussed 3 methods of recommendation engine. They are:

1) Collaborative filtering

2) Content-based filtering

3) Hybrid Recommendation Systems

So today we are going to implement collaborative filtering way  of recommendation engine, before that i want to explain some key things about recommendation engine which was missed in Introduction to recommendation engine post.

Today learning Topics:

1) what is the Long tail phenomenon in recommendation engine ?

2) Basic idea about Collaborative filtering ?

3) Understanding Collaborative filtering approach with friends movie recommendation engine ?

4) Implementing Collaborative filtering approach of recommendation engine ?

what is the Long tail phenomenon in recommendation engine ?

Usage of Recommendation engine  become popular in last 10 to 20 years. The reason behind this is we changed from a state of scarcity to abundance state. Don’t be frighten about this two words scarcity and abundance. Coming paragraph i will make you clear why i have use those words in popularity of recommendation engine.

Scarcity to Abundance:

Harvard Book Store 1256 Massachusetts Ave., Cambridge, MA 02138 Tara Metal, 617-661-1424 x1 tmetal@harvard.com

Lets understand scarcity to abundance with book stores. Imagine a physical books store like crossword with thousands of books.  we can see almost all popular books in books store shells, but shop shells are limited,to increasing number of shells, retailer need more space for this retailer need to spend some huge amount of money. This is the reason why we are missing some many good books which are less popular and finally we are only getting popular books.

On the other hand if we see online web books stores, we have unlimited shells with unlimited shell space, so we can get almost all books apart from popular or unpopular. This is because of web enables near-zero cost dissemination of information about products ( books,news articles, mobiles ,etc ). This new zero cost dissemination of information in web gives rise to an phenomenon is called as “Long Tail” phenomenon.

Long Tail phenomenon:

long_tail_final

The above graph clearly represents the Long Tail phenomenon. On X – axis we have products with popularity (most popular ranked products will be at left side and  less popular ranked product will be going to right side). Here popularity means numbers of times an product purchased or number of time an product was viewed. On Y-axis it was genuine popularity means how many times products was purchased or viewed in an interval of time like one week or one month. if you see the graph at top Y – axis the orange color curve was just far away from the actual Y- axis this means they are very few popular products. Coming to curve behavior Initially it has very stiff fall and if we move towards the right as the product rank become greater on X – axis, products popularity falls very stiffly and at an certain point this popularity fall less and less stiffly and it don’t reach X – axis . The interesting things is products right side to the cutoff point  was very  very less poplar and hardly they was purchased once or twice in an week or month. So these type of product we don’t get in physical retailer store because storing this less poplar items is waste of money so good business retailer don’t think to store them any more. So popular product can be store in physical retailers as well as we can get them in online too ,In case of graph left side to cutoff point which is Head part is the combination of the both physical retailer store and online store. Coming to the right side of the cutoff point which are less poplar, so we can only get them in Online. So this tail part to the right side of the cutoff point for less poplar product is called Long Tail. If you see the area under cure of the Long tail we can notice there were some many good products which are less popular. Finding them is harder task to user in online so there is need of an system which can recommend these good product which are unpopular  to user by considering some metrics. This system is nothing but recommendation system.

Basic idea about Collaborative filtering :

collaborative filtering algorithm usually works by searching a large group of people and finds an smaller set with tastes similar to user. It looks at other things ,they like and combines then to create a ranked list of suggestions. Finally shows the suggestion to user.For better understanding of collaborative filtering let’s think about our own friends movie recommendation system.

Understanding Collaborative filtering approach with friends movie recommendation engine:

da

Let’s understand collaborative filtering approach by friends movie recommendations engine. To explain friends movie recommendations engine i want to share my own story about this.

Like  most of the people i do love to watch movies in week ends. Generally there was so many movies in my hard disk and it’s hard to select one movie from that. That’s the reason why when i want to watch an movie i will search for some of my friends who’s movie taste is similar to me and i will ask them to recommend an movie which i may like ( haven’t seen by me but seen by them ). They will recommend me an movie which they like and which i was never seen. it may happened to you also.

This means to implement your own movie recommendation engine by considering your friends as a set of group people. Something you have learned over time by observing whether they usually like the same things as you. So you gonna select a group of your friends and you have to find someone who is more similar to you. Then you can expect movie recommendation for your friend. Applying this scenario  of techniques to implement an recommendation engine is called as collaborative filtering.

Hope i have clear the idea about Collaborative filtering. So Let’s wet our hands by implementing this collaborative filtering in Python programming language.

Implementing Collaborative filtering approach of recommendation engine :

Data set for implementing collaborative filtering recommendation engine:

To implement collaborative filtering first we need data set having rated preferences  ( how likely the people in data set  like some set of items). So i am taking this data set from one of my favorite book Collective Intelligence book which was written by  Toby Segaran.

First i am storing this data set to an Python Dictionary. For huge data set we generally store this preferences in Databases.

File name : recommendation_data.py


#!/usr/bin/env python
# Collabrative Filtering data set taken from Collective Intelligence book.

dataset={
         'Lisa Rose': {'Lady in the Water': 2.5,
                       'Snakes on a Plane': 3.5,
                       'Just My Luck': 3.0,
                       'Superman Returns': 3.5,
                       'You, Me and Dupree': 2.5,
                       'The Night Listener': 3.0},
         'Gene Seymour': {'Lady in the Water': 3.0,
                          'Snakes on a Plane': 3.5,
                          'Just My Luck': 1.5,
                          'Superman Returns': 5.0,
                          'The Night Listener': 3.0,
                          'You, Me and Dupree': 3.5},

        'Michael Phillips': {'Lady in the Water': 2.5,
                             'Snakes on a Plane': 3.0,
                             'Superman Returns': 3.5,
                             'The Night Listener': 4.0},
        'Claudia Puig': {'Snakes on a Plane': 3.5,
                         'Just My Luck': 3.0,
                         'The Night Listener': 4.5,
                         'Superman Returns': 4.0,
                         'You, Me and Dupree': 2.5},
        'Mick LaSalle': {'Lady in the Water': 3.0,
                         'Snakes on a Plane': 4.0,
                         'Just My Luck': 2.0,
                         'Superman Returns': 3.0,
                         'The Night Listener': 3.0,
                         'You, Me and Dupree': 2.0},
       'Jack Matthews': {'Lady in the Water': 3.0,
                         'Snakes on a Plane': 4.0,
                         'The Night Listener': 3.0,
                         'Superman Returns': 5.0,
                         'You, Me and Dupree': 3.5},
      'Toby': {'Snakes on a Plane':4.5,
               'You, Me and Dupree':1.0,
               'Superman Returns':4.0}}

Now we are ready with data set. So let’s start implementing recommendation engine. create new file with name collaborative_filtering.py be careful you created this  file in the same directory where you created recommendation_data.py file. First let’s import our recommendation dataset to collaborative_filtering.py file and let’s try whether we are getting data properly or not by answering the below questions.

1) What was the rating for Lady in the Water by Lisa Rose and Michael Phillips ?

2) Movie rating  of Jack Matthews ?

File name : collaborative_filtering.py


#!/usr/bin/env python
# Implementation of collaborative filtering recommendation engine

from recommendation_data import dataset

print "Lisa Rose rating on Lady in the water: {}\n".format(dataset['Lisa Rose']['Lady in the Water'])
print "Michael Phillips rating on Lady in the water: {}\n".format(dataset['Michael Phillips']['Lady in the Water'])

print '**************Jack Matthews ratings**************'
print dataset['Jack Matthews']

Script Output:

Lisa Rose rating on Lady in the water: 2.5

Michael Phillips rating on Lady in the water: 2.5

**************Jack Matthews ratings**************
{'Lady in the Water': 3.0, 'Snakes on a Plane': 4.0, 'You, Me and Dupree': 3.5, 'Superman Returns': 5.0, 'The Night Listener': 3.0}
[Finished in 0.1s]

After getting data. we need to find the similar people by comparing each person with every other person this by calculating similarity score between them. To know more about similarity you can view similarity score post. So Let’s write a function to find the similarity between two persons.

Similar Persons:

First let use Euclidean distance to find similarity between two people. To do that we need to write an euclidean distance measured function Let’s add this function in collaborative_filtering.py and let’s find the Euclidean distance between Lisa Rose and Jack Matthews. Before that let’s remember Euclidean distance formula.

euclid_eqn

Euclidean distance is the most common use of distance. In most cases when people say about distance, they will refer to Euclidean distance. Euclidean distance  is also know as simply distance. When data is dense or continuous , this is the best proximity measure. The Euclidean distance between two points is the length of the path connecting them.This distance between two points is given by the Pythagorean theorem.

 


#!/usr/bin/env python
# Implementation of collaborative filtering recommendation engine

from recommendation_data import dataset
from math import sqrt

def similarity_score(person1,person2):

    # Returns ratio Euclidean distance score of person1 and person2 

    both_viewed = {} # To get both rated items by person1 and person2

    for item in dataset[person1]:
       if item in dataset[person2]:
          both_viewed[item] = 1

   # Conditions to check they both have an common rating items
   if len(both_viewed) == 0:
       return 0

   # Finding Euclidean distance
   sum_of_eclidean_distance = [] 

   for item in dataset[person1]:
      if item in dataset[person2]:
         sum_of_eclidean_distance.append(pow(dataset[person1][item] - dataset[person2][item],2))
       sum_of_eclidean_distance = sum(sum_of_eclidean_distance)

   return 1/(1+sqrt(sum_of_eclidean_distance))

print similarity_score('Lisa Rose','Jack Matthews')

Source Output:

0.340542426583
[Finished in 0.0s]

This means the Euclidean distance score of  Lisa Rose and Jack Matthews is 0.34054

Code Explanation:

Line 3-4:

  • Imported recommendation data set  and imported sqrt function.

Line 6-28:

  • similarity_score function we are taking two parameters as person1 and person2
  • The first thing we need to do is whether the person1 and person2 rating any common items. we doing this in line 12 to 14.
  • Once we find the both_viewed  we are checking  the length of the both_viewed. If it zero there is no need to find similarity score why because it will zero.
  • Then we are find the Euclidean distance sum value by consider the items which was rated by both person1 and person2.
  • Then we returns the inverted value of euclidean distance. The reason behind using inverted euclidean distance is generally euclidean distance returns the distance between the two users. If the distance between two users is less means they are more similar but we need high value for the people who are similar so this can be done by adding 1 to euclidean distance ( so you don’t get a division by zero error) and inverting it.

 

Do you think the approach we used here is the good one to find the similarity between two users. Let’s consider an example to get clear idea about is this good approach to find similarity between two users.  Suppose we have two users X and Y. If X feels it’s good movie he will rate 4 for it, if he feels it’s an  average movie he will rate 2 and finally if he feel it’s not an good movie he will rate 1. In the same way Y will rate 5 for good movie, 4 for average move and 3 for worst movie.

ratings_table

If we calculated  similarity between both users it will some what similar but we are missing one great point here According to Euclidean distance if we consider an movie which rated by both X and Y. Suppose X rated as 4 and Y rated as 4 then euclidean distance formulas give both  X and Y are more similar, but this movie is good one for user X and average movie for Y. So if we use Euclidean disatance our approach will be wrong. So we have use some other approach to find similarity between two users. This approach is Pearson Correlation.

Pearson Correlation:

Pearson Correlation Score:

A slightly more sophisticated way to determine the similarity between people’s interests is to use a pearson correlation coefficient. The correlation coefficient is a measure of how well two sets of data fit on a straight line. Formula for this is more complicated that the Euclidean distance score, but it tends to give better results in situations where the data isn’t well normalized like our present data set.

Implementation for the Pearson correlation score first finds the items rated by both users. It then calculates the sums and the sum of the squares of the ratings for the both users and calculates the sum of the products of their ratings. Finally, it uses these results to calculate the Pearson correlation coefficient.Unlike the distance metric, this formula is not intuitive, but it does tell you how much the variables change together divided by the product of how much the vary individually.

correlation2

 

 

Let’s create this function in the same collaborative_filtering.py file.


# Implementation of collaborative filtering recommendation engine

from recommendation_data import dataset
from math import sqrt

def similarity_score(person1,person2):

    # Returns ratio Euclidean distance score of person1 and person2 

    both_viewed = {} # To get both rated items by person1 and person2

    for item in dataset[person1]:
       if item in dataset[person2]:
           both_viewed[item] = 1

    # Conditions to check they both have an common rating items
    if len(both_viewed) == 0:
       return 0

    # Finding Euclidean distance
    sum_of_eclidean_distance = [] 

    for item in dataset[person1]:
      if item in dataset[person2]:
            sum_of_eclidean_distance.append(pow(dataset[person1][item] - dataset[person2][item],2))
   sum_of_eclidean_distance = sum(sum_of_eclidean_distance)

    return 1/(1+sqrt(sum_of_eclidean_distance))

def pearson_correlation(person1,person2):

     # To get both rated items
     both_rated = {}
     for item in dataset[person1]:
        if item in dataset[person2]:
          both_rated[item] = 1

     number_of_ratings = len(both_rated) 

     # Checking for number of ratings in common
     if number_of_ratings == 0:
         return 0

     # Add up all the preferences of each user
     person1_preferences_sum = sum([dataset[person1][item] for item in both_rated])
     person2_preferences_sum = sum([dataset[person2][item] for item in both_rated])

     # Sum up the squares of preferences of each user
     person1_square_preferences_sum = sum([pow(dataset[person1][item],2) for item in both_rated])
     person2_square_preferences_sum = sum([pow(dataset[person2][item],2) for item in both_rated])

    # Sum up the product value of both preferences for each item
    product_sum_of_both_users = sum([dataset[person1][item] * dataset[person2][item] for item in both_rated])

   # Calculate the pearson score
   numerator_value = product_sum_of_both_users - (person1_preferences_sum*person2_preferences_sum/number_of_ratings)
  denominator_value = sqrt((person1_square_preferences_sum - pow(person1_preferences_sum,2)/number_of_ratings) * (person2_square_preferences_sum -pow(person2_preferences_sum,2)/number_of_ratings))
   if denominator_value == 0:
      return 0
   else:
     r = numerator_value/denominator_value
     return r 

print pearson_correlation('Lisa Rose','Gene Seymour')

Script Output:

0.396059017191
[Finished in 0.0s]

Generally this pearson_correlation function return a value between -1 to 1 . A value 1 means both users are having the same taste in all most all cases.

Ranking similar users for an user:

Now that we have functions for comparing two people, we can create a function that scores everyone against a given person and finds the closest matches. Lets add this function to collaborative_filtering.py file to get an ordered list of people with similar tastes to the specified person.


def most_similar_users(person,number_of_users):
    # returns the number_of_users (similar persons) for a given specific person.
    scores = [(pearson_correlation(person,other_person),other_person)    for other_person in dataset if other_person != person ]

        # Sort the similar persons so that highest scores person will appear at the first
        scores.sort()
        scores.reverse()
        return scores[0:number_of_users]

print most_similar_users('Lisa Rose',3) 

Script Output:

[(0.9912407071619299, 'Toby'), (0.7470178808339965, 'Jack Matthews'), (0.5940885257860044, 'Mick LaSalle')]
[Finished in 0.0s]

What we have done now is we just look at the person who are most similar  persons to him and Now we have to recommend some movie to that person. But that would be too permissive. Such an approach could accidentally turn up reviewers who haven’t rated  some of the movies that particular person like. it could also return a reviewer who strangely like a move that got bad reviews from all the other person returned by most_similar_persons function.

To solve these issues, you need to score the items by producing a weighted score that ranks the users. Take the votes of all other persons and multiply how similar they are to particular person by the score they gave to each move.
Below image shows how we have to do that.

recommendataion_for_toby

 

This images shows the correlation scores for each person and the ratings they gave for three movies The Night Listener, Lady in the Water, and Just My Luck that Toby haven’t rated. The Colums beginning with S.x give the similarity multiplied by the rating,so a person who is similar to Toby will contribute more to the overall score than a person who is different from Toby. The Total row shows the sum of all these numbers.

We could just use the totals to calculate the rankings, but then a movie reviewed by more people would have a big advantage. To correct for this you need to divide by the sum of all the similraties for persons that reviewd that movie (the Sim.Sum row in the table) because The Night Listener was reviewed by everyone, it’s total is divided by the average of similarities. Lady in the water ,however , was not reviewed by Puig, The last row shows the results of this division.

Let’s implement that now.


def user_reommendations(person):

       # Gets recommendations for a person by using a weighted average of every other user's rankings
       totals = {}
       simSums = {}
       rankings_list =[]
       for other in dataset:
           # don't compare me to myself
           if other == person:
                continue
           sim = pearson_correlation(person,other)

           # ignore scores of zero or lower
           if sim <=0:
              continue
           for item in dataset[other]:

            # only score movies i haven't seen yet
                if item not in dataset[person] or dataset[person][item] == 0:

                # Similrity * score
                totals.setdefault(item,0)
                totals[item] += dataset[other][item]* sim
                # sum of similarities
                simSums.setdefault(item,0)
                simSums[item]+= sim

     # Create the normalized list

     rankings = [(total/simSums[item],item) for item,total in totals.items()]
     rankings.sort()
     rankings.reverse()
     # returns the recommended items
     recommendataions_list = [recommend_item for score,recommend_item in rankings]
     return recommendataions_list

print "Recommendations for Toby"
print user_reommendations('Toby')

Script Output:

Recommendations for Toby

['The Night Listener', 'Lady in the Water', 'Just My Luck']
[Finished in 0.0s]

We have done the Recommendation engine, just change the any other persons and check recommended items.

To get total code you can clone our github project :

 https://github.com/saimadhu-polamuri/CollaborativeFiltering

To get all codes of dataaspirant blog you can clone the below github link:

 https://github.com/saimadhu-polamuri/DataAspirant_codes

Reference Book:

Collective intelligence book

Follow us:

FACEBOOK| QUORA |TWITTERREDDIT | FLIPBOARD |LINKEDIN | MEDIUM| GITHUB

I hope you liked todays post. If you have any questions then feel free to comment below.  If you want me to write on one specific topic then do tell it to me in the comments below.

If you want share your experience or opinions you can say.

Hello to hello@dataaspirant.com 

THANKS FOR READING…..

Home | About | Data scientists Interviews | For beginners | Join us |  Monthly newsletter

Five most popular similarity measures implementation in python

Home | About | Data scientists Interviews | For beginners | Join us

Cover_post_final

The buzz term similarity distance measures has got wide variety of definitions among the math and data mining practitioners. As a result those terms, concepts and their usage went way beyond the head for beginner , who  started to understand them for the very first time. So today I write this post to give more simplified and very intuitive definitions for similarity and i will drive to Five most popular similarity measures and implementation of them.

Before going to explain different similarity distance measures let me explain the the effective key term similarity in datamining. This similarity is the very basic building block for activities such as Recommendation engines, clustering, classification and anomaly detection.

Similarity:

Similarity is the measure of how much alike two data objects are. Similarity in a data mining context is usually described as a distance with dimensions representing features of the objects. If this distance is small it will be high degree of similarity where as large distance will be low degree of similarity.Similarity is subjective and is highly dependant on the domain and application. For example two fruits are similar because of color or size or taste. Care should be taken when calculating distance across dimensions/features that are unrelated. The relative values of each feature must be normalized or one feature could end up dominating the distance calculation. Similarity are measure in the range 0 to 1 [0,1].

Two main consideration about similarity:

  • Similarity = 1 if X = Y         (Where X, Y are two objects)
  • Similarity = 0 if X ≠ Y

That’s all about similarity let’s drive to five most popular similarity distance measures.

Euclidean distance:

Euclidean

    Euclidean distance is the most common use of distance. In most cases when people said about distance, they will refer to Euclidean distance. Euclidean distance  is also know as simply distance. When data is dense or continuous , this is the best proximity measure. The Euclidean distance between two points is the length of the path connecting them.This distance between two points is given by the Pythagorean theorem.

Euclidean distance implementation in python:


#!/usr/bin/env python

from math import*

def euclidean_distance(x,y):

  return sqrt(sum(pow(a-b,2) for a, b in zip(x, y)))

print euclidean_distance([0,3,4,5],[7,6,3,-1])

Script Output:

9.74679434481
[Finished in 0.0s]

 

Manhattan distance:

manhattan

Manhattan distance is an metric in which the distance between two points is the sum of the absolute differences of their Cartesian coordinates. In simple way of saying it is the absolute sum of difference between the x-coordinates  and y-coordinates. Suppose we have two point A and B if we want to find the manhattan distane between them, just we have to sum up the absultue x-axis and y – axis variation means we have to find how these to points A and B are varining in X-axis and Y- axis.In more mathematical way of saying Manhattan distance between two points measured along axes at right angles.

In a plane with p1 at (x1, y1) and p2 at (x2, y2).

Manhattan distance = |x1 – x2| + |y1 – y2|

This Manhattan distance metric is also known as Manhattan length,rectilinear distance, L1 distance or L1 norm ,city block distance,Minkowski’s L1 distance,taxi cab metric, or city block distance.

Manhattan distance implementation in python:


#!/usr/bin/env python

from math import*

def manhattan_distance(x,y):

  return sum(abs(a-b) for a,b in zip(x,y))

print manhattan_distance([10,20,10],[10,20,20])

Script Output:

10
[Finished in 0.0s]

 

Minkowski distance:

minkowski

The Minkowski distance is a generalized metric form of Euclidean distance and Manhattan distance.

equation_minkowski-distance (1)

In the equation d^MKD is the Minkowski distance between the data record i and j, k the index of a variable, n the total number of variables y and λ the order of the Minkowski metric. Although it is defined for any λ > 0, it is rarely used for values other than 1, 2 and ∞.

The way distances are measured by the Minkowski metric of different orders between two objects with three variables ( In the image it displayed in a coordinate system with x, y ,z-axes).

Synonyms of Minkowski:
Different names for the Minkowski distance or Minkowski metric arise form the order:

  • λ = 1 is the Manhattan distance. Synonyms are L1-Norm, Taxicab or City-Block distance. For two vectors of ranked ordinal variables the Manhattan distance is sometimes called Foot-ruler distance.
  • λ = 2 is the Euclidean distance. Synonyms are L2-Norm or Ruler distance. For two vectors of ranked ordinal variables the Euclidean distance is sometimes called Spear-man distance.
  • λ = ∞ is the Chebyshev distance. Synonym are Lmax-Norm or Chessboard distance.
    reference.

 Minkowski distance implementation in python:


#!/usr/bin/env python

from math import*
from decimal import Decimal

def nth_root(value, n_root):

 root_value = 1/float(n_root)
 return round (Decimal(value) ** Decimal(root_value),3)

def minkowski_distance(x,y,p_value):

 return nth_root(sum(pow(abs(a-b),p_value) for a,b in zip(x, y)),p_value)

print minkowski_distance([0,3,4,5],[7,6,3,-1],3)

Script Output:

8.373
[Finished in 0.0s]

Cosine similarity:

cosine

Cosine similarity metric finds the normalized dot product of the two attributes. By determining the cosine similarity, we will effectively trying to find cosine of the angle between the two objects. The cosine of 0° is 1, and it is less than 1 for any other angle. It is thus a judgement of orientation and not magnitude: two vectors with the same orientation have a cosine similarity of 1, two vectors at 90° have a similarity of 0, and two vectors diametrically opposed have a similarity of -1, independent of their magnitude. Cosine similarity is particularly used in positive space, where the outcome is neatly bounded in [0,1]. One of the reasons for the popularity of cosine similarity is that it is very efficient to evaluate, especially for sparse vectors.

Cosine similarity implementation in python:


#!/usr/bin/env python

from math import*

def square_rooted(x):

   return round(sqrt(sum([a*a for a in x])),3)

def cosine_similarity(x,y):

 numerator = sum(a*b for a,b in zip(x,y))
 denominator = square_rooted(x)*square_rooted(y)
 return round(numerator/float(denominator),3)

print cosine_similarity([3, 45, 7, 2], [2, 54, 13, 15])

Script Output:

0.972
[Finished in 0.1s]

Jaccard similarity:

jaccard_similariyt

We so far discussed some metrics to find the similarity between objects. where the objects are points or vectors .When we consider about jaccard similarity this objects will be sets. So first let’s learn some very basic about sets.

Sets:

A set is (unordered) collection of objects {a,b,c}. we use the notation as elements separated by commas inside curly brackets { }. They are unordered so {a,b} = { b,a }.

Cardinality:

Cardinality of A denoted by |A| which counts how many elements are in A.

Intersection:

Intersection between two sets A and B is denoted A ∩ B and reveals all items which are in both sets A,B.

Union:

Union between two sets A and B is denoted A ∪ B and reveals all items which are in either set.

 

jaccaard2

Now going back to Jaccard similarity.The Jaccard similarity measures similarity between finite sample sets, and is defined as the cardinality of the intersection of sets divided by the cardinality of the union of the sample sets. Suppose you want to find jaccard similarity between two sets A and B it is the ration of cardinality of A ∩ B and A ∪ B

jaccaard3

Jaccard similarity implementation:


#!/usr/bin/env python

from math import*

def jaccard_similarity(x,y):

 intersection_cardinality = len(set.intersection(*[set(x), set(y)]))
 union_cardinality = len(set.union(*[set(x), set(y)]))
 return intersection_cardinality/float(union_cardinality)

print jaccard_similarity([0,1,2,5,6],[0,2,3,5,7,9])

Script Output:

0.375
[Finished in 0.0s]

Implementaion of all 5 similarity measure into one Similarity class:

file_name : similaritymeasures.py


#!/usr/bin/env python

from math import*
from decimal import Decimal

class Similarity():

  """ Five similarity measures function """

  def euclidean_distance(self,x,y):

   """ return euclidean distance between two lists """

   return sqrt(sum(pow(a-b,2) for a, b in zip(x, y)))

  def manhattan_distance(self,x,y):

   """ return manhattan distance between two lists """

   return sum(abs(a-b) for a,b in zip(x,y))

  def minkowski_distance(self,x,y,p_value):

   """ return minkowski distance between two lists """

   return self.nth_root(sum(pow(abs(a-b),p_value) for a,b in zip(x, y)),p_value)

  def nth_root(self,value, n_root):

   """ returns the n_root of an value """

   root_value = 1/float(n_root)
   return round (Decimal(value) ** Decimal(root_value),3)

  def cosine_similarity(self,x,y):

   """ return cosine similarity between two lists """

   numerator = sum(a*b for a,b in zip(x,y))
   denominator = self.square_rooted(x)*self.square_rooted(y)
   return round(numerator/float(denominator),3)

  def square_rooted(self,x):

   """ return 3 rounded square rooted value """

   return round(sqrt(sum([a*a for a in x])),3)

  def jaccard_similarity(self,x,y):

   """ returns the jaccard similarity between two lists """

   intersection_cardinality = len(set.intersection(*[set(x), set(y)]))
   union_cardinality = len(set.union(*[set(x), set(y)]))
   return intersection_cardinality/float(union_cardinality)

Using Similarity class:


#!/usr/bin/env python

from similaritymeasures import Similarity

def main():

  """ main function to create Similarity class instance and get use of it """

  measures = Similarity()

  print measures.euclidean_distance([0,3,4,5],[7,6,3,-1])
  print measures.jaccard_similarity([0,1,2,5,6],[0,2,3,5,7,9])

if __name__ == "__main__":
  main()

 

This post codes can be found in this  Github link :

 https://github.com/saimadhu-polamuri/DataAspirant_codes/tree/master/Similarity_measures

 

Follow us:

FACEBOOKQUORA |TWITTERREDDIT | FLIPBOARD | MEDIUM | GITHUB

I hope you liked todays post. If you have any questions then feel free to comment below.  If you want me to write on one specific topic then do tell it to me in the comments below.

THANKS FOR READING…..

Home | About | Data scientists Interviews | For beginners | Join us

 

Recommendation Engine part-1

Home | About | Data scientists Interviews | For beginners | Join us

Recommendation Engine Introduction 

product-recommendation-1024x402

        Today we are going to start our exploration of data mining by looking at recommendation engine. People call this mixed words as a single effective word with different names like Recommendation engine, Recommendation system.

What we will learn:

To begin the tour of recommendation engine , I ‘am going to answer four basic question about Recommendation Engine.

  1. What is Recommendation Engine ?

  2. What is the difference between Real life Recommendation engine and online Recommendation Engine ?

  3. Why should we use recommendation engines?
  4. What are the different types of Recommendation Engines ?

What is Recommendation Engine ?

Wiki Definition :

Recommendation Engines are a subclass of information filtering system that seek to predict the ‘rating’ or ‘preference’ that user would give to an item.

dataaspirant Definition:

Recommendation Engine is a black box which analysis some set of users and shows the items which a single user may like.

What is the difference between Real life Recommendation engine and online Recommendation Engine ?

Before summarizing the difference between Real life Recommendation engine and online Recommendation Engine lets see them individually.

Real life Recommendation Engine:

Your friend as movie recommendation engine:

Take-turns-talking

Most of the time we will ask our friends to recommend some good movies to see and most of the cases we will feel the movies which recommended by our friends is good ones.

Your sister/mother/father/brother/friend as dress Recommendation Engine:

de093__Stylish-MRQT-Boutique

       Selecting an dress from thousands of models is little bit harder that’s why, when we are going to buy a good dress for our Birthday or for any festival purpose we ask our sister/mother/father/brother/friend to select a good dress for us.

Your instructor as Book Recommendation Engine:

teacher-holding-books

     When we want to read one good book for better understand of particular concept we will ask our instructor  to recommend some good books for better understanding of concept.

Your elder brother or elder sister as career recommendation engine:

?????????????????????????????????????????????????????????????????

We always consider our elder sister or brother suggestions in our career planing.

Note:

In all these cases the person who is recommended things for you is well know about you and about the things you want to recommend.

Online Recommendation engine:

Facebook People You May Know:

8027583588_0232c7d836_z

People You May Know are people on Facebook that you might know. it shows you people based on mutual friends, work and education information, networks you’re part of,contacts you’ve imported and many other factors.

Netflix Other Movies You Might Enjoy:

netflixrec01_616

Netflix offers thousands of titles to stream ,when you fill out your Taste Preferences or rate movies and TV shows, you’re helping Netflix to filter through the thousands of selections to get a better idea of what you would like to watch.Netflix recommendation algorithm takes certain factors into consideration to recommend movies to you , such as:

  • The genres of movies and TV shows available.
  • Your streaming history, and previous ratings you’ve made.
  • The combined ratings of all Netflix members who have similar tastes in titles to you.

Linkedin Jobs You may be interested in

linkedin

The Jobs You May Be Interested In feature shows jobs posted on LinkedIn that match your profile in some way. These recommendations shown based on the titles and descriptions in your previous experience. If you search for jobs in your field and “save your search”. You’ll receive alerts whenever a new job is posted within your search rules. That might help you find jobs you’re looking for without altering your profile information.

 

Amazon Customers who Brought this Item Also Bought

0b98028

Amazon uses it’s Recommendation engine to Recommend products to customers to bought. This customers who bought this Item Also Bought played a key role to increase Amazon sales.

Let’s summarize these two recommendation engines. In real life recommendation engine the main theme is I will like the thing which you may believe i will like , in online recommendation engine I may like the things which you may like if you and me are more similar persons. don’t feel any confuse about similar persons and all these stuff in very next post i will explain these things in much clear way.

Why should we use recommendation engines?

There is one famous quote about customers relationship  summary of quote will go like this customers don’t know what they want until we show them if we succeed in showing some thing which customers  may like business profit will sky rocket .

so recommendation engines will help customers find information , products  and services they might not have thought of. Recommendation application can be found in wide variety of industries and Business. some of them we have seen before and some application listed below.

  • Travel
  • Financial service
  • Music/Online radio
  • Tv and Videos
  • Online publications
  • Retail
  • and countless others….

 What are the different types of Recommendation Engines ?

Recommendation engines are mainly 2 types and one hybrid type:

1) Collaborative filtering

2) Content-based filtering

3) Hybrid Recommendation Systems

Collaborative Filtering:

userbasedpeople with similar taste to you like the thing you like.

Collaborative filtering methods are based on collecting and analyzing a large amount of information on users’ behaviors, activities or preferences and predicting what users will like based on their similarity to other users. A key advantage of the collaborative filtering approach is that it does not rely on machine analyzable content and therefore it is capable of accurately recommending complex items such as movies without requiring an “understanding” of the item itself. Many algorithms have been used in measuring user similarity or item similarity in recommender systems. For example, the  k-nearest neighbor (k-NN) approach and the Pearson Correlation.

Content-based filtering

itembased1

people who liked this also liked these as well

Content-based filtering methods are based on a description of the item and a profile of the user’s preference. In a content-based recommendation system, keywords are used to describe the items; beside, a user profile is built to indicate the type of item this user likes. In other words, these algorithms try to recommend items that are similar to those that a user liked in the past (or is examining in the present). In particular, various candidate items are compared with items previously rated by the user and the best-matching items are recommended. This approach has its roots in information retrieval and information filtering research.

Hybrid Recommendation Systems

people-who-liked-this-talk-also-liked-building-recommendation-systems-using-ruby-40-638

Recent research has demonstrated that a hybrid approach, combining collaborative filtering and content-based filtering could be more effective in some cases. Hybrid approaches can be implemented in several ways, by making content-based and collaborative-based predictions separately and then combining them, by adding content-based capabilities to a collaborative-based approach (and vice versa), or by unifying the approaches into one model. Several studies empirically compare the performance of the hybrid with the pure collaborative and content-based methods and demonstrate that the hybrid methods can provide more accurate recommendations than pure approaches. These methods can also be used to overcome some of the common problems in recommendation systems such as cold start and the sparsity problem.

Netflix is a good example of hybrid systems. They make recommendations by comparing the watching and searching habits of similar users (i.e. collaborative filtering) as well as by offering movies that share characteristics with films that a user has rated highly (content-based filtering).

Upcoming posts:

In very next upcoming posts we are going to learn about these 3 Recommendation system in big picture and we are going to learn how to implementation them.

Follow us:

FACEBOOKQUORA |TWITTERREDDIT | FLIPBOARD | MEDIUM | GITHUB

I hope you liked todays post. If you have any questions then feel free to comment below.  If you want me to write on one specific topic then do tell it to me in the comments below.

Thank’s for Reading…. 

Home | About | Data scientists Interviews | For beginners | Join us