Scala and Pyspark specialization certification courses started

 

Scala & Spark Specilization
Scala & Spark Specialization

 

Data science is a promising field, Where you have to continuously update your skill set by learning the new technique, algorithms, and newly created tools. As the learning journey never ends, we would always seek to find the best resources to start learning these new skill sets. We should be thankful for the great MOOC course providers like Coursera, Edx, Udacity , where all these MOOC course providers main intention is to provide the high-quality content which explains the core concepts in standardized way  to create the virtual world around the user to feel himself like getting step by step to master in those skills.

In this particular post, we are going to share you 2 famous data science specialization certification courses offered from Edx  and Coursera.

  1. Data Science and Engineering with Apache Spark
  2. Functional Programming in Scala

These two specializations are a pack of some series of courses, Which start from basics to advanced level. Generally, it would take somewhere around 5 to 6 months  to get complete knowledge out off this specializations course. All the course videos , reference materials stuff are free of cost but if you indented for the specialization certificate it would cost you some decent dollars.

Data Science and Engineering with Apache Spark

Image Credit: edx.com
Image Credit: edx.org

The Data Science and Engineering with Spark XSeries, created in partnership with Databricks, will teach students how to perform data science and data engineering at scale using Spark, a cluster computing system well-suited for large-scale machine learning tasks. It will also present an integrated view of data processing by highlighting the various components of data analysis pipelines, including exploratory data analysis, feature extraction, supervised learning, and model evaluation. Students will gain hands-on experience building and debugging Spark applications. Internal details of Spark and distributed machine learning algorithms will be covered, which will provide students with intuition about working with big data and developing code for a distributed environment.

This XSeries requires a programming background and experience with Python (or the ability to learn it quickly). All exercises will use PySpark (the Python API for Spark), but previous experience with Spark or distributed computing is NOT required. Familiarity with basic machine learning concepts and exposure to algorithms, probability, linear algebra, and calculus are prerequisites for two of the courses in this series.

1. Introduction to Apache Spark

About this course:

Spark is rapidly becoming the compute engine of choice for big data. Spark programs are more concise and often run 10-100 times faster than Hadoop MapReduce jobs. As companies realize this, Spark developers are becoming increasingly valued.

This statistics and data analysis course will teach you the basics of working with Spark and will provide you with the necessary foundation for diving deeper into Spark. You’ll learn about Spark’s architecture and programming model, including commonly used APIs. After completing this course, you’ll be able to write and debug basic Spark applications. This course will also explain how to use Spark’s web user interface (UI), how to recognize common coding errors, and how to proactively prevent errors. The focus of this course will be Spark Core and Spark SQL.

This course covers advanced undergraduate-level material. It requires a programming background and experience with Python (or the ability to learn it quickly). All exercises will use PySpark (the Python API for Spark), but previous experience with Spark or distributed computing is NOT required. Students should take this Python mini-quiz before the course and take this Python mini-course if they need to learn Python or refresh their Python knowledge.

What you’ll learn:

  • Basic Spark architecture
  • Common operations
  • How to avoid coding mistakes
  • How to debug your Spark program

2. Distributed Machine Learning with Apache Spark

About this course:

Machine learning aims to extract knowledge from data, relying on fundamental concepts in computer science, statistics, probability, and optimization. Learning algorithms enable a wide range of applications, from everyday tasks such as product recommendations and spam filtering to bleeding edge applications like self-driving cars and personalized medicine. In the age of ‘big data’, with datasets rapidly growing in size and complexity and cloud computing becoming more pervasive, machine learning techniques are fast becoming a core component of large-scale data processing pipelines.

This statistics and data analysis course introduces the underlying statistical and algorithmic principles required to develop scalable real-world machine learning pipelines. We present an integrated view of data processing by highlighting the various components of these pipelines, including exploratory data analysis, feature extraction, supervised learning, and model evaluation. You will gain hands-on experience applying these principles using Spark, a cluster computing system well-suited for large-scale machine learning tasks, and its packages spark.ml and spark.mllib. You will implement distributed algorithms for fundamental statistical models (linear regression, logistic regression, principal component analysis) while tackling key problems from domains such as online advertising and cognitive neuroscience.

What you’ll learn:

  • The underlying statistical and algorithmic principles required to develop scalable real-world machine learning pipelines
  • Exploratory data analysis, feature extraction, supervised learning, and model evaluation
  • Application of these principles using Spark
  • How to implement distributed algorithms for fundamental statistical models.

3. Big Data Analysis with Apache Spark

About this course:

Organizations use their data to support and influence decisions and build data-intensive products and services, such as recommendation, prediction, and diagnostic systems. The collection of skills required by organizations to support these functions has been grouped under the term ‘data science’.

This statistics and data analysis course will attempt to articulate the expected output of data scientists and then teach students how to use PySpark (part of Spark) to deliver against these expectations. The course assignments include log mining, textual entity recognition, and collaborative filtering exercises that teach students how to manipulate data sets using parallel processing with PySpark.

This course covers advanced undergraduate-level material. It requires a programming background and experience with Python (or the ability to learn it quickly). All exercises will use PySpark (the Python API for Spark), and previous experience with Spark equivalent to Introduction to Apache Spark, is required.

What you’ll learn:

  • How to use Apache Spark to perform data analysis
  • How to use parallel programming to explore data sets
  • Apply log mining, textual entity recognition and collaborative filtering techniques to real-world data questions.

4. Advanced Apache Spark for Data Science and Data Engineering

About this course:

Gain a deeper understanding of Spark by learning about its APIs, architecture, and common use cases.  This statistics and data analysis course will cover material relevant to both data engineers and data scientists.  You’ll learn how Spark efficiently transfers data across the network via its shuffle, details of memory management, optimizations to reduce the compute costs, and more.  Learners will see several use cases for Spark and will work to solve a variety of real-world problems using public datasets.  After taking this course, you should have a thorough understanding of how Spark works and how you can best utilize its APIs to write efficient, scalable code.  You’ll also learn about a wide variety of Spark’s APIs, including the APIs in Spark Streaming.

What you’ll learn:

  • Common use cases for Spark
  • Details of internals like the shuffle, Spark SQL’s Catalyst Optimizer, and Project Tungsten
  • A deep architectural overview
  • Spark Streaming
  • Spark ML

5. Advanced Distributed Machine Learning with Apache Spark

About this course:

Building on the core ideas presented in Distributed Machine Learning with Spark, this course covers advanced topics for training and deploying large-scale learning pipelines. You will study state-of-the-art distributed algorithms for collaborative filtering, ensemble methods (e.g., random forests), clustering and topic modeling, with a focus on model parallelism and the crucial tradeoffs between computation and communication.

After completing this course, you will have a thorough understanding of the statistical and algorithmic principles required to develop and deploy distributed machine learning pipelines. You will further have the expertise to write efficient and scalable code in Spark, using MLlib and the spark.ml package in particular.

What you’ll learn:

  • Training and deploying large-scale learning pipelines for various supervised and unsupervised settings
  • Model parallelism and tradeoffs between computation and communication in distributed settings
  • Collaborative filtering, decision trees, random forests, clustering, topic modeling, hyperparameter tuning
  • Application of these principles using Spark, focusing on the spark.ml package.

Functional Programming in Scala

Image Credit:coursera.org
Image Credit:coursera.org

This Specialization provides a hands-on introduction to functional programming using the widespread programming language, Scala. It begins with the basic building blocks of the functional paradigm, first showing how to use these blocks to solve small problems, before building up to combining these concepts to architect larger functional programs. You’ll see how the functional paradigm facilitates parallel and distributed programming, and through a series hands-on on examples and programming assignments, you’ll learn how to analyze data sets small to large; from parallel programming on multicore architectures, to distributed programming on a cluster using Apache Spark. A final capstone project will allow you to apply the skills you learned by building a large data-intensive application using real-world data.

1. Functional Programming Principles in Scala

About this course:

Functional programming is becoming increasingly widespread in the industry. This trend is driven by the adoption of Scala as the main programming language for many applications. Scala fuses functional and object-oriented programming in a practical package. It interoperates seamlessly with both Java and Javascript. Scala is the implementation language of many important frameworks, including Apache Spark, Kafka, and Akka. It provides the core infrastructure for sites such as Twitter, Tumblr and also Coursera.

In this course you will discover the elements of the functional programming style and learn how to apply them usefully in your daily programming tasks. You will also develop a solid foundation for reasoning about functional programs, by touching upon proofs of invariants and the tracing of execution symbolically. The course is hands on; most units introduce short programs that serve as illustrations of important concepts and invite you to play with them, modifying and improving them. The course is complemented by a series programming projects as homework assignments.

Learning Outcomes:

  • understand the principles of functional programming,
  • write purely functional programs, using recursion, pattern matching, and higher-order functions,
  • combine functional programming with objects and classes,
  • design immutable data structures,
  • reason about properties of functions,
  • understand generic types for functional programs

2.Functional Program Design in Scala

About this course:

In this course you will learn how to apply the functional programming style in the design of larger applications. You’ll get to know important new functional programming concepts, from lazy evaluation to structuring your libraries using monads. We’ll work on larger and more involved examples, from state space exploration to random testing to discrete circuit simulators. You’ll also learn some best practices on how to write good Scala code in the real world.

Several parts of this course deal with the question how functional programming interacts with mutable state. We will explore the consequences of combining functions and state. We will also look at purely functional alternatives to mutable state, using infinite data structures or functional reactive programming.

Learning Outcomes:

  • recognize and apply design principles of functional programs,
  • design functional libraries and their APIs,
  • competently combine functions and state in one program,
  • understand reasoning techniques for programs that combine functions and state,
  • write simple functional reactive applications.

3. Parallel programming

About the Course:
With every smartphone and computer now boasting multiple processors, the use of functional ideas to facilitate parallel programming is becoming increasingly widespread. In this course, you’ll learn the fundamentals of parallel programming, from task parallelism to data parallelism. In particular, you’ll see how many familiar ideas from functional programming map perfectly to the data parallel paradigm. We’ll start the nuts and bolts how to effectively parallelize familiar collections operations, and we’ll build up to parallel collections, a production-ready data parallel collections library available in the Scala standard library. Throughout, we’ll apply these concepts through several hands-on examples that analyze real-world data, such as popular algorithms like k-means clustering.
Learning Outcomes:
  • reason about task and data parallel programs,
  • express common algorithms in a functional style and solve them in parallel,
  • competently microbenchmark parallel code,
  • write programs that effectively use parallel collections to achieve performance
  • References.

4. Big Data Analysis with Scala and Spark:

About this course:

Manipulating big data distributed over a cluster using functional concepts is rampant in the industry, and is arguably one of the first widespread industrial uses of functional ideas. This is evidenced by the popularity of MapReduce and Hadoop, and most recently Apache Spark, a fast, in-memory distributed collections framework written in Scala. In this course, we’ll see how the data parallel paradigm can be extended to the distributed case, using Spark throughout. We’ll cover Spark’s programming model in detail, being careful to understand how and when it differs from familiar programming models, like shared-memory parallel collections or sequential Scala collections. Through hands-on examples in Spark and Scala, we’ll learn when important issues related to the distribution like latency and network communication should be considered and how they can be addressed effectively for improved performance.

Learning Outcomes:

  • read data from persistent storage and load it into Apache Spark,
  • manipulate data with Spark and Scala,
  • express algorithms for data analysis in a functional style,
  • recognize how to avoid shuffles and recomputation in Spark

5.Functional Programming in Scala Capstone

About this course:

In the final capstone project you will apply the skills you learned by building a large data-intensive application using real-world data.

 

Follow us:

FACEBOOK| QUORA |TWITTERREDDIT | FLIPBOARD |LINKEDIN | MEDIUM| GITHUB

 

I hope you liked today’s post. If you have any questions then feel free to comment below.  If you want me to write on one specific topic then do tell it to me in the comments below.
Visit blogadda.com to discover Indian blogs

If you want to share your experience or opinions you can say.

Hello to hello@dataaspirant.com 

THANKS FOR READING…..

Home | About | Data scientists Interviews | For beginners | Join us |  Monthly newsletter

dataaspirant-april2016-newsletter

Home | About | Data scientists Interviews | For beginners | Join us | Monthly newsletter

news_letter

Blog Posts:

1. Step by step kaggle competition tutorial:

In this article we are going to see how to go through a Kaggle competition step by step.The contest explored here is the San Francisco Crime Classification contest. The goal is to classify a crime occurrence knowing the time and place it happened.

Read Complete Post:  datanice Blog

2. Introduction to machine learning: 

It is an attempt to make things more intelligent. Most of us have come across terms like “Artificial Neural Networks”, it is an attempt to replicate the working of the human brain. Even something like this is not necessarily always complex. At its heart, it is just multiplication and differentiation. Yes, Maths at it again but it’s rather what you learned at school, no different (This coming from a guy who is petrified of maths)

Read Complete Post: medium.com/@mukulmalik

3. Baidu research chief Andrew NG fixed on self-taught computers self-driving cars:

Artificial-intelligence whiz Andrew Ng hangs his hat these days at a nondescript building in Sunnyvale that serves as the Silicon Valley outpost of the Chinese search giant Baidu.

Read Complete Post: Seattle times

4. Misleading modelling overfitting cross-validation and the bias-variance trade-off:

In this post you will get to grips with what is perhaps the most essential concept in machine learning: the bias-variance trade-off. The main idea here is that you want to create models that are as good at prediction as possible but that are still applicable to new data (i.e. they are generalizable). The danger is that you can easily create models that overfit to the local noise in your specific dataset, which isn’t too helpful and leads to poor generalizability since the noise is random and therefore different in each dataset. Essentially, you want to create models that capture only the useful components of a dataset. On the other hand, models that generalize very well but are too inflexible to generate good predictions are the other extreme you want to avoid (this is called underfitting).

Read Complete Post: Cambridge coding

5. Association rules and the apriori algorithm:

When we go grocery shopping, we often have a standard list of things to buy. Each shopper has a distinctive list, depending on one’s needs and preferences. A housewife might buy healthy ingredients for a family dinner, while a bachelor might buy beer and chips. Understanding these buying patterns can help to increase sales in several ways.

Read Complete Post: Annalyzin

6. Churn prediction pyspark using mllib and ml packages:

Churn prediction is big business. It minimizes customer defection by predicting which customers are likely to cancel a subscription to a service. Though originally used within the telecommunications industry, it has become common practice across banks, ISPs, insurance firms, and other verticals.

The prediction process is heavily data driven and often utilizes advanced machine learning techniques. In this post, we’ll take a look at what types of customer data are typically used, do some preliminary analysis of the data, and generate churn prediction models – all with PySpark and its machine learning frameworks. We’ll also discuss the differences between two Apache Spark version 1.6.0 frameworks, MLlib and ML.

Read Complete Post: mapr

7. Data scientist keeps ranking top every best jobs list:

Data scientist is at, or near, the top of just about every “best jobs” survey, report, or study released in the past few years. Harvard Business Review named it the sexiest job of the 21st century. And with a median base salary of $96,000, data scientist and some engineering specialties are in a very small group of high-paying jobs that don’t require a medical or law degree. However, you’ll likely need more than a bachelor’s degree, as you’ll find out later in this article.

Read Complete Post: goodcall

8. Understand machine learning data descriptive statistics python: 

Looking at the raw data can reveal insights that you cannot get any other way. It can also plant seeds that may later grow into ideas on how to better preprocess and handle the data for machine learning tasks.

Read Complete Post: machinelearningmastery

9. Time series interventions and contribution:

This article illustrates principles of an analysis of the President George W. Bush’s job approval from January 2001 through Sep 2004 with disposable income excluded from the statistical model. To see a version complete with code and its description, visit bicorner.com.

Presidents with a job approval rating of less than 50 percent are unlikely to be re-elected. During June, Bush’s job approval rating averaged 47 percent in five major polls.

Read Complete Post: gladwinanalytics

10. Deep neural networks creative deep learning art:

Are deep neural networks creative? It seems like a reasonable question. Google’s “Inceptionism” technique transforms images, iteratively modifying them to enhance the activation of specific neurons in a deep net. The images appear trippy, transforming rocks into buildings or leaves into insects. Another neural generative model, introduced by Leon Gatys of the University of Tubingen in Germany, can extract the style from one image (say a painting by Van Gogh), and apply it to the content of another image (say a photograph).

Read Complete Post: kdnuggets

Video Courses:

  1. Introduction to python for data science
  2. Data exploration with kaggle scripts

That’s all for April-2016 newsletter. Please leave your suggestions on newsletter in the comment section. To get all  dataaspirant newsletters you can visit monthly newsletter page.

Follow us:

FACEBOOK| QUORA |TWITTERREDDIT | FLIPBOARD |LINKEDIN | MEDIUM| GITHUB

Home | About | Data scientists Interviews | For beginners | Join us | Monthly newsletter

python datamining packages virtual environment setup in ubuntu

Home | About | Data scientists Interviews | For beginners | Join us | Monthly newsletter

main1

virtual Environment:

Virtual Environment is to create an python environment which allows to use different python modules & versions, without messing up the system.

let’s understand virtual environment with it’s need in the project development. In the world of python projects it’s obvious to use different python libraries (wrappers) to get our work done, but It won’t be happy ending all the time. Most of the time we face environment issues where our python application won’t run on new machines(computer systems). This is because of the dependence issue of python libraries in the new machine (in which we try to run our python application). For better understanding, suppose in the development face in our python application we have used python pandas (Python data analysis library) version 0.18.0 function which was not there in pandas 0.17.1. In new machine pandas version 0.17.1 we installed. So the python application won’t run on new machine because of the version difference.

To over come this, we need to use an created python environment which contains everything that a Python project (application) needs in order to run in an organised, isolated fashion.

So Using virtual Environment is the recommended way for working with Python projects, regardless of how many you might be busy with.

In ubuntu creating virtual environment is much easy by using virtualenv (a tool to create isolated Python environments)

About virtualenv:image02

Virtualenv helps solve project dependency conflicts by creating isolated environments which can contain all the goodies Python programmers need to develop their projects. A virtual environment created using this tool includes a fresh copy of the Python binary itself as well as a copy of the entire Python standard library.

Installing virtualenv:


$ sudo pip install virtualenv

As of now we successfully installed virtualenv. Now let’s create a folder (Environment) where we will  install python data mining packages .

Create  virtual environment:


$ virtualenv dataaspirant_venv

virtualenv dataaspirant_venv will create a folder in the current directory which will contain the Python executable files, and a copy of the pip library which you can use to install other packages. The name of the virtual environment (in this case, it was dataaspirant_venv) can be anything; omitting the name will place the files in the current directory instead.

This creates a copy of Python in whichever directory you ran the command in, placing it in a folder named dataaspirant_venv

Using the virtual environment:

To use the virtual environment, it needs to be activated


$ source dataaspirant_venv/bin/activate

The name of the current virtual environment will now appear on the left of the prompt (e.g. (dataaspirant_venv)Your-Computer:your_project UserName$) to let you know that it’s active. From now on, any package that you install using pip will be placed in the dataaspirant_venv folder, isolated from the global Python installation.

Python Data mining packages Installation:

So let’s start installing data mining python pacakges one by one

Numpy installation Command:


$ pip install numpy

Scipy Installation Command:


$ pip install scipy

Matplotlib Installation Command:


$ pip install matplotlib

Ipython Installation Command:


$ pip install ipython[all]

Pandas Installation Command:


$ pip install pandas

Statsmodel Installation Command:


$ pip install statsmodels

Scikit-learn Installation Command:


$ pip install scikit-learn

Running Script File:

when virtualenv is in active mode you simple go the directiory where your python script file there and run your script file in general way.


$ python script_file.py

Deactivate the virtual environment:

If you done working in the virtual environment for the moment, you can deactivate it.


$ deactivate

This puts you back to the system’s default Python interpreter with all its installed libraries.

To delete a virtual environment, just delete its folder. (In this case, it would be rm -rf dataaspirant_venv.)

Reference Links:

[ 1 ] http://docs.python-guide.org/en/latest/dev/virtualenvs/

[ 2 ] https://en.wikipedia.org/wiki/Virtual_environment_software/

Follow us:

FACEBOOK| QUORA |TWITTERREDDIT | FLIPBOARD |LINKEDIN | MEDIUM| GITHUB

I hope you liked todays post. If you have any questions then feel free to comment below.  If you want me to write on one specific topic then do tell it to me in the comments below.

If you want share your experience or opinions you can say.

Hello to hello@dataaspirant.com 

THANKS FOR READING…..

Home | About | Data scientists Interviews | For beginners | Join us |  Monthly newsletter

DataAspirant Sept-Oct2015 newsletter

Home | About | Data scientists Interviews | For beginners | Join us | Monthly newsletter

Data scientist

 

Hi dataaspirant lovers we are sorry for not publishing dataaspirant September  newsletter. So for October newsletter we come up with September newsletter ingredients  too. We rounded up the best blogs for anyone interested in learning more about data science. Whatever your experience level in data science or someone who’s just heard of the field,  these blogs provide enough detail and context for you to understand what you’re reading. We also collected some videos too. Hope you  enjoy October  dataaspirant newsletter.

 

Blog Posts:

1 . How to do a Logistic Regression in R:

Regression is the statistical technique that tries to explain the relationship between a dependent variable and one or more independent variables. There are various kinds of it like simple linear, multiple linear, polynomial, logistic, poisson etc

Read Complete post on: datavinci

2 . Introduction of Markov State Modeling:

Modeling and prediction problems occur in different domain and data situations. One type of situation involves sequence of events.

For instance, you may want to model behaviour of customers on your website, looking at pages they land or enter by, links they click, and so on. You may want to do this to understand common issues and needs and may redesign your website to address that. You may, on the other hand, may want to promote certain sections or products on website and want to understand right page architecture and layout. In other example, you may be interested in predicting next medical visit of patient based on previous visits or next purchase product of customer based on previous products.

Read Complete post on: edupristine

3 . Five ways to improve the way you use Hadoop:

Apache Hadoop is an open source framework designed to distribute the storage and processing of massive data sets across virtually limitless servers. Amazon EMR (Elastic MapReduce) is a particularly popular service from Amazon that is used by developers trying to avoid the burden of set up and administration, and concentrate on working with their data.

Read Complete post on: cloudacademy

4. What is deep learning and why is it getting so much attention:

Deep learning is probably one of the hottest topics in Machine learning today, and it has shown significant improvement over some of its counterparts. It falls under a class of unsupervised learning algorithms and uses multi-layered neural networks to achieve these remarkable outcomes.

Read Complete post on: analyticsvidhya

5. Facebook data collection and photo network visualization with Gephi and R:

The first thing to do is get the Facebook data. Before being allowed to pull it from R, you’ll need to make a quick detour to developers.facebook.com/apps, register as a developer, and create a new app. Name and description are irrelevant, the only thing you need to do is go to Settings → Website → Site URL and fill in http://localhost:1410/ (that’s the port we’re going to be using). The whole process takes ~5 min and is quite painless

Read Complete post on: kateto

6. Kudu: New Apache Hadoop Storage for Fast Analytics on Fast Data:

The set of data storage and processing technologies that define the Apache Hadoop ecosystem are expansive and ever-improving, covering a very diverse set of customer use cases used in mission-critical enterprise applications. At Cloudera, we’re constantly pushing the boundaries of what’s possible with Hadoop—making it faster, easier to work with, and more secure.

Read Complete post on: cloudera

7. Rapid Development & Performance in Spark For Data Scientists:

Spark is a cluster computing framework that can significantly increase the efficiency and capabilities of a data scientist’s workflow when dealing with distributed data. However, deciding which of its many modules, features and options are appropriate for a given problem can be cumbersome. Our experience at Stitch Fix has shown that these decisions can have a large impact on development time and performance. This post will discuss strategies at each stage of the data processing workflow which data scientists new to Spark should consider employing for high productivity development on big data.

Read Complete post on: multithreaded

8. NoSQL: A Dog with Different Fleas:

The NoSQL movement is around providing performance, scale, and flexibility; where cost is sometimes part of the reasoning (e.g. Oracle Tax). Yet databases like MySQL, which provide all the Oracle features, is often considered before choosing NoSQL. And with respects to NoSQL flexibility. This also can be Pandora’s box. In other words, schema-less modeling has been shown to be a serious complication to data management. I was at the MongoDB Storage Engine Summit this year and the number one ask to the storage engine providers is “how to discover schema in a schema-less architecture?” In other words, managing models over time is a serious matter to consider too.

Read Complete post on: deepis

9. Apache Spark: Sparkling star in big data firmament:

The underlying data needed to be used to gain right outcomes for all above tasks is comparatively very large. It cannot be handled efficiently (in terms of both space and time) by traditional systems. These are all big data scenarios. To collect, store and do computations on this kind of voluminous data we need a specialized cluster computing system. Apache Hadoop has solved this problem for us.

Read Complete post on: edupristine

10. Sqoop vs. Flume – Battle of the Hadoop ETL tools:

Apache Hadoop is synonymous with big data for its cost-effectiveness and its attribute of scalability for processing petabytes of data. Data analysis using hadoop is just half the battle won. Getting data into the Hadoop cluster plays a critical role in any big data deployment. Data ingestion is important in any big data project because the volume of data is generally in petabytes or exabytes. Hadoop Sqoop and Hadoop Flume are the two tools in Hadoop which is used to gather data from different sources and load them into HDFS. Sqoop in Hadoop is mostly used to extract structured data from databases like Teradata, Oracle, etc., and Flume in Hadoop is used to sources data which is stored in various sources like and deals mostly with unstructured data.

Read Complete post on: dezyre

 

Videos:

1. Spark and Spark Streaming at Uber :

2. How To Stream Twitter Data Into Hadoop Using Apache Flume:

 

That’s all for October 2015 newsletter. Please leave your suggestions on newsletter in the comment box. To get all  dataaspirant newsletters you can visit monthly newsletter page. Do please Subscribe to our blog so that every month you get our news letter in your inbox.

 

Follow us:

FACEBOOK| QUORA |TWITTERREDDIT | FLIPBOARD |LINKEDIN | MEDIUM| GITHUB

Home | About | Data scientists Interviews | For beginners | Join us | Monthly newsletter

collaborative filtering recommendation engine implementation in python

Home | About | Data scientists Interviews | For beginners | Join us | Monthly newsletter

Collaborative Filtering

first

 

In the introduction post of recommendation engine we have seen the need of recommendation engine in real life as well as importance of recommendation engine in online and finally we have discussed 3 methods of recommendation engine. They are:

1) Collaborative filtering

2) Content-based filtering

3) Hybrid Recommendation Systems

So today we are going to implement collaborative filtering way  of recommendation engine, before that i want to explain some key things about recommendation engine which was missed in Introduction to recommendation engine post.

Today learning Topics:

1) what is the Long tail phenomenon in recommendation engine ?

2) Basic idea about Collaborative filtering ?

3) Understanding Collaborative filtering approach with friends movie recommendation engine ?

4) Implementing Collaborative filtering approach of recommendation engine ?

what is the Long tail phenomenon in recommendation engine ?

Usage of Recommendation engine  become popular in last 10 to 20 years. The reason behind this is we changed from a state of scarcity to abundance state. Don’t be frighten about this two words scarcity and abundance. Coming paragraph i will make you clear why i have use those words in popularity of recommendation engine.

Scarcity to Abundance:

Harvard Book Store 1256 Massachusetts Ave., Cambridge, MA 02138 Tara Metal, 617-661-1424 x1 tmetal@harvard.com

Lets understand scarcity to abundance with book stores. Imagine a physical books store like crossword with thousands of books.  we can see almost all popular books in books store shells, but shop shells are limited,to increasing number of shells, retailer need more space for this retailer need to spend some huge amount of money. This is the reason why we are missing some many good books which are less popular and finally we are only getting popular books.

On the other hand if we see online web books stores, we have unlimited shells with unlimited shell space, so we can get almost all books apart from popular or unpopular. This is because of web enables near-zero cost dissemination of information about products ( books,news articles, mobiles ,etc ). This new zero cost dissemination of information in web gives rise to an phenomenon is called as “Long Tail” phenomenon.

Long Tail phenomenon:

long_tail_final

The above graph clearly represents the Long Tail phenomenon. On X – axis we have products with popularity (most popular ranked products will be at left side and  less popular ranked product will be going to right side). Here popularity means numbers of times an product purchased or number of time an product was viewed. On Y-axis it was genuine popularity means how many times products was purchased or viewed in an interval of time like one week or one month. if you see the graph at top Y – axis the orange color curve was just far away from the actual Y- axis this means they are very few popular products. Coming to curve behavior Initially it has very stiff fall and if we move towards the right as the product rank become greater on X – axis, products popularity falls very stiffly and at an certain point this popularity fall less and less stiffly and it don’t reach X – axis . The interesting things is products right side to the cutoff point  was very  very less poplar and hardly they was purchased once or twice in an week or month. So these type of product we don’t get in physical retailer store because storing this less poplar items is waste of money so good business retailer don’t think to store them any more. So popular product can be store in physical retailers as well as we can get them in online too ,In case of graph left side to cutoff point which is Head part is the combination of the both physical retailer store and online store. Coming to the right side of the cutoff point which are less poplar, so we can only get them in Online. So this tail part to the right side of the cutoff point for less poplar product is called Long Tail. If you see the area under cure of the Long tail we can notice there were some many good products which are less popular. Finding them is harder task to user in online so there is need of an system which can recommend these good product which are unpopular  to user by considering some metrics. This system is nothing but recommendation system.

Basic idea about Collaborative filtering :

collaborative filtering algorithm usually works by searching a large group of people and finds an smaller set with tastes similar to user. It looks at other things ,they like and combines then to create a ranked list of suggestions. Finally shows the suggestion to user.For better understanding of collaborative filtering let’s think about our own friends movie recommendation system.

Understanding Collaborative filtering approach with friends movie recommendation engine:

da

Let’s understand collaborative filtering approach by friends movie recommendations engine. To explain friends movie recommendations engine i want to share my own story about this.

Like  most of the people i do love to watch movies in week ends. Generally there was so many movies in my hard disk and it’s hard to select one movie from that. That’s the reason why when i want to watch an movie i will search for some of my friends who’s movie taste is similar to me and i will ask them to recommend an movie which i may like ( haven’t seen by me but seen by them ). They will recommend me an movie which they like and which i was never seen. it may happened to you also.

This means to implement your own movie recommendation engine by considering your friends as a set of group people. Something you have learned over time by observing whether they usually like the same things as you. So you gonna select a group of your friends and you have to find someone who is more similar to you. Then you can expect movie recommendation for your friend. Applying this scenario  of techniques to implement an recommendation engine is called as collaborative filtering.

Hope i have clear the idea about Collaborative filtering. So Let’s wet our hands by implementing this collaborative filtering in Python programming language.

Implementing Collaborative filtering approach of recommendation engine :

Data set for implementing collaborative filtering recommendation engine:

To implement collaborative filtering first we need data set having rated preferences  ( how likely the people in data set  like some set of items). So i am taking this data set from one of my favorite book Collective Intelligence book which was written by  Toby Segaran.

First i am storing this data set to an Python Dictionary. For huge data set we generally store this preferences in Databases.

File name : recommendation_data.py


#!/usr/bin/env python
# Collabrative Filtering data set taken from Collective Intelligence book.

dataset={
         'Lisa Rose': {'Lady in the Water': 2.5,
                       'Snakes on a Plane': 3.5,
                       'Just My Luck': 3.0,
                       'Superman Returns': 3.5,
                       'You, Me and Dupree': 2.5,
                       'The Night Listener': 3.0},
         'Gene Seymour': {'Lady in the Water': 3.0,
                          'Snakes on a Plane': 3.5,
                          'Just My Luck': 1.5,
                          'Superman Returns': 5.0,
                          'The Night Listener': 3.0,
                          'You, Me and Dupree': 3.5},

        'Michael Phillips': {'Lady in the Water': 2.5,
                             'Snakes on a Plane': 3.0,
                             'Superman Returns': 3.5,
                             'The Night Listener': 4.0},
        'Claudia Puig': {'Snakes on a Plane': 3.5,
                         'Just My Luck': 3.0,
                         'The Night Listener': 4.5,
                         'Superman Returns': 4.0,
                         'You, Me and Dupree': 2.5},
        'Mick LaSalle': {'Lady in the Water': 3.0,
                         'Snakes on a Plane': 4.0,
                         'Just My Luck': 2.0,
                         'Superman Returns': 3.0,
                         'The Night Listener': 3.0,
                         'You, Me and Dupree': 2.0},
       'Jack Matthews': {'Lady in the Water': 3.0,
                         'Snakes on a Plane': 4.0,
                         'The Night Listener': 3.0,
                         'Superman Returns': 5.0,
                         'You, Me and Dupree': 3.5},
      'Toby': {'Snakes on a Plane':4.5,
               'You, Me and Dupree':1.0,
               'Superman Returns':4.0}}

Now we are ready with data set. So let’s start implementing recommendation engine. create new file with name collaborative_filtering.py be careful you created this  file in the same directory where you created recommendation_data.py file. First let’s import our recommendation dataset to collaborative_filtering.py file and let’s try whether we are getting data properly or not by answering the below questions.

1) What was the rating for Lady in the Water by Lisa Rose and Michael Phillips ?

2) Movie rating  of Jack Matthews ?

File name : collaborative_filtering.py


#!/usr/bin/env python
# Implementation of collaborative filtering recommendation engine

from recommendation_data import dataset

print "Lisa Rose rating on Lady in the water: {}\n".format(dataset['Lisa Rose']['Lady in the Water'])
print "Michael Phillips rating on Lady in the water: {}\n".format(dataset['Michael Phillips']['Lady in the Water'])

print '**************Jack Matthews ratings**************'
print dataset['Jack Matthews']

Script Output:

Lisa Rose rating on Lady in the water: 2.5

Michael Phillips rating on Lady in the water: 2.5

**************Jack Matthews ratings**************
{'Lady in the Water': 3.0, 'Snakes on a Plane': 4.0, 'You, Me and Dupree': 3.5, 'Superman Returns': 5.0, 'The Night Listener': 3.0}
[Finished in 0.1s]

After getting data. we need to find the similar people by comparing each person with every other person this by calculating similarity score between them. To know more about similarity you can view similarity score post. So Let’s write a function to find the similarity between two persons.

Similar Persons:

First let use Euclidean distance to find similarity between two people. To do that we need to write an euclidean distance measured function Let’s add this function in collaborative_filtering.py and let’s find the Euclidean distance between Lisa Rose and Jack Matthews. Before that let’s remember Euclidean distance formula.

euclid_eqn

Euclidean distance is the most common use of distance. In most cases when people say about distance, they will refer to Euclidean distance. Euclidean distance  is also know as simply distance. When data is dense or continuous , this is the best proximity measure. The Euclidean distance between two points is the length of the path connecting them.This distance between two points is given by the Pythagorean theorem.

 


#!/usr/bin/env python
# Implementation of collaborative filtering recommendation engine

from recommendation_data import dataset
from math import sqrt

def similarity_score(person1,person2):

    # Returns ratio Euclidean distance score of person1 and person2 

    both_viewed = {} # To get both rated items by person1 and person2

    for item in dataset[person1]:
       if item in dataset[person2]:
          both_viewed[item] = 1

   # Conditions to check they both have an common rating items
   if len(both_viewed) == 0:
       return 0

   # Finding Euclidean distance
   sum_of_eclidean_distance = [] 

   for item in dataset[person1]:
      if item in dataset[person2]:
         sum_of_eclidean_distance.append(pow(dataset[person1][item] - dataset[person2][item],2))
       sum_of_eclidean_distance = sum(sum_of_eclidean_distance)

   return 1/(1+sqrt(sum_of_eclidean_distance))

print similarity_score('Lisa Rose','Jack Matthews')

Source Output:

0.340542426583
[Finished in 0.0s]

This means the Euclidean distance score of  Lisa Rose and Jack Matthews is 0.34054

Code Explanation:

Line 3-4:

  • Imported recommendation data set  and imported sqrt function.

Line 6-28:

  • similarity_score function we are taking two parameters as person1 and person2
  • The first thing we need to do is whether the person1 and person2 rating any common items. we doing this in line 12 to 14.
  • Once we find the both_viewed  we are checking  the length of the both_viewed. If it zero there is no need to find similarity score why because it will zero.
  • Then we are find the Euclidean distance sum value by consider the items which was rated by both person1 and person2.
  • Then we returns the inverted value of euclidean distance. The reason behind using inverted euclidean distance is generally euclidean distance returns the distance between the two users. If the distance between two users is less means they are more similar but we need high value for the people who are similar so this can be done by adding 1 to euclidean distance ( so you don’t get a division by zero error) and inverting it.

 

Do you think the approach we used here is the good one to find the similarity between two users. Let’s consider an example to get clear idea about is this good approach to find similarity between two users.  Suppose we have two users X and Y. If X feels it’s good movie he will rate 4 for it, if he feels it’s an  average movie he will rate 2 and finally if he feel it’s not an good movie he will rate 1. In the same way Y will rate 5 for good movie, 4 for average move and 3 for worst movie.

ratings_table

If we calculated  similarity between both users it will some what similar but we are missing one great point here According to Euclidean distance if we consider an movie which rated by both X and Y. Suppose X rated as 4 and Y rated as 4 then euclidean distance formulas give both  X and Y are more similar, but this movie is good one for user X and average movie for Y. So if we use Euclidean disatance our approach will be wrong. So we have use some other approach to find similarity between two users. This approach is Pearson Correlation.

Pearson Correlation:

Pearson Correlation Score:

A slightly more sophisticated way to determine the similarity between people’s interests is to use a pearson correlation coefficient. The correlation coefficient is a measure of how well two sets of data fit on a straight line. Formula for this is more complicated that the Euclidean distance score, but it tends to give better results in situations where the data isn’t well normalized like our present data set.

Implementation for the Pearson correlation score first finds the items rated by both users. It then calculates the sums and the sum of the squares of the ratings for the both users and calculates the sum of the products of their ratings. Finally, it uses these results to calculate the Pearson correlation coefficient.Unlike the distance metric, this formula is not intuitive, but it does tell you how much the variables change together divided by the product of how much the vary individually.

correlation2

 

 

Let’s create this function in the same collaborative_filtering.py file.


# Implementation of collaborative filtering recommendation engine

from recommendation_data import dataset
from math import sqrt

def similarity_score(person1,person2):

    # Returns ratio Euclidean distance score of person1 and person2 

    both_viewed = {} # To get both rated items by person1 and person2

    for item in dataset[person1]:
       if item in dataset[person2]:
           both_viewed[item] = 1

    # Conditions to check they both have an common rating items
    if len(both_viewed) == 0:
       return 0

    # Finding Euclidean distance
    sum_of_eclidean_distance = [] 

    for item in dataset[person1]:
      if item in dataset[person2]:
            sum_of_eclidean_distance.append(pow(dataset[person1][item] - dataset[person2][item],2))
   sum_of_eclidean_distance = sum(sum_of_eclidean_distance)

    return 1/(1+sqrt(sum_of_eclidean_distance))

def pearson_correlation(person1,person2):

     # To get both rated items
     both_rated = {}
     for item in dataset[person1]:
        if item in dataset[person2]:
          both_rated[item] = 1

     number_of_ratings = len(both_rated) 

     # Checking for number of ratings in common
     if number_of_ratings == 0:
         return 0

     # Add up all the preferences of each user
     person1_preferences_sum = sum([dataset[person1][item] for item in both_rated])
     person2_preferences_sum = sum([dataset[person2][item] for item in both_rated])

     # Sum up the squares of preferences of each user
     person1_square_preferences_sum = sum([pow(dataset[person1][item],2) for item in both_rated])
     person2_square_preferences_sum = sum([pow(dataset[person2][item],2) for item in both_rated])

    # Sum up the product value of both preferences for each item
    product_sum_of_both_users = sum([dataset[person1][item] * dataset[person2][item] for item in both_rated])

   # Calculate the pearson score
   numerator_value = product_sum_of_both_users - (person1_preferences_sum*person2_preferences_sum/number_of_ratings)
  denominator_value = sqrt((person1_square_preferences_sum - pow(person1_preferences_sum,2)/number_of_ratings) * (person2_square_preferences_sum -pow(person2_preferences_sum,2)/number_of_ratings))
   if denominator_value == 0:
      return 0
   else:
     r = numerator_value/denominator_value
     return r 

print pearson_correlation('Lisa Rose','Gene Seymour')

Script Output:

0.396059017191
[Finished in 0.0s]

Generally this pearson_correlation function return a value between -1 to 1 . A value 1 means both users are having the same taste in all most all cases.

Ranking similar users for an user:

Now that we have functions for comparing two people, we can create a function that scores everyone against a given person and finds the closest matches. Lets add this function to collaborative_filtering.py file to get an ordered list of people with similar tastes to the specified person.


def most_similar_users(person,number_of_users):
    # returns the number_of_users (similar persons) for a given specific person.
    scores = [(pearson_correlation(person,other_person),other_person)    for other_person in dataset if other_person != person ]

        # Sort the similar persons so that highest scores person will appear at the first
        scores.sort()
        scores.reverse()
        return scores[0:number_of_users]

print most_similar_users('Lisa Rose',3) 

Script Output:

[(0.9912407071619299, 'Toby'), (0.7470178808339965, 'Jack Matthews'), (0.5940885257860044, 'Mick LaSalle')]
[Finished in 0.0s]

What we have done now is we just look at the person who are most similar  persons to him and Now we have to recommend some movie to that person. But that would be too permissive. Such an approach could accidentally turn up reviewers who haven’t rated  some of the movies that particular person like. it could also return a reviewer who strangely like a move that got bad reviews from all the other person returned by most_similar_persons function.

To solve these issues, you need to score the items by producing a weighted score that ranks the users. Take the votes of all other persons and multiply how similar they are to particular person by the score they gave to each move.
Below image shows how we have to do that.

recommendataion_for_toby

 

This images shows the correlation scores for each person and the ratings they gave for three movies The Night Listener, Lady in the Water, and Just My Luck that Toby haven’t rated. The Colums beginning with S.x give the similarity multiplied by the rating,so a person who is similar to Toby will contribute more to the overall score than a person who is different from Toby. The Total row shows the sum of all these numbers.

We could just use the totals to calculate the rankings, but then a movie reviewed by more people would have a big advantage. To correct for this you need to divide by the sum of all the similraties for persons that reviewd that movie (the Sim.Sum row in the table) because The Night Listener was reviewed by everyone, it’s total is divided by the average of similarities. Lady in the water ,however , was not reviewed by Puig, The last row shows the results of this division.

Let’s implement that now.


def user_reommendations(person):

       # Gets recommendations for a person by using a weighted average of every other user's rankings
       totals = {}
       simSums = {}
       rankings_list =[]
       for other in dataset:
           # don't compare me to myself
           if other == person:
                continue
           sim = pearson_correlation(person,other)

           # ignore scores of zero or lower
           if sim <=0:
              continue
           for item in dataset[other]:

            # only score movies i haven't seen yet
                if item not in dataset[person] or dataset[person][item] == 0:

                # Similrity * score
                totals.setdefault(item,0)
                totals[item] += dataset[other][item]* sim
                # sum of similarities
                simSums.setdefault(item,0)
                simSums[item]+= sim

     # Create the normalized list

     rankings = [(total/simSums[item],item) for item,total in totals.items()]
     rankings.sort()
     rankings.reverse()
     # returns the recommended items
     recommendataions_list = [recommend_item for score,recommend_item in rankings]
     return recommendataions_list

print "Recommendations for Toby"
print user_reommendations('Toby')

Script Output:

Recommendations for Toby

['The Night Listener', 'Lady in the Water', 'Just My Luck']
[Finished in 0.0s]

We have done the Recommendation engine, just change the any other persons and check recommended items.

To get total code you can clone our github project :

 https://github.com/saimadhu-polamuri/CollaborativeFiltering

To get all codes of dataaspirant blog you can clone the below github link:

 https://github.com/saimadhu-polamuri/DataAspirant_codes

Reference Book:

Collective intelligence book

Follow us:

FACEBOOK| QUORA |TWITTERREDDIT | FLIPBOARD |LINKEDIN | MEDIUM| GITHUB

I hope you liked todays post. If you have any questions then feel free to comment below.  If you want me to write on one specific topic then do tell it to me in the comments below.

If you want share your experience or opinions you can say.

Hello to hello@dataaspirant.com 

THANKS FOR READING…..

Home | About | Data scientists Interviews | For beginners | Join us |  Monthly newsletter

Five most popular similarity measures implementation in python

Home | About | Data scientists Interviews | For beginners | Join us

Cover_post_final

The buzz term similarity distance measures has got wide variety of definitions among the math and data mining practitioners. As a result those terms, concepts and their usage went way beyond the head for beginner , who  started to understand them for the very first time. So today I write this post to give more simplified and very intuitive definitions for similarity and i will drive to Five most popular similarity measures and implementation of them.

Before going to explain different similarity distance measures let me explain the the effective key term similarity in datamining. This similarity is the very basic building block for activities such as Recommendation engines, clustering, classification and anomaly detection.

Similarity:

Similarity is the measure of how much alike two data objects are. Similarity in a data mining context is usually described as a distance with dimensions representing features of the objects. If this distance is small it will be high degree of similarity where as large distance will be low degree of similarity.Similarity is subjective and is highly dependant on the domain and application. For example two fruits are similar because of color or size or taste. Care should be taken when calculating distance across dimensions/features that are unrelated. The relative values of each feature must be normalized or one feature could end up dominating the distance calculation. Similarity are measure in the range 0 to 1 [0,1].

Two main consideration about similarity:

  • Similarity = 1 if X = Y         (Where X, Y are two objects)
  • Similarity = 0 if X ≠ Y

That’s all about similarity let’s drive to five most popular similarity distance measures.

Euclidean distance:

Euclidean

    Euclidean distance is the most common use of distance. In most cases when people said about distance, they will refer to Euclidean distance. Euclidean distance  is also know as simply distance. When data is dense or continuous , this is the best proximity measure. The Euclidean distance between two points is the length of the path connecting them.This distance between two points is given by the Pythagorean theorem.

Euclidean distance implementation in python:


#!/usr/bin/env python

from math import*

def euclidean_distance(x,y):

  return sqrt(sum(pow(a-b,2) for a, b in zip(x, y)))

print euclidean_distance([0,3,4,5],[7,6,3,-1])

Script Output:

9.74679434481
[Finished in 0.0s]

 

Manhattan distance:

manhattan

Manhattan distance is an metric in which the distance between two points is the sum of the absolute differences of their Cartesian coordinates. In simple way of saying it is the absolute sum of difference between the x-coordinates  and y-coordinates. Suppose we have two point A and B if we want to find the manhattan distane between them, just we have to sum up the absultue x-axis and y – axis variation means we have to find how these to points A and B are varining in X-axis and Y- axis.In more mathematical way of saying Manhattan distance between two points measured along axes at right angles.

In a plane with p1 at (x1, y1) and p2 at (x2, y2).

Manhattan distance = |x1 – x2| + |y1 – y2|

This Manhattan distance metric is also known as Manhattan length,rectilinear distance, L1 distance or L1 norm ,city block distance,Minkowski’s L1 distance,taxi cab metric, or city block distance.

Manhattan distance implementation in python:


#!/usr/bin/env python

from math import*

def manhattan_distance(x,y):

  return sum(abs(a-b) for a,b in zip(x,y))

print manhattan_distance([10,20,10],[10,20,20])

Script Output:

10
[Finished in 0.0s]

 

Minkowski distance:

minkowski

The Minkowski distance is a generalized metric form of Euclidean distance and Manhattan distance.

equation_minkowski-distance (1)

In the equation d^MKD is the Minkowski distance between the data record i and j, k the index of a variable, n the total number of variables y and λ the order of the Minkowski metric. Although it is defined for any λ > 0, it is rarely used for values other than 1, 2 and ∞.

The way distances are measured by the Minkowski metric of different orders between two objects with three variables ( In the image it displayed in a coordinate system with x, y ,z-axes).

Synonyms of Minkowski:
Different names for the Minkowski distance or Minkowski metric arise form the order:

  • λ = 1 is the Manhattan distance. Synonyms are L1-Norm, Taxicab or City-Block distance. For two vectors of ranked ordinal variables the Manhattan distance is sometimes called Foot-ruler distance.
  • λ = 2 is the Euclidean distance. Synonyms are L2-Norm or Ruler distance. For two vectors of ranked ordinal variables the Euclidean distance is sometimes called Spear-man distance.
  • λ = ∞ is the Chebyshev distance. Synonym are Lmax-Norm or Chessboard distance.
    reference.

 Minkowski distance implementation in python:


#!/usr/bin/env python

from math import*
from decimal import Decimal

def nth_root(value, n_root):

 root_value = 1/float(n_root)
 return round (Decimal(value) ** Decimal(root_value),3)

def minkowski_distance(x,y,p_value):

 return nth_root(sum(pow(abs(a-b),p_value) for a,b in zip(x, y)),p_value)

print minkowski_distance([0,3,4,5],[7,6,3,-1],3)

Script Output:

8.373
[Finished in 0.0s]

Cosine similarity:

cosine

Cosine similarity metric finds the normalized dot product of the two attributes. By determining the cosine similarity, we will effectively trying to find cosine of the angle between the two objects. The cosine of 0° is 1, and it is less than 1 for any other angle. It is thus a judgement of orientation and not magnitude: two vectors with the same orientation have a cosine similarity of 1, two vectors at 90° have a similarity of 0, and two vectors diametrically opposed have a similarity of -1, independent of their magnitude. Cosine similarity is particularly used in positive space, where the outcome is neatly bounded in [0,1]. One of the reasons for the popularity of cosine similarity is that it is very efficient to evaluate, especially for sparse vectors.

Cosine similarity implementation in python:


#!/usr/bin/env python

from math import*

def square_rooted(x):

   return round(sqrt(sum([a*a for a in x])),3)

def cosine_similarity(x,y):

 numerator = sum(a*b for a,b in zip(x,y))
 denominator = square_rooted(x)*square_rooted(y)
 return round(numerator/float(denominator),3)

print cosine_similarity([3, 45, 7, 2], [2, 54, 13, 15])

Script Output:

0.972
[Finished in 0.1s]

Jaccard similarity:

jaccard_similariyt

We so far discussed some metrics to find the similarity between objects. where the objects are points or vectors .When we consider about jaccard similarity this objects will be sets. So first let’s learn some very basic about sets.

Sets:

A set is (unordered) collection of objects {a,b,c}. we use the notation as elements separated by commas inside curly brackets { }. They are unordered so {a,b} = { b,a }.

Cardinality:

Cardinality of A denoted by |A| which counts how many elements are in A.

Intersection:

Intersection between two sets A and B is denoted A ∩ B and reveals all items which are in both sets A,B.

Union:

Union between two sets A and B is denoted A ∪ B and reveals all items which are in either set.

 

jaccaard2

Now going back to Jaccard similarity.The Jaccard similarity measures similarity between finite sample sets, and is defined as the cardinality of the intersection of sets divided by the cardinality of the union of the sample sets. Suppose you want to find jaccard similarity between two sets A and B it is the ration of cardinality of A ∩ B and A ∪ B

jaccaard3

Jaccard similarity implementation:


#!/usr/bin/env python

from math import*

def jaccard_similarity(x,y):

 intersection_cardinality = len(set.intersection(*[set(x), set(y)]))
 union_cardinality = len(set.union(*[set(x), set(y)]))
 return intersection_cardinality/float(union_cardinality)

print jaccard_similarity([0,1,2,5,6],[0,2,3,5,7,9])

Script Output:

0.375
[Finished in 0.0s]

Implementaion of all 5 similarity measure into one Similarity class:

file_name : similaritymeasures.py


#!/usr/bin/env python

from math import*
from decimal import Decimal

class Similarity():

  """ Five similarity measures function """

  def euclidean_distance(self,x,y):

   """ return euclidean distance between two lists """

   return sqrt(sum(pow(a-b,2) for a, b in zip(x, y)))

  def manhattan_distance(self,x,y):

   """ return manhattan distance between two lists """

   return sum(abs(a-b) for a,b in zip(x,y))

  def minkowski_distance(self,x,y,p_value):

   """ return minkowski distance between two lists """

   return self.nth_root(sum(pow(abs(a-b),p_value) for a,b in zip(x, y)),p_value)

  def nth_root(self,value, n_root):

   """ returns the n_root of an value """

   root_value = 1/float(n_root)
   return round (Decimal(value) ** Decimal(root_value),3)

  def cosine_similarity(self,x,y):

   """ return cosine similarity between two lists """

   numerator = sum(a*b for a,b in zip(x,y))
   denominator = self.square_rooted(x)*self.square_rooted(y)
   return round(numerator/float(denominator),3)

  def square_rooted(self,x):

   """ return 3 rounded square rooted value """

   return round(sqrt(sum([a*a for a in x])),3)

  def jaccard_similarity(self,x,y):

   """ returns the jaccard similarity between two lists """

   intersection_cardinality = len(set.intersection(*[set(x), set(y)]))
   union_cardinality = len(set.union(*[set(x), set(y)]))
   return intersection_cardinality/float(union_cardinality)

Using Similarity class:


#!/usr/bin/env python

from similaritymeasures import Similarity

def main():

  """ main function to create Similarity class instance and get use of it """

  measures = Similarity()

  print measures.euclidean_distance([0,3,4,5],[7,6,3,-1])
  print measures.jaccard_similarity([0,1,2,5,6],[0,2,3,5,7,9])

if __name__ == "__main__":
  main()

 

This post codes can be found in this  Github link :

 https://github.com/saimadhu-polamuri/DataAspirant_codes/tree/master/Similarity_measures

 

Follow us:

FACEBOOKQUORA |TWITTERREDDIT | FLIPBOARD | MEDIUM | GITHUB

I hope you liked todays post. If you have any questions then feel free to comment below.  If you want me to write on one specific topic then do tell it to me in the comments below.

THANKS FOR READING…..

Home | About | Data scientists Interviews | For beginners | Join us

 

Linear Regression Implementation in Python

Home | About | Data scientists Interviews | For beginners | Join us

Linear Regression

In this post i gonna wet your hands with coding part too, Before we drive further. let me show what type of  examples we gonna solve today.

1) Predicting house price for ZooZoo.

House-Diagram-Render-1

  • ZooZoo gonna buy new house, so we have to find how much it will cost a particular house.

2) Predicting which Television Show will have more viewers for next week

flashvsarrowjpg-e14ea2-639x360

 

The Flash and Arrow are my favorite Television ( Tv ) shows. i want to find which Tv show will get more viewers in upcoming  week. Frankly i am so excited to know which show will get more viewers.

 3) Replacing missing values using linear Regression

You have to solve this problem i will explain every thing you have to do. so please try to solve this problem.

So let’s drive into coding part

I believe you have installed all the required packages which i specified in my previous post. If not please take some time and install all the packages in this post  python packages for datamining. It would better once your go through Linear Regression post.

1) Predicting cost price of  a house for ZooZoo.

zoozoo

ZooZoo have the following data set

 

No.             square_feet                 price     
1 150 6450
2 200 7450
3 250 8450
4 300 9450
5 350 11450
6 400 15450
7 600 18450

 

 

 

About data set:

  • Square feet is the  Area of house.
  • Price is the corresponding cost of  that house.

Steps to Follow :

  • As we learn linear regression we know that we have to find linear line for this data so that we can get  θ0 and θ1.
  • As you remember our hypothesis equation looks like this

Image [7]

where:

  • hθ(x) is nothing but the value price(which we are going to predicate ) for particular square_feet  ( means price is a linear function of square_feet)
  • θ0 is a constant
  • θ1 is the regression coefficient

As we clear what we have to do, let’s start coding.

STEP – 1 :

  • First open your favorite text editor and name it as predict_house_price.py.
  • The below packages we gonna use in our program ,so  copy them in your predict_house_price.py file.

# Required Packages

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn import datasets, linear_model

  • Just run your code once. if your program is error free then most of the job was done. If you facing  any errors , this means you missed some  packages so please go to this packages page  
  • Install all the packages in that blog post and run your code once again . This time most probably you will never face any problem.
  • Means your program is error free now so we can go to STEP – 2.

STEP – 2

  • I stored our data set in to a csv file with name input_data.csv
  • So let’s write a function to get our data into X values ( square_feet  ) Y values (Price)

# Function to get data
def get_data(file_name):
 data = pd.read_csv(file_name)
 X_parameter = []
 Y_parameter = []
 for single_square_feet ,single_price_value in zip(data['square_feet'],data['price']):
       X_parameter.append([float(single_square_feet)])
       Y_parameter.append(float(single_price_value))
 return X_parameter,Y_parameter

Line 3:

Reading csv data to pandas DataFrame.

Line 6-9:

Converting pandas dataframe data to X_parameter and Y_parameter data returning them

So let’s print our X_parameters and Y_parameters


X,Y = get_data('input_data.csv')
print X
print Y

Script Output

[[150.0], [200.0], [250.0], [300.0], [350.0], [400.0], [600.0]]
[6450.0, 7450.0, 8450.0, 9450.0, 11450.0, 15450.0, 18450.0]
[Finished in 0.7s]

Step – 3

we converted data to X_parameters and Y_parameter so let’s fit our X_parameters and Y_parameters to Linear Regression model

So we gonna write a function which will take  X_parameters ,Y_parameter and the value you gonna predict  as input and return the θ0 ,θ1  and predicted value

 


# Function for Fitting our data to Linear model
def linear_model_main(X_parameters,Y_parameters,predict_value):

 # Create linear regression object
 regr = linear_model.LinearRegression()
 regr.fit(X_parameters, Y_parameters)
 predict_outcome = regr.predict(predict_value)
 predictions = {}
 predictions['intercept'] = regr.intercept_
 predictions['coefficient'] = regr.coef_
 predictions['predicted_value'] = predict_outcome
 return predictions

Line 5-6:

First we are creating an linear model and the training it with our X_parameters and Y_parameters

Line 8-12:

we are creating one dictionary with name predictions and storing θ0 ,θ1  and predicted values. and returning predictions dictionary as an output.

So let’s call our function with predicting value as 700


X,Y = get_data('input_data.csv')
predictvalue = 700
result = linear_model_main(X,Y,predictvalue)
print "Intercept value " , result['intercept']
print "coefficient" , result['coefficient']
print "Predicted value: ",result['predicted_value']

 

Script Output:

Intercept value 1771.80851064
coefficient [ 28.77659574]
Predicted value: [ 21915.42553191]
[Finished in 0.7s]

Here Intercept value is nothing but   θ0 value and coefficient value is nothing but  θ1 value.

We got the predicted values as 21915.4255 means we done our job of predicting the house price.

For checking purpose we have to see how our data fit to linear regression.So we have to write a function which takes X_parameters and Y_parameters as input and show the linear line fitting for our data.

 


# Function to show the resutls of linear fit model
def show_linear_line(X_parameters,Y_parameters):
 # Create linear regression object
 regr = linear_model.LinearRegression()
 regr.fit(X_parameters, Y_parameters)
 plt.scatter(X_parameters,Y_parameters,color='blue')
 plt.plot(X_parameters,regr.predict(X_parameters),color='red',linewidth=4)
 plt.xticks(())
 plt.yticks(())
 plt.show()

So let call our show_linear_line Function


show_linear_line(X,Y)

Script Output:

post6

 

2) Predicting which Television Show will have more viewers

Flash-vs-Arrow-Barry-Allen-and-Oliver-Queen

About The FLASH Tv show:

CW10-Flash-Ad

      The Flash is an American television series developed by writer/producers Greg Berlanti, Andrew Kreisberg and Geoff Johns, airing on The CW. It is based on the DC Comics character Flash (Barry Allen), a costumed superhero crime-fighter with the power to move at superhuman speeds, who was created by Robert Kanigher, John Broome andCarmine Infantino. It is a spin-off from Arrow, existing in the same universe. The pilot for the series was written by Berlanti, Kreisberg and Johns, and directed by David Nutter. The series premiered in North America on October 7, 2014, where the pilot became the most watched telecast for The CW.

About Arrow Tv Show:

wp_arrow_leather_ipad_768_1024

         Arrow is an American television series developed by writer/producers Greg Berlanti, Marc Guggenheim, and Andrew Kreisberg. It is based on the DC Comics characterGreen Arrow, a costumed crime-fighter created by Mort Weisinger and George Papp. It premiered in North America on The CW on October 10, 2012, with international broadcasting taking place in late 2012. Primarily filmed in Vancouver, British Columbia, Canada, the series follows billionaire playboy Oliver Queen, portrayed by Stephen Amell, who, after five years of being stranded on a hostile island, returns home to fight crime and corruption as a secret vigilante whose weapon of choice is a bow and arrow. Unlike in the comic books, Queen does not initially go by the alias “Green Arrow”.

As both of these are my best-loved Tv show when ever i am watching these shows i feel myself which show have more viewers and i am so interested to guess which show will have more viewers.

So lets write a program which guess( predict ) which Tv Show will have more viewers.

For free drive of our program we need some dataset which having both shows viewers for each episode. Luckly i got this data from Wikipidia and prepared an csv file. It’s looks  like this.

flash_episode flash_us_viewers arrow_episode arrow_us_viewers
1 4.83 1 2.84
2 4.27 2 2.32
3 3.59 3 2.55
4 3.53 4 2.49
5 3.46 5 2.73
6 3.73 6 2.6
7 3.47 7 2.64
8 4.34 8 3.92
9 4.66 9 3.06

About data set:

Us viewers ( millions )

Step by steps to solving this problem:

  • First we have to convert our data to X_parameters and Y_parameters but here we have two X_parameters and Y_parameters so lets’s name them as flash_x_parameter, flash_y_parameter, arrow_x_parameter , arrow_y_parameter.
  • Then we have to fit our data to two different  linear regression models first for flash and other for arrow.
  • Then we have to predict the number of viewers for next episode for both of the Tv shows.
  • Then we can compare the results and we can guess which Tv Shows will have more viewers.

Let’s drive to code this interesting problem.

Step-1

We have to import our packages


# Required Packages
import csv
import sys
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn import datasets, linear_model

Step-2

Converting our data to flash_x_parameter,flash_y_parameter,arrow_x_parameter ,arrow_y_parameter so lets write a function which will take our data set as input and returns lash_x_parameter,flash_y_parameter,arrow_x_parameter ,arrow_y_parameter values.


# Function to get data
def get_data(file_name):
 data = pd.read_csv(file_name)
 flash_x_parameter = []
 flash_y_parameter = []
 arrow_x_parameter = []
 arrow_y_parameter = []
 for x1,y1,x2,y2 in zip(data['flash_episode_number'],data['flash_us_viewers'],data['arrow_episode_number'],data['arrow_us_viewers']):
 flash_x_parameter.append([float(x1)])
 flash_y_parameter.append(float(y1))
 arrow_x_parameter.append([float(x2)])
 arrow_y_parameter.append(float(y2))
 return flash_x_parameter,flash_y_parameter,arrow_x_parameter,arrow_y_parameter

now we have  flash_x_parameters,flash_y_parameters,arrow_x_parameters,arrow_y_parameters. so let’s write a function which will take these above parameters as input and gives an output as which show will have more views.


# Function to know which Tv show will have more viewers
def more_viewers(x1,y1,x2,y2):
 regr1 = linear_model.LinearRegression()
 regr1.fit(x1, y1)
 predicted_value1 = regr1.predict(9)
 print predicted_value1
 regr2 = linear_model.LinearRegression()
 regr2.fit(x2, y2)
 predicted_value2 = regr2.predict(9)
 #print predicted_value1
 #print predicted_value2
 if predicted_value1 > predicted_value2:
 print "The Flash Tv Show will have more viewers for next week"
 else:
 print "Arrow Tv Show will have more viewers for next week"

So let’s write every thing in one file open your editor and name it as prediction.py and copy this total code into prediction.py file.


# Required Packages
import csv
import sys
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn import datasets, linear_model

# Function to get data
def get_data(file_name):
 data = pd.read_csv(file_name)
 flash_x_parameter = []
 flash_y_parameter = []
 arrow_x_parameter = []
 arrow_y_parameter = []
 for x1,y1,x2,y2 in zip(data['flash_episode_number'],data['flash_us_viewers'],data['arrow_episode_number'],data['arrow_us_viewers']):
 flash_x_parameter.append([float(x1)])
 flash_y_parameter.append(float(y1))
 arrow_x_parameter.append([float(x2)])
 arrow_y_parameter.append(float(y2))
 return flash_x_parameter,flash_y_parameter,arrow_x_parameter,arrow_y_parameter

# Function to know which Tv show will have more viewers
def more_viewers(x1,y1,x2,y2):
 regr1 = linear_model.LinearRegression()
 regr1.fit(x1, y1)
 predicted_value1 = regr1.predict(9)
 print predicted_value1
 regr2 = linear_model.LinearRegression()
 regr2.fit(x2, y2)
 predicted_value2 = regr2.predict(9)
 #print predicted_value1
 #print predicted_value2
 if predicted_value1 > predicted_value2:
 print "The Flash Tv Show will have more viewers for next week"
 else:
 print "Arrow Tv Show will have more viewers for next week"

x1,y1,x2,y2 = get_data('input_data.csv')
#print x1,y1,x2,y2
more_viewers(x1,y1,x2,y2)

Run this program and see which Tv show will have more viewers.

 3) Replacing missing values using linear Regression

     some times we have a situation where we have to do analysis on data which consists of missing values. Some people will remove these missing values and they continue  analysis and some people replace them with min value or max value or mean value it’s good to replace missing value with mean value but so time it’s not the right way to replace missing with mean value so we can use linear regression to replace those missing value very effectively.

This approach goes some thing like this.

First we have find which column we gonna replace missing values and we have to find on which columns this missing values column values more depends on ,then we have to remove the missing value rows. Then consider missing values column as Y_parameters and consider the columns on which this missing values more depend as X_parameters and fit this data to Linear regression model . Now predict the missing values in missing values column by consider the columns on which this missing values column more depends.

Once all this process completed  we will get data without any missing values so we are free to analysis data.

For practice i leave this problem to you so please kindly get some missing values data from online and solve this problem. Leave your comments once you completed .i so happy to view them.

 Small personal note:

I want to share my personal experience with data mining. I remember in my introductory  datamining classes The instructor starts slow and explains some interesting areas where we can apply datamining and some very basic concepts so i and my friends understand every thing and we show more interest towards datamining. Then suddenly the difficulty leave will sky rocket. This makes a lot of my friends in class feel like extremely frustrated and intimated by course and ultimately they left interest on datamining. So i want to avoid this thing in my blog posts. In my blog post i want to make thing more easygoing this would be possible only when i explain things  with some interest examples moreover i want to make my blog viewers more comfortable learning without any boring so i am in that spirit which leads me towards use this examples.

 

Follow us:

FACEBOOKQUORA |TWITTERREDDIT | FLIPBOARD | MEDIUM | GITHUB

I hope you liked todays post. If you have any questions then feel free to comment below.  If you want me to write on one specific topic then do tell it to me in the comments below.

 Thanks For Reading….

Home | About | Data scientists Interviews | For beginners | Join us