Home | About | Data scientists Interviews | For beginners | Join us | Monthly newsletter


Blog Posts:

1. Great resources for learning data mining concepts and techniques:

With today’s tools, anyone can collect data from almost anywhere, but not everyone can pull the important nuggets out of that data. Whacking your data into Tableau is an OK start, but it’s not going to give you the business critical insights you’re looking for. To truly make your data come alive you need to mine it. Dig deep. Play around. And tease out the diamond in the rough.

Read Complete Post on:

2.  Interactive Data Science with R in Apache Zeppelin Notebook:

The objective of this blog post is to help you get started with Apache Zeppelin notebook for your R data science requirements. Zeppelin is a web-based notebook that enables interactive data analytics. You can make beautiful data-driven, interactive and collaborative documents with Scala(with Apache Spark), Python(with Apache Spark), SparkSQL, Hive, Markdown, Shell and more.

Read Complete Post on: sparkiq-labs

3. How to install Apache Hadoop 2.6.0 in Ubuntu:

Let’s get started towards setting up a fresh Multinode Hadoop (2.6.0) cluster.

Read Complete Post on: pingax

4. Running scalable Data Science on Cloud with R & Python:

So, why do we even need to run data science on cloud? You might raise this question that if a laptop can pack 64 GB RAM, do we even need cloud for data science? And the answer is a big YES for a variety of reasons. Here are a few of them.

Read Complete Post on: analyticsvidhya

5. How to Choose Between Learning Python or R First:

If you’re interested in a career in data, and you’re familiar with the set of skills you’ll need to master, you know that Python and R are two of the most popular languages for data analysis. If you’re not exactly sure which to start learning first, you’re reading the right article.

When it comes to data analysis, both Python and R are simple (and free) to install and relatively easy to get started with. If you’re a newcomer to the world of data science and don’t have experience in either language, or with programming in general, it makes sense to be unsure whether to learn R or Python first.

Read Complete Post on: Udacity Blog

LinkedIn Posts:

  1. 5 Best Machine Learning APIs for Data Science
  2. Big Data Top Trends In 2016
  3. Big Data: 4 Things You Can Do With It, And 3 Things You Can’t


1. Machine Learning: Going Deeper with Python and Theano

2. Current State of Recommendation Systems

3. Pandas From The Ground Up


TensorFlow Google Machine Learning Library:

About TensorFlow:

TensorFlow™ is an open source software library for numerical computation using data flow graphs. Nodes in the graph represent mathematical operations, while the graph edges represent the multidimensional data arrays (tensors) communicated between them. The flexible architecture allows you to deploy computation to one or more CPUs or GPUs in a desktop, server, or mobile device with a single API. TensorFlow was originally developed by researchers and engineers working on the Google Brain Team within Google’s Machine Intelligence research organization for the purposes of conducting machine learning and deep neural networks research, but the system is general enough to be applicable in a wide variety of other domains as well.

Get Started TensorFlow


That’s all for November 2015 newsletter. Please leave your suggestions on newsletter in the comment section. To get all  dataaspirant newsletters you can visit monthly newsletter page. Do please Subscribe to our blog so that every month you get our news letter in your inbox.

Follow us:


Home | About | Data scientists Interviews | For beginners | Join us | Monthly newsletter

DataAspirant Sept-Oct2015 newsletter

Home | About | Data scientists Interviews | For beginners | Join us | Monthly newsletter

Data scientist


Hi dataaspirant lovers we are sorry for not publishing dataaspirant September  newsletter. So for October newsletter we come up with September newsletter ingredients  too. We rounded up the best blogs for anyone interested in learning more about data science. Whatever your experience level in data science or someone who’s just heard of the field,  these blogs provide enough detail and context for you to understand what you’re reading. We also collected some videos too. Hope you  enjoy October  dataaspirant newsletter.


Blog Posts:

1 . How to do a Logistic Regression in R:

Regression is the statistical technique that tries to explain the relationship between a dependent variable and one or more independent variables. There are various kinds of it like simple linear, multiple linear, polynomial, logistic, poisson etc

Read Complete post on: datavinci

2 . Introduction of Markov State Modeling:

Modeling and prediction problems occur in different domain and data situations. One type of situation involves sequence of events.

For instance, you may want to model behaviour of customers on your website, looking at pages they land or enter by, links they click, and so on. You may want to do this to understand common issues and needs and may redesign your website to address that. You may, on the other hand, may want to promote certain sections or products on website and want to understand right page architecture and layout. In other example, you may be interested in predicting next medical visit of patient based on previous visits or next purchase product of customer based on previous products.

Read Complete post on: edupristine

3 . Five ways to improve the way you use Hadoop:

Apache Hadoop is an open source framework designed to distribute the storage and processing of massive data sets across virtually limitless servers. Amazon EMR (Elastic MapReduce) is a particularly popular service from Amazon that is used by developers trying to avoid the burden of set up and administration, and concentrate on working with their data.

Read Complete post on: cloudacademy

4. What is deep learning and why is it getting so much attention:

Deep learning is probably one of the hottest topics in Machine learning today, and it has shown significant improvement over some of its counterparts. It falls under a class of unsupervised learning algorithms and uses multi-layered neural networks to achieve these remarkable outcomes.

Read Complete post on: analyticsvidhya

5. Facebook data collection and photo network visualization with Gephi and R:

The first thing to do is get the Facebook data. Before being allowed to pull it from R, you’ll need to make a quick detour to, register as a developer, and create a new app. Name and description are irrelevant, the only thing you need to do is go to Settings → Website → Site URL and fill in http://localhost:1410/ (that’s the port we’re going to be using). The whole process takes ~5 min and is quite painless

Read Complete post on: kateto

6. Kudu: New Apache Hadoop Storage for Fast Analytics on Fast Data:

The set of data storage and processing technologies that define the Apache Hadoop ecosystem are expansive and ever-improving, covering a very diverse set of customer use cases used in mission-critical enterprise applications. At Cloudera, we’re constantly pushing the boundaries of what’s possible with Hadoop—making it faster, easier to work with, and more secure.

Read Complete post on: cloudera

7. Rapid Development & Performance in Spark For Data Scientists:

Spark is a cluster computing framework that can significantly increase the efficiency and capabilities of a data scientist’s workflow when dealing with distributed data. However, deciding which of its many modules, features and options are appropriate for a given problem can be cumbersome. Our experience at Stitch Fix has shown that these decisions can have a large impact on development time and performance. This post will discuss strategies at each stage of the data processing workflow which data scientists new to Spark should consider employing for high productivity development on big data.

Read Complete post on: multithreaded

8. NoSQL: A Dog with Different Fleas:

The NoSQL movement is around providing performance, scale, and flexibility; where cost is sometimes part of the reasoning (e.g. Oracle Tax). Yet databases like MySQL, which provide all the Oracle features, is often considered before choosing NoSQL. And with respects to NoSQL flexibility. This also can be Pandora’s box. In other words, schema-less modeling has been shown to be a serious complication to data management. I was at the MongoDB Storage Engine Summit this year and the number one ask to the storage engine providers is “how to discover schema in a schema-less architecture?” In other words, managing models over time is a serious matter to consider too.

Read Complete post on: deepis

9. Apache Spark: Sparkling star in big data firmament:

The underlying data needed to be used to gain right outcomes for all above tasks is comparatively very large. It cannot be handled efficiently (in terms of both space and time) by traditional systems. These are all big data scenarios. To collect, store and do computations on this kind of voluminous data we need a specialized cluster computing system. Apache Hadoop has solved this problem for us.

Read Complete post on: edupristine

10. Sqoop vs. Flume – Battle of the Hadoop ETL tools:

Apache Hadoop is synonymous with big data for its cost-effectiveness and its attribute of scalability for processing petabytes of data. Data analysis using hadoop is just half the battle won. Getting data into the Hadoop cluster plays a critical role in any big data deployment. Data ingestion is important in any big data project because the volume of data is generally in petabytes or exabytes. Hadoop Sqoop and Hadoop Flume are the two tools in Hadoop which is used to gather data from different sources and load them into HDFS. Sqoop in Hadoop is mostly used to extract structured data from databases like Teradata, Oracle, etc., and Flume in Hadoop is used to sources data which is stored in various sources like and deals mostly with unstructured data.

Read Complete post on: dezyre



1. Spark and Spark Streaming at Uber :

2. How To Stream Twitter Data Into Hadoop Using Apache Flume:


That’s all for October 2015 newsletter. Please leave your suggestions on newsletter in the comment box. To get all  dataaspirant newsletters you can visit monthly newsletter page. Do please Subscribe to our blog so that every month you get our news letter in your inbox.


Follow us:


Home | About | Data scientists Interviews | For beginners | Join us | Monthly newsletter

Four Coursera data science Specializations starts this month

Home | About | Data scientists Interviews | For beginners | Join us | Monthly newsletter


Starting is the biggest step  to achieve dreams.  This is 200% true for the people how want to learn data science. The very first question comes in mind for data science beginners is Where to Start. If you  trying find the answer for this question your on the right track. You can find your answer in Coursera Specializations.

What Coursera Specialization will offer:

Coursera Data science Specializations and courses teach the fundamentals of interpreting data, performing analyses, and understanding and communicating actionable insights. Topics of study for beginning and advanced learners include qualitative and quantitative data analysis, tools and methods for data manipulation, and machine learning algorithms.


Big Data Specialization


About This Specialization:

In this Specialization, you will develop a robust set of skills that will allow you to process, analyze, and extract meaningful information from large amounts of complex data. You will install and configure Hadoop with MapReduce, use Spark, Pig and Hive, perform predictive modelling with open source tools, and leverage graph analytics to model problems and perform scalable analytical tasks. In the final Capstone Project, developed in partnership with data software company Splunk, you’ll apply the skills you learned by building your own tools and models to analyze big data in the context of retail, sports, current events, or another area of your choice.

COURSE 1:  Introduction to Big Data

Course Started on : Oct 26   Ends on: Nov 23

About the Course:

What’s the “hype” surrounding the Big Data phenomenon? Who are these mysterious data scientists everyone is talking about? What kinds of problem-solving skills and knowledge should they have? What kinds of problems can be solved by Big Data technology? After this short introductory course you will have answers to all these questions. Additionally, you will start to become proficient with the key technical terms and big data tools and applications to prepare you for a deep dive into the rest of the courses in the Big Data specialization. Each day, our society creates 2.5 quintillion bytes of data (that’s 2.5 followed by 18 zeros). With this flood of data the need to unlock actionable value becomes more acute, rapidly increasing demand for Big Data skills and qualified data scientists.
Hands-On Assignment Hardware and Software Requirements
Windows 7+, Mac OS X 10.10+, Ubuntu 14.04+ or CentOS 6+ VirtualBox 5+, VMWare Workstation 9+ or VMWare Fusion 7+
Quad Core Processor (VT-x or AMD-V support recommended)
8 GB Ram 20 GB disk free

COURSE 2:  Hadoop Platform and Application Framework

Course Started on : Oct 20   Ends on: Nov 30

About the Course:

Are you looking for hands-on experience processing big data? After completing this course, you will be able to install, configure and implement an Apache Hadoop stack ranging from basic “Big Data” components to MapReduce and Spark execution frameworks. Moreover, in the exercises for this course you will solve fundamental problems that would require more computing power than a single computer. You will apply the most important Hadoop concepts in your solutions and use distributed/parallel processing in the Hadoop application framework. Get ready to be empowered to manipulate and analyze the significance of big data!

Course Link:  Hadoop Platform and Application Framework

COURSE 3: Introduction to Big Data Analytics

Course Starts : November 2015

About the Course:

Do you have specific business questions you want answered? Need to learn how to interpret results through analytics? This course will help you answer these questions by introducing you to HBase, Pig and Hive. In this course, you will take a real Twitter data set, clean it, bring it into an analytics engine, and create summary charts and drill-down dashboards. After completing this course, you will be able to utilize BigTable, distributed data store, columnar data, noSQL, and more!

Course Link: Introduction to Big Data Analytics


COURSE 4: Machine Learning With Big Data

Course Starts : December 2015

About the Course:

Want to learn the basics of large-scale data processing? Need to make predictive models but don’t know the right tools? This course will introduce you to open source tools you can use for parallel, distributed and scalable machine learning. After completing this course’s hands-on projects with MapReduce, KNIME and Spark, you will be able to train, evaluate, and validate basic predictive models. By the end of this course, you will be building a Big Data platform and utilizing several different tools and techniques.

Course Link:  Machine Learning With Big Data


COURSE 5:  Introduction to Graph Analytics

Course Starts : January 2016

About the Course:

Want to understand your data network structure and how it changes under different conditions? Curious to know how to identify closely interacting clusters within a graph? Have you heard of the fast-growing area of graph analytics and want to learn more? This course gives you a broad overview of the field of graph analytics so you can learn new ways to model, store, retrieve and analyze graph-structured data. After completing this course, you will be able to model a problem into a graph database and perform analytical tasks over the graph in a scalable manner. Better yet, you will be able to apply these techniques to understand the significance of your data sets for your own projects.

Course Link:  Introduction to Graph Analytics


Machine Learning Specialization


About This Specialization:

This Specialization provides a case-based introduction to the exciting, high-demand field of machine learning. You’ll learn to analyze large and complex datasets, build applications that can make predictions from data, and create systems that adapt and improve over time. In the final Capstone Project, you’ll apply your skills to solve an original, real-world problem through implementation of machine learning algorithms.

COURSE 1: Machine Learning Foundations: A Case Study Approach

Course Started on: Oct 26 Ends on: Dec 14

About the Course:
Do you have data and wonder what it can tell you? Do you need a deeper understanding of the core ways in which machine learning can improve your business? Do you want to be able to converse with specialists about anything from regression and classification to deep learning and recommender systems? In this course, you will get hands-on experience with machine learning from a series of practical case-studies. At the end of the first course you will have studied how to predict house prices based on house-level features, analyze sentiment from user reviews, retrieve documents of interest, recommend products, and search for images. Through hands-on practice with these use cases, you will be able to apply machine learning methods in a wide range of domains. This first course treats the machine learning method as a black box. Using this abstraction, you will focus on understanding tasks of interest, matching these tasks to machine learning tools, and assessing the quality of the output. In subsequent courses, you will delve into the components of this black box by examining models and algorithms. Together, these pieces form the machine learning pipeline, which you will use in developing intelligent applications.
Learning Outcomes: By the end of this course, you will be able to:
-Identify potential applications of machine learning in practice.
-Describe the core differences in analyses enabled by regression, classification, and clustering.
-Select the appropriate machine learning task for a potential application.
-Apply regression, classification, clustering, retrieval, recommender systems, and deep learning.
-Represent your data as features to serve as input to machine learning models.
-Assess the model quality in terms of relevant error metrics for each task.
-Utilize a dataset to fit a model to analyze new data.
-Build an end-to-end application that uses machine learning at its core.
-Implement these techniques in Python.

COURSE 2: Regression

Starts November 2015
About the Course:
Case Study – Predicting Housing Prices In our first case study, predicting house prices, you will create models that predict a continuous value (price) from input features (square footage, number of bedrooms and bathrooms,…). This is just one of the many places where regression can be applied. Other applications range from predicting health outcomes in medicine, stock prices in finance, and power usage in high-performance computing, to analyzing which regulators are important for gene expression. In this course, you will explore regularized linear regression models for the task of prediction and feature selection. You will be able to handle very large sets of features and select between models of various complexity. You will also analyze the impact of aspects of your data — such as outliers — on your selected models and predictions. To fit these models, you will implement optimization algorithms that scale to large datasets.
Learning Outcomes: By the end of this course, you will be able to:
-Describe the input and output of a regression model.
-Compare and contrast bias and variance when modeling data.
-Estimate model parameters using optimization algorithms.
-Tune parameters with cross validation.
-Analyze the performance of the model.
-Describe the notion of sparsity and how LASSO leads to sparse solutions.
-Deploy methods to select between models.
-Exploit the model to form predictions.
-Build a regression model to predict prices using a housing dataset.
-Implement these techniques in Python.
Course Link: Regression

COURSE 3: Classification

Starts December 2015
About the Course:
Case Study: Analysing Sentiment In our second case study, analyzing sentiment, you will create models that predict a class (positive/negative sentiment) from input features (text of the reviews, user profile information,…). This task is an example of classification, one of the most widely used areas of machine learning, with a broad array of applications, including ad targeting, spam detection, medical diagnosis and image classification. In this course, you will create classifiers that provide state-of-the-art performance on a variety of tasks. You will become familiar with some of the most successful techniques, including logistic regression, boosted decision trees and kernelized support vector machines. In addition, you will be able to design and implement the underlying algorithms that can learn these models at scale. You will implement these technique on real-world, large-scale machine learning tasks.
Learning Objectives: By the end of this course, you will be able to:
-Describe the input and output of a classification model.
-Tackle both binary and multiclass classification problems.
-Implement a logistic regression model for large-scale classification.
-Create a non-linear model using decision trees.
-Improve the performance of any model using boosting.
-Construct non-linear features using kernels.
-Describe the underlying decision boundaries.
-Build a classification model to predict sentiment in a product review dataset.
-Implement these techniques in Python.
Course Link: Classification

COURSE 4: Clustering & Retrieval

Starts February 2016
About the Course:
Case Studies: Finding Similar Documents A reader is interested in a specific news article and you want to find similar articles to recommend. What is the right notion of similarity? Moreover, what if there are millions of other documents? Each time you want to a retrieve a new document, do you need to search through all other documents? How do you group similar documents together? How do you discover new, emerging topics that the documents cover? In this third case study, finding similar documents, you will examine similarity-based algorithms for retrieval. In this course, you will also examine structured representations for describing the documents in the corpus, including clustering and mixed membership models, such as latent Dirichlet allocation (LDA). You will implement expectation maximization (EM) to learn the document clusterings, and see how to scale the methods using MapReduce.
Learning Outcomes: By the end of this course, you will be able to:
-Create a document retrieval system using k-nearest neighbors.
-Describe how k-nearest neighbors can also be used for regression and classification.
-Identify various similarity metrics for text data.
-Cluster documents by topic using k-means.
-Perform mixed membership modeling using latent Dirichlet allocation (LDA).
-Describe how to parallelize k-means using MapReduce.
-Examine mixtures of Gaussians for density estimation.
-Fit a mixture of Gaussian model using expectation maximization (EM).
-Compare and contrast initialization techniques for non-convex optimization objectives.
-Implement these techniques in Python.

COURSE 5:  Recommender Systems & Dimensionality Reduction

Starts March 2016
About the Course:
Case Study:
Recommending Products How does Amazon recommend products you might be interested in purchasing? How does Netflix decide which movies or TV shows you might want to watch? What if you are a new user, should Netflix just recommend the most popular movies? Who might you form a new link with on Facebook or LinkedIn? These questions are endemic to most service-based industries, and underlie the notion of collaborative filtering and the recommender systems deployed to solve these problems. In this fourth case study, you will explore these ideas in the context of recommending products based on customer reviews. In this course, you will explore dimensionality reduction techniques for modeling high-dimensional data. In the case of recommender systems, your data is represented as user-product relationships, with potentially millions of users and hundred of thousands of products. You will implement matrix factorization and latent factor models for the task of predicting new user-product relationships. You will also use side information about products and users to improve predictions.
Learning Outcomes: By the end of this course, you will be able to:
-Create a collaborative filtering system.
-Reduce dimensionality of data using SVD, PCA, and random projections.
-Perform matrix factorization using coordinate descent.
-Deploy latent factor models as a recommender system.
-Handle the cold start problem using side information.
-Examine a product recommendation application.
-Implement these techniques in Python.

Data Science at Scale Specialization


About This Specialization:
This Specialization covers intermediate topics in data science. You will gain hands-on experience with scalable SQL and NoSQL data management solutions, data mining algorithms, and practical statistical and machine learning concepts. You will also learn to visualize data and communicate results, and you’ll explore legal and ethical issues that arise in working with big data. In the final Capstone Project, developed in partnership with the digital internship platform Coursolve, you’ll apply your new skills to a real-world data science project.

COURSE 1:  Data Manipulation at Scale: Systems and Algorithms

Upcoming session: Oct 26 — Nov 30
About the Course:
Data analysis has replaced data acquisition as the bottleneck to evidence-based decision making — we are drowning in it. Extracting knowledge from large, heterogeneous, and noisy datasets requires not only powerful computing resources, but the programming abstractions to use them effectively. The abstractions that emerged in the last decade blend ideas from parallel databases, distributed systems, and programming languages to create a new class of scalable data analytics platforms that form the foundation for data science at realistic scales. In this course, you will learn the landscape of relevant systems, the principles on which they rely, their tradeoffs, and how to evaluate their utility against your requirements. You will learn how practical systems were derived from the frontier of research in computer science and what systems are coming on the horizon. Cloud computing, SQL and NoSQL databases, MapReduce and the ecosystem it spawned, Spark and its contemporaries, and specialized systems for graphs and arrays will be covered. You will also learn the history and context of data science, the skills, challenges, and methodologies the term implies, and how to structure a data science project. At the end of this course, you will be able to:
Learning Goals:
1. Describe common patterns, challenges, and approaches associated with data science projects, and what makes them different from projects in related fields.
2. Identify and use the programming models associated with scalable data manipulation, including relational algebra, mapreduce, and other data flow models.
3. Use database technology adapted for large-scale analytics, including the concepts driving parallel databases, parallel query processing, and in-database analytics
4. Evaluate key-value stores and NoSQL systems, describe their tradeoffs with comparable systems, the details of important examples in the space, and future trends.
5. “Think” in MapReduce to effectively write algorithms for systems including Hadoop and Spark. You will understand their limitations, design details, their relationship to databases, and their associated ecosystem of algorithms, extensions, and languages. write programs in Spark
6. Describe the landscape of specialized Big Data systems for graphs, arrays, and streams.

COURSE 2:  Practical Predictive Analytics: Models and Methods

Upcoming session: Oct 26 — Nov 30
About the Course:
Statistical experiment design and analytics are at the heart of data science. In this course you will design statistical experiments and analyze the results using modern methods. You will also explore the common pitfalls in interpreting statistical arguments, especially those associated with big data. Collectively, this course will help you internalize a core set of practical and effective machine learning methods and concepts, and apply them to solve some real world problems.
Learning Goals: After completing this course, you will be able to:
1. Design effective experiments and analyze the results
2. Use resampling methods to make clear and bulletproof statistical arguments without invoking esoteric notation
3. Explain and apply a core set of classification methods of increasing complexity (rules, trees, random forests), and associated optimization methods (gradient descent and variants)
4. Explain and apply a set of unsupervised learning concepts and methods
5. Describe the common idioms of large-scale graph analytics, including structural query, traversals and recursive queries, PageRank, and community detection.
About the Course:
Producing numbers is not enough; effective data scientists know how to interpret the numbers and communicate findings accurately to stakeholders to inform business decisions. Visualization is a relatively recent field of research in computer science that links perception, cognition, and algorithms to exploit the enormous bandwidth of the human visual cortex. In this course you will design effective visualizations and develop skills in recognizing and avoiding poor visualizations. Just because you can get the answer using big data doesn’t mean you should. In this course you will have the opportunity to explore the ethical considerations around big data and how these considerations are beginning to influence policy and practice.
Learning Goals: After completing this course, you will be able to:
1. Design and critique visualizations
2. Explain the state-of-the-art in privacy, ethics, governance around big data and data science
3. Explain the role of open data and reproducibility in data science.
The Data Analysis and Interpretation Specialization takes you from data novice to data analyst in just four project-based courses. You’ll learn to apply basic data science tools and techniques, including data visualization, regression modeling, and machine learning. Throughout the Specialization, you will analyze research questions of your choice and summarize your insights. In the final Capstone Project, you will use real data to address an important issue in society, and report your findings in a professional-quality report. These instructors are here to create a warm and welcoming place at the table for everyone. Everyone can do this, and we are building a community to show the way.

COURSE 1:  Data Management and Visualization

Upcoming session: Oct 26 — Nov 30
About the Course:
Have you wanted to describe your data in more meaningful ways? Interested in making visualizations from your own data sets? After completing this course, you will be able to manage, describe, summarize and visualize data. You will choose a research question based on available data and engage in the early decisions involved in quantitative research. Based on a question of your choosing, you will describe variables and their relationships through frequency tables, calculate statistics of center and spread, and create graphical representations. By the end of this course, you will be able to: – use a data codebook to decipher a data set – identify questions or problems that can be tackled by a particular data set – determine the data management steps that are needed to prepare data for analysis – write code to execute a variety of data management and data visualization techniques

Course Link:  Data Management and Visualization

COURSE 2: Data Analysis Tools

Current session: Oct 22 — Nov 30
About the Course:
Do you want to answer questions with data? Interested in discovering simple methods for answering these questions? Hypothesis testing is the tool for you! After completing this course, you will be able to: – identify the right statistical test for the questions you are asking – apply and carry out hypothesis tests – generalize the results from samples to larger populations – use Analysis of Variance, Chi-Square, Test of Independence and Pearson correlation – present your findings using statistical language.
Course Link: Data Analysis Tools

COURSE 3:  Regression Modeling in Practice

Starts November 2015
About the Course:
What kinds of statistical tools can you use to test your research question in more depth? In this course, you will go beyond basic data analysis tools to develop multiple linear regression and logistic regression models to address your research question more thoroughly. You will examine multiple predictors of your outcome and identify confounding variables. In this course you will be introduced to additional Python libraries for regression modeling. You will learn the assumptions underlying regression analysis, how to interpret regression coefficients, and how to use regression diagnostic plots and other tools to evaluate residual variability. Finally, through blogging, you will present the story of your regression model using statistical language.

COURSE 4: Machine Learning for Data Analysis

Starts January 2016
About the Course:
Are you interested in predicting future outcomes using your data? This course helps you do just that! Machine learning is the process of developing, testing, and applying predictive algorithms to achieve this goal. Make sure to familiarize yourself with course 3 of this specialization before diving into these machine learning concepts. Building on Course 3, which introduces students to integral supervised machine learning concepts, this course will provide an overview of many additional concepts, techniques, and algorithms in machine learning, from basic classification to decision trees and clustering. By completing this course, you will learn how to apply, test, and interpret machine learning algorithms as alternative methods for addressing your research questions.

Follow us:


I hope you liked todays post. If you have any questions then feel free to comment below.  If you want me to write on one specific topic then do tell it to me in the comments below.

If you want share your experience or opinions you can say.

Hello to 


Home | About | Data scientists Interviews | For beginners | Join us | Monthly newsletter

collaborative filtering recommendation engine implementation in python

Home | About | Data scientists Interviews | For beginners | Join us | Monthly newsletter

Collaborative Filtering



In the introduction post of recommendation engine we have seen the need of recommendation engine in real life as well as importance of recommendation engine in online and finally we have discussed 3 methods of recommendation engine. They are:

1) Collaborative filtering

2) Content-based filtering

3) Hybrid Recommendation Systems

So today we are going to implement collaborative filtering way  of recommendation engine, before that i want to explain some key things about recommendation engine which was missed in Introduction to recommendation engine post.

Today learning Topics:

1) what is the Long tail phenomenon in recommendation engine ?

2) Basic idea about Collaborative filtering ?

3) Understanding Collaborative filtering approach with friends movie recommendation engine ?

4) Implementing Collaborative filtering approach of recommendation engine ?

what is the Long tail phenomenon in recommendation engine ?

Usage of Recommendation engine  become popular in last 10 to 20 years. The reason behind this is we changed from a state of scarcity to abundance state. Don’t be frighten about this two words scarcity and abundance. Coming paragraph i will make you clear why i have use those words in popularity of recommendation engine.

Scarcity to Abundance:

Harvard Book Store 1256 Massachusetts Ave., Cambridge, MA 02138 Tara Metal, 617-661-1424 x1

Lets understand scarcity to abundance with book stores. Imagine a physical books store like crossword with thousands of books.  we can see almost all popular books in books store shells, but shop shells are limited,to increasing number of shells, retailer need more space for this retailer need to spend some huge amount of money. This is the reason why we are missing some many good books which are less popular and finally we are only getting popular books.

On the other hand if we see online web books stores, we have unlimited shells with unlimited shell space, so we can get almost all books apart from popular or unpopular. This is because of web enables near-zero cost dissemination of information about products ( books,news articles, mobiles ,etc ). This new zero cost dissemination of information in web gives rise to an phenomenon is called as “Long Tail” phenomenon.

Long Tail phenomenon:


The above graph clearly represents the Long Tail phenomenon. On X – axis we have products with popularity (most popular ranked products will be at left side and  less popular ranked product will be going to right side). Here popularity means numbers of times an product purchased or number of time an product was viewed. On Y-axis it was genuine popularity means how many times products was purchased or viewed in an interval of time like one week or one month. if you see the graph at top Y – axis the orange color curve was just far away from the actual Y- axis this means they are very few popular products. Coming to curve behavior Initially it has very stiff fall and if we move towards the right as the product rank become greater on X – axis, products popularity falls very stiffly and at an certain point this popularity fall less and less stiffly and it don’t reach X – axis . The interesting things is products right side to the cutoff point  was very  very less poplar and hardly they was purchased once or twice in an week or month. So these type of product we don’t get in physical retailer store because storing this less poplar items is waste of money so good business retailer don’t think to store them any more. So popular product can be store in physical retailers as well as we can get them in online too ,In case of graph left side to cutoff point which is Head part is the combination of the both physical retailer store and online store. Coming to the right side of the cutoff point which are less poplar, so we can only get them in Online. So this tail part to the right side of the cutoff point for less poplar product is called Long Tail. If you see the area under cure of the Long tail we can notice there were some many good products which are less popular. Finding them is harder task to user in online so there is need of an system which can recommend these good product which are unpopular  to user by considering some metrics. This system is nothing but recommendation system.

Basic idea about Collaborative filtering :

collaborative filtering algorithm usually works by searching a large group of people and finds an smaller set with tastes similar to user. It looks at other things ,they like and combines then to create a ranked list of suggestions. Finally shows the suggestion to user.For better understanding of collaborative filtering let’s think about our own friends movie recommendation system.

Understanding Collaborative filtering approach with friends movie recommendation engine:


Let’s understand collaborative filtering approach by friends movie recommendations engine. To explain friends movie recommendations engine i want to share my own story about this.

Like  most of the people i do love to watch movies in week ends. Generally there was so many movies in my hard disk and it’s hard to select one movie from that. That’s the reason why when i want to watch an movie i will search for some of my friends who’s movie taste is similar to me and i will ask them to recommend an movie which i may like ( haven’t seen by me but seen by them ). They will recommend me an movie which they like and which i was never seen. it may happened to you also.

This means to implement your own movie recommendation engine by considering your friends as a set of group people. Something you have learned over time by observing whether they usually like the same things as you. So you gonna select a group of your friends and you have to find someone who is more similar to you. Then you can expect movie recommendation for your friend. Applying this scenario  of techniques to implement an recommendation engine is called as collaborative filtering.

Hope i have clear the idea about Collaborative filtering. So Let’s wet our hands by implementing this collaborative filtering in Python programming language.

Implementing Collaborative filtering approach of recommendation engine :

Data set for implementing collaborative filtering recommendation engine:

To implement collaborative filtering first we need data set having rated preferences  ( how likely the people in data set  like some set of items). So i am taking this data set from one of my favorite book Collective Intelligence book which was written by  Toby Segaran.

First i am storing this data set to an Python Dictionary. For huge data set we generally store this preferences in Databases.

File name :

#!/usr/bin/env python
# Collabrative Filtering data set taken from Collective Intelligence book.

         'Lisa Rose': {'Lady in the Water': 2.5,
                       'Snakes on a Plane': 3.5,
                       'Just My Luck': 3.0,
                       'Superman Returns': 3.5,
                       'You, Me and Dupree': 2.5,
                       'The Night Listener': 3.0},
         'Gene Seymour': {'Lady in the Water': 3.0,
                          'Snakes on a Plane': 3.5,
                          'Just My Luck': 1.5,
                          'Superman Returns': 5.0,
                          'The Night Listener': 3.0,
                          'You, Me and Dupree': 3.5},

        'Michael Phillips': {'Lady in the Water': 2.5,
                             'Snakes on a Plane': 3.0,
                             'Superman Returns': 3.5,
                             'The Night Listener': 4.0},
        'Claudia Puig': {'Snakes on a Plane': 3.5,
                         'Just My Luck': 3.0,
                         'The Night Listener': 4.5,
                         'Superman Returns': 4.0,
                         'You, Me and Dupree': 2.5},
        'Mick LaSalle': {'Lady in the Water': 3.0,
                         'Snakes on a Plane': 4.0,
                         'Just My Luck': 2.0,
                         'Superman Returns': 3.0,
                         'The Night Listener': 3.0,
                         'You, Me and Dupree': 2.0},
       'Jack Matthews': {'Lady in the Water': 3.0,
                         'Snakes on a Plane': 4.0,
                         'The Night Listener': 3.0,
                         'Superman Returns': 5.0,
                         'You, Me and Dupree': 3.5},
      'Toby': {'Snakes on a Plane':4.5,
               'You, Me and Dupree':1.0,
               'Superman Returns':4.0}}

Now we are ready with data set. So let’s start implementing recommendation engine. create new file with name be careful you created this  file in the same directory where you created file. First let’s import our recommendation dataset to file and let’s try whether we are getting data properly or not by answering the below questions.

1) What was the rating for Lady in the Water by Lisa Rose and Michael Phillips ?

2) Movie rating  of Jack Matthews ?

File name :

#!/usr/bin/env python
# Implementation of collaborative filtering recommendation engine

from recommendation_data import dataset

print "Lisa Rose rating on Lady in the water: {}\n".format(dataset['Lisa Rose']['Lady in the Water'])
print "Michael Phillips rating on Lady in the water: {}\n".format(dataset['Michael Phillips']['Lady in the Water'])

print '**************Jack Matthews ratings**************'
print dataset['Jack Matthews']

Script Output:

Lisa Rose rating on Lady in the water: 2.5

Michael Phillips rating on Lady in the water: 2.5

**************Jack Matthews ratings**************
{'Lady in the Water': 3.0, 'Snakes on a Plane': 4.0, 'You, Me and Dupree': 3.5, 'Superman Returns': 5.0, 'The Night Listener': 3.0}
[Finished in 0.1s]

After getting data. we need to find the similar people by comparing each person with every other person this by calculating similarity score between them. To know more about similarity you can view similarity score post. So Let’s write a function to find the similarity between two persons.

Similar Persons:

First let use Euclidean distance to find similarity between two people. To do that we need to write an euclidean distance measured function Let’s add this function in and let’s find the Euclidean distance between Lisa Rose and Jack Matthews. Before that let’s remember Euclidean distance formula.


Euclidean distance is the most common use of distance. In most cases when people say about distance, they will refer to Euclidean distance. Euclidean distance  is also know as simply distance. When data is dense or continuous , this is the best proximity measure. The Euclidean distance between two points is the length of the path connecting them.This distance between two points is given by the Pythagorean theorem.


#!/usr/bin/env python
# Implementation of collaborative filtering recommendation engine

from recommendation_data import dataset
from math import sqrt

def similarity_score(person1,person2):

    # Returns ratio Euclidean distance score of person1 and person2 

    both_viewed = {} # To get both rated items by person1 and person2

    for item in dataset[person1]:
       if item in dataset[person2]:
          both_viewed[item] = 1

   # Conditions to check they both have an common rating items
   if len(both_viewed) == 0:
       return 0

   # Finding Euclidean distance
   sum_of_eclidean_distance = [] 

   for item in dataset[person1]:
      if item in dataset[person2]:
         sum_of_eclidean_distance.append(pow(dataset[person1][item] - dataset[person2][item],2))
       sum_of_eclidean_distance = sum(sum_of_eclidean_distance)

   return 1/(1+sqrt(sum_of_eclidean_distance))

print similarity_score('Lisa Rose','Jack Matthews')

Source Output:

[Finished in 0.0s]

This means the Euclidean distance score of  Lisa Rose and Jack Matthews is 0.34054

Code Explanation:

Line 3-4:

  • Imported recommendation data set  and imported sqrt function.

Line 6-28:

  • similarity_score function we are taking two parameters as person1 and person2
  • The first thing we need to do is whether the person1 and person2 rating any common items. we doing this in line 12 to 14.
  • Once we find the both_viewed  we are checking  the length of the both_viewed. If it zero there is no need to find similarity score why because it will zero.
  • Then we are find the Euclidean distance sum value by consider the items which was rated by both person1 and person2.
  • Then we returns the inverted value of euclidean distance. The reason behind using inverted euclidean distance is generally euclidean distance returns the distance between the two users. If the distance between two users is less means they are more similar but we need high value for the people who are similar so this can be done by adding 1 to euclidean distance ( so you don’t get a division by zero error) and inverting it.


Do you think the approach we used here is the good one to find the similarity between two users. Let’s consider an example to get clear idea about is this good approach to find similarity between two users.  Suppose we have two users X and Y. If X feels it’s good movie he will rate 4 for it, if he feels it’s an  average movie he will rate 2 and finally if he feel it’s not an good movie he will rate 1. In the same way Y will rate 5 for good movie, 4 for average move and 3 for worst movie.


If we calculated  similarity between both users it will some what similar but we are missing one great point here According to Euclidean distance if we consider an movie which rated by both X and Y. Suppose X rated as 4 and Y rated as 4 then euclidean distance formulas give both  X and Y are more similar, but this movie is good one for user X and average movie for Y. So if we use Euclidean disatance our approach will be wrong. So we have use some other approach to find similarity between two users. This approach is Pearson Correlation.

Pearson Correlation:

Pearson Correlation Score:

A slightly more sophisticated way to determine the similarity between people’s interests is to use a pearson correlation coefficient. The correlation coefficient is a measure of how well two sets of data fit on a straight line. Formula for this is more complicated that the Euclidean distance score, but it tends to give better results in situations where the data isn’t well normalized like our present data set.

Implementation for the Pearson correlation score first finds the items rated by both users. It then calculates the sums and the sum of the squares of the ratings for the both users and calculates the sum of the products of their ratings. Finally, it uses these results to calculate the Pearson correlation coefficient.Unlike the distance metric, this formula is not intuitive, but it does tell you how much the variables change together divided by the product of how much the vary individually.




Let’s create this function in the same file.

# Implementation of collaborative filtering recommendation engine

from recommendation_data import dataset
from math import sqrt

def similarity_score(person1,person2):

    # Returns ratio Euclidean distance score of person1 and person2 

    both_viewed = {} # To get both rated items by person1 and person2

    for item in dataset[person1]:
       if item in dataset[person2]:
           both_viewed[item] = 1

    # Conditions to check they both have an common rating items
    if len(both_viewed) == 0:
       return 0

    # Finding Euclidean distance
    sum_of_eclidean_distance = [] 

    for item in dataset[person1]:
      if item in dataset[person2]:
            sum_of_eclidean_distance.append(pow(dataset[person1][item] - dataset[person2][item],2))
   sum_of_eclidean_distance = sum(sum_of_eclidean_distance)

    return 1/(1+sqrt(sum_of_eclidean_distance))

def pearson_correlation(person1,person2):

     # To get both rated items
     both_rated = {}
     for item in dataset[person1]:
        if item in dataset[person2]:
          both_rated[item] = 1

     number_of_ratings = len(both_rated) 

     # Checking for number of ratings in common
     if number_of_ratings == 0:
         return 0

     # Add up all the preferences of each user
     person1_preferences_sum = sum([dataset[person1][item] for item in both_rated])
     person2_preferences_sum = sum([dataset[person2][item] for item in both_rated])

     # Sum up the squares of preferences of each user
     person1_square_preferences_sum = sum([pow(dataset[person1][item],2) for item in both_rated])
     person2_square_preferences_sum = sum([pow(dataset[person2][item],2) for item in both_rated])

    # Sum up the product value of both preferences for each item
    product_sum_of_both_users = sum([dataset[person1][item] * dataset[person2][item] for item in both_rated])

   # Calculate the pearson score
   numerator_value = product_sum_of_both_users - (person1_preferences_sum*person2_preferences_sum/number_of_ratings)
  denominator_value = sqrt((person1_square_preferences_sum - pow(person1_preferences_sum,2)/number_of_ratings) * (person2_square_preferences_sum -pow(person2_preferences_sum,2)/number_of_ratings))
   if denominator_value == 0:
      return 0
     r = numerator_value/denominator_value
     return r 

print pearson_correlation('Lisa Rose','Gene Seymour')

Script Output:

[Finished in 0.0s]

Generally this pearson_correlation function return a value between -1 to 1 . A value 1 means both users are having the same taste in all most all cases.

Ranking similar users for an user:

Now that we have functions for comparing two people, we can create a function that scores everyone against a given person and finds the closest matches. Lets add this function to file to get an ordered list of people with similar tastes to the specified person.

def most_similar_users(person,number_of_users):
    # returns the number_of_users (similar persons) for a given specific person.
    scores = [(pearson_correlation(person,other_person),other_person)    for other_person in dataset if other_person != person ]

        # Sort the similar persons so that highest scores person will appear at the first
        return scores[0:number_of_users]

print most_similar_users('Lisa Rose',3) 

Script Output:

[(0.9912407071619299, 'Toby'), (0.7470178808339965, 'Jack Matthews'), (0.5940885257860044, 'Mick LaSalle')]
[Finished in 0.0s]

What we have done now is we just look at the person who are most similar  persons to him and Now we have to recommend some movie to that person. But that would be too permissive. Such an approach could accidentally turn up reviewers who haven’t rated  some of the movies that particular person like. it could also return a reviewer who strangely like a move that got bad reviews from all the other person returned by most_similar_persons function.

To solve these issues, you need to score the items by producing a weighted score that ranks the users. Take the votes of all other persons and multiply how similar they are to particular person by the score they gave to each move.
Below image shows how we have to do that.



This images shows the correlation scores for each person and the ratings they gave for three movies The Night Listener, Lady in the Water, and Just My Luck that Toby haven’t rated. The Colums beginning with S.x give the similarity multiplied by the rating,so a person who is similar to Toby will contribute more to the overall score than a person who is different from Toby. The Total row shows the sum of all these numbers.

We could just use the totals to calculate the rankings, but then a movie reviewed by more people would have a big advantage. To correct for this you need to divide by the sum of all the similraties for persons that reviewd that movie (the Sim.Sum row in the table) because The Night Listener was reviewed by everyone, it’s total is divided by the average of similarities. Lady in the water ,however , was not reviewed by Puig, The last row shows the results of this division.

Let’s implement that now.

def user_reommendations(person):

       # Gets recommendations for a person by using a weighted average of every other user's rankings
       totals = {}
       simSums = {}
       rankings_list =[]
       for other in dataset:
           # don't compare me to myself
           if other == person:
           sim = pearson_correlation(person,other)

           # ignore scores of zero or lower
           if sim <=0:
           for item in dataset[other]:

            # only score movies i haven't seen yet
                if item not in dataset[person] or dataset[person][item] == 0:

                # Similrity * score
                totals[item] += dataset[other][item]* sim
                # sum of similarities
                simSums[item]+= sim

     # Create the normalized list

     rankings = [(total/simSums[item],item) for item,total in totals.items()]
     # returns the recommended items
     recommendataions_list = [recommend_item for score,recommend_item in rankings]
     return recommendataions_list

print "Recommendations for Toby"
print user_reommendations('Toby')

Script Output:

Recommendations for Toby

['The Night Listener', 'Lady in the Water', 'Just My Luck']
[Finished in 0.0s]

We have done the Recommendation engine, just change the any other persons and check recommended items.

To get total code you can clone our github project :

To get all codes of dataaspirant blog you can clone the below github link:

Reference Book:

Collective intelligence book

Follow us:


I hope you liked todays post. If you have any questions then feel free to comment below.  If you want me to write on one specific topic then do tell it to me in the comments below.

If you want share your experience or opinions you can say.

Hello to 


Home | About | Data scientists Interviews | For beginners | Join us |  Monthly newsletter


Home | About | Data scientists Interviews | For beginners | Join us


Ancient story of Datamining

In the 1960s, statisticians used terms like “Data Fishing” or “Data Dredging” to refer to what they considered the bad practice of analyzing data without an a-priori hypothesis. The term “Data Mining” appeared around 1990 in the database community.

Data mining in Technical words

Data mining is a process of extracting specific information from data and presenting relevant and usable information that can be used to solve problems. There are different kinds of services in the process like text mining, web mining, audio and video mining, pictorial data mining and social network data mining.

Why Data mining is hot cake Topic for this generation?

Data mining is young and promising field for present generation because of its spacious applications. In general way of saying, it has an attracted a great deal of attention in the information industry and in society, due to the wide availability of huge amounts of data and the imminent need for turning such data into useful information and knowledge.The information and knowledge gained can be used for applications ranging from market analysis, fraud detection, and customer retention to production control and science exploration. This is the reason why data mining is also called as knowledge discovery from data.

Understanding of data mining with buying apples example


Before going to explain data mining with this fresh apples, let me say some interesting facts about apples.

Nutrition:  According to the United States Department of Agriculture, a typical apple serving weighs 242 grams and contains 126 calories with significant dietary fiber and modest vitamin C content, with otherwise a generally low content of essential nutrients.

Toxicity of apple seeds: The seeds of apples contain small amounts of amygdalin, a sugar and cyanide compound known as a cyanogenic glycoside. Ingesting small amounts of apple seeds will cause no ill effects, but in extremely large doses can cause adverse reactions. There is only one known case of fatal cyanide poisoning from apple seeds; in this case the individual chewed and swallowed one cup of seeds. It may take several hours before the poison takes effect, as cyanogenic glycosides must be hydrolyzed before the cyanide ion is released.

Now we will step into our example.

Suppose your family members want to meet some one who is suffering from pancreatic cancer. We all know that the consumption of apples could help to reduce pancreatic cancer by up to 23 percent. So your father asked you to bring apples from a nearby shop to your house. Also your father teach (learn) you how to buy apples by giving some set of rules.

Rules for buying apples

  • Big size apples are having less taste than small size apples.
  • Dark red apples are not fresh ones.
  • Light red apples are fresh ones.
  • Green apples are good for health.

On seeing this list of rules you can pick the apples which you want to buy. Your family members want to give  these apples to an unhealthy person. Hence, you obviously pick green apples. So when you go to shop you pick small size apples which are in green color. End of the story to select apples which are good for health.

Non Data mining  Algorithm

selecting apples Algorithm
if( selected_apple == small (in size ))
     if(selected_apple == green ( in color ) ){
            select apple
     else {
           don't select apple

Comparing  with data mining

  • You will randomly select an apple from the shop ( training data )
  • Make a table of all the physical characteristics of each apple, like color, size( features )
  • Tasty apples, apple which are good for health( output variables )
  • If you went to other shop and buy the apples ( test data )

You can now buy  apples with great confidence, without worrying about the details of how to choose the best apples. And what more, you can make your algorithm and improve it over time (reinforcement learning), so that it will improve its accuracy as it reads more training data, and modifies itself when it makes a wrong prediction. But the best part is, you can use the same algorithm to train different models, one each for predicting the quality of apples, oranges, bananas, grapes, cherries and watermelons, and keep all your loved ones happy.

This type of learning is called as supervised learning in data mining. In next post I will give you clear picture of difference between supervised learning and unsupervised learning with real life examples.


Follow us:


I hope you liked todays post. If you have any questions then feel free to comment below.  If you want me to write on one specific topic then do tell it to me in the comments below.


Home | About | Data scientists Interviews | For beginners | Join us