Home | About | Data scientists Interviews | For beginners | Join us | Monthly newsletter


Blog Posts:

1. Great resources for learning data mining concepts and techniques:

With today’s tools, anyone can collect data from almost anywhere, but not everyone can pull the important nuggets out of that data. Whacking your data into Tableau is an OK start, but it’s not going to give you the business critical insights you’re looking for. To truly make your data come alive you need to mine it. Dig deep. Play around. And tease out the diamond in the rough.

Read Complete Post on:

2.  Interactive Data Science with R in Apache Zeppelin Notebook:

The objective of this blog post is to help you get started with Apache Zeppelin notebook for your R data science requirements. Zeppelin is a web-based notebook that enables interactive data analytics. You can make beautiful data-driven, interactive and collaborative documents with Scala(with Apache Spark), Python(with Apache Spark), SparkSQL, Hive, Markdown, Shell and more.

Read Complete Post on: sparkiq-labs

3. How to install Apache Hadoop 2.6.0 in Ubuntu:

Let’s get started towards setting up a fresh Multinode Hadoop (2.6.0) cluster.

Read Complete Post on: pingax

4. Running scalable Data Science on Cloud with R & Python:

So, why do we even need to run data science on cloud? You might raise this question that if a laptop can pack 64 GB RAM, do we even need cloud for data science? And the answer is a big YES for a variety of reasons. Here are a few of them.

Read Complete Post on: analyticsvidhya

5. How to Choose Between Learning Python or R First:

If you’re interested in a career in data, and you’re familiar with the set of skills you’ll need to master, you know that Python and R are two of the most popular languages for data analysis. If you’re not exactly sure which to start learning first, you’re reading the right article.

When it comes to data analysis, both Python and R are simple (and free) to install and relatively easy to get started with. If you’re a newcomer to the world of data science and don’t have experience in either language, or with programming in general, it makes sense to be unsure whether to learn R or Python first.

Read Complete Post on: Udacity Blog

LinkedIn Posts:

  1. 5 Best Machine Learning APIs for Data Science
  2. Big Data Top Trends In 2016
  3. Big Data: 4 Things You Can Do With It, And 3 Things You Can’t


1. Machine Learning: Going Deeper with Python and Theano

2. Current State of Recommendation Systems

3. Pandas From The Ground Up


TensorFlow Google Machine Learning Library:

About TensorFlow:

TensorFlow™ is an open source software library for numerical computation using data flow graphs. Nodes in the graph represent mathematical operations, while the graph edges represent the multidimensional data arrays (tensors) communicated between them. The flexible architecture allows you to deploy computation to one or more CPUs or GPUs in a desktop, server, or mobile device with a single API. TensorFlow was originally developed by researchers and engineers working on the Google Brain Team within Google’s Machine Intelligence research organization for the purposes of conducting machine learning and deep neural networks research, but the system is general enough to be applicable in a wide variety of other domains as well.

Get Started TensorFlow


That’s all for November 2015 newsletter. Please leave your suggestions on newsletter in the comment section. To get all  dataaspirant newsletters you can visit monthly newsletter page. Do please Subscribe to our blog so that every month you get our news letter in your inbox.

Follow us:


Home | About | Data scientists Interviews | For beginners | Join us | Monthly newsletter

Four Coursera data science Specializations starts this month

Home | About | Data scientists Interviews | For beginners | Join us | Monthly newsletter


Starting is the biggest step  to achieve dreams.  This is 200% true for the people how want to learn data science. The very first question comes in mind for data science beginners is Where to Start. If you  trying find the answer for this question your on the right track. You can find your answer in Coursera Specializations.

What Coursera Specialization will offer:

Coursera Data science Specializations and courses teach the fundamentals of interpreting data, performing analyses, and understanding and communicating actionable insights. Topics of study for beginning and advanced learners include qualitative and quantitative data analysis, tools and methods for data manipulation, and machine learning algorithms.


Big Data Specialization


About This Specialization:

In this Specialization, you will develop a robust set of skills that will allow you to process, analyze, and extract meaningful information from large amounts of complex data. You will install and configure Hadoop with MapReduce, use Spark, Pig and Hive, perform predictive modelling with open source tools, and leverage graph analytics to model problems and perform scalable analytical tasks. In the final Capstone Project, developed in partnership with data software company Splunk, you’ll apply the skills you learned by building your own tools and models to analyze big data in the context of retail, sports, current events, or another area of your choice.

COURSE 1:  Introduction to Big Data

Course Started on : Oct 26   Ends on: Nov 23

About the Course:

What’s the “hype” surrounding the Big Data phenomenon? Who are these mysterious data scientists everyone is talking about? What kinds of problem-solving skills and knowledge should they have? What kinds of problems can be solved by Big Data technology? After this short introductory course you will have answers to all these questions. Additionally, you will start to become proficient with the key technical terms and big data tools and applications to prepare you for a deep dive into the rest of the courses in the Big Data specialization. Each day, our society creates 2.5 quintillion bytes of data (that’s 2.5 followed by 18 zeros). With this flood of data the need to unlock actionable value becomes more acute, rapidly increasing demand for Big Data skills and qualified data scientists.
Hands-On Assignment Hardware and Software Requirements
Windows 7+, Mac OS X 10.10+, Ubuntu 14.04+ or CentOS 6+ VirtualBox 5+, VMWare Workstation 9+ or VMWare Fusion 7+
Quad Core Processor (VT-x or AMD-V support recommended)
8 GB Ram 20 GB disk free

COURSE 2:  Hadoop Platform and Application Framework

Course Started on : Oct 20   Ends on: Nov 30

About the Course:

Are you looking for hands-on experience processing big data? After completing this course, you will be able to install, configure and implement an Apache Hadoop stack ranging from basic “Big Data” components to MapReduce and Spark execution frameworks. Moreover, in the exercises for this course you will solve fundamental problems that would require more computing power than a single computer. You will apply the most important Hadoop concepts in your solutions and use distributed/parallel processing in the Hadoop application framework. Get ready to be empowered to manipulate and analyze the significance of big data!

Course Link:  Hadoop Platform and Application Framework

COURSE 3: Introduction to Big Data Analytics

Course Starts : November 2015

About the Course:

Do you have specific business questions you want answered? Need to learn how to interpret results through analytics? This course will help you answer these questions by introducing you to HBase, Pig and Hive. In this course, you will take a real Twitter data set, clean it, bring it into an analytics engine, and create summary charts and drill-down dashboards. After completing this course, you will be able to utilize BigTable, distributed data store, columnar data, noSQL, and more!

Course Link: Introduction to Big Data Analytics


COURSE 4: Machine Learning With Big Data

Course Starts : December 2015

About the Course:

Want to learn the basics of large-scale data processing? Need to make predictive models but don’t know the right tools? This course will introduce you to open source tools you can use for parallel, distributed and scalable machine learning. After completing this course’s hands-on projects with MapReduce, KNIME and Spark, you will be able to train, evaluate, and validate basic predictive models. By the end of this course, you will be building a Big Data platform and utilizing several different tools and techniques.

Course Link:  Machine Learning With Big Data


COURSE 5:  Introduction to Graph Analytics

Course Starts : January 2016

About the Course:

Want to understand your data network structure and how it changes under different conditions? Curious to know how to identify closely interacting clusters within a graph? Have you heard of the fast-growing area of graph analytics and want to learn more? This course gives you a broad overview of the field of graph analytics so you can learn new ways to model, store, retrieve and analyze graph-structured data. After completing this course, you will be able to model a problem into a graph database and perform analytical tasks over the graph in a scalable manner. Better yet, you will be able to apply these techniques to understand the significance of your data sets for your own projects.

Course Link:  Introduction to Graph Analytics


Machine Learning Specialization


About This Specialization:

This Specialization provides a case-based introduction to the exciting, high-demand field of machine learning. You’ll learn to analyze large and complex datasets, build applications that can make predictions from data, and create systems that adapt and improve over time. In the final Capstone Project, you’ll apply your skills to solve an original, real-world problem through implementation of machine learning algorithms.

COURSE 1: Machine Learning Foundations: A Case Study Approach

Course Started on: Oct 26 Ends on: Dec 14

About the Course:
Do you have data and wonder what it can tell you? Do you need a deeper understanding of the core ways in which machine learning can improve your business? Do you want to be able to converse with specialists about anything from regression and classification to deep learning and recommender systems? In this course, you will get hands-on experience with machine learning from a series of practical case-studies. At the end of the first course you will have studied how to predict house prices based on house-level features, analyze sentiment from user reviews, retrieve documents of interest, recommend products, and search for images. Through hands-on practice with these use cases, you will be able to apply machine learning methods in a wide range of domains. This first course treats the machine learning method as a black box. Using this abstraction, you will focus on understanding tasks of interest, matching these tasks to machine learning tools, and assessing the quality of the output. In subsequent courses, you will delve into the components of this black box by examining models and algorithms. Together, these pieces form the machine learning pipeline, which you will use in developing intelligent applications.
Learning Outcomes: By the end of this course, you will be able to:
-Identify potential applications of machine learning in practice.
-Describe the core differences in analyses enabled by regression, classification, and clustering.
-Select the appropriate machine learning task for a potential application.
-Apply regression, classification, clustering, retrieval, recommender systems, and deep learning.
-Represent your data as features to serve as input to machine learning models.
-Assess the model quality in terms of relevant error metrics for each task.
-Utilize a dataset to fit a model to analyze new data.
-Build an end-to-end application that uses machine learning at its core.
-Implement these techniques in Python.

COURSE 2: Regression

Starts November 2015
About the Course:
Case Study – Predicting Housing Prices In our first case study, predicting house prices, you will create models that predict a continuous value (price) from input features (square footage, number of bedrooms and bathrooms,…). This is just one of the many places where regression can be applied. Other applications range from predicting health outcomes in medicine, stock prices in finance, and power usage in high-performance computing, to analyzing which regulators are important for gene expression. In this course, you will explore regularized linear regression models for the task of prediction and feature selection. You will be able to handle very large sets of features and select between models of various complexity. You will also analyze the impact of aspects of your data — such as outliers — on your selected models and predictions. To fit these models, you will implement optimization algorithms that scale to large datasets.
Learning Outcomes: By the end of this course, you will be able to:
-Describe the input and output of a regression model.
-Compare and contrast bias and variance when modeling data.
-Estimate model parameters using optimization algorithms.
-Tune parameters with cross validation.
-Analyze the performance of the model.
-Describe the notion of sparsity and how LASSO leads to sparse solutions.
-Deploy methods to select between models.
-Exploit the model to form predictions.
-Build a regression model to predict prices using a housing dataset.
-Implement these techniques in Python.
Course Link: Regression

COURSE 3: Classification

Starts December 2015
About the Course:
Case Study: Analysing Sentiment In our second case study, analyzing sentiment, you will create models that predict a class (positive/negative sentiment) from input features (text of the reviews, user profile information,…). This task is an example of classification, one of the most widely used areas of machine learning, with a broad array of applications, including ad targeting, spam detection, medical diagnosis and image classification. In this course, you will create classifiers that provide state-of-the-art performance on a variety of tasks. You will become familiar with some of the most successful techniques, including logistic regression, boosted decision trees and kernelized support vector machines. In addition, you will be able to design and implement the underlying algorithms that can learn these models at scale. You will implement these technique on real-world, large-scale machine learning tasks.
Learning Objectives: By the end of this course, you will be able to:
-Describe the input and output of a classification model.
-Tackle both binary and multiclass classification problems.
-Implement a logistic regression model for large-scale classification.
-Create a non-linear model using decision trees.
-Improve the performance of any model using boosting.
-Construct non-linear features using kernels.
-Describe the underlying decision boundaries.
-Build a classification model to predict sentiment in a product review dataset.
-Implement these techniques in Python.
Course Link: Classification

COURSE 4: Clustering & Retrieval

Starts February 2016
About the Course:
Case Studies: Finding Similar Documents A reader is interested in a specific news article and you want to find similar articles to recommend. What is the right notion of similarity? Moreover, what if there are millions of other documents? Each time you want to a retrieve a new document, do you need to search through all other documents? How do you group similar documents together? How do you discover new, emerging topics that the documents cover? In this third case study, finding similar documents, you will examine similarity-based algorithms for retrieval. In this course, you will also examine structured representations for describing the documents in the corpus, including clustering and mixed membership models, such as latent Dirichlet allocation (LDA). You will implement expectation maximization (EM) to learn the document clusterings, and see how to scale the methods using MapReduce.
Learning Outcomes: By the end of this course, you will be able to:
-Create a document retrieval system using k-nearest neighbors.
-Describe how k-nearest neighbors can also be used for regression and classification.
-Identify various similarity metrics for text data.
-Cluster documents by topic using k-means.
-Perform mixed membership modeling using latent Dirichlet allocation (LDA).
-Describe how to parallelize k-means using MapReduce.
-Examine mixtures of Gaussians for density estimation.
-Fit a mixture of Gaussian model using expectation maximization (EM).
-Compare and contrast initialization techniques for non-convex optimization objectives.
-Implement these techniques in Python.

COURSE 5:  Recommender Systems & Dimensionality Reduction

Starts March 2016
About the Course:
Case Study:
Recommending Products How does Amazon recommend products you might be interested in purchasing? How does Netflix decide which movies or TV shows you might want to watch? What if you are a new user, should Netflix just recommend the most popular movies? Who might you form a new link with on Facebook or LinkedIn? These questions are endemic to most service-based industries, and underlie the notion of collaborative filtering and the recommender systems deployed to solve these problems. In this fourth case study, you will explore these ideas in the context of recommending products based on customer reviews. In this course, you will explore dimensionality reduction techniques for modeling high-dimensional data. In the case of recommender systems, your data is represented as user-product relationships, with potentially millions of users and hundred of thousands of products. You will implement matrix factorization and latent factor models for the task of predicting new user-product relationships. You will also use side information about products and users to improve predictions.
Learning Outcomes: By the end of this course, you will be able to:
-Create a collaborative filtering system.
-Reduce dimensionality of data using SVD, PCA, and random projections.
-Perform matrix factorization using coordinate descent.
-Deploy latent factor models as a recommender system.
-Handle the cold start problem using side information.
-Examine a product recommendation application.
-Implement these techniques in Python.

Data Science at Scale Specialization


About This Specialization:
This Specialization covers intermediate topics in data science. You will gain hands-on experience with scalable SQL and NoSQL data management solutions, data mining algorithms, and practical statistical and machine learning concepts. You will also learn to visualize data and communicate results, and you’ll explore legal and ethical issues that arise in working with big data. In the final Capstone Project, developed in partnership with the digital internship platform Coursolve, you’ll apply your new skills to a real-world data science project.

COURSE 1:  Data Manipulation at Scale: Systems and Algorithms

Upcoming session: Oct 26 — Nov 30
About the Course:
Data analysis has replaced data acquisition as the bottleneck to evidence-based decision making — we are drowning in it. Extracting knowledge from large, heterogeneous, and noisy datasets requires not only powerful computing resources, but the programming abstractions to use them effectively. The abstractions that emerged in the last decade blend ideas from parallel databases, distributed systems, and programming languages to create a new class of scalable data analytics platforms that form the foundation for data science at realistic scales. In this course, you will learn the landscape of relevant systems, the principles on which they rely, their tradeoffs, and how to evaluate their utility against your requirements. You will learn how practical systems were derived from the frontier of research in computer science and what systems are coming on the horizon. Cloud computing, SQL and NoSQL databases, MapReduce and the ecosystem it spawned, Spark and its contemporaries, and specialized systems for graphs and arrays will be covered. You will also learn the history and context of data science, the skills, challenges, and methodologies the term implies, and how to structure a data science project. At the end of this course, you will be able to:
Learning Goals:
1. Describe common patterns, challenges, and approaches associated with data science projects, and what makes them different from projects in related fields.
2. Identify and use the programming models associated with scalable data manipulation, including relational algebra, mapreduce, and other data flow models.
3. Use database technology adapted for large-scale analytics, including the concepts driving parallel databases, parallel query processing, and in-database analytics
4. Evaluate key-value stores and NoSQL systems, describe their tradeoffs with comparable systems, the details of important examples in the space, and future trends.
5. “Think” in MapReduce to effectively write algorithms for systems including Hadoop and Spark. You will understand their limitations, design details, their relationship to databases, and their associated ecosystem of algorithms, extensions, and languages. write programs in Spark
6. Describe the landscape of specialized Big Data systems for graphs, arrays, and streams.

COURSE 2:  Practical Predictive Analytics: Models and Methods

Upcoming session: Oct 26 — Nov 30
About the Course:
Statistical experiment design and analytics are at the heart of data science. In this course you will design statistical experiments and analyze the results using modern methods. You will also explore the common pitfalls in interpreting statistical arguments, especially those associated with big data. Collectively, this course will help you internalize a core set of practical and effective machine learning methods and concepts, and apply them to solve some real world problems.
Learning Goals: After completing this course, you will be able to:
1. Design effective experiments and analyze the results
2. Use resampling methods to make clear and bulletproof statistical arguments without invoking esoteric notation
3. Explain and apply a core set of classification methods of increasing complexity (rules, trees, random forests), and associated optimization methods (gradient descent and variants)
4. Explain and apply a set of unsupervised learning concepts and methods
5. Describe the common idioms of large-scale graph analytics, including structural query, traversals and recursive queries, PageRank, and community detection.
About the Course:
Producing numbers is not enough; effective data scientists know how to interpret the numbers and communicate findings accurately to stakeholders to inform business decisions. Visualization is a relatively recent field of research in computer science that links perception, cognition, and algorithms to exploit the enormous bandwidth of the human visual cortex. In this course you will design effective visualizations and develop skills in recognizing and avoiding poor visualizations. Just because you can get the answer using big data doesn’t mean you should. In this course you will have the opportunity to explore the ethical considerations around big data and how these considerations are beginning to influence policy and practice.
Learning Goals: After completing this course, you will be able to:
1. Design and critique visualizations
2. Explain the state-of-the-art in privacy, ethics, governance around big data and data science
3. Explain the role of open data and reproducibility in data science.
The Data Analysis and Interpretation Specialization takes you from data novice to data analyst in just four project-based courses. You’ll learn to apply basic data science tools and techniques, including data visualization, regression modeling, and machine learning. Throughout the Specialization, you will analyze research questions of your choice and summarize your insights. In the final Capstone Project, you will use real data to address an important issue in society, and report your findings in a professional-quality report. These instructors are here to create a warm and welcoming place at the table for everyone. Everyone can do this, and we are building a community to show the way.

COURSE 1:  Data Management and Visualization

Upcoming session: Oct 26 — Nov 30
About the Course:
Have you wanted to describe your data in more meaningful ways? Interested in making visualizations from your own data sets? After completing this course, you will be able to manage, describe, summarize and visualize data. You will choose a research question based on available data and engage in the early decisions involved in quantitative research. Based on a question of your choosing, you will describe variables and their relationships through frequency tables, calculate statistics of center and spread, and create graphical representations. By the end of this course, you will be able to: – use a data codebook to decipher a data set – identify questions or problems that can be tackled by a particular data set – determine the data management steps that are needed to prepare data for analysis – write code to execute a variety of data management and data visualization techniques

Course Link:  Data Management and Visualization

COURSE 2: Data Analysis Tools

Current session: Oct 22 — Nov 30
About the Course:
Do you want to answer questions with data? Interested in discovering simple methods for answering these questions? Hypothesis testing is the tool for you! After completing this course, you will be able to: – identify the right statistical test for the questions you are asking – apply and carry out hypothesis tests – generalize the results from samples to larger populations – use Analysis of Variance, Chi-Square, Test of Independence and Pearson correlation – present your findings using statistical language.
Course Link: Data Analysis Tools

COURSE 3:  Regression Modeling in Practice

Starts November 2015
About the Course:
What kinds of statistical tools can you use to test your research question in more depth? In this course, you will go beyond basic data analysis tools to develop multiple linear regression and logistic regression models to address your research question more thoroughly. You will examine multiple predictors of your outcome and identify confounding variables. In this course you will be introduced to additional Python libraries for regression modeling. You will learn the assumptions underlying regression analysis, how to interpret regression coefficients, and how to use regression diagnostic plots and other tools to evaluate residual variability. Finally, through blogging, you will present the story of your regression model using statistical language.

COURSE 4: Machine Learning for Data Analysis

Starts January 2016
About the Course:
Are you interested in predicting future outcomes using your data? This course helps you do just that! Machine learning is the process of developing, testing, and applying predictive algorithms to achieve this goal. Make sure to familiarize yourself with course 3 of this specialization before diving into these machine learning concepts. Building on Course 3, which introduces students to integral supervised machine learning concepts, this course will provide an overview of many additional concepts, techniques, and algorithms in machine learning, from basic classification to decision trees and clustering. By completing this course, you will learn how to apply, test, and interpret machine learning algorithms as alternative methods for addressing your research questions.

Follow us:


I hope you liked todays post. If you have any questions then feel free to comment below.  If you want me to write on one specific topic then do tell it to me in the comments below.

If you want share your experience or opinions you can say.

Hello to 


Home | About | Data scientists Interviews | For beginners | Join us | Monthly newsletter

DataAspirant August2015 newsletter

Home | About | Data scientists Interviews | For beginners | Join us | Monthly newsletter

Dataaspirant news letter for August

August newsletter, We rounded up the best blogs for anyone interested in learning more about data science. Whatever your experience level in data science or someone who’s just heard of the field,  these blogs provide enough detail and context for you to understand what you’re reading. We also collected some videos too. Hope you  enjoy August dataaspirant newsletter.

Blog Posts:

1. python Machine Learning war:

Oh god, another one of those subjective, pointedly opinionated click-bait headlines? Yes! Why did I bother writing this? Well, here is one of the most trivial yet life-changing insights and worldly wisdoms from my former professor that has become my mantra ever since: “If you have to do this task more than 3 times just write a script and automate it.”

Read Complete post on: sebastianraschka Blog

2. A Neural Network in 11 lines of Python:

I learn best with toy code that I can play with. This tutorial teaches backpropagation via a very simple toy example, a short python implementation.

Read Complete post on:  iamtrask Blog

3. Neural Network Part 2:

The takeaway here is that backpropagation doesn’t optimize! It moves the error information from the end of the network to all the weights inside the network so that an different algorithm can optimize those weights to fit our data. We actually have a plethora of different nonlinear optimization methods that we could use with back propagation.

Read Complete post on:  iamtrask Blog

4. Interactive Data Visualization using Bokeh:

Bokeh is a Python library for interactive visualization that targets web browsers for representation. This is the core difference between Bokeh and other visualization libraries. Look at the snapshot below, which explains the process flow of how Bokeh helps to present data to a web browser.

Read Complete post on : analyticsvidhya Blog

5. Artificial Neural Networks Linear Regression:

Artificial neural networks (ANNs) were originally devised in the mid-20th century as a computational model of the human brain. Their used waned because of the limited computational power available at the time, and some theoretical issues that weren’t solved for several decades (which I will detail at the end of this post). However, they have experienced a resurgence with the recent interest and hype surrounding Deep Learning. One of the more famous examples of Deep Learning is the “Youtube Cat” paper by Andrew Ng et al.

Read Complete post on: briandolhansky Blog

6. Artificial Neural Networks Linear Classification: 

So far we’ve covered using neural networks to perform linear regression. What if we want to perform classification using a single-layer network?  In this post, I will cover two methods: the perceptron algorithm and using a sigmoid activation function to generate a likelihood.

Read Complete post on: briandolhansky Blog

7. List of Machine Learning Certifications and Best Data Science Bootcamps:

Every one has a different style of learning. Hence, there are multiple ways to become a data scientist. You can learn from tutorials, blogs, books, hackathons, videos and what not! I personally like self paced learning aided by help from a community – it works best for me. What works best for you?

If the answer to above question was class room / instructor led certifications, you should check out machine learning certifications and data science bootcamps. They offer a great way to learn and prepare you for the role and expectations from a data scientist.

Read Complete post on: analyticsvidhya Blog

8. Prediction intervals for Random Forests:

An aspect that is important but often overlooked in applied machine learning is intervals for predictions, be it confidence or prediction intervals. For classification tasks, beginning practitioners quite often conflate probability with confidence: probability of 0.5 is taken to mean that we are uncertain about the prediction, while a prediction of 1.0 means we are absolutely certain in the outcome. But there are two concepts being mixed up here. A prediction of 0.5 could mean that we have learned very little about a given instance, due to observing no or only a few data points about it. Or it could be that we have a lot of data, and the response is fundamentally uncertain, like flipping a coin.

Read Complete post on: datadive Blog 

9. Computational Statistics in Python:

We will be using Python a fair amount in this class. Python is a high-level scripting language that offers an interactive programming environment. We assume programming experience, so this lecture will focus on the unique properties of Python.

Programming languages generally have the following common ingredients: variables, operators, iterators, conditional statements, functions (built-in and user defined) and higher-order data structures. We will look at these in Python and highlight qualities unique to this language.

Read Complete post on: Here

10. Beyond the k-Means – Prepping the Data:

This is second post in three-part series on deep-dive into k-Means clustering. While k-Means is simple and popular clustering solution, analyst must not be deceived by the simplicity and lose sight of nuances of implementation. In previous blog post, we discussed various approaches to selecting number of clusters for k-Means clustering. This post will discuss aspects of data pre-processing before running the k-Means algorithm.

Read Complete post on: edupristine Blog

Linkedin Post:

1 . 100 open source Big Data architecture papers for data professionals:

Big Data technology has been extremely disruptive with open source playing a dominant role in shaping its evolution. While on one hand it has been disruptive, on the other it has led to a complex ecosystem where new frameworks, libraries and tools are being released pretty much every day, creating confusion as technologists struggle and grapple with the deluge.

If you are a Big Data enthusiast or a technologist ramping up (or scratching your head), it is important to spend some serious time deeply understanding the architecture of key systems to appreciate its evolution. Understanding the architectural components and subtleties would also help you choose and apply the appropriate technology for your use case. In my journey over the last few years, some literature has helped me become a better educated data professional. My goal here is to not only share the literature but consequently also use the opportunity to put some sanity into the labyrinth of open source systems.

Read Complete post on : Linkedin Posts

2 . The importance of a Journal Club in Data Science:

A while back ago, I was asked: “What is your favorite office activity?” and without a doubt it is Data Science Tuesday, where the members of the team (and anyone from the company) get together to discuss research papers on a variety of topics from Data Science, Computer Science, Software Engineer, Social Networks, Psychology, Sociology, Neuroscience and even personality assessment.

I believe most of the greatest ideas that nurture the R&D projects, the product and the vision of the Data Science team, came during these collaborative times. We could enjoy a good lunch and brainstorm the heck out of cutting-edge research papers.

In my own experience, as a Data Scientist, I have grown technically and professionally by discussing ideas from several different topics. In addition, I firmly believe, it can boost the performance of any Data Science Team.

Read Complete post on : Linkedin Posts

3 . A beautiful dawn in the universe of Big Data & Analytics:

It’s a beautiful dawn out there, in the universe of Big Data & Analytics. With help from a nice weekend morning, good coffee, and the motivation from fellow bloggers on this forum, sharing what I see and experience in this space.

Read Complete post on : Linkedin Posts

Videos :

1 . Hadoop with Python 

2 . Bayesian Statistical Analysis using Python

3. Introduction to NumPy | SciPy 2015 Tutorial

4. Statistical inference with computational methods

5. Machine Learning with Scikit-Learn

That’s all for August 2015 newsletter. Please leave your suggestions on newsletter in the comment box so that we improve for next month newsletter. To get all  dataaspirant newsletters you can visit monthly newsletter page. Subscribe to our blog so that every month you get our news letter in your inbox.

Follow us:


Home | About | Data scientists Interviews | For beginners | Join us | Monthly newsletter

Interview with Data science expert Kai Xin Thia, Data scientist at Lazada, Co-Founder DataScience SG

Home | About | Data scientists Interviews | For beginners | Join us

Interview with Kai Xin Thia


We are excited to interview Kai Xin Thia as the first data scientist for our dataaspirant blog lovers.  He has shared some interesting things about data science. So let us see what he has shared with us.

Hi Kai Xin Thia we are so delighted to interview you and thanks a lot for your time with us. Before going to interview let me introduce Kai Xin Thia.


Kai Xin is a data scientist at Lazada. He specializes in behavioral analytics and has interest in large recommendation systems. He has been building behavioral models for 3 years and is the top 1% on Kaggle, which is an international data science competition portal. He is also the Co-Founder of DataScience SG (the largest data science community in Singapore) & volunteer at DataKind SG (NGO that helps other NGOs through data science).

Hi Thiakx! Let me start with asking about your background. Can you tell your background for our Data science Enthusiasts?

Hi, I began my data journey from singapore management university, where I graduated with a degree in information systems, business intelligence and analytics. I then spent time working at SAS and EMC, building my foundation. I moved on to focus on healthcare analytics at Khoo Teck Puat hospital and I am currently at Lazada, working on retail and behavioral analytics.

That’s great now we came to know about your Healthcare analytics. What is your definition of data science ?

Sure Generally, data science is the use of hacking skills, math & stats and domain expertise to generate useful insights for business and you can see some reference like.

I am personally excited  to know about you. How did you get started with data science and  which things inspired you a lot towards data science ?

When I first got started, it was all about business intelligence & business analytics. Pretty much about generating reports to understand the current performances of businesses. Things started to get interesting when I started on Kaggle, building predictive models based on historical data.

Can you share your experience about data science? ( At specially regarding your projects and start up “Foxhole”)

I will say there is a growing interest among companies in Singapore (and probably Asia in general) regarding the use of data science in their operations but we are still behind our US counterparts.

You have been building behavioral models for 3 years. Can you give us introduction and insights about behavioral models?

Behavioral models as its namesake suggest, is about understanding why people behave / respond in a certain way and how we can encourage them to adjust their behavior using data models. Beyond data models, there are a lot to learn in this field, for example, how predictably irrational most people are:

You had participated in data science competitions. You were 2nd place winner in “Unilever Prediction Challenge on consumer preference” and “Singapore’s Data in the City Visualization Challenge on education “. Can you share your experience about those?

Data in the City was interesting as we took the chance to research and understand Singapore’s education journey and we grew from a third world, improvised country into a developed city with an education system that attracts students from all around the region. In the Unilever challenge, we had the opportunity to present to management and learn from them what truly matters: sometimes it is not just about building the most complex models but rather, the act of balancing model accuracy with the ease of deploying the models into production.

You have done information systems from Singapore Management University. How has been information systems helping you in your career. What would be your recommendation for Data science enthusiasts regarding this?

University is the best time to pick up technical skills. If you are interested to try out / enter the data science industry, don’t be afraid to sign up for some difficult mathematics / statistics / machine learning modules. Use this opportunity to make mistakes and learn from them.   

What is your opinion about online courses for Data science? Which are your recommended online courses for Data science enthusiasts?

Coursera / edX / stanford online are fantastic platforms for learning. Here is what I recommend:

*John Hopkin’s data science specialization is not worth the money but is alright for a quick introduction to data science.

Can you share your favorite list of data science books for us?


What are the prerequisites that you think for a data science fresher who is starting from Zero level?

Not giving up. Most smart, rational peeps give up after 3-6 months because it is too hard / too boring / not earning them money. It takes years to train a doctor; it takes at least as long to train a data scientist.

What are the best programming languages for data science and which one is  your favorite?

Learn R for quick prototyping, python to deal with larger datasets and Apache Spark for enterprise level work.

What are the primary questions that will ask in data scientist interviews?

See the list from Quora, quite accurate.

 I will like to add one question that I was asked before: “Describe to me the greatest data project that you have worked on so far.

What is the present scope of data science and how it would be in future?

Current popular data science tools (R/python) are limited to single machines while enterprise software tools (SAS / Teradata) are expensive and unwieldy. Next generation tools like Spark will bridge the gap, bringing enterprise level scalability to popular data science tools  (R/python).

Final question. Can you share your opinion on our Blog?

It will be really interesting if you can interview more data scientists 🙂

Sure. We have more interviews coming up 🙂

Thank you so much for enlightening interview with us. This will definitely add value to our readers. Once again thank you.

Follow us:


I hope you liked today interview. If you have any questions then feel free to comment.  If you want to ask a question to data scientist then let us know in the comments. You can find link for comments just below the title.  so that we ask those questions in next data scientist interviews.


Home | About | Data scientists Interviews | For beginners | Join us