Home | About | Data scientists Interviews | For beginners | Join us | Monthly newsletter


Blog Posts:

1. Cricket survial Analysis:

Survival Analysis is used in areas where the time duration of a sample of observations is analysed until an event of death occurs. Survival analysis is applied to mechanical engineering to predict systems failures and in medical sciences to predict patient outcomes.

In this post I’ll be using Survival Analysis for a more lighthearted application–to analyze the career lengths of Cricket players.

Read Complete Post on:

2. Complete tutorial learn data science python scratch:

Python was originally a general purpose language. But, over the years, with strong community support, this language got dedicated library for data analysis and predictive modeling.

Due to lack of resource on python for data science, I decided to create this tutorial to help many others to learn python faster. In this tutorial, we will take bite sized information about how to use Python for Data Analysis, chew it till we are comfortable and practice it at our own end.

Read Complete Post on: analyticsvidhya

3. Markov Chains Explained Visually:

Markov chains, named after Andrey Markov, are mathematical systems that hop from one “state” (a situation or set of values) to another. For example, if you made a Markov chain model of a baby’s behavior, you might include “playing,” “eating”, “sleeping,” and “crying” as states, which together with other behaviors could form a ‘state space’: a list of all possible states. In addition, on top of the state space, a Markov chain tells you the probabilitiy of hopping, or “transitioning,” from one state to any other state—e.g., the chance that a baby currently playing will fall asleep in the next five minutes without crying first.

Read Complete Post on: setosa

4. Data scientist Role:

The age of big data is upon us, and it’s here to stay. With more data being collected than ever before, extracting value from this data is only going to become more intricate and demanding as time goes on. The logic behind the big data economy is shaping our personal lives in ways that we probably can’t even conceive or predict; every electronic move that we make produces a statistic and insight into our life.

As participants in the consumer economy, we are mined for data when we connect to any website or electronic service, and a data scientist is there to collect, clean, analyse and predict the data that we provide by using a combination of computer science, statistical analysis and intricate business knowledge.

As we can see, this responsibility is a combination of multiple skill sets and expertise compared to a typical Big Data Developer or Business Analyst.

Read Complete Post on: globalbigdataconference

5. An introduction to deep learning from perceptrons to deep networks:

In recent years, there’s been a resurgence in the field of Artificial Intelligence. It’s spread beyond the academic world with major players like Google, Microsoft, and Facebook creating their own research teams and making some impressive acquisitions.

Some this can be attributed to the abundance of raw data generated by social network users, much of which needs to be analyzed, as well as to the cheap computational power available via GPGPUs.

Read Complete Post on: toptal

6. Books for statistics mathematics data science:

The selection process of data scientists at Google gives higher priority to candidates with strong background in statistics and mathematics. Not just Google, other top companies (Amazon, Airbnb, Uber etc) in the world also prefer candidates with strong fundamentals rather than mere know-how in data science.

If you too aspire to work for such top companies in future, it is essential for you to develop a mathematical understanding of data science. Data science is simply the evolved version of statistics and mathematics, combined with programming and business logic. I’ve met many data scientists who struggle to explain predictive models statistically.

More than just deriving accuracy, understanding & interpreting every metric, calculation behind that accuracy is important. Remember, every single ‘variable’ has a story to tell. So, if not anything else, try to become a great story explorer!

Read Complete Post on: Analyticsvidhya

7. Hadoop installation using Ambari:

Apache Hadoop has become a de-facto software framework for reliable, scalable, distributed and large scale computing.  Unlike other computing system, it brings computation to data rather than sending data to computation. Hadoop was created in 2006 at Yahoo by Doug Cutting based on paper published by Google. As Hadoop has matured, over the years many new components and tools were added to its ecosystem to enhance its usability and functionality. Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig, Sqoop etc. to name a few.

Read Complete Post on: Edupristine


  1. How to use data to make a hit tv show
  2. Solving  Big Problems with Julia
  3. Best Data visualization

That’s all for March-2016 newsletter. Please leave your suggestions on newsletter in the comment section. To get all  dataaspirant newsletters you can visit monthly newsletter page.

Follow us:


Home | About | Data scientists Interviews | For beginners | Join us | Monthly newsletter


Home | About | Data scientists Interviews | For beginners | Join us | Monthly newsletter


Blog Posts:

1. Great resources for learning data mining concepts and techniques:

With today’s tools, anyone can collect data from almost anywhere, but not everyone can pull the important nuggets out of that data. Whacking your data into Tableau is an OK start, but it’s not going to give you the business critical insights you’re looking for. To truly make your data come alive you need to mine it. Dig deep. Play around. And tease out the diamond in the rough.

Read Complete Post on:

2.  Interactive Data Science with R in Apache Zeppelin Notebook:

The objective of this blog post is to help you get started with Apache Zeppelin notebook for your R data science requirements. Zeppelin is a web-based notebook that enables interactive data analytics. You can make beautiful data-driven, interactive and collaborative documents with Scala(with Apache Spark), Python(with Apache Spark), SparkSQL, Hive, Markdown, Shell and more.

Read Complete Post on: sparkiq-labs

3. How to install Apache Hadoop 2.6.0 in Ubuntu:

Let’s get started towards setting up a fresh Multinode Hadoop (2.6.0) cluster.

Read Complete Post on: pingax

4. Running scalable Data Science on Cloud with R & Python:

So, why do we even need to run data science on cloud? You might raise this question that if a laptop can pack 64 GB RAM, do we even need cloud for data science? And the answer is a big YES for a variety of reasons. Here are a few of them.

Read Complete Post on: analyticsvidhya

5. How to Choose Between Learning Python or R First:

If you’re interested in a career in data, and you’re familiar with the set of skills you’ll need to master, you know that Python and R are two of the most popular languages for data analysis. If you’re not exactly sure which to start learning first, you’re reading the right article.

When it comes to data analysis, both Python and R are simple (and free) to install and relatively easy to get started with. If you’re a newcomer to the world of data science and don’t have experience in either language, or with programming in general, it makes sense to be unsure whether to learn R or Python first.

Read Complete Post on: Udacity Blog

LinkedIn Posts:

  1. 5 Best Machine Learning APIs for Data Science
  2. Big Data Top Trends In 2016
  3. Big Data: 4 Things You Can Do With It, And 3 Things You Can’t


1. Machine Learning: Going Deeper with Python and Theano

2. Current State of Recommendation Systems

3. Pandas From The Ground Up


TensorFlow Google Machine Learning Library:

About TensorFlow:

TensorFlow™ is an open source software library for numerical computation using data flow graphs. Nodes in the graph represent mathematical operations, while the graph edges represent the multidimensional data arrays (tensors) communicated between them. The flexible architecture allows you to deploy computation to one or more CPUs or GPUs in a desktop, server, or mobile device with a single API. TensorFlow was originally developed by researchers and engineers working on the Google Brain Team within Google’s Machine Intelligence research organization for the purposes of conducting machine learning and deep neural networks research, but the system is general enough to be applicable in a wide variety of other domains as well.

Get Started TensorFlow


That’s all for November 2015 newsletter. Please leave your suggestions on newsletter in the comment section. To get all  dataaspirant newsletters you can visit monthly newsletter page. Do please Subscribe to our blog so that every month you get our news letter in your inbox.

Follow us:


Home | About | Data scientists Interviews | For beginners | Join us | Monthly newsletter

DataAspirant Sept-Oct2015 newsletter

Home | About | Data scientists Interviews | For beginners | Join us | Monthly newsletter

Data scientist


Hi dataaspirant lovers we are sorry for not publishing dataaspirant September  newsletter. So for October newsletter we come up with September newsletter ingredients  too. We rounded up the best blogs for anyone interested in learning more about data science. Whatever your experience level in data science or someone who’s just heard of the field,  these blogs provide enough detail and context for you to understand what you’re reading. We also collected some videos too. Hope you  enjoy October  dataaspirant newsletter.


Blog Posts:

1 . How to do a Logistic Regression in R:

Regression is the statistical technique that tries to explain the relationship between a dependent variable and one or more independent variables. There are various kinds of it like simple linear, multiple linear, polynomial, logistic, poisson etc

Read Complete post on: datavinci

2 . Introduction of Markov State Modeling:

Modeling and prediction problems occur in different domain and data situations. One type of situation involves sequence of events.

For instance, you may want to model behaviour of customers on your website, looking at pages they land or enter by, links they click, and so on. You may want to do this to understand common issues and needs and may redesign your website to address that. You may, on the other hand, may want to promote certain sections or products on website and want to understand right page architecture and layout. In other example, you may be interested in predicting next medical visit of patient based on previous visits or next purchase product of customer based on previous products.

Read Complete post on: edupristine

3 . Five ways to improve the way you use Hadoop:

Apache Hadoop is an open source framework designed to distribute the storage and processing of massive data sets across virtually limitless servers. Amazon EMR (Elastic MapReduce) is a particularly popular service from Amazon that is used by developers trying to avoid the burden of set up and administration, and concentrate on working with their data.

Read Complete post on: cloudacademy

4. What is deep learning and why is it getting so much attention:

Deep learning is probably one of the hottest topics in Machine learning today, and it has shown significant improvement over some of its counterparts. It falls under a class of unsupervised learning algorithms and uses multi-layered neural networks to achieve these remarkable outcomes.

Read Complete post on: analyticsvidhya

5. Facebook data collection and photo network visualization with Gephi and R:

The first thing to do is get the Facebook data. Before being allowed to pull it from R, you’ll need to make a quick detour to, register as a developer, and create a new app. Name and description are irrelevant, the only thing you need to do is go to Settings → Website → Site URL and fill in http://localhost:1410/ (that’s the port we’re going to be using). The whole process takes ~5 min and is quite painless

Read Complete post on: kateto

6. Kudu: New Apache Hadoop Storage for Fast Analytics on Fast Data:

The set of data storage and processing technologies that define the Apache Hadoop ecosystem are expansive and ever-improving, covering a very diverse set of customer use cases used in mission-critical enterprise applications. At Cloudera, we’re constantly pushing the boundaries of what’s possible with Hadoop—making it faster, easier to work with, and more secure.

Read Complete post on: cloudera

7. Rapid Development & Performance in Spark For Data Scientists:

Spark is a cluster computing framework that can significantly increase the efficiency and capabilities of a data scientist’s workflow when dealing with distributed data. However, deciding which of its many modules, features and options are appropriate for a given problem can be cumbersome. Our experience at Stitch Fix has shown that these decisions can have a large impact on development time and performance. This post will discuss strategies at each stage of the data processing workflow which data scientists new to Spark should consider employing for high productivity development on big data.

Read Complete post on: multithreaded

8. NoSQL: A Dog with Different Fleas:

The NoSQL movement is around providing performance, scale, and flexibility; where cost is sometimes part of the reasoning (e.g. Oracle Tax). Yet databases like MySQL, which provide all the Oracle features, is often considered before choosing NoSQL. And with respects to NoSQL flexibility. This also can be Pandora’s box. In other words, schema-less modeling has been shown to be a serious complication to data management. I was at the MongoDB Storage Engine Summit this year and the number one ask to the storage engine providers is “how to discover schema in a schema-less architecture?” In other words, managing models over time is a serious matter to consider too.

Read Complete post on: deepis

9. Apache Spark: Sparkling star in big data firmament:

The underlying data needed to be used to gain right outcomes for all above tasks is comparatively very large. It cannot be handled efficiently (in terms of both space and time) by traditional systems. These are all big data scenarios. To collect, store and do computations on this kind of voluminous data we need a specialized cluster computing system. Apache Hadoop has solved this problem for us.

Read Complete post on: edupristine

10. Sqoop vs. Flume – Battle of the Hadoop ETL tools:

Apache Hadoop is synonymous with big data for its cost-effectiveness and its attribute of scalability for processing petabytes of data. Data analysis using hadoop is just half the battle won. Getting data into the Hadoop cluster plays a critical role in any big data deployment. Data ingestion is important in any big data project because the volume of data is generally in petabytes or exabytes. Hadoop Sqoop and Hadoop Flume are the two tools in Hadoop which is used to gather data from different sources and load them into HDFS. Sqoop in Hadoop is mostly used to extract structured data from databases like Teradata, Oracle, etc., and Flume in Hadoop is used to sources data which is stored in various sources like and deals mostly with unstructured data.

Read Complete post on: dezyre



1. Spark and Spark Streaming at Uber :

2. How To Stream Twitter Data Into Hadoop Using Apache Flume:


That’s all for October 2015 newsletter. Please leave your suggestions on newsletter in the comment box. To get all  dataaspirant newsletters you can visit monthly newsletter page. Do please Subscribe to our blog so that every month you get our news letter in your inbox.


Follow us:


Home | About | Data scientists Interviews | For beginners | Join us | Monthly newsletter