# Linear Regression

In this post i gonna wet your hands with coding part too, Before we drive further. let me show what type of  examples we gonna solve today.

## 1) Predicting house price for ZooZoo.

• ZooZoo gonna buy new house, so we have to find how much it will cost a particular house.

# 2) Predicting which Television Show will have more viewers for next week

The Flash and Arrow are my favorite Television ( Tv ) shows. i want to find which Tv show will get more viewers in upcoming  week. Frankly i am so excited to know which show will get more viewers.

# 3) Replacing missing values using linear Regression

You have to solve this problem i will explain every thing you have to do. so please try to solve this problem.

## So let’s drive into coding part

I believe you have installed all the required packages which i specified in my previous post. If not please take some time and install all the packages in this post  python packages for datamining. It would better once your go through post.

### 1) Predicting cost price of  a house for ZooZoo.

ZooZoo have the following data set

No.             square_feet                 price
1 150 6450
2 200 7450
3 250 8450
4 300 9450
5 350 11450
6 400 15450
7 600 18450

• Square feet is the  Area of house.
• Price is the corresponding cost of  that house.

• As we learn linear regression we know that we have to find linear line for this data so that we can get  θ0 and θ1.
• As you remember our hypothesis equation looks like this

where:

• hθ(x) is nothing but the value price(which we are going to predicate ) for particular square_feet  ( means price is a linear function of square_feet)
• θ0 is a constant
• θ1 is the regression coefficient

As we clear what we have to do, let’s start coding.

STEP – 1 :

• First open your favorite text editor and name it as predict_house_price.py.
• The below packages we gonna use in our program ,so  copy them in your predict_house_price.py file.

# Required Packages

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn import datasets, linear_model


• Just run your code once. if your program is error free then most of the job was done. If you facing  any errors , this means you missed some  packages so please go to this
• Install all the packages in that blog post and run your code once again . This time most probably you will never face any problem.
• Means your program is error free now so we can go to STEP – 2.

STEP – 2

• I stored our data set in to a csv file with name input_data.csv
• So let’s write a function to get our data into X values ( square_feet  ) Y values (Price)

# Function to get data
def get_data(file_name):
X_parameter = []
Y_parameter = []
for single_square_feet ,single_price_value in zip(data['square_feet'],data['price']):
X_parameter.append([float(single_square_feet)])
Y_parameter.append(float(single_price_value))
return X_parameter,Y_parameter


Line 3:

Reading csv data to pandas DataFrame.

Line 6-9:

Converting pandas dataframe data to X_parameter and Y_parameter data returning them

So let’s print our X_parameters and Y_parameters


X,Y = get_data('input_data.csv')
print X
print Y


Script Output

[[150.0], [200.0], [250.0], [300.0], [350.0], [400.0], [600.0]]
[6450.0, 7450.0, 8450.0, 9450.0, 11450.0, 15450.0, 18450.0]
[Finished in 0.7s]

Step – 3

we converted data to X_parameters and Y_parameter so let’s fit our X_parameters and Y_parameters to Linear Regression model

So we gonna write a function which will take  X_parameters ,Y_parameter and the value you gonna predict  as input and return the θ0 ,θ1  and predicted value


# Function for Fitting our data to Linear model
def linear_model_main(X_parameters,Y_parameters,predict_value):

# Create linear regression object
regr = linear_model.LinearRegression()
regr.fit(X_parameters, Y_parameters)
predict_outcome = regr.predict(predict_value)
predictions = {}
predictions['intercept'] = regr.intercept_
predictions['coefficient'] = regr.coef_
predictions['predicted_value'] = predict_outcome
return predictions


Line 5-6:

First we are creating an linear model and the training it with our X_parameters and Y_parameters

Line 8-12:

we are creating one dictionary with name predictions and storing θ0 ,θ1  and predicted values. and returning predictions dictionary as an output.

So let’s call our function with predicting value as 700


X,Y = get_data('input_data.csv')
predictvalue = 700
result = linear_model_main(X,Y,predictvalue)
print "Intercept value " , result['intercept']
print "coefficient" , result['coefficient']
print "Predicted value: ",result['predicted_value']


Script Output:

Intercept value 1771.80851064
coefficient [ 28.77659574]
Predicted value: [ 21915.42553191]
[Finished in 0.7s]

Here Intercept value is nothing but   θ0 value and coefficient value is nothing but  θ1 value.

We got the predicted values as 21915.4255 means we done our job of predicting the house price.

For checking purpose we have to see how our data fit to linear regression.So we have to write a function which takes X_parameters and Y_parameters as input and show the linear line fitting for our data.


# Function to show the resutls of linear fit model
def show_linear_line(X_parameters,Y_parameters):
# Create linear regression object
regr = linear_model.LinearRegression()
regr.fit(X_parameters, Y_parameters)
plt.scatter(X_parameters,Y_parameters,color='blue')
plt.plot(X_parameters,regr.predict(X_parameters),color='red',linewidth=4)
plt.xticks(())
plt.yticks(())
plt.show()



So let call our show_linear_line Function


show_linear_line(X,Y)



Script Output:

### About The FLASH Tv show:

The Flash is an American television series developed by writer/producers Greg Berlanti, Andrew Kreisberg and Geoff Johns, airing on The CW. It is based on the DC Comics character Flash (Barry Allen), a costumed superhero crime-fighter with the power to move at superhuman speeds, who was created by Robert Kanigher, John Broome andCarmine Infantino. It is a spin-off from Arrow, existing in the same universe. The pilot for the series was written by Berlanti, Kreisberg and Johns, and directed by David Nutter. The series premiered in North America on October 7, 2014, where the pilot became the most watched telecast for The CW.

Arrow is an American television series developed by writer/producers Greg Berlanti, Marc Guggenheim, and Andrew Kreisberg. It is based on the DC Comics characterGreen Arrow, a costumed crime-fighter created by Mort Weisinger and George Papp. It premiered in North America on The CW on October 10, 2012, with international broadcasting taking place in late 2012. Primarily filmed in Vancouver, British Columbia, Canada, the series follows billionaire playboy Oliver Queen, portrayed by Stephen Amell, who, after five years of being stranded on a hostile island, returns home to fight crime and corruption as a secret vigilante whose weapon of choice is a bow and arrow. Unlike in the comic books, Queen does not initially go by the alias “Green Arrow”.

As both of these are my best-loved Tv show when ever i am watching these shows i feel myself which show have more viewers and i am so interested to guess which show will have more viewers.

So lets write a program which guess( predict ) which Tv Show will have more viewers.

For free drive of our program we need some dataset which having both shows viewers for each episode. Luckly i got this data from Wikipidia and prepared an csv file. It’s looks  like this.

flash_episode flash_us_viewers arrow_episode arrow_us_viewers
1 4.83 1 2.84
2 4.27 2 2.32
3 3.59 3 2.55
4 3.53 4 2.49
5 3.46 5 2.73
6 3.73 6 2.6
7 3.47 7 2.64
8 4.34 8 3.92
9 4.66 9 3.06

Us viewers ( millions )

### Step by steps to solving this problem:

• First we have to convert our data to X_parameters and Y_parameters but here we have two X_parameters and Y_parameters so lets’s name them as flash_x_parameter, flash_y_parameter, arrow_x_parameter , arrow_y_parameter.
• Then we have to fit our data to two different  linear regression models first for flash and other for arrow.
• Then we have to predict the number of viewers for next episode for both of the Tv shows.
• Then we can compare the results and we can guess which Tv Shows will have more viewers.

Let’s drive to code this interesting problem.

Step-1

We have to import our packages


# Required Packages
import csv
import sys
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn import datasets, linear_model



Step-2

Converting our data to flash_x_parameter,flash_y_parameter,arrow_x_parameter ,arrow_y_parameter so lets write a function which will take our data set as input and returns lash_x_parameter,flash_y_parameter,arrow_x_parameter ,arrow_y_parameter values.


# Function to get data
def get_data(file_name):
flash_x_parameter = []
flash_y_parameter = []
arrow_x_parameter = []
arrow_y_parameter = []
for x1,y1,x2,y2 in zip(data['flash_episode_number'],data['flash_us_viewers'],data['arrow_episode_number'],data['arrow_us_viewers']):
flash_x_parameter.append([float(x1)])
flash_y_parameter.append(float(y1))
arrow_x_parameter.append([float(x2)])
arrow_y_parameter.append(float(y2))
return flash_x_parameter,flash_y_parameter,arrow_x_parameter,arrow_y_parameter


now we have  flash_x_parameters,flash_y_parameters,arrow_x_parameters,arrow_y_parameters. so let’s write a function which will take these above parameters as input and gives an output as which show will have more views.


# Function to know which Tv show will have more viewers
def more_viewers(x1,y1,x2,y2):
regr1 = linear_model.LinearRegression()
regr1.fit(x1, y1)
predicted_value1 = regr1.predict(9)
print predicted_value1
regr2 = linear_model.LinearRegression()
regr2.fit(x2, y2)
predicted_value2 = regr2.predict(9)
#print predicted_value1
#print predicted_value2
if predicted_value1 > predicted_value2:
print "The Flash Tv Show will have more viewers for next week"
else:
print "Arrow Tv Show will have more viewers for next week"



So let’s write every thing in one file open your editor and name it as prediction.py and copy this total code into prediction.py file.


# Required Packages
import csv
import sys
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn import datasets, linear_model

# Function to get data
def get_data(file_name):
flash_x_parameter = []
flash_y_parameter = []
arrow_x_parameter = []
arrow_y_parameter = []
for x1,y1,x2,y2 in zip(data['flash_episode_number'],data['flash_us_viewers'],data['arrow_episode_number'],data['arrow_us_viewers']):
flash_x_parameter.append([float(x1)])
flash_y_parameter.append(float(y1))
arrow_x_parameter.append([float(x2)])
arrow_y_parameter.append(float(y2))
return flash_x_parameter,flash_y_parameter,arrow_x_parameter,arrow_y_parameter

# Function to know which Tv show will have more viewers
def more_viewers(x1,y1,x2,y2):
regr1 = linear_model.LinearRegression()
regr1.fit(x1, y1)
predicted_value1 = regr1.predict(9)
print predicted_value1
regr2 = linear_model.LinearRegression()
regr2.fit(x2, y2)
predicted_value2 = regr2.predict(9)
#print predicted_value1
#print predicted_value2
if predicted_value1 > predicted_value2:
print "The Flash Tv Show will have more viewers for next week"
else:
print "Arrow Tv Show will have more viewers for next week"

x1,y1,x2,y2 = get_data('input_data.csv')
#print x1,y1,x2,y2
more_viewers(x1,y1,x2,y2)



Run this program and see which Tv show will have more viewers.

## 3) Replacing missing values using linear Regression

some times we have a situation where we have to do analysis on data which consists of missing values. Some people will remove these missing values and they continue  analysis and some people replace them with min value or max value or mean value it’s good to replace missing value with mean value but so time it’s not the right way to replace missing with mean value so we can use linear regression to replace those missing value very effectively.

This approach goes some thing like this.

First we have find which column we gonna replace missing values and we have to find on which columns this missing values column values more depends on ,then we have to remove the missing value rows. Then consider missing values column as Y_parameters and consider the columns on which this missing values more depend as X_parameters and fit this data to Linear regression model . Now predict the missing values in missing values column by consider the columns on which this missing values column more depends.

Once all this process completed  we will get data without any missing values so we are free to analysis data.

For practice i leave this problem to you so please kindly get some missing values data from online and solve this problem. Leave your comments once you completed .i so happy to view them.

#### Small personal note:

I want to share my personal experience with data mining. I remember in my introductory  datamining classes The instructor starts slow and explains some interesting areas where we can apply datamining and some very basic concepts so i and my friends understand every thing and we show more interest towards datamining. Then suddenly the difficulty leave will sky rocket. This makes a lot of my friends in class feel like extremely frustrated and intimated by course and ultimately they left interest on datamining. So i want to avoid this thing in my blog posts. In my blog post i want to make thing more easygoing this would be possible only when i explain things  with some interest examples moreover i want to make my blog viewers more comfortable learning without any boring so i am in that spirit which leads me towards use this examples.

I hope you liked todays post. If you have any questions then feel free to comment below.  If you want me to write on one specific topic then do tell it to me in the comments below.

# Python Packages For Datamining

Just because you have a “hammer”, doesn’t mean that every problem you come across will be a “nail”.

The intelligent key thing is when you use  the same hammer to solve what ever problem you came across. Like the same way when we intended to solve a datamining problem  we will face so many issues but we can solve them by using python in a intelligent way.

In very next post I am going to wet your hands to solve one interesting  datamining problem using python programming language. So in this post I am going to explain you about some powerful python weapons( packages )

Before stepping directly into python packages let me clear you a doubt which is rotating in your mind right now. Why python ?

## Why Python ?

We all know that python is powerful programming language ,But what does that mean, exactly? What makes python  a powerful programming language?

### Python is Easy

Universally Python has gained reputation because of its easy learning. The syntax of python programming language is designed to be easily readable. Python has significant popularity in  scientific computing. The people working in this field are scientists first, and programmers second.

### Python is Efficient

Now a days we are working on bulk amount of data popularly know as BIG DATA.  The more data you have to process, the more important it becomes to manage the memory you use. Here python will work very efficiently.

### Python is Fast

We all know Python is an interpreted language, we may think that it may be slow but some amazing work has been done over the past years to improve Python’s performance. My point is that if you want to do high-performance computing, Python is a viable option today.

## NumPy

NumPy is the fundamental package for scientific computing with Python. It contains among other things.NumPy is an extension to the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large library of high-level mathematical functions to operate on these arrays. The ancestor of NumPy, Numeric, was originally created by Jim Hugunin with contributions from several other developers. In 2005, Travis Oliphant created NumPy by incorporating features of the competing Numarray into Numeric, with extensive modifications.

Original author(s) Travis Oliphant Community project As Numeric, 1995; as NumPy, 2006 1.9.0 / 7 September 2014; 36 days ago Python, C Cross-platform Technical computing BSD-new license www.numpy.org

Installing numpy:

I strongly believe that python is already installed in your computer, if python is not installed in your computer please install it first.

Installing numpy in linux

Open your terminal and copy this commands

sudo apt-get update
sudo apt-get install python-numpy

Sample numpy code

Sample numpy code for using reshape function


from numpy import *
a = arange(12)
a = a.reshape(3,2,2)
print a



Script output

[[[ 0 1]
[ 2 3]]

[[ 4 5]
[ 6 7]]

[[ 8 9]
[10 11]]]

### SciPy

SciPy (pronounced “Sigh Pie”) is open-source software for mathematics, science, and engineering. The SciPy library depends on NumPy, which provides convenient and fast N-dimensional array manipulation. The SciPy library is built to work with NumPy arrays, and provides many user-friendly and efficient numerical routines such as routines for numerical integration and optimization. Together, they run on all popular operating systems, are quick to install, and are free of charge. NumPy and SciPy are easy to use, but powerful enough to be depended upon by some of the world’s leading scientists and engineers. If you need to manipulate numbers on a computer and display or publish the results, Scipy is one of the most need one.

Original author(s) Travis Oliphant, Pearu Peterson, Eric Jones Community library project 0.14.0 / 3 May 2014; 5 months ago Technical computing BSD-new license www.scipy.org

Installing SciPy in linux

Open your terminal and copy this commands

sudo apt-get update
sudo apt-get install python-scipy

Sample SciPy code


from scipy import special, optimize
f = lambda x: -special.jv(3, x)
sol = optimize.minimize(f, 1.0)
x = linspace(0, 10, 5000)
plot(x, special.jv(3, x), '-', sol.x, -sol.fun, 'o')
savefig('plot.png', dpi=96)


Script output

### Pandas

Pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language. It is already well on its way toward this goal.

Pandas is well suited for many different kinds of data:

• Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet.
• Ordered and unordered (not necessarily fixed-frequency) time series data.
• Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels.
• Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure.

Installing Pandas in linux

Open your terminal and copy this commands

sudo apt-get update
sudo apt-get install python-pandas



Sample Pandas code

Sample Pandas code about pandas Series


import pandas as pd

values = np.array([2.0, 1.0, 5.0, 0.97, 3.0, 10.0, 0.0599, 8.0])
ser = pd.Series(values)
print ser


Script output

0     2.0000
1     1.0000
2     5.0000
3     0.9700
4     3.0000
5    10.0000
6     0.0599
7     8.0000

### Matplotlib

matplotlib is a plotting library for the Python programming language and its NumPy numerical mathematics extension. It provides an object-oriented API for embedding plots into applications using general-purpose GUI toolkits like wxPython, Qt, or GTK+. There is also a procedural “pylab” interface based on a state machine (like OpenGL), designed to closely resemble that of MATLAB. SciPy makes use of matplotlib.

Original author(s) John Hunter Michael Droettboom, et al. 1.4.2 (26 October 2014; 3 days ago) [±] Python Cross-platform Plotting matplotlib license matplotlib.org

Installing Matplotlib in linux

Open your terminal and copy this commands

sudo apt-get update
sudo apt-get install python-matplotlib


Sample Matplotlib code

Sample Matplotlib code to Create Histograms


import numpy as np
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt

# example data
mu = 100 # mean of distribution
sigma = 15 # standard deviation of distribution
x = mu + sigma * np.random.randn(10000)

num_bins = 50
# the histogram of the data
n, bins, patches = plt.hist(x, num_bins, normed=1, facecolor='green', alpha=0.5)
# add a 'best fit' line
y = mlab.normpdf(bins, mu, sigma)
plt.plot(bins, y, 'r--')
plt.xlabel('Smarts')
plt.ylabel('Probability')
plt.title(r'Histogram of IQ: $\mu=100$, $\sigma=15$')

# Tweak spacing to prevent clipping of ylabel
plt.show()



Script output

### Ipython

IPython is a command shell for interactive computing in multiple programming languages, originally developed for the Python programming language, that offers enhanced introspection, rich media, additional shell syntax, tab completion, and rich history. IPython currently provides the following features:

• Powerful interactive shells (terminal and Qt-based).
• A browser-based notebook with support for code, text, mathematical expressions, inline plots and other rich media.
• Support for interactive data visualization and use of GUI toolkits.
• Flexible, embeddable interpreters to load into one’s own projects.
• Easy to use, high performance tools for parallel computing.
Original author(s) Fernando Perez and others 2.3 / 1 October 2014; 27 days ago Cross-platform Shell BSD www.ipython.org

Installing Ipython in linux

Open your terminal and copy this commands

sudo apt-get update
sudo pip install ipython

Sample Ipython code

This piece of code is to plot demonstrating the integral as the area under a curve


import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import Polygon
def func(x):
return (x - 3) * (x - 5) * (x - 7) + 85

a, b = 2, 9 # integral limits
x = np.linspace(0, 10)
y = func(x)

fig, ax = plt.subplots()
plt.plot(x, y, 'r', linewidth=2)
plt.ylim(ymin=0)

ix = np.linspace(a, b)
iy = func(ix)
verts = [(a, 0)] + list(zip(ix, iy)) + [(b, 0)]
poly = Polygon(verts, facecolor='0.9', edgecolor='0.5')

plt.text(0.5 * (a + b), 30, r"$\int_a^b f(x)\mathrm{d}x$",
horizontalalignment='center', fontsize=20)

plt.figtext(0.9, 0.05, '$x$')
plt.figtext(0.1, 0.9, '$y$')

ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.xaxis.set_ticks_position('bottom')

ax.set_xticks((a, b))
ax.set_xticklabels(('$a$', '$b$'))
ax.set_yticks([])

plt.show()



Script output

### scikit-learn

The scikit-learn project started as scikits.learn, a Google Summer of Code project by David Cournapeau. Its name stems from the notion that it is a “SciKit” (SciPy Toolkit), a separately-developed and distributed third-party extension to SciPy. The original codebase was later extensively rewritten by other developers. Of the various scikits, scikit-learn as well as scikit-image were described as “well-maintained and popular” in November 2012.

Original author(s) David Cournapeau June 2007; 7 years ago[1] 0.15.1 / August 1, 2014; 2 months ago[2] Python, Cython, C andC++ Library for machine learning BSD License scikit-learn.org

Installing Scikit-learn in linux

Open your terminal and copy this commands

sudo apt-get update
sudo apt-get install python-sklearn


Sample Scikit-learn code


import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model

# Use only one feature
diabetes_X = diabetes.data[:, np.newaxis]
diabetes_X_temp = diabetes_X[:, :, 2]

# Split the data into training/testing sets
diabetes_X_train = diabetes_X_temp[:-20]
diabetes_X_test = diabetes_X_temp[-20:]

# Split the targets into training/testing sets
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]

# Create linear regression object
regr = linear_model.LinearRegression()

# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)

# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean square error
print("Residual sum of squares: %.2f"
% np.mean((regr.predict(diabetes_X_test) - diabetes_y_test) ** 2))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % regr.score(diabetes_X_test, diabetes_y_test))

# Plot outputs
plt.scatter(diabetes_X_test, diabetes_y_test, color='black')
plt.plot(diabetes_X_test, regr.predict(diabetes_X_test), color='blue',
linewidth=3)

plt.xticks(())
plt.yticks(())

plt.show()



Script output

Coefficients:
[ 938.23786125]
Residual sum of squares: 2548.07
Variance score: 0.47

I have explained the packages which we are going to use in coming post to solve some interesting problems.

please leave your comment if i have to add any other python datamining packages to this list

I hope you liked today post. If you have any questions then feel free to comment below.  If you want me to write on one specific topic then do tell me in the comments below.