Inside Ai

5 Types Regression in 45 lines of code

Rupak (Bob) Roy - II
15 min readJun 21, 2021

Full Guide to Linear Regression, Polynomial Regression, Support Vector Regression, Decision Trees Regression, Random Forest Regression and more.

Python Programming

Hi, how are you doing, I hope it's great……….

Today let's understand and perform all the types of regression and also we will compare each performance for its accurate prediction.

Let’s get started, we will use The Abalone Dataset from UCI Machine Learning Repository

The original owners of the dataset:
Marine Resources Division
Marine Research Laboratories — Taroona
Department of Primary Industry and Fisheries, Tasmania

And it contains attributes

Sex / nominal / — / M, F, and I (infant)
Length / continuous / mm / Longest shell measurement
Diameter / continuous / mm / perpendicular to length
Height / continuous / mm / with meat in shell
Whole weight / continuous / grams / whole abalone
Shucked weight / continuous / grams / weight of meat
Viscera weight / continuous / grams / gut weight (after bleeding)
Shell weight / continuous / grams / after being dried
Rings / integer / — / +1.5 gives the age in years

Abalone Data set

Let’s get started with our commonly used regression method:

1.)Multiple Linear Regression then we will use

2.)Polynomial Regression

3.) Support Vector Regression

4.) Decision Tree Regression

5.) Random Forest Regression

Any else Regression? Let me know in the comment below.

#Multiple Linear Regression
#importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
#Importing the dataset
dataset = pd.read_csv('abalone.csv',header=None)
# as we don’t have column names
X = dataset.iloc[:,:-1].values"""#We will define all the independent variables X in the format [ row, columns] & [ upper bound: lower bound ,upper : lower bound ] where location is [ : , : -1 ] i.e. all columns except the last column"""y = dataset.iloc[:, 8].values #The last 8th column# Encoding categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
#Gender column
ct = ColumnTransformer([("Gender", OneHotEncoder(), [0])], remainder = 'passthrough')
X = ct.fit_transform(X)
#to avoid dummy variable trap
X = X[:, 1:]

#remember whenever you have more than 2 categories assume 5 then u have to select one less than 5 i.e. (n-1) = 5–1=4. Else it will give N/A output. Re-Think again when you convert a Two category variable like gender, Male/Female, the output will have 1 column that will contain 0 and 1 values, 0 for male or 1 for female.

The same goes if we have 5 classes/categories we will have 5 columns as output but remember each column have the response of 2 variable 0,1 (yes/No). Thus we remove 1 column to indicate all the 4 variables(assume from 2nd to 5th) output(yes/no,1/0) are in respect to 1 variable(assume the 1st variable).

In other words, it means the value of 1st variable is already included in all 4 variables.

#So to avoid the dummy variable trap we will remove one(1st) variable or any of the country columns(assume multiclass column ‘country’ having 5 country class’), off-course ‘ANY’ doesn't mean any other variable output outside the country columns.

# Split the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)# Fitting Multiple Linear Regression to the Training set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
#Predicting the model accuracy on Test data set
y_pred = regressor.predict(X_test)
#To get the intercept:
print(regressor.intercept_)
#To view the coefficient values
print(regressor.coef_)

# I believe you are aware of the regression equation: y=mx+c where M is the coefficient and X is the independent variable + C refers to the intercept. Alternatively, we can also define this equation as

Linear Regression Equation

So the equation for our example will be Y=m*X1 + m*X2+ m*X3+ ………m*X9+c

It means every one unit of change in X3(original column name from (X_test): Length of abalone) there will be a negative impact of -0.306 to Y(dependent variable)(original column name: Rings), same goes with the next column every one unit of change in X4 (original column name: Diameter of abalone) there will be a positive impact of 11.25 to Y and it will continue so on for rest of the columns.

However, I will like to interpret column 1 that is Gender. which got transformed into 3 columns. Then we remove one column X = X[:, 1:] to avoid the dummy variable trap.

So If there is one unit of change in X1 from Gender with respect to Female. The infant will down by -0.83 and if every one unit of change in X2 from Gender with respect to Female. Male will go up by 0.075

We can put all of this together in one equation to get the final value of Y:

Y=m*X1 + m*X2+ m*X3+………m*X9 + c

Y = -0.83* X1 + 0.075* X2 + (-0.306)* X3 +……..+ 8.33* X9 + 4.01

I know it sounds a bit dizzy…………for that I would request u to practice the example a few times analyze the values to understand or you can even visit my other sites for hands-on practice on that.

#We can also compare the actual versus prediction
df = pd.DataFrame({'Actual': y_test.flatten(), 'Predicted': y_pred.flatten()})
df
Predicted Vs Actual

#we can observe its quite close, which means high accuracy

#we can also visualize the actual vs predicted

df_1 = df.head(25)
df_1.plot(kind=’bar’,figsize=(16,10))
plt.grid(which=’major’, linestyle=’-’, linewidth=’0.5', color=’green’)
plt.grid(which=’minor’, linestyle=’:’, linewidth=’0.5', color=’black’)
plt.show()
Actual Vs Predicted (abalone)

The final step is to evaluate the performance of the algorithm using evaluation metrics like

1. Mean Absolute Error(MAE) is the mean of the absolute value of the errors.

Mean Absolute Error

2. Mean Squared Error(MSE) is the mean of the squared errors

Mean Squared Error

3. Root squared Error(RMSE) is just the square root of the mean of the errors.

Luckily we don’t have to perform these calculations manually.

#evaluation Metrics 
from sklearn import metrics
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

and we must have our results something like this

Mean Absolute Error: 1.5815705715179207

Mean Squared Error: 5.0068785487799286

Root Mean Squared Error: 2.237605539137747

THE LESSER the values the BETTER the model is.

Great!

We have successfully done our FIRST REGRESSION MODEL

Bonus:

#save the model in the disk
import pickle
# save the model to disk
filename = 'reg_model.sav'
pickle.dump(regressor, open(filename, 'wb'))
# load the model from disk
filename1 = 'reg_model.sav'
loaded_model = pickle.load(open(filename1, 'rb'))
#another method using joblib
'''Pickled model as a file using joblib: Joblib is the replacement of pickle as it is more efficent on objects that carry large numpy arrays. '''
from sklearn.externals import joblib
# Save the model as a pickle in a file
joblib.dump(regressor, 'regressor.pkl')

# Load the model from the file
loaded_model2 = joblib.load('regressor.pkl')

# Use the loaded model to make predictions
loaded_model2.predict(X_test)

The whole code will look something like this:

Multiple Linear Regression

I hope you enjoyed. Next is Polynomial Regression.

Stay Tune! Enjoy Machine Learning.

In linear regression, we need a linear relationship between the Target / Dependent variable and the independent variable.

Regression Image Wikipedia
Non-Linear Data

And what if we don’t have a linear relationship then we need something to fit the non-linear relationship between the target/dependent variable and the independent variable. This is where Polynomial regression comes into the picture.

Polynomial regression is a special case of linear regression where get a polynomial equation on the data with a curvilinear relationship between the target variable and the independent variable

In polynomial regression, we have a polynomial equation of degree n represented as:

Polynomial Linear Regression Equation

where n is the degree that will turn the features (columns) into ploy features of degree (² 0r ^3 or ^4……..so on)

Linear Regression Equation

Let’s understand this with the help of an example:

#Importing the librariesimport numpy as np
import matplotlib.pyplot as plt
import pandas as pd
#Importing the dataset
dataset = pd.read_csv('abalone.csv', header = None)
#Encoding categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
#Gender column
ct = ColumnTransformer([("Gender", OneHotEncoder(), [0])], remainder = 'passthrough')
dataset = ct.fit_transform(dataset)
#anyway we wont use this column we will simply use 1 independent variable i.e. column ‘Length’ of abalone data set and X our dependent variable i.e.‘number of rings’ from the last column.
The reason why i m choosing only 2 columns is to show u the comparison of performance of both the algorithm using plots()X = dataset[:,10:]
y = dataset[:,3]

Now what we will do out here is first define the degree =2 that will transform X into 3 more features, Containing features and the square root of features.

In one word we can call it as ‘poly features’.

# Fitting Polynomial Regression to the datasetfrom sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree = 2)
X_poly = poly_reg.fit_transform(X)
#will transform X into 2 more features(^2,#containing features and Square root of features)
#To view the X-plot features
X_poly
#build up the regression(poly) model
poly_reg.fit(X_poly, y)

# then we will fit our poly features in the linear regression,,,, done! How simple is that,

All we need to do is to create poly features and fit it in our regular regression function,,, that’s it !

linear_reg2 = LinearRegression()
linear_reg2.fit(X_poly, y)

Let’s also create by side simple linear regression to compare the results

# Fitting Linear Regression to the dataset
from sklearn.linear_model import LinearRegression
linear_reg = LinearRegression()
linear_reg.fit(X, y)

Now it's time to VISUALIZE THE Regression results.

# Visualizing the Linear Regression resultsplt.scatter(X, y, color = 'red')
plt.plot(X, linear_reg.predict(X), color = 'blue')
plt.title('Predicting the age of abalone from physical measurements.')
plt.xlabel('Rings')
plt.ylabel('Length')
plt.show()
# Visualizing the Polynomial Regression resultsplt.scatter(X, y, color = 'red')
plt.plot(X, linear_reg2.predict(poly_reg.fit_transform(X)), color = 'blue')
plt.title('Predicting the age of abalone from physical measurements.')
plt.xlabel('Rings')
plt.ylabel('length')
plt.show()
Right | Simple Linear Regression, Left | Polynomial Regression
Polynomial Regression with smoother curve
# Visualizing the Polynomial Regression results (for smoother curve)X_grid = np.arange(min(X), max(X), 0.1)
X_grid = X_grid.reshape((len(X_grid), 1))
plt.scatter(X, y, color = 'red')
plt.plot(X_grid,linear_reg2.predict(poly_reg.fit_transform(X_grid)), color = 'blue')
plt.title('Predicting abalone from physical measurements.')
plt.xlabel('Rings')
plt.ylabel('Length')
plt.show()

We can observe that our polynomial regression is capturing more data than before. Thus we will conclude polynomial regression is better in capturing and predicting the data than simple linear regression when our data is non-linear in nature.

TIME TO PREDICT THE TRUTH!!!!!

# Predicting a new result with Linear Regression
linear_reg.predict([[10]])
# Predicting a new result with Polynomial Regression
linear_reg2.predict(poly_reg.fit_transform([[10]]))

We will have output similar to this

Linear Regression: array([0.52536725])

Polynomial Regression: array([0.55452649])

So polynomial regression output is more accurate than linear regression, Coz its capturing more data than simple linear regression as shown in the plot.

The whole code will look something like this:

Polynomial Regression

Congratulations!

We can now successfully apply regression for our non-linear data

I hope you enjoyed it… Next is Support Vector Regression (SVR)

SUPPORT VECTOR REGRESSION

Support Vector Machine

What is SVR?

Support Vector Regression works on the principle of Support Vector Machine SVM

In brief, the principle working of SVM is to find the nearest data point(either class) with the help of a hyper-plane. This distance is called as Margin

Lets get understand this with the help of an example.

We will use a simple dataset Air Pressure where it will have records of temperature based on Air Pressure.

# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
data = pd.read_csv('AirPressure.csv')
data
#dividing the dataset into X and y
X = data.iloc[:, 1:2].values
y = data.iloc[:, 2].values
# Feature Scaling for SVR
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
sc_y = StandardScaler()

Feature scaling is a must in the support vector as the whole algorithm is based on calculating the distance between data points. Thus feature scaling will reduce the spread of data that will help the algorithm more swiftly and efficiently.

X = sc_y.fit_transform(X.reshape(-1,1))
y = sc_y.fit_transform(y.reshape(-1,1))

reshape will us to transform from 1D array to 2D array for example.

If we have a list of values where X = np.arrange(0,100), then use x.ndim it will give 1 and x.shape will also give (100, ) means it is simply a list of 100 numbers with no row/column concept. Therefore we will use X.reshape(-1,1) where -1 means it will add all the rows, there will be no change and then 1 refers it will one more dimension. Now if we use X.shape it will give (100,1)and X.ndim will give (2)

In the Same way as before we will perform both simple linear regression as well as SVR to compare its effectiveness.

# Fitting Linear Regression to the dataset
from sklearn.linear_model import LinearRegression
lin = LinearRegression()
lin.fit(X, y)
# Fitting SVR to the dataset
from sklearn.svm import SVR
regressor = SVR(kernel = 'rbf')
regressor.fit(X, y)

Its time to visualize the output:

# Visualising the Linear Regression results
plt.scatter(X, y, color = 'blue')
plt.plot(X, lin.predict(X), color = 'red')
plt.title('Linear Regression')
plt.xlabel('Temperature')
plt.ylabel('Pressure')
plt.show()
# Visualising the SVR results
plt.scatter(X, y, color = 'blue')
plt.plot(X, regressor.predict(X), color = 'red')
plt.title('Support Vector Regression')
plt.xlabel('Temperature')
plt.ylabel('Pressure')
plt.show()
#Predicting a new result with Linear Regression
lin.predict([[150.0]])
#Predicting a new result(temperature) with Support Vector Regression
y_Pressure = regressor.predict([[55]])
Linear Regression Vs Support Vector Machine

Alright we can observe the SVR is touching more data points than linear regression.

Thus SVR is more suitable for non-linear data.

The whole SVR code will look something like this

Support Vector Regressor (SVR)

Congratulations again!

we can now perform SVR which is one of the advanced methods for the non-linear data types.

Next is Decision Tree

DECISION TREES Regression

What are Decision Trees?

Decision Trees are a non-parametric supervised learning method used for both classification and regression tasks. The goal is to create a model that predicts the value of a target variable by learning simple decision rules derived from the data features.

The decision rules are generally in form of if-then-else statements. The deeper the tree, the more complex the rules and fitter the model.

A decision tree gives output in a tree-like graph with nodes. Take this graph as an example, beautifully explained.

Decision Trees | Graph Credit ~ TDS

Let’s get hands-on experience on how to perform Decision trees.

# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
data = pd.read_csv('AirPressure.csv')
data
#dividing the dataset into X and y
X = data.iloc[:, 1:2].values
y = data.iloc[:, 2].values
# Feature Scaling for SVR
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
sc_y = StandardScaler()
X = sc_y.fit_transform(X.reshape(-1,1))
y = sc_y.fit_transform(y.reshape(-1,1))

Everything till here is the same as before, now we will create models linear regression, svr, and decision trees offcourse! And will compare the results.

#Fitting Linear Regression to the dataset
from sklearn.linear_model import LinearRegression
lin = LinearRegression()
lin.fit(X, y)
#Fitting SVR to the dataset
from sklearn.svm import SVR
regressor = SVR(kernel = 'rbf')
regressor.fit(X, y)
#Fitting Decision Tree Regression
from sklearn.tree import DecisionTreeRegressor
dt_model = DecisionTreeRegressor(random_state = 0)
dt_model.fit(X, y)
#Visualizing the Linear Regression results
plt.scatter(X, y, color = 'blue')
plt.plot(X, lin.predict(X), color = 'red')
plt.title('Linear Regression')
plt.xlabel('Temperature')
plt.ylabel('Pressure')
plt.show()
#Visualizing the SVR results
plt.scatter(X, y, color = 'blue')
plt.plot(X, regressor.predict(X), color = 'red')
plt.title('Support Vector Regression')
plt.xlabel('Temperature')
plt.ylabel('Pressure')
plt.show()
#Visualizing the Decision Trees Regression results
plt.scatter(X, y, color = 'blue')
plt.plot(X, dt_model.predict(X), color = 'red')
plt.title('Decision Trees Regression')
plt.xlabel('Temperature')
plt.ylabel('Pressure')
plt.show()

Let’s compare them all Who Performs BETTER! for non-linear data types

Linear Regression Vs Decision Trees
Support Vector Vs Decision Trees

We will see in Decision Trees the line is passing between the blue points(Thus better model). However, remember each and every method have disadvantages too for different scenarios. And decision trees commonly face the issue of overfitting where it tries to fit all the data points as much as possible which later results in bias model. The solution to that is pruning.

There is a whole new chapter to optimize the decision trees. So for that, I would request to use to visit another blog. Meanwhile, I will give u an intuition when to use Decision Trees. DT is used when we are more concerned about the interpretation of our data like we saw (if and else graph) or to generate rules rather than accuracy.

We can also predict new results using predict() just like before

#Predicting a new result with Linear Regression
lin.predict([[55.0]])
#Predicting a new result(pressure) with Support Vector Regression
y_Pressure = dt_model.predict([[55]])

Its time to visualize the decision tree,

# import export_graphviz
from sklearn.tree import export_graphviz
# export the decision tree to a tree.dot file
#for visualizing the plot easily anywhere
export_graphviz(dt_model, out_file ='e:/tree.dot',feature_names =['Pressure'])

The tree is finally exported and we can visualize using http://www.webgraphviz.com/ by copying the data from the ‘tree.dot’ file.

Decision Tree with http://www.webgraphviz.com/

Putting all these together the who code looks something like this

Decision Tree Regressor

Here we are, we have finished how to apply decision trees for non-linear data

NEXT RANDOM FOREST

Random Forest For Regression

What is a random forest?

Random Forest is the upgrade version of decision trees. The name itself refers it consists of a large number of individual decision trees that operate as an ensemble. Thus we are combining the predictive power of several decision trees to give more accuracy.

Random Forest Graphical Representation

Let’s get started with the help of an example

# Importing the libraries 
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
data = pd.read_csv('AirPressure.csv')
data
#dividing the dataset into X and y
X = data.iloc[:, 1:2].values
y = data.iloc[:, 2].values
# Feature Scaling for SVR
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
sc_y = StandardScaler()
X = sc_y.fit_transform(X.reshape(-1,1))
y = sc_y.fit_transform(y.reshape(-1,1))

Till here it's the same as before. Now we will fit the random forest into the dataset. Also, we will do for decision tree so that later we can compare the performance.

#Fitting Decision Tree Regression 
from sklearn.tree import DecisionTreeRegressor
dt_model = DecisionTreeRegressor(random_state = 0)
dt_model.fit(X, y)

#Fitting Random Forest Regression to the dataset
from sklearn.ensemble import RandomForestRegressor
rf_model = RandomForestRegressor(n_estimators = 500, random_state = 0)
rf_model.fit(X, y)

Done…. now it's time to visualize and compare.

# Visualizing the Decision Trees Regression results 
plt.scatter(X, y, color = 'blue')
plt.plot(X, dt_model.predict(X), color = 'red')
plt.title('Decision Trees Regression')
plt.xlabel('Temperature')
plt.ylabel('Pressure')
plt.show()
# Visualizing the Random Forest results
plt.scatter(X, y, color = 'blue')
plt.plot(X, rf_model.predict(X), color = 'red')
plt.title('Random Forest Regression')
plt.xlabel('Temperature')
plt.ylabel('Pressure')
plt.show()
# Visualizing the Random Forest results with more precisely
X_grid = np.arange(min(X), max(X), 0.01)
X_grid = X_grid.reshape((len(X_grid), 1))
plt.scatter(X, y, color = 'red')
plt.plot(X_grid, rf_model.predict(X_grid), color = 'blue')
plt.title('Random Forest Regression')
plt.xlabel('Temperature')
plt.ylabel('Pressure')
plt.show()
#Predicting a new result(pressure) with Random Forest Regression
rf_model.predict([[55]])
#Predicting a new result(pressure) with Decision Tree Regression
dt_model.predict([[55]])
Decision Tree Vs Random Forest
Random Forest

Well we can observe from the last data point the blue line in the decision tree is touching the last red data point than random forest

Thus decision tree here is performing better than the random forest is because the random forest requires more data to generate trees. Well see, I already told u every algorithm has its pros and cons. In Random Forest the algorithm is not affected by outliers and overfitting but in the decision tree, it does.

The only thing left for you is to make a note of the pros and cons of each algorithm that will help you out to understand where to use and not.

Putting all the above codes together:

Random Forest Regressor

Finally, we are done……….we have learned how to apply different types of regression using Python. I will also be making another version in R.

I hope you enjoyed. Stay tune for my next release. Classification(logistic, SVM, Knn, Kernal SVM, Naive Bayes, Decision Trees, Random Forest and more using python.

--

--

Rupak (Bob) Roy - II
Rupak (Bob) Roy - II

Written by Rupak (Bob) Roy - II

Things i write about frequently on Medium: Data Science, Machine Learning, Deep Learning, NLP and many other random topics of interest. ~ Let’s stay connected!

No responses yet