Normality Tests in 11 ways

Rupak (Bob) Roy - II
6 min readMar 22, 2022

--

A-Z Guide where our data distribution follows Gaussian distribution

Puri Holiday Trip
From Puri Holiday Trip

Hi there, i hope you doing good! today i will bring you to an article that has almost all types/ways to check the normality of our data whether it follows Gaussian Distribution/Normal Distribution or not.

So let’s buckle up!

We all know why we need to follow the normal distribution in our data. off-course in real life it's hard to get normally distributed data that follows Gaussian distribution but still, we need close to or almost to the bell curve. If you are not familiar with the concept “Why we need to follow Normal distribution” “ What is Central Limit Theorem” there are tons of articles already available on google. Just do the Googling! Our main focus here is to understand the various ways that we can identify/detect whether the data is normally distributed or not.

A quick comparison to valid on “How normal distribution helps to improve accuracy.”.

Dataset: https://www.kaggle.com/datasets/rupakroy/credit-data

import numpy as np
import matplotlib.pyploy as plt
import pandas as pd
#import dataset
dataset = pd.read_csv("credit_data.csv",sep=",")
dataset.dropna(inplace=True)
#check the skewness
import seaborn as sns
for i in dataset.columns:
print(dataset[i])
sns.displot(dataset[i])
#Repeat the whole set with skewness correction ##########
#sns.displot(dataset["loan"])
#dataset["loan"] = np.sqrt(dataset["loan"])
#define X and y
X = dataset.iloc[:,1:4]
y = dataset.iloc[:,4].values
#class imbalanced
from imlearn.combine import SMOTETomek
smt = SMOTETOmek(random_state = 1)
X,y = smt.fit_sample(X,y)
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,split_size=0.3,random_state=1)
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_Train)
X_test = sc.transform(X_test)
from sklearn.svm import SVC
svc_model = SVC(random_state=0)
svc_model.fit(X_train,y_train)
y_pred = svc_model.predict(X_test)from skearn import metrics
print("classification reports", metrics.classification_report(y_Test,y_pred))

You will see the difference in accuracy good luck!

Now let’s get back to our normality test

  1. Graph for normality tests
#graph for normality tests
import numpy as np
import pylab
import scipy.stats as stats
#generate a dummy dataset
measurements = np.random.normal(loc = 20, scale = 5 , size=150)
stats.probplot(measurements, dist="norm",plot=pylab)
pylab.show()
#Alternatively -----------------------

import numpy as np
import statsmodels.api as sm
import pylab as py

# np.random generates different random numbers
# whenever the code is executed
# Note: When you execute the same code
# the graph look different than shown below.

# Random data points generated
data_points = np.random.normal(0, 1, 100)

sm.qqplot(data_points, line ='45')
py.show()
qqplot
qqplot

2. Boxplot

import seaborn as sns# Random data points generated
data_points = np.random.normal(0, 1, 100)
sns.boxplot(data_points)
boxplot
boxplot
normality test using boxplot
normality test using boxplot

3. Histogram

import seaborn as sns# Random data points generated
data_points = np.random.normal(0, 1, 100)
sns.displot(data_points)
sns.displot(data_points)
sns.displot(data_points)
normality test using histogram
normality test using histogram

I personally use sns.displot(), quick and easy to apply with a lot of information about the data distribution pattern.

4.Skewness

from scipy.stats import skewskewness_threshold = 0.5for i in df[skew_feat_list]:
s_v1 = abs(skew(df[i]))
print("Numeric Variable Detected")
if (s_v1 >= skewness_threshold or s_v1 <= skewness_threshold):
print("Skewness Before Correction", sns.displot(df[i])
print("The Skewness Value",s_v1)
print("Skewness Detection")
df[i] = np.sqrt(df[i])
print("Skewness After Correction", sns.displot(df[i])
print("Skewness Value After", skew(df[i])
else:
pass
#replace df[skew_feat_list] with ur targeted columns

Anything is beyond 0.5 threshold either its in minus(-) or plus(+) is considered to be highly skewed. skew value range lies between( -1 to +1) Negatively skewed and positively skewed.

In a real-life scenario, we won't be getting perfect normally distributed data. Thus anything that is way too highly skewed either is negative or positive we will correct it using my above snippet. However do make sure the datatypes are properly formatted, if not simply modify my snippet with ~ if df.select_dtypes(include=”int”)

And this is what i use when it comes to automation.

5. Shapiro-Wilk Test is said to be one of the most powerful test to check the normality of a variable. It was proposed in 1965 by Samuel Sanford Shapiro and Martin Wilk.

from scipy.stats import shapiro
data = np.random.normal(loc=20,scale=5,size=150)
stat,p = shapiro(data)
print("stat = %.3f, p =%.3f \n" %(stat,p))
if p > 0.05:
print("Probability Gaussian")
else:
print('Probability Not Gaussian')
#If the p-value ≤ 0.05, then we reject the null hypothesis i.e. we assume the distribution of our variable is not normal/gaussian.
#If the p-value > 0.05, then we fail to reject the null hypothesis i.e. we assume the distribution of our variable is normal/gaussian.

6. D’Agostino’s K-squared test — check’s normality of a variable based on skewness and kurtosis.

from scipy.stats import normaltest
data = np.random.normal(loc=20,scale=5,size=150)
stat,p = normaltest(data)
print("stat = %.3f, p =%.3f \n" %(stat,p))
if p > 0.05:
print("Probability Gaussian")
else:
print('Probability Not Gaussian')

Well, we can also perform the same by using skewness as mentioned above.

7. Anderson-Darling Normality Test — another normality test developed in 1952 by Theodore Anderson and Donald Darling.

from scipy.stats import anderson
data = np.random.normal(loc=20,scale=5,size=150)
result = anderson(data)
print('stat=%.3f' %(result.statistic))
for i in range(len(result.critical_values)):
sig_lev, crit_val = result.significance_level[i], result.critical_values[i]
if result.statistic < crit_val:
print(f'Probability Gaussian: {crit_val} critical value at {sig_lev} level of signifiance')
else:
print(f'Probability NOT Gaussian: {crit_val} critical value at {sig_lev} level of signifiance')

8. Chi-Square Normality Test

from scipy.stats import chisquare
data = np.random.normal(loc=20,scale=5,size=150)
statistic,pvalue = chisquare(data)
print("stat = %.3f, p =%.3f \n" %(statistic,pvalue))
if pvalue > 0.05:
print("Probability Gaussian")
else:
print('Probability Not Gaussian')

9. Lilliefors Test for Normality — normality test based on the Kolmogorov–Smirnov test. It is named after Hubert Lilliefors, professor of statistics at George Washington University.

from statsmodels.stats.diagnostic import lillieforsdata = np.random.normal(loc=20,scale=5,size=150)
statistic,pvalue = lilliefors(data)
print("stat = %.3f, p =%.3f \n" %(statistic,pvalue))
if pvalue > 0.05:
print("Probability Gaussian")
else:
print('Probability Not Gaussian')

10. Jarque–Bera test for Normality — tests using the skewness and kurtosis for a normal distribution. The test only works if the data sample size >2000

from scipy.stats import jarque_beradata = np.random.normal(loc=20,scale=5,size=150)
statistic,pvalue = jarque_bera(data)
print("stat = %.3f, p =%.3f \n" %(statistic,pvalue))
if pvalue > 0.05:
print("Probability Gaussian")
else:
print('Probability Not Gaussian')

11.Kolmogorov-Smirnov test for Normality — -Apart from qq-plot this is the next best test if we want to quantify the test.

Kolmogorov Smirnov test is when you have more mass in the tails and that your normal distribution is really not satisfied.

QQ-plots

Quantitative way — Kolmogorov Smirnov test is something similar what it does is it compares the empirical distribution function to the cumulative distribution of in this case normal distribution.

Kolmogorov Smirnov

So Kolmogorov Smirnov test statistics is it actually just takes the maximum distance between them.

In other words, performs a goodness of fit test using one sample or two samples of the distribution F(x) of
an observed random variable against a given distribution G(x) i.e. a normal distribution.

from scipy.stats import kstestdata = np.random.normal(loc=20,scale=5,size=150)
statistic,pvalue = kstest(data,'norm')
print("stat = %.3f, p =%.3f \n" %(statistic,pvalue))
if pvalue > 0.05:
print("Probability Gaussian")
else:
print('Probability Not Gaussian')

Now this applies to all.

#If the p-value ≤ 0.05, then we reject the null hypothesis i.e. we assume the distribution of our variable is not normal/gaussian.
#If the p-value > 0.05, then we fail to reject the null hypothesis i.e. we assume the distribution of our variable is normal/gaussian.

Thanks again, for your time, if you enjoyed this short article there are tons of topics in advanced analytics, data science, and machine learning available in my medium repo. https://medium.com/@bobrupakroy

Some of my alternative internet presences Facebook, Instagram, Udemy, Blogger, Issuu, Slideshare, Scribd and more.

Also available on Quora @ https://www.quora.com/profile/Rupak-Bob-Roy

Let me know if you need anything. Talk Soon.

Konark ~ The Sun Temple

--

--

Rupak (Bob) Roy - II

Things i write about frequently on Medium: Data Science, Machine Learning, Deep Learning, NLP and many other random topics of interest. ~ Let’s stay connected!