Normality Tests in 11 ways
A-Z Guide where our data distribution follows Gaussian distribution
Hi there, i hope you doing good! today i will bring you to an article that has almost all types/ways to check the normality of our data whether it follows Gaussian Distribution/Normal Distribution or not.
So let’s buckle up!
We all know why we need to follow the normal distribution in our data. off-course in real life it's hard to get normally distributed data that follows Gaussian distribution but still, we need close to or almost to the bell curve. If you are not familiar with the concept “Why we need to follow Normal distribution” “ What is Central Limit Theorem” there are tons of articles already available on google. Just do the Googling! Our main focus here is to understand the various ways that we can identify/detect whether the data is normally distributed or not.
A quick comparison to valid on “How normal distribution helps to improve accuracy.”.
Dataset: https://www.kaggle.com/datasets/rupakroy/credit-data
import numpy as np
import matplotlib.pyploy as plt
import pandas as pd#import dataset
dataset = pd.read_csv("credit_data.csv",sep=",")
dataset.dropna(inplace=True)#check the skewness
import seaborn as sns
for i in dataset.columns:
print(dataset[i])
sns.displot(dataset[i])#Repeat the whole set with skewness correction ##########
#sns.displot(dataset["loan"])
#dataset["loan"] = np.sqrt(dataset["loan"])#define X and y
X = dataset.iloc[:,1:4]
y = dataset.iloc[:,4].values#class imbalanced
from imlearn.combine import SMOTETomek
smt = SMOTETOmek(random_state = 1)
X,y = smt.fit_sample(X,y)from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,split_size=0.3,random_state=1)from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_Train)
X_test = sc.transform(X_test)from sklearn.svm import SVC
svc_model = SVC(random_state=0)
svc_model.fit(X_train,y_train)y_pred = svc_model.predict(X_test)from skearn import metrics
print("classification reports", metrics.classification_report(y_Test,y_pred))
You will see the difference in accuracy good luck!
Now let’s get back to our normality test
- Graph for normality tests
#graph for normality tests
import numpy as np
import pylab
import scipy.stats as stats#generate a dummy dataset
measurements = np.random.normal(loc = 20, scale = 5 , size=150)
stats.probplot(measurements, dist="norm",plot=pylab)
pylab.show()#Alternatively -----------------------
import numpy as np
import statsmodels.api as sm
import pylab as py
# np.random generates different random numbers
# whenever the code is executed
# Note: When you execute the same code
# the graph look different than shown below.
# Random data points generated
data_points = np.random.normal(0, 1, 100)
sm.qqplot(data_points, line ='45')
py.show()
2. Boxplot
import seaborn as sns# Random data points generated
data_points = np.random.normal(0, 1, 100)
sns.boxplot(data_points)
3. Histogram
import seaborn as sns# Random data points generated
data_points = np.random.normal(0, 1, 100)
sns.displot(data_points)
I personally use sns.displot(), quick and easy to apply with a lot of information about the data distribution pattern.
4.Skewness
from scipy.stats import skewskewness_threshold = 0.5for i in df[skew_feat_list]:
s_v1 = abs(skew(df[i]))
print("Numeric Variable Detected")
if (s_v1 >= skewness_threshold or s_v1 <= skewness_threshold):
print("Skewness Before Correction", sns.displot(df[i])
print("The Skewness Value",s_v1)
print("Skewness Detection")
df[i] = np.sqrt(df[i])
print("Skewness After Correction", sns.displot(df[i])
print("Skewness Value After", skew(df[i])
else:
pass#replace df[skew_feat_list] with ur targeted columns
Anything is beyond 0.5 threshold either its in minus(-) or plus(+) is considered to be highly skewed. skew value range lies between( -1 to +1) Negatively skewed and positively skewed.
In a real-life scenario, we won't be getting perfect normally distributed data. Thus anything that is way too highly skewed either is negative or positive we will correct it using my above snippet. However do make sure the datatypes are properly formatted, if not simply modify my snippet with ~ if df.select_dtypes(include=”int”)
And this is what i use when it comes to automation.
5. Shapiro-Wilk Test is said to be one of the most powerful test to check the normality of a variable. It was proposed in 1965 by Samuel Sanford Shapiro and Martin Wilk.
from scipy.stats import shapiro
data = np.random.normal(loc=20,scale=5,size=150)
stat,p = shapiro(data)
print("stat = %.3f, p =%.3f \n" %(stat,p))
if p > 0.05:
print("Probability Gaussian")
else:
print('Probability Not Gaussian')#If the p-value ≤ 0.05, then we reject the null hypothesis i.e. we assume the distribution of our variable is not normal/gaussian.
#If the p-value > 0.05, then we fail to reject the null hypothesis i.e. we assume the distribution of our variable is normal/gaussian.
6. D’Agostino’s K-squared test — check’s normality of a variable based on skewness and kurtosis.
from scipy.stats import normaltest
data = np.random.normal(loc=20,scale=5,size=150)
stat,p = normaltest(data)
print("stat = %.3f, p =%.3f \n" %(stat,p))
if p > 0.05:
print("Probability Gaussian")
else:
print('Probability Not Gaussian')
Well, we can also perform the same by using skewness as mentioned above.
7. Anderson-Darling Normality Test — another normality test developed in 1952 by Theodore Anderson and Donald Darling.
from scipy.stats import anderson
data = np.random.normal(loc=20,scale=5,size=150)
result = anderson(data)
print('stat=%.3f' %(result.statistic))
for i in range(len(result.critical_values)):
sig_lev, crit_val = result.significance_level[i], result.critical_values[i]
if result.statistic < crit_val:
print(f'Probability Gaussian: {crit_val} critical value at {sig_lev} level of signifiance')
else:
print(f'Probability NOT Gaussian: {crit_val} critical value at {sig_lev} level of signifiance')
8. Chi-Square Normality Test
from scipy.stats import chisquare
data = np.random.normal(loc=20,scale=5,size=150)
statistic,pvalue = chisquare(data)
print("stat = %.3f, p =%.3f \n" %(statistic,pvalue))
if pvalue > 0.05:
print("Probability Gaussian")
else:
print('Probability Not Gaussian')
9. Lilliefors Test for Normality — normality test based on the Kolmogorov–Smirnov test. It is named after Hubert Lilliefors, professor of statistics at George Washington University.
from statsmodels.stats.diagnostic import lillieforsdata = np.random.normal(loc=20,scale=5,size=150)
statistic,pvalue = lilliefors(data)
print("stat = %.3f, p =%.3f \n" %(statistic,pvalue))
if pvalue > 0.05:
print("Probability Gaussian")
else:
print('Probability Not Gaussian')
10. Jarque–Bera test for Normality — tests using the skewness and kurtosis for a normal distribution. The test only works if the data sample size >2000
from scipy.stats import jarque_beradata = np.random.normal(loc=20,scale=5,size=150)
statistic,pvalue = jarque_bera(data)
print("stat = %.3f, p =%.3f \n" %(statistic,pvalue))
if pvalue > 0.05:
print("Probability Gaussian")
else:
print('Probability Not Gaussian')
11.Kolmogorov-Smirnov test for Normality — -Apart from qq-plot this is the next best test if we want to quantify the test.
Kolmogorov Smirnov test is when you have more mass in the tails and that your normal distribution is really not satisfied.
Quantitative way — Kolmogorov Smirnov test is something similar what it does is it compares the empirical distribution function to the cumulative distribution of in this case normal distribution.
So Kolmogorov Smirnov test statistics is it actually just takes the maximum distance between them.
In other words, performs a goodness of fit test using one sample or two samples of the distribution F(x) of
an observed random variable against a given distribution G(x) i.e. a normal distribution.
from scipy.stats import kstestdata = np.random.normal(loc=20,scale=5,size=150)
statistic,pvalue = kstest(data,'norm')
print("stat = %.3f, p =%.3f \n" %(statistic,pvalue))
if pvalue > 0.05:
print("Probability Gaussian")
else:
print('Probability Not Gaussian')
Now this applies to all.
#If the p-value ≤ 0.05, then we reject the null hypothesis i.e. we assume the distribution of our variable is not normal/gaussian.
#If the p-value > 0.05, then we fail to reject the null hypothesis i.e. we assume the distribution of our variable is normal/gaussian.
Thanks again, for your time, if you enjoyed this short article there are tons of topics in advanced analytics, data science, and machine learning available in my medium repo. https://medium.com/@bobrupakroy
Some of my alternative internet presences Facebook, Instagram, Udemy, Blogger, Issuu, Slideshare, Scribd and more.
Also available on Quora @ https://www.quora.com/profile/Rupak-Bob-Roy
Let me know if you need anything. Talk Soon.