Normality Tests in 11 ways

6 min readMar 22, 2022

A-Z Guide where our data distribution follows Gaussian distribution

Hi there, i hope you doing good! today i will bring you to an article that has almost all types/ways to check the normality of our data whether it follows Gaussian Distribution/Normal Distribution or not.

So let’s buckle up!

We all know why we need to follow the normal distribution in our data. off-course in real life it's hard to get normally distributed data that follows Gaussian distribution but still, we need close to or almost to the bell curve. If you are not familiar with the concept “Why we need to follow Normal distribution” “ What is Central Limit Theorem” there are tons of articles already available on google. Just do the Googling! Our main focus here is to understand the various ways that we can identify/detect whether the data is normally distributed or not.

A quick comparison to valid on “How normal distribution helps to improve accuracy.”.

Dataset: https://www.kaggle.com/datasets/rupakroy/credit-data

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd#import dataset
dataset = pd.read_csv("credit_data.csv",sep=",")
dataset.dropna(inplace=True)#check the skewness 
import seaborn as sns
for i in dataset.columns:
    print(dataset[i])
    sns.displot(dataset[i])#Repeat the whole set with skewness correction ##########
#sns.displot(dataset["loan"])
#dataset["loan"] = np.sqrt(dataset["loan"])#define X and y
X = dataset.iloc[:,1:4]
y = dataset.iloc[:,4].values#class imbalanced
from imlearn.combine import SMOTETomek
smt = SMOTETOmek(random_state = 1)
X,y = smt.fit_sample(X,y)from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,split_size=0.3,random_state=1)from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_Train)
X_test = sc.transform(X_test)from sklearn.svm import SVC
svc_model = SVC(random_state=0)
svc_model.fit(X_train,y_train)y_pred = svc_model.predict(X_test)from skearn import metrics
print("classification reports", metrics.classification_report(y_Test,y_pred))

You will see the difference in accuracy good luck!

Now let’s get back to our normality test

Graph for normality tests

#graph for normality tests
import numpy as np
import pylab
import scipy.stats as stats#generate a dummy dataset
measurements = np.random.normal(loc = 20, scale = 5 , size=150)
stats.probplot(measurements, dist="norm",plot=pylab)
pylab.show()#Alternatively -----------------------

import numpy as np
import statsmodels.api as sm
import pylab as py
  
# np.random generates different random numbers
# whenever the code is executed
# Note: When you execute the same code 
# the graph look different than shown below.
  
# Random data points generated
data_points = np.random.normal(0, 1, 100)    
  
sm.qqplot(data_points, line ='45')
py.show()

2. Boxplot

import seaborn as sns# Random data points generated
data_points = np.random.normal(0, 1, 100)
sns.boxplot(data_points)

3. Histogram

import seaborn as sns# Random data points generated
data_points = np.random.normal(0, 1, 100)
sns.displot(data_points)

I personally use sns.displot(), quick and easy to apply with a lot of information about the data distribution pattern.

4.Skewness

from scipy.stats import skewskewness_threshold = 0.5for i in df[skew_feat_list]:
    s_v1 = abs(skew(df[i]))
    print("Numeric Variable Detected")
    if (s_v1 >= skewness_threshold or s_v1 <= skewness_threshold):
        print("Skewness Before Correction", sns.displot(df[i]))
        print("The Skewness Value",s_v1)
        print("Skewness Detection")
        df[i] = np.sqrt(df[i])
        print("Skewness After Correction", sns.displot(df[i])
        print("Skewness Value After", skew(df[i]))
    else:
        pass#replace df[skew_feat_list] with ur targeted columns

Anything is beyond 0.5 threshold either its in minus(-) or plus(+) is considered to be highly skewed. skew value range lies between( -1 to +1) Negatively skewed and positively skewed.

In a real-life scenario, we won't be getting perfect normally distributed data. Thus anything that is way too highly skewed either is negative or positive we will correct it using my above snippet. However do make sure the datatypes are properly formatted, if not simply modify my snippet with ~ if df.select_dtypes(include=”int”)

And this is what i use when it comes to automation.

5. Shapiro-Wilk Test is said to be one of the most powerful test to check the normality of a variable. It was proposed in 1965 by Samuel Sanford Shapiro and Martin Wilk.

from scipy.stats import shapiro
data = np.random.normal(loc=20,scale=5,size=150)
stat,p = shapiro(data)
print("stat = %.3f, p =%.3f \n" %(stat,p))
if p > 0.05:
    print("Probability Gaussian")
else:
    print('Probability Not Gaussian')#If the p-value ≤ 0.05, then we reject the null hypothesis i.e. we assume the distribution of our variable is not normal/gaussian.
#If the p-value > 0.05, then we fail to reject the null hypothesis i.e. we assume the distribution of our variable is normal/gaussian.

6. D’Agostino’s K-squared test — check’s normality of a variable based on skewness and kurtosis.

from scipy.stats import normaltest
data = np.random.normal(loc=20,scale=5,size=150)
stat,p = normaltest(data)
print("stat = %.3f, p =%.3f \n" %(stat,p))
if p > 0.05:
    print("Probability Gaussian")
else:
    print('Probability Not Gaussian')

Well, we can also perform the same by using skewness as mentioned above.

7. Anderson-Darling Normality Test — another normality test developed in 1952 by Theodore Anderson and Donald Darling.

from scipy.stats import anderson
data = np.random.normal(loc=20,scale=5,size=150)
result = anderson(data)
print('stat=%.3f' %(result.statistic))
for i in range(len(result.critical_values)):
    sig_lev, crit_val = result.significance_level[i], result.critical_values[i]
    if result.statistic < crit_val:
        print(f'Probability Gaussian: {crit_val} critical value at {sig_lev} level of signifiance')
    else:
        print(f'Probability NOT Gaussian: {crit_val} critical value at {sig_lev} level of signifiance')

8. Chi-Square Normality Test

from scipy.stats import chisquare
data = np.random.normal(loc=20,scale=5,size=150)
statistic,pvalue = chisquare(data)
print("stat = %.3f, p =%.3f \n" %(statistic,pvalue))
if pvalue > 0.05:
    print("Probability Gaussian")
else:
    print('Probability Not Gaussian')

9. Lilliefors Test for Normality — normality test based on the Kolmogorov–Smirnov test. It is named after Hubert Lilliefors, professor of statistics at George Washington University.

from statsmodels.stats.diagnostic import lillieforsdata = np.random.normal(loc=20,scale=5,size=150)
statistic,pvalue = lilliefors(data)
print("stat = %.3f, p =%.3f \n" %(statistic,pvalue))
if pvalue > 0.05:
    print("Probability Gaussian")
else:
    print('Probability Not Gaussian')

10. Jarque–Bera test for Normality — tests using the skewness and kurtosis for a normal distribution. The test only works if the data sample size >2000

from scipy.stats import jarque_beradata = np.random.normal(loc=20,scale=5,size=150)
statistic,pvalue = jarque_bera(data)
print("stat = %.3f, p =%.3f \n" %(statistic,pvalue))
if pvalue > 0.05:
    print("Probability Gaussian")
else:
    print('Probability Not Gaussian')

11.Kolmogorov-Smirnov test for Normality — -Apart from qq-plot this is the next best test if we want to quantify the test.

Kolmogorov Smirnov test is when you have more mass in the tails and that your normal distribution is really not satisfied.

Quantitative way — Kolmogorov Smirnov test is something similar what it does is it compares the empirical distribution function to the cumulative distribution of in this case normal distribution.

So Kolmogorov Smirnov test statistics is it actually just takes the maximum distance between them.

In other words, performs a goodness of fit test using one sample or two samples of the distribution F(x) of
an observed random variable against a given distribution G(x) i.e. a normal distribution.

from scipy.stats import kstestdata = np.random.normal(loc=20,scale=5,size=150)
statistic,pvalue =  kstest(data,'norm')
print("stat = %.3f, p =%.3f \n" %(statistic,pvalue))
if pvalue > 0.05:
    print("Probability Gaussian")
else:
    print('Probability Not Gaussian')

Now this applies to all.
#If the p-value ≤ 0.05, then we reject the null hypothesis i.e. we assume the distribution of our variable is not normal/gaussian.
#If the p-value > 0.05, then we fail to reject the null hypothesis i.e. we assume the distribution of our variable is normal/gaussian.

Thanks again, for your time, if you enjoyed this short article there are tons of topics in advanced analytics, data science, and machine learning available in my medium repo. https://medium.com/@bobrupakroy

Some of my alternative internet presences Facebook, Instagram, Udemy, Blogger, Issuu, Slideshare, Scribd and more.

Also available on Quora @ https://www.quora.com/profile/Rupak-Bob-Roy

Let me know if you need anything. Talk Soon.

Normality Tests in 11 ways

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Rupak (Bob) Roy - II

No responses yet

More from Rupak (Bob) Roy - II

Fine-Tuning LLMs with Reinforcement Learning (PPO) and PEFT: Automating RLHF

Proximal Policy Optimization (PPO) offers a unique method where reinforcement learning agents make decisions within an environment to…

LSTMs for regression

Quick and easy guide to solve regression problems with Deep Learnings’ different types of LSTMs

LLM Tools and Agents Simplified

Fine-Tuning Large Language Models with PEFT (LoRA) and Rouge Score: A Comprehensive Hands-On Guide

Master the Art of Efficient Model Adaptation: Explore PEFT and LoRA Techniques for Fine-Tuning LLMs and Dive into Route Score Optimization

Recommended from Medium

A Deeper Dive into Odds Ratios Using Logistic Regression

A comprehensive guide on how to extract and explore odds ratios from a Logistic Regression model using Python and Statsmodels with examples

Regression Models: Normal, Logarithmic, Exponential, and Bayesian — A Rigorous Mathematical Study

Regression is a fundamental tool in data analysis, used to model the relationship between dependent and independent variables. There are…

Lists

Predictive Modeling w/ Python

Practical Guides to Machine Learning

ChatGPT prompts

Coding & Development

SQL Window Functions: Best Practices and Limitations

Note: If you’re not a medium member, CLICK HERE

A/B Testing with Python: A Complete Guide

Conducting an A/B Test Using Python: A Step-by-Step Guide

My Data Scientist — 2 Interview Experience at Zepto

Hey everyone! 👋

Data Analyst Project using Python: Price optimization

Price optimization is a strategic approach that uses data analytics and modelling to determine the optimal pricing of products or services…