# Normality Tests in 11 ways

A-Z Guide where our data distribution follows Gaussian distribution

Hi there, i hope you doing good! today i will bring you to an article that has almost all types/ways to check the normality of our data whether it follows Gaussian Distribution/Normal Distribution or not.

So let’s buckle up!

We all know why we need to follow the normal distribution in our data. off-course in real life it's hard to get normally distributed data that follows Gaussian distribution but still, we need close to or almost to the bell curve. If you are not familiar with the concept ** “Why we need to follow Normal distribution” “ What is Central Limit Theorem”** there are tons of articles already available on google. Just do the Googling! Our main focus here is to understand the various ways that we can identify/detect whether the data is normally distributed or not.

A quick comparison to valid on *“How normal distribution helps to improve accuracy.”.*

Dataset: https://www.kaggle.com/datasets/rupakroy/credit-data

import numpy as np

import matplotlib.pyploy as plt

import pandas as pd#import dataset

dataset = pd.read_csv("credit_data.csv",sep=",")

dataset.dropna(inplace=True)#check the skewness

import seaborn as sns

for i in dataset.columns:

print(dataset[i])

sns.displot(dataset[i])#Repeat the whole set with skewness correction ##########

#sns.displot(dataset["loan"])

#dataset["loan"] = np.sqrt(dataset["loan"])#define X and y

X = dataset.iloc[:,1:4]

y = dataset.iloc[:,4].values#class imbalanced

from imlearn.combine import SMOTETomek

smt = SMOTETOmek(random_state = 1)

X,y = smt.fit_sample(X,y)from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(X,y,split_size=0.3,random_state=1)from sklearn.preprocessing import StandardScaler

sc = StandardScaler()

X_train = sc.fit_transform(X_Train)

X_test = sc.transform(X_test)from sklearn.svm import SVC

svc_model = SVC(random_state=0)

svc_model.fit(X_train,y_train)y_pred = svc_model.predict(X_test)from skearn import metrics

print("classification reports", metrics.classification_report(y_Test,y_pred))

*You will see the difference in accuracy good luck!*

Now let’s get back to our normality test

**Graph for normality tests**

#graph for normality tests

import numpy as np

import pylab

import scipy.stats as stats#generate a dummy dataset

measurements = np.random.normal(loc = 20, scale = 5 , size=150)

stats.probplot(measurements, dist="norm",plot=pylab)

pylab.show()#Alternatively -----------------------

import numpy as np

import statsmodels.api as sm

import pylab as py

# np.random generates different random numbers

# whenever the code is executed

# Note: When you execute the same code

# the graph look different than shown below.

# Random data points generated

data_points = np.random.normal(0, 1, 100)

sm.qqplot(data_points, line ='45')

py.show()

*2. Boxplot*

import seaborn as sns# Random data points generated

data_points = np.random.normal(0, 1, 100)

sns.boxplot(data_points)

*3. Histogram*

import seaborn as sns# Random data points generated

data_points = np.random.normal(0, 1, 100)

sns.displot(data_points)

**I personally use sns.displot(), quick and easy to apply with a lot of information about the data distribution pattern.**

**4.Skewness**

from scipy.stats import skewskewness_threshold = 0.5for i in df[skew_feat_list]:

s_v1 = abs(skew(df[i]))

print("Numeric Variable Detected")

if (s_v1 >= skewness_threshold or s_v1 <= skewness_threshold):

print("Skewness Before Correction", sns.displot(df[i])

print("The Skewness Value",s_v1)

print("Skewness Detection")

df[i] = np.sqrt(df[i])

print("Skewness After Correction", sns.displot(df[i])

print("Skewness Value After", skew(df[i])

else:

pass#replace df[skew_feat_list] with ur targeted columns

Anything is beyond 0.5 threshold either its in minus(-) or plus(+) is considered to be highly skewed. skew value range lies between( -1 to +1) Negatively skewed and positively skewed.

In a real-life scenario, we won't be getting perfect normally distributed data. Thus anything that is way too highly skewed either is negative or positive we will correct it using my above snippet. However do make sure the datatypes are properly formatted, if not simply modify my snippet with ~ if df.select_dtypes(include=”int”)

And this is what i use when it comes to automation.

** 5. Shapiro-Wilk Test **is said to be one of the most powerful test to check the normality of a variable.

**It was proposed in 1965 by Samuel Sanford Shapiro and Martin Wilk.**

from scipy.stats import shapiro

data = np.random.normal(loc=20,scale=5,size=150)

stat,p = shapiro(data)

print("stat = %.3f, p =%.3f \n" %(stat,p))

if p > 0.05:

print("Probability Gaussian")

else:

print('Probability Not Gaussian')#If the p-value ≤ 0.05, then we reject the null hypothesis i.e. we assume the distribution of our variable is not normal/gaussian.

#If the p-value > 0.05, then we fail to reject the null hypothesis i.e. we assume the distribution of our variable is normal/gaussian.

** 6. D’Agostino’s K-squared tes**t — check’s normality of a variable based on skewness and kurtosis.

`from scipy.stats import normaltest`

data = np.random.normal(loc=20,scale=5,size=150)

stat,p = normaltest(data)

print("stat = %.3f, p =%.3f \n" %(stat,p))

if p > 0.05:

print("Probability Gaussian")

else:

print('Probability Not Gaussian')

Well, we can also perform the same by using skewness as mentioned above.

** 7. Anderson-Darling Normality Test — **another normality test developed in 1952 by Theodore Anderson and Donald Darling.

`from scipy.stats import anderson`

data = np.random.normal(loc=20,scale=5,size=150)

result = anderson(data)

print('stat=%.3f' %(result.statistic))

for i in range(len(result.critical_values)):

sig_lev, crit_val = result.significance_level[i], result.critical_values[i]

if result.statistic < crit_val:

print(f'Probability Gaussian: {crit_val} critical value at {sig_lev} level of signifiance')

else:

print(f'Probability NOT Gaussian: {crit_val} critical value at {sig_lev} level of signifiance')

*8. Chi-Square Normality Test*

`from scipy.stats import chisquare`

data = np.random.normal(loc=20,scale=5,size=150)

statistic,pvalue = chisquare(data)

print("stat = %.3f, p =%.3f \n" %(statistic,pvalue))

if pvalue > 0.05:

print("Probability Gaussian")

else:

print('Probability Not Gaussian')

** 9. Lilliefors Test for Normality** — normality test based on the Kolmogorov–Smirnov test. It is named after Hubert Lilliefors, professor of statistics at George Washington University.

from statsmodels.stats.diagnostic import lillieforsdata = np.random.normal(loc=20,scale=5,size=150)

statistic,pvalue = lilliefors(data)

print("stat = %.3f, p =%.3f \n" %(statistic,pvalue))

if pvalue > 0.05:

print("Probability Gaussian")

else:

print('Probability Not Gaussian')

** 10. Jarque–Bera test for Normality — **tests using the skewness and kurtosis for a normal distribution. The test only works if the data sample size >2000

from scipy.stats import jarque_beradata = np.random.normal(loc=20,scale=5,size=150)

statistic,pvalue = jarque_bera(data)

print("stat = %.3f, p =%.3f \n" %(statistic,pvalue))

if pvalue > 0.05:

print("Probability Gaussian")

else:

print('Probability Not Gaussian')

** 11.Kolmogorov-Smirnov test for Normality — -**Apart from qq-plot this is the next best test if we want to quantify the test.

Kolmogorov Smirnov test is when you have more mass in the tails and that your normal distribution is really not satisfied.

Quantitative way — Kolmogorov Smirnov test is something similar what it does is it compares the **empirical distribution function** to the cumulative distribution of in this case normal distribution.

So Kolmogorov Smirnov test statistics is it actually just takes the maximum distance between them.

In other words, performs a goodness of fit test using one sample or two samples of the distribution F(x) of

an observed random variable against a given distribution G(x) i.e. a normal distribution.

from scipy.stats import kstestdata = np.random.normal(loc=20,scale=5,size=150)

statistic,pvalue = kstest(data,'norm')

print("stat = %.3f, p =%.3f \n" %(statistic,pvalue))

if pvalue > 0.05:

print("Probability Gaussian")

else:

print('Probability Not Gaussian')

Now this applies to all.

#If the p-value ≤ 0.05, then we reject the null hypothesis i.e. we assume the distribution of our variable is not normal/gaussian.#If the p-value > 0.05,then we fail to reject the null hypothesis i.e. we assume the distribution of our variable is normal/gaussian.

*Thanks again, for your time, if you enjoyed this short article there are tons of topics in advanced analytics, data science, and machine learning available in my medium repo. **https://medium.com/@bobrupakroy*

**Some of my alternative internet presences** Facebook, Instagram, Udemy, Blogger, Issuu, Slideshare, Scribd and more.

**Also available on Quora** @ https://www.quora.com/profile/Rupak-Bob-Roy

Let me know if you need anything. Talk Soon.