Exploratory Data Analysis(EDA) is an approach to analyzing datasets to summarize their main characteristics.
By using EDA, we can understand the dataset easily, find patterns, identify outliers and explore the relationship between variables by using non-graphical and graphical techniques.
EDA also helps us to choose which feature should be considered to use for our machine learning model. (aka feature selection)
5 Steps:
- Understanding Business Case
what is the business problem? - Variable Description
have a clear understanding of data variables in the data set. - Data Understanding
import data with read_csv, read_sql, read_excel
df.head(), df.info(),df.shape(),df.columns,df.describe() - Data Cleaning
Missing Values, Duplicated Values
df.isnull().sum()
df[df.duplicated(keep=’first’)]
df.drop_duplicates(keep=’first’,inplace=True) -
Data Visualization
A picture is worth a thousand words
perform univariate, bivariate and multivariate analysis to see the distribution
and relationship between variables.a. univariate analysis is to understand the distribution of values for a single variable.
We can perform univariate analysis with 3 options : Summary Statistics Frequency Distributions Table Charts ( Boxplot, Histogram, Barplot, Pie Chart)b. bivariate analysis to find relationships between two variables.
use boxplot(categorical vs numerical), scatterplot(numerical vs numerical),
or contingency table(categorical vs categorical).
c. Multivariate Analysis—Correlationcorrelation is used to test relationships between quantitative variables or categorical variables. It’s a measure of how things are related. heatmap() method shows us the relationship between numeric variables. There are different methods to calculate correlation coefficient ; 1.Pearson 2. Kendall 3. Spearman