Exploratory Data Analysis(EDA)

Exploratory Data Analysis(EDA) is an approach to analyzing datasets to summarize their main characteristics.

By using EDA, we can understand the dataset easily, find patterns, identify outliers and explore the relationship between variables by using non-graphical and graphical techniques.

EDA also helps us to choose which feature should be considered to use for our machine learning model. (aka feature selection)

5 Steps:

Understanding Business Case
what is the business problem?
Variable Description
have a clear understanding of data variables in the data set.
Data Understanding
import data with read_csv, read_sql, read_excel
df.head(), df.info(),df.shape(),df.columns,df.describe()
Data Cleaning
Missing Values, Duplicated Values
df.isnull().sum()
df[df.duplicated(keep=’first’)]
df.drop_duplicates(keep=’first’,inplace=True)

Data Visualization
A picture is worth a thousand words
perform univariate, bivariate and multivariate analysis to see the distribution
and relationship between variables.

a. univariate analysis is to understand the distribution of values for a single variable.

We can perform univariate analysis with 3 options :
    Summary Statistics
    Frequency Distributions Table
    Charts ( Boxplot, Histogram, Barplot, Pie Chart)

b. bivariate analysis to find relationships between two variables.
use boxplot(categorical vs numerical), scatterplot(numerical vs numerical),
or contingency table(categorical vs categorical).
c. Multivariate Analysis—Correlation

correlation is used to test relationships between quantitative variables 
or categorical variables. It’s a measure of how things are related.
heatmap() method shows us the relationship between numeric variables.

There are different methods to calculate correlation coefficient ;
     1.Pearson
     2. Kendall
     3. Spearman

近期文章

分类

You may also like...

发表回复取消回复

You may also like...

Python tricks

Good site for python sqlite3

Dash Plotly plot height in the Page

发表回复 取消回复

发表回复取消回复