Exploratory Data Analysis(EDA)

Exploratory Data Analysis(EDA) is an approach to analyzing datasets to summarize their main characteristics.

By using EDA, we can understand the dataset easily, find patterns, identify outliers and explore the relationship between variables by using non-graphical and graphical techniques.

EDA also helps us to choose which feature should be considered to use for our machine learning model. (aka feature selection)

5 Steps:

  1. Understanding Business Case
    what is the business problem?
  2. Variable Description
    have a clear understanding of data variables in the data set.
  3. Data Understanding
    import data with read_csv, read_sql, read_excel
    df.head(), df.info(),df.shape(),df.columns,df.describe()
  4. Data Cleaning
    Missing Values, Duplicated Values
    df.isnull().sum()
    df[df.duplicated(keep=’first’)]
    df.drop_duplicates(keep=’first’,inplace=True)
  5. Data Visualization
    A picture is worth a thousand words
    perform univariate, bivariate and multivariate analysis to see the distribution
    and relationship between variables.

    a. univariate analysis is to understand the distribution of values for a single variable.

    We can perform univariate analysis with 3 options :
        Summary Statistics
        Frequency Distributions Table
        Charts ( Boxplot, Histogram, Barplot, Pie Chart)

    b. bivariate analysis to find relationships between two variables.
    use boxplot(categorical vs numerical), scatterplot(numerical vs numerical),
    or contingency table(categorical vs categorical).
    c. Multivariate Analysis—Correlation

    correlation is used to test relationships between quantitative variables 
    or categorical variables. It’s a measure of how things are related.
    heatmap() method shows us the relationship between numeric variables.
    
    There are different methods to calculate correlation coefficient ;
         1.Pearson
         2. Kendall
         3. Spearman