Quick and Easy Data Analysis in Python

Fahad Patel
4 min readNov 4, 2020

Exploratory Data Analysis is the most important and the very first step in which we discover pattern and trends in the dataset. Today, I am going to show you the easiest and quickest way to do Exploratory Data Analysis with just some lines of code in Python. Exploratory Data Analysis is a process where we tend to analyze the dataset and summarize the main characteristics of the dataset often using visual methods. EDA is really important because if you are not familiar with the dataset you are working on, then you won’t be able to infer something from that data. However, EDA generally takes a lot of time. But today we will learn the fastest way to do EDA.

In this article, we will work on Automating EDA using

  1. ) Sweetviz
  2. ) Pandas Profiling.

These are python library that generates beautiful, high-density visualizations to start your EDA. Let us first explore Sweetviz in detail and later we will move on to Pandas Profiling.

Installing Sweetviz

Like any other python library, we can install Sweetviz by using the pip install command given below.

pip install sweetviz

Analyzing Dataset

In this article,we will be using advertising dataset that contains 4 attributes and 200 rows. First, we need to load the using pandas.

import pandas as  pd
df = pd.read_csv('Advertising.csv')

Sweetviz has a function named Analyze() which analyzes the whole dataset and provides a detailed report with visualization.

Let’s Analyze our dataset using the command given below.

#importing sweetviz
import sweetviz as sv
#analyzing the dataset
advert_report = sv.analyze(df)
#display the report
advert_report.show_html('Advertising.html')

And here we go, as you can see above our EDA report is ready and contains a lot of information for all the attributes. It’s easy to understand and is prepared in just 3 lines of code.

Now, let us move on to Panda’s Profiling

Installing Pandas Profiling

Like as we did for sweetviz ,we need to install pandas-profiling by using the pip install command given below.

pip install pandas-profiling

Now lets use this library on a Kaggle data set (cervical cancer risk classification) and walk through the output. Using the below code snippet I have generated a detailed report of the data using the pandas ProfileReport module.

# import the pandas profile library
from pandas_profiling import ProfileReport
# load the data from Kaggle
train1=pd.read_csv(“/kaggle/input/cervical-cancer-risk-classification/kag_risk_factors_cervical_cancer.csv”)
# data cleaning
train2 = train1.replace(‘?’, np.nan)
# creating profile report
report = ProfileReport(train2, title=’Profile Report’, html={‘style’:{‘full_width’:True}})

Here is a snapshot of the output:

Overview of Dataset

As you can see from the snapshot, at one go you get all the important inferences of the data. This is just the Overview Tab. You can dig deeper into each variable’s characteristics by clicking the Variables tab.

Variables of Dataset

Here we get description of the data and its distribution. This output is given for each variable in the data separately. Next is the Correlations Tab. Five types of correlations are provided for the variables. You can analyse each correlation to understand the relationship between the target and dependent variables.

Correlations Tab

Next tab is for Missing values. The missing value analysis is shown in five different output formats. The Count bar chart provides a quick look at the number of missing values for each variable. There is also Matrix, Heatmap, and Dendrogram that provides a nice pictorial representation of all the missing values in the data.

Missing values Tab
Heatmap

The last tab in the profile report provides a Sample of the first and last few rows of the data set.

Sample Tab

Overall both the libraries are excellent and reduces the effort involved in data exploration, as all the key EDA outcomes are part of the profile report. I would suggest to use both libraries to get on the same dataset and compare your results. Based on this report further data exploration can be performed.

Before You Go

Thanks for reading! If you want to get in touch with me, feel free to reach me on fahadpatel1403@gmail.com or my LinkedIn Profile. You can also view the code and data I have used here in my Github.

--

--