0% found this document useful (0 votes)
54 views47 pages

EDA with Python: Techniques & Tools

The document provides a comprehensive guide on Exploratory Data Analysis (EDA) fundamentals, including its significance, steps, and software tools. It includes practical experiments for downloading datasets, installing necessary Python libraries, and performing data manipulation using pandas and numpy. Additionally, it covers various visualization techniques and case studies to enhance data analysis skills.

Uploaded by

kataruraj9
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views47 pages

EDA with Python: Techniques & Tools

The document provides a comprehensive guide on Exploratory Data Analysis (EDA) fundamentals, including its significance, steps, and software tools. It includes practical experiments for downloading datasets, installing necessary Python libraries, and performing data manipulation using pandas and numpy. Additionally, it covers various visualization techniques and case studies to enhance data analysis skills.

Uploaded by

kataruraj9
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

UNIT-I

Exploratory Data Analysis Fundamentals: Understanding data science, The significance of EDA,
Steps in EDA, Making sense of data, Numerical data, Categorical data, Measurement scales,
Comparing EDA with classical and Bayesian analysis, Software tools available for EDA, Getting
started with EDA.

Sample Experiments:
1. a) Download Dataset from Kaggle using the following link :
https://www.kaggle.com/datasets/sukhmanibedi/cars4u
b) Install python libraries required for Exploratory Data Analysis (numpy, pandas,
matplotlib,seaborn)
2. Perform Numpy Array basic operations and Explore Numpy Built-in functions.
3. Loading Dataset into pandas dataframe
4. Selecting rows and columns in the dataframe

1 A) Download Dataset from Kaggle using the following link :

https://www.kaggle.com/datasets/sukhmanibedi/cars4u

Method 1:
Manually Download from Kaggle
Go to the dataset link: Cars4U Dataset.
1. Click the Download button.
2. Extract the downloaded ZIP file to access the CSV file(s).

Method 2:
Use Kaggle API in Python
If you prefer downloading via Python, follow these steps:

Step 1: Install Kaggle API


If you haven’t installed the Kaggle API yet, run:
pip install kaggle

Step 2: Get Kaggle API Credentials

1. Go to your Kaggle Account Settings (Kaggle API Section).


2. Click on Create New API Token to download kaggle.json.
3. Move kaggle.json to a safe directory (e.g., ~/.kaggle/ for Linux/Mac or
C:\Users\YourUsername\.kaggle\ for Windows).

Exploratory Data Analysis with Python 1


Step 3: Download the Dataset

Run the following Python script:


import kaggle

# Download the dataset


!kaggle datasets download -d sukhmanibedi/cars4u --unzip

1.b) Install python libraries required for Exploratory Data Analysis (numpy, pandas,
matplotlib,seaborn)

Step 1:

Check if Python is Installed

Before installing the required libraries, ensure that Python is installed on your system.

 Open the terminal (Linux/Mac) or command prompt (Windows).


 Type the following command to check the installed Python version:
python –version (OR) python3 –version
If Python is not installed, download and install it from Python's official website.

Step 2:
Install pip (If Not Installed)
pip is the package manager for Python. Most modern Python installations come with pip pre-
installed. To check if pip is installed, run:

pip –version

python -m ensurepip --default-pip

python -m pip install --upgrade pip

Step 3:

Install Required Libraries (NumPy, Pandas, Matplotlib, Seaborn)

To install the required libraries, run the following command:

pip install numpy pandas matplotlib seaborn

If you are using Jupyter Notebook, run this inside a code cell:

If you are using Jupyter Notebook, run this inside a code cell:

Exploratory Data Analysis with Python 2


Step 4:

Verify the Installation

After installation, verify that the libraries are correctly installed by running the following
command:

import numpy

import pandas

import matplotlib

import seaborn

print("All libraries installed successfully!")

If you do not get any errors, the libraries are successfully installed.

2. Perform Numpy Array basic operations and Explore Numpy Built-in functions

class Disease:
def __init__(self, disease = 'Depression'):
self.type = disease

def getName(self):
print("Mental Health Diseases: {0}".format(self.type))

d1 = Disease('Social Anxiety Disorder')


d1.getName()

OUTPUT

Mental Health Diseases: Social Anxiety Disorder

Exploratory Data Analysis with Python 3


# Try Catch Block
# The try block will generate a NameError, because x is not defined:

try:
print(y)
except NameError:
print("Well, the variable y is not defined")
except:
print("OMG, Something else went wrong")

OUTPUT

Well, the variable y is not defined

try:
Value = int(input("Type a number between 1 and 10:"))
except ValueError:
print("You must type a number between 1 and 10!")
else:
if (Value > 0) and (Value <= 10):
print("You typed: ", Value)
else:
print("The value you typed is incorrect!")

OUTPUT

Type a number between 1 and 10:333


The value you typed is incorrect!

Exploratory Data Analysis with Python 4


3. Loading Dataset into pandas dataframe

PROGRAM

import pandas as pd
columns = ['age', 'workclass', 'fnlwgt', 'education', 'education_num',
'marital_status', 'occupation', 'relationship', 'ethnicity',
'gender','capital_gain','capital_loss','hours_per_week','country_of_origin','income']
df = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-
databases/adult/adult.data',names=columns)
df.head(10)

out[0]

Exploratory Data Analysis with Python 5


Exploratory Data Analysis with Python 6
4. Selecting rows and columns in the data frame

# Selects a row
df.iloc[10]

# Selects 10 rows
df.iloc[0:10]

# Selects a range of rows


df.iloc[10:15]

# Selects the last 2 rows


df.iloc[-2:]

# Selects every other row in columns 3-5


df.iloc[::2, 3:5].head()

Exploratory Data Analysis with Python 7


import pandas as pd
import numpy as np

np.random.seed(24)
df = pd.DataFrame({'F': np.linspace(1, 10, 10)})
df = pd.concat([df, pd.DataFrame(np.random.randn(10, 5), columns=list('EDCBA'))],
axis=1)
df.iloc[0, 2] = np.nan
df

Exploratory Data Analysis with Python 8


# Define a function that should color the values that are less than 0
def colorNegativeValueToRed(value):
if value < 0:
color = 'red'
elif value > 0:
color = 'black'
else:
color = 'green'

return 'color: %s' % color

s = df.style.applymap(colorNegativeValueToRed, subset=['A','B','C','D','E'])
s

Exploratory Data Analysis with Python 9


# Let us hightlight max value in the column with green background and min value with orange
background
def highlightMax(s):
isMax = s == s.max()
return ['background-color: orange' if v else '' for v in isMax]

def highlightMin(s):
isMin = s == s.min()
return ['background-color: green' if v else '' for v in isMin]

df.style.apply(highlightMax).apply(highlightMin).highlight_null(null_color='red')

Exploratory Data Analysis with Python 10


import seaborn as sns

cm = sns.light_palette("pink", as_cmap=True)

s = df.style.background_gradient(cmap=cm)
s

Exploratory Data Analysis with Python 11


UNIT-II
Visual Aids for EDA: Technical requirements, Line chart, Bar charts, Scatter plot using seaborn,
Polar chart, Histogram, Choosing the best chart
Case Study: EDA with Personal Email, Technical requirements, Loading the dataset, Data
transformation, Data cleansing, Applying descriptive statistics, Data refactoring, Data analysis.

Sample Experiments:

1. Apply different visualization techniques using sample dataset


a) Line Chart b) Bar Chart c) Scatter Plots d) Bubble Plot
2. Generate Scatter Plot using seaborn library for iris dataset
3. Apply following visualization Techniques for a sample dataset
a) Area Plot b) Stacked Plot c) Pie chart d) Table Chart

1. Apply different visualization techniques using sample dataset


In this chapter we are going to learn about different visualization techniques using simpler data
set.

from faker import Faker


fake = Faker()

import datetime
import math
import pandas as pd
import random
import radar

import datetime
import math
import pandas as pd
import random
import radar
from faker import Faker
fake = Faker()

def generateData(n):
listdata = []
start = datetime.datetime(2019, 8, 1)
end = datetime.datetime(2019, 8, 30)
delta = end - start
for _ in range(n):
date = radar.random_datetime(start='2019-08-1', stop='2019-08-30').strftime("%Y-%m-%d")
price = round(random.uniform(900, 1000), 4)
listdata.append([date, price])
df = pd.DataFrame(listdata, columns = ['Date', 'Price'])

Exploratory Data Analysis with Python 12


df['Date'] = pd.to_datetime(df['Date'], format='%Y-%m-%d')
df = df.groupby(by='Date').mean()

return df

a) Line Chart

df = generateData(50)
df.head(10)

Exploratory Data Analysis with Python 13


b) Bar Chart

# Let us import the required libraries


import numpy as np
import calendar
import matplotlib.pyplot as plt

# Step 1: Set up the data. Remember range stoping parameter is exclusive. Meaning if you
generate range from (1, 13), the last item 13 is not included.
months = list(range(1, 13))
sold_quantity = [round(random.uniform(100, 200)) for x in range(1, 13)]

# Step 2: Specify the layout of the figure and allocate space.


figure, axis = plt.subplots()

# Step 3: In the X-axis, we would like to display the name of the months.
plt.xticks(months, calendar.month_name[1:13], rotation=20)

# Step 4: Plot the graph


plot = axis.bar(months, sold_quantity)

Exploratory Data Analysis with Python 14


# Step 5: This step can be optinal depending upon if you are interested in displaying the data
vaue on the head of the bar.
# It visually gives more meaning to show actual number of sold iteams on the bar itself.
for rectangle in plot:
height = rectangle.get_height()
axis.text(rectangle.get_x() + rectangle.get_width() /2., 1.002 * height, '%d' % int(height),
ha='center', va = 'bottom')

# Step 6: Display the graph on the screen.


plt.show()

# Step 1: Set up the data. Remember range stoping parameter is exclusive. Meaning if you
generate range from (1, 13), the last item 13 is not included.
months = list(range(1, 13))
sold_quantity = [round(random.uniform(100, 200)) for x in range(1, 13)]

# Step 2: Specify the layout of the figure and allocate space.


figure, axis = plt.subplots()

# Step 3: In the X-axis, we would like to display the name of the months.
plt.yticks(months, calendar.month_name[1:13], rotation=20)

# Step 4: Plot the graph


plot = axis.barh(months, sold_quantity)

Exploratory Data Analysis with Python 15


# Step 5: This step can be optinal depending upon if you are interested in displaying the data
vaue on the head of the bar.
# It visually gives more meaning to show actual number of sold iteams on the bar itself.
for rectangle in plot:
width = rectangle.get_width()
axis.text(width + 2.5, rectangle.get_y() + 0.38, '%d' % int(width), ha='center', va = 'bottom')

# Step 6: Display the graph on the screen.


plt.show()

c) Scatter Plots

age = list(range(0, 65))


sleep = []

classBless = ['newborns(0-3)', 'infants(4-11)', 'toddlers(12-24)', 'preschoolers(36-60)', 'school-


aged-children(72-156)', 'teenagers(168-204)', 'young-adults(216-300)','adults(312-768)', 'older-
adults(>=780)']
headers_cols = ['age','min_recommended', 'max_recommended', 'may_be_appropriate_min',
'may_be_appropriate_max', 'min_not_recommended', 'max_not_recommended']

# Newborn (0-3)
for i in range(0, 4):
min_recommended = 14

Exploratory Data Analysis with Python 16


max_recommended = 17
may_be_appropriate_min = 11
may_be_appropriate_max = 13
min_not_recommended = 11
max_not_recommended = 19
sleep.append([i, min_recommended, max_recommended, may_be_appropriate_min,
may_be_appropriate_max, min_not_recommended, max_not_recommended])

# infants(4-11)
for i in range(4, 12):
min_recommended = 12
max_recommended = 15
may_be_appropriate_min = 10
may_be_appropriate_max = 11
min_not_recommended = 10
max_not_recommended = 18
sleep.append([i, min_recommended, max_recommended, may_be_appropriate_min,
may_be_appropriate_max, min_not_recommended, max_not_recommended])

# toddlers(12-24)
for i in range(12, 25):
min_recommended = 11
max_recommended = 14
may_be_appropriate_min = 9
may_be_appropriate_max = 10
min_not_recommended = 9
max_not_recommended = 16
sleep.append([i, min_recommended, max_recommended, may_be_appropriate_min,
may_be_appropriate_max, min_not_recommended, max_not_recommended])

# preschoolers(36-60)
for i in range(36, 61):
min_recommended = 10
max_recommended = 13
may_be_appropriate_min = 8
may_be_appropriate_max = 9
min_not_recommended = 8
max_not_recommended = 14
sleep.append([i, min_recommended, max_recommended, may_be_appropriate_min,
may_be_appropriate_max, min_not_recommended, max_not_recommended])

# school-aged-children(72-156)
for i in range(72, 157):
min_recommended = 9
max_recommended = 11
may_be_appropriate_min = 7

Exploratory Data Analysis with Python 17


may_be_appropriate_max = 8
min_not_recommended = 7
max_not_recommended = 12
sleep.append([i, min_recommended, max_recommended, may_be_appropriate_min,
may_be_appropriate_max, min_not_recommended, max_not_recommended])

# teenagers(168-204)
for i in range(168, 204):
min_recommended = 8
max_recommended = 10
may_be_appropriate_min = 7
may_be_appropriate_max = 11
min_not_recommended = 7
max_not_recommended = 11
sleep.append([i, min_recommended, max_recommended, may_be_appropriate_min,
may_be_appropriate_max, min_not_recommended, max_not_recommended])

# young-adults(216-300)
for i in range(216, 301):
min_recommended = 7
max_recommended = 9
may_be_appropriate_min = 6
may_be_appropriate_max = 11
min_not_recommended = 6
max_not_recommended = 11
sleep.append([i, min_recommended, max_recommended, may_be_appropriate_min,
may_be_appropriate_max, min_not_recommended, max_not_recommended])

# adults(312-768)
for i in range(312, 769):
min_recommended = 7
max_recommended = 9
may_be_appropriate_min = 6
may_be_appropriate_max = 10
min_not_recommended = 6
max_not_recommended = 10
sleep.append([i, min_recommended, max_recommended, may_be_appropriate_min,
may_be_appropriate_max, min_not_recommended, max_not_recommended])

# older-adults(>=780)
for i in range(769, 780):
min_recommended = 7
max_recommended = 8
may_be_appropriate_min = 5
may_be_appropriate_max = 6
min_not_recommended = 5

Exploratory Data Analysis with Python 18


max_not_recommended = 9
sleep.append([i, min_recommended, max_recommended, may_be_appropriate_min,
may_be_appropriate_max, min_not_recommended, max_not_recommended])

sleepDf = pd.DataFrame(sleep, columns=headers_cols)


sleepDf.head(10)
sleepDf.to_csv(r'sleep_vs_age.csv')
import seaborn as sns
import matplotlib.pyplot as plt
sns.set()

# A regular scatter plot


plt.scatter(x=sleepDf["age"]/12., y=sleepDf["min_recommended"])
plt.scatter(x=sleepDf["age"]/12., y=sleepDf['max_recommended'])
plt.xlabel('Age of person in Years')
plt.ylabel('Total hours of sleep required')
plt.show()

Exploratory Data Analysis with Python 19


d) Bubble Plot

# Load the Iris dataset


df = sns.load_dataset('iris')

df['species'] = df['species'].map({'setosa': 0, "versicolor": 1, "virginica": 2})

# Create bubble plot


plt.scatter(df.petal_length, df.petal_width,
s=50*df.petal_length*df.petal_width,
c=df.species,
alpha=0.3
)

# Create labels for axises


plt.xlabel('Petal Length')
plt.ylabel('Petal Width')
plt.show()

2.Generate Scatter Plot using seaborn library for iris dataset

df = sns.load_dataset('iris')

df['species'] = df['species'].map({'setosa': 0, "versicolor": 1, "virginica": 2})


sns.scatterplot(x=df["sepal_length"], y=df["sepal_width"], hue=df.species, data=df)

Exploratory Data Analysis with Python 20


3. Apply following visualization Techniques for a sample dataset
a) Area Plot and Stacked Plot

houseLoanMortage = [9000, 9000, 8000, 9000,


8000, 9000, 9000, 9000,
9000, 8000, 9000, 9000]
utilitiesBills = [4218, 4218, 4218, 4218,
4218, 4218, 4219, 2218,
3218, 4233, 3000, 3000]
transportation = [782, 900, 732, 892,
334, 222, 300, 800,
900, 582, 596, 222]
carMortage = [700, 701, 702, 703,
704, 705, 706, 707,
708, 709, 710, 711]

import matplotlib.pyplot as plt


import seaborn as sns

months= [x for x in range(1,13)]

sns.set()
plt.plot([],[], color='sandybrown', label='houseLoanMortage')

Exploratory Data Analysis with Python 21


plt.plot([],[], color='tan', label='utilitiesBills')
plt.plot([],[], color='bisque', label='transportation')
plt.plot([],[], color='darkcyan', label='carMortage')

plt.stackplot(months, houseLoanMortage, utilitiesBills, transportation, carMortage,


colors=['sandybrown', 'tan', 'bisque', 'darkcyan'])
plt.legend()

plt.title('Household Expenses')
plt.xlabel('Months of the year')
plt.ylabel('Cost')

plt.show()

c) Pie chart

# Create URL to JSON file (alternatively this can be a filepath)


url =
'https://raw.githubusercontent.com/hmcuesta/PDA_Book/master/Chapter3/pokemonByType.csv'

# Load the first sheet of the JSON file into a data frame
pokemon = pd.read_csv(url, index_col='type')

Exploratory Data Analysis with Python 22


Exploratory Data Analysis with Python 23
d) Table Chart

# Years under consideration


years = ["2010", "2011", "2012", "2013", "2014"]

# Available watt
columns = ['4.5W', '6.0W', '7.0W','8.5W','9.5W','13.5W','15W']
unitsSold = [
[65, 141, 88, 111, 104, 71, 99],
[85, 142, 89, 112, 103, 73, 98],
[75, 143, 90, 113, 89, 75, 93],
[65, 144, 91, 114, 90, 77, 92],
[55, 145, 92, 115, 88, 79, 93],
]

# Define the range and scale for the y axis


values = np.arange(0, 600, 100)

colors = plt.cm.OrRd(np.linspace(0, 0.7, len(years)))


index = np.arange(len(columns)) + 0.3
bar_width = 0.7

y_offset = np.zeros(len(columns))
fig, ax = plt.subplots()

cell_text = []

n_rows = len(unitsSold)
for row in range(n_rows):
plot = plt.bar(index, unitsSold[row], bar_width, bottom=y_offset,
color=colors[row])
y_offset = y_offset + unitsSold[row]
cell_text.append(['%1.1f' % (x) for x in y_offset])
i=0
# Each iteration of this for loop, labels each bar with corresponding value for the given year
for rect in plot:
height = rect.get_height()
ax.text(rect.get_x() + rect.get_width()/2, y_offset[i],'%d'
% int(y_offset[i]),
ha='center', va='bottom')
i = i+1

# Add a table to the bottom of the axes


the_table = plt.table(cellText=cell_text, rowLabels=years,
rowColours=colors, colLabels=columns, loc='bottom')
plt.ylabel("Units Sold")
plt.xticks([])

Exploratory Data Analysis with Python 24


plt.title('Number of LED Bulb Sold/Year')
plt.show()

Exploratory Data Analysis with Python 25


UNIT-III
Data Transformation: Merging database-style data frames, Concatenating along with an axis,
Merging on index, Reshaping and pivoting, Transformation techniques, Handling missing data,
Mathematical operations with NaN, Filling missing values, Discretization and binning, Outlier
detection and filtering, Permutation and random sampling, Benefits of data
Transformation, Challenges.
Sample Experiments:
1. Perform the following operations
a) Merging Data frames
import pandas as pd
import numpy as np
dataFrame1 = pd.DataFrame({ 'StudentID': [1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29],
'Score' : [89, 39, 50, 97, 22, 66, 31, 51, 71, 91, 56, 32, 52, 73, 92]})
dataFrame2 = pd.DataFrame({'StudentID': [2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30],
'Score': [98, 93, 44, 77, 69, 56, 31, 53, 78, 93, 56, 77, 33, 56, 27]})
# We can do that by using Pandas concat() method.

dataframe = pd.concat([dataFrame1, dataFrame2], ignore_index=True)


dataframe

Exploratory Data Analysis with Python 26


b) Reshaping with Hierarchical Indexing
data = np.arange(15).reshape((3,5))
indexers = ['Rainfall', 'Humidity', 'Wind']
dframe1 = pd.DataFrame(data, index=indexers, columns=['Bergen', 'Oslo', 'Trondheim',
'Stavanger', 'Kristiansand'])
dframe1

Exploratory Data Analysis with Python 27


c) Data Deduplication
frame3 = pd.DataFrame({'column 1': ['Looping'] * 3 + ['Functions'] * 4, 'column 2': [10, 10, 22,
23, 23, 24, 24]})
frame3

Exploratory Data Analysis with Python 28


d) Replacing Values
import numpy as np
replaceFrame = pd.DataFrame({'column 1': [200., 3000., -786., 3000., 234., 444., -786., 332.,
3332. ], 'column 2': range(9)})
replaceFrame.replace(to_replace =-786, value= np.nan)

Exploratory Data Analysis with Python 29


2.Apply different Missing Data handling techniques

data = np.arange(15, 30).reshape(5, 3)


dfx = pd.DataFrame(data, index=['apple', 'banana', 'kiwi', 'grapes', 'mango'], columns=['store1',
'store2', 'store3'])
dfx

dfx['store4'] = np.nan
dfx.loc['watermelon'] = np.arange(15, 19)
dfx.loc['oranges'] = np.nan
dfx['store5'] = np.nan
dfx['store4']['apple'] = 20.
dfx

Exploratory Data Analysis with Python 30


a)NaN values in mathematical Operations

Exploratory Data Analysis with Python 31


b) Filling in missing data

Exploratory Data Analysis with Python 32


c) Forward and Backward filling of missing values

Exploratory Data Analysis with Python 33


d) Filling with index values

Exploratory Data Analysis with Python 34


e) Interpolation of missing values

Exploratory Data Analysis with Python 35


3. Apply different data transformation techniques
a) Renaming axis indexes
data = np.arange(15).reshape((3,5))
indexers = ['Rainfall', 'Humidity', 'Wind']
dframe1 = pd.DataFrame(data, index=indexers, columns=['Bergen', 'Oslo', 'Trondheim',
'Stavanger', 'Kristiansand'])
dframe1

# Say, you want to transform the index terms to capital letter.


dframe1.index = dframe1.index.map(str.upper)
dframe1

dframe1.rename(index=str.title, columns=str.upper)

Exploratory Data Analysis with Python 36


b) Discretization and Binning
import pandas as pd

height = [120, 122, 125, 127, 121, 123, 137, 131, 161, 145, 141, 132]

bins = [118, 125, 135, 160, 200]

category = pd.cut(height, bins)

category

pd.value_counts(category)

category2 = pd.cut(height, [118, 126, 136, 161, 200], right=False)

category2

Exploratory Data Analysis with Python 37


c) Permutation and Random Sampling
dat = np.arange(80).reshape(10,8)
df = pd.DataFrame(dat)

df

sampler = np.random.permutation(10)
sampler

Exploratory Data Analysis with Python 38


df.take(sampler)

d) Dummy variables

df = pd.DataFrame({'gender': ['female', 'female', 'male', 'unknown', 'male', 'female'], 'votes':


range(6, 12, 1)})
df

Exploratory Data Analysis with Python 39


pd.get_dummies(df['gender'])

Exploratory Data Analysis with Python 40


UNIT-IV
Descriptive Statistics: Distribution function, Measures of central tendency, Measures of
dispersion, Types of kurtosis, Calculating percentiles, Quartiles, Grouping Datasets, Correlation,
Understanding univariate, bivariate, multivariate analysis, Time Series Analysis
Sample Experiments:
1. Study the following Distribution Techniques on a sample data
a) Uniform Distribution
b) Normal Distribution
c) Gamma Distribution
d) Exponential Distribution
e) Poisson Distribution
f) Binomial Distribution

2. Perform Data Cleaning on a sample dataset.


# Find out the number of values which are not numeric
df['price'].str.isnumeric().value_counts()

# List out the values which are not numeric


df['price'].loc[df['price'].str.isnumeric() == False]

#Setting the missing value to mean of price and convert the datatype to integer
price = df['price'].loc[df['price'] != '?']
pmean = price.astype(str).astype(int).mean()
df['price'] = df['price'].replace('?',pmean).astype(int)
df['price'].head()

# Cleaning the horsepower losses field


df['horsepower'].str.isnumeric().value_counts()
horsepower = df['horsepower'].loc[df['horsepower'] != '?']
hpmean = horsepower.astype(str).astype(int).mean()
df['horsepower'] = df['horsepower'].replace('?',hpmean).astype(int)
df['horsepower'].head()

Exploratory Data Analysis with Python 41


3. Compute measure of Central Tendency on a sample dataset
a) Mean
b)Median
c)Mode

4. Explore Measures of Dispersion on a sample dataset


a) Variance
# variance of data set using var() function
variance=df.var()
print(variance)
# variance of the specific column
var_height=df.loc[:,"height"].var()
print(var_height)

Exploratory Data Analysis with Python 42


df.loc[:,"height"].var()

b) Standard Deviation

c) Skewness

d) Kurtosis
# Kurtosis of data in data using skew() function
kurtosis =df.kurt()
print(kurtosis)

# Kurtosis of the specific column


sk_height=df.loc[:,"height"].kurt()
print(sk_height)

Exploratory Data Analysis with Python 43


4.a) Calculating percentiles on sample dataset

# calculating 30th percentile of heights in dataset


height = df["height"]
percentile = np.percentile(height, 50,)
print(percentile)

b) Calculate Inter Quartile Range(IQR) and Visualize using Box Plots It divides the data
set into four equal points.
First quartile = 25th percentile Second quartile = 50th percentile (Median) Third quartile = 75th
percentile

Based on the quartile, there is a another measure called inter-quartile range that also measures
the variability in the dataset. It is defined as:

IQR = Q3 - Q1

IQR is not affected by the presence of outliers.

price = df.price.sort_values()
Q1 = np.percentile(price, 25)
Q2 = np.percentile(price, 50)
Q3 = np.percentile(price, 75)

IQR = Q3 - Q1
IQR

df["normalized-losses"].describe()

Exploratory Data Analysis with Python 44


5. Perform the following analysis on automobile dataset.
a) Bivariate analysis b)Multivariate analysis
6. Perform Time Series Analysis on Open Power systems dataset

Exploratory Data Analysis with Python 45


UNIT-V

Model Development and Evaluation: Unified machine learning workflow, Data preprocessing,
Data preparation, Training sets and corpus creation, Model creation and training, Model
evaluation, Best model selection and evaluation, Model deployment
Case Study:EDA on Wine Quality Data Analysis

Sample Experiments:
1. Perform hypothesis testing using statsmodels library
a) Z-Test

b)T-Test
height = np.array([172, 184, 174, 168, 174, 183, 173, 173, 184, 179, 171, 173, 181, 183, 172,
178, 170, 182, 181, 172, 175, 170, 168, 178, 170, 181, 180, 173, 183, 180, 177, 181, 171, 173,
171, 182, 180, 170, 172, 175, 178, 174, 184, 177, 181, 180, 178, 179, 175, 170, 182, 176, 183,
179, 177])
height

from scipy.stats import ttest_1samp


import numpy as np

height_average = np.mean(height)
print("Average height is = {0:.3f}".format(height_average))

tset,pval = ttest_1samp(height, 175)

print("P-value = {}".format(pval))

if pval < 0.05:


print("We are rejecting the null Hypotheis.")

Exploratory Data Analysis with Python 46


else:
print("We are accepting the null hypothesis")

2. Develop model and Perform Model Evaluation using different metrics such as prediction
score, R2 Score, MAE Score, MSE Score.

Exploratory Data Analysis with Python 47

You might also like