0% found this document useful (0 votes)

54 views47 pages

EDA with Python: Techniques & Tools

The document provides a comprehensive guide on Exploratory Data Analysis (EDA) fundamentals, including its significance, steps, and software tools. It includes practical experiments for downloading datasets, installing necessary Python libraries, and performing data manipulation using pandas and numpy. Additionally, it covers various visualization techniques and case studies to enhance data analysis skills.

Uploaded by

kataruraj9

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

54 views47 pages

EDA with Python: Techniques & Tools

Uploaded by

kataruraj9

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

UNIT-I

Exploratory Data Analysis Fundamentals: Understanding data science, The significance of EDA,
Steps in EDA, Making sense of data, Numerical data, Categorical data, Measurement scales,
Comparing EDA with classical and Bayesian analysis, Software tools available for EDA, Getting
started with EDA.

Sample Experiments:
1. a) Download Dataset from Kaggle using the following link :
https://www.kaggle.com/datasets/sukhmanibedi/cars4u
b) Install python libraries required for Exploratory Data Analysis (numpy, pandas,
matplotlib,seaborn)
2. Perform Numpy Array basic operations and Explore Numpy Built-in functions.
3. Loading Dataset into pandas dataframe
4. Selecting rows and columns in the dataframe

1 A) Download Dataset from Kaggle using the following link :

https://www.kaggle.com/datasets/sukhmanibedi/cars4u

Method 1:
Manually Download from Kaggle
Go to the dataset link: Cars4U Dataset.
1. Click the Download button.
2. Extract the downloaded ZIP file to access the CSV file(s).

Method 2:
Use Kaggle API in Python
If you prefer downloading via Python, follow these steps:

Step 1: Install Kaggle API

If you haven’t installed the Kaggle API yet, run:
pip install kaggle

Step 2: Get Kaggle API Credentials

1. Go to your Kaggle Account Settings (Kaggle API Section).

2. Click on Create New API Token to download kaggle.json.
3. Move kaggle.json to a safe directory (e.g., ~/.kaggle/ for Linux/Mac or
C:\Users\YourUsername\.kaggle\ for Windows).

Exploratory Data Analysis with Python 1

Step 3: Download the Dataset

Run the following Python script:

import kaggle

# Download the dataset

!kaggle datasets download -d sukhmanibedi/cars4u --unzip

1.b) Install python libraries required for Exploratory Data Analysis (numpy, pandas,
matplotlib,seaborn)

Step 1:

Check if Python is Installed

Before installing the required libraries, ensure that Python is installed on your system.

 Open the terminal (Linux/Mac) or command prompt (Windows).

 Type the following command to check the installed Python version:
python –version (OR) python3 –version
If Python is not installed, download and install it from Python's official website.

Step 2:
Install pip (If Not Installed)
pip is the package manager for Python. Most modern Python installations come with pip pre-
installed. To check if pip is installed, run:

pip –version

python -m ensurepip --default-pip

python -m pip install --upgrade pip

Step 3:

Install Required Libraries (NumPy, Pandas, Matplotlib, Seaborn)

To install the required libraries, run the following command:

pip install numpy pandas matplotlib seaborn

If you are using Jupyter Notebook, run this inside a code cell:

Exploratory Data Analysis with Python 2

Step 4:

Verify the Installation

After installation, verify that the libraries are correctly installed by running the following
command:

import numpy

import pandas

import matplotlib

import seaborn

print("All libraries installed successfully!")

If you do not get any errors, the libraries are successfully installed.

2. Perform Numpy Array basic operations and Explore Numpy Built-in functions

class Disease:
def __init__(self, disease = 'Depression'):
self.type = disease

def getName(self):
print("Mental Health Diseases: {0}".format(self.type))

d1 = Disease('Social Anxiety Disorder')

d1.getName()

OUTPUT

Mental Health Diseases: Social Anxiety Disorder

Exploratory Data Analysis with Python 3

# Try Catch Block
# The try block will generate a NameError, because x is not defined:

try:
print(y)
except NameError:
print("Well, the variable y is not defined")
except:
print("OMG, Something else went wrong")

OUTPUT

Well, the variable y is not defined

try:
Value = int(input("Type a number between 1 and 10:"))
except ValueError:
print("You must type a number between 1 and 10!")
else:
if (Value > 0) and (Value <= 10):
print("You typed: ", Value)
else:
print("The value you typed is incorrect!")

OUTPUT

Type a number between 1 and 10:333

The value you typed is incorrect!

Exploratory Data Analysis with Python 4

3. Loading Dataset into pandas dataframe

PROGRAM

import pandas as pd
columns = ['age', 'workclass', 'fnlwgt', 'education', 'education_num',
'marital_status', 'occupation', 'relationship', 'ethnicity',
'gender','capital_gain','capital_loss','hours_per_week','country_of_origin','income']
df = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-
databases/adult/adult.data',names=columns)
df.head(10)

out[0]

Exploratory Data Analysis with Python 5

Exploratory Data Analysis with Python 6
4. Selecting rows and columns in the data frame

# Selects a row
df.iloc[10]

# Selects 10 rows
df.iloc[0:10]

# Selects a range of rows

df.iloc[10:15]

# Selects the last 2 rows

df.iloc[-2:]

# Selects every other row in columns 3-5

df.iloc[::2, 3:5].head()

Exploratory Data Analysis with Python 7

import pandas as pd
import numpy as np

np.random.seed(24)
df = pd.DataFrame({'F': np.linspace(1, 10, 10)})
df = pd.concat([df, pd.DataFrame(np.random.randn(10, 5), columns=list('EDCBA'))],
axis=1)
df.iloc[0, 2] = np.nan
df

Exploratory Data Analysis with Python 8

# Define a function that should color the values that are less than 0
def colorNegativeValueToRed(value):
if value < 0:
color = 'red'
elif value > 0:
color = 'black'
else:
color = 'green'

return 'color: %s' % color

s = df.style.applymap(colorNegativeValueToRed, subset=['A','B','C','D','E'])
s

Exploratory Data Analysis with Python 9

# Let us hightlight max value in the column with green background and min value with orange
background
def highlightMax(s):
isMax = s == s.max()
return ['background-color: orange' if v else '' for v in isMax]

def highlightMin(s):
isMin = s == s.min()
return ['background-color: green' if v else '' for v in isMin]

df.style.apply(highlightMax).apply(highlightMin).highlight_null(null_color='red')

Exploratory Data Analysis with Python 10

import seaborn as sns

cm = sns.light_palette("pink", as_cmap=True)

s = df.style.background_gradient(cmap=cm)
s

Exploratory Data Analysis with Python 11

UNIT-II
Visual Aids for EDA: Technical requirements, Line chart, Bar charts, Scatter plot using seaborn,
Polar chart, Histogram, Choosing the best chart
Case Study: EDA with Personal Email, Technical requirements, Loading the dataset, Data
transformation, Data cleansing, Applying descriptive statistics, Data refactoring, Data analysis.

Sample Experiments:

1. Apply different visualization techniques using sample dataset

a) Line Chart b) Bar Chart c) Scatter Plots d) Bubble Plot
2. Generate Scatter Plot using seaborn library for iris dataset
3. Apply following visualization Techniques for a sample dataset
a) Area Plot b) Stacked Plot c) Pie chart d) Table Chart

1. Apply different visualization techniques using sample dataset

In this chapter we are going to learn about different visualization techniques using simpler data
set.

from faker import Faker

fake = Faker()

import datetime
import math
import pandas as pd
import random
import radar

import datetime
import math
import pandas as pd
import random
import radar
from faker import Faker
fake = Faker()

def generateData(n):
listdata = []
start = datetime.datetime(2019, 8, 1)
end = datetime.datetime(2019, 8, 30)
delta = end - start
for _ in range(n):
date = radar.random_datetime(start='2019-08-1', stop='2019-08-30').strftime("%Y-%m-%d")
price = round(random.uniform(900, 1000), 4)
listdata.append([date, price])
df = pd.DataFrame(listdata, columns = ['Date', 'Price'])

Exploratory Data Analysis with Python 12

df['Date'] = pd.to_datetime(df['Date'], format='%Y-%m-%d')
df = df.groupby(by='Date').mean()

return df

a) Line Chart

df = generateData(50)
df.head(10)

Exploratory Data Analysis with Python 13

b) Bar Chart

# Let us import the required libraries

import numpy as np
import calendar
import matplotlib.pyplot as plt

# Step 1: Set up the data. Remember range stoping parameter is exclusive. Meaning if you
generate range from (1, 13), the last item 13 is not included.
months = list(range(1, 13))
sold_quantity = [round(random.uniform(100, 200)) for x in range(1, 13)]

# Step 2: Specify the layout of the figure and allocate space.

figure, axis = plt.subplots()

# Step 3: In the X-axis, we would like to display the name of the months.
plt.xticks(months, calendar.month_name[1:13], rotation=20)

# Step 4: Plot the graph

plot = axis.bar(months, sold_quantity)

Exploratory Data Analysis with Python 14

# Step 5: This step can be optinal depending upon if you are interested in displaying the data
vaue on the head of the bar.
# It visually gives more meaning to show actual number of sold iteams on the bar itself.
for rectangle in plot:
height = rectangle.get_height()
axis.text(rectangle.get_x() + rectangle.get_width() /2., 1.002 * height, '%d' % int(height),
ha='center', va = 'bottom')

# Step 6: Display the graph on the screen.

plt.show()

# Step 2: Specify the layout of the figure and allocate space.

figure, axis = plt.subplots()

# Step 3: In the X-axis, we would like to display the name of the months.
plt.yticks(months, calendar.month_name[1:13], rotation=20)

# Step 4: Plot the graph

plot = axis.barh(months, sold_quantity)

Exploratory Data Analysis with Python 15

# Step 5: This step can be optinal depending upon if you are interested in displaying the data
vaue on the head of the bar.
# It visually gives more meaning to show actual number of sold iteams on the bar itself.
for rectangle in plot:
width = rectangle.get_width()
axis.text(width + 2.5, rectangle.get_y() + 0.38, '%d' % int(width), ha='center', va = 'bottom')

# Step 6: Display the graph on the screen.

plt.show()

c) Scatter Plots

age = list(range(0, 65))

sleep = []

classBless = ['newborns(0-3)', 'infants(4-11)', 'toddlers(12-24)', 'preschoolers(36-60)', 'school-

aged-children(72-156)', 'teenagers(168-204)', 'young-adults(216-300)','adults(312-768)', 'older-
adults(>=780)']
headers_cols = ['age','min_recommended', 'max_recommended', 'may_be_appropriate_min',
'may_be_appropriate_max', 'min_not_recommended', 'max_not_recommended']

# Newborn (0-3)
for i in range(0, 4):
min_recommended = 14

Exploratory Data Analysis with Python 16

max_recommended = 17
may_be_appropriate_min = 11
may_be_appropriate_max = 13
min_not_recommended = 11
max_not_recommended = 19
sleep.append([i, min_recommended, max_recommended, may_be_appropriate_min,
may_be_appropriate_max, min_not_recommended, max_not_recommended])

# infants(4-11)
for i in range(4, 12):
min_recommended = 12
max_recommended = 15
may_be_appropriate_min = 10
may_be_appropriate_max = 11
min_not_recommended = 10
max_not_recommended = 18
sleep.append([i, min_recommended, max_recommended, may_be_appropriate_min,
may_be_appropriate_max, min_not_recommended, max_not_recommended])

# toddlers(12-24)
for i in range(12, 25):
min_recommended = 11
max_recommended = 14
may_be_appropriate_min = 9
may_be_appropriate_max = 10
min_not_recommended = 9
max_not_recommended = 16
sleep.append([i, min_recommended, max_recommended, may_be_appropriate_min,
may_be_appropriate_max, min_not_recommended, max_not_recommended])

# preschoolers(36-60)
for i in range(36, 61):
min_recommended = 10
max_recommended = 13
may_be_appropriate_min = 8
may_be_appropriate_max = 9
min_not_recommended = 8
max_not_recommended = 14
sleep.append([i, min_recommended, max_recommended, may_be_appropriate_min,
may_be_appropriate_max, min_not_recommended, max_not_recommended])

# school-aged-children(72-156)
for i in range(72, 157):
min_recommended = 9
max_recommended = 11
may_be_appropriate_min = 7

Exploratory Data Analysis with Python 17

may_be_appropriate_max = 8
min_not_recommended = 7
max_not_recommended = 12
sleep.append([i, min_recommended, max_recommended, may_be_appropriate_min,
may_be_appropriate_max, min_not_recommended, max_not_recommended])

# teenagers(168-204)
for i in range(168, 204):
min_recommended = 8
max_recommended = 10
may_be_appropriate_min = 7
may_be_appropriate_max = 11
min_not_recommended = 7
max_not_recommended = 11
sleep.append([i, min_recommended, max_recommended, may_be_appropriate_min,
may_be_appropriate_max, min_not_recommended, max_not_recommended])

# young-adults(216-300)
for i in range(216, 301):
min_recommended = 7
max_recommended = 9
may_be_appropriate_min = 6
may_be_appropriate_max = 11
min_not_recommended = 6
max_not_recommended = 11
sleep.append([i, min_recommended, max_recommended, may_be_appropriate_min,
may_be_appropriate_max, min_not_recommended, max_not_recommended])

# adults(312-768)
for i in range(312, 769):
min_recommended = 7
max_recommended = 9
may_be_appropriate_min = 6
may_be_appropriate_max = 10
min_not_recommended = 6
max_not_recommended = 10
sleep.append([i, min_recommended, max_recommended, may_be_appropriate_min,
may_be_appropriate_max, min_not_recommended, max_not_recommended])

# older-adults(>=780)
for i in range(769, 780):
min_recommended = 7
max_recommended = 8
may_be_appropriate_min = 5
may_be_appropriate_max = 6
min_not_recommended = 5

Exploratory Data Analysis with Python 18

max_not_recommended = 9
sleep.append([i, min_recommended, max_recommended, may_be_appropriate_min,
may_be_appropriate_max, min_not_recommended, max_not_recommended])

sleepDf = pd.DataFrame(sleep, columns=headers_cols)

sleepDf.head(10)
sleepDf.to_csv(r'sleep_vs_age.csv')
import seaborn as sns
import matplotlib.pyplot as plt
sns.set()

# A regular scatter plot

plt.scatter(x=sleepDf["age"]/12., y=sleepDf["min_recommended"])
plt.scatter(x=sleepDf["age"]/12., y=sleepDf['max_recommended'])
plt.xlabel('Age of person in Years')
plt.ylabel('Total hours of sleep required')
plt.show()

Exploratory Data Analysis with Python 19

d) Bubble Plot

# Load the Iris dataset

df = sns.load_dataset('iris')

df['species'] = df['species'].map({'setosa': 0, "versicolor": 1, "virginica": 2})

# Create bubble plot

plt.scatter(df.petal_length, df.petal_width,
s=50*df.petal_length*df.petal_width,
c=df.species,
alpha=0.3
)

# Create labels for axises

plt.xlabel('Petal Length')
plt.ylabel('Petal Width')
plt.show()

2.Generate Scatter Plot using seaborn library for iris dataset

df = sns.load_dataset('iris')

df['species'] = df['species'].map({'setosa': 0, "versicolor": 1, "virginica": 2})

sns.scatterplot(x=df["sepal_length"], y=df["sepal_width"], hue=df.species, data=df)

Exploratory Data Analysis with Python 20

3. Apply following visualization Techniques for a sample dataset
a) Area Plot and Stacked Plot

houseLoanMortage = [9000, 9000, 8000, 9000,

8000, 9000, 9000, 9000,
9000, 8000, 9000, 9000]
utilitiesBills = [4218, 4218, 4218, 4218,
4218, 4218, 4219, 2218,
3218, 4233, 3000, 3000]
transportation = [782, 900, 732, 892,
334, 222, 300, 800,
900, 582, 596, 222]
carMortage = [700, 701, 702, 703,
704, 705, 706, 707,
708, 709, 710, 711]

import matplotlib.pyplot as plt

import seaborn as sns

months= [x for x in range(1,13)]

sns.set()
plt.plot([],[], color='sandybrown', label='houseLoanMortage')

Exploratory Data Analysis with Python 21

plt.plot([],[], color='tan', label='utilitiesBills')
plt.plot([],[], color='bisque', label='transportation')
plt.plot([],[], color='darkcyan', label='carMortage')

plt.stackplot(months, houseLoanMortage, utilitiesBills, transportation, carMortage,

colors=['sandybrown', 'tan', 'bisque', 'darkcyan'])
plt.legend()

plt.title('Household Expenses')
plt.xlabel('Months of the year')
plt.ylabel('Cost')

plt.show()

c) Pie chart

# Create URL to JSON file (alternatively this can be a filepath)

url =
'https://raw.githubusercontent.com/hmcuesta/PDA_Book/master/Chapter3/pokemonByType.csv'

# Load the first sheet of the JSON file into a data frame
pokemon = pd.read_csv(url, index_col='type')

Exploratory Data Analysis with Python 22

Exploratory Data Analysis with Python 23
d) Table Chart

# Years under consideration

years = ["2010", "2011", "2012", "2013", "2014"]

# Available watt
columns = ['4.5W', '6.0W', '7.0W','8.5W','9.5W','13.5W','15W']
unitsSold = [
[65, 141, 88, 111, 104, 71, 99],
[85, 142, 89, 112, 103, 73, 98],
[75, 143, 90, 113, 89, 75, 93],
[65, 144, 91, 114, 90, 77, 92],
[55, 145, 92, 115, 88, 79, 93],
]

# Define the range and scale for the y axis

values = np.arange(0, 600, 100)

colors = plt.cm.OrRd(np.linspace(0, 0.7, len(years)))

index = np.arange(len(columns)) + 0.3
bar_width = 0.7

y_offset = np.zeros(len(columns))
fig, ax = plt.subplots()

cell_text = []

n_rows = len(unitsSold)
for row in range(n_rows):
plot = plt.bar(index, unitsSold[row], bar_width, bottom=y_offset,
color=colors[row])
y_offset = y_offset + unitsSold[row]
cell_text.append(['%1.1f' % (x) for x in y_offset])
i=0
# Each iteration of this for loop, labels each bar with corresponding value for the given year
for rect in plot:
height = rect.get_height()
ax.text(rect.get_x() + rect.get_width()/2, y_offset[i],'%d'
% int(y_offset[i]),
ha='center', va='bottom')
i = i+1

# Add a table to the bottom of the axes

the_table = plt.table(cellText=cell_text, rowLabels=years,
rowColours=colors, colLabels=columns, loc='bottom')
plt.ylabel("Units Sold")
plt.xticks([])

Exploratory Data Analysis with Python 24

plt.title('Number of LED Bulb Sold/Year')
plt.show()

Exploratory Data Analysis with Python 25

UNIT-III
Data Transformation: Merging database-style data frames, Concatenating along with an axis,
Merging on index, Reshaping and pivoting, Transformation techniques, Handling missing data,
Mathematical operations with NaN, Filling missing values, Discretization and binning, Outlier
detection and filtering, Permutation and random sampling, Benefits of data
Transformation, Challenges.
Sample Experiments:
1. Perform the following operations
a) Merging Data frames
import pandas as pd
import numpy as np
dataFrame1 = pd.DataFrame({ 'StudentID': [1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29],
'Score' : [89, 39, 50, 97, 22, 66, 31, 51, 71, 91, 56, 32, 52, 73, 92]})
dataFrame2 = pd.DataFrame({'StudentID': [2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30],
'Score': [98, 93, 44, 77, 69, 56, 31, 53, 78, 93, 56, 77, 33, 56, 27]})
# We can do that by using Pandas concat() method.

dataframe = pd.concat([dataFrame1, dataFrame2], ignore_index=True)

dataframe

Exploratory Data Analysis with Python 26

b) Reshaping with Hierarchical Indexing
data = np.arange(15).reshape((3,5))
indexers = ['Rainfall', 'Humidity', 'Wind']
dframe1 = pd.DataFrame(data, index=indexers, columns=['Bergen', 'Oslo', 'Trondheim',
'Stavanger', 'Kristiansand'])
dframe1

Exploratory Data Analysis with Python 27

c) Data Deduplication
frame3 = pd.DataFrame({'column 1': ['Looping'] * 3 + ['Functions'] * 4, 'column 2': [10, 10, 22,
23, 23, 24, 24]})
frame3

Exploratory Data Analysis with Python 28

d) Replacing Values
import numpy as np
replaceFrame = pd.DataFrame({'column 1': [200., 3000., -786., 3000., 234., 444., -786., 332.,
3332. ], 'column 2': range(9)})
replaceFrame.replace(to_replace =-786, value= np.nan)

Exploratory Data Analysis with Python 29

2.Apply different Missing Data handling techniques

data = np.arange(15, 30).reshape(5, 3)

dfx = pd.DataFrame(data, index=['apple', 'banana', 'kiwi', 'grapes', 'mango'], columns=['store1',
'store2', 'store3'])
dfx

dfx['store4'] = np.nan
dfx.loc['watermelon'] = np.arange(15, 19)
dfx.loc['oranges'] = np.nan
dfx['store5'] = np.nan
dfx['store4']['apple'] = 20.
dfx

Exploratory Data Analysis with Python 30

a)NaN values in mathematical Operations

Exploratory Data Analysis with Python 31

b) Filling in missing data

Exploratory Data Analysis with Python 32

c) Forward and Backward filling of missing values

Exploratory Data Analysis with Python 33

d) Filling with index values

Exploratory Data Analysis with Python 34

e) Interpolation of missing values

Exploratory Data Analysis with Python 35

3. Apply different data transformation techniques
a) Renaming axis indexes
data = np.arange(15).reshape((3,5))
indexers = ['Rainfall', 'Humidity', 'Wind']
dframe1 = pd.DataFrame(data, index=indexers, columns=['Bergen', 'Oslo', 'Trondheim',
'Stavanger', 'Kristiansand'])
dframe1

# Say, you want to transform the index terms to capital letter.

dframe1.index = dframe1.index.map(str.upper)
dframe1

dframe1.rename(index=str.title, columns=str.upper)

Exploratory Data Analysis with Python 36

b) Discretization and Binning
import pandas as pd

height = [120, 122, 125, 127, 121, 123, 137, 131, 161, 145, 141, 132]

bins = [118, 125, 135, 160, 200]

category = pd.cut(height, bins)

category2 = pd.cut(height, [118, 126, 136, 161, 200], right=False)

category2

Exploratory Data Analysis with Python 37

c) Permutation and Random Sampling
dat = np.arange(80).reshape(10,8)
df = pd.DataFrame(dat)

sampler = np.random.permutation(10)
sampler

Exploratory Data Analysis with Python 38

df.take(sampler)

d) Dummy variables

df = pd.DataFrame({'gender': ['female', 'female', 'male', 'unknown', 'male', 'female'], 'votes':

range(6, 12, 1)})
df

Exploratory Data Analysis with Python 39

pd.get_dummies(df['gender'])

Exploratory Data Analysis with Python 40

UNIT-IV
Descriptive Statistics: Distribution function, Measures of central tendency, Measures of
dispersion, Types of kurtosis, Calculating percentiles, Quartiles, Grouping Datasets, Correlation,
Understanding univariate, bivariate, multivariate analysis, Time Series Analysis
Sample Experiments:
1. Study the following Distribution Techniques on a sample data
a) Uniform Distribution
b) Normal Distribution
c) Gamma Distribution
d) Exponential Distribution
e) Poisson Distribution
f) Binomial Distribution

2. Perform Data Cleaning on a sample dataset.

# Find out the number of values which are not numeric
df['price'].str.isnumeric().value_counts()

# List out the values which are not numeric

df['price'].loc[df['price'].str.isnumeric() == False]

#Setting the missing value to mean of price and convert the datatype to integer
price = df['price'].loc[df['price'] != '?']
pmean = price.astype(str).astype(int).mean()
df['price'] = df['price'].replace('?',pmean).astype(int)
df['price'].head()

# Cleaning the horsepower losses field

df['horsepower'].str.isnumeric().value_counts()
horsepower = df['horsepower'].loc[df['horsepower'] != '?']
hpmean = horsepower.astype(str).astype(int).mean()
df['horsepower'] = df['horsepower'].replace('?',hpmean).astype(int)
df['horsepower'].head()

Exploratory Data Analysis with Python 41

3. Compute measure of Central Tendency on a sample dataset
a) Mean
b)Median
c)Mode

4. Explore Measures of Dispersion on a sample dataset

a) Variance
# variance of data set using var() function
variance=df.var()
print(variance)
# variance of the specific column
var_height=df.loc[:,"height"].var()
print(var_height)

Exploratory Data Analysis with Python 42

df.loc[:,"height"].var()

b) Standard Deviation

c) Skewness

d) Kurtosis
# Kurtosis of data in data using skew() function
kurtosis =df.kurt()
print(kurtosis)

# Kurtosis of the specific column

sk_height=df.loc[:,"height"].kurt()
print(sk_height)

Exploratory Data Analysis with Python 43

4.a) Calculating percentiles on sample dataset

# calculating 30th percentile of heights in dataset

height = df["height"]
percentile = np.percentile(height, 50,)
print(percentile)

b) Calculate Inter Quartile Range(IQR) and Visualize using Box Plots It divides the data
set into four equal points.
First quartile = 25th percentile Second quartile = 50th percentile (Median) Third quartile = 75th
percentile

Based on the quartile, there is a another measure called inter-quartile range that also measures
the variability in the dataset. It is defined as:

IQR = Q3 - Q1

IQR is not affected by the presence of outliers.

price = df.price.sort_values()
Q1 = np.percentile(price, 25)
Q2 = np.percentile(price, 50)
Q3 = np.percentile(price, 75)

IQR = Q3 - Q1
IQR

df["normalized-losses"].describe()

Exploratory Data Analysis with Python 44

5. Perform the following analysis on automobile dataset.
a) Bivariate analysis b)Multivariate analysis
6. Perform Time Series Analysis on Open Power systems dataset

Exploratory Data Analysis with Python 45

UNIT-V

Model Development and Evaluation: Unified machine learning workflow, Data preprocessing,
Data preparation, Training sets and corpus creation, Model creation and training, Model
evaluation, Best model selection and evaluation, Model deployment
Case Study:EDA on Wine Quality Data Analysis

Sample Experiments:
1. Perform hypothesis testing using statsmodels library
a) Z-Test

b)T-Test
height = np.array([172, 184, 174, 168, 174, 183, 173, 173, 184, 179, 171, 173, 181, 183, 172,
178, 170, 182, 181, 172, 175, 170, 168, 178, 170, 181, 180, 173, 183, 180, 177, 181, 171, 173,
171, 182, 180, 170, 172, 175, 178, 174, 184, 177, 181, 180, 178, 179, 175, 170, 182, 176, 183,
179, 177])
height

from scipy.stats import ttest_1samp

import numpy as np

height_average = np.mean(height)
print("Average height is = {0:.3f}".format(height_average))

tset,pval = ttest_1samp(height, 175)

print("P-value = {}".format(pval))

if pval < 0.05:

print("We are rejecting the null Hypotheis.")

Exploratory Data Analysis with Python 46

else:
print("We are accepting the null hypothesis")

2. Develop model and Perform Model Evaluation using different metrics such as prediction
score, R2 Score, MAE Score, MSE Score.

Exploratory Data Analysis with Python 47

Final Dev Record
No ratings yet
Final Dev Record
49 pages
AD3301 DEV Lab Manual
No ratings yet
AD3301 DEV Lab Manual
26 pages
PDF Experiments-1 DADV
No ratings yet
PDF Experiments-1 DADV
41 pages
Lab Record Dev
No ratings yet
Lab Record Dev
20 pages
Chapter 2. Data Analysis and Processing - Full
No ratings yet
Chapter 2. Data Analysis and Processing - Full
49 pages
L6 and 7-Data Preprocessing-Coding
No ratings yet
L6 and 7-Data Preprocessing-Coding
34 pages
Aids Lab
No ratings yet
Aids Lab
45 pages
Exploratory Data Analysis Course
100% (1)
Exploratory Data Analysis Course
139 pages
Data Prep & EDA for Python Users
No ratings yet
Data Prep & EDA for Python Users
12 pages
Labdev
No ratings yet
Labdev
57 pages
Semi-Automated EDA in Python
No ratings yet
Semi-Automated EDA in Python
3 pages
Data Analytics Course for Beginners
No ratings yet
Data Analytics Course for Beginners
34 pages
Kaggle Data Analysis with Seaborn
No ratings yet
Kaggle Data Analysis with Seaborn
5 pages
24UAD315 DEV Final Record
No ratings yet
24UAD315 DEV Final Record
49 pages
UNIT - 1 EDA Continuation
No ratings yet
UNIT - 1 EDA Continuation
113 pages
Data Analysis Lab with Python
No ratings yet
Data Analysis Lab with Python
11 pages
2A - Python+Data Analysis For Pyhton2 v2
No ratings yet
2A - Python+Data Analysis For Pyhton2 v2
38 pages
Python Data Analysis Handbook
No ratings yet
Python Data Analysis Handbook
57 pages
‏لقطة شاشة ٢٠٢٤-٠٥-٠٧ في ٧.٢٧.١٤ م
No ratings yet
‏لقطة شاشة ٢٠٢٤-٠٥-٠٧ في ٧.٢٧.١٤ م
12 pages
ML Manual
No ratings yet
ML Manual
21 pages
DEV Manual - ESEC
No ratings yet
DEV Manual - ESEC
27 pages
AI & Data Science Lab Record
No ratings yet
AI & Data Science Lab Record
28 pages
Data Analysis With Python
100% (2)
Data Analysis With Python
29 pages
Python for High School Data Exploration
No ratings yet
Python for High School Data Exploration
28 pages
2,3. Introduction Pandas & Matplotlib
No ratings yet
2,3. Introduction Pandas & Matplotlib
32 pages
FOUND. DATA SCIENCE Practical
No ratings yet
FOUND. DATA SCIENCE Practical
15 pages
AUTOMATED EDA Libraries
No ratings yet
AUTOMATED EDA Libraries
12 pages
Data Understanding and Preparation
No ratings yet
Data Understanding and Preparation
48 pages
DXV Guidelines
No ratings yet
DXV Guidelines
3 pages
Data Analysis
No ratings yet
Data Analysis
42 pages
CSE445 NSU Week - 3
No ratings yet
CSE445 NSU Week - 3
48 pages
Dev Record Final
No ratings yet
Dev Record Final
34 pages
Dev Record Aids
No ratings yet
Dev Record Aids
24 pages
TD5Numpy Pandas and Matplotlib
No ratings yet
TD5Numpy Pandas and Matplotlib
5 pages
BDA File
No ratings yet
BDA File
26 pages
Python Data Exploration Guide
100% (1)
Python Data Exploration Guide
12 pages
DS Final
No ratings yet
DS Final
46 pages
DAV EXP 1 t12 31
No ratings yet
DAV EXP 1 t12 31
39 pages
Data Preprocessing
No ratings yet
Data Preprocessing
159 pages
Python Libraries for Data Science
No ratings yet
Python Libraries for Data Science
96 pages
Exploratory Data Analysis-1
No ratings yet
Exploratory Data Analysis-1
10 pages
Some Exercises
No ratings yet
Some Exercises
9 pages
Python EDA Guide for Data Analysts
No ratings yet
Python EDA Guide for Data Analysts
13 pages
Summary: Introduction To Data Visualization Tools
No ratings yet
Summary: Introduction To Data Visualization Tools
13 pages
Pandas Complete + Visualisation Summary of IBM Visualization
No ratings yet
Pandas Complete + Visualisation Summary of IBM Visualization
21 pages
DEV Lab Manual-1
No ratings yet
DEV Lab Manual-1
27 pages
Datascience
No ratings yet
Datascience
26 pages
Financial Analytics With Python
100% (1)
Financial Analytics With Python
40 pages
De&v Lab Manual
No ratings yet
De&v Lab Manual
91 pages
04 DS 2023
No ratings yet
04 DS 2023
63 pages
Data Analytics Lab Course Overview
No ratings yet
Data Analytics Lab Course Overview
125 pages
AI & Data Science Lab Guide
No ratings yet
AI & Data Science Lab Guide
35 pages
Program Questions
No ratings yet
Program Questions
2 pages
Data Exploration and Visualization Laboratory - AD3301 - Lab Manual
No ratings yet
Data Exploration and Visualization Laboratory - AD3301 - Lab Manual
55 pages
Dav Exps - Merged - Merged
No ratings yet
Dav Exps - Merged - Merged
99 pages
Python Data Science Essentials
No ratings yet
Python Data Science Essentials
33 pages
Data Analysis and Visualization Course
No ratings yet
Data Analysis and Visualization Course
4 pages
Credit Card Customer Retention Analysis
No ratings yet
Credit Card Customer Retention Analysis
51 pages
CODR-OP-01 OP For Control of Documents & Records (IMS)
No ratings yet
CODR-OP-01 OP For Control of Documents & Records (IMS)
8 pages
Redundancy Link Dan Load Balancing Menggunakan Metode Etherchannel Lacp Dengan Inter Vlan Routing
No ratings yet
Redundancy Link Dan Load Balancing Menggunakan Metode Etherchannel Lacp Dengan Inter Vlan Routing
8 pages
HK32C003 Datasheet V1.01700554919393
No ratings yet
HK32C003 Datasheet V1.01700554919393
46 pages
DAA Lab Work Book
No ratings yet
DAA Lab Work Book
98 pages
Dowsing Erika Slinks
No ratings yet
Dowsing Erika Slinks
2 pages
University of Calicut Master of Business Administration BUS 2C 15 Management Information Systems
No ratings yet
University of Calicut Master of Business Administration BUS 2C 15 Management Information Systems
80 pages
PHP Assignment
No ratings yet
PHP Assignment
31 pages
Project Report On Online Shopping
100% (1)
Project Report On Online Shopping
59 pages
Major Project Report
No ratings yet
Major Project Report
31 pages
Introduction To Teaching Making A Difference in Student Learning 2nd Edition Hall Fast Access
No ratings yet
Introduction To Teaching Making A Difference in Student Learning 2nd Edition Hall Fast Access
333 pages
Lagranto Tutorial
No ratings yet
Lagranto Tutorial
17 pages
List of Free Courses - Analytics Vidhya - Limited Time Access
No ratings yet
List of Free Courses - Analytics Vidhya - Limited Time Access
2 pages
Sociable Media: Prepared For The Encyclopedia of Human-Computer Interaction, Forthcoming Judith Donath April 15, 2004
No ratings yet
Sociable Media: Prepared For The Encyclopedia of Human-Computer Interaction, Forthcoming Judith Donath April 15, 2004
5 pages
Python For Odoo
No ratings yet
Python For Odoo
4 pages
AIF For Outbound Interface
No ratings yet
AIF For Outbound Interface
7 pages
Computer System Overview: CH Balasubramanyam Pgt-Computer Science Velammal Bodhi Campus - Ponneri
No ratings yet
Computer System Overview: CH Balasubramanyam Pgt-Computer Science Velammal Bodhi Campus - Ponneri
55 pages
IT Standard Operating Procedures Guide
100% (1)
IT Standard Operating Procedures Guide
42 pages
CyberArk Certification Prep Guide
No ratings yet
CyberArk Certification Prep Guide
13 pages
SUN2000ME V500R023C00 Modbus Interface Definitions
No ratings yet
SUN2000ME V500R023C00 Modbus Interface Definitions
71 pages
KyconSTX3120 23184
No ratings yet
KyconSTX3120 23184
2 pages
Veeco Dektak 6M Protocol 102011-1
No ratings yet
Veeco Dektak 6M Protocol 102011-1
11 pages
Understanding Sculpted Embossing Techniques
No ratings yet
Understanding Sculpted Embossing Techniques
25 pages
Student Listening & Reading Test
No ratings yet
Student Listening & Reading Test
2 pages
Multiple Choice Questions For Business Information Course
No ratings yet
Multiple Choice Questions For Business Information Course
26 pages
IPv6 Essentials Reference Sheet
No ratings yet
IPv6 Essentials Reference Sheet
1 page
JS1 Note 2024
No ratings yet
JS1 Note 2024
23 pages
PROMOVES
No ratings yet
PROMOVES
18 pages
These Popular Products Are Free Each Month For 12 Months
No ratings yet
These Popular Products Are Free Each Month For 12 Months
43 pages
Silo - Tips - Fuji Synapse Pacs Installation Instructions Radiology Specialists of Denver PC
No ratings yet
Silo - Tips - Fuji Synapse Pacs Installation Instructions Radiology Specialists of Denver PC
21 pages
First Role - mq5
No ratings yet
First Role - mq5
9 pages