UNIT-I
Exploratory Data Analysis Fundamentals: Understanding data science, The significance of EDA,
Steps in EDA, Making sense of data, Numerical data, Categorical data, Measurement scales,
Comparing EDA with classical and Bayesian analysis, Software tools available for EDA, Getting
started with EDA.
Sample Experiments:
1. a) Download Dataset from Kaggle using the following link :
https://www.kaggle.com/datasets/sukhmanibedi/cars4u
b) Install python libraries required for Exploratory Data Analysis (numpy, pandas,
matplotlib,seaborn)
2. Perform Numpy Array basic operations and Explore Numpy Built-in functions.
3. Loading Dataset into pandas dataframe
4. Selecting rows and columns in the dataframe
1 A) Download Dataset from Kaggle using the following link :
https://www.kaggle.com/datasets/sukhmanibedi/cars4u
Method 1:
Manually Download from Kaggle
Go to the dataset link: Cars4U Dataset.
1. Click the Download button.
2. Extract the downloaded ZIP file to access the CSV file(s).
Method 2:
Use Kaggle API in Python
If you prefer downloading via Python, follow these steps:
Step 1: Install Kaggle API
If you haven’t installed the Kaggle API yet, run:
pip install kaggle
Step 2: Get Kaggle API Credentials
1. Go to your Kaggle Account Settings (Kaggle API Section).
2. Click on Create New API Token to download kaggle.json.
3. Move kaggle.json to a safe directory (e.g., ~/.kaggle/ for Linux/Mac or
C:\Users\YourUsername\.kaggle\ for Windows).
Exploratory Data Analysis with Python 1
Step 3: Download the Dataset
Run the following Python script:
import kaggle
# Download the dataset
!kaggle datasets download -d sukhmanibedi/cars4u --unzip
1.b) Install python libraries required for Exploratory Data Analysis (numpy, pandas,
matplotlib,seaborn)
Step 1:
Check if Python is Installed
Before installing the required libraries, ensure that Python is installed on your system.
Open the terminal (Linux/Mac) or command prompt (Windows).
Type the following command to check the installed Python version:
python –version (OR) python3 –version
If Python is not installed, download and install it from Python's official website.
Step 2:
Install pip (If Not Installed)
pip is the package manager for Python. Most modern Python installations come with pip pre-
installed. To check if pip is installed, run:
pip –version
python -m ensurepip --default-pip
python -m pip install --upgrade pip
Step 3:
Install Required Libraries (NumPy, Pandas, Matplotlib, Seaborn)
To install the required libraries, run the following command:
pip install numpy pandas matplotlib seaborn
If you are using Jupyter Notebook, run this inside a code cell:
If you are using Jupyter Notebook, run this inside a code cell:
Exploratory Data Analysis with Python 2
Step 4:
Verify the Installation
After installation, verify that the libraries are correctly installed by running the following
command:
import numpy
import pandas
import matplotlib
import seaborn
print("All libraries installed successfully!")
If you do not get any errors, the libraries are successfully installed.
2. Perform Numpy Array basic operations and Explore Numpy Built-in functions
class Disease:
def __init__(self, disease = 'Depression'):
self.type = disease
def getName(self):
print("Mental Health Diseases: {0}".format(self.type))
d1 = Disease('Social Anxiety Disorder')
d1.getName()
OUTPUT
Mental Health Diseases: Social Anxiety Disorder
Exploratory Data Analysis with Python 3
# Try Catch Block
# The try block will generate a NameError, because x is not defined:
try:
print(y)
except NameError:
print("Well, the variable y is not defined")
except:
print("OMG, Something else went wrong")
OUTPUT
Well, the variable y is not defined
try:
Value = int(input("Type a number between 1 and 10:"))
except ValueError:
print("You must type a number between 1 and 10!")
else:
if (Value > 0) and (Value <= 10):
print("You typed: ", Value)
else:
print("The value you typed is incorrect!")
OUTPUT
Type a number between 1 and 10:333
The value you typed is incorrect!
Exploratory Data Analysis with Python 4
3. Loading Dataset into pandas dataframe
PROGRAM
import pandas as pd
columns = ['age', 'workclass', 'fnlwgt', 'education', 'education_num',
'marital_status', 'occupation', 'relationship', 'ethnicity',
'gender','capital_gain','capital_loss','hours_per_week','country_of_origin','income']
df = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-
databases/adult/adult.data',names=columns)
df.head(10)
out[0]
Exploratory Data Analysis with Python 5
Exploratory Data Analysis with Python 6
4. Selecting rows and columns in the data frame
# Selects a row
df.iloc[10]
# Selects 10 rows
df.iloc[0:10]
# Selects a range of rows
df.iloc[10:15]
# Selects the last 2 rows
df.iloc[-2:]
# Selects every other row in columns 3-5
df.iloc[::2, 3:5].head()
Exploratory Data Analysis with Python 7
import pandas as pd
import numpy as np
np.random.seed(24)
df = pd.DataFrame({'F': np.linspace(1, 10, 10)})
df = pd.concat([df, pd.DataFrame(np.random.randn(10, 5), columns=list('EDCBA'))],
axis=1)
df.iloc[0, 2] = np.nan
df
Exploratory Data Analysis with Python 8
# Define a function that should color the values that are less than 0
def colorNegativeValueToRed(value):
if value < 0:
color = 'red'
elif value > 0:
color = 'black'
else:
color = 'green'
return 'color: %s' % color
s = df.style.applymap(colorNegativeValueToRed, subset=['A','B','C','D','E'])
s
Exploratory Data Analysis with Python 9
# Let us hightlight max value in the column with green background and min value with orange
background
def highlightMax(s):
isMax = s == s.max()
return ['background-color: orange' if v else '' for v in isMax]
def highlightMin(s):
isMin = s == s.min()
return ['background-color: green' if v else '' for v in isMin]
df.style.apply(highlightMax).apply(highlightMin).highlight_null(null_color='red')
Exploratory Data Analysis with Python 10
import seaborn as sns
cm = sns.light_palette("pink", as_cmap=True)
s = df.style.background_gradient(cmap=cm)
s
Exploratory Data Analysis with Python 11
UNIT-II
Visual Aids for EDA: Technical requirements, Line chart, Bar charts, Scatter plot using seaborn,
Polar chart, Histogram, Choosing the best chart
Case Study: EDA with Personal Email, Technical requirements, Loading the dataset, Data
transformation, Data cleansing, Applying descriptive statistics, Data refactoring, Data analysis.
Sample Experiments:
1. Apply different visualization techniques using sample dataset
a) Line Chart b) Bar Chart c) Scatter Plots d) Bubble Plot
2. Generate Scatter Plot using seaborn library for iris dataset
3. Apply following visualization Techniques for a sample dataset
a) Area Plot b) Stacked Plot c) Pie chart d) Table Chart
1. Apply different visualization techniques using sample dataset
In this chapter we are going to learn about different visualization techniques using simpler data
set.
from faker import Faker
fake = Faker()
import datetime
import math
import pandas as pd
import random
import radar
import datetime
import math
import pandas as pd
import random
import radar
from faker import Faker
fake = Faker()
def generateData(n):
listdata = []
start = datetime.datetime(2019, 8, 1)
end = datetime.datetime(2019, 8, 30)
delta = end - start
for _ in range(n):
date = radar.random_datetime(start='2019-08-1', stop='2019-08-30').strftime("%Y-%m-%d")
price = round(random.uniform(900, 1000), 4)
listdata.append([date, price])
df = pd.DataFrame(listdata, columns = ['Date', 'Price'])
Exploratory Data Analysis with Python 12
df['Date'] = pd.to_datetime(df['Date'], format='%Y-%m-%d')
df = df.groupby(by='Date').mean()
return df
a) Line Chart
df = generateData(50)
df.head(10)
Exploratory Data Analysis with Python 13
b) Bar Chart
# Let us import the required libraries
import numpy as np
import calendar
import matplotlib.pyplot as plt
# Step 1: Set up the data. Remember range stoping parameter is exclusive. Meaning if you
generate range from (1, 13), the last item 13 is not included.
months = list(range(1, 13))
sold_quantity = [round(random.uniform(100, 200)) for x in range(1, 13)]
# Step 2: Specify the layout of the figure and allocate space.
figure, axis = plt.subplots()
# Step 3: In the X-axis, we would like to display the name of the months.
plt.xticks(months, calendar.month_name[1:13], rotation=20)
# Step 4: Plot the graph
plot = axis.bar(months, sold_quantity)
Exploratory Data Analysis with Python 14
# Step 5: This step can be optinal depending upon if you are interested in displaying the data
vaue on the head of the bar.
# It visually gives more meaning to show actual number of sold iteams on the bar itself.
for rectangle in plot:
height = rectangle.get_height()
axis.text(rectangle.get_x() + rectangle.get_width() /2., 1.002 * height, '%d' % int(height),
ha='center', va = 'bottom')
# Step 6: Display the graph on the screen.
plt.show()
# Step 1: Set up the data. Remember range stoping parameter is exclusive. Meaning if you
generate range from (1, 13), the last item 13 is not included.
months = list(range(1, 13))
sold_quantity = [round(random.uniform(100, 200)) for x in range(1, 13)]
# Step 2: Specify the layout of the figure and allocate space.
figure, axis = plt.subplots()
# Step 3: In the X-axis, we would like to display the name of the months.
plt.yticks(months, calendar.month_name[1:13], rotation=20)
# Step 4: Plot the graph
plot = axis.barh(months, sold_quantity)
Exploratory Data Analysis with Python 15
# Step 5: This step can be optinal depending upon if you are interested in displaying the data
vaue on the head of the bar.
# It visually gives more meaning to show actual number of sold iteams on the bar itself.
for rectangle in plot:
width = rectangle.get_width()
axis.text(width + 2.5, rectangle.get_y() + 0.38, '%d' % int(width), ha='center', va = 'bottom')
# Step 6: Display the graph on the screen.
plt.show()
c) Scatter Plots
age = list(range(0, 65))
sleep = []
classBless = ['newborns(0-3)', 'infants(4-11)', 'toddlers(12-24)', 'preschoolers(36-60)', 'school-
aged-children(72-156)', 'teenagers(168-204)', 'young-adults(216-300)','adults(312-768)', 'older-
adults(>=780)']
headers_cols = ['age','min_recommended', 'max_recommended', 'may_be_appropriate_min',
'may_be_appropriate_max', 'min_not_recommended', 'max_not_recommended']
# Newborn (0-3)
for i in range(0, 4):
min_recommended = 14
Exploratory Data Analysis with Python 16
max_recommended = 17
may_be_appropriate_min = 11
may_be_appropriate_max = 13
min_not_recommended = 11
max_not_recommended = 19
sleep.append([i, min_recommended, max_recommended, may_be_appropriate_min,
may_be_appropriate_max, min_not_recommended, max_not_recommended])
# infants(4-11)
for i in range(4, 12):
min_recommended = 12
max_recommended = 15
may_be_appropriate_min = 10
may_be_appropriate_max = 11
min_not_recommended = 10
max_not_recommended = 18
sleep.append([i, min_recommended, max_recommended, may_be_appropriate_min,
may_be_appropriate_max, min_not_recommended, max_not_recommended])
# toddlers(12-24)
for i in range(12, 25):
min_recommended = 11
max_recommended = 14
may_be_appropriate_min = 9
may_be_appropriate_max = 10
min_not_recommended = 9
max_not_recommended = 16
sleep.append([i, min_recommended, max_recommended, may_be_appropriate_min,
may_be_appropriate_max, min_not_recommended, max_not_recommended])
# preschoolers(36-60)
for i in range(36, 61):
min_recommended = 10
max_recommended = 13
may_be_appropriate_min = 8
may_be_appropriate_max = 9
min_not_recommended = 8
max_not_recommended = 14
sleep.append([i, min_recommended, max_recommended, may_be_appropriate_min,
may_be_appropriate_max, min_not_recommended, max_not_recommended])
# school-aged-children(72-156)
for i in range(72, 157):
min_recommended = 9
max_recommended = 11
may_be_appropriate_min = 7
Exploratory Data Analysis with Python 17
may_be_appropriate_max = 8
min_not_recommended = 7
max_not_recommended = 12
sleep.append([i, min_recommended, max_recommended, may_be_appropriate_min,
may_be_appropriate_max, min_not_recommended, max_not_recommended])
# teenagers(168-204)
for i in range(168, 204):
min_recommended = 8
max_recommended = 10
may_be_appropriate_min = 7
may_be_appropriate_max = 11
min_not_recommended = 7
max_not_recommended = 11
sleep.append([i, min_recommended, max_recommended, may_be_appropriate_min,
may_be_appropriate_max, min_not_recommended, max_not_recommended])
# young-adults(216-300)
for i in range(216, 301):
min_recommended = 7
max_recommended = 9
may_be_appropriate_min = 6
may_be_appropriate_max = 11
min_not_recommended = 6
max_not_recommended = 11
sleep.append([i, min_recommended, max_recommended, may_be_appropriate_min,
may_be_appropriate_max, min_not_recommended, max_not_recommended])
# adults(312-768)
for i in range(312, 769):
min_recommended = 7
max_recommended = 9
may_be_appropriate_min = 6
may_be_appropriate_max = 10
min_not_recommended = 6
max_not_recommended = 10
sleep.append([i, min_recommended, max_recommended, may_be_appropriate_min,
may_be_appropriate_max, min_not_recommended, max_not_recommended])
# older-adults(>=780)
for i in range(769, 780):
min_recommended = 7
max_recommended = 8
may_be_appropriate_min = 5
may_be_appropriate_max = 6
min_not_recommended = 5
Exploratory Data Analysis with Python 18
max_not_recommended = 9
sleep.append([i, min_recommended, max_recommended, may_be_appropriate_min,
may_be_appropriate_max, min_not_recommended, max_not_recommended])
sleepDf = pd.DataFrame(sleep, columns=headers_cols)
sleepDf.head(10)
sleepDf.to_csv(r'sleep_vs_age.csv')
import seaborn as sns
import matplotlib.pyplot as plt
sns.set()
# A regular scatter plot
plt.scatter(x=sleepDf["age"]/12., y=sleepDf["min_recommended"])
plt.scatter(x=sleepDf["age"]/12., y=sleepDf['max_recommended'])
plt.xlabel('Age of person in Years')
plt.ylabel('Total hours of sleep required')
plt.show()
Exploratory Data Analysis with Python 19
d) Bubble Plot
# Load the Iris dataset
df = sns.load_dataset('iris')
df['species'] = df['species'].map({'setosa': 0, "versicolor": 1, "virginica": 2})
# Create bubble plot
plt.scatter(df.petal_length, df.petal_width,
s=50*df.petal_length*df.petal_width,
c=df.species,
alpha=0.3
)
# Create labels for axises
plt.xlabel('Petal Length')
plt.ylabel('Petal Width')
plt.show()
2.Generate Scatter Plot using seaborn library for iris dataset
df = sns.load_dataset('iris')
df['species'] = df['species'].map({'setosa': 0, "versicolor": 1, "virginica": 2})
sns.scatterplot(x=df["sepal_length"], y=df["sepal_width"], hue=df.species, data=df)
Exploratory Data Analysis with Python 20
3. Apply following visualization Techniques for a sample dataset
a) Area Plot and Stacked Plot
houseLoanMortage = [9000, 9000, 8000, 9000,
8000, 9000, 9000, 9000,
9000, 8000, 9000, 9000]
utilitiesBills = [4218, 4218, 4218, 4218,
4218, 4218, 4219, 2218,
3218, 4233, 3000, 3000]
transportation = [782, 900, 732, 892,
334, 222, 300, 800,
900, 582, 596, 222]
carMortage = [700, 701, 702, 703,
704, 705, 706, 707,
708, 709, 710, 711]
import matplotlib.pyplot as plt
import seaborn as sns
months= [x for x in range(1,13)]
sns.set()
plt.plot([],[], color='sandybrown', label='houseLoanMortage')
Exploratory Data Analysis with Python 21
plt.plot([],[], color='tan', label='utilitiesBills')
plt.plot([],[], color='bisque', label='transportation')
plt.plot([],[], color='darkcyan', label='carMortage')
plt.stackplot(months, houseLoanMortage, utilitiesBills, transportation, carMortage,
colors=['sandybrown', 'tan', 'bisque', 'darkcyan'])
plt.legend()
plt.title('Household Expenses')
plt.xlabel('Months of the year')
plt.ylabel('Cost')
plt.show()
c) Pie chart
# Create URL to JSON file (alternatively this can be a filepath)
url =
'https://raw.githubusercontent.com/hmcuesta/PDA_Book/master/Chapter3/pokemonByType.csv'
# Load the first sheet of the JSON file into a data frame
pokemon = pd.read_csv(url, index_col='type')
Exploratory Data Analysis with Python 22
Exploratory Data Analysis with Python 23
d) Table Chart
# Years under consideration
years = ["2010", "2011", "2012", "2013", "2014"]
# Available watt
columns = ['4.5W', '6.0W', '7.0W','8.5W','9.5W','13.5W','15W']
unitsSold = [
[65, 141, 88, 111, 104, 71, 99],
[85, 142, 89, 112, 103, 73, 98],
[75, 143, 90, 113, 89, 75, 93],
[65, 144, 91, 114, 90, 77, 92],
[55, 145, 92, 115, 88, 79, 93],
]
# Define the range and scale for the y axis
values = np.arange(0, 600, 100)
colors = plt.cm.OrRd(np.linspace(0, 0.7, len(years)))
index = np.arange(len(columns)) + 0.3
bar_width = 0.7
y_offset = np.zeros(len(columns))
fig, ax = plt.subplots()
cell_text = []
n_rows = len(unitsSold)
for row in range(n_rows):
plot = plt.bar(index, unitsSold[row], bar_width, bottom=y_offset,
color=colors[row])
y_offset = y_offset + unitsSold[row]
cell_text.append(['%1.1f' % (x) for x in y_offset])
i=0
# Each iteration of this for loop, labels each bar with corresponding value for the given year
for rect in plot:
height = rect.get_height()
ax.text(rect.get_x() + rect.get_width()/2, y_offset[i],'%d'
% int(y_offset[i]),
ha='center', va='bottom')
i = i+1
# Add a table to the bottom of the axes
the_table = plt.table(cellText=cell_text, rowLabels=years,
rowColours=colors, colLabels=columns, loc='bottom')
plt.ylabel("Units Sold")
plt.xticks([])
Exploratory Data Analysis with Python 24
plt.title('Number of LED Bulb Sold/Year')
plt.show()
Exploratory Data Analysis with Python 25
UNIT-III
Data Transformation: Merging database-style data frames, Concatenating along with an axis,
Merging on index, Reshaping and pivoting, Transformation techniques, Handling missing data,
Mathematical operations with NaN, Filling missing values, Discretization and binning, Outlier
detection and filtering, Permutation and random sampling, Benefits of data
Transformation, Challenges.
Sample Experiments:
1. Perform the following operations
a) Merging Data frames
import pandas as pd
import numpy as np
dataFrame1 = pd.DataFrame({ 'StudentID': [1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29],
'Score' : [89, 39, 50, 97, 22, 66, 31, 51, 71, 91, 56, 32, 52, 73, 92]})
dataFrame2 = pd.DataFrame({'StudentID': [2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30],
'Score': [98, 93, 44, 77, 69, 56, 31, 53, 78, 93, 56, 77, 33, 56, 27]})
# We can do that by using Pandas concat() method.
dataframe = pd.concat([dataFrame1, dataFrame2], ignore_index=True)
dataframe
Exploratory Data Analysis with Python 26
b) Reshaping with Hierarchical Indexing
data = np.arange(15).reshape((3,5))
indexers = ['Rainfall', 'Humidity', 'Wind']
dframe1 = pd.DataFrame(data, index=indexers, columns=['Bergen', 'Oslo', 'Trondheim',
'Stavanger', 'Kristiansand'])
dframe1
Exploratory Data Analysis with Python 27
c) Data Deduplication
frame3 = pd.DataFrame({'column 1': ['Looping'] * 3 + ['Functions'] * 4, 'column 2': [10, 10, 22,
23, 23, 24, 24]})
frame3
Exploratory Data Analysis with Python 28
d) Replacing Values
import numpy as np
replaceFrame = pd.DataFrame({'column 1': [200., 3000., -786., 3000., 234., 444., -786., 332.,
3332. ], 'column 2': range(9)})
replaceFrame.replace(to_replace =-786, value= np.nan)
Exploratory Data Analysis with Python 29
2.Apply different Missing Data handling techniques
data = np.arange(15, 30).reshape(5, 3)
dfx = pd.DataFrame(data, index=['apple', 'banana', 'kiwi', 'grapes', 'mango'], columns=['store1',
'store2', 'store3'])
dfx
dfx['store4'] = np.nan
dfx.loc['watermelon'] = np.arange(15, 19)
dfx.loc['oranges'] = np.nan
dfx['store5'] = np.nan
dfx['store4']['apple'] = 20.
dfx
Exploratory Data Analysis with Python 30
a)NaN values in mathematical Operations
Exploratory Data Analysis with Python 31
b) Filling in missing data
Exploratory Data Analysis with Python 32
c) Forward and Backward filling of missing values
Exploratory Data Analysis with Python 33
d) Filling with index values
Exploratory Data Analysis with Python 34
e) Interpolation of missing values
Exploratory Data Analysis with Python 35
3. Apply different data transformation techniques
a) Renaming axis indexes
data = np.arange(15).reshape((3,5))
indexers = ['Rainfall', 'Humidity', 'Wind']
dframe1 = pd.DataFrame(data, index=indexers, columns=['Bergen', 'Oslo', 'Trondheim',
'Stavanger', 'Kristiansand'])
dframe1
# Say, you want to transform the index terms to capital letter.
dframe1.index = dframe1.index.map(str.upper)
dframe1
dframe1.rename(index=str.title, columns=str.upper)
Exploratory Data Analysis with Python 36
b) Discretization and Binning
import pandas as pd
height = [120, 122, 125, 127, 121, 123, 137, 131, 161, 145, 141, 132]
bins = [118, 125, 135, 160, 200]
category = pd.cut(height, bins)
category
pd.value_counts(category)
category2 = pd.cut(height, [118, 126, 136, 161, 200], right=False)
category2
Exploratory Data Analysis with Python 37
c) Permutation and Random Sampling
dat = np.arange(80).reshape(10,8)
df = pd.DataFrame(dat)
df
sampler = np.random.permutation(10)
sampler
Exploratory Data Analysis with Python 38
df.take(sampler)
d) Dummy variables
df = pd.DataFrame({'gender': ['female', 'female', 'male', 'unknown', 'male', 'female'], 'votes':
range(6, 12, 1)})
df
Exploratory Data Analysis with Python 39
pd.get_dummies(df['gender'])
Exploratory Data Analysis with Python 40
UNIT-IV
Descriptive Statistics: Distribution function, Measures of central tendency, Measures of
dispersion, Types of kurtosis, Calculating percentiles, Quartiles, Grouping Datasets, Correlation,
Understanding univariate, bivariate, multivariate analysis, Time Series Analysis
Sample Experiments:
1. Study the following Distribution Techniques on a sample data
a) Uniform Distribution
b) Normal Distribution
c) Gamma Distribution
d) Exponential Distribution
e) Poisson Distribution
f) Binomial Distribution
2. Perform Data Cleaning on a sample dataset.
# Find out the number of values which are not numeric
df['price'].str.isnumeric().value_counts()
# List out the values which are not numeric
df['price'].loc[df['price'].str.isnumeric() == False]
#Setting the missing value to mean of price and convert the datatype to integer
price = df['price'].loc[df['price'] != '?']
pmean = price.astype(str).astype(int).mean()
df['price'] = df['price'].replace('?',pmean).astype(int)
df['price'].head()
# Cleaning the horsepower losses field
df['horsepower'].str.isnumeric().value_counts()
horsepower = df['horsepower'].loc[df['horsepower'] != '?']
hpmean = horsepower.astype(str).astype(int).mean()
df['horsepower'] = df['horsepower'].replace('?',hpmean).astype(int)
df['horsepower'].head()
Exploratory Data Analysis with Python 41
3. Compute measure of Central Tendency on a sample dataset
a) Mean
b)Median
c)Mode
4. Explore Measures of Dispersion on a sample dataset
a) Variance
# variance of data set using var() function
variance=df.var()
print(variance)
# variance of the specific column
var_height=df.loc[:,"height"].var()
print(var_height)
Exploratory Data Analysis with Python 42
df.loc[:,"height"].var()
b) Standard Deviation
c) Skewness
d) Kurtosis
# Kurtosis of data in data using skew() function
kurtosis =df.kurt()
print(kurtosis)
# Kurtosis of the specific column
sk_height=df.loc[:,"height"].kurt()
print(sk_height)
Exploratory Data Analysis with Python 43
4.a) Calculating percentiles on sample dataset
# calculating 30th percentile of heights in dataset
height = df["height"]
percentile = np.percentile(height, 50,)
print(percentile)
b) Calculate Inter Quartile Range(IQR) and Visualize using Box Plots It divides the data
set into four equal points.
First quartile = 25th percentile Second quartile = 50th percentile (Median) Third quartile = 75th
percentile
Based on the quartile, there is a another measure called inter-quartile range that also measures
the variability in the dataset. It is defined as:
IQR = Q3 - Q1
IQR is not affected by the presence of outliers.
price = df.price.sort_values()
Q1 = np.percentile(price, 25)
Q2 = np.percentile(price, 50)
Q3 = np.percentile(price, 75)
IQR = Q3 - Q1
IQR
df["normalized-losses"].describe()
Exploratory Data Analysis with Python 44
5. Perform the following analysis on automobile dataset.
a) Bivariate analysis b)Multivariate analysis
6. Perform Time Series Analysis on Open Power systems dataset
Exploratory Data Analysis with Python 45
UNIT-V
Model Development and Evaluation: Unified machine learning workflow, Data preprocessing,
Data preparation, Training sets and corpus creation, Model creation and training, Model
evaluation, Best model selection and evaluation, Model deployment
Case Study:EDA on Wine Quality Data Analysis
Sample Experiments:
1. Perform hypothesis testing using statsmodels library
a) Z-Test
b)T-Test
height = np.array([172, 184, 174, 168, 174, 183, 173, 173, 184, 179, 171, 173, 181, 183, 172,
178, 170, 182, 181, 172, 175, 170, 168, 178, 170, 181, 180, 173, 183, 180, 177, 181, 171, 173,
171, 182, 180, 170, 172, 175, 178, 174, 184, 177, 181, 180, 178, 179, 175, 170, 182, 176, 183,
179, 177])
height
from scipy.stats import ttest_1samp
import numpy as np
height_average = np.mean(height)
print("Average height is = {0:.3f}".format(height_average))
tset,pval = ttest_1samp(height, 175)
print("P-value = {}".format(pval))
if pval < 0.05:
print("We are rejecting the null Hypotheis.")
Exploratory Data Analysis with Python 46
else:
print("We are accepting the null hypothesis")
2. Develop model and Perform Model Evaluation using different metrics such as prediction
score, R2 Score, MAE Score, MSE Score.
Exploratory Data Analysis with Python 47