MSBD5002 (Micromaster)
Data Mining and Knowledge
Discovery
Overview
Prepared by Raymond Wong
Presented by Raymond Wong
raywong@cse
MSBD5002 1
Course Details
Reference books/materials:
Papers
MSBD5002 2
Course Details
Data Mining: Concepts and
Techniques. Jiawei Han and Micheline
Kamber. Morgan Kaufmann Publishers
(3rd edition)
Introduction to Data Mining. Pang-
Ning Tan, Michael Steinbach, Vipin
Kumar Boston : Pearson Addison
Wesley (2006)
MSBD5002 3
Major Topics
1. Association
2. Clustering
3. Classification
4. Data Warehouse
5. Web Databases
MSBD5002 4
1. Association
Custom Apple Orange Milk
We are interested
er
in the
Raymond Apple Orange items/itemsets
Ada Orange Milk with frequency >=
2
Grace Apple Orange
… … … …
Items/Itemsets Frequency
Frequent Pattern
Apple 2 (or Frequent Item)
Orange 3
Frequent Pattern
Milk 1 (or Frequent Item)
{Apple, Orange} 2
Frequent Pattern
{Orange, Milk}
MSBD5002
1 (or Frequent Itemset) 5
1. Association
Custom Apple Orange Milk
We are interested
er
in the
Raymond Apple Orange items/itemsets
Ada Orange Milk with frequency >=
2
Association Rule:
Grace Apple Orange 1. Apple Orange
… … … (… customers who buy
100%
Items/Itemsets Frequency apple will probably buy
orange.)
Apple 2
Orange 3
3
2. Orange Apple
Milk 1 ( 67% customer who buy
2 orange will probably buy
{Apple, Orange} 2 apple.)
{Orange,
Problem: toMilk} 1
find all frequent
MSBD5002
patterns and association 6
rules
Major Topics
1. Association
2. Clustering
3. Classification
4. Data Warehouse
5. Web Databases
MSBD5002 7
2. Clustering
Cluster 2
(e.g. High Score in History
and Low Score in Computer)
History
Comput History
er
Raymon
100 40
d
Louis 90 45
Wyman 20 95
Cluster 1 Computer
… … … (e.g. High Score in Computer
and Low Score in History)
Problem: to find all clusters
MSBD5002 8
Major Topics
1. Association
2. Clustering
3. Classification
4. Data Warehouse
5. Web Databases
MSBD5002 9
3. Classification
Suppose there is a person.
Race Incom Child Insuranc
e e
whit high no ?
e
root
child=yes child=no
100% Yes
0% No
Income=high Income=low
100% Yes 0% Yes
0% No 100% No
Decision tree
MSBD5002 10
Major Topics
1. Association
2. Clustering
3. Classification
4. Data Warehouse
5. Web Databases
MSBD5002 11
4. Warehouse
Query
Databases Users
Need to wait for a long time
(e.g., 1 day to 1 week)
Data Users
Databases
Warehouse
Pre-computed results
MSBD5002 12
Major Topics
1. Association
2. Clustering
3. Classification
4. Data Warehouse
5. Web Databases
MSBD5002 13
5. Web Databases
Raymond Wong
MSBD5002 14
How to rank the webpages?
MSBD5002 15