Association rule mining and Apriori algorithm

ASSOCIATION RULE MINING AND APRIORI ALGORITHM By, AMBOOJ ANAM IQBAL HINA FIRDAUS M.Tech CSE 3rd Semester Guided By, Dr. Harleen Kaur

 First proposed by Agrawal, Imielinski, and Swami frequent itemsets and association rule mining  Motivation: Finding inherent regularities in data.  Pattern mining algorithms can be applied on various types of data such as :  transaction databases  sequence databases  stream, graph etc.  Pattern mining algorithms can be designed to discover various types of patterns:  subgraphs,  associations,  indirect associations,  trends,  periodic patterns,  sequential rules, lattices, sequential patterns, high-utility patterns, etc. WHAT IS FREQUENT PATTERN ANALYSIS?

WHY IS FREQUENT PATTERN MINING IMPORTANT?  Frequent pattern: An intrinsic and important property of datasets.  Foundation for many essential data mining tasks  Association, correlation, and causality analysis  Sequential, structural (e.g., sub-graph) patterns  Pattern analysis in spatiotemporal, multimedia, time-series, and stream data  Classification: discriminative, frequent pattern analysis  Cluster analysis: frequent pattern-based clustering  Data warehousing: iceberg cube and cube-gradient  Semantic data compression: fascicles  Broad applications

WHAT IS ASSOCIATION MINING?  Proposed by Agrawal et al in 1993.  Association rule mining it is a procedure which is meant to find frequent patterns, correlations, associations, or causal structures from data sets found in various kinds of databases such as relational databases, transactional databases, and other forms of data repositories.  It is an important data mining model studied extensively by the database and data mining community.

APPLICATION OF ASSOCIATION  Market Basket Analysis: given a database of customer transactions, where each transaction is a set of items the goal is to find groups of items which are frequently purchased together.  Telecommunication each customer is a transaction containing the set of phone calls  Credit Cards/ Banking Services each card/account is a transaction containing the set of customer’s payments  Medical Treatments each patient is represented as a transaction containing the ordered set of diseases  Basketball-Game Analysis each game is represented as a transaction containing the ordered set of ball passes

MARKET BASKET ANALYSIS  INPUT: list of purchases by purchaser  do not have names  identify purchase patterns  what items tend to be purchased together  obvious: steak-potatoes; beer-pretzels  what items are purchased sequentially  obvious: house-furniture; car-tires  what items tend to be purchased by season

CONTINUE…  Categorize customer purchase behavior  identify actionable information  purchase profiles  profitability of each purchase profile  use for marketing  layout or catalogs  select products for promotion  space allocation, product placement  Market Basket Benefits  selection of promotions, merchandising strategy  sensitive to price: Italian entrees, pizza, pies, Oriental entrees, orange juice  uncover consumer spending patterns  correlations: orange juice & waffles  joint promotional opportunities

POSSIBLE MARKET BASKETS Customer 1: diapers, baby lotion, grapefruit juice, baby food, milk Customer 2: soda, potato chips, milk Customer 3: soup, beer, milk, ice cream Customer 4: soda, coffee, milk, bread Customer 5: beer, potato chips

LIMITATIONS  takes over 18 months to implement  market basket analysis only identifies hypotheses, which need to be tested  neural network, regression, decision tree analyses  measurement of impact needed  difficult to identify product groupings  complexity grows exponentially

BENEFITS  simple computations  can be undirected (don’t have to have hypotheses before analysis)  different data forms can be analyzed

ASSOCIATION RULES  Wal-Mart customers who purchase Barbie dolls have a 60% likelihood of also purchasing one of three types of candy bars [Forbes, Sept 8, 1997]  Customers who purchase maintenance agreements are very likely to purchase large appliances (author experience)  When a new hardware store opens, one of the most commonly sold items is toilet bowl cleaners (author experience)  So what…

WHAT IS ASSOCIATION RULE MINING?  Association Analysis is used for discovering interesting relationships hidden in large data sets.  Proposed by Agrawal et al in 1993.  It is an important data mining model studied extensively by the database and data mining community.  Assume all data are categorical.  Initially used for Market Basket Analysis to find how items purchased by customers are related.

 Finding frequent patterns, associations, correlations, or causal structures among sets of items in transaction databases, relational databases, and other information repositories.  Association rules are if/then statements that help uncover relationships between seemingly unrelated data in a relational database or other information repository.  Association rules are widely used in various areas such as telecommunication networks, market and risk management, inventory control etc.  Programmers use association rules to build programs capable of machine learning. CONTINUED…

 The following rule can be extracted from the data set shown in table 1:  {Diapers}  {Beer}  The rule suggests that a strong relationship exists between the sale of diapers and beer because many customers who buy diapers also buy beer.  Retailers can use this type of rules to help them identify new opportunities for cross-selling their products to the customers.  Applications Basket data analysis, cross-marketing, catalog design, loss-leader analysis, web log analysis, fraud detection

 An association rule has two parts, an antecedent (if) and a consequent (then).  An antecedent is an item found in the data.  A consequent is an item that is found in combination with the antecedent.  Rule form: Antecedent → Consequence  „Given:  (1) database of transactions,  (2) each transaction is a list of items purchased by a customer in a visit.  Find: all rules that correlate the presence of one set of items ( itemset ) with that of another set of items.  „E.g., 98% of people who purchase tires and auto accessories also get automotive services done

THE MODEL: RULES  A transaction t contains X, a set of items (itemset) in I, if X  t.  An association rule is an implication of the form: X  Y, where X, Y  I, and X Y =   An itemset is a set of items.  E.g., X = {milk, bread, cereal} is an itemset.  A k-itemset is an itemset with k items.  E.g., {milk, bread, cereal} is a 3-itemset

ASSOCIATION RULES  Association rule types:  Actionable Rules – contain high-quality, actionable information  Trivial Rules – information already well-known by those familiar with the business  Inexplicable Rules – no explanation and do not suggest action  Trivial and Inexplicable Rules occur most often

HOW GOOD IS AN ASSOCIATION RULE? Customer Items Purchased 1 OJ, soda 2 Milk, OJ, window cleaner 3 OJ, detergent 4 OJ, detergent, soda 5 Window cleaner, soda OJ Window cleaner Milk Soda Detergent OJ 4 1 1 2 2 Window cleaner 1 2 1 1 0 Milk 1 1 1 0 0 Soda 2 1 0 3 1 Detergent 2 0 0 1 2 POS Transactions Co-occurrence of Products

HOW GOOD IS AN ASSOCIATION RULE? OJ Window cleaner Milk Soda Detergent OJ 4 1 1 2 2 Window cleaner 1 2 1 1 0 Milk 1 1 1 0 0 Soda 2 1 0 3 1 Detergent 2 0 0 1 2 Simple patterns: 1. OJ and soda are more likely purchased together than any other two items 2. Detergent is never purchased with milk or window cleaner 3. Milk is never purchased with soda or detergent

HOW GOOD IS AN ASSOCIATION RULE?  What is the confidence for this rule:  If a customer purchases soda, then customer also purchases OJ  2 out of 3 soda purchases also include OJ, so 67%  What about the confidence of this rule reversed?  2 out of 4 OJ purchases also include soda, so 50%  Confidence = Ratio of the number of transactions with all the items to the number of transactions with just the “if” items Customer Items Purchased 1 OJ, soda 2 Milk, OJ, window cleaner 3 OJ, detergent 4 OJ, detergent, soda 5 Window cleaner, soda POS Transactions

CREATING ASSOCIATION RULES 1. Choosing the right set of items 2. Generating rules by deciphering the counts in the co-occurrence matrix 3. Overcoming the practical limits imposed by thousands or tens of thousands of unique items

ASSOCIATION RULES Support  “The support is the percentage of transactions that demonstrate the rule.”  Example: Database with transactions (customer_# : item_a1, item_a2,.. ) 1: 1, 3, 5. 2: 1, 8, 14, 17, 12. 3: 4, 6, 8, 12, 9, 104. 4: 2, 1, 8.  support {8,12} = 2 (,or 50% ~ 2 of 4 customers)  support {1, 5} = 1 (,or 25% ~ 1 of 4 customers )  support {1} = 3 (,or 75% ~ 3 of 4 customers)  An itemset is called frequent if its support is equal or greater than an agreed upon minimal value – the support threshold

ASSOCIATION RULES Confidence  The confidence is the conditional probability that, given X present in a transition , Y will also be present.  An association rule is of the form: X => Y  X => Y: if someone buys X, he also buys Y  Confidence measure, by definition:  Confidence(X=>Y) equals support(X,Y) / support(X)

EXAMPLE Example: Database with transactions ( customer_# : item_a1, item_a2, … ) 1: 3, 5, 8. 2: 2, 6, 8. 3: 1, 4, 7, 10. 4: 3, 8, 10. 5: 2, 5, 8. 6: 1, 5, 6. 7: 4, 5, 6, 8. 8: 2, 3, 4. 9: 1, 5, 7, 8. 10: 3, 8, 9, 10. Conf ( {5} => {8} ) ? supp({5}) = 5 , supp({8}) = 7 , supp({5,8}) = 4, then conf( {5} => {8} ) = 4/5 = 0.8 or 80%

EXAMPLE Example: Database with transactions ( customer_# : item_a1, item_a2, … ) 1: 3, 5, 8. 2: 2, 6, 8. 3: 1, 4, 7, 10. 4: 3, 8, 10. 5: 2, 5, 8. 6: 1, 5, 6. 7: 4, 5, 6, 8. 8: 2, 3, 4. 9: 1, 5, 7, 8. 10: 3, 8, 9, 10. Conf ( {5} => {8} ) ? 80% Done. Conf ( {8} => {5} ) ? supp({5}) = 5 , supp({8}) = 7 , supp({5,8}) = 4, then conf( {8} => {5} ) = 4/7 = 0.57 or 57%

EXAMPLE Example: Database with transactions ( customer_# : item_a1, item_a2, … ) 1: 3, 5, 8. 2: 2, 6, 8. 3: 1, 4, 7, 10. 4: 3, 8, 10. 5: 2, 5, 8. 6: 1, 5, 6. 7: 4, 5, 6, 8. 8: 2, 3, 4. 9: 1, 5, 7, 8. 10: 3, 8, 9, 10. Conf ( {9} => {3} ) ? supp({9}) = 1 , supp({3}) = 1 , supp({3,9}) = 1, then conf( {9} => {3} ) = 1/1 = 1.0 or 100%. OK?

EXAMPLE Example: Database with transactions ( customer_# : item_a1, item_a2, … ) Conf( {9} => {3} ) = 100%. Done. Notice: High Confidence, Low Support. -> Rule ( {9} => {3} ) not meaningful

WHAT IS ASSOCIATION RULE MINING ALGORITHM  There are a large number of them!!  They use different strategies and data structures.  Their resulting sets of rules are all the same.  Given a transaction data set T, and a minimum support and a minimum confident, the set of association rules existing in T is uniquely determined.  Some of the proposed algorithms are:  AIS Algorithm  SETM Algorithm  Apriori Algorithm *  AprioriHybrid Algorithm.  AprioriTid Algorithm  FP growth Algorithm

APRIORI ALGORITHM  In computer science and data mining, Apriori is a classic algorithm for learning association rules.  Apriori is designed to operate on databases containing transactions (for example, collections of items bought by customers, or details of a website frequentation).  The algorithm attempts to find subsets which are common to at least a minimum number C (the cutoff, or confidence threshold) of the itemsets.  Apriori uses a "bottom up" approach, where frequent subsets are extended one item at a time (a step known as candidate generation, and groups of candidates are tested against the data.  The algorithm terminates when no further successful extensions are found.  Apriori uses breadth-first search and a hash tree structure to count candidate item sets efficiently.

APRIORI ALGORITHM PSEUDOCODE Ck: Candidate itemsets of size k Lk : frequent itemsets of size k L1 = {frequent items}; for (k = 1; Lk !=; k++) Ck+1 = GenerateCandidates(Lk) for each transaction t in database do increment count of candidates in Ck+1 that are contained in t endfor Lk+1 = candidates in Ck+1 with support ≥min_sup endfor return k Lk;

GENERATE CANDIDATES • Assume the items in Lk are listed in an order (e.g., alphabetical) • Step 1: self-joining Lk (IN SQL) insert into Ck+1 select p.item1, p.item2, …, p.itemk, q.itemk from Lk p, Lk q where p.item1=q.item1, …, p.itemk-1=q.itemk-1, p.itemk < q.itemk • Step 2: pruning forall itemsets c in Ck+1 do forall k-subsets s of c do if (s is not in Lk) then delete c from Ck+1

STEPS TO PERFORM APRIORI ALGORITHM

FORMULAS TO NOTE  Min_sup count = Minimum support percentage * total number of transaction in database  Minimum confidence percentage < Confidence percentage after association rule

APRIORI ALGORITHM EXAMPLES If the minimum support is 50%, minimum confidence is 50% in database D. Illustrate the Apriori algorithm for finding frequent itemsets in D TID Items 100 1 3 4 200 2 3 5 300 1 2 3 5 400 2 5 Database D

CONTINUE… Scan D itemset sup. {1} 2 {2} 3 {3} 3 {4} 1 {5} 3 C1 itemset sup. {1} 2 {2} 3 {3} 3 {5} 3 L1 itemset sup {1 3} 2 {2 3} 2 {2 5} 3 {3 5} 2 L2 itemset sup {1 2} 1 {1 3} 2 {1 5} 1 {2 3} 2 {2 5} 3 {3 5} 2 C2 itemset {1 2} {1 3} {1 5} {2 3} {2 5} {3 5} C2 Scan D C3 itemset {2 3 5} Scan D L3 itemset sup {2 3 5} 2 TID Items 100 1 3 4 200 2 3 5 300 1 2 3 5 400 2 5 Database D

CONTINUE… Association Rule {2,3,5} Support Confidence Confidence (%) 2->3^5 2 2/3=0.66 66.6% 3->2^5 2 2/3=0.66 66.6% 5->2^3 2 2/3=0.66 66.6% 2^3->5 2 2/2=1.0 100% 2^5->3 2 2/3=0.66 66.6% 3^5->2 2 2/2=1.0 100%

With the confidence threshold set to 50%, the Strong Association Rules are (sorted by confidence): 1. 2^3->5 (1.0) 2. 3^5->2 (1.0) 3. 2->3^5 (0.66) 4. 3->2^5 (0.66) 5. 5->2^3 (0.66) 6. 2^5->3 (0.66) CONTINUE…

PRACTICE PROBLEM Trace the results of using the Apriori algorithm on the grocery store example with support threshold s=33.34% and confidence threshold c=60%. Show the candidate and frequent itemsets for each database scan. Enumerate all the final frequent itemsets. Also indicate the association rules that are generated and highlight the strong ones, sort them by confidence.

SOLUTION  Support threshold =33.34% => threshold is at least 2 transactions.  Applying Apriori  Note that {HotDogs, Buns, Coke} and {HotDogs, Buns, Chips} are not candidates when k=3 because their subsets {Buns, Coke} and {Buns, Chips} are not frequent.  Note also that normally, there is no need to go to k=4 since the longest transaction has only 3 items.  All Frequent Itemsets: {HotDogs}, {Buns}, {Ketchup}, {Coke}, {Chips}, {HotDogs, Buns}, {HotDogs, Coke}, {HotDogs, Chips}, {Coke, Chips}, {HotDogs, Coke, Chips}.

Association rules: SOLUTION With the confidence threshold set to 60%, the Strong Association Rules are (sorted by confidence):

APRIORI ADVANTAGES/DISADVANTAGES  Advantages  Uses large itemset property  Easily parallelized  Easy to implement  Disadvantages  Assumes transaction database is memory resident.  Requires many database scans.

Association rule mining and Apriori algorithm

Association rule mining and Apriori algorithm

More Related Content

What's hot

Similar to Association rule mining and Apriori algorithm

Recently uploaded

In this document

Association rule mining and Apriori algorithm