CSE 347/447 Data Mining: Homework 2
Due on 11:59 PM, March 29, 2021
Standard and General Requirements
• Work it by yourself, and try to write as perfect a solution as you can. Discussion is allowed at the level
of technical conversation only. Students are expected to abide by Lehigh Academic Integrity Policy.
If you do, however, give proper acknowledgments.
• Typed solutions are encouraged, especially if your handwriting is messy. However, there will be no
• Partial credit will be given for partial solutions, but not for long off-topic discussion that leads nowhere.
Overall, think before you write, and try to give concise and crisp answers.
• Late policy: You can be at most 3 days late; for every late date you lose 10% of your grade, unless
some other arrangement is agreed to before the due date.
Exercise 1 (25 points): For each of the following questions, provide an example of an association rule from
the market basket data in Table 1 that satisfies the following conditions. Also, describe whether such rules are
subjectively interesting.
(a) A rule that has high support and high confidence.
(b) A rule that has reasonably high support but low confidence.
(c) A rule that has low support and high confidence.
(d) A rule that has low support and low confidence.
Exercise 2 (25 points): Consider the following frequent 3-sequences in Table 2:
(a) List all the candidate 4-sequences produced by the candidate generation step of the GSP algorithm.
(b) List all the candidate 4-sequences pruned during the candidate pruning step of the GSP algorithm (assuming
no timing constraints). Note: When there is no timing constraints, all subsequences of a candidate must be
frequent.
Exercise 3 (25 points): Question 9 in 5.10 Exercises (page 441 of the book of “Introduction to Data Mining”
(2nd edition)), i.e.,
The Apriori algorithm uses a generate-and-count strategy for deriving fre-quent itemsets. Candidate itemsets
of size k + 1 are created by joining a pair of frequent itemsets of size k (this is known as the candidate generation
step). A candidate is discarded if any one of its subsets is found to be infrequent during the candidate pruning step.
Suppose the Apriori algorithm is applied to the data set shown in Table 3 with minsup = 30%, i.e., any itemset
occurring in less than 3 transactions is considered to be infrequent.
1
Table 2: Sequence Database
Frequent 3-sequences
〈{1, 2, 3}〉
〈{1, 2} {3}〉
〈{1} {2, 3}〉
〈{1, 2} {4}〉
〈{1, 3} {4}〉
〈{1, 2, 4}〉
〈{2, 3} {3}〉
〈{2, 3} {4}〉
〈{2} {3} {3}〉
〈{2} {3} {4}〉
Table 3: A Transaction Database
(a) Draw an itemset lattice representing the data set given in Table 3. Label each node in the lattice with the
following letter(s):
• N: If the itemset is not considered to be a candidate itemset by the Apriori algorithm. There are two
reasons for an itemset not to be considered as a candidate itemset: (1) it is not generated at all during the
candidate generation step, or (2) it is generated during the candidate generation step but is subsequently
removed during the candidate pruning step because one of its subsets is found to be infrequent.
• F: If the candidate itemset is found to be frequent by the Apriori algorithm.
• I: If the candidate itemset is found to be infrequent after support counting.
(b) What is the percentage of frequent itemsets (with respect to all itemsets in the lattice)?
(c) What is the pruning ratio of the Apriori algorithm on this data set? (Pruning ratio is defined as the percentage
of itemsets not considered to be a candidate because (1) they are not generated during candidate generation
or (2) they are pruned during the candidate pruning step.)
(d) What is the false alarm rate (i.e, percentage of candidate itemsets that are found to be infrequent after
performing support counting)?
Exercise 4 (25 points): Consider the data set shown in Table 4.
(a) For each combination of rules given below, specify the rule that has the highest confidence.
2
Table 4: An Internet User Database
• 15 < A < 25→ 10 < B < 20, 10 < A < 25→ 10 < B < 20, and 15 < A < 35→ 10 < B < 20.
• 15 < A < 25→ 10 < B < 20, 15 < A < 25→ 5 < B < 20, and 15 < A < 25→ 5 < B < 30.
(b) Suppose we are interested in finding the average number of hours spent online per week by Internet users
between the age of 15 and 35. Write the corresponding statistics-based association rule to characterize the
segment of users. To compute the average number of hours spent online, approximate each interval by its
midpoint value (e.g., use B = 7.5 to represent the interval 5 < B < 10)?
3

Email:51zuoyejun

@gmail.com