CSE 347/447 Data Mining: Homework 2

Due on 11:59 PM, March 29, 2021

Standard and General Requirements

• Work it by yourself, and try to write as perfect a solution as you can. Discussion is allowed at the level

of technical conversation only. Students are expected to abide by Lehigh Academic Integrity Policy.

If you do, however, give proper acknowledgments.

• Typed solutions are encouraged, especially if your handwriting is messy. However, there will be no

extra marks for typed answers.

• Partial credit will be given for partial solutions, but not for long off-topic discussion that leads nowhere.

Overall, think before you write, and try to give concise and crisp answers.

• Late policy: You can be at most 3 days late; for every late date you lose 10% of your grade, unless

some other arrangement is agreed to before the due date.

• Submission: Please return your answers as .PDF file to CourseSite.

Exercise 1 (25 points): For each of the following questions, provide an example of an association rule from

the market basket data in Table 1 that satisfies the following conditions. Also, describe whether such rules are

subjectively interesting.

(a) A rule that has high support and high confidence.

(b) A rule that has reasonably high support but low confidence.

(c) A rule that has low support and high confidence.

(d) A rule that has low support and low confidence.

Table 1: Market Basket Data.

Exercise 2 (25 points): Consider the following frequent 3-sequences in Table 2:

(a) List all the candidate 4-sequences produced by the candidate generation step of the GSP algorithm.

(b) List all the candidate 4-sequences pruned during the candidate pruning step of the GSP algorithm (assuming

no timing constraints). Note: When there is no timing constraints, all subsequences of a candidate must be

frequent.

Exercise 3 (25 points): Question 9 in 5.10 Exercises (page 441 of the book of “Introduction to Data Mining”

(2nd edition)), i.e.,

The Apriori algorithm uses a generate-and-count strategy for deriving fre-quent itemsets. Candidate itemsets

of size k + 1 are created by joining a pair of frequent itemsets of size k (this is known as the candidate generation

step). A candidate is discarded if any one of its subsets is found to be infrequent during the candidate pruning step.

Suppose the Apriori algorithm is applied to the data set shown in Table 3 with minsup = 30%, i.e., any itemset

occurring in less than 3 transactions is considered to be infrequent.

1

Table 2: Sequence Database

Frequent 3-sequences

〈{1, 2, 3}〉

〈{1, 2} {3}〉

〈{1} {2, 3}〉

〈{1, 2} {4}〉

〈{1, 3} {4}〉

〈{1, 2, 4}〉

〈{2, 3} {3}〉

〈{2, 3} {4}〉

〈{2} {3} {3}〉

〈{2} {3} {4}〉

Table 3: A Transaction Database

(a) Draw an itemset lattice representing the data set given in Table 3. Label each node in the lattice with the

following letter(s):

• N: If the itemset is not considered to be a candidate itemset by the Apriori algorithm. There are two

reasons for an itemset not to be considered as a candidate itemset: (1) it is not generated at all during the

candidate generation step, or (2) it is generated during the candidate generation step but is subsequently

removed during the candidate pruning step because one of its subsets is found to be infrequent.

• F: If the candidate itemset is found to be frequent by the Apriori algorithm.

• I: If the candidate itemset is found to be infrequent after support counting.

(b) What is the percentage of frequent itemsets (with respect to all itemsets in the lattice)?

(c) What is the pruning ratio of the Apriori algorithm on this data set? (Pruning ratio is defined as the percentage

of itemsets not considered to be a candidate because (1) they are not generated during candidate generation

or (2) they are pruned during the candidate pruning step.)

(d) What is the false alarm rate (i.e, percentage of candidate itemsets that are found to be infrequent after

performing support counting)?

Exercise 4 (25 points): Consider the data set shown in Table 4.

(a) For each combination of rules given below, specify the rule that has the highest confidence.

2

Table 4: An Internet User Database

• 15 < A < 25→ 10 < B < 20, 10 < A < 25→ 10 < B < 20, and 15 < A < 35→ 10 < B < 20.

• 15 < A < 25→ 10 < B < 20, 15 < A < 25→ 5 < B < 20, and 15 < A < 25→ 5 < B < 30.

(b) Suppose we are interested in finding the average number of hours spent online per week by Internet users

between the age of 15 and 35. Write the corresponding statistics-based association rule to characterize the

segment of users. To compute the average number of hours spent online, approximate each interval by its

midpoint value (e.g., use B = 7.5 to represent the interval 5 < B < 10)?

3

欢迎咨询51作业君