B565-Data Mining Homework #4

Due on Thursday, March 7, 2023, 11:59 p.m.

February 21, 2023

1

Student Name B565-Data Mining

HW 4

k-means Algorithm in Theory

This part is provided to help you implement k-means clustering algorithm.

1: ALGORITHM k-means

2: INPUT (data ∆, distance d : ∆2 → R≥0, centoid number k, threshold τ)

3: OUTPUT(Setofcentoids{c1,c2,...,ck})

4:

5: *** Dom(∆) denotes domain of data. 6:

7: *** Assume centroid is structure c = (v ∈ DOM(∆),B ⊆ ∆)

8: *** c.v is the centroid value and c.B is the set of nearest points.

9: *** ci means centroid at ith iteration.

10:

11: i=0

12: *** Initialize Centroids

13: forj=1,kdo

14: cij .v ← random(Dom(∆))

15: cij.B ← ∅

16: end for

17:

18: repeat

19: i←i+1

20: *** Assign data point to nearest centroid

21: for δ ∈ ∆ do

22: cij .B ← c.B ∪ {δ}, where mincij {d(δ, cij .v)}

23: end for

24: for j = 1, k do

25: *** Get size of centroid

26: n ← |cij.B|

27: *** Update centroid with average

28: cij.v ← (1/n)Pδ∈cij.B δ

29: *** Remove data from centroid

30: cij.B ← ∅

31: end for

32: *** Calculate scalar product (abuse notation and structure slightly)

33: *** See notes

34: until ((1/k) Pk ||ci−1 − ci ||) < τ j=1 j j

35: return({ci1,ci2,...,cik})

k-means on a tiny data set.

Here are the inputs:

∆ = d((x1, y1), (x2, y2)) =

{(2, 5), (1, 5), (22, 55), (42, 12), (15, 16)}

(1)

[(x1 − x2)2 + (y1 − y2)2)]1/2

k=2 (3) τ = 10 (4)

Observe that Dom(∆) = R2. We now work through k-means. We ignore the uninformative assignments. We remind the reader that T means transpose.

1: i←0

(2)

Page 2 of 9

Student Name B565-Data Mining HW 4

2: *** Randomly assign value to first centroid.

3: c01.v←random(Dom(∆))=(16,19)

4: *** Randomly assign value to second centroid.

5: c02.v←random(Dom(∆))=(2,5)

6: i←i+1

7: *** Associate each datum with nearest centroid

8: c1.B={(22,55),(42,12),(15,16)}

9: c12 .B = {(2, 5), (1, 5)}

10: *** Update centroids

11: c1.v←(26.3,27.7)=(1/3)((22,55)+(42,12)+(15,16))

12: c12.v←(1.5,5)=(1/2)((2,5)+(1,5))

13: *** The convergence condition is split over the next few lines to explicitly show the calculations

16: Since the threshold is met (6.9 < 10), k-means stops, returning {(26.3, 27.7), (1.5, 5)}

14: (1/k)Pk ||ci−1−ci||=(1/2)(||c0−c1||+||c0−c1||)=(1/2)(||?2?−?1.5?||+||?16?−?26.3?||) j=1jj 1122 551927.7

15: = (1/2)[(?.5?T ?.5?)(1/2) + (?−9.7?T ?−9.7?)(1/2) )] = (1/2)(√.5 + √169.7) ∼ (1/2)(13.7) = 6.9 0 0 −8.7 −8.7

Page 3 of 9

Student Name B565-Data Mining HW 4

Problem 1

For this problem we are going to use a diabetes data set collected from many US hospitals on the purpose of analyzing the factors causing readmission of diabetic patients. You can download the data from here [link]. The web-page comes with a downloadable link along with the description of the data.

Answer the following questions [20 points]:

1. Briefly describe this data set–what is its purpose? How should it be used? What are the kinds of data it’s using?

Discussion of data

Answer here. . .

2. Using R/Python, show code that answers the following questions:

(a) How many entries are in the data set?

R/Python script

# Sample R Script With Highlighting

# Sample Python Script With Highlighting

(b) How many unknown or missing data are in the data set?

R/Python script

# Sample R Script With Highlighting

# Sample Python Script With Highlighting

(c) Create histograms for attributes {age, num lab procedures, num medications}. Label the plots properly. Discuss the distribution of values e.g., are uniform, skewed, normal. Place images of these histograms into the document. Show the Python or R code that you used below and discussion below that.

R/Python script

# Sample R Script With Highlighting

# Sample Python Script With Highlighting

Discussion of Findings

Answer here. . .

Problem 1 continued on next page. . .

Page 4 of 9

Student Name B565-Data Mining HW 4

Plots

Place images here with suitable captions.

3. Make a scatter plots of [time in hospital, num medications] and [num medications, num lab procedures] variables and color the data points with the class variable [readmitted]. Discuss the plots, i.e., do

you observe any relationships between variables? Show the Python or R code that you used below and

discussion below that.

R/Python script

# Sample R Script With Highlighting

# Sample Python Script With Highlighting

Discussion of Findings

Answer here. . . Plots

Place images here with suitable captions.

Problem 2

The pseudo-code for k-means and a running example of k-means on a small data set are provided above. Answer the following questions [10 points]:

2.1 Does k-means always converge? Given your answer, a bound on the iterate must be included. How is its value determined?

2.2 What is the run-time of this algorithm?

Problem 3

Implement Lloyd’s algorithm for k-means (see algorithm k-means above) in R or Python and call this program Ck. As you present your code explain your protocol for [20 points]

1. initializing centroids

2. maintaining k centroids 3. deciding ties

4. stopping criteria

Problem 3 continued on next page. . .

Page 5 of 9

Student Name B565-Data Mining HW 4

R/Python Code

subsectionR/Python script

# Sample R Script With Highlighting

# Sample Python Script With Highlighting

Discussion of Initialization of Centroids

Answer here. . .

Discussion of Maintaining k Centroids

Answer here. . .

Discussion of Deciding Ties

Answer here. . .

Discussion of Stopping Criteria

Answer here. . . Problem 4

In this question, you are asked to run your program, Ck, against the Diabetes data set from Problem 1 and New York Times Comments data set [link](use the file nyt-comments-2020.csv as your data set). Upon stopping, you will calculate the quality of the centroids and of the partition. For each centroid ci, form two counts:

yi ← X [δ.C = “y”], readmitted/selected δ∈ci.B

ni ← X [δ.C == “n”], not readmitted/selected δ∈ci.B

where[x=y]returns1ifTrue,0otherwise. Forexample,[2=3]+[0=0]+[34=34]=2

The centroid ci is classified as readmitted/selected if yi > ni and not readmitted/selected otherwise. We can now calculate a simple error rate. Assume ci is good. Then the error is:

ni ni + yi

k

= X error(ci)

i=1

error(ci) =

We can find the total error rate easily:

Error({c1, c2, . . . , ck})

Report the total error rates for k = 2, . . . 5 for 20 runs each for both data sets. Presenting the results that are easily understandable. Plots are generally a good way to convey complex ideas quickly, i.e., box plot. Discuss your results.

Note: The error calculation method mentioned above is generalized for both data sets where the first data set asks if a patient was readmitted or not and the second data set asks if a comment was selected by editor

Problem 4 continued on next page. . . Page 6 of 9

Student Name B565-Data Mining HW 4

or not.

Data preparation: Like any other data mining problem, before feeding your data to the clustering algo- rithms, you will have to perform data cleaning, feature engineering, and feature selection on both data sets. For the second data set (NY Times Comments data set), you will be using commentBody column as your input feature (you have to build a corpus and perform an appropriate vector-space representation technique) [20 points].

R/Python script

# Sample R Script With Highlighting

# Sample Python Script With Highlighting

Discussion of Findings

Answer here. . . Plots

Place images here with suitable captions.

Problem 5

The k-means algorithm provided above stops when centroids become stable (Line 34). In theory, k-means converges once SSE is minimized

k

SSE=X X||x−cj.v||2

j x∈cj .B

In this question, you are asked to use SSE as stopping criterion. Run your program, CkSSE , against the Diabetes and New York Times Comments data sets. Report the total error rates for k = 2, . . . 5 for 20 runs each for both data sets. Moreover, compare k-means and kmeans++’s run time for k = 2, . . . 5 for 20 runs using both data sets. Presenting the results that are easily understandable. Plots are generally a good way to convey complex ideas quickly, i.e., box plot. Discuss your results [20 points].

R/Python script

# Sample R Script With Highlighting

# Sample Python Script With Highlighting

Discussion of Findings

Answer here. . .

Plots

Place images here with suitable captions.

Page 7 of 9

Student Name B565-Data Mining HW 4

Problem 6

Traditional k-means initialization is based on choosing values from a uniform distribution. In this question, you are asked to improve k-means through initialization. k-means ++ is an extended k-means clustering algorithm and induces non-uniform distributions over the data that serve as the initial centroids. Read the paper and discuss the idea in a paragraph. Implement this idea to improve your k-means program. Run your program, Ck++, against the Diabetes and New York Times Comments data sets. Report the total error rates for k = 2, . . . 5 for 20 runs each for both data sets. Moreover, compare Ck, CkSSE and Ck++’s run time for k = 2, . . . 5 for 20 runs using both data sets. Presenting the results that are easily understandable. Plots are generally a good way to convey complex ideas quickly, i.e., box plot. Discuss your results [20 points].

R/Python script

# Sample R Script With Highlighting

# Sample Python Script With Highlighting

Discussion of Findings

Answer here. . . Plots

Place images here with suitable captions.

Problem 7

In this question, you are asked to make use of the R/Python libraries for k-means. The elbow technique is used to determine optimal cluster number. Find the optimal cluster number for the Diabetes and New York Times Comments data sets using elbow method (for 2 ≤ k ≤ 15). Provide plots that show the total SSE for each k. Discuss your results [20 points].

R/Python script

# Sample R Script With Highlighting

# Sample Python Script With Highlighting

Discussion of Findings

Answer here. . .

Plots

Place images here with suitable captions.

Page 8 of 9

Student Name B565-Data Mining HW 4

Extra credit

This part is optional.

1. Ball-k-means calculates the distance of a data point from the centroid to find the annular region in which the data point resides. The annular region helps determine which neighbor centroids should be included in distance computations. This improves the run-time over earlier approaches by avoid- ing expensive computations. Run your program, Ckball , against the Diabetes and New York Times Comments data sets. Report the total error rates for k = 2, . . . 5 for 20 runs each for both data sets. Moreover, compare Ck, CkSSE , Ck++ and Ckball ’s run time for k = 2, . . . 5 for 20 runs using both data sets. Presenting the results that are easily understandable. Plots are generally a good way to convey complex ideas quickly, i.e., box plot. Discuss your results [30 points].

R/Python script

# Sample R Script With Highlighting

# Sample Python Script With Highlighting

Discussion of Findings

Answer here. . . Plots

Place images here with suitable captions.

2. The student who has the fastest implementation of all four clustering algorithms will receive extra 20

points [20 points].

Submission

You must use LATEX to turn in your assignments. Please submit the following two files via Canvas:

1. A .pdf with the name yourname-hw4-everything.pdf which you will get after compiling your

.tex file.

2. A .zip file with the name yourname-hw4.zip which should contain your .tex, .pdf, codes(.py, .ipynb, .R, or .Rmd), and a README file. The README file should contain information about dependencies and how to run your codes.

Page 9 of 9