CSCI567: Machine Learning USC, Fall 2022 Homework 4 Instructor: Vatsal Sharan Due: November 16 by 2:00 pm PST We would like to thank Gregory Valiant (Stanford) for kindly sharing some of the problems with us. A reminder on collaboration policy and academic integrity: Our goal is to maintain an optimal learning envi- ronment. You can discuss the homework problems at a high level with other groups, but you should not look at any other group’s solutions. Trying to find solutions online or from any other sources for any homework or project is prohibited, will result in zero grade and will be reported. To prevent any future plagiarism, uploading any material from the course (your solutions, quizzes etc.) on the internet is prohibited, and any violations will also be reported. Please be considerate, and help us help everyone get the best out of this course. Please remember the Student Conduct Code (Section 11.00 of the USC Student Guidebook). General principles of academic honesty include the concept of respect for the intellectual property of others, the expectation that individual work will be submitted unless otherwise allowed by an instructor, and the obligations both to protect one’s own academic work from misuse by others as well as to avoid using another’s work as one’s own. All students are expected to understand and abide by these principles. Students will be referred to the Office of Student Judicial Affairs and Community Standards for further review, should there be any suspicion of academic dishonesty. Notes on notation: • Unless stated otherwise, scalars are denoted by small letter in normal font, vectors are denoted by small letters in bold font and matrices are denoted by capital letters in bold font. • ∥.∥ means L2-norm unless specified otherwise, i.e., ∥.∥ = ∥.∥2. Instructions We recommend that you use LaTeX to write up your homework solution. However, you can also scan handwritten notes. The homework will need to be submitted on Gradescope. 1 Theory-based Questions Problem 1: Decision Trees (12pts) Consider a binary dataset with 400 examples, where half of them belong to class A and the rest belong to class B. Next, consider two decision stumps (i.e. trees with depth 1) T1 and T2, each with two children. For T1, the left child has 150 examples in class A and 50 examples in class B. For T2, the left child has 0 examples in class A and 100 examples in class B. (You can infer the number of examples in the right child using the total number of examples.) 1.1 (6 pts) In class, we discussed entropy and Gini impurity as measures of uncertainty at a leaf. Another possible metric is the classification error at the leaf, assuming that the prediction at the leaf is the majority class among all examples that belong to that leaf. For each leaf of T1 and T2, compute the entropy (base e), Gini impurity and classi- fication error. You can either exactly express the final numbers in terms of fractions and logarithms, or round them to two decimal places. 1.2 (6 pts) Compare the quality of T1 and T2 (that is, the two different splits of the root) based on conditional entropy (base e), weighted Gini impurity and total classification error. Intuitively, which of T1 or T2 appears to be a better split to you (there may not necessarily be one correct answer to this)? Based on your conditional entropy, Gini impurity and classification error calculations, which of the metrics appear to be more suitable choices to decide which variable to split on? 2 Problem 2: Gaussian Mixture Model and EM (10pts + 5 pts bonus) In class, we applied EM to learn Gaussian Mixture Models (GMMs) and showed the M-Step without a proof. Now, it is time that you prove it. Consider a GMM with the following PDF of xi: p(xi) = k∑ j=1 πjN(xi | µj ,Σj) = k∑ j=1 πj ( √ 2π)d|Σj | 12 exp ( −1 2 (xi − µj)TΣ−1j (xi − µj) ) where k is the number of Gaussian components, d is dimension of a data point xi and N is the usual Gaussian pdf (|Σ| in the pdf denotes the determinant of matrix Σ). This GMM has k tuples of model parameters {(µj , Σj , πj)}kj=1, where the parameters represent the mean vector, covariance matrix, and component weight of the j-th Gaussian com- ponent. For simplicity, we further assume that all components are isotropic Gaussian, i.e., Σj = σ2j I . 2.1 (10 pts) Find the MLE of the expected complete log-likelihood. Equivalently, find the optimal solution to the following optimization problem. argmax πj ,µj ,Σj ∑ i ∑ j γij lnπj + ∑ i ∑ j γij lnN(xi | µj ,Σj) s.t. πj ≥ 0 k∑ j=1 πj = 1 where γij is the posterior of latent variables computed from the E-Step. You can use the following fact: Given a1, . . . , ak ∈ R+, the solution to the following optimization problem over q1, . . . , qk: argmax qj k∑ j=1 aj ln qj , s.t. qj ≥ 0, k∑ j=1 qj = 1. is given by: q∗j = aj∑ k′ ak′ . 2.2 (Bonus) (5 pts) The posterior probability of z in GMM can be seen as a soft assignment to the clusters; in contrast, k-means assigns each data point to one cluster at each iteration (hard assignment). Show that if we set {σj , πj}kj=1 in a particular way in the GMM model, then the cluster assignments given by the GMM reduce in the limit to the k-means cluster assignments (where the cluster centers {µj}kj=1 remain the same for both the models). To verify your answer, you should derive p(zi = j|xi) for your choice. 3 Programming-based Questions As in previous homeworks, you need to have your coding environment setup for this part. We use python3 (version ≥ 3.7) in our programming-based questions. There are multiple ways you can install python3, for example: • You can use conda to configure a python3 environment for all programming assignments. • Alternatively, you can also use virtualenv to configure a python3 environment for all programming assignments After you have a python3 environment, you will need to install the following python packages: • numpy • matplotlib (for plotting figures) Note: You are not allowed to use other packages such as tensorflow, pytorch, keras, scikit-learn, scipy, etc. for 3.1-3.2. If you have other package requests, please ask first before using them. You are allowed to use any packages for 3.3-3.5. Download the files for the programming part from https://vatsalsharan.github.io/fall22/hw4. zip. Problem 3: Exploring Decision Trees and Random Forests (12pts) In this question, we will observe the effect of different hyperparameters in training decision trees and random forests, and also visualize what features the random forest model is using to make predictions. We will do this on a Colab notebook HW4-Exploring-Random-Forests.ipynb. Instructions to run the notebook: Upload this file to your USC Google Drive. Then, add Google Colab to your Google App by New → More → Connect more apps → Type in Google Colab and install it, as in Fig. 1. After that, you can run the notebook with your browser. Figure 1: Screenshots showing how to install Colab. We use a variant of the MNIST dataset for this problem. As mentioned in HW 3, the MNIST dataset contains images of handwritten digits (0, 1, 2, ..., 9) and is generally used for the (10) digit classification problem. Here, we work with a binary classfication task of predicting whether the digit is less than 5 or not. Fig. 2 shows some samples images from the dataset with original and binary labels, respectively. You should go through the code we provide and understand what’s happening, but for the purpose of answering the questions you will mainly have to run the code and understand the results. All the models (decision trees, random forests, etc.) are imported from the sklearn library. You should feel free to explore the role of other hyperparameters and other methods (such as bagging, boosting), and go through the documentation to better understand these things, but for this question, we will focus on decision trees and random forests. 4 Figure 2: Some samples images from the MNIST dataset with original (left) and binary (right) labels, respectively. We will look at the effect of 5 parameters: • max_depth: The maximum depth of the decision tree. • n_estimators: The number of decision trees in the forest. • min_samples_leaf: The minimum number of samples required to be a leaf node. • max_samples: The number of samples to draw from the training set to train each decision tree in the forest. • max_features: The number of features to consider when looking for the best split for any node. To observe the effect of a parameter, we look at train and test accuracy for different values of that parameter while keeping the rest of the parameters fixed. 3.1 (2 pts) The first set of plots shows the train and test accuracy for different values of max_depth using a deci- sion tree and a random forest, respectively. How do the accuracies and the generalization gap vary with max_depth in these cases? For a particular value of max_depth, how does the generalization gap for the decision tree compare with that of the random forest? Explain your observations. 3.2 (2 pts) The next plot shows the train and test accuracy for different values of n_estimators for a random forest. Comment on your observations regarding the accuracies and the generalization gap. What value(s) (range or approximate values are enough) would you prefer to use for this parameter? Give reasons why. Hint: Are there any drawbacks to using very high values of n_estimators? For the next three parts of this question, we will also look at the how the size of the training set influences the accuracy trends for a given parameter. In each case, for the first plot, the training set consists of 4000 samples whereas for the second plot, it contains 1000 samples. 3.3 (2 pts) The next set of plots shows the train and test accuracy for different values of min_samples_leaf for a random forest. Taking into account the behaviours for different training set sizes, explain your observations for very low and very high values of min_samples_leaf. What do you conclude from this trend? What could be the reasons for such a behaviour? 3.4 (2 pts) The next set of plots shows the train and test accuracy for different values of max_samples for a random forest. What do you observe for very low and very high values of max_samples? Would you prefer to use 5 low, intermediate or high values for this parameter in both the cases? Give reasons why. Hint: How does the size of the training set influence the choice of this parameter? 3.5 (2 pts) The next set of plots shows the train and test accuracy for different values of max_features for a random forest. Comment on your observations regarding the accuracies and the generalization gap for the two training set sizes. What is the best range of values for this parameter in both the cases? Is it similar/different? Explain your observations. 3.6 (2 pts) In class, we discussed how ensembles are usually not as interpretable as a single decision tree. While this is true, there are still ways to explore which features are used the most by our ensemble. We will explore one such technique in this part. We visualize the feature importances of a random forest model trained for the binary classification task on the MNIST dataset. Intuitively, features with higher importance are the pixels which are used more often in the decision trees in the forest and which lead to better splits, i.e. which contribute more in improving the performance of the model. For more details, you can see Section 18.6.1 of the PML book. The last plot shows the importances of different pixels/portions of the image for a trained random forest model to make its predictions. What portions of the image does the model seem to be focusing on? In other words, can you think of reasons why the pixels with higher importance are indeed important for the prediction task (classifying whether the digit is smaller than 5 or not)? As is usually true for such open-ended questions, there can be multiple correct answers here and we’re looking more for your reasoning than a specific answer. 6 Problem 4: PCA for Learning Word Embeddings(30pts+10pts Bonus) This question is about word embeddings. We saw word embeddings in class in lecture 7. As we discussed then, a word embedding is simply a vector space representation of words which captures some of the semantic and syntactic structures in the language—for example, words similar in meaning have representations which are close to each other in the vector space. Word embeddings have taken natural language processing (NLP) by storm in the past decade or so, and have become the backbone for numerous NLP tasks such as question answering and machine translation. There are neural approaches to learning word embeddings, but in this question we will study a simple PCA-based scheme which does a surprisingly good job at learning word embeddings. We have created a word co-occurrence matrix M of the 10000 most frequent words from a Wikipedia corpus with 1.5 billion words. The co-occurrences were obtained by using a sliding window of length 5 across the Wikipedia corpus. Entry Mij of the matrix denotes the number of times words i and j occur in the corpus within the same sliding window. The file co_occur.csv contains the symmetric co-occurence matrix. dictionary.txt contains the dictionary for interpreting this matrix, the ith row of the dictionary is the word corresponding to the ith row or column of M. The dictionary is sorted according to the word frequencies. Therefore the first word in the dictionary—“the” is the most common word in the corpus and the first row or column of M contains the co-occurrence counts of “the” with every other word in the dictionary. We provide some starter code in the file hw4-pca.py with some useful functions (e.g. to read files, generate plots, etc.) which can be used directly, and instructions on how to complete the functions required for this problem. 4.1 (6 pts) First, read the co-occurrance matrix and the list of all words from the given files. Let the matrix M be the n × d (n = d = 10000) matrix of word co-occurrences. As we discussed in class, a suitable normalization or scaling is often very helpful to get PCA to work well. In light of the power law distribution of word occurrences, in this case we will work with the normalized matrix M˜ such that each entry M˜ij = log(1 +Mij). We regard the i-th row of M as the datapoint for the i-th word. We will use PCA to find the first 100 principal components of the data. Let M˜c be the centered version of M˜. Use the PCA function from the sklearn library (refer https://scikit-learn.org/stable/modules/ generated/sklearn.decomposition.PCA.html) to get V, i.e. the set of first 100 principal components or eigenvectors of M˜c. Note that you can directly use the fit method with M˜c as the input. Also get the eigenvalues of the covariance matrix (check the documentation of the function) and plot all the 100 eigenvalues. Do the eigenvalues seem to decay? What percent of the variance in the data is explained by the first 100 eigenvalues we calculated (note that there are 10,000 eigenvalues in total)? 4.2 (6 pts) In this question, we find embeddings for all 10,000 words in the dictionary using the princpal compo- nents V. Then, we will use word embeddings to find word(s) which are ‘similar’ to a given word. To obtain the embeddings of the words, we will project the datapoint corresponding to each word (its co-occurences with every other word) onto the 100 dimensional space spanned by the first 100 principal components, similar to the general approach we laid down in class. Here are the two steps you should follow. Recall that we regard the ith row of M˜c as the datapoint for the ith word. Given the 100 PCs (V), we now project each datapoint (row of M˜c) onto these PCs. Denote this n×k (10000×100) matrix as P (you should write P as a matrix operation using M˜c and V). Next, to ensure that each PC gets equal importance, we normalize the vector of the projections of all the words onto the jth PC (i.e. the jth column of P) to have unit norm, for all j = {1, ..., 100}. Denote this n × k (10000 × 100) matrix as E. Finally, normalize the rows of E such that each row has unit ℓ2 norm, to get a new matrix E˜. We regard the ith row of E˜ as the embedding of the ith word. Next, we will define a similarity metric for the word embeddings. We will use the cosine-similarity as the similarity metric. As all the vectors have unit ℓ2 norm, the cosine similarity between two words i and j with embeddingswi andwj is equal to the inner product ⟨wi,wj⟩. Now that we have a similarity metric defined, we can have some fun with these embeddings by querying for the closest word to any word we like! Try finding the closest words to some common words, such as “learning”, “university”, “california”, and comment on your observations. 7 4.3 (6 pts) We’ll now interpret the principal components/eigenvectors (columns of V). For any i, denote vi as the eigenvector corresponding to the ith largest eigenvalue. Note that the entries of this vector correspond to the 10000 words in our dictionary, we’ll call these our 10000 variables. By sorting the entries of vi by absolute value, and observing what the top 10 variables and their (signed) entries are, we can infer what information the ith eigenvector roughly captures. Can you find 5 interesting eigenvectors, and point out what semantic or syntactic structures they capture? Can you do this for all 100 eigenvectors? Hint: What do you observe about PCs or eigenvectors with small eigenvalues? 4.4 (12 pts) In this question, we will explore a curious property of the word embeddings—that certain directions in the embedded space correspond to specific syntactic or semantic concepts. Let w1 be the word embedding for “woman” and w2 be the word embedding for “man”. Let w = w1 −w2, and wˆ = w∥w∥ . 4.4.1 (6 pts) Project the embeddings of the following words onto wˆ: boy, girl, brother, sister, king, queen, he, she, john, mary, wall, tree. Present a plot of projections of the embeddings of these words marked on a line. For example, if the projection of the embedding for “girl” onto wˆ is 0.1, then you should label 0.1 on the line with “girl”. What do you observe? 4.4.2 (6 pts) Present a similar plot of the projections of the embeddings of the following words onto wˆ: math, history, nurse, doctor, pilot, teacher, engineer, science, arts, literature, bob, alice. What do you observe? Why do you think this is the case? Do you see a potential problem with this? Remember that word embeddings are extensively used across NLP. Suppose LinkedIn used such word embeddings to find suitable candidates for a job or to find candidates who best match a search term or job description. What might be the result of this? If you want to learn more about this, you might find it interesting to read the original paper1 which pointed out this issue in word embeddings. 4.5 (Bonus) (10 pts) In this question, we will explore the property that that directions in the embedded space cor- respond to semantic or syntactic concepts in more depth. Because word embeddings capture semantic and syntactic concepts, they can be used to solve word analogy tasks. For example, consider an analogy question— “man is to woman as king is to ”, where the goal is to fill in the blank space. This can be solved by finding the word whose embedding is closest to wwoman − wman + wking in cosine similarity. You can do this by a nearest neighbor search across the entire dictionary—excluding the three words man, woman, king which already appear in the analogy as they cannot be valid answers to the question. Here wi represents the word embedding for the word i. Refer to Fig. 3 for why this makes sense in light of the fact that directions in the embedded space correspond to semantic/syntactic concepts. Figure 3: Figure denoting how word embeddings might encode gender, tense or country-capital relationships. Image source. 1https://proceedings.neurips.cc/paper/2016/file/a486cd07e4ac3d270571622f4f316ec5-Paper.pdf 8 We have provided a dataset, analogy_task.txt, which tests the ability of word embeddings to answer anal- ogy questions, such as those represented in Fig. 3. Using the cosine similarity metric, find and report the accuracy of the word embeddings you have constructed on the word analogy task. Look at the incorrect/correct answers of the approach and comment on the results. For example, what types of analogy questions seem to be harder to answer correctly for this approach? Deliverables: Discussions for all the parts. Plots for parts 4.1, 4.4.1, 4.4.2. Code for all the parts as a separate Python file hw4-pca.py. 9 Problem 5: Project Launchpad (16pts + 10pts bonus) In this part, you will implement a simple linear regression model for the final project. At the same time, you will also explore the use of the Pandas package, a very popular Python library for data analysis. Here is a tutorial for the Pandas package. You should do this part together with your project team (each team will make one submission). Early bird bonus We encourage you to start early on the project. The first 5 teams among all teams in CSCI-567 class at the HW4 submission deadline can get 10 bonus points. To get this bonus, at least one of the team members will have to give a 5-minute presentation on their approach in the discussion session on November 17. We will consider the final ranking on the public leaderboard as of 2pm PST on Nov 16 for this part of bonus points. Setup We will complete this part of assignment on a jupyter notebook project-starter.ipynb. To load the notebook into Kaggle, follow these steps: (a) In the “Code” tag of the competition, click button “New Notebook”. (b) Then, select “File/Import Notebook”, and import regression.ipynb into Kaggle. Deliverables Notebook with completed code, output, and the discussion for 5.4. 10 Submission Instructions Each project team (not homework team) will make one submission for this problem (there will be a separate Gradescope link to submit solutions for this problem). Please submit your notebook along with the output for grading. If you are using Kaggle notebook, you can export the notebook by “File/Download Notebook”, see figure below: General Tip 1: before getting started, please read the dataset description carefully. Make sure you understand the datatype of each entries and the meanings behind them. General Tip 2: all of the tasks in this part can be done within three lines of code. You should spend most of your time reading documentations to understand certain methods provided by the Pandas package, instead of implementing the functionalities yourself. Part I - Feature Engineering (12 pts) One of the most important steps in developing a machine learning model is feature engineering. In this part, we will walk you through the process of constructing a simple set of features for the final project. 5.1 Moving average of oil prices (6 pts) In this task, you are asked to compute the moving average of the oil prices with window size 7. 5.1.1 (2 pts) In Section 1 of the notebook, after loading the oil price into data oil, compute data oil["ma oil"] as the moving average of past seven recorded oil prices (note that the oil prices of some dates are missing, and we will skip them in the computation). 5.1.2 (4 pts) After computing data oil["ma oil"], you should now deal with the missing dates in data oil["ma oil"]. A DataFrame calendar with contiguous dates from 2013-01-01 to 2017-08-31 is created for you. You are asked to complete the following: (a) Merge DataFrames calendar and data oil. The merged DataFrame should have contiguous dates from 2013-01-01 to 2017-08-3 as indexes, and the same columns as data oil, where missing values are denoted by NaN. (b) Fill in the missing values of data oil["ma oil"]. There are many ways to deal with missing values, such as fill it with some default value, fill it with the last valid observation, or fill it with the next valid observation. Here, you are asked to fill in each missing value by the last valid observation. Perform this operation in place on data oil["ma oil"]. 5.2 Workday feature (3 pts) Next, we will create an extra feature indicating whether each date is a workday. Here, we only make use of the fact that Monday to Friday are usually workdays, and we ignore all the holidays information in holidays events.csv (which is left for you to explore). In Section 2 of the notebook, your task is to create a column calendar["wd"] such that a date between Monday and Friday has value True, and a date that is Saturday or Sunday has value False. 11 5.3 Create trend feature (3 pts) Now we are ready to load all training and testing data into the notebook (Section 3). The last step before training our model is to create the trend feature: note that a linear regression model cannot read the dates, and we need to map the dates into certain features. Here, you are asked to implement the simplest trend feature: given a range of dates, map them into a range of integers starting from 1. You need to create a DataFrame X as the trend feature, which is then extended with the moving average of oil prices and workday indicators to form the final training features. Part II - Train Your Model (4 pts) We are now ready to train the model! Here, we will train a simple linear regression model, which only takes a few seconds. 5.4 Display training error by the type of product (2 pts) Let’s show the training error to get a sense of how well the model is fitting the training data. The model is making sales predictions for each store and type of product. You are asked to show the training error within each type of product. Which family of product does the model has the most difficulty in predicting the sales? 5.5 Make predictions on the test set (2 pts) Now you can use your trained model to make predictions on the test set. To do this, you need to first create the features for the test set (denoted by X test) similar to the training set. Fill in your steps for computing X test in the notebook. 5.6 Make a submission You are now ready to make a submission. Simply open the “Competitions” tag on the right, and click the submit button to make the submission. You should get a score around 0.45. 12
欢迎咨询51作业君