程序代写案例-CS178

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

2021/3/14 Projects: CS178: Machine Learning & Data Mining
https://canvas.eee.uci.edu/courses/33360/pages/projects 1/4
Projects
In the course project, groups of three students will work together to create classifiers for an in-class
Kaggle prediction competition. The competition training data is available from the uci-cs178-win21
Kaggle site (https://www.kaggle.com/c/uci-cs178-win21/) . To give your Kaggle account permission
to join the in-class competition and upload results, use the URL posted on Piazza
(http://piazza.com/uci/winter2021/compsci178leca) .
Kaggle Competition
The Problem
Our competition data are satellite-based measurements of cloud temperature (infrared imaging),
being used to predict the presence or absence of rainfall at a particular location. The data are
courtesy of the UC Irvine Center for Hydrometeorology and Remote Sensing
(http://chrs.web.uci.edu/) , and have been pre-processed to extract features corresponding to a model
they use actively for predicting rainfall across the globe. Each data point corresponds to a particular
lat-long location where the model thinks there might be rain; the extracted features include
information such as IR temperature at that location, and information about the corresponding cloud
(area, average temperature, etc.). The target value is a binary indicator of whether there was rain
(measured by radar) at that location; you will notice that the data are slightly imbalanced (positives
make up about 30% of the training data).
The Evaluation
Scoring of predictions is done using AUC
(https://en.wikipedia.org/wiki/Receiver_operating_characteristic#Area_under_the_curve) , the area
under the ROC (receiver-operator characteristic) curve. This gives an average of your learner’s
performance at various levels of sensitivity to positive data. This means that you will likely do better
if, instead of simply predicting the target class, you also include your confidence level of that class
value, so that the ROC curve can be evaluated at different levels of specificity. To do so, you can
report your confidence that it is raining (class +1) as a real number for each test point. Your
predictions will then be sorted in order of confidence, and the ROC curve evaluated.
Using Kaggle
Download the training features X_train, the training category labels Y_train, and the test features
X_test. You will learn classifiers using the training data, make predictions based on the test features,
and upload your predictions to Kaggle for evaluation. Kaggle will then score your predictions, and
report your performance on a random subset of the test data to place your team on the public
leaderboard. After the competition, the score on the remainder of the test data will be used to
2021/3/14 Projects: CS178: Machine Learning & Data Mining
https://canvas.eee.uci.edu/courses/33360/pages/projects 2/4
determine your final standing; this ensures that your scores are not affected by overfitting to the
leaderboard data.
Kaggle will limit you to at most 2 uploads per day, so you cannot simply upload every
possible classifier and check their leaderboard quality. You will need to do your own validation, for
example by splitting the training data into multiple folds, to tune the parameters of learning
algorithms before uploading predictions for your top models. The competition closes (uploads will no
longer be accepted or scored) on March 17, 2021 at 11:59pm Pacific time.
Submission Format
Your submission must be a file containing two columns separated by a comma. The first column
should be the instance number (a positive integer), and the second column is the score for that
instance (probability that it equals class +1). The first line of the file should be “ID,Prob1”, the name
of the two columns. We have released a sample submission file, containing random
predictions, named Y_random.txt.
Forming a Project Team
Students will work in teams of three students to complete the project. We encourage you to start
looking for teammates now; one option is to use the "Search for Teammates!" page on Piazza
(http://piazza.com/uci/winter2021/compsci178leca) . In exceptional circumstances, if you are not able
to form a team of three students, smaller teams are allowed. However, the same grading standards
are applied to all teams, so smaller teams should expect a larger workload.
Once you've identified your teammates, on the Team tab in Kaggle, merge with your teammates to
form an integrated team. (We know that merging may make your individual HW4 score disappear
from the leaderboard, and will not penalize you for this when grading.) You are required to form a
merged team, and report the team members to the course staff, by March 4, 2021. After this
date, you may not use individual Kaggle accounts to submit predictions for evaluation, only your
merged team account.
To receive credit for forming your team on time, you must submit the "Group Project Team"
assignment on gradescope. One team member should complete this assignment, and gradescope
will then allow that person to select the other team members. Use the "View or edit group" option on
gradescope to be sure this is done correctly. Do not complete the assignment multiple times;
only one team member should submit.
Project Requirements
Each project team will learn several different classifiers for the Kaggle data, as well as an ensemble
“blend” of them, to try to predict class labels as accurately as possible. We expect you to experiment
with at least three (more is good) different types of classification models. Suggestions include:
2021/3/14 Projects: CS178: Machine Learning & Data Mining
https://canvas.eee.uci.edu/courses/33360/pages/projects 3/4
1. K-Nearest Neighbors. KNN models for this data will need to overcome two issues: the large
number of training & test examples, and the data dimension. As noted in class, distance-based
methods often do not work well in high dimensions, so you may need to perform some kind of
feature selection process to decide which features are most important. Also, computing distances
between all pairs of training and test instances may be too slow; you may need to reduce the
number of training examples somehow (for example by clustering), or use more efficient
algorithms to find nearest neighbors. Finally, the right “distance” for prediction may not be
Euclidean in the original feature scaling (these are raw numbers); you may want to experiment
with scaling features differently.
2. Linear models. Since you have relatively few input features but a large amount of training data,
you will probably need to define non-linear features for top performance, for example using
polynomials or radial basis functions.
3. Kernel methods. libSVM (https://www.csie.ntu.edu.tw/~cjlin/libsvm/) is one efficient
implementation of SVM training algorithms. But like KNN classifiers, SVMs (with non-linear
kernels) can be challenging to learn from large datasets, and some data pre-processing or
subsampling may be required.
4. Random forests. You will explore decision tree classifiers for this data on homework 4, and
random forests would be a natural way to improve accuracy.
5. Boosted learners. Use AdaBoost, gradient boosting, or another boosting algorithm to train a
boosted ensemble of some base learner (perceptrons, shallow decision trees, Gaussian naive
Bayes models, etc.).
6. Neural networks. The key to learning a good NN model on these data will be to ensure that your
training algorithm does not become trapped in poor local optima. You should monitor
its performance across backpropagation iterations on training/validation data, and verify that
predictive performance improves to reasonable values. Start with few layers (2-3) and moderate
numbers of hidden nodes (100-1000) per layer, and verify improvements over baseline linear
models.
7. Other. You tell us! Apply another class of learners, or a variant or combination of methods like
the above. You can use existing libraries or modify course code. The only requirement is that you
understand the model you are applying, and can clearly explain its properties in the project
report.
For each learner, you should do enough work to make sure that it achieves “reasonable”
performance, with accuracy similar to (or better than) baselines like logistic regression or decision
trees. Then, take your best learned models, and combine them using a blending or stacking
technique. This could be done via a simple average/vote, or a weighted vote based on another
learning algorithm. Feel free to experiment and see what performance gains are possible.
Project Report
2021/3/14 Projects: CS178: Machine Learning & Data Mining
https://canvas.eee.uci.edu/courses/33360/pages/projects 4/4
By March 18, 2021, each team must submit a single 2-page pdf document describing your learned
classifiers and overall prediction ensemble. Please include:
1. A table listing each model, as well as your best blended/stacked model ensembles, and their
performance on training and validation and leaderboard data. Also include your final
performance on the Private Leaderboard, which becomes visible after the Kaggle
competition closes, and your Kaggle team name.
2. For each model, a paragraph or two describing: what features you gave it (raw inputs, selected
inputs, non-linear feature expansions, etc.); how was it trained (learning algorithm and software
source); and key hyperparameter settings (plus your approach to choosing those settings).
3. A paragraph or two describing your overall prediction ensemble: how did you combine the
individual models, and why did you pick that technique?
4. A conclusion paragraph highlighting the methods/algorithms that you think worked particularly
well for this data, the methods/algorithms that worked poorly, and your hypotheses as to why.
Your project grade will be mostly based on the quality of your written report, and groups whose final
prediction accuracy is mediocre may still receive a high grade, if their results are described and
analyzed carefully. But, some additional points will also be given to the teams at the top of the
leaderboard.
One team member should upload your pdf to the gradescope site, and gradescope will then allow
that person to select the other team members. Use the "View or edit group" option on gradescope to
be sure this is done correctly. Do not upload multiple copies of the project report; only one team
member should upload.

欢迎咨询51作业君