辅导案例-CS 663

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

CS 663 - Machine Learning Spring, 2020

Data Challenge

The purpose of this challenge is to test your ability to write software and models to collect,
normalise, store, analyse and visualise “real world” data. This challenge is designed to mimic
those you may receive upon applying to positions as Data Scientist or Machine Learning
Engineer. You may draw on your work in the lab assignments. The challenge is designed to
take about two hours, but it is not timed. Please deliver your results by the due date.

You may also use any tools or software on your computer, or that are freely available on the
Internet as long as the tool works with a Jupyter notebook. We prefer that you use simpler tools
to more complex ones and that you are “lazy” in the sense of using third party APIs and libraries
as much as possible. The use of obscure, undocumented “black box” libraries is discouraged.

Do as much as you can, as well as you can. Prefer efficient, elegant solutions. Prefer scripted
analysis to unrepeatable use of GUI tools. For data security and transfer time reasons, you have
been given a relatively small data file. Prefer solutions that do not require the full data set to be
stored in memory.

Finally, we are also interested in your ability to work on a team, which means considering how
to package and deliver your results in a way that makes it easy for us to review them. This does
NOT mean you are allowed to discuss with others or use their work, including those enrolled in
this or similar courses. It does mean that undocumented code and data dumps are virtually
useless; commented code and a clear writeup with elegant visuals are ideal. Also consider how
asking targeted questions to members of our team may allow you to get more done in less time.

Background

Health Inspectors from the Health Department of the City and County of San Francisco routinely
conduct inspections of restaurants (“facilities”). After conducting an inspection of a facility, a
Health Inspector calculates a score based on the violations observed. Violations can fall into:
● High risk category: records specific violations that directly relate to the transmission of
food borne illnesses, the adulteration of food products and the contamination of
food-contact surfaces
● Moderate risk category: records specific violations that are of a moderate risk to the
public health and safety

CS 663 - Machine Learning - Spring 2020 - Data Challenge 1

● Low risk category: records violations that are low risk or have no immediate risk to the
public health and safety.
These violations may also be graded — i.e. converted to an inspection score — and posted, for
example, on the windows of the facilities. By design, some inspections do not contain violations
or inspection scores.

Data

WIth these instructions, we have provided two CSV files:
● facility_scores_known.csv (9MB): 43,199 facility records plus 1 header
● facility_scores_unknown.csv (2 MB): 10,774 facility records, plus 1 header

Requirements (Process)

There are two (2) parts for this challenge:
1. Predict inspection scores.
2. Explain inspection scores.

Predict inspection scores

You may use the train data to create a model for predicting inspection scores of a facility. The
inspection score for each facility is missing from the test set. You must use a model to predict
the inspection scores of facilities for each instance in this set.

You will submit your prediction for inspection scores, which the grader will compare against the
actual values using MSE (mean squared error). Your prediction must be named “preds.csv ”,
a file in CSV format with the one field: inspection score. This file must have one prediction for
each facility appearing in the test.csv file, in order. For example:
76
17

Explain inspection scores

Once your model has been created, you must provide an explanation of what factors best
predict the facility’s score.

CS 663 - Machine Learning - Spring 2020 - Data Challenge 2

Submission

Submit the following to Github (starter link):
● Your Jupyter notebook for the implementation of the above
● Your model’s prediction of inspection scores as a CSV file
● A PDF document with an explanation of your process, findings, visualisations, etc.
The submission deadline is 11:59 PM PDT on 13th May, 2020. Late submissions will not be
accepted.

Grading

Each submission will be graded as follows:
50% Performance The competitive accuracy (as measured by MSE) of your model 1
as executed on a neutral system.
20% Explanation of
Model
The degree to which your process for finding explanatory features
follows a reasonable process. In addition, the correct identification
of explanatory features according to the following:
20% = Reasonable process and correct features derived
14% = Not following reasonable process or incorrect features
7% = Not following reasonable process and incorrect features
15% Code Quality The degree to which your solution is modular, easy to run, easy to
read and contains comments helpful to a peer or other person with
skills similar to yours.
15% = Completely
9% = Partially
3% = Poorly
10% Process The degree to which your solution follows a reasonable process
and has documented this process.
10% = Completely
7% = Partially: missing process details / module documentation
3% = Poorly: missing several major details / most documentation
5% Execution Time The competitive wall clock execution time of your model as
executed on a neutral, CPU-based system

1 For competitive grading, the submissions with the top performance get a full-credit score (eg. 50/50 on
Performance). Other submissions which do not yield top performance are ranked and graded accordingly.
Your model will be executed once, so be wary of models with varying / random performance.

CS 663 - Machine Learning - Spring 2020 - Data Challenge 3