辅导案例-CSE142

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

CSE142-Fall 2019
Project
Predicting Star Ratings from User Reviews
Handed out Date: Nov 6, 2019
Evaluation deadline: Dec 4, 2019
Report and code due date: Dec 6, 2019
• The project has to be done in groups of 3.
• 10% of the points are for write-up on the group’s diversity.
• We recommend implementing all the codes in Python3.
• You are allowed to use any external library including Machine Learning and Natural Language Processing
libraries.
• One (and only one) member of the group has to submit the project report using his/her account on Canvas.
All group members will get points for that submission.
• How to submit your solutions: Your project report must be typed up separately (in at least an 11-point
font) and submitted on the Canvas website as a PDF file. The code and related files should be submitted
in a .zip file.
• Your report should clearly mention your group’s number (from the shared project sign up sheet), and the
names, email addresses and student ids of all group members.
• You are very strongly encouraged to format your report in LATEX. You can use other software but hand-
written reports are not acceptable.
• The Computer Science and Engineering Department of UCSC has a zero-tolerance policy for any incident
of academic dishonesty. If cheating occurs, consequences within the context of the course may range from
getting zero on a particular assignment, to failing the course. In addition, every case of academic dishonesty
will be referred to as the student’s college Provost, who sets in motion an official disciplinary process.
Cheating in any part of the course may lead to failing the course and suspension or dismissal from the
university.
1
1 Course Project [100 points]
The rise in E-commerce has brought a significant rise in the importance of customer reviews. There are hundreds
of review sites online and massive amounts of reviews for every product. The ability to successfully decide whether
a review will be helpful to other customers and thus give the product more exposure is vital to companies that
support these reviews.
This project is about automatically identifying the appropriate ratings for a given review. Specifically, the
Machine Learning classification task is as follows: given an input text (review), you have to predict the respective
ratings. (from 1 to 5)
1.1 Dataset
The training dataset provided to you is the modified Yelp Open dataset. The dataset consists of reviews and
their respective ratings in JSON format.
You will be provided:
• data train.json:
– This dataset has around 330000 entries.
– Each entry consists of a review in multiple sentences, corresponding rating and also the usefulness of
the review.
– There are five fields - ‘stars’, ‘useful’,‘funny’,‘text’
– You need to predict ‘stars’ (ratings) from the ‘text’ (reviews)
– If you think it helps, you may use the attributes of the reviews about them being useful or funny!
– You can download this file from this link this link
• data test.json:
– This contains only the reviews and their attributes. The ratings will not be provided for this dataset.
The test set will be provided shortly before the project evaluation deadline.
1.2 Evaluation
Your trained model will be evaluated on a held-out and hidden test set. As mentioned above, the goal at test
time is to predict a rating of each review in the test set. The test set will be provided to you on the day of
the evaluation. Your code should take as input the test file with no labels and output predictions in a .csv file.
We will evaluate your predictions against the ground truth (hidden from you at all times) using the following
performance measures: Accuracy, Precision, Recall, and F1-score.
Your system should be able to accept such a file as input. Note that the file is in JSON format. Also,
‘text’ entry of each data point represents a review and it can contain punctuation marks including commas and
quotation marks. Your predictions file (output) should contain only one column (predicted rating). The file
should be comma-separated (it should not contain any other punctuation marks like quotation marks). Outputs
that do not conform to this format will not be evaluated.
Please see the template file “data test wo label template.json”. This is just a template test file with 5 entries.
The actual test file that we will release near the end will have around 50,000 entries. We also provide the
template for output prediction file. See “predictions.csv”. There should be one column with header predictions.
The prediction numbers should NOT be quoted. Please make sure your prediction file format is similar to the
one provided as template.
1.3 Report
Page Limit: 3 pages
You are also expected to write a short report on your findings. The report will describe the details of your
approach like the data cleaning/pre-processing, feature extraction, model details, and experiments done to build
2
your model. The first section of the report should be titled ‘Tools used’ and should list all the tools/libraries that
you use for the project. In the report, indicate whether you wrote code for a particular step or used a library. For
example, if you try Logistic Regression, when describing your approach indicate if you used a library or coded
the algorithm. The first page should also contain a small paragraph on diversity. Diversity of the group can be
based on a variety of factors and as mentioned in class you don’t have to limit yourself to race/gender. Talk to
your teammates and find out how you might be different from them. You will be evaluated on your description of
diversity. A typical report would have the following components/sections, but feel free to customize the suggested
components according to your project.
Required Components:
1. Title
2. Group details (full names, email addresses and student ids of all group members)
3. Tools Used (including a short 1-2 sentence description of what they were used for)
4. Diversity
Suggested Components:
1. Abstract (1 paragraph summary of your approach and key findings)
2. Data Pre-processing
3. Feature Extraction
4. Approach(es)
5. Experimental Set-up
6. Results
7. Conclusion
8. Ideas for future work
1.4 What to Submit:
1. Report (.pdf file) to be submitted on Canvas.
2. 〈Names〉 code.zip: This file should contain any code that you write for the project. It should contain a
ReadMe and the code should be properly documented. This file should be submitted on Canvas with your
report.
3. 〈Names〉 predictions.csv containing the predictions of your model on the provided test set. Please note that
we will not be able to evaluate your predictions if your predictions file is not in the correct format.
In the above description, 〈Names〉 should be replaced by the last names of all group members in alphabetical
order. For example, if there were two members in the group named: Joe Smith and Mary Johnson, then the zip
file would be named JohnsonSmith code.zip.
3