程序代写案例-COMP9321-Assignment 3
2021/4/6 Assignment 3 | COMP9321 21T1 | WebCMS3
https://webcms3.cse.unsw.edu.au/COMP9321/21T1/resources/59350 1/5
Resources / Assignment 3
Assignment 3
Introduction
In this assignment you will be using the Movie dataset provided and the machine learning algorithm you have
learned in this course in order to find out, knowing only things you could know before a film was released ,
what the rating and revenue of the film would be. the rational here is that your client is a movie theater that
would like to decide for how long should they reserve the movie theater to show a movie when it is released.
Datasets
In this assignment, you will be given two datasets training.csv (https://github.com/mysilver/COMP9321-Data-
Services/raw/master/20t1/assign3/training.csv) and validation.csv (https://github.com/mysilver/COMP9321-
Data-Services/raw/master/20t1/assign3/validation.csv) .
You can use the training dataset (but not validation) for training machine learning models, and you can use
validation dataset to evaluate your solutions and avoid over-fitting.
Please Note:
This assignment specification is deliberately left open to encourage students to submit innovative
solutions.
You can only use Scikit-learn to train your machine learning algorithm
Your model will be evaluated against a third dataset (available for tutors, but not for students)
You must submit your code and a report
The due date is 20/04/2021 18:00
Part-I: Regression (10 Marks)
In the first part of the assignment, you are asked to predict the "revenue" of movies based on the information in
the provided dataset. More specifically, you need to predict the revenue of a movie based on a subset (or all)
of the following attributes (**make sure you DO NOT use rating** ):
cast,crew,budget,genres,homepage,keywords,original_language,original_title,overview,production_companies,
production_countries,release_date,runtime,spoken_languages,status,tagline
Part-II: Classification (10 Marks)
Using the same datasets, you must predict the rating of a movie based on a subset (or all) of the following
attributes (**make sure you DO NOT use revenue** ):
cast,crew,budget,genres,homepage,keywords,original_language,original_title,overview,production_companies,
production_countries,release_date,runtime,spoken_languages,status,tagline
Submission
2021/4/6 Assignment 3 | COMP9321 21T1 | WebCMS3
https://webcms3.cse.unsw.edu.au/COMP9321/21T1/resources/59350 2/5
You must submit two files:
A python script z{id}.py
A report named z{id}.pdf
Python Script and Expected Output files
Your code must be executed in CSE machines using the following command with three arguments:
$ python3 z{id}.py path1 path2
path1 : indicates the path for the dataset which should be used for training the model (e.g.,
~/training.csv)
path2 : indicates the path for the dataset which should be used for reporting the performance of the
trained model (e.g., ~/validation.csv); we may use different datasets for evaluation
For example, the following command will train your models for the first part of the assignment and use the
validation dataset to report the performance:
$ python3 YOUR_ZID.py training.csv validation.csv
Your program should create 4 files on the same directory as the script:
z{id}.PART1.summary.csv
z{id}.PART1.output.csv
z{id}.PART2.summary.csv
z{id}.PART2.output.csv
For the first part of the assignment:
" z{id}.PART1.summary.csv " contains the evaluation metrics (MSR, correlation) for the model trained for the
first part of the assignment. Use the given validation dataset to compute the metrics. The file should be
formatted exactly as follow:
zid,MSR,correlation
YOUR_ZID,6.13,0.73
MSR : the mean_squared_error in the regression problem
correlation : The Pearson correlation coefficient in the regression problem (a floating number
between -1 and 1)
" z{id}.PART1.output.csv " stores the predicted revenues for all of the movies in the evaluation dataset (not the
training dataset), and the file should be formatted exactly as follow:
movie_id,predicted_revenue
1,7655555
2,75875765
...
For the second part of the assignment:
" z{id}.PART2.summary.csv " contains the evaluation metrics (average_precision, average_recall, accuracy -
the unweighted mean ) for the model trained for the second part of the assignment. Use the given validation
dataset to compute the metrics. The file should be formatted exactly as follow:
zid,average_precision,average_recall,accuracy
YOUR_ZID,0.69.71,0.89
2021/4/6 Assignment 3 | COMP9321 21T1 | WebCMS3
https://webcms3.cse.unsw.edu.au/COMP9321/21T1/resources/59350 3/5
average_precision : the average precision for all classes in the classification problem (a number
between 0 and 1)
average_recall : the average recall for all classes in the classification problem (a number between 0
and 1)
" z{id}.PART2.output.csv " stores the predicted ratings for all of the movies in the evaluation dataset (not the
training dataset) and it should be formatted exactly as follow:
movie_id,predicted_rating
1,1
2,4
...
Marking Criteria
For EACH of the parts, you will be marked based on:
(3 marks) Your code must run and perform the designated tasks on CSE machines without problems
and create the expected files.
(3 marks) How well your model (trained on the training dataset) perform in the test dataset
(2 marks) You must correctly calculate the evaluation metrics (e.g., average_precision - 2 decimal
places ) in the output files (e.g., z{id}.PART2.summary.csv)
(2 marks) One page report containing:
Performance of your model on the validation dataset and how you evaluated the performance and
improved it (e.g., relying on feature selection, switching from one machine learning model to a
more suitable one,...etc.)
Problems you have faced in predicting (e.g., JSON formatted columns, keywords, missing data)
and how you tried to solve the problem.
The minimum coefficient value in the regression model is 0.3 in the test dataset (not validation). As listed
above, you will be marked on different aspects (e..g, report); and your submission will be compared to
the rest of the students to adjust marks and be fair to all. Do your best in improving your models and
make sure you do not overfit because you will be marked based on a third dataset, called "test dataset".
In the classification problem, your accuracy should be more than a baseline. The baseline model labels
all movies with the most frequent class (e.g., assuming all movie rates are 3).
You will be penalized if your models take more than 3 minutes to train and generate outputs
Your assignment will not be marked (zero marks) if any of the following occur:
If it generates hard-coded predictions
If it also uses the second dataset (test/validation) to train the model
If it does not run on CSE machines with the given command (e.g., python3 zid.py
training_dataset.csv test_dataset.csv)
Do not hard-code the dataset names
FAQ
Can we define our own feature set?
Yes, you can define any features; make sure your features do not rely on the validation (or test) datasets
What is the difference between validation and test datasets?
The validation dataset is provided for you to tune your models; the test dataset will not be provided to
students, instead, it will be used to evaluate your model.
For the average precision/recall functions, are we to use the unweighted ('macro') mean or
weighted mean?
2021/4/6 Assignment 3 | COMP9321 21T1 | WebCMS3
https://webcms3.cse.unsw.edu.au/COMP9321/21T1/resources/59350 4/5
Resource created 10 days ago (Sunday 28 March 2021, 07:53:59 AM), last modified about 7 hours ago (Tuesday 06 April 2021,
04:08:22 PM).
use the unweighted ('macro') mean
Should we calculate metrics to 1 Decimal Place?
2 Decimal Places
Can we use any machine learning algorithm?
Yes, as long as it is provided in sklearn.
What python modules can we use for developing our solutions?
You can use any modules presented in the lab activities; if not there, you may get permission by asking
...
How should we calculate the Pearson correlation coefficient?
It is calculated between your predictions and the real values for the validation (or test) dataset.
Plagiarism
This is an individual assignment . The work you submit must be your own work. Submission of work partially or
completely derived from any other person or jointly written with any other person is not permitted. The
penalties for such offense may include negative marks, automatic failure of the course, and possibly other
academic disciplines. Assignment submissions will be checked using plagiarism detection tools for both code
and the report and then the submission will be examined manually.
Do not provide or show your assignment work to any other person - apart from the teaching staff of this course.
If you knowingly provide or show your assignment work to another person for any reason, and work derived
from it is submitted, you may be penalized, even if the work was submitted without your knowledge or consent.
Pay attention to that is also your duty to protect your code artifacts . if you are using an online solution to
store your code artifacts (e.g., GitHub) then make sure to keep the repository private and do not share access
to anyone.
Reminder: Plagiarism is defined as (https://student.unsw.edu.au/plagiarism) using the words or ideas of others
and presenting them as your own. UNSW and CSE treat plagiarism as academic misconduct, which means
that it carries penalties as severe as being excluded from further study at UNSW. There are several online
sources to help you understand what plagiarism is and how it is dealt with at UNSW:
Plagiarism and Academic Integrity (https://student.unsw.edu.au/plagiarism)
UNSW Plagiarism Procedure (https://www.gs.unsw.edu.au/policy/documents/plagiarismprocedure.pdf)
Make sure that you read and understand this. Ignorance is not accepted as an excuse for plagiarism. In
particular, you are also responsible for ensuring that your assignment files are not accessible by anyone but
you by setting the correct permissions in your CSE directory and code repository, if using one (e.g., Github and
similar). Note also that plagiarism includes paying or asking another person to do a piece of work for you and
then submitting it as your own work.
UNSW has an ongoing commitment to fostering a culture of learning informed by academic integrity. All UNSW
staff and students have a responsibility to adhere to this principle of academic integrity. Plagiarism undermines
academic integrity and is not tolerated at UNSW.
Comments
  (/COMP9321/21T1/forums/search?forum_choice=resource/59350)
 (/COMP9321/21T1/forums/resource/59350)
Add a comment
2021/4/6 Assignment 3 | COMP9321 21T1 | WebCMS3
https://webcms3.cse.unsw.edu.au/COMP9321/21T1/resources/59350 5/5
Runqi Liu (/users/z5241723) about 2 hours ago (Tue Apr 06 2021 20:43:28 GMT+1000 (澳大利亚东部
标准时间)), last modified about an hour ago (Tue Apr 06 2021 21:31:33 GMT+1000 (澳大利亚东部标准时
间))
1. Could we import sys module to pass in parameter?
2. Could we import numpy?
3. In addition, in the below picture
the requirement asks us to output predicted revenues for all of movies in the evaluation
dataset, but where could we find the evaluation dataset?
Thanks
Reply
Di Wu (/users/z5247036) about 4 hours ago (Tue Apr 06 2021 19:07:13 GMT+1000 (澳大利亚东部标准
时间))
Can you give download files for these two csv files?
Reply
Vishal Bondwal (/users/z5278101) about 2 hours ago (Tue Apr 06 2021 20:20:16 GMT+1000 (澳
大利亚东部标准时间))
I could download them by right-clicking and selecting 'Save Link As...'. You can also click
to open them, and then right-click the page, and choose 'Save Page As...'
Reply
Daniel Fan (/users/z5114117) about 6 hours ago (Tue Apr 06 2021 17:01:36 GMT+1000 (澳大利亚东部
标准时间))
Hi,
What is the meaning of this sentence in the Marking Criteria section: "The minimum
coefficient value in the regression model is 0.3 in the test dataset (not validation)"?
- Doesn't the minimum coefficient value depend on what features we choose? E.g. if we
scale a feature, won't it potentially be much higher?
- Is this a restriction imposed for our model? But how can we even determine this if we can't
train on the test set?
Thanks in advance
Reply

欢迎咨询51作业君
51作业君 51作业君

Email:51zuoyejun

@gmail.com

添加客服微信: ITCSdaixie