代写辅导接单- COM6012 Assignment -

欢迎使用51辅导,51作业君孵化低价透明的学长辅导平台,服务保持优质,平均费用压低50%以上! 51fudao.top

 COM6012 Assignment - Deadline: 13:00 Friday 03 May 2024

Assignment Brief

Please carefully read the assignment brief before starting to complete the assignment. Release Status:

Q1 - 14 marks

Q2 - 12 marks

Q3 - 12 marks Q4 - 12 marks

An FAQ will be updated when questions are raised for important clarifications/tips.

How and what to submit

A. Create a folder YOUR_USERNAME-COM6012 containing the following:

1) AS_report.pdf: A report in PDF containing answers (including all figures and tables) to ALL questions at the root of the zipped folder (like readme.txt in the lab solutions). If an answer to a question is not found in this PDF file, you will lose the respective mark. The report should be concise. You may include appendices/references for additional information but marking will focus on the main body of the report.

2) Code, script, and output files: All files used to generate the answers for individual questions in the report above, except the data, should be included. These files should be named properly starting with the question number (separate files for the two questions): for example, your Python code as Q1_code.py and Q2_code.py, your HPC script as Q1_script.sh and Q2_script.sh, and your output files on HPC as Q1_output.txt and Q2_output.txt (and Q1_figB2.jpg, etc.). The results must be generated from the HPC, not your local machine. We will apply a penalty if any of these files are missing, 25% for each file. Double-check check these files are included by downloading the zipped file on another machine and opening it to verify.

B. When you have finished ALL the questions, zip your folder YOUR_USERNAME-COM6012 to include the above (one single report plus code, script, and output files for all questions, properly named) and upload this YOUR_USERNAME-COM6012.zip file to Blackboard before the deadline.

C. NO DATA UPLOAD: Please do not upload the data files used. Instead, use the relative file path in your code, assuming data files are downloaded (and unzipped if needed) under the folder ‘Data’, as in the lab.

D. Code and output: 1) Use PySpark 3.5.0 and Python 3.11.7 as covered in the lecture and lab

sessions to complete the tasks; 2) Submit your PySpark job to HPC with sbatch to obtain the output. Assessment Criteria (Scope: Sessions 1 to 9; Total: 50 marks)

1. Being able to use PySpark to analyse big data to answer data analytic questions.

2. Being able to perform tasks covered in Sessions 1 to 9 on large-scale data.

3. Being able to make useful observations and explain obtained results clearly.

Late submissions: We follow the Department's guidelines about late submissions, i.e., “If you submit work to be marked after the deadline you will incur a deduction of 5% of the mark each working day the work is late after the deadline, up to a maximum of 5 working days” but NO late submission will be marked after the maximum of 5 working days because we will release a solution by then. Please see this link.

Use of unfair means: "Any form of unfair means is treated as a serious academic offence and action may be taken under the Discipline Regulations." (from the MSc Handbook). Please carefully read this link on what constitutes Unfair Means if not sure.

 

 

 Question 1. Log Mining and Analysis [14 marks, set by Shuo Zhou]

You need to finish Lab 1 and Lab 2 before solving this question.

Data: Use wget to download the NASA access log July 1995 data (using the hyperlink ftp://ita.ee.lbl.gov/traces/NASA_access_log_Jul95.gz) to the “Data” folder. The data description is the same as in Lab 2 Task 4 Question 1 so please review it to understand the data before completing the tasks below.

Tasks:

A. Find out the total number of requests for 1) all hosts from Germany ending with “.de”, 2) all hosts from Canada ending with “.ca”, and 3) all hosts from Singapore ending with “.sg”. Report these three numbers and visualise them using a graph of your choice. [2 marks]

B. For each of the three countries in Question A (Germany, Canada, and Singapore), find the number of unique hosts and the top 9 most frequent hosts among them. You need to report three numbers and 3 x 9 = 27 hosts in total. [ 3 marks]

C. For each country, visualise the percentage (with respect to the total in that country) of requests by each of the top 9 most frequent hosts and the rest (i.e. 10 proportions in total) using a graph of your choice with the 9 hosts clearly labelled on the graph. Three graphs need to be produced. [3 marks].

D. For the most frequent host from each of the three countries, produce a heatmap plot with day as the x-axis (the range of x-axis should cover the range of days available in the log file. If there are 31 days, it runs from 1st to 31st. If it starts from 5th and ends on 25th, it runs from 5th to 25th), the hour of visit as the y-axis (0 to 23, as recorded on the server), and the number of visits indicated by the colour. Three x-y heatmap plots need to be produced with the day and hour clearly labelled. [3 marks]

E. Discuss two most interesting observations from A to D above, each with three sentences: 1) What is the observation? 2) What are the possible causes of the observation? 3) How useful is this observation to NASA? [2 marks]

F. Your report must be clearly written and your code must be well documented so that it is clear what each step is doing. [1 mark]

Question 2 Liability Claim Prediction [set by Shuo and Robert - 12 marks].

You need to finish Lab 3 and Lab 4 before solving this question.

Data: The dataset you will use is the French Motor Claims Dataset, freMTPL2freq, which comprises risk features and claim numbers collected for 677,991 motor third-party liability policies observed over a year [1]. In total, it contains 12 columns::

● IDpol: The policy ID (used to link with the claims dataset). •

● ClaimNb: Number of claims during the exposure period.

● Exposure: The exposure period.

 

 ● Area: The area code.

● VehPower: The power of the car (ordered categorical).

● VehAge: The vehicle age, in years.

● DrivAge: The driver's age, in years (in France, people can drive a car at 18).

● BonusMalus: Bonus/malus, between 50 and 350: <100 means bonus, >100 means malus in

France.

● VehBrand: The car brand (unknown categories).

● VehGas: Whether the car is gas or Diesel.

● Density: The density of inhabitants (number of inhabitants per km2) in the city the driver of the car

lives in.

● Region: The policy regions in France (based on a standard French classification)

Full descriptions of the dataset are available on Kaggle. It can be downloaded as a Pandas dataframe using sklearn.datasets API:

Python

   from sklearn.datasets import fetch_openml

   df_freq = fetch_openml(data_id=41214, as_frame=True).data

● If you have not done so already, you will need to install the scikit-learn package in your conda environment before running this code.

You will use logistic regression and generalised linear models for this question.

Tasks:

A. Pre-processing[2marks].

a. Convert the data to a PySpark dataframe. Create a new column: hasClaim for indicating

the presence or absence of a claim. The value equals 1 if ClaimNb>0, and 0 otherwise. [1

mark]

b. Split the dataset into training (70%) and test (30%) sets (use the last five digits of your

registration number on your UCard as the seed to split the dataset). Please use a stratified split on hasClaim for this imbalanced dataset. You may find the pyspark.sql.DataFrame.sampleBy API is useful for this task. [1 mark]

B. Trainpredictivemodelswithtenfeatures:Exposure,Area,VehPower,VehAge,DrivAge, BonusMalus, VehBrand, VehGas, Density, and Region. Standardise numeric features and use one-hot encoding to transform categorical features. [7 marks]

a. [3 marks] Sample a small subset from the training set (e.g. 10%), and use cross-validation to determine the best values of regParam (out of [0.001, 0.01, 0.1, 1, 10]) for:

i. Modeling the number of claims (ClaimNb) conditionally on the input features via Poisson regression. [1.5 marks]

ii. Modeling the relationship between hasClaim and the input features via Logistic regression, with L1 and L2 regularisation respectively. [1.5 marks]

 

 b. [4 marks] Utilize the optimal hyperparameters, and train your models on the full dataset using four cores on Stanage. Report the RMSE or accuracy for the test set, along with the model coefficients for each predictive model obtained from the following tasks:

i. Modeling the number of claims (ClaimNb) conditioned on the input features via Poisson regression. [2 marks]

ii. Modeling the relationship between hasClaim and the input features via Logistic regression, with L1 and L2 regularisation respectively. [2 marks]

C. AnalysetheperformanceinQ2.BandthecoefficientsacrossL1andL2regularizationobtainedin Q2.B.ii, and discuss at least three observations (e.g., anything interesting)., with two to three sentences for each observation. If you need to, you can run additional experiments that help you to provide these observations. [3 marks]

[1] A. Noll, R. Salzmann and M.V. Wuthrich, Case Study: French Motor Third-Party Liability Claims (November 8, 2018). doi:10.2139/ssrn.3164764

Question 3 Searching for exotic particles in high-energy physics using ensemble methods [set by Tahsin 12 marks]

You need to finish Labs 5 and 6 before solving this question.

Data: In this question, you will explore the use of supervised classification algorithms to identify Higgs bosons from particle collisions, like the ones produced in the Large Hadron Collider. In particular, you will use the HIGGS dataset.

Use wget to download the data using the direct link: [http://archive.ics.uci.edu/ml/machine-learning-databases/00280/HIGGS.csv.gz]. You would then need to unzip the dataset first. For this purpose, you can use a tool like gzip.

You will apply Random Forests, Gradient boosting and (shallow) Neural networks over a subset of the dataset in part A and over the full dataset in part B. As performance measures use classification accuracy and area under the curve.

A. Use pipelines and cross-validation to find the best configuration of parameters for each model (8 marks).

a. For finding the best configuration of parameters, use 1% of the data chosen randomly from the whole set. Hint: think of proper class balancing while picking your randomly chosen subset of data. Pick three parameters for each of the two models and use a sensible grid of three options for each of those parameters (6 marks).

b. Use the same splits of training and test data when comparing performances among the algorithms (2 mark).

Please, use the batch mode to work on this. Although the dataset is not as large, the batch mode allows queueing jobs and for the cluster to better allocate resources.

 

 B. Workingwiththelargerdataset.Onceyouhavefoundthebestparameterconfigurationsforeach algorithm in the smaller subset of the data, use the full dataset to compare the performance of the three algorithms in the cluster (4 marks). Remember to use the batch mode to work on this.

a. Use the best parameters found for each model in the smaller dataset of the previous step, for the models used in this step (2 mark).

b. Once again, use the same splits of training and test data when comparing performances between the algorithms (2 mark).

Question 4. Movie Recommendation and Cluster Analysis [set by Robert - 12 marks]

You need to finish Lab 7 and Lab 8 before solving this question.

Data: Use wget to download the MovieLens 20M Dataset to the “Data” folder and unzip it there. Please

read the dataset description to understand the data before completing the following tasks. Tasks:

A. Time-splitRecommendation[5marks]

1) Perform time-split recommendation using ALS-based matrix factorisation in PySpark on the

rating data in ratings.csv: [2 marks]

● sort all data by the timestamp,

● perform splitting according to the sorted timestamp. Earlier times (the past) should

be used for training and later times (the future) should be used for testing, which is a more realistic setting than random split. Consider three such splits with three training data sizes: 40%, 60%, and 80%.

2) For each of the three splits above, study two versions (settings) of ALS using your student number (keeping only the digits) as the seed for the following [2 marks]

● Setting 1: The same ALS setting you used in Lab 7 except the random seed

● Setting 2: Based on the results (see the next step 3 below) from the first ALS setting, choose another different ALS setting that can potentially improve the results. Provide at least a one-sentence justification to explain why you think the chosen setting can potentially improve the results. [This is to imagine a real scenario. You need to think about how the performance might be improved, provide a justification, and then make changes. This implies that failing to improve the results is acceptable, but we expect you to provide a good justification when you make changes aiming to improve the results, and that your

justification is sound.]

3) For each split and each version of ALS, compute three metrics: the Root Mean Square Error (RMSE), Mean Square Error (MSE), and Mean Absolute Error (MAE). Put these RMSE, MSE and MAE results for each of the three splits in one Table for the two ALS settings in the report. You need to report 3 metrics x 3 splits x 2 ALS settings = 18 numbers. Visualise these 18 numbers in ONE single figure. [1 mark]

B. UserAnalysis[4marks]

 

 1) After ALS, each user is modelled by a vector of factors. For each of the three time-splits, use k-means in PySpark with k=25 to cluster all the users based on the user factors learned with ALS Setting 2 above, and find the top five largest user clusters. As before, use the digits of your student number as the random seed for initializing the clusters. Report the size of (i.e. the number of users in) each of the top five clusters in one Table, in total 3 splits x 5 clusters = 15 numbers. Visualise these 15 numbers in ONE single figure. [2 marks]

2) For each of the three splits in Q3-A1, consider only the largest user cluster in Q3-B1, and do the following on the training set only: [2 marks]

● [1 mark] Considering all users in the largest user cluster, find all the movies that have been rated by these users and their respective average ratings (over users in this cluster), and name this collection as movies_largest_cluster. Find those movies in movies_largest_cluster with an average rating greater than or equal to 4 (>=4), and name these as top_movies.

● [1 mark] Use movies.csv to find the genres for all of the top_movies and report the top ten most popular genres. Each movie may have multiple genres, separated by the character ‘|’. Here “most popular” means genres assigned to the largest number of top_movies). Report these 3 splits x 10 genres = 30 genres in one Table.

C. Discuss your two most interesting observations from A & B above, each with three sentences: 1) What is the observation? 2) What are the possible causes of the observation? 3) How useful is this observation to a movie website such as Netflix? Your report must be clearly written and your code must be well documented so that it is clear what each step is doing. [3 marks]

The END of the Assignment

FAQ

Q1. Can we use libraries other than PySpark to generate the results?

A1: For functionalities available in PySpark, you should use PySpark, particularly for the core computational part. If functionalities are not available in PySpark, you may use other Python libraries.

Q2: Can we use interactive mode for the assignment?

A2: You are required to complete all assignment questions using batch mode.

Q3: Are the graphs/figures required to be generated by Python code?

A3: Yes, all results, including figures, should be generated by Python code. You will lose marks if your submitted script does not include code for creating figures.

 

 

51作业君

Email:51zuoyejun

@gmail.com

添加客服微信: abby12468