程序代写案例-CS5014

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

Practical 1: Credit Approval
CS5014 Machine Learning
Due date: Fri 5th March (Week 6) 21:00
40% of the coursework grade
Aims
The aim of this practical is to gain experience in applying machine learning methodology to
a real dataset. The focus is on good understanding and justification of steps. A successful
submission will demonstrate the understanding of:
• how to load, clean, and process a dataset;
• how to train a standard algorithm;
• how to report and interpret the results; and
• how to write clear, concise, and re-usable research code.
Dataset
The dataset for this practical is provided on studres in the directory named data. It con-
tains two files: crx.names contains the description of the different features in the dataset,
while crx.data contains the actual data, in the CSV format. It contains anonymised data
pertaining to individual credit card applications, as well as the final outcome (successful or
unsuccessful). Read crx.names to familiarise yourself with the values. You will notice that
some of the values are numerical (also referred to as ‘continuous’), and some are categorical.
With categorical inputs, the description will list all valid labels. The target variable (outcome
of the application) uses ‘+’ for successful and ‘-’ for unsuccessful applications.
Task
You will create a machine learning model that can predict credit card approvals from the de-
scribed dataset. You will create two deliverables: the source code for your solution, and a brief
report which answers specific questions about your solution. You should follow the steps out-
lined in this spec and use the questions to guide your progress. Both code and the report are
important: you must specifically answer each question listed below and will be evaluated on
how well your answers demonstrate understanding of the topics covered in lectures.
In this practical, you will be marked based on your use of logistic regression. There are
no extra points for using more advanced algorithms. If you experiment with other classifiers,
please separate this work into new Python files and clearly identify them to ease marking.
1
Part 1: Data Processing
Start by loading the dataset using Pandas. You may want to drop or clean some of the values,
change the encoding, apply scaling. You will also need to separate the dataset into a training
and testing set. In your report you should clearly answer the following questions about your
data processing:
(a) How did you load and clean the data, and why?
(b) How did you split the data into test/train set and why?
(c) How did you process the data including encoding, conversion, and scaling, and why?
(d) How did you ensure that there is no data leakage?
Part 2: Training
After loading the data, you should train a logistic regression classifier to predict the output
from inputs. You should use the LogisticRegression algorithm for this. Make sure you are
familiar with all the parameters offered by this implementation, and what they mean. In this
part, make sure to set the penalty parameter to ‘none’ to get the basic, unregularised ver-
sion of logistic regression. In your report you should clearly explain how you performed each
of the following tasks during the training process:
(a) Train using penalty=‘none’ and class weight=None. What is the best and worst
classification accuracy you can expect and why?
(b) Explain each of the following parameters used by LogisticRegression in your own words:
penalty, tol, max iter.
(c) Train using balanced class weights (setting class weight=‘balanced’). What does
this do and why is it useful?
(d) LogisticRegression provides three functions to obtain classification results. Explain the
relationship between the vectors returned by predict(), decision function(), and
predict proba().
Part 3: Evaluation
After successfully training your model on the training data, you should evaluate your model
on the testing data. It is fine to use built-in sklearn functionality like accuracy score, but
you will have to understand what such functions do. In your report, you must clearly explain
the following:
(a) What is the classification accuracy of your model and how is it calculated? Give the
formula.
(b) What is the balanced accuracy of your model and how is it calculated? Give the formula.
(c) Show the confusion matrix for your classifier for both unbalanced (2a) and balanced (2b)
cases. Discuss any differences.
(d) Plot the precision-recall curve and report the Average Precision (AP) for your algorithm.
What is th erelationship between AP and the PR curve?
2
Part 4: Advanced Tasks
Once you have successfully completed Parts 1-3, you can try some advanced tasks listed be-
low. These are required for 17 and higher, but they cannot make up for poor performance in
previous tasks.
(a) Set the penalty parameter in LogisticRegression to ‘l2’. Give the equation of the
cost function used by LogisticRegression as the result. Derive the gradient of this l2-
regularised cost.
(b) Implement a 2nd degree polynomial expansion on the dataset. Explain how many di-
mensions this produces and why.
(c) Compare the results of regularised and unregularised classifiers on the expanded data
and explain any differences.
(d) This question requires you to read some relevant research paper and combine with what
you have learnt to form an answer. Note that the available solvers to train the model in-
clude: “newton-cg”, “lbfgs”, and “sag”. “newton-cg” stands for Newton Conjugate Gra-
dient, “lbfgs” stands for limited memory Broyden Fletcher Goldfarb Shanno (BFGS) [1],
and “sag” stands for Stochastic Average Gradient [2]. Read the relevant references and
answer: which of the three methods are first order optimisation and which are second
order optimisation methods? What are the differences between these two methods (first
order and second order optimisation)? Why does the package suggests that “sag” is
faster for large datasets?
Code Quality
Your code will evolve as you tackle individual parts of this practical. At the end, you will have
code that produced all your results. This is research code so you should focus on the code
quality aspects that support research. Your code should be:
(a) correct,
(b) clean and understandable,
(c) concise and elegant, and
(d) repeatable and easy to modify.
You will not need to write a lot of code and should avoid overcomplicating. Focus on how
easy it is for someone else to take your code, understand it, reproduce your results, and make
modifications to support further experiments. Our marking will be based on how well your
code meets these criteria. We encourage you to keep these factors in mind from the beginning,
but it is also OK to focus on correctness first and clean up the code later.
Submission
Hand in via MMS, by the deadline of 9pm on Friday of Week 6:
• The source code of your application which works in the Python3 virtual environment
set up in the school labs. This must be in the form of human-readable .py files, not the
binary .ipynb notebook format!
3
• A brief report in the PDF format. The report must contain sections which correspond to
the four parts described in this specification and it must address each of the questions
associated with each part.
Create a single .zip file containing all of these and submit this to MMS. Do not include the
dataset, your python virtual environment, or git repository.
Marking and Extensions
This practical will be marked according to the guidelines at
https://info.cs.st-andrews.ac.uk/student-handbook/learning-teaching/
feedback.html
It will be based on the quality of your answers to the questions and the quality of your code.
The report is the most important part of the submission – your answers should be brief, but
they have to demonstrate understanding of the underlying algorithms. The code will be eval-
uated based on the criteria listed in Part 5.
Some examples of submissions in various bands are:
• A basic implementation in the 11–13 grade band will complete Parts 1-3, but with signif-
icant weaknesses. Examples include a messy implementation, unexplained differences
between the code and the description in the report, or incorrect or incomplete answers to
questions in Parts 1-3.
• An implementation in the 14–16 range should complete all parts of the basic specifica-
tion comprising Parts 1-3, including answers to all associated questions. The code should
be of good quality, and the answers should be mostly correct and insightful, and demon-
strate understanding of lecture materials.
• An implementation in the 17–18 range, must include a high-quality solution to Parts
1-3, and some work on Part 4. Excellent answers to all attempted questions are strictly
required for this grade band.
• A grade of 19 and higher requires an excellent solution to all four parts with exception-
ally clear code and insightful answers to questions which evidence deep understanding
and independent study.
Note that the goal is solid machine learning methodology and understanding rather than a collection
of extensions – a good scientific approach and analysis are difficult, whereas running many
different scikit-learn algorithms on the same data is easy. Also note that:
• We will not focus on software engineering practice and advanced Python techniques
when marking, but your code should be sensibly organised, commented, and easy to
follow, as described above.
• Standard lateness penalties apply as outlined in the student handbook at https://
info.cs.st-andrews.ac.uk/student-handbook/learning-teaching/assessment.
html
• You must reference any external sources used. Guidelines for good academic practice
are outlined in the student handbook at https://info.cs.st-andrews.ac.uk/
student-handbook/academic/gap.html
4
References
[1] Richard H Byrd, Peihuang Lu, Jorge Nocedal, and Ciyou Zhu. A limited memory algo-
rithm for bound constrained optimization. SIAM Journal on scientific computing, 16(5):1190–
1208, 1995. URL: http://users.iems.northwestern.edu/˜nocedal/lbfgsb.
html.
[2] Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing finite sums with the
stochastic average gradient. Mathematical Programming, 162(1-2):83–112, 2017. URL:
https://hal.inria.fr/hal-00860051/document.
5

欢迎咨询51作业君