程序辅导案例 > Program >

程序代写案例-CMP-6002B

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

School of Comput_ing Sciences
Module: CMP-6002B Machine LearningAssignment: Classificat_ion with Decison Trees
Set by: Anthony Bagnall ([email protected])Checked by: Jason Lines ([email protected])Date set: Tuesday 19th March 2020Value: 50%
Date due: Wednesday 19th May 2021 3pmReturned by: Wednesday 30th June 2021Submission: Blackboard
Learning outcomes
The studentswill learn about implement_ing variants of decision trees and ensemble techniquesand the general issues involved with implement_ing any classifier. They will understand bet_terhow to evaluate and compare classifiers on a range of data and for a specific problem. They willappreciate the difficulty in present_ing technical results in a way to support or refute a hypoth-esis. The exercises will improve the following transferable skills: programming in Java; analysisof data and results; writ_ing clearly and concisely; present_ing data coherently. The students willalso learn the basics of how to use GitHub to work with open source sof_tware.
Specificat_ion
Overview
The task for this assignment is to implement components of decision trees and ensembles,then to evaluate/compare them on a range of real world data.
1
Part 1: Building a Tree by Hand (10%)
The purpose of this exercise is tomake sure you understand how the basic algorithmswork andto provide you with a bench check for later implementat_ions.Table 1 shows adataset describing pat_ients admit_ted to hospital present_ing symptomsofmenin-git_is, and their subsequent diagnosis.
Headache Spots St_iff Neck Diagnosis
yes yes yes posit_iveno yes yes posit_iveno yes no posit_iveyes no yes posit_iveno yes yes posit_iveyes no yes posit_iveno yes no negat_iveno no yes negat_iveno yes no negat_iveyes no yes negat_iveyes no no negat_iveno no no negat_ive
Table 1: Symptoms of pat_ients present_ing symptoms of meningit_is and their diagnosis.
Construct decision trees for this data using
1. the chi-squared stat_ist_ic as the split_t_ing criterion;
2. the Gini index as the split_t_ing criterion.
The stopping criteria for both trees should be to stop at a pure node (node of just one class) orif there are no more at_tributes to use.
2
Part 2: Implement Variants of ID3 and Ensembles (50%)
This exercise involves implement_ing, debugging and test_ing variants of decision tree classifiers.A java codebase, tsml, is provided. To get started, youneed to clone theml6002b-courseworkbranch on the tsml GitHub repository. There is guidance on how to do this on blackboard.Please follow method and class names specified in the instruct_ions. Failure to do so may resultin a marks penalty.
Stat_ic funct_ions to help assess at_tribute quality (15%)
Implement a class called AttributeMeasures in the package ml 6002b coursework that con-tains four stat_ic methods, each of which measures the quality of an at_tribute split at a node.They should all take a two dimensional array of integers as an argument and return a double.You can assume that the rows represent different values of the at_tribute being assessed, andthe columns the class counts. The methods are:
1. measureInformationGain returns the informat_ion gain for the cont_ingency table.
2. measureGini returns the gini measure for the cont_ingency table.
3. measureChiSquared returns the chi-squared stat_ist_ic for the cont_ingency table.
4. measureChiSquaredYates returns the chi-squared stat_ist_ic af_ter applying the Yates cor-rect_ion. We did not cover this feature in the lecture, so it is up to you to find out how todo it.
The formal definit_ions for informat_ion gain, gini and chi-squared are given in the lecture onDecision Trees, and I want you to follow these formula. The Yates correct_ion is a very simplemodificat_ion. Your methods should handle all possible inputs without crashing. Comment thecode to indicate how you have dealt with any edge cases. Include a main method test harnessthat tests each funct_ion by finding each measure for the at_tribute headache in terms of thediagnosis. Print each measure to the console, in the form “measure for headachesplit_t_ing diagnosis =”.
Alternat_ive at_tribute select_ion methods for ID3 (15%)
Your task is to facilitate using IG, gini and chi squared at_tribute select_ion mechanisms withinthe ID3Coursework classifier provided. The class ID3Coursework in the package
ml 6002b coursework is a clone of the Weka Id3 classifier with minor cosmet_ic changes. Itis advisable to revisit the lab sheet on decision trees and to look at the original source forID3. There is an interface called AttributeSplitMeasure that is used in ID3Coursework.You need to implement variants of this interface and use them within the ID3Coursework.The interface contains a single abstract method, computeAttributeQuality, that takes an
Instances and an Attribute, that should return the quality of the at_tribute passed. There isalso a default method splitData to split Instances by the Attribute.
1. Implement and test the skeleton class IGAttributeSplitMeasure so that the split isperformed using informat_ion gain. I strongly advise you to look at how the original class
Id3 does this, and you have my permission to copy the code from Id3, but if you doplease at_tribute it in the comments.
3
2. Implement and test a class ChiSquaredAttributeSplitMeasure that implements
AttributeSplitMeasure and measures the quality using the chi-squared stat_ist_ic. Thisclass should be configurable to use the Yates correct_ion.
3. Implement and test a class GiniAttributeSplitMeasure that implements
AttributeSplitMeasure and measures the quality using the Gini index stat_ist_ic.
4. Configure ID3Coursework so that it can be used with any of the three at_tribute splitmeasures you have implemented. Adjust ID3Coursework so that the split criterion canbe set through setOptions.
5. Currently, ID3Coursework canonly beusedwith nominal at_tributes. Implement amethodin AttributeSplitMeasure called splitDataOnNumeric that randomises the mecha-nism for handling cont_inuous at_tributes. This should involve select_ing a random at_tributevalue between the minimum and maximum for that at_tribute, then making a binary splitof instances into those below the value and those above the value. This should be doneprior to measuring the at_tribute quality.
The majority of the marks will be for correctness. However, some considerat_ion will be givento efficiency and code quality. You do not have to use the methods defined in part 1. Includea main method in all three AttributeSplitMeasure classes that prints out the split criteriavalue for Headache, Spots and St_iff Neck for the data from Sect_ion 1 for the whole data. Printin the form “measure for at_tribute split_t_ing diagnosis = ”. Adda main method to the class C45Coursework that loads the problem optdigits (all nominalat_tributes) from the directory test data (do not use absolute file paths for this), performs arandom split, then builds classifiers using IG, ChiSquared and Gini split criteria, output_t_ing thetest accuracy of each to console in the form“Id3 using measure on JW Problem has test accuracy =”.Repeat this for the data set ChinaTown (all cont_inuous at_tributes).
Implement an Ensemble Classifier (20%)
The task is to implement an Ensemble classifier that can be usedwith variants of your enhancedID3 classifier. You must implement this classifier from scratch rather than use any exist_ing clas-sifiers in tsml. Implement a classifier, TreeEnsemble, that extends AbstractClassifier andconsists of an ensemble of ID3Coursework classifiers stored in an array or List. Set the defaultensemble size to 50. TreeEnsemble should be in the package ml 6002b coursework.
1. Themethod buildClassifier should construct a new set of instances for each elementof the ensemble by select_ing a random subset (without replacement) of the at_tributes.The proport_ion of at_tributes to select should be a parameter (defaults to 50%). It shouldthen build a separate classifier on each Instances object. The TreeEnsemblewill needto store which at_tributes are used with which classifier in order to recreate the at_tributeselect_ions in classifyInstance and distributionForInstance.
2. Further diversity should be injected into the ensemble by randomising the decision treeparameters. This diversificat_ion should include at_tribute select_ion mechanisms you haveimplemented, but also can involve other tree parameters.
3. Implement classifyInstance so that it returns themajority vote of the ensemble. i.e.,classify a new instancewith each of the decision trees, count howmany classifiers predicteach class value, then return the class that receives the largest number of votes.
4
4. Implement distributionForInstance so that it returns the proport_ion of votes foreach class.
5. Implement an opt_ion where rather than count_ing votes, the ensemble can average prob-ability distribut_ions of the base classifiers.
6. Implement amainmethod in the classTreeEnsemble that loads the problemoptdigits,prints out the test accuracy but also prints out the probability est_imates for the first fivetest cases.
Evaluat_ion of Decision Trees and Ensembles Classifiers (40%)
Your task is to perform a series of classificat_ion experiments and write them up as a researchpaper. Your experiments will address the quest_ion of whether the variat_ions you have imple-mented for ID3 improve performance over a range of problems and whether a your ensembleis a good approach for a specific case study data set. You have been assigned a specific dataset (see blackboard) and a link to further informat_ion. Please note that the for this part of thecoursework wemark the paper itself, not the code used to generate the results. We advise youreserve significant t_ime for reading and checking your paper, and recommend you ask some-one else to read it before submission. Imagine you are writ_ing a paper for a wide audience,not just for the marker. Aim for a t_ight focus on the specific quest_ions the paper addresses.We have provided two sets of datasets to perform these experiments: these are in files UCIDis-rete.zip and UCICont_inuous.zip. A list of the problems in these files is given in the otherwiseempty class DatasetLists.java. You will be assigned a completely different problem fromt_imeseriesclassificat_ion.com for the case study.There are four issues to invest_igate:
1. Decision Trees: Test whether there is any difference in average accuracy for the at_tributeselect_ion methods on the classificat_ion problems we have provided. Compare your ver-sions of ID3 to the Weka ID3 and J48 classifiers.
2. Tree Ensemble vs Tree Tuning: Test whether tuning ID3Coursework, including choosingat_tribute select_ion criteria method, is bet_ter than ensembling. It is up to you to selectranges of values to tune and ensemble over, but you should ensure that the range ofparameters you tune over is the same as those you use to ensemble. Perform this exper-iment with the proport_ion of at_tributes set to 100%, then repeat the experiment with theproport_ion set to 50%. You can use any code available in tsml to help you do the tuning.
3. Compare your ensemble against a range of built in Weka classifiers, including other en-sembles, on the provided classificat_ion problems.
4. Perform a case study on your assigned data set to propose which classifier from thoseyou have used would be best for this part_icular problem.
Experiments
You should compare classifiers based on accuracy (or, equivalently, error) and any other met-rics described in the evaluat_ion lecture that you feel are appropriate. You can use the builtin classifier evaluat_ion tools provided in tsml, or use any other mechanism. You should thinkabout the experimental design for each experiment, including deciding on a sampling method(train/test, cross validate or resample) and performance measures. The choice of classifiers forpart 3 is up to you, but we recommend you include random forest and rotat_ion forest in the list.
5
Include a table of all parameter set_t_ings of all classifiers in the paper. You can also compare thet_ime each algorithm takes to build a model. You can use both the discrete data and cont_inuousdata, although it may be advisable to split the analysis into two parts if you use both. For thecase study, the data you have been given has all cont_inuous at_tributes. You should experimentwith the data in this format, and compare results to those obtained by first discret_izing the dataso that it is converted to nominal at_tributes. The convert_ion can be done using the Weka Filter
Discretize or any other means
The Write Up
You should write a paper in Latex using the style file available on blackboard. The paper shouldbe called “An experimental assessment of decision trees and decision tree ensembles”. Thisshould be around 4 pages, but there is no maximum or minimum limit. Your paper shouldinclude the following sect_ions:
1. Introduct_ion: start with the aims of the paper, a descript_ion of the hypotheses you aretest_ing and an overview of the structure. Include in this your prior beliefs as to what youthink the outcomeof the testwill be, with some rat_ionalisat_ion as towhy. So, for example,do you think at_tribute evaluat_ion will make a difference? Can you find any publishedliterature that claims to answer any of the quest_ions?
2. Data Descript_ion: an overview describing the data, including data characterist_ics sum-marised in a table (number of at_tributes, number of train/test cases, number of classes,class distribut_ion). For your case study data set you should have more detail, including adescript_ion of the source of the data and the nature of the problem the data describes.Look for references for similar types of problem.
3. Classifier Descript_ion: Include a descript_ion of your C45Coursework classifier, includingdetails of design choices you made and data structures you employed. You should in-clude an example of how to use the classifier with the refinements you have included.Also provide an overview of the other classifiers you use in the case study, including ref-erences.
4. Results: A descript_ion of the experimental procedure you used and details of the results.Remember, graphs are good. There should be a subsect_ion for each hypothesis. Thecase study sect_ion should go into greater depth and remember, accuracy is not the onlycriteria.
5. Conclusions: how do you answer the quest_ions posed? Are there any biases in yourexperiment and if so, how would you improve/refine them?
Present_ing the output of experiments is a key transferable skill in any technical area. I stressagain we mark the paper for this sect_ion, not the experimental code. Avoid writ_ing a descrip-t_ion of what you did. We do not need to know about any blind alleys you went down or prob-lems you encountered, unless they are relevant to the experiments (e.g. memory constraints).Please also take care to include references where appropriate, and to check the format_t_ing ofreferences.These experiments will require a reasonable amount of computat_ion. We suggest you storethe results of experiments in the format used on the module to avoid data loss in the case ofcomputer failure. You can remotely log on to the lab machines and run experiments on thishardware if yours is limited.
6
Submission Requirements
You should submit a single zipped file containing your solut_ion to part 1, 2 and 3 to the black-board submission point.For part 1, submit a single pdf with your worked solut_ion. For part 2, submit a zipped Intellijproject with your code. For part 3, submit a single pdf of your paper. A template format is here
https://www.overleaf.com/read/ycgpvxbgwmkfFeel free to copy this project.The blackboard submission portal will go live one week before the deadline.Plagiarism and collusion: Stackoverflow and other similar forums (such as the module discus-sion board) are valuable resources for learning and complet_ing format_ive work. However, forassessed work such as this assignment, you may not post quest_ions relat_ing to courseworksolut_ions in such forums. Using code from other sources/writ_ten by others without acknowl-edgement is plagiarism, which is not allowed (General Regulat_ion 18).
Marking Scheme
1. Part 1. Decision trees by hand (10 marks)
2. Part 2. Implement tree variants (50 marks)
3. Part 3. Evaluate Classifiers (40 marks)
Total: 100 Marks, 50% of the overall module marks
7

欢迎咨询51作业君