辅导案例-COSC 051

COSC 051 Page 1 of 4
ANLY 555: Data Science Python Toolbox
Deliverable 5: Advanced Topics

Technical Resources:
See canvas.
Background
Throughout this course you will be designing and implementing a Data Science Toolbox using Python. There will be
5 major deliverables and required, supporting discussion posts. The first deliverable will be focused on designing
the class hierarchy, building the basic coding infrastructure, and beginning the documentation process.
Overview of Deliverables
1. Design
2. Implement DataSets Class and subclasses
3. Implement ClassifierAlgorithms Class, kdTreekNNClassifier subclass and Experiment Class
4. Implement
a. ROC method for Experiment Class,
b. simpleTreeClassifier
5. Advanced Topics:
a. kdTreeKNNClassifier
b. lshKNNclassifier
c. HeterogeneousDataSet
Software Design Requirements Overview
The toolbox will be implemented using OOP practices and will take advantage of inheritance and polymorphism.
Specifically, the toolbox will consist of 3 main classes some of which have subclasses and member methods as
noted below. You will also submit a demo script for each submission that tests the capabilities of your newly
created toolbox.
1. Class Hierarchy
a. DataSet
i. TimeSeriesDataSet
ii. TextDataSet
iii. QuantDataSet
iv. QualDataSet
v. TransactionDataSet
vi. HeterogeneousDataSet
b. ClassifierAlgorithm
i. simplekNNClassifier
ii. decisionTreeClassifier
iii. kdTreeKNNclassifier
COSC 051 Page 2 of 4
iv. lshKNNclassifier
c. Experiment

2. Member Methods for each Super Class (subclasses will have more specified members as well)
a. DataSet
i. __init__(self, filename)
ii. __readFromCSV(self, filename)
iii. __load(self, filename)
iv. clean(self)
v. explore(self)
b. ClassifierAlgorithm
i. __init__(self)
ii. train(self)
iii. test(self)
c. Experiment
i. runCrossVal(self, k)
ii. score(self)
iii. __confusionMatrix(self)
iv. ROC



Details for Deliverable #5. Advanced Topics.
There are 3 main components for your deliverable this week.
1. Using Python, you will implement 1 of the following (If you implement 2, you can received up to 25 bonus
points, and if you implement all 3 you will receive up to 75 bonus points):
a. kdTreeKNNclassifier use the supplemental journal article as reference.
i. The train method for kdTreeKNNClassifier will have input parameters trainingData and
true labels. Both training and testing methods fill follow specifications detailed in the kd-
tree knn paper.
ii. The test method will have parameters testData and k, and will find the k closest training
samples and will compute and return labels for the test data as specified in the in the
kd-tree knn paper.
iii. Perform formal computational Complexity Analysis on the following methods. Include a
space count S(n) and step count T(n) function (where n is the size of the input) as well as
a tight-fit upperbound using Big-O notation. Assume worst case and justify your
analyses.
1. Both kdTreeKNNClassifier train and test methods
2. Compare to simpleKNNClassifier methods previously analyzed.

b. lshKNNclassifier. LSH (Locality Sensitive Hashing -- try Algorithm 2 from paper). Build a hash table
to find nearest neighbors and implement as a kNN classifier.
i. Perform formal computational Complexity Analysis on the following methods. Include a
space count S(n) and step count T(n) function (where n is the size of the input) as well as
a tight-fit upperbound using Big-O notation. Assume worst case and justify your
analyses.
COSC 051 Page 3 of 4
1. Both lshKNNclassifier train and test methods
2. Compare time complexity AND accuracy to simpleKNNClassifier method
previously analyzed.

c. HeterogenousData via Composition of Objects. HeterogeneousDataSet Class have member
attribute which is a list of DataSets. It will likely need other attributes to help maintain state.
i. Just like other DataSets, it will have the following member methods / functionality
1. Initializing and Loading the data
a. __init__(self, filename)
b. __load(self, filename)
You can consider 2 use cases here (Pick one).
1) You will build HeterogeneousDataSets from a collection of other data
sets. For example you can instantiate some quantData and qualData
and then instantiate the HeterogeneousDataSet using a constructor:
HeterogeneousDataSet( list(quantData1, qualData1) )
2) You can prompt the user to help guide the initialization of the data,
reading the data from specific files. Etc.
2. clean(self)
a. This will simply call the clean methods from each of the individual
datasets
3. explore(self)
a. This will simply call the explore methods from each of the individual
datasets
ii. You will add the following member select( self, … ), which will select one of the
constituent data sets. (Use case. You may wish to train or test on only a selected
constituent dataset.)


2. Using Python you will implement a demo script that tests the functionality of your code. You will test to
the full functionality of the new code submitted for this deliverable. You will test all constructors and
methods.
3. Using Doxygen (or another UML-like documentation tool), UPDATE your documentation which
illustratively describes the class hierarchy, member attributes, and member methods. The description
should include structural and functional details.
COSC 051 Page 4 of 4
Academic Integrity
This is an individual project and all work must be your own. Refer to the guidelines specified in the Academic
Honesty section of this course syllabus or contact me if you have any questions.
Include the following comments at the start of your source code files:

/*
* .
*
* ANLY 5555
* Project <>
*
* Due on:
* Author:
*
*
* In accordance with the class policies and Georgetown's
* Honor Code, I certify that, with the exception of the
* class resources and those items noted below, I have neither
* given nor received any assistance on this project other than
* the TAs, *professor, textbook and teammates.
*
* References not otherwise commented within the program source code.
* Note that you should not mention any help from the TAs, the professor,
* or any code taken from the class textbooks.
*/

These comments must appear exactly as shown above. The only difference will be values that you replace where
there are "place holders" within angle brackets such as which should be replaced by your own netID.

Submission Details
Upload (as instructed by your professor) a zip folder containing ALL files (.py, .pdf, and/or .html
files). Use the following folder name: P5. For example, I would
create a folder named jeremyBoltonP5 which contained all files. I would then
zip this folder creating file jeremyBoltonP5.zip . I would then submit this
zip file. Late submissions will be penalized heavily. If you are late you may turn in the project to receive
feedback but the grade may be zero. In general, requests for extensions will not be considered.



欢迎咨询51作业君
51作业君 51作业君

Email:51zuoyejun

@gmail.com

添加客服微信: ITCSdaixie