辅导案例-IFN645

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

IFN645 Large Scale Data Mining
Assignment
Unit Contribution: 40 Marks
Released: 9/09/2020
Due In: 14/10/2020

Introduction

This assignment is intended to allow you to display your knowledge and understanding developed in
your practicals and lectures. The purpose of this assignment is to give you (1) an understanding that
various methods can be applied to a data set and (2) the benefits of applying data analytics
techniques to a data domain.
Instructions

1. The assignment is due on 14/10/2020. It is a firm deadline.

2. You should submit the assignment via Blackboard Assignment.

3. The assignment is spilt into 3 parts.
1. A set of materials which can be run in the Weka GUI. This is in sections 1 – 3 of the
description. It is worth 15 marks.
2. A reflective statement. This is in sections 4-5 of the description. It is worth 10 marks.
3. A set of material which needs to be in Java. This is section 6 of the description. It is
worth 15 marks.

4. A report should be submitted via online submission answering each question of the case study.
You must also include an introduction, conclusion and references in the report. The report should
just include responses to the questions set in the case-study. Some answers may require
screenshots. Use them as needed. You should also include your tables detailing those
results/outcomes where it is required. We will repeat this several times throughout but provide all
evidence. The important thing is to write down the important points and attach the important
screen dumps, to show that you have thought the matter through.

5. You also need to submit your code. Please zip it down and upload it to Blackboard.

6. The datasets required for this assignment can be found on Blackboard with the file named as
assignment-data.zip. It includes text datasets:
a. a text dataset - to perform text analysis
b. a soil dataset – to perform feature selection
c. an electricity dataset – to perform evaluation.

7. Your work must be your own. No collaboration or borrowing from others is permitted.

8. Read the Assessment Policies on Blackboard or QUT Website.

Description

Section 1 - Weka

Text
Download the text.arff file and execute it in Weka using the text analysis tools that you have
learned. We would like to match the scores to the text, i.e. use the scores as the class. Write about
your approach, describe what techniques worked and what didn’t. Provide all evidence. In particular,
you should answer:
1. What was the different number of Correctly Classified Instances when you were going
through the different text options?
2. What is the highest number of Correctly Classified Instances for your best instance?
3. Did you disregard any frequent terms?
4. Describe the other techniques that you used?

Feature Selection
Download the soil dataset. Run the filter on the dataset using the following approaches: J48,
NaiveBayes, SMO, OneR, IBK.
For each of the approaches answer the following. Provide all evidence.
1. Run the Ranker method using InfoGainAttributeEval and GainRatioAttributeEval for all of the
attributes. Provide a table that shows the attributed and the values.
2. Provide a graph that shows the attributes which gave the highest number of results in terms
of Correctly Classified Instances.
3. Describe if there was a difference in how long the approaches above took. Which approach
took the longest and which approach took the quickest. How do the approaches compare in
terms of efficiency versus effectiveness.
4. What would be the 3 best attributes. Justify your answer.
5. How would you use attributes in your downstream data mining tasks.

Evaluation
Download the electricity dataset. Choose one of the standard approach (e.g. J48, NaiveBayes, SMO
etc) and record the following in relation to your evaluation. Provide all evidence.
1. Explain how your evaluation approach avoids the problem of overfitting.
2. Calculate the average and the standard deviation for 10 holdout using random numbers and
cross validation. Comment on which approach that you should use based on these figures
3. Calculate the learning curve in terms of accuracy, in measurements of 5%, for the dataset. At
which position should the learning curve have been sufficient?
4. Image that there has been a cost associated with the DOWN, of 10. Calculate what is the
Total Cost value. Comment on the result.
Section 2 – Reflective Statement

Ethics
Choose a common big data platform (e.g. Twitter, Google. Facebook). For the platform explain the
following:
1. Outline how risks associated with privacy, security and fairness, and justice could be
potentially violated by the big data.
2. Outline what are the common ethical issues in your platform posed by the following
frameworks: virtue ethics, consequential/utilitarian ethics and deontological ethics.
3. Outline what would be the downstream risks associated with your platform.
4. Outline how different ethical challenges should be managed with respect to big data.
5. If you were designing a company that was designing the use of such data, how would you
define promote values of transparency, autonomy and trustworthiness?

Parallel Processing
Select one of your big data solutions provided in Section 1. Explain how you would transform the
approach from a single threaded application to a multi-threaded application in Weka using the
Hadoop/Spark paradigm. Explain:
1. Explain how the dataset would be more efficient if you move to a MapReduce scheme.
2. How you would divide the tasks into key and value pairs.
3. How would you divide the task into a Map and a Reduce tasks.
4. How would you split the source dataset to work with the distributed application.

Section 3 - Java

Java
1. Write a Java programme that use the three datasets to compare a traditional approach (e.g.
J48, NaiveBayes, SMO etc) with the MOA’s HoeffdingTree.
2. Determine the Correctly Classified Instances for both. Describe why one approach is better.
3. Determine the time taken for both approaches to run. Describe why one approach is faster
than the other.

7 6 5 4 1-3
Weka The material
presented in
Weka is very
strong and
well
supported.
The Weka
material is
generally
strong but
there are
some errors
in it.
The Weka
material is
generally ok
but has some
issues relating
to its
explanation.
There is a
lack of
structure to
the Weka
material.
The Weka
material is
poor.
Reflective
statement
The reflective
statement is
very strong.
The reflective
statement is
generally
strong but
there are
some errors
in it.
The reflective
statement is
generally ok
but has some
issues relating
to its
explanation.
The reflective
statement is
sound but
has some
issues.
The reflective
statement is
poor.
Java Java works
and has a
near perfect
design.
Java works
and has a few
things wrong
with the
design.
Java works but
there the
coding design
has some
troubles.
Java works
but there the
coding design
has serious
troubles.
Did not get
Java working.

欢迎咨询51作业君