IFN645 Large Scale Data Mining Assignment Unit Contribution: 40 Marks Released: 9/09/2020 Due In: 14/10/2020 Introduction This assignment is intended to allow you to display your knowledge and understanding developed in your practicals and lectures. The purpose of this assignment is to give you (1) an understanding that various methods can be applied to a data set and (2) the benefits of applying data analytics techniques to a data domain. Instructions 1. The assignment is due on 14/10/2020. It is a firm deadline. 2. You should submit the assignment via Blackboard Assignment. 3. The assignment is spilt into 3 parts. 1. A set of materials which can be run in the Weka GUI. This is in sections 1 – 3 of the description. It is worth 15 marks. 2. A reflective statement. This is in sections 4-5 of the description. It is worth 10 marks. 3. A set of material which needs to be in Java. This is section 6 of the description. It is worth 15 marks. 4. A report should be submitted via online submission answering each question of the case study. You must also include an introduction, conclusion and references in the report. The report should just include responses to the questions set in the case-study. Some answers may require screenshots. Use them as needed. You should also include your tables detailing those results/outcomes where it is required. We will repeat this several times throughout but provide all evidence. The important thing is to write down the important points and attach the important screen dumps, to show that you have thought the matter through. 5. You also need to submit your code. Please zip it down and upload it to Blackboard. 6. The datasets required for this assignment can be found on Blackboard with the file named as assignment-data.zip. It includes text datasets: a. a text dataset - to perform text analysis b. a soil dataset – to perform feature selection c. an electricity dataset – to perform evaluation. 7. Your work must be your own. No collaboration or borrowing from others is permitted. 8. Read the Assessment Policies on Blackboard or QUT Website. Description Section 1 - Weka Text Download the text.arff file and execute it in Weka using the text analysis tools that you have learned. We would like to match the scores to the text, i.e. use the scores as the class. Write about your approach, describe what techniques worked and what didn’t. Provide all evidence. In particular, you should answer: 1. What was the different number of Correctly Classified Instances when you were going through the different text options? 2. What is the highest number of Correctly Classified Instances for your best instance? 3. Did you disregard any frequent terms? 4. Describe the other techniques that you used? Feature Selection Download the soil dataset. Run the filter on the dataset using the following approaches: J48, NaiveBayes, SMO, OneR, IBK. For each of the approaches answer the following. Provide all evidence. 1. Run the Ranker method using InfoGainAttributeEval and GainRatioAttributeEval for all of the attributes. Provide a table that shows the attributed and the values. 2. Provide a graph that shows the attributes which gave the highest number of results in terms of Correctly Classified Instances. 3. Describe if there was a difference in how long the approaches above took. Which approach took the longest and which approach took the quickest. How do the approaches compare in terms of efficiency versus effectiveness. 4. What would be the 3 best attributes. Justify your answer. 5. How would you use attributes in your downstream data mining tasks. Evaluation Download the electricity dataset. Choose one of the standard approach (e.g. J48, NaiveBayes, SMO etc) and record the following in relation to your evaluation. Provide all evidence. 1. Explain how your evaluation approach avoids the problem of overfitting. 2. Calculate the average and the standard deviation for 10 holdout using random numbers and cross validation. Comment on which approach that you should use based on these figures 3. Calculate the learning curve in terms of accuracy, in measurements of 5%, for the dataset. At which position should the learning curve have been sufficient? 4. Image that there has been a cost associated with the DOWN, of 10. Calculate what is the Total Cost value. Comment on the result. Section 2 – Reflective Statement Ethics Choose a common big data platform (e.g. Twitter, Google. Facebook). For the platform explain the following: 1. Outline how risks associated with privacy, security and fairness, and justice could be potentially violated by the big data. 2. Outline what are the common ethical issues in your platform posed by the following frameworks: virtue ethics, consequential/utilitarian ethics and deontological ethics. 3. Outline what would be the downstream risks associated with your platform. 4. Outline how different ethical challenges should be managed with respect to big data. 5. If you were designing a company that was designing the use of such data, how would you define promote values of transparency, autonomy and trustworthiness? Parallel Processing Select one of your big data solutions provided in Section 1. Explain how you would transform the approach from a single threaded application to a multi-threaded application in Weka using the Hadoop/Spark paradigm. Explain: 1. Explain how the dataset would be more efficient if you move to a MapReduce scheme. 2. How you would divide the tasks into key and value pairs. 3. How would you divide the task into a Map and a Reduce tasks. 4. How would you split the source dataset to work with the distributed application. Section 3 - Java Java 1. Write a Java programme that use the three datasets to compare a traditional approach (e.g. J48, NaiveBayes, SMO etc) with the MOA’s HoeffdingTree. 2. Determine the Correctly Classified Instances for both. Describe why one approach is better. 3. Determine the time taken for both approaches to run. Describe why one approach is faster than the other. 7 6 5 4 1-3 Weka The material presented in Weka is very strong and well supported. The Weka material is generally strong but there are some errors in it. The Weka material is generally ok but has some issues relating to its explanation. There is a lack of structure to the Weka material. The Weka material is poor. Reflective statement The reflective statement is very strong. The reflective statement is generally strong but there are some errors in it. The reflective statement is generally ok but has some issues relating to its explanation. The reflective statement is sound but has some issues. The reflective statement is poor. Java Java works and has a near perfect design. Java works and has a few things wrong with the design. Java works but there the coding design has some troubles. Java works but there the coding design has serious troubles. Did not get Java working.
欢迎咨询51作业君