辅导案例-BUSS6002
BUSS6002 2020S1 1 BUSS6002 Group Assignment Due Date: Wednesday 3 June 2020 Value: 25% of the total mark Rationale This group assignment has been designed to allow students to apply their data science skills on a real-world problem in business domains, as well as to help students develop collaborative skills when working in a team. Instructions 1. Required submission items via Canvas: 1. ONE written report (PDF format). • Assignments > Report Submission 2. ONE Jupyter Notebook .ipynb • Assignments > Upload Your Code File 3. ONE csv file of test results • Assignments > Submit Your Test Results 2. The assignment is due at 17:00pm on Wednesday, 3 June 2020 AEST. The late penalty for the assignment is 5% of the assigned mark per day, starting after 17:00pm on the due date. The closing date Wednesday, 10 June 2020, 17:00pm AEST is the last date on which an assessment will be accepted for marking. 3. As per anonymous marking policy, please include the Group ID and Student IDs of all group members. Do NOT include names. The name of the report and code file must follow: GroupID_BUSS6002_2020S1, and the name of test results must follow: GroupID_Test_Results.csv. 4. Your analyses and answers should be provided as a final report that gives full explanation and interpretation of any results you obtain. Output without explanation will receive zero marks. You are required to also submit your code that can reproduce your reported results, as reproducibility is a key component to data science. Not submitting your code will lead to a loss of 50% of the mark. 5. Be warned that plagiarism between individuals is always obvious to the markers of the assignment and can be easily detected by Turnitin. 6. Presentation of the assignment is part of the assignment. There will be 10% marks for the presentation of your final report and/or code. 7. Numbers with decimals should be reported to the third-decimal point. Meeting Minutes and Peer Review 1. Each group is required to submit at least 3 meeting minutes as the appendix attached to the final report. A template will be provided for preparing meetings minutes. You may use the template provided or a template you choose. 2. We may ask for peer review from each student within a group. The instructions about how to do this will be released later. BUSS6002 2020S1 2 3. Each group will be awarded a group mark as per the marking criteria. Individual adjustments to grades may be made if there is a dispute in a group or the quality/quantity of contributions made by individuals are significantly different. In such a case the unit coordinator will seek meeting minutes and peer review reports from individuals within a group to decide on individual marks. 4. If you encounter any issues with your group members, please report and discuss with your unit coordinator as early as possible. Group Competition A competition will be run among groups to rank the performance of your models on the test data provided. The top 5 groups will be awarded with bonus marks to top up their overall assignment mark: the top 3 groups will receive an extra 5 marks, and the 4th and 5th groups will receive an extra 3 marks. Project Description and Dataset Nowadays, e-commerce has revolutionized the way companies do business and consumers make purchasing decisions. It has become common practice for consumers to use online reviews to inform their decision making and give opinions about their buying experience. Companies and individuals are increasingly using such data to better understand their audience and make better decisions. Through analyzing consumer opinions towards their products, companies can develop comprehensive insights to customers’ experience, and use this to improve their offering, build a better brand and improve their business. Individual consumers can check the opinions of existing users of a product to help them make wiser purchase decisions. Suppose you are now working in a Data Science Team for an online clothing retailer. The company has noticed a recent decline in their net promoter score which measures the share of customers who would recommend the company to a friend or colleague. Management suspects that this is the result of a recent change in their procurement strategy for some of their departments and they tasked you to understand what customers are thinking about the current collection. To facilitate this, you have been provided with a dataset that consists of detailed product descriptions and classifications of recently sold items and the reviews written by customers. Your team is tasked to analyze this dataset and report your findings to assist the company in improving its appeal to consumers, with the following research objectives: • Describe how recommendation and rating patterns are affected across departments and product types. • Understand the shopping behavior of consumers and assess how age would affect the buying and reviewing behavior. • Conduct an analysis and build a predictive model to understand what influences a customer’s decision to recommend a product. There are two data files provided: product_train.csv and product_test.csv. Only product_train.csv contains the target variable: Recommended, where 1 indicates that the customer recommends the product and 0 indicates he/she does not recommend the product. The details of the features presented in the above datasets are given in dictionary.csv. As it may not be feasible to directly use some of these BUSS6002 2020S1 3 features (in particular, reviews represented as raw text) to build a model, one of your tasks is to carefully extract or construct meaningful features as input to your analysis. Tasks Data Understanding: Conduct a thorough EDA to gain a better understanding of the given data and business objectives. This includes but not limited to: checking/dealing with missing data and outliers if any; top popular items sold and their characteristics; recommendation and rating patterns across departments and product types. buying and reviewing behavior of different age groups, etc. Carefully present your analysis and findings in your report. Build a Benchmark Model to Predict Recommendation: Build a simple logistic regression model to assess the feasibility of recommendation prediction and establish a baseline model. For this task, you are required to build your baseline model using bag of words of the review text only. Use scikit-learn’s logistic regression model with “solver” set to ‘liblinear’ and all other parameters set to default. Use scikit-learn’s CountVectorizer with “max_features” set to 500 and all other parameters set to default. You need to choose appropriate evaluation metrics and model evaluation strategies to validate your model. Present your analysis and discuss your findings. Improving Your Benchmark Model: You are required to make attempts to improve the performance of your benchmark model as much as you can. You should consider using more advanced feature engineering techniques and adding extra features to rebuild your model. Your choice of decisions should be justified based on the evidence from the data and accompanied by detailed explanation. You must properly validate your model and optimize appropriate hyperparameters that apply. Simply building a model without any consideration of validation and optimisation does not meet the minimum requirements. You should demonstrate evidence of your efforts and you will be assessed based on the depth of your exploration. Provide a summary of what has worked and what has not. Report on your improved models and make comparisons with the benchmark model. Note: You must use logistic regression and no other models are allowed for this task. Interpreting Results: Decide on your best model and provide analysis and interpretation of its behavior. For example, you may report on the features associated with positive/negative recommendation. For your interpretation, you should focus on identifying general rules that might be useful for the company to improve its business in the future. Final Test Results: Finally, apply your best model on the test data. You are asked to report the classification results on the test data. Save your results into a csv file containing two columns, one for the Review Index (ID from product_test.csv) and the other column Recommended for the predicted labels (1’s or 0’s). An example file of test results test_results_example.csv is also provided. Name your file as GroupID_Test_Results.csv. The results on the test data will be assessed to decide your group performance among the entire class (group competition!). BUSS6002 2020S1 4 Presentation • The assignment material to be submitted will consist of a final report that: 1) Takes a research article form in which you shall have a number of sections such as introduction, methodology, experiment results, findings/interpretation, and conclusion. All references should be properly cited and take a full bibliographical format. Here are a few examples http://cs229.stanford.edu/proj2015/007_report.pdf http://cs229.stanford.edu/proj2015/188_report.pdf http://cs229.stanford.edu/proj2015/031_report.pdf 2) Details ALL steps and decisions taken by the group regarding requirements above. 3) Demonstrates an understanding of the problem being addressed and the relevant principles of data science techniques used. 4) Clearly and appropriately presents any relevant graphs and tables. • The report should be NOT more than 20 pages with font size no smaller than 11pt, including everything like text, figures, tables, small sections of inserted code, etc., but excluding the cover page and the appendix containing the meeting minutes. Think about the best and most structured way to present your work, summarise the procedures implemented, support your results/findings and prove the originality of your work. • Your code submission has no length limit, however, make sure your code is as concise as possible and add comments when necessary to explain the functionality of your code segments. • Your group is required to submit at least 3 meetings minutes. Your group may use the provided template for preparing meeting minutes. Documentation should include attendance, discussion points, actions decided, etc. You may use your own form or find something online. • You, as a member of a group, may be also required to submit your peer review. Please use the provided criteria sheet for this purpose. You will be advised how to use an online form when it becomes available.