The University of Sydney Page 1 COMP5310: Principles of Data Science W1: Introduction Presented by Dr Ali Anaissi School of Information Technologies The University of Sydney Page 2 Curriculum at a glance Whirlwind tour of: – Data Exploration – Data Engineering – Data Mining & Machine Learning – Making Decisions from Data Focus on key activities of a data scientist The University of Sydney Page 3 Perspectives and communication Diverse cohort in this unit with: – Honours degrees in non-quantitative disciplines – Bachelors degrees in quantitative disciplines or IT – Years of experience in industry Doing data science requires – Understanding application domain – Learning, collaborating, communicating – Product thinking Chance to build key soft skills as well as technical skills The University of Sydney Page 4 Questions and suggestions We are very excited to be teaching this for the fifth year Thank you for joining us! Please feel free to: – Ask questions (we should know the answer or someone who does) – Share thoughts and suggestions on how we can improve Questions about the MDS degree program or enrolments? – Keiko Narushima (MDS admin officer), SIT Building, room 2E-229 – phone: 02 8627 0872 email:
[email protected] The University of Sydney Page 5 UNIT ARRANGEMENTS The University of Sydney Page 6 Introducing Team Lecturer Dr Ali Anaissi Unit Coordinator Dr Ali Anaissi SIT Building J12, Level 2
[email protected] Tutors Seid Miad Zandavi Omid Tavallaie Hossein Moeinzadeh The University of Sydney Page 7 Resources Google Sheets for spreadsheet exercises [week 2] – Please create a Google account if you don’t already have one! Jupyter Hub accounts for Python/SQL exercises – We will provide account details in week 3 – But we recommend you download Anaconda and PostgreSQL database on your PC The University of Sydney Page 8 Textbooks and readings Data Science from Scratch. Grus. O’Reilly Media. 2015. – Available electronically through library. Doing Data Science. O’Neill and Schutt. O’Reilly Media. 2015. – Available electronically through library. The University of Sydney Page 9 Learn Python and SQL with Grok – Exercises will use Python from week 3 – We provide self-guided Python learning through Grok – Grok learning modules are available now in Canvas under Assignments folder – Please complete (sooner is better, week 5 at latest) The University of Sydney Page 10 Find everything on Canvas – The web site for this unit is on Canvas – Use it to access contacts, schedule, readings, slides, etc – Participate in Q&A with instructors and classmates https://canvas.sydney.edu.au The University of Sydney Page 11 ASSESSMENTS The University of Sydney Page 12 Assessment – 10%: Participation – 10%: Project stage 1 – 15%: Project stage 2 – 5%: Project stage 3 – 60%: Final exam The University of Sydney Page 13 Participation Objective Ensure everybody is keeping up. Requirements Submit code at end of each exercise Complete Grok exercises (not marked) Output Code/spreadsheets from exercises Marking 10% of overall mark The University of Sydney Page 14 Project stage 1: Explore, Clean, Pitch Objective Explore a data set and define a research question based on research/business requirement. Activities Choose a data set Explore, summarise and prepare data Define problem, specify requirements Output 2-page report summarizing problem analysis and proposal (plus code) Marking 10% of overall mark (report and code) The University of Sydney Page 15 Project stage 2 and 3: Experiment, Quantify, Report Objective Define an experimental framework and complete analysis/visualisation, data mining, machine learning, etc. Activities Define experimental framework Perform analysis or build tool Describe evaluation and conclusions Output 4-page report describing framework, analysis and conclusions (plus code) Presentation (2-3/3-4 mins) Marking 20% of overall mark – 15% report and code – 5% presentation The University of Sydney Page 16 Final exam Objective Assess understanding of unit material, ability to frame data problems scientifically and critical thinking about claims made based on data Activities Answer questions about lecture materials Practical excises and SQL queries Describe an approach to answering a question with data Critique a claim made based on data Format Written examination Must get 40% on exam to pass unit per SIT policy Marking 60% of overall mark cap on final mark which cannot exceed exam mark by more than 10 marks The University of Sydney Page 17 Lecture plan – W1: Introductions and housekeeping – W2: Data exploration (spreadsheets) – W3: Data exploration (Python) – W4: Cleaning and storing data – W5: Querying and summarising data – W6: Hypothesis testing Project stage 1 due – W7: Data Mining - Association Rules and Dimensionality Reduction – W8: Data Mining - Clustering – W9: Machine Learning – Regression – W10: Machine Learning – Classification – W11: Unstructured Data – W12: Review Project stage 2 and 3 due – Exam The University of Sydney Page 18 LATENESS AND PLAGIARISM The University of Sydney Page 19 Recipe for success – Attend scheduled classes except for illness, emergency, etc – Plan 6-9 hours per week for preparation, practice, project, etc – Participate in classes and forums with respect and humility – Submit assessments on time – Let us know if any concerns, e.g., if you are falling behind The University of Sydney Page 20 Special consideration (University policy) – If your performance on assessments is affected by illness or misadventure – Follow proper bureaucratic procedures – Have professional practitioner sign special USyd form – Submit application for special consideration online, upload scans – Note you have only a quite short deadline for applying – http://sydney.edu.au/current_students/special_consideration/ – Notify us by email as soon as anything begins to go wrong – There is a similar process if you need special arrangements for religious observance, military service, representative sports, etc The University of Sydney Page 21 Penalty for lateness – If you have not been granted special consideration – Penalty is 5% of awarded marks per day – Maximum 10 days late, then 0 points – Examples: – Work would have scored 60% and is 1 hour late: 57% – Work would have scored 70% and is 28 hours late: 63% – Recommendation: submit early; submit often The University of Sydney Page 22 Academic integrity (University policy) “The University of Sydney is unequivocally opposed to, and intolerant of, plagiarism and academic dishonesty. Academic dishonesty means seeking to obtain or obtaining academic advantage for oneself or for others (including in the assessment or publication of work) by dishonest or unfair means. Plagiarism means presenting another person’s work as one’s own work by presenting, copying or reproducing it without appropriate acknowledgement of the source.” http://sydney.edu.au/elearning/student/EI/index.shtml The University of Sydney Page 23 Academic integrity (University policy) – Submitted work is compared against other work – Turnitin for textual tasks (through eLearning) – other systems for code – Penalties for academic dishonesty or plagiarism can be severe – Complete required self-education AHEM1001 The University of Sydney Page 24 INTRODUCTIONS AND BACKGROUNDS The University of Sydney Page 25 Exercise: Survey of skills and interests https://goo.gl/BgVnjR (link on Canvas) Survey – Individual Responses What kind of role would you like (Data Engineer/Scientist, Analyst, etc)? What are the three most important data analytics skills? We’ll explore this data in week 2 exercises! The University of Sydney Page 26 WHAT IS DATA SCIENCE? The University of Sydney Page 27 Data Scientists build intelligent systems to derive knowledge from data. The University of Sydney Page 28 http://www.marketingdistillery.com/2014/11/ 29/is-data-science-a-buzzword-modern-data- scientist-defined/ Data Science skills Data scientists help organisations: – understand their data, – ask meaningful questions, – derive transformative insights, – lead empirically grounded decision making. The University of Sydney Page 29 Cross Industry Standard Process for Data Mining (CRISP-DM) By Kenneth Jensen - Own work based on: ftp://public.dhe.ibm.com/software/analytics/spss/documentation/modeler/18.0/en/ModelerCRISPDM.pdf (Figure 1), CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=24930610 The University of Sydney Page 30 Business Understanding Phase – Business objective – Understand business processes – Associated costs/pain – Assess situation – Define the success criteria – Data science goals – Project plan – List assumptions and risk (technical/financial/business/ organizational) factors The University of Sydney Page 31 Some example goals – Farmer wants advice on what fertilizer to use, to maximize crop yield – Bank wants to automatically flag some credit card purchases as potentially fraudulent, to delay payment till checks have been made – Biologist wants to be able to find out which species of micro- organism are present in a location, given a list of protein fragments found in an environmental sample The University of Sydney Page 32 Some example goals (cont’d) – Doctor wants to determine whether a patient is likely to have a particular disease, given results of tests (none of which is perfect) – Designer wants a car that brakes automatically when a pedestrian steps in front The University of Sydney Page 33 Data Understanding Phase – Collect Data – What are the data sources? • Original sources (these all will contain errors!): – sensors (measure the world) – surveys (ask people) – digital logs (track IT activities) • Secondary sources – other scholars, organizations, etc – data may already be summarized, transformed, cleaned, etc The University of Sydney Page 34 Examples of datasets – Census – raw data has individual level demographics etc – available summaries combine these into counts in a suburb etc – Crop observations – many plantings, with many features (seed type, date, weather, soil, fertilizer etc), and resulting crop yields – Credit card histories – lots of transactions of many users, with many features, some transactions were reported as fraudulent – Medical records – lots of patients, their test results, diagnoses The University of Sydney Page 35 Data Understanding Phase – Data Description – Document data quality issues – Compute basic statistics – Data Exploration – How is it structured? What is the meaning of the different features? • eg is temperature the daily maximum, monthly average, at some specific time? is income measured in actual dollars or inflation-adjusted ones? – Simple univariate data plots/distributions – Investigate attribute interactions • Can you find patterns connecting different features? – Data Quality Issues The University of Sydney Page 36 Data Preparation Phase – Integrate Data – Joining multiple data tables – Summarisation/aggregation of data – Select Data – Attribute subset selection • Rationale for Inclusion/Exclusion – Data sampling • Training/Validation and Test sets The University of Sydney Page 37 Data Preparation Phase (cont’d) – Data Transformation – Using functions such as log – Factor/Principal Components analysis – Normalization/Discretization/Binarization – Clean Data – Handling missing values/Outliers – Data Construction – Derived Attributes The University of Sydney Page 38 The Modelling Phase – Select of the appropriate modelling technique – Dependent on • Data mining problem type • Output requirements – Develop a testing regime – Sampling • Verify samples have similar characteristics and are representative of the population The University of Sydney Page 39 The Modelling Phase (cont’d) – Build Model – Choose initial parameter settings – Study model behaviour • Sensitivity analysis – Assess the model – Beware of over-fitting – Investigate the error distribution • Identify segments of the state space where the model is less effective – Iteratively adjust parameter settings • Document reasons of these changes The University of Sydney Page 40 Examples of Models – Model to predict the purity of the environment based on carbon level (Regression prediction model) – Model to classify a person whether he is cheating in his tax return or not (Classification prediction model). – Model to find hidden patterns and association rules in the basket market analysis (Clustering or association rules). – Model to detect anomalies or outliers such as spam emails (Classification prediction model). The University of Sydney Page 41 The Evaluation Phase – Validate Model – Human evaluation of results by domain experts – Evaluate usefulness of results from business perspective • Define control groups • Expected Return on Investment – Review Process – Determine next steps – Potential for deployment – Metrics for success of deployment The University of Sydney Page 42 The Deployment Phase – Knowledge Deployment is specific to objectives – Knowledge Presentation – Automated pre-processing of live data feeds – Generation of a report • Online/Offline – Monitoring and evaluation of effectiveness The University of Sydney Page 43 DATA SCIENCE PROJECTS The University of Sydney Page 44 http://www.bloomberg.com/news/articles/2013-10-30/ups-uses-big- data-to-make-routes-more-efficient-save-gas Example: Reducing costs through route optimisation – Use customer, vehicle and delivery data – 1 mile less per day for every driver saves $50 million p.a. in fuel, maintenance and time – Less idling, e.g., by avoiding left turns, saved 1.6 million gallons of fuel in 2012 The University of Sydney Page 45 Example: Structural Health Monitoring (SHM) Time-based maintenance: • Preventative maintenance schedules • Too early or too late SHM: • Condition-based maintenance using sensors • Data-driven approach establishes model from data, using machine learning techniques. The University of Sydney Page 46 http://arxiv.org/pdf/1508.03965v1.pdf Example: Preventative policing – Given social network from arrest records, geographic, temporal data – Predict whether a person is likely to be involved in crime – Chicago police using to issue preemptive warnings: “We’re watching you” The University of Sydney Page 47 Example: Road Condition Assessment from Vehicle- mounted Sensor Data Acquisition Machine Learning AnalysisFeature Extraction Road Health Score Excitation The University of Sydney Page 48 WHERE DO I GET DATA? The University of Sydney Page 49 Source Example: UCI Machine Learning Repository Datasets About The UCI Machine Learning Repository is a collection of databases, domain theories, and data generators that are used by the machine learning community for the empirical analysis of machine learning algorithms. URL https://archive.ics.uci.edu/ml/datasets.html Data sets – Classification – Breast Cancer – Diabetes – Letter Recognition …etc. – Regression – Forest Fires – Buzz in social media ..etc. – Clustering – Bag of Words – Sponge …etc. The University of Sydney Page 50 Source Example: Kaggle Datasets About Kaggle is an online platform for data science competitions. Some data sets are publicly available. URL https://www.kaggle.com/datasets Data sets – Amazon fine food reviews – Health insurance marketplace – World food facts – Ocean ship logbooks – Reddit comments – Hillary Clinton’s emails – GOP debate Twitter sentiment – NIPS 2015 papers The University of Sydney Page 51 Source Example: AIHW Data About Australian Institute of Health & Welfare collects data that provide insight into the health and wellbeing of the multifaceted Australian population. URL http://www.aihw.gov.au/data-by-subject/ Data sets – Alcohol, Tobacco & Drugs – Cancer – Children’s health – Height & weight – Hospitals – Indigenous health – Mental health – Lots more! The University of Sydney Page 52 Source Example: Reddit comments About Reddit is a social news web site that functions like an online bulletin board. URL https://www.reddit.com/r/datasets/comm ents/3bxlg7/i_have_every_publicly_avai lable_reddit_comment Data sets – 1.7 billion public comments The University of Sydney Page 53 REVIEW The University of Sydney Page 54 W1 Review: Introductions and housekeeping Objective Housekeeping; Learn about backgrounds and goals; Define data science. Lecture – Welcome, introductions – Unit overview, assessment, resources – Learning Python with Grok – Discuss definitions/scope of data science Readings – Data Science from Scratch: Ch 1 – Is being a data scientist really the best job in America? – 8 skills you need to be a data scientist Exercises – Introductions / interviews – Interests / definitions TODO in W1 – Grok Python modules 1-3 – Fill out & submit background survey – Choose possible project data The University of Sydney Page 55 Formulating a COMP5310 project (Stage 1 & 2) – By next week: – Identify possible problems and data sets – Think about questions the data can answer – Other possible data sets… The University of Sydney Page 56 Source Example: Yahoo Webscope About The Yahoo Webscope program is a reference library of data sets for non- commercial use by academics. URL http://webscope.sandbox.yahoo.com/ Data sets – 13.5 TB of user interaction data – Search engine query logs – Q&A forum data – Query entity disambiguation The University of Sydney Page 57 Source Example: GovHack Data About GovHack is an annual event that brings people together to innovate with open government data. They list many data sets from Australia and New Zealand. URL http://portal.govhack.org/datasets.html https://data.gov.au/ Data sets – ABC news and TV archives – Australian census data – Labour, industry, transport data – Health and welfare data – Various CSIRO data sets – Finance, IP, geoscience, archives, etc The University of Sydney Page 58 NEXT TIME The University of Sydney Page 59 Next week: Data exploration with spreadsheets Objective Use interactive tools to explore a new data set quickly. Lecture – Data types, cleaning, preprocessing – Descriptive statistics, e.g., mean, stddev, median – Descriptive visualisation, e.g., scatterplots, histograms Readings – Data Science from Scratch: Ch 2-3 Exercises – Google Sheets: Visualisation – Google Sheets: Descriptive stats TODO for W2 – Grok Python modules 1-3 – Make sure you answered today's background survey – Explore project data – GET YOUR GOOGLE ACCOUNT! The University of Sydney Page 60 Thanks
欢迎咨询51作业君