辅导案例-MIE 1628

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

MIE 1628 Big Data Science – Fall 2020
Final Project
Due Date: TBA

1 About the Data
1.1 Predict the Quality Rating of Stack Overflow Questions
60,000 Stack Overflow questions have been collected from 2016-2020 and classified into three
categories:
1. HQ: High-quality posts with 30+ score and without a single edit.
2. LQ_EDIT: Low-quality posts with a negative score and with multiple community edits.
However, they remain open after the edits.
3. LQ_CLOSE: Low-quality posts that were closed by the community without a single edit.
Moreover,
1. Question body is in HTML format.
2. All dates are in UTC format.
1.1.1 Features and Target
 Id: The id of the post
 Title: The title of Stack Overflow question
 Body: Question Body in HTML
 Tags: Question Tags
 CreationDate: Creation Date in UTC format
 Target: Quality rating = Target
1.2 Identify Potentially Hazardous Asteroids
The dataset contains different physical parameters and measurements for over 900,000 asteroids.
Nowadays Machine Learning is solving so many problems in Astronomy and Astrophysics fields.
This Dataset is officially maintained by Jet Propulsion Laboratory of California Institute of
Technology which is an organization under NASA. In this Dataset all kinds of Data related to
Asteroid is included.
1.2.1 Features and Target
 SPK-ID: Object primary SPK-ID
 Object ID: Object internal database ID
 Object fullname: Object full name/designation
 pdes: Object primary designation
 name: Object IAU name
 NEO: Near-Earth Object (NEO) flag
 PHA: Potentially Hazardous Asteroid (PHA) flag = Target
 H: Absolute magnitude parameter
 Diameter: object diameter (from equivalent sphere) km Unit
 Albedo: Geometric albedo
 Diameter_sigma: 1-sigma uncertainty in object diameter km Unit
 Orbit_id: Orbit solution ID
 Epoch: Epoch of osculation in modified Julian day form
 Equinox: Equinox of reference frame
 e: Eccentricity
 a: Semi-major axis au Unit
 q: perihelion distance au Unit
 i: inclination; angle with respect to x-y ecliptic plane
 tp: Time of perihelion passage TDB Unit
 moid_ld: Earth Minimum Orbit Intersection Distance au Unit
1.3 Improve the algorithm that classifies drugs based on their biological activity
This project is an ongoing Kaggle competition. You can find the details about the competition here
- https://www.kaggle.com/c/lish-moa/overview
The Connectivity Map, a project within the Broad Institute of MIT and Harvard, the Laboratory
for Innovation Science at Harvard (LISH), and the NIH Common Funds Library of Integrated
Network-Based Cellular Signatures (LINCS), present this challenge with the goal of advancing
drug development through improvements to MoA prediction algorithms.
The prize money is $30,000.
2 Learning Objectives
2.1 Data Cleaning
There may be missing values in the dataset, handle the missing values however you see fit and
justify your approach. Provide some insight as to why you think the values are missing and how it
might affect your overall analysis. For text data, there may be special characters present. Similarly,
handle them as you see fit and justify your approach in the report.
2.2 Feature Engineering
The original features in the data should be used to create additional, innovative features. The goal
here is for you to create additional features from existing ones which would boost the final model
performance. Please support your new features with plots and/or statistical analysis.
2.3 Exploratory Data Analysis
Present at least 4 graphs (may include plots from 2.2, but new unique plots would be a plus) which
may represent trends in data and explain how those trends might be helpful for your machine
learning algorithm. All the graphs must be appropriately labelled (axes and titles).
2.4 Feature Selection
Analyze and visualize the importance of your final features and then from the analysis, select either
manually or through feature selection algorithms the features chosen for your model. Using the
feature selection technique and its justification would be a plus.
2.5 Model Implementation and tuning
Build at least 3 machine learning models to predict the target. Choose appropriate evaluation
metrics to evaluate your models and provide graphs and statistical data to support your model.
Draw conclusions on the strengths and weaknesses of each model and select a winner model based
on this.
Tune the hyperparameters of the winner model and draw a comparison of tuned and untuned model
using appropriate evaluation metric. Use k-fold cross-validation to get a better sense of the model’s
performance.
2.6 Model Testing and Discussion
One week before the project submission deadline, a new dataset will be released. Apply your
winner model on this data and analyze whether your model is overfitting or not.
3 Note:
1. The project must be done using PySpark and/or Scala. Python should not be used apart
from data visualization.
2. No other tool or software besides those mentioned in 1. can be used to modify the data
files. For instance, using Microsoft Excel to clean the data is not allowed
4 Submission Format
4.1 Report
We will not specify a specific format for the report, but your report should include everything
discussed in Section 2.
4.2 Group Presentation
In addition to presenting your work, each team member will be subjected to questions regarding
the code and on-the-fly questions regarding big data and machine learning (only from material
taught throughout the course).
4.3 Code Notebook
The code must be well organized and must have appropriate comments and explanations as
mentioned in the learning objective.

欢迎咨询51作业君