辅导案例-ELEC0136

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

ELEC0136 Data Acquisition and Processing Systems 20/21

Assignment
Objective summary
You will be asked to solve a real-life data acquisition and processing problem. You will need to
design solutions for each step of the way, from finding the data, to acquiring and storing it, to
cleaning and preprocessing it, to finally analysing it. You should use the methods we have seen
in the lecture and in the labs, such as acquiring data via APIs, storing it in data structures or data
bases, visualisations techniques, standard statistical analysis to clean and describe the data, etc.
You are also free to use any additional methods you find are well suited for the problem. You will
be asked to write in a report (in the form of an academic paper) in which you will have to justify
your methodology choices. You are allowed to discuss ideas with peers, but your code,
experiments and report must be done solely based on your own work.
1. Problem Description
Assume you are a junior Data Scientist in an Investment Company and your project manager
provides you a list with the following public companies:
● Apple Inc. (AAPL),
● Microsoft Corp. (MSFT),
● American Airlines Group Inc (AAL),
● Zoom Video Communication Inc (ZM)
to assess the financial potential of the company of your choice. In particular, you have been
instructed to select one of the listed companies and decide when and whether the firm you work
for should buy stocks.

You project manager was kind enough to advise you on the ML process you should follow for this
challenge:
1. Select a company and acquire stock data from the beginning of the fiscal year 2017 up to
now. Decide and collect any other data of events (e.g., climate changes, a pandemic,
season, etc.) that might have an impact on the stocks.
2. Propose a way to store this type of data so it can be easily accessed by the ML processes
you will develop next.
3. Preprocess your data
4. Visualize and explore patterns in your data. If needed repeat step 3.
5. Train a model to predict the closing stock price
The technical objectives of the new project that has been assigned to you are described in
Section 2.
2. Tasks Description
2.1 Data Acquisition
You will first have to acquire the necessary data for conducting your study. One essential type of
data that you will need, are the stock prices for each company from April 2017 until now, as
described in Section 1. Since these companies are public, the data is made available online. The
first task is for you to search and collect this data, finding the best way to access and download
it. A good place to look is on platforms that provide free data relating to the stock market such as
Google Finance or Yahoo! Finance.

Bear in mind that there are many other valuable sources of information for analysing the stock
market. In addition to time series data depicting evolution of stock price over time you can also
use other sources such as

a) Social Media, e.g., Twitter: This can be used to uncover the public’s sentimental
response to the stock market
b) Financial reports: This can help explain what kind of factors are likely to affect the stock
market the most
c) News: This can be used to draw links between current affairs and the stock market
d) Climate data: Sometimes weather data is directly correlated to some companies’ stock
prices and should therefore be taken into account in financial analysis
e) Others: anything that can justifiably support your analysis
2.2 Data Storage
Once you have collected the relevant data, you need to decide on the best way of storing it for
easy and efficient access throughout the study. You are expected to store the data locally on your
computer in the format of your choice (cvs, pkl, numpy files). The data should be
organized/structured according to the type of information gathered, so it can be easily understood
and accessed efficiently.

Additionally, you can set up your own cloud database to be accessed via an API. A cloud-based
database is not mandatory but will earn you extra marks on the assignment (see Section 4 for
Marking criteria).
2.3 Data Preprocessing
Now that you have the data stored, you can start preprocessing it. Think about what features to
keep, which ones to transform, combine or discard. Make sure your data is clean and consistent
(e.g., are there many outliers? any missing values?). You are expected to (1) clean, (2) visualize
and (3) transform your data (e.g., using normalization, dimensionality reduction, etc.).
2.4 Data Exploration
After ensuring that the data is well preprocessed, it is time to start exploring the data to carry out
hypotheses and intuition about possible patterns that might be inferred. Depending on the data,
different EDA techniques can be applied, and a large amount of information can be extracted.
For example, you could do the following analysis:

● Time series data is normally a combination of several components:
○ Trend represents the overall tendency of the data to increase or decrease over time.
○ Seasonality is related to the presence of recurrent patterns that appear after regular
intervals (like seasons).
○ Random noise is often hard to explain and represents all those changes in the data
that seem unexpected. Sometimes sudden changes are related to fixed or predictable
events (i.e., public holidays).
● Features correlation provides additional insight into the data structure. Scatter plots and
boxplots are useful tools to spot relevant information.
● Explain unusual behavior.
● Explore the correlation between stock price data and other external data that you can
collect (as listed in Sec 2.1)
● Use hypothesis testing to better understand the composition of your dataset and its
representativeness.

At the end of this step, provide key insights on the data. This data exploration procedure should
inform the subsequent data analysis/inference procedure, allowing one to establish a predictive
relationship between variables.

2.5 Data Inference
You are expected to train a model to predict the closing stock price on each day for the data you
have already collected, stored, preprocessed and explored from previous steps. The data must
be spanning from April 2017 to April 2020.
You should develop two separate models:
1. A model for predicting the closing stock price on each day for a 1-month time window (until
end of May 2020), using only the company’s stock data.
2. A model for predicting the closing stock price on each day for a 1-month time window (until
end of May 2020), using the company’s stock data and external data.
Which model is performing better? How do you measure performance and why? How could you
improve further the performance? Are the models capable of predicting the closing stock prices
far into the future?

[IMPORTANT NOTE] For these tasks, you are not expected to compare model architectures, but
examine and analyse the differences when training the same model with multiple data attributes
and information from sources. Therefore, you should decide a single model suitable for time series
data to solve the tasks described above. Please see the lecture slides for tips on model selection
and feel free to experiment before selecting one.

The following would help you to evaluate your approach and pinpoint you any mistakes or wrong
choices you made on this or previous phases.
1. Evaluate the performance of your model using metrics such as Mean Squared Error, Mean
Absolute Error or R-Squared.
2. Use ARIMA and Facebook Prophet to explore the uncertainty on your model’s predicted
values by employing confidence bands.
3. Result visualization: create joint plots showing marginal distributions to understand the
correlation between actual and predicted values.
4. Finding the mean, median and skewness of the residual distribution might provide
additional insight into the predictive capability of the model.

3. Deliverables
3.1. Report
The report should be written in the form of an academic paper using the ICML format1. The paper
should be at most 8 pages long excluding references, with an additional maximum of 2 pages for
references.

The paper must include the following sections:

● Abstract. This section should be a short paragraph (4-5 sentences) that provides a brief
overview of the methodology and results presented in the report.
● Introduction. This section describes the problem with an emphasis on the motivations
and the end goal.
● Data description. This section details the data that was used for this study. For each data
set, should clearly describe the content, size and format of the data. The reason for
selecting each data set should also be provided in this section.
● Data acquisition. This section presents the data acquisition process, explaining how each
data set was acquired, and why did you choose the specific data acquisition method.
● Data storage. This section explains and justifies your data storage strategies.
● Data preprocessing. This section should describe in detail all the preprocessing steps
that were applied to the data. A justification for each step should also be provided. In case
no or very little preprocessing was done, this section should clearly justify why.
● Data Exploration. This section should summarize any data exploration task you have

1 https://icml.cc/Conferences/2020/StyleAuthorInstructions
resorted to in order to find particular patterns within the data.
● Data inference. This section should first describe the inference problem, then explain and
justify the methodology used to approach the problem and finally present the results.
● Conclusion. This last section summarises the findings, highlights any challenges or
limitations that were encountered during the study and provides directions for potential
improvements.

Please make sure you complement your discussion in each section with relevant equations,
diagrams, or figures as you see fit.
3.2. Code
In addition to the report, you should also provide all the code that was used for your study. You
can choose between one of these two options for sharing your code:

● Python Notebook(s). If you wish to conduct all of your preprocessing, analysis and
experiments in Python Notebook(s), you are welcome to do so. You should make sure
that the notebook(s) are easily readable. Divide your code into cells that match the report
sections. You should make advantage of the text cells as well to accompany your code.
● Python files. If you prefer working with Python files, you are welcome to do so. Make sure
to document your code well in the form of comments. You should provide one main file
that contains the runnable script that you used to produce all the analysis and results
presented in your report (anyone should be able to reproduce all your results by running
the main file only). Optionally, you can add a README text file that details instructions on
how to run your code.

Either way, the code you submit must:
● Be readable and well documented. Each class and function should be accompanied by
comments describing their use. Additionally, any block of code that implements a complex
part of a function should be commented.
● Compile and run. We reserve the right to test the code.

4. Marking Scheme (+ Submission Instructions)
4.1 Marking Scheme
The mark will be decided based on both the report (70% of final mark) and corresponding code
(30% of final mark). In particular, we will mark based on following scheme:
REPORT 70% CORRESPONDING CODE 30%
Abstract 5%
Introduction 5%
Data Description 15%
Data Acquisition 15% Stock prices over time 10%
External data* 10%
Data Storage 10% Data locally stored on your computer in a tidy format of your
choice (cvs, pkl, numpy files)
15%
A cloud-based database 5%
Data
Preprocessing
15% Data Cleaning 7%
Data visualization 7%
Data transformation 6%
Data Exploration 20% EDA 10%
Hypothesis Testing 10%
Data Inference 10% Development of model using stocks 5%
Development of model using stocks and other data sources 10%
Evaluation metrics implementation 5%
Conclusion 5%

* You are expected to find at least one extra data source to earn the 10%. You can additionally
earn a 5% for each extra dataset you acquire, store, preprocess, explore and use in the second
model of inference step.

4.2. Submission
You will be asked to submit both a compressed file with code+report as well as the report (PDF)
format only. The latter submission is needed to check plagiarism.

Compressed file Submission
Create a folder called \code and put all of your .ipyn or .py files. If you have chosen to work with
Python files, make sure that your runnable file is called ‘main.py’.
Create another folder called \DAPS_assignment and add the report in PDF format as well as the
\code folder. Compress \DAPS_assignment into one single file, your submission file.

Submit this file in the “ELEC0136 Assignment (Compressed) Submission” voice on Moodle

Report Submission
Submit only report in PDF format.

Submit this file in the “ELEC0136 Assignment (Report) Submission” voice on Moodle

欢迎咨询51作业君