ELEC0136 Data Acquisition and Processing Systems 20/21 Assignment Objective summary You will be asked to solve a real-life data acquisition and processing problem. You will need to design solutions for each step of the way, from finding the data, to acquiring and storing it, to cleaning and preprocessing it, to finally analysing it. You should use the methods we have seen in the lecture and in the labs, such as acquiring data via APIs, storing it in data structures or data bases, visualisations techniques, standard statistical analysis to clean and describe the data, etc. You are also free to use any additional methods you find are well suited for the problem. You will be asked to write in a report (in the form of an academic paper) in which you will have to justify your methodology choices. You are allowed to discuss ideas with peers, but your code, experiments and report must be done solely based on your own work. 1. Problem Description Assume you are a junior Data Scientist in an Investment Company and your project manager provides you a list with the following public companies: ● Apple Inc. (AAPL), ● Microsoft Corp. (MSFT), ● American Airlines Group Inc (AAL), ● Zoom Video Communication Inc (ZM) to assess the financial potential of the company of your choice. In particular, you have been instructed to select one of the listed companies and decide when and whether the firm you work for should buy stocks. You project manager was kind enough to advise you on the ML process you should follow for this challenge: 1. Select a company and acquire stock data from the beginning of the fiscal year 2017 up to now. Decide and collect any other data of events (e.g., climate changes, a pandemic, season, etc.) that might have an impact on the stocks. 2. Propose a way to store this type of data so it can be easily accessed by the ML processes you will develop next. 3. Preprocess your data 4. Visualize and explore patterns in your data. If needed repeat step 3. 5. Train a model to predict the closing stock price The technical objectives of the new project that has been assigned to you are described in Section 2. 2. Tasks Description 2.1 Data Acquisition You will first have to acquire the necessary data for conducting your study. One essential type of data that you will need, are the stock prices for each company from April 2017 until now, as described in Section 1. Since these companies are public, the data is made available online. The first task is for you to search and collect this data, finding the best way to access and download it. A good place to look is on platforms that provide free data relating to the stock market such as Google Finance or Yahoo! Finance. Bear in mind that there are many other valuable sources of information for analysing the stock market. In addition to time series data depicting evolution of stock price over time you can also use other sources such as a) Social Media, e.g., Twitter: This can be used to uncover the public’s sentimental response to the stock market b) Financial reports: This can help explain what kind of factors are likely to affect the stock market the most c) News: This can be used to draw links between current affairs and the stock market d) Climate data: Sometimes weather data is directly correlated to some companies’ stock prices and should therefore be taken into account in financial analysis e) Others: anything that can justifiably support your analysis 2.2 Data Storage Once you have collected the relevant data, you need to decide on the best way of storing it for easy and efficient access throughout the study. You are expected to store the data locally on your computer in the format of your choice (cvs, pkl, numpy files). The data should be organized/structured according to the type of information gathered, so it can be easily understood and accessed efficiently. Additionally, you can set up your own cloud database to be accessed via an API. A cloud-based database is not mandatory but will earn you extra marks on the assignment (see Section 4 for Marking criteria). 2.3 Data Preprocessing Now that you have the data stored, you can start preprocessing it. Think about what features to keep, which ones to transform, combine or discard. Make sure your data is clean and consistent (e.g., are there many outliers? any missing values?). You are expected to (1) clean, (2) visualize and (3) transform your data (e.g., using normalization, dimensionality reduction, etc.). 2.4 Data Exploration After ensuring that the data is well preprocessed, it is time to start exploring the data to carry out hypotheses and intuition about possible patterns that might be inferred. Depending on the data, different EDA techniques can be applied, and a large amount of information can be extracted. For example, you could do the following analysis: ● Time series data is normally a combination of several components: ○ Trend represents the overall tendency of the data to increase or decrease over time. ○ Seasonality is related to the presence of recurrent patterns that appear after regular intervals (like seasons). ○ Random noise is often hard to explain and represents all those changes in the data that seem unexpected. Sometimes sudden changes are related to fixed or predictable events (i.e., public holidays). ● Features correlation provides additional insight into the data structure. Scatter plots and boxplots are useful tools to spot relevant information. ● Explain unusual behavior. ● Explore the correlation between stock price data and other external data that you can collect (as listed in Sec 2.1) ● Use hypothesis testing to better understand the composition of your dataset and its representativeness. At the end of this step, provide key insights on the data. This data exploration procedure should inform the subsequent data analysis/inference procedure, allowing one to establish a predictive relationship between variables. 2.5 Data Inference You are expected to train a model to predict the closing stock price on each day for the data you have already collected, stored, preprocessed and explored from previous steps. The data must be spanning from April 2017 to April 2020. You should develop two separate models: 1. A model for predicting the closing stock price on each day for a 1-month time window (until end of May 2020), using only the company’s stock data. 2. A model for predicting the closing stock price on each day for a 1-month time window (until end of May 2020), using the company’s stock data and external data. Which model is performing better? How do you measure performance and why? How could you improve further the performance? Are the models capable of predicting the closing stock prices far into the future? [IMPORTANT NOTE] For these tasks, you are not expected to compare model architectures, but examine and analyse the differences when training the same model with multiple data attributes and information from sources. Therefore, you should decide a single model suitable for time series data to solve the tasks described above. Please see the lecture slides for tips on model selection and feel free to experiment before selecting one. The following would help you to evaluate your approach and pinpoint you any mistakes or wrong choices you made on this or previous phases. 1. Evaluate the performance of your model using metrics such as Mean Squared Error, Mean Absolute Error or R-Squared. 2. Use ARIMA and Facebook Prophet to explore the uncertainty on your model’s predicted values by employing confidence bands. 3. Result visualization: create joint plots showing marginal distributions to understand the correlation between actual and predicted values. 4. Finding the mean, median and skewness of the residual distribution might provide additional insight into the predictive capability of the model. 3. Deliverables 3.1. Report The report should be written in the form of an academic paper using the ICML format1. The paper should be at most 8 pages long excluding references, with an additional maximum of 2 pages for references. The paper must include the following sections: ● Abstract. This section should be a short paragraph (4-5 sentences) that provides a brief overview of the methodology and results presented in the report. ● Introduction. This section describes the problem with an emphasis on the motivations and the end goal. ● Data description. This section details the data that was used for this study. For each data set, should clearly describe the content, size and format of the data. The reason for selecting each data set should also be provided in this section. ● Data acquisition. This section presents the data acquisition process, explaining how each data set was acquired, and why did you choose the specific data acquisition method. ● Data storage. This section explains and justifies your data storage strategies. ● Data preprocessing. This section should describe in detail all the preprocessing steps that were applied to the data. A justification for each step should also be provided. In case no or very little preprocessing was done, this section should clearly justify why. ● Data Exploration. This section should summarize any data exploration task you have 1 https://icml.cc/Conferences/2020/StyleAuthorInstructions resorted to in order to find particular patterns within the data. ● Data inference. This section should first describe the inference problem, then explain and justify the methodology used to approach the problem and finally present the results. ● Conclusion. This last section summarises the findings, highlights any challenges or limitations that were encountered during the study and provides directions for potential improvements. Please make sure you complement your discussion in each section with relevant equations, diagrams, or figures as you see fit. 3.2. Code In addition to the report, you should also provide all the code that was used for your study. You can choose between one of these two options for sharing your code: ● Python Notebook(s). If you wish to conduct all of your preprocessing, analysis and experiments in Python Notebook(s), you are welcome to do so. You should make sure that the notebook(s) are easily readable. Divide your code into cells that match the report sections. You should make advantage of the text cells as well to accompany your code. ● Python files. If you prefer working with Python files, you are welcome to do so. Make sure to document your code well in the form of comments. You should provide one main file that contains the runnable script that you used to produce all the analysis and results presented in your report (anyone should be able to reproduce all your results by running the main file only). Optionally, you can add a README text file that details instructions on how to run your code. Either way, the code you submit must: ● Be readable and well documented. Each class and function should be accompanied by comments describing their use. Additionally, any block of code that implements a complex part of a function should be commented. ● Compile and run. We reserve the right to test the code. 4. Marking Scheme (+ Submission Instructions) 4.1 Marking Scheme The mark will be decided based on both the report (70% of final mark) and corresponding code (30% of final mark). In particular, we will mark based on following scheme: REPORT 70% CORRESPONDING CODE 30% Abstract 5% Introduction 5% Data Description 15% Data Acquisition 15% Stock prices over time 10% External data* 10% Data Storage 10% Data locally stored on your computer in a tidy format of your choice (cvs, pkl, numpy files) 15% A cloud-based database 5% Data Preprocessing 15% Data Cleaning 7% Data visualization 7% Data transformation 6% Data Exploration 20% EDA 10% Hypothesis Testing 10% Data Inference 10% Development of model using stocks 5% Development of model using stocks and other data sources 10% Evaluation metrics implementation 5% Conclusion 5% * You are expected to find at least one extra data source to earn the 10%. You can additionally earn a 5% for each extra dataset you acquire, store, preprocess, explore and use in the second model of inference step. 4.2. Submission You will be asked to submit both a compressed file with code+report as well as the report (PDF) format only. The latter submission is needed to check plagiarism. Compressed file Submission Create a folder called \code and put all of your .ipyn or .py files. If you have chosen to work with Python files, make sure that your runnable file is called ‘main.py’. Create another folder called \DAPS_assignment and add the report in PDF format as well as the \code folder. Compress \DAPS_assignment into one single file, your submission file. Submit this file in the “ELEC0136 Assignment (Compressed) Submission” voice on Moodle Report Submission Submit only report in PDF format. Submit this file in the “ELEC0136 Assignment (Report) Submission” voice on Moodle
欢迎咨询51作业君