Assignment 2: Developing your Data Pipelines
Due Sep 9 by 3am
Points 100
Submitting a website url
New Attempt
PREREQUISITES: Review the Data Engineering, Feature Engineering and Dataset Design lectures.
And the numpy and scikit-learn package tutorials.
OBJECTIVES: Based on the insights you discovered during Assignment 1, implement the
raw_data_handler, feature_extractor and dataset_design Module. Place all files in your provisioned
repository under the directory securebank/ (e.g., securebank/modules/raw_data_handler.py). All saved
artifacts must be written in a directory called securebank/storage/ (e.g.,
fraud_detection/storage/raw_data/). NOTE: Do not add data
Task 1: In a python module called modules/raw_data_handler.py, write a Raw_Data_Handler Class that
is responsible for extracting data from data sources (customers.csv, transactions.parquet,
fraud_information.json), transforming noisy data into clean, usable data for general machine learning
purposes (i.e., not only for our fraud detection use case).
This class will have at least FOUR methods:
extract() reads the data sources
arguments:
customer_information_filename: str (e.g. customers.csv)
transaction_filename: str (e.g. transactions.parquet)
fraud_information_filename: str (e.g., fraud_information.json)
returns:
customer_information: pandas.DataFrame
transaction_information: pandas.DataFrame
fraud_information: pandas.DataFrame
transform() merges, standardizes, and cleans columns and rows, etc. from the three data sources.
arguments:
customer_information: pandas.DataFrame
transaction_information: pandas.DataFrame
fraud_information: pandas.DataFrame
returns:
raw_data: pandas.DataFrame.
describe() computes the significant quality metrics of the transformed dataset.
arguments:
2024/9/19 20:44Assignment 2: Developing your Data Pipelines
https://jhu.instructure.com/courses/82966/assignments/877009?return_to=https%3A%2F%2Fjhu.instructure.com%2Fcalendar%23view_name%3Dmo...1/4
*args, **kwargs
returns:
description: Dict which is structured in this manner:
{
`version`: version_name: str,
`storage`: storage_path: str,
`description`: dictionary_of_important_dataset_description: Dict()
}
load() saves data into storage in a parquet format.
arguments:
output_filename: str
Task 2: In a python module called modules/dataset_design.py, write a Dataset_Designer Class that is
responsible for partitioning the data.
This class will have at least FOUR methods:
extract() reads the parquet raw data file
arguments:
raw_dataset_filename: str
returns:
raw_dataset: pandas.DataFrame
sample() partitions the data into training dataset, test dataset, etc.
arguments:
raw_dataset: pandas.DataFrame
returns:
partitioned_data: List[pandas.DataFrame,]
describe() computes the significant quality metrics of the transformed dataset.
arguments:
*args, **kwargs
returns:
description: Dict which is structured in this manner:
{
`version`: version_name: str,
`storage`: storage_path: str,
`description`: dictionary_of_important_dataset_description: Dict()
}
load() saves data into storage in a parquet format.
arguments:
2024/9/19 20:44Assignment 2: Developing your Data Pipelines
https://jhu.instructure.com/courses/82966/assignments/877009?return_to=https%3A%2F%2Fjhu.instructure.com%2Fcalendar%23view_name%3Dmo...2/4
output_filename: str
Task 3: In a python module called modules/feature_extractor.py, write a Feature_Extractor Class that is
responsible for extracting and formating features from the data produced from your Dataset_Designer
Module. Your Feature_Extractor class will be used for developing your fraud detection models.
This class will have at least THREE methods:
extract() reads the data provided
arguments:
training_dataset_filename: str
testing_dataset_filename: str
etc.
returns:
training_dataset: pandas.DataFrame
testing_dataset: pandas.DataFrame
etc.
transform() converts the dataset into a features useful for training.
arguments:
training_dataset: pandas.DataFrame
testing_dataset: pandas.DataFrame
etc.
returns:
partitioned_data: List[pandas.DataFrame,]
describe() computes the significant quality metrics of the transformed dataset.
arguments:
*args, **kwargs
returns:
description: Dict which is structured in this manner:
{
`version`: version_name: str,
`storage`: storage_path: str,
`description`: dictionary_of_important_dataset_description: Dict()
}
Task 4: In a markdown file called Data_Pipeline_Design.md, explain the design decisions you used for
each of the three modules, and argue why you made these decisions (e.g., what is it about the data, the
nature of the problem, etc. that made you decide a certain design?). Please make use proper formatting
(i.e., headers, etc.) for easy readability.
2024/9/19 20:44Assignment 2: Developing your Data Pipelines
https://jhu.instructure.com/courses/82966/assignments/877009?return_to=https%3A%2F%2Fjhu.instructure.com%2Fcalendar%23view_name%3Dmo...3/4
Total Points: 100
Assignment 2: Developing your Data Pipelines
CriteriaRatingsPts
25 pts
25 pts
25 pts
25 pts
SUBMISSION: You will need to check in the following three files and any supporting python modules:
securebank/modules/raw_data_handler.py
securebank/modules/dataset_design.py
securebank/modules/feature_extractor.py
securebank/Data_Pipeline_Design.md
Provide GitHub the URL link to this markdown file via Canvas to get credit for this submission.
Task 1
Task 2
Task 3
Task 4
2024/9/19 20:44Assignment 2: Developing your Data Pipelines
https://jhu.instructure.com/courses/82966/assignments/877009?return_to=https%3A%2F%2Fjhu.instructure.com%2Fcalendar%23view_name%3Dmo...4/4