代写辅导接单-Assignment 2: Developing your Data Pipelines

欢迎使用51辅导,51作业君孵化低价透明的学长辅导平台,服务保持优质,平均费用压低50%以上! 51fudao.top

Assignment 2: Developing your Data Pipelines

Due Sep 9 by 3am

Points 100

Submitting a website url

New Attempt

PREREQUISITES: Review the Data Engineering, Feature Engineering and Dataset Design lectures.

And the numpy and scikit-learn package tutorials.

OBJECTIVES: Based on the insights you discovered during Assignment 1, implement the

raw_data_handler, feature_extractor and dataset_design Module. Place all files in your provisioned

repository under the directory securebank/ (e.g., securebank/modules/raw_data_handler.py). All saved

artifacts must be written in a directory called securebank/storage/ (e.g.,

fraud_detection/storage/raw_data/). NOTE: Do not add data

Task 1: In a python module called modules/raw_data_handler.py, write a Raw_Data_Handler Class that

is responsible for extracting data from data sources (customers.csv, transactions.parquet,

fraud_information.json), transforming noisy data into clean, usable data for general machine learning

purposes (i.e., not only for our fraud detection use case).

This class will have at least FOUR methods:

extract() reads the data sources

arguments:

customer_information_filename: str (e.g. customers.csv)

transaction_filename: str (e.g. transactions.parquet)

fraud_information_filename: str (e.g., fraud_information.json)

returns:

customer_information: pandas.DataFrame

transaction_information: pandas.DataFrame

fraud_information: pandas.DataFrame

transform() merges, standardizes, and cleans columns and rows, etc. from the three data sources.

arguments:

customer_information: pandas.DataFrame

transaction_information: pandas.DataFrame

fraud_information: pandas.DataFrame

returns:

raw_data: pandas.DataFrame.

describe() computes the significant quality metrics of the transformed dataset.

arguments:

2024/9/19 20:44Assignment 2: Developing your Data Pipelines

https://jhu.instructure.com/courses/82966/assignments/877009?return_to=https%3A%2F%2Fjhu.instructure.com%2Fcalendar%23view_name%3Dmo...1/4

*args, **kwargs

returns:

description: Dict which is structured in this manner:

{

`version`: version_name: str,

`storage`: storage_path: str,

`description`: dictionary_of_important_dataset_description: Dict()

}

load() saves data into storage in a parquet format.

arguments:

output_filename: str

Task 2: In a python module called modules/dataset_design.py, write a Dataset_Designer Class that is

responsible for partitioning the data.

This class will have at least FOUR methods:

extract() reads the parquet raw data file

arguments:

raw_dataset_filename: str

returns:

raw_dataset: pandas.DataFrame

sample() partitions the data into training dataset, test dataset, etc.

arguments:

raw_dataset: pandas.DataFrame

returns:

partitioned_data: List[pandas.DataFrame,]

describe() computes the significant quality metrics of the transformed dataset.

arguments:

*args, **kwargs

returns:

description: Dict which is structured in this manner:

{

`version`: version_name: str,

`storage`: storage_path: str,

`description`: dictionary_of_important_dataset_description: Dict()

}

load() saves data into storage in a parquet format.

arguments:

2024/9/19 20:44Assignment 2: Developing your Data Pipelines

https://jhu.instructure.com/courses/82966/assignments/877009?return_to=https%3A%2F%2Fjhu.instructure.com%2Fcalendar%23view_name%3Dmo...2/4

output_filename: str

Task 3: In a python module called modules/feature_extractor.py, write a Feature_Extractor Class that is

responsible for extracting and formating features from the data produced from your Dataset_Designer

Module. Your Feature_Extractor class will be used for developing your fraud detection models.

This class will have at least THREE methods:

extract() reads the data provided

arguments:

training_dataset_filename: str

testing_dataset_filename: str

etc.

returns:

training_dataset: pandas.DataFrame

testing_dataset: pandas.DataFrame

etc.

transform() converts the dataset into a features useful for training.

arguments:

training_dataset: pandas.DataFrame

testing_dataset: pandas.DataFrame

etc.

returns:

partitioned_data: List[pandas.DataFrame,]

describe() computes the significant quality metrics of the transformed dataset.

arguments:

*args, **kwargs

returns:

description: Dict which is structured in this manner:

{

`version`: version_name: str,

`storage`: storage_path: str,

`description`: dictionary_of_important_dataset_description: Dict()

}

Task 4: In a markdown file called Data_Pipeline_Design.md, explain the design decisions you used for

each of the three modules, and argue why you made these decisions (e.g., what is it about the data, the

nature of the problem, etc. that made you decide a certain design?). Please make use proper formatting

(i.e., headers, etc.) for easy readability.

2024/9/19 20:44Assignment 2: Developing your Data Pipelines

https://jhu.instructure.com/courses/82966/assignments/877009?return_to=https%3A%2F%2Fjhu.instructure.com%2Fcalendar%23view_name%3Dmo...3/4

Total Points: 100

Assignment 2: Developing your Data Pipelines

CriteriaRatingsPts

25 pts

25 pts

25 pts

25 pts

SUBMISSION: You will need to check in the following three files and any supporting python modules:

securebank/modules/raw_data_handler.py

securebank/modules/dataset_design.py

securebank/modules/feature_extractor.py

securebank/Data_Pipeline_Design.md

Provide GitHub the URL link to this markdown file via Canvas to get credit for this submission.

Task 1

Task 2

Task 3

Task 4

2024/9/19 20:44Assignment 2: Developing your Data Pipelines

https://jhu.instructure.com/courses/82966/assignments/877009?return_to=https%3A%2F%2Fjhu.instructure.com%2Fcalendar%23view_name%3Dmo...4/4

51作业君

Email:51zuoyejun

@gmail.com

添加客服微信: Fudaojun0228