代写辅导接单-FIT5196-S2-2024

欢迎使用51辅导,51作业君孵化低价透明的学长辅导平台,服务保持优质,平均费用压低50%以上! 51fudao.top

FIT5196-S2-2024Assessment2

Thisisagroupassessmentandisworth40%ofyourtotalmarkfor

FIT5196.

Duedate:Friday18October2024,11:55pm

Task1.DataCleansing(50%)

Forthisassessment,youarerequiredtowritePythoncodetoanalyseyourdataset,findandfix

theproblemsinthedata.Theinputandoutputofthistaskareshownbelow:

Table1.Theinputandoutputoftask1

Inputfiles

Submission

Outputfiles

OtherDeliverables

Group_dirty_data

.csv

Group_outlier_da

ta.csv

Group_missing_d

ata.csv

warehouse.csv

Group_dirty_data_sol

ution.csv

Group_outlier_data_s

olution.csv

Group_missing_data_

solution.csv

Group_ass2_task

1.ipynb

Group_ass2_task

1.py

Note1:Allfilesmustbezippedintoafilenamed

Group_ass2.zip(pleaseusezip

notrar,7z,tar,etc.)

Note2:Replacewithyourgroupid(donotinclude<>)

Note3:Youcanfindyourthreeinputfilesfromthefolderwithyourgroupnumberhere.

Usingthewrongfileswillresultinzeromarks.

Note4:Pleasestrictlyfollowtheinstructionsintheappendixtogeneratethe.ipynband

.pyfiles.

Exploringandunderstandingthedataisoneofthemostimportantpartsofthedatawrangling

process.Youarerequiredtoperformgraphicaland/ornon-graphicalEDAmethodstounderstand

thedatafirstandthenfindthedataproblems.Inthisassessment,youhavebeenprovidedwith

threedatainputsalongwiththeadditionalfile:

warehouse.csv

here.Duetoanunexpected

scenario,aportionofthedataismissingorcontainsanomalousvalues.Thus,beforemovingto

thenextstepindataanalysis,youarerequiredtoperformthefollowingtasks:

1.

Detectandfixerrorsin

_dirty_data.csv

2.Imputethemissingvaluesin_missing_data.csv

3.Detectandremoveoutlierrowsin_outlier_data.csv

○(w.r.t.thedelivery_chargesattributeonly)

ProjectBackground

Asastartingpoint,hereiswhatweknowaboutthedatasetinhand:

Thedatasetcontainstransactionalretaildatafromanonlineelectronicsstore(DigiCO)locatedin

Melbourne,Australia

1

.Thestoreoperationisexclusivelyonline,andithasthreewarehouses

aroundMelbournefromwhichgoodsaredeliveredtocustomers.

EachinstanceofthedatarepresentsasingleorderfromDigiCOstore.Thedescriptionofeach

datacolumnisshowninTable2.

Table2.Descriptionofthecolumns

COLUMNDESCRIPTION

order_idAuniqueidforeachorder

customer_idAuniqueidforeachcustomer

dateThedatetheorderwasmade,giveninYYYY-MM-DDformat

nearest_warehouseAstringdenotingthenameofthenearestwarehousetothecustomer

shopping_cartAlistoftuplesrepresentingtheorderitems:firstelementofthetupleis

theitemordered,andthesecondelementisthequantityorderedfor

thatitem

order_priceAfloatdenotingtheorderpriceinAUD.Theorderpriceisthepriceof

itemsbeforeanydiscountsand/ordeliverychargesareapplied.

customer_latLatitudeofthecustomer’slocation

customer_longLongitudeofthecustomer’slocation

coupon_discountAnintegerdenotingthepercentagediscounttobeappliedtothe

order_price.

distance_to_nearest_wa

rehouse

Afloatrepresentingthearcdistance,inkilometres,betweenthe

customerandthenearestwarehousetohim/her.

(radiusofearth:6378KM)

delivery_chargesAfloatrepresentingthedeliverychargesoftheorder

1

Thedatasetisfictional

order_totalAfloatdenotingthetotaloftheorderinAUDafteralldiscountsand/or

deliverychargesareapplied.

seasonAstringdenotingtheseasoninwhichtheorderwasplaced.Referto

thislinkfordetailsabouthowseasonsaredefined.

is_expedited_deliveryAbooleandenotingwhetherthecustomerhasrequestedanexpedited

delivery

latest_customer_reviewAstringrepresentingthelatestcustomerreviewonhis/hermostrecent

order

is_happy_customerAbooleandenotingwhetherthecustomerisahappycustomerorhad

anissuewithhis/herlastorder.

Notes:

1.

Theoutputcsvfilesmusthavetheexactsamecolumnsastherespectiveinputfiles.Any

misspellingormismatchwillleadtoamalfunctionoftheauto-markerwhichwillinturnleadto

losingmarks.

2.

InthefileGroup_dirty_data.csv,anyrowcancarrynomorethanoneanomaly.

(i.e.therecanonlybeuptooneissueinasinglerow.)

3.

Allanomaliesindirtydatahaveoneandonlyonepossiblefix.

4.

TherearenodataanomaliesinthefileGroup_outlier_data.csvexceptforoutliers.

Similarly,thereareonlycoveragedataanomalies(i.e.nootherdataanomalies)in

Group_missing_data.csv.

5.

TheretailstorehasthreedifferentwarehousesinMelbourne(seewarehouse.csvfortheir

locations)

6.Theretailstorefocusesonlyon10brandeditemsandsellsthematcompetitiveprices

7.

Inordertogettheitemunitprice,ausefulpythonpackagetosolvemultivariableequationsis

numpy.linalg

8.ThedistanceiscalculatedasHaversineDistance(withradiusofearth=6378KM)likehere.

9.

Thestorehasdifferentbusinessrulesdependingontheseasonstomatchthedifferent

demandsofeachseason.Forexample,deliverychargeiscalculatedusingalinearmodel

whichdiffersdependingontheseason.Themodeldependslinearly(butindifferentwaysfor

eachseason)on:

Distancebetweencustomerandnearestwarehouse

Whetherthecustomerwantsanexpediteddelivery

Whetherthecustomerwashappywithhis/herlastpurchase(ifnopreviouspurchase,

itisassumedthatthecustomerishappy)

10.Itisrecommendedtousesklearn.linear_model.LinearRegressionforsolvingthelinear

modelasdemonstratedinthetutorials.

11.Usingproperdataformodeltrainingiscrucialtohaveagoodlinearmodel(i.e.R

2

scoreover

0.97andverycloseto1)tovalidatethedeliverycharges.Thebetteryourmodelis,themore

accurateyourresultwillbe.

12.

Tocheckwhetheracustomerishappywiththeirlastorder,thecustomer'slatestreviewis

classifiedusingasentimentanalysisclassifier.SentimentIntensityAnalyzerfrom

nltk.sentiment.vaderisusedtoobtainthepolarityscore.Asentimentisconsideredpositiveifit

hasa'compound'polarityscoreof0.05orhigherandisconsiderednegativeotherwise.Refer

tothislinkformoredetailsonhowtousethismodule.

13.

Ifthecustomerprovidedacouponduringpurchase,thecoupondiscountpercentagewillbe

appliedtotheorderpricebeforeaddingthedeliverycharges(i.e.thedeliverychargeswill

neverbediscounted).

14.

Thebelowcolumnsareerror-free(i.e.don’tlookforanyerrorsindirtydataforthem):

○coupon_discount

○delivery_charges

○Theorderedquantityvaluesintheshopping_cartattribute

○order_id

○customer_id

○latest_customer_review

15.

Formissingdataimputation,youarerecommendedtotryallpossiblemethodstoimpute

missingvaluesandkeepthemostappropriateonethatcouldprovidethebestperformance.

16.

AsEDAispartofthisassessment,nofurtherinformationwillbegivenpubliclyregardingthe

data.However,youcanbrainstormwiththeteachingteamduringtutorialsorontheEdforum.

17.

Nolibraries/packagesrestriction.

Methodology(10%)

Thereport<

group_id

>_ass2_task1

.ipynb

shoulddemonstratethemethodology(includingall

steps)toachievethecorrectresults.

Youneedtodemonstrateyoursolutionusingcorrectsteps.

●Yoursolutionshouldbepresentedinaproperwayincludingallrequiredsteps.

●YouneedtoselectandusetheappropriatePythonfunctionsforinput,processand

output.

●Yoursolutionshouldbeanefficientonewithoutredundantoperationsand

unnecessaryreadingandwritingthedata.

Task2:DataReshaping(15%)

Youneedtocompletetask2withthe

suburb_info.xlsx

fileONLY.Withthegivenpropertyand

suburbrelateddata,youneedtostudytheeffectofdifferentnormalisation/transformation(e.g.

standardisation,min-maxnormalisation,log,power,box-coxtransformation)methodsonthese

columns:number_of_houses,number_of_units,population,aus_born_perc,

median_income,median_house_price.Youneedtoobserveandexplaintheireffectassuming

wewanttodevelopalinearmodeltopredictthe“median_house_price”usingthe5attributes

mentionedabove.

Whenreshapingthedata,wenormallyhavetwomaincriteria.

●First,wewantourfeaturestobeonthesamescale;and

●Second,wewantourfeaturestohaveasmuchlinearrelationshipaspossiblewiththe

targetvariable(i.e.,median_house_price).

Youneedtofirstexplorethedatatoseeifanyscalingortransformationisnecessary(ifyeswhy?

andifnot,alsowhy?)andthenperformappropriateactionsanddocumentyourresultsand

observations.Pleasenotethattheaimforthistaskistopreparethedataforalinear

regressionmodel,it’snotbuildingthelinearmodel.Thatis,youneedtorecordallyoursteps

fromloadtherawdatatocompletealltherequiredtransformationsifany.

InputfilesSubmission

suburb_info.xlsxGroup_ass2_task2.ipynb

Youcouldconsiderthescenariooftask2tobeanopenexploratoryproject:JackieandKiara

havegotsomefundingtodoanexploratoryconsultingprojectonthepropertymarket.Wewishto

understandanyinterestinginsightsfromtherelevantfeaturesindifferentsuburbsofMelbourne.

Beforewestepintothefinallinearregressionmodellingstage,wewishtohireyoutopreparethe

dataforusandtellusifanytransformation/normalisationisrequired?Willthosedatasatisfythe

assumptionsoflinearregression?Howcouldwemakeourdatamoresuitableforthelatter

modellingstage.

Asanexploratorytask,youonlyneedtoputyourjourneyofexplorationinproper

documentationinyour.ipynbfile,nootheroutputfiletobesubmittedfortask2.Wewillmark

basedonthe.ipynbcontentfortask2.

Table3.Descriptionofthesuburb_info.xlsxfile.

suburbThesuburbname,whichservesastheindexofthedata

number_of_housesThenumberofhousesinthepropertysuburb

number_of_unitsThenumberofunitsinthepropertysuburb

municipalityThemunicipalityofthepropertysuburb

aus_born_percThepercentageoftheAustralian-bornpopulationintheproperty

suburb

median_incomeThemedianincomeofthepopulationinthepropertysuburb

median_house_priceThemedian‘house’priceinthepropertysuburb

populationThepopulationinthepropertysuburb

Task3:ProjectReflectiveReport(15%)

InputfilesSubmission

N/AGroup_report.pdf

3.1FeedbackSessionDuringWeek10AppliedSession

Tasks:

Pleaseattendtheweek10appliedsessionandpresentyourworkingprogressto

yourTAforsomefeedback.Youneedto:

1.Presentyourcurrentprogress

2.Anyfutureplanningyouwishtoundertake

3.Record/NotedtheTA’ssuggestions

4.Continueyourworkwithtailoredsolution/follow-upsbasedonthesuggestions

Details:

●Time/Date:Week10,duringyourallocatedAppliedsessions

●Duration:Approximate5-8minutespergroup

●Location:NormallocationofallocatedappliedsessionsinyourAllocate+records

●Criterion:PleaserefertoA2markingrubrics

3.2GroupReflectionPresentation(Hurdle)

TherewillbeareflectivepresentationforyourA2.Theaimforthepresentationistocheck

yourunderstandingofyourA2projectandmakesureallsubmissionsarecompliantwiththe

academicintegrityrequirementsofMonash.

Details:

●Time/Date:Week12,duringyourallocatedAppliedsessions

●Duration:Approximate5-10minutespergroup

●Location:NormallocationofallocatedappliedsessionsinyourAllocate+records

●Arrangement:Wewillprovideatimescheduleforeverygroupduringtheir

allocatedsession,pleasearriveatyourallocatedtimeslot.Ifyouarriveearlier,

pleasewaitpatientlyoutsidetheroom.

●Content:Pleasebrieflydescribeyourmethodology/logicofA2(atleastfor80%of

A2,detailedsubtaskspleaserefertoA2markingrubrics)andanswerquestionsif

any

●Criterion:PleaserefertoA2markingrubrics

AttendanceRequirement:

Mandatoryattendance(HURDLE)

ConsequencesofNon-Attendance:

Failuretoattendthepresentationorinabilitytosatisfactorilydemonstrateyourworkwill

resultinnotmeetingthehurdlerequirementsforAssignmentA2.Consequently,youwill

receiveZEROforAssessment2.

Thefollowingexcuseswillnotbeaccepted:

●Forgettocometothetutorial

●Forgettoprepareforthepresentation,i.e.forgetyourownsolution

●Toolimitedtimetoprepareyourpresentation

●BetoonervoustotalkinEnglish

●Directuseonlineresourceswithoutproperreference

3.3ReflectiveReport

Inthistask,youareaskedtoprovideareflectivereportbasedonthesuggestionsyouget

fromweek10feedbacksessions,yourtailoredsolution/follow-upsforthesuggestionsand

anyaction/findingsrelatedtotheA2methodology.Yoursolutions,justificationsneedtocome

uptogetherwithacomprehensiveexploratorydataanalysis(EDA),anyinvestigationsyou’ve

doneafterweek10,andanyfutureimprovements/worksforA2.Thegoalistouncover

interestinginsightsthatcanbeusefulforfurtheranalysisordecision-making.

Documentation(10%)

Thecleaningtaskmustbeexplainedinawell-formattedreport(withappropriatesectionsand

subsections).PleaserememberthatthereportmustexplainthecompleteEDAtoexaminethe

data,yourmethodologytofindthedataanomaliesandthesuggestedapproachtofixthose

anomalies.

Thereportshouldbeorganisedinaproperstructuretopresentyoursolutionswithclearand

meaningfultitlesforsectionsandsubsectionsorsub-subsectionifneeded.

●Eachstepinyoursolutionshouldbeclearlydescribedandjustified.Forexample,you

canwritetoexplainyourideaofthesolution,anyspecificsettings,andthereasonfor

usingaparticularfunction,etc.

●Explanationofyourresultsincludingallintermediatestepsisrequired.Thiscanhelp

themarkingteamtounderstandyoursolutionandgivepartialmarksifthefinalresults

arenotfullycorrect.

●Allyourcodesneedproper(butnotexcessive)commenting.

SubmissionRequirements

Youneedtosubmitthefollowing6files:

●Afilenamed

Group

_dirty_data_solution.csv:Thisfileshould

containthedatarecordswithallidentifiederrorsfixed.

●Afilenamed

Group

_missing_data_solution.csv:Thisfilemust

containthedatarecordswithallmissingvaluesimputed.

●Afilenamed

Group

_outlier_data_solution.csv:Thisfileshould

includetheremainingdatarecordswithalloutliersremoved.

●APythonnotebooknamed

Group

_ass2_task1

.ipynb:Inthis

notebook,provideawell-documentedreportthatcomprehensivelydemonstrates

yoursolutionsforthe'dirty,''missing,'and'outlier'datafilesofAssessment2.You

needtoclearlypresentyourmethodologythroughastep-by-stepprocess,

accompaniedbyrelevantcommentsandexplanations.Yourapproachcanvary,

butclarityisparamount.Pleasekeepthisnotebookeasy-to-read,asyouwilllose

marksifwecannotunderstandit.(makesurethecelloutputsareNOTcleared)

AfilenamedGroup

_ass2_task1.py:Thisfilewillbeusedforplagiarism

check.

(makesurethecelloutputsareclearedbeforeexporting)

●APythonnotebooknamed

Group

_ass2_task2

.ipynb:Inthis

notebook,provideawell-documentedreportthatcomprehensivelydemonstrates

yoursolutionsfortask2.Youneedtoclearlypresentyourinvestigation,yourEDA

andothermethodologythroughawell-organisedreportingformat,accompanied

byrelevantjustifications,explanationsandreferences(ifany).Asanopen

exploratorytask,weexpecttoseeyoureachouttocheckthedataset,themodel

assumptionandrelevantresearchtoenrichthereport.Pleasekeepthisnotebook

easy-to-read,asyouwilllosemarksifwecannotunderstandit.(makesurethe

celloutputsareNOTcleared)

AfilenamedGroup

_report

.pdf:Thisisthereflectivereportyouhavefrom

task3.

Appendix

Togeneratea.pyfile,youneedtoclearallthecelloutputs,andthendownloadit.

InJupyternotebook:

InGooglecolab:

SubmissionChecklist

Pleasezipallthesubmissionfilesforassessment2intoasinglefilewiththe

nameGroup_ass2.zip.(anyotherformate.g.raror7zwillbe

penalised)

Thereare7filesinyourcompressedzipfile

shouldbereplacedwithyourgroupid(without<>)(ithas0

paddingsie.001,010...).’

MakesureBOTHmembersofyourgroupclickthe‘Submit’buttononMoodle

Pleasestrictlyfollowthefilenamingstandard.Anymisnamedfilewillcarrya

penalty.

Pleasemakesurethatyour.ipynbfilecontainsprintedoutput,whileyour.py

filedoesnotincludeanyoutput.

Pleaseensurethatallyourfilesareparsableandreadable.Youcanachievethis

byre-readingallyourgeneratedfilesbackintopython.(e.g.using

read_csv

for

CSVfiles).Thesechecksareonlysanitychecksandhenceshouldnotbeadded

toyourfinalsubmission.

MakesuretoattendyourallocatedappliedsessionsonWeek12tomeetthe

HURDLErequirements

Note:Allsubmissionswillbeputthroughaplagiarismdetectionsoftwarewhichautomatically

checksfortheirsimilaritywithrespecttoothersubmissions.Anyplagiarismfoundwilltrigger

theFaculty’srelevantproceduresandmayresultinseverepenalties,uptoandincluding

exclusionfromtheuniversity.

51作业君

Email:51zuoyejun

@gmail.com

添加客服微信: Fudaojun0228