FIT5196-S2-2024Assessment2
Thisisagroupassessmentandisworth40%ofyourtotalmarkfor
FIT5196.
Duedate:Friday18October2024,11:55pm
Task1.DataCleansing(50%)
Forthisassessment,youarerequiredtowritePythoncodetoanalyseyourdataset,findandfix
theproblemsinthedata.Theinputandoutputofthistaskareshownbelow:
Table1.Theinputandoutputoftask1
Inputfiles
Submission
Outputfiles
OtherDeliverables
Group
.csv
Group
ta.csv
Group
ata.csv
warehouse.csv
Group
ution.csv
Group
olution.csv
Group
solution.csv
Group
1.ipynb
Group
1.py
Note1:Allfilesmustbezippedintoafilenamed
Group
notrar,7z,tar,etc.)
Note2:Replace
Note3:Youcanfindyourthreeinputfilesfromthefolderwithyourgroupnumberhere.
Usingthewrongfileswillresultinzeromarks.
Note4:Pleasestrictlyfollowtheinstructionsintheappendixtogeneratethe.ipynband
.pyfiles.
Exploringandunderstandingthedataisoneofthemostimportantpartsofthedatawrangling
process.Youarerequiredtoperformgraphicaland/ornon-graphicalEDAmethodstounderstand
thedatafirstandthenfindthedataproblems.Inthisassessment,youhavebeenprovidedwith
threedatainputsalongwiththeadditionalfile:
warehouse.csv
here.Duetoanunexpected
scenario,aportionofthedataismissingorcontainsanomalousvalues.Thus,beforemovingto
thenextstepindataanalysis,youarerequiredtoperformthefollowingtasks:
1.
Detectandfixerrorsin
2.Imputethemissingvaluesin
3.Detectandremoveoutlierrowsin
○(w.r.t.thedelivery_chargesattributeonly)
ProjectBackground
Asastartingpoint,hereiswhatweknowaboutthedatasetinhand:
Thedatasetcontainstransactionalretaildatafromanonlineelectronicsstore(DigiCO)locatedin
Melbourne,Australia
1
.Thestoreoperationisexclusivelyonline,andithasthreewarehouses
aroundMelbournefromwhichgoodsaredeliveredtocustomers.
EachinstanceofthedatarepresentsasingleorderfromDigiCOstore.Thedescriptionofeach
datacolumnisshowninTable2.
Table2.Descriptionofthecolumns
COLUMNDESCRIPTION
order_idAuniqueidforeachorder
customer_idAuniqueidforeachcustomer
dateThedatetheorderwasmade,giveninYYYY-MM-DDformat
nearest_warehouseAstringdenotingthenameofthenearestwarehousetothecustomer
shopping_cartAlistoftuplesrepresentingtheorderitems:firstelementofthetupleis
theitemordered,andthesecondelementisthequantityorderedfor
thatitem
order_priceAfloatdenotingtheorderpriceinAUD.Theorderpriceisthepriceof
itemsbeforeanydiscountsand/ordeliverychargesareapplied.
customer_latLatitudeofthecustomer’slocation
customer_longLongitudeofthecustomer’slocation
coupon_discountAnintegerdenotingthepercentagediscounttobeappliedtothe
order_price.
distance_to_nearest_wa
rehouse
Afloatrepresentingthearcdistance,inkilometres,betweenthe
customerandthenearestwarehousetohim/her.
(radiusofearth:6378KM)
delivery_chargesAfloatrepresentingthedeliverychargesoftheorder
1
Thedatasetisfictional
order_totalAfloatdenotingthetotaloftheorderinAUDafteralldiscountsand/or
deliverychargesareapplied.
seasonAstringdenotingtheseasoninwhichtheorderwasplaced.Referto
thislinkfordetailsabouthowseasonsaredefined.
is_expedited_deliveryAbooleandenotingwhetherthecustomerhasrequestedanexpedited
delivery
latest_customer_reviewAstringrepresentingthelatestcustomerreviewonhis/hermostrecent
order
is_happy_customerAbooleandenotingwhetherthecustomerisahappycustomerorhad
anissuewithhis/herlastorder.
Notes:
1.
Theoutputcsvfilesmusthavetheexactsamecolumnsastherespectiveinputfiles.Any
misspellingormismatchwillleadtoamalfunctionoftheauto-markerwhichwillinturnleadto
losingmarks.
2.
InthefileGroup
(i.e.therecanonlybeuptooneissueinasinglerow.)
3.
Allanomaliesindirtydatahaveoneandonlyonepossiblefix.
4.
TherearenodataanomaliesinthefileGroup
Similarly,thereareonlycoveragedataanomalies(i.e.nootherdataanomalies)in
Group
5.
TheretailstorehasthreedifferentwarehousesinMelbourne(seewarehouse.csvfortheir
locations)
6.Theretailstorefocusesonlyon10brandeditemsandsellsthematcompetitiveprices
7.
Inordertogettheitemunitprice,ausefulpythonpackagetosolvemultivariableequationsis
numpy.linalg
8.ThedistanceiscalculatedasHaversineDistance(withradiusofearth=6378KM)likehere.
9.
Thestorehasdifferentbusinessrulesdependingontheseasonstomatchthedifferent
demandsofeachseason.Forexample,deliverychargeiscalculatedusingalinearmodel
whichdiffersdependingontheseason.Themodeldependslinearly(butindifferentwaysfor
eachseason)on:
○
Distancebetweencustomerandnearestwarehouse
○
Whetherthecustomerwantsanexpediteddelivery
○
Whetherthecustomerwashappywithhis/herlastpurchase(ifnopreviouspurchase,
itisassumedthatthecustomerishappy)
10.Itisrecommendedtousesklearn.linear_model.LinearRegressionforsolvingthelinear
modelasdemonstratedinthetutorials.
11.Usingproperdataformodeltrainingiscrucialtohaveagoodlinearmodel(i.e.R
2
scoreover
0.97andverycloseto1)tovalidatethedeliverycharges.Thebetteryourmodelis,themore
accurateyourresultwillbe.
12.
Tocheckwhetheracustomerishappywiththeirlastorder,thecustomer'slatestreviewis
classifiedusingasentimentanalysisclassifier.SentimentIntensityAnalyzerfrom
nltk.sentiment.vaderisusedtoobtainthepolarityscore.Asentimentisconsideredpositiveifit
hasa'compound'polarityscoreof0.05orhigherandisconsiderednegativeotherwise.Refer
tothislinkformoredetailsonhowtousethismodule.
13.
Ifthecustomerprovidedacouponduringpurchase,thecoupondiscountpercentagewillbe
appliedtotheorderpricebeforeaddingthedeliverycharges(i.e.thedeliverychargeswill
neverbediscounted).
14.
Thebelowcolumnsareerror-free(i.e.don’tlookforanyerrorsindirtydataforthem):
○coupon_discount
○delivery_charges
○Theorderedquantityvaluesintheshopping_cartattribute
○order_id
○customer_id
○latest_customer_review
15.
Formissingdataimputation,youarerecommendedtotryallpossiblemethodstoimpute
missingvaluesandkeepthemostappropriateonethatcouldprovidethebestperformance.
16.
AsEDAispartofthisassessment,nofurtherinformationwillbegivenpubliclyregardingthe
data.However,youcanbrainstormwiththeteachingteamduringtutorialsorontheEdforum.
17.
Nolibraries/packagesrestriction.
Methodology(10%)
Thereport<
group_id
>_ass2_task1
.ipynb
shoulddemonstratethemethodology(includingall
steps)toachievethecorrectresults.
Youneedtodemonstrateyoursolutionusingcorrectsteps.
●Yoursolutionshouldbepresentedinaproperwayincludingallrequiredsteps.
●YouneedtoselectandusetheappropriatePythonfunctionsforinput,processand
output.
●Yoursolutionshouldbeanefficientonewithoutredundantoperationsand
unnecessaryreadingandwritingthedata.
Task2:DataReshaping(15%)
Youneedtocompletetask2withthe
suburb_info.xlsx
fileONLY.Withthegivenpropertyand
suburbrelateddata,youneedtostudytheeffectofdifferentnormalisation/transformation(e.g.
standardisation,min-maxnormalisation,log,power,box-coxtransformation)methodsonthese
columns:number_of_houses,number_of_units,population,aus_born_perc,
median_income,median_house_price.Youneedtoobserveandexplaintheireffectassuming
wewanttodevelopalinearmodeltopredictthe“median_house_price”usingthe5attributes
mentionedabove.
Whenreshapingthedata,wenormallyhavetwomaincriteria.
●First,wewantourfeaturestobeonthesamescale;and
●Second,wewantourfeaturestohaveasmuchlinearrelationshipaspossiblewiththe
targetvariable(i.e.,median_house_price).
Youneedtofirstexplorethedatatoseeifanyscalingortransformationisnecessary(ifyeswhy?
andifnot,alsowhy?)andthenperformappropriateactionsanddocumentyourresultsand
observations.Pleasenotethattheaimforthistaskistopreparethedataforalinear
regressionmodel,it’snotbuildingthelinearmodel.Thatis,youneedtorecordallyoursteps
fromloadtherawdatatocompletealltherequiredtransformationsifany.
InputfilesSubmission
suburb_info.xlsxGroup
Youcouldconsiderthescenariooftask2tobeanopenexploratoryproject:JackieandKiara
havegotsomefundingtodoanexploratoryconsultingprojectonthepropertymarket.Wewishto
understandanyinterestinginsightsfromtherelevantfeaturesindifferentsuburbsofMelbourne.
Beforewestepintothefinallinearregressionmodellingstage,wewishtohireyoutopreparethe
dataforusandtellusifanytransformation/normalisationisrequired?Willthosedatasatisfythe
assumptionsoflinearregression?Howcouldwemakeourdatamoresuitableforthelatter
modellingstage.
Asanexploratorytask,youonlyneedtoputyourjourneyofexplorationinproper
documentationinyour.ipynbfile,nootheroutputfiletobesubmittedfortask2.Wewillmark
basedonthe.ipynbcontentfortask2.
Table3.Descriptionofthesuburb_info.xlsxfile.
suburbThesuburbname,whichservesastheindexofthedata
number_of_housesThenumberofhousesinthepropertysuburb
number_of_unitsThenumberofunitsinthepropertysuburb
municipalityThemunicipalityofthepropertysuburb
aus_born_percThepercentageoftheAustralian-bornpopulationintheproperty
suburb
median_incomeThemedianincomeofthepopulationinthepropertysuburb
median_house_priceThemedian‘house’priceinthepropertysuburb
populationThepopulationinthepropertysuburb
Task3:ProjectReflectiveReport(15%)
InputfilesSubmission
N/AGroup
3.1FeedbackSessionDuringWeek10AppliedSession
Tasks:
Pleaseattendtheweek10appliedsessionandpresentyourworkingprogressto
yourTAforsomefeedback.Youneedto:
1.Presentyourcurrentprogress
2.Anyfutureplanningyouwishtoundertake
3.Record/NotedtheTA’ssuggestions
4.Continueyourworkwithtailoredsolution/follow-upsbasedonthesuggestions
Details:
●Time/Date:Week10,duringyourallocatedAppliedsessions
●Duration:Approximate5-8minutespergroup
●Location:NormallocationofallocatedappliedsessionsinyourAllocate+records
●Criterion:PleaserefertoA2markingrubrics
3.2GroupReflectionPresentation(Hurdle)
TherewillbeareflectivepresentationforyourA2.Theaimforthepresentationistocheck
yourunderstandingofyourA2projectandmakesureallsubmissionsarecompliantwiththe
academicintegrityrequirementsofMonash.
Details:
●Time/Date:Week12,duringyourallocatedAppliedsessions
●Duration:Approximate5-10minutespergroup
●Location:NormallocationofallocatedappliedsessionsinyourAllocate+records
●Arrangement:Wewillprovideatimescheduleforeverygroupduringtheir
allocatedsession,pleasearriveatyourallocatedtimeslot.Ifyouarriveearlier,
pleasewaitpatientlyoutsidetheroom.
●Content:Pleasebrieflydescribeyourmethodology/logicofA2(atleastfor80%of
A2,detailedsubtaskspleaserefertoA2markingrubrics)andanswerquestionsif
any
●Criterion:PleaserefertoA2markingrubrics
AttendanceRequirement:
Mandatoryattendance(HURDLE)
ConsequencesofNon-Attendance:
Failuretoattendthepresentationorinabilitytosatisfactorilydemonstrateyourworkwill
resultinnotmeetingthehurdlerequirementsforAssignmentA2.Consequently,youwill
receiveZEROforAssessment2.
Thefollowingexcuseswillnotbeaccepted:
●Forgettocometothetutorial
●Forgettoprepareforthepresentation,i.e.forgetyourownsolution
●Toolimitedtimetoprepareyourpresentation
●BetoonervoustotalkinEnglish
●Directuseonlineresourceswithoutproperreference
3.3ReflectiveReport
Inthistask,youareaskedtoprovideareflectivereportbasedonthesuggestionsyouget
fromweek10feedbacksessions,yourtailoredsolution/follow-upsforthesuggestionsand
anyaction/findingsrelatedtotheA2methodology.Yoursolutions,justificationsneedtocome
uptogetherwithacomprehensiveexploratorydataanalysis(EDA),anyinvestigationsyou’ve
doneafterweek10,andanyfutureimprovements/worksforA2.Thegoalistouncover
interestinginsightsthatcanbeusefulforfurtheranalysisordecision-making.
Documentation(10%)
Thecleaningtaskmustbeexplainedinawell-formattedreport(withappropriatesectionsand
subsections).PleaserememberthatthereportmustexplainthecompleteEDAtoexaminethe
data,yourmethodologytofindthedataanomaliesandthesuggestedapproachtofixthose
anomalies.
Thereportshouldbeorganisedinaproperstructuretopresentyoursolutionswithclearand
meaningfultitlesforsectionsandsubsectionsorsub-subsectionifneeded.
●Eachstepinyoursolutionshouldbeclearlydescribedandjustified.Forexample,you
canwritetoexplainyourideaofthesolution,anyspecificsettings,andthereasonfor
usingaparticularfunction,etc.
●Explanationofyourresultsincludingallintermediatestepsisrequired.Thiscanhelp
themarkingteamtounderstandyoursolutionandgivepartialmarksifthefinalresults
arenotfullycorrect.
●Allyourcodesneedproper(butnotexcessive)commenting.
SubmissionRequirements
Youneedtosubmitthefollowing6files:
●Afilenamed
Group
containthedatarecordswithallidentifiederrorsfixed.
●Afilenamed
Group
containthedatarecordswithallmissingvaluesimputed.
●Afilenamed
Group
includetheremainingdatarecordswithalloutliersremoved.
●APythonnotebooknamed
Group
_ass2_task1
.ipynb:Inthis
notebook,provideawell-documentedreportthatcomprehensivelydemonstrates
yoursolutionsforthe'dirty,''missing,'and'outlier'datafilesofAssessment2.You
needtoclearlypresentyourmethodologythroughastep-by-stepprocess,
accompaniedbyrelevantcommentsandexplanations.Yourapproachcanvary,
butclarityisparamount.Pleasekeepthisnotebookeasy-to-read,asyouwilllose
marksifwecannotunderstandit.(makesurethecelloutputsareNOTcleared)
●
AfilenamedGroup
_ass2_task1.py:Thisfilewillbeusedforplagiarism
check.
(makesurethecelloutputsareclearedbeforeexporting)
●APythonnotebooknamed
Group
_ass2_task2
.ipynb:Inthis
notebook,provideawell-documentedreportthatcomprehensivelydemonstrates
yoursolutionsfortask2.Youneedtoclearlypresentyourinvestigation,yourEDA
andothermethodologythroughawell-organisedreportingformat,accompanied
byrelevantjustifications,explanationsandreferences(ifany).Asanopen
exploratorytask,weexpecttoseeyoureachouttocheckthedataset,themodel
assumptionandrelevantresearchtoenrichthereport.Pleasekeepthisnotebook
easy-to-read,asyouwilllosemarksifwecannotunderstandit.(makesurethe
celloutputsareNOTcleared)
AfilenamedGroup
.pdf:Thisisthereflectivereportyouhavefrom
task3.
Appendix
Togeneratea.pyfile,youneedtoclearallthecelloutputs,andthendownloadit.
InJupyternotebook:
InGooglecolab:
SubmissionChecklist
Pleasezipallthesubmissionfilesforassessment2intoasinglefilewiththe
nameGroup
penalised)
Thereare7filesinyourcompressedzipfile
paddingsie.001,010...).’
MakesureBOTHmembersofyourgroupclickthe‘Submit’buttononMoodle
Pleasestrictlyfollowthefilenamingstandard.Anymisnamedfilewillcarrya
penalty.
Pleasemakesurethatyour.ipynbfilecontainsprintedoutput,whileyour.py
filedoesnotincludeanyoutput.
Pleaseensurethatallyourfilesareparsableandreadable.Youcanachievethis
byre-readingallyourgeneratedfilesbackintopython.(e.g.using
read_csv
for
CSVfiles).Thesechecksareonlysanitychecksandhenceshouldnotbeadded
toyourfinalsubmission.
MakesuretoattendyourallocatedappliedsessionsonWeek12tomeetthe
HURDLErequirements
Note:Allsubmissionswillbeputthroughaplagiarismdetectionsoftwarewhichautomatically
checksfortheirsimilaritywithrespecttoothersubmissions.Anyplagiarismfoundwilltrigger
theFaculty’srelevantproceduresandmayresultinseverepenalties,uptoandincluding
exclusionfromtheuniversity.