MonashUniversity
FIT5202-DataprocessingforBigData(SSB2025)
Assignment1:AnalysingFoodDeliveryData
DueDate:23:55Friday17/Jan/2025(Endofweek3)
Weight:10%ofthefinalmarks
Background
Fooddeliveryserviceshavebecomeanintegralpartofmodernsociety,
revolutionizingthewayweconsumemealsandinteractwiththefoodindustry.These
platforms,accessiblethroughwebsitesandmobileapps,provideaconvenientbridge
betweenrestaurantsandconsumers,allowinguserstobrowsemenus,placeorders,
andhavefooddelivereddirectlytotheirdoorstepwithjustafewtaps.Intoday's
fast-pacedworld,wheretimeisapreciouscommodity,fooddeliveryservicesofferan
invaluablesolution,cateringtobusylifestyles,limitedmobility,andtheever-present
desireforconvenience.Theyempowerindividualstoenjoyadiverserangeof
cuisineswithoutleavingtheirhomesoroffices,supportlocalrestaurantsby
expandingtheirreach,andhaveevenbecomeacruciallifelineduringtimesofcrisis,
suchaslockdownsandemergencies,ensuringaccesstoessentialsustenanceand
supportingtheeconomy.Asaresultofitsconvenience,andtheincreasing
preferenceforon-demandservices,fooddeliveryhasbecomeaveryimportantpart
ofmodernlife,impactingeverythingfromourdailyroutinestothebroadereconomic
landscape.
Inthefooddeliveryindustry,accurateon-timedeliverypredictionisparamount.Big
dataprocessingallowscompaniestoachievethisbyanalyzingvastdatasets
encompassingorderdetails,driverperformance,real-timetraffic,andevenweather.
Sophisticatedalgorithmsleveragethisdatatobuildpredictivemodels.Thesemodels
learnfromhistoricaltrends,forexample,arestaurant'slongerpreparationtimes
duringpeakhoursoradriver'sfasternavigationinspecificareas.Real-timedata,like
driverGPSlocationandlivetraffic,furtherrefinethesepredictions,enablingdynamic
adjustmentstoestimateddeliverytimes.
Thebenefitsaresubstantial.Firstly,customersatisfactionimproveswithreliable
deliveryestimatesandtransparentcommunicationregardingdelays.Secondly,
operationalefficiencyincreasesthroughoptimizeddriverschedulingandroute
planning,leadingtoreducedcostsandfasterdeliveries.Furthermore,accurate
predictionsempowerproactivemeasurestomitigatedelays.Thesystemcanalert
customersofpotentialissues,offercompensation,andtriggerinterventionslike
expeditingorderpreparation.Ifanorderisnotdeliveredontime,aquality
after-serviceshouldbefollowed,suchasofferingrefunds,providingfuturediscounts,
orsimplyofferingasincereapology.
Bymasteringon-timedeliverypredictionthroughbigdata,fooddeliverycompanies
gainacrucialcompetitiveedge.Theycanmeetandexceedcustomerexpectations,
fosterloyalty,anddrivesustainablegrowthinademandingmarket.Astheindustry
evolves,leveragingbigdataforaccuratedeliveryforecastingwillremainakey
differentiatorforsuccess.
Thisseriesofassignmentswillimmerseyouintheworldofbigdataanalytics,
specificallywithinthecontextofamodern,data-drivenapplication:fooddelivery
services.Wewillexploretheentirelifecycleofdataprocessing,fromanalyzing
historicalinformationtobuildinganddeployingreal-timemachinelearningmodels.
Eachassignmentbuildsuponthelast,culminatinginacomprehensiveunderstanding
ofhowbigdatatechnologiescanbeleveragedtooptimizeperformanceandenhance
userexperience.
Inthefirstassignment(A1),wewilldelveintohistoricaldatasets,performingdata
analysistouncoverkeytrendsandpatternsrelatedtodeliverytimes,ordervolumes,
andothercrucialmetrics.Thisfoundationalunderstandingwillpavethewayfor
assignment2A,wherewewillharnessthepowerofApacheSpark'sMLLibto
constructandtrainmachinelearningmodels,focusingonpredictingdeliverytimes
withaccuracyandefficiency.Finally,assignment2Bwillelevateouranalysistothe
real-timedomain,utilizingApacheSparkStructuredStreamingtoprocesslivedata
streamsanddynamicallyadjustpredictions,providingaglimpseintothecutting-edge
techniquesdrivingmodern,responsiveapplications.Throughthishands-onjourney,
youwillgainpracticalexperiencewithindustry-standardtoolsanddevelopastrong
conceptualunderstandingofhowbigdatapowersthedynamicworldofon-demand
services.
InA1,wewillperformhistoricaldataanalysisusingApacheSpark.WewilluseRDD,
DataFrameandSQLAPIlearntfromtopics1-4.
TheDataset
ThedatasetcanbedownloadedfromMoodle.
Youwillfindthefollowingfilesafterextractingthezipfile:
1)delivery_order.csv:Containsfoodorderrecords.
2)geolocation.csv:Containsgeographicalinformationaboutrestaurantsand
deliverylocations
3)delivery_person.csv:Containsbasicdriverinformation,theirratingandvehicle
information.
Themetadataofthedatasetcanbefoundintheappendixattheendofthisdocument.
(Note:Thedatasetisamixtureofreal-lifeandsyntheticdata,thereforesome
anomaliesmayexistinthedataset.Datacleansingisnotmandatoryinthis
assignment.)
AssignmentInformation
Theassignmentconsistsofthreeparts:WorkingwithRDD,Workingwith
Dataframes,andComparisonofthreeformsofSparkabstractions.Inthis
assignment,youarerequiredtoimplementvarioussolutionsbasedonRDDsand
DataFramesinPySparkforthegivenqueriesrelatedtoeCommercedataanalysis.
GettingStarted
●DownloadyourdatasetfromMoodle.
●Downloadatemplatefileforsubmissionpurposes:
●A1_template.ipynbfileinJupyternotebooktowriteyour
solution.Renameitintotheformat(forexample:
A1_xxx0000.ipynb.Thisfilecontainsyourcode
solution(xxx0000isyourauthcode).
●Forthisassignment,youwillusePython3+andPySpark3.5.0.(The
environmentisprovidedasaDockerimage,thesameyouusein
labs.)
Part1:WorkingwithRDDs(
30%
)
Inthissection,youneedtocreateRDDsfromthegivendatasets,perform
partitioningintheseRDDsandusevariousRDDoperationstoanswerthequeries.
1.1DataPreparationandLoading(5%)
1.WritethecodetocreateaSparkContextobjectusingSparkSession.Tocreate
aSparkSession,youfirstneedtobuildaSparkConfobjectthatcontains
informationaboutyourapplication.UseMelbournetimeasthesession
timezone.GiveyourapplicationanappropriatenameandrunSparklocally
with4coresonyourmachine.
2.LoadtheCSVfilesintomultipleRDDs.
3.ForeachRDD,removetheheaderrowsanddisplaythetotalcountandfirst
10records.
4.Droprecordswithinvalidinformation(NaNorNull)inanycolumn.
1.2DataPartitioninginRDD(
15%
)
1.ForeachRDD,usingSpark’sdefaultpartitioning,printoutthetotalnumberof
partitionsandthenumberofrecordsineachpartition(5%).
2.Answerthefollowingquestions:
a.HowmanypartitionsdotheaboveRDDshave?
b.HowisthedataintheseRDDspartitionedbydefault,whenwedonot
explicitlyspecifyanypartitioningstrategy?Canyouexplainwhyitis
partitionedinthisnumber?
c.Assumingwearequeryingthedatasetbasedonordertimestamp,
canyouthinkofabetterstrategyforpartitioningthedatabasedon
youravailablehardwareresources?
WriteyourexplanationinMarkdowncells.(5%)
3.Createauser-definedfunction(UDF)totransformatimestamptoISO
format(YYYY-MM-DDHH:mm:ss),thencalltheUDFtotransformtwo
timestamps(order_tsandready_ts)toorder_datetimeand
ready_datetime(5%)
1.3Query/Analysis(
10%
)
Forthispart,writerelevantRDDoperationstoanswerthefollowingquestions.
1.Extractweekday(Monday-Sunday)informationfromordersandprintthetotal
numberoforderseachweekday.(5%)
2.Showalistoftype_of_orderandaveragepreparationtimeinminutes
(ready_ts-order_ts)(5%)
Part2.WorkingwithDataFrames(45%)
Inthissection,youneedtoloadthegivendatasetsintoPySpark
DataFramesanduseDataFramefunctionstoanswerthequeries.
2.1DataPreparationandLoading(5%)
1.LoadtheCSVfilesintoseparatedataframes.Whenyoucreateyour
dataframes,pleaserefertothemetadatafileandthinkabouttheappropriate
datatypeforeachcolumn.
2.Displaytheschemaofthedataframes.
Whenthedatasetislarge,doyouneedallcolumns?Howtooptimizememory
usage?Doyouneedacustomizeddatapartitioningstrategy?(Note:Thinkabout
thosequestionsbutyoudon’tneedtoanswerthesequestions.)
2.2Query/Analysis(
40%
)
Implementthefollowingqueriesusingdataframes.Youneedtobeabletoperform
operationsliketransforming,filtering,sorting,joiningandgroupbyusingthefunctions
providedbytheDataFrameAPI.
1.Writeafunctiontoencode/transformweatherconditionstoIntegersanddrop
theoriginalstring.Youcandecideyourownencodingscheme.(i.e.Sunny=0,
Cloudy=1,Fog=2,etc.)(5%)
2.Calculatetheamountoforderforeachhour.Showtheresultsinatableand
plotabarchart.(5%)
3.Jointhedelivery_orderwithgeolocationdataframe,calculatethedistance
betweenarestaurantandthedeliverylocation,andstorethedistanceinanew
columnnameddelivery_distance.(hint:Youmayneedtoinstallanadditional
librarylikeGeoPandastocalculatethedistancebetweentwopoints).(5%)
4.Usingthedatafrom3,findthetop10driverstravellingthelongestdistance.
(5%)
5.Foreachtypeoforder,plotahistogramofmealpreparationtime.Theplotcan
bedonewithmultiplelegendsorsub-plots.(note:youcandecideyourbin
size).(10%)
6.(OpenQuestion)Explorethedatasetanduseadeliveryperson’sratingasa
performanceindicator.Isalowerratingusuallycorrelatedtoalongerdelivery
time?Whatmightbethecontributingfactorstothelowrateofdrivers?Please
includeoneplotanddiscussionbasedonyourobservation(nowordlimitbut
pleasekeepitconcise).(10%)
Part3:RDDsvsDataFramevsSparkSQL(25%)
ImplementthefollowingqueriesusingRDDs,DataFrameinSparkSQLseparately.
Logthetimetakenforeachqueryineachapproachusingthe“%%time”built-in
magiccommandinJupyterNotebookanddiscusstheperformancedifference
betweenthese3approaches.
(ComplexQuery)Calculatethetimetakenontheroad(definedasthetotaltime
takenminusrestaurants’orderpreparationtime,i.e.,totaltime-(ready_ts-
order_ts)).Foreachroad_condition,usinga10-minutebucketsizeoftimeon
theroad(e.g.0-10,10-20,20-30,etc.),showthepercentageofeachbucket.
(note:Youcanreusetheloadeddata/variablesfrompart1&2.)
(hint:YoumaycreateintermediateRDD/dataframesforthisquery.)
1)ImplementtheabovequeryusingRDDs,DataFrameandSQLseparately
andprinttheresults.(Note:Thethreedifferentapproachesshouldhavethe
sameresults).(15%)
2)Whichoneistheeasiesttoimplementinyouropinion?Logthetime
takenforeachquery,andobservethequeryexecutiontime,among
RDD,DataFrame,andSparkSQL,whichisthefastestandwhy?Please
includeproperreferences.(Maximum500words.)(10%)
Submission
YoushouldsubmityourfinalversionoftheassignmentsolutiononlineviaMoodle.You
mustsubmitthefilescreated:
-Yourjupyternotebookfile(e.g.,A1_authcate.ipynb).
-Apdffilesavedfromjupyternotebookwithalloutputfollowingthefile
namingformatasfollows:A1_authcate.pdf
Notethatbothsubmitted(jupyterandpdf)fileswillbescannedusing
plagiarismdetectionsoftware.Thehighestsimilarityscoreamongstudents
maybeinterviewedtoprovetheoriginalityofthework.
AssignmentMarkingRubric
Foreachtaskindividually,you’llbemarkedbasedonthequalityofyourworkona3-level
scale(0%,50%and100%).
●0%:Noattemptorincorrectanswerwithpoorattempt;
●50%:Partialmarkforagoodattemptbutincorrectresult;
●100%:Fullmarkforcorrectattempt.
Inyoursubmission,thejupyternotebookfileshouldcontainthecodeanditsoutput.It
shouldfollowprogrammingstandards,readabilityofthecode,organizationofcode.Please
findthePEP8--StyleGuideforPythonCodeforyourreference.Hereisthelink:
https://peps.python.org/pep-0008/Penaltyuofpto10%appliesifyourcodeishardto
understandwithinsufficientcomments.
Latesubmissions
LateAssignmentsorextensionswillnotbeacceptedunlessyousubmitaspecial
considerationform.ALLSpecialConsideration,includingwithinthesemester,isnowtobe
submittedcentrally.ThismeansthatstudentsMUSTsubmitanonlineSpecialConsideration
formviaMonashConnect.Formoredetails,pleaserefertotheUnitInformationsectionin
Moodle.
Alatesubmissionissubjecttoa5%penaltyperday,includingweekends.Thecut-offdateis
7daysaftertheduedate.Nosubmission(i.e.0mark)willbeacceptedafterthecut-offdate
unlessyouhaveaspecialconsideration.
MarkReleaseandReview
●Markwillbereleasedwithin10businessdaysafterthesubmissiondeadline.
●Reviewsanddisputesregardingthemarkwillbeacceptedamaximumof7daysafter
thereleasedate(includingweekends).
OtherInformation
Wheretogethelp
YoucanaskquestionsabouttheassignmentintheAssignmentssectioninthe
EdForumaccessibleontheunit'sMoodleForumpage.Thisisthepreferredvenue
forassignmentclarification-typequestions.Youshouldcheckthisforumregularly,as
theresponsesoftheteachingstaffare"official"andcanconstituteamendmentsor
additionstotheassignmentspecification.Also,youcanattendscheduled
consultationsessionsiftheproblemandtheconfusionarestillnotsolved.
Plagiarismandcollusion
PlagiarismandcollusionareseriousacademicoffencesatMonashUniversity.
Studentsmustnotsharetheirworkwithanyotherstudents.Studentsshouldconsult
thepolicylinkedbelowformoreinformation.
https://www.monash.edu/students/academic/policies/academic-integrity
SeealsothevideolinkedontheMoodlepageundertheAssignment
block.
Studentsinvolvedincollusionorplagiarismwillbesubjecttodisciplinary
penalties,whichcaninclude:
●Theworknotbeingassessed
●Azerogradefortheunit
●SuspensionfromtheUniversity
●ExclusionfromtheUniversity
GenerativeAIStatement
AspertheUniversity’spolicyontheguidelinesandpracticespertainingtotheusage
ofGenerativeAI:
AI&GenerativeAItoolsmaybeusedSELECTIVELYwithinthisassessment.
Whereused,AImustbeusedresponsibly,clearlydocumentedandappropriately
acknowledged(seeLearnHQ).
Anyworksubmittedforamarkmust:
1)Representasinceredemonstrationofyourhumanefforts,skillsandsubject
knowledgethatyouwillbeaccountablefor.
2)AdheretotheguidelinesforAIusesetfortheassessmenttask.
3)ReflecttheUniversity’scommitmenttoacademicintegrityandethical
behaviour.
InappropriateAIuseand/orAIusewithoutacknowledgementwillbe
consideredabreachofacademicintegrity.
Theteachingteamencouragesstudentstoapplytheirowncriticalthinkingand
reasoningskillswhenworkingontheassessmentswithanassistantfromGenAI.
GenerativeAItoolsmayproduceinaccuratecontent,whichcouldnegativelyimpact
students’comprehensionofbigdatatopics.
Datasourceacknowledgement:
Thedatasetisaremixbasedonseveralreal-worlddataset.Allname,age,dob,
salaryetc.arerandomlygeneratedsyntheticdatasets.
Appendix:MetadataoftheDatasetSchema
geolocation.csv
geoidUUIDofgeolocation
latitudeLatitude,Decimal(8,6)
longitudeLongitude,Decimal(8,6)
districtWhetheralocationisconsideredanurbanormetropolitan
area(String)
locGeometryobjectofthegeolocation
delivery_person.csv
person_idUniqueidentifierofdeliveryperson/driver(String)
ageDeliverydriver’sage(Integer)
ratingOverallratingofthedriver(float,0-5scale)
vehicle_conditionAdriver’svehiclecondition(from0-Good,1-Fair,2-Poor)
type_of_vehicleMotorcycle,Scooter,electric_scooter,etc.(String)
delivery_order.csv
order_idUniqueidentifierofanorder
delivery_person_idUniqueIDofthedriverdeliveringanorder
order_tstimestampwhenanorderisplaced
ready_tstimestampwhenarestaurantfinishespreparinganorder,i.e.ready
forthedeliverydrivertopickup.
weather_conditionWeatherconditionatthetimeoforder(Sunny,Windy,Storm,etc.)
(String)
road_conditionRoadtrafficconditions(Low,Medium,Jam,etc.)
type_of_orderSnacks,Meal,Drinks,etc.(String)
is_festivalWhetherthecurrentdateisafestival(note:duringafestival,
restaurantsarebusierthanusualanditmaytakethemlongerto
prepareforanorder.)
time_takenTotaltimetakenforanorder(i.e.fromorderplacementtodelivered,
measuredinminutes).
restaurant_geoidGeolocationofarestaurant.
delivery_geoidGeolocationofadelivery.