代写辅导接单-FIT5202-DataprocessingforBigData(SSB2025)

欢迎使用51辅导,51作业君孵化低价透明的学长辅导平台,服务保持优质,平均费用压低50%以上! 51fudao.top

MonashUniversity

FIT5202-DataprocessingforBigData(SSB2025)

Assignment1:AnalysingFoodDeliveryData

DueDate:23:55Friday17/Jan/2025(Endofweek3)

Weight:10%ofthefinalmarks

Background

Fooddeliveryserviceshavebecomeanintegralpartofmodernsociety,

revolutionizingthewayweconsumemealsandinteractwiththefoodindustry.These

platforms,accessiblethroughwebsitesandmobileapps,provideaconvenientbridge

betweenrestaurantsandconsumers,allowinguserstobrowsemenus,placeorders,

andhavefooddelivereddirectlytotheirdoorstepwithjustafewtaps.Intoday's

fast-pacedworld,wheretimeisapreciouscommodity,fooddeliveryservicesofferan

invaluablesolution,cateringtobusylifestyles,limitedmobility,andtheever-present

desireforconvenience.Theyempowerindividualstoenjoyadiverserangeof

cuisineswithoutleavingtheirhomesoroffices,supportlocalrestaurantsby

expandingtheirreach,andhaveevenbecomeacruciallifelineduringtimesofcrisis,

suchaslockdownsandemergencies,ensuringaccesstoessentialsustenanceand

supportingtheeconomy.Asaresultofitsconvenience,andtheincreasing

preferenceforon-demandservices,fooddeliveryhasbecomeaveryimportantpart

ofmodernlife,impactingeverythingfromourdailyroutinestothebroadereconomic

landscape.

Inthefooddeliveryindustry,accurateon-timedeliverypredictionisparamount.Big

dataprocessingallowscompaniestoachievethisbyanalyzingvastdatasets

encompassingorderdetails,driverperformance,real-timetraffic,andevenweather.

Sophisticatedalgorithmsleveragethisdatatobuildpredictivemodels.Thesemodels

learnfromhistoricaltrends,forexample,arestaurant'slongerpreparationtimes

duringpeakhoursoradriver'sfasternavigationinspecificareas.Real-timedata,like

driverGPSlocationandlivetraffic,furtherrefinethesepredictions,enablingdynamic

adjustmentstoestimateddeliverytimes.

Thebenefitsaresubstantial.Firstly,customersatisfactionimproveswithreliable

deliveryestimatesandtransparentcommunicationregardingdelays.Secondly,

operationalefficiencyincreasesthroughoptimizeddriverschedulingandroute

planning,leadingtoreducedcostsandfasterdeliveries.Furthermore,accurate

predictionsempowerproactivemeasurestomitigatedelays.Thesystemcanalert

customersofpotentialissues,offercompensation,andtriggerinterventionslike

expeditingorderpreparation.Ifanorderisnotdeliveredontime,aquality

after-serviceshouldbefollowed,suchasofferingrefunds,providingfuturediscounts,

orsimplyofferingasincereapology.

Bymasteringon-timedeliverypredictionthroughbigdata,fooddeliverycompanies

gainacrucialcompetitiveedge.Theycanmeetandexceedcustomerexpectations,

fosterloyalty,anddrivesustainablegrowthinademandingmarket.Astheindustry

evolves,leveragingbigdataforaccuratedeliveryforecastingwillremainakey

differentiatorforsuccess.

Thisseriesofassignmentswillimmerseyouintheworldofbigdataanalytics,

specificallywithinthecontextofamodern,data-drivenapplication:fooddelivery

services.Wewillexploretheentirelifecycleofdataprocessing,fromanalyzing

historicalinformationtobuildinganddeployingreal-timemachinelearningmodels.

Eachassignmentbuildsuponthelast,culminatinginacomprehensiveunderstanding

ofhowbigdatatechnologiescanbeleveragedtooptimizeperformanceandenhance

userexperience.

Inthefirstassignment(A1),wewilldelveintohistoricaldatasets,performingdata

analysistouncoverkeytrendsandpatternsrelatedtodeliverytimes,ordervolumes,

andothercrucialmetrics.Thisfoundationalunderstandingwillpavethewayfor

assignment2A,wherewewillharnessthepowerofApacheSpark'sMLLibto

constructandtrainmachinelearningmodels,focusingonpredictingdeliverytimes

withaccuracyandefficiency.Finally,assignment2Bwillelevateouranalysistothe

real-timedomain,utilizingApacheSparkStructuredStreamingtoprocesslivedata

streamsanddynamicallyadjustpredictions,providingaglimpseintothecutting-edge

techniquesdrivingmodern,responsiveapplications.Throughthishands-onjourney,

youwillgainpracticalexperiencewithindustry-standardtoolsanddevelopastrong

conceptualunderstandingofhowbigdatapowersthedynamicworldofon-demand

services.

InA1,wewillperformhistoricaldataanalysisusingApacheSpark.WewilluseRDD,

DataFrameandSQLAPIlearntfromtopics1-4.

TheDataset

ThedatasetcanbedownloadedfromMoodle.

Youwillfindthefollowingfilesafterextractingthezipfile:

1)delivery_order.csv:Containsfoodorderrecords.

2)geolocation.csv:Containsgeographicalinformationaboutrestaurantsand

deliverylocations

3)delivery_person.csv:Containsbasicdriverinformation,theirratingandvehicle

information.

Themetadataofthedatasetcanbefoundintheappendixattheendofthisdocument.

(Note:Thedatasetisamixtureofreal-lifeandsyntheticdata,thereforesome

anomaliesmayexistinthedataset.Datacleansingisnotmandatoryinthis

assignment.)

AssignmentInformation

Theassignmentconsistsofthreeparts:WorkingwithRDD,Workingwith

Dataframes,andComparisonofthreeformsofSparkabstractions.Inthis

assignment,youarerequiredtoimplementvarioussolutionsbasedonRDDsand

DataFramesinPySparkforthegivenqueriesrelatedtoeCommercedataanalysis.

GettingStarted

●DownloadyourdatasetfromMoodle.

●Downloadatemplatefileforsubmissionpurposes:

●A1_template.ipynbfileinJupyternotebooktowriteyour

solution.Renameitintotheformat(forexample:

A1_xxx0000.ipynb.Thisfilecontainsyourcode

solution(xxx0000isyourauthcode).

●Forthisassignment,youwillusePython3+andPySpark3.5.0.(The

environmentisprovidedasaDockerimage,thesameyouusein

labs.)

Part1:WorkingwithRDDs(

30%

)

Inthissection,youneedtocreateRDDsfromthegivendatasets,perform

partitioningintheseRDDsandusevariousRDDoperationstoanswerthequeries.

1.1DataPreparationandLoading(5%)

1.WritethecodetocreateaSparkContextobjectusingSparkSession.Tocreate

aSparkSession,youfirstneedtobuildaSparkConfobjectthatcontains

informationaboutyourapplication.UseMelbournetimeasthesession

timezone.GiveyourapplicationanappropriatenameandrunSparklocally

with4coresonyourmachine.

2.LoadtheCSVfilesintomultipleRDDs.

3.ForeachRDD,removetheheaderrowsanddisplaythetotalcountandfirst

10records.

4.Droprecordswithinvalidinformation(NaNorNull)inanycolumn.

1.2DataPartitioninginRDD(

15%

)

1.ForeachRDD,usingSpark’sdefaultpartitioning,printoutthetotalnumberof

partitionsandthenumberofrecordsineachpartition(5%).

2.Answerthefollowingquestions:

a.HowmanypartitionsdotheaboveRDDshave?

b.HowisthedataintheseRDDspartitionedbydefault,whenwedonot

explicitlyspecifyanypartitioningstrategy?Canyouexplainwhyitis

partitionedinthisnumber?

c.Assumingwearequeryingthedatasetbasedonordertimestamp,

canyouthinkofabetterstrategyforpartitioningthedatabasedon

youravailablehardwareresources?

WriteyourexplanationinMarkdowncells.(5%)

3.Createauser-definedfunction(UDF)totransformatimestamptoISO

format(YYYY-MM-DDHH:mm:ss),thencalltheUDFtotransformtwo

timestamps(order_tsandready_ts)toorder_datetimeand

ready_datetime(5%)

1.3Query/Analysis(

10%

)

Forthispart,writerelevantRDDoperationstoanswerthefollowingquestions.

1.Extractweekday(Monday-Sunday)informationfromordersandprintthetotal

numberoforderseachweekday.(5%)

2.Showalistoftype_of_orderandaveragepreparationtimeinminutes

(ready_ts-order_ts)(5%)

Part2.WorkingwithDataFrames(45%)

Inthissection,youneedtoloadthegivendatasetsintoPySpark

DataFramesanduseDataFramefunctionstoanswerthequeries.

2.1DataPreparationandLoading(5%)

1.LoadtheCSVfilesintoseparatedataframes.Whenyoucreateyour

dataframes,pleaserefertothemetadatafileandthinkabouttheappropriate

datatypeforeachcolumn.

2.Displaytheschemaofthedataframes.

Whenthedatasetislarge,doyouneedallcolumns?Howtooptimizememory

usage?Doyouneedacustomizeddatapartitioningstrategy?(Note:Thinkabout

thosequestionsbutyoudon’tneedtoanswerthesequestions.)

2.2Query/Analysis(

40%

)

Implementthefollowingqueriesusingdataframes.Youneedtobeabletoperform

operationsliketransforming,filtering,sorting,joiningandgroupbyusingthefunctions

providedbytheDataFrameAPI.

1.Writeafunctiontoencode/transformweatherconditionstoIntegersanddrop

theoriginalstring.Youcandecideyourownencodingscheme.(i.e.Sunny=0,

Cloudy=1,Fog=2,etc.)(5%)

2.Calculatetheamountoforderforeachhour.Showtheresultsinatableand

plotabarchart.(5%)

3.Jointhedelivery_orderwithgeolocationdataframe,calculatethedistance

betweenarestaurantandthedeliverylocation,andstorethedistanceinanew

columnnameddelivery_distance.(hint:Youmayneedtoinstallanadditional

librarylikeGeoPandastocalculatethedistancebetweentwopoints).(5%)

4.Usingthedatafrom3,findthetop10driverstravellingthelongestdistance.

(5%)

5.Foreachtypeoforder,plotahistogramofmealpreparationtime.Theplotcan

bedonewithmultiplelegendsorsub-plots.(note:youcandecideyourbin

size).(10%)

6.(OpenQuestion)Explorethedatasetanduseadeliveryperson’sratingasa

performanceindicator.Isalowerratingusuallycorrelatedtoalongerdelivery

time?Whatmightbethecontributingfactorstothelowrateofdrivers?Please

includeoneplotanddiscussionbasedonyourobservation(nowordlimitbut

pleasekeepitconcise).(10%)

Part3:RDDsvsDataFramevsSparkSQL(25%)

ImplementthefollowingqueriesusingRDDs,DataFrameinSparkSQLseparately.

Logthetimetakenforeachqueryineachapproachusingthe“%%time”built-in

magiccommandinJupyterNotebookanddiscusstheperformancedifference

betweenthese3approaches.

(ComplexQuery)Calculatethetimetakenontheroad(definedasthetotaltime

takenminusrestaurants’orderpreparationtime,i.e.,totaltime-(ready_ts-

order_ts)).Foreachroad_condition,usinga10-minutebucketsizeoftimeon

theroad(e.g.0-10,10-20,20-30,etc.),showthepercentageofeachbucket.

(note:Youcanreusetheloadeddata/variablesfrompart1&2.)

(hint:YoumaycreateintermediateRDD/dataframesforthisquery.)

1)ImplementtheabovequeryusingRDDs,DataFrameandSQLseparately

andprinttheresults.(Note:Thethreedifferentapproachesshouldhavethe

sameresults).(15%)

2)Whichoneistheeasiesttoimplementinyouropinion?Logthetime

takenforeachquery,andobservethequeryexecutiontime,among

RDD,DataFrame,andSparkSQL,whichisthefastestandwhy?Please

includeproperreferences.(Maximum500words.)(10%)

Submission

YoushouldsubmityourfinalversionoftheassignmentsolutiononlineviaMoodle.You

mustsubmitthefilescreated:

-Yourjupyternotebookfile(e.g.,A1_authcate.ipynb).

-Apdffilesavedfromjupyternotebookwithalloutputfollowingthefile

namingformatasfollows:A1_authcate.pdf

Notethatbothsubmitted(jupyterandpdf)fileswillbescannedusing

plagiarismdetectionsoftware.Thehighestsimilarityscoreamongstudents

maybeinterviewedtoprovetheoriginalityofthework.

AssignmentMarkingRubric

Foreachtaskindividually,you’llbemarkedbasedonthequalityofyourworkona3-level

scale(0%,50%and100%).

●0%:Noattemptorincorrectanswerwithpoorattempt;

●50%:Partialmarkforagoodattemptbutincorrectresult;

●100%:Fullmarkforcorrectattempt.

Inyoursubmission,thejupyternotebookfileshouldcontainthecodeanditsoutput.It

shouldfollowprogrammingstandards,readabilityofthecode,organizationofcode.Please

findthePEP8--StyleGuideforPythonCodeforyourreference.Hereisthelink:

https://peps.python.org/pep-0008/Penaltyuofpto10%appliesifyourcodeishardto

understandwithinsufficientcomments.

Latesubmissions

LateAssignmentsorextensionswillnotbeacceptedunlessyousubmitaspecial

considerationform.ALLSpecialConsideration,includingwithinthesemester,isnowtobe

submittedcentrally.ThismeansthatstudentsMUSTsubmitanonlineSpecialConsideration

formviaMonashConnect.Formoredetails,pleaserefertotheUnitInformationsectionin

Moodle.

Alatesubmissionissubjecttoa5%penaltyperday,includingweekends.Thecut-offdateis

7daysaftertheduedate.Nosubmission(i.e.0mark)willbeacceptedafterthecut-offdate

unlessyouhaveaspecialconsideration.

MarkReleaseandReview

●Markwillbereleasedwithin10businessdaysafterthesubmissiondeadline.

●Reviewsanddisputesregardingthemarkwillbeacceptedamaximumof7daysafter

thereleasedate(includingweekends).

OtherInformation

Wheretogethelp

YoucanaskquestionsabouttheassignmentintheAssignmentssectioninthe

EdForumaccessibleontheunit'sMoodleForumpage.Thisisthepreferredvenue

forassignmentclarification-typequestions.Youshouldcheckthisforumregularly,as

theresponsesoftheteachingstaffare"official"andcanconstituteamendmentsor

additionstotheassignmentspecification.Also,youcanattendscheduled

consultationsessionsiftheproblemandtheconfusionarestillnotsolved.

Plagiarismandcollusion

PlagiarismandcollusionareseriousacademicoffencesatMonashUniversity.

Studentsmustnotsharetheirworkwithanyotherstudents.Studentsshouldconsult

thepolicylinkedbelowformoreinformation.

https://www.monash.edu/students/academic/policies/academic-integrity

SeealsothevideolinkedontheMoodlepageundertheAssignment

block.

Studentsinvolvedincollusionorplagiarismwillbesubjecttodisciplinary

penalties,whichcaninclude:

●Theworknotbeingassessed

●Azerogradefortheunit

●SuspensionfromtheUniversity

●ExclusionfromtheUniversity

GenerativeAIStatement

AspertheUniversity’spolicyontheguidelinesandpracticespertainingtotheusage

ofGenerativeAI:

AI&GenerativeAItoolsmaybeusedSELECTIVELYwithinthisassessment.

Whereused,AImustbeusedresponsibly,clearlydocumentedandappropriately

acknowledged(seeLearnHQ).

Anyworksubmittedforamarkmust:

1)Representasinceredemonstrationofyourhumanefforts,skillsandsubject

knowledgethatyouwillbeaccountablefor.

2)AdheretotheguidelinesforAIusesetfortheassessmenttask.

3)ReflecttheUniversity’scommitmenttoacademicintegrityandethical

behaviour.

InappropriateAIuseand/orAIusewithoutacknowledgementwillbe

consideredabreachofacademicintegrity.

Theteachingteamencouragesstudentstoapplytheirowncriticalthinkingand

reasoningskillswhenworkingontheassessmentswithanassistantfromGenAI.

GenerativeAItoolsmayproduceinaccuratecontent,whichcouldnegativelyimpact

students’comprehensionofbigdatatopics.

Datasourceacknowledgement:

Thedatasetisaremixbasedonseveralreal-worlddataset.Allname,age,dob,

salaryetc.arerandomlygeneratedsyntheticdatasets.

Appendix:MetadataoftheDatasetSchema

geolocation.csv

geoidUUIDofgeolocation

latitudeLatitude,Decimal(8,6)

longitudeLongitude,Decimal(8,6)

districtWhetheralocationisconsideredanurbanormetropolitan

area(String)

locGeometryobjectofthegeolocation

delivery_person.csv

person_idUniqueidentifierofdeliveryperson/driver(String)

ageDeliverydriver’sage(Integer)

ratingOverallratingofthedriver(float,0-5scale)

vehicle_conditionAdriver’svehiclecondition(from0-Good,1-Fair,2-Poor)

type_of_vehicleMotorcycle,Scooter,electric_scooter,etc.(String)

delivery_order.csv

order_idUniqueidentifierofanorder

delivery_person_idUniqueIDofthedriverdeliveringanorder

order_tstimestampwhenanorderisplaced

ready_tstimestampwhenarestaurantfinishespreparinganorder,i.e.ready

forthedeliverydrivertopickup.

weather_conditionWeatherconditionatthetimeoforder(Sunny,Windy,Storm,etc.)

(String)

road_conditionRoadtrafficconditions(Low,Medium,Jam,etc.)

type_of_orderSnacks,Meal,Drinks,etc.(String)

is_festivalWhetherthecurrentdateisafestival(note:duringafestival,

restaurantsarebusierthanusualanditmaytakethemlongerto

prepareforanorder.)

time_takenTotaltimetakenforanorder(i.e.fromorderplacementtodelivered,

measuredinminutes).

restaurant_geoidGeolocationofarestaurant.

delivery_geoidGeolocationofadelivery.

51作业君

Email:51zuoyejun

@gmail.com

添加客服微信: abby12468