FIT5196-S2-2024Assessment1(35%)
Thisisagroupassessmentandworth35%ofyourtotalmarkfor
FIT5196.
Duedate:23:55PM,Friday,August30,2024
Textdocuments,suchasthosederivedfromwebcrawling,typicallyconsistoftopically
coherentcontent.Withineachsegmentoftopicallycoherentdata,wordusageexhibitsmore
consistentlexicaldistributionscomparedtotheentiredataset.Fortextanalysistasks,such
aspassageretrievalininformationretrieval(IR),documentsummarization,recommender
systems,andlearning-to-rankmethods,alinearpartitioningoftextsintotopicsegmentsis
effective.Inthisassessment,yourgroupisrequiredtosuccessfullycompleteallfivetasks
listedbelowtoachievefullmarks.
Task1:ParsingRawFiles(8/35)
Thistasktouchesontheveryfirststepofanalysingtextualdata,i.e.,extractingdatafrom
semi-structuredtextfiles.
Allowedlibraries:re,json,pandas,datetime,os
InputFilesOutputFiles(submission)
group
group
task1_
task1_
task1_
task1_
(the
ie.001,010...)
YourgroupisprovidedwithreviewinformationonGooglemapfromCalifornia(ratings,text,
images,etc.).Pleaseusetheinputdatafileswithyourgroup_number,i.e.
group
(student_data).
Note:UsingawronginputdatasetwillresultinZEROmarksfor‘Output’inmarking
rubric.
YourdatasetcontainsasubsetofrecordsoftheGooglemapreviews.Thegooglemap
reviewsarerecordedwithasetofattributes:
●user_id-IDofthereviewer
●name-nameofthereviewer
●time-timeofthereview(unixtime)
●rating-ratingofthebusiness
●text-textofthereview
●pics-picturesofthereview
●resp-businessresponsetothereviewincludingunixtimeandtextofthe
response
●
gmap_id-IDofthebusiness
Pleasecheckwiththesampleinputfiles(sample_input)foralltheavailableattributes.
Yourtaskistoextractthedatafromallofyourinputfiles(includinganexcelfileand15.txt
filesfollowingamis-structuredxmlformat).Youareaskedtoextractandtransformthedata
intoacsvfileandaJSONfile.
Forthecsvfile,youarerequiredtoproduceanoutputfilewithfollowingcolumns:
●gmap_id:theIDofthebusiness.
●review_count:thenumberoftotalreviewsforabusiness.
●review_text_count:thenumberofreviewsthatcontainsatext.
●response_count:thenumberofresponsesfromabusiness.
FortheJSONfile,youarerequiredtoproduceanoutputfilewiththefollowingfields:
●gmap_id:theIDofthebusiness
●reviews:arootelementwithoneormorereviews,containsfields:
○user_id:IDofthereviewer.
○time:thetimeofthereview.(outputformat:UTCtimeinYYYY-MM-DD
tt:hh:ss)
○review_rating:ratingofthebusiness.
○review_text:theenglishreviewtext,ifareviewisinotherlanguages,only
extracttheenglishtranslationversion.Needtoremoveemojisbeforeoutput
basedonthelistandtheoutputtextshouldbenormalisedtolowercase.
Place“None”asthevalue,ifthereisnoreview.
○If_pic:ifthereviewerincludepictures.(outputformat:Y/N)
○pic_dim:thedimensionofpicturesinalistoftuples.eg[[h,w],[h,w]...].Place
[]asthevalue,ifthereisnopicture.
○If_response:ifthereviewhasaresponse(outputformat:Y/N)
●earliest_review_date:theearliestreviewdateforagivenbusinessinthegivendata
subset.(outputformat:UTCtimeinYYYY-MM-DDtt:hh:ss)
●latest_review_date:thelatestreviewdateforagivenbusinessinthegivendata
subset.(outputformat:UTCtimeinYYYY-MM-DDtt:hh:ss)
VERYIMPORTANTNOTE:
1.Allthetagnamesarecase-sensitiveintheoutputjsonfile.Youcanrefertothe
sampleoutputforthecorrectjsonfilestructure.
2.Thesampleoutputfilesarejustforyoutounderstandthestructureoftherequired
outputandthecorrectnessoftheircontentintask1isnotguaranteed.Sopleasedo
nottrytoreverseengineertheoutputsasitwillfailtogeneratethecorrectcontent.
Task1Guidelines
Tocompletetheabovetask,pleasefollowthestepsbelow:
Step0:Studythesamplefiles
●Openandcheckyourinput.txtfileandtrytofindany‘potentialinteresting’patterns
fordifferentdataelements
Step1:Txtfileparsingandexcelfileparsing
●Loadtheinputfile
●UseRegextoextracttherequiredattributesandtheirvaluesaslistedfromthetxtfile
●Extractnecessarydatafromtheexcelfile
●Combinealldatatogether
Step2:FurtherprocesstheextractedtextfromStep1
●Removeanyduplicatesifneeded
●Furtherprocesstheextracteddata
●Noteforreviewtexts:theyshouldbetransformedintolowercaseandwithnoemojis
○Toremoveemojis,makesureyourtextdataisinutf-8format
○Thelistofemojistoremoveare:
"["
"\U0001F600-\U0001F64F"
"\U0001F300-\U0001F5FF"
"\U0001F680-\U0001F6FF"
"\U0001F1E0-\U0001F1FF"
"\U00002702-\U000027B0"
"\U000024C2-\U0001F251"
"]+"
Step3:fileoutput
●Outputtherequiredfilesbasedonthespecifiedstructuresprovidedabove,make
sureyourdataisutf-8encoded.
SubmissionRequirements
Youneedtosubmit4files:
●A
task1_<
group_number
>
.jsonfilecontainsthecorrectreviewinformationwithall
theelementslistedabove.
●A
task1_<
group_number
>
.csvfilecontainsthecorrectreviewinformationwithall
theelementslistedabove.
●APythonnotebooknamedtask1_
<
group_number
>
.ipynbcontainsa
well-documentedreportthatdemonstratesyoursolutiontoTask1.Youneedto
clearlypresentthemethodology,thatis,theentirestep-by-stepprocessofyour
solutionwithappropriatecommentsandexplanations.Youcanfollowthesuggested
stepsintheguidelineabove.Pleasekeepthisnotebookeasy-to-read,asyouwill
losemarksifwecannotunderstandit(makesureyouPRINTOUTyourcelloutput).
●
Atask1_<
group_number
>.pyfile.Thisfilewillbeusedforplagiarismcheck
(make
sureyouclearyourcelloutputbeforeexporting).
InGooglecolab:
RequirementsonthePythonnotebook(report)
●Methodology-35%
○Youneedtodemonstrateyoursolutionusingcorrectregularexpressions.
Resultsfromeachstepcouldhelptodemonstrateyoursolutionbetterand
beeasiertounderstand.
○Youshouldpresentyoursolutioninaproperwayincludingallrequired
steps.Skipanystepswillcauseapenaltyongrade.
○YouneedtoselectandusetheappropriatePythonfunctionsforinput,
processandoutput.
○Yoursolutionshouldbeanefficientonewithoutredundantoperationsand
unnecessaryreadingandwritingthedata.
●Reportorganisationandwriting-15%
○Thereportshouldbeorganisedinaproperstructuretopresentyour
solutionstoTask1withclearandmeaningfultitlesforsectionsand
subsectionsorsub-subsectionifneeded.
○Eachstepinyoursolutionshouldbeclearlydescribed.Forexample,you
canwritetoexplainyourideaofthesolution,anyspecificsettings,andthe
reasonforusingaparticularfunction,etc.
○Explanationofyourresultsincludingallintermediatestepsisrequired.
Thiscanhelpthemarkingteamtounderstandyoursolutionandgive
partialmarksifthefinalresultsarenotfullycorrect.
○Allyourcodesneedproper(butnotexcessive)commenting.
○Youcanrefertothenotebooktemplatesprovidedasaguidelinefora
properlyformattednotebookreport.
Task2:TextPre-Processing(12/35)
Thistaskinvolvesthenextstepintextualdataanalysis:convertingextractedtextintoa
numericalrepresentationfordownstreammodellingtasks.YouarerequiredtowritePython
codetopreprocessGoogleReviewstextfromTask1andtransformitintonumerical
representations.Thesenumericalrepresentationsarethestandardformatfortextdata,
suitableforinputintoNLPsystemssuchasrecommendersystems,informationretrieval
algorithms,andmachinetranslation.Themostfundamentalstepinnaturallanguage
processing(NLP)tasksisconvertingwordsintonumberstoenablemachinestounderstand
anddecodepatternswithinalanguage.Thisstep,althoughiterative,iscrucialindetermining
thefeaturesforyourmachinelearningmodelsandalgorithms.
Allowedlibraries:ALL
InputFilesOutputFiles(submission)
task1_
task1_
task2_
task2_
Inthistaskyouarerequiredtocontinueworkingwiththedatafromtask1.
Youareaskedtousethereviewfieldsinallthereviewsfromallthebusinessesthathaveat
least70textreviews.Thenpre-processtheabstracttextandgenerateavocabularylistand
numericalrepresentationforthecorrespondingtext,whichwillbeusedinthemodeltraining
byyourcolleagues.Theinformationregardingoutputfilesislistedbelow:
●
alphabetically,presentedintheformatoftoken:token_index
●
organisedbychannels_idandtokenindex,followingtheformatchannel_id,
token_index:frequency.
Carefullyexaminethesampleoutputfiles(here)fordetailedinformationabouttheoutput
structure.
VERYIMPORTANTNOTE:Thesampleoutputsarejustforyoutounderstandthestructure
oftherequiredoutputandthecorrectnessoftheircontentintask2isnotguaranteed.So
pleasedonottrytoreverseengineertheoutputsasitwillfailtogeneratethecorrectcontent.
Task2Guideline
Tocompletetheabovetask,pleasefollowthestepsbelow:
Step1:Textextraction
●Youarerequiredtoextractthereviewtextfromtheoutputoftask1.
●Validateyourreviewtextdata:thetextdatashouldbeallinEnglishandinlowercase
withnoemojis.
●Youareonlyrequiredtoextractthevocabandcountveclistsforreviewfrom
businessesthathaveatleast70textreviews
Step2:Generatetheunigramandbigramlistsandoutputasvocab.txt
●
Thefollowingstepsmustbeperformed(notnecessarilyinthesameorder)to
completetheassessment.Pleasenotethattheorderofpreprocessingmattersand
willresultindifferentvocabularyandhencedifferentcountvectors.Itispartofthe
assessmenttofigureoutthecorrectorderofpreprocessingwhichmakesthe
mostsenseaswelearnedintheunit.Youareencouragedtoaskquestionsand
discussthemwiththeteachingteamifindoubt.
a.Thewordtokenizationmustusethefollowingregularexpression,"[a-zA-Z]+"
b.Thecontext-independentandcontext-dependentstopwordsmustberemoved
fromthevocabulary.
■Forcontext-independent,Theprovidedcontext-independentstop
wordslist(i.e,stopwords_en.txt)mustbeused.
■Forcontext-dependentstopwords,youmustsetthethresholdto
wordsthatappearinmorethan95%ofthebusinessesthathaveat
least70textreviews.
c.TokensshouldbestemmedusingthePorterstemmer.
d.Raretokensmustberemovedfromthevocab(withthethresholdsettobe
wordsthatappearinlessthan5%ofthebusinessesthathaveatleast70
textreviews.
e.Tokenswithalengthlessthan3shouldberemovedfromthevocab.
f.First200meaningfulbigrams(i.e.,collocations)mustbeincludedinthe
vocabusingPMImeasure,thenmakessurethecollocationscanbe
collocatedwithinthesamereview.
g.Calculatethevocabularycontainingbothunigramsandbigrams.
●
Combinetheunigramsandbigrams,sortthelistalphabeticallyinanascendingorder
andoutputasvocab.txt
Step3:Generatethesparsenumericalrepresentationandoutputascountvec.txt
1.Generatesparserepresentationbyusingthecountvectorizer()functionORdirectly
countthefrequencyusingFreqDist().
2.Outputthesparsenumericalrepresentationintotxtfilewiththefollowingformat:
gmap_id1,token1_index:token1_frequency,token2_index:token2_frequency,
token3_index:token3_frequency,...
gmap_id2,token2_index:token2_frequency,token5_index:token5_frequency,
token7_index:token7_frequency,...
gmap_id3,token6_index:token6_frequency,token9_index:token9_frequency,
token12_index:token12_frequency,...
Note:thetoken_indexcomesfromthevocab.txtandmakesureyouarecounting
bigrams
SubmissionRequirements
Youneedtosubmit4files:
1.A
thefollowingformat,token:token_index.Wordsinthevocabularymustbesortedin
alphabeticalorder.
2.A
representationsofoneofthechannelinthefollowingformat:
gmap_id1,token1_index:token1_wordcount,token2_index:token2_wordcount,...
Pleasenote:thetokenswithzerowordcountshouldNOTbeincludedinthesparse
representation.
3.Atask2_
andthemethodology.(makesureyouPRINTOUTyourcelloutputs)
4.Atask2_
celloutputs)
RequirementsonthePythonnotebook(report)
●Methodology-35%
○Youneedtodemonstrateyoursolutionusingcorrectregularexpressions.
○Youshouldpresentyoursolutioninaproperwayincludingallrequired
steps.
○YouneedtoselectandusetheappropriatePythonfunctionsforinput,
processandoutput.
○Yoursolutionshouldbeanefficientonewithoutredundantoperationsand
unnecessaryreadingandwritingthedata.
●Reportorganisationandwriting-15%
○Thereportshouldbeorganisedinaproperstructuretopresentyour
solutionstoTask2withclearandmeaningfultitlesforsectionsand
subsectionsorsub-subsectionifneeded.
○Eachstepinyoursolutionshouldbeclearlydescribed.Forexample,you
canwritetoexplainyourideaofthesolution,anyspecificsettings,andthe
reasonforusingaparticularfunction,etc.
○Explanationofyourresultsincludingallintermediatestepsisrequired.
Thiscanhelpthemarkingteamtounderstandyoursolutionandgive
partialmarksifthefinalresultsarenotfullycorrect.
○Allyourcodesneedproper(butnotexcessive)commenting.
○Youcanrefertothenotebooktemplatesprovidedasaguidelinefora
properlyformattednotebookreport.
Task3:DataExploratoryAnalysis(12/35)
Inthistask,youareaskedtoconductacomprehensiveexploratorydataanalysis(EDA)on
theprovidedGoogleReviewdata.Thegoalistouncoverinterestinginsightsthatcanbe
usefulforfurtheranalysisordecision-making.
OutputFiles(submission)
task3_
task3_
Task3Guideline
Tocompletetheabovetask,pleasefollowthestepsbelow:
Step1:UnderstandtheSampleData:
●Reviewthedataprovidedfromthepreviousexercise.
●Summarisethekeyfeaturesandvariablesincludedinthedataset.
●Identifyanyinitialpatterns,trends
Step2:UnderstandtheAuxiliaryDataset:Metadata(optional):
●Understandtheauxiliarymetadatadataset.
●Evaluatetheusefulnessofthemetadatainconjunctionwiththemaindataset.
●Decidewhethertoincorporatethemetadataintoyouranalysis.
Note:youneedtoperformthisstepforaHDgrading.
Step3:DataAnalysis:
●YoucanchoosetouseeitherthemainGoogleReviewdata,theauxiliarymetadata,
orboth.
●Performanexploratorydataanalysistoinvestigateanduncoverinterestinginsights.
●Insightsshouldbedata-drivenandsubstantiatedbyvisualisationsand/orstatistical
summaries
●YouarerequiredtoInvestigate2-5interestinginsightsfromthedata.
●ForaHDgrading,youneedtohaveatleast4meaningfulinsightswithin-depth
discussions(andatleasttwoofthemneedtocomefromthecombinationof
metadataandthereviewdata).
SubmissionRequirements
Youneedtosubmit2files:
5.Atask3_
andthemethodology.(makesureyouPRINTOUTyourcelloutputs)
6.Atask3_
celloutputs)
Task4:VideopresentationforTask3(2/35)
Createavideopresentation(5-8minutes)toeffectivelycommunicatethefindingsfromyour
exploratorydataanalysis(EDA)ontheGoogleReviewdata.Thegoalistopresentyour
methodologyandinsightsinaclear,concise,andengagingmanner.
OutputFiles(submission)
task4_
SubmissionRequirements
Herearethekeycomponentsyouneedtoincludeinyoursubmission:
Introduction:
●Brieflyintroduceyourselfandprovidecontextfortheanalysis.
●ExplainthepurposeoftheEDAandthedatasetsused(mainGoogleReviewdata
andauxiliarymetadata,ifapplicable).
Methodology:
●Describethestepstakenduringthedataanalysisprocess.
Insights:
●Present2-5interestinginsightsuncoveredfromtheanalysis.
●Usevisualaidssuchascharts,graphs,ortablestosupportyourinsights.
●Explainthesignificanceofeachinsightandhowitcanbeappliedorinterpreted.
Conclusion:
●Summarisethekeyfindingsandtheirpotentialimplications.
●Discussanylimitationsoftheanalysisandsuggestareasforfurtherresearch.
Task5:DevelopmentHistory(1/35)
Forthistask,yourgroupisrequiredtoprovideacomprehensivedevelopmenthistoryofyour
assignment,showcasingincrementalprogressoveratleastthreedifferenttimepoints.
Thepurposeofthistaskistodemonstrateyourabilitytomanageanddocumentthe
evolutionofyourproject,includingchangesmade,challengesfaced,andcollaborative
effortswithyourgroupmates.
OutputFiles(submission)
task5_
Requirement:
Thistaskmustbecompleted(HURDLE)
Consequencesofmissingdevelopmenthistory:
Failuretosubmityourdevelopmenthistoryforaparticulartask,willresultinnotmeetingthe
hurdlerequirementsforAssignmentA2.Consequently,youwillreceiveZEROforthetask..
SubmissionRequirements
Herearethekeycomponentsyouneedtoincludeinyoursubmission(example):
1.DevelopmentTimeline:Provideadetailedtimelinehighlightingatleastthree
significanttimepointsinthedevelopmentofyourassignment.Eachtimepointshould
beaccompaniedbyadescriptionofthechangesmadeandtherationalebehind
thosechanges.
2.VersionScreenshots:Includescreenshotsorsnapshotsofdifferentversionsof
yourassignmentateachtimepoint.Thisshouldclearlyillustratetheincremental
developmentandanymodificationsmadetotheproject.
3.CollaborativeEffort(Ifyouaredoingtheassignmentwithanotherstudent):
Documentthecollaborativeeffortwithyourgroupmates.Thiscanbeadescriptionof
thecontributionsmadebyeachteammemberorscreenshotsofproofshowcasing
thecollaborativeeffortwithyourgroupmates.
Note:Weonlyrequirebriefdescriptionshere,theydon’thavetobelongaslongasthey
reflectthekeycomponentswelistedabove.IfyouuseGooglecolab,itprovidesa
comprehensiveversionhistory.
Note:toavoidlosingimportantwork,youneedto
pintheversionhistorythatwe
wishtosave.Thiskeepsthenotebookversionsavedpermanently.
Instructions-Sharingipynblink
1.ClickontheSharebuttononthetoprightcorner
2.MakesureunderGeneralaccesssection,youhavethepermissionsetsto‘Monash
University’and‘Editor’,thenclickon`Copylink`
3.Createamarkdowncellattheendofyourassignment.Pastethesharelink/createa
hyperlinkobject.
4.Doublecheckthelinktomakesureitisworking.
SubmissionChecklist:
Pleasezipallthesubmissionfilesfortask1and2intoasinglefilewiththename
willbepenalised)
Thereare12filesinyourcompressedzipfile
0paddingsie.001,010...)
Makesurebothmembersofyourgroupclickthe‘Submit’buttononMoodle
Pleasestrictlyfollowthefilenamingstandard.Anymisnamedfilewillcarryapenalty
Pleasemakesurethatyour.ipynbfilecontainsprintedoutput,whileyour.pyfile
doesnotincludeanyoutput
Pleaseensurethatallyourfilesareparsableandreadable.Youcanachievethisby
re-readingallyourgeneratedfilesbackintopython.(e.g.usingread_csvforCSV
filesorjsonmoduleforJSON).Thesechecksareonlysanitychecksandhence
shouldnotbeaddedtoyourfinalsubmission
Note:Allsubmissionswillbeputthroughaplagiarismdetectionsoftwarewhichautomatically
checksfortheirsimilaritywithrespecttoothersubmissions.Anyplagiarismfoundwilltrigger
theFaculty’srelevantproceduresandmayresultinseverepenalties,uptoandincluding
exclusionfromtheuniversity.