代写辅导接单-FIT5196-S2-2024

欢迎使用51辅导,51作业君孵化低价透明的学长辅导平台,服务保持优质,平均费用压低50%以上! 51fudao.top

FIT5196-S2-2024Assessment1(35%)

Thisisagroupassessmentandworth35%ofyourtotalmarkfor

FIT5196.

Duedate:23:55PM,Friday,August30,2024

Textdocuments,suchasthosederivedfromwebcrawling,typicallyconsistoftopically

coherentcontent.Withineachsegmentoftopicallycoherentdata,wordusageexhibitsmore

consistentlexicaldistributionscomparedtotheentiredataset.Fortextanalysistasks,such

aspassageretrievalininformationretrieval(IR),documentsummarization,recommender

systems,andlearning-to-rankmethods,alinearpartitioningoftextsintotopicsegmentsis

effective.Inthisassessment,yourgroupisrequiredtosuccessfullycompleteallfivetasks

listedbelowtoachievefullmarks.

Task1:ParsingRawFiles(8/35)

Thistasktouchesontheveryfirststepofanalysingtextualdata,i.e.,extractingdatafrom

semi-structuredtextfiles.

Allowedlibraries:re,json,pandas,datetime,os

InputFilesOutputFiles(submission)

group.txt

group.xlsx

task1_.json

task1_.csv

task1_.ipynb

task1_.py

(thehas0paddings

ie.001,010...)

YourgroupisprovidedwithreviewinformationonGooglemapfromCalifornia(ratings,text,

images,etc.).Pleaseusetheinputdatafileswithyourgroup_number,i.e.

group_x.txtandgroup.xlsxintheGoogledrivefolder

(student_data).

Note:UsingawronginputdatasetwillresultinZEROmarksfor‘Output’inmarking

rubric.

YourdatasetcontainsasubsetofrecordsoftheGooglemapreviews.Thegooglemap

reviewsarerecordedwithasetofattributes:

●user_id-IDofthereviewer

●name-nameofthereviewer

●time-timeofthereview(unixtime)

●rating-ratingofthebusiness

●text-textofthereview

●pics-picturesofthereview

●resp-businessresponsetothereviewincludingunixtimeandtextofthe

response

gmap_id-IDofthebusiness

Pleasecheckwiththesampleinputfiles(sample_input)foralltheavailableattributes.

Yourtaskistoextractthedatafromallofyourinputfiles(includinganexcelfileand15.txt

filesfollowingamis-structuredxmlformat).Youareaskedtoextractandtransformthedata

intoacsvfileandaJSONfile.

Forthecsvfile,youarerequiredtoproduceanoutputfilewithfollowingcolumns:

●gmap_id:theIDofthebusiness.

●review_count:thenumberoftotalreviewsforabusiness.

●review_text_count:thenumberofreviewsthatcontainsatext.

●response_count:thenumberofresponsesfromabusiness.

FortheJSONfile,youarerequiredtoproduceanoutputfilewiththefollowingfields:

●gmap_id:theIDofthebusiness

●reviews:arootelementwithoneormorereviews,containsfields:

○user_id:IDofthereviewer.

○time:thetimeofthereview.(outputformat:UTCtimeinYYYY-MM-DD

tt:hh:ss)

○review_rating:ratingofthebusiness.

○review_text:theenglishreviewtext,ifareviewisinotherlanguages,only

extracttheenglishtranslationversion.Needtoremoveemojisbeforeoutput

basedonthelistandtheoutputtextshouldbenormalisedtolowercase.

Place“None”asthevalue,ifthereisnoreview.

○If_pic:ifthereviewerincludepictures.(outputformat:Y/N)

○pic_dim:thedimensionofpicturesinalistoftuples.eg[[h,w],[h,w]...].Place

[]asthevalue,ifthereisnopicture.

○If_response:ifthereviewhasaresponse(outputformat:Y/N)

●earliest_review_date:theearliestreviewdateforagivenbusinessinthegivendata

subset.(outputformat:UTCtimeinYYYY-MM-DDtt:hh:ss)

●latest_review_date:thelatestreviewdateforagivenbusinessinthegivendata

subset.(outputformat:UTCtimeinYYYY-MM-DDtt:hh:ss)

VERYIMPORTANTNOTE:

1.Allthetagnamesarecase-sensitiveintheoutputjsonfile.Youcanrefertothe

sampleoutputforthecorrectjsonfilestructure.

2.Thesampleoutputfilesarejustforyoutounderstandthestructureoftherequired

outputandthecorrectnessoftheircontentintask1isnotguaranteed.Sopleasedo

nottrytoreverseengineertheoutputsasitwillfailtogeneratethecorrectcontent.

Task1Guidelines

Tocompletetheabovetask,pleasefollowthestepsbelow:

Step0:Studythesamplefiles

●Openandcheckyourinput.txtfileandtrytofindany‘potentialinteresting’patterns

fordifferentdataelements

Step1:Txtfileparsingandexcelfileparsing

●Loadtheinputfile

●UseRegextoextracttherequiredattributesandtheirvaluesaslistedfromthetxtfile

●Extractnecessarydatafromtheexcelfile

●Combinealldatatogether

Step2:FurtherprocesstheextractedtextfromStep1

●Removeanyduplicatesifneeded

●Furtherprocesstheextracteddata

●Noteforreviewtexts:theyshouldbetransformedintolowercaseandwithnoemojis

○Toremoveemojis,makesureyourtextdataisinutf-8format

○Thelistofemojistoremoveare:

"["

"\U0001F600-\U0001F64F"

"\U0001F300-\U0001F5FF"

"\U0001F680-\U0001F6FF"

"\U0001F1E0-\U0001F1FF"

"\U00002702-\U000027B0"

"\U000024C2-\U0001F251"

"]+"

Step3:fileoutput

●Outputtherequiredfilesbasedonthespecifiedstructuresprovidedabove,make

sureyourdataisutf-8encoded.

SubmissionRequirements

Youneedtosubmit4files:

●A

task1_<

group_number

>

.jsonfilecontainsthecorrectreviewinformationwithall

theelementslistedabove.

●A

task1_<

group_number

>

.csvfilecontainsthecorrectreviewinformationwithall

theelementslistedabove.

●APythonnotebooknamedtask1_

<

group_number

>

.ipynbcontainsa

well-documentedreportthatdemonstratesyoursolutiontoTask1.Youneedto

clearlypresentthemethodology,thatis,theentirestep-by-stepprocessofyour

solutionwithappropriatecommentsandexplanations.Youcanfollowthesuggested

stepsintheguidelineabove.Pleasekeepthisnotebookeasy-to-read,asyouwill

losemarksifwecannotunderstandit(makesureyouPRINTOUTyourcelloutput).

Atask1_<

group_number

>.pyfile.Thisfilewillbeusedforplagiarismcheck

(make

sureyouclearyourcelloutputbeforeexporting).

InGooglecolab:

RequirementsonthePythonnotebook(report)

●Methodology-35%

○Youneedtodemonstrateyoursolutionusingcorrectregularexpressions.

Resultsfromeachstepcouldhelptodemonstrateyoursolutionbetterand

beeasiertounderstand.

○Youshouldpresentyoursolutioninaproperwayincludingallrequired

steps.Skipanystepswillcauseapenaltyongrade.

○YouneedtoselectandusetheappropriatePythonfunctionsforinput,

processandoutput.

○Yoursolutionshouldbeanefficientonewithoutredundantoperationsand

unnecessaryreadingandwritingthedata.

●Reportorganisationandwriting-15%

○Thereportshouldbeorganisedinaproperstructuretopresentyour

solutionstoTask1withclearandmeaningfultitlesforsectionsand

subsectionsorsub-subsectionifneeded.

○Eachstepinyoursolutionshouldbeclearlydescribed.Forexample,you

canwritetoexplainyourideaofthesolution,anyspecificsettings,andthe

reasonforusingaparticularfunction,etc.

○Explanationofyourresultsincludingallintermediatestepsisrequired.

Thiscanhelpthemarkingteamtounderstandyoursolutionandgive

partialmarksifthefinalresultsarenotfullycorrect.

○Allyourcodesneedproper(butnotexcessive)commenting.

○Youcanrefertothenotebooktemplatesprovidedasaguidelinefora

properlyformattednotebookreport.

Task2:TextPre-Processing(12/35)

Thistaskinvolvesthenextstepintextualdataanalysis:convertingextractedtextintoa

numericalrepresentationfordownstreammodellingtasks.YouarerequiredtowritePython

codetopreprocessGoogleReviewstextfromTask1andtransformitintonumerical

representations.Thesenumericalrepresentationsarethestandardformatfortextdata,

suitableforinputintoNLPsystemssuchasrecommendersystems,informationretrieval

algorithms,andmachinetranslation.Themostfundamentalstepinnaturallanguage

processing(NLP)tasksisconvertingwordsintonumberstoenablemachinestounderstand

anddecodepatternswithinalanguage.Thisstep,althoughiterative,iscrucialindetermining

thefeaturesforyourmachinelearningmodelsandalgorithms.

Allowedlibraries:ALL

InputFilesOutputFiles(submission)

task1_.jsonOR

task1_.csv

_vocab.txt

_countvec.txt

task2_.ipynb

task2_.py

Inthistaskyouarerequiredtocontinueworkingwiththedatafromtask1.

Youareaskedtousethereviewfieldsinallthereviewsfromallthebusinessesthathaveat

least70textreviews.Thenpre-processtheabstracttextandgenerateavocabularylistand

numericalrepresentationforthecorrespondingtext,whichwillbeusedinthemodeltraining

byyourcolleagues.Theinformationregardingoutputfilesislistedbelow:

_vocab.txtcomprisesuniquestemmedtokenssorted

alphabetically,presentedintheformatoftoken:token_index

_countvec.txtincludesnumericalrepresentationsofalltokens,

organisedbychannels_idandtokenindex,followingtheformatchannel_id,

token_index:frequency.

Carefullyexaminethesampleoutputfiles(here)fordetailedinformationabouttheoutput

structure.

VERYIMPORTANTNOTE:Thesampleoutputsarejustforyoutounderstandthestructure

oftherequiredoutputandthecorrectnessoftheircontentintask2isnotguaranteed.So

pleasedonottrytoreverseengineertheoutputsasitwillfailtogeneratethecorrectcontent.

Task2Guideline

Tocompletetheabovetask,pleasefollowthestepsbelow:

Step1:Textextraction

●Youarerequiredtoextractthereviewtextfromtheoutputoftask1.

●Validateyourreviewtextdata:thetextdatashouldbeallinEnglishandinlowercase

withnoemojis.

●Youareonlyrequiredtoextractthevocabandcountveclistsforreviewfrom

businessesthathaveatleast70textreviews

Step2:Generatetheunigramandbigramlistsandoutputasvocab.txt

Thefollowingstepsmustbeperformed(notnecessarilyinthesameorder)to

completetheassessment.Pleasenotethattheorderofpreprocessingmattersand

willresultindifferentvocabularyandhencedifferentcountvectors.Itispartofthe

assessmenttofigureoutthecorrectorderofpreprocessingwhichmakesthe

mostsenseaswelearnedintheunit.Youareencouragedtoaskquestionsand

discussthemwiththeteachingteamifindoubt.

a.Thewordtokenizationmustusethefollowingregularexpression,"[a-zA-Z]+"

b.Thecontext-independentandcontext-dependentstopwordsmustberemoved

fromthevocabulary.

■Forcontext-independent,Theprovidedcontext-independentstop

wordslist(i.e,stopwords_en.txt)mustbeused.

■Forcontext-dependentstopwords,youmustsetthethresholdto

wordsthatappearinmorethan95%ofthebusinessesthathaveat

least70textreviews.

c.TokensshouldbestemmedusingthePorterstemmer.

d.Raretokensmustberemovedfromthevocab(withthethresholdsettobe

wordsthatappearinlessthan5%ofthebusinessesthathaveatleast70

textreviews.

e.Tokenswithalengthlessthan3shouldberemovedfromthevocab.

f.First200meaningfulbigrams(i.e.,collocations)mustbeincludedinthe

vocabusingPMImeasure,thenmakessurethecollocationscanbe

collocatedwithinthesamereview.

g.Calculatethevocabularycontainingbothunigramsandbigrams.

Combinetheunigramsandbigrams,sortthelistalphabeticallyinanascendingorder

andoutputasvocab.txt

Step3:Generatethesparsenumericalrepresentationandoutputascountvec.txt

1.Generatesparserepresentationbyusingthecountvectorizer()functionORdirectly

countthefrequencyusingFreqDist().

2.Outputthesparsenumericalrepresentationintotxtfilewiththefollowingformat:

gmap_id1,token1_index:token1_frequency,token2_index:token2_frequency,

token3_index:token3_frequency,...

gmap_id2,token2_index:token2_frequency,token5_index:token5_frequency,

token7_index:token7_frequency,...

gmap_id3,token6_index:token6_frequency,token9_index:token9_frequency,

token12_index:token12_frequency,...

Note:thetoken_indexcomesfromthevocab.txtandmakesureyouarecounting

bigrams

SubmissionRequirements

Youneedtosubmit4files:

1.A_vocab.txtthatcontainstheunigramsandbigramstokensin

thefollowingformat,token:token_index.Wordsinthevocabularymustbesortedin

alphabeticalorder.

2.A_countvec.txtfile,inwhicheachlinecontainsthesparse

representationsofoneofthechannelinthefollowingformat:

gmap_id1,token1_index:token1_wordcount,token2_index:token2_wordcount,...

Pleasenote:thetokenswithzerowordcountshouldNOTbeincludedinthesparse

representation.

3.Atask2_.ipynbfilethatcontainsyourreportexplainingthecode

andthemethodology.(makesureyouPRINTOUTyourcelloutputs)

4.Atask2_.pyfileforplagiarismchecks.(makesureyouclearyour

celloutputs)

RequirementsonthePythonnotebook(report)

●Methodology-35%

○Youneedtodemonstrateyoursolutionusingcorrectregularexpressions.

○Youshouldpresentyoursolutioninaproperwayincludingallrequired

steps.

○YouneedtoselectandusetheappropriatePythonfunctionsforinput,

processandoutput.

○Yoursolutionshouldbeanefficientonewithoutredundantoperationsand

unnecessaryreadingandwritingthedata.

●Reportorganisationandwriting-15%

○Thereportshouldbeorganisedinaproperstructuretopresentyour

solutionstoTask2withclearandmeaningfultitlesforsectionsand

subsectionsorsub-subsectionifneeded.

○Eachstepinyoursolutionshouldbeclearlydescribed.Forexample,you

canwritetoexplainyourideaofthesolution,anyspecificsettings,andthe

reasonforusingaparticularfunction,etc.

○Explanationofyourresultsincludingallintermediatestepsisrequired.

Thiscanhelpthemarkingteamtounderstandyoursolutionandgive

partialmarksifthefinalresultsarenotfullycorrect.

○Allyourcodesneedproper(butnotexcessive)commenting.

○Youcanrefertothenotebooktemplatesprovidedasaguidelinefora

properlyformattednotebookreport.

Task3:DataExploratoryAnalysis(12/35)

Inthistask,youareaskedtoconductacomprehensiveexploratorydataanalysis(EDA)on

theprovidedGoogleReviewdata.Thegoalistouncoverinterestinginsightsthatcanbe

usefulforfurtheranalysisordecision-making.

OutputFiles(submission)

task3_.ipynb

task3_.py

Task3Guideline

Tocompletetheabovetask,pleasefollowthestepsbelow:

Step1:UnderstandtheSampleData:

●Reviewthedataprovidedfromthepreviousexercise.

●Summarisethekeyfeaturesandvariablesincludedinthedataset.

●Identifyanyinitialpatterns,trends

Step2:UnderstandtheAuxiliaryDataset:Metadata(optional):

●Understandtheauxiliarymetadatadataset.

●Evaluatetheusefulnessofthemetadatainconjunctionwiththemaindataset.

●Decidewhethertoincorporatethemetadataintoyouranalysis.

Note:youneedtoperformthisstepforaHDgrading.

Step3:DataAnalysis:

●YoucanchoosetouseeitherthemainGoogleReviewdata,theauxiliarymetadata,

orboth.

●Performanexploratorydataanalysistoinvestigateanduncoverinterestinginsights.

●Insightsshouldbedata-drivenandsubstantiatedbyvisualisationsand/orstatistical

summaries

●YouarerequiredtoInvestigate2-5interestinginsightsfromthedata.

●ForaHDgrading,youneedtohaveatleast4meaningfulinsightswithin-depth

discussions(andatleasttwoofthemneedtocomefromthecombinationof

metadataandthereviewdata).

SubmissionRequirements

Youneedtosubmit2files:

5.Atask3_.ipynbfilethatcontainsyourreportexplainingthecode

andthemethodology.(makesureyouPRINTOUTyourcelloutputs)

6.Atask3_.pyfileforplagiarismcheck.(makesureyouclearyour

celloutputs)

Task4:VideopresentationforTask3(2/35)

Createavideopresentation(5-8minutes)toeffectivelycommunicatethefindingsfromyour

exploratorydataanalysis(EDA)ontheGoogleReviewdata.Thegoalistopresentyour

methodologyandinsightsinaclear,concise,andengagingmanner.

OutputFiles(submission)

task4_.mp4

SubmissionRequirements

Herearethekeycomponentsyouneedtoincludeinyoursubmission:

Introduction:

●Brieflyintroduceyourselfandprovidecontextfortheanalysis.

●ExplainthepurposeoftheEDAandthedatasetsused(mainGoogleReviewdata

andauxiliarymetadata,ifapplicable).

Methodology:

●Describethestepstakenduringthedataanalysisprocess.

Insights:

●Present2-5interestinginsightsuncoveredfromtheanalysis.

●Usevisualaidssuchascharts,graphs,ortablestosupportyourinsights.

●Explainthesignificanceofeachinsightandhowitcanbeappliedorinterpreted.

Conclusion:

●Summarisethekeyfindingsandtheirpotentialimplications.

●Discussanylimitationsoftheanalysisandsuggestareasforfurtherresearch.

Task5:DevelopmentHistory(1/35)

Forthistask,yourgroupisrequiredtoprovideacomprehensivedevelopmenthistoryofyour

assignment,showcasingincrementalprogressoveratleastthreedifferenttimepoints.

Thepurposeofthistaskistodemonstrateyourabilitytomanageanddocumentthe

evolutionofyourproject,includingchangesmade,challengesfaced,andcollaborative

effortswithyourgroupmates.

OutputFiles(submission)

task5_.pdf

Requirement:

Thistaskmustbecompleted(HURDLE)

Consequencesofmissingdevelopmenthistory:

Failuretosubmityourdevelopmenthistoryforaparticulartask,willresultinnotmeetingthe

hurdlerequirementsforAssignmentA2.Consequently,youwillreceiveZEROforthetask..

SubmissionRequirements

Herearethekeycomponentsyouneedtoincludeinyoursubmission(example):

1.DevelopmentTimeline:Provideadetailedtimelinehighlightingatleastthree

significanttimepointsinthedevelopmentofyourassignment.Eachtimepointshould

beaccompaniedbyadescriptionofthechangesmadeandtherationalebehind

thosechanges.

2.VersionScreenshots:Includescreenshotsorsnapshotsofdifferentversionsof

yourassignmentateachtimepoint.Thisshouldclearlyillustratetheincremental

developmentandanymodificationsmadetotheproject.

3.CollaborativeEffort(Ifyouaredoingtheassignmentwithanotherstudent):

Documentthecollaborativeeffortwithyourgroupmates.Thiscanbeadescriptionof

thecontributionsmadebyeachteammemberorscreenshotsofproofshowcasing

thecollaborativeeffortwithyourgroupmates.

Note:Weonlyrequirebriefdescriptionshere,theydon’thavetobelongaslongasthey

reflectthekeycomponentswelistedabove.IfyouuseGooglecolab,itprovidesa

comprehensiveversionhistory.

Note:toavoidlosingimportantwork,youneedto

pintheversionhistorythatwe

wishtosave.Thiskeepsthenotebookversionsavedpermanently.

Instructions-Sharingipynblink

1.ClickontheSharebuttononthetoprightcorner

2.MakesureunderGeneralaccesssection,youhavethepermissionsetsto‘Monash

University’and‘Editor’,thenclickon`Copylink`

3.Createamarkdowncellattheendofyourassignment.Pastethesharelink/createa

hyperlinkobject.

4.Doublecheckthelinktomakesureitisworking.

SubmissionChecklist:

Pleasezipallthesubmissionfilesfortask1and2intoasinglefilewiththename

_ass1.zip.(anyotherformate.g.raror7zorotherzipfilenames

willbepenalised)

Thereare12filesinyourcompressedzipfile

shouldbereplacedwithyourgroupid.(thehas

0paddingsie.001,010...)

Makesurebothmembersofyourgroupclickthe‘Submit’buttononMoodle

Pleasestrictlyfollowthefilenamingstandard.Anymisnamedfilewillcarryapenalty

Pleasemakesurethatyour.ipynbfilecontainsprintedoutput,whileyour.pyfile

doesnotincludeanyoutput

Pleaseensurethatallyourfilesareparsableandreadable.Youcanachievethisby

re-readingallyourgeneratedfilesbackintopython.(e.g.usingread_csvforCSV

filesorjsonmoduleforJSON).Thesechecksareonlysanitychecksandhence

shouldnotbeaddedtoyourfinalsubmission

Note:Allsubmissionswillbeputthroughaplagiarismdetectionsoftwarewhichautomatically

checksfortheirsimilaritywithrespecttoothersubmissions.Anyplagiarismfoundwilltrigger

theFaculty’srelevantproceduresandmayresultinseverepenalties,uptoandincluding

exclusionfromtheuniversity.

51作业君

Email:51zuoyejun

@gmail.com

添加客服微信: Fudaojun0228