程序代写接单- COMPSCI 5096

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

Wednesday, 20 May, 09:15 BST (24 hour open online assessment – Indicative duration 1.5 hours) DEGREES of MSci, MEng, BEng, BSc, MA and MA (Social Sciences) TEXT AS DATA (M) COMPSCI 5096 (Answer 3 out of 4 questions) This examination paper is worth a total of 60 marks. 1. This question is about tokenisation and similarity. (a) This part concerns processing text. Consider the input string: [He didn’t like the U.S. movie “Snakes on a train, revenge of Viper-man!”, now playing in the U.K.] (i) Provide a tokenised form of the above string. Identify and discuss two elements of the above string that present ambiguities. Justify your tokenisation decision for each. [3] (ii) Compare and contrast ‘standard’ word-based tokenisation with the tokenisation method used by BERT. Illustrate key differences using the example provided. Analyse and discuss why they differ and their relative advantages and disadvantages. (Hint: Recall we used BERT’s tokeniser in Lab 1 and in the in-class embedding exercise.) [4] (b) Consider the two tokenized documents: S1: [a, woman, is, under, a, mayan, curse] S2: [a, woman, sees, a, mayan, shaman, to, lift, the, curse] Create a Dictionary from the two documents above (S1 and S2) with appropriate ordering. Give your answer in the form of a table with ID and token. Discuss the following properties of the dictionary and provide reasons for the decision: 1) what is included in the dictionary and 2) the order of the dictionary. [3] (c) Critically evaluate the Bag-of-Words (BoW) model as a term weighting feature model for documents. Discuss its strengths and give three weaknesses of the model and propose a modification that addresses each. You should relate each to Sci-kit Learn vectorizers and their important parameters. [4] (d) You are measuring the similarity between two molecular compounds for drug discovery research. They have been processed to create a series of unique structural ‘fingerprints’ and a one-hot encoding of the compounds is created. A compound has tens of thousands of fingerprints on average and all the compounds are approximately the same size. Also, most of the compounds in the dataset share more than 90% of fingerprints in common. A lab partner suggests using Jaccard overlap to measure the similarity between compounds. First, critically discuss why Jaccard is or is not appropriate for this task and the challenges it presents. Second, propose and justify a change to both the representation and similarity measure to address them. [6] 1 CONTINUED OVERLEAF 2. This question is about language modelling and classification. (a)