CE314/887 Assignment 1 Issued data: November 05 th 2020 Faser due date: November 20 th 2020 Latest date for no penalty submission: November 27 th 2020 You should provide code for part 1 and part 2, no coding needed in part 3. Part 1: Regular expression (40%) (You can store your code in output part1_regularexpression_studentID.py)• • (20%) Write a regular expression that can find all amounts of money in a text. Your expression should be able to deal with different formats and currencies, for example £50,000 and £117.3m as well as 30p, 500m euro, 338bn euros, $15bn and $92.88. Make sure that you can at least detect amounts in Pounds, Dollars and Euros. For full marks: include the output of a Python program that applies your regular expression to the following BBC News Web site: https://www.bbc.co.uk/news/business-41779341 • (20%) Write a regular expression that can matching all phone numbers listed below: (You can write a python program to check the matching results) 555.123.4565 +1-(800)-545-2468 2-(800)-545-2468 3-800-545-2468 555-123-3456 555 222 3342 (234) 234 2442 (243)-234-2342 1234567890 123.456.7890 123.4567 123-4567 1234567900 12345678900 Part 2: NLTK (10%) • Find the 50 highest frequency word in Wall Street Journal corpus in NLTK.books (text7), submit your code as the name: part2_NLTK_studentID.py (All punctuation removed and all words lowercased.) Part 3: Language modeling (50% Paper work based, no need to code for this part) Exercise 1 Consider the following toy example Training data:
I am Sam Sam I am Sam I like Sam I do like do I like Sam Assume that we use a bigram language model based on the above training data. 1. What is the most probable next word predicted by the model for the following word sequences? (10%) (a)
Sam ... (b) Sam I do ... (c) Sam I am Sam ... (d) do I like ... 2. Which of the following sentences is better, i.e., gets a higher probability with this model? (10%) (a) Sam I do I like (b) Sam I am (c) I do like Sam I am Exercise 2 Consider again the same training data and the same bigram model. Compute the perplexity of (10%) I do like Exercise 3 Take again the same training data. This time, we use a bigram LM with Laplace smoothing. 1, Give the following bigram probabilities estimated by this model: P(do|) P(do|Sam) P(Sam|) P(Sam|do) P(I|Sam) P(I|do) P(like|I) Note that for each word wn−1, we count an additional bigram for each possible continuation wn. Consequently, we have to take the words into consideration and also the symbol .(10%) 2. Calculate the probabilities of the following sequences according to this model: (a) do Sam I like (b) Sam do I like Which of the two sequences is more probable according to our LM (language modeling)? (10%) 欢迎咨询51作业君