辅导案例-CE314/887-Assignment 1

欢迎使用51辅导,51作业君孵化低价透明的学长辅导平台,服务保持优质,平均费用压低50%以上! 51fudao.top
CE314/887 Assignment 1

Issued data: November 05
th
2020
Faser due date: November 20
th
2020
Latest date for no penalty submission: November 27
th
2020

You should provide code for part 1 and part 2, no coding needed in part 3.


Part 1: Regular expression (40%) (You can store your code in output
part1_regularexpression_studentID.py)•

• (20%) Write a regular expression that can find all amounts of money in a text. Your expression
should be able to deal with different formats and currencies, for example £50,000 and
£117.3m as well as 30p, 500m euro, 338bn euros, $15bn and $92.88. Make sure that you can
at least detect amounts in Pounds, Dollars and Euros.

For full marks: include the output of a Python program that applies your regular expression
to the following BBC News Web site:

https://www.bbc.co.uk/news/business-41779341


• (20%) Write a regular expression that can matching all phone numbers listed below: (You
can write a python program to check the matching results)

555.123.4565
+1-(800)-545-2468
2-(800)-545-2468
3-800-545-2468
555-123-3456
555 222 3342
(234) 234 2442
(243)-234-2342
1234567890
123.456.7890
123.4567
123-4567
1234567900
12345678900

Part 2: NLTK (10%) •

Find the 50 highest frequency word in Wall Street Journal corpus in NLTK.books (text7), submit
your code as the name: part2_NLTK_studentID.py (All punctuation removed and all words
lowercased.)


Part 3: Language modeling (50% Paper work based, no need to code for this part)

Exercise 1 Consider the following toy example




Training data:
I am Sam
Sam I am
Sam I like
Sam I do like
do I like Sam

Assume that we use a bigram language model based on the above training data.

1. What is the most probable next word predicted by the model for the following word
sequences? (10%)

(a) Sam ...
(b) Sam I do ...
(c) Sam I am Sam ...
(d) do I like ...

2. Which of the following sentences is better, i.e., gets a higher probability with this model?
(10%)

(a) Sam I do I like
(b) Sam I am
(c) I do like Sam I am

Exercise 2 Consider again the same training data and the same bigram model. Compute the
perplexity of (10%)

I do like

Exercise 3 Take again the same training data. This time, we use a bigram LM with Laplace
smoothing.

1, Give the following bigram probabilities estimated by this model:

P(do|)
P(do|Sam)
P(Sam|)
P(Sam|do)
P(I|Sam)
P(I|do)
P(like|I)
Note that for each word wn−1, we count an additional bigram for each possible continuation
wn. Consequently, we have to take the words into consideration and also the symbol
.(10%)


2. Calculate the probabilities of the following sequences according to this model:

(a) do Sam I like
(b) Sam do I like

Which of the two sequences is more probable according to our LM (language modeling)?
(10%)





欢迎咨询51作业君
51作业君

Email:51zuoyejun

@gmail.com

添加客服微信: abby12468