辅导案例-COMP 614 F20

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

2020/12/1 Keyword Query Engine: COMP 614 F20
https://canvas.rice.edu/courses/35911/pages/keyword-query-engine 1/10
Keyword Query Engine
As with all assignments, make sure that you read, understand, and follow the course honor
code policy (https://canvas.rice.edu/courses/35911/assignments/syllabus) for assignments.
This assignment is due at 8PM on Friday, December 4, 2020. Late assignments will be subject to
the course grading policies.
The goal of this homework is to apply all the knowledge you've learned so far to build a keyword
query engine for the Wikipedia articles we've been working with. Specifically, you will gain more
familiarity with:
1. Designing and implementing algorithms with less specific, more abstract instructions.
2. Experimenting with new techniques in algorithms.
3. Testing.
Be sure to read the entire assignment before beginning.
Preliminaries
We will be using the same dataset as in the previous assignments. To make sure everyone is on the
same page, please place the following items in a folder titled "datasets":
articles_index_20199.pkl (https://canvas.rice.edu/courses/35911/files/2319074/download?wrap=1)
pagerank_results_20199.pkl (https://canvas.rice.edu/courses/35911/files/2435737/download?
wrap=1)
top10000.npy (https://canvas.rice.edu/courses/35911/files/2435718/download?wrap=1)
transform_matrix.npy (https://canvas.rice.edu/courses/35911/files/2435732/download?wrap=1)
projected.npy (https://canvas.rice.edu/courses/35911/files/2436461/download?wrap=1)
The Problem
We have been working with Wikipedia articles for quite some time now. In previous assignments, we
learned how to extract the "bag of words" from Wikipedia archives; we learned how to use Principal
Component Analysis (PCA) to find the "most important" words, and we learned how to compute
PageRank using a graph of Wikipedia articles.
To put everything together, in this final assignment, our goal is to build a keyword query engine. The
engine takes a handful of words as a query and then produces a list of Wikipedia article titles as the
result.
Our keyword query engine builds upon two important metrics we've already obtained: the PCA-space
representation and the PageRank of each Wikipedia article. The PCA-space representation captures
2020/12/1 Keyword Query Engine: COMP 614 F20
https://canvas.rice.edu/courses/35911/pages/keyword-query-engine 2/10
the similarities between articles, while the PageRank measures the relative importance. The goal of
our query will be to find important articles closely related to the keywords.
The Assignment
A. Revisiting Bag of Words [10 points]
In Assignment 4, you used regular expressions to parse through XML files and represent articles as
vectors of word counts. Your first task in this assignment is to use a similar concept to represent a
query, i.e., a list of keywords, as a vector of integers. In other words, imagine that a list of keywords
collectively makes up a small article which you need to convert to a vector.
To make things easier, we have provided you with a list of the top 10,000 words for the large dataset
you will be using in this assignment. Create a function, create_vector(keywords) , which takes in the
query as a list of keyword strings and returns a 10,000 element vector where the number in the
position is the count of the word in the top 10,000 word list. Assume that each input string is a
single word in lower case and that the top 10,000 words have already been converted to lowercase.
Run the following code, and move on once you feel confident in your function:
print(create_vector(['the', 'and', 'ideals']))
The reference solution returns [1. 0. 1. ... 1. 0. 0.].
B. Querying Based on PCA-Space Similarity [20 points]
Now that we've represented our query keywords as a vector, your next task is to find the nearest
neighbors to this vector. Specifically, your task is to write a function, find_closest_neighbors(vector, k) ,
which takes in the vector you created in Part A and an integer representing the number of nearest
neighbors you want to find. This function should return a list of the titles of the nearest neighbors to
the input vector.
To calculate these nearest neighbors, you will use PCA to project the keyword vector with a
dimension of 10,000 into a vector space with a dimension of 1820. You will then need to iterate
through every vector in the provided dataset of 20,199 projected vectors and calculate the Euclidean
distance between the keyword vector and the current vector. You will then want to return a list of the
titles of the vectors with the smallest distance values. You may find it helpful to use a dictionary,
Counter, and/or Priority Queue to complete this task.
We've provided you with a few files to help you out with this process:
projected.npy - This contains a a list of 20,199 projected article vectors. Each vector has a
dimension of 1820. This is the result of running Bag of Words and PCA on our large dataset.
th
th
2020/12/1 Keyword Query Engine: COMP 614 F20
https://canvas.rice.edu/courses/35911/pages/keyword-query-engine 3/10
transform_matrix.npy - This contains 10,000 vectors of size 1820. We computed this matrix using
the code from your PCA assignment. 1820 was the cutoff that gave us a 60% variance. Similar to
what you did in Assignment 5, use the transpose of this matrix to project vectors from a space
with a dimension of 10,000 into a space with a dimension of 1820.
articles_index_20199.pkl - This contains a list of 20,199 article titles in the same order as the
vectors in word_data.npy.
Once you have finished implementing this function, create a function, search(keywords, k) , which
takes in a list of keywords, an integer representing the number of articles to find, and returns a list of
the article titles most relevant to the keywords. This is essentially a driver function which should
call the previous two functions that you wrote.
Try out these three test cases:
print(search(["dna", "sequencing", "chromosome", "genome"], 50))
print(search(["north", "democracy", "states"], 50))
print(search(["sony"], 50))
The reference solution prints out
['dna', 'dna profiling', 'dna repair', 'eukaryotic dna replication', 'nucleoid', 'dna sequencing', 'dna m
ethylation', 'francis crick', 'origin of replication', 'epigenetics', 'human genome', 'carcinogenesis',
'rosalind franklin', 'dna virus', 'sheep', 'timeline of science fiction', 'sutton, london', 'dna doe proj
ect', 'mitochondrion', 'health effects of bisphenol a', 'crispr', 'synthetic biology', 'hayes, hillingdo
n', 'darren osborne', 'preston, lancashire', 'rutgers university', 'billy joel', 'divorce', 'malaysian is
lamic party', 'hulk hogan', 'kuwait', 'aggression', 'duke lacrosse case', 'rudy giuliani', 'jay lethal',
'yahoo!', 'catalog of paintings in the louvre museum', 'mercedes mcqueen', 'idaho', 'comparison of chines
e transcription systems', 'eric holder', 'john bolton', 'che guevara in popular culture', 'yule marble',
'2009 lincolnshire county council election', 'merit (buddhism)', 'amnesty international', 'ken barlow',
'shanghai', 'john a. macdonald']
['north korea–united states relations', 'timeline of labour issues and events', 'north korea and weapons
of mass destruction', 'foreign relations of north korea', 'north dakota', '2018 north korea–united states
singapore summit', '2017–2018 north korea crisis', '1001 movies you must see before you die', 'human righ
ts in north korea', 'north carolina', 'north bergen, new jersey', 'carolina–duke rivalry', 'north korea',
'history of north carolina', 'north america', '2007 uttar pradesh legislative assembly election', 'motown
discography', 'north sea', 'public facilities privacy & security act', 'honorific nicknames in popular
music', 'concealed carry in the united states', '2015 in australian television', 'new york asian film fes
tival', 'japan–united states relations', 'economy of north korea', 'democratic peace theory', 'communist
state', 'national popular vote interstate compact', 'foreign policy of the united states', "timeline of w
omen's education", 'treaty on the non-proliferation of nuclear weapons', 'battle of coral–balmoral', 'for
eign relations of the united states', '2018 pennsylvania house of representatives election', 'north china
craton', '2008 pennsylvania house of representatives election', 'united states v. wong kim ark', '2016 ge
orgia state elections', 'kim jong-il', '2014 pennsylvania house of representatives election', '2012 penns
ylvania house of representatives election', '2000 pennsylvania house of representatives election', '2016
pennsylvania house of representatives election', 'idaho', '2010 pennsylvania house of representatives ele
ction', 'northeast india', 'aggravated felony', '2008 attacks on uttar pradeshi and bihari migrants in ma
harashtra', 'charlotte, north carolina', 'gene kelly awards']
['playstation', 'playstation 3', 'sheep', 'timeline of science fiction', 'sutton, london', 'health effect
s of bisphenol a', 'preston, lancashire', 'billy joel', 'hayes, hillingdon', 'darren osborne', 'divorce',
'rutgers university', 'rudy giuliani', 'malaysian islamic party', 'aggression', 'kuwait', 'hulk hogan',
2020/12/1 Keyword Query Engine: COMP 614 F20
https://canvas.rice.edu/courses/35911/pages/keyword-query-engine 4/10
'jay lethal', 'yahoo!', 'che guevara in popular culture', 'eric holder', 'catalog of paintings in the lou
vre museum', 'mercedes mcqueen', 'idaho', 'john bolton', 'comparison of chinese transcription systems',
'shanghai', '2009 lincolnshire county council election', 'yule marble', 'john a. macdonald', 'merit (budd
hism)', 'tampa, florida', 'milwaukee', 'richard wagner', 'cattle', 'vaccination schedule', 'paramount pic
tures', 'amnesty international', "international meeting of communist and workers' parties", 'bernadette p
eters', 'playstation 4', 'opinion polling for the 2019 ukrainian parliamentary election', 'labour party
(uk) election results (1929–1945)', 'wound healing', 'diamond', 'royal dutch shell', 'brighton', 'ken bar
low', 'kiss (band)', 'dot cotton']
If you order these titles based on Page Rank, which you will do in Part E, you should get the
following:
['amnesty international', 'shanghai', 'dna', 'yahoo!', 'kuwait', 'idaho', 'rudy giuliani', 'rutgers unive
rsity', 'mitochondrion', 'sheep', 'dna repair', 'dna sequencing', 'francis crick', 'epigenetics', 'dna me
thylation', 'divorce', 'human genome', 'preston, lancashire', 'duke lacrosse case', 'billy joel', 'rosali
nd franklin', 'hulk hogan', 'eric holder', 'aggression', 'carcinogenesis', 'crispr', 'john a. macdonald',
'merit (buddhism)', 'dna profiling', 'nucleoid', 'dna virus', 'jay lethal', 'synthetic biology', 'origin
of replication', 'hayes, hillingdon', 'ken barlow', 'sutton, london', 'john bolton', 'malaysian islamic p
arty', 'che guevara in popular culture', 'comparison of chinese transcription systems', 'yule marble', 't
imeline of science fiction', '2009 lincolnshire county council election', 'darren osborne', 'mercedes mcq
ueen', 'eukaryotic dna replication', 'dna doe project', 'health effects of bisphenol a', 'catalog of pain
tings in the louvre museum']
['north america', 'north korea', 'north sea', 'north carolina', 'charlotte, north carolina', 'north dakot
a', 'idaho', 'communist state', 'kim jong-il', 'honorific nicknames in popular music', 'foreign policy of
the united states', 'treaty on the non-proliferation of nuclear weapons', 'northeast india', 'north korea
and weapons of mass destruction', 'democratic peace theory', '2018 north korea–united states singapore su
mmit', 'north bergen, new jersey', 'public facilities privacy & security act', 'concealed carry in the un
ited states', 'united states v. wong kim ark', 'japan–united states relations', 'human rights in north ko
rea', 'north korea–united states relations', 'economy of north korea', 'new york asian film festival', 'f
oreign relations of the united states', '1001 movies you must see before you die', 'history of north caro
lina', 'aggravated felony', 'north china craton', 'national popular vote interstate compact', 'carolina–d
uke rivalry', 'foreign relations of north korea', '2007 uttar pradesh legislative assembly election', '20
08 attacks on uttar pradeshi and bihari migrants in maharashtra', '2016 pennsylvania house of representat
ives election', 'battle of coral–balmoral', '2014 pennsylvania house of representatives election', '2018
pennsylvania house of representatives election', 'timeline of labour issues and events', '2017–2018 north
korea crisis', 'motown discography', '2015 in australian television', "timeline of women's education", '2
008 pennsylvania house of representatives election', '2016 georgia state elections', '2012 pennsylvania h
ouse of representatives election', '2000 pennsylvania house of representatives election', '2010 pennsylva
nia house of representatives election', 'gene kelly awards']
['amnesty international', 'shanghai', 'yahoo!', 'paramount pictures', 'kuwait', 'playstation 3', 'playsta
tion 4', 'tampa, florida', 'milwaukee', 'richard wagner', 'idaho', 'rudy giuliani', 'royal dutch shell',
'rutgers university', 'brighton', 'cattle', 'sheep', 'divorce', 'preston, lancashire', 'diamond', 'playst
ation', 'kiss (band)', 'billy joel', 'hulk hogan', 'eric holder', 'aggression', 'john a. macdonald', 'mer
it (buddhism)', 'jay lethal', 'wound healing', 'bernadette peters', 'hayes, hillingdon', 'ken barlow', 's
utton, london', 'john bolton', 'malaysian islamic party', 'dot cotton', "international meeting of communi
st and workers' parties", 'che guevara in popular culture', 'comparison of chinese transcription system
s', 'yule marble', 'timeline of science fiction', '2009 lincolnshire county council election', 'vaccinati
on schedule', 'darren osborne', 'mercedes mcqueen', 'opinion polling for the 2019 ukrainian parliamentary
election', 'health effects of bisphenol a', 'catalog of paintings in the louvre museum', 'labour party (u
k) election results (1929–1945)'].
C. Higher-Order Functions and Cosine Similarity [10 points]
2020/12/1 Keyword Query Engine: COMP 614 F20
https://canvas.rice.edu/courses/35911/pages/keyword-query-engine 5/10
Although we saw some relevant articles in each of the three queries in Part B, the results can
definitely be improved. In this section, you will need to write find_nearest_neighbors(vector, k,
compute_similarity) so that it takes in a function which accepts two projected vectors as its arguments
and returns the result of evaluating that function on those two vectors. The first two parameters
should be the same as the two parameters for find_closest_neighbors . You should then treat
find_nearest_neighbors as a higher-order function and pass to it a function that computes the
Euclidean distance between two vectors. Make use of the function you wrote in Part B and verify that
your results have not changed from Part B.
Once you have developed this higher order framework, create a new function which utilizes cosine
similarity to calculate the closeness of vectors instead of Euclidean distance. The formula for cosine
similarity is:
. In this case, A and B would be two projected vectors.
Try out the same three test cases from Part B. However, use the cosine similarity function that you
wrote rather than your Euclidean distance formula. The reference solution outputs the following:
['dna', 'dna profiling', 'dna repair', 'eukaryotic dna replication', 'nucleoid', 'dna sequencing', 'dna m
ethylation', 'francis crick', 'origin of replication', 'epigenetics', 'carcinogenesis', 'human genome',
'rosalind franklin', 'dna virus', 'dna doe project', 'crispr', 'duke lacrosse case', 'mitochondrion', 'br
ca1', 'glossary of genetics', 'synthetic biology', 'virus', 'genetic studies on jews', 'chemotherapy', 'g
enetic history of europe', 'genetic history of east asians', 'genetic engineering', 'life', 'chloroplas
t', 'helicobacter pylori', 'gene therapy', 'ultraviolet', 'ageing', 'microrna', 'cancer', 'panspermia',
'human papillomavirus infection', 'hiv', 'haplogroup r1a', 'genetically modified organism', 'o. j. simpso
n murder case', 'p53', 'linus pauling', 'origin of the albanians', 'scientific method', 'bacteria', 'hapl
ogroup q-m242', 'facioscapulohumeral muscular dystrophy', 'archaea', 'genetically modified crops']
['north korea–united states relations', 'timeline of labour issues and events', 'north korea and weapons
of mass destruction', 'foreign relations of north korea', 'north dakota', '2018 north korea–united states
singapore summit', '2017–2018 north korea crisis', '1001 movies you must see before you die', 'human righ
ts in north korea', 'north carolina', 'carolina–duke rivalry', 'north bergen, new jersey', 'north korea',
'history of north carolina', 'north america', 'motown discography', '2007 uttar pradesh legislative assem
bly election', 'north sea', 'public facilities privacy & security act', '2015 in australian televisio
n', 'honorific nicknames in popular music', 'new york asian film festival', 'japan–united states relation
s', 'economy of north korea', 'concealed carry in the united states', 'democratic peace theory', 'foreign
policy of the united states', 'communist state', "timeline of women's education", 'national popular vote
interstate compact', 'treaty on the non-proliferation of nuclear weapons', 'foreign relations of the unit
ed states', '2018 pennsylvania house of representatives election', '2008 pennsylvania house of representa
tives election', '2014 pennsylvania house of representatives election', '2012 pennsylvania house of repre
sentatives election', '2000 pennsylvania house of representatives election', '2016 pennsylvania house of
representatives election', '2016 georgia state elections', '2010 pennsylvania house of representatives el
ection', 'battle of coral–balmoral', 'kim jong-il', 'aggravated felony', '2018 new hampshire house of rep
resentatives election', '2020 pennsylvania house of representatives election', 'pakistan–united states re
lations', 'citizenship of the united states', '2019 north korea–united states hanoi summit', 'northeast i
ndia', 'united states v. wong kim ark']
['playstation', 'playstation 3', 'playstation 4', 'spider-man in film', 'history of video game consoles',
'playstation 3 models', '2009 in home video', '2010 in home video', 'sega saturn', '2011 in home video',
'2016 in home video', '2013 in home video', '2008 in home video', 'spider-man', '2012 in home video', '20
17 in home video', '2007 in home video', '2006 in home video', '2015 in home video', '2018 in home vide
2020/12/1 Keyword Query Engine: COMP 614 F20
https://canvas.rice.edu/courses/35911/pages/keyword-query-engine 6/10
o', 'dreamcast', 'playstation portable', 'spider-man: homecoming', '2019 in home video', "uncharted 3: dr
ake's deception", 'spider-man 3', '2020 in home video', 'spider-man: far from home', '2005 in home vide
o', 'j. jonah jameson', 'spider-man: into the spider-verse', '2014 in home video', 'eighth generation of
video game consoles', 'seventh generation of video game consoles', 'spider-man (2018 video game)', 'sony
pictures universe of marvel characters', 'plácido domingo discography', 'the amazing spider-man (2012 fil
m)', 'doctor octopus', 'blu-ray', "tony hawk's (series)", 'venom (2018 film)', '2004 in home video', 'lit
tlebigplanet (2008 video game)', 'eddie brock', 'god of war (franchise)', 'the amazing spider-man 2', 'no
rman osborn', 'high-definition remasters for playstation consoles', 'oled']
Once again, if you order these articles based on Page Rank, you should get the following:
['dna', 'cancer', 'bacteria', 'hiv', 'ultraviolet', 'chemotherapy', 'virus', 'scientific method', 'archae
a', 'mitochondrion', 'chloroplast', 'dna repair', 'dna sequencing', 'francis crick', 'life', 'genetic eng
ineering', 'epigenetics', 'dna methylation', 'human genome', 'duke lacrosse case', 'microrna', 'linus pau
ling', 'p53', 'genetically modified organism', 'rosalind franklin', 'brca1', 'genetically modified crop
s', 'carcinogenesis', 'gene therapy', 'crispr', 'helicobacter pylori', 'ageing', 'dna profiling', 'nucleo
id', 'dna virus', 'panspermia', 'haplogroup q-m242', 'human papillomavirus infection', 'synthetic biolog
y', 'origin of replication', 'genetic history of europe', 'o. j. simpson murder case', 'haplogroup r1a',
'genetic studies on jews', 'glossary of genetics', 'origin of the albanians', 'facioscapulohumeral muscul
ar dystrophy', 'genetic history of east asians', 'eukaryotic dna replication', 'dna doe project']
['north america', 'north korea', 'north sea', 'north carolina', 'north dakota', 'communist state', 'kim j
ong-il', 'honorific nicknames in popular music', 'foreign policy of the united states', 'treaty on the no
n-proliferation of nuclear weapons', 'northeast india', 'citizenship of the united states', 'north korea
and weapons of mass destruction', 'democratic peace theory', '2018 north korea–united states singapore su
mmit', 'north bergen, new jersey', 'public facilities privacy & security act', 'concealed carry in the un
ited states', 'united states v. wong kim ark', '2019 north korea–united states hanoi summit', 'japan–unit
ed states relations', 'human rights in north korea', 'north korea–united states relations', 'economy of n
orth korea', 'new york asian film festival', 'foreign relations of the united states', 'pakistan–united s
tates relations', '1001 movies you must see before you die', 'history of north carolina', 'aggravated fel
ony', 'national popular vote interstate compact', 'carolina–duke rivalry', 'foreign relations of north ko
rea', '2007 uttar pradesh legislative assembly election', '2020 pennsylvania house of representatives ele
ction', '2016 pennsylvania house of representatives election', 'battle of coral–balmoral', '2014 pennsylv
ania house of representatives election', '2018 pennsylvania house of representatives election', '2018 new
hampshire house of representatives election', 'timeline of labour issues and events', '2017–2018 north ko
rea crisis', 'motown discography', '2015 in australian television', "timeline of women's education", '200
8 pennsylvania house of representatives election', '2012 pennsylvania house of representatives election',
'2000 pennsylvania house of representatives election', '2016 georgia state elections', '2010 pennsylvania
house of representatives election']
['playstation 3', 'playstation 4', 'blu-ray', 'playstation portable', 'spider-man', 'playstation', 'ole
d', 'sega saturn', 'dreamcast', 'spider-man: homecoming', 'seventh generation of video game consoles', 's
pider-man: far from home', 'eighth generation of video game consoles', 'doctor octopus', 'spider-man 3',
'j. jonah jameson', "uncharted 3: drake's deception", 'norman osborn', 'the amazing spider-man (2012 fil
m)', 'the amazing spider-man 2', 'eddie brock', 'spider-man in film', 'spider-man: into the spider-vers
e', 'god of war (franchise)', 'venom (2018 film)', 'spider-man (2018 video game)', 'littlebigplanet (2008
video game)', 'sony pictures universe of marvel characters', 'high-definition remasters for playstation c
onsoles', 'plácido domingo discography', "tony hawk's (series)", '2006 in home video', 'history of video
game consoles', 'playstation 3 models', '2009 in home video', '2010 in home video', '2011 in home video',
'2016 in home video', '2013 in home video', '2008 in home video', '2012 in home video', '2017 in home vid
eo', '2007 in home video', '2015 in home video', '2018 in home video', '2019 in home video', '2020 in hom
e video', '2005 in home video', '2014 in home video', '2004 in home video']
D. Lambdas and NumPy [5 points]
2020/12/1 Keyword Query Engine: COMP 614 F20
https://canvas.rice.edu/courses/35911/pages/keyword-query-engine 7/10
The euclidean distance and cosine similarity functions that you wrote could have been written in one
line using functions from NumPy (https://numpy.org/doc/) . First, change these two functions so that
they are written as concisely as possible using NumPy. These changes are meant to speed up your
query engine since NumPy functions tend to run very efficiently.
Many computer scientists would argue that writing an entire function was not necessary for
something that could have been written in one line. We will now turn our attention to lambdas. A
lambda is an unnamed, anonymous function, normally used as an argument to a higher-order
function. Please visit the CodeSkulptor Documentation
(https://rice.codeskulptor.org/docs.html#lambda-return) for more information about the syntax of
lambda in Python.
Once you have read more about lambdas, rewrite these one-line functions as lambdas. Do not delete
the original euclidean distance and cosine similarity functions that you wrote. However, pass in the
lambda versions of these functions to find_nearest_neighbors .
Side note: Even after you complete this class, you are encouraged to revisit the NumPy and
CodeSkulptor documentation pages.
E. Using Pagerank [15 points]
Now that we have a PCA-space query engine, we will use PageRank to make our results not only
relevant but also significant. There are two ways we can join the two metrics together:
Treat the PCA space similarity as the primary metric: When the user demands articles, obtain
the top- (where ) similar articles in the PCA space. From those, rank them by PageRank
and output the top .
Directly combine the two metrics: rank all our articles based on a new metric derived from the
PCA-space similarity and the PageRank. For example, the new metric can be the weighted
average of the two or the harmonic mean of the two.
We will specifically guide you through the first option, but you may choose to explore the second
option in Part H. For this task, create a function rank_nodes(article_titles, k) which takes in a list of
article titles and returns the top articles in terms of PageRank. You may assume that these articles
exist in the provided pagerank_results_20199.pkl file.
Once you have finished implementing this function, update your search function so that its new
header is search(keywords, t, k) , where is the number of nearest neighbors that you want to find
(previously was written as in Part B), and is the total number of articles that you want your
search query to return. This function should project the keywords as a vector, find the nearest
neighbors to that vector, and return a list of the article titles with the highest PageRank from those
articles. Here are the reference results to the three previous test cases after combining PageRank
into this search process:
2020/12/1 Keyword Query Engine: COMP 614 F20
https://canvas.rice.edu/courses/35911/pages/keyword-query-engine 8/10
print(search(["dna", "sequencing", "chromosome", "genome"], 50, 10))
print(search(["north", "democracy", "states"], 50, 10))
print(search(["sony"], 50, 10))
['dna', 'cancer', 'bacteria', 'hiv', 'ultraviolet', 'chemotherapy', 'virus', 'scientific method', 'archae
a', 'mitochondrion']
['north america', 'north korea', 'north sea', 'north carolina', 'north dakota', 'communist state', 'kim j
ong-il', 'honorific nicknames in popular music', 'foreign policy of the united states', 'treaty on the no
n-proliferation of nuclear weapons']
['playstation 3', 'playstation 4', 'blu-ray', 'playstation portable', 'spider-man', 'playstation', 'ole
d', 'sega saturn', 'dreamcast', 'spider-man: homecoming']
F. Tuning the parameters [10 points]
Although our results look a lot better, they can still be improved a bit. For example, you may have
expected to see 'united states' as one of the top results for our second query.
You may have noticed that our choice of k = 50 was very arbitrary. Changing may give different, if
not better, results. If we consider our engine as a 'model', the words as the 'inputs' and the result
articles as the 'output', then k is a 'parameter' of our model. In this section, you will experiment with
tuning your parameter. You could consider two approaches:
By domain knowledge: If you are an expert in Wikipedia articles, PCA, PageRank, etc., you may
use expertise in designing this query engine. If you know already, from your experience, what the
best handful of candidates for the parameters are, you can manually test your candidates and
find the optimal one.
Using a computer to search for the parameters: If you are not yet confident with your expertise in
this very specific problem, you may wish to let the computer itself "figure the parameters out".
That is, to write a program that determines the best parameter. To make the computer "learn" how
to find good results, these algorithms usually require a large amount of training data, that is
(input, output) pairs that we wish to fit onto the model.
You are free to explore the first option, but your task in this part is to implement a very basic
algorithm to automatically determine parameter k (for this part, you can keep n as 10). We will call
this algorithm "grid search" and it will consist of 3 steps:
1. Determine a finite search space for the parameter of interest.
2. Using the training data, evaluate the model outputs for all choices in the search space.
3. Compare model outputs to desired outputs using some metric to determine which parameter
value is the best option.
In this part, follow the suggestions below:
1. A reasonable search space for this problem can be k in 10,20,30,...200.
2020/12/1 Keyword Query Engine: COMP 614 F20
https://canvas.rice.edu/courses/35911/pages/keyword-query-engine 9/10
2. Use the following examples as your training data. Note that in real-world problems, we would
need a much larger training data set to obtain good results. This is just a toy example.
["dna", "sequencing", "chromosome", "genome"] -> "dna"
["north", "democracy", "states"] -> "united states"
["sony"] -> "playstation 3"
3. We want our desired article to be the first article in the output list. Define 'error' as the positional
distance between the desired article and the first article in the output. for example, if "united
states" comes in the third position in our second training case, the error is 2. The training error is
the sum of errors of all 3 training cases. We want the training error to be as small as possible.
4. Use the cosine similarity method as your primary metric.
Often as a last step of training a model, we wish to know if the model generalizes well to data it has
never seen before. For the purpose, we usually have a separate testing dataset. We should not use
this test set during training, but only to evaluate the effectiveness of the training model. We define
'error' the same way as we did above, but this time the sum of errors of all testing cases is called the
testing error. If the testing error is high, we sometimes need to change the training algorithm or make
other adjustments. Here is a toy testing set that has only one entry:
["president", "42", "arkansas"] -> "bill Clinton"
In your writeup, include the training error, testing error, and the optimal value of k you found.
G. Testing [10 points]
You are required to test your code for this assignment. You must write Python unit tests for all the
functions you write for this assignment, and you must use a Mock or a Spy in at least 2 of your unit
tests. Include your tests with your Python submission. Refer to the video lectures on testing. You
must determine what and how to test and the effectiveness of your tests will be considered in the
grading of this part.
H. Improving the Engine [10 points]
This homework only covered a fraction of the topics we could explore about the keyword query
engine. There are many other interesting directions you can explore to improve the engine, either
with a slight change in some mathematics or an improvement in the key metrics. Some attempts may
have positive effects, while others may lead to negative ones. In any case, this part is not about how
good your improvement is (it does not even have to have a precise definition of 'good'). It encourages
you to experiment, and in your writeup, document what you've tried and what your findings are. This
is an open question.
Some ideas you can explore:
2020/12/1 Keyword Query Engine: COMP 614 F20
https://canvas.rice.edu/courses/35911/pages/keyword-query-engine 10/10
Drawn from Part C: there are many other ways to compare the "difference" of two vectors. We
only demonstrated Euclidian distance and cosine similarity.
Drawn from Part E: remember that we only explore option 1, option 2 is still open.
Drawn from Part H: Option 2 in Part E has even more parameters to tune.
Drawn from Part A: Usually, the words in the title of an article have more significance than the
words in the body. Our bag of words extraction neglects this fact.
Drawn from Part B: We used the PCA space of the bag of words representations as the metric for
similarity in the query engine. Of course, we can replace it with other metrics. TF-IDF
(http://www.tfidf.com/ (http://www.tfidf.com/) ) is one of them.
Important: Do not delete or modify the previous functions that you wrote. Create a new
functions for your changes, so we can still test your code for Parts A-G.
Note: If you need the original, unprojected word vectors for this part, please reference the
word_data.npy file from Homework 5.
I. Code Style [10 points]
Please follow all code style guidelines (https://canvas.rice.edu/courses/35911/pages/coding-style-
and-standards) .
Submission
This assignment should be submitted in two parts. All of the code should be submitted in one file:
hw9.py. All of the written questions should be submitted as a file (plain text or PDF - do not submit
any other format) with written answers for all of the parts that require that.
You will find two assignments in Canvas, one for each part, that are clearly labeled (Python) and
(Writeup) to indicate which part is which. Be sure to submit both assignments!
The table below indicates what you should submit for each part:
Keyword Query Engine (Writeup) your writeup in txt or pdf format
Keyword Query Engine (Python) hw9.py
Please refer to the syllabus for the late policy.
You may submit as many times as you like. However, only your last Python submission and your last
Writeup submission will count towards your assignment grade and be used to determine if your
assignment is late.

欢迎咨询51作业君