辅导案例-DATA 1050

欢迎使用51辅导,51作业君孵化低价透明的学长辅导平台,服务保持优质,平均费用压低50%以上! 51fudao.top


Search Engine
Homework #5
DATA 1050 Fall 2020
Due 23:59 EST,​ Wednesday, October 28, 2020
Optional: ​Coding Problem 3 Resubmission​ due by 23:59 EST, Monday, ​October 19, 2020

Setup & Submission

1. Use ​this​ GitHub Assignment Invitation ​Link: ​https://classroom.github.com/a/Vbl7I9d9
2. Run ​pip install --user ​conda install pytest pytest-timeout flake8 black nltk in a terminal to install all required
packages.
3. Push to master as much as you want before the homework deadline.
4. If your final push will be late, remember to use the ​Late Day Form *twice*, once before the deadline, and once
again after your final push.
5. When you are ready to submit, visit gradescope and submit your repo. Your report should be included in the repo
as ​report.md
6. Be sure to also Include a copy and paste of the test results and coverage information in coverage.md in each of
your stencil directories. Two students created a helpful ​utility script​ to do this for you.
7. For ​Coding Problem 3 Resubmission​, please ​only include​ the coding problems you want to be
re-graded in a stencil_CodingProblems3 subdirectory of your hw5 repo and submit the entire repo via
the corresponding gradescope assignment “Coding Problems3 Resubmission”

Overview
Search engines retrieve a list of documents (typically web pages) that match a given query. Since multiple documents
may satisfy the query, most search engines also attempt to list more relevant documents first. In this project, you will use
the TF-IDF and PageRank algorithms to create part of a web search engine.
Operationally, each searchable document must be analyzed in advance (in a process called indexing) to gather and
integrate statistics about its content. This data is used to form an index (like the one found at the end of a book) that
allows user search queries to be performed in real-time without the need to revisit all of the original documents.
Document term count information is enough to perform simple query matching since it allows one to find all documents
that contain a particular set of terms. It also allows one rank documents relative to one another based on their term
frequencies. ​TF-IDF​, otherwise known as ​term frequency–inverse document frequency​, is one way to do this.
Web search engines use “spiders” to “crawl” to the web in order to automatically visit and index web pages. At each page,
a count of the number of times each word (“term”) appears in a document is recorded (i.e, the term histogram is
recorded), a list of outbound links (​outlinks​) is collected, and then the spider moves on.
When present, link information defines a document graph, which can be used for document ranking. For example,
documents that are referred to many other documents may be more important than ones with very few incoming links
(​inlinks​). Today, Google and most other web search engines use ​PageRank on link data to help determine web page
performance.


The PageRank algorithm was created by ​Larry Page and ​Sergey Brin (the founders of Google) while they were still
graduate students at Stanford. While Page’s ​original technical report suggested using it to rank web page importance
based on link information, it has since been used to analyze many other networks. Identifying social media influencers
(based on follower networks), and gauging the importance of scientific papers (via citation networks) are two examples.
A Note on Vocabulary
In this project handout,
● A “​documen​t” refers to a Wikipedia page
● A “​term​” refers to a word (or alphanumeric sequence containing no white space)
Details
We used the Wikipedia ​download interface to download some of its web pages. These documents (web pages) are
stored as XML records. In order to create your search engine, you will write code to process these XML files.
In the world of search and natural language processing, a collection of texts is called a ​corpus​. A ​corpus (plural ​corpora​) is
a collection of written texts (web pages with links in this case). Thus the corpora you will be working with are subsets of
the pages of the English-language Wikipedia.
This project has four main parts, which are described in detail in the sections below:
1. Calculating TF-IDF
2. Calculating PageRanks
3. Processing the XML files
4. Search Engine and user interface
Note: We suggest you implement your solution project in the sequence listed above. TF-IDF and PageRank are both
somewhat math-heavy, and it is important you get them working with artificial data for which you are sure of the answers
before you start processing real data. Specifically, this data will be used to create ​analytic test cases​, some of which we
provide, and some of which you will provide. We encourage you to make git commits once a subcomponent is working.
This makes it easier for you to revert to a previous version.
Proceeding in this order will also help the course staff know which parts of the problem you have completed if you have
questions.
Stencil files provided:
● TFIDF.py, PageRank.py: ​Task: Implement of TF-IDF/PageRank algorithm.
● processing.py​: Task: index the Wikipedia corpus
● SearchEngine.py​: Task: Implement an API that takes search queries and returns search results ranked by the
chosen algorithm
● app.py: ​A simple command-line interface (CLI) we supply that will allow you to use your search engine for
interactively
● report.md​: Task: Write your report here
Datasets provided:
(note that all datasets have been zipped together into wiki_data.zip​. You’ll have to extract them into your stencil
directory before you can proceed)
● MiniWiki.xml, SmallWiki.xml​: XML files with 2 and 107 Wikipedia articles respectively. As you work on
adding functionary to your search engine, we suggest you start by testing with these!


● MedWiki.xml​: An XML containing 871 Wikipedia articles. You can try testing with this file once you get your code
working with SmallWiki.
● PageWiki.xml​: An XML file containing 100 documents with titles 1 to 100. Each document contains nothing
except for a link to document 100. You can use this file to test your implementation of PageRank. Document 100
should have a much higher PageRank than the other documents.
Using VSCode​  
Jupyter notebooks are often used for visualization and analysis, and they work well for one-time coding tasks. In this
project and more generally in DATA 1050, we will not be coding in Jupyter notebooks but rather, in Python files. To help
with this we recommend you try using the ​Visual Studio Code Interactive Development Environment (IDE) with the
Microsoft Python Development extension.
Please read the ​VS Code Instructions​ document to learn how to do the following:
1. How to start The VS Code IDE​ on data.jupyter.brown.edu
2. Install the Python Development Extension
3. Install and configure other helpful packages and extensions
4. Run the debugger
VS Code is an immensely powerful software development tool. Learning how to use its many features takes a long time.
However, because it also provides integrated ​terminal window support, you can continue to use all of the command line
based python development skills you already have.
Since this is the first time we have run VS Code, we will update the VS Code Instructions as needed. We suggest you
add notifications for changes​ to the VS Code Instructions document.
Coding Style Requirements
Using a consistent coding style is a critical part of code quality. For this project, make your style consistent with ​PEP 8​,
the style guide for python developers from Guido himself. There are many tools for checking if your code is styled
correctly. You can periodically run ​flake8 *.py in a terminal under your code directory. You can also install a flake8
extension to give you PEP-8 style warning as you code. Either way, resolve all style warnings you see to avoid receiving
point deductions.
An auto-formatting tool may also come in handy. From the terminal you can run black -l100 ​to
automatically reformat your code (e.g., running black -l100 PageRank.py ​will reformat the code in your PageRank.py
file.) Various VS Code extensions provide similar functionality and allow you to select regions of text that to reformat.
Part 5.1 TF-IDF
As the name suggests, TF-IDF has two parts: TF (​term frequency​) and IDF (​inverse document frequency​). TF-IDF allows
us to capture the following two ideas:
● A term that appears many times within a document is likely to be more relevant to that document than a term that
appears fewer times (this is captured by TF). we calculate the ​term frequency​ as
F (term, doc) (term|doc)T = total # of terms in the document
# of times term appears in the document = P


● A term that occurs in fewer documents in the corpus is likely to be a better signal of relevance compared to a
common term like “the”. (this is captured by IDF) A rare term often tells more about the document. The ​inverse
document frequency ​of a term is defined as
)DF (term) og( I = l # of documents in the corpus# of documents in which the term appears
Please use ​natural log ​(which is what Python’s ​math.log ​does),
● FIDF (term, doc) TF (term, doc) DF (term)T = × I

Examples
Consider a document containing 100 terms wherein the term ​cat appears 3 times. The term frequency (i.e.,
tf) for cat is then (3 / 100) = 0.03. Now, assume we have 10 million documents and the term ​cat appears
in one thousand of these. Then, the inverse document frequency (i.e., idf) is calculated as ln(10,000,000 /
1,000) = 9.210. Thus, the tf-idf weight for the term ​cat in this document is the product of these quantities:
0.03 * 9.210 = 0.2763.
You may also find a more sophisticated example ​here​.
Complete the following tasks in​ TFIDF.py.
Task 5.1
● calc_term_frequency​: For each document and each term, calculate term frequency. Store the results in a
dictionary that maps from each ​term to a secondary dict. The secondary dict maps document title ​doc to
TF(term, doc)​. Indexing looks like this: [term][document that contains the term][frequency of that term in each 
document that contains it] Pay close attention to the docstring. ​See the provided function test_tfidf_basic    
for example inputs and outputs.
● calc_tf_idf: Iterate through the dictionary created for TF and multiply each term frequency by the term’s
inverse document frequency. Note that the returned dictionary can also serve as a “reverse index,” (i.e., given a
term, you can use a single dictionary lookup to get the documents that contain this term)
● search: Given a list of keywords, return a list of matching documents paired with their tf-idf scores. Each
returned document should contain at least 1 of the keywords. The returned list should be sorted by document
score. When a limit is specified, only return that many ​(doc, tf-idf-score) pairs (e.g., if limit = 10, you  
would only return the top 10 matches). Compute tf-idf score by summing the tf-idf values of each query word
found in the document.
core(doc, eywords) FIDF (term, doc)s k = ∑

term∈keywords
T
● Include your pytest compatible tests in ​TFIDF.py​.


Notes
● Read carefully through the docstring (prose, parameter description). The docstrings most precisely specify the
behavior of each method.
● Use​ test_tfidf_basic​ as a starting point to help you understand the expected input and output.
● Some methods are labeled as @staticmethod​. This means they do not depend on instance or class data,  
notice they do not include a self variable. @staticmethods can be invoked without a class instance as illustrated
in ​test_tfidf_basic​. Their output does not depend on instance or class variables (i.e. they are regular
stateless functions). This is sometimes helpful because each method can be tested separately. (For interested
readers, see ​here for a discussion on the difference between @classmethods (which can depend on class
variables) and @staticmethods.)
Report Q&A: Ordering and Logs
Hw5.1 ​Your boss has been reviewing your code and wants you to do the following:

“store word frequencies the way you compute”, i.e. s/he wants you to use a dict of dicts which is indexed via
{document:{word:word frequency}}

Think carefully about this suggestion and give an appropriate response.

Part 5.2 PageRank
So far we have scored documents based on their relevance to a given query only. This is how early search engines like
AltaVista worked. But then, in the late 90s, the founders of Google came up with an idea for ranking documents based on
their position in the web itself. Their PageRank algorithm gives an overall importance score and it pays no attention
whatsoever to the documents' content. Rather, the rankings it computes are based entirely on the link structure between
documents: i.e., which documents link to one another and which documents link to it. Documents with high scores (i.e.,
high PageRanks) have links from other documents that also have high ranks (which are linked to other documents with
high ranks). This is a kind of crazy circular idea, but when converted into the appropriate mathematics the algorithm works
quite well.
There are many variants of the PageRank algorithm. In this project, we will implement one of the most basic. Key
principles underlying the design of PageRank:
1. The more documents that link to a page, the more authoritative it should be.
○ For example, if my personal blog post on “Providence Internet Outages" is linked to by 5 different web
pages (5 different incoming links, aka 5 inlinks), and the Wikipedia document on "Providence, Rhode
Island" is linked to by 1000s of web pages (1000s of inlinks), then it makes sense to consider the
Wikipedia document more authoritative.
○ Many early search engines use incoming page counts, and at first, this worked well, but over time people
realized they could promote a page by making many fake pages that pointed to it.
○ The PageRank innovation was to also consider the authority score of incoming pages. Authoritative
pages should also be linked to by other authoritative pages.,
2. The ability of an authoritative page to promote other pages should be constrained by the total number of outgoing
links it has, fewer links provide better references.
○ Assume "Research at Brown" is linked to by an ​NYTimes document which links to only 10 other pages,
and my blog "About Me" page is linked to by "Rhode Island Technology Bloggers", which links to 2000
other pages. Because the "NYTimes" page has only a few links, and "Rhode Island Blogger” has many
links, a link from the "NYTimes" page lends a larger fraction of its importance than a link from the "Rhode


 
Island Technology Bloggers", leading us to attribute more authority to "Research at Brown" than "About
Me"
Let be a document, be the set of documents that have a link to , be the number of outgoing edges from . u Bu u (u) L u
The ​PageRank​ score for document is defined asu
ageRank(u) P = ∑

v∈Bu
L(v)
PageRank(v)

The trouble with this equation is that it is circular; in order to calculate the weight of document , we have to know the u
weights of all document linking to . How do we begin calculating the document weights? In practice, we use an ​iterative u
approach and stop once the document weights have converged. Here is a suggested approach to implementing
PageRank:
● Create a dict ​weights​. Initialize weights for all documents to 1/N​, where ​N is the number of documents in
the corpus
● While maximum # of iterations not reached
○ Create a dict ​new_weights ​with weights for all documents set to 0
○ Iterate through each document ​x​.
For each document ​y​ that document ​x​ links to, update
​new_weights[y] += weights[x] / (# of documents x links to)
○ Calculate L2-distance between ​weights ​and ​new_weights​.
○ Break out of the loop if​ distance < eps * N
○ Set​ weights = new_weights
Note​:
● L2-distance is ​Euclidean distance
● If you sum up the PageRank value for each document in a corpus, you should get ​1.
● Links from a document to itself are ignored. ​Links to documents outside the corpus are ignored. You may
need to filter such links.
● Documents that link to nothing (i.e. sinks, like B below) are considered linked (once) to all documents ​including
itself. Therefore, each document receives​ (weight of this sink) / N​.
● Multiple links from one document to another are treated as a single link.
Examples

Let’s say that we have the corpus shown above, which consists of the documents (a.k.a. pages), A, B, ​and C. A
links to ​B ​and ​C​, and ​C ​links to A. ​The table below shows the results of running the PageRank algorithm on this


corpus in order to determine the weight (a.k.a. rank) of each document. Remember that since ​B does not link to
anything, we treat it as if it links to every document ​including​ itself.
Iteration Weight of A Weight of B Weight of C
Formula B / 3 + C / 1 A / 2 + B / 3 A / 2 + B / 3
Initial 1/3 = .333 1/3 = .333 1/3 = .333
1 .333 / 3 + .333 / 1= .444 .333 / 2 + .333 / 3 = .277 .333 / 2 + .333 / 3= .277
2 .277 / 3 + .277 / 1 = .370 .444 / 2 + .277 / 3 = .314 .444 / 2 + .277 / 3 = .314
3 .314 / 3 + .314 / 1 = .419 .370 / 2 + .314 / 3 = .290 .370 / 2 + .314 / 3 = .290
4 .290 / 3 + .290 / 1 = .386 .419 / 2 + .290 / 3 = .306 .419 / 2 + .209 / 3 = .306

Task 5.2
● calc_page_rank​: Given links between the documents, calculate the PageRank values for each page as
specified above.
● search​: Searches for documents containing at least one of the query words, sorted by PageRank values and
return up to ​`limit​` of them.
● Write unit tests for​ PageRank.py​.
● Use​ test_page_rank_basic​ as a starting point to ​help you understand the expected input and output​.
● Add file docstring and class ​docstring​ in​ PageRank.py​ and remove all TODOs.
Report Q&A: Page Rank Understanding
Hw5.2 ​There is a Markov Chain interpretation of the PageRank algorithm that PageRank value represents the frequency
that web surfers arrive at a specific web page following hyperlinks found on other pages.

Included in the model is the idea that a user can also visit a page without clicking a link (e.g., they can visit via bookmark
or a typed url). This random jumping process around process can be included in an iterative solution via a damping factor
d​. Explain the following equation in terms of probability notation if ​PR(X) is the probability of arriving at node ​X​, and ​L(X)
is the number of outbound links from ​X​. (i.e., What does ​d represent, what does ​(1 - d)/N represent, and what does the    
sum represent.)



If 32% of traffic to a webpage comes from non-inlink sources what should the value of d be?

See ​https://en.wikipedia.org/wiki/PageRank​ if you are stuck.


Part 5.3 Processing the XML file
Tf-idf computes word counts. In order to do this, irrelevant data (like punctuation and word spelling variations) must be
removed. Tf-idf calculation is then done with respect to this simplified corpus. PageRank requires all the links in the
corpus.

It is now time to process the Wikipedia dump XML to extract each type of data.
Introduction to XML files
The Wikipedia corpora used for this project are stored as ​XML files​. ​XML stands for eXtensible Markup Language. It is
similar to HTML in that elements are surrounded by matching opening and closing tags. An opening tag looks like
​. The corresponding closing tag includes a slash, so it looks like ​<\title>​. </br>Open SmallWiki.xml in a text editor and get familiar with the structure of XML files. </br>Task 5.3.1 Processing Text </br>You should work with​ process_text​ in​ processing.py​. </br>● Tokenizing </br>Tokenization refers to the process of splitting a string into tokens. For our purposes, ​tokens are ​sequences of characters </br>with no punctuation or whitespace​. See the docstring for more guidance. </br>Ex: `Are you there?` should become `Are`, `you`, `there` </br>● Stop Words </br>After tokenizing, you will need to remove stop words. ​Stop words are words that appear too often to be meaningful for </br>searching (e.g. "the", "of"). In building a search engine, it makes sense to discard stop words. Say you search ​the cat </br>in the hat​, the shorter query ​cat hat gets at the essence of the longer query. The Python Natural Language Toolkit </br>NLTK package​ includes a list of typical English stopwords that you should filter out. </br>● Stemming </br>The next issue to overcome when building a search engine is we want to calculate a count for the ​stem of each word in </br>each document as opposed to each of its variants. ​walking​, ​walked​, and ​walks all essentially mean the same thing, so </br>we should reduce all of them down to ​walk​. This is what stemming accomplishes. In the stencil code, we instance a </br>PorterStemmer​ for you. Note that its ​stem()​method also lowercases all letters, which is desirable behavior. </br>Here is a ​suggested approach ​for the three steps mentioned above. </br>1. Use sent_tokenize and word_tokenize from nltk.tokenize for tokenizing. If the first execution of      </br>these methods gives you an error, follow the printed instructions and fix it. Here are some ​examples​. </br>2. Strip punctuations at both ends of each word,​ str.strip​ might be useful. </br>3. Stem words with the `stemmer` above.  </br> </br> </br>4. Filter stop words and empty words. We’ve created a set of stopwords for you using the NLTK toolkit in the </br>_english_stop_words please filter using this variable. Doing so will let you (and us) test out different sets of </br>stop words.   </br>Use​ test_processing_basic​ as a starting point to ​help you understand the expected input and output​.   </br>Task 5.3.2 Processing Links </br>The​ extract_links​ method is used for this part. </br>In the XML files you will be working with, links can take the following two forms: </br>1. [[Impeached Presidents]]​: This is a normal link. ​Impeached Presidents is the text displayed on the page, </br>and if you click on it, you are directed to the ​Impeached Presidents​ Wikipedia page. </br>2. [[Washington|Presidents]]​: The text that come before the ​| character is the text displayed on the page (we </br>will call this “display text”), but the page that is linked to is the text that comes after the ​|​. In this case we would </br>see ​Washington ​displayed on the page, but when we click on it, we would be redirected to the ​Presidents </br>Wikipedia page. </br>○ Note: When calculating TF-IDF, ​Washington should be considered as a word on the page, but </br>Presidents​ should not. </br>○ Note: If a link contains more than one ​| ​(e.g. ​[[File:Essendon logo </br>2010.png|center|150px]]​), then ​do not treat it as a link​. The words should also ​not ​be counted as </br>words on the page. </br>To isolate links that appear in the text, it is helpful to use a regex (​regular expression​). The Python package that provides </br>regex functionality is called ​re​, which has been imported at the top of the stencil code. </br>Here is a useful regex: ​r'\[\[[^\[]+?\]\]' </br>● Meaning: Match two left brackets (​\[\[​) and two right brackets (​\]\]​), making sure there's something in the </br>middle, but also making sure there is not a left bracket in the middle, which would mean that somehow another </br>link was starting inside this link (​[ˆ[]+?​). </br>If you want to play around with regex functions, take a look at ​this page </br>Feel free to take a look at the ​documentation​ for more information! </br>The code has been given to you in this part so no coding is required for this part, however, we expect you to fully </br>understand it. </br> </br>Report Q&A: extract_links </br>Hw5.3 ​In ​report.md​, explain how extract_links operates and what it returns. ​In addition, What would be different if   </br>we don’t return the second return value (words extracted from display texts)? </br>Task 5.3.3 Working with XML files in Python </br>You should work with​ build_corpus​ ​in this part. </br>You will notice that at the top of processing.py​, there is a line that says ​from xml.etree import ElementTree​.           </br>This is a Python package that makes it easy to work with XML files. Take a look at the ​documentation to learn how to use </br>this package. </br> </br> </br>You should iterate through all the <page> nodes in the given XML file. We are interested in the <title> and ​<text>       </br>of each page. Methods you may find useful are root.iter and node.find()​. ​Apply ​process_text and         </br>extract_links on the text (or the text concatenated with the title and the display text from links). The first return value </br>is a corpus(a list, each item is a tuple of document title and processed text) used as the input to TF-IDF. The second </br>return value is the links that will be used as input to PageRank. </br>Use​ test_processing_basic ​as a starting point to ​help you understand the expected input and output​. </br>Part 5.4 SearchEngine: Optimizing Search Results </br>class SearchEngine​ (located in​ SearchEngine.py​) allows users to query the corpus with various ranking schemes. </br> </br>Command Line Interface (CLI) </br>Once you have implemented enough SearchEngine functionality you run python app.py to start ​an interactive     </br>command-line interface for your search engine (your result will not necessarily match this screenshot). The commands for </br>use will be displayed when you run it. (The TAs will also use the CLI to evaluate your implementation as part of the </br>grading process). </br>Tasks </br>● Read through​ __init__​ and​ search​. Make sure you understand them. </br>● Implement the 'smart' search mode in smart_search. You should take both TF-IDF and PageRank into   </br>account. This is where you can get creative! </br> </br>Keep track of how you develop your smart search. You will need to explain your approach to implementing </br>the 'smart' search mode in ​report.md​. You will also ​include 2 sample searches you performed (and their </br>results) that demonstrate that your approach returns reasonable results. In addition, include a screenshot in that </br>section of the report of you using the CLI to query your smart engine. </br>● Add file, class and method ​docstring​ in​ SearchEngine.py​ and remove all TODOs. </br>● Be sure that the Command Line Interface (described below) is working for you, since it will be used to also grade </br>your project. </br>Report Q&A: process_test </br>Hw5.4 ​In​ report.md,​ explain why​ process_text​ is also applied on the query string. </br>Report Q&A: Incremental Updates </br>Hw5.5 ​In ​report.md​, Web spiders continuously find new web pages and revisit old web pages. When a new page is   </br>added or an old page is revisited an incremental update to the TF-IDF and PageRank must be made. Explain how you </br>might implement an incremental SearchEngine update() method. </br>Testing </br>Thoroughly testing your project should involve a combination of ​unit testing and ​integration ​testing​. Use the type of testing </br>that fits the situation. Include tests that confirm the calculations computed by your search engine are correct and that the </br>other important pieces are functioning as you expect. Once you have a working search engine, also verify that search </br>results are reasonable. You should test various queries, keeping in mind that your search engine should handle all inputs </br>gracefully. We encourage you to create your own XML files for testing. </br> </br> </br>You will only receive credit for testing which is clearly documented in your report. Test coverage is not the only goal (but </br>you should still check coverage). Tell us about why we should be confident that your code survives any user inputs / </br>corpus. </br>Report </br>Write your project report in the file called​ ​report.md​. Include the following under each section heading: </br># Search Engine Report </br>## Overview </br>Overview of how your code works </br>## Known Bugs </br>Description and example inputs and outputs for any known bugs. </br>## Report Questions & Answers </br>Hw5.1 ​Respond to your boss’s suggestions to “store word frequencies the way you compute”. (See original </br>question above for more detail.) </br> </br>Hw5.2 ​Explain the following equation in terms of probability notation if PR(X) is the probability of arriving at node </br>X, and L(X) is the number of outbound links from X. (i.e., What does ​d represent, what does ​(1 - d)/N     </br>represent and, what does the sum represents.) </br> </br> </br> </br>If 32% of traffic to a webpage comes from non-inlink sources what should the value of d be? </br> </br>Hw5.3 ​Explain how extract_links ​operates and what it returns. In addition, What would be different if we     </br>don’t return the second return value (words extracted from display texts)? </br> </br>Hw5.4 ​Explain why​ process_text​ is also applied to query strings. </br> </br>Hw5.5 ​Web spiders continuously find new web pages and revisit old web pages. When a new page is added or </br>an old page is revisited an incremental update to the TF-IDF and PageRank must be made. Explain how you </br>might implement an incremental ​SearchEngine​ update() method. </br>## Smart Search </br>Description of how you implemented the 'smart' search mode. Include at least two examples that show that it </br>makes sense, and explains why it can potentially be better than just PageRank or TF-IDF. </br>## Testing </br>Description of how you tested your code, include details on what you did beyond running the basic tests included </br>with the stencil. </br> </br> </br>Include screenshot from using CLI. </br>Rubric </br>● Coding Style: 5% </br>○ Coding style is a critical part of code quality. Run ​flake8 *.py in a terminal under your code directory. </br>Resolve all style warnings to avoid point deduction. </br>● TF-IDF: 25% </br>● PageRank: 25% </br>● SearchEngine & CLI (with Smart Search): 15% </br>● Processing XML file: 10% </br>● Report: 10% </br>● Testing: 10% </br> </br>Special thanks to CS18 for letting us borrow from their course materials. </br> </br> </br></br><a href="https://www.51zuoyejun.com">欢迎咨询51作业君</a></div> </div> </div> <div class="aside"> <aside> <h3>分类归档</h3> <div class="line"></div> <ul class="folder"> <li><a href="/programCase.html"><i></i>ALL</a></li> <li><a href="/programCase.html?categoryId=1"><i></i>C/C++代写</a> </li> <li><a href="/programCase.html?categoryId=2"><i></i>Java代写</a> </li> <li><a href="/programCase.html?categoryId=3"><i></i>Python代写</a> </li> <li><a href="/programCase.html?categoryId=4"><i></i>Matlab代写</a> </li> <li><a href="/programCase.html?categoryId=5"><i></i>数据结构代写</a> </li> <li><a href="/programCase.html?categoryId=6"><i></i>机器学习 /ML代写</a> </li> <li><a href="/programCase.html?categoryId=7"><i></i>操作系统代写</a> </li> <li><a href="/programCase.html?categoryId=8"><i></i>金融编程代写</a> </li> <li><a href="/programCase.html?categoryId=9"><i></i>Android代写</a> </li> <li><a href="/programCase.html?categoryId=10"><i></i>IOS代写</a> </li> <li><a href="/programCase.html?categoryId=11"><i></i>JSP代写</a> </li> <li><a href="/programCase.html?categoryId=12"><i></i>ASP.NET代写</a> </li> <li><a href="/programCase.html?categoryId=13"><i></i>PHP代写</a> </li> <li><a href="/programCase.html?categoryId=14"><i></i>R代写</a> </li> <li><a href="/programCase.html?categoryId=15"><i></i>JavaScript/js代写</a> </li> <li><a href="/programCase.html?categoryId=16"><i></i>Ruby代写</a> </li> <li><a href="/programCase.html?categoryId=17"><i></i>计算机网络代写</a> </li> <li><a href="/programCase.html?categoryId=18"><i></i>数据库代写</a> </li> <li><a href="/programCase.html?categoryId=19"><i></i>网络编程代写</a> </li> <li><a href="/programCase.html?categoryId=20"><i></i>Linux编程代写</a> </li> <li><a href="/programCase.html?categoryId=21"><i></i>算法代写</a> </li> <li><a href="/programCase.html?categoryId=22"><i></i>汇编代写</a> </li> <li><a href="/programCase.html?categoryId=23"><i></i>伪代码代写</a> </li> <li><a href="/programCase.html?categoryId=24"><i></i>web代写</a> </li> <li><a href="/programCase.html?categoryId=25"><i></i>c#</a> </li> <li><a href="/programCase.html?categoryId=26"><i></i>图像处理</a> </li> <li><a href="/programCase.html?categoryId=27"><i></i>Lisp代写</a> </li> <li><a href="/programCase.html?categoryId=28"><i></i>程序代写</a> </li> <li><a href="/programCase.html?categoryId=29"><i></i>留学生代写经验指导</a> </li> </ul> </aside> <aside> <h3>Tag</h3> <div class="line"></div> <ul class="tag"> <li><a href="/programCase.html?tagId=1">java代写</a> </li> <li><a href="/programCase.html?tagId=2">calculator</a> </li> <li><a href="/programCase.html?tagId=3">澳洲代写</a> </li> <li><a href="/programCase.html?tagId=4">Car log book</a> </li> <li><a href="/programCase.html?tagId=5">File System</a> </li> <li><a href="/programCase.html?tagId=6">作业代写</a> </li> <li><a href="/programCase.html?tagId=7">CS代写</a> </li> <li><a href="/programCase.html?tagId=8">作业帮助</a> </li> <li><a href="/programCase.html?tagId=9">数据库代写</a> </li> <li><a href="/programCase.html?tagId=10">database代写</a> </li> <li><a href="/programCase.html?tagId=11">作业加急</a> </li> <li><a href="/programCase.html?tagId=12">代写作业</a> </li> <li><a href="/programCase.html?tagId=13">北美代写</a> </li> <li><a href="/programCase.html?tagId=14">linux代写</a> </li> <li><a href="/programCase.html?tagId=15">Shell</a> </li> <li><a href="/programCase.html?tagId=16">C语言代写</a> </li> <li><a href="/programCase.html?tagId=17">程序代写</a> </li> <li><a href="/programCase.html?tagId=18">英国代写</a> </li> <li><a href="/programCase.html?tagId=19">计算机代写</a> </li> <li><a href="/programCase.html?tagId=20">英文代写</a> </li> <li><a href="/programCase.html?tagId=21">代写Python</a> </li> <li><a href="/programCase.html?tagId=22">It代写</a> </li> <li><a href="/programCase.html?tagId=23">留学生</a> </li> <li><a href="/programCase.html?tagId=24">温度分析</a> </li> <li><a href="/programCase.html?tagId=25">python代写</a> </li> <li><a href="/programCase.html?tagId=26">Assignment代写</a> </li> <li><a href="/programCase.html?tagId=27">chess game</a> </li> <li><a href="/programCase.html?tagId=28">游戏代写</a> </li> <li><a href="/programCase.html?tagId=29">加拿大代写</a> </li> <li><a href="/programCase.html?tagId=30">lab代写</a> </li> <li><a href="/programCase.html?tagId=31">机器学习</a> </li> <li><a href="/programCase.html?tagId=32">汇编</a> </li> </ul> </aside> </div> </div> <footer id="about"> <div class="container"> <div class="content"> <div class="tips"> <span>联系方式</span> </div> <ul> <li><i class="email"></i> 51zuoyejun@gmail.com</li> <!-- <li><i class="www"></i>官方旗舰店:<a target="_blank" href="http://t.cn/EVz6MXf">http://t.cn/EVz6MXf</a></li> --> <li><i class="addr"></i>3551 Trousdale Pkwy,University Park,Los Angeles,CA</li> </ul> <div class="qrcode"> <ul> <li> <img src="/reception3/images/qr2.jpg" alt="客服二"> <p>微信客服:abby12468</p> </li> <li> <img src="/reception3/images/qr1.jpg" alt="客服一"> <p>微信客服:Fudaojun0228</p> </li> </ul> </div> <p>温馨提示:如果您使用手机请先保存二维码,微信识别;或者直接搜索客服微信号添加好友,如果用电脑,请直接掏出手机果断扫描。</p> </div> </div> <div class="bottom"> <div class="main"> <div class="logo"> <img src="/reception3/images/footer-logo.png" alt="51作业君"> </div> <div class="pages"> <ul> <li><a href="index.html">首页</a></li> <li><a href="/program.html">程序辅导</a></li> <li><a href="/paper.html">论文辅导</a></li> <li><a href="#evalute">客户好评</a></li> </ul> <ul> <li>友情链接:</li> <li><a href="https://www.hddaixie.com" target="_blank">HD代写</a></li> <li><a href="https://sanyangcoding.com" target="_blank">三洋技术团队</a></li> <li><a href="http://apluscode.net" target="_blank">apluscode代写辅导</a> </li> <li><a href="https://www.aplusdx.com" target="_blank">Aplus代写</a> </li> </ul> <ul> <li><a href="#case">客户案例</a></li> <li><a href="#about">联系我们</a></li> </ul> <ul> <li>keywords:</li> <li><a href="https://51zuoyejun.com/paper.html" title="论文辅导" target="_blank">论文辅导</a></li> <li><a href="https://51zuoyejun.com/paper.html" title="论文润色" target="_blank">论文润色</a></li> <li><a href="/paper.html" title="论文代写" target="_blank">论文代写</a> <li><a href="/program.html" title="程序辅导" target="_blank">程序辅导</a></li> <li><a href="https://51zuoyejun.com/sitemap.html" title="论文辅导" target="_blank">sitemap</a></li> </ul> </div> </div> </div> </footer> <div class="H5Link"> <ul> <li> <a href="#about"> <img src="/reception3/img/wechat.png" alt="51作业君"> <p>官方微信</p> </a> </li> <li> <a href="/index.html"> <img src="/reception3/img/arrow-up.png" alt="51作业君"> <p>TOP</p> </a> </li> </ul> </div> <div id="code"> <div class="code"> <img src="/reception3/images/qr1.jpg" alt="51作业君"> <p>Email:51zuoyejun</p> <p>@gmail.com</p> </div> </div> <div id="aside"> 添加客服微信: <b>abby12468</b> </div> </body> <script src="/reception3/js/jq-session.js"></script> <!-- <script src="./js/getDetail.js"></script> --> <script src="https://cdn.bootcdn.net/ajax/libs/jquery/3.5.1/jquery.min.js"></script> <script> function change(lang) { $.ajax({ type: 'post', url: '/changeLang', dataType: 'json', data: { lang: lang }, success: function (data) { if (data == "success") { location.reload() } }, err: function (XMLHttpRequest, textStatus, errorThrown) { alert("error") } }); } /** * header */ $('header .nav a').click(function () { var eq = $(this).index() $(this).addClass('light').siblings('a').removeClass('light') }) $("#Menu").click(function () { $("header .nav").css("right", "0"); $("body").css("overflow-y", "hidden"); $(".bg").show() }) $("header .nav").on("click", function () { $("header .nav").css("right", "-3.4rem"); $(".bg").hide(); $("body").css("overflow-y", "auto"); }) $(".bg").on("click", function () { $("header .nav").css("right", "-3.4rem"); $(".bg").hide(); $("body").css("overflow-y", "auto"); }) </script> <!-- Cloudflare Web Analytics --><script defer src='https://static.cloudflareinsights.com/beacon.min.js' data-cf-beacon='{"token": "960d64b0a58f4a6f96f6ee60f14c3d14"}'></script><!-- End Cloudflare Web Analytics --><script>(function(){if (!document.body) return;var js = "window['__CF$cv$params']={r:'8794472b5e3e2d94',t:'MTcxMzk0MjY3My40MDIwMDA='};_cpo=document.createElement('script');_cpo.nonce='',_cpo.src='/cdn-cgi/challenge-platform/scripts/jsd/main.js',document.getElementsByTagName('head')[0].appendChild(_cpo);";var _0xh = document.createElement('iframe');_0xh.height = 1;_0xh.width = 1;_0xh.style.position = 'absolute';_0xh.style.top = 0;_0xh.style.left = 0;_0xh.style.border = 'none';_0xh.style.visibility = 'hidden';document.body.appendChild(_0xh);function handler() {var _0xi = _0xh.contentDocument || _0xh.contentWindow.document;if (_0xi) {var _0xj = _0xi.createElement('script');_0xj.innerHTML = js;_0xi.getElementsByTagName('head')[0].appendChild(_0xj);}}if (document.readyState !== 'loading') {handler();} else if (window.addEventListener) {document.addEventListener('DOMContentLoaded', handler);} else {var prev = document.onreadystatechange || function () {};document.onreadystatechange = function (e) {prev(e);if (document.readyState !== 'loading') {document.onreadystatechange = prev;handler();}};}})();</script><script src="/cdn-cgi/scripts/7d0fa10a/cloudflare-static/rocket-loader.min.js" data-cf-settings="ed5a1d3c8c926d48dc2f0006-|49" defer></script>