辅导案例-COM6115
COM6115: Text Processing (2019/20) Assignment: Document Retrieval Task: Complete the implementation of a basic document retrieval system in Python. Submission: Submit your assignment work electronically via the module’s MOLE unit, so check you have access to this, and notify the module lecturer if not. Precise instructions for what files to submit are given later in this document. SUBMISSION DEADLINE: 3pm, Friday, Week 9 (29 November, 2019) Penalties: Standard departmental penalties apply for late hand-in and use of unfair means. Materials Provided Download the file Document Retrieval Assignment Files.zip from the module homepage. It unzips to a folder that contains a number of code and data files, for use in the assignment. Data files: The materials provided include a file documents.txt, which contains a collection of documents that record publications in the CACM (Communications of the Association for Computing Machinery). Each document is a short record of a CACM paper, including its title, author(s), and abstract — although one or other of these (especially abstract) may be absent for a given document. The file queries.txt contains a set of IR queries for use against this collection. (These are ‘old-style’ queries, where users might write an entire paragraph describing their interest.) The file cacm gold std.txt is a ‘gold standard’ identifying the documents that have been judged relevant to each query. These three files together constitute a standard test set that has been used for evaluating IR systems (although it is now somewhat dated, not least by being very small by modern standards). As discussed in class, standard IR systems create an inverted index over a document collection (such as documents.txt), to allow efficient identification of the documents relevant to a query. Various choices are made in preprocessing documents before indexation (e.g. whether a stoplist is used, whether terms are stemmed, etc) with various consequences (e.g. for the effectiveness of retrieval, the size of the index, etc). To simplify the task, the files provided include several pre- computed index files for the document collection, which were generated according to different preprocessing choices, i.e. whether a stoplist was used or not, and whether (porter) stemming was applied or not (e.g. giving files such as index nostoplist nostemming.txt, and so on). Correspondingly preprocessed versions of the queries are also provided (e.g. such as the file queries nostoplist nostemming.txt, and so on). (As such, the original ‘non-preprocessed’ files documents.txt and queries.txt are provided only for information/inspection. They are not required for the work you must do, and should not be accessed by your code.) Code files: The materials provided include the code file ir engine.py, which is the ‘outer shell’ of a retrieval engine, that loads an index and preprocessed query set, and then ‘batch processes’ the queries, i.e. uses the index to compute the 10 best-ranking documents for each query, which it prints to a results file. Run this program with its help option (-h) for in- formation on its command line options. These include flags for whether stoplisting and/or COM6115 Page 1 Assignment: 2019/20 stemming are applied during preprocessing (which are used to determine which of the index and query files to load). Another option allows the user to set the name of the file to which results are written. A final option allows the user to select the term weighting scheme used during retrieval, with a choice of binary, tf (term frequency) and tfidf modes. The Python script eval ir.py calculates system performance scores, by comparing the col- lection gold standard (cacm gold std.txt) to a system results file (which lists the ids of the documents the system returns for each query). Execute the script with its help option (-h) for instructions on use. The program ir engine.py can be executed to generate a results file, but you will find that it scores zero for retrieval performance. The program does implement various aspects of required functionality, i.e. it processes the command line, loads the selected index file into a suitable data structure (a two-level dictionary), loads the preprocessed queries, runs a batch process over the queries, and prints out the results to file. However, it does not include a sensible implementation of the functionality for computing what are the most relevant documents for a given query, based on the index. This functionality is to be provided by the class Retrieve which ir engine.py imports from the file my retriever.py, but the current definition pro- vided in that file is just a ‘stub’ which returns the same result for every query (which is just a list of the numbers 1 to 10, as if these were the ids of the documents selected as relevant). Task Description Your task is to complete the definition of the Retrieve class, so that the overall IR system performs retrieval based on the vector space model. Ideally, your implementation should allow retrieval under alternative term weighting schemes, as selected using the “-w” command line flag, i.e. under binary, term frequency and TFIDF schemes. You should then evaluate the performance of the system over the CACM test collection under alternative configurations, arising from alternative preprocessing choices and the available term weighting schemes. What to Submit Your assignment work is to be submitted electronically using MOLE, and should include: 1. Your Python code, as a modified version of the file my retriever.py. Do NOT submit any other code files. Your implementation should not depend on any changes made to the file ir engine.py. (Any such dependency will cause your code to fail at testing time, as this will involve placing your code alongside a ‘fresh’ copy of the other files that are needed.) Your code file should not open any other files at all. Rather, it should take its inputs, and return its results solely through its interactions with the code in ir engine.py. 2. A short report (as a pdf file), which should NOT EXCEED 2 PAGES IN LENGTH (ex- cluding a title page, should you wish to have one). The report may include a brief descrip- tion of the extent of the implementation achieved (this is only really important if you have not completed a sufficient implementation for performance testing), and should present the performance results you have collected under different configurations, and any conclusions you draw from your analysis of these results. Graphs/tables may be used in presenting your results, to aid exposition. COM6115 Page 2 Assignment: 2019/20 Assessment Criteria A total of 25 marks are available for the assignment and will be assigned based on the following criteria. Implementation and Code Style (15 marks) How many of the alternative weighting schemes have been correctly implemented? How effi- cient is the implementation (i.e. how quickly are results returned)? Have appropriate Python constructs been used? Is the code comprehensible and clearly commented? Report (10 marks) Is the report a clear and accurate description of the implementation? How complete and accurate is the discussion of the performance of the system under a range of configurations? What inferences can be drawn about the performance of the IR system from these results? Guidance on Use of Python Libraries The use of certain low level libraries is fine (e.g. math to compute sqrt). The use of intermediate level libraries like numpy and pandas is discouraged. Our experience is that students using these libraries do not use them effectively and end up producing code that is less clear and less efficient than those who simply implement from the ground up and thus retain clear control and understanding over what they are doing. The use of high level libraries that implement some of the core functionality you are asked to produce (e.g. scikit-learn or whoosh or other implementations of the vector space model or aspects of it, computing IDF weights, etc) will be seriously penalised – this is the stuff you are meant to do yourself! If in doubt about whether to use any 3rd party code, ask. Notes and Comments 1. Study the code in the file ir engine.py, particularly with a view to understanding: (i) how the retrieval index is stored within the program (as a two-level dictionary structure, map- ping terms to doc-ids to counts), (ii) the representation of individual queries (as a dictionary mapping query-terms to counts), and (iii) how the code of the Retrieve class, that you are to complete, is called from the main program. 2. When retrieving relevant documents for an individual query, the set of candidate documents to be considered for retrieval are those containing at least one of the terms from the query, i.e. the candidate set is the union of the document sets for the individual query terms. Having computed this set, similarity scores can be computed for each, and used to rank the candidates, so that the top 10 can be returned. COM6115 Page 3 Assignment: 2019/20 3. The vector space model views documents as vectors of term weights, and computes simi- larity between documents as the cosine of the angle between their vectors. Stated in terms of the comparison of a document and a query, this calculated as: sim(~q, ~d ) = cos(~q, ~d ) = ∑n i=1 qidi√∑n i=1 q 2 i √∑n i=1 d 2 i Note, however, that when we compute scores to rank the candidate documents for a single query (so that the top N can be returned), the component √∑n i=1 q 2 i (for the size of the query vector) is constant across comparisons, and so can be dropped without affecting how candidates are ranked. 4. Although the vector space model envisages documents as vectors with term weight values for every term of the collection, we do not actually need to construct these vectors. In practice, only terms with non-zero weights will contribute. For example, in computing the product ∑n i=1 qidi, we need only consider the terms that are present in the query; for all other terms qi is zero, and so also is qidi. (HOWEVER, when we compute the size of document vectors, all terms with non-zero weights should be considered.) 5. Computing required values: Various numeric values that derive from the document collection are required for calculating term weights and similarity scores. These values can be computed from the inverted index. The required values are: a. The total number of documents in the collection (|D|) — which can be computed by gathering together the full set of document identifiers that are present in the index b. The document frequency dfw of each term w — which is easily computed, as the index maps each term to the documents that contain it c. The inverse document frequency log(|D|/dfw) of each term w, computed from the above d. The size of each document vector, |~d | = √ Σni=1di 2, i.e. the sum of squared weights for terms appearing in the document. This can be computed for all documents at the same time, in a single pass over the index. Where TF.IDF term weighting is used, the IDF values must be computed before the document vector sizes are calculated. COM6115 Page 4 Assignment: 2019/20