School of Computer Science
Page 1 of 3
COMP5425/4425 Multimedia Retrieval 2025S1
Homework
Homework is an individual assessment and worth 15% of the total assessment of this unit
of study. It is due in Week 08.
You need to
1)
Complete a programming task, write a report of your implementation and results,
produce a video to explain your implementation and results.
2)
make online submission of your source codes, your report, and your video in a zip
file named with your unikey.
Due: Tue, 15/04/2025 (Week 08)
Note that this homework aims to provide students opportunity to learn and implement a
text retrieval algorithm of a specific application and you must complete the task by yourself.
Using Artificial Intelligence (AI) tools is allowed and needs to be acknowledged, if it
complies with the University’s policies.
Some codes of this task may have been available publicly (e.g., on GitHub) or the task has
been implemented as a set of function calls in some programming libraries. You must not
use such resources or complete the project directly with those function calls, while you can
still use programming libraries for components which are not directly relevant to the task to
be completed.
What you can ONLY use:
- Python language itself
- NLTK for tokenization, lemmatization.
- NumPy, SciPy.
- built-in library from python (as listed https://docs.python.org/3/library/)
- Any IDE/Editor
Task – Passage Retrieval/Ranking
Your need to implement a simple passage retrieval/ranking system. Your system needs to
accept a text as query and retrieve a list of documents.
Inputs:
1. query(str): a textual query, it can be:
- A simple word, phrase, sentence OR
- Boolean expression.
A Boolean expression combines keywords with Boolean operators (i.e. "AND", "OR",
"NOT"). An example query is: `"Australia AND (Housing OR Property) NOT Car"`
The words (or keywords) can be exact words or wildcards, you only need to handle trailing
wildcards (e.g. Austra* matches all words starting with Austra with at least 0 any trailing
characters). You do not need to implement all the query parsing, refer to the requirements
paragraph for details.
COMP5425 Multimedia Retrieval PROJECT
School of Computer Science
Page 2 of 3
2. K(integer): the number of top K documents in your ranking/retrieval.
Outputs:
1. Your output needs to contain top K documents id.
2. Your code can get the documents content with documents id.
3. You results needs to be in JSON format
An example output is:
Dataset:
You can choose any retrieval dataset and sub-sample them to obtain your own dataset.
Some examples are MS Marco(https://microsoft.github.io/msmarco/Datasets.html), TREC- CAR (https://github.com/BRIGHT-benchmark/BRIGHT). You can choose your own
appropriate processing methods.
Requirements:
You can choose several features for your query system.
1. Queries and documents processing
1. You need to process the text appropriately, such as tokenization,
lemmization, etc.
2. You need at least to implement: simple key words query.
3. You can implement Boolean expression [3, 4], wildcard query for higher
scores.
2. Ranking criteria/score
1. You need to implement ranking criteria to determine the similarity between
your query and passages.
2. You need to implement TD-IDF score.
3. You can implement BM25 (Best Matching 25) [1] for higher scores.
3. UI
1. You can implement Command Line Interface (CLI).
Marking Scheme (15 marks)
1. [12 marks] Algorithm implementation
Up to 6 marks
Implement basic function. Exact keywords matching. TF-IDF score implemented.
COMP5425 Multimedia Retrieval PROJECT
School of Computer Science
Page 3 of 3
Up to 9 marks
In addition to above features, implemented inverted-index, wildcard query.
Up to 12 marks
In addition to above features, implemented advanced scoring (e.g. BM25), Boolean query.
[3 marks] Report and Video
Please prepare a brief report up to 4 pages to describe your method and results. It could
contain key steps or details of the algorithm implemented, experimental settings, and
experimental results (or text cases), and several key references. A clear report will allow
readers to reproduce your work.
Please produce a video up to 3 minutes and 30 MB in MP4 format to explain and
demonstrate the implementation of your project. You may be asked to demonstrate and
explain your implementation to your tutor, if the tutor needs more information. Marks may
be deducted if you are not able to explain your work clearly.
References
1. Robertson, Stephen; Zaragoza, Hugo (2009). The Probabilistic Relevance Framework:
BM25 and Beyond (PDF). NOW Publishers, Inc.
2. Yuanhua Lv and ChengXiang Zhai. Lower-bounding term frequency normalization. In
Proceedings of CIKM'2011, pages 7-16
3. Aho, Alfred Vaino; Lam, Monica Sin-Ling; Sethi, Ravi; Ullman, Jeffrey David (2006).
Compilers: Principles, Techniques, and Tools (2 ed.). Boston, Massachusetts, US:
Addison-Wesley. ISBN 0-321-48681-1. OCLC 70775643
4. Reverse Polish Notation https://mathworld.wolfram.com/ReversePolishNotation.html
51作业君版权所有