605.646 Natural Language Processing: Class Project
Overview
The class project allows you the opportunity to investigate a particular topic in greater depth
than we can cover in the classroom. Projects are individual endeavors that require advance
planning and progressive effort to complete successfully. Most projects involve writing or
using software to conduct an experiment or process a textual dataset. It is acceptable and
encouraged to use open-source toolkits and source code (with citation) for components of
your project. Successful projects usually have the following characteristics:
● involve working with two or more distinct NLP techniques (not necessarily ones
specifically covered in the class lectures)
● work with language data
● include meaningful experiments and report quantitative results
● are scoped appropriately for completion in a one-semester course
We provide below specifications for three diverse projects: (a) answering factoid questions,
(b) cross-language information retrieval, and (c) detecting adverse drug reactions; you may
choose any one of them. If there is a different project that interests you, perhaps from a
hobby or professional interest, you may propose to do that instead. Projects focused on
topics like sentiment analysis, text retrieval, information extraction, authorship attribution,
spam filtering, detecting fake reviews or fake news, dialog systems, large language models,
or translation are all reasonable ideas.
To earn a grade of A- or higher in the course, students must complete and submit a project;
however, completing a project does not guarantee receiving an A- or higher. Students may
opt out of submitting a project, in which case the other coursework will determine the final
grade, as discussed in the course syllabus.
Grading Criteria
Project grades are based on the work performed and documented in the written report
(70%), and a presentation to the class (30%) that is a shared, pre-recorded video. Criteria
that we use to score presentations are:
1. Were the project’s goals and motivation sufficiently explained? (1-10)
2. Was suitable and meaningful background information presented (e.g., prior work)? (1-10)
3. Did the presentation provide sufficient technical detail (1-5) and articulate a contribution or
insight? (1-5)
4. Clarity of the presentation, quality of slides or materials, appropriate length (1-10).
5. Was the work well thought out? Did conclusions follow from the argument or experimental
results? (1-10)
Proposal
Irrespective of whether you are doing one of the projects provided by the instructors or one
of your own devising, you must submit a written proposal in Canvas for approval by the
instructors no later than end of Module 6. The proposal should have a title, must identify the
project topic, should briefly motivate why this is an interesting or important natural language
problem, identify some relevant scientific literature for the problem of interest, identify
sources of data, and outline planned work for the project. Sufficient details about data,
experimental design, and evaluation methodology should be provided. (For the instructor-
provided projects, some of this information will be easy to compile.) Proposals are usually
less than a page in length. If you have a project topic that interests you, but you have
questions or are not sure how to proceed, you are strongly encouraged to contact us
informally for ideas or feedback in advance of submitting the proposal.
Written Report
The written report is the most significant project deliverable – it is where you document the
work that you have performed, and it counts for most of the project grade. Reports should
be scientifically oriented and should include an abstract, an introduction to the problem, a
brief review of related work, details about experimental design (e.g., how training/dev/test
data is used, what evaluation metrics are reported, etc.), experimental results with analysis,
findings supported by the work, and appropriate references. You have flexibility in the style
of formatting; however, do include headings, and use a font between 10 and 12 points.
Suitable tables and figures are highly encouraged.
You should take care to clearly communicate the scale and quality of your work. We leave
the length of the report up to you, but as a rough guideline, five pages is probably too short,
and over 10 pages is getting long. You do not need to include source code (but details about
the amount of code you wrote, or which packages you used can be informative). And we
repeat that tables, charts, sample data, and figures that help explain experimental results
and observations are valued.
Reports are due on the last day of the final Module and should be submitted in Canvas as a
single PDF file.
Presentation
You will share an approximately ten-minute video presentation about your project during the
last Module. A suggested format is voice-annotated slides created using PowerPoint,
Keynote, OpenOffice, etc. The presentation should focus on describing the problem, your
approach, any difficulties encountered, qualitative and quantitative results, and any
interesting observations and findings. Experiments are not always successful, and you can
achieve a good score on the project, even with negative results; however, your design should
be good and you need to articulate what was learned.
Schedule
Module 6 By Day 7 of Module 6, select a topic and submit a proposal in Canvas (as
PDF). One page is enough. Earlier is okay. You are welcome to contact the
instructors ahead of time to informally discuss ideas.
Anytime You are welcome to contact the instructors for advice if you have questions
about projects. We are available during office hours or by email.
Module 11 There will be no new lecture material or assigned readings this week. We
will set up times on a calendar when students may meet with the
instructors for individual project consulting. If you prefer not to meet over
Zoom to discuss your project, please do send us a brief status update by
email to both instructors by the end of the Module. We mainly want to
know if you are making progress, and if you discover any serious
impediments to the project that might require a late change in plans.
Module 14 By Day 1 of Module 14, create a discussion post titled "Project video: BRIEF
TITLE" with an attached video or a link to your video online.
Module 14 By Day 7 of Module 14, upload your written report as a PDF file in Canvas
Literature
Numerous resources are available to you. Research papers can be found via Google
Scholar, the ACL Anthology, arXiv, CiteSeer, or websites for various conferences. JHU
libraries can provide access to the ACM and IEEE digital libraries. (You may have to be VPN'd
into the JHU network to use some of these resources.)
Datasets
There are several shared tasks with available datasets. Data from Kaggle or HuggingFace
may be a good starting place for some projects. The computational linguistics community
also runs many shared tasks at conferences. One of the more popular evaluation
workshops is SemEval, which has run tasks for many years. The websites for the most
recent completed campaigns are:
https://semeval.github.io/SemEval2024/
https://semeval.github.io/SemEval2023/
https://semeval.github.io/SemEval2022/
https://semeval.github.io/SemEval2021/
Citation
The source of any code not written by you must be cited. This includes online tutorials, code
completion software, other students, etc.
Project A: Answering Factoid Questions
This project revolves around building and evaluating a system that attempts to automatically answer
a question whose answer is generally a short noun phrase. The question should be answered based
on information from the documents in the provided collection, not from general world knowledge,
an existing knowledge graph, or Internet sources.
We are providing a small collection of ~ 227k English sentences. The sentence collection is based
on news articles written by the Southeast European Times, a now-defunct news portal that closed in
2015. The site published content covering the Balkans. We are not providing you with any training
data; however, we are providing an evaluation set of 50 questions that predominantly seek a person
or a location as the response. Because the expected answers are short noun phrases, you should
generate one response per question that is no longer than 100 characters in length. You can score
a set of responses using the provided ScoreAnswers script on a file of answers, one per line.
The factoid QA task was popularized at the NIST TREC-8 evaluation in 1999. With the more recent
advent of deep learning, additional datasets have become available such as NewsQA and SQUAD.
Many early QA systems followed the same general architecture, which consists of three or four
primary components in a pipeline:
• Analyze the question (and determine the answer type)
• Document / Passage search
• Candidate answer extraction (possibly exploiting NER)
• Optional validation and selection of the top-ranked response
A more recent approach is based on using document search and LLMs to perform question
answering. This method is briefly described in J&M Chapter 14.
We expect that you will implement a baseline approach, evaluate performance quantitatively, and
then conduct experiments to try to improve performance on the task. Your work should make use of
at least two distinct HLT technologies; a solution based solely on LLM extraction (if you attempt that)
is not enough, but a combination of LLM with RAG would be. It is acceptable to use general purpose
NLP tools; however, you should not merely run an QA system that has been created by others. (It is
permissible to run an QA system of others for the purpose of comparing your performance to
previous results.) Your analysis could include a comparison of different approaches, a measure of
the benefit of data augmentation or fine-tuning, or other areas you find of interest.
You should measure and report system performance using the script and test data that we provide.
However, you may use other datasets to help develop your system or to conduct your experiments.
NLPProgress.com hosts a QA leaderboard page that is a good place to look for English language QA
resources.
Project B: Cross-Language Information Retrieval
Finding information in a language that you speak is usually straightforward. For example, if you speak
English and are trying to find Web documents you can use a variety of search engines, such as Bing,
Google, or DuckDuckGo, to find information on a topic of interest, then read those documents
directly. But what if the information you are seeking is only published in a language that you can’t
read? This used to be a rare use case. But with the advent of high-quality machine translation it
makes much more sense. A Cross-language Information Retrieval (CLIR) system takes queries in one
language and returns relevant documents that are written in a different language.
You are to build a CLIR system. Your system will:
• Index a large document collection in Chinese, Russian, or Persian
• Take English queries as input
• Use machine translation to translate queries into the language of the documents
• Use your index to retrieve the top 1000 most relevant documents for each query
• Evaluate your system using nDCG@10, recall, and average precision
Data
You will use a CLIR dataset from the TREC NeuCLIR collection (use of a different collection is
possible with permission of the instructors). These datasets include:
• Documents in a non-English language, either Chinese, Russian, or Persian
• Topics. These are English expressions of a user information need that you will convert to a
query provided as input to your retrieval engine. NeuCLIR topics include a two- or three-word
title and a sentence-length description. Here is an example:
title: Iranian female athletes refugees
description: I am looking for stories about Iranian female athletes who seek asylum in other countries.
• Judgments. Usually called qrels for historical reasons, these are decisions about the
relevance of given documents to the topics. Each qrels entry includes a query ID query_id, a
document ID doc_id, and a relevance judgment relevance. NeuCLIR has three relevance levels:
‘3’ meaning the document is highly relevant to the topic, ‘1’ meaning it’s somewhat relevant,
and ‘0’ meaning it’s not relevant.
The NeuCLIR collections are available from https://ir-datasets.com/neuclir.html. This site hosts
many information retrieval datasets with easy-to-use Python interfaces. You will need to pip install
ir_datasets.
Retrieval Systems
You may use any monolingual information retrieval system you like, including one of your own
construction if you so choose. Options you might consider include: Terrier, Anserini, Lucene,
ColBERT, etc. We recommend starting with a statistical system as they are fast, reliable, and do not
typically require training. Then if you have time you are welcome to experiment with a neural system.
Translation Systems
You may use a machine translation system of your choice. Options you might consider include
EasyNMT, TranslateShell (which can call Bing or Google APIs), or NLLB.
Evaluation
For evaluation you should report nDCG@10, recall, and average precision. These measures are all
easily available through IR Measures. You will need to pip install ir_measures.
Tasks
We expect you to create a baseline system and assess its performance on your chosen NeuCLIR
collection. Then you should attempt to improve your baseline system and report the results of your
enhancement(s) compared to the baseline approach. For example, you might focus on better
tokenization of the document language, improved query translation, query expansion, or a different
retrieval algorithm.
To perform error analysis, you might choose to use machine translation to convert some of the top
ranked documents to English and display them. This requires translation in the opposite direction
from the queries, but the mechanism should be the same.
Your project should conform to the requirements in the Class Project handout. Reminder: the source
of any code not written by you must be cited. This includes online tutorials, code completion
software, other students, etc.
Project C: Detecting Adverse Drug Reactions
You are to build a system that takes in portions of English text (i.e., sentences or short paragraphs)
and extracts any tuples (drug-or-intervention, adverse-consequence) that are supported by the text.
For example, given the passage: “Children are not permitted to take aspirin because of Reye’s
syndrome”, the system should extract (aspirin, Reye’s syndrome). You will be given some training
and evaluation data for this task, and you should evaluate your system performance using precision,
recall, and a composite F1 score.
We are providing training and evaluation data using the ade_corpus_v2 benchmark, a version of
which is described in a paper by Gurulingappa et al., and which can be found on Hugging Face. We
have created an 80%/10%/10% partition for you to use. The data are in TSV format with three
columns: text, drug, and effect.
Some ideas:
• Detecting ADRs in natural text is a well-studied problem. You may want to review the existing
literature for suggestions about auxiliary data sources or techniques. Searching the ACL
Anthology is a good place to start.
• You may benefit from utilizing an existing list of drug names, or adverse reactions, or
automatically learning them yourself from unlabeled corpora.
• There are some databases of known drug/side-effect pairs which could potentially be of use,
for example:
o https://www.canada.ca/en/health-canada/services/drugs-health-
products/medeffect-canada/adverse-reaction-database.html
o https://fis.fda.gov/extensions/FPD-QDE-FAERS/FPD-QDE-FAERS.html
And there may be other databases that are easier to use.
• A variant of the problem you could explore would be to take the English language training
data that we have provided, and using translation or multilingual embeddings, try to adapt
this to extract ADRs on non-English data (e.g., Russian: https://github.com/cimm-
kzn/RuDReC or Spanish: https://github.com/isegura/ADR).
We expect that you will create a baseline approach, evaluate performance quantitatively, and then
conduct experiments to try to improve performance on the task. Your work should make use of at
least two distinct HLT technologies. It is acceptable to use general purpose NLP tools, however, you
should not merely run an ADR system that has been created by others. (It is permissible to run an
ADR system of others for the purpose of comparing your performance to previous results.) Your
analysis could include a comparison of different machine learning approaches, the utility of different
features, measuring the benefit of data augmentation, or other areas you find of interest.
Some approaches you might consider exploring include:
• First detect drug names (or other medical interventions), and adverse reactions using regular
expressions, gazetteers, or sequence taggers. Then use supervised machine learning to
extract possible pairs given the drugs/reactions.
• Try various prompts and call an LLM such as ChatGPT to take a passage of text and extract
desired pairs.