Assignment 8: Integrating the Generation and Retrieval
Service
Due Nov 11 by 3am
Points 100
Submitting a website url
Start Assignment
PREREQUISITES: Review the Retrieval-Augmented Generation lectures.
REQUIRED PYTHON PACKAGES (in 'requirements.txt'):
mistralai
transformers
sentence_transformers
nltk
qa_metrics
faiss-cpu
torch
REQUIRED RESOURCES:
qa_resoruces/questions.csv
storage/corpus/*.txt.clean
The student-resources repository provides the dataset we will use for this case study and also contains
the code we will use to implement this RAG system.
OBJECTIVES: You will be tasked with implementing the logic to preprocess the corpus and
communicating with the Mistral API. Review the code found in the modules/ directory. The modules
included runners (if __name__= "__main__") as examples to run each module. Review the provided
datasets. For this assignment, update your repository by adding a directory called 'textwave' in your
project root directory.
To complete this assignment, you must install the python packages in requirements.txt
The generator/question_answering.py module leverages the Mistral API. You can configure this class by
specifying three class arguments:
The api_key argument should take in your unique API key (string) provided to you once you
registered for a Mistral account.
11/4/24, 12:24 AMAssignment 8: Integrating the Generation and Retrieval Service
https://jhu.instructure.com/courses/82966/assignments/897476?module_item_id=42102981/3
Go to https://mistral.ai/ (https://mistral.ai/) and register for a new account. You will need to
follow the authentication process to complete this.
Once registered, log in to your account and create a new workspace.
Go to "Le Plateforme" menu -> "Billing" -> "Go to billings plans page." Select "Experiment for
free" and subscribe to the plan. You will need to complete the authentication process.
In the "Le Plateforme" menu, select the "API Keys." You can view your API key here.
In a terminal, run the command: export MISTRAL_API_KEY=
command each time you open a terminal to run this code. Optionally, add this line at the bottom
of your ~/.bashrc file.
The temperature controls the randomness of the model's responses.
The generator_model specifies the model (e.g., mistral-{small|medium|large}-latest)
Task 1: Search nearest neighbors
In the pipeline.py module, modify your Pipeline class (from the previous homework):
a class method called __encode(query), which will return the embedding vector output from a
preprocessed user input text query. Define/configure your embedding strategy in Pipeline's
__init__().
a class method called search_neighbors(query_embedding, k=10), which will return the k-nearest nearest
neighbors. Define/configure your index and search strategy in Pipeline's __init__().
In a notebook called notebook/context_answering_analysis.ipynb, demonstrate the output of
search_neighbors() function:
query = "Who was Abraham Lincoln?", k = 15
query = "Who was Abraham Adams?", k = 15
query = "Did Abraham Lincoln live in the Frontier?", k = 1
query = "Did Abraham Lincoln live in the Frontier?", k = 10
query = "Did Abraham Lincoln live in the Frontier?", k = 20
query = "Did Abraham Lincoln live in the Frontier?", k = 50
query ="How did Fillmore ascend to the presidency?" k = ?
query = "What is the capital of France?", k = ?
Discuss how your observations.
Task 2: Generate answers
In the pipeline.py module, modify your Pipeline class with:
a class method called generate_answer(query, context, rerank=True), which will return an answer given
the query and the retrieved context. Define/configure your re-ranker and question_answering
11/4/24, 12:24 AMAssignment 8: Integrating the Generation and Retrieval Service
https://jhu.instructure.com/courses/82966/assignments/897476?module_item_id=42102982/3
strategy in Pipeline's __init__().
In a notebook called notebook/context_answering_analysis.ipynb, demonstrate the output of
generate_answer() with the following:
query = "Who was Abraham Lincoln?", k = 15, rerank = {True|False}
query = "Who was Abraham Adams?", k = 15, rerank = {True|False}
query ="How did Fillmore ascend to the presidency?" k = {1|5|10|20|...}, rerank = {True|False}
query = "What trail did Lincoln use a Farmers' Almanac in?", k = {1|5|10|20|...}, rerank = {True|False}
query = "What is the capital of France?", k = 15, rerank = {True|False}
Discuss how your observations.
SUBMISSION: You will need to check in the following files and any supporting python modules:
textwave/pipeline.py
textwave/notebooks/context_answering_analysis.ipynb
Provide the GitHub URL link to your textwave/notebooks/context_answering_analysis.ipynb file
via Canvas to get credit for this submission.
11/4/24, 12:24 AMAssignment 8: Integrating the Generation and Retrieval Service
https://jhu.instructure.com/courses/82966/assignments/897476?module_item_id=42102983/3