程序辅导案例 > Program >

代写辅导接单-HW3--

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

Submission through Canvas: upload a zip file with the name HW3-yourNETID-code.zip before the deadline (above) Penalty for wrong formatting & file naming: 5% grade deduction LSTM & TRANSFORMERS FOR TEXT MINING, TOPIC MINING [+ LEARNING TO RANK] (100 points) For this homework we will continue working with the News dataset from previous homework. https://github.com/mhjabreel/CharCnn Keras/blob/master/data/ag news csv/test.csv 1 Word Prediction with Word2Vec and LSTM (50 points) For this question, you will need the test.csv file of the News dataset from both previous homework. You will also need to install both TensorFlow (pip install tensorflow==2.0.0-alpha0) and Keras (pip install keras) – alternatively you can use Keras from TensorFlow. After installation of the libraries create a Jupyter notebook that does the following: 1) Download the news data file above (test.csv) 2) Load it with Pandas and run the usual data preprocessing we have done so far (transform to lowercase, remove numbers, punctuation, NAs, etc. stemming is optional) and select the 3rd column as the document content 3) Modify the attached LSTM code to use a 100-dimensial time series (instead of 3-dimensional one) and predict the next two words as follows. Create a 100-dimensional time-series, for which you will use a word2vec encoding and create vector words of dimension 100. Predict the next two word vectors. For the final prediction, you have to indicate the actual words (in English) that were predicted. Thus, find the words that are closer in meaning to the 2 word-vectors you predicted. Specifically: 3.1) For word2vec feel free to use a pre-trained word2vec (training is only needed for LSTM) and any Python library but I suggest you use gensim. Once you encoded the collection using word2vec, you will have to modify the code to use these 100- dimensional time series which come from the word vectors transitions from one word to the next in the sentences of the news collection. Notice that the example code stacks the data horizontally. This is because each time series is built separately one series at a time. However, since our “time” series come from vector sequences, and thus we are building each element of the 100 sequences at once, you will have to stack the data vertically (vstack) and iteratively one-word vector at a time. 3.2) Modify the code to predict the next value of the skipgram word-vector sequence. To do that you will need to modify the code to use the word-vectors from the dataset. Then, you will have to partition the dataset with the entries corresponding to the known words (e.g. first n − 2 words of a sentence) for training and the unknown words (e.g. last 2 words) for prediction. 3.3) Plot a histogram of the root mean squared error (RMSE) values of the predictions for both the tanh activation function and the ReLU activation function – you can plot them side by side using different colors for each histogram. This will provide an estimate of performance. 1 Up to this point, you are experimenting with the various documents in the collection of documents, points 3.4 and 3.5 need to be done for a selected document as indicated next: 3.4) Select the activation function (tanh or ReLU) with better distribution of RMSE and predict the last 2 word- vectors of one document. The doc you will choose for this is the one with most words in the collection. 3.5) Find the words that are closer in meaning to the word-vectors you predicted. If you are using gensim, you can use the function model.wv.similar by vector to obtain the words that are closer in meaning to a given word vector(s). 4) Add a boolean variable to alternatively include the stop words in the code. 5) Run the code (Word2Vec encoding and LSTM prediction) but include the stop words when the flag of the previous step is set to True. How do the RMSE of the predicted words compare with and without stop words (you can alternatively plot the histogram of F-scores of the matched closest words)? How does the quality of the prediction compare? Write your answers in a README file. 6) Set the flag to the option (without or with stop words) that performed better for the code you submit. Note. To simplify this problem, you can select only one sentence from each document instead of the full document. That is, you can parse the sentences (before removing punctuation) and work only with the sentence with the most words in each document. # multivariate lstm code based on Brownlee (2020) from numpy import array from numpy import hstack from keras.models import Sequential from keras.layers import LSTM from keras.layers import Dense from keras.layers import RepeatVector from keras.layers import TimeDistributed # split a multivariate sequence into samples def split_sequences(sequences , n_steps_in , n_steps_out): X, y = list(), list() for i in range(len(sequences)): # find the end of this pattern end_ix = i + n_steps_in out_end_ix = end_ix + n_steps_out # check if we are beyond the dataset if out_end_ix > len(sequences): break # gather input and output parts of the pattern seq_x , seq_y = sequences[i:end_ix , :], sequences[end_ix:out_end_ix , :] X.append(seq_x) y.append(seq_y) return array(X), array(y) # define input sequence in_seq1 = array([10 , 20 , 30 , 40 , 50 , 60 , 70, 80, 90]) in_seq2 = array([15 , 25 , 35 , 45 , 55 , 65 , 75, 85, 95]) in_seq3 = array([in_seq1[i]+in_seq2[i] for i in range(len(in_seq1))]) # convert to [rows , columns] structure in_seq1 = in_seq1.reshape ((len(in_seq1), 1)) in_seq2 = in_seq2.reshape ((len(in_seq2), 1)) in_seq3 = in_seq3.reshape ((len(in_seq3), 1)) # horizontally stack columns dataset = hstack ((in_seq1 , in_seq2 , in_seq3)) 2 # choose a number of time steps n_steps_in , n_steps_out = 3, 2 # covert into input/output X, y = split_sequences(dataset , n_steps_in , n_steps_out) # the dataset knows the number of features , e.g. 2 n_features = X.shape[2] # define model model = Sequential () model.add(LSTM(200 , activation=’relu’, input_shape=(n_steps_in , n_features))) model.add(RepeatVector(n_steps_out)) model.add(LSTM(200 , activation=’relu’, return_sequences=True)) model.add(TimeDistributed(Dense(n_features))) model.compile(optimizer=’adam’, loss=’mse’) # fit model model.fit(X, y, epochs=300 , verbose=0) # demonstrate prediction x_input = array([[60 , 65 , 125], [70 , 75 , 145], [80, 85, 165]]) x_input = x_input.reshape ((1, n_steps_in , n_features)) yhat = model.predict(x_input , verbose=0) print(yhat) 2 Word Prediction with Transformers - (30 points) For this problem, you will use the transformers library (!pip install transformers) from Hug- ging Face. You will also use Torch and Numpy. Similarly to Problem 1, you will use the test.csv file of the News dataset from both previous homework. Unlike LSTM you will use pre-trained models, BERT and GPT2, and thus, no training is needed for this problem (no partitioning of the data in training and test sets, there will be only test sets). 1) Download the news data file above 2) Load it with Pandas and run the usual data preprocessing we have done so far (transform to lowercase, remove numbers, punctuation, NAs, etc. stemming is optional) and select the 3rd column as document content 3) You will use BERT and GPT2 to predict the next (one) word as follows. 3.1) Use the sample code below to create 2 functions (1 for BERT and 1 for GPT3) that use the pre-trained models to forecast the next word given a prefix sequence. 3.2) Create a function or process that iterates over the documents in the collection and use the first n − 1 words of a sentence as a “prompt” or test sequence and the unknown word (e.g. the last word) for prediction. 3.3) Compare the predicted word with the actual word in the sentence to compute 2 performance estimates: 1 mandatory, the accuracy, and one optional, the cosine-similarity distributions with pseudo-feedback (extra credit, detailed below). Compute the accuracy for both models (number of correctly predicted words/total predictions) - this number should be low for the default BERT and a little higher for GTP2. Now add back the points (the punctuation) at the end of the sentences and run your predictions again - the performance now should be a little higher for BERT. Note. To simplify this problem, you can select only one sentence from each document instead of the full document. That is, you can parse the sentences (before removing punctuation) and work only with the sentence with the most words in each document. EXTRA CREDIT (5 points): Run 1 additional evaluation metric, namely, a pseudo-feedback cosine- similarity of a reference embedding. The process is as follows: use a pre-trained word2vec model to encode both the real word and the predicted word (of both BERT and GPT2) and compute the cosine similarity of their word-vectors. Plot the distribution/histogram of cosine similarities for both BERT and GPT2; plot the histograms in the same plot using different colors for each model. Which one is better? Write your answer in the README file. 3 Sample code to predict next word with pre-trained BERT from transformers import pipeline model = pipeline(’fill -mask’, model=’bert -base -uncased ’) pred = model("Forecasts of Presidential [MASK]") print("Predicted next word: ") pred[0][’token_str ’] Sample code to predict next word with pre-trained GPT2 import torch from transformers import AutoModelForCausalLM , AutoTokenizer import numpy as np tokenizer = AutoTokenizer.from_pretrained("gpt2") model2 = AutoModelForCausalLM.from_pretrained("gpt2") seq = "Forecasts of Presidential" inputs = tokenizer(seq , return_tensors="pt") input_ids = inputs["input_ids"] for id in input_ids[0]: word = tokenizer.decode(id) with torch.no_grad (): logits = model2( ** inputs).logits[:, -1, :] pred_id = torch.argmax(logits).item() pred_word = tokenizer.decode(pred_id) print("Predicted next word: ") print(pred_word) 3 Topic Modeling - Multinomial PMMModel and LDA(20 points) For this problem, you can use the test.csv file from the News dataset. Each row corresponds to a document and the third column is the content of the documents. For this problem, you will use the gensim library to identify topics in the collection. The following code shows how to use LdaModel: # Create the object for LDA model lda1 = gensim.models.ldamodel.LdaModel # Train the LDA model using the document term matrix. ldamodel = lda1(matrix_of_doc_term , num_topics=10, id2word = D1 , passes=100) matrix of doc term can be computed with the doc2bow function by providing a document from the corpus to the function. The matrix is built by computing doc2bow for every document in the corpus. The steps are: 1. create the dictionary D1 with gensim.corpora.Dictionary (use the collection) 2. compute the matrix_of_doc_term by computing D1.doc2bow for every document in the collection 4 Finally, use ldamodel to plot: a. The 5 most relevant terms for one of the topics b. Plot a matrix comparing the distance/difference of the 10 topics. Use the Kullback-Leibler diver- gence and 50 words. e.g. ldamodel.diff(ldamodel, distance=’kullback leibler’, num words=50). You can use any library for the plots but Matplotlib or ggplot are recommended. 4 EXTRA CREDIT - Logistic Regression for Learning to Rank with PageRank, HITS, and TF-IDF (25 points) Create a function that trains a learning-to-rank algorithm for classifying web-pages as relevant or not, using logistic regression, as follows. Use the set of pages you downloaded with the web crawler you build in HW2 to build a graph of URL links. Out of the ten pages, select a query or topic for which approximately only half of the pages will be relevant and half irrelevant. If the pages you downloaded for HW2 are not easy to label this way, choose and download a dif- ferent set. You will use this manual identification of relevant and non-relevant pages as your ground truth of labels as indicated below. Use the content of the HTML page as content of the document itself, index the pages as you wish (e.g. titles, file prefixes, numbers from 1 to # of pages, etc.) but number them (for the matrix representations of PageRank and HITS) and use the numbers as the indices of the nodes, and use the hyperlinks as edges in the graph. Specifi- cally, use the page id (e.g. a number or similar) as the node and the hyperlink () as an edge between two pages. For instance if the page www.nytimes.com links to your last page, say https://www.nytimes.com/2023/11/10/technology/personalized-ai-agents.html you can index the first as node 1 and the second as node 10 and so on, then, you could assign a link between nodes 1 and 10 (alternatively, 0 and 9). You will use the labeling of relevant pages you created manually (indicated above) and build a dataset as follows: Relevant? PageRankScore HITSScore TF IDF y(1) x (1) 1 x (1) 2 x (1) 3 y(2) x (2) 1 x (2) 2 x (2) 3 . . . y(10) x (10) 1 x (10) 2 x (10) 3 Table 1: Dataset Compute both HITS and PageRank scores (you don’t have to implement either PageRank nor HITS and you can use any library of your choice), in addition to the TF-IDF score of the pages. Arrange the three scores for each page and for the cases neded consider a set of words of your interest as shown in Table 1. For instance, you can use: [reuters, stocks, friday, investment, market, prices] as words that are relevant for the topic financial markets. Then, use Scikit-learn (link) to apply a logistic regression classifier (no coding from scratch is expected for logistic regression) to classify documents that are relevant from those that are irrelevant using the dataset you just created. Thus, you are building a learning-to-rank program that will consider not only the content of the pages but the hierarchy of importance of the documents to rank them. Partition your data into training and test sets by using 60% pages as training and 40% as testing. Run the prediction of relevance for your pages and report the precision and recall 5 curve and F1-scores (for both training and test sets ). Make sure that the partitions have an equal representation of each of the two types of pages in it (relevant and not relevant). WHAT TO TURN IN Mandatory (Graded) elements/files within the zip file: 1) The Jupyter notebook with the solutions to the problems (please, annotate & comment your code) and 2) a README file with the answers to the questions in the problems 3) (Extra Credit) For the Learning-To-Rank question: In addition to the notebook (can be a separate notebook but it is better to use the same), submit a PDF with a plot of the results and a one-paragraph description of what you found most interesting. Optional (no points will be awarded for this!): You can also include in the README file a URL link to your Colab file where you can show what you did in case that helps check what you did. However, this is just to verify outputs, and the code in your Collab repositoty will not be considered for grading. Only the submitted code will be graded. REFERENCES Brownlee, J. How to Develop LSTM Models for Time Series Forecasting. (2020). 6 51作业君版权所有