程序代写案例-COMP5329-assignment2

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

COMP5329 assignment2
Guanghao Huanga, Peiqiao Zhangb, and Yuben Yangc
aSID: 490498066
bSID: 500439825
cSID: 500376139
1 INTRODUCTION
A multi-label classification task would be solved in this assignment. There are two attributes variable including images and
captions need to be processed in this project. So different feature extraction methods have been applied on the two variables.
For the images part, the convolutional neural network has been chosen to extract the features behind the images. More
specifically, the popular CNN structure resnet50 has been used in this project with tiny changes.
For the cation part, global Vectors for Word Representation has been used as unsupervised learning algorithm for obtaining
vector representations for words. After that the LSTM model has been used to process the vectors as the feature extraction
methods.
Finally both the features vectors extracted from the CNN and LSTM are being combined together, and after an fully-connected
layers a classification model has been established. A better understanding of two main structures of deep learning neural
network including CNN and RNN would be acquired after this assignment. It is also a great chance for learner to use two DNN
structures in the practical application.
2 RELATED WORKS
There are many researches have been addressed in the multiple labels classification of images.
Wang et al. (2017) has done a similar task with this assignment. They used spatial transformer to find the attentional regions
from the convolutional feature maps and used an LSTM (Long-Short Term Memory) sub-network to sequentially predict
semantic labelling scores, Finally combined these two features for the classification task. Hua et al. (2019) proposed a novel
end-to-end network, namely class-wise attention-based convolutional and bidirectional LSTM network (CA-Conv-BiLSTM)
for the multiple labels classification. Zhao-Min et al. (2019)used Graph Convolutional Networks to extract the features from
the images. Furthermore, they propose a novel re-weighted scheme to generate an effective label correlation matrix , These
information matrix would be used to guide information propagation among the nodes in GCN. Shang-Fu et al. (2017) proposed a
joint learning attention and recurrent neural network (RNN) models for multi-label classification based on the image captioning
and pre-defined label sequences. More importantly, label co-occurrence information can be jointly exploited by their LSTM
model(Which has been adopted in our structure).
From the literature review it can be seen that to deal with the multiple labels classification task, the difference exists but
most of them follow the same pattern. CNN or modified CNN model would be used to deal with image information. The
attentional regions is the important part for the image feature extraction. As for the text or caption information, both text
proposes method and different LSTM models are needed. Combining different structures the final classification model structure
has been established.
3 TECHNIQUES
The brief description of the classifier structure has been shown in the introduction part. In the techniques part, there main
methods has been described in detailed. Global Vectors for Word Representation method has been used as text process. Resnet
model is being used for image feature extraction, and LSTM has been used for caption feature extraction.
3.1 Global Vectors for Word Representation
The purpose of the modelPennington et al. (2014) is to quantify words and make them contain as much semantic and grammatical
information as possible. So, the input value would be the text variable, and the output would be the vectors with numerical
value. The whole process has been divided into several steps.
• Generate the co-occurrence matrix X it calculates each element Xi, j in the co-occurrence matrix X. Xi, j is the number
of times when the words i and words j appeared in a window (length could be defined)at the same time in the caption
dataset.
• Ratio j,k
Then the conditional probability of each words when another word has already existed would be calculated. The ratio is
ranged from zero to one. The more value is closed to one, the more large probability that words j and k appeared at the
same time.
• Numerical vector generator Assumed we have already get the numerical vectors for each words, if the numerical
vectors could be calculated through a function G which could generated the same result of Ratio j,k , that means the
function G is correct and the numerical vectors could be produced from text variables. So the loss function could be
defined as the difference between function G and the Ratio j,k.
Another thing needed to be mentioned is weight in the loss function needs to be large for the words with high frequency
of occurrence. After the loss function being acquired, now the numeric vectors could be updated using loss function.
3.2 Deep residual network model structure
In this project, the CNN has been used as feature extraction method for images. But traditional CNN structure has some
disadvantage. So, the resnet model has been introduced.
The reason why resnet model being created is to solve the degradation problem of deep network. As the number of layers
become larger, then accuracy would increase at first, and then decrease., the training time would also become larger. It may
because the reverse gradient of deep network Is hard to conduct with too many layers.
According to the figure, it set F(x)=H(x)x, x is from the shallow layer, H(x) is from the deeper layer, and F(x) is the residual
between H(x) and x. When the features represented by shallow X are mature enough, any change to the feature which will
make the loss larger will automatically tend to be 0.and the feature will continue to transfer from the path.
The most important function of the residual module is to change the way of forward and backward information transmission,
thus greatly promoting the optimization of the network. The residual module will obviously reduce the value of the parameters
in the module, so that the parameters in the network have a more sensitive response to the loss value of reverse conduction.
Although it does not solve the problem of small loss of return, it will reduce the parameters, increase the effect of loss of return,
and also produce a certain regularization effect.
In this assignment the resnet50 model has been selected.50 means there are 50 layers in total. More specifically, there is a
convolution of input 7x7x64, and then after 3 + 4 + 6 + 3 = 16 building blocks(each block has three layers) so there are 16 x 3 =
48 layers. Finally there is FC layer (for classification), so there are 50 layers.
3.3 LSTM model
In this project, the LSTM of RNN structure has been used as feature extraction method for caption information. After the
Global Vectors for Word Representation method, the text vectors have been transferred into the numerical vectors which can be
processed in LSTM model.
LSTM model has the better performance dealing with long sequence training task than the normal RNN structure. Unlike the
traditional RNN with only one Transmitted state, LSTM have two Transmitted state including cell state and hidden state.
There are three stages in LSTM models. First stage is forgetting stage. This stage is mainly to selectively forget the input from
the previous node. Specifically, the calculated formula zupf (F for forget) is used as the forgetting gate to control which formula
in the previous state needs to be kept and which needs to be forgotten. The second stage is Selective memory stage. It is mainly
used to select and memorize the input controlling by the zi(I for information). Then these two stages are added together as the
input for the hidden state. The last stage is the output stage controlled by the z0.
4 EXPERIMENTS AND RESULTS
The experiments were based on the Google Colab environment with 12.72GB RAM, Tesla P100 GPU. One epoch in the training
phrase would take 800s to 900s and the best epoch was 2 so the total training time for the model was about 30 minutes.
In the experiments, various deep learning architectures were explored. Among those, some achieve relatively satisfactory results
while others show negligible promising in fulfilling the required multi-label classification task. The details of the experiment
and representative results will be introduced in this section.
2/5
In all architectures, the same baseline preprocessing is employed to prepare the input data for models. The original images are
rescaled to the size of 3*256*256. For text data, the words appear in captions are collected and handled by Glove Pennington
et al. (2014) to construct the word embedding matrix. Then the original captions are truncated or padded to be of a fixed length
of 10. They will be transformed into numeric vectors of 25 dimensions through the embedding matrix.
Also, all architectures have the same final FC layer using sigmoid activation with 1100 inputs and 18 outputs. The 18 outputs
correspond to the probability of the existence of 18 classes. They are connected to BCEloss which evaluates the binary cross
entropy. In terms of model training, the popular Adam optimizer accepting batch data is used. The hyperparameters for
optimisation are summarised in Table 1.
optimal value
Learning rate 0.01
Momentum 0.9
Weight decay rate 1e-4
Batch size 10
Table 1. Optimisation hyperparameters This table shows the result of Optimisation hyperparameters for the whole neural
network.
The ultimate architecture with the best mean F1 score submitted by us includes the pre-trained ResNeXt-50 and LSTM modules.
The ResNeXt module extract information from the image inputs resulting in features of 1000 dimensions and the LSTM
which contains one hidden layer transforms the embedding vectors into features of 100 dimensions. These two features are
concatenated as input for the final FC layer. To fully exploit the potential of this architecture, the ablation study for some tricks
of training was conducted. The details of the tricks and the F1 scores achieved on our validation data are listed in Table 2.
According to this table, it is clear that including the random flip for data augmentation is benefits the model performance. Thus
this strategy was adopted in our final submission.
Tricks F1 score
ResNeXt-50 + LSTM 0.83
ResNeXt-50 + LSTM + Random flip 0.89
ResNeXt-50 + LSTM + Dropout 0.80
ResNeXt-50 + LSTM + Random rotation 0.79
ResNeXt-50 + LSTM + Whitening 0.81
ResNeXt-50 + LSTM + Random apply 0.85
ResNeXt-50 + LSTM + Enlarge and crop 0.82
ResNeXt-50 + LSTM + GCN cannot converge
Table 2. Ablation study for ResNeXt-50 + LSTM This table shows the result of different combination of methods.
Before the best architecture was identified, we had attempted to use different CNN-based modules to extract the image features.
These modules include the basic CNN and pre-trained models such as AlexNet, VGG16, and ResNet. The performances of
these choices together with the best architecture are displayed in?. According to these results, it can be concluded that the more
advanced pre-trained networks proposed by academic papers are beneficial for completing the task. They are either better at
extracting the features or facilitating the convergence due to the power of sophisticated structure and transfer learning.
Architecture F1 score
CNN + LSTM 0.75
VGG16 + LSTM 0.79
ResNeXt-50 + LSTM 0.83
Table 3. CNN-based modules comparison This table shows the result of different CNN-based modules comparision.
3/5
5 DISCUSSION AND CONCLUSION
The multi-label classification task was accomplished and reasonable results were achieved. In order to improve our score in the
competition, a number of architectures and techniques were put into practice. During the exploration of this assignment, the
solution to combine different sources of inputs, the image, and the text data, was proposed. It will be a suitable example that
can be referred to whenever a problem requires diverse data sources. Experiments were conducted to facilitate ablation study
and model comparisons. By analyzing the results, the power of advanced CNN-based architecture and transfer learning was
definitely brought to light. It is worth further utilizing this power in our future engineering projects.
In general, this assignment offers a great opportunity for us to apply our knowledge of deep learning to solve a relatively
complex computer vision task. Such an experience will encourage us to further explore the world of deep learning and maybe
develop an advanced technique in the future.
REFERENCES
Hua, Y., Mou, L., and Zhu, X. X. (2019). Recurrently exploring class-wise attention in a hybrid convolutional and bidirectional
lstm network for multi-label aerial image classification. ISPRS Journal of Photogrammetry and Remote Sensing, 149:188–
199.
Pennington, J., Socher, R., and Manning, C. D. (2014). Glove: Global vectors for word representation. In Empirical Methods
in Natural Language Processing (EMNLP), pages 1532–1543.
Shang-Fu, C., Yi-Chen, C., and Yu-Chiang, F. (2017). Order-free rnn with visual attention for multi-label classification.
arXiv.org.
Wang, Z., Chen, T., Li, G., Xu, R., and Lin, L. (2017). Multi-label image recognition by recurrently discovering attentional
regions. arXiv.org.
Zhao-Min, C., Xiu-Shen, W., Wang, P., and Guo, Y. (2019). Multi-label image recognition with graph convolutional networks.
arXiv.org.
4/5
A APPENDIX
Instructions to run the code:
1. Copy the image data folder into the Input folder, then the images will be stored in the directory /Input/data/.
2. Copy the train.csv and test.csv into Input folder.
3. Then run the 5329as2.ipynb in the Algorithm folder from the top to the end.
5/5

欢迎咨询51作业君