代写辅导接单-Data Mining

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

Data Mining

October 3rd, 2023

1 Problem 1: Testing linear-attention Transformers

Compare various variants of the Performers architectures with regular Transformers models as well as the class of Linformers architectures [Wang et al., 2020] on the small image classification tasks (for MNIST, CIFAR-10/CIFAR-100 datasets). You can choose the depth of your Transformer model (the number of layers). Test at least the following Performer variants: (a) with positive random features (run ablations over various numbers of RFs), (b) using a deterministic mechanism with ReLU and EXP nonlinearities. Apply two strategies of training Performers and Linformers: (1) from scratch, (2) from the already learned checkpoints (via regular Transformer training) and compare both variants. Conduct speed tests of all trained models. Include the code that can be used to reproduce all presented results. Document all the design decisions (number of heads, query-key dimensionality per head, etc.).

2 Problem 2: Learnable attention kernel

Propose and attention model, where the entries Ai,j of the (not row-normalized) attention matrices A are obtained from the neural network computational blocks (e.g. MLPs) taking as inputs: the query and key vectors and outputting their similarity scores. Apply such an attention mechanism within a larger Transformer model for the downstream application of your choice and compare with the regular Transformer architecture. Describe in detail the neural network block that you use to learn the attention kernel. Can you propose its particular instantiation leveraging certain structural inductive bias (e.g. providing regular softmax-kernel in the initialization) ? Conduct speed tests of all trained models. Include the code that can be used to reproduce all presented results. Document all the design decisions (number of heads, query-key dimensionality per head, etc.).

3 Problem 3: Product-kernel attention

Assume that the entries of the (not row-normalized) attention matrices A are of the following form:

Ai,j = YKi(qi,kj) (1)

i=1

for a sequence of kernels K1,...,KT : Rd × Rd → R. Assume that each Ki for i = 1,...,T can be

linearized, i.e. there exist (randomized) mappings: φi : Rd → Rmi for some m1, ..., mT such that:

Ki(x, y) = E[φi(x)⊤φi(y)] (2)

Propose an efficient linear-attention mechanism for the product-kernel QTi=1 Ki(qi, kj ) as an attention kernel defining A. Does your mechanism provide unbiased estimation ? Complement it with the concentration results. Apply your method to the polynomial kernel: K(x, y) = (x⊤y)p, where p ∈ N. Can you force your mechanism to produce nonnegative (random) features defining linearization for the polynomial kernel ? Propose a method to use your algorithm for the polynomial kernel linearization to approximately linearize the softmax-kernel (not necessarily unbiasedly; extra points are given for providing an unbiased mechanism). Test it on the synthetic tasks involving softmax- kernel matrix estimation (you can sample queries and keys from the fixed probabilistic distributions, i.e. multivariate Gaussian).

4 Problem 4: Approximate softmax-attention with pseudo-Gaussian distributions

Consider a modification of the positive random feature map mechanism for the linearization of the softmax-kernel, where Gaussian projections ω1, ..., ωm are replaced with the vectors ωb1, ..., ωbm of entries sampled independently at random from {−1,+1}. Note that such a mechanism provides a biased estimation of the softmax-kernel, but leads to the bounded estimator. Provide concentration results for this estimator, by computing the upper bounds on the probabilities of deviations from the mean by a given value. Can you estimate the bias of the estimator (e.g. the gap between its expectation and the true softmax-kernel value) ? In the standard setting with Gaussian projections ω1, ..., ωm the so-called orthogonal mechanism, where different projections are exactly orthogonal yet their marginal distributions are N (0, Id ) turns out to provably reduce the variance of the estimation. Can you propose the analogous version, providing orthogonality, yet maintaining marginal distri- butions, for the {−1,+1}-variant considered here ? Test the quality of the approximation of the regular softmax-kernel attention by the {−1, +1}-variants on the synthetic tasks, with queries and keys sampled from the fixed probabilistic distributions, i.e. multivariate Gaussians.

References

[Wang et al., 2020] Wang, S., Li, B. Z., Khabsa, M., Fang, H., and Ma, H. (2020). Linformer: Self-attention with linear complexity. CoRR, abs/2006.04768.