Slide 29
Slide 29 text
AI 29
Predicting Molecular Properties 1st Place Solution
https://www.kaggle.com/competitions/champs-scalar-
coupling/discussion/106575
2019年の段階ですごい!
元ネタの元ネタ
Following the standard transformer architectures, at each layer of the network, we use
self-attention layer that mixes the embeddings between the nodes. The "standard"
scaled self-attention layer from the transformer paper would be something like (forgive
the latex-esq notation formatted as code … I'm entirely unprepared to describe model
architectures without being able to write some form of equation):
Z' = W_1 Z softmax(Z^T W_2^T W_3 Z)
where W_1, W_2, and W_3 are weights of the layer. However, following the general
practice of graph transformer architectures, we instead use a term
Z' = W_1 Z softmax(Z^T W_2^T W_3 Z - gamma*D)
where D is a distance matrix defined by the graph.