Skip to content

new transformer and attention layers

Matthew Leigh requested to merge matt_dev into main

Merge brings new definition of the TransformerEncoder which is made up of TransformerEncoderLayers (similar to pytorch) These layers are based on the GPT style with the NormFormer placement of Layernorms and residual connections. https://arxiv.org/pdf/2110.09456.pdf

To make the code neater I removed the options to turn off the residual connections and normalisations. I also did this because the investigation space of transformers is so saturated at the moment, with the architecture being so heavily optimized. Pretty much all variants require the normalization and the residual connections. The Normformer seems to be the most stable and best performing in the literature so I will advocate that we stick to it and reduce the complexity of our setup somewhat. https://wandb.ai/dalle-mini/dalle-mini/reports/An-Evaluation-of-Transformer-Variants--VmlldzoxNjk4MTIw

Additionally I made changes to the dense network to allow it to take an extra input (turned off by default) but this is to allow us to broadcast high level jet features into the message passing of the graph.

Also added a set of utils in salt.utils.torch.py. These are not being used yet, but we might want to consider using pass_with_mask at some point. This operation saves on memory and computation time by only passing the non-padded elements of a tensor through a module. It is not a catch all function though, and makes some assumptions on how to use the input and the mask, but maybe we can use it going forward.

Finally I also tweaked the code used in attention.py:

  • Changed method used to combine weights in the GatV2 layer to a broadcast rather than repeat/repeat_interleave, identical output, quicker and neater
  • Added functionality for an attention mask, this is transformer speak for an adjacency matrix. Turned off by default as we are using fully connected graphs, but now can network can accommodate sparse graphs
  • Added functionality for an attention bias. This was used in ParticleTransformer (https://arxiv.org/abs/2202.03772) to incorporate edge features by adding embedded edge features as a bias term to the attention scores.

Merge request reports