Transformer for All Data Types

Firiuza
5 min readMay 22, 2022

--

In 2017 Google published one of the most cited paper called «Attention is all you need». And a new network architecture for NLP task was described, where they show self-attention mechanism.

Since then, there have been many articles that explain the Transformer architecture (the most popular source), but there I want to mention just one thing, because I find it important to understand the Transformer feature and then move on.

So what does Transformer do:

Scaled Dot-Product Attention
  1. The given sequence is called as Key, Query and Value. Before becoming a Key, Query or Value the given sequence goes to Linear Layer to perform the linear transformation and being embedded into the latent space.
  2. Key and Query are passed to matrix multiplication, but actually what does it mean: each sample in the sequence is multiplied by others in the sequence, which means we get the new representation of each sample based on the representation of other samples in the sequence. Thus, Transformer corrects the embedding of each sample depending on the context.
  3. After that for each sample we get the importance vector through Softmax, which allows to get a representation of the probability vector, where Transformer explicitly shows the importance of the features.
  4. By understanding which features are more or less important for each sample in the sequence Transformer changes the Value. It means correcting the embedded samples in the sequence according to the self-attention mechanism.

So, Transformer represents the sequence in the latent space.

But now Transformer is used not only for NLP tasks but also for other areas. Below I want to show how to use Transformer architecture for different data types.

Multivariate Time Series data

Transformer for MTS data

In this paper we can find out how to apply Transformer architecture to multivariate time series data. Time series data is the sequence of feature vectors where each item in the feature vector can represent a specific physical feature.

  1. Before going to encoder, a linear layer is applied to entire sequence to embed each sample in the sequence.
  2. Then, as in original paper there can be a sinusoidal positional encoding for the embedded sequence. But this paper assumes that the trainable linear layer is better, but they also admit that there is no need for positional encoding with the numerical information of the time series. They hypothesize that positional encoding tries «to occupy a different, approximately orthogonal, subspace to the one in which the projected time series samples reside. This approximate orthogonality condition is much easier to satisfy in high dimensional spaces». So with more learnable parameters (that are used for positional encoding) is easy to understand how to allocate embeddings in a latent space.
  3. After that the embedded input goes to Transformer Encoder and is processed as in vanilla Transformer.

As we can see there are no big differences between NLP task where we work with embedded words. But there is NO Decoder phase, simply because it’s not necessary for time series data.

What could be more interesting at the beginning: instead of a Linear Layer that embeds each sample in a sequence we can apply a Convolution 1d and then move on to Positional Encoding and so on. Some experiments show that this approach outperforms the Linear Layer.

Transformer helps to get encoded representations of the data, for the classification or regression tasks we need to get one value at the end. So the question arises how to further process this encoded data.

  1. MLP layer: simply, flatten sequence into single vector and pass it to Linear Layer (there can be multiple Linear Layers with different hidden sizes)
  2. RNN layer: pass encoded sequences to RNN layers.
  3. CNN layer: a common practice to process sequential data with convolution layers too, this works mostly with 1d convolution layer, but it can be done with 2d layer as well. It depends on capacity of the device, because 2d layer requires more memory to process data.

At the beginning of this paper I showed that the representation of each sample is based on the full context of its sequential data, and this is the key difference between RNN or CNNs. When processing each sample sequentially, RNN doesn’t look at the sample that goes further, only previously processed samples are taken into account (Bidirectional partially solves this problem). For CNN we also have a sliding window and just with a deeper architecture or with a larger kernel size we can increase receptive field, but practically at first layers we do not see the whole picture. So the Transformer approach really can help to get a better representation, but we still need the next step, which will process the data on the time axis.

Tabular data

In paper “TabTransformer” they show that Transformer can be used for categorical features. If the data contains categorical features they can be stacked to represent the sequential data and passed to the Encoder.

After obtaining a representation of new categorical features through the Transformer Encoder, other continuous features are stacked all together and used as an input for the next tabular data processing step.

In this post I wrote about Neural Networks for Tabular data.

Images

  1. The key idea is to divide the images into fixed-size patches — small squares.
  2. Each patch is flatten then passed to Linear Layer to get patch embeddings.
  3. After obtaining a linear projection of flattened patches, position embeddings are added to them: for each patch embedding another 1D embedding is added. And these position embeddings are learnable too. This is similar to Multivariate Time Series Data approach.
  4. After that embedded patches go into the standard Transformer’s Encoder.
  5. The output of the Encoder goes to MLP layer.

Authors also suggest that the input image at very beginning could be passed to the Convolution Layer and after that the proposed pipeline follows.

What do they highlight:

  1. Position Embeddings do encode the distance between patches and reflect the spatial arrangement according to the original image. 1D Embeddings also captures dependencies along the row and column axes. That’s why 2D Embeddings are not needed.
  2. «Self-attention allows ViT to integrate information across the entire image even in the lowest layers». This means that at the very beginning the network has a representation based on the entire image.

There is a very interesting paper “Do Vision Transformers See Like Convolutional Neural Networks?”, because it’s not so obvious to me why ViT is so good. This paper conducts research to understand the key features of ViT.

There are other areas where Transformer can be used. Please take a look at the Stanford course where other areas are mentioned: https://web.stanford.edu/class/cs25/

--

--

No responses yet