Deep Neural Networks for Tabular Data

7 min readJan 25, 2021

Spoiler: TabNet and TabTransformer will be here…

When we talk about tabular data and try to find the appropriate model to solve the given problem with such kind of data, nobody wonders that boosting is the most appropriate method for tabular data. But now, particularly in 2020, a big step has been made towards Deep Learning and adapting Neural Networks for getting better results compared with boosting.

This article I want to dedicate to different types of Neural Networks that really can help to get better or not worse results than GBDT.

Good Old MLP (+ Embedding Layer)

The Story begins with the simple neural network architecture that uses fully connected layers, the only difference that all categorical variables are mapped to the Embedding Layer. In my business cases I saw MLP results that were better than boosting for tabular data, the only difficult thing is to set up an appropriate number of layers and number of neurons in the hidden layer (of course I mean that all training tricks and rules are known and taken into account). The only right way is to make a lot of experiments and find the better settings, but there are rules of thumb that can help to squeeze the space of options.

Rules of Thumb for number of hidden layers:

In old papers when MLP was introduced and investigated in terms of approximation ability, there is an assumption to use one hidden layer and many neurons, and we can get “universal approximation” (Hornik, Stinchcombe and White 1989; Hornik 1993; Bishop 1995, 130, and Ripley, 1996, 173–180). But there are no theoretical justifications.
But if the loss function tends to have hills, it is better to use two hidden layers, since only one hidden layer can be stuck in place and have slow convergence (Chester, D.L. (1990), “Why Two Hidden Layers are Better than One”).
And what we concretely should understand is that additional layers can learn complex convex, so if the given task has this one, try more layers.

Rules of Thumb for number of neurons in hidden layer

The number of neurons in range from input size to output size (Blum, A. (1992), Neural Networks in C++, NY: Wiley)
(Number of inputs + outputs) * (2/3)
The number of hidden neurons should be less than twice the size of the input layer (Swingler, K. (1996), Applying Neural Networks: A Practical Guide, London: Academic Press).
Based on my experience I choose the size of the hidden layer depending on the input size, I never make it smaller than input size because if I reduce the latent space I can lose information. And at the same time I do not set up too big size in terms of avoiding overfitting.

Factorization Machines

FMs is an improved version of Support Vector Machines (SVM). It finds first and second order feature interaction via factorized parametrization. So it helps to work with sparse data and at the same time FMs can work with any real valued feature vector.

Model equation:

w_0 is the global bias.
w_i models the strength of the i-th variable.
w_i,j := <v_i , v_j> models the interaction between the i-th and j-th variable. Instead of using an own model parameter w_i,j ∈ R for each interaction, the FM models the interaction by factorizing it.

Complexity

Factorization Machines can be computed in linear time O(kn). In paper they show how to do it.

FMs are a neural network that really looks like a complex function that can be trained by gradient descent. We used to see neural networks that have layers/ filters but we have to realize that neural networks are a function composition, the neuron is a function, and the FMs model shows it as it is.

Deep NN + FMs

To create low and high-order feature interaction Factorization Machines and Multi Layer Perceptron can be combined together in an end-to-end manner learning.

This paper shows this approach for building a Recommendation System.

They propose to create two branches:

FMs to learn first and second-order feature interaction

2. MLP to learn high order feature interaction

Two outputs from two branches are concatenated and passed to linear layer for final prediction.

This method showed good results at my practice too and I use some intuitions from this approach for other tasks.

TabNet

Another network for tabular data that can give good results for tabular data is TabNet from Google.

They propose sequential attention that selects features according to their importance at each decision step. This mechanism also gives an interpretability that other neural networks architectures suffer from.

Feature Transformer

They stack fully connected (linear) layers and use Gate Linear Unit instead of ubiquitous ReLU as activation function. Please, read a paper with good explanation https://arxiv.org/pdf/1612.08083.pdf

Attentive Transformer

This attention mechanism has a tricky architecture that is not so obvious at first sight. They create trainable weights to learn the importance and contribution of each feature and use Sparsemax in terms of weighting features. Sparsemax is analogous to Softmax, the main difference is it can create sparse distribution and gives zero probability for some features. In paper this function is described in detail and shows its very interesting features. And at the final stage they’ve added Prior Scales that is a coefficient that takes into account how frequently the given feature was used in previous steps.

Split output

The main stage is to understand how they split and combine the final result.

So when the features went along Feature Transformer its output is splitted into two parts and the first one goes to the aggregation step where they consequently added with other outputs from other decision steps to create the final output tensor from the network. The second part of output from Feature Transformer is going to Attentive Transformer to create the mask that will be used in the next decision step to filter out features from origin input data.

Mask

Via attention mask they can create feature importance as it is used in boosting algorithms, but it is an instance-wise mask. In paper they didn’t write anything else about how to create a global feature importance mask that doesn’t depend on each given sample. The mask is the output from Attentive Transformer.

As we see this architecture is inspired by NLP neural network architectures, but the next method is more similar to NLP networks.

TabTransformer

This network is using Transformer architecture for processing categorical features and MLP for final prediction. It gives more interpretability to embeddings that can be too close in Euclidean space when they are highly correlated features. Also TabTransformer is robust against noisy and missing data.

Globally the following steps are required:

For each categorical feature create embedding that maps into Euclidean space of D dimension.
Pass parametric embeddings to Transformer Layers. As they write: “Each parametric embedding is transformed into contextual embedding when outputted from the top layer Transformer, through successive aggregation of context from other embeddings”.
After getting contextual embeddings from Transformer Layers they are concatenated with continuous features.
Concatenated features are passed to MLP for getting final prediction.

A Transformer (Vaswani et al. 2017) contains a multi-head self-attention layer and element-wise addition and layer-normalization after attention layers. There is no positional encoding since in tabular data the order of features doesn’t matter.

In paper they show comparisons with MLP, TabNet and GBDT. Almost on all datasets TabTransformer outperforms.

I implemented all of them except the last one — TabTransformer, so if you have any question, please welcome for asking.

Also another paper that wasn’t described here for tabular data with extraordinary approach using CNN: https://arxiv.org/pdf/1903.06246.pdf