ROI pooling vs. ROI align

Firiuza
4 min readJan 8, 2020

In computer vision there are many interesting problems and one of them is object detection and segmentation on image.

Object detection tries to predict bounding box for each type of object that represents in the dataset and its score (the confidence of object class). And segementation predicts boundaries for the object — mask.

I want to explain how to extract Region of Interest (ROI) via pooling and align operations. And what the difference between these two operations.

For the beginnig let’s recall whole pipeline in object detection and segemantation tasks:

  1. Pass image into backbone network (e.g. ResNet or VGG).
  2. Extract feature map (or feature maps from Feature Pyramid Network).
  3. Pass feature map to Region Proposal Network (RPN).
  4. Using proposals from RPN take Region of Interest (ROI) and return fixed size feature map via pooling or align operations.
  5. Pass fixed size feature map from ROI pooling (or align) into R-CNN to get bounding box predictions and class scores.
  6. Pass fixed size feature map from ROI pooling (or align) into CNN to get segmentation mask.

Let’s consider 4th point.

ROI pooling

When RPN return region proposals, all proposals are the offsets for each anchors. Using them we can get proposed bounding boxes where their coordinates are presented based on original image size.

After that we have to take region of interest by croping from feature map needed predicted bounding box. But how to do it if its coordinates are based on original image size?

The given feature map was decreased k times from the original image (via convolutions). It means that each coordinate can be decreased k times.

  1. First, ROI pooling proposes to divide each coordinate by k and take an integer part: [x / k].

After that we have new coordinates relative to feature map size. For getting needed part from the feature map that responses for the supposed object, needed part is cropped using new coordinates.

2. Quantizations: for getting fixed size output from ROI pooling, cropped part is divided into bins. Such kind of division gives n x n grid. And from each bin can be taken maximum or average value.

Input feature map for ROI pooling.
Divide taken region into fixed size grid using proposals (updated coordinates).
Output from ROI pool based in max pool operation.

ROI align

ROI align has the same goal as ROI pooling: take Region of Interest via proposals. But these two steps a little bit different.

  1. ROI align divides each coordinate by k: x / k and do NOT take integer part.
ROI align operation (Source: https://arxiv.org/pdf/1703.06870.pdf)

It means that there is no definite pixel in grid that can be taken, because new coordinates are float values.

2. Nevertheless, cropped part is also divided into grid, but for defining concrete values in these bins ROI align choose regularly 4 points in each bin using bilinear interpolation (as shown in picture above). And from these 4 points maximum or average value from each bin is taken.

Input feature map is divided into grid by float coordinates values (proposals).
How to interpolate each point (red point within grid). Compute area of each piece in original cell that was divided by red point. Example for one of them.
Bilinear interpolation for red point from the (1, 1) cell in that red grid.
Output fixed size feature map after max pooling.

That’s how ROI pooling or align operations return new feature map with fixed size.

But for what ROI align? As you see pooling operation is rough. When we predict coordinates for bounding box, it’s not big problem when we made a little offset with 2 pixels, we still correctly detect object on image. But for segmentation task it’s a big problem when we make unneccessary offsets while predicting mask. Mask R-CNN shows that with ROI align accuracy is quite higher rather than with pooling.

--

--