Vantage Point Tree: does careful selection of vantage point make sense?

Firiuza
5 min readMar 25, 2018

--

Vantage Point Tree is very old algorithm for making fast search within tree. Off course nowadays we have so many tools for working with big data, but sometimes we face with problem where big data storage cannot provide us with good performance or another solution is cheaper.

I used this algorithm for find k nearest embeddings that represent image itself. Practically this algorithm showed good result and persuaded that with little modification in performance it can work in production.

I want to consider different approaches for selecting vantage point and show that simple approach is working very well comparing with proposed in paper that more time consuming.

Briefly describe this algorithm: in paper they talk about tree that we can use for fast searching. Every node of tree is so called vantage point that split the metric space into two subspaces — left and right. The first subspace contains closest points to vantage point and the second one is the most distant. For every point we construct subtree that also divide space into two subspaces. At the end we get the tree where we can find the nearest points for the given one.

They propose selection of vantage point from random sample that will spread the space better. The bigger second moment defines the best spread. What does it mean?

  • Firstly, they calculate distance between points for understanding their positions in metric space.
  • Then they do not work with mean distance, they find the median of distances. Median allows to create balance tree: if distance from vantage point to another one is less then median it goes to the left leaf otherwise to the right.
  • And the main thing that second moment defines the better spread. Second moment is a variance — hence we have to find point that has the maximum deviation from the median. Theoretically, we get a vantage point that has many closest points around itself and at the same time has points that are so far from it. Then we can easy split the space into left and right subspaces.

In paper they also propose heuristic search algorithm. But I want to consider different approaches that use people for selecting vantage point. Also I tested search within constructed tree for finding tree nearest neighbors.

For testing I used 50 embeddings (dimension is 128) and euclidean distance for calculating distance between embeddings.

I used T-SNE for visualization, hence all cases will have a little bit different view. But for what I want to pay attention:

1. Building tree time.

2. Finding 3 nearest neighbors time and the distance from given point to neighbors.

1. Select from random sample of points.

This approach has mathematical justification that described in paper.

a) For the first case every iteration vantage point was selected from all points that weren’t chosen as vantage points yet.

Off course time for building tree will be the greatest.

Tree building time: 0.224 sec

Finding 3 nearest time: 0.001 sec

Distances from given point to neighbors:

0.8029372528422773

0.8701960738954612

0.888163262947272

As I already said, tree is balanced. So left and right branch contains the same number nodes.

As we see selected vantage point really has the farthest point and more the closest ones.

b) Select from random sample (size of random sample if 30 points or the size of remaining points)

Tree building time: 0.150 sec

Finding 3 nearest time: 0.002 sec

Distances from given point to neighbors:

0.8029372528422773

0.8701960738954612

0.888163262947272

c) Select from random sample (5 points)

Tree building time: 0.044 sec

Finding 3 nearest time: 0.001 sec

Distances from given point to neighbors:

0.8029372528422773

0.8701960738954612

0.888163262947272

2. Select always the first point from remaining ones as vantage point.

Tree building time: 0.010 sec

Finding 3 nearest time: 0.001 sec

Distances from given point to neighbors:

0.8029372528422773

0.8701960738954612

0.888163262947272

3. Select random vantage point.

Tree building time: 0.008 sec

Finding 3 nearest time: 0.001 sec

Distances from given point to neighbors:

0.8029372528422773

0.8701960738954612

0.888163262947272

Conclusion

The goal of this tree is providing the fast and precise search. For getting this result we need accurate tree that will correctly split the metric space.

As we see we always get the same nearest neighbors for the given point. And time execution is also the same. For this toy example I can conclude that we can work with random selecting: a) building time is the least b) give the same result as others.

At practice I had about 2,5 million embeddings and building tree time is very important, because I can’t spend time for building big tree too much. And search within this tree was about 8 sec for finding 4 nearest neighbors. The neighbors were really closest. Searching time also can be improved with parallelizing or with another approach.

Sometimes such accurate approach doesn’t give us good performance and better result, just because maybe in this algorithm selecting of vantage point wasn’t the main idea — the main idea is to split space into left and right subspaces. That’s why we can neglect approach that is described in paper and work with easiest one.

https://github.com/Firyuza

https://twitter.com/Firiuza1

--

--

No responses yet