Jekyll2021-08-06T17:48:37+00:00vibhuagrawal.com/blog/feed.xmlVibhu’s blogVibhu Agrawal's personal blogVibhu AgrawalA Dive into Geospatial Nearest Neighbor Search2021-08-04T00:00:00+00:002021-08-04T00:00:00+00:00vibhuagrawal.com/blog/geospatial-nearest-neighbor-search<p>Say you have a list of locations with their corresponding latitudes and longitudes, and you want to put this data to use. Maybe you are a food delivery service that wants to find all Korean restaurants in a 10 kilometer radius. Or maybe you are advertising a job opening in New York but you want to publish it to nearby cities as well. Or maybe you are a dating app that wants to calculate the distance between potential matches to an absurd degree of accuracy.<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup></p> <p>While it is very tempting to use the latitude/longitude data as coordinates in a 2-D space and use the L2-norm to compute the “distance” between two locations, the approach fails due to three reasons:</p> <ol> <li><strong>The earth is not flat</strong>: While it is reasonable to consider the Earth to be flat for short distances, its curvature causes noticeable errors in any distance measurements over a few kilometers.</li> <li><strong>Equal distances in the geo-coordinate space may or may not be equal in the physical world</strong>: The distance required to move 1° along the equator is much larger than the distance required to move 1° near the North Pole due to the curvature of the earth (<em>Figure 1</em>). This makes querying for nearest-neighbors unreliable as some points that may appear to be far away in latitude/longitude coordinate space may actually be closer in real life.</li> <li><strong>Latitude/longitude are not continuous at the boundaries of the geo-coordinate space</strong>: Latitude/longitude values <em>jump</em> at the prime meridian from +180° to -180° resulting in wrong distance calculations. Similar problems are faced at the North and South Poles where “stepping-over” the poles results in wildly different longitude values.</li> </ol> <p><mark style="background-color: lightpink">In this article, we look at a method to compute the distance between two points on the surface of the Earth and extend it to query a geospatial dataset for nearest neighbors.</mark></p> <p> </p> <p><img src="/blog/assets/images/mercator_projection.png" alt="Mercator projection" /> <em>Figure 1. Mercator Projection is a cylindrical map projection of the Earth. Because the spherical Earth is projected onto a cylinder to get a “flat” coordinate system, the regions near the poles appear much larger than they actually are. Nonetheless, this scheme has been used extensively in the past (for example, Google Maps ditched Mercator projection view for a globe view only in 2018). The scale at which this phenomenon is observed can be interactively experienced at <a href="https://thetruesize.com/" target="_blank">thetruesize.com</a>) [<a href="https://gisgeography.com/cylindrical-projection/">source</a>]</em></p> <p> </p> <h2 id="a-better-distance-function">A better distance function</h2> <blockquote> <p><a id="#note_1" style="color: inherit;"><strong>Note 1:</strong></a> A key assumption here onwards is that the Earth is spherical in shape, which is not strictly true as the Earth’s shape is closer to an ellipsoid with the radius of curvature at the equator being ≈6378 km and that at the poles being ≈6357 km. Because the difference in the radius is not large, the error is generally small enough to be safely ignored.</p> </blockquote> <p>The shortest distance between any two points on a sphere is the distance between the two points along the great circle passing through both the points. A great circle is a circle drawn on a sphere with the same radius as the sphere, and centred at the centre of the sphere. Of the infinitely many great circles possible, we are now concerned with the one that passes through the two points in question.</p> <p>The solution to finding the distance between two points along the Earth’s surface lies in using the latitude/longitude values to compute the central angle between them. The central angle $$\theta$$ is defined as:</p> $\begin{equation} \theta = \frac{d}{r} \tag{1}\label{eq:1} \end{equation}$ <p>where $$d$$ is the distance between the two points along the great circle, and $$r$$ is the radius of the sphere. Because we already know the mean radius of the Earth (6371008.7714 m<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup>), in order to find the value of $$d$$, we need to compute $$\theta$$.</p> <p> </p> <p>Let’s put this knowledge to use.</p> <p>Let points A and B be two points on the surface of the Earth with latitudes $$\phi_1$$ and $$\phi_2$$ and longitudes $$\lambda_1$$ and $$\lambda_2$$ respectively, and let C be the North Pole. a, b, and c are the lengths of the arcs created by BC, AC and AB respectively on the surface of the sphere. (Figure 2)</p> <p><img src="/blog/assets/images/haversine_derivation_2.png" alt="Haversine derivation 2" /> <em>Figure 2. A, B, and C are three points on the Earth’s surface, and O is the center of the Earth.</em></p> <p>If we consider the sphere in <em>Figure 2</em> to be of unit radius, then using $$eq. \eqref{eq:1}$$, the $$\angle AOB$$ is the same as $$c$$, $$\angle AOC$$ is the same as $$b$$, and $$\angle BOC$$ is the same as $$a$$.</p> <p>Now, we have $$c = \theta$$, $$b = \frac{\pi}{2} - \phi_1$$, $$a = \frac{\pi}{2} - \phi_2$$, and $$C = \lambda_2 - \lambda_1$$.</p> <p>Using the <a href="https://en.wikipedia.org/wiki/Spherical_law_of_cosines">spherical cosine rule</a>:</p> $\begin{equation} \cos c = \cos b \cos a + \sin b \sin a \cos C \end{equation}$ <p>we get:</p> <div style="overflow-x: scroll"> $$\cos \theta = \cos (\frac{\pi}{2} - \phi_1) \cos (\frac{\pi}{2} - \phi_2) + \sin (\frac{\pi}{2} - \phi_1) \sin (\frac{\pi}{2} - \phi_2) \cos( \lambda_2 - \lambda_1 )$$ </div> <p> </p> <p>Replacing $$\theta$$ with $$d$$ and using $$\sin^2(\frac{x}{2}) = \frac{1}{2} (1 - \cos x)$$, we get</p> <div style="overflow-x: scroll"> \begin{align*} \sin^2\bigl(\tfrac{d}{2}\bigr) &amp;=\tfrac{1}{2}\bigl(1 - \cos \phi_1 \cos \phi_2 \cos( \lambda_2 - \lambda_1 )) - \sin \phi_1 \sin \phi_2\bigr) \\ &amp;=\tfrac{1}{2}\bigl(1 - \cos\phi_1\cos\phi_2\bigl(1-2\sin^2\bigl(\frac{\lambda_2-\lambda_1}{2}\bigr)\bigr) -\sin\phi_1\sin\phi_2\bigr)\\ &amp;=\tfrac{1}{2}\bigl(1 - \cos\phi_1\cos\phi_2 + 2\cos\phi_1\cos\phi_2\sin^2\bigl(\frac{\lambda_2-\lambda_1}{2}\bigr) -\sin\phi_1\sin\phi_2\bigr)\\ &amp;=\tfrac{1}{2}\bigl(1 - \cos(\phi_2-\phi_1) + 2\cos\phi_1\cos\phi_2\sin^2\bigl(\frac{\lambda_2-\lambda_1}{2}\bigr)\bigr)\\ &amp;=\tfrac{1}{2}\bigl(2\sin^2\bigl(\frac{\phi_2-\phi_1}{2}\bigr) + 2\cos\phi_1\cos\phi_2\sin^2\bigl(\frac{\lambda_2-\lambda_1}{2}\bigr)\bigr)\\ &amp;=\sin^2\left(\frac{\phi_2-\phi_1}{2}\right) + \cos\phi_1\cos\phi_2\sin^2\left(\frac{\lambda_2-\lambda_1}{2}\right) \end{align*} </div> <p> </p> <p>For a sphere with radius $$R$$, the above equation changes to</p> <div style="overflow-x: scroll"> $$\begin{equation} \sin^2\bigl(\tfrac{d}{2R}\bigr)=\sin^2\left(\frac{\phi_2-\phi_1}{2}\right) + \cos\phi_1\cos\phi_2\sin^2\left(\frac{\lambda_2-\lambda_1}{2}\right) \tag{2}\label{eq:2} \end{equation}$$ </div> <p> </p> <p>This formula has been around for a long time, but it is cumbersome to use without a calculator. It requires multiple trignometric lookups, calculating squares, and even a square root. To simplify the calculations, we use a lesser known trignometric function <em>haversine</em>, which is defined as:</p> <div style="overflow-x: scroll"> $$\mathop{\mathrm{haversin}}(x) = \sin^2{\left(\frac{x}{2}\right)}$$ </div> <p> </p> <p>Using this definition in $$eq. \eqref{eq:2}$$, we get what is called the <strong>Haversine formula</strong>:</p> <div style="overflow-x: scroll"> $$\begin{equation} \mathop{\mathrm{haversin}} \left(\frac{d}{r}\right) = \mathop{\mathrm{haversin}} \left(\phi_2-\phi_1\right) + \cos\phi_1\cos\phi_2\mathop{\mathrm{haversin}}\left(\lambda_2-\lambda_1\right)\tag{3}\label{eq:3} \end{equation}$$ </div> <p>In the past few hundred years, the Haversine formula has been used extensively by sailors in planning their voyages.<sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">3</a></sup> Instead of multiple difficult operations required by $$eq.\eqref{eq:2}$$, this formula makes it really simple to perform the calculation by requiring only a few lookups in the haversine table.</p> <p> </p> <p><strong>Implementation in Python</strong></p> <p>Let’s compute some distances! We will be using an openly available dataset of US Zip-codes and their corresponding latitude and longitude values. Also, we will be using $$eq.\eqref{eq:2}$$ instead of $$eq.\eqref{eq:3}$$ because we aren’t nineteenth centure sailors, and also because computing sin, cos, squares, and square roots has become trivial.</p> <figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span> <span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span> <span class="n">R</span> <span class="o">=</span> <span class="mf">6371008.7714</span> <span class="k">def</span> <span class="nf">haversine_distance</span><span class="p">(</span><span class="n">lat_1</span><span class="p">,</span> <span class="n">lon_1</span><span class="p">,</span> <span class="n">lat_2</span><span class="p">,</span> <span class="n">lon_2</span><span class="p">):</span> <span class="k">return</span> <span class="mi">2</span> <span class="o">*</span> <span class="n">R</span> <span class="o">*</span> <span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">arcsin</span><span class="p">((</span><span class="n">np</span><span class="p">.</span><span class="n">sin</span><span class="p">((</span><span class="n">lat_2</span> <span class="o">-</span> <span class="n">lat_1</span><span class="p">)</span> <span class="o">/</span> <span class="mi">2</span><span class="p">)</span> <span class="o">**</span> <span class="mi">2</span> <span class="o">+</span> \ <span class="n">np</span><span class="p">.</span><span class="n">cos</span><span class="p">(</span><span class="n">lat_1</span><span class="p">)</span> <span class="o">*</span> <span class="n">np</span><span class="p">.</span><span class="n">cos</span><span class="p">(</span><span class="n">lat_2</span><span class="p">)</span> <span class="o">*</span> \ <span class="n">np</span><span class="p">.</span><span class="n">sin</span><span class="p">((</span><span class="n">lon_2</span> <span class="o">-</span> <span class="n">lon_1</span><span class="p">)</span> <span class="o">/</span> <span class="mi">2</span><span class="p">)</span> <span class="o">**</span> <span class="mi">2</span><span class="p">)</span> <span class="o">**</span> <span class="mf">0.5</span><span class="p">))</span> <span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">"__main__"</span><span class="p">:</span> <span class="c1"># list of US zip codes along with their respective latitudes and longitudes </span> <span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span> <span class="s">"https://public.opendatasoft.com/explore/dataset/us-zip-code-latitude-and-longitude/download/?format=csv&amp;timezone=Asia/Kolkata&amp;lang=en&amp;use_labels_for_header=true&amp;csv_separator=%3B"</span><span class="p">,</span> <span class="n">sep</span><span class="o">=</span><span class="s">";"</span><span class="p">)</span> <span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">drop</span><span class="p">(</span><span class="n">columns</span><span class="o">=</span><span class="p">[</span><span class="s">"Timezone"</span><span class="p">,</span> <span class="s">"Daylight savings time flag"</span><span class="p">,</span> <span class="s">"geopoint"</span><span class="p">])</span> <span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">set_index</span><span class="p">(</span><span class="s">"Zip"</span><span class="p">)</span> <span class="k">print</span><span class="p">(</span><span class="n">df</span><span class="p">)</span> <span class="c1"># convert degrees to radians </span> <span class="n">df</span><span class="p">.</span><span class="n">Latitude</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">Latitude</span><span class="p">.</span><span class="nb">apply</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">radians</span><span class="p">)</span> <span class="n">df</span><span class="p">.</span><span class="n">Longitude</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">Longitude</span><span class="p">.</span><span class="nb">apply</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">radians</span><span class="p">)</span> <span class="c1"># compute distance between 68460 (Waco, NE) and 80741 (Merino, CO) </span> <span class="n">lat_1</span><span class="p">,</span> <span class="n">lon_1</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="mi">68460</span><span class="p">][[</span><span class="s">"Latitude"</span><span class="p">,</span> <span class="s">"Longitude"</span><span class="p">]]</span> <span class="n">lat_2</span><span class="p">,</span> <span class="n">lon_2</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="mi">80741</span><span class="p">][[</span><span class="s">"Latitude"</span><span class="p">,</span> <span class="s">"Longitude"</span><span class="p">]]</span> <span class="n">distance</span> <span class="o">=</span> <span class="n">haversine_distance</span><span class="p">(</span><span class="n">lat_1</span><span class="p">,</span> <span class="n">lon_1</span><span class="p">,</span> <span class="n">lat_2</span><span class="p">,</span> <span class="n">lon_2</span><span class="p">)</span> <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"The distance between 68460 (Waco, NE) and 80741 (Merino, CO) is </span><span class="si">{</span><span class="n">distance</span> <span class="o">/</span> <span class="mi">1000</span><span class="p">:.</span><span class="mi">2</span><span class="n">f</span><span class="si">}</span><span class="s"> km."</span><span class="p">)</span> <span class="c1"># compute distance between 50049 (Chariton, IA) and 51063 (Whiting, IA) </span> <span class="n">lat_1</span><span class="p">,</span> <span class="n">lon_1</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="mi">50049</span><span class="p">][[</span><span class="s">"Latitude"</span><span class="p">,</span> <span class="s">"Longitude"</span><span class="p">]]</span> <span class="n">lat_2</span><span class="p">,</span> <span class="n">lon_2</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="mi">51063</span><span class="p">][[</span><span class="s">"Latitude"</span><span class="p">,</span> <span class="s">"Longitude"</span><span class="p">]]</span> <span class="n">distance</span> <span class="o">=</span> <span class="n">haversine_distance</span><span class="p">(</span><span class="n">lat_1</span><span class="p">,</span> <span class="n">lon_1</span><span class="p">,</span> <span class="n">lat_2</span><span class="p">,</span> <span class="n">lon_2</span><span class="p">)</span> <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"The distance between 50049 (Chariton, IA) and 51063 (Whiting, IA) is </span><span class="si">{</span><span class="n">distance</span> <span class="o">/</span> <span class="mi">1000</span><span class="p">:.</span><span class="mi">2</span><span class="n">f</span><span class="si">}</span><span class="s"> km."</span><span class="p">)</span></code></pre></figure> <p>The script produces the following output:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> City State Latitude Longitude Zip 38732 Cleveland MS 33.749149 -90.713290 47872 Rockville IN 39.758142 -87.175400 50049 Chariton IA 41.028910 -93.298570 48463 Otisville MI 43.167457 -83.525420 51063 Whiting IA 42.137272 -96.166480 ... ... ... ... ... 68460 Waco NE 40.897974 -97.450720 28731 Flat Rock NC 35.270682 -82.415150 74362 Pryor OK 36.292495 -95.222792 37049 Cross Plains TN 36.548569 -86.679070 80741 Merino CO 40.508131 -103.418150 [43191 rows x 4 columns] The distance between 68460 (Waco, NE) and 80741 (Merino, CO) is 504.80 km. The distance between 50049 (Chariton, IA) and 51063 (Whiting, IA) is 268.47 km. </code></pre></div></div> <p>We can compare our results against the results obtained from Google Maps:</p> <table> <tbody> <tr> <td><img src="/blog/assets/images/waco_to_merino.png" alt="waco_to_merino" /><em>Distance from Waco, NE to Merino, CO using Google Maps</em></td> <td><img src="/blog/assets/images/chariton_to_whiting.png" alt="chariton_to_whiting" /><em>Distance from Chariton, IA to Whiting, IA using Google Maps</em></td> </tr> </tbody> </table> <h2 id="the-query-mechanism">The query mechanism</h2> <p>Now that we have a reliable distance metric in place, let’s shift our focus to querying our data for nearest-neighbors. We are primarily interested in two kinds of queries:</p> <ol> <li>finding $$k$$ nearest neighbors (kNN) of a given point, and</li> <li>finding all neighbors in radius $$r$$ around a given point</li> </ol> <p> </p> <h3 style="margin-block-end:0.33em"> The brute-force method </h3> <p>The most obvious way to perform queries of these types is the brute force approach. For any given point, we simply iterate over all other points in our dataset and compute each point’s distance from the given query point. Under the standard assumption that the query point may not be present in the dataset, the brute-force approach has $$O(D.N)$$ (where D is the dimensionality of the data, which in our case is 2) time complexity. Iterating over all data points for a query is very computationally expensive, especially if the size of the data is large. To speed up the query mechanism, we now look into preprocessing algorithms that can take advantage of its inherent structure.</p> <!-- To find $$k$$ nearest neighbors, we can either sort all points based on their distance from the given point ($$O(N\log(N))$$ time complexity) and take the select the first $$k$$ points, or we can use an algorithm like [quickselect](https://en.wikipedia.org/wiki/Quickselect) to select $$k$$ smallest elements ($$O(N)$$ average time complexity, $$O(N^2)$$ worst-case time complexity). If we have to perform multiple queries, it is more efficient to precompute the distances between every two points at the cost of extra space ($$O(N^2)$$ space and time complexity) and sort the list of points for each point based on the distance. This enables look-up during runtime at $$O(\log(N))$$ time complexity. --> <p> </p> <h3 style="margin-block-end:0.33em">The (problematic) k-d Tree </h3> <p>A standard approach to optimize the brute-force approach is to use a specialized tree data structure called a <a href="https://en.wikipedia.org/wiki/K-d_tree">k-d tree</a>. A k-d tree, or a k-dimensional tree, is a binary tree which recursively splits the space into two by building <a href="https://en.wikipedia.org/wiki/Hyperplane">hyperplanes</a>. A hyperplane is defined as a subspace whose dimension is one less than the dimension of its ambient space (a hyperplane in 2D is a 1D line, a hyperplane in 3D is a 2D place, etc.). Each node is associated with an axes and represents a hyperplane that splits the subtree’s search space perpendicular to its associated axis. All points on one side of this hyperplane form the left subtree, and all points on the other side form the right subtree. All nodes at the same level in a k-d tree are associated with the same axes.</p> <p>The splitting planes are cycled through as we move down the levels of the tree. For example, in a 2-dimensional space, we would alternate between splitting perpendicular to the x-axis and to the y-axis at each level (Video 1)<sup id="fnref:4" role="doc-noteref"><a href="#fn:4" class="footnote" rel="footnote">4</a></sup>. In a 3-dimensional space, the root would be split perpendicular to the x-y and x-z planes, the children would be split perpendicular to the x-y and y-z planes, the grandchildren perpendicular to the x-z and y-z planes, the great-grandchildren again perpendiculat to the x-y and x-z planes, and so on.</p> <video id="KDTree" width="100%" frameborder="0" controls=""> <source src="/blog/assets/videos/KDTreeExampleVideo.mp4" type="video/mp4" /> <!-- <source src="/blog/assets/videos/KDTreeExampleVideo.webm" type="video/webm"/> --> </video> <p class="caption">Video 1. <b>k-d tree in 2-dimensions:</b> While constructing a k-d tree, we alternate between splitting perpendicular to the x-axis and perpendicular to the y-axis at each level of the tree. Each split is made at the median of all the points in the subtree along the associated axis of that level. </p> <p>Once the tree has been constructed, searching and range queries are quite simple to execute. For either type of query, we traverse the tree from the root until we reach the leaf node that would contain the query point. Then:</p> <ul> <li><strong>For k-nearest neighbor search:</strong> <ol> <li>All points in the leaf node’s search space become the candidate points for k-nearest neighbors and each point’s distance from the query point is calculated.</li> <li>If the maximum distance $$m$$ amongst the $$min(num(candidate\_points), k)$$ nearest candidate points exceeds the distance from the query point to the boundary of any neighboring subspaces (or, if a hypersphere of radius $$m$$ centered at the query point intersects with any hyperplane), then the neighboring subspace’s points are also added to the candidate points set.</li> <li>This is repeated until we have at least $$k$$ candidate points and the distance to the k-th farthest candidate point is lesser than the distance to the nearest neighboring subspace. Finally, the k-nearest candidate points are selected as the result.</li> </ol> </li> <li><strong>For range queries within a radius $$r$$ about a query point:</strong> <ol> <li>All points in this search space are added to the candidate points set, and points from all neighboring subspaces which are at a distance less than $$r$$ from the query point are also added to the candidate points set.</li> <li>All points that are at a distance less than $$r$$ from the query point are selected as the result.</li> </ol> </li> </ul> <p>The tree construction has a time complexity of $$O(N\log(N))$$, and the kNN and range queries have a time complexity of $$O(\log(N))$$ and $$O\left(D. n ^ {1 - \frac{1}{D}}\right)$$ respectively, where $$D$$ is the dimension of the data, which in our case is 2.</p> <p>As k-d trees can split the space only along the axes, they only work with <a href="https://en.wikipedia.org/wiki/Minkowski_distance">Minkowski distances</a> such as Manhattan distance (L1 norm) and Euclidean distance (L2 Norm). <strong>Unfortunately for us, the Haversine distance metric is not a Minkowski distance.</strong> We now build upon the basic concepts of the k-d tree, and look at a data structure that does not depend on the explicit coordinates of each point, but only on the distance metric.</p> <p> </p> <h3 style="margin-block-end:0.33em">The Ball Tree </h3> <p>A ball tree is similar to a k-d tree in that it too is a space partitioning tree data structure, but instead of using hyperplanes to partition the data it uses hyperspheres (or “balls”; a hypersphere in 2-dimensions is a circle, and a hypersphere in 3-dimensions is a sphere or a ball). A ball tree works with any metric that respects the triangle inequality:</p> $\left|x+y\right| \leq \left|x\right| + \left|y\right|$ <p>At each node of the tree, we select two points in the data that have the maximum pairwise-distance between them. Using the same distance metric, all other points in the subtree are assigned to one these two points. Using the centers, we create two hyperspheres that span the points associated with the respective centers. Each of the newly formed hyperspheres may now be further divided into two hyperspheres each, and this process continues recursively until a stopping condition is reached (a certain depth is reached, or each leaf node has fewer than a specified minimum number of data points, etc).</p> <video id="BallTree" width="100%" frameborder="0" controls=""> <source src="/blog/assets/videos/BallTreeExampleVideo.mp4" type="video/mp4" /> <!-- <source src="/blog/assets/videos/BallTreeExampleVideo.webm" type="video/webm"/> --> </video> <p class="caption">Video 2. <b>Ball tree in 2-dimensions:</b> The two points with the largest pair-wise distance are selected, and are considered the center of a hypersphere. All points are assigned to the center that is closest to them, and two hypersphere of radius same as the distance of their center to the farthest associated point are created. These hyperspheres are further broken down into two hyperspheres each, and this process continues recursively. </p> <blockquote> <p><a id="note_2" style="color: inherit"><strong>Note 2</strong></a>: Instead of considering the two farthest points to be the centers of the hyperspheres, we can also use the centroid of each cluster. This approach results in tighter spheres and more efficient queries, but it is not possible to use with non-Minkowski distances as computing the centroid in over them is an ill-defined problem.</p> </blockquote> <p>Nearest neighbor search and range query are very similar to those on a k-d tree: we traverse the tree and reach the lead node that would contain the query point. If the query point lies outside of either hypersphere at any level, we assign it to the hypersphere who’s center is at the least distance from the query point. Then,</p> <ul> <li><strong>For k-nearest neighbor search:</strong> <ol> <li>All points in the leaf hypersphere are now considered to be the candidate points.</li> <li>A minimum bounding hypersphere for the $$min(num(candidate\_points), k)$$ nearest candidate points is considered, and the points from all other hyperspheres that intersect with the minimum bounding hypersphere are added to the candidate points set.</li> <li>As with a k-d tree, this process is repeated until there are at least $$k$$ candidate points, and the distance to the k-th farthest candidate point is lesser than the distance to the boundary of the nearest neighboring hypersphere. The $$k$$ nearest candidate points are then selected as the result.</li> </ol> </li> <li><strong>For range queries within a radius $$r$$</strong>: <ol> <li>All points in the leaf node’s search space are added to the candidate points set, and points from all hyperspheres that intersect with the candidate points’ minimum bounding hypersphere are also added to the candidate points set.</li> <li>The points that are at a distance less than $$r$$ from the query point are selected as the result.</li> </ol> </li> </ul> <p>Distance from a point to a hypersphere’s surface is very simple to calculate. We just calculate the distance from the point to the hypersphere’s center and subtract the radius. More formally, for a hypersphere of radius $$r$$ centered at point $$center$$, the distance from a point $$query\_point$$ is defined as:</p> <div style="overflow-x: scroll"> $$dist(query\_point, hypersphere) = dist(query\_point, center) - r$$ </div> <p>Hyperspheres thus make for a very efficient query mechanism as computing distance to a hypersphere is significantly simpler than computing distance to any other n-dimensional geometric structure. This property holds true for non-Minkowski distances as well.</p> <p>The time complexity for tree creation for a general ball tree using a non-Minkowski distance metric is $$O(N^2)$$. The kNN and range queries have time complexities same as those of a k-d tree, i.e., $$O(N\log(N))$$ and $$O\left(D. n ^ {1 - \frac{1}{D}}\right)$$ repectively.</p> <blockquote> <p><a id="#note_3" style="color: inherit"><strong>Note 3</strong></a>: Using a Minkowski distance metric and the method described in <a href="#note_1">Note 2</a>, the tree construction time complexity drops to $$O(N\log(N))$$</p> </blockquote> <p> </p> <p><strong>Implementation in Python</strong><br /> Let’s put everything together in a few lines of Python code. The scikit-learn library has a nice implementation of BallTree with support for Haversine distance right out of the box.</p> <figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span> <span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span> <span class="kn">from</span> <span class="nn">sklearn.neighbors</span> <span class="kn">import</span> <span class="n">BallTree</span> <span class="n">R</span> <span class="o">=</span> <span class="mf">6371008.7714</span> <span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">"__main__"</span><span class="p">:</span> <span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span> <span class="s">"https://public.opendatasoft.com/explore/dataset/us-zip-code-latitude-and-longitude/download/?format=csv&amp;timezone=Asia/Kolkata&amp;lang=en&amp;use_labels_for_header=true&amp;csv_separator=%3B"</span><span class="p">,</span> <span class="n">sep</span><span class="o">=</span><span class="s">";"</span><span class="p">)</span> <span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">drop</span><span class="p">(</span><span class="n">columns</span><span class="o">=</span><span class="p">[</span><span class="s">"Timezone"</span><span class="p">,</span> <span class="s">"Daylight savings time flag"</span><span class="p">,</span> <span class="s">"geopoint"</span><span class="p">])</span> <span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">set_index</span><span class="p">(</span><span class="s">"Zip"</span><span class="p">)</span> <span class="c1"># convert degrees to radians </span> <span class="n">df</span><span class="p">.</span><span class="n">Latitude</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">Latitude</span><span class="p">.</span><span class="nb">apply</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">radians</span><span class="p">)</span> <span class="n">df</span><span class="p">.</span><span class="n">Longitude</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">Longitude</span><span class="p">.</span><span class="nb">apply</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">radians</span><span class="p">)</span> <span class="c1"># construct a ball tree with haversine distance as the metric </span> <span class="n">tree</span> <span class="o">=</span> <span class="n">BallTree</span><span class="p">(</span><span class="n">df</span><span class="p">[[</span><span class="s">'Latitude'</span><span class="p">,</span> <span class="s">'Longitude'</span><span class="p">]].</span><span class="n">values</span><span class="p">,</span> <span class="n">leaf_size</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span> <span class="n">metric</span><span class="o">=</span><span class="s">'haversine'</span><span class="p">)</span> <span class="c1"># lets find the 10 nearest zipcodes to a 68460, a zipcode in Waco, NE </span> <span class="n">query_zipcode</span> <span class="o">=</span> <span class="mi">68460</span> <span class="n">query_point</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">query_zipcode</span><span class="p">][[</span><span class="s">"Latitude"</span><span class="p">,</span> <span class="s">"Longitude"</span><span class="p">]].</span><span class="n">values</span> <span class="n">distances</span><span class="p">,</span> <span class="n">indices</span> <span class="o">=</span> <span class="n">tree</span><span class="p">.</span><span class="n">query</span><span class="p">([</span><span class="n">query_point</span><span class="p">],</span> <span class="n">k</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span> <span class="n">result_df</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">iloc</span><span class="p">[</span><span class="n">indices</span><span class="p">[</span><span class="mi">0</span><span class="p">]]</span> <span class="n">result_df</span><span class="p">[</span><span class="s">'Distance (in km)'</span><span class="p">]</span> <span class="o">=</span> <span class="n">distances</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">*</span> <span class="n">R</span><span class="o">/</span><span class="mi">1000</span> <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"The 10 closest zipcodes to </span><span class="si">{</span><span class="n">query_zipcode</span><span class="si">}</span><span class="s"> are:"</span><span class="p">)</span> <span class="k">print</span><span class="p">(</span><span class="n">result_df</span><span class="p">)</span> <span class="k">print</span><span class="p">()</span> <span class="c1"># lets find all the zipcodes in a 25 km radius of 68460, a zipcode in Waco NE </span> <span class="n">indices</span><span class="p">,</span> <span class="n">distances</span> <span class="o">=</span> <span class="n">tree</span><span class="p">.</span><span class="n">query_radius</span><span class="p">([</span><span class="n">query_point</span><span class="p">],</span> <span class="n">r</span><span class="o">=</span><span class="mi">25</span><span class="o">/</span><span class="p">(</span><span class="n">R</span><span class="o">/</span><span class="mi">1000</span><span class="p">),</span> <span class="n">return_distance</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">sort_results</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span> <span class="n">result_df</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">iloc</span><span class="p">[</span><span class="n">indices</span><span class="p">[</span><span class="mi">0</span><span class="p">]]</span> <span class="n">result_df</span><span class="p">[</span><span class="s">'Distance (in km)'</span><span class="p">]</span> <span class="o">=</span> <span class="n">distances</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">*</span> <span class="n">R</span><span class="o">/</span><span class="mi">1000</span> <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"The zipcodes in a 25 km radius of </span><span class="si">{</span><span class="n">query_zipcode</span><span class="si">}</span><span class="s"> are:"</span><span class="p">)</span> <span class="k">print</span><span class="p">(</span><span class="n">result_df</span><span class="p">)</span></code></pre></figure> <p>The script produces the following output:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>The 10 closest zipcodes to 68460 are: City State Latitude Longitude Distance (in km) Zip 68460 Waco NE 0.713804 -1.700836 0.000000 68456 Utica NE 0.713810 -1.698569 10.916424 68467 York NE 0.713233 -1.703247 12.169037 68367 Gresham NE 0.716272 -1.699893 16.363419 68316 Benedict NE 0.715839 -1.703610 18.605159 68313 Beaver Crossing NE 0.711776 -1.697655 20.047743 68401 McCool Junction NE 0.711132 -1.703144 20.340898 68364 Goehner NE 0.712664 -1.696812 20.701203 68330 Cordova NE 0.710632 -1.699116 21.845466 68439 Staplehurst NE 0.715517 -1.696551 23.330794 The zipcodes in a 25 km radius of 68460 are: City State Latitude Longitude Distance (in km) Zip 68460 Waco NE 0.713804 -1.700836 0.000000 68456 Utica NE 0.713810 -1.698569 10.916424 68467 York NE 0.713233 -1.703247 12.169037 68367 Gresham NE 0.716272 -1.699893 16.363419 68316 Benedict NE 0.715839 -1.703610 18.605159 68313 Beaver Crossing NE 0.711776 -1.697655 20.047743 68401 McCool Junction NE 0.711132 -1.703144 20.340898 68364 Goehner NE 0.712664 -1.696812 20.701203 68330 Cordova NE 0.710632 -1.699116 21.845466 68439 Staplehurst NE 0.715517 -1.696551 23.330794 </code></pre></div></div> <p> </p> <h2 id="further-reading">Further reading</h2> <p>If you found the article and its premise interesting, here are some other interesting resources:</p> <ol> <li><a href="https://tech.instacart.com/dont-let-the-crow-guide-your-routes-f24c96daedba">Replacing Haversine distance with a simple machine learning model</a> (InstaCart blog).</li> <li>We assumed that the Earth is a perfect sphere. But we know that it isn’t. <a href="https://en.wikipedia.org/wiki/Vincenty%27s_formulae">Vincenty’s method</a> considers the Earth to be an spheroid and provides a more accurate distance metric.</li> <li><a href="https://jakevdp.github.io/blog/2013/04/29/benchmarking-nearest-neighbor-searches-in-python/">Benchmarking Nearest Neighbor Searches in Python</a></li> </ol> <h2 id="footnotes">Footnotes</h2> <div class="footnotes" role="doc-endnotes"> <ol> <li id="fn:1" role="doc-endnote"> <p><a href="https://robertheaton.com/2018/07/09/how-tinder-keeps-your-location-a-bit-private/">How Tinder keeps your exact location (a bit) private</a> <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> <li id="fn:2" role="doc-endnote"> <p><a href="https://en.wikipedia.org/wiki/Earth_radius#Published_values">A list of accepted Earth radii</a> <a href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> <li id="fn:3" role="doc-endnote"> <p><a href="https://en.wikipedia.org/wiki/James_Inman">Inman, James</a> (1835). <a href="https://books.google.co.in/books?id=-fUOnQEACAAJ&amp;redir_esc=y">Navigation and Nautical Astronomy: For the Use of British Seamen</a> (3 ed.). London, UK: W. Woodward, C. &amp; J. Rivington. <a href="#fnref:3" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> <li id="fn:4" role="doc-endnote"> <p>All videos created using <a href="https://www.manim.community/">Manim Community</a>. <a href="#fnref:4" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> </ol> </div>VibhuSay you have a list of locations with their corresponding latitudes and longitudes, and you want to put this data to use. Maybe you are a food delivery service that wants to find all Korean restaurants in a 10 kilometer radius. Or maybe you are advertising a job opening in New York but you want to publish it to nearby cities as well. Or maybe you are a dating app that wants to calculate the distance between potential matches to an absurd degree of accuracy.1 How Tinder keeps your exact location (a bit) private &#8617;