Building your own geolocation API


One cool project I worked on was building up a geolocation API. This service obtains a latitude and longitude via web request and responds back with the closest city to that location. Well, to be more accurate, it provides more than just the city; it provides the country, the province, the primary language and a whole bunch of other bits of metadata.

There are two ways to go at this that I know of if you want to build something in-house. The first way (and the one we didn't use) is to use OpenStreetMaps. OSM provides location information in the form of a Postgresql database. You can host your own instance with docker and use the database to find the closest location to a given point through standard queries. The second way is using GeoNames.

GeoNames is a free database available under creative commons. They deserve a lot more attention than what they get. They provide CSV files of geographical data, including the coordinates of the locations, the name, what the point is, population and more. It's an amazing data set that can be used for a lot of purposes; especially when you consider how rare it is to find geological resources that aren't proprietary or heavily restricted. You would think information about our world would be available to the community, but it sure as heck ain't.

CSV is a relatively easy format to parse; the tricky part is figuring out how to use the data efficiently. These files are several gigabytes in size; doing a brute force for-loop through the dataset isn't going to be nearly fast enough. One solution is to build a very specific kind of datastructure; a k-dimensional tree.

KD Tree

I'm not going to go too computer sciency on you. At each level of the tree, you compare either the X dimension or Y dimensions of your data and navigate down one of the branches based on some easy rules.

Starting at the root node:

current_best = nodes.root
current_node = root

while True:
    if request.x < current_node.x:
        current_node = current_node.left
    else:
        current_node = current_node.right

    if distance(current_node, request) < distance(current_node.parent):
        current_best = current_node
    if current_node is None:
        break

    if request.y < current_node.y:
        current_node = current_node.left
    else:
        current_node = current_node.right

    if distance(current_node, request) < distance(current_node.parent):
        current_best = current_node
    if current_node is None:
        break

As you can see, we have this repeating pattern alternating X and Y checks. At each level, you're choosing to navigate down the left or right branch by comparing your requested position against the node stored at that particular level. If your value is less than the node, you go left; otherwise, you go right. As you navigate down the levels, you're also doing comparisons to try to keep track of the closest node to your location. This pattern continues until you reach the terminating leaf node.

There is a certain level of complexity involved in building up the tree. There's no hard rule on what node belongs to what level, meaning you can potentially end up with an unbalanced tree. A solution to this is to take on some additional complexity when building the tree and choose the midpoints based on median values. If the middle node is the median, you should expect to see an equal depth level on both sides of the tree.

The downside to this approach is that if you start changing it after construction by adding more nodes and erasing other nodes, you're likely going to need to recalculate your entire tree to make it balanced again. You also might not want a balanced tree; perhaps one type of search is way more common than the other.

Lookups are incredibly fast at O(log n) efficiency if you can get this working. You can easily request thousands of points in bulk with the vast amount of your overhead being the transport medium. (ie: TCP being slow.)

Wikipedia offers an example in Python on how to implement the algorithm. Alternatively, you can probably find a nice library that implements it. At any rate, once you've loaded the data into the k-d tree, you're basically done! Just wrap up your search method in a web or socket API and call it a day.