Ravi Shekhar's Technical Blog

A Technical Blog of the Data Science Process

Geospatial Operations at Scale with Dask and Geopandas


Note: This post has interactive Bokeh graphics which may not render well on mobile devices. Try viewing the Jupyter notebook which underlies this post on NBViewer.

Part 1 : A Gentle Introduction to the Spatial Join

One problem I came across when analyzing the New York City Taxi Dataset, is that from 2009 to June 2016, both the starting and stopping locations of taxi trips were given as longitude and latitude points. After July 2016, to provide a degree of anonymity when releasing data to the public, the Taxi and Limousine Commission (TLC) only provides the starting and ending "taxi zones" of a trip, and a shapefile that specifies the boundaries, available here. Let's load this up in Geopandas, and set the coordinate system to 'epsg:4326', which is latitude and longitude coordinates.

In [1]:
Expand Code
Out[1]:
LocationID borough geometry zone
0 1 EWR POLYGON ((-74.18445299999996 40.6949959999999,... Newark Airport
1 2 Queens (POLYGON ((-73.82337597260663 40.6389870471767... Jamaica Bay
2 3 Bronx POLYGON ((-73.84792614099985 40.87134223399993... Allerton/Pelham Gardens
3 4 Manhattan POLYGON ((-73.97177410965318 40.72582128133706... Alphabet City
4 5 Staten Island POLYGON ((-74.17421738099989 40.5625680859999,... Arden Heights

We see that the geometry column consists of polygons (from Shapely) that have vertices defined by longitude and latitude points. Let's plot using bokeh, in order of ascending LocationID.

In [2]:
Expand Code
Loading BokehJS ...