How to Work with Overture Maps Data in Snowflake
Visualize and analyze very large geospatial datasets with the power of Snowflake data cloud platform
- Places of Interest
- Administrative Boundaries
These layers follow a specific data schema and have been created with developers in mind, to make it easy to build geospatial applications on top of it.
The layers include world-wide data and are huge. In this article, we are going to show how we can load and process the Buildings layer, that contains over 780 million building footprints (polygons).
Working with datasets at this scale requires a different approach to the traditional GIS processing workflows where most of the data is processed locally on a desktop machine. With a single computer, even a powerful one, it is not going to be possible to apply common GIS processing tools like spatial predicates.
Fortunately, next generation data platforms like Snowflake are well equipped to cope with these challenges through its massively parallel processing (MPP) engines. Snowflake has been adding geospatial features continuously since they made generally available the GEOGRAPHY data type in 2022.
We are going to show how we can ingest a dataset of this size, perform some processing and then visualize it efficiently taking advantage of the scalability provided by the Snowflake data cloud platform.
The first thing we are going to do is to ingest the data into the Snowflake database. The layers are provided as Parquet files stored in AWS and Azure cloud storage services. We are going to load the data from the S3 bucket but it will work in a similar way with Azure.
Loading the data from an external cloud storage service in Snowflake usually requires creating a stage, specifying the file format, creating the table with the needed columns and finally using the COPY INTO statement to load the data. We are going to see step by step how we can perform the different tasks:
- Creating the stage. The stage will point to the S3 bucket containing the Parquet files.
CREATE STAGE overture_stage
- Specifying the file format. In addition to the type, it is important so set the BINARY_AS_TEXT option to false so the binary data is not converted to text while loading.
CREATE OR REPLACE FILE FORMAT overture_buildings_format
TYPE = parquet
BINARY_AS_TEXT = FALSE;
- Creating the table. Here we take advantage of the INFER_SCHEMA function that is able to get the column details from the Parquet files.
CREATE OR REPLACE TABLE OVERTURE.BUILDINGS
USING TEMPLATE (
- Loading the data. Finally we use the COPY INTO statement to perform the actual data loading. There are ~120 files with the majority of them having ~1 GB of data, so this loading step can take a considerable amount of time. In my tests, the data was loaded in ~32 minutes using a small warehouse.
COPY INTO OVERTURE.BUILDINGS
FILE_FORMAT = (FORMAT_NAME= 'overture_buildings_format')
At the end of the ingestion process, we will have a large table with 785 million rows and 115 GB of data. The shape for the building footprints is stored in a BINARY column using the WKB format. In this processing step, we are going to do two different things: convert the shape data to the native Snowflake GEOGRAPHY data type and generate vector tiles so we can visualize the layer in an efficient way.
To perform the conversion we need to add first a new GEOGRAPHY column to our table:
ALTER TABLE OVERTURE.BUILDINGS
ADD COLUMN geom GEOGRAPHY;
Then we can use the ST_GEOGFROMWKB function to populate the new column (it is important to allow invalid shapes because there seems to be some self-intersecting polygons in this first release):
SET geom = ST_GEOGFROMWKB("geometry", TRUE);
This is a compute-intensive process. In my tests, it took ~45 minutes to execute the statement above with the same warehouse use for the data loading. The time it takes can be easily reduced just by using a bigger warehouse and it will cost approximately the same.
We can check shape validity using the ST_ISVALID function. Only 5 shapes were found to be invalid:
WHERE ST_ISVALID(geom) = FALSE;
Now that we have loaded the data in a proper GEOGRAPHY column, we need to solve the problem of visualizing such amount of data. The current standard approach for visualizing medium and large geospatial datasets consists of creating vector tiles that are processed according to the different zoom levels.
With this approach, we can avoid sending the full dataset to the browser when we are visualizing the whole world. The generation of vector tiles include, among other operations, the spatial intersection of the dataset with the bounding box of each tile and a geometric simplification, depending on the zoom level.
If the dataset is not really large (i.e. smaller than 100K polygons), we can try to generate the vector tiles on-demand, sometimes called dynamic tiling. But if the dataset is larger, this approach is not feasible because the operations above cannot be performed in the timeframe required by interactive maps.
For larger datasets, tiles are usually pre-generated. This is also a compute-intensive process that benefits from the parallelization capabilities of modern data warehouses. In this case, we are going to use the functions provided in the CARTO Analytics Toolbox for Snowflake (disclaimer: when writing this article I’m working as a Solution Engineer at CARTO).
CARTO provides different stored procedures for pre-generating tilesets in Snowflake. In this case we are going to use the CREATE_SIMPLE_TILESET procedure, that takes a table or a query as an input and creates a new table as the output with one row for each tile.
'SELECT "id", GEOM, "height", "numfloors" FROM OVERTURE.BUILDINGS',
We also specify what the tiler should do if there is a large number of features in a given tile. In this case, we specify that we should store no more than 100K vertices in a given tile and drop the remaining features. By default, the tiler will order the polygons by descending areas, so we ensure bigger polygons are not dropped.
This process will benefit from using a bigger warehouse. In my tests, with a large warehouse, it took around 8 hours to generate the tiles for zoom levels 0 to 14. We can further reduce this time if we use a bigger warehouse. In general, this process will scale linearly with the warehouse size. My recommendation is to use a big warehouse because the cost at the end will be approximately the same.
So now that we have pre-generated the tiles, we can visualize them. Tilesets created with the CARTO Analytics Toolbox can be visualized easily within the platform (and shared outside) using the Builder tool or we can also use the open source deck.gl visualization library to integrate them in our own applications.
In our tileset we have included the height property for each building, so we can take advantage of it to extrude the building footprints and easily create a 3D visualization:
Conclusions and Further Work
Visualizing the Overture Maps buildings layer is just one simple example of the things you can accomplish with a cloud-native approach to geospatial data. The value of geospatial data lies not only in visualization, but in being able to spatially analyze the information. You start getting more value when you combine these datasets with your own information to discover additional insights.
Performing spatial intersections with these layers or applying advanced spatial statistics algorithms to understand spatial auto-correlation are complex processes that require a good amount of computing power.
The approach of cloud-native computing, with compute-storage separation and the ability to scale the hardware resources on demand, allows us to perform complex spatial analysis on these large datasets.
The combination of the Snowflake data platform with advanced spatial analysis can help unlock the hidden value in these open datasets by building applications that take advantage of both.