In this section we turn to raster data, specifically, to satellite imagery. To get started, we must select an “area of interest” where we want to study – in our case, the bounding box around the redlining polygons of whichever city we have chosen. Given a geopandas data frame of our city, we can extract the bounding box in using the .total_bounds() method1
Working with raster data will take us beyond the tabular data model we have been using so far with ibis into another data model, known in some communities as “raster” data and others as “M-dimensional array data”. Consequently, we will be introducing a suite of new packages for working with kind of data. To help associate the different new python packages with there roles in this data pipeline, I will be adding the import commands as we go along, rather than all at once.
We begin by searching for relevant satellite imagery using a Spatio-Temporal Assets Catalog (STAC). A STAC catalog lets us search for all data in a particular collection that corresponds to a certain place and time, and potentionally other criteria noted in the catalog description. The python module pystac_client can automate this search process for us. Here, we search for all data assets in the “Sentinel-2 Level 2a” catalogue of publicly hosted data on Amazon Web Services (AWS) which:
falls within our bounding box
occurs in the most recent summer months (we want to measure greenness when leaves are on the trees)
has less than 20% of the image obscured by clouds
At this stage, we are only reading the STAC data catalog using pystac_client, not touching any of the actual data files. Our goal is to narrow down to the URLs of just those assets we need, rather than downloading lots of data that won’t end up in our analysis. This approach is a core element of a ‘cloud native’ workflow.