ibis Table Joins

Learning Goals

  • use join() to combine two tables on a key column
import ibis
from ibis import _
import ibis.selectors as s

con = ibis.duckdb.connect()

Last time we started getting comfortable with lazy evaluation (head() and execute()) in ibis, and began to learn how to select() (subset columns) and filter() (subset rows), as well as looking at distinct values. Today we will continue to draw on these skills as we go deeper into the fisheries data in search of the evidence of the North Atlantic Cod collapse. In the process, we shall pick up some new methods as well.

As before, let’s start with reading in data. Rather than focus on the metrics table, this time we will connect to several tables at the same time. Note how we can reuse the base_url to avoid extra typing, but take care that we reading the right CSV file in each case! As before, we explicitly set the nullstr value as well to ensure missing value codes are correctly interpreted.

base_url = "https://huggingface.co/datasets/cboettig/ram_fisheries/resolve/main/v4.65/"

stock = con.read_csv(base_url + "stock.csv", nullstr="NA")
timeseries = con.read_csv(base_url + "timeseries.csv", nullstr="NA")
assessment = con.read_csv(base_url + "assessment.csv", nullstr="NA")

Fish ‘stocks’

Like most real world data science problems, understanding these tables requires both a bit of background in fisheries science and a lot of splunking into the data. For our purposes, one of the key things you should know is that fisheries are divided into “stocks”, which you can think of as a particular species of fish in a particular area of the ocean. Let’s use the stock table to explore this idea a bit more. Let’s begin with a peek at the stock table:

stock
DatabaseTable: ibis_read_csv_5sjnrevlzzaijnrmxgnnonpheq
  stockid         string
  tsn             int64
  scientificname  string
  commonname      string
  areaid          string
  stocklong       string
  region          string
  primary_country string
  primary_FAOarea int64
  ISO3_code       string
  GRSF_uuid       string
  GRSF_areaid     string
  inmyersdb       int64
  myersstockid    string
  state           string

Ah! commonname looks like a good place as any to go looking for Atlantic cod. Of course if we knew (or looked up) the scientific name of the species, that might be even better – after all, common names are not always as precise. Let’s see what we can find:

(stock
 .filter(_.commonname == "Atlantic cod")
 .select(_.stockid, _.scientificname, _.commonname, 
         _.areaid, _.region, _.primary_country, _.ISO3_code)
 .head()
 .execute()
)
stockid scientificname commonname areaid region primary_country ISO3_code
0 COD1ABCDE Gadus morhua Atlantic cod multinational-ICES-1ABCDE Canada East Coast Greenland GRL
1 COD1F-XIV Gadus morhua Atlantic cod multinational-ICES-1F-XIV Europe non EU Greenland GRL
2 COD1IN Gadus morhua Atlantic cod multinational-ICES-1IN Canada East Coast Greenland GRL
3 COD2J3KL Gadus morhua Atlantic cod Canada-DFO-2J3KL Canada East Coast Canada CAN
4 COD3M Gadus morhua Atlantic cod multinational-NAFO-3M Canada East Coast Portugal PRT

Lots of stocks of Atlantic cod! Each row begins with a unique stockid. A column that uniquely identifies each row in a given table is often referred to as the “primary key” for that table (and is often but not necessarily listed first). The rows that follow give us some sense of what defines a “stock” as a species in an area: we see a few different identifiers for the species: commonname, scientificname. We also see information abot the area the stock occurs in – such as areaid, region, and primary country. (For display purposes we selected only a subset of columns).
While we have found the Cod, we haven’t yet found any data about the cod catch over time! For that we will need to look in the timeseries data. Let’s see how it is organized:

timeseries.head().execute()
assessid stockid stocklong tsid tsyear tsvalue
0 ABARES-BGRDRSE-1960-2011-CHING BGRDRSE Blue grenadier Southeast Australia CdivMEANC-ratio 1960 NaN
1 ABARES-BGRDRSE-1960-2011-CHING BGRDRSE Blue grenadier Southeast Australia CdivMEANC-ratio 1961 NaN
2 ABARES-BGRDRSE-1960-2011-CHING BGRDRSE Blue grenadier Southeast Australia CdivMEANC-ratio 1962 NaN
3 ABARES-BGRDRSE-1960-2011-CHING BGRDRSE Blue grenadier Southeast Australia CdivMEANC-ratio 1963 NaN
4 ABARES-BGRDRSE-1960-2011-CHING BGRDRSE Blue grenadier Southeast Australia CdivMEANC-ratio 1964 NaN

We again have a column called stockid. While we no longer have columns such as commonname or scientificname to tell us what species each row in the timeseries is measuring, we now know that we can look up that information in the stock table using the stockid. Such a column is often called a “foreign key”, because it matches the primary key of a separate table. (it appears the timeseries data has no ‘primary key’ of it’s own – no column that has a unique value for each row.). Rather than have to switch back and forth between two tables, we can join the two tables on stockid:

(stock
 .filter(_.commonname == "Atlantic cod")
 .join(timeseries, "stockid")
 .head()
 .select(_.stockid, _.scientificname, _.tsid, _.tsyear, 
         _.tsvalue, _.stocklong, _.stocklong_right) # subset of columns to keep display narrow
 .execute()
)
stockid scientificname tsid tsyear tsvalue stocklong stocklong_right
0 COD4VsW Gadus morhua CdivMEANC-ratio 1958 0.997804 Atlantic cod Eastern Scotian Shelf Atlantic cod Eastern Scotian Shelf
1 COD4VsW Gadus morhua CdivMEANC-ratio 1959 1.706091 Atlantic cod Eastern Scotian Shelf Atlantic cod Eastern Scotian Shelf
2 COD4VsW Gadus morhua CdivMEANC-ratio 1960 1.308003 Atlantic cod Eastern Scotian Shelf Atlantic cod Eastern Scotian Shelf
3 COD4VsW Gadus morhua CdivMEANC-ratio 1961 1.713846 Atlantic cod Eastern Scotian Shelf Atlantic cod Eastern Scotian Shelf
4 COD4VsW Gadus morhua CdivMEANC-ratio 1962 1.685411 Atlantic cod Eastern Scotian Shelf Atlantic cod Eastern Scotian Shelf

Effectively all this has done is take our timeseries table and for each stockid, add extra columns explaining what the stock table tells us about the stockid - species names, areas, and so on. The join has made our data is much wider than before – we have all the columns from both tables. (Note that both tables happened to have one column with the same name, stocklong. A truly tidy database would not have done this – we can easily see that this information belongs in the stock table. Because our database cannot assume these are the same when we join, it has renamed the one on the “right” (from timeseries) as “stocklong_right” to distinguish them). Because each stockid was repeated in the timeseries table, now all this other information is repeated too. This is not as inefficient as it may sound, thanks to internal optimizations in the database.

While it is clear even from this head() preview that we have the columns from both tables, what about the rows? Our stock table was already filtered to a subset of rows containing only Cod stocks. This join (technically called an “inner join”) has kept only those stockids, so we now have timeseries only about Cod! In fact, we could have instead joined the full tables for all stock ids, and then applied the filter for commonname.

Exercise

Try further exploring this resulting table using select() and distinct() to get a better sense of what rows are here. You will notice additional “*id” columns, like asssesid or areaid matching other tables in the data. Explore filtering and joinging with these tables as well.