We will be focusing on how to use the ibis package, a successor to the popular pandas package, for manipulating tabular data. We begin by importing the ibis package. (We include two additional imports from the package which are commonly referred to using their short names, the table placeholder _ instead of ibis._, and the selectors methods as s instead of the verbose ibis.selectors. We will see these in action later).

Learning Goals

  • establish a connection with duckdb.connect()
  • use head() and excute() to preview large data
  • use select(), distinct(), filter() to explore data.

Getting started

To use ibis, we must also select a backend. We will always be using the quite new and very powerful duckdb backend for all of our tasks. We select a backend by creating a “connection”. The details here are not important for us, we can treat this first block as “boilerplate” starting code.

import ibis
from ibis import _
import ibis.selectors as s

con = ibis.duckdb.connect()

We are now ready to read in our data. We will begin by reading the metrics table from the direct access link, as indicated in the URL below. con.read_csv() is quite similar to the pandas.read_csv() we saw in module 1, though the optional arguments get some different names and are not quite as flexibile. One important option for our purposes will be the how to indicate missing values. In the past, we’ve seen negative values like -99 be used to indicate missing values. That convention reflects limitations of early software, which had no natural concept of “missing”. More modern conventions indicating missing values as “NULL” or “NA”. We indicate the data has chosen the latter:

metrics_url = "https://huggingface.co/datasets/cboettig/ram_fisheries/resolve/main/v4.65/tsmetrics.csv"
tsmetrics = con.read_csv(metrics_url, nullstr="NA")

Previewing data: head() and execute()

Let’s take a look at our new table:

tsmetrics
DatabaseTable: ibis_read_csv_lnkh36a47bbdhi4vbfj7tetgry
  tscategory   string
  tsshort      string
  tslong       string
  tsunitsshort string
  tsunitslong  string
  tsunique     string

This doesn’t look like a pretty pandas table! Where are the values? Actually, as we become more familiar with ibis we learn to appreciate the display choice here. ibis is designed for working with very big data. An important part of this is something called lazy evaluation. Even downloading a very large fle might take a long time, and trying to load a large dataset into python all at once can exceed available RAM and crash the kernel. Instead, ibis merely “peeks” at the data over the remote connection – without even downloading it! It tells us the names of each column and the data type (e.g. string, or numeric, etc) that the read_csv method has ‘guessed’ for the data. As we will see, this is often the most useful information anway.

If we we do want to see a few example rows, we can use the method head() on the table, tsmetrics.head(), to say we want only want to see the top of the data frame. Optionally we can specify how many rows we want to preview, e.g. tsmetrics.head(10) to see 10 (the default is 5). Let’s try it:

tsmetrics.head()
r0 := DatabaseTable: ibis_read_csv_lnkh36a47bbdhi4vbfj7tetgry
  tscategory   string
  tsshort      string
  tslong       string
  tsunitsshort string
  tsunitslong  string
  tsunique     string

Limit[r0, n=5]

That’s not the top of 5 rows! Once again, ibis is being lazy. We see the same definition of the table as before, only this time it has a name r0, and we see a “plan of execution”, that ibis will return the first 5 rows Limit[r0, 5]. We can force it to execute this plan with execute() :

tsmetrics.head().execute()
tscategory tsshort tslong tsunitsshort tsunitslong tsunique
0 OTHER TIME SERIES DATA AQ Aquaculture MT metric tons AQ-MT
1 OTHER TIME SERIES DATA ASP Annual surplus production MT Metric tons ASP-MT
2 TOTAL BIOMASS BdivBmgtpref General biomass time series preferentially rel... dimensionless dimensionless BdivBmgtpref-dimensionless
3 TOTAL BIOMASS BdivBmgttouse General biomass time series relative to manage... dimensionless dimensionless BdivBmgttouse-dimensionless
4 TOTAL BIOMASS BdivBmsypref General biomass time series preferentially rel... dimensionless dimensionless BdivBmsypref-dimensionless

At last, we are starting to see what the data really looks like. Data tables can quickly become much to large to explore by simply trying to eyeball every row. For instance, we notice the first column, tscategory, shows a few different possible categories for the various metrics in the database. So, how many distinct categories are there?

select() and distinct()

To answer this, we will introduce a few more methods of data table manipulation. select() selects one or more columns of a given table, while distinct() returns only distinct (unique) rows of the table. Note that both of these methods share a common pattern – they both apply to a table (not some piece of a table, like a row or column or cell), and they both return a new table as well that is some subset of the old table. table in, table out. This design is very intentional – by having methods designed specificially to operate on tables and return tables, we can easily stack or chain these together, (also true of head() and execute()`. So let’s try and see distinct categories:

(tsmetrics
 .select("tscategory")
 .distinct()
 .head(10)
 .execute()
)
 
tscategory
0 OTHER TIME SERIES DATA
1 SPAWNING STOCK BIOMASS or CPUE
2 PRODUCTION
3 TOTAL BIOMASS
4 RECRUITS (NOTE: RECRUITS ARE OFFSET IN TIME SE...
5 TIME UNITS
6 CATCH or LANDINGS
7 FISHING MORTALITY

Note that we have stacked these methods together with each step on it’s own line by wrapping the whole thing inside () parentheses. This can make a long “chain” of commands easier to read. While we have asked for no more that 10 values, we have gotten back only 8 – so we now know there are only 8 categories.

filter()

What are the different unit types within, say, the “CATCH or LANDINGS” category? This requires a subset of rows (a filter), rather than a subset of columns (select()):

(tsmetrics
 .filter(_.tscategory == "CATCH or LANDINGS")
 .distinct()
 .head(14)
 .execute()
)
tscategory tsshort tslong tsunitsshort tsunitslong tsunique
0 CATCH or LANDINGS CdivMSY-conv Catch divided by MSY calculated from converted... ratio ratio CdivMSY-conv-ratio
1 CATCH or LANDINGS CdivMSY Catch divided by MSY ratio ratio CdivMSY-ratio
2 CATCH or LANDINGS TAC-A2 Total allowable catch In subarea MT Metric tons TAC-A2-MT
3 CATCH or LANDINGS TC-3 Total catch. Use only when there is more than ... E03 Thousands TC-3-E03
4 CATCH or LANDINGS TL-1 Total landings. Use only when there is more th... E03MT Thousands of metric tons TL-1-E03MT
5 CATCH or LANDINGS TL-A1 Total landings in subarea MT Metric tons TL-A1-MT
6 CATCH or LANDINGS TL-A3 Total landings in subarea MT Metric tons TL-A3-MT
7 CATCH or LANDINGS CdivMSY-est Catch divided by MSY calculated from model-est... ratio ratio CdivMSY-est-ratio
8 CATCH or LANDINGS TAC Total allowable catch MT Metric tons TAC-MT
9 CATCH or LANDINGS TC-2 Total catch (i.e. landings + discards. Add lan... MT Metric tons TC-2-MT
10 CATCH or LANDINGS TC Total catch (i.e. landings + discards. Add lan... E06 Millions TC-E06
11 CATCH or LANDINGS TL-2 Total landings. Use only when there is more th... MT Metric tons TL-2-MT
12 CATCH or LANDINGS TL-A4 Total landings in subarea MT Metric tons TL-A4-MT
13 CATCH or LANDINGS CdivMSY-dvmb Catch divided by MSY calculated from reference... ratio ratio CdivMSY-dvmb-ratio

This syntax to subset rows (filter) is more complicated than columns (select) – to find rows containing “CATCH or LANDINGS” we have to indicate which column to look for.

column selection and .

For python to know that we are looking for the column called “tscategory”, we use the column selection _.tscategory. This is actually a shorthand for the pattern tsmetrics.tscategory – the _ is a placeholder for “the current table” in our chain. Extracting a single column with . is itself something of a shorthand, it is equivalent to using the selector [, as tsmetrics["category"]. When a column name is also the name of a table method, we may need to fall back on the square bracket convention. So why use a dot at all? In addition to taking two less characters to write, the . method allows “tab completion” of the column name, which helps us avoid typos. Note that our select() method recognizes either syntax, you can do: tsmetrics.select(_.tscategory). This looks slightly more cryptic, but benefits from autocomplete and matches the sytnax of other functions.

== not =

Another common mistake is to use a single = sign rather than == in filter. Recall that = is used in variable assignment, a = 1 sets the value of a as 1. Double-equals is a “boolean operator”, that tests if the statement is True or False:

a = 1
a == 1
True

Other boolean operators include >, >=, != (not equal) and so forth. The important thing is to know that we can do boolean comparisons, this syntax is easy to look up.

Next steps

Explore the datasets in this collection using select(), distinct() and filter(). We will return to this list of Catch units after we become more aquainted with the remaining tables.

This syntax is harder than select() – we can’t just filter for “CATCH or LANDINGS” without indicating which column we are looking in. To signal that tscategory is a column name and not just a piece of text, we use the subsetting notation, _.tscategory. This is merely a shorthand for the more verbose: