import ibis
from ibis import _
import ibis.selectors as s
= ibis.duckdb.connect() con
We will be focusing on how to use the ibis
package, a successor to the popular pandas
package, for manipulating tabular data. We begin by importing the ibis
package. (We include two additional imports from the package which are commonly referred to using their short names, the table placeholder _
instead of ibis._
, and the selectors methods as s
instead of the verbose ibis.selectors
. We will see these in action later).
Learning Goals
- establish a connection with
duckdb.connect()
- use
head()
andexcute()
to preview large data - use
select()
,distinct()
,filter()
to explore data.
Getting started
To use ibis
, we must also select a backend. We will always be using the quite new and very powerful duckdb
backend for all of our tasks. We select a backend by creating a “connection”. The details here are not important for us, we can treat this first block as “boilerplate” starting code.
We are now ready to read in our data. We will begin by reading the metrics table from the direct access link, as indicated in the URL below. con.read_csv()
is quite similar to the pandas.read_csv()
we saw in module 1, though the optional arguments get some different names and are not quite as flexibile. One important option for our purposes will be the how to indicate missing values. In the past, we’ve seen negative values like -99
be used to indicate missing values. That convention reflects limitations of early software, which had no natural concept of “missing”. More modern conventions indicating missing values as “NULL” or “NA”. We indicate the data has chosen the latter:
= "https://huggingface.co/datasets/cboettig/ram_fisheries/resolve/main/v4.65/tsmetrics.csv"
metrics_url = con.read_csv(metrics_url, nullstr="NA") tsmetrics
Previewing data: head()
and execute()
Let’s take a look at our new table:
tsmetrics
DatabaseTable: ibis_read_csv_lnkh36a47bbdhi4vbfj7tetgry tscategory string tsshort string tslong string tsunitsshort string tsunitslong string tsunique string
This doesn’t look like a pretty pandas table! Where are the values? Actually, as we become more familiar with ibis
we learn to appreciate the display choice here. ibis
is designed for working with very big data. An important part of this is something called lazy evaluation. Even downloading a very large fle might take a long time, and trying to load a large dataset into python all at once can exceed available RAM and crash the kernel. Instead, ibis
merely “peeks” at the data over the remote connection – without even downloading it! It tells us the names of each column and the data type (e.g. string, or numeric, etc) that the read_csv method has ‘guessed’ for the data. As we will see, this is often the most useful information anway.
If we we do want to see a few example rows, we can use the method head()
on the table, tsmetrics.head()
, to say we want only want to see the top of the data frame. Optionally we can specify how many rows we want to preview, e.g. tsmetrics.head(10)
to see 10 (the default is 5). Let’s try it:
tsmetrics.head()
r0 := DatabaseTable: ibis_read_csv_lnkh36a47bbdhi4vbfj7tetgry tscategory string tsshort string tslong string tsunitsshort string tsunitslong string tsunique string Limit[r0, n=5]
That’s not the top of 5 rows! Once again, ibis
is being lazy. We see the same definition of the table as before, only this time it has a name r0
, and we see a “plan of execution”, that ibis will return the first 5 rows Limit[r0, 5]
. We can force it to execute this plan with execute()
:
tsmetrics.head().execute()
tscategory | tsshort | tslong | tsunitsshort | tsunitslong | tsunique | |
---|---|---|---|---|---|---|
0 | OTHER TIME SERIES DATA | AQ | Aquaculture | MT | metric tons | AQ-MT |
1 | OTHER TIME SERIES DATA | ASP | Annual surplus production | MT | Metric tons | ASP-MT |
2 | TOTAL BIOMASS | BdivBmgtpref | General biomass time series preferentially rel... | dimensionless | dimensionless | BdivBmgtpref-dimensionless |
3 | TOTAL BIOMASS | BdivBmgttouse | General biomass time series relative to manage... | dimensionless | dimensionless | BdivBmgttouse-dimensionless |
4 | TOTAL BIOMASS | BdivBmsypref | General biomass time series preferentially rel... | dimensionless | dimensionless | BdivBmsypref-dimensionless |
At last, we are starting to see what the data really looks like. Data tables can quickly become much to large to explore by simply trying to eyeball every row. For instance, we notice the first column, tscategory
, shows a few different possible categories for the various metrics in the database. So, how many distinct categories are there?
select()
and distinct()
To answer this, we will introduce a few more methods of data table manipulation. select()
selects one or more columns of a given table, while distinct()
returns only distinct (unique) rows of the table. Note that both of these methods share a common pattern – they both apply to a table (not some piece of a table, like a row or column or cell), and they both return a new table as well that is some subset of the old table. table in, table out. This design is very intentional – by having methods designed specificially to operate on tables and return tables, we can easily stack or chain these together, (also true of head()
and execute()`. So let’s try and see distinct categories:
(tsmetrics"tscategory")
.select(
.distinct()10)
.head(
.execute()
)
tscategory | |
---|---|
0 | OTHER TIME SERIES DATA |
1 | SPAWNING STOCK BIOMASS or CPUE |
2 | PRODUCTION |
3 | TOTAL BIOMASS |
4 | RECRUITS (NOTE: RECRUITS ARE OFFSET IN TIME SE... |
5 | TIME UNITS |
6 | CATCH or LANDINGS |
7 | FISHING MORTALITY |
Note that we have stacked these methods together with each step on it’s own line by wrapping the whole thing inside ()
parentheses. This can make a long “chain” of commands easier to read. While we have asked for no more that 10 values, we have gotten back only 8 – so we now know there are only 8 categories.
filter()
What are the different unit types within, say, the “CATCH or LANDINGS” category? This requires a subset of rows (a filter), rather than a subset of columns (select()
):
(tsmetricsfilter(_.tscategory == "CATCH or LANDINGS")
.
.distinct()14)
.head(
.execute() )
tscategory | tsshort | tslong | tsunitsshort | tsunitslong | tsunique | |
---|---|---|---|---|---|---|
0 | CATCH or LANDINGS | CdivMSY-conv | Catch divided by MSY calculated from converted... | ratio | ratio | CdivMSY-conv-ratio |
1 | CATCH or LANDINGS | CdivMSY | Catch divided by MSY | ratio | ratio | CdivMSY-ratio |
2 | CATCH or LANDINGS | TAC-A2 | Total allowable catch In subarea | MT | Metric tons | TAC-A2-MT |
3 | CATCH or LANDINGS | TC-3 | Total catch. Use only when there is more than ... | E03 | Thousands | TC-3-E03 |
4 | CATCH or LANDINGS | TL-1 | Total landings. Use only when there is more th... | E03MT | Thousands of metric tons | TL-1-E03MT |
5 | CATCH or LANDINGS | TL-A1 | Total landings in subarea | MT | Metric tons | TL-A1-MT |
6 | CATCH or LANDINGS | TL-A3 | Total landings in subarea | MT | Metric tons | TL-A3-MT |
7 | CATCH or LANDINGS | CdivMSY-est | Catch divided by MSY calculated from model-est... | ratio | ratio | CdivMSY-est-ratio |
8 | CATCH or LANDINGS | TAC | Total allowable catch | MT | Metric tons | TAC-MT |
9 | CATCH or LANDINGS | TC-2 | Total catch (i.e. landings + discards. Add lan... | MT | Metric tons | TC-2-MT |
10 | CATCH or LANDINGS | TC | Total catch (i.e. landings + discards. Add lan... | E06 | Millions | TC-E06 |
11 | CATCH or LANDINGS | TL-2 | Total landings. Use only when there is more th... | MT | Metric tons | TL-2-MT |
12 | CATCH or LANDINGS | TL-A4 | Total landings in subarea | MT | Metric tons | TL-A4-MT |
13 | CATCH or LANDINGS | CdivMSY-dvmb | Catch divided by MSY calculated from reference... | ratio | ratio | CdivMSY-dvmb-ratio |
This syntax to subset rows (filter) is more complicated than columns (select) – to find rows containing “CATCH or LANDINGS” we have to indicate which column to look for.
column selection and .
For python to know that we are looking for the column called “tscategory”, we use the column selection _.tscategory
. This is actually a shorthand for the pattern tsmetrics.tscategory
– the _
is a placeholder for “the current table” in our chain. Extracting a single column with .
is itself something of a shorthand, it is equivalent to using the selector [
, as tsmetrics["category"]
. When a column name is also the name of a table method, we may need to fall back on the square bracket convention. So why use a dot at all? In addition to taking two less characters to write, the .
method allows “tab completion” of the column name, which helps us avoid typos. Note that our select()
method recognizes either syntax, you can do: tsmetrics.select(_.tscategory)
. This looks slightly more cryptic, but benefits from autocomplete and matches the sytnax of other functions.
==
not =
Another common mistake is to use a single =
sign rather than ==
in filter. Recall that =
is used in variable assignment, a = 1
sets the value of a
as 1. Double-equals is a “boolean operator”, that tests if the statement is True or False:
= 1
a == 1 a
True
Other boolean operators include >
, >=
, !=
(not equal) and so forth. The important thing is to know that we can do boolean comparisons, this syntax is easy to look up.
Next steps
Explore the datasets in this collection using select()
, distinct()
and filter()
. We will return to this list of Catch units after we become more aquainted with the remaining tables.
This syntax is harder than select()
– we can’t just filter for “CATCH or LANDINGS” without indicating which column we are looking in. To signal that tscategory
is a column name and not just a piece of text, we use the subsetting notation, _.tscategory
. This is merely a shorthand for the more verbose: