`ibis` mutates and aggregates

Learning Goals

use group_by() and aggregate() patterns to summarize data
use order_by() to arrange rows by one or more columns.

import ibis
from ibis import _
import ibis.selectors as s

con = ibis.duckdb.connect()

base_url = "https://huggingface.co/datasets/cboettig/ram_fisheries/resolve/main/v4.65/"

stock = con.read_csv(base_url + "stock.csv", nullstr="NA")
timeseries = con.read_csv(base_url + "timeseries.csv", nullstr="NA")
assessment = con.read_csv(base_url + "assessment.csv", nullstr="NA")

Stock “assessments”

The next thing we need to know is a stock assessment.

assessment

DatabaseTable: ibis_read_csv_inv2bebb7vhktagb64s6mhgzaa
  assessid       string
  assessorid     string
  stockid        string
  stocklong      string
  recorder       string
  daterecorded   timestamp(6)
  dateloaded     timestamp(6)
  assessyear     string
  assesssource   string
  contacts       string
  notes          string
  pdffile        string
  assess         int64
  refpoints      int64
  assessmethod   string
  assesscomments string
  xlsfilename    string
  mostrecent     int64

Are some stockids assessed multiple times? One intuitive idea is to filter() for a single stockid, and see if we get back multiple rows (multiple assessments). Let’s take a look at “COD2J3KL”:

(assessment
 .filter(_.stockid == "COD2J3KL")
 .select(_.assessid, _.assessorid, _.stockid,
         _.daterecorded, _.assessyear) # pick a subset of columns to focus on
 .execute()
)

	assessid	assessorid	stockid	daterecorded	assessyear
0	DFO-COD2J3KL-1959-2014-WATSON	DFO	COD2J3KL	2016-02-25	1959-2014
1	DFO-NFLD-COD2J3KL-1850-2011-CHING	DFO-NFLD	COD2J3KL	2013-10-22	1850-2011
2	DFO-NFLD-COD2J3KL-1959-2018-ASHBROOK	DFO-NFLD	COD2J3KL	2019-06-20	1959-2018
3	DFO-NFLD-COD2J3KL-1959-2021-HIVELY	DFO-NFLD	COD2J3KL	2023-11-08	1959-2021

Indeed, it looks like their are four assessments of this stock, each conducted in different years and spanning different periods in time! filter()ing for each possible stockid would be tedious though. These four assessments that correspond to this stockid

(assessment
 .group_by(_.stockid)
 .agg(n=_.count())
 .execute()
)

	stockid	n
0	CSALMAKPSESCD	1
1	CSALMANVIKR	1
2	CSALMNORTONSD1	1
3	CSALMURSUSCL	1
4	CSALMYUKONRSR	1
...	...	...
1507	NPOUTVIa	2
1508	PLAIC7d	10
1509	PLAICIIIa	1
1510	POLLNS-VI-IIIa	11
1511	WHITIIIa	4

1512 rows × 2 columns

`order_by()`

Which stockids have the most assessments? We can re-order the rows by different columns using the order_by(). (Changing the row order does not alter any individual row itself – that would mess up the data. Each row is moved as a unit). By default, order is always increasing, smallest to largest, A to Z. While that might be intuitive for dates or names, if we want to see which stocks have the most assessments, we need n to be in descending order. We indicate this by appending the .desc() method to the column:

(assessment
 .group_by(_.stockid)
 .agg(n=_.count())
 .order_by(_.n.desc())
 .execute()
)

	stockid	n
0	CODIIIaW-IV-VIId	12
1	SEELNSSA2	11
2	SPRAT22-32	11
3	PLAICNS	11
4	HERRNIRS	11
...	...	...
1507	CSALMSKAGR	1
1508	PSALMHHAMMAHC	1
1509	PSALMNISQPS	1
1510	PSALMSNOHPS	1
1511	SARDVII-VIIIabd	1

1512 rows × 2 columns

ibis mutates and aggregates

Learning Goals

Stock “assessments”

order_by()

`ibis` mutates and aggregates

`order_by()`