ibis mutates and aggregates

Learning Goals

  • use group_by() and aggregate() patterns to summarize data
  • use order_by() to arrange rows by one or more columns.
import ibis
from ibis import _
import ibis.selectors as s

con = ibis.duckdb.connect()
base_url = "https://huggingface.co/datasets/cboettig/ram_fisheries/resolve/main/v4.65/"

stock = con.read_csv(base_url + "stock.csv", nullstr="NA")
timeseries = con.read_csv(base_url + "timeseries.csv", nullstr="NA")
assessment = con.read_csv(base_url + "assessment.csv", nullstr="NA")

Stock “assessments”

The next thing we need to know is a stock assessment.

assessment
DatabaseTable: ibis_read_csv_inv2bebb7vhktagb64s6mhgzaa
  assessid       string
  assessorid     string
  stockid        string
  stocklong      string
  recorder       string
  daterecorded   timestamp(6)
  dateloaded     timestamp(6)
  assessyear     string
  assesssource   string
  contacts       string
  notes          string
  pdffile        string
  assess         int64
  refpoints      int64
  assessmethod   string
  assesscomments string
  xlsfilename    string
  mostrecent     int64

Are some stockids assessed multiple times? One intuitive idea is to filter() for a single stockid, and see if we get back multiple rows (multiple assessments). Let’s take a look at “COD2J3KL”:

(assessment
 .filter(_.stockid == "COD2J3KL")
 .select(_.assessid, _.assessorid, _.stockid,
         _.daterecorded, _.assessyear) # pick a subset of columns to focus on
 .execute()
)
assessid assessorid stockid daterecorded assessyear
0 DFO-COD2J3KL-1959-2014-WATSON DFO COD2J3KL 2016-02-25 1959-2014
1 DFO-NFLD-COD2J3KL-1850-2011-CHING DFO-NFLD COD2J3KL 2013-10-22 1850-2011
2 DFO-NFLD-COD2J3KL-1959-2018-ASHBROOK DFO-NFLD COD2J3KL 2019-06-20 1959-2018
3 DFO-NFLD-COD2J3KL-1959-2021-HIVELY DFO-NFLD COD2J3KL 2023-11-08 1959-2021

Indeed, it looks like their are four assessments of this stock, each conducted in different years and spanning different periods in time! filter()ing for each possible stockid would be tedious though. These four assessments that correspond to this stockid

(assessment
 .group_by(_.stockid)
 .agg(n=_.count())
 .execute()
)
stockid n
0 CSALMAKPSESCD 1
1 CSALMANVIKR 1
2 CSALMNORTONSD1 1
3 CSALMURSUSCL 1
4 CSALMYUKONRSR 1
... ... ...
1507 NPOUTVIa 2
1508 PLAIC7d 10
1509 PLAICIIIa 1
1510 POLLNS-VI-IIIa 11
1511 WHITIIIa 4

1512 rows × 2 columns

order_by()

Which stockids have the most assessments? We can re-order the rows by different columns using the order_by(). (Changing the row order does not alter any individual row itself – that would mess up the data. Each row is moved as a unit). By default, order is always increasing, smallest to largest, A to Z. While that might be intuitive for dates or names, if we want to see which stocks have the most assessments, we need n to be in descending order. We indicate this by appending the .desc() method to the column:

(assessment
 .group_by(_.stockid)
 .agg(n=_.count())
 .order_by(_.n.desc())
 .execute()
)
stockid n
0 CODIIIaW-IV-VIId 12
1 SEELNSSA2 11
2 SPRAT22-32 11
3 PLAICNS 11
4 HERRNIRS 11
... ... ...
1507 CSALMSKAGR 1
1508 PSALMHHAMMAHC 1
1509 PSALMNISQPS 1
1510 PSALMSNOHPS 1
1511 SARDVII-VIIIabd 1

1512 rows × 2 columns