import ibis
from ibis import _
import ibis.selectors as s
= ibis.duckdb.connect() con
ibis
mutates and aggregates
Learning Goals
- use
group_by()
andaggregate()
patterns to summarize data - use
order_by()
to arrange rows by one or more columns.
= "https://huggingface.co/datasets/cboettig/ram_fisheries/resolve/main/v4.65/"
base_url
= con.read_csv(base_url + "stock.csv", nullstr="NA")
stock = con.read_csv(base_url + "timeseries.csv", nullstr="NA")
timeseries = con.read_csv(base_url + "assessment.csv", nullstr="NA") assessment
Stock “assessments”
The next thing we need to know is a stock assessment.
assessment
DatabaseTable: ibis_read_csv_inv2bebb7vhktagb64s6mhgzaa assessid string assessorid string stockid string stocklong string recorder string daterecorded timestamp(6) dateloaded timestamp(6) assessyear string assesssource string contacts string notes string pdffile string assess int64 refpoints int64 assessmethod string assesscomments string xlsfilename string mostrecent int64
Are some stockid
s assessed multiple times? One intuitive idea is to filter()
for a single stockid, and see if we get back multiple rows (multiple assessments). Let’s take a look at “COD2J3KL”:
(assessmentfilter(_.stockid == "COD2J3KL")
.
.select(_.assessid, _.assessorid, _.stockid,# pick a subset of columns to focus on
_.daterecorded, _.assessyear)
.execute() )
assessid | assessorid | stockid | daterecorded | assessyear | |
---|---|---|---|---|---|
0 | DFO-COD2J3KL-1959-2014-WATSON | DFO | COD2J3KL | 2016-02-25 | 1959-2014 |
1 | DFO-NFLD-COD2J3KL-1850-2011-CHING | DFO-NFLD | COD2J3KL | 2013-10-22 | 1850-2011 |
2 | DFO-NFLD-COD2J3KL-1959-2018-ASHBROOK | DFO-NFLD | COD2J3KL | 2019-06-20 | 1959-2018 |
3 | DFO-NFLD-COD2J3KL-1959-2021-HIVELY | DFO-NFLD | COD2J3KL | 2023-11-08 | 1959-2021 |
Indeed, it looks like their are four assessments of this stock, each conducted in different years and spanning different periods in time! filter()
ing for each possible stockid would be tedious though. These four assessments that correspond to this stockid
(assessment
.group_by(_.stockid)=_.count())
.agg(n
.execute() )
stockid | n | |
---|---|---|
0 | CSALMAKPSESCD | 1 |
1 | CSALMANVIKR | 1 |
2 | CSALMNORTONSD1 | 1 |
3 | CSALMURSUSCL | 1 |
4 | CSALMYUKONRSR | 1 |
... | ... | ... |
1507 | NPOUTVIa | 2 |
1508 | PLAIC7d | 10 |
1509 | PLAICIIIa | 1 |
1510 | POLLNS-VI-IIIa | 11 |
1511 | WHITIIIa | 4 |
1512 rows × 2 columns
order_by()
Which stockid
s have the most assessments? We can re-order the rows by different columns using the order_by()
. (Changing the row order does not alter any individual row itself – that would mess up the data. Each row is moved as a unit). By default, order
is always increasing, smallest to largest, A to Z. While that might be intuitive for dates or names, if we want to see which stocks have the most assessments, we need n
to be in descending order. We indicate this by appending the .desc()
method to the column:
(assessment
.group_by(_.stockid)=_.count())
.agg(n
.order_by(_.n.desc())
.execute() )
stockid | n | |
---|---|---|
0 | CODIIIaW-IV-VIId | 12 |
1 | SEELNSSA2 | 11 |
2 | SPRAT22-32 | 11 |
3 | PLAICNS | 11 |
4 | HERRNIRS | 11 |
... | ... | ... |
1507 | CSALMSKAGR | 1 |
1508 | PSALMHHAMMAHC | 1 |
1509 | PSALMNISQPS | 1 |
1510 | PSALMSNOHPS | 1 |
1511 | SARDVII-VIIIabd | 1 |
1512 rows × 2 columns