Syllabus

[!NOTE] This syllabus is a living document and subject to change as the course progresses.

Instructor: Carl Boettiger
Location: Room 166 - Social Sciences Building
Time: M/W 10:00 - 11:30

Course Overview

This course explores the cutting edge of data science for environmental problem solving, with a focus on reproducible, collaborative, and scalable workflows. In Spring 2026, we adopt an AI-augmented, agent-first methodology, moving beyond syntax memorization to architecting, prompting, auditing, and deploying solutions.

The curriculum is structured around 5 Core Modules over 15 weeks, allowing us to go deep into technical stacks and theoretical frameworks. We will leverage the R ecosystem, integrating modern tools for high-performance computing and AI.

Before we begin, please ensure you have completed the Compute Setup.

The Agent-First Methodology

Architect: Design data flows and system architecture.
Prompt: Direct AI implementation effectively.
Audit: Verify outputs and ensure reproducibility.
Deploy: Ship robust, scalable solutions.

Course Modules

Module 1: The AI-Data Analyst

Duration: Weeks 1-4
Focus: AI-assisted coding, high-performance tabular data, visualization.

Core Concepts:
- Reproducible environments (DevContainers, Codespaces).
- High-performance queries with DuckDB and Parquet via duckdbfs within the tidyverse.
- Interactive dashboards with Shiny.
- Data cleaning and visualization pipelines.
Deep Dive: Advanced SQL strategies, optimization of data pipelines, and implementing rigorous data validation frameworks using pointblank or similar R tools.
Project: Build an interactive emissions dashboard.

Module 2: Spatial Data & Environmental Justice

Duration: Weeks 5-7
Focus: Geospatial analysis, cloud-native formats, EJ lens.

Core Concepts:
- Vector Data: Analysis with DuckDB Spatial.
- Raster Data: Scalable processing with gdalcubes.
- Cloud-native geospatial formats (COG, GeoParquet).
- Mapping environmental impacts (e.g., data centers, biodiversity).
- Introduction to Geospatial Foundation Models.
Deep Dive: Handling massive geospatial datasets, custom coordinate reference systems discussions, and critical data studies applied to Environmental Justice.
Project: Map and analyze environmental impacts of infrastructure.

Module 3: Working with LLMs & Unstructured Data

Duration: Weeks 8-10
Focus: R-based LLM tools, Structured Outputs, Unstructured data extraction.

Core Concepts:
- Leading edge of R LLMs using Posit PBC tools:
  - ellmer for programmatic LLM interaction.
  - mcptools for Model Context Protocol integration.
  - ragnar (RAG implementations).
  - vitals (https://vitals.tidyverse.org/) for model evaluation.
- Extracting structured data (JSON) from unstructured text (PDFs, reports).
Deep Dive: Comparing local vs. cloud models, fine-tuning considerations, and evaluating LLM outputs for scientific rigor.
Project: Extract and analyze data from corporate sustainability reports.

Module 4: Scaling & Cloud Computing

Duration: Weeks 11-12
Focus: Container orchestration, Kubernetes, Cloud architecture.

Core Concepts:
- Introduction to Cloud Infrastructure (AWS/Azure/GCP).
- Containerization beyond local Docker.
- Kubernetes (K8s): Concepts (Pods, Deployments, Services) and practice.
- Scaling R-based data analysis workflows on the cloud.
Project: Deploy a scalable data application or analysis pipeline to a Kubernetes cluster.

Module 5: Capstone Project

Duration: Weeks 13-15
Focus: Synthesis, Innovation, Deployment.

Goal: Students work individually or in small teams to conceptualize, build, and deploy a significant data science project applying techniques from the previous modules.
Stages:
- Scope: Proposal and architectural design.
- Sprint: Agile development of an MVP.
- Refine: User experience and documentation.
- Demo: Live presentation of the deployed application.

Tech Stack

IDE: VS Code / Antigravity
Language: R (primary)
Data: DuckDB, duckdbfs, GeoParquet, gdalcubes
AI: ellmer, mcptools, vitals
Compute: GitHub Codespaces, Kubernetes
Web: Quarto, Shiny

Evaluation

The course grade is based on 5 Core Modules. Each module is worth 20 points, for a total of 100 points. Participation and engagement are assessed as part of the score for each module.

Component	Points
Module 1: The AI-Data Analyst	20
Module 2: Spatial Data & Environmental Justice	20
Module 3: Working with LLMs & Unstructured Data	20
Module 4: Scaling & Cloud Computing	20
Module 5: Capstone Project	20
Total	100

See Policies for full details on grading, participation, and attendance.