Chapter 7 · Scale

Working with Large Datasets

When CSVs and Stata stop scaling. Parquet, DuckDB, and the metadata pattern.

You may not need this page on day one. Come back here when you encounter a dataset that does not fit comfortably in Stata or pandas.

The Problem (and the New Tradeoff)

Most economists work with CSVs and .dta files. For small datasets, this is fine. But when data grows to 10, 50, or 70 gigabytes, the familiar workflow breaks down. Stata crawls. Pandas crashes. You cannot even open the file in Excel.

The typical response is to make do: split the data into pieces, load one year at a time, save intermediate files everywhere, lose track of which version is which. You end up with a folder full of data_v2_final_FINAL.csv files and no documentation.

There is a better approach: convert your data to a compressed columnar format (parquet), store it in a queryable database (DuckDB), and document it with metadata tables. Until recently, learning these tools was a week-long investment hard to justify for any single project.

With Claude Code, you describe what you want in plain English and get a professional-grade pipeline in an afternoon. The pattern works for any large administrative dataset: HMDA mortgage data, Medicare claims, patent filings, trade transaction records, Census microdata.

Parquet: Better Storage

Parquet stores data column by column instead of row by row. One consequence: it compresses dramatically. In one project, 70 GB of raw CSVs became about 6 GB of parquet files, roughly 15x smaller. Another consequence: DuckDB can read just the columns it needs without loading the full file into memory. If you want the mean of one variable across 200 million rows, it reads that one column. Your laptop handles it fine.

Your message

/prompt Convert all the CSVs in data/raw/ to parquet files.
Save them in data/parquet/.

DuckDB: Your Research Database

DuckDB is like SQLite: a single .duckdb file, zero configuration, no server, no database administrator. But it handles hundreds of millions of rows on a laptop. You describe what you want in English, Claude translates to SQL, DuckDB runs it in seconds. You never write SQL yourself unless you want to.

Your message

/prompt Build a DuckDB database from the parquet files in data/parquet/.
Then create a county-by-year aggregate table with:
- total loan originations
- total dollar volume
- number of active lenders
- denial rate
- median loan amount

One SQL query, 291 million rows, a few seconds. The database is a single file you can hand to a coauthor.

The old way

"I have CSVs everywhere. Let me load them one year at a time in Stata, save intermediate files, and hope I remember which version is current."

The new way

"Build a DuckDB database from these parquet files. Include a metadata table that documents every column."

Metadata Tables: Context Engineering for Data

When your data is too big to browse, put the documentation inside the database itself. A metadata table records the name, type, description, valid values, and year availability for every column:

SQL query

SELECT * FROM metadata WHERE column_name = 'action_taken';

-- Returns:
-- column: action_taken
-- type: integer
-- description: Loan application outcome
-- values: 1=originated, 2=approved not accepted,
--         3=denied, 4=withdrawn, 5=file closed
-- available_years: 2007-2024

Next time you, a coauthor, or Claude in a new session opens this database, SELECT * FROM metadata explains everything. No one calls you to ask what action_taken = 3 means. Claude can write correct queries without ever scanning the raw data. The data documents itself.

You would probably never build this by hand. The payoff is downstream and the cost is now, so it never happens. With Claude, the cost is basically zero: describe what you want documented and it writes the metadata for you.

The Pipeline Pattern

Every large-data project follows the same five-step pattern:

1. Download (with resume capability)

↓

2. Convert CSV to parquet

↓

3. Harmonize formats and eras

↓

4. Document with metadata table

↓

5. Query and analyze

Ask Claude to build this as a set of scripts, not as inline commands. The scripts are the reproducible artifact. Here is what a well-organized project looks like:

Project structure

project/
├── data/
│   ├── raw/              (original CSVs)
│   ├── parquet/          (converted, compressed)
│   └── database.duckdb   (queryable database)
├── code/
│   ├── download.py
│   ├── convert.py
│   ├── harmonize.py
│   └── analyze.py
└── figures/

One thing to watch for in complex pipeline tasks: Claude may spawn sub-agents for research or exploration. These sub-agents do not always inherit all your constraints. If you told Claude "do not look in the parent directory" but it delegates exploration to a sub-agent, that instruction may be lost. Re-state important constraints if you notice Claude exploring places it should not.

Verification for Large Data

You cannot eyeball millions of rows. Verification at scale looks different.

Sanity checks: Ask Claude basic questions after building the database. Do the top counties, firms, or categories match expectations? Does the time series show known patterns (the 2008 crisis, the COVID spike, seasonal variation)? If Los Angeles is not in the top 5 counties for mortgage volume, something broke.

Quality reports: Ask Claude for a quality report: how many files parsed successfully, distributions of key variables, suspiciously short or empty extractions. These catch problems that sanity checks miss.

The 95/5 pattern: Automated pipelines get you 95% of the way; the last 5% requires judgment. In one project, a parser extracted the target section from 119 out of 120 filings. The holdout had an unusual formatting quirk. For a quick analysis, 119 out of 120 is fine. For a published paper, you go back and fix it. Either way, the pipeline did the bulk of the work.

Try It: CSV to Parquet to DuckDB

Try the entire pattern right now with the same data/examples/UNRATE.csv from the Core Workflow page. The file is tiny, but the workflow is identical at any scale.

Step 1: Ask Claude:

Your message

/prompt Convert data/examples/UNRATE.csv to a parquet file.
Then create a DuckDB database from it with a metadata table
that documents every column: name, type, description, and value range.

Step 2: Claude produces a .parquet file (smaller than the CSV) and a .duckdb file. Check the metadata:

Your message

/prompt Show me the metadata table. Then compute the average
unemployment rate by decade.

Step 3: Check the results. You now have a self-documenting, queryable dataset from a plain CSV. Under two minutes, start to finish.