Great writeup! I hit similar memory walls running data pipelines on a Mac Mini with 64GB unified memory. The lazy evaluation pattern in Polars was a game-changer for me too — especially when processing hundreds of JSON files from multiple API sources simultaneously.
One thing I found helpful was combining Polars with streaming writes, so I never hold more than one chunk in memory at a time. Did you experiment with scan_parquet for the initial reads, or were you always loading from raw CSVs?
Also curious if you benchmarked Polars streaming mode vs the regular collect — in my case, streaming cut peak memory by ~60% on larger datasets.