The STAR method structure made this incredibly easy to follow. The Float32 downcast is such an underrated optimization — I've seen similar wins when processing financial transaction data on constrained hardware (Mac Mini with 64GB). Moving from eager Pandas to lazy evaluation completely changed how I think about memory budgets.
One thing I'm curious about: did you benchmark the np.einsum call specifically? In my experience that's often where the memory spike happens before Polars even gets involved. Wondering if chunking the einsum operands themselves (before creating LazyFrames) gave additional savings.
Great write-up — bookmarking this as a reference for anyone still hitting OOM with Pandas.