Our checkout service had a query joining orders, line_items, products, and inventory. no indexes on the foreign keys. at 2pm traffic spiked, query went from 40ms to 8+ seconds, connections pooled up, everything locked. classic death spiral.
The query was dead simple in dev. ran fine for months. nobody was looking at explain analyze output. we were just shipping it.
what i'd do differently: make explain analyze part of the deploy checklist. seriously. before any schema change or new query hits production, run it with realistic data volume. we learned postgres can't read minds about what you actually need indexed.
the fix took 20 minutes once we knew what to look at. added indexes on order_id and product_id in the joins:
CREATE INDEX idx_line_items_order_id ON line_items(order_id);
CREATE INDEX idx_products_inventory_product_id ON inventory(product_id);
ran the query again, dropped to 35ms. the bigger lesson was ignoring our own dashboards. we had slow query logs enabled but nobody was reviewing them. that's the real mistake.
No responses yet.