04-23 14:00 TPE: baseline jumps 300→400, etl autoscale_max drops from 150-200 → 50-100.04-23 14:00 TPE). After this point, default pending p95 spikes repeatedly to 200-600s.| Date | CUD baseline | AS slot-hrs | Used slot-hrs | AS $ | CUD $ | Total $ |
|---|---|---|---|---|---|---|
| 2026-04-21 | 300 | 3,592 | 10,524 | $249.6 | $298.1 | $547.7 |
| 2026-04-22 | 300 | 3,699 | 10,677 | $257.1 | $298.1 | $555.2 |
| 2026-04-23 (split day) | 342 | 3,032 | 10,470 | $210.7 | $339.5 | $550.2 |
| 2026-04-24 (15h partial) | 400 | 1,346 | 6,478 | $93.5 | $397.4 | $490.9 |
| Source | avg pre | avg post | Δ× | p95 pre | p95 post | Δ× | p99 pre | p99 post |
|---|
default reservation (after 4/23 14:00)| TPE hour | jobs | pending avg | pending p95 | pending p99 | duration p95 |
|---|
CUD 400 is committed → $397.4/day fixed. To save $20/day vs pre-4/23 ($555), AS budget ≤ $137.8/day = avg 83 slots. Current AS avg ≈ 85 — already in budget, but allocated badly (starves default). The path to $20 savings + degradation fix requires re-shaping AS allocation, not just adding capacity.
default has no baseline buffer and depends on cross-reservation idle borrowing to survive peaks (V1/V2 already made this fragile)| Action | Δ AS avg slots | Δ cost /day | Notes |
|---|---|---|---|
| default AS_max 50 floor all hours, 100 at 8-14 TPE | +20 | +$33 | core of the fix — direct relief |
| etl AS_max 50 → 80 normal, 100 at 02am | +12 | +$20 | rebuild idle pool for default to borrow |
| ka AS_max 50 → 35 | -1 | -$1.5 | p95 ka rarely exceeds 30 |
| etl-backfill AS_max 50 → 25 | ~0 | -$0.5 | avg usage is 0, just a cap reduction |
| Net | +31 | +$51 |
→ Total cost: ~$555 + (net AS) = roughly $545-555/day. vs pre: saves $0-10/day. default p95 pending: 260s → ~40s ✓
| Action | Δ AS avg slots | Δ cost /day | Notes |
|---|---|---|---|
| default AS_max 50 only at 8-14 TPE (7h) | +8 | +$9.7 | fix worst-hour pending; off-peak stays weak |
| etl AS_max 50 → 70 at peak hours 07-12 only | +4 | +$5.5 | targeted idle pool recovery |
| ka AS_max 50 → 35 | -1 | -$1.5 | safe |
| etl-backfill AS_max 50 → 25 | ~0 | -$0.5 | safe |
| Net change vs current | +11 | +$13 |
→ Current saves ~$16/day; after adding $13 back: save ~$3-8/day. But if we compare to 4/24 full-day extrapolation where savings are $16, after fixes savings become ~$10-15/day. default business hours p95: 260s → ~60s · off-peak Airflow bursts still ~200s.
| Action | Δ AS avg slots | Δ cost /day | Notes |
|---|---|---|---|
| default AS_max 25 at 9-13 TPE only (5h) | +4 | +$5.8 | lite version of fix |
| etl AS_max cut off-peak (0-6 TPE) to 30 | -5 | -$8.3 | risk: etl peak 02am may queue if demand > 180 |
| ka AS_max 50 → 30 | -1.5 | -$2.5 | mild peak risk |
| etl-backfill AS_max 50 → 25 | ~0 | -$0.5 | safe |
| Net change vs current | -2.5 | -$5.5 |
→ Current saves ~$16; additional $5.5 → save ~$20-22/day ✓ (hits target). But default Redash/User Query p95 only improves to ~120s, and etl 02am peak may queue. Analysts still feel it.
Plan A if analyst experience matters more than $20 target — fixes default cleanly, saves ~$10/day anyway.
Plan B is the pragmatic middle — fixes the visible pain (business hours) and saves ~$10-15/day.
Plan C meets the $20 target but leaves Redash/User Query queue times above 100s. Only choose this if cost is a hard budget constraint.
If the $20 target is firm AND degradation must be fully fixed → reconsider cutting core-realtime AS 50→25: avg usage is 6.6 slots, p95=24, provisioned 31. The 4× headroom remains. That alone is $25/day free and closes the gap with zero new risk elsewhere.
mart__bq_usage__source_summary__hourly — pending p95 by reservation × sourceRESERVATIONS_TIMELINE — verify autoscale caps match the new configgcp_billing_for_report — verify daily cost matches the Plan estimate within $2/day