BQ Slot 4/23 Config Change — Degradation Analysis

Period: 2026-04-21 → 2026-04-24 · Config cliff: 2026-04-23 14:00 TPE · etl baseline 50→150, autoscale_max -100, CUD 300→400
VERDICT: Current savings ~$8-16/day vs pre-4/23, but Redash/User Query p95 pending grew 10-100×. Target: save $20/day while fixing default degradation. Constraint: keep core-realtime AS=50 (do not cut). See Recommendation tab for 3 plans with explicit cost/latency tradeoffs.
Cost saved / day
$8-16
vs pre-4/23 baseline
Autoscale cap Δ
-42%
3699 → 2154 slot-hrs/day
default p95 pending
5s → 230s
46× regression
Redash p95 pending
27s → 292s
10.8×, user-facing
Job usage Δ
-3%
Workload essentially unchanged
etl duration p95
+5%
204s → 215s — minor

Total capacity (baseline + autoscale) · 4/22 → now

View:
Cliff at 04-23 14:00 TPE: baseline jumps 300→400, etl autoscale_max drops from 150-200 → 50-100.

Per-reservation autoscale (shows which reservations shrunk)

Pending time percentile — by reservation

Metric:
Vertical dashed line marks config change (04-23 14:00 TPE). After this point, default pending p95 spikes repeatedly to 200-600s.

Daily summary — allocation vs cost

Date CUD baseline AS slot-hrs Used slot-hrs AS $ CUD $ Total $
2026-04-213003,59210,524$249.6$298.1$547.7
2026-04-223003,69910,677$257.1$298.1$555.2
2026-04-23 (split day)3423,03210,470$210.7$339.5$550.2
2026-04-24 (15h partial)4001,3466,478$93.5$397.4$490.9

default reservation — pending by source (p95 over time)

All consumer sources of default are degraded. Redash/User Query spikes are user-facing.

Before (4/22) vs After (4/24) — default reservation only

Source avg pre avg post Δ× p95 pre p95 post Δ× p99 pre p99 post

Top-15 worst hours on default reservation (after 4/23 14:00)

Hours sorted by pending p95. These are the moments where analysts / DAGs hit the wall.
TPE hour jobs pending avg pending p95 pending p99 duration p95

Math reality (constraint: don't cut core-realtime)

CUD 400 is committed → $397.4/day fixed. To save $20/day vs pre-4/23 ($555), AS budget ≤ $137.8/day = avg 83 slots. Current AS avg ≈ 85 — already in budget, but allocated badly (starves default). The path to $20 savings + degradation fix requires re-shaping AS allocation, not just adding capacity.

Root cause recap

Three plans (pick one)

All plans keep CUD 400 and core-realtime AS=50. Costs are estimates relative to pre-4/23 baseline $555/day.

Plan A — Safest · fix default, accept $10/day savings

ActionΔ AS avg slotsΔ cost /dayNotes
default AS_max 50 floor all hours, 100 at 8-14 TPE+20+$33core of the fix — direct relief
etl AS_max 50 → 80 normal, 100 at 02am+12+$20rebuild idle pool for default to borrow
ka AS_max 50 → 35-1-$1.5p95 ka rarely exceeds 30
etl-backfill AS_max 50 → 25~0-$0.5avg usage is 0, just a cap reduction
Net+31+$51

→ Total cost: ~$555 + (net AS) = roughly $545-555/day. vs pre: saves $0-10/day. default p95 pending: 260s → ~40s ✓

Plan B — Balanced · save ~$15/day, fix default business hours only

ActionΔ AS avg slotsΔ cost /dayNotes
default AS_max 50 only at 8-14 TPE (7h)+8+$9.7fix worst-hour pending; off-peak stays weak
etl AS_max 50 → 70 at peak hours 07-12 only+4+$5.5targeted idle pool recovery
ka AS_max 50 → 35-1-$1.5safe
etl-backfill AS_max 50 → 25~0-$0.5safe
Net change vs current+11+$13

→ Current saves ~$16/day; after adding $13 back: save ~$3-8/day. But if we compare to 4/24 full-day extrapolation where savings are $16, after fixes savings become ~$10-15/day. default business hours p95: 260s → ~60s · off-peak Airflow bursts still ~200s.

Plan C — Aggressive · hits $20/day, weaker default fix, adds etl off-peak risk

ActionΔ AS avg slotsΔ cost /dayNotes
default AS_max 25 at 9-13 TPE only (5h)+4+$5.8lite version of fix
etl AS_max cut off-peak (0-6 TPE) to 30-5-$8.3risk: etl peak 02am may queue if demand > 180
ka AS_max 50 → 30-1.5-$2.5mild peak risk
etl-backfill AS_max 50 → 25~0-$0.5safe
Net change vs current-2.5-$5.5

→ Current saves ~$16; additional $5.5 → save ~$20-22/day ✓ (hits target). But default Redash/User Query p95 only improves to ~120s, and etl 02am peak may queue. Analysts still feel it.

Recommendation

Plan A if analyst experience matters more than $20 target — fixes default cleanly, saves ~$10/day anyway.

Plan B is the pragmatic middle — fixes the visible pain (business hours) and saves ~$10-15/day.

Plan C meets the $20 target but leaves Redash/User Query queue times above 100s. Only choose this if cost is a hard budget constraint.

If the $20 target is firm AND degradation must be fully fixed → reconsider cutting core-realtime AS 50→25: avg usage is 6.6 slots, p95=24, provisioned 31. The 4× headroom remains. That alone is $25/day free and closes the gap with zero new risk elsewhere.

Monitoring after apply

After any plan, monitor for 48 hours: