BQ Slot 4/23 Config Change — Degradation Analysis

Cost saved / day

$8-16

vs pre-4/23 baseline

Autoscale cap Δ

-42%

3699 → 2154 slot-hrs/day

default p95 pending

5s → 230s

46× regression

Redash p95 pending

27s → 292s

10.8×, user-facing

Job usage Δ

-3%

Workload essentially unchanged

etl duration p95

+5%

204s → 215s — minor

Total capacity (baseline + autoscale) · 4/22 → now

View:

Cliff at 04-23 14:00 TPE: baseline jumps 300→400, etl autoscale_max drops from 150-200 → 50-100.

Per-reservation autoscale (shows which reservations shrunk)

Pending time percentile — by reservation

Metric:

Vertical dashed line marks config change (04-23 14:00 TPE). After this point, default pending p95 spikes repeatedly to 200-600s.

Daily summary — allocation vs cost

Date	CUD baseline	AS slot-hrs	Used slot-hrs	AS $	CUD $	Total $
2026-04-21	300	3,592	10,524	$249.6	$298.1	$547.7
2026-04-22	300	3,699	10,677	$257.1	$298.1	$555.2
2026-04-23 (split day)	342	3,032	10,470	$210.7	$339.5	$550.2
2026-04-24 (15h partial)	400	1,346	6,478	$93.5	$397.4	$490.9

default reservation — pending by source (p95 over time)

All consumer sources of default are degraded. Redash/User Query spikes are user-facing.

Before (4/22) vs After (4/24) — default reservation only

Source	avg pre	avg post	Δ×	p95 pre	p95 post	Δ×	p99 pre	p99 post

Top-15 worst hours on `default` reservation (after 4/23 14:00)

Hours sorted by pending p95. These are the moments where analysts / DAGs hit the wall.

TPE hour	jobs	pending avg	pending p95	pending p99	duration p95

Three plans (pick one)

All plans keep CUD 400 and core-realtime AS=50. Costs are estimates relative to pre-4/23 baseline $555/day.

Plan A — Safest · fix default, accept $10/day savings

Action	Δ AS avg slots	Δ cost /day	Notes
default AS_max 50 floor all hours, 100 at 8-14 TPE	+20	+$33	core of the fix — direct relief
etl AS_max 50 → 80 normal, 100 at 02am	+12	+$20	rebuild idle pool for default to borrow
ka AS_max 50 → 35	-1	-$1.5	p95 ka rarely exceeds 30
etl-backfill AS_max 50 → 25	~0	-$0.5	avg usage is 0, just a cap reduction
Net	+31	+$51

→ Total cost: ~$555 + (net AS) = roughly $545-555/day. vs pre: saves $0-10/day. default p95 pending: 260s → ~40s ✓

Plan B — Balanced · save ~$15/day, fix default business hours only

Action	Δ AS avg slots	Δ cost /day	Notes
default AS_max 50 only at 8-14 TPE (7h)	+8	+$9.7	fix worst-hour pending; off-peak stays weak
etl AS_max 50 → 70 at peak hours 07-12 only	+4	+$5.5	targeted idle pool recovery
ka AS_max 50 → 35	-1	-$1.5	safe
etl-backfill AS_max 50 → 25	~0	-$0.5	safe
Net change vs current	+11	+$13

→ Current saves ~$16/day; after adding $13 back: save ~$3-8/day. But if we compare to 4/24 full-day extrapolation where savings are $16, after fixes savings become ~$10-15/day. default business hours p95: 260s → ~60s · off-peak Airflow bursts still ~200s.

Plan C — Aggressive · hits $20/day, weaker default fix, adds etl off-peak risk

Action	Δ AS avg slots	Δ cost /day	Notes
default AS_max 25 at 9-13 TPE only (5h)	+4	+$5.8	lite version of fix
etl AS_max cut off-peak (0-6 TPE) to 30	-5	-$8.3	risk: etl peak 02am may queue if demand > 180
ka AS_max 50 → 30	-1.5	-$2.5	mild peak risk
etl-backfill AS_max 50 → 25	~0	-$0.5	safe
Net change vs current	-2.5	-$5.5

→ Current saves ~$16; additional $5.5 → save ~$20-22/day ✓ (hits target). But default Redash/User Query p95 only improves to ~120s, and etl 02am peak may queue. Analysts still feel it.

Recommendation

Plan A if analyst experience matters more than $20 target — fixes default cleanly, saves ~$10/day anyway.

Plan B is the pragmatic middle — fixes the visible pain (business hours) and saves ~$10-15/day.

Plan C meets the $20 target but leaves Redash/User Query queue times above 100s. Only choose this if cost is a hard budget constraint.

If the $20 target is firm AND degradation must be fully fixed → reconsider cutting core-realtime AS 50→25: avg usage is 6.6 slots, p95=24, provisioned 31. The 4× headroom remains. That alone is $25/day free and closes the gap with zero new risk elsewhere.

Monitoring after apply

After any plan, monitor for 48 hours:

mart__bq_usage__source_summary__hourly — pending p95 by reservation × source

RESERVATIONS_TIMELINE — verify autoscale caps match the new config

gcp_billing_for_report — verify daily cost matches the Plan estimate within $2/day

Re-run this playground after 48h to validate. If default p95 stays under 40s at business hours → success