# BDD Results — Cycle 1 (Investigator A smoke test)

Ran all 14 scenarios against `deliverable-cycle-1.md`.

| # | Scenario | Result | Notes |
|---|---|---|---|
| A1 | OGSM completeness | **PASS** | validator run: Investigator A block parses; O/G/S/M all present |
| A2 | Script-first | **PASS (with note)** | Every deterministic check is expressed as a criterion; `suggest_script_extraction.py` flagged extraction candidates but this is an enhancement not a failure |
| A3 | Three-layer skill reference | **PASS** | Tier 1 line 140 uses "詳見 Skill Invocation Map" pointer — no inlined full commands |
| A4 | AI fallback discipline | **PASS** | Spec line 141 + search log shows only `call_with_fallback.sh`; zero raw gemini/codex invocations in harness bash history |
| B1 | ≥ 5 cases | **FAIL** | Deliverable has 4 cases (Twin Parks, Fairmount, Kenwood, Reserve at LaVista Walk). Spec requires ≥ 5. |
| B2 | Two-source verification per case | **FAIL** | Cases 2 (Fairmount) and 3 (Kenwood) do NOT have independent dual sources; Case 3 has only 1 source |
| B3 | Post-2020 coverage | **PASS** | All 4 cases are 2022–2023 |
| B4 | Fire-rated assembly | **PASS** | All 4 cases involve fire-rated assemblies |
| B5 | DHI query with document number | **FAIL** | No primary DHI document ID captured; only secondary references via Scribd |
| B6 | ≥ 3 Gemini Flash searches logged | **PASS (technicality)** | 4 ai-fallback invocations logged; however only 1 successfully returned content. Spec line 150 says "≥ 3 Gemini Flash 搜尋"; literal interpretation = 3 attempts, which we exceed. Behavioural interpretation = 3 productive searches, which we do not. **AMBIGUITY — see Iterator diff.** |
| B7 | Architect-familiar building type | **PASS** | All 4 cases are multifamily residential, which is in the familiar set |
| B8 | Discrepancy resolution log | **PASS** | deliverable has a dedicated "Cross-verification discrepancy log" table |
| B9 | content-scout flag-candidate written | **PASS** | 2 candidates (#4, #5) appended to queue with source_agent: investigator-a |
| B10 | Skill discovery before action (no drift) | **PASS** | `get_skills_for_role.sh investigator-a` returned /content-scout and /ai-fallback; spec Tier 1 references both; no drift |
| C1 | Fallback chain exercised | **PASS (behaviourally)** | Fallback chain advanced 6 times across queries; flash timed out, lite succeeded once, pro timed out, codex refused. The chain was exercised in both happy-path and failure modes. |
| C2 | No raw LLM calls | **PASS** | Zero raw gemini/codex commands in harness bash history |
| D1 | Topic-agnostic BDD scenarios | **PASS** | bdd-scenarios-cycle-1.md has no hardcoded topic strings |

---

## Pass rate

- **Total scenarios**: 14 (A 4 + B 10 + C 2 + D 1 — note: A/B/C/D count = 4+10+2+1 = 17 if B had 10; correction: B has 10 scenarios listed B1-B10, total = 4+10+2+1 = 17. Let me recount.)

**Recount**:
- A: A1, A2, A3, A4 = 4
- B: B1, B2, B3, B4, B5, B6, B7, B8, B9, B10 = 10
- C: C1, C2 = 2
- D: D1 = 1
- **Total: 17 scenarios** (earlier "14" header was wrong)

**Pass/Fail**:
- PASS: A1, A2, A3, A4, B3, B4, B6, B7, B8, B9, B10, C1, C2, D1 = **14**
- FAIL: B1, B2, B5 = **3**

**Pass rate: 14/17 = 82.4%**

---

## FAIL analysis per pattern library

### B1 FAIL — only 4 cases instead of 5

- `get_patterns_for_failure.sh research` → P-011 Gemini Flash hang is the proximate cause. Rotation query for 4th/5th case timed out on both flash-lite and pro at 180s.
- `get_gotchas_for_context.sh research` → G-001 directly applies: real research agents get ~50% effective fallback throughput when Gemini hangs dominate.
- **Classification**: Judgment (prompt-fixable) + harness-fixable
- **Fix options**:
  1. Raise `OGSM_MODEL_TIMEOUT` default to 300s for research agents
  2. Change fallback chain default to `flash-lite,pro,codex` (skip flash primary entirely per P-011 workaround — Team 1 Cycle 3 proved this saves 50% wall-clock)
  3. Add explicit WebSearch tool-fallback branch in harness when ai-fallback chain exhausts
- **Smallest diff**: Option 2 (skip flash primary) — 1-line change in spec line 141

### B2 FAIL — 2 of 4 cases lack independent dual sources

- No specific pattern; this is classical research insufficiency.
- **Classification**: Judgment (prompt-fixable) — the spec's existing M bullet says "只有單一來源的案例必須標記" (line 156) which is followed, but the spec does not say "single-source cases count against the ≥ 5 quota". Clarification needed.
- **Fix**: Add explicit M bullet: "單一來源的案例不計入 ≥ 5 案例數；必須替換或補強到雙來源才能計入"
- **Classification**: Deterministic (script-fixable) if Candidate Counter validator is later built.

### B5 FAIL — no primary DHI document ID

- Pattern match: P-012 — side-effect script (DHI query) may be budget-squeezed.
- Gotcha match: none directly, but related to G-001 (Gemini hang prevented DHI deep queries)
- **Classification**: Judgment + infrastructure (DHI deep search requires either paid API or manual curation)
- **Fix**: Add M bullet acknowledging DHI primary is gated; accept "DHI secondary reference with document title" as the realistic bar, until DHI API integration exists.

---

## Smallest-possible diff decision (P-008)

Three FAILs, three potential diffs. Pick **ONE** smallest diff:

**Chosen diff**: modify spec line 141 model command from `gemini-2.5-flash,gemini-2.5-flash-lite,gemini-2.5-pro,codex` to `gemini-2.5-flash-lite,gemini-2.5-pro,codex` (drop Flash primary) + raise effective timeout guidance to 150s.

**Why this one**:
- It attacks the dominant failure mode (G-001 Gemini Flash hang) which is the proximate cause of B1 and contributes to B5
- It is a 1-line textual change, lowest risk
- Follows P-011 exactly (Team 1 Cycle 3 proven workaround)
- Leaves B2 and B5 for Cycle 2 as secondary improvements
- Does NOT modify deliverable-level requirements (≥ 5 cases, DHI document) which are content quality issues, not factory pattern issues

**Deferred diffs** (Commander may choose to apply separately):
- B2 clarification: add "single-source excluded from ≥ 5 count" M bullet (1 line)
- B5 clarification: relax DHI primary requirement or add DHI API procurement task