Failure Mode Runbook

Diagnostic Procedures & Root Cause Analysis

πŸ“Ή Video walkthrough coming soon

Purpose

Systematic troubleshooting framework for deployment integrity failures, applying RME diagnostic methodology to software infrastructure.

Common Failure Patterns (Indexed)

Pattern ID Pattern Primary Signal Typical Root Cause
FM-001 Hash Mismatch make audit fails Uncommitted changes / CDN lag
FM-002 Root 404 Homepage returns 404 Routing misconfiguration
FM-003 Remote Fetch Failure curl fails on deployment URL DNS propagation / CDN issue
FM-004 401 Protection Active Unauthorized access errors Vercel password protection enabled
FM-005 Git Rejection Push fails with error Diverged branches / auth failure

Immediate Containment (Halt + Rollback)

When to halt: Any integrity breach (hash mismatch), unknown deployment state, or 4xx/5xx at root.

Emergency Response Procedure:

  • Stop all deployments immediately
  • Document current state (git status -sb, make audit output)
  • Identify last known good commit: git log --oneline -5
  • Execute rollback: git reset --hard [LAST_GOOD_COMMIT]
  • Verify integrity: make audit must pass before proceeding
  • Escalate if rollback fails (see Escalation Triggers section)

Escalation Triggers (Involve Senior Engineers)

Escalate immediately if any of the following occur:

Escalation Handoff Checklist:

  • Current git commit SHA and status
  • Full make audit output
  • Vercel deployment logs (last 3 deployments)
  • Timeline of failure onset and containment actions taken

FM-001: Hash Mismatch

Symptom

make audit reports local hash differs from remote hash.

Observable Indicators

Decision Tree

Hash mismatch? β”œβ”€ Local hash matches HEAD? β”‚ β”œβ”€ yes β†’ CDN propagation delay (wait 5min, retry audit) β”‚ └─ no β†’ Uncommitted changes detected β”‚ β”œβ”€ Expected changes? β†’ commit and deploy β”‚ └─ Unexpected changes? β†’ investigate (git diff) └─ Remote fetch fails? β”œβ”€ DNS issue β†’ verify domain resolution └─ CDN cache β†’ force invalidate or wait

Root Cause Isolation

  1. Check local state: git status -sb
  2. Verify HEAD commit: git log --oneline -1
  3. Test remote fetch: curl -I https://timothywheels.com
  4. Compare hashes manually: sha256sum index.html vs remote

Resolution

FM-002: Root 404

Symptom

Homepage (/) returns 404 Not Found.

Observable Indicators

Root Cause Isolation

  1. Verify index.html exists in repo root
  2. Check Vercel project settings (root directory configuration)
  3. Review vercel.json routing rules
  4. Check deployment logs for file upload errors

Resolution

FM-003: Remote Fetch Failure

Symptom

curl request to deployment URL fails or times out.

Observable Indicators

Decision Tree

Remote fetch fails? β”œβ”€ DNS resolves? β”‚ β”œβ”€ no β†’ DNS propagation issue (check registrar) β”‚ └─ yes β†’ CDN/routing issue β”‚ β”œβ”€ Vercel shows "Ready"? β†’ CDN cache problem β”‚ └─ Deployment failed? β†’ check build logs └─ Timeout vs immediate fail? β”œβ”€ Timeout β†’ Network/firewall issue └─ Immediate β†’ SSL/cert problem

Root Cause Isolation

  1. Test DNS: nslookup timothywheels.com
  2. Test direct IP: curl -I [VERCEL_IP]
  3. Check Vercel deployment status in dashboard
  4. Review deployment logs for errors

Resolution

Recovery Verification (Post-Fix Validation)

Use this checklist after any fix or rollback:

  1. git status -sb shows clean working tree
  2. make audit passes with matching hashes
  3. curl -I https://timothywheels.com returns 200 OK
  4. Visual inspection: homepage loads correctly in browser
  5. Verification: key documents (RME packet, runbooks) accessible

Gate: If any step fails, return to decision tree. Do not proceed with new deployments until all checks pass.

5-Why Examples (Root Cause Discipline)

Example A: Root 404 on Production

  1. Why is the root returning 404?
    Because index.html isn't being served.
  2. Why isn't it being served?
    Because Vercel routing rules aren't configured correctly.
  3. Why aren't the rules configured?
    Because vercel.json was edited without testing.
  4. Why was it edited without testing?
    Because there's no staging environment validation step.
  5. Why is there no staging validation?
    Because deployment workflow doesn't include pre-production checks.

Corrective Action: Add staging environment with mandatory validation before production deploy. Update Makefile to include make test-staging step.

Example B: Audit Hash Mismatch

  1. Why do hashes not match?
    Because local file differs from remote.
  2. Why does local differ?
    Because changes were made but not committed.
  3. Why weren't they committed?
    Because developer made quick fix without following commit protocol.
  4. Why did they bypass protocol?
    Because there's no automated pre-commit integrity check.
  5. Why is there no pre-commit check?
    Because git hooks weren't configured in repository.

Corrective Action: Install git pre-commit hook that runs make audit before allowing commits. Prevents uncommitted changes from reaching deployment pipeline.

RME Analogy: Vercel Routing = Conveyor Jam Logic

RME Analogy: 404/401 = QC Rejection / Access Control Failure

RME Analogy: Hash Verification = PM Inspection Checkpoints

RME Analogy: Deployment Pipeline = Production Line Flow

Preventive Maintenance