Files
SousChefAI/implementation_plan.md
pulipakaa24 193a825899
Some checks failed
Build and Deploy / build (push) Failing after 2m17s
Build and Deploy / docker-build (push) Has been skipped
Deploy to Server / deploy (push) Successful in 43s
video understanding
2026-04-29 11:50:44 -05:00

31 KiB

Reel Recipe App — Implementation Plan

This plan covers a native iOS app that turns shared Instagram Reels and TikToks into structured, searchable recipes, backed by a Python/FastAPI worker running on your apartment desktop. Voice search and dietary intelligence are first-class features alongside reel ingestion.


1. Scope & Goals

In scope for v1

  • iOS app (SwiftUI, iOS 18+)
  • Share extension accepting reels/TikToks from Instagram, TikTok, and any URL sharing surface
  • Video-understanding pipeline: caption parsing → ASR → frame OCR → VLM → fusion LLM
  • Voice search with on-device Whisper transcription, backend web search + LLM structuring
  • Dietary preferences and allergies with per-ingredient swap suggestions on violation
  • Apple Sign In authentication
  • Self-hosted Postgres on apartment desktop, accessed via Cloudflare Tunnel

Out of scope for v1

  • Android
  • Fridge scanning / computer vision of ingredients (carried over design language only from the SousChefAI codebase)
  • Full recipe regeneration on dietary conflict (ingredient swaps only for now)
  • Community / shared recipe features
  • Grocery list integrations beyond Apple Reminders

Non-goals / philosophy

  • Do not scrape Instagram or TikTok from the server. All ingestion is user-initiated through the share sheet. Backend may fetch the public video via yt-dlp, but only after the user has explicitly shared it.
  • Do not embed third-party API keys in the iOS binary. All LLM/VLM inference runs on the backend.
  • Do not require a network round-trip for things that can reasonably happen on-device (voice transcription for search, UI state, cache).

2. Architecture Overview

┌─────────────────────────────────────────────┐
│               iOS App (iOS 18+)              │
│  ┌────────────────┐  ┌─────────────────┐    │
│  │ Share Extension│  │   Main App      │    │
│  │  (thin, <50MB) │  │  (SwiftUI)      │    │
│  └────────┬───────┘  └────────┬────────┘    │
│           │                   │             │
│  ┌────────┴───────────────────┴────────┐    │
│  │  App Group: pending-jobs, cache     │    │
│  └────────────────┬────────────────────┘    │
│                   │                         │
│  ┌────────────────┴────────────────────┐    │
│  │  WhisperKit (on-device, voice only) │    │
│  └─────────────────────────────────────┘    │
└───────────────────┬─────────────────────────┘
                    │ HTTPS (Cloudflare Tunnel)
                    │ Bearer JWT
┌───────────────────┴─────────────────────────┐
│        Apartment Desktop (Ubuntu/WSL2)       │
│  ┌─────────────────────────────────────┐    │
│  │  FastAPI app (uvicorn)              │    │
│  │  • /auth  /me  /jobs  /recipes      │    │
│  │  • Pushes jobs onto arq queue       │    │
│  └────────────┬────────────────────────┘    │
│               │                             │
│  ┌────────────┴────────────┐                │
│  │  Redis (job queue)      │                │
│  └────────────┬────────────┘                │
│               │                             │
│  ┌────────────┴────────────────────┐        │
│  │  arq Worker(s)                  │        │
│  │  • yt-dlp / ffmpeg               │       │
│  │  • faster-whisper (GPU)          │       │
│  │  • PySceneDetect + PaddleOCR     │       │
│  │  • Qwen2.5-VL-7B (GPU)           │       │
│  │  • Qwen2.5-14B fusion (GPU)      │       │
│  └────────────┬────────────────────┘        │
│               │                             │
│  ┌────────────┴────────────────────┐        │
│  │  Postgres 16 + pgvector          │       │
│  └─────────────────────────────────┘        │
└─────────────────────────────────────────────┘

The share extension is deliberately tiny: it reads the shared URL and any caption iOS hands over, writes a pending-job record into the App Group container, posts it to the backend, and exits. It never downloads the video, never runs ML, never exceeds ~50MB of memory. iOS's 120-ish-MB share extension ceiling is not a constraint here because we designed around it.

The main app is the primary surface. It reads pending jobs from the App Group on launch, polls job status, renders recipes, handles voice search, and manages preferences. The main app has no meaningful memory limit.

The desktop backend does all heavy work. Two processes: the FastAPI request handler (lightweight, non-blocking) and one or more arq workers that pull jobs from Redis and run the video pipeline. This split matters — you don't want Whisper blocking a health check.


3. Tech Stack

iOS (SwiftUI, iOS 18+)

  • SwiftUI with the iOS 18 Liquid Glass materials for the polished card UI
  • Apple Sign In for auth
  • URLSession + async/await for networking
  • SwiftData for local cache of recipes (read-through, write-through to backend)
  • Share Extension target sharing an App Group container with the main app
  • WhisperKit for on-device voice transcription (voice search only; not used for reels)
  • APNs for job-ready notifications

Backend (Python 3.12)

  • FastAPI with uvicorn
  • SQLAlchemy 2.x async + Alembic migrations
  • Postgres 16 with pgvector (for future semantic recipe search)
  • Redis 7 + arq for job queue
  • PyJWT + cryptography for Apple identity token verification and own session JWT issuance
  • Pydantic v2 for schemas
  • httpx for outbound HTTP

ML / media stack (all on the desktop 3060 Ti + WSL2 Ubuntu)

  • yt-dlp (pinned, with a scheduled pull for updates)
  • ffmpeg
  • faster-whisper (large-v3-turbo, CTranslate2 backend)
  • PySceneDetect for keyframe selection
  • PaddleOCR (GPU) for on-screen text extraction
  • Qwen2.5-VL-7B-Instruct for frame captioning (via transformers or vLLM)
  • Qwen2.5-14B-Instruct for the fusion step and dietary swap suggestions
  • Escape hatch: config flag to route fusion to Gemini Flash if local model is struggling or machine is offline

Infrastructure

  • Cloudflare Tunnel to apartment desktop (existing setup)
  • GitHub Actions for CI (tests, lint, migration linting)
  • Eventually: Oracle Free ARM VM as a WireGuard relay (matches your existing roadmap)

4. iOS Application

4.1 Share Extension

The extension has exactly one job: take the URL the user shared, post it to the backend, and get out of the way.

Flow on share:

  1. Read the shared NSExtensionItem; extract the URL and any caption text iOS provides
  2. Check that a user session JWT exists in the App Group Keychain. If not, show an "open the app to sign in" message and bail
  3. POST /jobs/ingest-reel with {source_url, caption?} and the bearer token
  4. Show a brief "Recipe queued" confirmation. Dismiss
  5. If the network call fails, write the job to the App Group's pending-jobs directory. The main app will retry on next launch

No ML, no heavy dependencies, no video processing. This keeps the extension well under the memory limit and makes the share sheet experience feel instant.

4.2 Main App

Five primary surfaces:

Home / Inbox. Shows recipes with their processing status. A just-shared recipe appears immediately as "Processing…" with a shimmer, then transitions to a full card when the backend finishes. This is where the liquid-glass card styling lives — frosted backgrounds, subtle depth, match-score badges carried over from SousChefAI.

Recipe Detail. Title, description, ingredients list with provenance badges ("heard in narration", "seen on-screen", "from caption"), step-by-step instructions, servings/time metadata, missing-ingredient section if any, and any dietary flags inline. If the user has an allergy to something in the recipe, that ingredient appears with a red badge and a suggested swap right next to it.

Voice Search. A prominent mic button opens a full-screen capture surface. WhisperKit transcribes on-device as the user speaks, showing interim results. On confirmation, the transcript goes to /jobs/voice-search and the user sees a liquid-glass grid of structured results streaming in.

Saved / Library. Their personal collection. Filter by dietary match, by source (reel vs. voice vs. manual), by time.

Profile / Preferences. Apple Sign In account info, dietary restrictions, allergies (explicitly separated from preferences in the UI — allergies get an "important safety info" framing, preferences are soft), nutrition goals, pantry staples.

4.3 On-Device Services

Only two pieces of real logic run on the device:

WhisperKit transcription for voice search. Model downloaded on first use (Whisper-small distilled, ~150MB — usable quality for short queries, modest size). Runs locally, no audio leaves the device for voice search.

SwiftData local cache. Read-through cache of the user's recipes so the app opens to populated content even when the backend is unreachable. Writes go to the backend first; local cache updates on success.

4.4 Service Layer (Protocol-Based)

Carry over the protocol-based architecture from the SousChefAI codebase. RecipeService, AuthService, JobService, VoiceSearchService are protocols. Concrete implementations hit the backend. Mock implementations are used in previews and tests. This was already done well in the starter code — the Gemini-direct services get removed and replaced with thin clients that hit /recipes, /jobs, etc.


5. Backend (FastAPI)

5.1 Process Model

Two long-running processes:

  • API processuvicorn app.main:app. Handles HTTP. Does not do long-running work; every heavy operation is enqueued.
  • Worker process(es)arq app.worker.WorkerSettings. Pulls jobs from Redis, runs the video pipeline. Start with one worker; adjust concurrency based on GPU utilization.

Both run under systemd on the desktop. Redis runs as a third systemd unit. Postgres is also systemd-managed on the same box.

5.2 Module Layout

app/
  main.py                 # FastAPI app factory, middleware
  config.py               # Pydantic settings
  db.py                   # SQLAlchemy engine, session
  auth/
    apple.py              # Verify Apple identity tokens
    jwt.py                # Issue/verify own session JWTs
    deps.py               # FastAPI dependencies (current_user, etc)
  models/                 # SQLAlchemy models
  schemas/                # Pydantic request/response schemas
  routers/
    auth.py
    me.py
    jobs.py
    recipes.py
  services/
    ingest.py             # High-level reel ingestion orchestration
    voice_search.py
    dietary.py            # Violation detection + swap suggestion
  pipeline/
    download.py           # yt-dlp wrapper
    audio.py              # ffmpeg + faster-whisper
    frames.py             # PySceneDetect + OCR
    vlm.py                # Qwen2.5-VL client
    fusion.py             # Final structured-recipe generation
  worker.py               # arq worker settings + job functions
migrations/               # Alembic
tests/

5.3 API Surface

All endpoints require a Bearer session JWT except /auth/apple and health checks.

POST   /auth/apple             Exchange Apple identity token → session JWT + refresh
POST   /auth/refresh           Refresh session JWT
GET    /me                     Current user + preferences
PUT    /me/preferences         Update dietary restrictions, allergies, goals

POST   /jobs/ingest-reel       Body: {source_url, caption?}. Returns job_id.
POST   /jobs/voice-search      Body: {query}. Returns job_id.
POST   /jobs/scale-recipe      Body: {recipe_id, limiting_ingredient, quantity}
GET    /jobs/{id}              Poll job status

GET    /recipes                List current user's recipes (paginated)
GET    /recipes/{id}           Recipe detail with ingredients + steps
PUT    /recipes/{id}           Edit a recipe
DELETE /recipes/{id}
POST   /recipes/{id}/save      Explicit save to library (vs. inbox)

Job-based endpoints always return quickly with a job_id. The client polls (or receives an APNs push) and then fetches the final object. This pattern matters because the video pipeline can take 20-40 seconds and you can't hold a connection open that long reliably through Cloudflare Tunnel.

5.4 Auth Flow (Apple Sign In)

  1. iOS app performs Apple Sign In, receives an identity token (a JWT signed by Apple)
  2. App POSTs the token to /auth/apple
  3. Backend verifies the token: fetches Apple's public keys from appleid.apple.com/auth/keys, validates signature, checks audience (your app's bundle ID) and expiration
  4. Extract the stable sub (Apple's opaque user ID). Look up or create a users row keyed on apple_user_id
  5. Issue a session JWT (15 min) and a refresh token (30 days). Return both
  6. Client stores them in the App Group Keychain so the share extension can access them

The session JWT contains user_id and exp. Every subsequent request is authenticated against it. The refresh endpoint issues a new session JWT given a valid refresh token.

Important: the share extension does not initiate sign-in. If there's no valid JWT when a user shares, the extension tells them to open the main app once. This is fine — they'll have signed in during onboarding.


6. Database Schema

CREATE TABLE users (
  id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  apple_user_id   TEXT UNIQUE NOT NULL,
  email           TEXT,
  display_name    TEXT,
  created_at      TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE TABLE user_preferences (
  user_id               UUID PRIMARY KEY REFERENCES users(id) ON DELETE CASCADE,
  dietary_restrictions  TEXT[] NOT NULL DEFAULT '{}',  -- e.g. vegan, vegetarian, halal
  allergies             TEXT[] NOT NULL DEFAULT '{}',  -- e.g. peanut, shellfish, dairy
  nutrition_goals       TEXT,
  pantry_staples        TEXT[] NOT NULL DEFAULT '{}',
  updated_at            TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE TABLE recipes (
  id               UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  user_id          UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE,
  title            TEXT NOT NULL,
  description      TEXT,
  source_type      TEXT NOT NULL,    -- reel, voice_search, manual, web
  source_url       TEXT,
  source_platform  TEXT,             -- instagram, tiktok, web
  caption          TEXT,
  transcript       TEXT,
  servings         INT,
  estimated_time   TEXT,
  status           TEXT NOT NULL,    -- pending, processing, ready, failed
  error_message    TEXT,
  created_at       TIMESTAMPTZ NOT NULL DEFAULT now(),
  updated_at       TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE INDEX idx_recipes_user_status ON recipes(user_id, status);

CREATE TABLE ingredients (
  id                UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  recipe_id         UUID NOT NULL REFERENCES recipes(id) ON DELETE CASCADE,
  display_order     INT NOT NULL,
  name              TEXT NOT NULL,
  quantity          NUMERIC,
  unit              TEXT,
  raw_text          TEXT,            -- original phrasing from source
  provenance        TEXT NOT NULL,   -- caption, transcript, overlay, vlm_inferred, user
  confidence        REAL NOT NULL DEFAULT 1.0,
  violation_type    TEXT,            -- allergy, restriction, null
  violation_label   TEXT,            -- peanut, dairy, etc.
  suggested_swap    TEXT
);

CREATE INDEX idx_ingredients_recipe ON ingredients(recipe_id);

CREATE TABLE recipe_steps (
  id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  recipe_id       UUID NOT NULL REFERENCES recipes(id) ON DELETE CASCADE,
  step_order      INT NOT NULL,
  instruction     TEXT NOT NULL,
  timer_seconds   INT
);

CREATE INDEX idx_steps_recipe ON recipe_steps(recipe_id, step_order);

CREATE TABLE jobs (
  id            UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  user_id       UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE,
  type          TEXT NOT NULL,       -- ingest_reel, voice_search, scale_recipe
  status        TEXT NOT NULL,       -- queued, processing, done, failed
  payload       JSONB NOT NULL,
  result        JSONB,
  recipe_id     UUID REFERENCES recipes(id) ON DELETE SET NULL,
  error         TEXT,
  created_at    TIMESTAMPTZ NOT NULL DEFAULT now(),
  started_at    TIMESTAMPTZ,
  completed_at  TIMESTAMPTZ
);

CREATE INDEX idx_jobs_user_status ON jobs(user_id, status);

Allergies are stored separately from restrictions deliberately — the UI treats them with different severity, the dietary check weights them differently, and this separation scales cleanly to medical-severity flagging later if needed.

The provenance field on ingredients is the product moat. Pestle and similar apps output flat JSON. We can show "3 tbsp olive oil (seen on screen)" vs. "pinch of salt (inferred)" with visual distinction, giving users calibrated trust.


7. Pipelines

7.1 Reel Ingestion

When the worker picks up an ingest_reel job, it runs roughly this sequence. Stages short-circuit as soon as we have confident structured output.

Stage 1 — Download. yt-dlp fetches the video and extracts metadata. The caption is usually populated by yt-dlp from the post description. Typical reel: 5-15 MB, 30-90 seconds. Store in /tmp/jobs/{job_id}/.

Stage 2 — Caption-first extraction. Feed the caption to the fusion LLM with a structured-output prompt asking for a recipe JSON. If the model returns a recipe with all required fields populated and confidence above a threshold, skip the rest and go straight to dietary check. This is the Pestle path, and it handles probably 50-70% of reels at near-zero latency and compute cost.

Stage 3 — Caption link extraction. If the caption contains an external URL (common — many creators link their blog), fetch it, look for schema.org/Recipe JSON-LD, and use that if present. This handles another chunk of cases cleanly.

Stage 4 — Audio + ASR. ffmpeg extracts mono 16kHz WAV. faster-whisper with large-v3-turbo transcribes with word-level timestamps. Typical runtime: 3-6 seconds for a 60s clip on a 3060 Ti.

Stage 5 — Frame sampling. PySceneDetect identifies scene changes — reels cut exactly when something important happens, so this outperforms uniform sampling. Cap at 12 frames. ffmpeg extracts them as JPEGs.

Stage 6 — OCR. PaddleOCR on each frame. On-screen text in reels is often the most reliable source of exact ingredient quantities ("2 tbsp olive oil" flashed over a pan shot). Keep detected text with per-frame timestamps.

Stage 7 — VLM captioning. Qwen2.5-VL-7B describes each frame with a prompt focused on cooking context ("What ingredients, tools, and cooking actions are visible? What stage of preparation does this show?"). Keep the short descriptions. Typical: 2-4 seconds per frame in batched mode.

Stage 8 — Fusion. Qwen2.5-14B gets: caption, transcript (with timestamps), OCR text (with timestamps), VLM frame descriptions (with timestamps). The prompt asks for structured JSON with a per-field provenance tag. This is the stage that distinguishes "we confirmed from the overlay" from "we inferred from the voiceover".

Stage 9 — Dietary check. See §7.3.

Stage 10 — Persist. Write recipe, ingredients, steps rows. Mark job done. Send APNs push if the user is opted in.

Between ~15 and ~40 seconds end-to-end depending on how many stages were needed. Caption-only path: 2-4 seconds.

On-device (iOS). User holds the mic, WhisperKit transcribes with interim results shown live. On confirm, the app posts the final transcript to /jobs/voice-search.

Backend.

  1. LLM interprets the query into structured search terms, extracting any implicit constraints ("dinner tonight, quick, high protein" → {meal_type: dinner, max_time: 30m, dietary_emphasis: high_protein})
  2. Constraints are merged with the user's stored preferences and allergies
  3. Backend performs a web search (Brave Search API or Google Custom Search — pick one at implementation time, Brave is cheaper with a better API) for recipe URLs matching the terms
  4. Fetch the top ~5 results with httpx. For each, look for JSON-LD @type: Recipe — most food blogs have this. For pages without JSON-LD, feed the HTML through a structuring LLM call
  5. Filter results against the user's allergies as a hard gate, and score against soft preferences
  6. Return the top 3-5 as structured recipes

Voice search recipes are not auto-saved — they appear in the result view and the user taps to save to their library.

7.3 Dietary Intelligence

Two severity tiers internally, visually distinct in the UI:

Allergies (hard). Ingredient match triggers a red warning. The fusion LLM is also asked to propose an ingredient-level swap in the same call that generates the recipe, so it's stored alongside the ingredient row. If a safe swap cannot be produced ("this recipe is fundamentally built around peanuts"), the field is left null and the UI shows "no safe substitute — consider skipping this recipe".

Restrictions (soft). Vegetarian, vegan, gluten-free, etc. Orange warning, always accompanied by a swap suggestion. These are treated as preferences, not safety issues.

Detection is straightforward name matching plus synonym expansion maintained as a small dictionary — "peanut" matches peanut, peanut butter, peanut oil, groundnut, arachis. Keep this dictionary in the backend repo, version-controlled. LLM is involved only for generating the swap text, not for detection (detection needs to be reliable and fast).

This runs after the fusion stage so the swap suggestions can be informed by the full recipe context (the LLM knows what role the ingredient plays).


8. Infrastructure & Deployment

Desktop (primary dev/prod for v1). Ubuntu 24 on WSL2 as you currently have it. Four systemd units: postgres, redis, recipe-api (uvicorn), recipe-worker (arq). A simple unified logging setup — journalctl is fine for now.

Cloudflare Tunnel exposes api.wahwa.com (or similar subdomain) to the FastAPI port. Same pattern as your existing setup. No special handling needed for the video pipeline since videos don't traverse the tunnel — they're downloaded server-side.

Secrets live in a .env file sourced by systemd. Apple keys, DB credentials, search API key, optional Gemini fallback key.

Migrations. Alembic. alembic upgrade head runs as a ExecStartPre= step on the API unit so deploys apply migrations automatically.

CI (GitHub Actions). Lint, type-check, unit tests, migration check (Alembic can detect schema drift). Not doing CD to the apartment box initially — you'll deploy by SSH and git pull + restart units. If that becomes annoying, wire up a webhook.

iOS distribution. TestFlight for internal testing. Use your existing paid Apple Developer account. Standard provisioning.

Monitoring. For v1, a simple /health endpoint on the API and a daily cron emailing you the count of jobs completed and failed in the last 24h. If you want nicer dashboards, add Grafana + Loki later, but that's polish.


9. Scaling Roadmap

The v1 setup has known bottlenecks. Here's the progression if/when usage demands it.

Phase 1 (now): apartment desktop, everything on one box. Fine for you alone and a handful of beta testers. Caveats: a power outage or ISP glitch takes the service down; desktop-off when you travel is an outage; all GPU throughput is shared with whatever else you're running locally.

Phase 2: Oracle Free ARM VM as edge relay. The Oracle VM terminates Cloudflare Tunnel traffic and forwards over WireGuard to your desktop. This matches your existing roadmap for Gitea. Benefit: the public endpoint stays up even when your desktop is rebooting; you can fall back to a "still processing — desktop is restarting" state rather than a hard 502. Desktop does all real work.

Phase 3: managed Postgres. Neon or Supabase Postgres when you start caring about automated backups, PITR, read replicas, or just not babysitting a DB. The schema is vanilla Postgres, the move is trivial — change the DSN. Do this before scaling users, not after.

Phase 4: detach API from worker. The API moves to a cheap VPS (Fly, Hetzner). Workers stay on the GPU machine (your apartment now, or a colocated box, or RunPod/Lambda GPU). Communication is over Redis still; Redis moves to the VPS side. The GPU machine only handles arq jobs and needs no inbound public traffic.

Phase 5: multi-region or managed GPU. If latency for Asia/Europe users matters, or if your apartment GPU is constantly saturated, move workers to a managed GPU provider (RunPod has pay-per-second GPU; Modal and Beam are similar). The fusion/VLM stage is where cost lives; swap to Gemini Flash at that point if per-call cost undercuts self-hosted amortization.

Phase 6: if the app takes off. Object storage for thumbnails and cached video frames (S3/R2). CDN for recipe images. A separate analytics pipeline. User-generated content moderation. Most of this is generic scaling and not worth pre-designing.

At each phase, the code shouldn't change much — the whole point of putting ML behind a worker queue and keeping API stateless is that you can redraw the deployment topology without refactoring.


10. Build Order / Milestones

Rough weekly targets assuming focused part-time work. Collapse where you're fast, expand where something fights back.

M0 — Scaffolding (week 1). New iOS project scaffolded with SwiftUI + share extension target + App Group. FastAPI skeleton with /health, Alembic baseline migration, Postgres running, Redis running, arq hello-world. Cloudflare Tunnel pointed at the API. Apple Sign In round-trip working end-to-end: iOS app signs in, backend verifies the token, issues session JWT, iOS stores and sends it. No real features yet; just prove the whole loop closes.

M1 — Caption-first reel ingestion (weeks 2-3). Share extension writes job and POSTs. Worker runs yt-dlp, extracts caption, runs caption-only LLM parse, writes recipe to DB. iOS main app shows recipe list and detail view with the liquid-glass card design ported from SousChefAI. This alone is a working product for maybe 60% of reels and a satisfying demo.

M2 — Full video understanding (weeks 4-6). Add stages 4-8 from §7.1. faster-whisper integration first (clean visible improvement). Then frames + OCR. Then VLM. Then fusion prompt engineering — this is where real time gets spent; the prompt for fusion is the heart of the app's quality. Per-field provenance rendered in the UI.

M3 — Voice search (week 7). WhisperKit integration. /jobs/voice-search endpoint. Web search integration (Brave). HTML-to-recipe extraction. Results UI.

M4 — Dietary intelligence (week 8). Preferences onboarding flow. Allergy dictionary. Violation detection during fusion. Swap-suggestion prompt. UI treatment for allergy vs. restriction severity.

M5 — Polish and reliability (weeks 9-10). APNs push on job completion. Offline-queue handling in the share extension. Error states throughout. Recipe scaling feature from the SousChef codebase (already works, just needs the backend route). Oracle ARM relay setup for Phase 2 resilience. TestFlight build for beta testers.

Two months of part-time work, plus or minus, to reach a TestFlight-ready beta. The v1 defined here is larger than Pestle's feature set on the reel-understanding side specifically, matches or exceeds on dietary handling, and adds voice search as a net-new differentiator.


11. Open Risks & Decisions Deferred

yt-dlp breakage. Instagram periodically changes things that break yt-dlp for days at a time. Mitigation: pin a known-good version, monitor, have a manual-update playbook, keep an eye on the yt-dlp issue tracker. Longer-term: if breakage becomes frequent enough to hurt UX, consider a fallback where the iOS share extension actually downloads the video from the reel before POSTing — iOS gets the video as part of the share payload sometimes, though this depends on Instagram's share sheet contract and isn't guaranteed.

Local model quality vs. cost. Qwen2.5-VL-7B and Qwen2.5-14B are good but not Gemini-3-Pro good. If the fusion stage is producing poor recipes, the escape hatch is to route fusion to Gemini 2.5 Flash or 3 Flash (order of a cent per reel, still cheaper than Pestle's on-device compute amortized over dev time). Build the fusion layer with a clean model-swap interface from day one so this is a config change.

Desktop uptime. Your desktop going down while you're out of town means no service. Phase 2 of the scaling roadmap mitigates but doesn't eliminate this. For beta testers, be upfront that the service is best-effort during prototyping.

Share extension storage of refresh tokens. The refresh token lives in the App Group Keychain, accessible to both the extension and the main app. This is a standard pattern but it's worth double-checking the Keychain access group configuration at build time — if you get it wrong the extension can't read what the main app wrote, and the failure mode is silent.

Dietary dictionary completeness. The allergy/restriction synonym dictionary needs careful curation. "Dairy" is a category, not an ingredient — it has to match milk, butter, cream, cheese, yogurt, whey, casein, and so on. Getting this right affects the core safety feature. Plan an explicit audit pass on the dictionary before exposing the allergy feature to beta testers.

WhisperKit model size on first use. First-time voice search will download ~150 MB. Either pre-download on app install (adds to the install size) or show a clear first-use spinner. Pre-downloading is nicer UX but makes the app heavier; showing a spinner on first use is fine if it's clearly communicated.

Fusion prompt engineering. This is the dominant quality lever and it's the least predictable piece of work. Budget roughly twice what you think it'll take. Keep a corpus of 20-30 test reels representing different recipe types (quick snacks, long-form cooking, baking, drinks) and evaluate changes against that set.