Files
SousChefAI/implementation_plan.md

469 lines
31 KiB
Markdown
Raw Normal View History

2026-04-29 11:50:44 -05:00
# Reel Recipe App — Implementation Plan
This plan covers a native iOS app that turns shared Instagram Reels and TikToks into structured, searchable recipes, backed by a Python/FastAPI worker running on your apartment desktop. Voice search and dietary intelligence are first-class features alongside reel ingestion.
---
## 1. Scope & Goals
### In scope for v1
- iOS app (SwiftUI, iOS 18+)
- Share extension accepting reels/TikToks from Instagram, TikTok, and any URL sharing surface
- Video-understanding pipeline: caption parsing → ASR → frame OCR → VLM → fusion LLM
- Voice search with on-device Whisper transcription, backend web search + LLM structuring
- Dietary preferences and allergies with per-ingredient swap suggestions on violation
- Apple Sign In authentication
- Self-hosted Postgres on apartment desktop, accessed via Cloudflare Tunnel
### Out of scope for v1
- Android
- Fridge scanning / computer vision of ingredients (carried over design language only from the SousChefAI codebase)
- Full recipe regeneration on dietary conflict (ingredient swaps only for now)
- Community / shared recipe features
- Grocery list integrations beyond Apple Reminders
### Non-goals / philosophy
- Do not scrape Instagram or TikTok from the server. All ingestion is user-initiated through the share sheet. Backend may fetch the public video via yt-dlp, but only after the user has explicitly shared it.
- Do not embed third-party API keys in the iOS binary. All LLM/VLM inference runs on the backend.
- Do not require a network round-trip for things that can reasonably happen on-device (voice transcription for search, UI state, cache).
---
## 2. Architecture Overview
```
┌─────────────────────────────────────────────┐
│ iOS App (iOS 18+) │
│ ┌────────────────┐ ┌─────────────────┐ │
│ │ Share Extension│ │ Main App │ │
│ │ (thin, <50MB) │ │ (SwiftUI) │ │
│ └────────┬───────┘ └────────┬────────┘ │
│ │ │ │
│ ┌────────┴───────────────────┴────────┐ │
│ │ App Group: pending-jobs, cache │ │
│ └────────────────┬────────────────────┘ │
│ │ │
│ ┌────────────────┴────────────────────┐ │
│ │ WhisperKit (on-device, voice only) │ │
│ └─────────────────────────────────────┘ │
└───────────────────┬─────────────────────────┘
│ HTTPS (Cloudflare Tunnel)
│ Bearer JWT
┌───────────────────┴─────────────────────────┐
│ Apartment Desktop (Ubuntu/WSL2) │
│ ┌─────────────────────────────────────┐ │
│ │ FastAPI app (uvicorn) │ │
│ │ • /auth /me /jobs /recipes │ │
│ │ • Pushes jobs onto arq queue │ │
│ └────────────┬────────────────────────┘ │
│ │ │
│ ┌────────────┴────────────┐ │
│ │ Redis (job queue) │ │
│ └────────────┬────────────┘ │
│ │ │
│ ┌────────────┴────────────────────┐ │
│ │ arq Worker(s) │ │
│ │ • yt-dlp / ffmpeg │ │
│ │ • faster-whisper (GPU) │ │
│ │ • PySceneDetect + PaddleOCR │ │
│ │ • Qwen2.5-VL-7B (GPU) │ │
│ │ • Qwen2.5-14B fusion (GPU) │ │
│ └────────────┬────────────────────┘ │
│ │ │
│ ┌────────────┴────────────────────┐ │
│ │ Postgres 16 + pgvector │ │
│ └─────────────────────────────────┘ │
└─────────────────────────────────────────────┘
```
The share extension is deliberately tiny: it reads the shared URL and any caption iOS hands over, writes a pending-job record into the App Group container, posts it to the backend, and exits. It never downloads the video, never runs ML, never exceeds ~50MB of memory. iOS's 120-ish-MB share extension ceiling is not a constraint here because we designed around it.
The main app is the primary surface. It reads pending jobs from the App Group on launch, polls job status, renders recipes, handles voice search, and manages preferences. The main app has no meaningful memory limit.
The desktop backend does all heavy work. Two processes: the FastAPI request handler (lightweight, non-blocking) and one or more arq workers that pull jobs from Redis and run the video pipeline. This split matters — you don't want Whisper blocking a health check.
---
## 3. Tech Stack
**iOS (SwiftUI, iOS 18+)**
- SwiftUI with the iOS 18 Liquid Glass materials for the polished card UI
- Apple Sign In for auth
- URLSession + async/await for networking
- SwiftData for local cache of recipes (read-through, write-through to backend)
- Share Extension target sharing an App Group container with the main app
- WhisperKit for on-device voice transcription (voice search only; not used for reels)
- APNs for job-ready notifications
**Backend (Python 3.12)**
- FastAPI with uvicorn
- SQLAlchemy 2.x async + Alembic migrations
- Postgres 16 with pgvector (for future semantic recipe search)
- Redis 7 + arq for job queue
- PyJWT + cryptography for Apple identity token verification and own session JWT issuance
- Pydantic v2 for schemas
- httpx for outbound HTTP
**ML / media stack (all on the desktop 3060 Ti + WSL2 Ubuntu)**
- yt-dlp (pinned, with a scheduled pull for updates)
- ffmpeg
- faster-whisper (large-v3-turbo, CTranslate2 backend)
- PySceneDetect for keyframe selection
- PaddleOCR (GPU) for on-screen text extraction
- Qwen2.5-VL-7B-Instruct for frame captioning (via transformers or vLLM)
- Qwen2.5-14B-Instruct for the fusion step and dietary swap suggestions
- Escape hatch: config flag to route fusion to Gemini Flash if local model is struggling or machine is offline
**Infrastructure**
- Cloudflare Tunnel to apartment desktop (existing setup)
- GitHub Actions for CI (tests, lint, migration linting)
- Eventually: Oracle Free ARM VM as a WireGuard relay (matches your existing roadmap)
---
## 4. iOS Application
### 4.1 Share Extension
The extension has exactly one job: take the URL the user shared, post it to the backend, and get out of the way.
Flow on share:
1. Read the shared NSExtensionItem; extract the URL and any caption text iOS provides
2. Check that a user session JWT exists in the App Group Keychain. If not, show an "open the app to sign in" message and bail
3. POST `/jobs/ingest-reel` with `{source_url, caption?}` and the bearer token
4. Show a brief "Recipe queued" confirmation. Dismiss
5. If the network call fails, write the job to the App Group's `pending-jobs` directory. The main app will retry on next launch
No ML, no heavy dependencies, no video processing. This keeps the extension well under the memory limit and makes the share sheet experience feel instant.
### 4.2 Main App
Five primary surfaces:
**Home / Inbox.** Shows recipes with their processing status. A just-shared recipe appears immediately as "Processing…" with a shimmer, then transitions to a full card when the backend finishes. This is where the liquid-glass card styling lives — frosted backgrounds, subtle depth, match-score badges carried over from SousChefAI.
**Recipe Detail.** Title, description, ingredients list with provenance badges ("heard in narration", "seen on-screen", "from caption"), step-by-step instructions, servings/time metadata, missing-ingredient section if any, and any dietary flags inline. If the user has an allergy to something in the recipe, that ingredient appears with a red badge and a suggested swap right next to it.
**Voice Search.** A prominent mic button opens a full-screen capture surface. WhisperKit transcribes on-device as the user speaks, showing interim results. On confirmation, the transcript goes to `/jobs/voice-search` and the user sees a liquid-glass grid of structured results streaming in.
**Saved / Library.** Their personal collection. Filter by dietary match, by source (reel vs. voice vs. manual), by time.
**Profile / Preferences.** Apple Sign In account info, dietary restrictions, allergies (explicitly separated from preferences in the UI — allergies get an "important safety info" framing, preferences are soft), nutrition goals, pantry staples.
### 4.3 On-Device Services
Only two pieces of real logic run on the device:
**WhisperKit transcription for voice search.** Model downloaded on first use (Whisper-small distilled, ~150MB — usable quality for short queries, modest size). Runs locally, no audio leaves the device for voice search.
**SwiftData local cache.** Read-through cache of the user's recipes so the app opens to populated content even when the backend is unreachable. Writes go to the backend first; local cache updates on success.
### 4.4 Service Layer (Protocol-Based)
Carry over the protocol-based architecture from the SousChefAI codebase. `RecipeService`, `AuthService`, `JobService`, `VoiceSearchService` are protocols. Concrete implementations hit the backend. Mock implementations are used in previews and tests. This was already done well in the starter code — the Gemini-direct services get removed and replaced with thin clients that hit `/recipes`, `/jobs`, etc.
---
## 5. Backend (FastAPI)
### 5.1 Process Model
Two long-running processes:
- **API process** — `uvicorn app.main:app`. Handles HTTP. Does not do long-running work; every heavy operation is enqueued.
- **Worker process(es)** — `arq app.worker.WorkerSettings`. Pulls jobs from Redis, runs the video pipeline. Start with one worker; adjust concurrency based on GPU utilization.
Both run under systemd on the desktop. Redis runs as a third systemd unit. Postgres is also systemd-managed on the same box.
### 5.2 Module Layout
```
app/
main.py # FastAPI app factory, middleware
config.py # Pydantic settings
db.py # SQLAlchemy engine, session
auth/
apple.py # Verify Apple identity tokens
jwt.py # Issue/verify own session JWTs
deps.py # FastAPI dependencies (current_user, etc)
models/ # SQLAlchemy models
schemas/ # Pydantic request/response schemas
routers/
auth.py
me.py
jobs.py
recipes.py
services/
ingest.py # High-level reel ingestion orchestration
voice_search.py
dietary.py # Violation detection + swap suggestion
pipeline/
download.py # yt-dlp wrapper
audio.py # ffmpeg + faster-whisper
frames.py # PySceneDetect + OCR
vlm.py # Qwen2.5-VL client
fusion.py # Final structured-recipe generation
worker.py # arq worker settings + job functions
migrations/ # Alembic
tests/
```
### 5.3 API Surface
All endpoints require a Bearer session JWT except `/auth/apple` and health checks.
```
POST /auth/apple Exchange Apple identity token → session JWT + refresh
POST /auth/refresh Refresh session JWT
GET /me Current user + preferences
PUT /me/preferences Update dietary restrictions, allergies, goals
POST /jobs/ingest-reel Body: {source_url, caption?}. Returns job_id.
POST /jobs/voice-search Body: {query}. Returns job_id.
POST /jobs/scale-recipe Body: {recipe_id, limiting_ingredient, quantity}
GET /jobs/{id} Poll job status
GET /recipes List current user's recipes (paginated)
GET /recipes/{id} Recipe detail with ingredients + steps
PUT /recipes/{id} Edit a recipe
DELETE /recipes/{id}
POST /recipes/{id}/save Explicit save to library (vs. inbox)
```
Job-based endpoints always return quickly with a `job_id`. The client polls (or receives an APNs push) and then fetches the final object. This pattern matters because the video pipeline can take 20-40 seconds and you can't hold a connection open that long reliably through Cloudflare Tunnel.
### 5.4 Auth Flow (Apple Sign In)
1. iOS app performs Apple Sign In, receives an identity token (a JWT signed by Apple)
2. App POSTs the token to `/auth/apple`
3. Backend verifies the token: fetches Apple's public keys from `appleid.apple.com/auth/keys`, validates signature, checks audience (your app's bundle ID) and expiration
4. Extract the stable `sub` (Apple's opaque user ID). Look up or create a `users` row keyed on `apple_user_id`
5. Issue a session JWT (15 min) and a refresh token (30 days). Return both
6. Client stores them in the App Group Keychain so the share extension can access them
The session JWT contains `user_id` and `exp`. Every subsequent request is authenticated against it. The refresh endpoint issues a new session JWT given a valid refresh token.
Important: the share extension does not initiate sign-in. If there's no valid JWT when a user shares, the extension tells them to open the main app once. This is fine — they'll have signed in during onboarding.
---
## 6. Database Schema
```sql
CREATE TABLE users (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
apple_user_id TEXT UNIQUE NOT NULL,
email TEXT,
display_name TEXT,
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE TABLE user_preferences (
user_id UUID PRIMARY KEY REFERENCES users(id) ON DELETE CASCADE,
dietary_restrictions TEXT[] NOT NULL DEFAULT '{}', -- e.g. vegan, vegetarian, halal
allergies TEXT[] NOT NULL DEFAULT '{}', -- e.g. peanut, shellfish, dairy
nutrition_goals TEXT,
pantry_staples TEXT[] NOT NULL DEFAULT '{}',
updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE TABLE recipes (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
user_id UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE,
title TEXT NOT NULL,
description TEXT,
source_type TEXT NOT NULL, -- reel, voice_search, manual, web
source_url TEXT,
source_platform TEXT, -- instagram, tiktok, web
caption TEXT,
transcript TEXT,
servings INT,
estimated_time TEXT,
status TEXT NOT NULL, -- pending, processing, ready, failed
error_message TEXT,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE INDEX idx_recipes_user_status ON recipes(user_id, status);
CREATE TABLE ingredients (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
recipe_id UUID NOT NULL REFERENCES recipes(id) ON DELETE CASCADE,
display_order INT NOT NULL,
name TEXT NOT NULL,
quantity NUMERIC,
unit TEXT,
raw_text TEXT, -- original phrasing from source
provenance TEXT NOT NULL, -- caption, transcript, overlay, vlm_inferred, user
confidence REAL NOT NULL DEFAULT 1.0,
violation_type TEXT, -- allergy, restriction, null
violation_label TEXT, -- peanut, dairy, etc.
suggested_swap TEXT
);
CREATE INDEX idx_ingredients_recipe ON ingredients(recipe_id);
CREATE TABLE recipe_steps (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
recipe_id UUID NOT NULL REFERENCES recipes(id) ON DELETE CASCADE,
step_order INT NOT NULL,
instruction TEXT NOT NULL,
timer_seconds INT
);
CREATE INDEX idx_steps_recipe ON recipe_steps(recipe_id, step_order);
CREATE TABLE jobs (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
user_id UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE,
type TEXT NOT NULL, -- ingest_reel, voice_search, scale_recipe
status TEXT NOT NULL, -- queued, processing, done, failed
payload JSONB NOT NULL,
result JSONB,
recipe_id UUID REFERENCES recipes(id) ON DELETE SET NULL,
error TEXT,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
started_at TIMESTAMPTZ,
completed_at TIMESTAMPTZ
);
CREATE INDEX idx_jobs_user_status ON jobs(user_id, status);
```
Allergies are stored separately from restrictions deliberately — the UI treats them with different severity, the dietary check weights them differently, and this separation scales cleanly to medical-severity flagging later if needed.
The `provenance` field on `ingredients` is the product moat. Pestle and similar apps output flat JSON. We can show "3 tbsp olive oil (seen on screen)" vs. "pinch of salt (inferred)" with visual distinction, giving users calibrated trust.
---
## 7. Pipelines
### 7.1 Reel Ingestion
When the worker picks up an `ingest_reel` job, it runs roughly this sequence. Stages short-circuit as soon as we have confident structured output.
**Stage 1 — Download.** yt-dlp fetches the video and extracts metadata. The caption is usually populated by yt-dlp from the post description. Typical reel: 5-15 MB, 30-90 seconds. Store in `/tmp/jobs/{job_id}/`.
**Stage 2 — Caption-first extraction.** Feed the caption to the fusion LLM with a structured-output prompt asking for a recipe JSON. If the model returns a recipe with all required fields populated and confidence above a threshold, skip the rest and go straight to dietary check. This is the Pestle path, and it handles probably 50-70% of reels at near-zero latency and compute cost.
**Stage 3 — Caption link extraction.** If the caption contains an external URL (common — many creators link their blog), fetch it, look for `schema.org/Recipe` JSON-LD, and use that if present. This handles another chunk of cases cleanly.
**Stage 4 — Audio + ASR.** ffmpeg extracts mono 16kHz WAV. faster-whisper with large-v3-turbo transcribes with word-level timestamps. Typical runtime: 3-6 seconds for a 60s clip on a 3060 Ti.
**Stage 5 — Frame sampling.** PySceneDetect identifies scene changes — reels cut exactly when something important happens, so this outperforms uniform sampling. Cap at 12 frames. ffmpeg extracts them as JPEGs.
**Stage 6 — OCR.** PaddleOCR on each frame. On-screen text in reels is often the most reliable source of exact ingredient quantities ("2 tbsp olive oil" flashed over a pan shot). Keep detected text with per-frame timestamps.
**Stage 7 — VLM captioning.** Qwen2.5-VL-7B describes each frame with a prompt focused on cooking context ("What ingredients, tools, and cooking actions are visible? What stage of preparation does this show?"). Keep the short descriptions. Typical: 2-4 seconds per frame in batched mode.
**Stage 8 — Fusion.** Qwen2.5-14B gets: caption, transcript (with timestamps), OCR text (with timestamps), VLM frame descriptions (with timestamps). The prompt asks for structured JSON with a per-field provenance tag. This is the stage that distinguishes "we confirmed from the overlay" from "we inferred from the voiceover".
**Stage 9 — Dietary check.** See §7.3.
**Stage 10 — Persist.** Write recipe, ingredients, steps rows. Mark job done. Send APNs push if the user is opted in.
Between ~15 and ~40 seconds end-to-end depending on how many stages were needed. Caption-only path: 2-4 seconds.
### 7.2 Voice Search
**On-device (iOS).** User holds the mic, WhisperKit transcribes with interim results shown live. On confirm, the app posts the final transcript to `/jobs/voice-search`.
**Backend.**
1. LLM interprets the query into structured search terms, extracting any implicit constraints ("dinner tonight, quick, high protein" → `{meal_type: dinner, max_time: 30m, dietary_emphasis: high_protein}`)
2. Constraints are merged with the user's stored preferences and allergies
3. Backend performs a web search (Brave Search API or Google Custom Search — pick one at implementation time, Brave is cheaper with a better API) for recipe URLs matching the terms
4. Fetch the top ~5 results with httpx. For each, look for JSON-LD `@type: Recipe` — most food blogs have this. For pages without JSON-LD, feed the HTML through a structuring LLM call
5. Filter results against the user's allergies as a hard gate, and score against soft preferences
6. Return the top 3-5 as structured recipes
Voice search recipes are not auto-saved — they appear in the result view and the user taps to save to their library.
### 7.3 Dietary Intelligence
Two severity tiers internally, visually distinct in the UI:
**Allergies (hard).** Ingredient match triggers a red warning. The fusion LLM is also asked to propose an ingredient-level swap in the same call that generates the recipe, so it's stored alongside the ingredient row. If a safe swap cannot be produced ("this recipe is fundamentally built around peanuts"), the field is left null and the UI shows "no safe substitute — consider skipping this recipe".
**Restrictions (soft).** Vegetarian, vegan, gluten-free, etc. Orange warning, always accompanied by a swap suggestion. These are treated as preferences, not safety issues.
Detection is straightforward name matching plus synonym expansion maintained as a small dictionary — "peanut" matches peanut, peanut butter, peanut oil, groundnut, arachis. Keep this dictionary in the backend repo, version-controlled. LLM is involved only for generating the swap text, not for detection (detection needs to be reliable and fast).
This runs after the fusion stage so the swap suggestions can be informed by the full recipe context (the LLM knows what role the ingredient plays).
---
## 8. Infrastructure & Deployment
**Desktop (primary dev/prod for v1).** Ubuntu 24 on WSL2 as you currently have it. Four systemd units: `postgres`, `redis`, `recipe-api` (uvicorn), `recipe-worker` (arq). A simple unified logging setup — journalctl is fine for now.
**Cloudflare Tunnel** exposes `api.wahwa.com` (or similar subdomain) to the FastAPI port. Same pattern as your existing setup. No special handling needed for the video pipeline since videos don't traverse the tunnel — they're downloaded server-side.
**Secrets** live in a `.env` file sourced by systemd. Apple keys, DB credentials, search API key, optional Gemini fallback key.
**Migrations.** Alembic. `alembic upgrade head` runs as a `ExecStartPre=` step on the API unit so deploys apply migrations automatically.
**CI (GitHub Actions).** Lint, type-check, unit tests, migration check (Alembic can detect schema drift). Not doing CD to the apartment box initially — you'll deploy by SSH and `git pull` + restart units. If that becomes annoying, wire up a webhook.
**iOS distribution.** TestFlight for internal testing. Use your existing paid Apple Developer account. Standard provisioning.
**Monitoring.** For v1, a simple `/health` endpoint on the API and a daily cron emailing you the count of jobs completed and failed in the last 24h. If you want nicer dashboards, add Grafana + Loki later, but that's polish.
---
## 9. Scaling Roadmap
The v1 setup has known bottlenecks. Here's the progression if/when usage demands it.
**Phase 1 (now): apartment desktop, everything on one box.** Fine for you alone and a handful of beta testers. Caveats: a power outage or ISP glitch takes the service down; desktop-off when you travel is an outage; all GPU throughput is shared with whatever else you're running locally.
**Phase 2: Oracle Free ARM VM as edge relay.** The Oracle VM terminates Cloudflare Tunnel traffic and forwards over WireGuard to your desktop. This matches your existing roadmap for Gitea. Benefit: the public endpoint stays up even when your desktop is rebooting; you can fall back to a "still processing — desktop is restarting" state rather than a hard 502. Desktop does all real work.
**Phase 3: managed Postgres.** Neon or Supabase Postgres when you start caring about automated backups, PITR, read replicas, or just not babysitting a DB. The schema is vanilla Postgres, the move is trivial — change the DSN. Do this before scaling users, not after.
**Phase 4: detach API from worker.** The API moves to a cheap VPS (Fly, Hetzner). Workers stay on the GPU machine (your apartment now, or a colocated box, or RunPod/Lambda GPU). Communication is over Redis still; Redis moves to the VPS side. The GPU machine only handles arq jobs and needs no inbound public traffic.
**Phase 5: multi-region or managed GPU.** If latency for Asia/Europe users matters, or if your apartment GPU is constantly saturated, move workers to a managed GPU provider (RunPod has pay-per-second GPU; Modal and Beam are similar). The fusion/VLM stage is where cost lives; swap to Gemini Flash at that point if per-call cost undercuts self-hosted amortization.
**Phase 6: if the app takes off.** Object storage for thumbnails and cached video frames (S3/R2). CDN for recipe images. A separate analytics pipeline. User-generated content moderation. Most of this is generic scaling and not worth pre-designing.
At each phase, the code shouldn't change much — the whole point of putting ML behind a worker queue and keeping API stateless is that you can redraw the deployment topology without refactoring.
---
## 10. Build Order / Milestones
Rough weekly targets assuming focused part-time work. Collapse where you're fast, expand where something fights back.
**M0 — Scaffolding (week 1).** New iOS project scaffolded with SwiftUI + share extension target + App Group. FastAPI skeleton with `/health`, Alembic baseline migration, Postgres running, Redis running, arq hello-world. Cloudflare Tunnel pointed at the API. Apple Sign In round-trip working end-to-end: iOS app signs in, backend verifies the token, issues session JWT, iOS stores and sends it. No real features yet; just prove the whole loop closes.
**M1 — Caption-first reel ingestion (weeks 2-3).** Share extension writes job and POSTs. Worker runs yt-dlp, extracts caption, runs caption-only LLM parse, writes recipe to DB. iOS main app shows recipe list and detail view with the liquid-glass card design ported from SousChefAI. This alone is a working product for maybe 60% of reels and a satisfying demo.
**M2 — Full video understanding (weeks 4-6).** Add stages 4-8 from §7.1. faster-whisper integration first (clean visible improvement). Then frames + OCR. Then VLM. Then fusion prompt engineering — this is where real time gets spent; the prompt for fusion is the heart of the app's quality. Per-field provenance rendered in the UI.
**M3 — Voice search (week 7).** WhisperKit integration. `/jobs/voice-search` endpoint. Web search integration (Brave). HTML-to-recipe extraction. Results UI.
**M4 — Dietary intelligence (week 8).** Preferences onboarding flow. Allergy dictionary. Violation detection during fusion. Swap-suggestion prompt. UI treatment for allergy vs. restriction severity.
**M5 — Polish and reliability (weeks 9-10).** APNs push on job completion. Offline-queue handling in the share extension. Error states throughout. Recipe scaling feature from the SousChef codebase (already works, just needs the backend route). Oracle ARM relay setup for Phase 2 resilience. TestFlight build for beta testers.
Two months of part-time work, plus or minus, to reach a TestFlight-ready beta. The v1 defined here is larger than Pestle's feature set on the reel-understanding side specifically, matches or exceeds on dietary handling, and adds voice search as a net-new differentiator.
---
## 11. Open Risks & Decisions Deferred
**yt-dlp breakage.** Instagram periodically changes things that break yt-dlp for days at a time. Mitigation: pin a known-good version, monitor, have a manual-update playbook, keep an eye on the yt-dlp issue tracker. Longer-term: if breakage becomes frequent enough to hurt UX, consider a fallback where the iOS share extension actually downloads the video from the reel before POSTing — iOS gets the video as part of the share payload sometimes, though this depends on Instagram's share sheet contract and isn't guaranteed.
**Local model quality vs. cost.** Qwen2.5-VL-7B and Qwen2.5-14B are good but not Gemini-3-Pro good. If the fusion stage is producing poor recipes, the escape hatch is to route fusion to Gemini 2.5 Flash or 3 Flash (order of a cent per reel, still cheaper than Pestle's on-device compute amortized over dev time). Build the fusion layer with a clean model-swap interface from day one so this is a config change.
**Desktop uptime.** Your desktop going down while you're out of town means no service. Phase 2 of the scaling roadmap mitigates but doesn't eliminate this. For beta testers, be upfront that the service is best-effort during prototyping.
**Share extension storage of refresh tokens.** The refresh token lives in the App Group Keychain, accessible to both the extension and the main app. This is a standard pattern but it's worth double-checking the Keychain access group configuration at build time — if you get it wrong the extension can't read what the main app wrote, and the failure mode is silent.
**Dietary dictionary completeness.** The allergy/restriction synonym dictionary needs careful curation. "Dairy" is a category, not an ingredient — it has to match milk, butter, cream, cheese, yogurt, whey, casein, and so on. Getting this right affects the core safety feature. Plan an explicit audit pass on the dictionary before exposing the allergy feature to beta testers.
**WhisperKit model size on first use.** First-time voice search will download ~150 MB. Either pre-download on app install (adds to the install size) or show a clear first-use spinner. Pre-downloading is nicer UX but makes the app heavier; showing a spinner on first use is fine if it's clearly communicated.
**Fusion prompt engineering.** This is the dominant quality lever and it's the least predictable piece of work. Budget roughly twice what you think it'll take. Keep a corpus of 20-30 test reels representing different recipe types (quick snacks, long-form cooking, baking, drinks) and evaluate changes against that set.