Mentor Zero — Architecture & Deployment Guide (current)

This guide keeps the same layout: (1) Product summary (~1 page), (2) Technical architecture (~10 pages), and (3) Deployment & AWS environment (~2 pages). It reflects the current dev stack: Flask API + scheduler on EC2 (Docker Compose), Caddy TLS, Postgres in Docker, SES/Lambda inbound, SSM-backed secrets, ECR images, static pages served by Caddy/app, GitHub Actions for build/push + SSM-driven bootstrap. Recent updates: GitHub OIDC IAM role, build-and-push workflow, SSM S3 deploy of run_bootstrap.sh, prompt logging toggle, topic tagging via subject/header, admin config UI, verbose error logging, dedupe by event_log message_id.

---

1) Product Summary (~1 page)

What it is: Email-first AI mentor. One calm letter per topic per day, voiced like noted philosophers/scientists. Email is the deliberate medium; replying is the only interaction.
Personalization: Uses user context, streak/progress, notes, recent letter metadata to avoid repetition and tune tone.
Auth: Passwordless magic links (one-time tokens), sessions in DB, `mz_session` cookie. Logout revokes session. Admins from `app_admin`.
Delivery: SMTP (SES or Gmail). Scheduler sends daily. Footer carries login link (one-time token). Send-now per topic.
Scheduling: Per-user/topic send time + timezone; scheduler enqueues next job after each send; disabling a topic clears pending jobs.
Inbound replies: IMAP or SES→Lambda→webhook. Topic encoded via `[mz:<code>]` subject tag and `X-MZ-Topic` header. AI classifies done/note/unsubscribe; updates progress or disables topic; idempotent by Message-ID.
Admin: Prompt editor, users, footer editor, metrics dashboard, simulate reply, app-config toggle (prompt logging).
Observability: Structured logs; `event_log` for replies/sends; `/health` and `/metrics` (jobs + auth/webhook counters); verbose errors on webhook/login-link/scheduler; prompt logging optional.

---

2) Technical Architecture (~10 pages)

2.1 High-Level Flow

Enrollment: Landing page → `/api/login-link` sends magic link → `/login` consumes token, sets session cookie.
Settings: `/api/topics` saves enabled flag, context_note, send time, timezone; upserts schedule; enqueues next job; disabling clears pending jobs.
Send loop: Scheduler polls `email_job` (pending & due), marks sending, generates/sends, persists letter/metadata/progress, enqueues next run, updates `next_run_at_utc`.
Send-now: `/api/send-now` enqueues immediate job if user/topic active/enabled.
Replies: IMAP or SES/Lambda → webhook; AI marks done/note/unsubscribe; updates progress or disables topic; idempotent by Message-ID via `event_log`.
Admin: Prompt/templates, users, footer, metrics, simulate reply, config toggle for prompt logging; Admin link shown when `is_admin`.

2.2 Backend (Flask, `app.py`)

Serves SPA/static: `web/app.html`, `web/landing/index.html`, `web/architecture.html`, `web/user_guide.html`.
Auth: `/api/login-link`, `/api/signup`, `/login`, `/logout`, `/api/logout`, `/api/me`.
Settings/actions: `/api/topics` (GET/POST), `/api/complete`, `/api/send-now`.
Admin APIs/pages: prompts, users, footer, metrics, simulate reply, config toggle (`/api/admin/config`, `admin_config.html`).
Health/metrics: `/health`, `/metrics` (job/auth/webhook counters).
Webhooks: `/api/webhook/reply` (HMAC `X-MZ-Signature`), `/api/maildev-webhook` (local).
Middleware: request IDs; simple rate limits (login-link/webhook); host-aware cookies (Secure toggled by env).

2.3 Core Modules

`config.py`: env+SSM config, OpenAI client, SMTP config, logging, defaults, magic link secret, `DEFAULT_ADMIN_EMAILS`.
`db.py` (re-exports helpers): users/topics/templates/context; letters/metadata/prompts; schedules (`user_topic_schedule`), jobs (`email_job`); `event_log`; login tokens/sessions; app_config; footer; quotes; queue helpers.
`progress.py`: streak/done/missed; update completion/note.
`prompt.py`: build per-topic prompt; OpenAI JSON-mode; subject normalization + topic tag.
`mail.py`: SMTP send; IMAP ingestion; AI classify (done/note/unsubscribe); idempotent by Message-ID; sets `X-MZ-Topic` and `[mz:<code>]` subject tag; footer login links with DB token.
`scheduler.py`: compute next run, enqueue, poll pending, dispatch worker, persist letter/metadata/progress, enqueue next job; send-now uses same queue.
`webhook.py`: SES/Lambda handler; HMAC verify; AI unsubscribe; idempotent by Message-ID; logs payload/analysis.
`health.py`: queue metrics; `cli.py` for manual runs.

2.4 Data Model (key tables)

Users/admin/auth: `app_user`, `app_admin`, `login_token`, `app_session`
Topics/prompts: `topic`, `prompt_template`, `philosopher`
User-topic: `user_topic` (enabled/context), `user_topic_schedule` (timezone, send_time_local, next_run_at_utc)
Queue: `email_job` (run_at_utc, status pending/sending/sent/error, schedule_id, letter_id)
Letters: `letter`, `letter_metadata`, `letter_prompt` (optional prompt logging)
Progress: `user_progress` (completed, note, streak_at_time, letter_id)
Events: `event_log` (LETTER_SENT, REPLY_PROCESSED, UNSUBSCRIBE, LOGIN, etc.)
Quotes/footer/config: `bottom_quote`, `bottom_quote_history`, `email_footer`, `app_config`

2.5 Email Generation, Scheduling, Send-Now

`/api/topics` save: upsert schedule, compute `next_run_at_utc`, insert pending job; disabling clears pending jobs and pauses schedule.
Scheduler loop (~60s): fetch due pending jobs, mark sending, build letter, send SMTP, mark sent/error, enqueue next job, update schedule.
Send-now: enqueue immediate job if active/enabled; uses same pipeline.
Subjects normalized and tagged; footer renders `{{login_link}}` using DB one-time token (48h).
Prompt logging optional via `app_config.log_prompts` (admin toggle) → `letter_prompt`.

2.6 Prompting

OpenAI JSON-mode: subject/body/summary/themes/tone/advice_focus/variation_tags.
Inputs: prompt template (DB or fallback), recent metadata hints, progress context (streak/done/missed/notes/context), user/topic context.
Subject normalization + topic tag; resilient to malformed JSON responses.

2.7 Inbound Replies & Unsubscribe

IMAP or SES/Lambda → webhook (`/api/webhook/reply` HMAC). Topic from `X-MZ-Topic` header or `[mz:<code>]` tag. AI decides done/note/unsubscribe; unsubscribe disables topic; idempotent by Message-ID via `event_log`.
Maildev webhook for local; admin simulate uses same logic.

2.8 Frontend

`landing/index.html`: calm copy, magic-link form, topic selector, contact mailto, links to Architecture/User Guide docs.
`app.html`: topic toggles, context_note, send time/TZ, send-now buttons, admin link when `is_admin`, logout.
Admin pages: prompts, users, footer, metrics, reply simulator, config toggle.
Docs: `architecture.html`, `user_guide.html`. Served by Flask/Caddy.

2.9 Configuration & Secrets

Env/SSM keys: `DATABASE_URL[_SSM]`, `OPENAI_API_KEY[_SSM]`, `MAGIC_LINK_SECRET[_SSM]`, `WEBHOOK_SECRET[_SSM]`, SMTP (`SMTP_HOST/PORT/USE_TLS/REQUIRE_AUTH/USERNAME/PASSWORD` or `_SSM`), `SENDER_EMAIL`, `DEFAULT_ADMIN_EMAILS`, `APP_BASE_URL`, `AWS_REGION`.
`_get_config_value`: env → SSM → default; required keys raise.
User-data writes `.env` with SSM param names (no secrets); app fetches via SSM at runtime using instance role.

2.10 Observability & Logging

Structured logs; request IDs; verbose errors on webhook/login-link/scheduler; AI analysis logged in `event_log.metadata`.
`/health`, `/metrics` (job counts, auth/webhook counters, oldest pending, reply/unsubscribe counters).
Admin metrics UI; prompt logging toggle; event_log for sends/replies/unsubscribes/login.
Gaps: no remote log sink, no latency histograms, no alerts.

2.11 Reliability & Idempotency

Jobs: single-instance polling; no distributed locks; upsert on `(user_id, topic_id, run_at_utc)` to reset duplicates.
Replies: idempotent by Message-ID via `event_log`; HMAC on webhook; AI unsubscribe.
Gaps: send idempotency guard missing; no retries/backoff/ DLQ; no multi-instance coordination.

2.12 Security & Auth

Magic links (one-time tokens), sessions in DB, logout revokes; cookies HttpOnly, env-aware Secure, SameSite=Lax.
Rate limits on login-link/webhook; HMAC on webhook; SSM-backed secrets; TLS via Caddy/Let’s Encrypt.
PII in logs; recommend masking for prod; admin access via `app_admin`.

2.13 Extensibility

Topics: add rows to `topic` and `prompt_template`; prompt renderer is topic-code driven.
Prompts: editable via admin; per-topic model/temperature tunable in code.
Queue: can move to managed queue + retry/backoff.
Static/docs easily extended; admin config can grow (feature flags).

2.14 Testing

Unit: prompt parsing, streak calc, scheduler time calc, webhook signature.
Integration: progress flows, reply processing (IMAP/webhook), letter metadata persistence.
Gaps: E2E SMTP/IMAP, multi-instance scheduler, UI automation, load/alerting tests.

3) Deployment & AWS Environment (~2 pages)

3.1 Build & Publish

Image: Docker (gunicorn CMD), linux/amd64; Dockerfile installs app + alembic.
ECR: `${env_prefix}-app` (default repo). Build tags `latest` and git SHA.
GitHub Actions: `build-and-push.yml` (OIDC role) builds/pushes to ECR; `deploy-bootstrap.yml` triggers after successful build.

3.2 Infra Topology

Network: VPC (10.10.0.0/16), public subnet, IGW, SG allows 80/443/22.
EC2: Ubuntu, IAM role (SSM core, ECR read, SSM GetParameter, optional S3 deploy bucket). User-data installs Docker/compose/awscli, logs into ECR, writes compose/Caddy/.env, starts stack, waits for Postgres, seeds DB, runs alembic, ensures AWS_REGION in .env. Root volume 20 GB gp3.
Compose: `app` (gunicorn), `scheduler` (`python -m mentorzero.scheduler`), `db` (Postgres), `caddy` (TLS). Volumes: Postgres data, Caddy certs/config.
TLS: Caddy Let’s Encrypt for `www.<root>` and `api.dev.<root>`; Route53 A-records to EC2. (CloudFront removed.)
DNS: Route53 `www`, `api.dev` → EC2 IP.
Inbound email: SES receipt rule → S3 → Lambda → `/api/webhook/reply` with HMAC; secret in SSM; Lambda logs payload to CloudWatch; topic via header/subject tag.
Secrets: SSM params for DB URL, DB password, OpenAI key, magic link secret, SMTP creds, webhook secret; `.env` holds SSM param names; app fetches from SSM at runtime.
Static: served by Flask/Caddy (`web/landing`, `web/*.html`, docs).

3.3 CI/CD & Bootstrap

CI: `build-and-push` builds/pushes on main; uses OIDC role with ECR/SSM/EC2 perms.
Deploy: `deploy-bootstrap` runs on successful build; uploads `run_bootstrap.sh` to S3, SSM executes on EC2, pulls latest compose config from S3, runs bootstrap (Docker up, alembic, seeds).
Manual: `run_bootstrap.sh` can be run on host; `docker compose pull && docker compose up -d` to pick a new tag.
Terraform: sets DNS, SES, S3 inbound, Lambda, EC2/SG/IAM/SSM params, GitHub OIDC provider/role, deploy artifact bucket policy, app secrets placeholders (ignore_changes on values).

3.4 User-Data / Bootstrap Steps

Install docker/compose/awscli; login to ECR.
Write compose/Caddy/.env (SSM param names, config; no secrets).
`docker compose up -d`.
Wait for Postgres; extract SQL from image; run create/seed/backfill; run alembic with SSM-fetched DATABASE_URL.
Ensure AWS_REGION in .env for SSM client.

3.5 Runbook (dev)

`terraform apply -var "app_image=<ECR_URI:tag>" -var "deploy_artifact_bucket=<bucket>"`.
Populate SSM params once (OpenAI, magic link, SMTP, webhook, DB URL/password); `ignore_changes` prevents TF overwrite.
Push code → build-and-push → deploy-bootstrap runs; or SSH/SSM to `docker compose pull && docker compose up -d`.
Logs: `docker logs mentorzero-app-1`, `mentorzero-scheduler-1`, `mentorzero-caddy-1`, `mentorzero-db-1`.
Health: `GET /health`; metrics: `/metrics`; webhook: check HMAC.
DB seeds in `/opt/mentorzero/sql`; rerun with `docker exec -i mentorzero-db-1 psql -U postgres -d postgres < file.sql`.

3.6 Gaps / Next Steps

Add job retry/backoff + idempotency guard; multi-instance scheduler coordination or managed queue.
RDS with backups/encryption; ALB/WAF if scaling.
Remote log sink, latency metrics, alerts on webhook/job failures/backlog.
Secret rotation/reload; CI deploy approval gates; pin TLS/email sender policies.