Mentor Zero — Architecture & Deployment Guide (current)
This guide keeps the same layout: (1) Product summary (~1 page), (2) Technical architecture (~10 pages), and (3) Deployment & AWS environment (~2 pages). It reflects the current dev stack: Flask API + scheduler on EC2 (Docker Compose), Caddy TLS, Postgres in Docker, SES/Lambda inbound, SSM-backed secrets, ECR images, static pages served by Caddy/app, GitHub Actions for build/push + SSM-driven bootstrap. Recent updates: GitHub OIDC IAM role, build-and-push workflow, SSM S3 deploy of run_bootstrap.sh, prompt logging toggle, topic tagging via subject/header, admin config UI, verbose error logging, dedupe by event_log message_id.
---
1) Product Summary (~1 page)
- What it is: Email-first AI mentor. One calm letter per topic per day, voiced like noted philosophers/scientists. Email is the deliberate medium; replying is the only interaction.
- Personalization: Uses user context, streak/progress, notes, recent letter metadata to avoid repetition and tune tone.
- Auth: Passwordless magic links (one-time tokens), sessions in DB, `mz_session` cookie. Logout revokes session. Admins from `app_admin`.
- Delivery: SMTP (SES or Gmail). Scheduler sends daily. Footer carries login link (one-time token). Send-now per topic.
- Scheduling: Per-user/topic send time + timezone; scheduler enqueues next job after each send; disabling a topic clears pending jobs.
- Inbound replies: IMAP or SES→Lambda→webhook. Topic encoded via `[mz:<code>]` subject tag and `X-MZ-Topic` header. AI classifies done/note/unsubscribe; updates progress or disables topic; idempotent by Message-ID.
- Admin: Prompt editor, users, footer editor, metrics dashboard, simulate reply, app-config toggle (prompt logging).
- Observability: Structured logs; `event_log` for replies/sends; `/health` and `/metrics` (jobs + auth/webhook counters); verbose errors on webhook/login-link/scheduler; prompt logging optional.
---
2) Technical Architecture (~10 pages)
2.1 High-Level Flow
- Enrollment: Landing page → `/api/login-link` sends magic link → `/login` consumes token, sets session cookie.
- Settings: `/api/topics` saves enabled flag, context_note, send time, timezone; upserts schedule; enqueues next job; disabling clears pending jobs.
- Send loop: Scheduler polls `email_job` (pending & due), marks sending, generates/sends, persists letter/metadata/progress, enqueues next run, updates `next_run_at_utc`.
- Send-now: `/api/send-now` enqueues immediate job if user/topic active/enabled.
- Replies: IMAP or SES/Lambda → webhook; AI marks done/note/unsubscribe; updates progress or disables topic; idempotent by Message-ID via `event_log`.
- Admin: Prompt/templates, users, footer, metrics, simulate reply, config toggle for prompt logging; Admin link shown when `is_admin`.
2.2 Backend (Flask, `app.py`)
- Serves SPA/static: `web/app.html`, `web/landing/index.html`, `web/architecture.html`, `web/user_guide.html`.
- Auth: `/api/login-link`, `/api/signup`, `/login`, `/logout`, `/api/logout`, `/api/me`.
- Settings/actions: `/api/topics` (GET/POST), `/api/complete`, `/api/send-now`.
- Admin APIs/pages: prompts, users, footer, metrics, simulate reply, config toggle (`/api/admin/config`, `admin_config.html`).
- Health/metrics: `/health`, `/metrics` (job/auth/webhook counters).
- Webhooks: `/api/webhook/reply` (HMAC `X-MZ-Signature`), `/api/maildev-webhook` (local).
- Middleware: request IDs; simple rate limits (login-link/webhook); host-aware cookies (Secure toggled by env).
2.3 Core Modules
- `config.py`: env+SSM config, OpenAI client, SMTP config, logging, defaults, magic link secret, `DEFAULT_ADMIN_EMAILS`.
- `db.py` (re-exports helpers): users/topics/templates/context; letters/metadata/prompts; schedules (`user_topic_schedule`), jobs (`email_job`); `event_log`; login tokens/sessions; app_config; footer; quotes; queue helpers.
- `progress.py`: streak/done/missed; update completion/note.
- `prompt.py`: build per-topic prompt; OpenAI JSON-mode; subject normalization + topic tag.
- `mail.py`: SMTP send; IMAP ingestion; AI classify (done/note/unsubscribe); idempotent by Message-ID; sets `X-MZ-Topic` and `[mz:<code>]` subject tag; footer login links with DB token.
- `scheduler.py`: compute next run, enqueue, poll pending, dispatch worker, persist letter/metadata/progress, enqueue next job; send-now uses same queue.
- `webhook.py`: SES/Lambda handler; HMAC verify; AI unsubscribe; idempotent by Message-ID; logs payload/analysis.
- `health.py`: queue metrics; `cli.py` for manual runs.
2.4 Data Model (key tables)
- Users/admin/auth: `app_user`, `app_admin`, `login_token`, `app_session`
- Topics/prompts: `topic`, `prompt_template`, `philosopher`
- User-topic: `user_topic` (enabled/context), `user_topic_schedule` (timezone, send_time_local, next_run_at_utc)
- Queue: `email_job` (run_at_utc, status pending/sending/sent/error, schedule_id, letter_id)
- Letters: `letter`, `letter_metadata`, `letter_prompt` (optional prompt logging)
- Progress: `user_progress` (completed, note, streak_at_time, letter_id)
- Events: `event_log` (LETTER_SENT, REPLY_PROCESSED, UNSUBSCRIBE, LOGIN, etc.)
- Quotes/footer/config: `bottom_quote`, `bottom_quote_history`, `email_footer`, `app_config`
2.5 Email Generation, Scheduling, Send-Now
- `/api/topics` save: upsert schedule, compute `next_run_at_utc`, insert pending job; disabling clears pending jobs and pauses schedule.
- Scheduler loop (~60s): fetch due pending jobs, mark sending, build letter, send SMTP, mark sent/error, enqueue next job, update schedule.
- Send-now: enqueue immediate job if active/enabled; uses same pipeline.
- Subjects normalized and tagged; footer renders `{{login_link}}` using DB one-time token (48h).
- Prompt logging optional via `app_config.log_prompts` (admin toggle) → `letter_prompt`.
2.6 Prompting
- OpenAI JSON-mode: subject/body/summary/themes/tone/advice_focus/variation_tags.
- Inputs: prompt template (DB or fallback), recent metadata hints, progress context (streak/done/missed/notes/context), user/topic context.
- Subject normalization + topic tag; resilient to malformed JSON responses.
2.7 Inbound Replies & Unsubscribe
- IMAP or SES/Lambda → webhook (`/api/webhook/reply` HMAC). Topic from `X-MZ-Topic` header or `[mz:<code>]` tag. AI decides done/note/unsubscribe; unsubscribe disables topic; idempotent by Message-ID via `event_log`.
- Maildev webhook for local; admin simulate uses same logic.
2.8 Frontend
- `landing/index.html`: calm copy, magic-link form, topic selector, contact mailto, links to Architecture/User Guide docs.
- `app.html`: topic toggles, context_note, send time/TZ, send-now buttons, admin link when `is_admin`, logout.
- Admin pages: prompts, users, footer, metrics, reply simulator, config toggle.
- Docs: `architecture.html`, `user_guide.html`. Served by Flask/Caddy.
2.9 Configuration & Secrets
- Env/SSM keys: `DATABASE_URL[_SSM]`, `OPENAI_API_KEY[_SSM]`, `MAGIC_LINK_SECRET[_SSM]`, `WEBHOOK_SECRET[_SSM]`, SMTP (`SMTP_HOST/PORT/USE_TLS/REQUIRE_AUTH/USERNAME/PASSWORD` or `_SSM`), `SENDER_EMAIL`, `DEFAULT_ADMIN_EMAILS`, `APP_BASE_URL`, `AWS_REGION`.
- `_get_config_value`: env → SSM → default; required keys raise.
- User-data writes `.env` with SSM param names (no secrets); app fetches via SSM at runtime using instance role.
2.10 Observability & Logging
- Structured logs; request IDs; verbose errors on webhook/login-link/scheduler; AI analysis logged in `event_log.metadata`.
- `/health`, `/metrics` (job counts, auth/webhook counters, oldest pending, reply/unsubscribe counters).
- Admin metrics UI; prompt logging toggle; event_log for sends/replies/unsubscribes/login.
- Gaps: no remote log sink, no latency histograms, no alerts.
2.11 Reliability & Idempotency
- Jobs: single-instance polling; no distributed locks; upsert on `(user_id, topic_id, run_at_utc)` to reset duplicates.
- Replies: idempotent by Message-ID via `event_log`; HMAC on webhook; AI unsubscribe.
- Gaps: send idempotency guard missing; no retries/backoff/ DLQ; no multi-instance coordination.
2.12 Security & Auth
- Magic links (one-time tokens), sessions in DB, logout revokes; cookies HttpOnly, env-aware Secure, SameSite=Lax.
- Rate limits on login-link/webhook; HMAC on webhook; SSM-backed secrets; TLS via Caddy/Let’s Encrypt.
- PII in logs; recommend masking for prod; admin access via `app_admin`.
2.13 Extensibility
- Topics: add rows to `topic` and `prompt_template`; prompt renderer is topic-code driven.
- Prompts: editable via admin; per-topic model/temperature tunable in code.
- Queue: can move to managed queue + retry/backoff.
- Static/docs easily extended; admin config can grow (feature flags).
2.14 Testing
- Unit: prompt parsing, streak calc, scheduler time calc, webhook signature.
- Integration: progress flows, reply processing (IMAP/webhook), letter metadata persistence.
- Gaps: E2E SMTP/IMAP, multi-instance scheduler, UI automation, load/alerting tests.
3) Deployment & AWS Environment (~2 pages)
3.1 Build & Publish
- Image: Docker (gunicorn CMD), linux/amd64; Dockerfile installs app + alembic.
- ECR: `${env_prefix}-app` (default repo). Build tags `latest` and git SHA.
- GitHub Actions: `build-and-push.yml` (OIDC role) builds/pushes to ECR; `deploy-bootstrap.yml` triggers after successful build.
3.2 Infra Topology
- Network: VPC (10.10.0.0/16), public subnet, IGW, SG allows 80/443/22.
- EC2: Ubuntu, IAM role (SSM core, ECR read, SSM GetParameter, optional S3 deploy bucket). User-data installs Docker/compose/awscli, logs into ECR, writes compose/Caddy/.env, starts stack, waits for Postgres, seeds DB, runs alembic, ensures AWS_REGION in .env. Root volume 20 GB gp3.
- Compose: `app` (gunicorn), `scheduler` (`python -m mentorzero.scheduler`), `db` (Postgres), `caddy` (TLS). Volumes: Postgres data, Caddy certs/config.
- TLS: Caddy Let’s Encrypt for `www.<root>` and `api.dev.<root>`; Route53 A-records to EC2. (CloudFront removed.)
- DNS: Route53 `www`, `api.dev` → EC2 IP.
- Inbound email: SES receipt rule → S3 → Lambda → `/api/webhook/reply` with HMAC; secret in SSM; Lambda logs payload to CloudWatch; topic via header/subject tag.
- Secrets: SSM params for DB URL, DB password, OpenAI key, magic link secret, SMTP creds, webhook secret; `.env` holds SSM param names; app fetches from SSM at runtime.
- Static: served by Flask/Caddy (`web/landing`, `web/*.html`, docs).
3.3 CI/CD & Bootstrap
- CI: `build-and-push` builds/pushes on main; uses OIDC role with ECR/SSM/EC2 perms.
- Deploy: `deploy-bootstrap` runs on successful build; uploads `run_bootstrap.sh` to S3, SSM executes on EC2, pulls latest compose config from S3, runs bootstrap (Docker up, alembic, seeds).
- Manual: `run_bootstrap.sh` can be run on host; `docker compose pull && docker compose up -d` to pick a new tag.
- Terraform: sets DNS, SES, S3 inbound, Lambda, EC2/SG/IAM/SSM params, GitHub OIDC provider/role, deploy artifact bucket policy, app secrets placeholders (ignore_changes on values).
3.4 User-Data / Bootstrap Steps
- Install docker/compose/awscli; login to ECR.
- Write compose/Caddy/.env (SSM param names, config; no secrets).
- `docker compose up -d`.
- Wait for Postgres; extract SQL from image; run create/seed/backfill; run alembic with SSM-fetched DATABASE_URL.
- Ensure AWS_REGION in .env for SSM client.
3.5 Runbook (dev)
- `terraform apply -var "app_image=<ECR_URI:tag>" -var "deploy_artifact_bucket=<bucket>"`.
- Populate SSM params once (OpenAI, magic link, SMTP, webhook, DB URL/password); `ignore_changes` prevents TF overwrite.
- Push code → build-and-push → deploy-bootstrap runs; or SSH/SSM to `docker compose pull && docker compose up -d`.
- Logs: `docker logs mentorzero-app-1`, `mentorzero-scheduler-1`, `mentorzero-caddy-1`, `mentorzero-db-1`.
- Health: `GET /health`; metrics: `/metrics`; webhook: check HMAC.
- DB seeds in `/opt/mentorzero/sql`; rerun with `docker exec -i mentorzero-db-1 psql -U postgres -d postgres < file.sql`.
3.6 Gaps / Next Steps
- Add job retry/backoff + idempotency guard; multi-instance scheduler coordination or managed queue.
- RDS with backups/encryption; ALB/WAF if scaling.
- Remote log sink, latency metrics, alerts on webhook/job failures/backlog.
- Secret rotation/reload; CI deploy approval gates; pin TLS/email sender policies.