Back to projects
work

Groups Digest

Overhauled the email digest pipeline — from unreliable cron jobs to priority queues, dedicated Redis infrastructure, and proper observability.

Redis Terraform Nomad PHP Prometheus

Freelancer Groups is a social feature serving 25M+ users — communities where freelancers post updates, share work, and collaborate. Each group generates activity that members receive as digest emails summarising what they missed.

Email digests sound simple. At scale they’re not. By early 2025 the existing pipeline had reliability problems: worker OOM crashes on large groups, no separation between high-traffic and low-traffic groups, cron jobs with no log capture, and no visibility into whether emails were actually being delivered on time.

The overhaul covered the application code, the infrastructure underneath it, and the observability on top.

Pipeline

The digest processor was reworked to split jobs into priority and standard queues, so high-engagement groups get processed first rather than waiting behind the full backlog. An OOM guard filters inactive users (not seen in 90+ days) from large group digests — these were the jobs that had been crashing workers. Redis key fetching was narrowed to only the fields needed, and persistent RabbitMQ connections replaced per-job reconnects.

The whole pipeline is covered by a functional test suite that exercises the full path from job creation through to email output.

Infrastructure

The priority queue needed its own Redis cluster — provisioned from scratch via Terraform with security groups, HAProxy routing, and TTLs on all keys to prevent unbounded memory growth. Cron execution was migrated to Nomad with full stderr/stdout capture, replacing fire-and-forget cron jobs that silently failed.

Observability

Groups Digest Grafana dashboard — queue health, Redis metrics, cron execution, error logs

Grafana dashboard tracking the full pipeline: queue completion times, pending jobs, processing rates, cron execution, Redis memory and key counts, and error logs. Priority and standard queues are tracked independently.

A Prometheus and Alertmanager stack monitors the full pipeline — queue depth gauges, processing rate counters, cron success/failure tracking, Redis memory pressure, and key expiry rates. Alerts page on-call when queues stall or error rates spike.