The journey

Homelab

From a full Google Drive to a production-grade self-hosted stack. Each step was forced by a real problem. Nothing was planned upfront.

9.6TB

vs 100 GB on Google

24

services running

£437/yr

subscriptions cancelled

~4yrs

to break even

The maths works eventually. But that's not really why you do it.

You do it because a broken WireGuard tunnel at 11pm is more interesting than Netflix. Because understanding your own attack surface matters to you. Because "it just works" isn't satisfying when you don't know why it works.

If that doesn't sound like you — close the tab and pay Google. Genuinely. It's the right call for most people.

Skip the journey? Here's where I ended up.
24 services across 7 categories

Routing & Proxy

Traefik WireGuard nginx (VPS) Tailscale

DNS & Security

Pi-hole CrowdSec Authentik

Media

Plex Immich

Automation

Home Assistant Zigbee2MQTT

Observability

Prometheus Grafana Loki Promtail cAdvisor Node Exporter SNMP Exporter

Web

gread.uk rosiealsopphotography peragaadventures.com

Management

Portainer Uptime Kuma Homepage

9.6 TB storage

Photo library with ML search

Media server, anywhere

Home automation, no cloud

Websites for others

Full observability

Hindsight

Set up logs from day one, not just metrics.

Host something for someone else early. It changes your reliability mindset.

Buy a before you lose data, not after.

Timeline

The Trigger

beginner

Problem

Google Drive running out of space. Next tier meant paying more for storage we were outgrowing. Streaming services pushing ads on paid plans.

Solution

Self-host. Take control of storage and media.

The cost calculus shifts faster than you expect once you start adding up subscriptions.

cost privacy storage
More context

Google Drive hit its 100 GB limit. The next tier was £24.99/year for 200 GB, and the photo library was growing faster than we could trim it. Rosie had the same problem compounded across her personal Google account and GSuite. Streaming services were getting worse, with ads appearing on already-paid subscriptions.

The cost calculus shifted. Self-hosting went from “someday” to “this weekend.”

Hardware

beginner

Problem

Need a device that can store files, run , and not require Linux expertise to get started.

Solution

Synology DS923+: turnkey OS, package ecosystem, Docker support.

SHR handles mixed drive sizes. NVMe cache is nice-to-have, not essential. Free RAM from an old laptop is the best kind of RAM.

storage hardware
More context

Synology DS923+ with SHR (Synology Hybrid RAID) across three drives of different sizes: 2x 8 TB Toshiba MG08ADA800E and 1x 2 TB (used, free). SHR creates multiple md arrays internally to maximise usable space, giving 8.71 TB total on Volume 1 (btrfs, encrypted).

Two M.2 NVMe drives in RAID1 for read/write cache. 32 GB RAM swapped in from an old laptop.

A second volume (1 TB SSD) was added later to handle media and Docker containers quietly.

Replacing Google

beginner

Problem

Need cloud sync, local backups, and file sharing without recurring cloud costs.

Solution

Synology Drive for sync, Time Machine for Mac backups, SMB for LAN, Hyper Backup for offsite.

Encryption key management is critical. Keep recovery keys in multiple safe places before you need them.

storage privacy cost
More context

The immediate wins, things that work out of the box on :

  • Synology Drive handles file sync. Google Drive stays for Docs, Sheets, and Mail, but bulk storage moved off it.
  • Time Machine as a LAN backup target for MacBooks.
  • SMB shares for LAN file access from any device.
  • Btrfs snapshots with hourly/daily retention on shared folders.
  • Offsite backup via Hyper Backup to a used 2 TB HDD in a UGREEN enclosure, kept in a shed. Encrypted.

Media Server

beginner

Problem

Want to watch media on any device in the house.

Solution

Plex via Synology Package Center. LAN-only at this stage.

The DS923+ uses an AMD Ryzen R1600 with no Intel Quick Sync, so no hardware transcoding. Remote access via relay or QuickConnect was essentially unusable.

media
More context

Plex installed via Synology Package Center. Media playable on phones and TV over local network.

The DS923+ has an AMD Ryzen R1600 processor which lacks Intel Quick Sync, the hardware transcoding engine Plex relies on. Without it, Plex can only direct-play original files or attempt software transcoding (far too slow on a NAS CPU for real-time playback).

This mattered because both Plex’s built-in relay and Synology’s QuickConnect reduce stream quality to save bandwidth, which triggers Plex to transcode. No hardware transcoding meant those remote access options were effectively broken, either unwatchably slow or stuck buffering. On LAN it was fine because clients could direct-play at full quality.

This constraint made the later tunnel critical: with enough raw bandwidth, clients direct-play the original files remotely, with no transcoding needed at all.

Home Automation

comfortable

Problem

Five different manufacturer apps to control lights, heating, air quality, and media. Cloud-dependent, fragmented, unreliable.

Solution

Home Assistant OS as a VM with Zigbee2MQTT. One dashboard, no cloud dependencies.

Battery Zigbee devices don't route, but mains-powered ones like IKEA plugs do. An IKEA plug in the hallway fixed the mesh.

home-automation privacy
More context

Home Assistant OS installed as a VM on Synology VMM. Zigbee2MQTT with a Dongle Plus MG24 (~£30) connecting everything. No cloud required.

Smart TRVs on every radiator (5 rooms) with weekly schedules. Sonoff SNZB-06P human presence sensor in the study (USB-powered). Dyson fan for air quality monitoring. Govee lights, IKEA remote, IKEA plug as a Zigbee router, leak sensors in bathroom and kitchen. LG CX TV and Chromecast integration.

Five apps (Govee, Dyson, LG ThinQ, Google Home, TRV manufacturer) became one dashboard.

Living with a NAS

beginner

Problem

The NAS lives in the living room next to the router. HDD seek noise during media playback was audible over the TV.

Solution

Added a 1 TB SSD as a separate volume for media and Docker data. SSDs are silent.

Not everything needs RAID. Media is re-downloadable. Separating data by recovery importance saves money, noise, and complexity.

storage hardware
More context

The NAS stays in the living room; it needs a wired ethernet connection to the router. Moving it wasn’t an option.

The fix was separating what lives on spinning disks from what doesn’t need to. A 1 TB SSD became Volume 2 (ext4, encrypted) for media and Docker container runtime. The SHR array on the HDDs holds everything that needs RAID protection and snapshots.

The deeper lesson: not all data has the same recovery cost. Photos and documents need redundancy. Movies don’t, because you can re-download them. Building volumes around that distinction is cheaper and quieter than putting everything on the same drives.

External Access: Cloudflare Tunnels

networking

Problem

means no inbound connections. Static IPv4 would cost ~£60/month vs £22/month ISP.

Solution

Cloudflare Tunnels: outbound-only, punches through CGNAT.

CF Tunnels work fine for dashboards and APIs but hit a wall for media: 100 MB body limit and video streaming ToS.

networking vpn cost
More context

Cloudflare Tunnels were the first attempt at external access. They work for lightweight use cases like admin dashboards and small file transfers. But two limitations made them a dead end for media and file hosting:

  1. 100 MB HTTP request body limit on free/Pro plans. Large uploads fail.
  2. Terms of Service violation: Cloudflare prohibits proxying video/streaming on standard plans.

Before giving up on tunnels for media I tried a side route: skip the tunnel on the streaming hostname and serve it directly over IPv6 with an AAAA record, while keeping everything else routed through the tunnel. The DNS rules said no. A tunnel route needs a CNAME pointing at <uuid>.cfargotunnel.com, and RFC 1034 forbids a CNAME from sharing a hostname with any other record. So the same name couldn’t be both “tunnel for clients” and “direct for v6 clients”; it was one or the other without running split-DNS. That’s when the VPS + idea took over.

Pi-hole

comfortable

Problem

Ads everywhere. DNS queries leaking to ISP.

Solution

Pi-hole in . Router DNS pointed at it.

~20% of DNS queries blocked. Pi-hole is a single point of failure, so set a fallback DNS on your router.

privacy containers dns
More context

The first Docker container. Router’s DNS pointed at Pi-hole on port 53. Immediately ~20% of all DNS queries blocked at the network edge.

Important caveat: Pi-hole is a single point of failure for DNS. If the container goes down, nothing on the network can resolve domain names. Set a fallback DNS on the router.

Split-Horizon DNS

networking

Problem

Accessing services by IP and port number: 192.168.1.16:32400 for Plex, :8123 for Home Assistant, :3000 for Grafana. Impossible to remember once you have more than a few.

Solution

Pi-hole custom DNS records so *.gread.uk resolves to the NAS on the LAN. Same URL works from home and remotely.

Split-horizon DNS is the single biggest quality-of-life improvement in the whole journey. One URL, works everywhere.

dns networking
More context

Pi-hole doesn’t just block ads; it can serve custom DNS records. Point photos.gread.uk at 192.168.1.16 locally, while public DNS points it at the VPS for external users.

This is split-horizon DNS: the same domain resolves to different IPs depending on where you are. On your LAN, traffic goes straight to the NAS. Outside, it goes through the VPS tunnel. The user doesn’t notice the difference; they just type the same URL.

Before this, every service was a bookmark with an IP and port number. After, everything was a clean URL that worked from any device, anywhere.

TLS / Traefik

networking

Problem

Browsers complaining about invalid certificates on *.gread.uk domains over LAN.

Solution

Traefik reverse proxy with wildcard cert via DNS-01 challenge.

DNS-01 is required for wildcard certs. DSM's nginx must be patched off ports 80/443, then re-applied after every DSM update.

tls networking containers
More context

Traefik handles TLS termination and routing for every service. Let’s Encrypt wildcard cert via Cloudflare DNS-01 challenge means zero per-service certificate management.

Two route categories: *.internal.gread.uk for LAN-only services, *.gread.uk for internet-facing.

DSM’s built-in nginx binds to 80/443 by default, and must be patched to 82/444. This breaks on every DSM update.

How it all connects

graph TB
  subgraph ext["Internet"]
    User(["External User"])
    VPS["Hetzner VPS: nginx TCP proxy"]
  end

  subgraph lan["Home Network"]
    Router["Router"]

    subgraph nas["Synology NAS (DS923+)"]
      WireGuard["WireGuard"]
      Traefik["Traefik :443"]
      CrowdSec["CrowdSec"]
      PiHole["Pi-hole :53"]

      subgraph containers["Containers"]
        Authentik["Authentik"]
        Grafana["Grafana"]
        Prometheus["Prometheus"]
        Portainer["Portainer"]
        Uptime["Uptime Kuma"]
        Immich["Immich"]
      end
    end

    subgraph havm["Home Assistant VM"]
      HA["Home Assistant"]
    end

    Client(["LAN Client"])
  end

  User -- HTTPS --> VPS
  VPS -- WireGuard tunnel --> WireGuard
  WireGuard --> Traefik
  Traefik -- stream --> CrowdSec

  Client -- DNS --> PiHole
  Client -- HTTPS --> Traefik

  Router -- DNS --> PiHole

  Traefik --> Authentik
  Traefik --> Grafana
  Traefik --> Prometheus
  Traefik --> Portainer
  Traefik --> Uptime
  Traefik --> Immich
  Traefik --> HA

WireGuard + Hetzner VPS

linux

Problem

Cloudflare Tunnel limitations blocking media hosting. Need real external access without paying for a static IP.

Solution

Hetzner CX22 VPS as a WireGuard endpoint and TCP proxy. TLS terminates on the NAS, not the VPS.

WireGuard userspace on Synology works but needs tuning (MTU, UDP buffers). Rate limiting prevents TCP congestion collapse.

vpn networking cost
More context

A Hetzner CX22 (£3.19/month) runs nginx as a dumb TCP proxy. Traffic flows through a WireGuard tunnel to the NAS where terminates TLS. The VPS never sees plaintext.

No request body limits. No ToS restrictions. You control the whole path.

This also solved the transcoding problem from Phase 3. The DS923+‘s AMD CPU can’t hardware-transcode, but with 450 Mbps of tunnel bandwidth, remote clients direct-play original files at full quality, so no transcoding is needed.

Synology’s kernel (4.4.x) has no WireGuard module, so it uses wireguard-go in userspace with host networking. After MTU tuning and UDP buffer fixes, throughput went from 5 Mbps to 450 Mbps.

Does it pay for itself?

Upfront

£1137

Net saving /mo

£22.70

Break-even

...

...but do you value your time?

Savings line assumes ~5% annual price increases across all subscriptions.

Tailscale

networking

Problem

Admin tools like , Portainer, and Prometheus are intentionally LAN-only. When I'm away from home and something breaks, I can't check dashboards or fix anything.

Solution

Tailscale with subnet routing: personal remote access to the entire LAN without exposing internal services publicly.

Tailscale and the VPS tunnel serve different purposes. The tunnel is for public services (Plex, Immich, websites). Tailscale is for private admin access.

vpn networking
More context

The VPS tunnel solved external access for public services. But internal services (Grafana, Portainer, Pi-hole, Prometheus) are behind an IP allowlist for a reason. They shouldn’t be on the internet.

Tailscale gives a private overlay network. With subnet routing (--advertise-routes=192.168.1.0/24), my phone or laptop can reach the entire home LAN from anywhere, as if I were sitting on the couch. Internal service URLs just work.

Two tunnels, two purposes: the VPS is for the world, Tailscale is for me.

Immich

beginner

Problem

Family photos scattered across Google Photos, iCloud, and Synology Photos. Each with its own storage limits, its own app, and no good way to share albums across the family.

Solution

Immich: self-hosted photo library with ML face/object search, family accounts, and no storage limits.

This is the feature that gets the most daily use from non-technical family members. If the mobile app isn't good, nobody will use it, and Immich's is good. Google Photos still gets used for editing since Immich hasn't matured that yet.

media containers cost
More context

Synology Photos technically worked but the mobile app was slow, search was basic, and sharing was awkward. Meanwhile, Google Photos and iCloud were each eating into their own storage tiers across multiple family members.

Immich replaced all of it. ML-powered face and object recognition, timeline view, family accounts with shared albums, and a mobile app that people actually want to use. Photos live on the NAS SSD (Volume 2) with no per-GB cost.

Rosie and family have their own accounts. Shared links work without authentication for albums you choose to make public. Synology Photos was fully decommissioned.

Observability: Metrics

linux

Problem

10+ services running with no way to know if they're healthy unless someone complains. Finding out something's down by getting a text from Rosie isn't a monitoring strategy.

Solution

Prometheus for metrics collection, Grafana for dashboards and alerting, plus cAdvisor, Node Exporter, and Exporter for hardware visibility.

Grafana alerting via email means you find out before your users do. SNMP Exporter for NAS hardware (fan speeds, disk temps) catches physical problems before they become data problems.

observability containers
More context

Once other people depended on services, “it seems fine” wasn’t good enough. Prometheus scrapes 10 targets every 15 seconds: container health, host metrics, NAS hardware, Home Assistant state, Traefik routing, and more.

Grafana turns that into dashboards and alerts. An email lands if goes down, if a disk temperature spikes, or if WireGuard throughput drops.

cAdvisor is pinned to v0.46.0 because v0.47+ requires containerd, which Synology doesn’t expose. This is the kind of thing you only learn by debugging a container that won’t start.

Observability: Logs

linux

Problem

Metrics told me something was wrong. They couldn't tell me why. A spike in Traefik 5xx errors is useless without the actual error message.

Solution

Loki for log aggregation, Promtail for collection. containers, Traefik access logs, host syslogs, and DSM system logs all in one place.

Set up logs from day one, not months after metrics. You will need to answer 'why' much sooner than you expect.

observability containers
More context

Prometheus can tell you that error rates spiked at 3am. It can’t tell you which request failed, what the error message was, or which client triggered it. That’s what logs are for.

Loki aggregates logs from every container (via Docker auto-discovery), Traefik access logs, host system logs, and DSM syslog over UDP. GeoIP enrichment on Traefik logs shows which country traffic originates from, which is useful for spotting scanners.

This was added months after metrics and should have been there from the start. Every “I wonder what happened” before Loki was answered with “I don’t know, I didn’t have logs.”

Security Hardening

devops

Problem

Internal services (Grafana, Portainer, Pi-hole) were reachable from the public internet via the VPS TCP proxy.

Solution

Three-layer defence: VPS filter, Traefik IP allowlist, strict SNI matching.

DNS isn't the only way to reach a service. Defence-in-depth means each layer catches what the previous misses.

security networking
More context

Discovered that *.internal.gread.uk services were publicly reachable; the VPS nginx forwarded any TCP connection blindly. Built a three-layer fix:

  1. VPS nginx SNI filter: drops connections to internal hostnames at TCP level
  2. Traefik IP allowlist: blocks non-LAN IPs on internal routes
  3. sniStrict: true: rejects mismatched TLS SNI

Plus for IP reputation, Synology Firewall for default-deny, and weekly automated regression tests via GitHub Actions.

What 100 GB became

Google One 100 GB £1.59/mo
Google One 2 TB £7.99/mo
Google One 10 TB £480/yr
Self-hosted 9.6 TB

The equivalent Google One tier (10 TB) would cost £480/yr, which alone pays off the NAS in under 2 years.

Authentik SSO

devops

Problem

Per-service authentication, with different logins for Grafana, Immich, Pi-hole, Portainer. Annoying, especially remotely.

Solution

Authentik as centralised provider with Google OAuth upstream. Log in once, access everything.

Authentik's config has many moving parts. Session expiry needs tuning per-service. Always keep a break-glass path.

security containers
More context

Centralised identity provider. Google OAuth upstream, so no new credentials. Native OIDC for Grafana, Immich, Portainer. Traefik forwardAuth for Pi-hole, Homepage, Traefik dashboard.

If Authentik goes down, forwardAuth services become inaccessible externally. Pi-hole keeps its own password for LAN break-glass. Emergency recovery key stored offline.

Hosting for Others

devops

Problem

Rosie paying £140/year for Squarespace. Elia wanted a site but the hosting costs didn't make sense for a new venture.

Solution

Self-hosted Astro static sites behind nginx + . Cost: £0.

Hosting for others changes your reliability mindset overnight. When it's just for you, it doesn't matter. When others depend on it, everything matters.

cost containers networking
More context

The maturity inflection point. Rosie’s photography portfolio was costing £140/year on Squarespace for a static site with minimal traffic. Elia wanted an adventure tourism site but the hosting costs didn’t make sense for a brand-new venture.

Both now self-hosted as Astro static sites behind nginx + Traefik. Adding a new site is a compose file + Traefik labels.

This forced proper monitoring, proper deployment, and proper security. The production mindset came from having real users, not from discipline alone.

Infrastructure as Code

devops

Problem

SSH in, edit a file, restart a container, forget what changed. A month later, something breaks and you can't remember whether the config on the NAS matches what you intended.

Solution

Everything in a git repo. A deploys configs, substitutes secrets, and rebuilds sites. git pull is the deployment mechanism.

The moment you can see a diff of what changed and when, debugging goes from 'what did I do?' to 'this commit broke it.' Dependabot and GitHub Actions handle the rest.

containers
More context

The homelab repo is the source of truth. Every config file (Traefik routes, Prometheus scrape targets, Grafana dashboards, Loki pipeline, CrowdSec scenarios) lives in git.

The post-merge hook on the NAS is where the real work happens. It’s not just git pull && docker compose up. It:

  • Copies configs to /volume2/docker/ (the NAS paths that each container mounts)
  • Injects secrets from a .secrets file at deploy time, so credentials never touch the repo
  • Uses checksum comparison to decide whether to rebuild Astro static sites: if the source hasn’t changed, the build is skipped
  • Manages container state: starts new services, restarts changed ones, leaves untouched containers alone

git pull on the NAS is the entire deployment process. No manual SSH, no “I think I changed that file”, no config drift.

Pre-commit hooks enforce formatting. Dependabot opens PRs for container image updates. GitHub Actions run weekly security regression tests against the live stack. It’s not Kubernetes; it doesn’t need to be. It’s version-controlled, auditable, and recoverable from any commit.

UPS

beginner

Problem

A power cut could corrupt volumes, kill running Docker containers, and lose unsaved database state.

Solution

CyberPower CP900EPFCLCD-UK UPS protecting the NAS, fibre ONT, and router. DSM manages safe shutdown via USB.

At ~55W load the UPS gives ~30 minutes runtime. Full shutdown and auto-recovery tested and working. All services come back.

hardware reliability
More context

A sudden power loss can corrupt btrfs metadata, break Docker containers mid-write, and damage databases. The CyberPower CP900EPFCLCD-UK (900VA / 540W, pure sine wave) protects the entire network path: NAS, fibre ONT, and WiFi router.

Connected to the NAS via USB. DSM detects it and manages the shutdown sequence:

  1. Power fails, UPS switches to battery instantly (no interruption)
  2. After a configured timeout, DSM enters safe mode
  3. Docker containers stop, VM shuts down, volumes unmount cleanly (~3 min)
  4. NAS sends shutdown signal to the UPS, cutting WiFi and network to preserve battery during prolonged outages
  5. When mains power restores, UPS powers on, NAS auto-boots, all services recover

Immich is the one exception: its encrypted volume mount requires manual intervention on each reboot. This is intentional: the photo library stays locked until someone physically approves access.

Nominal draw is ~55-60W. Battery life expectancy: 3-5 years (replacement part: RBP0051).

Load Testing

devops

Problem

Before sharing the site publicly, no idea where the limits were or what failure looks like under real traffic.

Solution

Systematic load testing with wrk: inside out, then external, with live probe monitoring of critical services throughout.

Traefik is the bottleneck, not nginx or memory. CrowdSec costs almost nothing per request. Critical services (DSM, Authentik) degrade gracefully but never fail. A single VPS-level connlimit rule stops naive floods before they reach the stack.

devops networking observability
More context

The site was solid on the inside. But before pointing anyone else at it, I wanted to know exactly where it would fall over.

Load testing went inside-out: first from within the LAN to isolate the NAS and Traefik, then through the WireGuard tunnel from the VPS, then fully external. Uptime Kuma probes ran throughout so any degradation showed up in the logs, not just anecdotally.

A few things that weren’t obvious until the numbers came in:

  • Traefik saturates before anything else. nginx on the VPS handles more than expected; memory on the NAS is barely touched. The bottleneck is Traefik’s request routing under concurrent load.
  • CrowdSec is cheap. At 440 RPS the per-request cost barely registers. It’s not the place to optimise.
  • Critical services degrade gracefully. DSM and Authentik slow down under pressure but don’t fail. Session state survives.
  • A single connlimit rule does most of the flood protection. One iptables rule at the VPS level drops naive connection floods before they reach the internal stack. The more sophisticated CrowdSec scenarios handle the rest.

The result isn’t “it can handle anything”; it’s “I know exactly where it breaks, and it breaks cleanly.”

What I'd do differently

Since early 2026, a £19/month AI subscription can carry most technical people through this entire setup, as long as you question what it tells you. I didn't always, and ended up exposing my internal services to the internet.
Skip for media hosting: the 100 MB body limit and video ToS make it a dead end. Fine for dashboards though.
was the right call. Mixed drive sizes (2x 8 TB + 1x 2 TB free) just work.
Set up from day one. gives you 'what', Loki gives you 'why'.
Start hosting something for someone else early. It changes your reliability mindset overnight.
Set a fallback DNS on your router. If goes down, you want the network to still work.
Buy a before you lose data, not after.

Related reading