The journey
Homelab
From a full Google Drive to a production-grade self-hosted stack. Each step was forced by a real problem. Nothing was planned upfront.
9.6TB
vs 100 GB on Google ↓
24
services running
£437/yr
subscriptions cancelled
~4yrs
to break even ↓
The maths works eventually. But that's not really why you do it.
You do it because a broken WireGuard tunnel at 11pm is more interesting than Netflix. Because understanding your own attack surface matters to you. Because "it just works" isn't satisfying when you don't know why it works.
If that doesn't sound like you — close the tab and pay Google. Genuinely. It's the right call for most people.
Skip the journey? Here's where I ended up.
24 services across 7 categories
Routing & Proxy
DNS & Security
Media
Automation
Observability
Web
Management
Routing & Proxy
DNS & Security
Media
Automation
Observability
Web
Management
9.6 TB storage · vs 100 GB on Google
Photo library with ML search · replaces Google Photos
Media server, anywhere · cancelled streaming subs
Home automation, no cloud · replaced 5 apps
· ~20% queries blocked
Websites for others · £0 recurring
Full observability · metrics, logs, alerts
· one login, all services
Hindsight
Set up logs from day one, not just metrics.
Host something for someone else early. It changes your reliability mindset.
Buy a before you lose data, not after.
Timeline
The Trigger
beginnerProblem
Google Drive running out of space. Next tier meant paying more for storage we were outgrowing. Streaming services pushing ads on paid plans.
Solution
Self-host. Take control of storage and media.
The cost calculus shifts faster than you expect once you start adding up subscriptions.
cost privacy storage
More context
Google Drive hit its 100 GB limit. The next tier was £24.99/year for 200 GB, and the photo library was growing faster than we could trim it. Rosie had the same problem compounded across her personal Google account and GSuite. Streaming services were getting worse, with ads appearing on already-paid subscriptions.
The cost calculus shifted. Self-hosting went from “someday” to “this weekend.”
Hardware
beginnerProblem
Need a device that can store files, run , and not require Linux expertise to get started.
Solution
Synology DS923+: turnkey OS, package ecosystem, Docker support.
SHR handles mixed drive sizes. NVMe cache is nice-to-have, not essential. Free RAM from an old laptop is the best kind of RAM.
storage hardware
More context
Synology DS923+ with SHR (Synology Hybrid RAID) across three drives of different sizes: 2x 8 TB Toshiba MG08ADA800E and 1x 2 TB (used, free). SHR creates multiple md arrays internally to maximise usable space, giving 8.71 TB total on Volume 1 (btrfs, encrypted).
Two M.2 NVMe drives in RAID1 for read/write cache. 32 GB RAM swapped in from an old laptop.
A second volume (1 TB SSD) was added later to handle media and Docker containers quietly.
Replacing Google
beginnerProblem
Need cloud sync, local backups, and file sharing without recurring cloud costs.
Solution
Synology Drive for sync, Time Machine for Mac backups, SMB for LAN, Hyper Backup for offsite.
Encryption key management is critical. Keep recovery keys in multiple safe places before you need them.
storage privacy cost
More context
The immediate wins, things that work out of the box on :
- Synology Drive handles file sync. Google Drive stays for Docs, Sheets, and Mail, but bulk storage moved off it.
- Time Machine as a LAN backup target for MacBooks.
- SMB shares for LAN file access from any device.
- Btrfs snapshots with hourly/daily retention on shared folders.
- Offsite backup via Hyper Backup to a used 2 TB HDD in a UGREEN enclosure, kept in a shed. Encrypted.
Media Server
beginnerProblem
Want to watch media on any device in the house.
Solution
Plex via Synology Package Center. LAN-only at this stage.
The DS923+ uses an AMD Ryzen R1600 with no Intel Quick Sync, so no hardware transcoding. Remote access via relay or QuickConnect was essentially unusable.
media
More context
Plex installed via Synology Package Center. Media playable on phones and TV over local network.
The DS923+ has an AMD Ryzen R1600 processor which lacks Intel Quick Sync, the hardware transcoding engine Plex relies on. Without it, Plex can only direct-play original files or attempt software transcoding (far too slow on a NAS CPU for real-time playback).
This mattered because both Plex’s built-in relay and Synology’s QuickConnect reduce stream quality to save bandwidth, which triggers Plex to transcode. No hardware transcoding meant those remote access options were effectively broken, either unwatchably slow or stuck buffering. On LAN it was fine because clients could direct-play at full quality.
This constraint made the later tunnel critical: with enough raw bandwidth, clients direct-play the original files remotely, with no transcoding needed at all.
Home Automation
comfortableProblem
Five different manufacturer apps to control lights, heating, air quality, and media. Cloud-dependent, fragmented, unreliable.
Solution
Home Assistant OS as a VM with Zigbee2MQTT. One dashboard, no cloud dependencies.
Battery Zigbee devices don't route, but mains-powered ones like IKEA plugs do. An IKEA plug in the hallway fixed the mesh.
home-automation privacy
More context
Home Assistant OS installed as a VM on Synology VMM. Zigbee2MQTT with a Dongle Plus MG24 (~£30) connecting everything. No cloud required.
Smart TRVs on every radiator (5 rooms) with weekly schedules. Sonoff SNZB-06P human presence sensor in the study (USB-powered). Dyson fan for air quality monitoring. Govee lights, IKEA remote, IKEA plug as a Zigbee router, leak sensors in bathroom and kitchen. LG CX TV and Chromecast integration.
Five apps (Govee, Dyson, LG ThinQ, Google Home, TRV manufacturer) became one dashboard.
Living with a NAS
beginnerProblem
The NAS lives in the living room next to the router. HDD seek noise during media playback was audible over the TV.
Solution
Added a 1 TB SSD as a separate volume for media and Docker data. SSDs are silent.
Not everything needs RAID. Media is re-downloadable. Separating data by recovery importance saves money, noise, and complexity.
storage hardware
More context
The NAS stays in the living room; it needs a wired ethernet connection to the router. Moving it wasn’t an option.
The fix was separating what lives on spinning disks from what doesn’t need to. A 1 TB SSD became Volume 2 (ext4, encrypted) for media and Docker container runtime. The SHR array on the HDDs holds everything that needs RAID protection and snapshots.
The deeper lesson: not all data has the same recovery cost. Photos and documents need redundancy. Movies don’t, because you can re-download them. Building volumes around that distinction is cheaper and quieter than putting everything on the same drives.
External Access: Cloudflare Tunnels
networkingProblem
means no inbound connections. Static IPv4 would cost ~£60/month vs £22/month ISP.
Solution
Cloudflare Tunnels: outbound-only, punches through CGNAT.
CF Tunnels work fine for dashboards and APIs but hit a wall for media: 100 MB body limit and video streaming ToS.
networking vpn cost
More context
Cloudflare Tunnels were the first attempt at external access. They work for lightweight use cases like admin dashboards and small file transfers. But two limitations made them a dead end for media and file hosting:
- 100 MB HTTP request body limit on free/Pro plans. Large uploads fail.
- Terms of Service violation: Cloudflare prohibits proxying video/streaming on standard plans.
Before giving up on tunnels for media I tried a side route: skip the tunnel on the streaming hostname and serve it directly over IPv6 with an AAAA record, while keeping everything else routed through the tunnel. The DNS rules said no. A tunnel route needs a CNAME pointing at <uuid>.cfargotunnel.com, and RFC 1034 forbids a CNAME from sharing a hostname with any other record. So the same name couldn’t be both “tunnel for clients” and “direct for v6 clients”; it was one or the other without running split-DNS. That’s when the VPS + idea took over.
Pi-hole
comfortableProblem
Ads everywhere. DNS queries leaking to ISP.
Solution
Pi-hole in . Router DNS pointed at it.
~20% of DNS queries blocked. Pi-hole is a single point of failure, so set a fallback DNS on your router.
privacy containers dns
More context
The first Docker container. Router’s DNS pointed at Pi-hole on port 53. Immediately ~20% of all DNS queries blocked at the network edge.
Important caveat: Pi-hole is a single point of failure for DNS. If the container goes down, nothing on the network can resolve domain names. Set a fallback DNS on the router.
Split-Horizon DNS
networkingProblem
Accessing services by IP and port number: 192.168.1.16:32400 for Plex, :8123 for Home Assistant, :3000 for Grafana. Impossible to remember once you have more than a few.
Solution
Pi-hole custom DNS records so *.gread.uk resolves to the NAS on the LAN. Same URL works from home and remotely.
Split-horizon DNS is the single biggest quality-of-life improvement in the whole journey. One URL, works everywhere.
dns networking
More context
Pi-hole doesn’t just block ads; it can serve custom DNS records. Point photos.gread.uk at 192.168.1.16 locally, while public DNS points it at the VPS for external users.
This is split-horizon DNS: the same domain resolves to different IPs depending on where you are. On your LAN, traffic goes straight to the NAS. Outside, it goes through the VPS tunnel. The user doesn’t notice the difference; they just type the same URL.
Before this, every service was a bookmark with an IP and port number. After, everything was a clean URL that worked from any device, anywhere.
TLS / Traefik
networkingProblem
Browsers complaining about invalid certificates on *.gread.uk domains over LAN.
Solution
Traefik reverse proxy with wildcard cert via DNS-01 challenge.
DNS-01 is required for wildcard certs. DSM's nginx must be patched off ports 80/443, then re-applied after every DSM update.
tls networking containers
More context
Traefik handles TLS termination and routing for every service. Let’s Encrypt wildcard cert via Cloudflare DNS-01 challenge means zero per-service certificate management.
Two route categories: *.internal.gread.uk for LAN-only services, *.gread.uk for internet-facing.
DSM’s built-in nginx binds to 80/443 by default, and must be patched to 82/444. This breaks on every DSM update.
How it all connects
graph TB
subgraph ext["Internet"]
User(["External User"])
VPS["Hetzner VPS: nginx TCP proxy"]
end
subgraph lan["Home Network"]
Router["Router"]
subgraph nas["Synology NAS (DS923+)"]
WireGuard["WireGuard"]
Traefik["Traefik :443"]
CrowdSec["CrowdSec"]
PiHole["Pi-hole :53"]
subgraph containers["Containers"]
Authentik["Authentik"]
Grafana["Grafana"]
Prometheus["Prometheus"]
Portainer["Portainer"]
Uptime["Uptime Kuma"]
Immich["Immich"]
end
end
subgraph havm["Home Assistant VM"]
HA["Home Assistant"]
end
Client(["LAN Client"])
end
User -- HTTPS --> VPS
VPS -- WireGuard tunnel --> WireGuard
WireGuard --> Traefik
Traefik -- stream --> CrowdSec
Client -- DNS --> PiHole
Client -- HTTPS --> Traefik
Router -- DNS --> PiHole
Traefik --> Authentik
Traefik --> Grafana
Traefik --> Prometheus
Traefik --> Portainer
Traefik --> Uptime
Traefik --> Immich
Traefik --> HA WireGuard + Hetzner VPS
linuxProblem
Cloudflare Tunnel limitations blocking media hosting. Need real external access without paying for a static IP.
Solution
Hetzner CX22 VPS as a WireGuard endpoint and TCP proxy. TLS terminates on the NAS, not the VPS.
WireGuard userspace on Synology works but needs tuning (MTU, UDP buffers). Rate limiting prevents TCP congestion collapse.
vpn networking cost
More context
A Hetzner CX22 (£3.19/month) runs nginx as a dumb TCP proxy. Traffic flows through a WireGuard tunnel to the NAS where terminates TLS. The VPS never sees plaintext.
No request body limits. No ToS restrictions. You control the whole path.
This also solved the transcoding problem from Phase 3. The DS923+‘s AMD CPU can’t hardware-transcode, but with 450 Mbps of tunnel bandwidth, remote clients direct-play original files at full quality, so no transcoding is needed.
Synology’s kernel (4.4.x) has no WireGuard module, so it uses wireguard-go in userspace with host networking. After MTU tuning and UDP buffer fixes, throughput went from 5 Mbps to 450 Mbps.
Does it pay for itself?
Upfront
£1137
Net saving /mo
£22.70
Break-even
...
Stop overthinking it and just start building the thing.
Savings line assumes ~5% annual price increases across all subscriptions.
Tailscale
networkingProblem
Admin tools like , Portainer, and Prometheus are intentionally LAN-only. When I'm away from home and something breaks, I can't check dashboards or fix anything.
Solution
Tailscale with subnet routing: personal remote access to the entire LAN without exposing internal services publicly.
Tailscale and the VPS tunnel serve different purposes. The tunnel is for public services (Plex, Immich, websites). Tailscale is for private admin access.
vpn networking
More context
The VPS tunnel solved external access for public services. But internal services (Grafana, Portainer, Pi-hole, Prometheus) are behind an IP allowlist for a reason. They shouldn’t be on the internet.
Tailscale gives a private overlay network. With subnet routing (--advertise-routes=192.168.1.0/24), my phone or laptop can reach the entire home LAN from anywhere, as if I were sitting on the couch. Internal service URLs just work.
Two tunnels, two purposes: the VPS is for the world, Tailscale is for me.
Immich
beginnerProblem
Family photos scattered across Google Photos, iCloud, and Synology Photos. Each with its own storage limits, its own app, and no good way to share albums across the family.
Solution
Immich: self-hosted photo library with ML face/object search, family accounts, and no storage limits.
This is the feature that gets the most daily use from non-technical family members. If the mobile app isn't good, nobody will use it, and Immich's is good. Google Photos still gets used for editing since Immich hasn't matured that yet.
media containers cost
More context
Synology Photos technically worked but the mobile app was slow, search was basic, and sharing was awkward. Meanwhile, Google Photos and iCloud were each eating into their own storage tiers across multiple family members.
Immich replaced all of it. ML-powered face and object recognition, timeline view, family accounts with shared albums, and a mobile app that people actually want to use. Photos live on the NAS SSD (Volume 2) with no per-GB cost.
Rosie and family have their own accounts. Shared links work without authentication for albums you choose to make public. Synology Photos was fully decommissioned.
Observability: Metrics
linuxProblem
10+ services running with no way to know if they're healthy unless someone complains. Finding out something's down by getting a text from Rosie isn't a monitoring strategy.
Solution
Prometheus for metrics collection, Grafana for dashboards and alerting, plus cAdvisor, Node Exporter, and Exporter for hardware visibility.
Grafana alerting via email means you find out before your users do. SNMP Exporter for NAS hardware (fan speeds, disk temps) catches physical problems before they become data problems.
observability containers
More context
Once other people depended on services, “it seems fine” wasn’t good enough. Prometheus scrapes 10 targets every 15 seconds: container health, host metrics, NAS hardware, Home Assistant state, Traefik routing, and more.
Grafana turns that into dashboards and alerts. An email lands if goes down, if a disk temperature spikes, or if WireGuard throughput drops.
cAdvisor is pinned to v0.46.0 because v0.47+ requires containerd, which Synology doesn’t expose. This is the kind of thing you only learn by debugging a container that won’t start.
Observability: Logs
linuxProblem
Metrics told me something was wrong. They couldn't tell me why. A spike in Traefik 5xx errors is useless without the actual error message.
Solution
Loki for log aggregation, Promtail for collection. containers, Traefik access logs, host syslogs, and DSM system logs all in one place.
Set up logs from day one, not months after metrics. You will need to answer 'why' much sooner than you expect.
observability containers
More context
Prometheus can tell you that error rates spiked at 3am. It can’t tell you which request failed, what the error message was, or which client triggered it. That’s what logs are for.
Loki aggregates logs from every container (via Docker auto-discovery), Traefik access logs, host system logs, and DSM syslog over UDP. GeoIP enrichment on Traefik logs shows which country traffic originates from, which is useful for spotting scanners.
This was added months after metrics and should have been there from the start. Every “I wonder what happened” before Loki was answered with “I don’t know, I didn’t have logs.”
Security Hardening
devopsProblem
Internal services (Grafana, Portainer, Pi-hole) were reachable from the public internet via the VPS TCP proxy.
Solution
Three-layer defence: VPS filter, Traefik IP allowlist, strict SNI matching.
DNS isn't the only way to reach a service. Defence-in-depth means each layer catches what the previous misses.
security networking
More context
Discovered that *.internal.gread.uk services were publicly reachable; the VPS nginx forwarded any TCP connection blindly. Built a three-layer fix:
- VPS nginx SNI filter: drops connections to internal hostnames at TCP level
- Traefik IP allowlist: blocks non-LAN IPs on internal routes
- sniStrict: true: rejects mismatched TLS SNI
Plus for IP reputation, Synology Firewall for default-deny, and weekly automated regression tests via GitHub Actions.
What 100 GB became
The equivalent Google One tier (10 TB) would cost £480/yr, which alone pays off the NAS in under 2 years.
Authentik SSO
devopsProblem
Per-service authentication, with different logins for Grafana, Immich, Pi-hole, Portainer. Annoying, especially remotely.
Solution
Authentik as centralised provider with Google OAuth upstream. Log in once, access everything.
Authentik's config has many moving parts. Session expiry needs tuning per-service. Always keep a break-glass path.
security containers
More context
Centralised identity provider. Google OAuth upstream, so no new credentials. Native OIDC for Grafana, Immich, Portainer. Traefik forwardAuth for Pi-hole, Homepage, Traefik dashboard.
If Authentik goes down, forwardAuth services become inaccessible externally. Pi-hole keeps its own password for LAN break-glass. Emergency recovery key stored offline.
Hosting for Others
devopsProblem
Rosie paying £140/year for Squarespace. Elia wanted a site but the hosting costs didn't make sense for a new venture.
Solution
Self-hosted Astro static sites behind nginx + . Cost: £0.
Hosting for others changes your reliability mindset overnight. When it's just for you, it doesn't matter. When others depend on it, everything matters.
cost containers networking
More context
The maturity inflection point. Rosie’s photography portfolio was costing £140/year on Squarespace for a static site with minimal traffic. Elia wanted an adventure tourism site but the hosting costs didn’t make sense for a brand-new venture.
Both now self-hosted as Astro static sites behind nginx + Traefik. Adding a new site is a compose file + Traefik labels.
This forced proper monitoring, proper deployment, and proper security. The production mindset came from having real users, not from discipline alone.
Infrastructure as Code
devopsProblem
SSH in, edit a file, restart a container, forget what changed. A month later, something breaks and you can't remember whether the config on the NAS matches what you intended.
Solution
Everything in a git repo. A deploys configs, substitutes secrets, and rebuilds sites. git pull is the deployment mechanism.
The moment you can see a diff of what changed and when, debugging goes from 'what did I do?' to 'this commit broke it.' Dependabot and GitHub Actions handle the rest.
containers
More context
The homelab repo is the source of truth. Every config file (Traefik routes, Prometheus scrape targets, Grafana dashboards, Loki pipeline, CrowdSec scenarios) lives in git.
The post-merge hook on the NAS is where the real work happens. It’s not just git pull && docker compose up. It:
- Copies configs to
/volume2/docker/(the NAS paths that each container mounts) - Injects secrets from a
.secretsfile at deploy time, so credentials never touch the repo - Uses checksum comparison to decide whether to rebuild Astro static sites: if the source hasn’t changed, the build is skipped
- Manages container state: starts new services, restarts changed ones, leaves untouched containers alone
git pull on the NAS is the entire deployment process. No manual SSH, no “I think I changed that file”, no config drift.
Pre-commit hooks enforce formatting. Dependabot opens PRs for container image updates. GitHub Actions run weekly security regression tests against the live stack. It’s not Kubernetes; it doesn’t need to be. It’s version-controlled, auditable, and recoverable from any commit.
UPS
beginnerProblem
A power cut could corrupt volumes, kill running Docker containers, and lose unsaved database state.
Solution
CyberPower CP900EPFCLCD-UK UPS protecting the NAS, fibre ONT, and router. DSM manages safe shutdown via USB.
At ~55W load the UPS gives ~30 minutes runtime. Full shutdown and auto-recovery tested and working. All services come back.
hardware reliability
More context
A sudden power loss can corrupt btrfs metadata, break Docker containers mid-write, and damage databases. The CyberPower CP900EPFCLCD-UK (900VA / 540W, pure sine wave) protects the entire network path: NAS, fibre ONT, and WiFi router.
Connected to the NAS via USB. DSM detects it and manages the shutdown sequence:
- Power fails, UPS switches to battery instantly (no interruption)
- After a configured timeout, DSM enters safe mode
- Docker containers stop, VM shuts down, volumes unmount cleanly (~3 min)
- NAS sends shutdown signal to the UPS, cutting WiFi and network to preserve battery during prolonged outages
- When mains power restores, UPS powers on, NAS auto-boots, all services recover
Immich is the one exception: its encrypted volume mount requires manual intervention on each reboot. This is intentional: the photo library stays locked until someone physically approves access.
Nominal draw is ~55-60W. Battery life expectancy: 3-5 years (replacement part: RBP0051).
Load Testing
devopsProblem
Before sharing the site publicly, no idea where the limits were or what failure looks like under real traffic.
Solution
Systematic load testing with wrk: inside out, then external, with live probe monitoring of critical services throughout.
Traefik is the bottleneck, not nginx or memory. CrowdSec costs almost nothing per request. Critical services (DSM, Authentik) degrade gracefully but never fail. A single VPS-level connlimit rule stops naive floods before they reach the stack.
devops networking observability
More context
The site was solid on the inside. But before pointing anyone else at it, I wanted to know exactly where it would fall over.
Load testing went inside-out: first from within the LAN to isolate the NAS and Traefik, then through the WireGuard tunnel from the VPS, then fully external. Uptime Kuma probes ran throughout so any degradation showed up in the logs, not just anecdotally.
A few things that weren’t obvious until the numbers came in:
- Traefik saturates before anything else. nginx on the VPS handles more than expected; memory on the NAS is barely touched. The bottleneck is Traefik’s request routing under concurrent load.
- CrowdSec is cheap. At 440 RPS the per-request cost barely registers. It’s not the place to optimise.
- Critical services degrade gracefully. DSM and Authentik slow down under pressure but don’t fail. Session state survives.
- A single connlimit rule does most of the flood protection. One
iptablesrule at the VPS level drops naive connection floods before they reach the internal stack. The more sophisticated CrowdSec scenarios handle the rest.
The result isn’t “it can handle anything”; it’s “I know exactly where it breaks, and it breaks cleanly.”
What I'd do differently
Since early 2026, a £19/month AI subscription can carry most technical people through this entire setup, as long as you question what it tells you. I didn't always, and ended up exposing my internal services to the internet.
Skip for media hosting: the 100 MB body limit and video ToS make it a dead end. Fine for dashboards though.
was the right call. Mixed drive sizes (2x 8 TB + 1x 2 TB free) just work.
Set up from day one. gives you 'what', Loki gives you 'why'.
Start hosting something for someone else early. It changes your reliability mindset overnight.
Set a fallback DNS on your router. If goes down, you want the network to still work.
Buy a before you lose data, not after.
Related reading
The NAS that kept us awake
Building JS projects in your home folder on a btrfs NAS generates tens of thousands of indexed files and hundreds of Docker layers. Read-write NVMe cache meant every Docker build burned through laptop SSDs that weren't rated for it.
WireGuard throughput: from 5 Mbps to saturating the link
Investigating why remote users were getting 5 Mbps through a WireGuard tunnel despite 400+ Mbps available on both ends. MTU fragmentation, TCP congestion collapse, and UDP buffer exhaustion.
Deploying Authentik SSO across the homelab
Replacing per-service authentication with centralised SSO backed by Google OAuth. OIDC for Grafana, Immich, Portainer; forwardAuth for everything else.
How my internal services were exposed to the internet
Found that my *.internal.gread.uk services were publicly reachable via the VPS TCP proxy. Built a three-layer defence to fix it.