De-risking the Papercut→GKE migration

Papercut's Redis, moved live
Zero downtime, zero data loss

Moved to our own HA Redis — first step of Papercut's move to GKE.

Keys moved 14k · Downtime zero · Status done

01 / the question

Can we move Papercut's Redis off an aging, end-of-life instance — with zero downtime and zero lost data?
While the app keeps serving customers.

before

Shared, unpatchable Redis
One EOL instance (2016), no auth.

after

Dedicated HA Redis
3 nodes + 2 proxies, in our cluster.

First de-risking step of the bigger Papercut→GKE migration. Everything after this inherits an already-proven Redis.

02 / where we started

Where we started, what we built

before · 2 VMs

Old stack, shared Redis

Two aging VMs (pc-x-1, pc-x-2) shared ONE Redis — version 3.2.13, end-of-life since 2016. It held durable state: feature flags, invoice locks, counters. retired

after · in-cluster

Dedicated HA Redis

A new highly-available Redis inside our GKE cluster — 3 data servers, 2 proxies — fully separate from our main app's Redis, sized and tuned for Papercut's real workload. shipped

Redis moved first — the riskiest shared piece of the bigger Papercut→GKE cutover.

03 / why redis first

Redis is the riskiest shared state in the whole cutover. Doing it first means the app move inherits an already-proven Redis.

— the logic behind sequencing this migration

04 / reaching the new redis

How the old VMs reach it
Two networks, joined on purpose

Old VMs

pc-x-1, pc-x-2 — a different network entirely

▲ VPC peering ▼

Private internal address

Locked to just the two Papercut VMs — no public exposure

▲ connects to ▼

New HA Redis

3 data servers + 2 proxies, inside the GKE cluster

Two networks connected on purpose, access locked down to exactly what needs it.

05 / the migration, step by step

The migration, step by step

1 Build & warm

Stood up the new HA Redis, proved it worked, then copied all 14k keys across — no cold-start slowdown.

2 Canary: pc-x-2

Drained pc-x-2 from live traffic, repointed it alone, and verified it worked in isolation before touching anything else.

3 Prove shared state

pc-x-2 wrote a key; pc-x-1 read and wrote it back; pc-x-2 saw the change — both servers truly share one Redis.

4 Repoint pc-x-1

Swapped the second server behind the load balancer while pc-x-2 kept serving — so nobody ever saw an interruption.

5 Re-enable & verify

Turned background workers back on for both servers and verified everything end to end.

06 / how we protected it

Zero downtime, zero data loss

downtime

Always one healthy server

The load balancer always had one healthy server. We drained the one being changed, repointed it, verified it, then let it rejoin — same trick in reverse for the other.

data loss

Paused writes, verified match

Background workers paused during the final copy so the source couldn't change mid-flight. The copy matched key-for-key. The old Redis was only ever read, never touched — an instant rollback.

07 / key decisions

Key decisions

separation

Dedicated, not shared

A shared Redis would let Papercut evict our main app's keys — sharing would couple two systems that should stay independent.

eviction

Only cache keys can be evicted

Tuned so the no-expiry durable keys — feature flags, invoice locks — are always safe.

security

Added a password

The old Memorystore had no auth at all — this is a real security upgrade.

rollback

Kept the old Redis alive

Left running for a few days as an instant rollback before deleting it.

08 / what we learned

What we learned along the way

EOL Old but compatible. The old Redis is 3.2.13 (2016) — good backward-compatibility meant the copy still worked, but it confirmed how urgent moving off it was.

BYTES Compare values, not bytes. Comparing raw copied blobs across Redis versions gives false mismatches — you have to compare the actual values.

1-NOT-2 One Redis, not two. It turned out both servers shared ONE instance, not two as first assumed.

MODULE Start from proper HA. The reusable module we started from was single-pod, not truly HA — so we built real high-availability from our main app's trusted chart.

SIZE Measure before sizing. Papercut's must-keep data was only ~8 MB of a ~600 MB footprint — measure the durable-vs-cache split first.

09 / where we are now

Where we are now
Verified end to end

shipped Both Papercut servers live on the new HA Redis.

shipped Health checks green through the real load balancer.

shipped Background jobs processing on the new Redis.

shipped Cross-server read/write consistency proven.

shipped Zero downtime, zero data loss.

10 / what's next

What's next

1 Retire the old Redis

Keep the old Memorystore ~2–3 days as a safety net, then delete it for the monthly savings. planning

2 Continue the bigger migration

Move on with the broader Papercut→GKE migration — the Redis cutover is de-risked and rehearsed. planning

further reading · links open in new tabs