Restore drill
A backup you’ve never restored isn’t a backup. The first time you find out the dump is corrupt, the cron stopped running three weeks ago, or the IAM role can’t actually read the bucket should not be at 3am during an incident.
This page is the checklist. Run it quarterly. Sign off in the team’s runbook with the date and the issues you found.
The drill, top to bottom
Section titled “The drill, top to bottom”0. Pick a target
Section titled “0. Pick a target”| You’re testing | Use |
|---|---|
| The full restore path (DB + blobs) | A fresh staging environment with its own Mongo + its own bucket. |
| Just the DB backup | A scratch cluster (Atlas’s “restore to new cluster” / DocumentDB’s restore-db-cluster-to-point-in-time / a docker run mongo). |
Never restore to production during a drill. The point is to
prove the backup works, not to find out the hard way that the
--drop flag was the wrong choice.
1. Locate the backup
Section titled “1. Locate the backup”- Confirm the latest scheduled backup completed within the expected window (last 24h for a daily cron; last hour for hourly).
- Open the off-site copy. If you can’t list the bucket, the drill stops here — fix the access path before continuing.
- Note the timestamp and size. A backup that’s suddenly half the usual size is a red flag.
2. Restore the database
Section titled “2. Restore the database”- Download or restore the archive into the scratch cluster.
- Run a Mongo-side sanity check:
docker run --rm -it mongo:7 \ mongosh "<restored-uri>" --eval ' db = db.getSiblingDB("davepi"); print("collections:", db.getCollectionNames().length); print("audit_log:", db.audit_log.countDocuments({})); print("_migrations:", db._migrations.countDocuments({})); '- Confirm the collection count is sensible (matches
production’s
_describeschema count + framework collections). -
_migrationsis present — without this, the migration runner thinks the DB is fresh and may re-run everything.
3. Boot dAvePi against the restored DB
Section titled “3. Boot dAvePi against the restored DB”- Set
MONGO_URIto the restored cluster. - Boot the framework. Watch the log for:
listening on port <N>— the basic boot worked.- Any
error-level log lines during the schema loop — a schema file that depends on a migration that didn’t apply will fail here.
- Hit
GET /_describe. The response should list every schema you expect to see. If a schema is missing, the restore didn’t include the right collections. - Hit
GET /api/v1/<known-resource>with a real JWT. A list of records should come back.
4. Verify migrations
Section titled “4. Verify migrations”- Run
npx davepi migrate status. Every entry should readsucceeded. Anythingpendingmeans the restore is missing migration records — likely a partial-restore bug.
5. Spot-check a state machine
Section titled “5. Spot-check a state machine”If you have schemas with state machines:
- Read a record’s
availableTransitions. Should match the schema’s declaredtransitions[<current state>]. - Attempt an invalid transition — should return 400
INVALID_TRANSITIONwith the correctallowedlist. If it succeeds, the schema’s state machine wasn’t restored correctly.
6. Verify file blobs
Section titled “6. Verify file blobs”- Pick a record with a
type: 'File'field set. - Call the fetch endpoint
(
GET /api/v1/<path>/:id/<field>). Should return the URL or stream the blob. - If the response is “blob not found” but the
FileMetasub-document is present, the DB and storage backups were taken at different points. See File-storage backup.
7. Verify auth + ACL
Section titled “7. Verify auth + ACL”- Issue a new JWT via
POST /login. If this fails, theusercollection didn’t restore or theTOKEN_KEYenv var doesn’t match what was used to sign the existing tokens. - If your schemas use
acl.listor field-levelacl.read, hit an endpoint as both an admin and a regular user. Both should match the documented per-role projection.
8. Tear down
Section titled “8. Tear down”- Drop the scratch cluster / staging environment.
- If you’re using Atlas’s “restore to new cluster” UI, double-check you’re deleting the restored cluster, not production. The names look similar mid-drill.
What to write down
Section titled “What to write down”After the drill, capture:
- Date of the drill.
- Backup timestamp restored.
- Total time from “decide to drill” to “scratch cluster serving GET /_describe”.
- Issues found — even small ones. (“Backup archive name format changed; alerts based on the old name didn’t fire.”)
- Action items — fix the issues before the next drill.
The total time is your real RTO. If it’s higher than your stated recovery target, you have work to do before you actually need the backup.
When the drill fails
Section titled “When the drill fails”Common failure modes and what they mean:
| Failure | Likely cause |
|---|---|
| Can’t list the off-site bucket | IAM role / credentials drifted from what’s set on the host. |
| Archive download is incomplete / corrupted | Backup didn’t complete; cron failure, OOM kill, disk-full. |
mongorestore fails partway through | Source archive is corrupt, or the target cluster’s storage filled up. |
| Restored DB has no collections | Restore command pointed at the wrong DB name (--db davepi vs default). |
_describe shows fewer schemas than production | Schema files weren’t deployed alongside the restored data. The schemas live in source code, not the DB; restore alone doesn’t bring them back. |
| Auth returns 401 for known credentials | Restored DB but the host’s TOKEN_KEY doesn’t match what signed the existing tokens. (Use the same TOKEN_KEY across deploys, or accept that restores invalidate existing sessions.) |
FileMeta rows exist but blobs are 404 | DB and storage backups drifted in time. |
Frequency
Section titled “Frequency”- Quarterly at minimum. Calendar it.
- After any change to the backup pipeline (new bucket, new cron schedule, new region). The first drill after a change is the high-value one.
- Before launching a new product surface that depends on durable data. “We can restore from yesterday” only matters if it’s true.
See also
Section titled “See also”- Per-platform backup guides
- File-storage backup — including the FileMeta ↔ blob consistency check.