Backup Mechanics¶

The backup pipeline is where most of Velero's complexity lives.

Understanding this path is essential for debugging, performance tuning, and writing plugins.

Backup lifecycle state machine¶

New ──► InProgress ──► Completed
                   │──► PartiallyFailed
                   └──► Failed
                             └──► Deleting

BackupController drives this state machine. Once a Backup enters InProgress, it will not be re-reconciled by a second instance (leader election and phase check prevent this).

Step-by-step Backup Execution¶

1. BackupController Picks Up a New Backup¶

Informer event triggers the reconciler. The controller sets status.phase = InProgress and status.startTimestamp.

In HA deployments, a distributed lock (Kubernetes lease) prevents concurrent execution.

2. Resource Discovery and Collection¶

Uses the API server's discovery API to enumerate all resource types. For each resource type matching the include/exclude filters, lists objects via the dynamic client.

Discovery is done concurrently with goroutines per resource group.

Key file: pkg/backup/item_collector.go

3. BackupItemAction Plugins¶

For each collected item, runs all registered BackupItemAction plugins whose AppliesTo() matches the resource type. These can:

Mutate the item (e.g. strip sensitive annotations)
Add additional items to the backup graph (e.g. the built-in pod-action adds the PVC when a Pod is backed up, ensuring PVC/PV pairs are consistent)
Set skip flags to exclude an item

This is where most custom business logic lives. See Plugin System.

4. PVC → Volume Backup Decision¶

For each PVC, Velero decides the volume backup method in priority order:

Skip: if snapshotVolumes: false or if the PVC has the opt-out annotation backup.velero.io/backup-volumes-excludes
CSI VolumeSnapshot: if the CSI plugin is enabled and a matching VolumeSnapshotClass exists
Cloud provider snapshot: if a VolumeSnapshotter plugin is registered for the storage class
Kopia file-level copy: if defaultVolumesToFsBackup: true or if the PVC has the opt-in annotation backup.velero.io/backup-volumes

5. Pre-backup hooks¶

Before serializing a pod's volume data, executes pre-backup hooks (exec into containers). Used to quiesce databases, flush caches, sync filesystems. See Hooks.

6. Volume Snapshot / Data Upload¶

CSI / Kopia pathCloud snapshot path

Creates DataUpload CRDs. The DataUploadController in node-agent picks these up and runs Kopia to upload data directly from the PVC mount on the node. Velero-server polls DataUpload.status for completion.

Calls the VolumeSnapshotter plugin synchronously. The plugin calls the cloud provider API and returns a snapshot ID that Velero stores in the backup metadata.

7. Post-backup hooks¶

After volume data is captured, runs post-backup hooks to un-quiesce (e.g. UNLOCK TABLES). Velero guarantees post hooks run even if pre hooks fail (unless onError: Fail caused the backup to abort).

8. Serialization and Upload¶

All collected, plugin-processed items are serialized to JSON and written into a tarball (backup.tar.gz). A backup-results.gz file captures warnings and errors per item. Both are streamed to the object store via the ObjectStore plugin.

Key file: pkg/backup/backup.go

9. Metadata Upload¶

A velero-backup.json metadata file is written to the BSL. This is what the BackupSyncController reads to reconstruct Backup objects in a new cluster (enabling cross-cluster restores without re-creating Backup CRDs manually).

Object Store Layout¶

{bucket}/{prefix}/
  backups/
    {backup-name}/
      velero-backup.json                       # Backup CRD spec + status
      {backup-name}.tar.gz                     # All K8s resources (JSON per item)
      {backup-name}-logs.gz                    # Velero server logs during backup
      {backup-name}-results.gz                 # Warnings and errors per item
      {backup-name}-csi-volumesnapshots.json.gz  # CSI snapshot metadata
      {backup-name}-volumesnapshots.json.gz    # Legacy VSL snapshot metadata
  restores/
    {restore-name}/
      restore-{restore-name}-logs.gz
      restore-{restore-name}-results.gz

Tar Archive Structure¶

resources/
  deployments/
    namespaces/
      default/
        my-deployment.json
  persistentvolumeclaims/
    namespaces/
      default/
        my-pvc.json
  persistentvolumes/
    cluster/                    # cluster-scoped resources live here
      pvc-abc123.json

Useful Debug Techniques¶

# Watch backup progress
kubectl get backup my-backup -n velero -o yaml -w

# Stream velero server logs during a backup
kubectl logs -n velero deployment/velero -f --since=5m

# Inspect what's in the tar archive
velero backup download my-backup --output /tmp/my-backup.tar.gz
tar -tzf /tmp/my-backup.tar.gz | head -50

# See per-item warnings/errors
velero backup describe my-backup --details

Performance Note

Backup speed is bound by API server list throughput and object store upload bandwidth. For large clusters (10k+ objects), the list phase dominates.

The spec.resourceVersion is set at list time: items added after listing may be missing from the backup.

Next Up¶

Restore Mechanics