Skip to content

Backup Mechanics

The backup pipeline is where most of Velero's complexity lives.

Understanding this path is essential for debugging, performance tuning, and writing plugins.

Backup lifecycle state machine

New ──► InProgress ──► Completed
                   │──► PartiallyFailed
                   └──► Failed
                             └──► Deleting

BackupController drives this state machine. Once a Backup enters InProgress, it will not be re-reconciled by a second instance (leader election and phase check prevent this).

Step-by-step Backup Execution

1. BackupController Picks Up a New Backup

Informer event triggers the reconciler. The controller sets status.phase = InProgress and status.startTimestamp.

In HA deployments, a distributed lock (Kubernetes lease) prevents concurrent execution.

2. Resource Discovery and Collection

Uses the API server's discovery API to enumerate all resource types. For each resource type matching the include/exclude filters, lists objects via the dynamic client.

Discovery is done concurrently with goroutines per resource group.

Key file: pkg/backup/item_collector.go

3. BackupItemAction Plugins

For each collected item, runs all registered BackupItemAction plugins whose AppliesTo() matches the resource type. These can:

  • Mutate the item (e.g. strip sensitive annotations)
  • Add additional items to the backup graph (e.g. the built-in pod-action adds the PVC when a Pod is backed up, ensuring PVC/PV pairs are consistent)
  • Set skip flags to exclude an item

This is where most custom business logic lives. See Plugin System.

4. PVC → Volume Backup Decision

For each PVC, Velero decides the volume backup method in priority order:

  1. Skip: if snapshotVolumes: false or if the PVC has the opt-out annotation backup.velero.io/backup-volumes-excludes
  2. CSI VolumeSnapshot: if the CSI plugin is enabled and a matching VolumeSnapshotClass exists
  3. Cloud provider snapshot: if a VolumeSnapshotter plugin is registered for the storage class
  4. Kopia file-level copy: if defaultVolumesToFsBackup: true or if the PVC has the opt-in annotation backup.velero.io/backup-volumes

5. Pre-backup hooks

Before serializing a pod's volume data, executes pre-backup hooks (exec into containers). Used to quiesce databases, flush caches, sync filesystems. See Hooks.

6. Volume Snapshot / Data Upload

Creates DataUpload CRDs. The DataUploadController in node-agent picks these up and runs Kopia to upload data directly from the PVC mount on the node. Velero-server polls DataUpload.status for completion.

Calls the VolumeSnapshotter plugin synchronously. The plugin calls the cloud provider API and returns a snapshot ID that Velero stores in the backup metadata.

7. Post-backup hooks

After volume data is captured, runs post-backup hooks to un-quiesce (e.g. UNLOCK TABLES). Velero guarantees post hooks run even if pre hooks fail (unless onError: Fail caused the backup to abort).

8. Serialization and Upload

All collected, plugin-processed items are serialized to JSON and written into a tarball (backup.tar.gz). A backup-results.gz file captures warnings and errors per item. Both are streamed to the object store via the ObjectStore plugin.

Key file: pkg/backup/backup.go

9. Metadata Upload

A velero-backup.json metadata file is written to the BSL. This is what the BackupSyncController reads to reconstruct Backup objects in a new cluster (enabling cross-cluster restores without re-creating Backup CRDs manually).

Object Store Layout

{bucket}/{prefix}/
  backups/
    {backup-name}/
      velero-backup.json                       # Backup CRD spec + status
      {backup-name}.tar.gz                     # All K8s resources (JSON per item)
      {backup-name}-logs.gz                    # Velero server logs during backup
      {backup-name}-results.gz                 # Warnings and errors per item
      {backup-name}-csi-volumesnapshots.json.gz  # CSI snapshot metadata
      {backup-name}-volumesnapshots.json.gz    # Legacy VSL snapshot metadata
  restores/
    {restore-name}/
      restore-{restore-name}-logs.gz
      restore-{restore-name}-results.gz

Tar Archive Structure

resources/
  deployments/
    namespaces/
      default/
        my-deployment.json
  persistentvolumeclaims/
    namespaces/
      default/
        my-pvc.json
  persistentvolumes/
    cluster/                    # cluster-scoped resources live here
      pvc-abc123.json

Useful Debug Techniques

# Watch backup progress
kubectl get backup my-backup -n velero -o yaml -w

# Stream velero server logs during a backup
kubectl logs -n velero deployment/velero -f --since=5m

# Inspect what's in the tar archive
velero backup download my-backup --output /tmp/my-backup.tar.gz
tar -tzf /tmp/my-backup.tar.gz | head -50

# See per-item warnings/errors
velero backup describe my-backup --details

Performance Note

Backup speed is bound by API server list throughput and object store upload bandwidth. For large clusters (10k+ objects), the list phase dominates.

The spec.resourceVersion is set at list time: items added after listing may be missing from the backup.

Next Up

Restore Mechanics