How to Set Up a Persistent Environment on a Cloud GPU (and Keep It When You Switch Providers)

Back
Team Aquanode

Team Aquanode

Ansh Saxena

JULY 3, 2026

How to Set Up a Persistent Environment on a Cloud GPU (and Keep It When You Switch Providers)

Forty to ninety minutes. That is how long a normal ComfyUI or PyTorch setup takes to rebuild from scratch on a fresh cloud GPU, every single time the box goes away. I build the infrastructure underneath this stuff, not the models on top of it, but I spent a week reading r/LocalLLaMA, r/comfyui, and a stack of provider docs, and the same complaint keeps surfacing. People rent a GPU, burn the first hour reinstalling everything, work for a couple of hours, kill the box, and lose all of it. Then they do it again tomorrow. A persistent environment on a cloud GPU is supposed to end that loop. Most setups only end half of it.

TL;DR: A cloud GPU is stateless by design, so anything you install on the box disappears when it is terminated. The common fix is a persistent volume, which survives restarts but stays locked to one provider's datacenter and keeps billing while the GPU is off. To keep a persistent environment across providers, you have to snapshot the whole box, code, packages, and models, and restore it wherever you rent next.

Why does your cloud GPU environment disappear every session?

Because the disk your environment lives on is ephemeral on purpose. On RunPod, Vast.ai, and most marketplaces, the container disk is wiped the moment a pod is terminated, and only a separately attached network volume persists independently of the compute (Markaicode's writeup on RunPod's model paths lays out exactly which disk survives what). Your models, your virtualenv, and your custom nodes all sit on the disk that gets erased.

This is where the reinstall tax comes from. It is not one big cost, it is a small tax you pay on every cold start: git pull, re-download 6 to 24GB per checkpoint from Hugging Face or Civitai, re-clone custom nodes, pin them to the commits that do not throw red errors, resolve dependencies, restart. There is an entire genre of "one-command ComfyUI setup" scripts (Prompting Pixels ships a generator for exactly this) and the only reason that genre exists is that the base experience deletes your environment every teardown. One practitioner put the real cost plainly:

"The single biggest time sink in self-hosting is this: every time you start a new GPU instance, you have to reinstall ComfyUI, download every model, install every custom node, and restore your config. Doing this manually takes 30 to 90 minutes per instance. Do it weekly and you have lost a workday a month." Ricardo Ghekiere, DEV Community

A workday a month, for a single person. Teams report losing multiple engineers' weeks per quarter to the same pattern before they automate it.

What a persistent cloud GPU environment actually needs

Most people treat "persistent" as one thing. It is really three layers, and they fail independently:

  1. Your data and models. The 40 to 90GB of checkpoints, LoRAs, VAEs, and outputs that take the longest to move.
  2. Your environment. The Python packages, the custom nodes pinned to exact commits, the driver and CUDA versions, the config and environment variables. This is the fragile part, because it is a tree of dependencies that only works at specific versions.
  3. Portability. The ability to take layers 1 and 2 to whatever box is cheapest or actually in stock today, on any provider.

Almost every "persistent" setup you will read about solves layer 1, sometimes layer 2, and quietly ignores layer 3. That third layer is the one that decides whether you are locked to a provider or free to chase price and availability. Here is the ladder, from the cheapest fix to the one that actually moves.

Step 1: Put your models and code on a persistent volume

Attach a network volume, mount it at a stable path like /workspace, and keep your models, code, and outputs there instead of on the ephemeral container disk. Most providers mount the volume automatically and it survives pod restarts, so a 70B model that takes 30 minutes to download loads from the volume in seconds on the next boot.

This is the right first move and it kills the worst of the re-download tax. But it has two costs that are easy to miss. The volume keeps billing while the GPU is off, usually around $0.10/GB per month, so a 200GB library quietly runs a tab even when you are doing nothing. And the volume is pinned to one datacenter region, so you cannot attach it to a cheaper box on a different provider (Next Diffusion's RunPod tutorial walks through the volume setup and its region lock). You have solved persistence across restarts. You have not solved persistence across providers.

Step 2: Bake the slow, stable layers into an image

The parts of your setup that rarely change, the NVIDIA driver, the CUDA toolkit, your framework wheels, your pinned custom nodes, can be baked into a custom image once and reused. A fresh node spends 15 to 25 minutes on first boot installing drivers and pulling framework wheels; launching from a pre-baked image cuts that to 3 to 6 minutes (E2E Cloud's bake-and-reuse guide has the timing breakdown).

Bake one image per card family and framework version, and re-bake when you upgrade. Where this breaks: GPU saved images can reach hundreds of GB, they incur storage charges for the full size while retained, and the image is usually tied to one provider's registry and format. It is a great accelerator for the stable layers. It is not a way to carry your live, mutating environment from box to box.

Step 3: Snapshot the whole box so you can move it

The volume and the image each capture a slice. The thing that actually restores a working environment is the whole box: the code at its exact commit, the virtualenv with its resolved dependencies, the custom nodes, the config, and the model files, all at their original paths. Snapshot all of it, and a restore is not "set up ComfyUI again," it is "put the machine back exactly as it was."

A good snapshot captures state below the level of a single tool, so it does not care whether you are running Stable Diffusion, a fine-tuning job, or a Jupyter notebook. ComfyUI is just the loudest example because its setup is so heavy; the same tax hits anyone with a tuned environment. The reason this matters more than a volume is portability, which is the next question everyone eventually asks.

Can you move a persistent environment to a different provider?

Not with a network volume. A volume is pinned to one provider's datacenter region, so you cannot attach it to a cheaper or more available box somewhere else. To move a persistent environment across providers, you need a portable snapshot of the whole box that restores by path on any host, independent of the storage backend.

This is the layer the "which provider should I commit to" debate keeps skipping. If your state is portable, the question stops being "which provider do I marry" and becomes "which box is cheapest and in stock right now." Practitioners already treat multi-cloud as a survival strategy rather than an optimization, scanning five to ten providers before a job. Portable state is what makes that scan actually cheap to act on, because switching no longer means rebuilding.

Where each approach breaks

ApproachSurvives restartMoves across providersBills while idle
Ephemeral disk onlyNoNoNo
Network volumeYesNo (region-locked)Yes
Baked custom imagePartly (stable layers)Rarely (registry-tied)Yes (image storage)
Full-box snapshotYesYesOnly the stored snapshot

The pattern is clear once it is in a table. Each rung up the ladder solves a layer the one below it missed, and portability is the rung almost nobody ships.

What I'd actually do

For bursty, single-GPU work, use a volume for the models inside a working session so you are not re-downloading mid-project, but do not treat the volume as your source of truth. Treat the whole box as something you can snapshot and restore, so you are never married to one provider's region or one provider's prices. When a host reclaims your machine with no warning, and on cheap spot capacity it will, the recovery should be one restore command, not another hour of setup.

This is the exact gap we are building ogre to close, in the open: an open-source tool that snapshots a full GPU box and restores it on whatever provider you rent next, so your environment is portable instead of region-locked. We are proving the cross-provider restore in public rather than asking you to take the claim on faith, because "git for GPU boxes" only means something if you can watch it move a real environment between providers. If that is a problem you feel every week, that is the one we are trying to end.

About the author

I am Ansh, and I work on the infrastructure layer at Aquanode, the plumbing under the models rather than the models themselves. I have spent the last stretch deep in how CLI-native GPU renters actually work, reading the provider docs and the r/LocalLLaMA and r/comfyui threads where the reinstall tax gets complained about, and building the snapshot layer that would make it go away. I would rather version-control a GPU box than rebuild one by hand.

Sources

  • Ricardo Ghekiere, "ComfyUI Deploy: Choosing Between Self-Host, Serverless, and Managed (2026)". DEV Community
  • Prompting Pixels, "One-command ComfyUI on Cloud GPUs: A Practical, Repeatable Setup". DEV Community
  • Mark, "RunPod ModelNotFoundError: 4 Verified Fixes for Serverless and GPU Pods (2026)". Markaicode
  • "Bake and Reuse a GPU Image". E2E Cloud docs
  • "After Re-downloading 150GB of Models for the Fourth Time, I Learned Something". SynpixCloud
  • "How to run ComfyUI on RunPod". Next Diffusion
#cloud gpu#persistent environment#gpu snapshot#comfyui#machine learning#provider portability
Ready when you are

Stop paying for
idle GPUs.

Sign up in 60 seconds. Pay only for the GPU minutes you actually use.

Aquanode LogoAquanode

© 2026 Aquanode. All rights reserved.

All trademarks, logos and brand names are the
property of their respective owners.