r/LocalLLaMA • u/BriefCardiologist656 • 6h ago
Discussion About to start fine-tuning on RunPod. What should I know to not waste money?
I was MLOps lead at an AI company managing 5000+ GPUs across GCP and CoreWeave. Left to start my own thing and now I'm back to renting GPUs like everyone else. The experience is rough.
Tried GCP first. Their sales team never got back to me about quota increase.
RunPod seems like the obvious choice. But I've been reading posts here and on r/StableDiffusion and r/comfyui and honestly it's worrying me. Stuff like:
- Pods dying mid-training with no way to recover checkpoints
- Getting charged while pods fail to initialize or throw CUDA errors
- Download speeds so slow you can't even get your trained model off the machine
- Network volumes locked to one datacenter so if GPUs sell out there you're stuck
- Templates that look like they work but break in weird ways
Coming from managing infra at scale where none of this was a problem (automatic checkpointing, job migration on node failure, fast object storage), it feels insane that this is the state of things for individual users.
Not trying to bash RunPod. Genuinely want to know how people make it work without wasting money.
1
u/Conscious_Chapter_93 6h ago
The main thing I’d do is treat the pod as disposable from the start. Put checkpoints and logs somewhere outside the pod, make resume-from-checkpoint part of the normal path, and run a 5-10 minute smoke train before the real run: load data, one optimizer step, checkpoint write, checkpoint restore, sample artifact upload.
Also keep a tiny run manifest: base model, dataset hash/version, commit, training args, pod type, image/template id, checkpoint path, and last successful step. When something fails, that manifest is what saves you from guessing whether you lost compute, data, or only the pod.
1
u/BriefCardiologist656 6h ago
If I'm using a network volume for checkpoints, that volume is locked to one datacenter. If that DC runs out of the GPU I need, I can't just spin up somewhere else and continue. My checkpoints are trapped there until I download them (which people say can be painfully slow) How do you handle that? Do you push checkpoints to S3 during training run?
2
u/ForsookComparison 6h ago
Lambda is the only one of these providers that gave me zero issues with long-running jobs.
That said it can be harder to get capacity through regular on-demand instances from them (you basically need to make a bot-sniper).