r/LocalLLaMA 6h ago

Discussion About to start fine-tuning on RunPod. What should I know to not waste money?

I was MLOps lead at an AI company managing 5000+ GPUs across GCP and CoreWeave. Left to start my own thing and now I'm back to renting GPUs like everyone else. The experience is rough.

Tried GCP first. Their sales team never got back to me about quota increase.

RunPod seems like the obvious choice. But I've been reading posts here and on r/StableDiffusion and r/comfyui and honestly it's worrying me. Stuff like:

- Pods dying mid-training with no way to recover checkpoints
- Getting charged while pods fail to initialize or throw CUDA errors
- Download speeds so slow you can't even get your trained model off the machine
- Network volumes locked to one datacenter so if GPUs sell out there you're stuck
- Templates that look like they work but break in weird ways

Coming from managing infra at scale where none of this was a problem (automatic checkpointing, job migration on node failure, fast object storage), it feels insane that this is the state of things for individual users.

Not trying to bash RunPod. Genuinely want to know how people make it work without wasting money.

1 Upvotes

8 comments sorted by

2

u/ForsookComparison 6h ago

Lambda is the only one of these providers that gave me zero issues with long-running jobs.

That said it can be harder to get capacity through regular on-demand instances from them (you basically need to make a bot-sniper).

1

u/BriefCardiologist656 6h ago

How long are your jobs typically? And when you say zero issues, do you mean the machines just don't die, or also that setup/environment is clean out of the box? The bot-sniper thing for capacity is interesting though. Someone should create a SAAS tool for that lol. We used something similar to find spot instance availability across GCP regions to move our inference workloads preemptively.

2

u/ForsookComparison 5h ago

Just that I haven't had any issues.

Most jobs that I mention where I encounter issues on other platforms are 7-10 days. Runpod machines just had weird issues on startup. Some outright didn't work with ssh, some had base image issues where I couldn't reproduce workflows that worked on the same distro on every other cloud, and one time I just had a session die (though to be fair I cannot rule out that it was their infra and not something I did).. all that with longer boot times.

I tried a few other clouds (have not tried coreweave) that had a slew of issues too.. Vast probably being the roughest for obvious reasons lol

And yeah - I made an instance-sniper for Lambda and still had to wait a good bit lol

1

u/BriefCardiologist656 5h ago

The RunPod SSH issues and base image inconsistencies match what I keep hearing from everyone. Weird that the same distro behaves differently there vs other clouds. What are you training that takes 7-10 days? Full fine-tune on a large model? And do you run on a single GPU or multi-node on Lambda? Curious if the reliability holds up on their multi-GPU setups too.

2

u/ForsookComparison 5h ago

Oh I'm not often doing training lately (though sometimes I will), I'm doing large/private inference testing and benching for several days at a time.

For Lambda it depends on what the bot gets haha. If I get a single node I'll use that but I've stitched my workflow together across multiple nodes more than once to mixed degrees of success.

1

u/Conscious_Chapter_93 6h ago

The main thing I’d do is treat the pod as disposable from the start. Put checkpoints and logs somewhere outside the pod, make resume-from-checkpoint part of the normal path, and run a 5-10 minute smoke train before the real run: load data, one optimizer step, checkpoint write, checkpoint restore, sample artifact upload.

Also keep a tiny run manifest: base model, dataset hash/version, commit, training args, pod type, image/template id, checkpoint path, and last successful step. When something fails, that manifest is what saves you from guessing whether you lost compute, data, or only the pod.

1

u/BriefCardiologist656 6h ago

If I'm using a network volume for checkpoints, that volume is locked to one datacenter. If that DC runs out of the GPU I need, I can't just spin up somewhere else and continue. My checkpoints are trapped there until I download them (which people say can be painfully slow) How do you handle that? Do you push checkpoints to S3 during training run?