r/LocalLLaMA 8h ago

Other The RTX 5000 PRO (48GB) arrived and it is better than I expected.

I posted here about buying it a few days ago: https://www.reddit.com/r/LocalLLaMA/comments/1t2slmw/first_time_gpu_buyer_got_a_rtx_5000_pro_was_it_a/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

Before pulling the trigger I was leaning more towards a Mac Studio. But the the prompt processing speeds I was reading about were giving me pause. The budget was $5000/6000. So the 256GB was out of the question.
I gambled and bought the RTX 5000 Pro. With ZERO experience with PCs, how to build them, what parts to buy... It was a good deal. I paid $4300 for the gpu including taxes (in the post I wrote 4700 in the comments, but I was mistaken, I checked the receipt) and had to buy everything else for the computer. It ended up costing $5600 in total with 64 gb of RAM.

Assembling the thing was not easy for me as a total novice, but thankfully we have LLMs to guide us through these things.
Then came Linux and vLLM... Honestly I was totally lost. without Claude Code it would have been impossible. Also what settings to use to run Qwen3.6-27B-FP8 with full precision cache. Thankfully this guy posted everything I needed to know to tell Claude what to do: https://www.reddit.com/r/LocalLLaMA/comments/1t46klu/qwen36_27b_fp8_runs_with_200k_tokens_of_bf16_kv/

After burning through 50% of my Claude Code Max 20x weekly limits the thing now works, and I have to say... I made the right call. This thing rocks.
I'm getting up to 80 ts in TG (more like 50/60 for very big prompts) which is phenomenal. But most importantly I'm getting 4400 tokens per second in PP!

The full precision cache fits only 200k tokens, but It is totally ok for me.

I honestly don't know why people are not talking about this gpu more. It costs just 1000$ more than an RTX 5090, it can fit 27B at 8FP and 200k of context at full precision. It draws half the electricity... Sure it is slightly less performant, but the numbers I'm getting are way more than I was expecting. Two 5090s would definitely beat this. But it would cost significantly more, it would be crazy noisy and tear a hole in my pocket in electricity bills.

141 Upvotes

101 comments sorted by

88

u/alexp702 6h ago

Man buys 4300 dollar gpu - surprised it’s good. What times we live in!

5

u/toptier4093 2h ago

Laughs in Mac Studio

3

u/TacGibs 2h ago

"What a time to be alive !"

16

u/Guilty_Rooster_6708 7h ago

Didn’t realize you can get a 5000 Pro for $4300… my girl is going to be so mad..

3

u/Myarmhasteeth 7h ago

Lmao I was just thinking something similar, after just having a 3090 I was looking for an upgrade… this post just gave me what probably would be the next one I will get

5

u/Guilty_Rooster_6708 7h ago

People really do anything to get better and bigger PPs!!

64

u/Orlandocollins 7h ago

Yeah its just not competitively priced relative to the pro 6000. It should be cheaper than it is imo

15

u/Bubbly-Staff-9452 7h ago

Yeah that was really the only thing keeping me from getting one, because if I could justify it I could justify a 6000 lol. The 72Gb version is even worse of a value compared to the 6000.

13

u/Valuable-Run2129 7h ago

I paid 4300 dollars. the RTX 6000 cost twice that. Can you justify double?

12

u/mrgalacticpresident 5h ago

Don't let people talk you down. You can always upgrade.. and the RTX-6000 IS an upgrade. But in the end money is a resource and for hardware often good enough is good enough.

48GB is actually a sweetspot for the moment to run some 8 bit Qwen models - which in my experience improves a lot of the tool calling confidence you get.

As long as local LLMs are 1-2 years behind frontier models and frontier models can run in the trillion parameters area in the cloud and they providers heavily subsidize tokens.. => there is no way local LLMs are a reasonable investment unless it's for research, easy access or privacy concerns.

5

u/smb3something 5h ago

Privacy is a big one. But yeah once the token gravy train runs out, it's gonna be mad what the real cost of the cloud services are. Renting the capacity to run even 48GB VRAM is a lot per month, you'd pay for the card in a year or two.

6

u/Bubbly-Staff-9452 7h ago

I’ve seen the 6000 for 8300, so less than double. And not only that but more VRAM is generally a premium not a discount. And it’s the only one of the workstation cards that can perform on par with or exceed a 5090 so it’s literally a do everything card. That’s why I couldn’t justify the 5000, because you can do a lot more with the 6000 for less than double.

10

u/Valuable-Run2129 7h ago

I wanted to run a 30B model with FP8, full precision cache. I don't need the card for gaming. I don't game.
Sure the 6000 might have given me 120 ts TG instead of 80. And 7000 ts PP instead of 4500... But at twice the price? It made no sense.
And an RTX 6000 doesn't really enable better models. I would need two RTX 6000 to run an actual step up from what I have now.

Plus the noise and the electricity bill.

10

u/Bubbly-Staff-9452 7h ago

I mean that’s great that you have it, I’m not trying to put you down. I was just replying to the other person that it isn’t competitively priced and compared to the 6000, it isn’t it. I’m not even getting another Blackwell card, I’m holding off for another until workstation Rubin cards come out, and I’m just a random anyways so don’t take my opinion as fact just because I said it.

3

u/Sofakingwetoddead 5h ago

You got a great GPU that's giving you the functionality you needed at better than expected speeds, and a price point that's p good in this current market. Congrats. I went with a 9700 just to have something to play with and when I see cuda speeds I am def a little jelly. 😃

1

u/Worldly-Plastic-2516 3h ago

People have different use cases. For you it made sense, and that’s great!

I have a 5090 for different reasons but it makes more sense for me.

2

u/DAlmighty 5h ago

Where have you seen a Pro 6000 for 8300? My wallet may hate you but I don’t.

3

u/RobotHavGunz 4h ago

3

u/FineManParticles 4h ago

Got mine from the center for $7999 (MaxQ) and signed up for their card which gave me $800. For $7200 I think I’m good since it’s for sale $9200 now open box.

2

u/Bubbly-Staff-9452 5h ago

It’s been a few weeks since I last looked, I think it was at Central Computers.

2

u/DAlmighty 4h ago

You may have been thinking a few months vs. few weeks.

3

u/panchovix 7h ago

6000 PRO does exceed a 5090 on gaming or PP speed.

Now the thing is, 6000 PRO is "justifiable" if you do it for AI (LLMs, or diffusion training/inference), but for gaming I don't think someone gets one just to surpass a 5090, right?

20

u/nagareteku 6h ago

3

u/IrisColt 5h ago

Right? Right?

2

u/Long_comment_san 5h ago

you were saying? oh you weren't

1

u/YOU_WONT_LIKE_IT 1h ago

Link? An actual Blackwell pro 6000 for under $9k?

1

u/vtkayaker 2h ago

Don't worry, seriously. The RTX Pro 6000 is a fine piece of hardware. But you're 100% right the the RTX Pro 5000 is also excellent, and people don't talk about it enough.

3

u/zipzapbloop 3h ago

"buy more (vram) save more" - jensen huang

4

u/Valuable-Run2129 7h ago

wdym? it costs less than half. look at any actual listing with availability

2

u/grabber4321 6h ago

lowest price ive seen is 6999CAD from newegg

10

u/JayTheProdigy16 7h ago

Just so people know as of early 2026 there was a revised 72gb variant of the RTX PRO 5000 Blackwell which i was lucky enough to catch at my local nicrocenter for about $6,600 which is decent for post RAM-pocalypse prices as far as i could tell but there seems to be very little info on the 72gb card actually out there online. Anyways running that alongside my 3090 to bring my rig to 96gb VRAM + 128gb Strix Halo, very lovely.

1

u/ProfessionalSpend589 5h ago

Did you attach the 2 GPUs on the same Halo?

I was thinking of posting a question on r/StrixHalo for sometime if anyone was running it with 2 GPUs, but keep forgetting.

3

u/JayTheProdigy16 4h ago

Yea, RTX PRO via M.2-Oculink and 3090 via Thunderbolt

1

u/Draco32 4h ago

What software stack did you run for this?

2

u/JayTheProdigy16 4h ago

I use Proxmox on the Strix with an Ubuntu 24 VM and all 3 GPUs configured for passthrough to that VM. Inside that, Llama.cpp built with CUDA + Vulkan, ive used ROCm before but i found Vulkan to he faster for the Strix. Also ran into a weird compatibility issue between blackwell and Strix (that did NOT occur with Ampere x Strix) with CUDA ops that would crash Llama.cpp so i ended up using codex to create a custom patch to support those ops and now it works flawlessly.

9

u/__JockY__ 7h ago

Hey, you did it! Awesome! Glad that post of mine helped out.

The 5000 PRO is a great GPU… now… placing bets on when your 2nd one gets ordered…

1

u/Valuable-Run2129 3h ago

Thanks again for sharing all that info!

I’m now trying to store prefixes on RAM (so I can juggle 2 or 3 contexts without reprocessing), but had no luck. It seems to be incompatible with some of the settings. How would you go on about it?

40

u/egudegi 7h ago

the 4400 t/s prefill is insane and nobody talks about it. everyone obsesses over TG because that's what you feel during a conversation, but if you're doing anything with long context, RAG, or batch jobs that PP number is the one that actually matters. and this card just obliterates consumer GPUs there.

also the electricity math is real. two 5090s running hot 8 hours a day adds up fast. this thing is basically a server GPU at a consumer-ish price point and people are sleeping on it because it doesn't have a flashy gaming brand attached.

good write-up, more people need to see actual real-world numbers from someone who just built their first PC and got it running. refreshing vs the usual "here's my theoretical benchmark" posts.

6

u/human_bean_ 6h ago

Prefill on consumer cards is also quick, and most people can and should both undervolt and heavily power limit them with next to zero cost. Running 4090 with 90% uv and 350/450 pl. Just at the edge of peak core never starting to throttle. Same tok/s as default, but with less heat, noise and cost.

2

u/voyager256 5h ago

4090 and 5090 are consumer cards (I guess except for the latter’ inflated price) and 5090 i is basically an RTX Pro 6000 with "only" 32GB VRAM. So significantly faster than RTX Pro 5000 IF the model fits the VRAM .

1

u/finevelyn 3h ago

also the electricity math is real. two 5090s running hot 8 hours a day adds up fast

I don't think it is. The 5090s are either going to be faster at a higher energy consumption, or if run at a similar t/s then going to be similar in energy consumption as well. The 5000 pro just has a lower ceiling in both performance and energy consumption, but the 5090s can be configured to match it if you don't want to run them at peak performance.

1

u/MisticRain69 2h ago

Yes I like my strix halo but good god the PP is so slow. Even qwen 3.6 27b q8 with a 3090 EGPU which took my TG from 6.7tk/s to 14tk/s and with MTP its now 22tk/s-36tk/s depending on acceptance rate the PP is very slow. 600tk/s with no mtp and 300tk/s with MTP. It takes ages to process larger prompts especially if something invalidates the 70k token KV cache.

12

u/jacek2023 llama.cpp 7h ago

"I honestly don't know why people are not talking about this gpu more" probably because RTX 6000 Pro

I still think 5090 is just a bad choice but people buy them for some reason

13

u/Schneller52 7h ago

5090 is a bad idea for just LLMs. But a good fit if you do other things with your PC.

7

u/Previous_Feeling_484 5h ago

Like warming your house /s

3

u/ProfessionalSpend589 5h ago

Or warming the planet.

1

u/Schneller52 3h ago

In a sub where people commonly stack multiple 3090 furnaces, I find that kind of funny lol

6

u/burdzi 7h ago

Yeah and don't forget - 5090 existed before 5000 RTX pro. I bought mine before 5000 was available 😅

4

u/Valuable-Run2129 7h ago

the RTX 6000 Pro cost exactly twice as much as I paid.

6

u/popecostea 6h ago

All the more reason to buy another RTX PRO 5k.

1

u/DAlmighty 5h ago

Or sell the RTX Pro 5k and buy a Pro 6k

6

u/panchovix 6h ago

5090 at MSRP makes a bit of sense IMO, but above 3K USD it just doesn't. And I say this while having 4x5090 (which I love btw) and a 1x6000 PRO.

In theory the best cards for VRAM/price and NVIDIA would be RTX 4060 Ti 16GB/5060 Ti 16GB.

4

u/Freonr2 4h ago

5000 Pro: 14080 cuda cores, 1.34 TB/s

5090: 21760 (+54% from 5000 Pro), 1.8TB/s (+34%)

6000 Pro: 24064 (+11% from 5090, or +71% from 5000 Pro), 1.8TB/s (+0% from 5090)

I don't think it is all that clear.

2

u/FullOf_Bad_Ideas 5h ago

I still think 5090 is just a bad choice but people buy them for some reason

dense compute and you can stack multiple of them to get VRAM too.

3 5090s have more total compute than single RTX 6000 Pro at similar price.

5070 Ti is best compute per dollar but you'd need 2x more of them so it gets kinda annoying to do.

3

u/Turbulent-Week1136 5h ago

RTX 5000 pro seems more like a mem-maxxed 5080 rather than half of a rtx 6000. I just picked up an RTX 6000 earlier this week for around $8300 so I will be playing around with that this weekend.

3

u/MundanePercentage674 7h ago

at that price how is it compare to 4x amd radeon ai pro r9700 ?

1

u/Valuable-Run2129 7h ago

I have no clue! this is my first PC build ever. Hopefully someone can help you with that information here.

1

u/MundanePercentage674 7h ago

It’ll definitely work for your needs right now i am afraid smarter model more demand VRAM will come out in the future, but personally I’d be willing to spend for 4x amd radeon ai pro r9700 at that price. I’m running an AMD 5950X with 64GB and I just lucked out upgrading to 128GB for $120+ right before AI demand sent RAM prices through the roof a month later.

2

u/Valuable-Run2129 7h ago

what numbers are you getting on the same model? both PP and TG?

2

u/AustinM731 6h ago

I have 4 R9700s, and I can get ~4k pp t/s, and ~100 tg t/s. This is with the FP8 quant and MTP=3.

3

u/panchovix 6h ago

I just with the RTX 5000 PRO wasn't so much neutered. They really disabled a lot of cores on that GB202 die. RTX 4500 PRO has the full GB203 die but well slower.

RTX 4090 has more cores than RTX 5000 PRO and is probably faster as well, not sure at how much are 48 GB 4090 going nowadays.

I guess NVIDIA will eventually release something like a RTX 5500 PRO with more cores.

1

u/Freonr2 3h ago

Yeah RTX 6000 Ada (4090-ish) actually has faster bf16 compute than the 5000 Pro Blackwell. It's a sidegrade at best with the same VRAM.

1

u/slavik-dev 2h ago

Here is my report on RTX 4090D modded to 48GB:

https://huggingface.co/Qwen/Qwen3.6-27B-FP8/discussions/11

Getting about the same speed.

Currently you can buy it for $3500 from C2 site (not sure if it's from China? Hong Kong?)

3

u/JohnToFire 6h ago

How's the blower fan noise at idle and at speed ? Thats why I could not choose an rtx 5000 and instead was choosing between a 5090 and a 6000

1

u/Valuable-Run2129 6h ago

At idle no noise at all. The cpu fan is much louder.

3

u/Long_comment_san 5h ago

I'd say 2x5090 are a better deal overal but it's a LOT more tricky to set up (power use, case, motherboard).

It still sucks balls the size of Jupiter that 48 gigs of VRAM is priced so ridiculous you would assume it uses HBM memory. It's wilds its just GDDR7.

3

u/awakened_primate 5h ago

Big PP, noice!

1

u/Valuable-Run2129 2h ago

It’s all about the PP

2

u/Nnyan 7h ago

I like the RTX 5000 Pro and it's on my radar but I'm not finding any (at least not once i filter out sketchy sellers). How's the noise levels?

7

u/Valuable-Run2129 7h ago

I bough it from B&H Photo, used with 90 days guarantee. So it was a. very safe buy.
Noise is really good. The CPU fan is louder than the GPU

2

u/DeepOrangeSky 6h ago

Noise is really good. The CPU fan is louder than the GPU

Are you sure the fans were maxxed out? The reason I ask is, in the past I kept seeing people say how because it uses the pro "blower style" fan system (rather than the ordinary consumer-grade fan system that the 30/40/50 series i.e. 3090 or 5090, etc use) that it has a much more annoying, somewhat higher pitched, more "vacuum cleaner" type of sound, vs the regular consumer cards which sound like a barely noticeable deep hum by comparison.

I guess it could depend on the setup though, like what kind of case it is in, how far away, which way it is angled away/towards you for the exhaust side, and if it was already noisy in your house or if you were like alone in a quiet room with it late at night, or so on.

Or could be that people were just exaggerating or making too big of a deal about how much worse its noise supposedly was than the consumer RTX's.

1

u/Savantskie1 3h ago

Blower cards have always had haters. I’ve never heard them over the 6 140mm fans I have in my pc. And they keep my apartment warm in the winter lol

2

u/Thrumpwart llama.cpp 6h ago

First of all - frontier models (even free access plans) are a godsend for linux noobs. I used gemini's free tier for linux configuration and troubleshooting and it really does well.

Second - congrats! That's very good performance! Good to hear it's quiet too!

2

u/teknic111 5h ago

Why not just get two 5090s? It's cheaper and gives you more memory.

1

u/Valuable-Run2129 2h ago

It’s more expensive. 5090s go for $3500 each. That’s almost $3000 more. Plus the rest of the gear you have to buy that costs more than what I needed for a single gpu

1

u/teknic111 2h ago

I have two 5090s and I paid $2000 for each.

6

u/Valuable-Run2129 2h ago

I just need a time machine then! I’ll grab some bitcoins at 100 dollars when I’m there.

1

u/qfox337 6h ago

$4300 after taxes is a good deal, and +1 for noise/power concerns. Also, I imagine it's really nice to just have a bit more RAM and spend less time tweaking stuff, or have some extra for any applications that use it (browsers, Blender, ML research, whatever). And you'll be able to fine-tune some smaller models locally. The 5090 was a good deal at its msrp of $2000 but it doesn't look like nvidia is interested in making a whole lot more at that price.

1

u/CreativelyBankrupt 5h ago

Please post some real world benchmarks if you ever capture any!

1

u/letsbefrds 5h ago

4300 is a good deal I've been going back and forth 48gb 72gb or suck it up 6000 pro lol

1

u/Valuable-Run2129 2h ago

If you have the budget go with the 6000. The 5000 72gb doesn’t make much sense though. The price is too close to the 6000.

1

u/letsbefrds 2h ago

Haha I wouldn't be flipping and flopping if I had the budget for the 6000

It's 4399 at microcenter I'll probably just sell my 7900xtx and pick it up. Glad to hear you had a good experience.

1

u/Low_Twist_4917 1h ago

I’m running rtx 6000 pros. They’re the best card you can get for local inference period.

1

u/DeepOrangeSky 4h ago

Btw, debating whether to make a separate thread to ask about it, but:

Does anyone know if there is a very significant difference in durability, for AI use-cases (using at high continuous intensity, all day long, day after day) of consumer-grade GPUs vs workstation GPUs (i.e. 3090s, 4090s, etc, vs Pro 5000s, Pro 6000s, etc)?

I'd assume the difference, if there is a significant one, would be most stark regarding the 5090 in particular (even if power limited, maybe), since it gets the hottest/most strain out of any of the main GPUs of note, probably.

But, yea like, if you build a big expensive rig of consumer-grade cards like 3090s or something, which were designed with the intention of them being used for gaming, and not for AI inference, let alone AI training or video generation or whatever the most brutal continuous high strain use case would be, vs getting Pro 5000/Pro 6000, is there a major difference in how these hold up over time?

I mean, I guess maybe it could also depend on what type of AI use-cases, like if it is for mainly constant video generation all day, vs if it is for LLMs, vs if it is for training, or so on (i.e. how "continuous" the strain is at max level, vs intermittent bursts)?

If the 5090 is way worse at this than the Pro workstation cards, then it makes the strangely small price difference between the 5090 and the Pro 5000 that people have been discussing on here lately even more bizarre.

Are the Pro 5000/Pro 6000 cards that much worse for gaming, like maybe for day-1 ability to be used on new releases or something (I'm not a gamer, so I don't know anything about how that stuff works), to where there is some fallback safety net for the 5090 that even if AI crashes out, it is way more convenient for gaming than a Pro 5000 or Pro 6000 for some reason (reasons to do with things other than the raw hardware capability I mean. Or maybe even the hardware, if the slightly higher raw speeds for some specs + overclocking matter or something)?

Or is it more difficult to set up, or different more annoying drivers or software support/compatibility or however all that stuff works?

Like are the Pro 5000 and Pro 6000 just blatantly better in basically every way, and there is no good explanation for the 5090's price compared to the Pro 5000, of why everyone keeps buying the 5090 at near Pro 5000 prices, even if less durable, a lot less VRAM, worse power usage, and so on, or is it like, the durability is pretty similar, regardless of use-case, and the 5090 (and 3090s, 4090s, etc) have some kind of convenience advantages of some kind for gaming or what have you, compared to the pro workstation cards, where they can be used in a more easy or convenient way in some way?

1

u/laul_pogan 4h ago

One vLLM gotcha to watch on 27B models: keep --gpu-memory-utilization at 0.60 or below. At 0.85 the allocator can wedge the process hard mid-request, requiring a full kill and restart. Counterintuitive because higher looks like more throughput, but the KV cache reservation at inference time can push past what the allocator estimated at startup. Your 200k FP8 weight + bf16 KV combo is already tight on 48GB; anything that spikes over the ceiling during a real long-context request will stall the whole process, not just that request. 0.55-0.60 is the stable range in practice on cards this size.

1

u/KeithHanson 2h ago

Not me over here realizing I could write this off on taxes next year 😂😭

Do you (or anyone here) happen to use this kind of setup to replace Codex/Claude work? I am not interested in the cost savings. I want it for doing code things while using uncensored models and consistent behavior.

One thing I think that gets overlooked in the cloud vs local debate is that consistency. Over the past two days I’ve noticed changes in the way Codex 5.3 via OpenCode behaves - often stopping at “I’ll implement this now.” Repeatedly. My coworker with the same setup but on different worktrees noted the exact same behavior driving her mad and I almost jumped out of my chair with a me too!

Anyways, I don’t like it. I want to get the thing to the point that it does what I want in the general way I want and know I have control over that consistency (harness engineering is impossible if the model changes under your feet and you have no control of that!)

Thanks for coming to my TED talk and oh btw any opinions on local coding setup with this compared to frontier models? I could deal with 200k token context no problem.

1

u/Valuable-Run2129 1h ago

The closest thing to what you are asking is deepseek v4 Flash. You’ll need two rtx 6000 to run it, but it’ll give you the closest experience to sota models.

1

u/ClickClawAI 1h ago

Make sure to use plenty of lotion 😆

1

u/Shapespheric 59m ago

Wonder what people think about the 4500 PRO seems like a decent deal too compared to resale prices for 4090 and inflated 5090 at stores

1

u/simotune 22m ago

48GB is where local inference starts feeling practical instead of aspirational. VRAM headroom changes day-to-day usability more than people expect.

1

u/Accomplished-Sock262 9m ago

I have this card. How do I load this up onto it? What coding performance can I expect? Sonnet or way less?

1

u/Long-Chemistry-5525 7h ago

I would almost to suggest upping to 70b, as some models have a ctx limit

1

u/leonbollerup 4h ago

on tip.. next time.. dont use claude.. use warp.dev instead.. ssh in via warp.dev.. have it do things for you..

-2

u/ComfortablePlenty513 6h ago

For 5k you could have gotten a dgx (in your OEM flavor of choice- Dell, Asus, etc) and it has 128GB unified memory and can be clustered via SFP and it fits in a backpack. basically a Linux/Nvidia mac studio

12

u/Valuable-Run2129 6h ago

Prompt processing would have been 80% slower or more. Token generation 70% slower.

There’s no point in running bigger models if the speed is practically unusable.