Computer Music - Hardware ThreadRippers 2990WX & 2950X on the bench: Just a little bit of history repeating?

ThreadRippers 2990WX & 2950X on the bench: Just a little bit of history repeating?


I’m the first to admit that I’m a little late to the table with this write-up. The original 2990WX sample arrived whilst I was on leave and was quickly placed into a video rig and sent out for review, meaning I’ve had to locate another one at a later date. Along with that, I’m honestly a little overwhelmed with how much interest this £1700 workstation grade CPU has generated with the public in recent weeks, as I really didn’t expect this level of interest in a chip at this sort of price point.

I’ve also approached this with a little trepidation due to earlier testing. As someone noted over the GS forum, the 2990WX might not prove all that interesting for audio due to the design layout of the cores and the limitations we’ve seen previously with memory addressing inside of studio-based systems. They were certainly right there, as the first generation failed to blow me away and there remains a number of reservations I have with the under-laying design of this technology that potentially could be amplified by this new release. During the initial testing of the 2990WX this time around, the 1950X replacement also arrived with us too in the shape of the 2950X and given some of the results of the 2990WX I thought throwing it into the mix might prove a handy comparison. 

Why bring all this up at all? Well, because everything I discussed back then is still completely relevant. In fact, I’m going to go as far as to suggest that anyone doesn’t understand what I’m referring to at this point should head over to last years 1950X coverage and bring themselves up to speed before venturing forward any further.

Back again? Up to speed?

Then I shall begin.

The 2990WX is the new flagship within the AMD consumer range and features a 32 core / 64 thread design. It has a base clock of 3GHz with a max twin core turbo of 4.2GHz and an advised power draw of 250W TDP. 

I won’t split hairs. It’s a beast… something I’m sure most people reading this are well aware of given the past week or so’s publicity. 

In fact, for offline rendering, I could close the article right there. If you’re a video editor on this page and don’t happen to care about audio (hello… you might be lost, but welcome regardless) then you should feel secure in picking up one of these right now if you have the resources and the need for more power in your workshop.

But as was proven with the release of the 1950X, the requirements for a smooth running audio PC for a lot of users are largely pinned on how great it is for real-time rendering, which is a whole different ballgame.

In the 1950X article I linked up top, I went into a great deal of detail in regards to where performance holes existed. I found that low latency response was sluggish and resulted in a loss of performance overhead that left it not in an ideal place for audio orientated systems. I had a theory that NUMA load optimization for offline workloads was leaving the whole setup in a not ideal situation for real-time workloads like ASIO based audio handling. 

In the weeks following that article, we saw AMD release BIOS updates and application tweaks to try and resolve the NUMA addressing latency I had discussed in the original article, largely to no avail as far as the average audio user was concerned. In AMD’s defence, they were optimizing it further for tasks that didn’t include the sort of demands that real-time audio places upon it, so whilst I understand the improvements were successful in the markets they were designed to help, few of those happened to be audio-centric.  

At the time it was just a theory, but my conclusion was largely one being that if this is as integral to the design as I thought it might be, then it would take a whole architecture redesign to reduce the latency that was occurring to levels that would keep us rather demanding pro users happy.

The 2990WX we see here today is not the architecture change we would require for that to happen as where the 1950X has 2 dies in one chip, the 2990WX is now running a 4 die configuration which has the potential to amplify any previous design choices. If I was right about hard NUMA being the root of the lag in the first generation then on paper it looks like we can expect this to only get worse this time around due to the extra data paths and potential extra distance the internal data routing might have to cope with.

The 2950X, by comparison, is an update to the older 1950X and maintains 2 functional dies, with tweaks to the chip’s performance. Given the similar architecture, I would expect this to perform similar to the older chip, although make gains from the process refinements and tweaks enacted within this newer model. I’ll note that the all core overclocking is improved this time around and a stable 4GHz was quick and easy to achieve.

OK, so let’s run through the standard benchmarking and see what’s going on.

2990WX CPU-Z Report
2950X CPU-Z

As normal I’ve locked it off at an all core turbos on both of the chips. As with a lot of these higher core count chips, I’ve not managed to hit a stable all core max turbo clock, which would have been 4.2GHz, rather settling for 3.8GHz on the 2990WX and 4GHz on the 2950X both of which perform fine with aircooling

I’ve spoken to our video team about this and they managed to hit a stable 4.1GHz on the 2990WX using a Corsair H100, so it looks like you can eak out a bit more performance if noise is less of a consideration in your environment.

If you’re not aware from previous coverage why I do this, if you’re running a turbo with a large spread between the max and minimum clock speeds then the problem with real-time audio is that when 1 core falls over, they all fall over. So, whilst you might have 2 cores running at 4.2GHz the moment one of the cores still running at 3.2GHz fails to keep up then the whole lot will come tumbling down with it. Locking cores off will give you a smoother operating experience overall and I’m always keen to find a stable level of performance like this when doing this sort of testing.

2990WX CPU-Z Benchmark
2950X CPU-Z Benchmark

I don’t always remember to run this benchmark, although this time I’ve made the effort as Geekbench doesn’t appear to support this many cores at this point. Handily enough, I did at least run this over the 1950X last time which returned results of 428 on the single core and 9209 on the multi-core at the time.

Given that the 2990WX looks to be pulling twice the performance and physically has twice the number of cores, it looks to all be scaling rather well at this point. The 2950X, on the other hand, sees around a 10%-15% gain on the single and multi-core scores over the previous generation.

Moving onwards and the first test result here is the SGA DAWBench DSP test. 

DAWBench SGA1156 Test – Click To Enlarge

This initial test is very promising, as was the older 1950X testing. Raw performance wise we’re talking about it by the bucket, I really can’t stress that enough with both chips performing well in what is essentially a very CPU-centric test.

At the lowest buffer we see it being exceeded by the older chip, so what is going on there? Well, we’re seeing a repeat in the pattern that was exhibited by the 1950X where there is an impact to performance at tighter buffers, and it does appear that at the very tightest buffer setting that we’re seeing some additional inefficiency caused by the additional dies, although this does resolve itself when we move up a buffer setting.

Last time we scaled up from 70% load being accessible at a 64 buffer and this time, I imagine due to the extra dies being used we see the lowest setting corrupting around the 65% load level and then scaling up by 10% every time we double the buffer.

ASIO BufferCPU Load
6465%
12875%
25685%
51295%

As a note when I pulled that 512 buffer result this time around and it returned 529 instances.

The 2950X, by comparison, returned me a load handling around the 85% on a 64 buffer, rising to 95% at a 256. An improvement on the first look we took a look at the original 1950X chip, although I’ll note I was also seeing this improved handling when I did the 1950X retest a few months ago using the newer SGA1156 charts that has replaced the classic DSP test, so this might be down to the change in benchmarks over the last year, or it could also be down to the BIOS level changes they’ve made since original generation launch.

So far, so reasonable. A lot of users, even those with the most demanding of latency requirements can get away with a 128 buffer on the better audio interfaces and the performance levels seen at a 128 buffer, at least in this test are easily the highest single chip results that I’ve seen so far.

In fact, knowing we’re  losing 40% of the overhead on the 2990WX is really frustrating when you understand the sort of performance that we could be seeing otherwise. But even with that in mind, if you wanted to go all out and grab the most powerful option that you can, then wouldn’t this still make sense?

Well, that test is pure CPU performance and in the 1950X testing, the irregularities started to really manifest themselves in the DAWBench Kontakt test where it started to depend equally on the memory addressing side of things.

Normally I would insert a chart here to show how that testing panned out.

But I can’t.

It started off pretty well. I fired it up with a 64 buffer and started adding load to the project. I made it up to around 70% CPU load on the first attempt before the whole project collapsed on me and started to overload. I slackened it off by muting the tracks and took it back down to around 35% load where it stabilised, but from this point onwards I couldn’t take it above 35% without it overloading, not until I restarted the project. 

I then tried again at each buffer setting up to 512 and it repeated the pattern each time.

I proceeded to talk this one through with Vin the creator of the various DAWBench suites and a number of other ideas were kicked about, some of which I’ve dived further into.

One line of thought was that as I was still using Cubase and the last 8.5 build specifically, precisely for the reason that C9 has a load balance problem for high core count CPU’s that is currently being worked upon. The older C8.5 build is noted as not having the same issue manifest due to a difference in the engine and during testing this time Windows itself was showing a fairly balanced loads mapped across all of the cores whilst I was looking at performance meter, but even so, historically, exceeding 32 cores has always been questionable inside many of the DAW clients.  

So, to counter this concern, I went and ran the same tests under Reaper and saw much the same result. I could push projects to maybe 65%-70% and then it would distort the audio as the chip overloaded and this wouldn’t resolve itself until the sequencer was closed and reloaded.

So what is going on there? If I was to speculate, then the NUMA memory addressing is designed to allocate the nearest RAM channel to it’s nearest physical core and not to use other RAM channels until on core’s local channel is full.

I suspect with knowing that, that the outcome here is that it’s maintaining the optimal handling up until that 70% level and then once it figures out that the RAM channel is overloaded it starts allocating data on the fly as it sees fit. The reallocation of that data to one of the other 3 dies would result in it being buffered and then allocated to the secondary memory location and would result in additional latency when the data is recalled in a later buffer cycle which would result in audio being lost when the buffer cycle completes before it can be recalled.

In short, we’re seeing the same outcome as the first generation 1950X but amplified by the additional resources that now need to be managed.

This way of working is the whole point of hard NUMA addressing and indeed is the optimal design for most workstation workloads where multiple chips (or die clusters in this case) need to be managed. It’s a superb way for dealing with optimization for many workloads from database servers through to off-line render farms, but for anything requiring super-tight real-time memory allocation handling it remains a poor way of doing things.

As I’ve said previously, this is nothing new for anyone who deals with multi-CPU workstations where NUMA management has been a topic of interest to designers for decades now. There has always been a performance hit for dealing with multiple CPU’s in our type of workflow and it’s largely why I’ve always shy’d away from multiple chip Xeon based systems as they too exhibit this to a certain extent.

Much like the first generation 1950X with it’s 2 dies, we see similar memory addressing latency when we use 2 seperate Xeons and this has always been the case. I would never use 4 of those together in a cluster for this sort of work simply due to that latency and so the overall outcome with 4 dies being used in this fashion isn’t all that surprising.

I also tried retesting with SMT turned off, so it could only access the 32 physical cores in order to rule out a multi-threading problem. The CPU usage didn’t quite double at each buffer instead settling around the 70% total usage mark but the total amount of usable tracks remained the same and once again going over this lead to the audio collapsing quite rapidly.

So, much like the first generation the handling of VST instruments and especially those which are memory heavy look like they may not be the best sort of workload for this arrangement. This ultimately remains a shame, especially as one of the other great concerns from last time which was heat has been addressed by quite some degree. Running the 2990WX even with an overclock didn’t really see it get much above 70 degrees and that was on air. Given that the advised TDP here is 250W at stock, rising quickly when overclocked even to the point of doubling the power draw, the temperatures for a core count this huge is rather impressive. I think there is a lot to pay attention too here by Intel in regards to thermals and the news that the forthcoming i9’s are finally going to be soldered again, makes a whole load of sense given what we’ve seen here with the AMD solutions. If anything it’s just a shame it took the competition pulling this out of the hat before they took notice of all the requests for it to be brought back by their own customers over recent years.

Still, that’s the great thing about a competitive marketplace and very much what we like to see. Going forward I don’t really see these performance quirks changing within the Threadripper range, much the same way that I never expect it to change within the Xeon ecosystem. Both chip ranges are designed for certain tasks and optimized in certain ways, which ultimately makes them largely unsuitable for low latency audio work, no matter how much they exceed in other segments. 

There is some argument here for users who may not require ultra-tight real-time performance. It’s been brought to my attention in the past that users like mastering guys could have a lot of scope for using the performance available here and if they are doing video production work too, well, that only strengthens the argument. 

On paper that all makes sense and although I haven’t tested along those lines specifically, the results seem to indicate that even the trickiest of loads for these CPU’s seem to stabilise at 512 and above with 80%+ of the CPU being accessed, even in the worst case scenario. I have to wonder how it would stand up in mixed media scenarios although I would hope that ultimately in any situation where you render it offline that you should be able to leverage the maximum overhead from these chips.

I suspect the other upshot of this testing might be one of revisiting the total CPU core count that each DAW package can access these days. Last time I did a group test was about half a decade ago and certainly, all the packages look to have up’d their game since then. Even so, I doubt anyone working on a sequencer engine even 3 years ago would have envisioned a core count such as the one offered by the 2950X here, let along the monstrous core count found in the 2990WX. 

AMD’s Zen core IPC gains this generation as we’ve already seen with Ryzen refresh earlier in the year were around the 12% mark and it looks to have translated faithfully into Threadripper series with the 2950X  model. One of AMD’s big shouting points at launch was regarding just how scalable the Zen package was simply by upping the die count and that’s clear by the raw performance offered by the 2990WX, they really have proven just how effective this platform can be when dealing with workloads it’s designed for.

One day I just hope they manage to find a way of making it applicable to the more demanding of us studio users too.