Another month and another chip round up, with them still coming thick and fast, hitting the shelves at almost an unprecedented rate.
AMD’s Ryzen range arrived with us towards the end of Q1 this year and its impact upon the wider market sent shockwaves through computer industry for the first time for in well over the decade for AMD.
Although well received at launch, the Ryzen platform did have the sort of early teething problems that you would expect from any first generation implementation of a new chipset range. Its strength was that it was great for any software that could effectively leverage the processing performance on offer across the multitude of cores that were being made available. The platform whilst perfect for a great many tasks across any number of market segments did also have its inherent weaknesses too which would crop up in various scenarios with one such field where its design limitations being apparent being real-time audio.
Getting to the core of the problem.
The one bit of well meaning advice that drives system builders up the wall and that is the “clocks over cores” wisdom that has been offered up by DAW software firms since what feels like the dawn of time. It’s a double edged sword in that it tries to simplify a complicated issue without ever explaining why or in what situations it truly matters.
To give a bit of crucial background information as to why this might be we need to start from the point of view that your DAW software is pretty lousy for parallelization.
That’s it, the dirty secret. The one thing computers are good at are breaking down complex chains of data for quick and easy processing except in this instance not so much.
Audio works with real-time buffers. Your ASIO drivers have those 64/128/256 buffer settings which are nothing more than chunks of time where the data is captured entering the system and held in a buffer until it is full, before being passed over to the CPU to do its magic and get the work done.
If the workload is processed before the next buffer is full then life is great and everything is working as intended. If however the buffer becomes full prior to the previous batch of information being dealt with, then data is lost and this translates to your ears as clicks and pops in the audio.
Now with a single core system, this is straight forward. Say you’re working with 1 track of audio to process with some effects. The whole track would be sent to the CPU, the CPU processes the chain and spits out some audio for you to hear.
So far so easy.
Now say you have 2 or 3 tracks of audio and 1 core. These tracks will be processed on the available core one at a time and assuming all the tracks in the pile are processed prior to the buffer reset then we’re still good. In this instance by having a faster core to work on, more of these chains can be processed within the buffer time that has been allocated and more speed certainly means more processing being done in this example.
So now we consider 2 or more core systems. The channel chains are passed to the cores as they become available and the once more the whole channel chain is processed on a single core.
Because to split the channels over more than one core would require us to divide up the work load and then recombine it all again post processing, which for real-time audio would leave us with other components in the chain waiting for the data to be shuttled back and forth between the cores. All this lag means we’d lose processing cycles as that data is ferried about, meaning we’d continue to lose more performance with each and every added core something I will often refer to as processing overhead.
Now the upshot of this means that lower clocked chips can often be more inefficient than higher clocked chips, especially with newer, more demanding software.
So for just for an admittedly extreme example, say that you have the two following chips.
CPU 1 has 12 cores running at 2GHz
CPU 2 has 4 cores running at 4Ghz
The maths looks simple, 2 X 12 beats 4 X 4 on paper, but in this situation, it comes down to software and processing chain complexity. If you have a particularly demanding plugin chain that is capable of overloading one of those 2GHz CPU cores, then the resulting glitching will proceed to ruin the output from the other 11 cores.
In this situation the more overhead you have to play with overall on each core, the less chance the is that an overly demanding plugin is going to be able to sink to the lot in use.
This is also one of the reasons we tend to steer clear of single server CPU’s with high core counts and low clock speeds and is largely what the general advice is referring too.
On the other hand when we talk about 4 core CPU’s at 4GHz vs 8 core CPU’s at 3.5GHz, in this example the difference between them in clock speeds isn’t going to be enough to cause problems with even the busiest of chains, and once that is the case then more cores on a single chip tend to become more attractive propositions as far as getting out the best performance is concerned.
So with that covered, we’ll quickly cover the other problematic issue with working with server chips which is the data exchange process between memory banks.
Dual chip systems are capable of offering the ultimate levels of performance this much is true, but we have to remember that returns on your investment diminish quickly as we move through the models.
Not only do we have the concerns outlined above about cores and clocks, but when you move to dealing with more than one CPU you have to start to consider “NUMA” (Non-uniform memory access) overheads caused by using multiple processors.
CPU’s can exchange data between themselves via high-speed connections and in AMD’s case, this is done via an extension to the Infinity Fabric design that allows the quick exchange of data between the cores both on and off the chip(s). The memory holds data until it’s needed and in order to ensure the best performance from a CPU they try and store the data held in memory on the physical RAM stick nearest to the physical core. By keeping the distance between them as short as possible, they ensure the least amount of lag in information being requested and with it being received.
This is fine when dealing with 1 CPU and in the event that a bank of RAM is full, then moving and rebalancing the data across other memory banks isn’t going to add too much lag to the data being retrieved. However when you add a second CPU to the setup and an additional set of memory banks, then you suddenly find yourself trying to manage the data being sent and called between the chips as well as the memory banks attached. In this instance when a RAM bank is full then it might end up bouncing the data to free space on a bank connected to the other CPU in the system, meaning the data may have to travel that much further across the board when being accessed.
As we discussed in the previous section any wait for data to be called can cause inefficiencies where the CPU has to wait for the data to arrive. All this happens in microseconds but if this ends up happening hundreds of thousands of times every second our ASIO meter ends up looking like its overloading due to lagged data being dropped everywhere, whilst our CPU performance meter may look like it’s only being half used at the same time.
This means that we do tend to expect there to be an overhead when dealing with dual chip systems. Exactly how much depends on entirely on what’s being run on each channel and how much data is being exchanged internally between those chips but the take home is that we expect to have to pay a lot more for server grade solutions that can match the high-end enthusiast class chips that we see in the consumer market, at least when it comes to situations where real-time related workloads are crucial like dealing with ASIO based audio. It’s a completely different scenario when you deal with another task like off line rendering for video where the processor and RAM is being system managed on its own time and working to its own rules, server grade CPU options here make a lot of sense and are very, very efficient.
To server and protect
So why all the server background when we’re looking at desktop chips today? Indeed Threadripper has been positioned as AMD’s answer to Intel’s enthusiast range of chips and largely a direct response to the i7 and i9 7800X, 7820X and 7900X chips that launched just last month with AMD’s Epyc server grade chips still sat in waiting.
An early de-lidding of the Threadripper series chips quickly showed us that the basis of the new chips is two Zen CPU’s connected together. Thanks to the “Infinity Fabric” core interconnect design it makes it easy for them to add more cores and expand these chips up through the range; indeed their server solution EPYC is based on the same “Zen” building blocks at its heart as both Ryzen and Threadripper with just more cores piled in there.
Knowing this before testing it gave me some certain expectations going in that I wanted to examine. The first being Ryzens previously inefficient core handling when dealing with low latency workloads, where we established in the earlier coverage that the efficiency of the processor at lower buffer settings would suffer.
This I suspected was an example of data transference lag between cores and at the time of that last look we weren’t certain how constant this might have proven to be across the range. Without having more experience of the platform we didn’t know if this was something inherent to the design or if perhaps it might be solved in a later update. As we’ve seen since its launch and having checked over other CPU’s in testing this performance scaling seems to be a constant across all the chips we’ve seen so far and something that certainly can be constantly replicated.
Given that it’s a known constant to us now in how it behaves, we’re happy that isn’t further hidden under-laying concerns here. If the CPU performs as you require at the buffer setting that you need it to handle then that is more than good enough for most end users. The fact that it balances out around the 192 buffer level on Ryzen where we see 95% of the CPU power being leveraged means that for plenty of users who didn’t have the same concerns with low latency performance such as those mastering guys who work at higher buffer settings, meant that for some people this could still be good fit in the studio.
However knowing about this constant performance response at certain buffer settings made me wonder if this would carry across to Threadripper. The announcement that this was going to be 2 CPU’s connected together on one chip then raised my concerns that this was going to experience the same sort of problems that we see with Xeon server chips as we’d take a further performance hit through NUMA overheads.
So with all that in mind, on with the benchmarks…
On your marks
I took a look at the two Threadripper CPU’s available to us at launch.
The flagship 1950X features 16 cores and a total of 32 threads and has a base clock of 3.4GHz and a potential turbo of 4GHz.
Along with that I also took a look at the 1920X is a 12 core with 24 threads which has a base clock speed of 3.5GHz and an advised potential turbo clock of 4GHz.
First impressions weren’t too dissimilar to when we looked at the Intel i9 launch last month. These chips have a reported 180W TDP at stock settings placing them above the i9 7900X with its purported 140W TDP.
Also much like the i9’s we’ve seen previously it fast became apparent that as soon as you start placing these chips under stressful loads you can expect that power usage to scale up quickly, which is something you need to keep in mind with either platform where the real term power usage can rapidly increase when a machine is being pushed heavily.
History shows us that every time CPU war starts, the first casualty is often your system temperatures as the easiest way to increase a CPU’s performance quickly is to simply ramp the clock speeds, although often this will also be a cause of an exponential amount of heat then being dumped into the system because of it. We’ve seen a lot of discussion in recent years about the “improve and refine” product cycles with CPU’s where a new tech in the shape of a die shrink is introduced and then refined over the next generation or two as temperatures and power usage is reduced again, before starting the whole cycle again.
What this means is that with the first generation of any CPU we don’t always expect a huge overclock out of it, and this is certainly the case here. Once again for contrast the 1950X, much like the i9 7900X is running hot enough at stock clock settings that even with a great cooler it’s struggling to reach the limit of its advised potential overclock.
Running with a Corsair H110i cooler the chip only seems to hold a stable clock around the 3.7GHz level without any problems. The board itself ships with a default 4GHz setting which when tried would reset the system whilst running the relatively lightweight Geekbench test routine. I tried to setup a working overclock around that level, but the P-states would quickly throttle me back once it went above 3.8GHz leaving me to fall back to the 3.7GHz point. This is technically an overclock from the base clock but doesn’t meet the suggested turbo max of 4GHz, so the take home is that you should make sure that you invest in great cooling when working with one of these chips.
Speaking of Geekbench its time to break that one out.
I must admit to having expected more from the multi-core score, especially on the 1950X, even to the point in double checking the results a number of times. I did take a look at the published results on launch day and I saw that my own scores were pretty much in-line with the other results there at the time. Even now a few days later it still appears to be within 10% of the best results for the chip results published, which says to me that some people do look to have got a bit of an overclock going on with their new setups, but we’re certainly not going to be seeing anything extreme anytime soon.
When comparing the Geekbench results to other scores from recent chip coverage it’s all largely as we’d expect with the single core scores. A welcome improvement from the Ryzen 1700Xs, they’ve clearly done some fine tuning to the tech under the hood as the single core score has seen gains of around 10% even whilst running at a slightly slow per core clock.
One thing I will note at this point is that I was running with 3200MHz memory this time around. The were reports after the Ryzen launch that running with higher clocked memory could help improve the performance of the CPU’s in some scenarios and it’s possible that the single core clock jump we’re seeing might prove to be down as much to the increase in memory clocks as anything else. A number of people have asked me if this impacts audio performance at all, and I’ve done some testing with the production run 1800X’s and 1700X’s in the months since but haven’t seen any benefits to raising the memory clock speeds for real time audio handling.
We did suspect this would be the outcome as we headed into testing, as memory for audio has been faster than it needs to be for a long time now, although admittedly it was great to revisit it once more and make sure. As long as the system RAM is fast enough to deal with that ASIO buffer, then raising the memory clock speed isn’t going to improve the audio handling in a measurable fashion.
The multicore results show the new AMD’s slotted in between the current and last generation Intel top end models. Whilst the AMD’s have made solid performance gains over earlier generations it has still be widely reported that their IPC scores (Instructions per clockcycle) are still behind the sort of results returned by the Intel chips.
Going back to our earlier discussion about how much code you can action on any given CPU core within a ASIO buffer cycle, the key to this is the IPC capability. The quicker the code can be actioned, then the more efficently your audio gets processed and so more you can do overall. This is perhaps the biggest source of confusion when people quote “clocks over core” as rarely are any two CPU’s comparable on clock speeds alone ,and a chip that has a better IPC performance can often outperform other CPU’s with higher quoted per clock frequencies but a lower IPC score.
So lengthy explanations aside, we get to the crux of it all.
Much like the Ryzen tests before it, the Threadrippers hold up well in the older DawBench DSP testing run.
Both of the chips show gains over the Intel flagship i9 7900X and given this test uses a single plugin with stacked instances of it and a few channels of audio, what we end up measuring here is raw processor performance by simply stacking them high and letting it get on with it.
The is no disputing here that the is a sizable slice of performance to be had. Much like our previous coverage, however, it starts to show up some performance irregularities when you examine other scenarios such as the more complex Kontakt based test DawBenchVI.
The earlier scaling at low buffer settings is still apparent this time around, although it looks to have been compounded by the hard NUMA addressing that is in place due to the multi chip in one die design that is in use. It once more scales upwards as the buffer is slackened off but even at the 512 buffer setting which I tested, it could only achieve 90% of CPU use under load.
That to be fair to it, is very much what I would expect from any server CPU based system. In fact, just on its own, the memory addressing here seems pretty capable when compared to some of the other options I’ve seen over the years, it’s just a shame that the other performance response amplifies the symptoms when the system is stressed.
AMD to their credit is perfectly aware of the pitfalls of trying to market what is essentially a server CPU setup to an enthusiast market. Their Windows overclocking tool has various options to set up some control and optimize how it deals with NUMA and memory address as you can see below.
I did have a fiddle around with some of the settings here and the creators mode did give me some marginal gains over the other options thanks to it appearing to arrange the memory in a well organized and easy to address logical group, but ultimately the performance dips we’re seeing are down to a physical addressing issue, in that data has to be moved from X to Y in a given time frame and no amount of software magic will be able to resolve this for us I suspect.
I think this one is pretty straight forward if you need to be running at below a 256 ASIO buffer, although there are certainly some arguments for mastering guys who don’t need that sort of response.
Much like the Intel i9’s before it, however, the is a strong suggestion that you really do need to consider your cooling carefully here. The normal low noise high-end air coolers that I tend to favour for testing were largely overwhelmed once I placed these on the bench and once the heat started to climb the water cooler I was using had both fans screaming.
Older readers with long memories might have a clear recollection of the CPU wars that gave us P4’s, Prescott’s, Athlon FX’s and 64’s. We saw both of these firms in a CPU arms race that only really ended when the i7’s arrived with the X58 chipset. Over the years this took place we saw ever raising clock speeds, a rapid release schedule of CPU’s and constant gains, although at the cost of heat and ultimately noise levels. In the years since we’ve had refinement and a vast reduction of heat and noise, but little as far as performance advancements, at least over the last 5 or 6 generations.
We finally have some really great choices from both firms and depending on your exact needs and price points you’re working at the could be arguments in each direction. Personally, I wouldn’t consider server class chips to be ultimate solution in the studio from either firm currently, not unless you’re prepared to spend the sort of money that the tag “ultimate” tends to reflect, in which case you really won’t get anything better.
In this instance, if you’re doing a load of multimedia work alongside mastering for audio, this platform could fit your requirements well, but for writing and editing some music I’d be looking towards one of the other better value solutions unless this happens to fit your niche.