So, from a technical perspective in terms of utilising VRAM, is spanning multiple cards also a bit of a hardware bottleneck?
You know what I'm going to say, don't you? 🤣 "It depends".
For the purposes of this reply, I'll talk about GPUs we can relate to (as opposed to data centre GPUs). I'm also talking about CUDA compute cores here as opposed to the likes of Tensor cores (or equivalent).
There are a number of factors that can ultimately determine the performance of a CUDA-based system and workload, and whether or not bottlenecks may or may not be a factor. It might be something as obvious as the hardware itself; i.e. we know that modern-era GPUs have significant uplifts in compute capability from one generation to the next. With increased performance comes new tech, new features and new solutions. There are no real surprises there - technology continues to march onward as it always has. Other factors include how the GPU(s) has/have been configured. Distribution and dispatching of work on even a single GPU can be incredibly sensitive to setup, a point I will touch upon later. As more GPUs are introduced into the mix, this can add to the complexity in terms of balancing and distributing work optimally across the compute devices. The actual nature of the data and the calculations on that data can also have a massive impact on performance (yes, I know it sounds like I'm stating the obvious here). For example, in my path tracer, if I'm tracing a light ray for pixel N to try and determine the colour for that pixel, there is no guarantee that pixel N+1 (i.e. an adjacent pixel) is going to take the same amount of time and computing effort to complete. Why not? That's the nature of path tracing. The light path for pixel N might not hit anything and simply head straight out towards infinity (in which case I would simply colour that pixel with a sky colour, or whatever I am using for my background (such as a HDRI). There's hardly any computation effort needed and that thread of work is complete and the pixel colour known. However, pixel N+1 next door, that light ray might hit a dielectric surface (such as a glass sphere). At that hit point, due to the nature of the surface, the light ray can be reflected or refracted (as we know, glass is both reflective and refractive), so we have additional work to determine the reflected or refracted direction of the light bounce, along with calculations of other values that are needed so that the light path can be continued to be traced. This light path might then be totally internally refracted within the sphere, or it will hit the other side of the sphere (exit point). Again, the light might be reflected internally or it might be refracted again depending on the index of refraction of the entry/exit material properties (air/glass for example). The light path trace then continues again until a pre-calculated number of bounces (hits) are reached (or a random number if using 'Russian Roulette') or the light ray hits nothing else and shoots off to infinity. At this point, we have our pixel colour but it has taken in the order of millions of more compute instructions to calculate and a compute time orders of magnitude more than its neighbouring pixel. For reasons I'll touch on later, this can cause a stall.
Before I go off too far on a tangent (pun intended), if a workload is relatively simple, computationally balanced, and fits on a single GPU, it will typically be quicker than if the work was spread over multiple GPUs. Even though modern CUDA devices can be treated as a big single device (unified memory) there is, of course, work going on under the hood to distribute work across multiple devices and into different memory address spaces. This can add contention on system buses (it's easy to exhaust PCIe lanes on multigpu setups). Of course, there is the like of NVLink, but you are often limited to connecting two (or maybe 4? Can't remember...) devices and for the rest, well, you're a slave to the PCIe bandwidth as that is the only way to get data to/from the cards.
As already alluded to, it's not always quite so straightforward though. What if you can fit and execute your workload on a single GPU, but the computation is incredibly heavy? The computation might be so heavy that bus bandwidth limits, etc. aren't really a big issue as the time taken to process the workload far outweighs anything else, resulting in poor performance. In these cases, it might be better to split the workload across two GPUs because the overhead of doing so is minimal, yet you now have twice the number of compute units running in parallel. How about splitting it across 4 GPUs? 4 times the compute units working on it now... It's all a balancing act, with many factors trying to upset the apple cart.
Kepler and earlier architectures (I think) did not support unified memory, whilst later generations do. You can use newer versions of the CUDA API to treat older hardware as having unified memory, but the API is tricking you and has to perform a whole host of memory paging operations and swaps between GPU (device) and CPU (host) memory across the system bus and, naturally, this can cause some massive bottlenecks. Things got better and continue to do so with modern architectures. There is still work going on under the hood but it's a lot easier to work with and a LOT more performant in comparison.
Using my epic man-maths, the 8Pack system would have a total of 168GB of VRAM within it, spread over the seven cards. If you had a complete render including all the assets at say 20GB that could sit upon one card - would that have a clear performance benefit over a 40GB render that had to be split over at least two cards?
Or.... a bit like RAID with multiple drives - does the software see the entire GPU stack as a single device with little concern that there are seven GPUs and that the total memory is a cluster of 7x 24GB 'chunks'?
Yes, the 8Pack system GPU array could be treated as a single big compute device with a unified memory of 168GB of VRAM. That would be quite nice... 🤣
I have already answered the question above in many ways. Depending on the complexity of the render calculations for the light paths (assuming a ray/path tracing scenario here) then there may be benefits to running on a single GPU in the array, or it may be better to split across two or even all GPUs in the array. In reality, modern rendering software will generally be written to take hold of all the compute power it can get its hands on and then analysis of the dataset is carried out in order to determine the the most optimal way of processing it on the available devices. It's quite a big area and field of research; organising the data sensibly can often yield significant performance gains as opposed to just throwing more work at more GPUs with little thought.
I mentioned above that compute devices can be sensitive to the data they are processing and the calculations they are carrying out on that data, piece by piece. This is due to how modern GPU compute devices work. I won't go into too much detail here as it could get boring and incredibly lengthy, but I will hopefully give enough information to explain why massively parallel computation can be so finicky. 🤣
The 4090 GPU (let's stick with this GPU) is not the full ADA Lovelace AD102, but still has 128 SMs (streaming multiprocessors) with 16,384 CUDA cores (128 CUDA cores per SM). Each SM has the 128 CUDA cores, 1 RT core, 4 Tensor cores, and a multitude of other things like texture units and 128kb L1 shared memory cache (which can be configured at different size points depending on the workload). Each SM has 1536 threads available to it.
When configuring the GPU (or GPUs) you work in grids and blocks. It is up to the developer to determine the best strategy in terms of organising the overarching grid layout, within which the blocks are structured. The grid and block data structure can be treated as a 1D, 2D or 3D structure. The blocks are assigned several threads (a thread block). This might help (depicts a 2D grid structure of thread blocks, with each block having a 3D structure of threads within it):
It is up to the developer to determine the best arrangement of the above in order to get the best performance from the available hardware. It can be quite a tricky and complex thing to get right.
Let's dig a bit deeper.
Each SM, as already mentioned, can have 1536 threads. A thread block can have a maximum of 1024 threads. Each group of 32 threads is grouped into a warp. A warp is scheduled and controlled by the owning SM and, once scheduled, a warp will execute the same instruction across all 32 threads in that warp in parallel. That warp remains resident until execution of those threads in the warp is complete. Do you see a potential issue here? Cast your mind back to my example about pixel N and its neighbour, pixel N+1...
(This is massively simplified so, any CUDA developers reading this, remember I'm trying to keep it simple and give a flavour, ok?) 🤣
If a thread in the warp hits some sort of condition, it has to go in a direction that might be different to another thread in the same warp. They were executing the same instruction in parallel, but now they need to go in different directions. Pixel N is following this line of rendering logic, pixel N+1 takes a different code path as a whole lot of different computation is required. This is referred to as thread divergence. It also means that the parallel-running threads are no longer able to run the same instruction in parallel so threads are put to sleep whilst each thread is then run serially until completion. That can be a problem. 31 of the threads in the warp might have very little computation and execute exactly the same code as for pixel N. However, that one thread that has diverged has caused them all to be put on hold whilst it carries out its incredibly complex (in relative terms) calculation. The other 31 threads have to wait for the pixel N+1 thread to finish and hence sit in a wait state. That's a waste, of course. Whilst that warp is resident and executing, further warps might not be able to be scheduled if all resources have been exhausted. This has caused a bottleneck. This can quickly escalate when you consider the sheer number of threads and blocks involved and is why GPU compute workloads can be so sensitive to data and configuration.
Of course, there are also other reasons. Such as resource contention where threads might be trying to access another GPU resource but have to wait until another thread has relinquished its hold on that resource. Memory divergence, divergent data, sync points, latency hiding, occupancy... many factors contribute to potential performance issues and they are far beyond the scope of this (already too long) narrative!
Occupancy tripped me up on my path tracer recently (due to misconfiguration). A few days ago, I found a bug in my code and suddenly gained around an 8% performance uplift when looking into it. Occupancy is a metric that measures the utilisation of the GPU's multiprocessor resources during the execution of a CUDA kernel (code running on the GPU). It's a bit difficult to explain my situation so I'll explain it in the context of the information I've already provided above.
Remember I said that an SM has 1536 threads available? And that the maximum number of threads per block was 1024? Well, let's assume we go balls to the wall and specify the maximum number of threads for our blocks - so our blocks have 1024 threads. Maximum performance, yeah?
Well, not quite...
Remember that an SM has a capacity for 1536 threads, so we can only run a single block per SM. Two blocks don't fit as that would need 1024x2 (=2048) threads and the SM only supports 1536. So that means our SM has an unused 512 threads sat doing nothing. Hmm...
So, how about changing our block size to have 512 threads? This can work out much better. The SM can now run 3 thread blocks (3 x 512 = 1536 threads). We have utilised all the threads across all blocks within the SM.
That was a lot to take in and I've kind of lost my focus but hopefully, it makes some sense. I blame the meds.
I remember from my time when I went slightly baller and had two nVidia GTX690s in my PC. In theory, that gave me four GPUs with a total of 8GB across them. On paper, the performance should have been superb - even the likes of GTAV today with all the bells and whistles enabled @ 2560x1440 doesn't use 3GB of VRAM according to options screen.
In reality however, it was very much hit & miss. Some well-optimised games such as Warframe split the load evenly across all of the GPUs and played well. The majority however just hammered GPU0 and had access to the 2GB of VRAM allocated with that. Some games even outright lied with their claims within the options menu. Ashes of the Singularity had a specific tickbox to enable 'multi-GPU processing' which was music to the ears of any SLi or Crossfire system owner. Yet when running the hardware tool in the background, once again the game was solely using GPU0 and it's small amount of VRAM.
Yeah, multigpu for gaming never really took off. It looked quite promising for some time but, ultimately, proved too difficult to get right. Developers didn't really have the time or resources to invest (such is the nature of game dev) and the difficulty to get a balanced and performant workload distribution just wasn't feasible. Gaming workloads don't always fit so well when it comes to using GPUs for data computation; game logic is very dynamic, lots of branching, uncertainty, etc.
On the rendering/rasterisation side, the GPUs running in SLI (or equivalent) needed to have their own copies of required data in their local VRAM as well. There was overhead for syncing that up, and then for shuttling data to and fro across the bus to the GPUs, causing stalls and bottlenecks. It was just easier to run on a single GPU, maybe farm a little bit of work to another GPU on the system, but it was ultimately too expensive, time-consuming, and niche market to be worthwhile.
I assume that a fair amount of coding work is understanding your hardware and getting the most out of it - such as the CUDA requirement you mentioned above?
Exactly!