ClioSport.net

Register a free account today to become a member!
Once signed in, you'll be able to participate on this site by adding your own topics and posts, as well as connect with other members through your own private inbox!

  • When you purchase through links on our site, we may earn an affiliate commission. Read more here.

Realtime Graphics / Game Engines



Mr Squashie

CSF Harvester
ClioSport Club Member
  Clio 182
It downgrades it.

They recommend 4k 60fps is uploaded with a max bitrate of 68kbps so rendering anything over and above that is a waste of time.
Oh I know it still downgrades it, but the bitrate for a 4k YouTube video is higher than a 1080p one isn't it?
 

SharkyUK

ClioSport Club Member
I've tried re-uploading the recent video I made - this time in 4K (I had to upscale from 1080p original lossless source as trying to path trace a native 4K image and record at the same time is just not happening!) I think the video quality is a little better but it's not great. As Mark says, it's basically due to the low bitrate limits that YouTube imposes to keep file sizes down, and also made worse by the fact that path-traced visuals don't play well with video encoding algorithms.

 

Mr Squashie

CSF Harvester
ClioSport Club Member
  Clio 182
I've tried re-uploading the recent video I made - this time in 4K (I had to upscale from 1080p original lossless source as trying to path trace a native 4K image and record at the same time is just not happening!) I think the video quality is a little better but it's not great. As Mark says, it's basically due to the low bitrate limits that YouTube imposes to keep file sizes down, and also made worse by the fact that path-traced visuals don't play well with video encoding algorithms.


I know it's still not as good as it looks on your screen, or in still images, but it's certainly a massive improvement over the 1080p video. Thanks for that 👍
I think my computer would melt if it tried to render a frame of that 😂
 

SharkyUK

ClioSport Club Member
A few more renders as I dip in and out of this ongoing project. (y)

52657223071_994fa87ead_h.jpg

SIPT Test Render (Mecabricks) by Andy Eder, on Flickr

52662111470_ae9f1bf8ef_h.jpg

SIPT Test Render (Mecabricks) by Andy Eder, on Flickr

52661669326_1b2d282c5d_h.jpg

SIPT Test Render (Mecabricks) by Andy Eder, on Flickr

52737861492_78322d48a9_h.jpg

SIPT Render by Andy Eder, on Flickr

52738635289_7da4f82f03_h.jpg

SIPT Render by Andy Eder, on Flickr

52775216065_3f5bdae2dd_h.jpg

SIPT Render - Mazda 787B by Andy Eder, on Flickr
 

Clart

ClioSport Club Member
Was going to post this. Looks incredible. Wonder what sort of hardware you’d need to run that?
 

Mr Squashie

CSF Harvester
ClioSport Club Member
  Clio 182
Was going to post this. Looks incredible. Wonder what sort of hardware you’d need to run that?
Probably nothing too extreme, the developer was talking about the possibility of porting it over to current gen consoles so presumably an equally powerful PC would run it fine. I don't think he's done anything towards optimisation yet though.
 

SharkyUK

ClioSport Club Member
As much as I watch Unrecord, I'm finding it really difficult to get excited by it. It appears to be using a lot of simple photogrammetry with an excessively oversaturated primary global light source and there are a few things that don't quite work as I'd expect. Of course, novel use of "bodycam" filter effects help mask some of the things whilst giving the look and feel of authenticity (which works well). But graphically, I'm not convinced it's all that if I'm honest.

EDIT: I'd like to see some other dev videos with perhaps the same environments but with different primary lighting conditions and times of the day (for example).
 

Mr Squashie

CSF Harvester
ClioSport Club Member
  Clio 182
As much as I watch Unrecord, I'm finding it really difficult to get excited by it. It appears to be using a lot of simple photogrammetry with an excessively oversaturated primary global light source and there are a few things that don't quite work as I'd expect. Of course, novel use of "bodycam" filter effects help mask some of the things whilst giving the look and feel of authenticity (which works well). But graphically, I'm not convinced it's all that if I'm honest.

EDIT: I'd like to see some other dev videos with perhaps the same environments but with different primary lighting conditions and times of the day (for example).
I know exactly what you mean. Everyone is talking about it looking hyper realistic, but really the impressive thing isn't the model details or textures, it's how the dev has absolutely nailed the video look. When you see the camera moving around freely in the environment before beginning the playthrough it looks good, but it doesn't look exceptional, but when the gameplay starts and the filters are applied it really does look just like a video. The movement animations are a big part of that, along with the filters.
It's also very different to most games where they try to make things look as they do to the naked eye, whereas this has really tried to capture what things look like when viewed through a camera lens.
 

SharkyUK

ClioSport Club Member
I know exactly what you mean. Everyone is talking about it looking hyper realistic, but really the impressive thing isn't the model details or textures, it's how the dev has absolutely nailed the video look.

Yup, one million percent that mate - it's the presentation of the "video look". It's a long way from hyper-realistic (at least from my admittedly critical eye). I couldn't have said it better myself. :p
 

SharkyUK

ClioSport Club Member
Sharky have you come across 3D Gaussian Splatting?

Yes mate, although I've not done any work/research in that area specifically. It's similar(ish) to some light probe baking stuff I did way back where we used spherical harmonics and probe/point captures to capture light samples within a given 3D environment. 3D Gaussian splatting uses a somewhat novel technique to present the image though; where each captured point is represented with a blob (or Gaussian particle) - each blob has a position in 3D space, covariance parameters (how it is stretched and scaled), alpha (how see-through it is), and a colour. The blobs are rendered depth-sorted and can give cool results. As long as the scene is static! Of course, I have massively simplified things here as there is quite a bit more to it than rendering blobs to the screen! Such as the point cloud data capture and the "training" that takes the data and synthesises it into something visually meaningful.

It's a very different algorithm and technique to existing rasterisation render pipelines, though - so it's not particularly trivial to shoehorn into existing setups and generally needs a bespoke renderer creating. There are, of course, several examples available online though. It will be interesting to see where it goes in the future.


(y)
 

Robbie Corbett

ClioSport Club Member
Yes mate, although I've not done any work/research in that area specifically. It's similar(ish) to some light probe baking stuff I did way back where we used spherical harmonics and probe/point captures to capture light samples within a given 3D environment. 3D Gaussian splatting uses a somewhat novel technique to present the image though; where each captured point is represented with a blob (or Gaussian particle) - each blob has a position in 3D space, covariance parameters (how it is stretched and scaled), alpha (how see-through it is), and a colour. The blobs are rendered depth-sorted and can give cool results. As long as the scene is static! Of course, I have massively simplified things here as there is quite a bit more to it than rendering blobs to the screen! Such as the point cloud data capture and the "training" that takes the data and synthesises it into something visually meaningful.

It's a very different algorithm and technique to existing rasterisation render pipelines, though - so it's not particularly trivial to shoehorn into existing setups and generally needs a bespoke renderer creating. There are, of course, several examples available online though. It will be interesting to see where it goes in the future.


(y)
Cheers mate, some of the examples on that link are mental.

Seems hot stuff right now with some pretty wild applications.
 

SharkyUK

ClioSport Club Member
Yes, it seems that many people are jumping on the bandwagon and either nicking other peoples' implementations or writing their own interpretations of the algorithm! :ROFLMAO: It could indeed prove very useful in some industries, definitely. It can very easily fall apart, though - and requires some pretty hefty processing and storage requirements for acquisition and data training. That said, the final result can look stunning (as long as you don't look too closely). Maybe some clever AI coupling down the line will see it able to handle animation/moving scenes and reflections (rather than snapshots in time of static scenes as with the current implementation).
 

Beauvais Motorsport

ClioSport Club Member
Do you do any map creation or use game engines? I love map creation and used far cry editor but commitments stopped me from continuing or getting into unreal engine or the likes.

This was something I did a bit ago in far cry 5 editor, I was annoyed they didnt include the editor in fc6 :(

2.png
 

SharkyUK

ClioSport Club Member
Do you do any map creation or use game engines? I love map creation and used far cry editor but commitments stopped me from continuing or getting into unreal engine or the likes.

Me, mate? Not really, although I am familiar with many of the game engines. My interest is more in the tech/software that drives it, and I've had the "pleasure" (I use the term loosely) of working on quite a few, including Unreal, Unity and CryEngine (amongst others). When I say working, I mean as in working on the underlying code and systems that comprise the engines rather than from the point of view of using them to create content. A good friend, and former studio boss of mine, used to be the director of tech for Crytek's CryEngine. :cool:
 

Beauvais Motorsport

ClioSport Club Member
Me, mate? Not really, although I am familiar with many of the game engines. My interest is more in the tech/software that drives it, and I've had the "pleasure" (I use the term loosely) of working on quite a few, including Unreal, Unity and CryEngine (amongst others). When I say working, I mean as in working on the underlying code and systems that comprise the engines rather than from the point of view of using them to create content. A good friend, and former studio boss of mine, used to be the director of tech for Crytek's CryEngine. :cool:

Very cool!! Perhaps I should of continued with what I was doing re maps/engines and I could of bought a gt4 instead of a clio :LOL:
 

SharkyUK

ClioSport Club Member
I've not had much time to work on my path tracer research of late but I found the unedited footage of some stuff I did aaaaaages ago. Which I'd forgotten about. So I decided to throw a quick video together to "get it out there" for my vast army of subscribers and followers... 🤣

It's just a few simple animations that have been rendered out using my path tracer. They aren't particularly spectacular; it was more an exercise in implementing a basic animation import into my engine. It is what it is. Whilst it won't be giving Cyberpunk worries over artistic content, it does have some pretty funky stuff going on under the hood. For a "homebrew" project, it has led to some interesting professional discussions, work opportunities and offers to work with the likes of AMD, nVidia and Pixar. Not that I'm bragging or anything. :ROFLMAO:🤐🫣

 

Beauvais Motorsport

ClioSport Club Member
@SharkyUK Do you have any knowledge on the id Tech 3 engine? The Infinity Ward engine was based off id Tech 3 and COD4 (2007) on the pc with the mod promod was my favourite fps of all time. No game I've ever played can compare to the physics of that game with the promod mod.
 

Beauvais Motorsport

ClioSport Club Member
Also, if I had to compare playing Cod4 promod to Cod Modern Warfare (2019) The last cod game I had the GREAT pleasure of playing. I could only describe it of being the difference of a rather nice lass going up to you and doing all the work to get you back to the bedroom, compared to standing in a fresh steaming pile of XL Bully s**t.
 

SharkyUK

ClioSport Club Member
@SharkyUK Do you have any knowledge on the id Tech 3 engine? The Infinity Ward engine was based off id Tech 3 and COD4 (2007) on the pc with the mod promod was my favourite fps of all time. No game I've ever played can compare to the physics of that game with the promod mod.

Yes mate, insofar as I've poked around in the idTech3 ("Quake 3 Arena" engine) code and am familiar with the tech. It was a significant rewrite of idTech2 and, as you say, provided a base for IW's engine. I've not done much else with it other than write my own BSP level viewers, MD3 viewers, etc. (but that was a LONG time ago!)
 

Beauvais Motorsport

ClioSport Club Member
Yes mate, insofar as I've poked around in the idTech3 ("Quake 3 Arena" engine) code and am familiar with the tech. It was a significant rewrite of idTech2 and, as you say, provided a base for IW's engine. I've not done much else with it other than write my own BSP level viewers, MD3 viewers, etc. (but that was a LONG time ago!)

When I have to search code terms and such you write 😄 Thanks for sharing the info though mate!

231231321.png


Me right now...🙄

32232123.png
 

Beauvais Motorsport

ClioSport Club Member
When I did my media studies final project in college, I was knowledgeable of the games (cod4) in built editing 'client' I think it was called, I used green screen but was able keep animations of the players and incorporated that into my video, like running with the gun, zooming in with the sniper scope, shooting someone and seeing them die. The idea was that the game took over my mind and I couldn't tell what was reality, so at the end I jumped 'fell' off my garage and died in real life (my mate accidentally failed to catch the camera and it turned out that was the perfect shot as it looked realistic :LOL:
 

SharkyUK

ClioSport Club Member
Here's another video I've just uploaded that shows the path tracer in action. Again, it's a bit boring, but it's kinda difficult showing what is happening underneath that makes it all possible! :ROFLMAO:

(The black background glitch on the section with the multiple male heads is an encoding glitch and not caused by my software! I can't be bothered to go back and redo the video now though).



I'll post a few screenshots later. Possibly. If I remember!
 

SharkyUK

ClioSport Club Member
As mentioned in my last post, here are a few updated images straight from my project. Nothing too exciting I'm afraid (but then again, I've not had much time to devote to it recently!)

Mech thing:

53280987113_4951825387_h.jpg

SIPT Test Render by Andy Eder, on Flickr

53280996138_4acb52b8f3_h.jpg

SIPT Test Render by Andy Eder, on Flickr

Glass thing:

53281169490_6af28d2feb_h.jpg

SIPT Test Render by Andy Eder, on Flickr

(More) Car things:

53280713441_56053d9c5c_h.jpg

SIPT Test Render by Andy Eder, on Flickr

53281180555_66fde30811_h.jpg

SIPT Test Render by Andy Eder, on Flickr

Walkman thing:

53280994848_a2d8716749_h.jpg

SIPT Test Render by Andy Eder, on Flickr

53279819922_c6bce41991_h.jpg

SIPT Test Render by Andy Eder, on Flickr

Nasty Xeno-thing:

53281177860_eb9f6c2b92_h.jpg

SIPT Test Render by Andy Eder, on Flickr

Muddy thing:

53281174105_87242620bc_h.jpg

SIPT Test Render by Andy Eder, on Flickr

Antique dress thing:

53280715386_b26b1b36a4_h.jpg

SIPT Test Render by Andy Eder, on Flickr

I'll try and have some more interesting content next time! :ROFLMAO:

(On a side note... in the antique dress scene above, the textures alone are taking up 630MB - and that's compressed!) 🤓
 

SharkyUK

ClioSport Club Member
I haven't touched my realtime path tracing project for a while and decided to update it to use a newer version of the underlying APIs / frameworks - for example, CUDA. That was a fun few hours... which resulted in a completely broken and non-functional executable. In fact, I'm lying. I couldn't even get the software to compile/build due to a major version update in the CUDA libraries - from version 11.x to 12.x. In the end, I rolled back to the very last version of 11.x and will stick with that until I can find the time to get 12.x working.

I also spent a few hours experimenting with a foliage system.

53792044939_b03b764005_o.png

GPU Path Tracing by Andy Eder, on Flickr

53790786592_0d17284602_o.png

GPU Path Tracing by Andy Eder, on Flickr

And rendered a few more images to ensure I hadn't broken anything.

53790781302_011f6b83a9_o.png

GPU Path Tracing by Andy Eder, on Flickr

53791732431_325f8a90bf_o.png

GPU Path Tracing by Andy Eder, on Flickr

53792039869_0ac6efeb7c_o.png

GPU Path Tracing by Andy Eder, on Flickr

53790786587_2a6e576185_o.png

GPU Path Tracing by Andy Eder, on Flickr

53792044934_97f2501ba0_o.png

GPU Path Tracing by Andy Eder, on Flickr

I wish I had time to work on this as I have so much I want to experiment with and research further!
 

SharkyUK

ClioSport Club Member
Talking of experimenting and research, I did decide to drop a new feature into the project - realtime denoising!

I am using Intel's Open Image Denoise technology which uses AI learning to denoise rendered images. It runs on the GPU and is performant enough to be used in realtime in the main render loop, rather than purely as a post-process effect. It's pretty cool (well, I think it is!) It also plays nicely with my rendering code and it was relatively painless to get it in and working. I still need to tidy up the code and make a few changes, but it proved that it worked so I'll take that for now.

What is denoising? It's the process of removing "noise" from a rendered image. Typically, a path traced image generates noisy images due to the way in which light calculations are performed. The more light paths traced for a given scene, the better the image becomes as it converges over time to a less noisy image. However, this process is EXTREMELY expensive and the reason why ray traced and path traced games have such a huge performance hit on even the most powerful home PC. To offset the performance cost, the paths traced are kept to a minimum... but that also means the resulting image is going to be full of unsightly noise.

This is where the denoiser comes in. It takes the rendered image, complete with the noise, and applies a clever "filtering" to remove as much of the noise as possible without sacrificing scene detail or introducing too many inaccuracies into the resulting image. Here's an example taken directly from my path tracer...

This first image has a low sample rate, meaning a relatively low number of light paths were traced and a noisy image is the result.

53791739356_cd189a5f5a_o.png

GPU Path Tracing by Andy Eder, on Flickr

Here is exactly the SAME image a few ms later after the denoiser has done its magic.

53792153860_8fd565fe15_o.png

GPU Path Tracing by Andy Eder, on Flickr

No additional rendering effort was required to produce the second image above; it is a direct result of passing the noisy first image through the denoiser. Whilst the results are not perfect, they can certainly be passable and the denoiser can save a huge amount of calculation effort and time.

Here is another example - the noisy image followed by the denoised image...

53790788377_f890300ace_o.png

GPU Path Tracing by Andy Eder, on Flickr

53792046714_b2f7236a3e_o.png
GPU Path Tracing by Andy Eder, on Flickr

I created a short video of this new denoiser in action.

 

Darren S

ClioSport Club Member
@SharkyUK - I know the 8Pack system builds have been out for years, but would they necessarily be good for your area of work?


Just thinking where else a stack of seven 4090 cards, would be of much use? Engineering? Simulation?
 

SharkyUK

ClioSport Club Member
@SharkyUK - I know the 8Pack system builds have been out for years, but would they necessarily be good for your area of work?


Just thinking where else a stack of seven 4090 cards, would be of much use? Engineering? Simulation?

Yes, and no! 🤣

It's quite a difficult question to answer because it depends on the workload (i.e. what the ask is for utilising all that GPU performance) and how data associated with that workload is handled/managed.

In the case of my path tracer, it is a pure CUDA solution (hence it will only run on nVidia GPUs - more specifically, Kepler microarchitecture nVidia GPUs or later due to some rather specific performance optimisations I made in the source code). In theory, the more CUDA hardware available to throw at the problem (i.e. rendering the scene) the better. Why? Simply because the GPU can perform massively parallel processing. The more CUDA cores available, the more work I run in parallel. In this case, the work is calculating the colour of a given screen pixel. With more CUDA cores, I can calculate more pixels in parallel, leading to improved rendering performance. The thinking is that, If I have 1x GPU and can calculate say 100 pixels at a time, with 2x GPUs I can calculate 200 pixels at a time, with 4x GPUs I can calculate 400 pixels at a time, and so on.

Up to a point.

And... this is where it gets a bit tricky to explain because it becomes very technical, complex and boring (if you aren't into development and how to write software for such architectures!) 🤣 I'm trying to think how I can explain this in a way that doesn't cause folks to switch off through boredom or through not understanding the subject matter. That is not a slur on anyone who might read this; I know there are some incredibly clever and technical folk on here! But I would not expect the majority of them to have specialist knowledge in this area/domain (in the same way that I wouldn't have a clue in their line of work)! 🤣

Would multiple 4090s be good for my work in terms of my path tracer? Yes, absolutely... provided I am careful to limit myself to ensure that I don't exceed the amount of VRAM available on a GPU. Once VRAM is exhausted, performance can tank. Especially in this case where optimisations have been made such that the path tracer runs wholly on the GPU. If I start having to pull information from main system RAM, I have to send/pull data over the system bus and this is orders of magnitude slower than having everything available in high-speed local VRAM.

Ok... but the 4090 has 24GB of VRAM. That's plenty of memory, right? For gaming and video editing, driving a few monitors, etc. it is more than ample. It is overkill for most things. But, for some workloads, it just doesn't cut it. Let's continue using my path tracer as an example. I have 24GB of VRAM to play with, which sounds like a lot, but my path tracer requires everything to be available on the GPU itself. This is to ensure optimal performance. So, I have to have the following in VRAM:
  • the CUDA executable code (i.e. the code I have written that performs all the calculations to render a scene using the data that is in VRAM)
  • a 3D scene representation organised into an efficient structure to improve performance (sometimes referred to as a BVH) against which light paths are tested as part of the rendering process
  • all 3D scene assets - polygonal models, animation data, textures, material definitions, lighting information, camera information, render settings, G-buffer (memory buffers for storing colour information, depth information, etc).
  • ...and other stuff!

That's potentially a lot of data to store on the GPU. Especially if you are looking at high-fidelity 3D graphics rendering to the level we see in tv and movies. It's surprisingly easy to exhaust that memory. The last screenshot I posted above (of the female/male, zombie head, Lego car) is a relatively simple scene yet it consumes just over 10GB of VRAM. You read that correctly - 10 gigabytes! And that is why the answer to Darren's question is a lot more complex than it may seem.

The GPU's ability to process vast amounts of data and carry out massively parallel computations is staggering, provided the data is organised and fed to the GPU streaming processors in a friendly way, and the code executing on the GPU is very "focused" on the task it is designed to carry out (i.e. number-crunching and not having to deal with excessive branching and uncertainty in terms of code execution direction/flow). It also has to have that data available in VRAM, whether it persists in there for the lifetime of the parent process, or if it has to be brought in, swapped in/out from system memory, or somewhere else). And therein is the crux of the problem. Keeping the beast fed.

I could throw my path tracer at a system with multiple GPUs but, if I run out of memory and can't feed them, I lose any benefits and performance gains. Hence, the problem becomes not one of throwing more GPUs at a problem but how one can intelligently and efficiently work with the data in order to keep multiple GPUs happy and working optimally. Sadly, I don't have deep enough pockets to purchase racks of nVidia GPUs that are linked with high-speed fibre optic nodes and interconnects, connected to silly-fast solid state devices with capacities in the hundreds of terabytes and petabytes. Shame, really. 🤣

As said, and with due respect, it's kind of hard to explain without a VERY lengthy discussion that can quickly become very technical. When you have situations that call for serious levels of GPU compute capability, it is often more about the supporting infrastructure than the GPUs themselves. Both at the hardware and software level.

The 8Pack system linked would definitely be useful in the fields of engineering, fluid simulation, 3D rendering, etc. for sure. There is no doubt about it. The GPUs are perfectly suited to many workloads found in those fields. I would love to drop 40 grand on a system and have it at home for playing with and for work and research. Yet, it would be woefully inadequate for some of the work I've been involved with(*). For example, you wouldn't be able to use that 8Pack system to render Pixar's latest and greatest movie! That needs vast render farms. However, having a multi-GPU setup sat on an individual Pixar artist's desk would be ideal and, in fact, often essential for them to work on 3D content for such a movie. It again boils down to what the GPU(s) is/are being utilised for, and then determining how to best manage any associated data - whether the scale is at, say, an artist level working on a 3D character for a movie or the scale is render-farm level and the entire movie data set has to be available to allow thousands upon thousands of GPUs to work in unison to render the entire movie in a sensible time frame.

I hope that makes some sense. Probably not.

* I once worked on a dataset that contained only partial information for a 3D scene in a movie that was released several years ago. It didn't contain animation data, and many other data items were absent, too. However, that partial dataset still weighed in at 1.8 terabytes. It's a far cry from dealing with files on local hard drives that might run into a few hundred megabytes or a few gigabytes. 🤣
 

Darren S

ClioSport Club Member
Thanks for that Andy.

So, from a technical perspective in terms of utilising VRAM, is spanning multiple cards also a bit of a hardware bottleneck?

Using my epic man-maths, the 8Pack system would have a total of 168GB of VRAM within it, spread over the seven cards. If you had a complete render including all the assets at say 20GB that could sit upon one card - would that have a clear performance benefit over a 40GB render that had to be split over at least two cards?

Or.... a bit like RAID with multiple drives - does the software see the entire GPU stack as a single device with little concern that there are seven GPUs and that the total memory is a cluster of 7x 24GB 'chunks'?

I remember from my time when I went slightly baller and had two nVidia GTX690s in my PC. In theory, that gave me four GPUs with a total of 8GB across them. On paper, the performance should have been superb - even the likes of GTAV today with all the bells and whistles enabled @ 2560x1440 doesn't use 3GB of VRAM according to options screen.

In reality however, it was very much hit & miss. Some well-optimised games such as Warframe split the load evenly across all of the GPUs and played well. The majority however just hammered GPU0 and had access to the 2GB of VRAM allocated with that. Some games even outright lied with their claims within the options menu. Ashes of the Singularity had a specific tickbox to enable 'multi-GPU processing' which was music to the ears of any SLi or Crossfire system owner. Yet when running the hardware tool in the background, once again the game was solely using GPU0 and it's small amount of VRAM.

I assume that a fair amount of coding work is understanding your hardware and getting the most out of it - such as the CUDA requirement you mentioned above?
 

SharkyUK

ClioSport Club Member
So, from a technical perspective in terms of utilising VRAM, is spanning multiple cards also a bit of a hardware bottleneck?

You know what I'm going to say, don't you? 🤣 "It depends".

For the purposes of this reply, I'll talk about GPUs we can relate to (as opposed to data centre GPUs). I'm also talking about CUDA compute cores here as opposed to the likes of Tensor cores (or equivalent).

There are a number of factors that can ultimately determine the performance of a CUDA-based system and workload, and whether or not bottlenecks may or may not be a factor. It might be something as obvious as the hardware itself; i.e. we know that modern-era GPUs have significant uplifts in compute capability from one generation to the next. With increased performance comes new tech, new features and new solutions. There are no real surprises there - technology continues to march onward as it always has. Other factors include how the GPU(s) has/have been configured. Distribution and dispatching of work on even a single GPU can be incredibly sensitive to setup, a point I will touch upon later. As more GPUs are introduced into the mix, this can add to the complexity in terms of balancing and distributing work optimally across the compute devices. The actual nature of the data and the calculations on that data can also have a massive impact on performance (yes, I know it sounds like I'm stating the obvious here). For example, in my path tracer, if I'm tracing a light ray for pixel N to try and determine the colour for that pixel, there is no guarantee that pixel N+1 (i.e. an adjacent pixel) is going to take the same amount of time and computing effort to complete. Why not? That's the nature of path tracing. The light path for pixel N might not hit anything and simply head straight out towards infinity (in which case I would simply colour that pixel with a sky colour, or whatever I am using for my background (such as a HDRI). There's hardly any computation effort needed and that thread of work is complete and the pixel colour known. However, pixel N+1 next door, that light ray might hit a dielectric surface (such as a glass sphere). At that hit point, due to the nature of the surface, the light ray can be reflected or refracted (as we know, glass is both reflective and refractive), so we have additional work to determine the reflected or refracted direction of the light bounce, along with calculations of other values that are needed so that the light path can be continued to be traced. This light path might then be totally internally refracted within the sphere, or it will hit the other side of the sphere (exit point). Again, the light might be reflected internally or it might be refracted again depending on the index of refraction of the entry/exit material properties (air/glass for example). The light path trace then continues again until a pre-calculated number of bounces (hits) are reached (or a random number if using 'Russian Roulette') or the light ray hits nothing else and shoots off to infinity. At this point, we have our pixel colour but it has taken in the order of millions of more compute instructions to calculate and a compute time orders of magnitude more than its neighbouring pixel. For reasons I'll touch on later, this can cause a stall.

Before I go off too far on a tangent (pun intended), if a workload is relatively simple, computationally balanced, and fits on a single GPU, it will typically be quicker than if the work was spread over multiple GPUs. Even though modern CUDA devices can be treated as a big single device (unified memory) there is, of course, work going on under the hood to distribute work across multiple devices and into different memory address spaces. This can add contention on system buses (it's easy to exhaust PCIe lanes on multigpu setups). Of course, there is the like of NVLink, but you are often limited to connecting two (or maybe 4? Can't remember...) devices and for the rest, well, you're a slave to the PCIe bandwidth as that is the only way to get data to/from the cards.

As already alluded to, it's not always quite so straightforward though. What if you can fit and execute your workload on a single GPU, but the computation is incredibly heavy? The computation might be so heavy that bus bandwidth limits, etc. aren't really a big issue as the time taken to process the workload far outweighs anything else, resulting in poor performance. In these cases, it might be better to split the workload across two GPUs because the overhead of doing so is minimal, yet you now have twice the number of compute units running in parallel. How about splitting it across 4 GPUs? 4 times the compute units working on it now... It's all a balancing act, with many factors trying to upset the apple cart. :)

Kepler and earlier architectures (I think) did not support unified memory, whilst later generations do. You can use newer versions of the CUDA API to treat older hardware as having unified memory, but the API is tricking you and has to perform a whole host of memory paging operations and swaps between GPU (device) and CPU (host) memory across the system bus and, naturally, this can cause some massive bottlenecks. Things got better and continue to do so with modern architectures. There is still work going on under the hood but it's a lot easier to work with and a LOT more performant in comparison.

Using my epic man-maths, the 8Pack system would have a total of 168GB of VRAM within it, spread over the seven cards. If you had a complete render including all the assets at say 20GB that could sit upon one card - would that have a clear performance benefit over a 40GB render that had to be split over at least two cards?

Or.... a bit like RAID with multiple drives - does the software see the entire GPU stack as a single device with little concern that there are seven GPUs and that the total memory is a cluster of 7x 24GB 'chunks'?

Yes, the 8Pack system GPU array could be treated as a single big compute device with a unified memory of 168GB of VRAM. That would be quite nice... 🤣

I have already answered the question above in many ways. Depending on the complexity of the render calculations for the light paths (assuming a ray/path tracing scenario here) then there may be benefits to running on a single GPU in the array, or it may be better to split across two or even all GPUs in the array. In reality, modern rendering software will generally be written to take hold of all the compute power it can get its hands on and then analysis of the dataset is carried out in order to determine the the most optimal way of processing it on the available devices. It's quite a big area and field of research; organising the data sensibly can often yield significant performance gains as opposed to just throwing more work at more GPUs with little thought.

I mentioned above that compute devices can be sensitive to the data they are processing and the calculations they are carrying out on that data, piece by piece. This is due to how modern GPU compute devices work. I won't go into too much detail here as it could get boring and incredibly lengthy, but I will hopefully give enough information to explain why massively parallel computation can be so finicky. 🤣

The 4090 GPU (let's stick with this GPU) is not the full ADA Lovelace AD102, but still has 128 SMs (streaming multiprocessors) with 16,384 CUDA cores (128 CUDA cores per SM). Each SM has the 128 CUDA cores, 1 RT core, 4 Tensor cores, and a multitude of other things like texture units and 128kb L1 shared memory cache (which can be configured at different size points depending on the workload). Each SM has 1536 threads available to it.

When configuring the GPU (or GPUs) you work in grids and blocks. It is up to the developer to determine the best strategy in terms of organising the overarching grid layout, within which the blocks are structured. The grid and block data structure can be treated as a 1D, 2D or 3D structure. The blocks are assigned several threads (a thread block). This might help (depicts a 2D grid structure of thread blocks, with each block having a 3D structure of threads within it):

CUDA-grid-blocks-threads.jpg


It is up to the developer to determine the best arrangement of the above in order to get the best performance from the available hardware. It can be quite a tricky and complex thing to get right.

Let's dig a bit deeper.

Each SM, as already mentioned, can have 1536 threads. A thread block can have a maximum of 1024 threads. Each group of 32 threads is grouped into a warp. A warp is scheduled and controlled by the owning SM and, once scheduled, a warp will execute the same instruction across all 32 threads in that warp in parallel. That warp remains resident until execution of those threads in the warp is complete. Do you see a potential issue here? Cast your mind back to my example about pixel N and its neighbour, pixel N+1...

(This is massively simplified so, any CUDA developers reading this, remember I'm trying to keep it simple and give a flavour, ok?) 🤣

If a thread in the warp hits some sort of condition, it has to go in a direction that might be different to another thread in the same warp. They were executing the same instruction in parallel, but now they need to go in different directions. Pixel N is following this line of rendering logic, pixel N+1 takes a different code path as a whole lot of different computation is required. This is referred to as thread divergence. It also means that the parallel-running threads are no longer able to run the same instruction in parallel so threads are put to sleep whilst each thread is then run serially until completion. That can be a problem. 31 of the threads in the warp might have very little computation and execute exactly the same code as for pixel N. However, that one thread that has diverged has caused them all to be put on hold whilst it carries out its incredibly complex (in relative terms) calculation. The other 31 threads have to wait for the pixel N+1 thread to finish and hence sit in a wait state. That's a waste, of course. Whilst that warp is resident and executing, further warps might not be able to be scheduled if all resources have been exhausted. This has caused a bottleneck. This can quickly escalate when you consider the sheer number of threads and blocks involved and is why GPU compute workloads can be so sensitive to data and configuration.

Of course, there are also other reasons. Such as resource contention where threads might be trying to access another GPU resource but have to wait until another thread has relinquished its hold on that resource. Memory divergence, divergent data, sync points, latency hiding, occupancy... many factors contribute to potential performance issues and they are far beyond the scope of this (already too long) narrative!

Occupancy tripped me up on my path tracer recently (due to misconfiguration). A few days ago, I found a bug in my code and suddenly gained around an 8% performance uplift when looking into it. Occupancy is a metric that measures the utilisation of the GPU's multiprocessor resources during the execution of a CUDA kernel (code running on the GPU). It's a bit difficult to explain my situation so I'll explain it in the context of the information I've already provided above.

Remember I said that an SM has 1536 threads available? And that the maximum number of threads per block was 1024? Well, let's assume we go balls to the wall and specify the maximum number of threads for our blocks - so our blocks have 1024 threads. Maximum performance, yeah? :unsure: Well, not quite...

Remember that an SM has a capacity for 1536 threads, so we can only run a single block per SM. Two blocks don't fit as that would need 1024x2 (=2048) threads and the SM only supports 1536. So that means our SM has an unused 512 threads sat doing nothing. Hmm...

So, how about changing our block size to have 512 threads? This can work out much better. The SM can now run 3 thread blocks (3 x 512 = 1536 threads). We have utilised all the threads across all blocks within the SM.

That was a lot to take in and I've kind of lost my focus but hopefully, it makes some sense. I blame the meds.

I remember from my time when I went slightly baller and had two nVidia GTX690s in my PC. In theory, that gave me four GPUs with a total of 8GB across them. On paper, the performance should have been superb - even the likes of GTAV today with all the bells and whistles enabled @ 2560x1440 doesn't use 3GB of VRAM according to options screen.

In reality however, it was very much hit & miss. Some well-optimised games such as Warframe split the load evenly across all of the GPUs and played well. The majority however just hammered GPU0 and had access to the 2GB of VRAM allocated with that. Some games even outright lied with their claims within the options menu. Ashes of the Singularity had a specific tickbox to enable 'multi-GPU processing' which was music to the ears of any SLi or Crossfire system owner. Yet when running the hardware tool in the background, once again the game was solely using GPU0 and it's small amount of VRAM.

Yeah, multigpu for gaming never really took off. It looked quite promising for some time but, ultimately, proved too difficult to get right. Developers didn't really have the time or resources to invest (such is the nature of game dev) and the difficulty to get a balanced and performant workload distribution just wasn't feasible. Gaming workloads don't always fit so well when it comes to using GPUs for data computation; game logic is very dynamic, lots of branching, uncertainty, etc.

On the rendering/rasterisation side, the GPUs running in SLI (or equivalent) needed to have their own copies of required data in their local VRAM as well. There was overhead for syncing that up, and then for shuttling data to and fro across the bus to the GPUs, causing stalls and bottlenecks. It was just easier to run on a single GPU, maybe farm a little bit of work to another GPU on the system, but it was ultimately too expensive, time-consuming, and niche market to be worthwhile.

I assume that a fair amount of coding work is understanding your hardware and getting the most out of it - such as the CUDA requirement you mentioned above?

Exactly!
 


Top