Revisions to What are the common rendering optimization techniques for the geometry pass in a deferred shading renderer? [closed]

added 2 characters in body

Source Link

edited Dec 9, 2013 at 5:29

concept3d

12.8k
4
46
57

Avoid branching in shaders.
Try different vertex structures for example {VNT}{VNT} interleaved in the same array or {V},{N},{T}{V},{N},{T} in different arrays.

Use VBOs with the "right" amount of data. (depending on your hardware) but usually less draw calls are better.
Draw scene front to back.
Turn off Z-buffer at some points for example if an image doesn't need Z testing.
Use compressed textures.

Use inline functions for small functions.
Use SIMD (Single instruction multiple data) when possible.
Avoid cache unfriendly memory jumps.

Use VBOs with the "right" amount of data. (depending on your hardware) but usually less draw calls are better.

Avoid branching in shaders.
Try different vertex structures for example {VNT} interleaved in the same array or {V},{N},{T} in different arrays.

Use VBOs with the "right" amount of data. (depending on your hardware) but usually less draw calls are better.
Draw scene front to back.
Turn off Z-buffer at some points for example if an image doesn't need Z testing.
Use compressed textures.

Use inline functions for small functions.
Use SIMD (Single instruction multiple data) when possible.
Avoid cache unfriendly memory jumps.

Avoid branching in shaders.
Try different vertex structures for example {VNT} interleaved in the same array or {V},{N},{T} in different arrays.
Draw scene front to back.
Turn off Z-buffer at some points for example if an image doesn't need Z testing.
Use compressed textures.

Use inline functions for small functions.
Use SIMD (Single instruction multiple data) when possible.
Avoid cache unfriendly memory jumps.

Use VBOs with the "right" amount of data. (depending on your hardware) but usually less draw calls are better.

added more details regarding deferred shading

Source Link

edited Nov 26, 2013 at 11:47

concept3d

12.8k
4
46
57

Deferred shading is only a technique to "deffer""defer" the actual shading operation for later stages, this can be great to reduce the number of passes needed (for example) to render 10 lights which needs 10 passes. My point is regardless of the rendering technique you are using there are certain possiblepossible rendering optimizations that reduce the number of objects (vertices, normals etc) that your rendering pipeline need to process.

EvenDeferred rendering tries to solve the problem when using deferredthe number of lights increases, which in forward rendering thosemight make the number of passes explode.

Those techniques apply anddoes not directly optimize the deferred shading part, but according to your description, the deferred shading part is NOT your problem. Your problem though is that you are widely usedsubmitting the whole scene to rendering pipeline. So your engine has to process (for example all the 100 million vertices) in your scene just to be able to submit the result to the g-buffer, while most of theses 100 million vertex can trivially be culled away, and not submitted to the pre-process vertex and fragments pass.

In case of a forward renderer the N vertex will be processed by the vertex stage as a total of vertex count*lights count and by the fragment stage as a total of fragments count*number Lights, deferred shading effectively reduces this to only vertex count for the vertex stage and fragments count for the fragment count, before resolving the actual shading. But still N can be too much to process, especially when most of them can be trivially culled.

This makes culling more effective in case of forward rendering/multiple passes. But keep in mind that most engines will use a dual rendering approach, because deferred shading alone can not resolve transparent objects, this makes using those optimizations a must, I don't know of any commercial engine that don't do all of them.

Occlusion culling solves this, by doing some early tests to cull occluded objects that are in the rendering frustum. One practical implementation of occlusion culling is using point-based queries and checking if certain objects are visible from a specific point view. This can also be used to cull lights that do not contribute to the final image this is especially useful in a deferred engine renderer.

Use inline functions for small functions.
Use SIMD (Single instruction multiple data) when possible.
Avoid cache unfriendly memory jumps.

But what if my bottleneck was in the deferred shading ?

In this case, since deferred shading is more concerned about lights then the most obvious part is to optimize the actual shading calculations. some of the points to keep an eye on:

Render lights that actually affect the final image. In other words cull the lights that don't contribute. This can be effectively implemented using the occlusion culling I mentioned before.

Does this light need the specular or some other components? Maybe not.

Does this light cast shadow ? Some lights don't need to cast shadows.

Can this light contribution be pre-computed? If it is not moving probably some aspects can be pre-computed.

Deferred shading is only a technique to "deffer" the actual shading operation for later stages, this can be great to reduce the number of passes needed (for example) to render 10 lights which needs 10 passes. My point is regardless of the rendering technique you are using there are certain possible rendering optimizations that reduce the number of objects (vertices, normals etc) that your rendering pipeline need to process.

Even when using deferred rendering those techniques apply and are widely used.

Occlusion culling solves this, by doing some early tests to cull occluded objects that are in the rendering frustum. One practical implementation of occlusion culling is using point-based queries and checking if certain objects are visible from a specific point view.

Use inline functions for small functions.
Use SIMD (Single instruction multiple data) when possible.
Avoid cache unfriendly memory jumps.

Deferred shading is only a technique to "defer" the actual shading operation for later stages, this can be great to reduce the number of passes needed (for example) to render 10 lights which needs 10 passes. My point is regardless of the rendering technique you are using there are certain possible rendering optimizations that reduce the number of objects (vertices, normals etc) that your rendering pipeline need to process.

Deferred rendering tries to solve the problem when the number of lights increases, which in forward rendering might make the number of passes explode.

Those techniques does not directly optimize the deferred shading part, but according to your description, the deferred shading part is NOT your problem. Your problem though is that you are submitting the whole scene to rendering pipeline. So your engine has to process (for example all the 100 million vertices) in your scene just to be able to submit the result to the g-buffer, while most of theses 100 million vertex can trivially be culled away, and not submitted to the pre-process vertex and fragments pass.

In case of a forward renderer the N vertex will be processed by the vertex stage as a total of vertex count*lights count and by the fragment stage as a total of fragments count*number Lights, deferred shading effectively reduces this to only vertex count for the vertex stage and fragments count for the fragment count, before resolving the actual shading. But still N can be too much to process, especially when most of them can be trivially culled.

This makes culling more effective in case of forward rendering/multiple passes. But keep in mind that most engines will use a dual rendering approach, because deferred shading alone can not resolve transparent objects, this makes using those optimizations a must, I don't know of any commercial engine that don't do all of them.

Occlusion culling solves this, by doing some early tests to cull occluded objects that are in the rendering frustum. One practical implementation of occlusion culling is using point-based queries and checking if certain objects are visible from a specific point view. This can also be used to cull lights that do not contribute to the final image this is especially useful in a deferred engine renderer.

Use inline functions for small functions.
Use SIMD (Single instruction multiple data) when possible.
Avoid cache unfriendly memory jumps.

But what if my bottleneck was in the deferred shading ?

In this case, since deferred shading is more concerned about lights then the most obvious part is to optimize the actual shading calculations. some of the points to keep an eye on:

Render lights that actually affect the final image. In other words cull the lights that don't contribute. This can be effectively implemented using the occlusion culling I mentioned before.

Does this light need the specular or some other components? Maybe not.

Does this light cast shadow ? Some lights don't need to cast shadows.

Can this light contribution be pre-computed? If it is not moving probably some aspects can be pre-computed.

fixed some spelling and structure errors

Source Link

edited Nov 26, 2013 at 9:24

concept3d

12.8k
4
46
57

Deferred shading is only a technique to "deffer" the actual shading operation for later stages, this can be great to reduce the number of passes needed (for example) to render 10 lights which needs 10 passes. My point is regardless of the rendering technique you are using there are certain possible rendering optimizations that reduce the number of objects (vertices, normals etc) that your enginerendering pipeline need to process.

Even when using deferred rendering those techniquetechniques apply and are widely used.

Only Objects that are fully or partially included in the view frustum, ever need to be submitted to the rendering pipeline. This is the basic concept of frustum culling, unfortunately checking if a mesh is in/out of the view frustum can be an expensive operation, so instead, engine designers use an approximate bounding volume like an AABB(AABB)Axis Aligned bounding box or a bounding sphere, even though this might not be as accurate as using the actual mesh, the accuracy difference isn't worth the trouble of checking with the actual mesh.

This is a good and a simple technique for a smaller engine, and is almost used in every engine I ever used. I recommend using a "normal" Bounding Volume/Frustum checking without hierarchies if youyour engine does not require rendering very complex scenes.

This is a must, why draw faces that won't be visible anyway? rendering APIAPIs provide an interface to turn on/off back face culling. Unless you have a strong reason why not to turn it on, like some of the CAD applications that need to draw backfaces in certain circumstances, this is a must do thing.

Using the Z-buffer you can solveresolve visibility determination. But the problem is that Z-buffer isn't always great in terms of performance, since Z-buffer can only be resolved at later stages of the pipeline, objects being occluded should be rasterized and might be written to the Z-buffer and the Color buffer before failing the Z test.

A great real world example of such technique is in GTA5, where the skyscrapers are stratigically placed at the center of the city, are not only decorations, theybut they also work as occluders, effectively occluding the rest part of the city and preventing it from being rasterized.

Level of detail is widely used technique, the idea is to use a simpler version of the mesh when the mesh is less contributing to the scene. there are tootwo common implementationsimplementations; one simply switches the mesh with a simpler one when it's no longer greatly contributing, the mesh is selected based on some factor like the distance and the number of pixels (area on the scree) the mesh is occupying. The other version dynamically tessellates the mesh and, this is widely used in terrain rendering.

The first thing you need to do is to Profile your application using a graphics profiler, and determine where the bottleneck is. Keep in mind that the bottle neckbottleneck may change based as the content being rendered change.

Bottlenecks might be also be part of the code running on CPU so you need to measure that too.

After that you need to do some optimizations inon the bottleneck, keep in mind that there is no right answer for this, and will be different from hardware to another.

Some common GPU optimzationoptimization tricks:

Deferred shading is only a technique to "deffer" the actual shading operation for later stages, this can be great to reduce the number of passes needed (for example) to render 10 lights which needs 10 passes. My point is regardless of the rendering technique you are using there are certain possible rendering optimizations that reduce the number of objects (vertices, normals etc) that your engine need to process.

Even when using deferred rendering those technique apply and are widely used.

Only Objects that are fully or partially included in the view frustum, ever need to be submitted to the rendering pipeline. This is the basic concept of frustum culling, unfortunately checking if a mesh is in/out of the view frustum can be an expensive operation, so instead, engine designers use an approximate bounding volume like an AABB or a bounding sphere, even though this might not as accurate as using the actual mesh, the accuracy difference isn't worth the trouble of checking with the actual mesh.

This is a good and a simple technique for a smaller engine. I recommend using a "normal" Bounding Volume/Frustum checking without hierarchies if you engine does not require rendering very complex scenes.

This is a must, why draw faces that won't be visible anyway? rendering API provide an interface to turn on/off back face culling. Unless you have a strong reason why not to turn it on, like some of the CAD applications this is a must do thing.

Using the Z-buffer you can solve visibility determination. But the problem is that Z-buffer isn't always great in terms of performance, since Z-buffer can only be resolved at later stages of the pipeline, objects being occluded should be rasterized and might be written to the Z-buffer and the Color buffer before failing the Z test.

A great real world example of such technique is in GTA5, where the skyscrapers are stratigically placed at the center of the city, are not only decorations, they also work as occluders, effectively occluding the rest part of the city and preventing it from being rasterized.

Level of detail is widely used technique, the idea is to use a simpler version of the mesh when the mesh is less contributing to the scene. there are too common implementations one simply switches the mesh with a simpler one when it's no longer greatly contributing, the mesh is selected based on some factor like the distance and the number of pixels (area on the scree) the mesh is occupying. The other version dynamically tessellates the mesh and is widely used in terrain rendering.

The first thing you need to do is to Profile your application using a graphics profiler, and determine where the bottleneck is. Keep in mind that the bottle neck may change based as the content being rendered change.

Bottlenecks might be also be part of the code running on CPU so you need to measure that too.

After that you need to do some optimizations in the bottleneck, keep in mind that there is no right answer for this, and will be different from hardware to another.

Some common GPU optimzation tricks:

Deferred shading is only a technique to "deffer" the actual shading operation for later stages, this can be great to reduce the number of passes needed (for example) to render 10 lights which needs 10 passes. My point is regardless of the rendering technique you are using there are certain possible rendering optimizations that reduce the number of objects (vertices, normals etc) that your rendering pipeline need to process.

Even when using deferred rendering those techniques apply and are widely used.

Only Objects that are fully or partially included in the view frustum, ever need to be submitted to the rendering pipeline. This is the basic concept of frustum culling, unfortunately checking if a mesh is in/out of the view frustum can be an expensive operation, so instead, engine designers use an approximate bounding volume like an (AABB)Axis Aligned bounding box or bounding sphere, even though this might not be as accurate as using the actual mesh, the accuracy difference isn't worth the trouble of checking with the actual mesh.

This is a good and a simple technique for a smaller engine, and is almost used in every engine I ever used. I recommend using a "normal" Bounding Volume/Frustum checking without hierarchies if your engine does not require rendering very complex scenes.

This is a must, why draw faces that won't be visible anyway? rendering APIs provide an interface to turn on/off back face culling. Unless you have a strong reason why not to turn it on, like some of the CAD applications that need to draw backfaces in certain circumstances, this is a must do thing.

Using the Z-buffer you can resolve visibility determination. But the problem is that Z-buffer isn't always great in terms of performance, since Z-buffer can only be resolved at later stages of the pipeline, objects being occluded should be rasterized and might be written to the Z-buffer and the Color buffer before failing the Z test.

A great real world example of such technique is in GTA5, where the skyscrapers are stratigically placed at the center of the city, are not only decorations, but they also work as occluders, effectively occluding the rest part of the city and preventing it from being rasterized.

Level of detail is widely used technique, the idea is to use a simpler version of the mesh when the mesh is less contributing to the scene. there are two common implementations; one simply switches the mesh with a simpler one when it's no longer greatly contributing, the mesh is selected based on some factor like the distance and the number of pixels (area on the scree) the mesh is occupying. The other version dynamically tessellates the mesh, this is widely used in terrain rendering.

The first thing you need to do is to Profile your application using a graphics profiler, and determine where the bottleneck is. Keep in mind that the bottleneck may change as the content being rendered change. Bottlenecks might be also be part of the code running on CPU so you need to measure that too.

After that you need to do some optimizations on the bottleneck, keep in mind that there is no right answer for this, and will be different from hardware to another.

Some common GPU optimization tricks: