xchangefull.blogg.se - Shift shader 3.0

Shift shader 3.0 for free#

Things sat is registers that the hardware units read.

In the past the hardware had a lot of global device state.

Previously interpolator-bound shaders could now became ALU-bound and likely run substantially faster, whereas if you weren’t previously interpolator-bound, you would now see a slowdown due to the additional interpolation cost. On DX11 hardware where interpolation is an ALU instruction, you basically pay for the interpolation cost. For instance, for shaders with a sufficiently large number of instructions / interpolator ratio interpolators used to be free, although short shaders or shaders with many interpolators could easily become interpolator-bound.

Shift shader 3.0 for free#

A side effect of this trend is that things that previously were more or less for free could now come at a moderate cost in terms of ALU instructions.

Since gradient are needed by the texture units anyway, it made sense in the past to let them handle it however, now that GCN has a generic lane swizzle, the ALUs has all the tools to do the work itself, so now it’s done in the ALUs again. Gradients have moved a bit back and forth. Export conversion is now handled by the ALUs since GCN. Vertex fetch has been done by the shader for a long time.

Interpolators became ALU instructions with DX11 hardware. This makes a lot of sense from a transistor budget point of view and is something that has been going on for a long time.

The general trend is that more and more fixed function units move over to the shader cores.

There is also an image instruction (IMG) that does the actual sampling here, and finally an export instruction (EXP) that writes out the final output data, which in the case of a pixel shader is what lands in your framebuffer. More on these instructions later in this presentation.

There is a set of different types of instructions here, vector ALU instruction (VALU) which are your typical math instructions and operate on wide SIMD vectors across all threads/pixels/vertices, and scalar instructions (SALU) that operate on things that are common for all threads. This is what the actual shader looks like in the end.However, it may surprise you what this expands to in native hardware instructions. The D3D bytecode still treats sampling a cubemap as a simple sample instruction. Consequently, this is handled by the ALUs these days. Obviously we still need fast sampling, so cubemaps are still a first class citizen in the API and will likely remain that way however, the coordinate normalization is not something that we want to spend an awful lot of transistors on when those transistors could rather be used to add more general ALU cores instead. The cost of this fixed function hardware could no longer be motivated when we have so much ALU units that would be perfectly capable of doing this math. This reflected the fact that no hardware did the division by w in the texture unit anymore, so there was no need to pretend it did. In DX10 direct support for projective textures was removed, with the expectation that shaders that need projective texturing will simply do the division by w manually. Back in DX9 era a cubemap lookup was still a single sample instruction, and the same was true for projective textures (tex2Dproj).This presentation will assume that you already know what things like ”MAD-form” means. Refer to slide deck from last year (”Low-level Thinking in High-level Shading Languages”) for details on these optimizations.