Sets the stream source frequency divider value. This may be used to draw several instances of geometry.
HRESULT SetStreamSourceFreq( UINT StreamNumber, UINT FrequencyParameter );
If the method succeeds, the return value is D3D_OK. If the method fails, the return value can be D3DERR_INVALIDCALL.
There are two constants defined in d3d9types.h that are designed to use with SetStreamSourceFreq: D3DSTREAMSOURCE_INDEXEDDATA and D3DSTREAMSOURCE_INSTANCEDATA. To see how to use the constants, see "Efficiently Drawing Multiple Instances of Geometry (Direct3D 9)".
Header: Declared in D3D9.h.
Library: Use D3D9.lib.
IDirect3DDevice9::GetStreamSourceFreq
Given a scene that contains many objects that use the same geometry, you can draw many instances of that geometry at different orientations, sizes, colors, and so on with dramatically better performance by reducing the amount of data you need to supply to the renderer.
This can be accomplished through the use of two techniques: the first for drawing indexed geometry and the second for non-indexed geometry. Both techniques use two vertex buffers: one to supply geometry data and one to supply per-object instance data. The instance data can be a wide variety of information such as a transform, color data, or lighting data - basically anything that you can describe in a vertex declaration. Drawing many instances of geometry with these techniques can dramatically reduce the amount of data sent to the renderer.
The vertex buffer contains per-vertex data that is defined by a vertex declaration. Suppose that part of each vertex contains geometry data, and part of each vertex contains per-object instance data. This would look like this:
Figure 1: Indexed Geometry Vertex Buffer Layout
This technique requires a device that supports the 3_0 vertex shader model. This technique works with any programmable shader but not with the fixed function pipeline.
For the vertex buffers shown above, here are the corresponding vertex buffer declarations:
const D3DVERTEXELEMENT9 g_VBDecl_Geometry[] = { { 0 , 0 , D3DDECLTYPE_FLOAT3, D3DDECLMETHOD_DEFAULT, D3DDECLUSAGE_POSITION, 0 }, { 0 , 12 , D3DDECLTYPE_FLOAT3, D3DDECLMETHOD_DEFAULT, D3DDECLUSAGE_NORMAL, 0 }, { 0 , 24 , D3DDECLTYPE_FLOAT3, D3DDECLMETHOD_DEFAULT, D3DDECLUSAGE_TANGENT, 0 }, { 0 , 36 , D3DDECLTYPE_FLOAT3, D3DDECLMETHOD_DEFAULT, D3DDECLUSAGE_BINORMAL, 0 }, { 0 , 48 , D3DDECLTYPE_FLOAT2, D3DDECLMETHOD_DEFAULT, D3DDECLUSAGE_TEXCOORD, 0 }, D3DDECL_END() };
const D3DVERTEXELEMENT9 g_VBDecl_InstanceData[] = { { 1 , 0 , D3DDECLTYPE_FLOAT4, D3DDECLMETHOD_DEFAULT, D3DDECLUSAGE_TEXCOORD, 1 }, { 1 , 16 , D3DDECLTYPE_FLOAT4, D3DDECLMETHOD_DEFAULT, D3DDECLUSAGE_TEXCOORD, 2 }, { 1 , 32 , D3DDECLTYPE_FLOAT4, D3DDECLMETHOD_DEFAULT, D3DDECLUSAGE_TEXCOORD, 3 }, { 1 , 48 , D3DDECLTYPE_FLOAT4, D3DDECLMETHOD_DEFAULT, D3DDECLUSAGE_TEXCOORD, 4 }, { 1 , 64 , D3DDECLTYPE_FLOAT4, D3DDECLMETHOD_DEFAULT, D3DDECLUSAGE_COLOR, 0 }, D3DDECL_END() };
These declarations define two vertex buffers. The first declaration (for stream 0, indicated by the zeros in column 1) defines the geometry data which consists of: position, normal, tangent, binormal, and texture coordinate data.
The second declaration (for stream 1, indicated by the ones in column 1) defines the per-object instance data. Each instance is defined by four four-component floating point numbers, and a four-component color. The first four values could be used to initialize a 4x4 matrix, which means that this data will uniquely size, position, and rotate each instance of the geometry. The first four components use a texture coordinate semantic which, in this case, means "this is a general four-component number." When you use arbitrary data in a vertex declaration, use a texture coordinate semantic to mark it. The last element in the stream is used for color data. This could be applied in the vertex shader to give each instance a unique color.
Before rendering, you need to call IDirect3DDevice9::SetStreamSourceFreq to bind the vertex buffer streams to the device. Here is an example that binds both vertex buffers:
// Set up the geometry data stream pd3dDevice -> SetStreamSourceFreq( 0 , (D3DSTREAMSOURCE_INDEXEDDATA | g_numInstancesToDraw )); pd3dDevice -> SetStreamSource( 0 , g_VB_Geometry, 0 , D3DXGetDeclVertexSize( g_VBDecl_Geometry, 0 )); // Set up the instance data stream pd3dDevice -> SetStreamSourceFreq( 1 ,(D3DSTREAMSOURCE_INSTANCEDATA | 1 )); pd3dDevice -> SetStreamSource( 1 , g_VB_InstanceData, 0 ,D3DXGetDeclVertexSize( g_VBDecl_InstanceData, 1 ));
IDirect3DDevice9::SetStreamSourceFreq uses D3DSTREAMSOURCE_INDEXEDDATA to identify the indexed geometry data. In this case, stream 0 contains the indexed data that describes the object geometry. This value is logically combined with the number of instances of the geometry to draw.
Note that D3DSTREAMSOURCE_INDEXEDDATA and the number of instances to draw must always be set in stream zero.
In the second call, IDirect3DDevice9::SetStreamSourceFreq uses D3DSTREAMSOURCE_INSTANCEDATA to identify the stream containing the instance data. This value is logically combined with 1 since each vertex contains one set of instance data.
The last two calls to IDirect3DDevice9::SetStreamSource bind the vertex buffer pointers to the device.
When you are finished rendering the instance data, be sure to reset the vertex stream frequency back to its default state (which does not use instancing). Because this example used two streams, set both streams as shown below:
pd3dDevice->SetStreamSourceFreq(0,1);
pd3dDevice->SetStreamSourceFreq(1,1);
While it is not possible to make a single conclusion about how much this technique could reduce the render time in every application, consider the difference in the amount of data streamed into the runtime and the number of state changes that will be reduced if you use the instancing technique. This render sequence takes advantage of drawing multiple instances of the same geometry:
if ( SUCCEEDED( pd3dDevice -> BeginScene() ) ) { // Set up the geometry data stream pd3dDevice -> SetStreamSourceFreq( 0 ,( D3DSTREAMSOURCE_INDEXEDDATA | g_numInstancesToDraw)); pd3dDevice -> SetStreamSource( 0 , g_VB_Geometry, 0 , D3DXGetDeclVertexSize( g_VBDecl_Geometry, 0 ));
// Set up the instance data stream pd3dDevice -> SetStreamSourceFreq( 1 , ( D3DSTREAMSOURCE_INSTANCEDATA | 1 )); pd3dDevice -> SetStreamSource( 1 , g_VB_InstanceData, 0 , D3DXGetDeclVertexSize( g_VBDecl_InstanceData, 1 )); pd3dDevice -> SetVertexDeclaration( ); pd3dDevice -> SetVertexShader( ); pd3dDevice -> SetIndices( ); pd3dDevice -> DrawIndexedPrimitive( D3DPT_TRIANGLELIST, 0 , 0 , g_dwNumVertices, 0 , g_dwNumIndices / 3 );
pd3dDevice -> EndScene(); }
Notice that the render loop is called once, the geometry data is streamed once, and n instances are streamed once. This next render sequence is identical in functionality, but does not take advantage of instancing【熊猫注:功能上相同,但没有用实例化。我靠!循环这么多次,几个物件就循环几次,这多耗CPU啊。不比不知道,一比吓一跳!硬件实例化的好处是刚刚的!】:
if ( SUCCEEDED( pd3dDevice -> BeginScene() ) ) { for ( int i = 0 ; i < g_numObjects; i ++ ) { pd3dDevice -> SetStreamSource( 0 , g_VB_Geometry, 0 , D3DXGetDeclVertexSize( g_VBDecl_Geometry, 0 )); pd3dDevice -> SetVertexDeclaration( ); pd3dDevice -> SetVertexShader( ); pd3dDevice -> SetIndices( ); pd3dDevice -> DrawIndexedPrimitive( D3DPT_TRIANGLELIST, 0 , 0 , g_dwNumVertices, 0 , g_dwNumIndices / 3 ); } pd3dDevice -> EndScene(); }
Notice that the entire render loop is wrapped by a second loop to draw each object. Now the geometry data is streamed into the renderer n times (instead of once) and any pipeline states may also be set redundantly for each object drawn. This render sequence is very likely to be significantly slower. Notice also that the parameters to IDirect3DDevice9::DrawIndexedPrimitive have not changed between the two render loops.
In Drawing Indexed Geometry, vertex buffers were configured to draw multiple instances of indexed geometry with greater efficiency. You can also use IDirect3DDevice9::SetStreamSourceFreq to draw non-indexed geometry. This requires a different vertex buffer layout and has different constraints. To draw non-indexed geometry, prepare your vertex buffers like this:
Figure 2: Non-Indexed Geometry Vertex Buffer Layout
This technique is not supported by hardware acceleration on any device. It is only supported by software vertex processing and will work only with vs_3_0 shaders.
Because this technique works with non-indexed geometry, there is no index buffer. As the diagram shows, the vertex buffer that contains geometry contains n copies of the geometry data. For each instance drawn, the geometry data is read from the first vertex buffer and the instance data is read from the second vertex buffer.
Here are the corresponding vertex buffer declarations:
const D3DVERTEXELEMENT9 g_VBDecl_Geometry[] = { { 0 , 0 , D3DDECLTYPE_FLOAT3, D3DDECLMETHOD_DEFAULT, D3DDECLUSAGE_POSITION, 0 }, { 0 , 12 , D3DDECLTYPE_FLOAT3, D3DDECLMETHOD_DEFAULT, D3DDECLUSAGE_NORMAL, 0 }, { 0 , 24 , D3DDECLTYPE_FLOAT3, D3DDECLMETHOD_DEFAULT, D3DDECLUSAGE_TANGENT, 0 }, { 0 , 36 , D3DDECLTYPE_FLOAT3, D3DDECLMETHOD_DEFAULT, D3DDECLUSAGE_BINORMAL, 0 }, { 0 , 48 , D3DDECLTYPE_FLOAT2, D3DDECLMETHOD_DEFAULT, D3DDECLUSAGE_TEXCOORD, 0 }, D3DDECL_END() };
const D3DVERTEXELEMENT9 g_VBDecl_InstanceData[] = { { 1 , 0 , D3DDECLTYPE_FLOAT4, D3DDECLMETHOD_DEFAULT, D3DDECLUSAGE_TEXCOORD, 1 }, { 1 , 16 , D3DDECLTYPE_FLOAT4, D3DDECLMETHOD_DEFAULT, D3DDECLUSAGE_TEXCOORD, 2 }, { 1 , 32 , D3DDECLTYPE_FLOAT4, D3DDECLMETHOD_DEFAULT, D3DDECLUSAGE_TEXCOORD, 3 }, { 1 , 48 , D3DDECLTYPE_FLOAT4, D3DDECLMETHOD_DEFAULT, D3DDECLUSAGE_TEXCOORD, 4 }, { 1 , 64 , D3DDECLTYPE_FLOAT4, D3DDECLMETHOD_DEFAULT, D3DDECLUSAGE_COLOR, 0 }, D3DDECL_END() };
These declarations are identical to the declarations made in the indexed geometry example. Once again, the first declaration (for stream 0) defines the geometry data and the second declaration (for stream 1) defines the per-object instance data. When you create the first vertex buffer, be sure to load it with the number of instances of the geometry data that you will be drawing.
Before rendering, you need to set up the divider which tells the runtime how to divide up the first vertex buffer into n instances. Then set the divider using IDirect3DDevice9::SetStreamSourceFreq like this:
【熊猫注:把流绑到顶点缓冲上,把送水带通到水池子里;
顶点缓冲是流(stream)的源头(source),池子是水的源头;
数据从顶点缓冲里跑到流里,水从池子里跑到送水带里;
数据再从流跑到renderer里,水从送水带跑到庄稼地里;
写代码跟种庄稼真像啊!农民程序员,哇咔咔!】
The first call to IDirect3DDevice9::SetStreamSourceFreq says that stream 0 contains n instances of m vertices. IDirect3DDevice9::SetStreamSourceFreq then binds stream 0 to the geometry vertex buffer.
In the second call, IDirect3DDevice9::SetStreamSourceFreq identifies stream 1 as the source of the instance data. The second parameter is the number of vertices in each object (m). Remember that the instance data stream must always be declared as the second stream. IDirect3DDevice9::SetStreamSourceFreq then binds stream 1 to the vertex buffer that contains the instance data.
When you are finished rendering the instance data, be sure to reset the vertex stream frequency back to its default state. Because this example used two streams, set both streams as shown below:
pd3dDevice->SetStreamSourceFreq(0,1);
pd3dDevice->SetStreamSourceFreq(1,1);
The major advantage of this instancing style is that it can be used on non-indexed geometry. While it is not possible to make a single conclusion about how much this technique could reduce the render time in every application, consider the difference in the amount of data streamed into the runtime, and the number of state changes that will be reduced for the following render sequence:
if ( SUCCEEDED( pd3dDevice -> BeginScene() ) ) { // Set the divider pd3dDevice -> SetStreamSourceFreq( 0 , 1 ); pd3dDevice -> SetStreamSource( 0 , g_VB_Geometry, 0 ,D3DXGetDeclVertexSize( g_VBDecl_Geometry, 0 )); // Set up the instance data stream pd3dDevice -> SetStreamSourceFreq( 1 , verticesPerInstance)); pd3dDevice -> SetStreamSource( 1 , g_VB_InstanceData, 0 , D3DXGetDeclVertexSize( g_VBDecl_InstanceData, 1 )); pd3dDevice -> SetVertexDeclaration( ); pd3dDevice -> SetVertexShader( ); pd3dDevice -> SetIndices( ); pd3dDevice -> DrawIndexedPrimitive( D3DPT_TRIANGLELIST, 0 , 0 , g_dwNumVertices, 0 , g_dwNumIndices / 3 ); pd3dDevice -> EndScene(); }
Notice that the render loop is called once. The geometry data is streamed once although there are n instances of the geometry being streamed. The data from the instance vertex buffer is streamed once. This next render sequence is identical in functionality, but does not take advantage of instancing:
if ( SUCCEEDED( pd3dDevice -> BeginScene() ) ) { for ( int i = 0 ; i < g_numObjects; i ++ ) { pd3dDevice -> SetStreamSource( 0 , g_VB_Geometry, 0 ,D3DXGetDeclVertexSize( g_VBDecl_Geometry, 0 )); pd3dDevice -> SetVertexDeclaration( ); pd3dDevice -> SetVertexShader( ); pd3dDevice -> SetIndices( ); pd3dDevice -> DrawIndexedPrimitive( D3DPT_TRIANGLELIST, 0 , 0 , g_dwNumVertices, 0 , g_dwNumIndices / 3 ); } pd3dDevice -> EndScene(); }
Without instancing, the render loop needs to be wrapped by a second loop to draw each object. By eliminating the second render loop, you should expect better performance due to fewer render state changes that are called inside the loop.
Overall, it is reasonable to expect the indexed technique (Drawing Indexed Geometry) to perform better than the non-indexed technique (Drawing Non-Indexed Geometry) because the indexed technique only streams one copy of the geometry data. Notice that the parameters to IDirect3DDevice9::DrawIndexedPrimitive have not changed for any of the render sequences.
This sample demonstrates the instancing feature available with Direct3D 9. A vs_3_0 device is required for the hardware version of this feature. The sample also shows alternate ways of achieving results similar to hardware instancing, but for adapters that do not support vs_3_0. The shader instancing technique shows the benefits of efficient batching of primitives.
Note On graphics hardware that does not support vs_2_0, the sample will run as a reference device.
Source | SDK root\Samples\C++\Direct3D\Instancing |
---|---|
Executable | SDK root\Samples\C++\Direct3D\Bin\x86 or x64\Instancing.exe |
The sample demonstrates four different rendering techniques to achieve the same result: to render many nearly identical boxes (objects with small numbers of polygons) in the scene. The boxes differ by their position and color.
The user can vary the number of boxes in the scene between one and 1000. Use the sample to monitor performance as the number of boxes increases. As you change the number of boxes, the vertex and index buffer resources are recreated by OnCreateBuffers and OnDestroyBuffers.
Hardware instancing requires a vs_3_0-capable device. The instance-specific data is stored in a second vertex buffer. The rendering is implemented in the sample application's OnRenderHWInstancing method. IDirect3DDevice9::SetStreamSourceFreq is used with D3DSTREAMSOURCE_INDEXEDDATA to specify the number of boxes and with D3DSTREAMSOURCE_INSTANCEDATA to specify the frequency of the instance data (in this case, frequency equals one).
Drawing the scene is accomplished as follows:
pd3dDevice->DrawIndexedPrimitive( D3DPT_TRIANGLELIST, 0, 0, 4 * 6, 0, 6 * 2 );
This is the most efficient technique that does not use the hardware to perform the instancing. One call to IDirect3DDevice9::DrawIndexedPrimitive handles multiple primitives at once. The instance data is stored in a system memory array, which is then copied into the vertex shader's constants at draw time. Since vs_2_0 only guarantees 256 constants, it is not possible for the entire array of instance data to fit at once. This sample's .fx file can only batch-process 120 constants (one float4 is required for box position, and one float4 is required for box color). Rendering is performed in the sample's OnRenderShaderInstancing method.
int nRenderBoxes = min( nRemainingBoxes, g_nNumBatchInstance );
pd3dDevice -> DrawIndexedPrimitive( D3DPT_TRIANGLELIST, 0 , 0 , nRenderBoxes * 4 * 6 , 0 , nRenderBoxes * 6 * 2 );
This method would also work on a vs_1_1 part, although not as efficiently because vs_1_1 only guarantees 96 constants. This means that batching is nearly three times more efficient on vs_2_0 hardware than on vs_1_1 hardware, but even on vs_1_1 hardware, batching still outperforms no batching.
This technique is somewhat similar to Technique 2 without the batching (batch size = 1). It will render one box at a time, by first setting its position and color and then calling DrawIndexedPrimitive. The technique is implemented in the sample's OnRenderConstantsInstancing method.
As in Technique 1, this technique uses a second vertex buffer to store instance data. Instead of setting vertex shader constants to update the position and color for each box, this technique changes the offset in the instance stream buffer and draws one box at a time. The technique is implemented in the sample's OnRenderStreamInstancing method.
When the sample renders a large number of boxes, only Techniques 1 and 2 are GPU-bound, while Techniques 3 and 4 waste many CPU cycles by making numerous calls per frame to DrawIndexedPrimitive.【熊猫注:果然是像之前思考的,减少DP调用,节省了CPU资源,减少了GPU Idle,当物件足够多,即使用HWInstancing,fps也很低,是因为这时GPU成为了瓶颈。浅显!】
The performance numbers for rendering a large number of boxes show that only the first two techniques are GPU-bound, while the last two techniques waste a lot of CPU cycles by making numerous calls to DrawIndexedPrimitive per frame. To better illustrate the penalty from being CPU-bound in a real world application, the sample has the Goal option which simulates a secondary task (in a game, this could be AI or physics) competing for CPU. The sample queries the time elapsed since the last frame and if there is time remaining inside the goal time/frame, the sample allows time to pass to represent game logic. It reports the amount of time used in the Remaining for logic statistic. This is the percentage of CPU cycles that are available to spend on non-rendering algorithms such as game logic or physics; the higher the percentage, the better. If the scene is loaded with 1000 boxes, it was observed that the efficient techniques have 2x better CPU efficiency than the CPU-bound techniques. In other words, using Techniques 3 and 4 to render many instances will starve other tasks of CPU time.
The following table summarizes hardware and memory requirements for different rendering techniques:
Instancing
This sample uses Microsoft's Direct3D9 Instancing Group to render thousands of meshes on the screen with only a handful of draw calls. This significantly reduces the CPU overhead of submitting many separate draw calls and is a great technique for rendering trees, rocks, grass, RTS units and other groups of similar (but necessarily identical) objects.
Gamebryo Lightspeed
MeshInstancing 一堆齿轮的那个例子