1.合并渲染批次
合并同一发射器发射的粒子(材质肯定相同);
合并相同材质的特效节点;(透明节点不用考虑排序,效果没多大问题,不用考虑用什么order independant transparency)
遇到的问题:triangle list的特效节点合并没啥难度,triangle strip的呢?要用退化三角形!
http://realtimecollisiondetection.net/blog/?p=91
Optimizing the rendering of a particle system
There are many things that can kill the frame rate in a modern game, and particles are up near the top of the list of causes. A key contributing factor is that particles are subject to a lot of overdraw that is not present in your opaque geometry.
The reason for the increase in overdraw is that for particles we tend to have lots of individual primitives (usually quads) that are overlapped, perhaps to mimic effects like fire or smoke. Normally each particle primitive is translucent (alpha-blended) so the z-buffer is not updated as pixels are written and we end up rendering to pixels multiple times. (In contrast, for opaque geometry we do write to the z-buffer, so between a possible z-prepass, sorting objects front-to-back, hierarchical z-culling on the GPU, and normal depth testing, the result is that we have very little overdraw.)
Overdraw, in turn, leads to increased uses of both fillrate (how many pixels the hardware can render to per second) and bandwidth (how much data we can transfer to/from the GPU per second), both of which can be scarce resources.
Of course, overdraw is not the only reason particles can slow your frame rate to a crawl. We can also get bitten by other problems, like setting too much state for each particle or particle system.
What to do?
OK, let’s say we all agree that particles can cause a lot of problems. What to do? Fortunately there are lots of things that can be done to optimize the rendering side of a particle system. For whatever reason, few of these are discussed in books or articles, so I thought I’d list out a few things we can do. Feel free to add more suggestions as comments.
- Use opaque particles. For example, make smoke effects really thick so (some or all of) the particle billboards can be opaque, with cutout alpha. For some particles, like shrapnel, rocks, or similar, use lightweight geometry particles instead of sprites with alpha borders.
- Use richer particles. Put more oomph in a single particle sprite so we need fewer of them. Use flipbook textures for creating billowing in e.g. fire and smoke, rather than stacking sprites.
- Reduce dead space around cutout-alpha particles. Use texkill to not process the transparent regions of the sprite. Alternatively, trim away the transparent areas around the particle, using an n-gon fan instead of just a quad sprite. (but beware lowered quad (2×2 pixel block) utilization when increasing the triangle count, or becoming vertex bound in the distance, so LOD from fan to a quad sprite in the distance).
- Cap total amount of particles. Use hardware counters on the graphics card to count how many particle pixels have been rendered and stop emitting or drawing particles when passing a certain limit (which can be set dynamically).
- Use frequency divider to reduce data duplication. We can reduce bandwidth and memory requirements by sharing data across particle vertices using the frequency divider, instead of through data duplication across vertices. (Arseny Kapoulkine described this well in Particle rendering revisited.)
- Reduce state changes. Share shaders between particles. We can make this happen by e.g. dropping features for distant particles (such as dropping the normal map as soon as possible).
- Reduce the need for sorting. Rely on additively or subtractively blended particles where possible. E.g. additive particles can be drawn in any order, so we can sort them to e.g. reduce state changes instead of sorting on depth (as is likely needed for normal alpha-blended particles).
- Draw particles after AA-resolve. Most games today use multisampled antialiasing (MSAA), drawing to a 2x or 4x MSAA buffer. These buffers must be resolved (with an AA-resolve pass) into a non-MSAA buffer before display. Due to the way MSAA works, we still run the pixel shader equally many times whether we draw particles before or after the AA-resolve, but by drawing particles after the AA-resolve we drastically reduce frame buffer reads and writes, ROP costs, etc.
- Draw particles into a smaller-resolution buffer. We can also draw particles into a separate, smaller-resolution buffer (smaller than the frame buffer or the AA-resolved buffer). The exact details vary depending on e.g. whether you use RGBA or FP16 frame buffers, but the basic idea is the following. First we draw the opaque geometry into our standard frame buffer. We then shrink the resulting corresponding z-buffer down to 1/4 or 1/16 size, then draw particles to 1/4 or 1/16 size frame buffer using the smaller z-buffer. After we’re done, we scale the smaller frame buffer back up and composite it onto the original frame buffer. (Some interesting details I’ve left out are how exactly to perform the z-buffer shrinking and the composite steps.)
- Use MSAA trick to run pixel shader less. On consoles (and on PC hardware if drivers allowed it) you can tell the GPU to treat, say, an 800×600 frame buffer as if it were a 400×300 (ordered) 4xMSAA buffer. Due to the way MSAA works, this has the effect of running the pixel shader only once per 2×2 pixels of the original buffer, at the cost of blurring your particles equivalently much. (Though you still get the benefit of antialiasing at the edges of the particles.)
- Generate particles “on chip.” On newer (PC) hardware we can use geometry shaders to generate particles on the GPU instead of sending the vertex data from the CPU. This saves memory and bandwidth.
We can also attempt some more esoteric stuff, like:
- Compose particles front-to-back premultiplied-alpha style. Using premultiplied alpha (which is associative) we can blend particles front-to-back instead of the normal back-to-front ordering. The idea here is to use the front-to-back drawing to fill in depth or stencil buffer when alpha has become (near) solid and ultimately stop drawing particles all together (when they no longer contribute much to the visual scene, off in the distance).
- Group particles together into one particle entity. Instead of drawing two overlapping particles individually, we can form a single (larger) particle that encompasses the two particles and performs the blending of the two particles directly in the shader. This tends to reduce the amount of frame buffer reads we do (as we now only have to blend one particle) but it can also increase it (if the single particle covers much more area than the union of the two original particles).
My original intent was to categorize all these items in terms of what they were saving (overdraw, fillrate, ROP, pixelshader ALU, etc.) but it was a little bit more effort than I had time for to create the 2D matrix of feature vs. savings this would create. Hopefully it’s still clear what each bulletpoint-ed task would achieve.
All in all, the above’s a long list, but I’m sure I left something out, so please comment. Also, what’s your favorite “trick?”
介绍退化三角形
http://www.cppblog.com/lovedday/archive/2008/03/02/43557.html
三角带
三角带是一个三角形列表,其中每个三角形都与前一个三角形共享一边,图14.2显示了一个三角带的例子。
注意顶点列出的顺序使得每三个连续的点都能构成一个三角形。例如:
(1)顶点1、2、3构成第一个三角形。
(2)顶点2、3、4构成第二个三角形。
(3)顶点3、4、5构成第三个三角形。
在图14.2中,顶点以构成三角形带的顺序编号。"索引"信息不再需要,因为顶点顺序已经隐式定义了三角形。通常,列表前部有顶点数目,或末尾处有一特殊码表示"列表结束"。
注意到,顶点顺序在顺指针和逆时针间不断变换(见图14.3)。某些平台上,需要指出第一个三角形的顶点顺序,而有些平台上顺序是固定的。
最佳情况下,三角带可用n+2个顶点存储n个面。n很大时,每个三角形平均发送一个顶点,遗憾的是,这只是最佳情况。实践中,很多网格是一个三角形带无法表达的,不仅如此,3个以上三角形共享的顶点还是要多次发送给图形卡。从另一方面说,每个三角形至少要发送一个顶点。但在顶点缓存机制中,有可能将每个三角形发送的顶点数降到一个以下。当然,顶点缓存需要额外的簿记信息(索引和缓存管理数据),可是尽管这些额外信息对单个顶点来讲相对较大,操作速度也会相对下降,但发送顶点数最少的系统在特定平台上速度最快。
假设用一种生成三角带的直接方法,用三角带表示三角网需要的顶点数为t+2s,t为三角形数目,s为三角带数目。每个三角带的第一个三角形对应三个顶点,以后每个三角形对应一顶点。因为我们希望最小化发往图形卡的顶点数,所以三角带的数目应尽可能少,即三角带越长越好。STRIPE方法给出了一种三角带数目接近理论下限的生成手段。
另一个希望减少三角形带数目的原因在于建立各三角形需要额外时间。从另一方面说,分别渲染两个长为n的三角带所需时间长于渲染一个长为2n的三角带,即使这个三角带中的三角形数多于两个分开带中三角形数量的和。于是,我们经常通过使用退化三角形连接多个三角带,从而将整个网格置于一个连续的三角带中,退化的意思是面积为0。图14.4显示了如何重复顶点以将两个三角形合并为一个。
图14.4的含义不太明显,但这里有四个退化三角形用于连接两个三角带从而维持正确的顺指针、逆时针顺序。顶点7、8间的边实际包含两个退化三角形,图14.5指出了图14.4中包含的三角形。
退化三角形面积为0不需要渲染,所以不会影响效率。实际上要发送到图形卡的顶点仍然只是第一列的顶点:
1,2,3,4,5,6,7,8,9,10,11,12,13
这符合我们每三个连续顶点表示一个三角形的约定。
一些硬件(如PS2上的GS)可以跳过三角带中的三角形,方法是通过一个顶点上的标志位指出"不必绘制"此三角形。这给我们一种方法可以有效的从任意点开始新三角形带而不必重复顶点或使用退化三角形。例如,图14.4中的两个三角带可以如图14.6那样连接,其中灰色表示顶点被标记"不必绘制"。
三角扇
三角扇和三角带类似,但不如三角带灵活,所以很少使用。图14.7所示即为三角扇。
三角扇使用n+2个顶点存储n个面,和三角带相同。但是,第一个顶点必须为所有三角形共享,所以实践中不太经常能找到大型三角扇应用的场合。并且,三角扇不能像三角带那样连接。所以,三角扇只能在特殊场合应用,对一般应用来说,三角带更灵活。