direct to video

March 15, 2012

get my slides from GDC2012.

As promised, they’re here! I’m afraid I had to delete all the videos, but apparently the recording of the full thing should be in the GDC Vault at some point.

[PDF here]


Yes I am aware that SlideShare managed to crop the bejesus out of my presentation

To everyone who showed up to my talk – thanks for coming! Here are the slides as a memento of the occasion!
To anyone who couldn’t make it and wants to read the slides, here they are! Good luck making sense of them!
To anyone who was at GDC but went to something else instead – here’s what you missed!

If you did see the presentation live, I was supposed to ask you to fill out the evaluation forms (only if you liked it, obviously – I don’t want to get 100 forms back saying “bat shit mental”). Oh, and I was also supposed to ask you to turn off your mobile phone, no flash photography, no video cameras, and that there are two exits at the back and to file out in row order in case of emergency, but I forgot. Apparently we all made it out alive.
Please do tell me what you thought on here too.

February 14, 2012

come see me talk about directX11 at gdc 2012.

Filed under: demoscene, directx 11, fluid dynamics, particles, realtime rendering — Tags: — directtovideo @ 2:59 pm

Quiet around here, isn’t it?
That’s because I’m going to be speaking at GDC 2012 about advanced procedural rendering in DirectX 11! (So I’m saving all the good material for that. Sorry.)

I’ll be talking about how we’ve used D3D11’s features to handle things like mesh generation and fluid dynamics for our upcoming demos, to give us huge advancements over our old DX9 engine – and in a way that you might consider practical enough to start thinking about for future game titles.

For those who are just starting on DX11 or are only thinking about it I’ll also try and give an overview of building blocks you really need to know about for tackling problems with compute efficiently, like stream compaction and prefix sums, and where they fit into actual real-world problems like implementing marching cubes, smoothed particle hydrodynamics and mesh smoothing.

Or you could just come and look at the pictures.

GDC2012: Advanced Procedural Rendering with DirectX 11

Thursday March 8th, 4:00- 5:00pm Room 2009, West Hall, 2nd Floor. Be there. We’re going to be doing shots off the front of the stage after every other slide, so bring some salt.

May 3, 2011

numb res.

Filed under: demoscene, fluid dynamics, particles, realtime rendering — directtovideo @ 4:39 pm

Numb Res by CNCD & Fairlight

numb res. get it?

pouet exe version video video (anaglyph 3d) youtube vimeo


It was easter. We made a new demo for The Gathering 2011.Yea, that’s right – in Norway, not in Germany. I really wanted to do a new demo because I’ve been collecting new routines all winter, and it was high time they got into the wild. So about 3 weeks before easter Jani and I started bouncing ideas around (“something with fluids” was the sumtotal of that I think). Then we went on the hunt for music. As some may know, we don’t have an active musician we work with regularly in Fairlight or CNCD anymore; we have to outsource. So I dropped a message on facebook half-jokingly asking if anyone had a spare soundtrack. I’m not sure whether that was a good idea or not but I spoke to Ruairi (RC55), who put me in touch with Tom Wright (aka Stereo Wildlife). He’s produced a beautiful new album and agreed to let us use one of the tracks – and even did a bit of remixing to make it fit the demo. So, music was ready from day 1. This is such a huge bonus when making a demo; it meant we could completely design around it, plan out what scenes we wanted straight away and know they’d fit.

The demo was envisaged as a “small project” – a relatively low budget production. Low budget meaning less development time, fewer resources. Weeks to make by a small team. Frameranger for example is a very “high budget” demo – lots of people, over a year in the making, tonnes of art assets and specifically made effects, and lots and lots of wasted work. This one is very different; there’s only one hand-modelled mesh in the whole thing that’s “rendered” properly (the head at the start and end), although there’s lots of meshes used for other things in the demo. We wanted an effect-led production. The first thing that happened was that Jani designed the numbers scene in Lightwave: creating meshes for each number, placing them in the scene, timing them and making a camera path for the whole lot. Meanwhile I was working on effect development. Then Jani developed the introduction part with the head more or less on his own, and modelled and tweaked the tracks for the fluid parts while I worked on fleshing out the numbers scene with elements and effects. Then we integrated and worked together to finish. With a week or so to go there was a touch of panic and it looked like we weren’t going to get there; but in the end we found ourselves more or less done 5 days before the competition. For once we had time to polish, tweak and optimise. Hope it shows..

As an aside: the Gathering was a great event for us not least because they also held the Awards, which recognises the best demoscene productions from last year. We got 11 nominations and after a very rock & roll ceremony full of glitz and fireworks came away with 4 awards: Ceasefire for best music, Agenda Circling Forth for best effects, technical achievement and the cherry on the cake: best demo of 2010. Ooooh. Apparently we just missed out on Public’s Choice by a few points – but hey, no accounting for taste.. ๐Ÿ˜‰

32. Particles. Again?

I’ve realised over time that I’m not really a traditional “democoder”. I’m a graphics researcher who happens to prefer to show his new work off in whatever demo we make next. That probably goes some way to explaining why I do things the way I do: researching and improving on certain areas (like particle systems or fluid dynamics. but not ribbons. bitches.). Some would say that fluids or particles are effects: you “do” fluids for a scene in a demo, then you go “do” something completely different. I don’t subscribe to that. For me the achievement in a demo like this is not to implement fluids: we first used fluid dynamics in a demo 5 years ago. The challenge is to move the field on – to do something new with it that nobody else has managed to do in realtime yet, or not on the same scale. Of course there’s a point where this gets lost on the viewer, and maybe it does just become “nice particles” to the uninitiated.

Although the natural reaction of some people will be “oh, particles again – nothing new!” – this is probably the biggest technical leap we’ve made for a demo since Blunderbuss. Instead of concentrating on the amount of particles and simply using them to render 3D scenes with a few modifiers on top, we concentrated on the cleverness of the particles: the simulation itself and the rendering/shading. In this demo the particles are smart. They’re going somewhere.

Particles are just a primitive like polygons or lines – not interesting in themselves. Creating and rendering a lot of them is easy. Making them do something interesting and look good is a completely different kettle of fish.

So lets talk about what we did this time to make particles do something interesting and look good..

93. Smoothed Particle Hydrodynamics (SPH)

SPH is a form of fluid dynamics which uses particles for storing the fluid and the transport of the forces/densities, rather than a grid. This allows you to represent more detail at higher resolution than a grid would allow given the same memory / performance limitations, it’s not limited to a certain area of space, and it makes collisions more practical and it’s a better fit for liquid effects. It’s the scheme used in professional offline packages like Realflow, used for all those nice liquid splashy effects you see in ads and movies – which take hours to simulate, let alone render. Good SPH is for me one of those holy grails ofย  effects development (like realtime radiosity). The thing is, the quality and scope of effects you can do with it is directly dependent on the number of particles – and so is the difficulty in pulling it off. If you have a few thousand you can make some droplet effects; with 10s of thousands you can make some nice splashes; and with 100s of thousands or millions, you can start to make really amazing running water simulations.

Early tests with SPH fluids

Early tests with SPH fluids

Early tests with SPH fluids - with environment

Early tests with SPH fluids - with environment

The problem with SPH in realtime is it’s really really hard. The simple explanation of the algorithm is: “take all the particles near my particle and perform some force exchange between them”. The force exchange is easy; the “all the particles near my particle” is a bitch. On GPU it’s even more of a bitch; and in 3D it becomes an order of magnitude more of a bitch.

Other demos have featured SPH before; FR-063 performed it on the CPU with (what looks like) between 1000-10000 particles. The current bleeding edge for 3D SPH in realtime is around 250,000 particles, working on a top end GPU using CUDA and with simple point rendering (and no effects or anything else on top). The current bleeding edge for 3D SPH on DX9 – i.e. with no compute shader / CUDA – is erm.. I dont actually think it’s been done.

The problem is simply the neighbourhood search. You end up with a variable amount of fast-moving particles affecting each particle, where it’s hard to pick an upper bound – so the spatial database is hard to construct. If you solve the neighbourhood search, you can solve SPH.

The demo features up to 500,000 particles running under 3D SPH in realtime on the GPU, with surface tension and viscosity terms; this is in combination with collisions, meshing, high end effects like MLAA and depth of field, and plenty of lighting effects. On DirectX9. It’s fast. Almost impossibly fast. How? We found a new approach to SPH where we can re-form the neighbourhood search term to something much easier to solve on a GPU. Meaning we can, honestly, get very close to what a program like Realflow can do over hours of simulation – but in realtime. And that, for me, is what demo coding (and realtime graphics) is all about.

There are 4 scenes which are directly showing “fluids” in the demo; a couple more using SPH in places for the great quality it has that it makes the particles spread out really nicely rather than bunch together randomly. In each of the fluid scenes it’s basically a load of particles dropped at the top of a very long track, and left to get on with it. The camera captures only a part of the action at any time – the great battle of “design vs showing off code” resulted in something that probably doesn’t completely sell the effect, but it does make something more enjoyable to watch. And that too is what democoding is about..

I thought it’d be nice to show it in isolation, so I put a couple of screenshots and a video above. Aside from that one embedded video – apparently wordpress is a little bitch and won’t let me embed more than one video link into a blog post – you can also check the reverse angles here and here. Those and the above screenshots show an initial test shot we did with 3D SPH – we drop 250,000 particles, and let them run with SPH and collisions against a mesh (handled as a signed distance field). Look, it splashes about and shit like that. All completely in realtime. Oooooooh. If nothing else, being able to run it in realtime makes it a lot easier to tweak. You get instant results – you don’t have to wait for any simulations to calculate. In these days of youtube and the prevalence of netbooks, perhaps high end realtime graphics doesn’t have the same relevance to the audience that it did 15 years ago – but it sure matters a huge amount when you’re actually making something. The benefit to the workflow is huge.

12. Signed Distance Fields

I touched on this for Ceasefire, but it was this production where we finally got them working and used them in anger: the use of signed distance fields for arbitrary collisions (and attraction) with particles. We take polygon meshes, convert them into signed distance fields using distance to triangle measurements and place the results in a volume texture, giving us the means for fast collision ray tests. This is absolutely invaluable when using fluid dynamics because otherwise the particles fly off merrily into space. So we have particles flowing around a head; particles flowing down a track carried by SPH; and particles being blown by a 3d fluid effect into the form of a word. All using signed distance fields.

We used them for a lot more besides particle effects, though. They’ve become an integral part of our rendering pipeline. That will become more apparent the next time we do something featuring a lot of solid 3D.. but they’ve opened up a lot of doors.

One clear example of SDF usage comes in the first “fluid” scene – falling drops collide with invisible words. This also neatly demonstrates the “art vs code” issue – we’re simulating 250,000 particles under SPH running down a long 3D track, and the camera shows a small subsection of those. The collision with the words actually uses two affectors: we used a collision node to make the particles bounce off the 3D words (using an SDF version of the mesh), which worked great – but it means you only see the top of the words. ๐Ÿ™‚ So we added a second affector – a low weighted mesh attractor which pulls the particles towards points on the faces of the mesh. This helped the particles slowly run down and also pulls them in from 3d space towards the words. It also added to the surface tension effect by keeping them attracted to the words even after they fall off the end.

65. Particle Shading

In my original post on my particle system a year or more ago I talked about how weย  had support for opacity shadow maps for self shadowing on particles. Since Blunderbuss we didn’t actually use that much – we’ve mainly got away with unlit particles, using the shading and lighting from the source meshes. But I’ve been working on some new techniques and had to make use of them..

The major problem with opacity shadow maps is depth aliasing – you only have a limited set of depth samples (16 in my case) for which to represent the scene, and it’s not enough. They tend not to be spread evenly across the particles either. So I tried a few new methods:

252. Volume Shading

This method borrows heavily from slice-wise volume rendering: the particles are sorted in light space by depth, nearest to furthest, and rendered in slices to composite the image. In this case though we only care about the shadow result: the values are written into the per-particle shading buffer used in the final particle render.

The sorted particles are rendered into the shadow map in batches – typically we used 64 batches per particle system. Per batch we additively render the batch particles into the shadow map, then project the shadow map onto the particles into the next batch: the value read from the shadow map is considered the amount of shadow on that particle from particles closer to the light.

opacity shadow map version

Rendering using an opacity shadow map

Rendering using volumetric shadowing

Rendering using volumetric shadowing

This clever bit is, this method doesn’t care about the actual depth of the particle : it only cares about the position of the particle in the sorted sequence. No depth writes are required and transparency is supported without any problems. One additional benefit of the technique is that we can blur the shadow map a bit after each batch, giving a scattering effect. If one had the power to do it and could render one particle per batch, it’d give a perfect shadowing result. As it is, the batch sizes give some slice aliasing.

Unfortunately the slice aliasing was too much of a problem with large sytems and the technique is also a bit too slow – and generates a lot of render target swaps. So I came up with something better..

15. “Stochastic” Shadow Mapping

This isn’t the same as the stochastic shadow mapping paper that was recently presented, but the name makes a certain amount of sense for the effect anyway. ๐Ÿ™‚ The basic idea is something I’ve tried a few times on and off since 2009. The idea is that if your particles don’t overlap pixels in view space, you could render them as solid – using regular shadowmapping and lighting techniques. Of course this is rarely the case in a render – because particle systems rely on lots of small elements overlapping and blendingย  to look solid and nice. However, what if you do render them as single pixels and make them not overlap, and then perform a full screen 2D operation to upscale each point and make them overlap and blend?

We applied that approach to shadow maps generated from particles. The particles are rendered as single points to a very large shadow map; this gives us a reasonable chance that the particles won’t overlap. It’s just like a spatial hash – with a very simple hashing function and no collision handling.. Then, when sampling, we read from the map using a large kernel and sum up the amount of filled pixels which pass the shadow map test to give a shadowing result.

Stochastic shadowing in action, on something that is definately not a semen cell.

Stochastic shadowing in action, on something that is definately not an artistic interpretation of a sperm cell.

But there’s a twist: in order to improve the quality, cope with hash collisions and reduce aliasing, we perform a temporal reprojection step. When writing the shadow map each frame a random sub-pixel offset is applied to each particle which varies every frame; this means we get a different set of collisions, so different particles become visible each frame. Then when sampling the shadow map we blend the result with the previous frame, so the results adjust smoothly over time. By combining these two things we get a very nice, soft, reasonably alias-free shadow solution which is also efficient to render. No sorting required. The final shadow value per particle is written into a buffer and used at particle render time.

I also experimented with the technique for the actual rendering of the particles to the main frame – rendering single points with Z test and blurring the buffer out, with some per-pixel sorting during the composite, to create softened particles but without the need for a full particle sort. Unfortunately it didn’t give us the visual fidelity we needed; we relied on the blending of particles, the variable sizes and the sprites used. Could be more applicable in a future project though.

536. Meshing (Marching Cubes)

I suppose it’s the obvious step, isn’t it. Democoders love metaballs. Being able to render particles as meshes using metaballs is something we’ve wanted to do for ages because it moves us towards the “liquid” look – the Realflow-style look. We’ve been here before: in Frameranger we rendered around 50,000 metaballs in realtime by generating a potential field, converting it into a signed distance field and raymarching it. Results were promising but not perfect: being able to generate an actual triangle mesh has some side benefits, like being able to post process the mesh and adjust it with tension – something we really wanted to do to get closer to that Realflow look I keep going on about.

Marching cubes gives two issues to solve: generating the potentials, and then triangulating them. We already worked out how to generate the potentials some time ago for Frameranger, although a bit of work was required to scale it up to 250,000 particles. The second part is more difficult: you need to generate an arbitrary amount of geometry data from that potential field with triangle and vertex counts that change every frame. Naturally, we could quite easily make an implementation which just generates the worst case: treat every cell in the volume as if it was contributing triangles, then write degenerates for the invalid ones. That actually works – but it’s prohibitive for large volumes. One cell can contribute up to 5 triangles, and with a 128^3 volume we’d be looking at 10 million triangles – which isn’t great. 256^3 volumes would effectively be impossible. What we need is a way to only process and send triangles for the cells that are active.

This is problematic because we can’t generate index or vertex buffers on the GPU, we can’t generate drawcalls on the GPU (so we can’t vary how many primitives are rendered on the GPU) and we can’t use the CPU – because the potential field is on the GPU and it’d be far too slow to get it back to CPU. And even if we could, the CPU probably isn’t up to the task of generating the geometry fast enough anyway. And even if it was, we’d have to send all the triangle data back to the GPU again. So we’re stuck with the GPU – and yet we don’t have a way to vary the number of cells we render triangles for.

metaballs in numb res

It seems impossible. However, Gernot Ziegler came up with a nice solution a while ago: histopyramids. This is a way of performing stream compaction on the GPU: it takes a big sparse buffer, and moves all the filled elements to the start of the buffer. A bit like a sort, but much more efficient. This gives us exactly what we need: we generate the (sparse) potential grid and use histopyramid compaction to move all the filled elements to the start. Then we use an occlusion query to count the number of active cells and use the CPU to generate batches which give enough triangles for the count to generate. The actual vertices are generated using a pixel shader and vertex texture fetch is used to read them.


4. Bokeh Glows

I’ve had this effect on the back burner for a few years but finally got to actually finishing it up.. Bokeh is the term relating to the effect of circular or shaped highlights in a depth of field effect, caused by inaccuracies in the shape of the lens of a camera. Or something. They make DOF look really nice. I’ve tried before by using a really big circular kernel for a regular DOF effect with an HDR input and leaving it at that and it actually does work, but I wanted to see if I could get some shaped bokehs and really overblow it. So I tried something with point sprites.


bokeh, innit. turned up to max, of course.

The basic idea is to work out where on screen bokehs would happen, and render point sprites at those points. I did this using the following method:

– Bilinear downsample the screen (in several steps), storing the 2d position (UV) of the brightest point of the 4 values of the quad that were read to a render target.

– Use those 2d positions to read a blurred version of the original frame. Perform some thresholding to pick out the points which pass. Generate colour values for the points.

– Temporally smooth positions and colours using positions from last frame, apply some attack and decay.

– Render a load of point sprites using vertex texture fetch to read the positions and colours, rendering the sprites to the screen. (With some additional magic to make it look good.)

72. Post Process Antialiasing (MLAA)

This is the first demo since 2009 (Frameranger, in fact) that we’ve released which actually features polygons being rendered as polygons. Happily, time has moved on, and so has our renderer. One of the major bugbears I had with the deferred renderer is lack of antialiasing – but fortunately a whole bunch of post process antialiasing techniques got invented in the last couple of years. MLAA is the technique du jour, and we use an implementation in our renderer. It’s great.

We do two little twists in our version to make it cool: firstly we use a lot of stencil optimisation so only the active edges get the big-ass shader applied to do the actual MLAA (or in fact get any of the process after the edge detect applied). And secondly.. there’s an ugly problem with MLAA in that it actually cocks up quite badly in a certain case. The technique relies on checking for horizontal or vertical edges. But where you have a pixel which is both a horizontal and vertical edge, it messes it up. Which breaks about 1/4 of the diagonal edges you have to deal with, so its pretty noticable. Our oh so clever technique for fixing that is.. do the MLAA twice. ๐Ÿ™‚ The second time we flip the whole image in x and y, then MLAA it and flip it back. Genius huh? .. no? Well, it makes the polygonal scenes look good, and fortunately the stenciled version is so fast the extra hit isnt really noticable.

42. Stereoscopic 3D

We really wanted to do something with 3D for a while, but sadly we dont have any true 3D hardware (*cough* donations please *cough*). We decided quite early on that we were going to go for a pretty much black & white look – so it would actually be feasible to use the good old red / cyan anaglyph method. 3D isn’t as easy as just turning it on, though. It takes some effort to make it work well, give a good effect and not strain your eyes. We tuned it quite carefully and the setup of the scenes really helps – the first scene is slow and quite static so it lets your eyes adjust, the camera movements are quite smooth and in a single direction so they’re easy to track, and so on and so on.

Do watch the demo in 3D, it’s really made for it. We’re going to make a proper HD 3D video with left & right splits soon for those with real 3d setups.


I guess what’s interesting for me about this demo is that it was so much easier to make than many we’ve done. It just kind of came together; we started early enough, we got the music at the start, weย  didn’t have any major problems, nobody disappeared or dropped out, everything showed up on time, we didn’t completely overstretch ourselves and come up with some ideas that couldn’t be done, and we had time at the end to go over it and tweak and polish things, and we’re really happy with how it turned out. It’s like the way it’s supposed to go but never does. It doesn’t work for everyone (not very bombastic, you see) but it seems the people who got it really got it and like it, which is what matters. Maybe we’ve actually cracked it.. or maybe next time’ll be a royal screwup.ย  Have to wait and see..

An amusing realisation hit me the other day. We’ve unintentionally managed to make a demo which is entirely full of sexual references. There’s a load of massive sperm cells; there’ what looks like a female gender symbol, made up of little sperm cells; there’s a load of sperm falling down and colliding off things; and then there’s a big river of .. well, it’s not much of a stretch in context to call that fluid “spunk”, is it? It only dawned on me after Dixan commented that it was “finally a good demo about semen” on pouet, and I started thinking about it.


October 6, 2009

a thoroughly modern particle system.

Filed under: demoscene, fluid dynamics, realtime rendering — directtovideo @ 3:30 pm

particles in blunderbuss

During the making of Frameranger, I spent some time looking into making a “modern particle system”. Particles have been around for ever and ever, and by and large they haven’t changed that much in demos over the last 5-10 years. You simulate the particles (around 1000 – 100,000 of them) on the CPU, animating them using a mix of simple physics, morphs and hardcoded magic; sort them back to front if necessary, and then upload the vertex buffers to the GPU where they get rendered as textured quads or point sprites. The CPU gets hammered by simulation and sorting, and the GPU has to cope with filling all of the alpha blended, textured pixels.

However, particles in the offline rendering / film world have changed a lot. Counts in the millions, amazing rendering, fluid dynamics controlling the motion. Renderers like Krakatoa have produced some amazing images and animations. I spent some time looking around on the internet at all sorts of references and tried to nail down what those renderers had but I didn’t – and therefore needed. This is something I do a lot when developing new effects or demos. Why bother looking at what’s currently done realtime? That’s already been done. ๐Ÿ™‚

I decided on the following key things I needed:
1. Particle count. I want more. I want to be able to render sand or smoke or dust with particles. That means millions. 1 million would be a good start.
2. Spawning. Instead of just spawning from a simple emitter, I want to be able to spawn them using images or meshes.
3. Movement. I want to apply fluid dynamics to the particles to make them behave more like smoke or dust. And I want to morph them into things, like meshes or images – not just use the usual attractors and forces.
4. Shading. To look better the particles really need some form of lighting – to look like millions of little things forming a single solid-ish whole, not millions of little things moving randomly and independently.
5. Sorting. Good shading implies not additive blending, which implies sorting.

The problem with simulating particles on the CPU is that no matter how fast the simulation code on the CPU is, you’re going to hit two bottlenecks sooner or later: 1. you have to get that vertex data to the GPU – and that can make you bandwidth limited; and 2. you need to sort the particles back to front if you want to shade them nicely, which gets progressively slower the more you have. Fortunately given shader model 3 and up, it’s quite doable to make a particle system simulate on the GPU. You make big render targets for the particle positions, colours and so on; simulate in a pixel shader; and use vertex texture fetch to read from that texture in the vertex shader and give you an output position. Easy. Not quite – simulating on the GPU brings it’s own set of problems, but on to that later. Modern GPUs are sufficiently fast to easily be able to perform the operations to simulate millions of particles in the pixel shader, and outputting 1 million point sprites from the vertex shader is doable.

Shading is the biggest problem here, and the shading problem is mainly a lighting problem. Lighting for solid objects means a mix of – diffuse+specular reflection; shadows; and global illumination. Computing diffuse and specular reflection requires a normal, something which particles do not really have, unless we fake it. So that was my first line of attack – generate a normal for the particle. It would need to be consistent with the shape of the system, locally and globally, if it was going to give a good lighting approximation. I tried to use the position of the particle to generate a normal. It turns out that’s rather difficult if you’ve got something other than a load of static particles in the shape of a sphere or box. Then I tried to use a mesh as an emitter and use the underlying normal from the mesh for the particle. It did work, but of course once the particle moves away from it’s spawn point it becomes less and less accurate.

The image here shows particles generated from the car mesh in Frameranger, matching the shading and lighting.
particles mapped to the car from frameranger

I needed a better reference, so I looked away from solid objects and had a look at how you would light a volumetric object – e.g. a cloud. Which in real life is actually millions of millions of little particles, so maybe it makes quite a good match to lighting, well.. particles. It works out as a model of scattering and absorbtion. You cast light rays into the volume, and ray march through it. Whenever the ray hits a cell that isnt empty, a bit of the light gets absorbed by the cell and a bit of it scattered along secondary rays in different directions, and the rest passes on to the next cell. The cell’s brightness is the amount of light remaining on the ray when it gets to that cell. Scattering properly is hideously slow and expensive so we’ll just completely ignore it, and instead add a global constant to fake it (a good old “ambient” term). That just leaves us with marching rays through the volume and subtracting a small amount per cell, scaled by the amount of stuff in the cell. This actually works great, and I’ve used it for shading realtime smoke simulations – with a few additional constraints, like fixing to directional lights only and from a fixed direction, you can do it pretty efficiently. It looks superb too.

The problem is that the particles are not in a format that is appropriate for ray marching (like a volume texture). But the look is great – we just need a way of achieving it for particles. What we’re dealing with is semi-transparent things casting shadows, so it makes sense to research how to handle that. The efficient way of handling shadows for things nowadays is to use shadow maps. But shadow maps only work for opaque things – they give you the depth of the closest thing at each point in a 2D projection of light space. For alpha things you need more information than that, because otherwise the shadows will be solid.

Or do you? The first thing I tried was very simple – to use exponential shadow maps. Exponential shadowmaps have a great artefact / bug where the shadow seems to fade in close to the caster, and this is usually annoying – but for semi-transparent stuff we can use it to our advantage. Yep, plain old exponential shadowmaps actually work pretty well as shadowmaps for translucent objects – as long as those translucent objects aren’t all that translucent (e.g. smoke volumes). The blur step also makes small casters soften with those around them. It’s pretty fast too, and it almost drops into your regular lighting pipeline. But, for properly transparent (low alpha) stuff like particles, it’s not quite good enough.

The really nice high end offline way is to use deep shadow maps. That basically gives you a function or curve that gives you the shadow intensity at a given depth value. It’s usually generated by buffering up all the values written to each pixel in the map (depth and alpha), sorting them, and fitting a curve to them which is stored. Unfortunately it doesn’t map too well to pixel shader hardware. However there is a discrete version which is much simpler – opacity shadow maps. For this you divide depth into a series of layers and sum up the alpha value sums at each layer for the stuff written with a depth greater than that layer. On modern GPUs that’s actually pretty easy – you can fit 16 layers into 4 MRTs of 4 channels each, and render them in one pass! Unfortunately it’s not expandable beyond that without adding more passes, but it’s good enough to be getting on with – as long as you don’t need to cover a really large depth range and the layers are too spaced out. But this gives us nice shadows which work with semi-transparent stuff properly. You could even do coloured shadows if you didn’t mind less layers or multiple passes.

The next issue is how to apply that shadow information to the particles – it requires samping from 4 maps plus a bunch of maths, and isn’t all that quick. If we did it for every pixel rendered for the particles, it’d hammer the already stressed pixelshader. If the particles are small enough we could just sample it once per particle in the vertex shader – but it’s too many textures to sample. Fortunately the solution is easy – just calculate a colour buffer using the fragment shader, with all the lighting and shading information per particle in it, and sample that in the vertex shader. The great thing about that is it’s really similar in concept to the deferred renderer I’ve already got for solid geometry. You have a buffer containing positions and other information; you perform the lighting in multiple passes, one per light, blending into a composite buffer; then sample that composite buffer to get the particle colour when rendering to the screen. It’s so similar in fact to the deferred rendering pipeline that I can use the almost same lighting code, and even the same shadow maps from solid geometry to apply to the particles too – so particles can cast shadows on geometry, and geometry can cast shadows on particles.

particle lighting 01particle lighting 02

This shading pipeline – compositing first to a buffer, one pixel per particle – opens up new options. We can do all the same tricks we do in deferred rendering, like indexing a lookup table which contains material parameters for example. Or apply environment maps as well as lights. Or perform more complicated operations like using the particle’s life to index a colour lookup texture and change colour over the life of the particle – make it glow at first then fade down. It allows multiple operations to be glued together as separate passes rather than making many combinations of one shader pass.

So, we have a particle colour in a buffer. The next job is to render the particles to the screen. We’ve gone to all this effort to colour them well that we need to consider sorting – back to front – so it actually looks right. This could be problematic – we’ve got 1 million+ particles to sort, all moving independently and potentially quite quickly and randomly, and it has to be done on the GPU not the CPU – we can’t be pulling them back to CPU just to sort.

I had read some papers on sorting on the GPU but I decided it looked totally evil, so I ignored them. My first sorting approach was basically a bucket sort on GPU. I created a series of “buckets” – between 16 and 64 slices the size of the screen, laid out on a 2d texture (which was massive, by the way), with z values from the near to the far plane. Then I rendered the particles to that slice target, and in the vertex shader I worked out which slice fit the particle’s viewspace z value, and offset the output position to be in that slice. So, in one pass I had rendered all the particles to their correct “buckets” – all I had to do was to blend the buckets to the main screen from back to front, and I got a nicely sorted particle render which rendered efficiently – not much slower than not sorting at all. Unfortunately it had some problems – it used an awful lot of VRAM for the slice target, and the granularity of the slices was poor – they were too spread out, so sometimes all the particles would end up in one slice and not be sorted at all. I improved the Z ranges of the slices to fit the approximate (i.e. guessed) bound box of the particle system, but it still didn’t have great precision. In the end on Frameranger the VRAM requirements were simply too high, and I had to drop the effect. It turns out that the layers method is very useful for other things though, like rendering particles into volumes or arbitrary-layered opacity shadow maps.

When I revisted the particle effect, I knew the sorting had to be fixed. I looked back at the papers on GPU sorting, specifically the one in GPU Gems. They seemed very heavyweight – a sort of a 1024×1024 buffer (i.e. 1 million particles) would require 210 passes over that buffer per frame, which is completely unfeasible on a current high end GPU. But there was one line which caught my attention – “This will allow us to use intermediate results of the algorithm that converge to the correct sequence while we do more passes incrementally”. One of the sorting techniques would work over multiple frames – i.e. for each iteration of the algorithm, the results would be more sorted than the previous iteration – it would not give randomly changing orders, but converge on a sorted order. Perfect – we could split the sort over N frames, and it would get better and better each frame. That’s exactly what I did, and it actually worked great. It used much less memory than the bucket sort method and gave better accuracy too – and the performance requirements could be scaled as necessary in exchange for more frames needed to sort.

There are some irritations with simulating particles on GPU. Each particle must be treated independently and you have to perform a whole pass on all the particles simultaneously. It makes things which are trivial on CPU, like counting how many particles you emitted so far that frame, very difficult or not feasible at all on GPU. But it’s a rather important thing to solve – you often need to be able to emit particles slowly over time, rather than all at once. The first way I tried to solve that was to use the location of the particle in the position buffer. I would for example emit the particles in the y range 0 to 0.1 on the first frame, than 0.1 to 0.2 on the next, and so on. It worked to a point, but fell down when I started randomising the particle’s lifetime – I needed to emit different particles at different times. Then I realised something useful. If you’re dealing with loads and loads of something – like a near infinite amount – then doing things randomly is as good as doing things correctly. I.e. I dont need to correctly emit say 100 particles this frame – I just need to try to emit e.g. roughly 1% of particles this frame and if I’ve got enough particles in the first place, it’ll look alright. The trick is that those 1% is the right amount of random.

I’ll explain. The update goes like this: 1. generate a buffer of new potential spawn positions for particles. 2. Update the particle position buffer by reading the old positions, applying the particle velocities to them, and reducing the life; then if the life is less than 0, pick the corresponding value from the spawn position buffer and write that out instead. So, each frame I generate a whole set of spawn positions for the particles, but they only get used if the particle dies. But how to control the emission? Clearly if I put a value in the spawn buffer which has an initial life of less than 0 and it gets used, it’ll get killed by the renderer anyway and the next frame around it’ll respawn again – i.e. the particle never gets rendered and doesn’t really get spawned either. So if I want to control the number of particles emitted I just limit the number of values in the spawn buffer each frame that have an initial life greater than 0.

How do I choose which spawn values have valid lives? It needs to be a good spread, because the emission life is also randomised – some particles die earlier than others and need respawning. If I simply use a rolling window it’s not random enough and particles stop being spawned properly. If I actually randomly choose, it’s too random – it becomes dependent on framerate, and on a fast machine the particles just all get spawned – the randomness makes it run through the buffer too fast. So, what I did was a compromise between them – a random value that slowly changes in a time-dependent way.

The other nice thing about this spawn buffer was that it made it easy to combine multiple emitters. I could render some of the spawn buffer from one emitter, some from other, and it would “just work”. One of the first emitters I tried was a mesh emitter. The obvious way would be to emit particles from the vertices but this only worked well for some meshes – so instead I generated a texture of random positions on the mesh surface. I did this by firstly determining the total area of all the triangles in the mesh; then for each triangle spawning a number of particles, which was the total number of particles * (triangle area / total area). To spawn random positions I just used a random barycentric coordinate.

Here’s an early test case with particles generated for a logo mesh and being affected by fluid.
particle logo 01particle logo 02particle logo 03particle logo 04

Finally I needed some affectors. Of course I did the usual forces, but I wanted fluid dynamics. The obvious idea was to use a 3d grid solver and drive the particles by the velocities. Well, that wasn’t great. The main problem was that the grid was limited to a small area, and the particles could go anywhere. Besides, the fluid solver was quite slow to update for a decent resolution. So I used a much simpler method that generated much better results – procedural fluid flows (thanks Mr Bridson). Essentially this fakes up a velocity field by using differentials of a perlin-style noise field to generate fluid-like eddies – “curl noise”. By layering several of these on top, combined with some simple velocities, it looked very much like fluid.

The one remaining affector was something to attract particles to images. To do this I generated a texture from the image where each pixel contained the position of the closest filled pixel in the source image – a bit like a distance field but storing the closest position rather than the distance. Then, in the shader I projected the particle into image space, looked up the closest pixel and used that to calculate a velocity, weighted by the distance from the pixel. With a bit of randomness and adjustment to stop it affecting very new or very old particles, it worked a charm.

particles running under a fluid sim and attracted to an image

And there we have it – a “modern” particle system that works on DirectX9 – no CUDA required! I’m sure this will develop over time. With better GPUs the particle counts will go up fast – between 4 and 16 million is workable already on a top end Geforce, and it’ll go up and up with future hardware generations. In fact I have a host of other renderers for the particles besides this simple one – things to do metaballs, volume renders and clouds, for example – and a load of other improvements, but that can wait for another demo..

By the way, there’s a nice thing about GPU particles that maybe isn’t immediately obvious. You’re writing all the behaviour code (emission, affectors..) in shaders, right? And you can probably reload your shaders on the fly in your working environment. All of a sudden it makes development a lot easier. You don’t need to recompile and reload the executable every time you change the code, you can simply edit and reload the shader in the live environment. Great eh?

September 29, 2009

fluid dynamics #1: introduction, and the smokebox

Filed under: demoscene, fluid dynamics — Tags: , , , , , , , — directtovideo @ 10:57 am

One effect I was asked for a lot for demos by artists was “fluids”. Fluid dynamics make for great visuals – they look complicated, with minute details, yet they make sense to the human eye on a larger scale thanks to their physical basis. Unfortunately they are rather difficult to do. But if it were easy it wouldn’t be fun, right? Fluid dynamics and rendering has become an area of research I’ve kept coming back to over the years, and my journey started at the end of 2006. I immediately discounted simple fakes and decided to try and do something pretty “real” for our demos in 2007.

The first problem is that fluids take some serious computation power to calculate. To do it “properly”, what you have to do is simulate the flow of velocities and densities through space using a set of equations – “Navier Stokes”. I’m no mathematician, but fortunately these equations don’t look half as nasty in code as they do on paper. The rough point of them is that the velocites/densities at a certain point are interacted with other points nearby in space in the right way.

The basic approach follows two paths, depending on how you want to model space – you can use particles (SPH), or a grid. Grids are bounded in the area they cover, and their memory requirements depend on their size, but the equations are simpler and tend to be faster to solve – with grids, you automatically know what points are nearby. densities/velocities tend to dissipate and you lose details. Their rendering suits gas-like fluids well, as it’s typically visualised using a ray march through the volume – which suits transparent media. Particles can suit liquids better, and are unbounded – they can go where ever they want – and you don’t lose details in the same way as grids. You need a lot of particles, and the equations are a bit uglier and slower to solve because you have to work out which particles are near each other particle – it actually has a lot of similarities with flocking behaviours. Rendering particle fluids usually involves some sort of implicit surface formed from the particles and visualised using a raytracer or polygonised with marching cubes.

Modern offline fluid solvers tend to support one or both methods of fluid calculation. Fluids are big business – movies, adverts, all sorts of media use them. Offline renderers support hundreds of thousands of particles or huge grids, and can take hours to compute a few seconds of animation. So it’s going to be hard to compete with realtime.

I started off with grid solvers. They’re easier to understand and there are good examples and tutorials/papers out there that explain how to do it – and they aren’t packed full of magic numbers. The navier stokes equation breaks down into the following steps:
Update velocity grid:
– take the previous frame’s velocity grid
– perform a diffusion step – a bit like a blur
– perform a projection step – makes it mass-conserving, and its what gives the effect that swirly quality. In reality that means a linear solver with 10-20 steps, meaning looping over the whole grid 10-20 times. It’s the slow bit.
– perform an advect step, which is like doing “position = position + velocity” in reverse – to pull the velocities from the previous frame’s grid into the new grid.
Update density/colour grid:
– perform a diffusion step
– advect using the velocity grid.
You’ll soon realise you can cut out the diffusion steps – you can lose them without hurting the final result.

The first attempt I made was to do a 2D grid solver. That’s pretty easy – there’s a good example released by Jos Stam (probably the “father” of realtime fluid dynamics) some years ago which is easy to follow, although it needs optimising. The nice thing about 2D grid fluids is that they map very simply to the GPU – even back in the days of shaders 2.0. The grid goes into a 2D texture, and the algorithm becomes a multi-pass process on that texture. That worked out great – a few days of work and it was usable in a demo. It made it into halfsome in 2007. The resolution was good – we could have one fluid solver running at 512×512 well, or several at 256×256 – which proved to be ample. It was quite easy to control, too – we could simply initialise the density grid with an image, the velocity grid with random blobs, and the image was pushed around by the fluid.

2D fluids are nice, but 3D is better. But 3D brings a whole new set of problems. It was quite easy to extend a 2D solver to 3D on both CPU and GPU. Sadly at the time and on DirectX9, GPUs could not render to volume textures – so the problem could not be extended simply by changing the texture from 2D to 3D. Instead I had to lay out the “volume” as a series of slices on a 2D texture. That was hard to get right, but apart from that it extended easily. The problem was it was rather slow. The GPU at the time (GF6800 was the card du jour) just didn’t have the performance to handle it, when the extra overhead from fixing up and sampling from the 2D slice texture came into account. So the next option was to go CPU – I spent quite a long time hand optimising the code in SSE2 intrinsics, and then wrote out a volume texture in the end for rendering. Unfortunately the algorithm is in parts heavily memory/cache limited – the advect stage in the equation jumps around in memory almost randomly. In fact, in some parts of the solver the GPU was far faster thanks to more efficient memory access, and in other parts the CPU won out thanks to raw calculation performance and being able to re-use previous cell values. (Note, this is back in 2006-2007, Core Duo vs a GF6800.)

Finally I had a solver. Now, how to render the results? Raymarching the “volume texture” was quite slow – rendering the slices as a series of quads worked out better. Now the fun stuff started. I realised the key to a good looking was lighting – and for a semi-transparent thing like smoke/gas, that means shadows with absorbtion along the shadow ray. Ray marching in the shader or on CPU was out of the question, but fortunately there was an easy fake – assume a light that was only from the top down and do a “ray march” which added the value of the current grid cell to a rolling sum, and wrote the current sum back to the grid cell as a shadow value, then moved to the next cell immediately below it. That could be made even easier on my CPU solver by flipping the grid around so that the Y axis of the fluid was the “x axis” of the grid – and the sum could be rolled into the output stage of the copy to volume texture – so the whole shadow “ray march” was almost completely free.

After some time experimenting I discovered that the largest grid resolution I could get away with performance-wise was 64x32x32. Unfortunately it looks pretty rough at that. I tried a few things with octrees to avoid empty space but it just didn’t work – with my small grid, the whole space got filled quickly. In the end I simply doubled the res of the density grid and interpolated the velocities – so the slow part of the equation, updating the velocities, ran on a grid with 1/8 the number of cells as the density grid – which is the one where you really notice the blockiness.

It worked. It ran at framerates of at least 30fps realtime, with a grid res of 128x64x64 for the densities and 64x32x32 for the velocities. At the time of release it was probably the fastest and most powerful CPU grid solver ever made. It was even compared to the nvidia 8800 demo – which was running at a significantly higher resolution, but on vastly superior hardware and without lighting. We used it in media error, although we were so rushed with the demo that it only got used in a rather boring way – as “smoke in a box”. And it made me want to do something bigger and better.

smokebox in media error

Create a free website or blog at