Wednesday, October 24, 2012

CryEngine 3 Integration

This week we're on a multimedia frenzy and we're presenting another video of Coherent UI :).

This one is about our experimental integration in CryEngine 3. It's experimental because our integration is with the free SDK, which doesn't expose enough resources for maximal performance (e.g. the DirectX device is not easily accessible). We wanted to have it as clean as possible (no hacky stuff!) to understand the problems one might have when trying to integrate Coherent UI in an existing engine. While this led to having sub-optimal performance (technical details below the video), it was still very good and we believe it's worth showing.

The video demonstrates some exciting features of Coherent UI:
  • Displaying any website on any surface
  • Support for HTML5/CSS3, SSL
  • Social integration with Facebook, Twitter, Google+ (well, this isn't exactly demonstrated, but we have it :))
  • JavaScript binding (for engine variables and methods)
  • Live editing/debugging the interface
Without further ado (and I realize it's been quite an ado :)) I present you Coherent UI in CryEngine3!

Coherent UI in CryEngine 3

Coherent UI in CryEngine 3 (short version of the above)

Developer tidbits


Here's the story of the hurdles we had to overcome. Let me start with this first, the lack of access to the DirectX device was extremely annoying. So, now that we've got that clear, we just started our integration... and immediately ran into a problem.

There was no easy way to create an empty texture with specified size. Let alone create a shared texture for Windows Vista/7. This presented the first small inefficiency in the integration, since we had to use shared memory as image transport mechanism from Coherent UI to CryEngine 3, which involves some memory copying (and using shared textures doesn't). Even access to the device wouldn't help that much here since we'd like to have a valid engine handle for the texture, which means that we can't bypass it. After some experimenting, we settled on creating a dummy material with the editor, assigning a placeholder diffuse texture and getting its ID with the following code:

Having the ID, we can update the texture at runtime using IRenderer::UpdateTextureInVideoMemory. This approach comes with its own set of problems, however. An obvious one is that you need unique dummy material and diffuse texture for each Coherent UI View, which is annoying. Another problem is that this texture is not resizable, so we had an array of textures with common sizes that were the only ones allowed when resizing a View. The least obvious problem was that if the material's texture had a mip-chain, IRenderer::UpdateTextureInVideoMemory did not automatically generate the lower mip levels which resulted in some strange results, because of the trilinear filtering. It didn't perform any kind of stretching either, and that's why we only allowed preset View resolutions. You can see the mips problem here:
Problematic trilinear texture filtering

The placeholder texture

It took some time for figuring out since, at first, we didn't have fancy placeholder textures, but only solid color ones. The solution was to simply assign a texture that had only one surface (i.e. no mips). This presented another small inefficiency.

Ok, we have a texture now, we can update it and all, but how do we draw it on top of everything so it acts as a UI? After some digging in the SDK, we found that the CBitmapUI(and more precisely, IGameFramwork's IUIDraw) class should be able to solve this, having various methods for drawing full-screen quads. The color and alpha channel weights were messed up, however, so we had to call IUIDraw::DrawImage beforehand, which had the weights as parameters, so we could reset them to 1.0. We just drew a dummy image outside the viewport to reset these values, having yet another small inefficiency.

Moving on, to the biggest inefficiency of all - Coherent UI provides color values with premultiplied alpha. This means that transparency is already taken into account. When drawing the fullscreen quad, the blending modes in CryEngine are set to SourceAlpha/1-SourceAlpha for the source and destination colors, respectively, meaning that the source alpha will be taken into account again. What we had to is "post-divide" the alpha value, so when DirectX multiplies is we get the correct result. We had to do this for each pixel, involving both bitwise and floating point operations - imagine the slowdown for doing that on a 1280x720 or even 1920x1080 image. If we had device access, all that would be fixed with a single call for the blend mode but, alas, we don't. Also, if we used DirectX11 renderer, we'd have to do another pass on the pixels to swap their red and blue channels, because the component ordering has been changed since DirectX 10!

Next on the list was input forwarding - we wanted to add means for stopping player input(so we don't walk or lean or anything while typing) and redirecting it to Coherent UI, so we could interact with the Views. This wasn't really a problem but it was rather tedious - we had to register our own IInputEventListener that forwards input events to the focused View, if any. The tedious part was creating gigantic mappings for CryEngine 3 to Coherent UI event conversion. Stopping player input when interacting with a View was easy, too - we just had to disable the "player" action map using the IActionMapManager. We also needed a free cursor while ingame, so you can move your mouse when browsing, which was just a matter of calling the Windows API ShowCursor.

The final problem was actually getting the texture coordinates of the projected mouse position onto the surface below. I tried using the physics system which provided some sort of raycasts that I got working, but i couldn't do a refined trace on the actual geometry nor obtain it to do the tracing myself. And even if I managed to do that, I couldn't find any way to get the texture coordinates using the free CryEngine 3 SDK. That's why I just exported the interesting objects to .obj files using the CryEngine Editor, put the geometry into a KD-Tree and did the raycasting myself after all. For correct results, first we'd have to trace using the physics system so we know that no object is obstructing the View. Then, we trace in the KD-Tree and get the texture coordinates, which can be translated to View coordinates.

On the upside, there are some things that just worked - the JavaScript binding, debug highlights... pretty much anything that  didn't rely on the CryEngine 3 renderer :).

In conclusion, it worked pretty well, although if Crytek gave us access to the rendering device we could have been much more efficient in the integration, but then again, we used the free version so that's what we get. I was thinking of ways to get the device, like scanning the memory around gEnv->pRenderer for IUnknowns (by memcmping the first 3 virtual table entries) and then querying the interface for a D3D device, or just making a proxy DLL that exports the same functions as d3d9/11.dll and installing hooks on the relevant calls, but I don't have time for such fun now.

Now that we've seen how far can we go using the free CryEngine 3 SDK, next on the agenda is full Unity 3D integration (we have device access there!). Be on the lookout for it next month!

Monday, October 22, 2012

Introducing on-demand views in Coherent UI

Coherent UI is designed as a multi-process multi-threaded module to allow leveraging on modern processors and GPUs. Up until now it supported what we call 'buffered' views. All UI rendering is performed in an apposite rendering process that allows sand-boxing the interface's operations, hence all commands are executed asynchronously.

This kind of views allow for perfectly smooth animations and user experience and are the natural choice for dynamic UIs and in-game browser views. However, if you need to have interface elements correlate per-frame with in-game entities, buffered views might not be suitable.

Take for instance enemy players in an MMO - their nameplates must always be perfectly snapped in every frame over their heads. The same applies to RTS games - health indicators must be glued on the units and never lag behind.

On-demand views

Coherent UI is now the only HTML5-based solution that solves all these problems. We have created what we call 'on-demand' views. They allow exact synchronization between the game frame and the out-of-process UI rendering without sacrificing any performance. Everything is still asynchronous but we make strong guarantees on the content of frames you receive at any point.
With on-demand views the game can explicitly request the rendering of UI frames and is guaranteed that all events prior to the request will be executed in that frame.

The typical frame of a game that uses on-demand views looks like this:
  • update frame (move players, AI, etc.)
  • trigger UI events (i.e. set the new nameplates positions, player health, etc.)
  • request UI frame 
  • draw game frame
  • fetch UI frame
  • compose UI on the game frame
This flow ensures that the game is in perfect sync with the UI and while the game renders it's frame the UI gets simultaneously drawn in Coherent UI's internals.
This video shows the new view in action. Please, do not mind the programmer art.


As you can see buffered views might introduce some delay that is noticeable if interface elements are related to in-game events as the position of the units. On-demand views remain always in-sync.

Buffered views will remain part of Coherent UI as they are very easy to use and should be the default choice when no frame-perfect visual synchronization is required between the game and the UI. For instance if have an FPS game and the UI only shows mini-map, player health and ammo you should probably use buffered views as no delay will ever be noticeable on the elements you are showing. The same applies to in-game browsers.

In other cases however on-demand views come to aid. They will be available for use in the next version of Coherent UI.

Friday, October 19, 2012

A high level shader construction syntax - Part II

Enhanced shader syntax

As explained in A high level shader construction syntax - Part I the proposed shader syntax is an extension over SM4 and SM5. It is simple enough to be parsed with custom regex-based code.

One of the requirements I had when designing this is that vanilla HLSL shader code going through the translator should remain unchanged.

Usually a shader is a mini-pipeline with predefined steps that only vary slightly. Let's take for instance the pixel shader that populates a GBuffer with per-pixel depth and normal. It has three distinct steps - take the depth of the pixel, take the normal of the pixel and output both. Now here comes the branching per-material, some materials might have normal maps while others might use the interpolated normals from the vertices. However the shader just has to complete these 3 steps - there is no difference how you get the normal.

Nearly all shader code can be simplified as such simple steps. So here I came up with the idea of what I called 'polymorphics'. They are placeholders for functions that perform a specific operation (i.e. fetch normal) and can be varied per-material.

The code for a simple GBuffer pixel shader could look like this:

The keyword 'pixel_shader' is required so that the translator knows the type of function it is working on. We have two declared polymorphics -  MakeDepth and GetWorldNormal with the functions (called 'atoms') that can substitute them.

If the material has a normal map, after the translation process this shader looks like this:

There is much more code generated by the translator - the polymorphics have been substituted by 'atoms' i.e. function that perform the required task - "NormalFromMap" is an atom that fetches the normal vector from a map, "NormalFromInput" fetches it as an interpolated value from the vertices. If the material whose shader we want to create has no normal map we simply tell the translator to use "NormalFromInput" for the polymorphic "GetWorldNormal".

All these atoms are defined elsewhere and could form an entire library. They look like this:

There are many new keywords here.The all-caps words are called 'semantics', they are declared in an appropriate file and indicate the type of the placeholder name and a HLSL semantic name used in case they should be interpolated between shading stages or come as input in the vertex shader. Semantics are essentially variables that the shader translation system knows of.

A sample semantic file looks like this:

Of course if we just substitute parts of the code with snippets we'd be in trouble as different atoms require different data to work with. If we use the "NormalFromMap" atom we would need a normal map, a sampler and uv coordinates. If we use "NormalFromInput" we just need a normal vector as shader input. All functions with an input - that is atoms and the vertex/pixel shader main functions, have a 'needs' clause where all semantics needed for the computation are enumerated.

The declaration/definition(they are the same) of a sample atom is as follows:

atom NORMAL_W NormalFromMap(interface context) needs NORMAL_O, UV, TBN, MAP_NORMAL, SAMPLER_POINT

'atom' is required to flag the function. Then the return semantic and the name of the atom. 'interface context' is required. Atoms are not substituted by function calls but are inlined in the shader code - to avoid name clashes with vanilla code in the shader that is not dependent upon the translation system all computed semantics (variables) are put in a special structure called 'context'. In the atom declaration the keyword interface is used for an eventual future use. Strictly speaking currently 'interface context' is not needed but makes the atom resemble a real function and reminds that all input comes from the context. After the closing brace there is an optional clause 'needs' after which all required semantics are enumerated.

Sometimes the needed semantics are straightforward to procure - for instance if a normal map is required the system should simply declare a texture variable before the shader main code. However some computations are much more convolved - like computing the TBN matrix. Here comes the third type of resources needed in the translation process - 'combinators'.

When the translator encounters needed semantics it first checks if they are not already computed before and are not in the context (I remind you that all data is saved in the context). If it's a new semantic it checks all combinators for one that can calculate it. Combinators as atoms are functions - their declarations are almost the same as the ones of atoms:

The only difference is the keyword 'combinator' instead of 'atom'. They encapsulate code to compute a complicated semantic from more simple ones.
If no combinatoris found for a needed semantic it is assumed that it comes as an interpolant or vertex shader input. Needed semantic searches are always conducted so combinators can depend on other combinators.


To recap, the building blocks of the shader translation process are:
  • semantics
  • atoms
  • combinators
While it might seem complicated at first, the system simplifies the shader authoring a lot. The shaders themselves become much more readable with no branches in their logic per-material type - so no #ifdef. An atom and combinatorlibrary is trivial to build after writing some shaders - later on operations get reused. The translation process guarantees that only needed data is computed, interpolated or required as vertex input. The 'context' structure used to hold the data incurs no performance penalty as it is easily handled by the HLSL compiler. For convenience expanded atoms and combinators are flagged with comments in the outputted HLSL code and enclosed in scopes to avoid name clashes between local variables.

In the next post I'll explain some compile-time conditions supported by the translator as well as how the translation process works.

Monday, October 15, 2012

A twist on PImpl

PImpl is a well known pattern for reducing dependencies in a C++ project. The classic implementation is:

It has two drawbacks:
  • inhibits function inlining
  • has an extra heap allocation and pointer chase
The extra heap allocation leads to whole new set of drawbacks - creating an instance is more expensive, fragments the heap memory and the address space, has an extra pointer chase and reduces cache-locality.

This allocation can be avoided by a simple trade-off with the PImpl idiom. Why allocating the HTTPServerImpl instance, instead of storing it in the facade object? This is because C++ requires to see the declaration of HTTPServerImpl to allow it to be stored by value. But we can store a C++ object in every memory chunk large enough to hold the its data and respects its alignment requirements. So instead of storing HTTPServerImpl pointer in the facade, we can store a memory chunk that is interpreted as an instance of HTTPServerImpl.This concept can be easily generalized in an reusable template:

And the HTTPServer becomes:
This is definitely not a new technique and it is declared "deplorable" in GotW #28. It has its drawbacks, but I consider some of them acceptable trade-offs. What is more:
  • The alignment problems are mitigated by C++11 support for alignment
  • Writing operator= is not harder than writing it in general
  • The extra memory consumption is acceptable for small number of instances, given the better cache coherency.
So, does this technique really eliminates the extra pointer chase?

The classical implementation looks like:
And the "twisted" one:

Seems like it does.

Of course, this technique breaks the PImpl idiom and might be considered a hack. Every time the HTTPServerImpl grows beyond the hard-coded size or its alignment requirements change, we have to change the definition of the facade and recompile all the source files depending on the HTTPServer.h, but given the advantages, this is an acceptable trade-off for many situations.

Literature:
  • John Lakos; Large-Scale C++ Software Design; Addison-Wesley Longman, 1996
  • Herb Sutter; Exceptional C++: 47 Engineering Puzzles, Programming Problems, and Solutions; Addison-Wesley Longman, 2000

Thursday, October 11, 2012

A high level shader construction syntax - Part I

Shader construction

A challenge in modern graphics programming is the management of complicated shaders. The huge amount of  materials, lights and assorted conditions lead to a combinatorial explosion in shader code-paths.

There are many ways to cope with this problem and a lot of techniques have been developed.

Some engines like Unreal have taken the way lead by 3D modelling applications and allow designers to 'compose' shaders from pre-created nodes that they link in shade trees. An extensive description of the technique can be found in the paper "Abstract Shade Trees" by McGuire et al.. This way however the "Material editor" of the application usually has to be some sort of tree editor. Shaders generated this way might have performance issues if the designer didn't pay attention but of course they are the ones that give major freedom to that said artist.

Another technique is building shaders on-the-fly from C++ code as shown in "Shader Metaprogramming" by McCool et al.. I've never tried such a shader definition although I find it very compelling due mostly to it's technical implementation. You'd have to rebuild and relink C++ code on the fly to allow for interactive iterations when developing or debugging which is not very difficult to achieve but seems a bit awkward to me. The gains in code portability however should not be underestimated.

Über-shaders and SuperShaders usually build upon the preprocessor and enable/disable parts of the code via defines. The major drawback is that the 'main' shader in the end always becomes a giant unreadable mess of #ifdefs that is particularly unpleasant to debug.

A small variant of the SuperShader way is to use 'static const' variables injected by the native code and plain 'if's on them in the shader. All compilers I've seen are smart enough to compile-out any branching and essentially the static const variables work as preprocessor macros with the added bonus that if looks better than #ifdef and the code is a bit easier to read. On complex code all the SuperShader problems remain.

Dynamic shader linking introduced in Shader Model 5 allows to have interfaces and some sort of virtual method calls in your shaders and allows for very elegant code.

I'd like to share an idea and sample implementation of an enhanced syntax over HLSL SM4 and SM5. It is heavily influenced by the idea of dynamic linking, ASTs and "Automated Combination of Real-Time Shader Programs" with some additional features and was originally developed in order to support DirectX 10+-level hardware. Although the sample application works only on SM4 and SM5 it could relatively easily be ported to any modern shading language. On sm5 you could just use the built-in dynamic linkage feature.

In essence the program translates the 'enhanced' shader to plain HLSL. The translator works like a preprocessor so no AST is built on the code.

In the following posts I'll explain the syntax and what I tried to achieve with it as well as the implementation of the translator.

Update: A high level shader construction syntax - Part II

Monday, October 8, 2012

Converting Adobe Photoshop ACV to LUT for color grading

So, what does this even mean and what’s its purpose? First, ACV is the Adobe Photoshop curves file format, which stores color mapping information. An artist can easily create an .acv preset, which can drastically change the mood and color tone of the image. Second, LUT is just a look-up table, in our case a small 3D space which maps a RGB color to another RGB color. LUTs can be effectively used for nice and cheap post processing.

An identity LUT

Basically, ACV and LUT are both the same thing, represented differently. Only the latter is usable in real-time rendering applications at negligible costs, though. Now that we understand the problem we’re solving, let’s take a look at

The theory

An .acv preset (generally) contains curves for each color component - red, green and blue. A curve is just a function f : X -> X, where the domain(and co-domain) is the integers in the interval [0, 255]. The curve is defined by a number of control points which change its shape. Here’s a little more graphical explanation (you can find the curve editor in Image -> Adjustments -> Curves):

Photoshop curve editor

This image shows the mapping of the blue channel. For example, every color that has a blue component of 210 will be transformed to a color with blue component of 116. You can adjust the curves for the red and green channels, too, giving you full control. There is also a master RGB curve which is applied after the individual channel curves. For example, if the master curve maps a value of 116 to 80, the cumulative result of the blue curve and the master curve will be a mapping of 210 to 80 for the blue channel.

Before we do anything further, we should first understand the information stored in the preset. Adobe has been kind enough to publish the specifications of the .acv file format.

Length
Description
2
Version ( = 1 or = 4)
2
Count of curves in the file.
The following is the data for each curve specified by count above
2
Count of points in the curve (short integer from 2...19)
point count * 4
Curve points. Each curve point is a pair of short integers where the first number is the output value (vertical coordinate on the Curves dialog graph) and the second is the input value. All coordinates have range 0 to 255.

Pretty straight-forward.

The question now is “how do we convert that curve to a lookup table”. The curve is defined completely by a finite set of points, so we can paraphrase the question as “how do we obtain a polynomial that passes through a number of predefined points”. And that’s where numerical analysis comes in handy. There are various methods for doing that, such as cubic spline interpolation or interpolation using Lagrange polynomials. They both work fine and I chose to use spline interpolation since it came to mind first. I’ll spare you the details, but you can find an explanation on cubic spline interpolation in any numerical analysis book or in the Wikipedia article.

The implementation

Now that we have our polynomial we have to make the LUT we’re talking about. But how big should it be? 256^3? That’s 64MB for 32-bit color, totally unacceptable! As stated by Kaplanyan in his CryEngine 3 talk at Siggraph 2010, 16^3 seems to be enough and from my experiments I tend to agree with that claim. That brings the memory footprint down to just 16KB for a single LUT!

The code for generating the small LUT cube is nothing special, we just take discrete samples of the curve:

We’re pretty much done here, so I’ll demonstrate how this "hard" work pays off with a simple sample based on, well, SimpleSample11 from the DirectX SDK (it runs only on Windows Vista SP2+). The program uses a simple well-known trick for drawing a full-screen triangle with no actual vertex data for displaying a selected image and the shader modifies the color based on the LUT. When creating the LUT itself, the loading options are set so no mips are created as they are not needed (although even if you do create mips, the shader uses SampleGrad to sample the top surface). There isn't much more to the implementation than that. Except that by default DXUT creates a backbuffer with sRGB format, which makes us do the gamma correction in the shader (or create the resource view with the appropriate format). You can find the code in the github repository at the end of the post.

One important note that’s worth mentioning: you should use a sRGB color profile in Photoshop to match the colors of the generated LUT. You can check the profile used in Edit -> Color Settings.

Finally, here are some sample images I made:


Original

Color negative

Cross process

Dark

Vintage

Download ACV/LUT convertor, sample application and presets from github

Monday, October 1, 2012

Documenting JavaScript with Doxygen

As you already know (Coherent UI announcement) we are developing a large C++ and JavaScript project. We have documentation for both programming languages. The main requirements for the documentation are:
  • Application Programming Interface (API) references and general documentation such as quick start and detailed guides
  • cross references between the API references and the guides
  • accessible online and off line
  • easy markup language
There are a lot documentation tools for each language - DoxygenSandcastle for C++, YUIDocJSDuck for JavaScript. Our project API is primary in C++, so we choose Doxygen. It is great for C++projects, but it doesn't support JavaScript. There are some scripts that solve this by converting JavaScript to C++ or Java. Unfortunately they do not support the modules pattern or have inconvenient syntax for the documentation. Our JavaScript API consists mostly of modules, so we wrote a simple doxygen filter for our documentation. A doxygen filter is a program that is invoked with the name of a file, and its output is used by doxygen to create the documentation for that file. To enable filters for specific file extension add in the doxygen configuration file. Lets say we want to document the following module:
The filtered output looks like:
A nice surprise is that when you want to link to Sync.load you can use `Sync.load`. The only annoying C++ artifacts in the JavaScript documentation are the "Sync namespace" and using "::" as resolution operator, but they can be fixed by a simple find / replace script. The doxygen.js filter is available at https://gist.github.com/3767879.