Monday, September 24, 2012

Announcing Coherent UI

Finally! We, the Coherent Labs team, are very proud to announce our first product - Coherent UI.


After a mammoth work (that is of course still on-going), I can openly talk about the exciting new technology we are building.

Coherent UI is a user interface middleware aimed at game development companies. It greatly increases the quality and optimizes production costs for UI development.
HUD, in-game browser and a game-in-the-game; all integrated through Coherent UI

The biggest news - you can write the UI for ANY type of game on ANY platform with HTML5. I am a big fan of using the right tools for the job they are designed for and I think HTML5 is exactly the kind of tech game companies have been lacking in their development ecosystem.

Now that I can talk about it, I'll be able to write much more about the technology we are creating and how we achieved many of our goals. For a quick-start I'll list some of the tech features we had in mind when we started and that are now available:

  • Feature-full HTML5 and CSS3 rendering (3D elements in your UI + canvas + WebGL!)
  • GPU acceleration
  • Multi-platform
  • Full browsing support (you can have a fully featured browser embedded in your game)
    •   SSL
    •   plugins
    •   cookies
    •   local storage
    •   proxies
    •   etc.
  • Fast JavaScipt (yes, it's usually V8)
  • Super fast and powerful binding (native <-> JavaScript = FAST)
  • Debugging and profiling (you can debug JS code with breakpoints, watches etc.; performance profiling on JS and rendering)
  • Built-in support for click-through queries (I've seen unbelievable hacks in the past dealing with this and couldn't stand it anymore) 
  • Proper composition of ClearType text on transparent background (it's amazing how few people get this one right)
  • Easy to use and clean API (it's more difficult than it sounds)
A sample game menu made with Coherent UI

These is just a high-level overview of what we now have and continue to improve.


Stay tuned for I plan to post many of my thoughts about how we achieved all this, what mistakes we made (and probably are still making) and what went really right. Hope you'll enjoy.

You can check out Coherent UI on our site for free. 

Monday, September 10, 2012

Debugging undebuggable applications with PIX

Developers can ask DirectX 9 not to allow PIX to debug their application by calling D3DPERF_SetOptions(1). I knew that and encountered several commercial applications using it. One day, I was fooling around with Portal 2 and wanted to feed my curiosity on how some stuff is done but when I started PIX all I got was “Direct3D Analysis Disabled” and I knew it was the time to find a way to circumvent this little peculiarity. So, let’s see how can we convince DirectX to ignore the request of the said developers.

I started with a simple application:



Let’s check out the disassembly of the D3DPERF_SetOptions(1) call:

_D3DPERF_SetOptions@4:
72EC7402 8B FF             mov     edi,edi  
72EC7404 55               push     ebp  
72EC7405 8B EC             mov     ebp,esp  
72EC7407 83 EC 18         sub     esp,18h  
72EC740A A1 50 92 FC 72   mov     eax,dword ptr [___security_cookie (72FC9250h)]  
72EC740F 33 C5             xor     eax,ebp  
72EC7411 89 45 FC         mov     dword ptr [ebp-4],eax  
72EC7414 A1 54 74 EC 72   mov     eax,dword ptr [string "DirectX Direct3D SO" (72EC7454h)]  
72EC7419 89 45 E8         mov     dword ptr [ebp-18h],eax  
72EC741C 8B 0D 58 74 EC 72 mov     ecx,dword ptr ds:[72EC7458h]  
72EC7422 89 4D EC         mov     dword ptr [ebp-14h],ecx  
72EC7425 8B 15 5C 74 EC 72 mov     edx,dword ptr ds:[72EC745Ch]  
72EC742B 89 55 F0         mov     dword ptr [ebp-10h],edx  
72EC742E A1 60 74 EC 72   mov     eax,dword ptr ds:[72EC7460h]  
72EC7433 89 45 F4         mov     dword ptr [ebp-0Ch],eax  
72EC7436 8B 0D 64 74 EC 72 mov     ecx,dword ptr ds:[72EC7464h]  
72EC743C 89 4D F8         mov     dword ptr [ebp-8],ecx  
72EC743F C6 45 ED 44       mov     byte ptr [ebp-13h],44h  
72EC7443 8B 4D FC         mov     ecx,dword ptr [ebp-4]  
72EC7446 33 CD             xor     ecx,ebp  
72EC7448 E8 D3 A1 F5 FF   call     @__security_check_cookie@4 (72E21620h)  
72EC744D 8B E5             mov     esp,ebp  
72EC744F 5D               pop     ebp  
72EC7450 C2 04 00         ret     4

Some movs, xors, runtime security check and that’s it, nothing with the actual value we passed to D3DPERF_SetOptions... well that was big nothing.

Ok, take two - let’s first start the application with PIX and then attach.
We’ll have to add some code to give us time to attach:



Use something like this, or just a Sleep() for enough time. Now what do we have with the new setup:

_D3DPERF_SetOptions@4:
72EC7402 E9 46 77 AA E8   jmp     HookedD3DPERF_SetOptions (5B96EB4Dh) 

All right, the sneaky PIX has modified d3d9.dll's memory and now it has a jmp in the beginning! The function it now executes takes us inside PIXHelper.dll:

HookedD3DPERF_SetOptions:
5BF6EB4D 8B FF             mov     edi,edi  
5BF6EB4F 55               push     ebp  
5BF6EB50 8B EC             mov     ebp,esp  
5BF6EB52 83 7D 08 01       cmp     dword ptr [ebp+8],1  
5BF6EB56 75 0B             jne     HookedD3DPERF_SetOptions+16h (5BF6EB63h)

We’ve only had one push so far (for the value we passed) and we push ebp, so that’s 2 pushes. After "mov ebp,esp" ebp is the same as the stack pointer, so dword ptr [ebp + 8] would be exactly the value we passed to D3DPERF_SetOptions. We compare that to 1 and if it’s equal some procedures are invoked that stop the execution. If it isn’t - we follow the jump specified by jne. What we have to do is make that jump unconditional - i.e. always execute the jump, regardless of the value passed. We don’t care about the code that pops the message for disabled analysis so we have plenty of bytes to play with; However, we don’t need them as we can see in “Intel® 64 and IA-32 Architectures Software Developer Manuals” - what we’re looking for is the EB cb variant of jmp, the exact same amount of bytes as the used jne instruction. Now all that’s left is open PIXHelper.dll with a hex editor (I used Notepad++ with hex-editor plugin), search for some of the bytes (try “8B EC 83 7D 08 01 75 0B” - I found it only once) and change the 75 to EB. Voila! Now you won’t see that annoying warning anymore.

Friday, September 7, 2012

Building boost 1.51 with MSVC for Windows with debug symbols

This post is just a quick “note to self” for future reference. We wanted to update our boost source to the latest version, so I started building it with the usual “bjam --build-type=complete debug-symbols=on debug-store=database”. For some reason, however, there were a lot of .pdbs in the boost root folder and many libraries could’t be built as DLLs. For example, building boost::thread failed with “...failed compile-c-c++ bin.v2\libs\thread\build\msvc-10.0\debug\debug-store-database\threading-multi\win32\thread.obj...” and many others like this.

I tried building the 1.49 version and everything was fine, so I started digging in the jam files, specifically the one for MSVC - tools\build\v2\tools\msvc.jam. After a lot of wasted time, I was able to locate the problem - in the rule “compile-c-c++”, “PDB_NAME on $(<) = $(<:S=.pdb) ;” (that’s line 383 in the file) was behaving strangely. Reading the boost build guide (http://www.boost.org/boost-build2/doc/userman.pdf), I worked out that $(<) means the first argument, $(<:S) selects the suffix, and $(<:S=.pdb) would replace the suffix with “.pdb” - exactly what we want.

However, for some strange reason the replacement converted “bin.v2\libs\thread\build\msvc-10.0\debug\debug-store-database\threading-multi\win32\thread.obj” into “win32\thread.pdb”. $(<) was the whole string, but the replacement trimmed the beginning. There was no “win32” folder in the root boost directory, so when bjam tried to output some file it failed.

I played around a little with the jam file but didn’t get any results, so I took the easy way out and just created all the folders that are needed. Here’s the full list:
    converter
    cpplexer\re2clex
    encoding
    gregorian
    object
    shared
    std
    util
    win32
I also couldn’t build boost::python with the error message “No python installation configured and autoconfiguration failed”, yet my python was conveniently placed in C:\Python27. I didn’t want to spend a lot of time on figuring this out, so I tried a solution I found in this thread http://stackoverflow.com/questions/1704046/boostpython-windows-7-64-bit and it worked so I didn’t dig deeper. This is supposed to fix paths that have whitespace, however it apprarently breaks paths that don’t. So, just locate the line “python-cmd = \"$(python-cmd)\" ;” in the “if [ version.check-jam-version 3 1 17 ] || ( [ os.name ] != NT )” section in the file tools\build\v2\tools\python.jam and comment it with # in the beginning.

Tuesday, August 28, 2012

Building a shared memory IPC implementation - Part II

Memory management

This post is a follow-up on Building a shared memory IPC implementation - Part I. I'd like to discuss the way shared memory is handled and allocated for the inter-process queue.

As in all memory management systems, I try to minimize wasted memory on bookkeeping and the number of shared memory allocations, as they might be slow.

For the shared memory IPC queue I opted for the paged memory allocation scheme sketched below:


Pages are the ones that get requested form the OS as shared memory regions. For the sake of simplicity and performance they are a fixed number. This incurs a usage limitation as the queue can run out of memory but simplifies the synchronization and management mechanism. The limit should be set reasonably high and reaching it usually means a more serious problem occurred, for instance the consumer stopped working or is too slow.

As you can see on the picture, all shared page handles are put in an array. Only the used pages are requested from the OS – if 3 pages are used from a maximum of 16, then only they will be allocated.
Nodes (messages) are always allocated in one page. If there is not enough room left in a page to accommodate a node, a new page is requested with size = max(newNodeSize*2, DEFAULT_PAGE_SIZE). This formula ensures that pages are never smaller than a preset minimal limit and also allows to have huge nodes. Pages are of variable size which is very handy.

The pair (OwnerID, Offset) uniquely identifies a node.

NB: The identifier of the node is a pair of values and usually it can't be atomically updated. Special care should be taken to ensure that operator= (assignment) and operator!= are atomic to each other as specified in Building a shared memory IPC implementation - Part I.

A node in the queue knows it's position but also has the coordinates of it's successor.
If a node grows up too much and has to expand, but there is not enough room in the page, it is moved to another free one and holes remain. Node allocation goes always forward, there is no attempt to squeeze nodes in empty spaces. Node de-allocation goes always forward too (it's a queue) so free space will be effectively reclaimed as the node just before it is reclaimed:

 
As soon as Node K is freed, the whole Page 2 will be marked as free.

I never return the allocated memory to the OS. When a page is left empty, it is added to a collection of free pages and reused as soon as a new page is required. In my typical usage the pattern is very predictable and stable and just marking a page as reusable allows us to skip costly OS calls when consumption oscillates between M and M+1 pages. If memory is tight or usage spikes are expected, you could free the pages and then reallocated them from the OS.

As pages are freed and reused, nodes will have successors in pages that are not directly next to them. This is not a problem with the described addressing scheme.

When a new node is requested I first check if there is enough room in the current tail's owner, if not - call AllocateFirstNodeInPage(size_t size):

The method reclaims all pages that are now left empty, checks empty pages for enough size to accommodate the node and as a last resort allocates a new one from the OS.The NodesOwned field of a page is incremented/decremented with atomic operations.

New pages are always allocated by the producer. The consumer also has the same vector of pages but just maps the already allocated memory when it reaches a node in a not-yet-mapped page.

Saturday, August 11, 2012

DLL export quirks

Can you spot an error in this code (compiled with MSVC 2010 SP1 running on Win7; TestDLL is just a simple DLL project and an executable imports and uses it as described):




Although innocuous looking the code results in undefined behavior. What happens is that operator new is called in the EXE while operator delete is called in the DLL.

 A little playing around in the disassembly shows the reason. When you have a virtual destructor it's address is of course put in the vtable of the object created. However when the compiler sees a class like the one illustrated it creates two destructors - one that works just as the programmer would expect - destroying all the members etc. and another one that does the same things but also calls operator delete on the object(the 'deleting destructor'). This second destructor is the one set in the vtable of the object and is responsible for the behavior.

A fix for this problem is exporting the whole class, as pointed by Microsoft themselves -


In this case the compiler creates a 'scalar deleting destructor' for the class in the exe - calling the vanilla destructor and operator delete of the executable and putting it in the vtable, so everything works as expected.

Checking-out the assembly shows that the constructor of MyClass in the first case sets the address of the destructor to MyClass::`vector deleting destructor' in the DLL (the one that calls delete) and nothing more.
However in the export-all-class case the compiler generates a 'local vftable' and overwrites the one created in the DLL.

As it turns out before version 5.0(!) of VC++ only the first case used to work but created all the said problems. So in 5.0 they changed the behavior to the current one that also has it's drawbacks (like calling FreeLibrary as explained nicely in this thread).

If you __dllimport a whole class with a virtual destructor, the compiler creates a new virtual table and redirects the destructor to a local version in order to preserve the new/delete correctness. This appears to be the ONLY case it does this so in all other situations the programmer must be careful.

It is very tempting to just export the needed methods for a task and leave the rest hidden. However one must be aware of this quirk, that if left creeping unnoticed, might bring many headaches. In those cases the best solution is to rely on pure interfaces and factory functions like COM does it. This appears to be the most portable solution too. You could also override new and delete for the exported classes that has the advantage of not forcing you to use factories and can be easily be done with a common base class.