Tuesday, August 28, 2012

Building a shared memory IPC implementation - Part II

Memory management

This post is a follow-up on Building a shared memory IPC implementation - Part I. I'd like to discuss the way shared memory is handled and allocated for the inter-process queue.

As in all memory management systems, I try to minimize wasted memory on bookkeeping and the number of shared memory allocations, as they might be slow.

For the shared memory IPC queue I opted for the paged memory allocation scheme sketched below:


Pages are the ones that get requested form the OS as shared memory regions. For the sake of simplicity and performance they are a fixed number. This incurs a usage limitation as the queue can run out of memory but simplifies the synchronization and management mechanism. The limit should be set reasonably high and reaching it usually means a more serious problem occurred, for instance the consumer stopped working or is too slow.

As you can see on the picture, all shared page handles are put in an array. Only the used pages are requested from the OS – if 3 pages are used from a maximum of 16, then only they will be allocated.
Nodes (messages) are always allocated in one page. If there is not enough room left in a page to accommodate a node, a new page is requested with size = max(newNodeSize*2, DEFAULT_PAGE_SIZE). This formula ensures that pages are never smaller than a preset minimal limit and also allows to have huge nodes. Pages are of variable size which is very handy.

The pair (OwnerID, Offset) uniquely identifies a node.

NB: The identifier of the node is a pair of values and usually it can't be atomically updated. Special care should be taken to ensure that operator= (assignment) and operator!= are atomic to each other as specified in Building a shared memory IPC implementation - Part I.

A node in the queue knows it's position but also has the coordinates of it's successor.
If a node grows up too much and has to expand, but there is not enough room in the page, it is moved to another free one and holes remain. Node allocation goes always forward, there is no attempt to squeeze nodes in empty spaces. Node de-allocation goes always forward too (it's a queue) so free space will be effectively reclaimed as the node just before it is reclaimed:

 
As soon as Node K is freed, the whole Page 2 will be marked as free.

I never return the allocated memory to the OS. When a page is left empty, it is added to a collection of free pages and reused as soon as a new page is required. In my typical usage the pattern is very predictable and stable and just marking a page as reusable allows us to skip costly OS calls when consumption oscillates between M and M+1 pages. If memory is tight or usage spikes are expected, you could free the pages and then reallocated them from the OS.

As pages are freed and reused, nodes will have successors in pages that are not directly next to them. This is not a problem with the described addressing scheme.

When a new node is requested I first check if there is enough room in the current tail's owner, if not - call AllocateFirstNodeInPage(size_t size):

The method reclaims all pages that are now left empty, checks empty pages for enough size to accommodate the node and as a last resort allocates a new one from the OS.The NodesOwned field of a page is incremented/decremented with atomic operations.

New pages are always allocated by the producer. The consumer also has the same vector of pages but just maps the already allocated memory when it reaches a node in a not-yet-mapped page.

Saturday, August 11, 2012

DLL export quirks

Can you spot an error in this code (compiled with MSVC 2010 SP1 running on Win7; TestDLL is just a simple DLL project and an executable imports and uses it as described):




Although innocuous looking the code results in undefined behavior. What happens is that operator new is called in the EXE while operator delete is called in the DLL.

 A little playing around in the disassembly shows the reason. When you have a virtual destructor it's address is of course put in the vtable of the object created. However when the compiler sees a class like the one illustrated it creates two destructors - one that works just as the programmer would expect - destroying all the members etc. and another one that does the same things but also calls operator delete on the object(the 'deleting destructor'). This second destructor is the one set in the vtable of the object and is responsible for the behavior.

A fix for this problem is exporting the whole class, as pointed by Microsoft themselves -


In this case the compiler creates a 'scalar deleting destructor' for the class in the exe - calling the vanilla destructor and operator delete of the executable and putting it in the vtable, so everything works as expected.

Checking-out the assembly shows that the constructor of MyClass in the first case sets the address of the destructor to MyClass::`vector deleting destructor' in the DLL (the one that calls delete) and nothing more.
However in the export-all-class case the compiler generates a 'local vftable' and overwrites the one created in the DLL.

As it turns out before version 5.0(!) of VC++ only the first case used to work but created all the said problems. So in 5.0 they changed the behavior to the current one that also has it's drawbacks (like calling FreeLibrary as explained nicely in this thread).

If you __dllimport a whole class with a virtual destructor, the compiler creates a new virtual table and redirects the destructor to a local version in order to preserve the new/delete correctness. This appears to be the ONLY case it does this so in all other situations the programmer must be careful.

It is very tempting to just export the needed methods for a task and leave the rest hidden. However one must be aware of this quirk, that if left creeping unnoticed, might bring many headaches. In those cases the best solution is to rely on pure interfaces and factory functions like COM does it. This appears to be the most portable solution too. You could also override new and delete for the exported classes that has the advantage of not forcing you to use factories and can be easily be done with a common base class.

Building a shared memory IPC implementation - Part I

Overview


Inter-process communication has become a very wide-spread pattern in many modern applications. The nice sandboxing that a multi-process architecture provides is a major contributor to its popularity.

Google Chrome for instance spawns a process for every tab the user opens and there are more system processes running also. It is obvious that a fast IPC system is a must.

After a some research on different transports for an IPC system the best candidates were:
  • shared memory
  • pipes
  • TCP
After some experimentation on Windows, it was clear that a good shared memory implementation outperformed the other ones (not difficult to guess) but we found out that TCP actually has tremendous throughput when it works on loopback and the data packets are reasonably big. Message pipes are very convenient but the need for a cross-platform solution meant that we'd need to accommodate both Linux and Win pipes in our system.

In this post I'd like to share an implementation based on shared memory that we use in a commercial project where the messages themselves are very numerous but small and the overhead per-message must be minimal. I'll explain the interfaces, algorithm and choices we made using a mix of C++ and pseudo-code. A shared memory transport is more difficult to use but provides us with more control and speed and although there are differences in it's behavior between OSs they are very easy to deal with.

I gathered the following requirements for the queue that would be the backbone of the IPC system:

- single producer - single consumer
- lock-free
- variable-sized messages
- minimal number of allocations
- stream-like message queuing

The system is based on a message queue that is SPSC, so to actually have a two-way communication channel we simply use two queues for which the producer and the consumer are exchanged in the
two processes.

Of all those requirements, I have to say that the "variable-sized messages" and the "stream-like message queuing" are the ones that make the implementation more complicated. By "stream-like message queuing" I mean this: the user that queues a message does not know the size of the messages when she begins queuing it. An obvious solution would be for the user to just compose the message in a buffer and then send the whole buffer - the size would be known at the en-queue moment but would conflict with the "minimal number of allocations" requirement. We need a way to
let the user write directly IN the queue and expand the nodes when necessary.

The relevant part of the public interface for the producer looks like this:


The user requests a Node with some size, she writes directly in the void* returned by the queue, if she realizes that there is need for more memory, calls EnqueueMore with the new size and
gets again memory directly in the queue. The previously written-on memory gets transferred so the user can continue safely adding data. When done, CommitNode() makes the message available to
the consumer.

The Consumer interface has just one method:


The lock-free implementation of the queue is based on a well-known algorithm(it is for multi-threading but applies to inter-process too):

We never allow for the producer and the consumer to reach each other by having a dummy divider node. The producer is in charge of freeing all consumed nodes so minimal sharing of concurrent
variables is needed. For a detailed description of the algorithm take a look at Marginean's "Lock-FreeQueues". His algorithm is correct, but the implementation is flawed and not thread-safe as noted and explained by Sutter in a follow-up "WritingLock-Free Code: A Corrected Queue". Our queue
is a correct one and although at the time of writing it I was unaware of Sutter's article, the resulting implementation is very close to his.

The Implementation is divided in 3 classes: SMDataQueueBase, SMDataQueueProducer, SMDataQueueConsumer. The base class provides all the shared bookkeeping structures and methods. The producer and the consumer classes expose just the methods highlighted before.

The process of en-queuing works like this:


As you might have noticed, head always points to a Node that has already been consumed (or is invalid on start). When the queue is empty head == tail but there always is a 'guard node' in the structure. In this way the producer and the consumer are always responsible only for their respective pointers and in fact the only point where they touch is the check head != tail. It is very important that operator= and operator!= are mutually atomic - never should operator!= return true if the object is in mid-copy and thus invalid.

In the next posts I'll cover the memory management part of the implementation.

Update: Building a shared memory IPC implementation - Part II