Saturday, August 11, 2012

Building a shared memory IPC implementation - Part I

Overview


Inter-process communication has become a very wide-spread pattern in many modern applications. The nice sandboxing that a multi-process architecture provides is a major contributor to its popularity.

Google Chrome for instance spawns a process for every tab the user opens and there are more system processes running also. It is obvious that a fast IPC system is a must.

After a some research on different transports for an IPC system the best candidates were:
  • shared memory
  • pipes
  • TCP
After some experimentation on Windows, it was clear that a good shared memory implementation outperformed the other ones (not difficult to guess) but we found out that TCP actually has tremendous throughput when it works on loopback and the data packets are reasonably big. Message pipes are very convenient but the need for a cross-platform solution meant that we'd need to accommodate both Linux and Win pipes in our system.

In this post I'd like to share an implementation based on shared memory that we use in a commercial project where the messages themselves are very numerous but small and the overhead per-message must be minimal. I'll explain the interfaces, algorithm and choices we made using a mix of C++ and pseudo-code. A shared memory transport is more difficult to use but provides us with more control and speed and although there are differences in it's behavior between OSs they are very easy to deal with.

I gathered the following requirements for the queue that would be the backbone of the IPC system:

- single producer - single consumer
- lock-free
- variable-sized messages
- minimal number of allocations
- stream-like message queuing

The system is based on a message queue that is SPSC, so to actually have a two-way communication channel we simply use two queues for which the producer and the consumer are exchanged in the
two processes.

Of all those requirements, I have to say that the "variable-sized messages" and the "stream-like message queuing" are the ones that make the implementation more complicated. By "stream-like message queuing" I mean this: the user that queues a message does not know the size of the messages when she begins queuing it. An obvious solution would be for the user to just compose the message in a buffer and then send the whole buffer - the size would be known at the en-queue moment but would conflict with the "minimal number of allocations" requirement. We need a way to
let the user write directly IN the queue and expand the nodes when necessary.

The relevant part of the public interface for the producer looks like this:


The user requests a Node with some size, she writes directly in the void* returned by the queue, if she realizes that there is need for more memory, calls EnqueueMore with the new size and
gets again memory directly in the queue. The previously written-on memory gets transferred so the user can continue safely adding data. When done, CommitNode() makes the message available to
the consumer.

The Consumer interface has just one method:


The lock-free implementation of the queue is based on a well-known algorithm(it is for multi-threading but applies to inter-process too):

We never allow for the producer and the consumer to reach each other by having a dummy divider node. The producer is in charge of freeing all consumed nodes so minimal sharing of concurrent
variables is needed. For a detailed description of the algorithm take a look at Marginean's "Lock-FreeQueues". His algorithm is correct, but the implementation is flawed and not thread-safe as noted and explained by Sutter in a follow-up "WritingLock-Free Code: A Corrected Queue". Our queue
is a correct one and although at the time of writing it I was unaware of Sutter's article, the resulting implementation is very close to his.

The Implementation is divided in 3 classes: SMDataQueueBase, SMDataQueueProducer, SMDataQueueConsumer. The base class provides all the shared bookkeeping structures and methods. The producer and the consumer classes expose just the methods highlighted before.

The process of en-queuing works like this:


As you might have noticed, head always points to a Node that has already been consumed (or is invalid on start). When the queue is empty head == tail but there always is a 'guard node' in the structure. In this way the producer and the consumer are always responsible only for their respective pointers and in fact the only point where they touch is the check head != tail. It is very important that operator= and operator!= are mutually atomic - never should operator!= return true if the object is in mid-copy and thus invalid.

In the next posts I'll cover the memory management part of the implementation.

Update: Building a shared memory IPC implementation - Part II

No comments:

Post a Comment