In my previous post I covered a single producer/consumer queue developed and shared by Martin Thompson capable of delivering an amazing 130M messages per second. The queue he delivered is a great tool for communicating between threads, but sometimes communicating between threads is not enough. Sometime you need to leave your JVM and go out of process. Inter Process Communications (IPC) is a different problem to inter thread communications, can it be cracked by the same approach?
IPC, what's the problem?
Inter Process Communication is an old problem and there are many ways to solve it (which I will not discuss here). There are several attractions to specialized IPC solutions for Java:
- Faster than socket communication.
- An out of process integration option with applications written in other languages.
- A means of splitting large VMs to smaller ones improving performance by allowing GC and JIT specialization.
Off the heap we go!
To do IPC we must go off heap. This has several implications for the queue, most importantly references are not supported. Also note persistence to and from the queue is required, though one could extend my implementation to support a zero copy interaction where a struct is acquired, written and committed instead of the offer method, and similarly acquired, read and finally released instead of the poll method. I plan to make several flavours of this queue to test out these ideas in the near future.
My IPC queue uses a memory mapped file as a means of acquiring a chunk of shared memory, there is no intention to use the persisted values though further development in that direction may prove interesting to some. So now that I got me some shared memory, I had to put the queue in it.
I started by laying out the queue counters and cached counters. After realizing the counters need to be aligned to work properly I learnt how to align memory in Java. I went on to verify that aligned memory offers the guarantees required for concurrent access. Quick summary:
This is how the memory layout worked out:
- aligned access means writing data types to addresses which divide by their size.
- unaligned access is not atomic, which is bad for concurrency :(
- unaligned access is slow, which is bad for performance :(
- unaligned access may not work, depending on OS and architecture. Not working is very bad :(
This is how the memory layout worked out:
To illustrate in glorious ASCII graphics (each - is a byte), this is what the memory layout looks like when broken into cache lines:
I played around with mixing off heap counters with on heap buffer but in the interest of brevity I'll summarize and say the JVM does not like that very much and the end result performance is not as good as all heap/off-heap solutions. The code is available with everything else.
Once alignment and memory layout were sorted I had to give up the flexibility of having reference pointers and settle for writing my data (an integer) directly into the memory. This leaves my queue very restrictive in it's current form. I intend to revisit it and see what I can do to offer a more extendable API on top of it.Let me summarize the recipe at this point:
- Create a memory mapped file large enough to hold:
- 4 cache lines for counters/cached counters.
- 4 bytes(per integer) * queue capacity (must be a power of 2).
- 1 spare cache line to ensure you can align the above to the cache line.
- Get a mapped byte buffer, which is a direct byte buffer on top of the mapped memory.
- Steal the address and get the contained aligned byte buffer.
- Setup pointers to the counters and the beginning of the buffer
- Replace use of natural counters with off heap counters accessed via Unsafe using the pointers.
- Replace use of array with use of offset pointers into buffer and Unsafe access.
- Test and debug until you work out the kinks...
|Implementation/Affinity||Same core||Cross core||Cross socket|
Confused? Let me explain. First line is the measurements taken for the original queue. Similar to what was presented in prev. post, though I saw a slight improvement in the results with increasing the compile threshold to 100000.
The second line is my offheap implementation of same algorithm. It is significantly faster. This is not IPC yet, this is in process. The reason it is faster is because data is inlined in the queue, which means that by loading an entry in the queue we get the data as opposed to a reference to the data. Getting a reference is what you get when you have and Object array. The array holds the references and the data is elsewhere, this seems to make it more painful as we get further from the producer.
The last entry is a mutation of P1C1QueueOriginal3 into a primitive array backed queue to compare performance like for like. As you can see this displays very similar results to the off heap implementation supporting the theory that data in-lining is behind the observed performance boost.
The lesson here is an old one, namely that pointer chasing is expensive business further amplified by the distance between the producing CPU and consuming CPU.
The off-heap queue can offer an alternative to native code integration as the consuming thread may interact directly with the off-heap queue and write results back to a different off-heap queue.
Running a similar benchmark adapted to use a memory mapped file as the backing DirectByteBuffer for the off-heap queue we get:
same core - ops/sec=135M
across cores - ops/sec=98M
across sockets - ops/sec=25M
JOY! a pure Java IPC that gives you 135M messages per second is more throughput then you'd get with most commercial products out there. This is still not as fast as the same queue in process and I admit I'm not sure what the source of the performance difference is. Still I am quite happy with it.
A few notes/observations from the experimentation process:
- I got a variety of results, stabilizing around different average throughputs. I chose the best for the above summary and plan to go into detail about the results in the near future.
- The JVM was launched with: -XX:+UseCondCardMark -XX:CompileThreshold=100000
- Removing the Thread.yield from the producer/consumer loops improved performance when running on the same core, but made it worse otherwise.
- Moving the queue allocation into the test loop changes the performance profile dramatically.
- I've not had time to fully explore the size of the queue as a variable in the experiment but the little I've done suggests it makes a difference, choose the right size for your application.