book.xml

<?xml version="1.0"?>
<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN"
"http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd">
<book>
<title>The ZeroMQ Guide - for PHP Developers</title>
<bookinfo>
  <isbn><!-- ISBN goes here --></isbn>
</bookinfo>
<dedication>
<para><emphasis role="bold">By Pieter Hintjens</emphasis></para>

<para>Please use the <ulink url="https://github.com/imatix/zguide/issues">issue tracker</ulink> for all comments and errata. This version covers the latest stable release of &Oslash;MQ (3.2) and was published on Mon 12 November, 2012. If you are using older versions of &Oslash;MQ then some of the examples and explanations won't be accurate.</para>

<para>The Guide is originally <ulink url="http://zguide.zeromq.org/page:all">in C</ulink>, but also in <ulink url="http://zguide.zeromq.org/php:all">PHP</ulink>, <ulink url="http://zguide.zeromq.org/py:all">Python</ulink>, <ulink url="http://zguide.zeromq.org/lua:all">Lua</ulink>, and <ulink url="http://zguide.zeromq.org/hx:all">Haxe</ulink>. We've also translated most of the examples into C++, C#, CL, Delphi, Erlang, F#, Felix, Haskell, Java, Objective-C, Ruby, Ada, Basic, Clojure, Go, Haxe, Node.js, ooc, Perl, and Scala.</para>

</dedication>
<preface>
<title>Preface</title>
<sect1>
<title>&Oslash;MQ in a Hundred Words</title>
<para>&Oslash;MQ (also seen as ZeroMQ, &Oslash;MQ, zmq) looks like an embeddable networking library but acts like a concurrency framework. It gives you sockets that carry atomic messages across various transports like in-process, inter-process, TCP, and multicast. You can connect sockets N-to-N with patterns like fanout, pub-sub, task distribution, and request-reply. It's fast enough to be the fabric for clustered products. Its asynchronous I/O model gives you scalable multicore applications, built as asynchronous message-processing tasks. It has a score of language APIs and runs on most operating systems. &Oslash;MQ is from <ulink url="http://www.imatix.com">iMatix</ulink> and is LGPLv3 open source.</para>

</sect1>
<sect1>
<title>The Zen of Zero</title>
<para>The &Oslash; in &Oslash;MQ is all about tradeoffs. On the one hand this strange name lowers &Oslash;MQ's visibility on Google and Twitter. On the other hand it annoys the heck out of some Danish folk who write us things like "&Oslash;MG r&oslash;tfl", and "&Oslash; is not a funny looking zero!" and "<emphasis>R&oslash;dgr&oslash;d med Fl&oslash;de!</emphasis>", which is apparently an insult that means "may your neighbours be the direct descendants of Grendel!"  Seems like a fair trade.</para>

<para>Originally the zero in &Oslash;MQ was meant as "zero broker" and (as close to) "zero latency" (as possible). In the meantime it has come to cover different goals: zero administration, zero cost, zero waste. More generally, "zero" refers to the culture of minimalism that permeates the project. We add power by removing complexity rather than exposing new functionality.</para>

</sect1>
<sect1>
<title>How the Guide Happened</title>
<para>In the summer of 2010, &Oslash;MQ was still a little-known niche library described by its rather terse man pages and a living but sparse wiki. Martin Sustrik and myself (Pieter Hintjens) were sitting in the bar of the Hotel Kyjev in Bratislava plotting how to make &Oslash;MQ more widely popular. Martin had written most of the &Oslash;MQ code, and I'd put up the funding and organized the community. Over some Zlaty Bazants, we agreed that &Oslash;MQ needed a new, simpler web site, and a basic guide to new users.</para>

<para>Martin collected some ideas for topics to explain. I'd never written a line of &Oslash;MQ code before this, so it became a live learning documentary. As I worked through simple examples to more complex ones I tried to answer many of the questions I'd seen on the mailing list. Since I've been building large-scale architectures for 30 years, there were a lot of problems I was keen to throw &Oslash;MQ at. Amazingly the results were mostly simple and elegant, even working in C. I felt a pure joy learning &Oslash;MQ and using it to solve real problems, which brought me back to programming after a few years' pause. And often, not knowing how it was "supposed" to be done, we improved &Oslash;MQ as we went along.</para>

<para>From the start I wanted the Guide to be a community project, so I put it onto github and let others contribute with pull requests. This was considered a radical, even vulgar approach by some. We came to a division of labor: I'd do the writing and make the original C examples, and others would help fix the text and translate the examples into other languages.</para>

<para>This worked better than I dared hope: you can now find all the examples in several languages and many in a dozen languages. It's a kind of programming language Rosetta stone and a valuable outcome in itself. We set-up a high-score: reach 80% translation and your language got its own Guide. PHP, Python, Lua, and Haxe reached this goal. People asked for PDFs, and we made that. People asked for ebooks, and got those. About a hundred people contributed to the Guide to date.</para>

<para>The Guide achieved its goal of popularizing &Oslash;MQ. The style pleases most and annoys some, which is how it should be. In December 2010 my work on &Oslash;MQ and the Guide stopped, as I found myself going through late-stage cancer, heavy surgery, and six months of chemotherapy. When I picked up work again in mid-2011, it was to start using &Oslash;MQ in anger for one of the largest use-cases imagineable: on the mobile phones and tablets of the world's biggest electronics company.</para>

<para>But the goal of the Guide was, from the start, a printed book. So it was exciting to get an email from Bill Lubanovic in January 2012 introducing me to his editor Andy Oram at O'Reilly, suggesting a &Oslash;MQ book. Of course! Where do I sign? How much do I have to pay? Oh, I <emphasis>get money</emphasis> for this? All I have to do is finish it?</para>

<para>Of course as soon as O'Reilly announced a &Oslash;MQ book, other publishers started sending out emails to potential authors. You'll probably see a rash of &Oslash;MQ books coming out next year. That's good: our niche library has hit the mainstream and deserves its six inches of shelf space. My apologies to the other &Oslash;MQ authors: we've set the bar horribly high, and my advice is to make your books complimentary. Perhaps focus on a specific language, platform, or pattern.</para>

<para>This is the magic and power of communities: be the first community in a space, stay healthy, and you own that space for ever.</para>

</sect1>
<sect1>
<title>Audience</title>
<para>This book is written for professional programmers who want to learn how to make the massively distributed software that will dominate the future of computing. We assume you can read C code, because most of the examples here are in C even though &Oslash;MQ is used in many languages. We assume you care about scale, because &Oslash;MQ solves that problem above all others. We assume you need the best possible results with the least possible cost, because otherwise you won't appreciate the trade-offs that &Oslash;MQ makes. Other than that basic background, we try to present all the concepts in networking and distributed computing you will need to use &Oslash;MQ.</para>

</sect1>
<sect1>
<title>Acknowledgements</title>
<para>Thanks to Andy Oram for making <ulink url="http://shop.oreilly.com/product/0636920026136.do">this happen at O'Reilly</ulink> and editing the book.</para>

<para>Thanks to Bill Desmarais, Brian Dorsey, Daniel Lin, Eric Desgranges, Gonzalo Diethelm, Guido Goldstein, Hunter Ford, Kamil Shakirov, Martin Sustrik, Mike Castleman, Naveen Chawla, Nicola Peduzzi, Oliver Smith, Olivier Chamoux, Peter Alexander, Pierre Rouleau, Randy Dryburgh, John Unwin, Alex Thomas, Mihail Minkov, Jeremy Avnet, Michael Compton, Kamil Kisiel, Mark Kharitonov, Guillaume Aubert, Ian Barber, Mike Sheridan, Faruk Akgul, Oleg Sidorov, Lev Givon, Allister MacLeod, Alexander D'Archangel, Andreas Hoelzlwimmer, Han Holl, Robert G. Jakabosky, Felipe Cruz, Marcus McCurdy, Mikhail Kulemin, Dr. Gerg&ouml; &Eacute;rdi, Pavel Zhukov, Alexander Else, Giovanni Ruggiero, Rick "Technoweenie", Daniel Lundin, Dave Hoover, Simon Jefford, Benjamin Peterson, Justin Case, Devon Weller, Richard Smith, Alexander Morland, Wadim Grasza, Michael Jakl, Uwe Dauernheim, Sebastian Nowicki, Simone Deponti, Aaron Raddon, Dan Colish, Markus Schirp, Benoit Larroque, Jonathan Palardy, Isaiah Peng, Arkadiusz Orzechowski, Umut Aydin, Matthew Horsfall, Jeremy W. Sherman, Eric Pugh, Tyler Sellon, John E. Vincent, Pavel Mitin, Min RK, Igor Wiedler, Olof &Aring;kesson, Patrick Lucas, Heow Goodman, Senthil Palanisami, John Gallagher, Tomas Roos, Stephen McQuay, Erik Allik, Arnaud Cogolu&egrave;gnes, Rob Gagnon, Dan Williams, Edward Smith, James Tucker, Kristian Kristensen, Vadim Shalts, Martin Trojer, Tom van Leeuwen, Pandya Hiten, Harm Aarts, Marc Harter, Iskren Ivov Chernev, Jay Han, Sonia Hamilton, Nathan Stocks, Naveen Palli, and Zed Shaw for their contributions.</para>

<para>Thanks to Stathis Sideris for <ulink url="http://www.ditaa.org">Ditaa</ulink>.</para>

</sect1>
</preface>
<chapter id="basics">
<title>Basics</title>
<sect1>
<title>Fixing the World</title>
<para>How to explain &Oslash;MQ? Some of us start by saying all the wonderful things it does. <emphasis>It's sockets on steroids. It's like mailboxes with routing. It's fast!</emphasis>  Others try to share their moment of enlightenment, that zap-pow-kaboom satori paradigm-shift moment when it all became obvious. <emphasis>Things just become simpler. Complexity goes away. It opens the mind.</emphasis>  Others try to explain by comparison. <emphasis>It's smaller, simpler, but still looks familiar.</emphasis>  Personally, I like to remember why we made &Oslash;MQ at all, because that's most likely where you, the reader, still are today.</para>

<para>Programming is a science dressed up as art, because most of us don't understand the physics of software, and it's rarely if ever taught. The physics of software is not algorithms, data structures, languages and abstractions. These are just tools we make, use, throw away. The real physics of software is the physics of people.</para>

<para>Specifically, our limitations when it comes to complexity, and our desire to work together to solve large problems in pieces. This is the science of programming: make building blocks that people can understand and use <emphasis>easily</emphasis>, and people will work together to solve the very largest problems.</para>

<para>We live in a connected world, and modern software has to navigate this world. So the building blocks for tomorrow's very largest solutions are connected and massively parallel. It's not enough for code to be "strong and silent" any more. Code has to talk to code. Code has to be chatty, sociable, well-connected. Code has to run like the human brain, trillions of individual neurons firing off messages to each other, a massively parallel network with no central control, no single point of failure, yet able to solve immensely difficult problems. And it's no accident that the future of code looks like the human brain, because the endpoints of every network are, at some level, human brains.</para>

<para>If you've done any work with threads, protocols, or networks, you'll realize this is pretty much impossible. It's a dream. Even connecting a few programs across a few sockets is plain nasty, when you start to handle real life situations. Trillions? The cost would be unimaginable. Connecting computers is so difficult that software and services to do this is a multi-billion dollar business.</para>

<para>So we live in a world where the wiring is years ahead of our ability to use it. We had a software crisis in the 1980s, when leading software engineers like Fred Brooks believed <ulink url="http://en.wikipedia.org/wiki/No-Silver-Bullet">there was no "Silver Bullet"</ulink> to "promise even one order of magnitude of improvement in productivity, reliability, or simplicity".</para>

<para>Brooks missed free and open source software, which solved that crisis, enabling us to share knowledge efficiently. Today we face another software crisis, but it's one we don't talk about much. Only the largest, richest firms can afford to create connected applications. There is a cloud, but it's proprietary. Our data, our knowledge is disappearing from our personal computers into clouds that we cannot access, cannot compete with. Who owns our social networks? It is like the mainframe-PC revolution in reverse.</para>

<para>We can leave the political philosophy <ulink url="http://swsi.info">for another book</ulink>. The point is that while the Internet offers the potential of massively connected code, the reality is that this is out of reach for most of us, and so, large interesting problems (in health, education, economics, transport, and so on) remain unsolved because there is no way to connect the code, and thus no way to connect the brains that could work together to solve these problems.</para>

<para>There have been many attempts to solve the challenge of connected software. There are thousands of IETF specifications, each solving part of the puzzle. For application developers, HTTP is perhaps the one solution to have been simple enough to work, but it arguably makes the problem worse, by encouraging developers and architects to think in terms of big servers and thin, stupid clients.</para>

<para>So today people are still connecting applications using raw UDP and TCP, proprietary protocols, HTTP, Websockets. It remains painful, slow, hard to scale, and essentially centralized. Distributed P2P architectures are mostly for play, not work. How many applications use Skype or Bittorrent to exchange data?</para>

<para>Which brings us back to the science of programming. To fix the world, we needed to do two things. One, to solve the general problem of "how to connect any code to any code, anywhere". Two, to wrap that up in the simplest possible building blocks that people could understand and use <emphasis>easily</emphasis>.</para>

<para>It sounds ridiculously simple. And maybe it is. That's kind of the whole point.</para>

</sect1>
<sect1>
<title>Audience for This Book</title>
<para>We assume you are using the latest 3.2 release of &Oslash;MQ. We assume you are using a Linux box or something similar. We assume you can read C code, more or less, that's the default language for the examples. We assume that when we write constants like PUSH or SUBSCRIBE you can imagine they are really called <literal>ZMQ-PUSH</literal> or <literal>ZMQ-SUBSCRIBE</literal> if the programming language needs it.</para>

</sect1>
<sect1>
<title>Getting the Examples</title>
<para>The Guide examples live in the Guide's <ulink url="https://github.com/imatix/zguide">git repository</ulink>. The simplest way to get all the examples is to clone this repository:</para>

<screen>git clone --depth=1 git://github.com/imatix/zguide.git
</screen>

<para>And then browse the examples subdirectory. You'll find examples by language. If there are examples missing in a language you use, you're encouraged to <ulink url="http://zguide.zeromq.org/main:translate">submit a translation</ulink>. This is how the Guide became so useful, thanks to the work of many people. All examples are licensed under MIT/X11.</para>

</sect1>
<sect1>
<title>Ask and Ye Shall Receive</title>
<para>So let's start with some code. We start of course with a Hello World example. We'll make a client and a server. The client sends "Hello" to the server, which replies with "World"(<xref linkend="figure-1"/>). Here's the server in C, which opens a &Oslash;MQ socket on port 5555, reads requests on it, and replies with "World" to each request:</para>

<example id="hwserver-c">
<title>Hello World server (hwserver.c)</title>
<programlisting language="c">
//
//  Hello World server
//  Binds REP socket to tcp://*:5555
//  Expects "Hello" from client, replies with "World"
//
#include &lt;zmq.h&gt;
#include &lt;stdio.h&gt;
#include &lt;unistd.h&gt;
#include &lt;string.h&gt;

int main (void)
{
    void *context = zmq_ctx_new ();

    //  Socket to talk to clients
    void *responder = zmq_socket (context, ZMQ_REP);
    zmq_bind (responder, "tcp://*:5555");

    while (1) {
        //  Wait for next request from client
        zmq_msg_t request;
        zmq_msg_init (&amp;request);
        zmq_msg_recv (&amp;request, responder, 0);
        printf ("Received Hello\n");
        zmq_msg_close (&amp;request);

        //  Do some 'work'
        sleep (1);

        //  Send reply back to client
        zmq_msg_t reply;
        zmq_msg_init_size (&amp;reply, 5);
        memcpy (zmq_msg_data (&amp;reply), "World", 5);
        zmq_msg_send (&amp;reply, responder, 0);
        zmq_msg_close (&amp;reply);
    }
    //  We never get here but if we did, this would be how we end
    zmq_close (responder);
    zmq_ctx_destroy (context);
    return 0;
}
</programlisting>

</example>
<figure id="figure-1">
    <title>Request-Reply</title>
    <mediaobject>
        <imageobject>
            <imagedata fileref="images/fig1.eps" format="EPS" width="4.8in"/>
        </imageobject>
    </mediaobject>
</figure>

<para>The REQ-REP socket pair is in lockstep. The client issues <literal>zmq-msg-send[3]</literal> and then <literal>zmq-msg-recv[3]</literal>, in a loop (or once if that's all it needs). Doing any other sequence (e.g. sending two messages in a row) will result in a return code of -1 from the <literal>send</literal> or <literal>recv</literal> call. Similarly, the service issues <literal>zmq-msg-recv[3]</literal> and then <literal>zmq-msg-send[3]</literal> in that order, as often as it needs to.</para>

<para>&Oslash;MQ uses C as its reference language and this is the main language we'll use for examples. If you're reading this on-line, the link below the example takes you to translations into other programming languages. Let's compare the same server in C++:</para>

<example id="hwserver-cpp">
<title>Hello World server (hwserver.cpp)</title>
<programlisting language="cpp">
//
//  Hello World server in C++
//  Binds REP socket to tcp://*:5555
//  Expects "Hello" from client, replies with "World"
//
#include &lt;zmq.hpp&gt;
#include &lt;string&gt;
#include &lt;iostream&gt;
#include &lt;unistd.h&gt;

int main () {
    //  Prepare our context and socket
    zmq::context_t context (1);
    zmq::socket_t socket (context, ZMQ_REP);
    socket.bind ("tcp://*:5555");

    while (true) {
        zmq::message_t request;

        //  Wait for next request from client
        socket.recv (&amp;request);
        std::cout &lt;&lt; "Received Hello" &lt;&lt; std::endl;

        //  Do some 'work'
        sleep (1);

        //  Send reply back to client
        zmq::message_t reply (5);
        memcpy ((void *) reply.data (), "World", 5);
        socket.send (reply);
    }
    return 0;
}
</programlisting>

</example>
<para>You can see that the &Oslash;MQ API is similar in C and C++. In a language like PHP, we can hide even more and the code becomes even easier to read:</para>

<example id="hwserver-php">
<title>Hello World server (hwserver.php)</title>
<programlisting language="php">
&lt;?php
/*
 *  Hello World server
 *  Binds REP socket to tcp://*:5555
 *  Expects "Hello" from client, replies with "World"
 * @author Ian Barber &lt;ian(dot)barber(at)gmail(dot)com&gt;
 */

$context = new ZMQContext(1);

//  Socket to talk to clients
$responder = new ZMQSocket($context, ZMQ::SOCKET_REP);
$responder-&gt;bind("tcp://*:5555");

while(true) {
	//  Wait for next request from client
	$request = $responder-&gt;recv();
	printf ("Received request: [%s]\n", $request);

	//  Do some 'work'
	sleep (1);

	//  Send reply back to client
	$responder-&gt;send("World");    
</programlisting>

</example>
<para>Here's the client code:</para>

<example id="hwclient-php">
<title>Hello World client (hwclient.php)</title>
<programlisting language="php">
&lt;?php
/*
 *  Hello World client
 *  Connects REQ socket to tcp://localhost:5555
 *  Sends "Hello" to server, expects "World" back
 * @author Ian Barber &lt;ian(dot)barber(at)gmail(dot)com&gt;
 */

$context = new ZMQContext();

//  Socket to talk to server
echo "Connecting to hello world server...\n";
$requester = new ZMQSocket($context, ZMQ::SOCKET_REQ);
$requester-&gt;connect("tcp://localhost:5555");

for($request_nbr = 0; $request_nbr != 10; $request_nbr++) {
	printf ("Sending request %d...\n", $request_nbr);
	$requester-&gt;send("Hello");
	
	$reply = $requester-&gt;recv();
	printf ("Received reply %d: [%s]\n", $request_nbr, $reply);
}
</programlisting>

</example>
<para>Now this looks too simple to be realistic, but a &Oslash;MQ socket is what you get when you take a normal TCP socket, inject it with a mix of radioactive isotopes stolen from a secret Soviet atomic research project, bombard it with 1950-era cosmic rays, and put it into the hands of a drug-addled comic book author with a badly-disguised fetish for bulging muscles clad in spandex(<xref linkend="figure-2"/>). Yes, &Oslash;MQ sockets are the world-saving superheroes of the networking world.</para>

<figure id="figure-2">
    <title>A terrible accident...</title>
    <mediaobject>
        <imageobject>
            <imagedata fileref="images/fig2.eps" format="EPS" width="4.8in"/>
        </imageobject>
    </mediaobject>
</figure>

<para>You could throw thousands of clients at this server, all at once, and it would continue to work happily and quickly. For fun, try starting the client and <emphasis>then</emphasis> starting the server, see how it all still works, then think for a second what this means.</para>

<para>Let us explain briefly what these two programs are actually doing. They create a &Oslash;MQ context to work with, and a socket. Don't worry what the words mean. You'll pick it up. The server binds its REP (reply) socket to port 5555. The server waits for a request, in a loop, and responds each time with a reply. The client sends a request and reads the reply back from the server.</para>

<para>If you kill the server (Ctrl-C) and restart it, the client won't recover properly. Recovering from crashing processes isn't quite that easy. Making a reliable request-reply flow is complex enough that we won't cover it until Reliable Request-Reply Patterns<xref linkend="reliable-request-reply"/>.</para>

<para>There is a lot happening behind the scenes but what matters to us programmers is how short and sweet the code is, and how often it doesn't crash, even under heavy load. This is the request-reply pattern, probably the simplest way to use &Oslash;MQ. It maps to RPC and the classic client-server model.</para>

</sect1>
<sect1>
<title>A Minor Note on Strings</title>
<para>&Oslash;MQ doesn't know anything about the data you send except its size in bytes. That means you are responsible for formatting it safely so that applications can read it back. Doing this for objects and complex data types is a job for specialized libraries like Protocol Buffers. But even for strings you need to take care.</para>

<para>In C and some other languages, strings are terminated with a null byte. We could send a string like "HELLO" with that extra null byte:</para>

<programlisting language="c">
zmq-msg-init-data (&amp;request, "Hello", 6, NULL, NULL);
</programlisting>

<para>However if you send a string from another language it probably will not include that null byte. For example, when we send that same string in Python, we do this:</para>

<programlisting language="python">
socket.send ("Hello")
</programlisting>

<para>Then what goes onto the wire is a length (one byte for shorter strings) and the string contents, as individual characters(<xref linkend="figure-3"/>).</para>

<figure id="figure-3">
    <title>A &Oslash;MQ string</title>
    <mediaobject>
        <imageobject>
            <imagedata fileref="images/fig3.eps" format="EPS" width="4.8in"/>
        </imageobject>
    </mediaobject>
</figure>

<para>And if you read this from a C program, you will get something that looks like a string, and might by accident act like a string (if by luck the five bytes find themselves followed by an innocently lurking null), but isn't a proper string. When your client and server don't agree on the string format, you will get weird results.</para>

<para>When you receive string data from &Oslash;MQ, in C, you simply cannot trust that it's safely terminated. Every single time you read a string you should allocate a new buffer with space for an extra byte, copy the string, and terminate it properly with a null.</para>

<para>So let's establish the rule that <emphasis role="bold">&Oslash;MQ strings are length-specified, and are sent on the wire <emphasis>without</emphasis> a trailing null</emphasis>. In the simplest case (and we'll do this in our examples) a &Oslash;MQ string maps neatly to a &Oslash;MQ message frame, which looks like the above figure, a length and some bytes.</para>

<para>Here is what we need to do, in C, to receive a &Oslash;MQ string and deliver it to the application as a valid C string:</para>

<programlisting language="c">
//  Receive 0MQ string from socket and convert into C string
static char *
s-recv (void *socket) {
    zmq-msg-t message;
    zmq-msg-init (&amp;message);
    int size = zmq-msg-recv (&amp;message, socket, 0);
    if (size == -1)
        return NULL;
    char *string = malloc (size + 1);
    memcpy (string, zmq-msg-data (&amp;message), size);
    zmq-msg-close (&amp;message);
    string [size] = 0;
    return (string);
}
</programlisting>

<para>This makes a very handy helper function and in the spirit of making things we can reuse profitably, let's write a similar 's-send' function that sends strings in the correct &Oslash;MQ format, and package this into a header file we can reuse.</para>

<para>The result is <literal>zhelpers.h</literal>, which lets us write sweeter and shorter &Oslash;MQ applications in C. It is a fairly long source, and only fun for C developers, so <ulink url="https://github.com/imatix/zguide/blob/master/examples/C/zhelpers.h">read it at leisure</ulink>.</para>

</sect1>
<sect1>
<title>Version Reporting</title>
<para>&Oslash;MQ does come in several versions and quite often, if you hit a problem, it'll be something that's been fixed in a later version. So it's a useful trick to know <emphasis>exactly</emphasis> what version of &Oslash;MQ you're actually linking with. Here is a tiny program that does that:</para>

<example id="version-php">
<title>&Oslash;MQ version reporting (version.php)</title>
<programlisting language="php">
&lt;?php
/* Report 0MQ version
 *
 * @author Ian Barber &lt;ian(dot)barber(at)gmail(dot)com&gt;
 */

if(class_exists("ZMQ") &amp;&amp; defined("ZMQ::LIBZMQ_VER")) {
	echo ZMQ::LIBZMQ_VER, PHP_EOL;
</programlisting>

</example>
</sect1>
<sect1>
<title>Getting the Message Out</title>
<para>The second classic pattern is one-way data distribution, in which a server pushes updates to a set of clients. Let's see an example that pushes out weather updates consisting of a zip code, temperature, and relative humidity. We'll generate random values, just like the real weather stations do.</para>

<para>Here's the server. We'll use port 5556 for this application:</para>

<example id="wuserver-php">
<title>Weather update server (wuserver.php)</title>
<programlisting language="php">
&lt;?php
/*
 *  Weather update server
 *  Binds PUB socket to tcp://*:5556
 *  Publishes random weather updates
 * @author Ian Barber &lt;ian(dot)barber(at)gmail(dot)com&gt;
 */

//  Prepare our context and publisher
$context = new ZMQContext();
$publisher = $context-&gt;getSocket(ZMQ::SOCKET_PUB);
$publisher-&gt;bind("tcp://*:5556");
$publisher-&gt;bind("ipc://weather.ipc");

while (true) {
	//  Get values that will fool the boss
	$zipcode     = mt_rand(0, 100000);
	$temperature = mt_rand(-80, 135);
	$relhumidity = mt_rand(10, 60);

	//  Send message to all subscribers
	$update = sprintf ("%05d %d %d", $zipcode, $temperature, $relhumidity);
	$publisher-&gt;send($update);
}
</programlisting>

</example>
<para>There's no start, and no end to this stream of updates, it's like a never ending broadcast(<xref linkend="figure-4"/>).</para>

<figure id="figure-4">
    <title>Publish-Subscribe</title>
    <mediaobject>
        <imageobject>
            <imagedata fileref="images/fig4.eps" format="EPS" width="4.8in"/>
        </imageobject>
    </mediaobject>
</figure>

<para>Here is client application, which listens to the stream of updates and grabs anything to do with a specified zip code, by default New York City because that's a great place to start any adventure:</para>

<example id="wuclient-php">
<title>Weather update client (wuclient.php)</title>
<programlisting language="php">
&lt;?php
/*
 *  Weather update client
 *  Connects SUB socket to tcp://localhost:5556
 *  Collects weather updates and finds avg temp in zipcode
 * @author Ian Barber &lt;ian(dot)barber(at)gmail(dot)com&gt;
 */

$context = new ZMQContext();

//  Socket to talk to server
echo "Collecting updates from weather server...", PHP_EOL;
$subscriber = new ZMQSocket($context, ZMQ::SOCKET_SUB);
$subscriber-&gt;connect("tcp://localhost:5556");

//  Subscribe to zipcode, default is NYC, 10001
$filter = $_SERVER['argc'] &gt; 1 ? $_SERVER['argv'][1] : "10001";
$subscriber-&gt;setSockOpt(ZMQ::SOCKOPT_SUBSCRIBE, $filter);

//  Process 100 updates
$total_temp = 0;
for ($update_nbr = 0; $update_nbr &lt; 100; $update_nbr++) {
	$string = $subscriber-&gt;recv();
	sscanf ($string, "%d %d %d", $zipcode, $temperature, $relhumidity);
	$total_temp += $temperature;
}
printf ("Average temperature for zipcode '%s' was %dF\n",
    $filter, (int) ($total_temp / $update_nbr))
</programlisting>

</example>
<para>Note that when you use a SUB socket you <emphasis role="bold">must</emphasis> set a subscription using <literal>zmq-setsockopt[3]</literal> and SUBSCRIBE, as in this code. If you don't set any subscription, you won't get any messages. It's a common mistake for beginners. The subscriber can set many subscriptions, which are added together. That is, if a update matches ANY subscription, the subscriber receives it. The subscriber can also cancel specific subscriptions. A subscription is often but not necessarily a printable string. See <literal>zmq-setsockopt[3]</literal> for how this works.</para>

<para>The PUB-SUB socket pair is asynchronous. The client does <literal>zmq-msg-recv[3]</literal>, in a loop (or once if that's all it needs). Trying to send a message to a SUB socket will cause an error. Similarly the service does <literal>zmq-msg-send[3]</literal> as often as it needs to, but must not do <literal>zmq-msg-recv[3]</literal> on a PUB socket.</para>

<para>In theory with &Oslash;MQ sockets, it does not matter which end connects, and which end binds. However in practice there are undocumented differences that I'll come to later. For now, bind the PUB and connect the SUB, unless your network design makes that impossible.</para>

<para>There is one more important thing to know about PUB-SUB sockets: you do not know precisely when a subscriber starts to get messages. Even if you start a subscriber, wait a while, and then start the publisher, <emphasis role="bold">the subscriber will always miss the first messages that the publisher sends</emphasis>. This is because as the subscriber connects to the publisher (something that takes a small but non-zero time), the publisher may already be sending messages out.</para>

<para>This "slow joiner" symptom hits enough people, often enough, that we're going to explain it in detail. Remember that &Oslash;MQ does asynchronous I/O, i.e. in the background. Say you have two nodes doing this, in this order:</para>

<itemizedlist>
  <listitem><para>Subscriber connects to an endpoint and receives and counts messages.</para></listitem>
  <listitem><para>Publisher binds to an endpoint and immediately sends 1,000 messages.</para></listitem>
</itemizedlist>
<para>Then the subscriber will most likely not receive anything. You'll blink, check that you set a correct filter, and try again, and the subscriber will still not receive anything.</para>

<para>Making a TCP connection involves to and fro handshaking that takes several milliseconds depending on your network and the number of hops between peers. In that time, &Oslash;MQ can send very many messages. For sake of argument assume it takes 5 msecs to establish a connection, and that same link can handle 1M messages per second. During the 5 msecs that the subscriber is connecting to the publisher, it takes the publisher only 1 msec to send out those 1K messages.</para>

<para>In Sockets and Patterns<xref linkend="sockets-and-patterns"/> we'll explain how to synchronize a publisher and subscribers so that you don't start to publish data until the subscriber(s) really are connected and ready. There is a simple and stupid way to delay the publisher, which is to sleep. Don't do this in a real application, though, because it is extremely fragile as well as inelegant and slow. Use sleeps to prove to yourself what's happening, and then wait for Sockets and Patterns<xref linkend="sockets-and-patterns"/> to see how to do this right.</para>

<para>The alternative to synchronization is to simply assume that the published data stream is infinite and has no start, and no end. One also assumes that the subscriber doesn't care what transpired before it started up. This is how we built our weather client example.</para>

<para>So the client subscribes to its chosen zip code and collects a thousand updates for that zip code. That means about ten million updates from the server, if zip codes are randomly distributed. You can start the client, and then the server, and the client will keep working. You can stop and restart the server as often as you like, and the client will keep working. When the client has collected its thousand updates, it calculates the average, prints it, and exits.</para>

<para>Some points about the publish-subscribe pattern:</para>

<itemizedlist>
  <listitem><para>A subscriber can connect to more than one publisher, using one 'connect' call each time. Data will then arrive and be interleaved ("fair-queued") so that no single publisher drowns out the others.</para></listitem>
  <listitem><para>If a publisher has no connected subscribers, then it will simply drop all messages.</para></listitem>
  <listitem><para>If you're using TCP, and a subscriber is slow, messages will queue up on the publisher. We'll look at how to protect publishers against this, using the "high-water mark" later.</para></listitem>
  <listitem><para>From &Oslash;MQ 3.x, filtering happens at the publisher side, when using a connected protocol (<literal>tcp://</literal> or <literal>ipc://</literal>). Using the <literal>epgm://</literal> protocol, filtering happens at the subscriber side. In &Oslash;MQ/2.x, all filtering happened at the subscriber side.</para></listitem>
</itemizedlist>
<para>This is how long it takes to receive and filter 10M messages on my laptop, which is an 2011-era Intel I7, fast but nothing special:</para>

<screen>ph@nb201103:~/work/git/zguide/examples/c$ time wuclient
Collecting updates from weather server...
Average temperature for zipcode '10001 ' was 28F

real    0m4.470s
user    0m0.000s
sys     0m0.008s
</screen>

</sect1>
<sect1>
<title>Divide and Conquer</title>
<para>As a final example (you are surely getting tired of juicy code and want to delve back into philological discussions about comparative abstractive norms), let's do a little supercomputing. Then coffee. Our supercomputing application is a fairly typical parallel processing model(<xref linkend="figure-5"/>). We have:</para>

<itemizedlist>
  <listitem><para>A ventilator that produces tasks that can be done in parallel</para></listitem>
  <listitem><para>A set of workers that process tasks</para></listitem>
  <listitem><para>A sink that collects results back from the worker processes</para></listitem>
</itemizedlist>
<para>In reality, workers run on superfast boxes, perhaps using GPUs (graphic processing units) to do the hard math. Here is the ventilator. It generates 100 tasks, each is a message telling the worker to sleep for some number of milliseconds:</para>

<example id="taskvent-php">
<title>Parallel task ventilator (taskvent.php)</title>
<programlisting language="php">
&lt;?php 
/*
 *  Task ventilator
 *  Binds PUSH socket to tcp://localhost:5557
 *  Sends batch of tasks to workers via that socket
 * @author Ian Barber &lt;ian(dot)barber(at)gmail(dot)com&gt;
 */

$context = new ZMQContext();

//  Socket to send messages on
$sender = new ZMQSocket($context, ZMQ::SOCKET_PUSH);
$sender-&gt;bind("tcp://*:5557");

echo "Press Enter when the workers are ready: ";
$fp = fopen('php://stdin', 'r');
$line = fgets($fp, 512);
fclose($fp);
echo "Sending tasks to workers...", PHP_EOL;

//  The first message is "0" and signals start of batch
$sender-&gt;send(0);
    
//  Send 100 tasks
$total_msec = 0;     //  Total expected cost in msecs
for ($task_nbr = 0; $task_nbr &lt; 100; $task_nbr++) {
	//  Random workload from 1 to 100msecs
	$workload = mt_rand(1, 100);
	$total_msec += $workload;
	$sender-&gt;send($workload);
	
}

printf ("Total expected cost: %d msec\n", $total_msec);
sleep (1);              //  Give 0MQ time to delive
</programlisting>

</example>
<figure id="figure-5">
    <title>Parallel Pipeline</title>
    <mediaobject>
        <imageobject>
            <imagedata fileref="images/fig5.eps" format="EPS" width="4.8in"/>
        </imageobject>
    </mediaobject>
</figure>

<para>Here is the worker application. It receives a message, sleeps for that number of seconds, then signals that it's finished:</para>

<example id="taskwork-php">
<title>Parallel task worker (taskwork.php)</title>
<programlisting language="php">
&lt;?php
/*
 *  Task worker
 *  Connects PULL socket to tcp://localhost:5557
 *  Collects workloads from ventilator via that socket
 *  Connects PUSH socket to tcp://localhost:5558
 *  Sends results to sink via that socket
 * @author Ian Barber &lt;ian(dot)barber(at)gmail(dot)com&gt;
 */

$context = new ZMQContext();

//  Socket to receive messages on
$receiver = new ZMQSocket($context, ZMQ::SOCKET_PULL);
$receiver-&gt;connect("tcp://localhost:5557");

//  Socket to send messages to
$sender = new ZMQSocket($context, ZMQ::SOCKET_PUSH);
$sender-&gt;connect("tcp://localhost:5558");

//  Process tasks forever
while (true) {
	$string = $receiver-&gt;recv();
	
	//  Simple progress indicator for the viewer
	echo $string, PHP_EOL;
   
	//  Do the work
	usleep($string * 1000);

   //  Send results to sink
	$sender-&gt;send("");
</programlisting>

</example>
<para>Here is the sink application. It collects the 100 tasks, then calculates how long the overall processing took, so we can confirm that the workers really were running in parallel, if there are more than one of them:</para>

<example id="tasksink-php">
<title>Parallel task sink (tasksink.php)</title>
<programlisting language="php">
&lt;?php
/*
 *  Task sink
 *  Binds PULL socket to tcp://localhost:5558
 *  Collects results from workers via that socket
 * @author Ian Barber &lt;ian(dot)barber(at)gmail(dot)com&gt;
 */

//  Prepare our context and socket
$context = new ZMQContext();
$receiver = new ZMQSocket($context, ZMQ::SOCKET_PULL);
$receiver-&gt;bind("tcp://*:5558");

//  Wait for start of batch
$string = $receiver-&gt;recv();

//  Start our clock now
$tstart = microtime(true);

//  Process 100 confirmations
$total_msec = 0;     //  Total calculated cost in msecs
for ($task_nbr = 0; $task_nbr &lt; 100; $task_nbr++) {
	$string = $receiver-&gt;recv();
	if($task_nbr % 10 == 0) {
		echo ":";
	} else {
		echo ".";
	}
}

$tend = microtime(true);

$total_msec = ($tend - $tstart) * 1000;
echo PHP_EOL;
printf ("Total elapsed time: %d msec", $total_msec);
echo PHP_EOL
</programlisting>

</example>
<para>The average cost of a batch is 5 seconds. When we start 1, 2, 4 workers we get results like this from the sink:</para>

<screen>#   1 worker
Total elapsed time: 5034 msec
#   2 workers
Total elapsed time: 2421 msec
#   4 workers
Total elapsed time: 1018 msec
</screen>

<para>Let's look at some aspects of this code in more detail:</para>

<itemizedlist>
  <listitem><para>The workers connect upstream to the ventilator, and downstream to the sink. This means you can add workers arbitrarily. If the workers bound to their endpoints, you would need (a) more endpoints and (b) to modify the ventilator and/or the sink each time you added a worker. We say that the ventilator and sink are 'stable' parts of our architecture and the workers are 'dynamic' parts of it.</para></listitem>
  <listitem><para>We have to synchronize the start of the batch with all workers being up and running. This is a fairly common gotcha in &Oslash;MQ and there is no easy solution. The 'connect' method takes a certain time. So when a set of workers connect to the ventilator, the first one to successfully connect will get a whole load of messages in that short time while the others are also connecting. If you don't synchronize the start of the batch somehow, the system won't run in parallel at all. Try removing the wait, and see.</para></listitem>
  <listitem><para>The ventilator's PUSH socket distributes tasks to workers (assuming they are all connected <emphasis>before</emphasis> the batch starts going out) evenly. This is called <emphasis>load-balancing</emphasis> and it's something we'll look at again in more detail.</para></listitem>
  <listitem><para>The sink's PULL socket collects results from workers evenly. This is called <emphasis>fair-queuing</emphasis>(<xref linkend="figure-6"/>).</para></listitem>
</itemizedlist>
<figure id="figure-6">
    <title>Fair Queuing</title>
    <mediaobject>
        <imageobject>
            <imagedata fileref="images/fig6.eps" format="EPS" width="4.8in"/>
        </imageobject>
    </mediaobject>
</figure>

<para>The pipeline pattern also exhibits the "slow joiner" syndrome, leading to accusations that PUSH sockets don't load balance properly. If you are using PUSH and PULL, and one of your workers gets way more messages than the others, it's because that PULL socket has joined faster than the others, and grabs a lot of messages before the others manage to connect.</para>

</sect1>
<sect1>
<title>Programming with &Oslash;MQ</title>
<para>Having seen some examples, you're eager to start using &Oslash;MQ in some apps. Before you start that, take a deep breath, chillax, and reflect on some basic advice that will save you stress and confusion.</para>

<itemizedlist>
  <listitem><para>Learn &Oslash;MQ step by step. It's just one simple API but it hides a world of possibilities. Take the possibilities slowly, master each one.</para></listitem>
  <listitem><para>Write nice code. Ugly code hides problems and makes it hard for others to help you. You might get used to meaningless variable names, but people reading your code won't. Use names that are real words, that say something other than "I'm too careless to tell you what this variable is really for". Use consistent indentation, clean layout. Write nice code and your world will be more comfortable.</para></listitem>
  <listitem><para>Test what you make as you make it. When your program doesn't work, you should know what five lines are to blame. This is especially true when you do &Oslash;MQ magic, which just <emphasis>won't</emphasis> work the first few times you try it.</para></listitem>
  <listitem><para>When you find that things don't work as expected, break your code into pieces, test each one, see which one is not working. &Oslash;MQ lets you make essentially modular code, use that to your advantage.</para></listitem>
  <listitem><para>Make abstractions (classes, methods, whatever) as you need them. If you copy/paste a lot of code you're going to copy/paste errors too.</para></listitem>
</itemizedlist>
<para>To illustrate, here is a fragment of code someone asked me to help fix:</para>

<programlisting language="c">
//  NOTE: do NOT reuse this example code!
static char *topic-str = "msg.x|";

void* pub-worker(void* arg){
    void *ctx = arg;
    assert(ctx);

    void *qskt = zmq-socket(ctx, ZMQ-REP);
    assert(qskt);

    int rc = zmq-connect(qskt, "inproc://querys");
    assert(rc == 0);

    void *pubskt = zmq-socket(ctx, ZMQ-PUB);
    assert(pubskt);

    rc = zmq-bind(pubskt, "inproc://publish");
    assert(rc == 0);

    uint8-t cmd;
    uint32-t nb;
    zmq-msg-t topic-msg, cmd-msg, nb-msg, resp-msg;

    zmq-msg-init-data(&amp;topic-msg, topic-str, strlen(topic-str) , NULL, NULL);

    fprintf(stdout,"WORKER: ready to receive messages\n");
    //  NOTE: do NOT reuse this example code, It's broken.
    //  e.g. topic-msg will be invalid the second time through
    while (1){
    zmq-msg-send(pubskt, &amp;topic-msg, ZMQ-SNDMORE);

    zmq-msg-init(&amp;cmd-msg);
    zmq-msg-recv(qskt, &amp;cmd-msg, 0);
    memcpy(&amp;cmd, zmq-msg-data(&amp;cmd-msg), sizeof(uint8-t));
    zmq-msg-send(pubskt, &amp;cmd-msg, ZMQ-SNDMORE);
    zmq-msg-close(&amp;cmd-msg);

    fprintf(stdout, "received cmd %u\n", cmd);

    zmq-msg-init(&amp;nb-msg);
    zmq-msg-recv(qskt, &amp;nb-msg, 0);
    memcpy(&amp;nb, zmq-msg-data(&amp;nb-msg), sizeof(uint32-t));
    zmq-msg-send(pubskt, &amp;nb-msg, 0);
    zmq-msg-close(&amp;nb-msg);

    fprintf(stdout, "received nb %u\n", nb);

    zmq-msg-init-size(&amp;resp-msg, sizeof(uint8-t));
    memset(zmq-msg-data(&amp;resp-msg), 0, sizeof(uint8-t));
    zmq-msg-send(qskt, &amp;resp-msg, 0);
    zmq-msg-close(&amp;resp-msg);

    }
    return NULL;
}
</programlisting>

<para>This is what I rewrote it to, as part of finding the bug:</para>

<programlisting language="c">
static void *
worker-thread (void *arg) {
    void *context = arg;
    void *worker = zmq-socket (context, ZMQ-REP);
    assert (worker);
    int rc;
    rc = zmq-connect (worker, "ipc://worker");
    assert (rc == 0);

    void *broadcast = zmq-socket (context, ZMQ-PUB);
    assert (broadcast);
    rc = zmq-bind (broadcast, "ipc://publish");
    assert (rc == 0);

    while (1) {
        char *part1 = s-recv (worker);
        char *part2 = s-recv (worker);
        printf ("Worker got [%s][%s]\n", part1, part2);
        s-sendmore (broadcast, "msg");
        s-sendmore (broadcast, part1);
        s-send     (broadcast, part2);
        free (part1);
        free (part2);

        s-send (worker, "OK");
    }
    return NULL;
}
</programlisting>

<para>In the end, the problem was that the application was passing sockets between threads, which crashes weirdly. Sockets are not threadsafe. It became legal behavior to migrate sockets from one thread to another in &Oslash;MQ/2.1, but this remains dangerous unless you use a "full memory barrier". If you don't know what that means, don't attempt socket migration.</para>

<sect2>
<title>Getting the Context Right</title>
<para>&Oslash;MQ applications always start by creating a <emphasis>context</emphasis>, and then using that for creating sockets. In C, it's the <literal>zmq-ctx-new[3]</literal> call. You should create and use exactly one context in your process. Technically, the context is the container for all sockets in a single process, and acts as the transport for <literal>inproc</literal> sockets, which are the fastest way to connect threads in one process. If at runtime a process has two contexts, these are like separate &Oslash;MQ instances. If that's explicitly what you want, OK, but otherwise remember:</para>

<para><emphasis role="bold">Do one <literal>zmq-ctx-new[3]</literal> at the start of your main line code, and one <literal>zmq-ctx-destroy[3]</literal> at the end.</emphasis></para>

<para>If you're using the <literal>fork()</literal> system call, each process needs its own context. If you do <literal>zmq-ctx-new[3]</literal> in the main process before calling <literal>fork()</literal>, the child processes get their own contexts. In general you want to do the interesting stuff in the child processes, and just manage these from the parent process.</para>

</sect2>
<sect2>
<title>Making a Clean Exit</title>
<para>Classy programmers share the same motto as classy hit men: always clean-up when you finish the job. When you use &Oslash;MQ in a language like Python, stuff gets automatically freed for you. But when using C you have to carefully free objects when you're finished with them, or you get memory leaks, unstable applications, and generally bad karma.</para>

<para>Memory leaks are one thing, but &Oslash;MQ is quite finicky about how you exit an application. The reasons are technical and painful but the upshot is that if you leave any sockets open, the <literal>zmq-ctx-destroy[3]</literal> function will hang forever. And even if you close all sockets, <literal>zmq-ctx-destroy[3]</literal> will by default wait forever if there are pending connects or sends. Unless you set the LINGER to zero on those sockets before closing them.</para>

<para>The &Oslash;MQ objects we need to worry about are messages, sockets, and contexts. Luckily it's quite simple, at least in simple programs:</para>

<itemizedlist>
  <listitem><para>Always close a message the moment you are done with it, using <literal>zmq-msg-close[3]</literal>.</para></listitem>
  <listitem><para>If you are opening and closing a lot of sockets, that's probably a sign you need to redesign your application.</para></listitem>
  <listitem><para>When you exit the program, close your sockets and then call <literal>zmq-ctx-destroy[3]</literal>. This destroys the context.</para></listitem>
</itemizedlist>
<para>This is at least for C development. In a language with automatic object destruction, sockets and contexts will be destroyed as you leave the scope. If you use exceptions you'll have to do the clean-up in something like a "final" block, the same as for any resource.</para>

<para>If you're doing multithreaded work, it gets rather more complex than this. We'll get to multithreading in the next chapter, but because some of you will, despite warnings, will try to run before you can safely walk, below is the quick and dirty guide to making a clean exit in a <emphasis>multithreaded</emphasis> &Oslash;MQ application.</para>

<para>First, do not try to use the same socket from multiple threads. No, don't explain why you think this would be excellent fun, just please don't do it. Next, you need to shut down each socket that has ongoing requests. The proper way is to set a low LINGER value (1 second), then close the socket. If your language binding doesn't do this for you automatically when you destroy a context, I'd suggest sending a patch.</para>

<para>Finally, destroy the context. This will cause any blocking receives or polls or sends in attached threads (i.e. which share the same context) to return with an error. Catch that error, and then set linger on, and close sockets in <emphasis>that</emphasis> thread, and exit. Do not destroy the same context twice. The zmq-ctx-destroy in the main thread will block until all sockets it knows about are safely closed.</para>

<para>Voila! It's complex and painful enough that any language binding author worth his or her salt will do this automatically and make the socket closing dance unnecessary.</para>

</sect2>
</sect1>
<sect1>
<title>Why We Needed &Oslash;MQ</title>
<para>Now that you've seen &Oslash;MQ in action, let's go back to the "why".</para>

<para>Many applications these days consist of components that stretch across some kind of network, either a LAN or the Internet. So many application developers end up doing some kind of messaging. Some developers use message queuing products, but most of the time they do it themselves, using TCP or UDP. These protocols are not hard to use, but there is a great difference between sending a few bytes from A to B, and doing messaging in any kind of reliable way.</para>

<para>Let's look at the typical problems we face when we start to connect pieces using raw TCP. Any reusable messaging layer would need to solve all or most these:</para>

<itemizedlist>
  <listitem><para>How do we handle I/O? Does our application block, or do we handle I/O in the background? This is a key design decision. Blocking I/O creates architectures that do not scale well. But background I/O can be very hard to do right.</para></listitem>
  <listitem><para>How do we handle dynamic components, i.e. pieces that go away temporarily? Do we formally split components into "clients" and "servers" and mandate that servers cannot disappear? What then if we want to connect servers to servers? Do we try to reconnect every few seconds?</para></listitem>
  <listitem><para>How do we represent a message on the wire? How do we frame data so it's easy to write and read, safe from buffer overflows, efficient for small messages, yet adequate for the very largest videos of dancing cats wearing party hats?</para></listitem>
  <listitem><para>How do we handle messages that we can't deliver immediately? Particularly, if we're waiting for a component to come back on-line? Do we discard messages, put them into a database, or into a memory queue?</para></listitem>
  <listitem><para>Where do we store message queues? What happens if the component reading from a queue is very slow, and causes our queues to build up? What's our strategy then?</para></listitem>
  <listitem><para>How do we handle lost messages? Do we wait for fresh data, request a resend, or do we build some kind of reliability layer that ensures messages cannot be lost? What if that layer itself crashes?</para></listitem>
  <listitem><para>What if we need to use a different network transport. Say, multicast instead of TCP unicast? Or IPv6? Do we need to rewrite the applications, or is the transport abstracted in some layer?</para></listitem>
  <listitem><para>How do we route messages? Can we send the same message to multiple peers? Can we send replies back to an original requester?</para></listitem>
  <listitem><para>How do we write an API for another language? Do we re-implement a wire-level protocol or do we repackage a library? If the former, how can we guarantee efficient and stable stacks? If the latter, how can we guarantee interoperability?</para></listitem>
  <listitem><para>How do we represent data so that it can be read between different architectures? Do we enforce a particular encoding for data types? How far is this the job of the messaging system rather than a higher layer?</para></listitem>
  <listitem><para>How do we handle network errors? Do we wait and retry, ignore them silently, or abort?</para></listitem>
</itemizedlist>
<para>Take a typical open source project like <ulink url="http://hadoop.apache.org/zookeeper/">Hadoop Zookeeper</ulink> and read the C API code in <ulink url="http://github.com/apache/zookeeper/blob/trunk/src/c/src/zookeeper.c">src/c/src/zookeeper.c</ulink>. As I write this, in 2010, the code is 3,200 lines of mystery and in there is an undocumented, client-server network communication protocol. I see it's efficient because it uses poll() instead of select(). But really, Zookeeper should be using a generic messaging layer and an explicitly documented wire level protocol. It is incredibly wasteful for teams to be building this particular wheel over and over.</para>

<para>But how to make a reusable messaging layer? Why, when so many projects need this technology, are people still doing it the hard way, by driving TCP sockets in their code, and solving the problems in that long list, over and over(<xref linkend="figure-7"/>)?</para>

<figure id="figure-7">
    <title>Messaging as it Starts</title>
    <mediaobject>
        <imageobject>
            <imagedata fileref="images/fig7.eps" format="EPS" width="4.8in"/>
        </imageobject>
    </mediaobject>
</figure>

<para>It turns out that building reusable messaging systems is really difficult, which is why few FOSS projects ever tried, and why commercial messaging products are complex, expensive, inflexible, and brittle. In 2006 iMatix designed <ulink url="http://www.amqp.org">AMQP</ulink> which started to give FOSS developers perhaps the first reusable recipe for a messaging system. AMQP works better than many other designs <ulink url="http://www.imatix.com/articles:whats-wrong-with-amqp">but remains relatively complex, expensive, and brittle</ulink>. It takes weeks to learn to use, and months to create stable architectures that don't crash when things get hairy.</para>

<para>Most messaging projects, like AMQP, that try to solve this long list of problems in a reusable way do so by inventing a new concept, the "broker", that does addressing, routing, and queuing. This results in a client-server protocol or a set of APIs on top of some undocumented protocol, that let applications speak to this broker. Brokers are an excellent thing in reducing the complexity of large networks. But adding broker-based messaging to a product like Zookeeper would make it worse, not better. It would mean adding an additional big box, and a new single point of failure. A broker rapidly becomes a bottleneck and a new risk to manage. If the software supports it, we can add a second, third, fourth broker and make some fail-over scheme. People do this. It creates more moving pieces, more complexity, more things to break.</para>

<para>And a broker-centric set-up needs its own operations team. You literally need to watch the brokers day and night, and beat them with a stick when they start misbehaving. You need boxes, and you need backup boxes, and you need people to manage those boxes. It is only worth doing for large applications with many moving pieces, built by several teams of people, over several years.</para>

<para>So small to medium application developers are trapped. Either they avoid network programming, and make monolithic applications that do not scale. Or they jump into network programming and make brittle, complex applications that are hard to maintain. Or they bet on a messaging product, and end up with scalable applications that depend on expensive, easily broken technology. There has been no really good choice, which is maybe why messaging is largely stuck in the last century and stirs strong emotions. Negative ones for users, gleeful joy for those selling support and licenses(<xref linkend="figure-8"/>).</para>

<figure id="figure-8">
    <title>Messaging as it Becomes</title>
    <mediaobject>
        <imageobject>
            <imagedata fileref="images/fig8.eps" format="EPS" width="4.8in"/>
        </imageobject>
    </mediaobject>
</figure>

<para>What we need is something that does the job of messaging but does it in such a simple and cheap way that it can work in any application, with close to zero cost. It should be a library that you just link with, without any other dependencies. No additional moving pieces, so no additional risk. It should run on any OS and work with any programming language.</para>

<para>And this is &Oslash;MQ: an efficient, embeddable library that solves most of the problems an application needs to become nicely elastic across a network, without much cost.</para>

<para>Specifically:</para>

<itemizedlist>
  <listitem><para>It handles I/O asynchronously, in background threads. These communicate with application threads using lock-free data structures, so concurrent &Oslash;MQ applications need no locks, semaphores, or other wait states.</para></listitem>
  <listitem><para>Components can come and go dynamically and &Oslash;MQ will automatically reconnect. This means you can start components in any order. You can create "service-oriented architectures" (SOAs) where services can join and leave the network at any time.</para></listitem>
  <listitem><para>It queues messages automatically when needed. It does this intelligently, pushing messages as close as possible to the receiver before queuing them.</para></listitem>
  <listitem><para>It has ways of dealing with over-full queues (called "high water mark"). When a queue is full, &Oslash;MQ automatically blocks senders, or throws away messages, depending on the kind of messaging you are doing (the so-called "pattern").</para></listitem>
  <listitem><para>It lets your applications talk to each other over arbitrary transports: TCP, multicast, in-process, inter-process. You don't need to change your code to use a different transport.</para></listitem>
  <listitem><para>It handles slow/blocked readers safely, using different strategies that depend on the messaging pattern.</para></listitem>
  <listitem><para>It lets you route messages using a variety of patterns such as request-reply and publish-subscribe. These patterns are how you create the topology, the structure of your network.</para></listitem>
  <listitem><para>It lets you create proxies to queue, forward, or capture messages with a single call. Proxies can reduce the interconnection complexity of a network.</para></listitem>
  <listitem><para>It delivers whole messages exactly as they were sent, using a simple framing on the wire. If you write a 10k message, you will receive a 10k message.</para></listitem>
  <listitem><para>It does not impose any format on messages. They are blobs of zero to gigabytes large. When you want to represent data you choose some other product on top, such as Google's protocol buffers, XDR, and others.</para></listitem>
  <listitem><para>It handles network errors intelligently. Sometimes it retries, sometimes it tells you an operation failed.</para></listitem>
  <listitem><para>It reduces your carbon footprint. Doing more with less CPU means your boxes use less power, and you can keep your old boxes in use for longer. Al Gore would love &Oslash;MQ.</para></listitem>
</itemizedlist>
<para>Actually &Oslash;MQ does rather more than this. It has a subversive effect on how you develop network-capable applications. Superficially it's a socket-inspired API on which you do <literal>zmq-msg-recv[3]</literal> and <literal>zmq-msg-send[3]</literal>. But message processing rapidly becomes the central loop, and your application soon breaks down into a set of message processing tasks. It is elegant and natural. And it scales: each of these tasks maps to a node, and the nodes talk to each other across arbitrary transports. Two nodes in one process (node is a thread), two nodes on one box (node is a process), or two boxes on one network (node is a box) - it's all the same, with no application code changes.</para>

</sect1>
<sect1>
<title>Socket Scalability</title>
<para>Let's see &Oslash;MQ's scalability in action. Here is a shell script that starts the weather server and then a bunch of clients in parallel:</para>

<screen>wuserver &amp;
wuclient 12345 &amp;
wuclient 23456 &amp;
wuclient 34567 &amp;
wuclient 45678 &amp;
wuclient 56789 &amp;
</screen>

<para>As the clients run, we take a look at the active processes using 'top', and we see something like (on a 4-core box):</para>

<screen>  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 7136 ph        20   0 1040m 959m 1156 R  157 12.0  16:25.47 wuserver
 7966 ph        20   0 98608 1804 1372 S   33  0.0   0:03.94 wuclient
 7963 ph        20   0 33116 1748 1372 S   14  0.0   0:00.76 wuclient
 7965 ph        20   0 33116 1784 1372 S    6  0.0   0:00.47 wuclient
 7964 ph        20   0 33116 1788 1372 S    5  0.0   0:00.25 wuclient
 7967 ph        20   0 33072 1740 1372 S    5  0.0   0:00.35 wuclient
</screen>

<para>Let's think for a second about what is happening here. The weather server has a single socket, and yet here we have it sending data to five clients in parallel. We could have thousands of concurrent clients. The server application doesn't see them, doesn't talk to them directly. So the &Oslash;MQ socket is acting like a little server, silently accepting client requests and shoving data out to them as fast as the network can handle it. And it's a multithreaded server, squeezing more juice out of your CPU.</para>

</sect1>
<sect1>
<title>Upgrading from &Oslash;MQ/2.2 to &Oslash;MQ/3.2</title>
<para>In early 2012, &Oslash;MQ/3.2 became stable enough for live use and by the time you're reading this, it's what you really should be using. If you are still using 2.2, here's a quick summary of the changes, and how to migrate your code.</para>

<para>The main change in 3.x is that PUB-SUB works properly, as in, the publisher only sends subscribers stuff they actually want. In 2.x, publishers send everything and the subscribers filter. Simple, but not ideal for performance on a TCP network.</para>

<para>Most of the API is backwards compatible, except a few changes that went into 3.0 with little regard to the cost of breaking existing code. The syntax of <literal>zmq-send[3]</literal> and <literal>zmq-recv[3]</literal> changed, and <literal>ZMQ-NOBLOCK</literal> got rebaptized to <literal>ZMQ-DONTWAIT</literal>. So although I'd love to say, "you just recompile your code with the latest libzmq and everything will work", that's not how it is. For what it's worth, we banned such API breakage afterwards.</para>

<para>So the minimal change for C/C++ apps that use the low-level libzmq API is to replace all calls to <literal>zmq-send[3]</literal> with <literal>zmq-msg-send[3]</literal>, and <literal>zmq-recv[3]</literal> with <literal>zmq-msg-recv[3]</literal>. In other languages, your binding author may have done the work already. Note that these two functions now return -1 in case of error, and zero or more according to how many bytes were sent or received.</para>

<para>Other parts of the libzmq API became more consistent. We deprecated <literal>zmq-init[3]</literal> and <literal>zmq-term[3]</literal>, replacing them with <literal>zmq-ctx-new[3]</literal> and <literal>zmq-ctx-destroy[3]</literal>. We added <literal>zmq-ctx-set[3]</literal> to let you configure a context before starting to work with it.</para>

<para>Finally, we added context monitoring via the <literal>zmq-ctx-set-monitor[3]</literal> call, which lets you track connections and disconnections, and other events on sockets.</para>

</sect1>
<sect1>
<title>Warning - Unstable Paradigms!</title>
<para>Traditional network programming is built on the general assumption that one socket talks to one connection, one peer. There are multicast protocols but these are exotic. When we assume "one socket = one connection", we scale our architectures in certain ways. We create threads of logic where each thread work with one socket, one peer. We place intelligence and state in these threads.</para>

<para>In the &Oslash;MQ universe, sockets are doorways to fast little background communications engines that manage a whole set of connections automagically for you. You can't see, work with, open, close, or attach state to these connections. Whether you use blocking send or receive, or poll, all you can talk to is the socket, not the connections it manages for you. The connections are private and invisible, and this is the key to &Oslash;MQ's scalability.</para>

<para>Because your code, talking to a socket, can then handle any number of connections across whatever network protocols are around, without change. A messaging pattern sitting in &Oslash;MQ can scale more cheaply than a messaging pattern sitting in your application code.</para>

<para>So the general assumption no longer applies. As you read the code examples, your brain will try to map them to what you know. You will read "socket" and think "ah, that represents a connection to another node". That is wrong. You will read "thread" and your brain will again think, "ah, a thread represents a connection to another node", and again your brain will be wrong.</para>

<para>If you're reading this Guide for the first time, realize that until you actually write &Oslash;MQ code for a day or two (and maybe three or four days), you may feel confused, especially by how simple &Oslash;MQ makes things for you, and you may try to impose that general assumption on &Oslash;MQ, and it won't work. And then you will experience your moment of enlightenment and trust, that <emphasis>zap-pow-kaboom</emphasis> satori paradigm-shift moment when it all becomes clear.</para>

</sect1>
</chapter>
<chapter id="sockets-and-patterns">
<title>Sockets and Patterns</title>
<para>In Basics<xref linkend="basics"/> we took &Oslash;MQ for a drive, with some basic examples of the main &Oslash;MQ patterns: request-reply, publish-subscribe, and pipeline. In this chapter we're going to get our hands dirty and start to learn how to use these tools in real programs.</para>

<para>We'll cover:</para>

<itemizedlist>
  <listitem><para>How to create and work with &Oslash;MQ sockets.</para></listitem>
  <listitem><para>How to send and receive messages on sockets.</para></listitem>
  <listitem><para>How to build your apps around &Oslash;MQ's asynchronous I/O model.</para></listitem>
  <listitem><para>How to handle multiple sockets in one thread.</para></listitem>
  <listitem><para>How to handle fatal and non-fatal errors properly.</para></listitem>
  <listitem><para>How to handle interrupt signals like Ctrl-C.</para></listitem>
  <listitem><para>How to shutdown a &Oslash;MQ application cleanly.</para></listitem>
  <listitem><para>How to check a &Oslash;MQ application for memory leaks.</para></listitem>
  <listitem><para>How to send and receive multi-part messages.</para></listitem>
  <listitem><para>How to forward messages across networks.</para></listitem>
  <listitem><para>How to build a simple message queuing broker.</para></listitem>
  <listitem><para>How to write multithreaded applications with &Oslash;MQ.</para></listitem>
  <listitem><para>How to use &Oslash;MQ to signal between threads.</para></listitem>
  <listitem><para>How to use &Oslash;MQ to coordinate a network of nodes.</para></listitem>
  <listitem><para>How to create and use message envelopes for publish-subscribe.</para></listitem>
  <listitem><para>Using the high-water mark (HWM) to protect against memory overflows.</para></listitem>
</itemizedlist>
<sect1>
<title>The Socket API</title>
<para>To be perfectly honest, &Oslash;MQ does a kind of switch-and-bait on you. Which we don't apologize for, it's for your own good and hurts us more than it hurts you. It presents a familiar socket-based API, which requires great effort for us to hide a bunch of message-processing engines. However, the result will slowly fix your world-view about how to design and write distributed software.</para>

<para>Sockets are the de-facto standard API for network programming, as well as being useful for stopping your eyes from falling onto your cheeks. One thing that makes &Oslash;MQ especially tasty to developers is that it uses sockets and messages instead of some other arbitrary set of concepts. Kudos to Martin Sustrik for pulling this off. It turns "Message Oriented Middleware", a phrase guaranteed to send the whole room off to Catatonia, into "Extra Spicy Sockets!" which leaves us with a strange craving for pizza, and a desire to know more.</para>

<para>Like a favorite dish, &Oslash;MQ sockets are easy to digest. Sockets have a life in four parts, just like BSD sockets:</para>

<itemizedlist>
  <listitem><para>Creating and destroying sockets, which go together to form a karmic circle of socket life (see <literal>zmq-socket[3]</literal>, {{zmq-close[3]).</para></listitem>
  <listitem><para>Configuring sockets by setting options on them and checking them if necessary (see <literal>zmq-setsockopt[3]</literal>, {{zmq-getsockopt[3]).</para></listitem>
  <listitem><para>Plugging sockets onto the network topology by creating &Oslash;MQ connections to and from them (see <literal>zmq-bind[3]</literal>, {{zmq-connect[3]).</para></listitem>
  <listitem><para>Using the sockets to carry data by writing and receiving messages on them (see <literal>zmq-msg-send[3]</literal>, {{zmq-msg-recv[3]).</para></listitem>
</itemizedlist>
<para>Note that sockets are always void pointers, and messages (which we'll come to very soon) are structures. So in C you pass sockets as-such, but you pass addresses of messages in all functions that work with messages, like <literal>zmq-msg-send[3]</literal> and <literal>zmq-msg-recv[3]</literal>. As a mnemonic, realize that "in &Oslash;MQ all your sockets are belong to us", but messages are things you actually own in your code.</para>

<para>Creating, destroying, and configuring sockets works as you'd expect for any object. But remember that &Oslash;MQ is an asynchronous, elastic fabric. This has some impact on how we plug sockets into the network topology, and how we use the sockets after that.</para>

<sect2>
<title>Plugging Sockets Into the Topology</title>
<para>To create a connection between two nodes you use <literal>zmq-bind[3]</literal> in one node, and <literal>zmq-connect[3]</literal> in the other.  As a general rule of thumb, the node which does <literal>zmq-bind[3]</literal> is a "server", sitting on a well-known network address, and the node which does <literal>zmq-connect[3]</literal> is a "client", with unknown or arbitrary network addresses. Thus we say that we "bind a socket to an endpoint" and "connect a socket to an endpoint", the endpoint being that well-known network address.</para>

<para>&Oslash;MQ connections are somewhat different from old-fashioned TCP connections. The main notable differences are:</para>

<itemizedlist>
  <listitem><para>They go across an arbitrary transport (<literal>inproc</literal>, <literal>ipc</literal>, <literal>tcp</literal>, <literal>pgm</literal> or <literal>epgm</literal>). See <literal>zmq-inproc[7]</literal>, <literal>zmq-ipc[7]</literal>, <literal>zmq-tcp[7]</literal>, <literal>zmq-pgm[7]</literal>, and <literal>zmq-epgm[7]</literal>.</para></listitem>
  <listitem><para>One socket may have many outgoing and many incoming connections.</para></listitem>
  <listitem><para>There is no {{zmq-accept() method. When a socket is bound to an endpoint it automatically starts accepting connections.</para></listitem>
  <listitem><para>The network connection itself happens in the background, and &Oslash;MQ will automatically re-connect if the network connection is broken (e.g. if the peer disappears and then comes back).</para></listitem>
  <listitem><para>Your application code cannot work with these connections directly; they are encapsulated under the socket.</para></listitem>
</itemizedlist>
<para>Many architectures follow some kind of client-server model, where the server is the component that is most static, and the clients are the components that are most dynamic, i.e. they come and go the most. There are sometimes issues of addressing: servers will be visible to clients, but not necessarily vice-versa. So mostly it's obvious which node should be doing <literal>zmq-bind[3]</literal> (the server) and which should be doing <literal>zmq-connect[3]</literal> (the client). It also depends on the kind of sockets you're using, with some exceptions for unusual network architectures. We'll look at socket types later.</para>

<para>Now, imagine we start the client <emphasis>before</emphasis> we start the server. In traditional networking we get a big red Fail flag. But &Oslash;MQ lets us start and stop pieces arbitrarily. As soon as the client node does <literal>zmq-connect[3]</literal> the connection exists and that node can start to write messages to the socket. At some stage (hopefully before messages queue up so much that they start to get discarded, or the client blocks), the server comes alive, does a <literal>zmq-bind[3]</literal> and &Oslash;MQ starts to deliver messages.</para>

<para>A server node can bind to many endpoints (that is, a combination of protocol and address) and it can do this using a single socket. This means it will accept connections across different transports:</para>

<programlisting language="c">
zmq-bind (socket, "tcp://*:5555");
zmq-bind (socket, "tcp://*:9999");
zmq-bind (socket, "inproc://somename");
</programlisting>

<para>With most transports you cannot bind to the same endpoint twice, unlike for example in UDP. The <literal>ipc</literal> transport does however let one process bind to an endpoint already used by a first process. It's meant to allow a process to recover after a crash.</para>

<para>Although &Oslash;MQ tries to be neutral about which side binds, and which side connects, there are differences. We'll see these in more detail later. The upshot is that you should usually think in terms of "servers" as static parts of your topology, that bind to more-or-less fixed endpoints, and "clients" as dynamic parts that come and go and connect to these endpoints. Then, design your application around this model. The chances that it will "just work" are much better like that.</para>

<para>Sockets have types. The socket type defines the semantics of the socket, its policies for routing messages inwards and outwards, queuing, etc. You can connect certain types of socket together, e.g. a publisher socket and a subscriber socket. Sockets work together in "messaging patterns". We'll look at this in more detail later.</para>

<para>It's the ability to connect sockets in these different ways that gives &Oslash;MQ its basic power as a message queuing system. There are layers on top of this, such as proxies, which we'll get to later. But essentially, with &Oslash;MQ you define your network architecture by plugging pieces together like a child's construction toy.</para>

</sect2>
<sect2>
<title>Using Sockets to Carry Data</title>
<para>To send and receive messages you use the <literal>zmq-msg-send[3]</literal> and <literal>zmq-msg-recv[3]</literal> methods. The names are conventional but &Oslash;MQ's I/O model is different enough from the TCP model(<xref linkend="figure-9"/>) that you will need time to get your head around it.</para>

<figure id="figure-9">
    <title>TCP sockets are 1 to 1</title>
    <mediaobject>
        <imageobject>
            <imagedata fileref="images/fig9.eps" format="EPS" width="4.8in"/>
        </imageobject>
    </mediaobject>
</figure>

<para>Let's look at the main differences between TCP sockets and &Oslash;MQ sockets when it comes to working with data:</para>

<itemizedlist>
  <listitem><para>&Oslash;MQ sockets carry messages, like UDP, rather than a stream of bytes as TCP does. A &Oslash;MQ message is length-specified binary data. We'll come to messages shortly, their design is optimized for performance and so a little tricky.</para></listitem>
  <listitem><para>&Oslash;MQ sockets do their I/O in a background thread. This means that messages arrive in local input queues, and are sent from local output queues, no matter what your application is busy doing.</para></listitem>
  <listitem><para>&Oslash;MQ sockets have one-to-N routing behavior built-in, according to the socket type.</para></listitem>
</itemizedlist>
<para>The <literal>zmq-msg-send[3]</literal> method does not actually send the message to the socket connection(s). It queues the message so that the I/O thread can send it asynchronously. It does not block except in some exception cases. So the message is not necessarily sent when <literal>zmq-msg-send[3]</literal> returns to your application. If you created a message using <literal>zmq-msg-init-data[3]</literal> you cannot reuse the data or free it, otherwise the I/O thread will rapidly find itself writing overwritten or unallocated garbage. This is a common mistake for beginners. We'll see a little later how to properly work with messages.</para>

</sect2>
<sect2>
<title>Unicast Transports</title>
<para>&Oslash;MQ provides a set of unicast transports (<literal>inproc</literal>, <literal>ipc</literal>, and <literal>tcp</literal>) and multicast transports (epgm, pgm). Multicast is an advanced technique that we'll come to later. Don't even start using it unless you know that your fanout ratios will make 1-to-N unicast impossible.</para>

<para>For most common cases, use <emphasis role="bold"><literal>tcp</literal></emphasis>, which is a <emphasis>disconnected TCP</emphasis> transport. It is elastic, portable, and fast enough for most cases. We call this 'disconnected' because &Oslash;MQ's <literal>tcp</literal> transport doesn't require that the endpoint exists before you connect to it. Clients and servers can connect and bind at any time, can go and come back, and it remains transparent to applications.</para>

<para>The inter-process <literal>ipc</literal> transport is disconnected, like <literal>tcp</literal>. It has one limitation: it does not yet work on Windows. By convention we use endpoint names with an ".ipc" extension to avoid potential conflict with other file names. On UNIX systems, if you use <literal>ipc</literal> endpoints you need to create these with appropriate permissions otherwise they may not be shareable between processes running under different user IDs. You must also make sure all processes can access the files, e.g. by running in the same working directory.</para>

<para>The inter-thread transport, <emphasis role="bold"><literal>inproc</literal></emphasis>, is a connected signaling transport. It is much faster than <literal>tcp</literal> or <literal>ipc</literal>. This transport has a specific limitation compared to <literal>ipc</literal> and <literal>tcp</literal>: <emphasis role="bold">the server must issue a bind before any client issues a connect</emphasis>. This is something future versions of &Oslash;MQ may fix, but at present this defines you use <literal>inproc</literal> sockets. We create and bind one socket, start the child threads, which create and connect the other sockets.</para>

</sect2>
<sect2>
<title>&Oslash;MQ is Not a Neutral Carrier</title>
<para>A common question that newcomers to &Oslash;MQ ask (it's one I asked myself) is, "how do I write an XYZ server in &Oslash;MQ?" For example, "how do I write an HTTP server in &Oslash;MQ?" The implication is that if we use normal sockets to carry HTTP requests and responses, we should be able to use &Oslash;MQ sockets to do the same, only much faster and better.</para>

<para>The answer used to be "this is not how it works". &Oslash;MQ is not a neutral carrier, it imposes a framing on the transport protocols it uses. This framing is not compatible with existing protocols, which tend to use their own framing. For example, compare an HTTP request(<xref linkend="figure-10"/>), and a &Oslash;MQ request(<xref linkend="figure-10"/>), both over TCP/IP.</para>

<para>The HTTP request uses CR-LF as its simplest framing delimiter, whereas &Oslash;MQ uses a length-specified frame.</para>

<figure id="figure-10">
    <title>HTTP On the Wire</title>
    <mediaobject>
        <imageobject>
            <imagedata fileref="images/fig10.eps" format="EPS" width="4.8in"/>
        </imageobject>
    </mediaobject>
</figure>

<figure id="figure-11">
    <title>&Oslash;MQ On the Wire</title>
    <mediaobject>
        <imageobject>
            <imagedata fileref="images/fig11.eps" format="EPS" width="4.8in"/>
        </imageobject>
    </mediaobject>
</figure>

<para>So you could write a HTTP-like protocol using &Oslash;MQ, using for example the request-reply socket pattern. But it would not be HTTP.</para>

<para>Since &Oslash;MQ/3.3, however, the <literal>ZMQ_ROUTER_RAW</literal> socket option lets you read and write TCP bytes without the &Oslash;MQ framing. Hardeep Singh contributed this change so that he could connect to Telnet servers from his &Oslash;MQ application. This is still, at time of writing, somewhat experimental, but it shows how &Oslash;MQ keeps evolving to solve new problems. Maybe the next patch will be yours.</para>

</sect2>
<sect2>
<title>I/O Threads</title>
<para>We said that &Oslash;MQ does I/O in a background thread. One I/O thread (for all sockets) is sufficient for all but the most extreme applications. When you create a new context it starts with one I/O thread. The general rule of thumb is to allow one I/O thread per gigabyte of data in or out per second. To raise the number of I/O threads, use the <literal>zmq-ctx-set[3]</literal> call <emphasis>before</emphasis> creating any sockets:</para>

<programlisting language="c">
int io-threads = 4;
void *context = zmq-ctx-new ();
zmq-ctx-set (context, ZMQ-IO-THREADS, io-threads);
assert (zmq-ctx-get (context, ZMQ-IO-THREADS) == io-threads);
</programlisting>

<para>We've seen that one socket can handle many (dozens, thousands of) connections at once. This has a fundamental impact on how you write applications. A traditional networked application has one process or one thread per remote connection, and that process or thread handles one socket. &Oslash;MQ lets you collapse this entire structure into a single process, and then break it up as necessary for scaling.</para>

<para>If you are using &Oslash;MQ for inter-thread communications only, i.e. a multithreaded application that does no external socket I/O, you can set the I/O threads to zero. It's not a significant optimization though, more of a curiosity.</para>

</sect2>
</sect1>
<sect1>
<title>Messaging Patterns</title>
<para>Underneath the brown paper wrapping of &Oslash;MQ's socket API lies the world of messaging patterns. If you have a background in enterprise messaging, or know UDP well, these will be vaguely familiar. But to most &Oslash;MQ newcomers they are a surprise, we're so used to the TCP paradigm where a socket maps one-to-one to another node.</para>

<para>Let's recap briefly what &Oslash;MQ does for you. It delivers blobs of data (messages) to nodes, quickly and efficiently. You can map nodes to threads, processes, or nodes. &Oslash;MQ gives your applications a single socket API to work with, no matter what the actual transport (like in-process, inter-process, TCP, or multicast). It automatically reconnects to peers as they come and go. It queues messages at both sender and receiver, as needed. It manages these queues carefully to ensure processes don't run out of memory, overflowing to disk when appropriate. It handles socket errors. It does all I/O in background threads. It uses lock-free techniques for talking between nodes, so there are never locks, waits, semaphores, or deadlocks.</para>

<para>But cutting through that, it routes and queues messages according to precise recipes called <emphasis>patterns</emphasis>. It is these patterns that provide &Oslash;MQ's intelligence. They encapsulate our hard-earned experience of the best ways to distribute data and work. &Oslash;MQ's patterns are hard-coded but future versions may allow user-definable patterns.</para>

<para>&Oslash;MQ patterns are implemented by pairs of sockets with matching types. In other words, to understand &Oslash;MQ patterns you need to understand socket types and how they work together. Mostly this just takes study, there is little that is obvious at this level.</para>

<para>The built-in core &Oslash;MQ patterns are:</para>

<itemizedlist>
  <listitem><para><emphasis role="bold">Request-reply</emphasis>, which connects a set of clients to a set of services. This is a remote procedure call and task distribution pattern.</para></listitem>
  <listitem><para><emphasis role="bold">Publish-subscribe</emphasis>, which connects a set of publishers to a set of subscribers. This is a data distribution pattern.</para></listitem>
  <listitem><para><emphasis role="bold">Pipeline</emphasis>, which connects nodes in a fan-out / fan-in pattern that can have multiple steps, and loops. This is a parallel task distribution and collection pattern.</para></listitem>
</itemizedlist>
<para>We looked at each of these in the first chapter. There's one more pattern that people tend to try to use when they still think of &Oslash;MQ in terms of traditional TCP sockets: <emphasis role="bold">Exclusive pair</emphasis>, which connects two sockets exclusively. This is a pattern you should use only to connect two threads in a process. We'll see an example at the end of this chapter.</para>

<para>The <literal>zmq-socket[3]</literal> man page is fairly clear about the patterns, it's worth reading several times until it starts to make sense. These are the socket combinations that are valid for a connect-bind pair (either side can bind):</para>

<itemizedlist>
  <listitem><para>PUB and SUB</para></listitem>
  <listitem><para>REQ and REP</para></listitem>
  <listitem><para>REQ and ROUTER</para></listitem>
  <listitem><para>DEALER and REP</para></listitem>
  <listitem><para>DEALER and ROUTER</para></listitem>
  <listitem><para>DEALER and DEALER</para></listitem>
  <listitem><para>ROUTER and ROUTER</para></listitem>
  <listitem><para>PUSH and PULL</para></listitem>
  <listitem><para>PAIR and PAIR</para></listitem>
</itemizedlist>
<para>You'll also see references to XPUB and XSUB sockets, which we'll come to later (they're like raw versions of PUB and SUB). Any other combination will produce undocumented and unreliable results and future versions of &Oslash;MQ will probably return errors if you try them. You can and will of course bridge other socket types <emphasis>via code</emphasis>, i.e. read from one socket type and write to another.</para>

<sect2>
<title>High-level Messaging Patterns</title>
<para>These four core patterns are cooked-in to &Oslash;MQ. They are part of the &Oslash;MQ API, implemented in the core C++ library, and guaranteed to be available in all fine retail stores.</para>

<para>On top, we add <emphasis>high-level patterns</emphasis>. We build these high-level patterns on top of &Oslash;MQ and implement them in whatever language we're using for our application. They are not part of the core library, do not come with the &Oslash;MQ package, and exist in their own space, as part of the &Oslash;MQ community. For example the Majordomo pattern, which we explore in Reliable Request-Reply Patterns<xref linkend="reliable-request-reply"/>, sits in the GitHub Majordomo project in the ZeroMQ organization.</para>

<para>One of the things we aim to provide you with this guide are a set of such high-level patterns, both small (how to handle messages sanely) to large (how to make a reliable publish-subscribe architecture).</para>

</sect2>
<sect2>
<title>Working with Messages</title>
<para>On the wire, &Oslash;MQ messages are blobs of any size from zero upwards, fitting in memory. You do your own serialization using protobufs, msgpack, JSON, or whatever else your applications need to speak. It's wise to choose a data representation that is portable and fast, but you can make your own decisions about trade-offs.</para>

<para>In memory, &Oslash;MQ messages are <literal>zmq-msg-t</literal> structures (or classes depending on your language). Here are the basic ground rules for using &Oslash;MQ messages in C:</para>

<itemizedlist>
  <listitem><para>You create and pass around <literal>zmq-msg-t</literal> objects, not blocks of data.</para></listitem>
  <listitem><para>To read a message you use <literal>zmq-msg-init[3]</literal> to create an empty message, and then you pass that to <literal>zmq-msg-recv[3]</literal>.</para></listitem>
  <listitem><para>To write a message from new data, you use <literal>zmq-msg-init-size[3]</literal> to create a message and at the same time allocate a block of data of some size. You then fill that data using <literal>memcpy</literal>, and pass the message to <literal>zmq-msg-send[3]</literal>.</para></listitem>
  <listitem><para>To release (not destroy) a message you call <literal>zmq-msg-close[3]</literal>. This drops a reference, and eventually &Oslash;MQ will destroy the message.</para></listitem>
  <listitem><para>To access the message content you use <literal>zmq-msg-data[3]</literal>. To know how much data the message contains, use <literal>zmq-msg-size[3]</literal>.</para></listitem>
  <listitem><para>Do not use <literal>zmq-msg-move[3]</literal>, <literal>zmq-msg-copy[3]</literal>, or <literal>zmq-msg-init-data[3]</literal> unless you read the man pages and know precisely why you need these.</para></listitem>
</itemizedlist>
<para>Here is a typical chunk of code working with messages, which should be familiar if you have been paying attention. This is from the <literal>zhelpers.h</literal> file we use in all the examples:</para>

<programlisting language="c">
//  Receive 0MQ string from socket and convert into C string
static char *
s-recv (void *socket) {
    zmq-msg-t message;
    zmq-msg-init (&amp;message);
    int size = zmq-msg-recv (&amp;message, socket, 0);
    if (size == -1)
        return NULL;
    char *string = malloc (size + 1);
    memcpy (string, zmq-msg-data (&amp;message), size);
    zmq-msg-close (&amp;message);
    string [size] = 0;
    return (string);
}

//  Convert C string to 0MQ string and send to socket
static int
s-send (void *socket, char *string) {
    zmq-msg-t message;
    zmq-msg-init-size (&amp;message, strlen (string));
    memcpy (zmq-msg-data (&amp;message), string, strlen (string));
    int size = zmq-msg-send (&amp;message, socket, 0);
    zmq-msg-close (&amp;message);
    return (size);
}
</programlisting>

<para>You can easily extend this code to send and receive blobs of arbitrary length.</para>

<para>NOTE:</para>

<para>After you pass a message to <literal>zmq-msg-send[3]</literal>, &Oslash;MQ will clear the message, i.e., set the size to zero. You cannot send the same message twice, and you cannot access the message data after sending it.</para>

<para>If you want to send the same message more than once, create a second message, initialize it using <literal>zmq-msg-init[3]</literal> and then use <literal>zmq-msg-copy[3]</literal> to create a copy of the first message. This does not copy the data but the reference. You can then send the message twice (or more, if you create more copies) and the message will only be finally destroyed when the last copy is sent or closed.</para>

<para>&Oslash;MQ also supports <emphasis>multi-part</emphasis> messages, which let you send or receive a list of frames as a single on-the-wire message. This is widely used in real applications and we'll look at that later in this chapter and in Advanced Request-Reply Patterns<xref linkend="advanced-request-reply"/>.</para>

<para>Frames (also called "message parts" in the &Oslash;MQ reference manual pages) are the basic wire format for &Oslash;MQ messages. A frame is a length-specified block of data. The length can be zero upwards. If you've done any TCP programming you'll appreciate why frames are a useful answer to the question "how much data am I supposed to read of this network socket now?"</para>

<para>There is a wire-level <ulink url="http://rfc.zeromq.org/spec:15">protocol called ZMTP</ulink> that defines how &Oslash;MQ reads and writes frames on a TCP connection. If you're interested in how this works, the spec is quite short, just a few pages.</para>

<para>Originally, a &Oslash;MQ message was one frame, like UDP. We later extended this with "multipart" messages, which are quite simply series of frames with a "more" bit set to one, followed by one with that bit set to zero. The &Oslash;MQ API then lets you write messages with a "more" flag, and when you read messages, lets you check if there's "more".</para>

<para>In the low-level &Oslash;MQ API and the reference manual, therefore, there's some fuzziness about messages vs. frames. So here's a useful lexicon:</para>

<itemizedlist>
  <listitem><para>A message can be one or more parts.</para></listitem>
  <listitem><para>These parts are also called "frames".</para></listitem>
  <listitem><para>Each part is a zmq_msg_t object.</para></listitem>
  <listitem><para>You send and receive each part separately, in the low-level API.</para></listitem>
  <listitem><para>Higher-level APIs provide wrappers to send entire multi-part messages.</para></listitem>
</itemizedlist>
<para>Some other things that are worth knowing about messages:</para>

<itemizedlist>
  <listitem><para>You may send zero-length messages, e.g. for sending a signal from one thread to another.</para></listitem>
  <listitem><para>&Oslash;MQ guarantees to deliver all the parts (one or more) for a message, or none of them.</para></listitem>
  <listitem><para>&Oslash;MQ does not send the message (single, or multi-part) right away but at some indeterminate later time. A multi-part message must therefore fit in memory.</para></listitem>
  <listitem><para>A message (single, or multi-part) must fit in memory. If you want to send files of arbitrary sizes, you should break them into pieces and send each piece as separate single-part messages.</para></listitem>
  <listitem><para>You must call <literal>zmq-msg-close[3]</literal> when finished with a message, in languages that don't automatically destroy objects when a scope closes.</para></listitem>
</itemizedlist>
<para>And to be necessarily repetitive, do not use <literal>zmq-msg-init-data[3]</literal>, yet. This is a zero-copy method and guaranteed to create trouble for you. There are far more important things to learn about &Oslash;MQ before you start to worry about shaving off microseconds.</para>

</sect2>
<sect2>
<title>Handling Multiple Sockets</title>
<para>In all the examples so far, the main loop of most examples has been:</para>

<orderedlist>
  <listitem><para>wait for message on socket</para></listitem>
  <listitem><para>process message</para></listitem>
  <listitem><para>repeat</para></listitem>
</orderedlist>
<para>What if we want to read from multiple endpoints at the same time? The simplest way is to connect one socket to all the endpoints and get &Oslash;MQ to do the fan-in for us. This is legal if the remote endpoints are in the same pattern but it would be wrong to e.g. connect a PULL socket to a PUB endpoint.</para>

<para>To actually read from multiple sockets at once, use <literal>zmq-poll[3]</literal>. An even better way might be to wrap <literal>zmq-poll[3]</literal> in a framework that turns it into a nice event-driven <emphasis>reactor</emphasis>, but it's significantly more work than we want to cover here.</para>

<para>Let's start with a dirty hack, partly for the fun of not doing it right, but mainly because it lets me show you how to do non-blocking socket reads. Here is a simple example of reading from two sockets using non-blocking reads. This rather confused program acts both as a subscriber to weather updates, and a worker for parallel tasks:</para>

<example id="msreader-php">
<title>Multiple socket reader (msreader.php)</title>
<programlisting language="php">
&lt;?php
/*
 *  Reading from multiple sockets
 *  This version uses a simple recv loop
 * @author Ian Barber &lt;ian(dot)barber(at)gmail(dot)com&gt;
 */

//  Prepare our context and sockets
$context = new ZMQContext();

//  Connect to task ventilator
$receiver = new ZMQSocket($context, ZMQ::SOCKET_PULL);
$receiver-&gt;connect("tcp://localhost:5557");

//  Connect to weather server
$subscriber = new ZMQSocket($context, ZMQ::SOCKET_SUB);
$subscriber-&gt;connect("tcp://localhost:5556");
$subscriber-&gt;setSockOpt(ZMQ::SOCKOPT_SUBSCRIBE, "10001");

//  Process messages from both sockets
//  We prioritize traffic from the task ventilator
while(true) {
	//  Process any waiting tasks
	try {
		for($rc = 0; !$rc;) {
			if($rc = $receiver-&gt;recv(ZMQ::MODE_NOBLOCK)) {
				// process task
			}
		}
	} catch (ZMQSocketException $e) {
		// do nothing 
	}
	
	try {
		//  Process any waiting weather updates
		for($rc = 0; !$rc;) {
			if($rc = $subscriber-&gt;recv(ZMQ::MODE_NOBLOCK)) {
				// process weather update
			}
		}
	} catch (ZMQSocketException $e) {
		// do nothing 
	}
		
	//  No activity, so sleep for 1 msec
	usleep(1);
</programlisting>

</example>
<para>The cost of this approach is some additional latency on the first message (the sleep at the end of the loop, when there are no waiting messages to process). This would be a problem in applications where sub-millisecond latency was vital. Also, you need to check the documentation for nanosleep() or whatever function you use to make sure it does not busy-loop.</para>

<para>You can treat the sockets fairly by reading first from one, then the second rather than prioritizing them as we did in this example.</para>

<para>Now let's see the same little senseless application done right, using <literal>zmq-poll[3]</literal>:</para>

<example id="mspoller-php">
<title>Multiple socket poller (mspoller.php)</title>
<programlisting language="php">
&lt;?php 
/*
 *  Reading from multiple sockets
 *  This version uses zmq_poll()
 * @author Ian Barber &lt;ian(dot)barber(at)gmail(dot)com&gt;
 */

$context = new ZMQContext();

//  Connect to task ventilator
$receiver = new ZMQSocket($context, ZMQ::SOCKET_PULL);
$receiver-&gt;connect("tcp://localhost:5557");

//  Connect to weather server
$subscriber = new ZMQSocket($context, ZMQ::SOCKET_SUB);
$subscriber-&gt;connect("tcp://localhost:5556");
$subscriber-&gt;setSockOpt(ZMQ::SOCKOPT_SUBSCRIBE, "10001");

//  Initialize poll set
$poll = new ZMQPoll();
$poll-&gt;add($receiver, ZMQ::POLL_IN);
$poll-&gt;add($subscriber, ZMQ::POLL_IN);

$readable = $writeable = array();

//  Process messages from both sockets
while(true) {
	$events = $poll-&gt;poll($readable, $writeable);
	if($events &gt; 0) {
		foreach($readable as $socket) {
			if($socket === $receiver) {
				$message = $socket-&gt;recv();
				// Process task
			} 
			else if($socket === $subscriber) {
				$mesage = $socket-&gt;recv();
				// Process weather update
			}
		}
	}
}
   
//  We never get her
</programlisting>

</example>
<para>The items structure has these four members:</para>

<programlisting language="c">
typedef struct {
    void *socket;       //  0MQ socket to poll on
    int fd;             //  OR, native file handle to poll on
    short events;       //  Events to poll on
    short revents;      //  Events returned after poll
} zmq-pollitem-t;
</programlisting>

</sect2>
<sect2>
<title>Multi-part Messages</title>
<para>&Oslash;MQ lets us compose a message out of several frames, giving us a "multi-part message". Realistic applications use multi-part messages heavily, both for wrapping messages with address information, and for simple serialization. We'll look at reply envelopes later.</para>

<para>What we'll learn now is simply how to safely (but blindly) read and write multi-part messages in any application (like a proxy) that needs to forward messages without inspecting them.</para>

<para>When you work with multi-part messages, each part is a <literal>zmq-msg</literal> item. E.g. if you are sending a message with five parts, you must construct, send, and destroy five <literal>zmq-msg</literal> items. You can do this in advance (and store the <literal>zmq-msg</literal> items in an array or structure), or as you send them, one by one.</para>

<para>Here is how we send the frames in a multi-part message (we receive each frame into a message object):</para>

<programlisting language="c">
zmq-msg-send (socket, &amp;message, ZMQ-SNDMORE);
...
zmq-msg-send (socket, &amp;message, ZMQ-SNDMORE);
...
zmq-msg-send (socket, &amp;message, 0);
</programlisting>

<para>Here is how we receive and process all the parts in a message, be it single part or multi-part:</para>

<programlisting language="c">
while (1) {
    zmq-msg-t message;
    zmq-msg-init (&amp;message);
    zmq-msg-recv (socket, &amp;message, 0);
    //  Process the message frame
    zmq-msg-close (&amp;message);
    int-t more;
    size-t more-size = sizeof (more);
    zmq-getsockopt (socket, ZMQ-RCVMORE, &amp;more, &amp;more-size);
    if (!more)
        break;      //  Last message frame
}
</programlisting>

<para>Some things to know about multi-part messages:</para>

<itemizedlist>
  <listitem><para>When you send a multi-part message, the first part (and all following parts) are only actually sent on the wire when you send the final part.</para></listitem>
  <listitem><para>If you are using <literal>zmq-poll[3]</literal>, when you receive the first part of a message, all the rest has also arrived.</para></listitem>
  <listitem><para>You will receive all parts of a message, or none at all.</para></listitem>
  <listitem><para>Each part of a message is a separate <literal>zmq-msg</literal> item.</para></listitem>
  <listitem><para>You will receive all parts of a message whether or not you check the RCVMORE option.</para></listitem>
  <listitem><para>On sending, &Oslash;MQ queues message frames in memory until the last is received, then sends them all.</para></listitem>
  <listitem><para>There is no way to cancel a partially sent message, except by closing the socket.</para></listitem>
</itemizedlist>
</sect2>
<sect2>
<title>Intermediaries and Proxies</title>
<para>&Oslash;MQ aims for decentralized intelligence but that doesn't mean your network is empty space in the middle. It's filled with message-aware infrastructure and quite often, we build that infrastructure with &Oslash;MQ. The &Oslash;MQ plumbing can range from tiny pipes to full-blown service-oriented brokers. The messaging industry calls this "intermediation", meaning that the stuff in the middle deals with either side. In &Oslash;MQ we call these proxies, queues, forwarders, device, or brokers, depending on the context.</para>

<para>This pattern is extremely common in the real world and is why our societies and economies are filled with intermediaries who have no other real function than to reduce the complexity and scaling costs of larger networks. Real-world intermediaries are typically called wholesalers, distributors, managers, etc.</para>

</sect2>
<sect2>
<title>The Dynamic Discovery Problem</title>
<para>One of the problems you will hit as you design larger distributed architectures is discovery. That is, how do pieces know about each other? It's especially difficult if pieces come and go, thus we can call this the "dynamic discovery problem".</para>

<para>There are several solutions to dynamic discovery. The simplest is to entirely avoid it by hard-coding (or configuring) the network architecture so discovery is done by hand. That is, when you add a new piece, you reconfigure the network to know about it.</para>

<para>In practice this leads to increasingly fragile and hard-to-manage architectures. Let's say you have one publisher and a hundred subscribers. You connect each subscriber to the publisher by configuring a publisher endpoint in each subscriber. That's easy(<xref linkend="figure-12"/>). Subscribers are dynamic, the publisher is static. Now say you add more publishers. Suddenly it's not so easy any more. If you continue to connect each subscriber to each publisher, the cost of avoiding dynamic discovery gets higher and higher.</para>

<figure id="figure-12">
    <title>Small-scale Pub-Sub Network</title>
    <mediaobject>
        <imageobject>
            <imagedata fileref="images/fig12.eps" format="EPS" width="4.8in"/>
        </imageobject>
    </mediaobject>
</figure>

<para>There are quite a few answers to this but the very simplest answer is to add an intermediary, that is, a static point in the network to which all other nodes connect. In classic messaging, this is the job of the "message broker". &Oslash;MQ doesn't come with a message broker as such, but it lets us build intermediaries quite easily.</para>

<para>You might wonder, if all networks eventually get large enough to need intermediaries, why don't we simply have a message broker in place for all applications? For beginners, it's a fair compromise. Just always use a star topology, forget about performance, and things will usually work. However message brokers are greedy things; in their role as central intermediaries, they become too complex, too stateful, and eventually a problem.</para>

<para>It's better to think of intermediaries as simple stateless message switches. The best analogy is an HTTP proxy; it's there but doesn't have any special role. Adding a pub-sub proxy solves the dynamic discovery problem in our example. We set the proxy in the "middle" of the network(<xref linkend="figure-13"/>). The proxy opens an XSUB socket, an XPUB socket, and binds each to well-known IP addresses and ports. Then all other processes connect to the proxy, instead of to each other. It becomes trivial to add more subscribers or publishers.</para>

<figure id="figure-13">
    <title>Pub-Sub Network with a Proxy</title>
    <mediaobject>
        <imageobject>
            <imagedata fileref="images/fig13.eps" format="EPS" width="4.8in"/>
        </imageobject>
    </mediaobject>
</figure>

<para>We need XPUB and XSUB sockets because &Oslash;MQ does subscription forwarding: SUB sockets actually send subscriptions to PUB sockets as special messages. The proxy has to forward these as well, by reading them from the XPUB socket and writing them to the XSUB socket. This is the main use-case for XSUB and XPUB(<xref linkend="figure-14"/>).</para>

<figure id="figure-14">
    <title>Extended Publish-Subscribe</title>
    <mediaobject>
        <imageobject>
            <imagedata fileref="images/fig14.eps" format="EPS" width="4.8in"/>
        </imageobject>
    </mediaobject>
</figure>

</sect2>
<sect2>
<title>Shared Queue (DEALER and ROUTER sockets)</title>
<para>In the Hello World client-server application we have one client that talks to one service. However in real cases we usually need to allow multiple services as well as multiple clients. This lets us scale up the power of the service (many threads or processes or nodes rather than just one). The only constraint is that services must be stateless, all state being in the request or in some shared storage such as a database.</para>

<para>There are two ways to connect multiple clients to multiple servers. The brute-force way is to connect each client socket to multiple service endpoints. One client socket can connect to multiple service sockets, and the REQ socket will then load-balance requests among these services. Let's say you connect a client socket to three service endpoints, A, B, and C. The client makes requests R1, R2, R3, R4. R1 and R4 go to service A, R2 goes to B, and R3 goes to service C(<xref linkend="figure-15"/>).</para>

<figure id="figure-15">
    <title>Load-balancing of Requests</title>
    <mediaobject>
        <imageobject>
            <imagedata fileref="images/fig15.eps" format="EPS" width="4.8in"/>
        </imageobject>
    </mediaobject>
</figure>

<para>This design lets you add more clients cheaply. You can also add more services. Each client will load-balance its requests to the services. But each client has to know the service topology. If you have 100 clients and then you decide to add three more services, you need to reconfigure and restart 100 clients in order for the clients to know about the three new services.</para>

<para>That's clearly not the kind of thing we want to be doing at 3am when our supercomputing cluster has run out of resources and we desperately need to add a couple of hundred new service nodes. Too many static pieces are like liquid concrete: knowledge is distributed and the more static pieces you have, the more effort it is to change the topology. What we want is something sitting in between clients and services that centralizes all knowledge of the topology. Ideally, we should be able to add and remove services or clients at any time without touching any other part of the topology.</para>

<para>So we'll write a little message queuing broker that gives us this flexibility. The broker binds to two endpoints, a frontend for clients and a backend for services. It then uses <literal>zmq-poll[3]</literal> to monitor these two sockets for activity and when it has some, it shuttles messages between its two sockets. It doesn't actually manage any queues explicitly -- &Oslash;MQ does that automatically on each socket.</para>

<para>When you use REQ to talk to REP you get a strictly synchronous request-reply dialog. The client sends a request. The service reads the request and sends a reply. The client then reads the reply. If either the client or the service try to do anything else (e.g. sending two requests in a row without waiting for a response) they will get an error.</para>

<para>But our broker has to be non-blocking. Obviously we can use <literal>zmq-poll[3]</literal> to wait for activity on either socket, but we can't use REP and REQ.</para>

<para>Luckily there are two sockets called DEALER and ROUTER that let you do non-blocking request-response. You'll see in Advanced Request-Reply Patterns<xref linkend="advanced-request-reply"/> how DEALER and ROUTER sockets let you build all kinds of asynchronous request-reply flows. For now, we're just going to see how DEALER and ROUTER let us extend REQ-REP across an intermediary, that is, our little broker.</para>

<para>In this simple extended request-reply pattern, REQ talks to ROUTER and DEALER talks to REP. In between the DEALER and ROUTER we have to have code (like our broker) that pulls messages off the one socket and shoves them onto the other(<xref linkend="figure-16"/>).</para>

<figure id="figure-16">
    <title>Extended Request-reply</title>
    <mediaobject>
        <imageobject>
            <imagedata fileref="images/fig16.eps" format="EPS" width="4.8in"/>
        </imageobject>
    </mediaobject>
</figure>

<para>The request-reply broker binds to two endpoints, one for clients to connect to (the frontend socket) and one for workers to connect to (the backend). To test this broker, you will want to change your workers so they connect to the backend socket. Here is a client that shows what I mean:</para>

<example id="rrclient-php">
<title>Request-reply client (rrclient.php)</title>
<programlisting language="php">
&lt;?php
/*
 * Hello World client
 * Connects REQ socket to tcp://localhost:5559
 * Sends "Hello" to server, expects "World" back
 * @author Ian Barber &lt;ian(dot)barber(at)gmail(dot)com&gt;
 */

$context = new ZMQContext();

//  Socket to talk to server
$requester = new ZMQSocket($context, ZMQ::SOCKET_REQ);
$requester-&gt;connect("tcp://localhost:5559");

for($request_nbr = 0; $request_nbr &lt; 10; $request_nbr++) {
	$requester-&gt;send("Hello");
	$string = $requester-&gt;recv();
	printf ("Received reply %d [%s]%s", $request_nbr, $string, PHP_EOL);
</programlisting>

</example>
<para>Here is the worker:</para>

<example id="rrworker-php">
<title>Request-reply worker (rrworker.php)</title>
<programlisting language="php">
&lt;?php
/*
 * Hello World server
 * Connects REP socket to tcp://*:5560
 * Expects "Hello" from client, replies with "World"
 * @author Ian Barber &lt;ian(dot)barber(at)gmail(dot)com&gt;
 */

$context = new ZMQContext();

//  Socket to talk to clients
$responder = new ZMQSocket($context, ZMQ::SOCKET_REP);
$responder-&gt;connect("tcp://localhost:5560");

while(true) {
	//  Wait for next request from client
	$string = $responder-&gt;recv();
	printf ("Received request: [%s]%s", $string, PHP_EOL);
	
	// Do some 'work'
	sleep(1);
	
	//  Send reply back to client
	$responder-&gt;send("World");
}
</programlisting>

</example>
<para>And here is the broker, which properly handles multi-part messages:</para>

<example id="rrbroker-php">
<title>Request-reply broker (rrbroker.php)</title>
<programlisting language="php">
&lt;?php
/*
 * Simple request-reply broker
 * @author Ian Barber &lt;ian(dot)barber(at)gmail(dot)com&gt;
 */

//  Prepare our context and sockets
$context = new ZMQContext();
$frontend = new ZMQSocket($context, ZMQ::SOCKET_ROUTER);
$backend = new ZMQSocket($context, ZMQ::SOCKET_DEALER);
$frontend-&gt;bind("tcp://*:5559");
$backend-&gt;bind("tcp://*:5560");

//  Initialize poll set
$poll = new ZMQPoll();
$poll-&gt;add($frontend, ZMQ::POLL_IN);
$poll-&gt;add($backend, ZMQ::POLL_IN);
$readable = $writeable = array();

//  Switch messages between sockets
while(true) {
	$events = $poll-&gt;poll($readable, $writeable);
	
	foreach($readable as $socket) {
		if($socket === $frontend) {
			//  Process all parts of the message
			while(true) {
				$message = $socket-&gt;recv();
				//  Multipart detection
				$more = $socket-&gt;getSockOpt(ZMQ::SOCKOPT_RCVMORE);
				$backend-&gt;send($message, $more ? ZMQ::MODE_SNDMORE : null);
				if(!$more) {
					break; //  Last message part
				}
			}
		} 
		else if($socket === $backend) {
			$message = $socket-&gt;recv();
			//  Multipart detection
			$more = $socket-&gt;getSockOpt(ZMQ::SOCKOPT_RCVMORE);
			$frontend-&gt;send($message, $more ? ZMQ::MODE_SNDMORE : null);
			if(!$more) {
				break; //  Last message part
			}
		}
	}
</programlisting>

</example>
<para>Using a request-reply broker makes your client-server architectures easier to scale since clients don't see workers, and workers don't see clients. The only static node is the broker in the middle(<xref linkend="figure-17"/>).</para>

<figure id="figure-17">
    <title>Request-reply Broker</title>
    <mediaobject>
        <imageobject>
            <imagedata fileref="images/fig17.eps" format="EPS" width="4.8in"/>
        </imageobject>
    </mediaobject>
</figure>

</sect2>
<sect2>
<title>&Oslash;MQ's Built-in Proxy Function</title>
<para>It turns out that the core loop in the previous section's rrbroker is very useful, and reusable. It lets us build pub-sub forwarders and shared queues and other little intermediaries, with very little effort. &Oslash;MQ wraps this up in a single method, <literal>zmq-proxy[3]</literal>:</para>

<programlisting language="c">
zmq-proxy (frontend, backend, capture);
</programlisting>

<para>The two (or three sockets, if we want to capture data) must be properly connected, bound, configured. When we call the <literal>zmq-proxy</literal> method it's exactly like starting the main loop of rrbroker. Let's rewrite the request-reply broker to call <literal>zmq-proxy</literal>, and re-badge this as an expensive-sounding "message queue" (people have charged houses for code that did less):</para>

<example id="msgqueue-php">
<title>Message queue broker (msgqueue.php)</title>
<programlisting language="php">
&lt;?php

/*
 *  Simple message queuing broker
 *  Same as request-reply broker but using QUEUE device
 * @author Ian Barber &lt;ian(dot)barber(at)gmail(dot)com&gt;
 */

$context = new ZMQContext();

//  Socket facing clients
$frontend = $context-&gt;getSocket(ZMQ::SOCKET_ROUTER);
$frontend-&gt;bind("tcp://*:5559");

//  Socket facing services
$backend = $context-&gt;getSocket(ZMQ::SOCKET_DEALER);
$backend-&gt;bind("tcp://*:5560");

//  Start built-in device
new ZMQDevice($frontend, $backend);

//  We never get here...
</programlisting>

</example>
<para>If you're like most &Oslash;MQ users, at this stage your mind is starting to think, "what kind of evil stuff can I do if I plug random socket types into the proxy?"  The short answer is: try it and work out what is happening. In practice you would usually stick to ROUTER/DEALER, XSUB/XPUB, or PULL/PUSH.</para>

</sect2>
<sect2>
<title>Transport Bridging</title>
<para>A frequent request from &Oslash;MQ users is "how do I connect my &Oslash;MQ network with technology X?" where X is some other networking or messaging technology. The simple answer is to build a "bridge". A bridge is a small application that speaks one protocol at one socket, and converts to/from a second protocol at another socket. A protocol interpreter, if you like. A common bridging problem in &Oslash;MQ is to bridge two transports or networks.</para>

<para>As an example, we're going to write a little proxy that sits in between a publisher and a set of subscribers, bridging two networks. The frontend socket (SUB) faces the internal network, where the weather server is sitting, and the backend (PUB) faces subscribers on the external network. It subscribes to the weather service on the frontend socket, and republishes its data on the backend socket(<xref linkend="figure-18"/>).</para>

<example id="wuproxy-php">
<title>Weather update proxy (wuproxy.php)</title>
<programlisting language="php">
&lt;?php
/*
 *  Weather proxy device
 * @author Ian Barber &lt;ian(dot)barber(at)gmail(dot)com&gt;
 */

$context = new ZMQContext();

//  This is where the weather server sits
$frontend = new ZMQSocket($context, ZMQ::SOCKET_SUB);
$frontend-&gt;connect("tcp://192.168.55.210:5556");

//  This is our public endpoint for subscribers
$backend = new ZMQSocket($context, ZMQ::SOCKET_PUB);
$backend-&gt;bind("tcp://10.1.1.0:8100");

//  Subscribe on everything
$frontend-&gt;setSockOpt(ZMQ::SOCKOPT_SUBSCRIBE, "");

//  Shunt messages out to our own subscribers
while(true) {
	while(true) {
		//  Process all parts of the message
		$message = $frontend-&gt;recv();
		$more = $frontend-&gt;getSockOpt(ZMQ::SOCKOPT_RCVMORE);
		$backend-&gt;send($message, $more ? ZMQ::SOCKOPT_SNDMORE : 0);
		if(!$more) {
			break; // Last message part
		}
	}
</programlisting>

</example>
<para>It looks very similar to the earlier proxy example but the key part is that the frontend and backend sockets are on two different networks. We can use this model for example to connect a multicast network (<literal>pgm</literal> transport) to a TCP publisher.</para>

<figure id="figure-18">
    <title>Pub-Sub Forwarder Proxy</title>
    <mediaobject>
        <imageobject>
            <imagedata fileref="images/fig18.eps" format="EPS" width="4.8in"/>
        </imageobject>
    </mediaobject>
</figure>

</sect2>
</sect1>
<sect1>
<title>Handling Errors and ETERM</title>
<para>&Oslash;MQ's error handling philosophy is a mix of fail-fast and resilience. Processes, we believe, should be as vulnerable as possible to internal errors, and as robust as possible against external attacks and errors. To give an analogy, a living cell will self-destruct if it detects a single internal error, yet it will resist attack from the outside by all means possible.</para>

<para>Assertions, which pepper the &Oslash;MQ code, are absolutely vital to robust code, they just have to be on the right side of the cellular wall. And there should be such a wall. If it is unclear whether a fault is internal or external, that is a design flaw to be fixed. In C/C++, assertions stop the application immediately with an error. In other languages you may get exceptions or halts.</para>

<para>When &Oslash;MQ detects an external fault it returns an error to the calling code. In some rare cases it drops messages silently, if there is no obvious strategy for recovering from the error.</para>

<para>In most of the C examples we've seen so far there's been no error handling. <emphasis role="bold">Real code should do error handling on every single &Oslash;MQ call</emphasis>. If you're using a language binding other than C, the binding may handle errors for you. In C you do need to do this yourself. There are some simple rules, starting with POSIX conventions:</para>

<itemizedlist>
  <listitem><para>Methods that create objects return NULL if they fail.</para></listitem>
  <listitem><para>Methods that process data may return the number of bytes processed, or -1 on an error or failure.</para></listitem>
  <listitem><para>Other methods return 0 on success and -1 on an error or failure.</para></listitem>
  <listitem><para>The error code is provided in <literal>errno</literal> or <literal>zmq-errno[3]</literal>.</para></listitem>
  <listitem><para>A descriptive error text for logging is provided by <literal>zmq-strerror[3]</literal>.</para></listitem>
</itemizedlist>
<para>For example:</para>

<programlisting language="c">
void *context = zmq-ctx-new ();
assert (context);
void *socket = zmq-socket (context, ZMQ-REP);
assert (socket);
int rc = zmq-bind (socket, "tcp://*:5555");
if (rc != 0) {
    printf ("E: bind failed: %s\n", strerror (errno));
    return -1;
}
</programlisting>

<para>There are two main exceptional conditions that you may want to handle as non-fatal:</para>

<itemizedlist>
  <listitem><para>When a thread calls <literal>zmq-msg-recv[3]</literal> with the <literal>ZMQ-DONTWAIT</literal> option and there is no waiting data. &Oslash;MQ will return -1 and set <literal>errno</literal> to <literal>EAGAIN</literal>.</para></listitem>
  <listitem><para>When a thread calls <literal>zmq-ctx-destroy[3]</literal> and other threads are doing blocking work. The <literal>zmq-ctx-destroy[3]</literal> call closes the context and all blocking calls exit with -1, and errno set to <literal>ETERM</literal>.</para></listitem>
</itemizedlist>
<para>In C/C++, asserts can be removed entirely in optimized code, so don't make the mistake of wrapping the whole &Oslash;MQ call in an assert(). It looks neat, then the optimizer removes all the asserts and the calls you want to make, and your application breaks in impressive ways.</para>

<para>Let's see how to shut down a process cleanly. We'll take the parallel pipeline example from the previous section. If we've started a whole lot of workers in the background, we now want to kill them when the batch is finished. Let's do this by sending a kill message to the workers. The best place to do this is the sink, since it really knows when the batch is done.</para>

<para>How do we connect the sink to the workers? The PUSH/PULL sockets are one-way only. The standard &Oslash;MQ answer is: create a new socket flow for each type of problem you need to solve. We'll use a publish-subscribe model to send kill messages to the workers(<xref linkend="figure-19"/>):</para>

<itemizedlist>
  <listitem><para>The sink creates a PUB socket on a new endpoint.</para></listitem>
  <listitem><para>Workers bind their input socket to this endpoint.</para></listitem>
  <listitem><para>When the sink detects the end of the batch it sends a kill to its PUB socket.</para></listitem>
  <listitem><para>When a worker detects this kill message, it exits.</para></listitem>
</itemizedlist>
<para>It doesn't take much new code in the sink:</para>

<programlisting language="c">
void *control = zmq-socket (context, ZMQ-PUB);
zmq-bind (control, "tcp://*:5559");
...
//  Send kill signal to workers
zmq-msg-init-data (&amp;message, "KILL", 5);
zmq-msg-send (control, &amp;message, 0);
zmq-msg-close (&amp;message);
</programlisting>

<figure id="figure-19">
    <title>Parallel Pipeline with Kill Signaling</title>
    <mediaobject>
        <imageobject>
            <imagedata fileref="images/fig19.eps" format="EPS" width="4.8in"/>
        </imageobject>
    </mediaobject>
</figure>

<para>Here is the worker process, which manages two sockets (a PULL socket getting tasks, and a SUB socket getting control commands) using the <literal>zmq-poll[3]</literal> technique we saw earlier:</para>

<example id="taskwork2-php">
<title>Parallel task worker with kill signaling (taskwork2.php)</title>
<programlisting language="php">
&lt;?php
/*
 *  Task worker - design 2
 *  Adds pub-sub flow to receive and respond to kill signal
 * @author Ian Barber &lt;ian(dot)barber(at)gmail(dot)com&gt;
 */

$context = new ZMQContext();

//  Socket to receive messages on
$receiver = new ZMQSocket($context, ZMQ::SOCKET_PULL);
$receiver-&gt;connect("tcp://localhost:5557");

//  Socket to send messages to
$sender = new ZMQSocket($context, ZMQ::SOCKET_PUSH);
$sender-&gt;connect("tcp://localhost:5558");

//  Socket for control input
$controller = new ZMQSocket($context, ZMQ::SOCKET_SUB);
$controller-&gt;connect("tcp://localhost:5559");
$controller-&gt;setSockOpt(ZMQ::SOCKOPT_SUBSCRIBE, "");

//  Process messages from receiver and controller
$poll = new ZMQPoll();
$poll-&gt;add($receiver, ZMQ::POLL_IN);
$poll-&gt;add($controller, ZMQ::POLL_IN);
$readable = $writeable = array();

//  Process messages from both sockets
while (true) {
	$events = $poll-&gt;poll($readable, $writeable);
	if($events &gt; 0) {
		foreach($readable as $socket) {
			if($socket === $receiver) {
				$message = $socket-&gt;recv();
				//  Simple progress indicator for the viewer
				echo $message, PHP_EOL;

				//  Do the work
				usleep($message * 1000);

			   //  Send results to sink
				$sender-&gt;send("");
			}
			//  Any waiting controller command acts as 'KILL'
			else if($socket === $controller) {
				exit();
			}
		}
	}
</programlisting>

</example>
<para>Here is the modified sink application. When it's finished collecting results it broadcasts a kill message to all workers:</para>

<example id="tasksink2-php">
<title>Parallel task sink with kill signaling (tasksink2.php)</title>
<programlisting language="php">
&lt;?php
/*
 *  Task design 2
 *  Adds pub-sub flow to send kill signal to workers
 * @author Ian Barber &lt;ian(dot)barber(at)gmail(dot)com&gt;
 */

$context = new ZMQContext();

//  Socket to receive messages on
$receiver = new ZMQSocket($context, ZMQ::SOCKET_PULL);
$receiver-&gt;bind("tcp://*:5558");

//  Socket for worker control
$controller = new ZMQSocket($context, ZMQ::SOCKET_PUB);
$controller-&gt;bind("tcp://*:5559");

//  Wait for start of batch
$string = $receiver-&gt;recv();

//  Process 100 confirmations
$tstart = microtime(true);
$total_msec = 0;     //  Total calculated cost in msecs
for ($task_nbr = 0; $task_nbr &lt; 100; $task_nbr++) {
	$string = $receiver-&gt;recv();
    
	if($task_nbr % 10 == 0) {
		echo ":";
	} else {
		echo ".";
	}
}

$tend = microtime(true);

$total_msec = ($tend - $tstart) * 1000;
echo PHP_EOL;
printf ("Total elapsed time: %d msec", $total_msec);
echo PHP_EOL;

//  Send kill signal to workers
$controller-&gt;send("KILL");

//  Finished
sleep (1);              //  Give 0MQ time to deliver
</programlisting>

</example>
</sect1>
<sect1>
<title>Handling Interrupt Signals</title>
<para>Realistic applications need to shut down cleanly when interrupted with Ctrl-C or another signal such as SIGTERM. By default, these simply kill the process, meaning messages won't be flushed, files won't be closed cleanly, etc.</para>

<para>Here is how we handle a signal in various languages:</para>

<example id="interrupt-php">
<title>Handling Ctrl-C cleanly (interrupt.php)</title>
<programlisting language="php">
(This example still needs translation into PHP)
</programlisting>

</example>
<para>The program provides <literal>s-catch-signals()</literal>, which traps Ctrl-C (<literal>SIGINT</literal>) and <literal>SIGTERM</literal>. When either of these signals arrive, the <literal>s-catch-signals()</literal> handler sets the global variable <literal>s-interrupted</literal>. Thanks to your signal handler, your application will not die automatically. Instead, you have a chance to clean up and exit gracefully. You have to now explicitly check for an interrupt, and handle it properly. Do this by calling <literal>s-catch-signals()</literal> (copy this from <literal>interrupt.c</literal>) at the start of your main code. This sets-up the signal handling. The interrupt will affect &Oslash;MQ calls as follows:</para>

<itemizedlist>
  <listitem><para>If your code is blocking in <literal>zmq-msg-recv[3]</literal>, <literal>zmq-poll[3]</literal>, or <literal>zmq-msg-send[3]</literal>, when a signal arrives, the call will return with <literal>EINTR</literal>.</para></listitem>
  <listitem><para>Wrappers like <literal>s-recv()</literal> return NULL if they are interrupted.</para></listitem>
</itemizedlist>
<para>So check for an <literal>EINTR</literal> return code, a NULL return, and/or <literal>s-interrupted</literal>.</para>

<para>Here is a typical code fragment:</para>

<screen>s-catch-signals ();
client = zmq-socket (...);
while (!s-interrupted) {
    char *message = s-recv (client);
    if (!message)
        break;          //  Ctrl-C used
}
zmq-close (client);
</screen>

<para>If you call <literal>s-catch-signals()</literal> and don't test for interrupts, the your application will become immune to Ctrl-C and SIGTERM, which may be useful, but is usually not.</para>

</sect1>
<sect1>
<title>Detecting Memory Leaks</title>
<para>Any long-running application has to manage memory correctly, or eventually it'll use up all available memory and crash. If you use a language that handles this automatically for you, congratulations. If you program in C or C++ or any other language where you're responsible for memory management, here's a short tutorial on using valgrind, which among other things will report on any leaks your programs have.</para>

<itemizedlist>
  <listitem><para>To install valgrind, e.g. on Ubuntu or Debian issue:</para></listitem>
</itemizedlist>
<screen>sudo apt-get install valgrind
</screen>

<itemizedlist>
  <listitem><para>By default, &Oslash;MQ will cause valgrind to complain a lot. To remove these warnings, create a file called <literal>valgrind.supp</literal> that contains this:</para></listitem>
</itemizedlist>
<screen>{
   &lt;socketcall-sendto&gt;
   Memcheck:Param
   socketcall.sendto(msg)
   fun:send
   ...
}
{
   &lt;socketcall-sendto&gt;
   Memcheck:Param
   socketcall.send(msg)
   fun:send
   ...
}
</screen>

<itemizedlist>
  <listitem><para>Fix your applications to exit cleanly after Ctrl-C. For any application that exits by itself, that's not needed, but for long-running applications, this is essential, otherwise valgrind will complain about all currently allocated memory.</para></listitem>
  <listitem><para>Build your application with -DDEBUG, if it's not your default setting. That ensures valgrind can tell you exactly where memory is being leaked.</para></listitem>
  <listitem><para>Finally, run valgrind thus:</para></listitem>
</itemizedlist>
<screen>valgrind --tool=memcheck --leak-check=full --suppressions=valgrind.supp someprog
</screen>

<para>And after fixing any errors it reported, you should get the pleasant message:</para>

<screen>==30536== ERROR SUMMARY: 0 errors from 0 contexts...
</screen>

</sect1>
<sect1>
<title>Multithreading with &Oslash;MQ</title>
<para>&Oslash;MQ is perhaps the nicest way ever to write multithreaded (MT) applications. Whereas as &Oslash;MQ sockets require some readjustment if you are used to traditional sockets, &Oslash;MQ multithreading will take everything you know about writing MT applications, throw it into a heap in the garden, pour gasoline over it, and set it alight. It's a rare book that deserves burning, but most books on concurrent programming do.</para>

<para>To make utterly perfect MT programs (and I mean that literally) <emphasis role="bold">we don't need mutexes, locks, or any other form of inter-thread communication except messages sent across &Oslash;MQ sockets.</emphasis></para>

<para>By "perfect" MT programs I mean code that's easy to write and understand, that works with the same design approach in any programming language, and on any operating system, and that scales across any number of CPUs with zero wait states and no point of diminishing returns.</para>

<para>If you've spent years learning tricks to make your MT code work at all, let alone rapidly, with locks and semaphores and critical sections, you will be disgusted when you realize it was all for nothing. If there's one lesson we've learned from 30+ years of concurrent programming it is: <emphasis>just don't share state</emphasis>. It's like two drunkards trying to share a beer. It doesn't matter if they're good buddies. Sooner or later they're going to get into a fight. And the more drunkards you add to the table, the more they fight each other over the beer. The tragic majority of MT applications look like drunken bar fights.</para>

<para>The list of weird problems that you need to fight as you write classic shared-state MT code would be hilarious if it didn't translate directly into stress and risk, as code that seems to work suddenly fails under pressure. A large firm with world-beating experience in buggy code released its list of "11 Likely Problems In Your Multithreaded Code", which covers forgotten synchronization, incorrect granularity, read and write tearing, lock-free reordering, lock convoys, two-step dance, and priority inversion.</para>

<para>Yeah, we also counted seven problems, not eleven. That's not the point though. The point is, do you really want that code running the power grid or stock market to start getting two-step lock convoys at 3pm on a busy Thursday? Who cares what the terms actually mean. This is not what turned us on to programming, fighting ever more complex side-effects with ever more complex hacks.</para>

<para>Some widely used models, despite being the basis for entire industries, are fundamentally broken, and shared state concurrency is one of them. Code that wants to scale without limit does it like the Internet does, by sending messages and sharing nothing except a common contempt for broken programming models.</para>

<para>You should follow some rules to write happy multithreaded code with &Oslash;MQ:</para>

<itemizedlist>
  <listitem><para>You must not access the same data from multiple threads. Using classic MT techniques like mutexes are an anti-pattern in &Oslash;MQ applications. The only exception to this is a &Oslash;MQ context object, which is threadsafe.</para></listitem>
  <listitem><para>You must create a &Oslash;MQ context for your process, and pass that to all threads that you want to connect via <literal>inproc</literal> sockets.</para></listitem>
  <listitem><para>You may treat threads as separate tasks, with their own context, but these threads cannot communicate over <literal>inproc</literal>. However they will be easier to break into standalone processes afterwards.</para></listitem>
  <listitem><para>You must not share &Oslash;MQ sockets between threads. &Oslash;MQ sockets are not threadsafe. Technically it's possible to do this, but it demands semaphores, locks, or mutexes. This will make your application slow and fragile. The only place where it's remotely sane to share sockets between threads are in language bindings that need to do magic like garbage collection on sockets.</para></listitem>
</itemizedlist>
<para>If you need to start more than one proxy in an application, for example, you will want to run each in their own thread. It is easy to make the error of creating the proxy frontend and backend sockets in one thread, and then passing the sockets to the proxy in another thread. This may appear to work but will fail randomly. Remember: <emphasis>Do not use or close sockets except in the thread that created them.</emphasis></para>

<para>If you follow these rules, you can quite easily split threads into separate processes, when you need to. Application logic can sit in threads, processes, nodes: whatever your scale needs.</para>

<para>&Oslash;MQ uses native OS threads rather than virtual "green" threads. The advantage is that you don't need to learn any new threading API, and that &Oslash;MQ threads map cleanly to your operating system. You can use standard tools like Intel's ThreadChecker to see what your application is doing. The disadvantages are that your code, when it for instance starts new threads, won't be portable, and that if you have a huge number of threads (thousands), some operating systems will get stressed.</para>

<para>Let's see how this works in practice. We'll turn our old Hello World server into something more capable. The original server ran in a single thread. If the work per request is low, that's fine: one &Oslash;MQ thread can run at full speed on a CPU core, with no waits, doing an awful lot of work. But realistic servers have to do non-trivial work per request. A single core may not be enough when 10,000 clients hit the server all at once. So a realistic server must start multiple worker threads. It then accepts requests as fast as it can, and distributes these to its worker threads. The worker threads grind through the work, and eventually send their replies back.</para>

<para>You can of course do all this using a proxy broker and external worker processes, but often it's easier to start one process that gobbles up sixteen cores than sixteen processes, each gobbling up one core. Further, running workers as threads will cut out a network hop, latency, and network traffic.</para>

<para>The MT version of the Hello World service basically collapses the broker and workers into a single process. We use pthreads because it's the most widespread standard for multithreading:</para>

<example id="mtserver-php">
<title>Multithreaded service (mtserver.php)</title>
<programlisting language="php">
&lt;?php
/*
 * Multithreaded Hello World server. Uses proceses due
 * to PHP's lack of threads!
 * @author Ian Barber &lt;ian(dot)barber(at)gmail(dot)com&gt;
 */

function worker_routine() {
	$context = new ZMQContext();
	// Socket to talk to dispatcher
	$receiver = new ZMQSocket($context, ZMQ::SOCKET_REP);
	$receiver-&gt;connect("ipc://workers.ipc");
	
	while(true) {
		$string = $receiver-&gt;recv();
		printf ("Received request: [%s]%s", $string, PHP_EOL);
		
		// Do some 'work'
		sleep(1);
		
		// Send reply back to client
		$receiver-&gt;send("World");
	}
}

//  Launch pool of worker threads
for($thread_nbr = 0; $thread_nbr != 5; $thread_nbr++) {	
	$pid = pcntl_fork();
	if($pid == 0) {
		worker_routine();
		exit();
	}
}

//  Prepare our context and sockets
$context = new ZMQContext();

//  Socket to talk to clients
$clients = new ZMQSocket($context, ZMQ::SOCKET_ROUTER);
$clients-&gt;bind("tcp://*:5555");

//  Socket to talk to workers
$workers = new ZMQSocket($context, ZMQ::SOCKET_DEALER);
$workers-&gt;bind("ipc://workers.ipc");

//  Connect work threads to client threads via a queue
$device = new ZMQDevice($clients, $workers);
$device-&gt;run ();
</programlisting>

</example>
<para>All the code should be recognizable to you by now. How it works:</para>

<itemizedlist>
  <listitem><para>The server starts a set of worker threads. Each worker thread creates a REP socket and then processes requests on this socket. Worker threads are just like single-threaded servers. The only differences are the transport (<literal>inproc</literal> instead of <literal>tcp</literal>), and the bind-connect direction.</para></listitem>
  <listitem><para>The server creates a ROUTER socket to talk to clients and binds this to its external interface (over <literal>tcp</literal>).</para></listitem>
  <listitem><para>The server creates a DEALER socket to talk to the workers and binds this to its internal interface (over <literal>inproc</literal>).</para></listitem>
  <listitem><para>The server starts a proxy that connects the two sockets. The proxy pulls incoming requests fairly from all clients, and distributes those out to workers. It also routes replies back to their origin.</para></listitem>
</itemizedlist>
<para>Note that creating threads is not portable in most programming languages. The POSIX library is pthreads, but on Windows you have to use a different API. In our example, the <literal>pthread-create</literal> call starts up a new thread running the <literal>worker-routine</literal> function we defined. We'll see in Advanced Request-Reply Patterns<xref linkend="advanced-request-reply"/> how to wrap this in a portable API.</para>

<para>Here the 'work' is just a one-second pause. We could do anything in the workers, including talking to other nodes. This is what the MT server looks like in terms of &Oslash;MQ sockets and nodes. Note how the request-reply chain is <literal>REQ-ROUTER-queue-DEALER-REP</literal>(<xref linkend="figure-20"/>).</para>

<figure id="figure-20">
    <title>Multithreaded Server</title>
    <mediaobject>
        <imageobject>
            <imagedata fileref="images/fig20.eps" format="EPS" width="4.8in"/>
        </imageobject>
    </mediaobject>
</figure>

</sect1>
<sect1>
<title>Signaling between Threads (PAIR sockets)</title>
<para>When you start making multithreaded applications with &Oslash;MQ, you'll encounter the question of how to coordinate your threads. Though you might be tempted to insert 'sleep' statements, or use multithreading techniques such as semaphores or mutexes, <emphasis role="bold">the only mechanism that you should use are &Oslash;MQ messages</emphasis>. Remember the story of The Drunkards and the Beer Bottle.</para>

<para>Let's make three threads that signal each other when they are ready(<xref linkend="figure-21"/>). In this example we use PAIR sockets over the <literal>inproc</literal> transport:</para>

<example id="mtrelay-php">
<title>Multithreaded relay (mtrelay.php)</title>
<programlisting language="php">
&lt;?php
/*
 * Multithreaded relay. Actually using processes due a lack 
 * of PHP threads.  
 * @author Ian Barber &lt;ian(dot)barber(at)gmail(dot)com&gt;
 */

function step1() {
	$context = new ZMQContext(); 
	// Signal downstream to step 2
	$sender = new ZMQSocket($context, ZMQ::SOCKET_PAIR);
	$sender-&gt;connect("ipc://step2.ipc");
	$sender-&gt;send("");
}

function step2() {
	$pid = pcntl_fork();
	if($pid == 0) {
		step1();
		exit();
	}
	
	$context = new ZMQContext(); 
	//  Bind to ipc: endpoint, then start upstream thread
	$receiver = new ZMQSocket($context, ZMQ::SOCKET_PAIR);
	$receiver-&gt;bind("ipc://step2.ipc");
	
	// Wait for signal
	$receiver-&gt;recv();

	// Signal downstream to step 3
	$sender = new ZMQSocket($context, ZMQ::SOCKET_PAIR);
	$sender-&gt;connect("ipc://step3.ipc");
	$sender-&gt;send("");
}


// Start upstream thread then bind to icp: endpoint
$pid = pcntl_fork();
if($pid == 0) {
	step2();
	exit();
}

$context = new ZMQContext();
$receiver = new ZMQSocket($context, ZMQ::SOCKET_PAIR);
$receiver-&gt;bind("ipc://step3.ipc");


// Wait for signal
$receiver-&gt;recv();

echo "Test succesful!", PHP_EOL
</programlisting>

</example>
<figure id="figure-21">
    <title>The Relay Race</title>
    <mediaobject>
        <imageobject>
            <imagedata fileref="images/fig21.eps" format="EPS" width="4.8in"/>
        </imageobject>
    </mediaobject>
</figure>

<para>This is a classic pattern for multithreading with &Oslash;MQ:</para>

<orderedlist>
  <listitem><para>Two threads communicate over <literal>inproc</literal>, using a shared context.</para></listitem>
  <listitem><para>The parent thread creates one socket, binds it to an inproc:// endpoint, and <emphasis>then</emphasis> starts the child thread, passing the context to it.</para></listitem>
  <listitem><para>The child thread creates the second socket, connects it to that inproc:// endpoint, and <emphasis>then</emphasis> signals to the parent thread that it's ready.</para></listitem>
</orderedlist>
<para>Note that multithreading code using this pattern is not scalable out to processes. If you use <literal>inproc</literal> and socket pairs, you are building a tightly-bound application, i.e. one where your threads are structurally interdependent. Do this when low latency is really vital. The other design pattern is a loosely-bound application, where threads have their own context and communicate over <literal>ipc</literal> or <literal>tcp</literal>. You can easily break loosely-bound threads into separate processes.</para>

<para>This is the first time we've shown an example using PAIR sockets. Why use PAIR? Other socket combinations might seem to work but they all have side-effects that could interfere with signaling:</para>

<itemizedlist>
  <listitem><para>You can use PUSH for the sender and PULL for the receiver. This looks simple and will work, but remember that PUSH will load-balance messages to all available receivers. If you by accident start two receivers (e.g. you already have one running and you start a second), you'll "lose" half of your signals. PAIR has the advantage of refusing more than one connection; the pair is <emphasis>exclusive</emphasis>.</para></listitem>
  <listitem><para>You can use DEALER for the sender and ROUTER for the receiver. ROUTER however wraps your message in an "envelope", meaning your zero-size signal turns into a multi-part message. If you don't care about the data, and treat anything as a valid signal, and if you don't read more than once from the socket, that won't matter. If however you decide to send real data, you will suddenly find ROUTER providing you with "wrong" messages. DEALER also load-balances, giving the same risk as PUSH.</para></listitem>
  <listitem><para>You can use PUB for the sender and SUB for the receiver. This will correctly deliver your messages exactly as you sent them and PUB does not load-balance as PUSH or DEALER do. However you need to configure the subscriber with an empty subscription, which is annoying. Worse, the reliability of the PUB-SUB link is timing dependent and messages can get lost if the SUB socket is connecting while the PUB socket is sending its messages.</para></listitem>
</itemizedlist>
<para>For these reasons, PAIR makes the best choice for coordination between pairs of threads.</para>

</sect1>
<sect1>
<title>Node Coordination</title>
<para>When you want to coordinate nodes, PAIR sockets won't work well any more. This is one of the few areas where the strategies for threads and nodes are different. Principally nodes come and go whereas threads are static. PAIR sockets do not automatically reconnect if the remote node goes away and comes back.</para>

<para>The second significant difference between threads and nodes is that you typically have a fixed number of threads but a more variable number of nodes. Let's take one of our earlier scenarios (the weather server and clients) and use node coordination to ensure that subscribers don't lose data when starting up.</para>

<para>This is how the application will work:</para>

<itemizedlist>
  <listitem><para>The publisher knows in advance how many subscribers it expects. This is just a magic number it gets from somewhere.</para></listitem>
  <listitem><para>The publisher starts up and waits for all subscribers to connect. This is the node coordination part. Each subscriber subscribes and then tells the publisher it's ready via another socket.</para></listitem>
  <listitem><para>When the publisher has all subscribers connected, it starts to publish data.</para></listitem>
</itemizedlist>
<para>In this case we'll use a REQ-REP socket flow to synchronize subscribers and publisher(<xref linkend="figure-22"/>). Here is the publisher:</para>

<example id="syncpub-php">
<title>Synchronized publisher (syncpub.php)</title>
<programlisting language="php">
&lt;?php
/*
 * Synchronized publisher
 *
 * @author Ian Barber &lt;ian(dot)barber(at)gmail(dot)com&gt;
 */

//  We wait for 10 subscribers
define("SUBSCRIBERS_EXPECTED", 10);

$context = new ZMQContext();

//  Socket to talk to clients
$publisher = new ZMQSocket($context, ZMQ::SOCKET_PUB);
$publisher-&gt;bind("tcp://*:5561");

//  Socket to receive signals
$syncservice = new ZMQSocket($context, ZMQ::SOCKET_REP);
$syncservice-&gt;bind("tcp://*:5562");

//  Get synchronization from subscribers
$subscribers = 0;
while ($subscribers &lt; SUBSCRIBERS_EXPECTED) {
	//  - wait for synchronization request
	$string = $syncservice-&gt;recv();
	//  - send synchronization reply
	$syncservice-&gt;send("");
	$subscribers++;
}

//  Now broadcast exactly 1M updates followed by END
for ($update_nbr = 0; $update_nbr &lt; 1000000; $update_nbr++) {
	$publisher-&gt;send("Rhubarb");
}

$publisher-&gt;send("END");

sleep (1);              //  Give 0MQ/2.0.x time to flush outpu
</programlisting>

</example>
<figure id="figure-22">
    <title>Pub-Sub Synchronization</title>
    <mediaobject>
        <imageobject>
            <imagedata fileref="images/fig22.eps" format="EPS" width="4.8in"/>
        </imageobject>
    </mediaobject>
</figure>

<para>And here is the subscriber:</para>

<example id="syncsub-php">
<title>Synchronized subscriber (syncsub.php)</title>
<programlisting language="php">
&lt;?php
/*
 * Synchronized subscriber
 *
 * @author Ian Barber &lt;ian(dot)barber(at)gmail(dot)com&gt;
 */

$context = new ZMQContext();

//  First, connect our subscriber socket
$subscriber = $context-&gt;getSocket(ZMQ::SOCKET_SUB);
$subscriber-&gt;connect("tcp://localhost:5561");
$subscriber-&gt;setSockOpt(ZMQ::SOCKOPT_SUBSCRIBE, "");

//  Second, synchronize with publisher
$syncclient = $context-&gt;getSocket(ZMQ::SOCKET_REQ);
$syncclient-&gt;connect("tcp://localhost:5562");

//  - send a synchronization request
$syncclient-&gt;send("");

//  - wait for synchronization reply
$string = $syncclient-&gt;recv();

//  Third, get our updates and report how many we got
$update_nbr = 0;
while (true) {
	$string = $subscriber-&gt;recv();
	if($string == "END") {
		break;
	}
	$update_nbr++;
}
printf ("Received %d updates %s", $update_nbr, PHP_EOL)
</programlisting>

</example>
<para>This Bash shell script will start ten subscribers and then the publisher:</para>

<screen>echo "Starting subscribers..."
for ((a=0; a&lt;10; a++)); do
    syncsub &amp;
done
echo "Starting publisher..."
syncpub
</screen>

<para>Which gives us this satisfying output:</para>

<screen>Starting subscribers...
Starting publisher...
Received 1000000 updates
Received 1000000 updates
Received 1000000 updates
Received 1000000 updates
Received 1000000 updates
Received 1000000 updates
Received 1000000 updates
Received 1000000 updates
Received 1000000 updates
Received 1000000 updates
</screen>

<para>We can't assume that the SUB connect will be finished by the time the REQ/REP dialog is complete. There are no guarantees that outbound connects will finish in any order whatsoever, if you're using any transport except <literal>inproc</literal>. So, the example does a brute-force sleep of one second between subscribing, and sending the REQ/REP synchronization.</para>

<para>A more robust model could be:</para>

<itemizedlist>
  <listitem><para>Publisher opens PUB socket and starts sending "Hello" messages (not data).</para></listitem>
  <listitem><para>Subscribers connect SUB socket and when they receive a Hello message they tell the publisher via a REQ/REP socket pair.</para></listitem>
  <listitem><para>When the publisher has had all the necessary confirmations, it starts to send real data.</para></listitem>
</itemizedlist>
</sect1>
<sect1>
<title>Zero Copy</title>
<para>&Oslash;MQ's message API lets you can send and receive messages directly from and to application buffers without copying data. We call "zero-copy", and it can improve performance in some applications. Like all optimizations, use this when you know it helps, and measure before and after. Zero-copy makes your code more complex.</para>

<para>To do zero-copy you use <literal>zmq-msg-init-data[3]</literal> to create a message that refers to a block of data already allocated on the heap with <literal>malloc()</literal>, and then you pass that to <literal>zmq-msg-send[3]</literal>. When you create the message you also pass a function that &Oslash;MQ will call to free the block of data, when it has finished sending the message. This is the simplest example, assuming 'buffer' is a block of 1000 bytes allocated on the heap:</para>

<programlisting language="c">
void my-free (void *data, void *hint) {
    free (data);
}
//  Send message from buffer, which we allocate and 0MQ will free for us
zmq-msg-t message;
zmq-msg-init-data (&amp;message, buffer, 1000, my-free, NULL);
zmq-msg-send (socket, &amp;message, 0);
</programlisting>

<para>There is no way to do zero-copy on receive: &Oslash;MQ delivers you a buffer that you can store as long as you wish but it will not write data directly into application buffers.</para>

<para>On writing, &Oslash;MQ's multi-part messages work nicely together with zero-copy. In traditional messaging you need to marshal different buffers together into one buffer that you can send. That means copying data. With &Oslash;MQ, you can send multiple buffers coming from different sources as individual message frames. Send each field as a length-delimited frame. To the application it looks like a series of send and recv calls. But internally the multiple parts get written to the network and read back with single system calls, so it's very efficient.</para>

</sect1>
<sect1>
<title>Pub-Sub Message Envelopes</title>
<para>In the pub-sub pattern we can split the key into a separate message frame that we call an "envelope". If you want to use pub-sub envelopes, make them yourself. It's optional, and in previous pub-sub examples we didn't do this. Using a pub-sub envelope is a little more work for simple cases but it's cleaner especially for real cases, where the key and the data are naturally separate things.</para>

<para>Here is what a publish-subscribe message with an envelope looks like:</para>

<figure id="figure-23">
    <title>Pub-Sub Envelope with Separate Key</title>
    <mediaobject>
        <imageobject>
            <imagedata fileref="images/fig23.eps" format="EPS" width="4.8in"/>
        </imageobject>
    </mediaobject>
</figure>

<para>Recall that subscriptions do a prefix match. That is, "all messages starting with XYZ". The obvious question is: how to delimit keys from data so that the prefix match doesn't accidentally match data. The best answer is to use an envelope, since the match won't cross a frame boundary.</para>

<para>Here is a minimalist example of how pub-sub envelopes look in code. This publisher sends messages of two types, A and B. The envelope holds the message type:</para>

<example id="psenvpub-php">
<title>Pub-Sub envelope publisher (psenvpub.php)</title>
<programlisting language="php">
&lt;?php
/*
 * Pubsub envelope publisher
 * @author Ian Barber &lt;ian(dot)barber(at)gmail(dot)com&gt;
 */

//  Prepare our context and publisher
$context = new ZMQContext();
$publisher = new ZMQSocket($context, ZMQ::SOCKET_PUB);
$publisher-&gt;bind("tcp://*:5563");

while (true) {
	//  Write two messages, each with an envelope and content
	$publisher-&gt;send("A", ZMQ::MODE_SNDMORE);
	$publisher-&gt;send("We don't want to see this");
	$publisher-&gt;send("B", ZMQ::MODE_SNDMORE);
	$publisher-&gt;send("We would like to see this");
	sleep (1);
}

//  We never get her
</programlisting>

</example>
<para>The subscriber wants only messages of type B:</para>

<example id="psenvsub-php">
<title>Pub-Sub envelope subscriber (psenvsub.php)</title>
<programlisting language="php">
&lt;?php
/*
 * Pubsub envelope subscriber
 * @author Ian Barber &lt;ian(dot)barber(at)gmail(dot)com&gt;
 */

//  Prepare our context and subscriber
$context = new ZMQContext();
$subscriber = new ZMQSocket($context, ZMQ::SOCKET_SUB);
$subscriber-&gt;connect("tcp://localhost:5563");
$subscriber-&gt;setSockOpt(ZMQ::SOCKOPT_SUBSCRIBE, "B");

while (true) {
	//  Read envelope with address
	$address = $subscriber-&gt;recv();
	//  Read message contents
	$contents = $subscriber-&gt;recv();
	printf ("[%s] %s%s", $address, $contents, PHP_EOL);
}
//  We never get here
</programlisting>

</example>
<para>When you run the two programs, the subscriber should show you this:</para>

<screen>[B] We would like to see this
[B] We would like to see this
[B] We would like to see this
[B] We would like to see this
...
</screen>

<para>This examples shows that the subscription filter rejects or accepts the entire multi-part message (key plus data). You won't get part of a multi-part message, ever.</para>

<para>If you subscribe to multiple publishers and you want to know their address so that you can send them data via another socket (and this is a typical use-case), create a three-part message:</para>

<figure id="figure-24">
    <title>Pub-Sub Envelope with Sender Address</title>
    <mediaobject>
        <imageobject>
            <imagedata fileref="images/fig24.eps" format="EPS" width="4.8in"/>
        </imageobject>
    </mediaobject>
</figure>

</sect1>
<sect1>
<title>High Water Marks</title>
<para>When you can send messages rapidly from process to process, you soon discover that memory is a precious resource, and one that can be trivially filled up. A few seconds of delay somewhere in a process can turn into a backlog that blows up a server, unless you understand the problem and take precautions.</para>

<para>The problem is this: if you have process A sending messages to process B, which suddenly gets very busy (garbage collection, CPU overload, whatever), then what happens to the messages that process A wants to send? Some will sit in B's network buffers. Some will sit on the Ethernet wire itself. Some will sit in A's network buffers. And the rest will accumulate in A's memory. If you don't take some precaution, A can easily run out of memory and crash. It is a consistent, classic problem with message brokers.</para>

<para>What are the answers? One is to pass the problem upstream. A is getting the messages from somewhere else. So tell that process, "stop!" And so on. This is called "flow control". It sounds great, but what if you're sending out a Twitter feed? Do you tell the whole world to stop tweeting while B gets its act together?</para>

<para>Flow control works in some cases but in others. The transport layer can't tell the application layer to "stop" any more than a subway system can tell a large business, "please keep your staff at work another half an hour, I'm too busy".</para>

<para>The answer for messaging is to set limits on the size of buffers, and then when we reach those limits, take some sensible action. In most cases (not for a subway system, though), the answer is to throw away messages. In a few others, it's to wait.</para>

<para>&Oslash;MQ uses the concept of "high water mark" or HWM to define the capacity of its internal pipes. Each connection out of a socket or into a socket has its own pipe, and HWM capacity.</para>

<para>In &Oslash;MQ/2.x the HWM was infinite by default. In &Oslash;MQ/3.x it's set to 1,000 by default, which is more sensible. If you're still using &Oslash;MQ/2.x you should always set a HWM on your sockets, be it 1,000 to match &Oslash;MQ/3.x or another figure that takes into account your message sizes.</para>

<para>The high water mark affects both the transmit and receive buffers of a single socket. Some sockets (PUB, PUSH) only have transmit buffers. Some (SUB, PULL, REQ, REP) only have receive buffers. Some (DEALER, ROUTER, PAIR) have both transmit and receive buffers.</para>

<para>When your socket reaches its high-water mark, it will either block or drop data depending on the socket type. PUB sockets will drop data if they reach their high-water mark, while other socket types will block.</para>

<para>Over the <literal>inproc</literal> transport, the sender and receiver share the same buffers, so the real HWM is the sum of the HWM set by both sides. This means in effect that if one side does not set a HWM, there is no limit to the buffer size.</para>

</sect1>
<sect1>
<title>Missing Message Problem Solver</title>
<para>As you build applications with &Oslash;MQ you will come across this problem more than once: losing messages that you expect to receive. We have put together a diagram(<xref linkend="figure-25"/>) that walks through the most common causes for this.</para>

<figure id="figure-25">
    <title>Missing Message Problem Solver</title>
    <mediaobject>
        <imageobject>
            <imagedata fileref="images/fig25.eps" format="EPS" width="4.8in"/>
        </imageobject>
    </mediaobject>
</figure>

<para>If you're using &Oslash;MQ in a context where failures are expensive, then you want to plan properly. First, build prototypes that let you learn and test the different aspects of your design. Stress them until they break, so that you know exactly how strong your designs are. Second, invest in testing. This means building test frameworks, ensuring you have access to realistic setups with sufficient computer power, and getting time or help to actually test seriously. Ideally, one team writes the code, a second team tries to break it. Lastly, do get your organization to <ulink url="http://www.imatix.com/contact">contact iMatix</ulink> to discuss how we can help to make sure things work properly, and can be fixed rapidly if they break.</para>

<para>In short: if you have not proven an architecture works in realistic conditions, it will most likely break at the worst possible moment.</para>

</sect1>
</chapter>
<chapter id="advanced-request-reply">
<title>Advanced Request-Reply Patterns</title>
<para>In Sockets and Patterns<xref linkend="sockets-and-patterns"/> we worked through the basics of using &Oslash;MQ by developing a series of small applications, each time exploring new aspects of &Oslash;MQ. We'll continue this approach in this chapter, as we explore advanced patterns built on top of &Oslash;MQ's core request-reply pattern.</para>

<para>We'll cover:</para>

<itemizedlist>
  <listitem><para>How the request-reply mechanisms work.</para></listitem>
  <listitem><para>How to combine REQ, REP, DEALER, and ROUTER sockets.</para></listitem>
  <listitem><para>How ROUTER sockets work, in detail.</para></listitem>
  <listitem><para>The load-balancing pattern.</para></listitem>
  <listitem><para>Building a simple load-balancing message broker.</para></listitem>
  <listitem><para>Designing a high-level API for &Oslash;MQ.</para></listitem>
  <listitem><para>Building an asynchronous request-reply server.</para></listitem>
  <listitem><para>A detailed inter-broker routing example.</para></listitem>
</itemizedlist>
<sect1>
<title>The Request-Reply Mechanisms</title>
<para>We already looked briefly at multi-part messages. Let's now look at a major use case, which is <emphasis>reply message envelopes</emphasis>. An envelope is a way of safely packaging up data with an address, without touching the data itself. By separating reply addresses into an envelope we make it possible to write general-purpose intermediaries such as APIs and proxies that create, read, and remove addresses no matter what the message payload or structure.</para>

<para>In the request-reply pattern, the envelope holds the return address for replies. It is how a &Oslash;MQ network with no state can create round-trip request-reply dialogs.</para>

<para>When you use REQ and REP sockets you don't even see envelopes; these sockets deal with them automatically. But for most of the interesting request-reply patterns, you'll want to understand envelopes and particularly ROUTER sockets. We'll work through this step-by-step.</para>

<sect2>
<title>The Simple Reply Envelope</title>
<para>A request-reply exchange consists of a <emphasis>request</emphasis> message, and an eventual <emphasis>reply</emphasis> message. In the simple request-reply pattern there's one reply for each request. In more advanced patterns, requests and replies can flow asynchronously. However, the reply envelope always works the same way.</para>

<para>The &Oslash;MQ reply envelope formally consists of zero or more reply addresses, followed by an empty frame (the envelope delimiter), followed by the message body (zero or more frames). The envelope is created by multiple sockets working together in a chain. We'll break this down.</para>

<para>We'll start by sending "Hello" through a REQ socket. The REQ socket creates the simplest possible reply envelope, which has no addresses, just an empty delimiter frame and the message frame containing the "Hello" string. This is a two-frame message(<xref linkend="figure-26"/>).</para>

<figure id="figure-26">
    <title>Request with Minimal Envelope</title>
    <mediaobject>
        <imageobject>
            <imagedata fileref="images/fig26.eps" format="EPS" width="4.8in"/>
        </imageobject>
    </mediaobject>
</figure>

<para>The REP socket does the matching work: it strips off the envelope, up to and including the delimiter frame, saves the whole envelope, and passes the "Hello" string up the application. Thus our original Hello World example used request-reply envelopes internally, but the application never saw them.</para>

<para>If you spy on the network data flowing between hwclient and hwserver, this is what you'll see: every request and every reply is in fact two frames, an empty frame and then the body. It doesn't seem to make much sense for a simple REQ-REP dialog. However you'll see the reason when we explore how ROUTERS and DEALERS handle envelopes.</para>

</sect2>
<sect2>
<title>The Extended Reply Envelope</title>
<para>Now let's extend the REQ-REP pair with a ROUTER-DEALER proxy in the middle and see how this affects the reply envelope. This is the <emphasis>extended request-reply pattern</emphasis> we already saw in Sockets and Patterns<xref linkend="sockets-and-patterns"/>. We can in fact insert any number of proxy steps(<xref linkend="figure-27"/>). The mechanics are the same.</para>

<figure id="figure-27">
    <title>Extended Request-Reply Pattern</title>
    <mediaobject>
        <imageobject>
            <imagedata fileref="images/fig27.eps" format="EPS" width="4.8in"/>
        </imageobject>
    </mediaobject>
</figure>

<para>The proxy does this, in pseudo-code:</para>

<screen>prepare context, frontend and backend sockets
while true:
    poll on both sockets
    if frontend had input:
        read all frames from frontend
        send to backend
    if backend had input:
        read all frames from backend
        send to frontend
</screen>

<para>The ROUTER socket, unlike other sockets, tracks every connection it has, and tells the caller about these. The way it tells the caller is to stick the connection <emphasis>identity</emphasis> in front of each message received. An identity, sometimes called an <emphasis>address</emphasis>, is just a binary string with no meaning except "this is a unique handle to the connection". Then, when you send a message via a ROUTER socket, you first send an identity frame.</para>

<para>The <literal>zmq_socket()</literal> man page describes it thus: <emphasis>when receiving messages a ZMQ_ROUTER socket shall prepend a message part containing the identity of the originating peer to the message before passing it to the application. Messages received are fair-queued from among all connected peers. When sending messages a ZMQ_ROUTER socket shall remove the first part of the message and use it to determine the identity of the peer the message shall be routed to.</emphasis></para>

<para>As a historical note, &Oslash;MQ/2.2 and earlier use UUIDs as identities, and &Oslash;MQ/3.0 and later use short integers. There's some impact on network performance but only when you use multiple proxy hops, which is rare.</para>

<para>It's a difficult concept to understand but essential if you want to become a &Oslash;MQ expert. The ROUTER socket <emphasis>invents</emphasis> an random identity for each connection it works with. If there are three REQ sockets connected to a ROUTER socket, it will invent three random identities, one for each REQ socket.</para>

<para>So if we continue our worked example, let's say the REQ socket has identity <literal>02</literal>. Internally, this means the ROUTER socket keeps a hash table where it can search for <literal>02</literal> and find the TCP connection for the REQ socket.</para>

<para>When we receive the message off the ROUTER socket, we get three frames(<xref linkend="figure-28"/>).</para>

<figure id="figure-28">
    <title>Request with One Address</title>
    <mediaobject>
        <imageobject>
            <imagedata fileref="images/fig28.eps" format="EPS" width="4.8in"/>
        </imageobject>
    </mediaobject>
</figure>

<para>The core of the proxy loop is 'read from one socket, write to the other', so we literally send these three frames out on the DEALER socket. If you now sniffed the network traffic, you would see these three frames flying from the DEALER socket to the REP socket. The REP socket does as before, strips off the whole envelope including the new reply address, and once again delivers the "Hello" to the caller.</para>

<para>Incidentally the REP socket can only deal with one request-reply exchange at a time, which is why if you try to read multiple requests or send multiple replies without sticking to a strict recv-send cycle, it gives an error.</para>

<para>You should now be able to visualize the return path. When the hwserver sends "World" back, the REP socket wraps that with the envelope it saved, and sends a three-frame reply message across the wire to the DEALER socket(<xref linkend="figure-29"/>).</para>

<figure id="figure-29">
    <title>Reply with one Address</title>
    <mediaobject>
        <imageobject>
            <imagedata fileref="images/fig29.eps" format="EPS" width="4.8in"/>
        </imageobject>
    </mediaobject>
</figure>

<para>Now the DEALER reads these three frames, and sends all three out via the ROUTER socket. The ROUTER takes the first frame for the message, which is the <literal>02</literal> identity, and looks up the connection for this. If it finds that, it then pumps the next two frames out onto the wire(<xref linkend="figure-30"/>).</para>

<figure id="figure-30">
    <title>Reply with Minimal Envelope</title>
    <mediaobject>
        <imageobject>
            <imagedata fileref="images/fig30.eps" format="EPS" width="4.8in"/>
        </imageobject>
    </mediaobject>
</figure>

<para>The REQ socket picks this message up, and checks that the first frame is the empty delimiter, which it is. The REQ socket discards that frame and passes "World" to the calling application, which prints it out to the amazement of the younger us looking at &Oslash;MQ for the first time.</para>

</sect2>
<sect2>
<title>What's This Good For?</title>
<para>To be honest the use cases for strict request-reply or extended request-reply are somewhat limited. For one thing, there's no easy way to recover from common failures like the server crashing due to buggy application code. We'll see more about this in Reliable Request-Reply Patterns<xref linkend="reliable-request-reply"/>. However once you grasp the way these four sockets deal with envelopes, and how they talk to each other, you can do very useful things. We saw how ROUTER uses the reply envelope to decide which client REQ socket to route a reply back to. Now let's express this another way:</para>

<itemizedlist>
  <listitem><para>Each time ROUTER gives you a message it tells you what peer that came from, as an identity.</para></listitem>
  <listitem><para>You can use this with a hash table (with the identity as key) to track new peers as they arrive.</para></listitem>
  <listitem><para>ROUTER will route messages asynchronously to any peer connected to it, if you prefix the identity as the first frame of the message.</para></listitem>
</itemizedlist>
<para>ROUTER sockets don't care about the whole envelope. They don't know anything about the empty delimiter. All they care about is that one identity frame that lets them figure out which connect to send a message to.</para>

</sect2>
<sect2>
<title>Recap of Request-Reply Sockets</title>
<para>Let's recap this:</para>

<itemizedlist>
  <listitem><para>The REQ socket sends, to the network, an empty delimiter frame in front of the message data. REQ sockets are synchronous. REQ sockets always send one request and then wait for one reply. REQ sockets talk to one peer at a time. If you connect a REQ socket to multiple peers, requests are distributed to and replies expected from each peer one turn at a time.</para></listitem>
  <listitem><para>The REP socket reads and saves all identity frames up to and including the empty delimiter, then passes the following frame or frames to the caller. REP sockets are synchronous and talk to one peer at a time. If you connect a REP socket to multiple peers, requests are read from peers in fair fashion, and replies are always sent to the same peer that made the last request.</para></listitem>
  <listitem><para>The DEALER socket is oblivious to the reply envelope and handles this like any multi-part message. DEALER sockets are asynchronous and like PUSH and PULL combined. They distribute sent messages among all connections, and fair-queue received messages from all connections.</para></listitem>
  <listitem><para>The ROUTER socket is oblivious to the reply envelope, like DEALER. It creates identities for its connections, and passes these identities to the caller as a first frame in any received message. Conversely, when the caller sends a message, it use the first message frame as an identity to look-up the connection to send to. ROUTERS are asynchronous.</para></listitem>
</itemizedlist>
</sect2>
</sect1>
<sect1>
<title>Request-Reply Combinations</title>
<para>We have four request-reply sockets, each with a certain behavior. We've seen how they connect in simple and extended request-reply patterns. But these sockets are building blocks that you can use to solve many problems.</para>

<para>These are the legal combinations:</para>

<itemizedlist>
  <listitem><para>REQ to REP</para></listitem>
  <listitem><para>DEALER to REP</para></listitem>
  <listitem><para>REQ to ROUTER</para></listitem>
  <listitem><para>DEALER to ROUTER</para></listitem>
  <listitem><para>DEALER to DEALER</para></listitem>
  <listitem><para>ROUTER to ROUTER</para></listitem>
</itemizedlist>
<para>And these combinations are invalid (and I'll explain why):</para>

<itemizedlist>
  <listitem><para>REQ to REQ</para></listitem>
  <listitem><para>REQ to DEALER</para></listitem>
  <listitem><para>REP to REP</para></listitem>
  <listitem><para>REP to ROUTER</para></listitem>
</itemizedlist>
<para>Here are some tips for remembering the semantics. DEALER is like an asynchronous REQ socket, and ROUTER is like an asynchronous REP socket. Where we use a REQ socket we can use a DEALER, we just have to read and write the envelope ourselves. Where we use a REP socket we can stick a ROUTER, we just need to manage the identities ourselves.</para>

<para>Think of REQ and DEALER sockets as "clients" and REP and ROUTER sockets as "servers". Mostly you'll want to bind REP and ROUTER sockets, and connect REQ and DEALER sockets to them. It's not always going to be this simple, but it is a clean and memorable place to start.</para>

<sect2>
<title>The REQ to REP Combination</title>
<para>We've already covered a REQ client talking to a REP server but let's take one aspect: the REQ client <emphasis>must</emphasis> initiate the message flow. A REP server cannot talk to a REQ client that hasn't first sent it a request. Technically it's not even possible, and the API also returns an <literal>EFSM</literal> error if you try it.</para>

</sect2>
<sect2>
<title>The DEALER to REP Combination</title>
<para>Now, let's replace the REQ client with a DEALER. This gives us an asynchronous client that can talk to multiple REP servers. If we rewrote the hello world client using DEALER, we'd be able to send off any number of "Hello" requests without waiting for replies.</para>

<para>When we use a DEALER to talk to a REP socket, we <emphasis>must</emphasis> accurately emulate the envelope that the REQ socket would have sent, otherwise the REP socket will discard the message as invalid. So, to send a message, we:</para>

<itemizedlist>
  <listitem><para>send an empty message frame with the MORE flag set; then</para></listitem>
  <listitem><para>send the message body.</para></listitem>
</itemizedlist>
<para>And when we receive a message, we:</para>

<itemizedlist>
  <listitem><para>receive the first frame, if it's not empty, discard the whole message;</para></listitem>
  <listitem><para>receive the next frame and pass that to the application.</para></listitem>
</itemizedlist>
</sect2>
<sect2>
<title>The REQ to ROUTER Combination</title>
<para>In the same way as we can replace REQ with DEALER, we can replace REP with ROUTER. This gives us an asynchronous server that can talk to multiple REQ clients at the same time. If we rewrote the hello world server using ROUTER, we'd be able to process any number of "Hello" requests in parallel. We saw this in the Sockets and Patterns<xref linkend="sockets-and-patterns"/> mtserver example.</para>

<para>We can use ROUTER in two distinct ways:</para>

<itemizedlist>
  <listitem><para>As a proxy that switches messages between a frontend and backend sockets.</para></listitem>
  <listitem><para>As an application that reads the message and acts on it.</para></listitem>
</itemizedlist>
<para>In the first case the ROUTER simply reads all frames including the artificial identity frame, and passes them on blindly. In the second case the ROUTER <emphasis>must</emphasis> know the format of the reply envelope it's being sent. As the other peer is a REQ socket, the ROUTER gets the identity frame, an empty frame, then the data frame.</para>

</sect2>
<sect2>
<title>The DEALER to ROUTER Combination</title>
<para>Now, we can switch out both REQ and REP with DEALER and ROUTER to get the most powerful socket combination, which is DEALER talking to ROUTER. It gives us asynchronous clients talking to asynchronous servers, where both sides have full control over the message formats.</para>

<para>Since both DEALER and ROUTER can work with arbitrary message formats, if you hope to use these safely you have to become a little bit of a protocol designer. At the very least you must decide whether you wish to emulate the REQ/REP reply envelope. It depends on whether you actually need to send replies, or not.</para>

</sect2>
<sect2>
<title>The DEALER to DEALER Combination</title>
<para>You can swap a REP with a ROUTER, but you can also swap a REP with a DEALER, if the DEALER is talking to one and only one peer.</para>

<para>When you replace a REP with a DEALER, your worker can suddenly go full asynchronous, sending any number of replies back. The cost is that you have to manage the reply envelopes yourself, and get them right, or nothing at all will work. We'll see a worked example later. Let's just say for now that DEALER to DEALER is one of the trickier patterns to get right, and happily it's rare that we need it.</para>

</sect2>
<sect2>
<title>The ROUTER to ROUTER Combination</title>
<para>This sounds perfect for N-to-N connections but it's the most difficult combination to use. You should avoid it until you are well-advanced with &Oslash;MQ. We'll see one example it in the Freelance pattern in Reliable Request-Reply Patterns<xref linkend="reliable-request-reply"/>, and an alternative DEALER to ROUTER design for peer-to-peer work in A Universe of Moving Pieces<xref linkend="moving-pieces"/>.</para>

</sect2>
<sect2>
<title>Invalid Combinations</title>
<para>Mostly, trying to connect clients to clients, or servers to servers, is a bad idea and won't work. However rather than give general vague warnings, I'll explain in detail:</para>

<itemizedlist>
  <listitem><para>REQ to REQ: both sides want to start by sending messages to each other, and this could only work if you timed things so that both peers exchanged messages at the same time. It hurts my brain to even think about it.</para></listitem>
  <listitem><para>REQ to DEALER: you could in theory do this, but it would break if you added a second REQ, since DEALER has no way of sending a reply to the original peer. Thus the REQ socket would get confused, and/or return messages meant for another client.</para></listitem>
  <listitem><para>REP to REP: both sides would wait for the other to send the first message.</para></listitem>
  <listitem><para>REP to ROUTER: the ROUTER socket can in theory initiate the dialog and send a properly-formatted request, if it knows the REP socket has connected <emphasis>and</emphasis> it knows the identity of that connection. It's messy and adds nothing over DEALER to ROUTER.</para></listitem>
</itemizedlist>
<para>The common thread in this valid vs. invalid breakdown is that a &Oslash;MQ socket connection is always biased towards one peer that binds to an endpoint, and another that connects to that. Further, that which side binds and which side connects is not arbitrary, but follows natural patterns. The side which we expect to "be there" binds: it'll be a server, a broker, a publisher, a collector. The side that "comes and goes" connects: it'll be clients and workers. Remembering this will help you design better &Oslash;MQ architectures.</para>

</sect2>
</sect1>
<sect1>
<title>Exploring ROUTER Sockets</title>
<sect2>
<title>Identities and Addresses</title>
<para>The <emphasis>identity</emphasis> concept in &Oslash;MQ refers specifically to ROUTER sockets and how they identity the connections they have to other sockets. More broadly, identities are used as addresses in the reply envelope. In most cases the identity is arbitrary and local to the ROUTER socket: it's a lookup key in a hash table. Independently, a peer can have an address that is physical (a network endpoint like "tcp://192.168.55.117:5670") or logical (a UUID or email address or other unique key).</para>

<para>An application that uses a ROUTER socket to talk to specific peers can convert a logical address to an identity if it has built the necessary hash table. Since ROUTER sockets only announce the identity of a connection (to a specific peer) when that peer sends a message, you can only really reply to a message, not spontaneously talk to a peer.</para>

<para>This is true even if you flip the rules and make the ROUTER connect to the peer rather than wait for the peer to connect to the ROUTER.</para>

<para>However you can force the ROUTER socket to use a logical address in place of its identity. The <literal>zmq_setsockopt</literal> reference page calls this "setting the socket identity". It works as follows:</para>

<itemizedlist>
  <listitem><para>The peer application sets the <literal>ZMQ_IDENTITY</literal> option its peer socket (DEALER or REQ), <emphasis>before</emphasis> binding or connecting.</para></listitem>
  <listitem><para>Usually the peer then connects to the already-bound ROUTER socket. But the ROUTER can also connect to the peer.</para></listitem>
  <listitem><para>At connection time, the peer socket tells the router socket, "please use this identity for this connection".</para></listitem>
  <listitem><para>If the peer socket doesn't say that, the router generates its usual arbitrary random identity for the connection.</para></listitem>
  <listitem><para>The ROUTER socket now provides this logical address to the application as a prefix identity frame for any messages coming in from that peer.</para></listitem>
  <listitem><para>The ROUTER also expects the logical address as the prefix identity frame for any outgoing messages.</para></listitem>
</itemizedlist>
<para>Here is a simple example of two peers that connect to a ROUTER socket, one that imposes a logical address "PEER2":</para>

<example id="identity-php">
<title>Identity check (identity.php)</title>
<programlisting language="php">
&lt;?php
/*
 * Demonstrate identities as used by the request-reply pattern.  Run this
 * program by itself.  Note that the utility functions s_ are provided by
 * zhelpers.h.  It gets boring for everyone to keep repeating this code.
 * @author Ian Barber &lt;ian(dot)barber(at)gmail(dot)com&gt;
 */
include "zhelpers.php";

$context = new ZMQContext();

$sink = new ZMQSocket($context, ZMQ::SOCKET_ROUTER);
$sink-&gt;bind("inproc://example");

//  First allow 0MQ to set the identity
$anonymous = new ZMQSocket($context, ZMQ::SOCKET_REQ);
$anonymous-&gt;connect("inproc://example");
$anonymous-&gt;send("ROUTER uses a generated UUID");
s_dump ($sink);

//  Then set the identity ourselves
$identified = new ZMQSocket($context, ZMQ::SOCKET_REQ);
$identified-&gt;setSockOpt(ZMQ::SOCKOPT_IDENTITY, "PEER2");
$identified-&gt;connect("inproc://example");
$identified-&gt;send("ROUTER socket uses REQ's socket identity");
s_dump ($sink);
</programlisting>

</example>
<para>Here is what the program prints:</para>

<screen>----------------------------------------
[005] 006B8B4567
[000]
[026] ROUTER uses a generated UUID
----------------------------------------
[005] PEER2
[000]
[038] ROUTER uses REQ's socket identity
</screen>

</sect2>
<sect2>
<title>ROUTER Error Handling</title>
<para>ROUTER sockets do have a somewhat brutal way of dealing with messages they can't send anywhere: they drop them silently. It's an attitude that makes sense in working code, but makes debugging hard. The "send identity as first frame" is tricky enough that we get this wrong when we're learning, and the ROUTER's stony silence when we mess up isn't very constructive.</para>

<para>Since &Oslash;MQ/3.2 there's a socket option you can set to catch this error: <literal>ZMQ_ROUTER_MANDATORY</literal>. Set that on the ROUTER socket and then you provide an unroutable identity on a send call, the socket will signal an EHOSTUNREACH error.</para>

</sect2>
</sect1>
<sect1>
<title>The Load-balancing Pattern</title>
<para>Let's now look at some code. We'll see how to connect a ROUTER socket to a REQ socket, and then to a DEALER socket. These two examples follow the same logic, which is a <emphasis>load-balancing</emphasis> pattern. This pattern is our first exposure to using the ROUTER socket for deliberate routing, rather than simply acting as a reply channel.</para>

<para>The load-balancing pattern is very common and we'll see it several times in the Guide. It solves the main problem with simple round-robin routing (as PUSH and DEALER offer) which is that round-robin becomes inefficient if tasks do not all roughly take the same time.</para>

<para>It's the post office analogy. If you have one queue per counter, and you have some people buying stamps (a fast, simple transaction), and some people opening new accounts (a very slow transaction), then you will find stamp-buyers getting unfairly stuck in queues. Just as in a post office, if your messaging architecture is unfair, people will get annoyed.</para>

<para>The solution in the post office is to create a single queue so that even if one or two counters get 'stuck' with slow work, other counters will continue to serve clients on a first-come, first-serve basis.</para>

<para>One reason PUSH and DEALER use the simplistic approach is sheer performance. If you arrive in any major US airport, you'll find long queues of people waiting at immigration. The border-patrol officials will send people in advance to queue up at each counter, rather than using a single queue. Having people walk fifty yards in advance saves a minute or two per passenger. And since every passport check takes roughly the same time, it's more or less fair. And this is the strategy for PUSH and DEALER: send work loads ahead of time so that there is less walking distance.</para>

<para>This is a recurring theme with &Oslash;MQ: the world's problems are diverse and you can really benefit from solving different problems each in the right way. The airport isn't the post-office and one size fits no-one, really well.</para>

<para>Back to a worker (DEALER or REQ) connected to a broker (ROUTER). The broker has to know when the worker is ready, and keep a list of workers so that it can take the <emphasis>least recently used</emphasis> worker each time.</para>

<para>The solution is really simple in fact: workers send a "Ready" message when they start, and after they finish each task. The broker reads these messages one by one. Each time it reads a message, that is from the last used worker. And since we're using a ROUTER socket, we get an identity that we can then use to send a task back to the worker.</para>

<para>It's a twist on request-reply because the task is sent with the reply, and any response for the task is sent as a new request. The following code examples should make it clearer.</para>

<sect2>
<title>ROUTER Broker and REQ Workers</title>
<para>Here is an example of the load-balancing pattern using a ROUTER broker talking to a set of REQ workers:</para>

<example id="rtreq-php">
<title>ROUTER-to-REQ (rtreq.php)</title>
<programlisting language="php">
&lt;?php
/*
 * Custom routing Router to Mama (ROUTER to REQ)
 * @author Ian Barber &lt;ian(dot)barber(at)gmail(dot)com&gt;a
 */

define("NBR_WORKERS", 10);

function worker_thread() {
	$context = new ZMQContext();
	$worker = new ZMQSocket($context, ZMQ::SOCKET_REQ);
	$worker-&gt;connect("ipc://routing.ipc");
	
	$total = 0;
	while(true) {
		//  Tell the router we're ready for work
		$worker-&gt;send("ready");
		
		//  Get workload from router, until finished
		$workload = $worker-&gt;recv();
		if($workload == 'END') {
			printf ("Processed: %d tasks%s", $total, PHP_EOL);
			break;
		}
		$total++;
		
		//  Do some random work
		usleep(mt_rand(1, 1000000));
	}
}

for ($worker_nbr = 0; $worker_nbr &lt; NBR_WORKERS; $worker_nbr++) {
	if(pcntl_fork() == 0) {
		worker_thread(); 
		exit();
	}
}

$context = new ZMQContext();
$client = $context-&gt;getSocket(ZMQ::SOCKET_ROUTER);
$client-&gt;bind("ipc://routing.ipc");

for ($task_nbr = 0; $task_nbr &lt; NBR_WORKERS * 10; $task_nbr++) {
	//  LRU worker is next waiting in queue
	$address = $client-&gt;recv();
	$empty = $client-&gt;recv();
	$read = $client-&gt;recv();
	
	$client-&gt;send($address, ZMQ::MODE_SNDMORE);
	$client-&gt;send("", ZMQ::MODE_SNDMORE);
	$client-&gt;send("This is the workload");
}

//  Now ask mamas to shut down and report their results
for ($task_nbr = 0; $task_nbr &lt; NBR_WORKERS; $task_nbr++) {
	//  LRU worker is next waiting in queue
	$address = $client-&gt;recv();
	$empty = $client-&gt;recv();
	$read = $client-&gt;recv();
	
	$client-&gt;send($address, ZMQ::MODE_SNDMORE);
	$client-&gt;send("", ZMQ::MODE_SNDMORE);
	$client-&gt;send("END");
}

sleep (1);              //  Give 0MQ/2.0.x time to flush output
</programlisting>

</example>
<para>The example runs for five seconds and then each worker prints how many tasks they handled. If the routing worked, we'd expect a fair distribution of work:</para>

<screen>Completed: 20 tasks
Completed: 18 tasks
Completed: 21 tasks
Completed: 23 tasks
Completed: 19 tasks
Completed: 21 tasks
Completed: 17 tasks
Completed: 17 tasks
Completed: 25 tasks
Completed: 19 tasks
</screen>

<para>To talk to the workers in this example, we have to create a REQ-friendly envelope consisting of an identity plus an empty envelope delimiter frame(<xref linkend="figure-31"/>).</para>

<figure id="figure-31">
    <title>Routing Envelope for REQ</title>
    <mediaobject>
        <imageobject>
            <imagedata fileref="images/fig31.eps" format="EPS" width="4.8in"/>
        </imageobject>
    </mediaobject>
</figure>

</sect2>
<sect2>
<title>ROUTER Broker and DEALER Workers</title>
<para>Anywhere you can use REQ, you can use DEALER. There are two specific differences:</para>

<itemizedlist>
  <listitem><para>The REQ socket always sends an empty delimiter frame before any data frames; the DEALER does not.</para></listitem>
  <listitem><para>The REQ socket will send only one message before it receives a reply; the DEALER is fully asynchronous.</para></listitem>
</itemizedlist>
<para>The synchronous vs. asynchronous behavior has no effect on our example since we're doing strict request-reply anyhow. It is more relevant when we come to recovering from failures, which we'll come to in Reliable Request-Reply Patterns<xref linkend="reliable-request-reply"/>.</para>

<para>Now let's look at exactly the same example but with the REQ socket replaced by a DEALER socket:</para>

<example id="rtdealer-php">
<title>ROUTER-to-DEALER (rtdealer.php)</title>
<programlisting language="php">
&lt;?php 
/*
 * Custom routing Router to Dealer
 * @author Ian Barber &lt;ian(dot)barber(at)gmail(dot)com&gt;
 */


//  We have two workers, here we copy the code, normally these would
//  run on different boxes...
function worker_a() {
	$context = new ZMQContext();
	$worker = $context-&gt;getSocket(ZMQ::SOCKET_DEALER);
	$worker-&gt;setSockOpt(ZMQ::SOCKOPT_IDENTITY, "A");
	$worker-&gt;connect("ipc://routing.ipc");

	$total = 0;
	while(true) {
		//  We receive one part, with the workload
		$request = $worker-&gt;recv();
		if($request == 'END') {
			printf ("A received: %d%s", $total, PHP_EOL);
			break;
		}
		$total++;
	}
}

function worker_b() {
	$context = new ZMQContext();
	$worker = $context-&gt;getSocket(ZMQ::SOCKET_DEALER);
	$worker-&gt;setSockOpt(ZMQ::SOCKOPT_IDENTITY, "B");
	$worker-&gt;connect("ipc://routing.ipc");
	
	$total = 0;
	while(true) {
		//  We receive one part, with the workload
		$request = $worker-&gt;recv();
		if($request == 'END') {
			printf ("B received: %d%s", $total, PHP_EOL);
			break;
		}
		$total++;
	}
}

$pid = pcntl_fork();
if($pid == 0) { worker_a(); exit(); }
$pid = pcntl_fork();
if($pid == 0) { worker_b(); exit(); }

$context = new ZMQContext();
$client = new ZMQSocket($context, ZMQ::SOCKET_ROUTER);
$client-&gt;bind("ipc://routing.ipc");

//  Wait for threads to stabilize
sleep(1);

//  Send 10 tasks scattered to A twice as often as B
for ($task_nbr = 0; $task_nbr != 10; $task_nbr++) {
	//  Send two message parts, first the address...
	if(mt_rand(0, 2) &gt; 0) {
		$client-&gt;send("A", ZMQ::MODE_SNDMORE);
	} else {
		$client-&gt;send("B", ZMQ::MODE_SNDMORE);
	}
	//  And then the workload
	$client-&gt;send("This is the workload");
}

$client-&gt;send("A", ZMQ::MODE_SNDMORE);
$client-&gt;send("END");

$client-&gt;send("B", ZMQ::MODE_SNDMORE);
$client-&gt;send("END");

sleep (1);              //  Give 0MQ/2.0.x time to flush output
</programlisting>

</example>
<para>The code is almost identical except that the worker uses a DEALER socket, and reads and writes that empty frame before the data frame. This is the approach I'd use when I wanted to keep compatibility with REQ workers.</para>

<para>However remember the reason for that empty delimiter frame: it's to allow multihop extended requests that terminate in a REP socket, which uses that delimiter to split off the reply envelope, so it can hand the data frames to its application.</para>

<para>If we never need to pass the message along to a REP socket, we can simply drop the empty delimiter frame at both sides, which makes things simpler. This is usually the design I use for pure DEALER to ROUTER protocols.</para>

</sect2>
<sect2>
<title>A Load-Balancing Message Broker</title>
<para>The previous example is half-complete. It can manage a set of workers with dummy requests and replies, but it has no way to talk to clients.</para>

<para>If we add a second <emphasis>frontend</emphasis> ROUTER socket that accepts client requests, and turn our example into a proxy that can switch messages from frontend to backend, we get a useful and reusable tiny load-balancing message broker(<xref linkend="figure-32"/>).</para>

<figure id="figure-32">
    <title>Load-Balancing Broker</title>
    <mediaobject>
        <imageobject>
            <imagedata fileref="images/fig32.eps" format="EPS" width="4.8in"/>
        </imageobject>
    </mediaobject>
</figure>

<para>What this broker does is:</para>

<itemizedlist>
  <listitem><para>Accepts connections from a set of clients.</para></listitem>
  <listitem><para>Accepts connections from a set of workers.</para></listitem>
  <listitem><para>Accepts requests from clients and holds these in a single queue.</para></listitem>
  <listitem><para>Sends these requests to workers using the load-balancing pattern.</para></listitem>
  <listitem><para>Receives replies back from workers.</para></listitem>
  <listitem><para>Sends these replies back to the original requesting client.</para></listitem>
</itemizedlist>
<para>The broker code is fairly long but worth understanding:</para>

<example id="lbbroker-php">
<title>Load-balancing broker (lbbroker.php)</title>
<programlisting language="php">
&lt;?php
/*
 *  Least-recently used (LRU) queue device
 *  Clients and workers are shown here as IPC as PHP
 *  does not have threads. 
 * @author Ian Barber &lt;ian(dot)barber(at)gmail(dot)com&gt;
 */
define("NBR_CLIENTS", 10);
define("NBR_WORKERS", 3);

//  Basic request-reply client using REQ socket
function client_thread() {
	$context = new ZMQContext();
	$client = new ZMQSocket($context, ZMQ::SOCKET_REQ);
	$client-&gt;connect("ipc://frontend.ipc");
	
	//  Send request, get reply
	$client-&gt;send("HELLO");
	$reply = $client-&gt;recv();
	printf("Client: %s%s", $reply, PHP_EOL);
}

//  Worker using REQ socket to do LRU routing
function worker_thread () {
	$context = new ZMQContext();
	$worker = $context-&gt;getSocket(ZMQ::SOCKET_REQ);
	$worker-&gt;connect("ipc://backend.ipc");

    //  Tell broker we're ready for work
	$worker-&gt;send("READY");
	
	while(true) {
		//  Read and save all frames until we get an empty frame
        //  In this example there is only 1 but it could be more
		$address = $worker-&gt;recv();
		
		// Additional logic to clean up workers. 
		if($address == "END") {
			exit();
		}
		$empty = $worker-&gt;recv();
		assert(empty($empty));
		
		//  Get request, send reply
		$request = $worker-&gt;recv();
		printf ("Worker: %s%s", $request, PHP_EOL);
		
		$worker-&gt;send($address, ZMQ::MODE_SNDMORE);
		$worker-&gt;send("", ZMQ::MODE_SNDMORE);
		$worker-&gt;send("OK");
    }
}

function main() {
	for($client_nbr = 0; $client_nbr &lt; NBR_CLIENTS; $client_nbr++) {
		$pid = pcntl_fork();
		if($pid == 0) {
			client_thread();
			return;
		}
	}

	for($worker_nbr = 0; $worker_nbr &lt; NBR_WORKERS; $worker_nbr++) {
		$pid = pcntl_fork();
		if($pid == 0) {
			worker_thread();
			return;
		}
	}
	
	$context = new ZMQContext();
	$frontend = new ZMQSocket($context, ZMQ::SOCKET_ROUTER);
	$backend = new ZMQSocket($context, ZMQ::SOCKET_ROUTER);
	$frontend-&gt;bind("ipc://frontend.ipc");
	$backend-&gt;bind("ipc://backend.ipc");
	
	//  Logic of LRU loop
    //  - Poll backend always, frontend only if 1+ worker ready
    //  - If worker replies, queue worker as ready and forward reply
    //    to client if necessary
    //  - If client requests, pop next worker and send request to it

    //  Queue of available workers
	$available_workers = 0;
	$worker_queue = array();
	$writeable = $readable = array();
	
	while($client_nbr &gt; 0) {
		$poll = new ZMQPoll();
		
		//  Poll front-end only if we have available workers
		if($available_workers &gt; 0) {
			$poll-&gt;add($frontend, ZMQ::POLL_IN);
		}
		
		//  Always poll for worker activity on backend
		$poll-&gt;add($backend, ZMQ::POLL_IN);
		$events = $poll-&gt;poll($readable, $writeable);

		if($events &gt; 0) {
			foreach($readable as $socket) {
				//  Handle worker activity on backend
				if($socket === $backend) {
					//  Queue worker address for LRU routing
					$worker_addr = $socket-&gt;recv();
					assert($available_workers &lt; NBR_WORKERS);
					$available_workers++;
					array_push($worker_queue, $worker_addr);
			
					//  Second frame is empty
					$empty = $socket-&gt;recv();
					assert(empty($empty));
				
					//  Third frame is READY or else a client reply address
					$client_addr = $socket-&gt;recv();
				
					if($client_addr != "READY") {
						$empty = $socket-&gt;recv();
						assert(empty($empty));
						$reply = $socket-&gt;recv();
						$frontend-&gt;send($client_addr, ZMQ::MODE_SNDMORE);
						$frontend-&gt;send("", ZMQ::MODE_SNDMORE);
						$frontend-&gt;send($reply);
						
						// exit after all messages relayed
						$client_nbr--;
					}
				} else if($socket === $frontend) {
					//  Now get next client request, route to LRU worker
					//  Client request is [address][empty][request]
					$client_addr = $socket-&gt;recv();
					$empty = $socket-&gt;recv();
					assert(empty($empty));
					$request = $socket-&gt;recv();

					$backend-&gt;send(array_shift($worker_queue), ZMQ::MODE_SNDMORE);
					$backend-&gt;send("", ZMQ::MODE_SNDMORE);
					$backend-&gt;send($client_addr, ZMQ::MODE_SNDMORE);
					$backend-&gt;send("", ZMQ::MODE_SNDMORE);
					$backend-&gt;send($request);

					$available_workers--; 
				} 
			}
		}
	}
	
	// Clean up our worker processes
	foreach($worker_queue as $worker) {
		$backend-&gt;send($worker, ZMQ::MODE_SNDMORE);
		$backend-&gt;send("", ZMQ::MODE_SNDMORE);
		$backend-&gt;send('END');
	}
	
	sleep(1);
}

main();
</programlisting>

</example>
<para>The difficult part of this program is (a) the envelopes that each socket reads and writes, and (b) the load-balancing algorithm. We'll take these in turn, starting with the message envelope formats.</para>

<para>Let's walk through a full request-reply chain from client to worker and back. In this code we set the identity of client and worker sockets to make it easier to trace the message frames. In reality we'd allow the ROUTER sockets to invent identities for connections. Let's assume the client's identity is "CLIENT" and the worker's identity is "WORKER". The client application sends a single frame containing "HELLO"(<xref linkend="figure-33"/>).</para>

<figure id="figure-33">
    <title>Message that Client Sends</title>
    <mediaobject>
        <imageobject>
            <imagedata fileref="images/fig33.eps" format="EPS" width="4.8in"/>
        </imageobject>
    </mediaobject>
</figure>

<para>Since the REQ socket adds its empty delimiter frame, and the ROUTER socket adds its connection identity, what the proxy reads off the frontend ROUTER socket are three frames: the client address, empty delimiter frame, and the data part(<xref linkend="figure-34"/>).</para>

<figure id="figure-34">
    <title>Message Coming in on Frontend</title>
    <mediaobject>
        <imageobject>
            <imagedata fileref="images/fig34.eps" format="EPS" width="4.8in"/>
        </imageobject>
    </mediaobject>
</figure>

<para>The broker sends this to the worker, prefixed by the address of the chosen worker, plus an additional empty part to keep the REQ at the other end happy(<xref linkend="figure-35"/>).</para>

<figure id="figure-35">
    <title>Message Sent to Backend</title>
    <mediaobject>
        <imageobject>
            <imagedata fileref="images/fig35.eps" format="EPS" width="4.8in"/>
        </imageobject>
    </mediaobject>
</figure>

<para>This complex envelope stack gets chewed up first by the backend ROUTER socket, which removes the first frame. Then the REQ socket in the worker removes the empty part, and provides the rest to the worker application(<xref linkend="figure-36"/>).</para>

<figure id="figure-36">
    <title>Message Delivered to Worker</title>
    <mediaobject>
        <imageobject>
            <imagedata fileref="images/fig36.eps" format="EPS" width="4.8in"/>
        </imageobject>
    </mediaobject>
</figure>

<para>The worker has to save the envelope (which is all the parts up to and including the empty message frame) and then it can do what's needed with the data part. Note that a REP socket would do this automatically but we're using the REQ-ROUTER pattern so we can get proper load-balancing.</para>

<para>On the return path the messages are the same as when they come in, i.e. the backend socket gives the broker a message in five parts, and the broker sends the frontend socket a message in three parts, and the client gets a message in one part.</para>

<para>Now let's look at the load-balancing algorithm. It requires that both clients and workers use REQ sockets, and that workers correctly store and replay the envelope on messages they get. The algorithm is:</para>

<itemizedlist>
  <listitem><para>Create a pollset which polls the backend always, and the frontend only if there are one or more workers available.</para></listitem>
  <listitem><para>Poll for activity with infinite timeout.</para></listitem>
  <listitem><para>If there is activity on the backend, we either have a "ready" message or a reply for a client. In either case we store the worker address (the first part) on our worker queue, and if the rest is a client reply we send it back to that client via the frontend.</para></listitem>
  <listitem><para>If there is activity on the frontend, we take the client request, pop the next worker (which is the last used), and send the request to the backend. This means sending the worker address, empty part, and then the three parts of the client request.</para></listitem>
</itemizedlist>
<para>You should now see that you can reuse and extend the load-balancing algorithm with variations based on the information the worker provides in its initial "ready" message. For example, workers might start up and do a performance self-test, then tell the broker how fast they are. The broker can then choose the fastest available worker rather than the oldest.</para>

</sect2>
</sect1>
<sect1>
<title>A High-Level API for &Oslash;MQ</title>
<sect2>
<title>Making a Detour</title>
<para>We're going to push request-reply onto the stack and open a different area, which is the &Oslash;MQ API itself. There's a reason for this detour: as we write more complex examples, the low-level &Oslash;MQ API starts to look increasingly clumsy. Look at the core of the worker thread from our load-balancing broker:</para>

<programlisting language="c">
while (true) {
    //  Read and save all frames until we get an empty frame
    //  In this example there is only 1 but it could be more
    char *address = s_recv (worker);
    char *empty = s_recv (worker);
    assert (*empty == 0);
    free (empty);

    //  Get request, send reply
    char *request = s_recv (worker);
    printf ("Worker: %s\n", request);
    free (request);

    s_sendmore (worker, address);
    s_sendmore (worker, "");
    s_send     (worker, "OK");
    free (address);
}
</programlisting>

<para>That code isn't even reusable, because it can only handle one reply address in the envelope. And it already does some wrapping around the &Oslash;MQ API. If we used the libzmq API directly this is what we'd have to write:</para>

<programlisting language="c">
while (true) {
    //  Read and save all frames until we get an empty frame
    //  In this example there is only 1 but it could be more
    zmq_msg_t address;
    zmq_msg_init (&amp;address);
    zmq_msg_recv (worker, &amp;address, 0);

    zmq_msg_t empty;
    zmq_msg_init (&amp;empty);
    zmq_msg_recv (worker, &amp;empty, 0);

    //  Get request, send reply
    zmq_msg_t payload;
    zmq_msg_init (&amp;payload);
    zmq_msg_recv (worker, &amp;payload, 0);

    int char_nbr;
    printf ("Worker: ");
    for (char_nbr = 0; char_nbr &lt; zmq_msg_size (&amp;payload); char_nbr++)
        printf ("%c", *(char *) (zmq_msg_data (&amp;payload) + char_nbr));
    printf ("\n");

    zmq_msg_init_size (&amp;payload, 2);
    memcpy (zmq_msg_data (&amp;payload), "OK", 2);

    zmq_msg_send (worker, &amp;address, ZMQ_SNDMORE);
    zmq_close (&amp;address);
    zmq_msg_send (worker, &amp;empty, ZMQ_SNDMORE);
    zmq_close (&amp;empty);
    zmq_msg_send (worker, &amp;payload, 0);
    zmq_close (&amp;payload);
}
</programlisting>

<para>And when code is too long to write quickly, it's also too long to understand. Up to now, I've stuck to the native API because as &Oslash;MQ users we need to know that intimately. But when it gets in our way, we have to treat it as a problem to solve.</para>

<para>I'm not proposing changing the &Oslash;MQ API, which is a documented public contract that thousands of people have agreed to and depend on. What I'm proposing is to construct a higher-level API on top, based on our experience so far, and most specifically, our experience from writing more complex request-reply patterns.</para>

<para>What we want is an API that lets us receive and send an entire message in one shot, including the reply envelope with any number of reply addresses. One that lets us do what we want with the absolute least lines of code.</para>

<para>Making a good message API is fairly difficult. We have a problem of terminology: &Oslash;MQ uses "message" to describe both multi-part messages, and individual message frames. We have a problem of expectations: sometimes it's natural to see message content as printable string data, sometimes as binary blobs. And we have technical challenges, especially if we want to avoid copying data around too much.</para>

<para>The challenge of making a good API affects all languages, though my specific use-case is C. Whatever language you use, think about how you could contribute to your language binding to make it as good (or better) than the C binding I'm going to describe.</para>

</sect2>
<sect2>
<title>Features of a Higher-Level API</title>
<para>My solution is to use three fairly natural and obvious concepts: <emphasis>string</emphasis> (already the basis for our {{s_send} and <literal>s_recv</literal>) helpers, <emphasis>frame</emphasis> (a message frame), and <emphasis>message</emphasis> (a list of one or more frames). Here is the worker code, rewritten onto an API using these concepts:</para>

<programlisting language="c">
while (true) {
    zmsg_t *msg = zmsg_recv (worker);
    zframe_reset (zmsg_last (msg), "OK", 2);
    zmsg_send (&amp;msg, worker);
}
</programlisting>

<para>Cutting the amount of code we need to read and write complex messages is great: the results are easy to read and understand. Let's continue this process for other aspects of working with &Oslash;MQ. Here's a wishlist of things I'd like in a higher-level API, based on my experience with &Oslash;MQ so far:</para>

<itemizedlist>
  <listitem><para><emphasis>Automatic handling of sockets.</emphasis> I find it cumbersome to have to close sockets manually, and to have to explicitly define the linger timeout in some (but not all) cases. It'd be great to have a way to close sockets automatically when I close the context.</para></listitem>
  <listitem><para><emphasis>Portable thread management.</emphasis> Every non-trivial &Oslash;MQ application uses threads, but POSIX threads aren't portable. So a decent high-level API should hide this under a portable layer.</para></listitem>
  <listitem><para><emphasis>Portable clocks.</emphasis> Even getting the time to a millisecond resolution, or sleeping for some milliseconds, is not portable. Realistic &Oslash;MQ applications need portable clocks, so our API should provide them.</para></listitem>
  <listitem><para><emphasis>A reactor to replace <literal>zmq_poll()</literal>.</emphasis> The poll loop is simple but clumsy. Writing a lot of these, we end up doing the same work over and over: calculating timers, and calling code when sockets are ready. A simple reactor with socket readers, and timers, would save a lot of repeated work.</para></listitem>
  <listitem><para><emphasis>Proper handling of Ctrl-C.</emphasis> We already saw how to catch an interrupt. It would be useful if this happened in all applications.</para></listitem>
</itemizedlist>
</sect2>
<sect2>
<title>The CZMQ High-Level API</title>
<para>Turning this wishlist into reality for the C language gives us <ulink url="http://zero.mq/c">CZMQ</ulink>, a &Oslash;MQ language binding for C. This high-level binding in fact developed out of earlier versions of the Guide. It combines nicer semantics for working with &Oslash;MQ with some portability layers, and (importantly for C but less for other languages) containers like hashes and lists. CZMQ also uses an elegant object model that leads to frankly lovely code.</para>

<para>Here is the load-balancing broker rewritten to use a higher-level API (CZMQ for the C case):</para>

<example id="lbbroker2-php">
<title>Load-balancing broker using high-level API (lbbroker2.php)</title>
<programlisting language="php">
&lt;?php 
/*
 *  Least-recently used (LRU) queue device
 *  Demonstrates use of the zmsg class
 * @author Ian Barber &lt;ian(dot)barber(at)gmail(dot)com&gt;
 */
include "zmsg.php";

define("NBR_CLIENTS", 10);
define("NBR_WORKERS", 3);

//  Basic request-reply client using REQ socket
function client_thread() {
	$context = new ZMQContext();
	$client = new ZMQSocket($context, ZMQ::SOCKET_REQ);
	$client-&gt;connect("ipc://frontend.ipc");
	
	//  Send request, get reply
	$client-&gt;send("HELLO");
	$reply = $client-&gt;recv();
	printf("Client: %s%s", $reply, PHP_EOL);
}

//  Worker using REQ socket to do LRU routing
function worker_thread () {
	$context = new ZMQContext();
	$worker = $context-&gt;getSocket(ZMQ::SOCKET_REQ);
	$worker-&gt;connect("ipc://backend.ipc");

    //  Tell broker we're ready for work
	$worker-&gt;send("READY");
	
	while(true) {
		$zmsg = new Zmsg($worker);
		$zmsg-&gt;recv();

		// Additional logic to clean up workers. 
		if($zmsg-&gt;address() == "END") {
			exit();
		}
		
		printf ("Worker: %s\n", $zmsg-&gt;body());
		
		$zmsg-&gt;body_set("OK");
		$zmsg-&gt;send();
    }
}

function main() {
	for($client_nbr = 0; $client_nbr &lt; NBR_CLIENTS; $client_nbr++) {
		$pid = pcntl_fork();
		if($pid == 0) {
			client_thread();
			return;
		}
	}

	for($worker_nbr = 0; $worker_nbr &lt; NBR_WORKERS; $worker_nbr++) {
		$pid = pcntl_fork();
		if($pid == 0) {
			worker_thread();
			return;
		}
	}
	
	$context = new ZMQContext();
	$frontend = new ZMQSocket($context, ZMQ::SOCKET_ROUTER);
	$backend = new ZMQSocket($context, ZMQ::SOCKET_ROUTER);
	$frontend-&gt;bind("ipc://frontend.ipc");
	$backend-&gt;bind("ipc://backend.ipc");
	
	//  Logic of LRU loop
    //  - Poll backend always, frontend only if 1+ worker ready
    //  - If worker replies, queue worker as ready and forward reply
    //    to client if necessary
    //  - If client requests, pop next worker and send request to it

    //  Queue of available workers
	$available_workers = 0;
	$worker_queue = array();
	$writeable = $readable = array();
	
	while($client_nbr &gt; 0) {
		$poll = new ZMQPoll();
		
		//  Poll front-end only if we have available workers
		if($available_workers &gt; 0) {
			$poll-&gt;add($frontend, ZMQ::POLL_IN);
		}
		
		//  Always poll for worker activity on backend
		$poll-&gt;add($backend, ZMQ::POLL_IN);
		$events = $poll-&gt;poll($readable, $writeable);

		if($events &gt; 0) {
			foreach($readable as $socket) {
				//  Handle worker activity on backend
				if($socket === $backend) {
					//  Queue worker address for LRU routing
					$zmsg = new Zmsg($socket);
					$zmsg-&gt;recv();
					assert($available_workers &lt; NBR_WORKERS);
					$available_workers++;
					array_push($worker_queue, $zmsg-&gt;unwrap());
			
					if($zmsg-&gt;body() != "READY") {
						$zmsg-&gt;set_socket($frontend)-&gt;send();
						
						// exit after all messages relayed
						$client_nbr--;
					}
				} else if($socket === $frontend) {
					$zmsg = new Zmsg($socket);
					$zmsg-&gt;recv();
					$zmsg-&gt;wrap(array_shift($worker_queue), "");
					$zmsg-&gt;set_socket($backend)-&gt;send();
					$available_workers--; 
				} 
			}
		}
	}
	
	// Clean up our worker processes
	foreach($worker_queue as $worker) {
		$zmsg = new Zmsg($backend);
		$zmsg-&gt;body_set('END')-&gt;wrap($worker, "")-&gt;send();
	}
	
	sleep(1);
}

main()
</programlisting>

</example>
<para>One thing CZMQ provides is clean interrupt handling. This means that Ctrl-C will cause any blocking &Oslash;MQ call to exit with a return code -1 and errno set to EINTR. The high-level recv methods will return NULL in such cases. So, you can cleanly exit a loop like this:</para>

<programlisting language="c">
while (true) {
    zstr_send (client, "HELLO");
    char *reply = zstr_recv (client);
    if (!reply)
        break;              //  Interrupted
    printf ("Client: %s\n", reply);
    free (reply);
    sleep (1);
}
</programlisting>

<para>Or, if you're calling <literal>zmq_poll()</literal>, test on the return code:</para>

<programlisting language="c">
if (zmq_poll (items, 2, 1000 * 1000) == -1)
    break;              //  Interrupted
</programlisting>

<para>The previous example still uses <literal>zmq_poll()</literal>. So how about reactors? The CZMQ <literal>zloop</literal> reactor is simple but functional. It lets you:</para>

<itemizedlist>
  <listitem><para>Set a reader on any socket, i.e. code that is called whenever the socket has input.</para></listitem>
  <listitem><para>Cancel a reader on a socket.</para></listitem>
  <listitem><para>Set a timer that goes off once or multiple times at specific intervals.</para></listitem>
  <listitem><para>Cancel a timer.</para></listitem>
</itemizedlist>
<para><literal>zloop</literal> of course uses <literal>zmq_poll()</literal> internally. It rebuilds its poll set each time you add or remove readers, and it calculates the poll timeout to match the next timer. Then, it calls the reader and timer handlers for each socket and timer that needs attention.</para>

<para>When we use a reactor pattern, our code turns inside out. The main logic looks like this:</para>

<programlisting language="c">
zloop_t *reactor = zloop_new ();
zloop_reader (reactor, self-&gt;backend, s_handle_backend, self);
zloop_start (reactor);
zloop_destroy (&amp;reactor);
</programlisting>

<para>While the actual handling of messages sits inside dedicated functions or methods. You may not like the style, it's a matter of taste. What it does help with is mixing timers and socket activity. In the rest of this text we'll use <literal>zmq_poll()</literal> in simpler cases, and <literal>zloop</literal> in more complex examples.</para>

<para>Here is the load-balancing broker rewritten once again, this time to use <literal>zloop</literal>:</para>

<example id="lbbroker3-php">
<title>Load balancing broker using zloop (lbbroker3.php)</title>
<programlisting language="php">
(This example still needs translation into PHP)
</programlisting>

</example>
<para>Getting applications to properly shut-down when you send them Ctrl-C can be tricky. If you use the <literal>zctx</literal> class it'll automatically set-up signal handling, but your code still has to cooperate. You must break any loop if <literal>zmq_poll</literal> returns -1 or if any of the <literal>zstr_recv</literal>, <literal>zframe_recv</literal>, or <literal>zmsg_recv</literal> methods return NULL. If you have nested loops, it can be useful to make the outer ones conditional on <literal>!zctx_interrupted</literal>.</para>

</sect2>
</sect1>
<sect1>
<title>The Asynchronous Client-Server Pattern</title>
<para>In the ROUTER to DEALER example we saw a 1-to-N use case where one server talks asynchronously to multiple workers. We can turn this upside-down to get a very useful N-to-1 architecture where various clients talk to a single server, and do this asynchronously(<xref linkend="figure-37"/>).</para>

<figure id="figure-37">
    <title>Asynchronous Client-Server</title>
    <mediaobject>
        <imageobject>
            <imagedata fileref="images/fig37.eps" format="EPS" width="4.8in"/>
        </imageobject>
    </mediaobject>
</figure>

<para>Here's how it works:</para>

<itemizedlist>
  <listitem><para>Clients connect to the server and send requests.</para></listitem>
  <listitem><para>For each request, the server sends 0 or more replies.</para></listitem>
  <listitem><para>Clients can send multiple requests without waiting for a reply.</para></listitem>
  <listitem><para>Servers can send multiple replies without waiting for new requests.</para></listitem>
</itemizedlist>
<para>Here's code that shows how this works:</para>

<example id="asyncsrv-php">
<title>Asynchronous client-server (asyncsrv.php)</title>
<programlisting language="php">
&lt;?php
/*
 * Asynchronous client-to-server (DEALER to ROUTER)
 * 
 * While this example runs in a single process, that is just to make
 * it easier to start and stop the example. Each task has its own
 * context and conceptually acts as a separate process.
 * @author Ian Barber &lt;ian(dot)barber(at)gmail(dot)com&gt;
 */
include "zmsg.php";

/* ---------------------------------------------------------------------
 * This is our client task
 * It connects to the server, and then sends a request once per second
 * It collects responses as they arrive, and it prints them out. We will
 * run several client tasks in parallel, each with a different random ID.
 */
function client_task() {
	$context = new ZMQContext();
	$client = new ZMQSocket($context, ZMQ::SOCKET_DEALER);
	
	//  Generate printable identity for the client
	$identity = sprintf ("%04X", rand(0, 0x10000));
	$client-&gt;setSockOpt(ZMQ::SOCKOPT_IDENTITY, $identity);
	$client-&gt;connect("tcp://localhost:5570");
	
	$read = $write = array();
	$poll = new ZMQPoll();
	$poll-&gt;add($client, ZMQ::POLL_IN);
	
	$request_nbr = 0;
	while(true) {
		//  Tick once per second, pulling in arriving messages
		for($centitick = 0; $centitick &lt; 100; $centitick++) {
			$events = $poll-&gt;poll($read, $write, 1000);
			$zmsg = new Zmsg($client);
			if($events) {
				$zmsg-&gt;recv();
				printf ("%s: %s%s", $identity, $zmsg-&gt;body(), PHP_EOL);
			}
		}
		$zmsg = new Zmsg($client);
		$zmsg-&gt;body_fmt("request #%d", ++$request_nbr)-&gt;send();
	}
}

/* ---------------------------------------------------------------------
 * This is our server task
 * It uses the multithreaded server model to deal requests out to a pool
 * of workers and route replies back to clients. One worker can handle
 * one request at a time but one client can talk to multiple workers at
 * once.
 */
function server_task() {
	
	//  Launch pool of worker threads, precise number is not critical
	for($thread_nbr = 0; $thread_nbr &lt; 5; $thread_nbr++) {
		$pid = pcntl_fork();
		if($pid == 0) {
			server_worker();
			exit();
		}
	}
	
	$context = new ZMQContext();
	
	//  Frontend socket talks to clients over TCP
	$frontend = new ZMQSocket($context, ZMQ::SOCKET_ROUTER);
	$frontend-&gt;bind("tcp://*:5570");
	
	//  Backend socket talks to workers over ipc
	$backend = new ZMQSocket($context, ZMQ::SOCKET_DEALER);
	$backend-&gt;bind("ipc://backend");
	
	//  Connect backend to frontend via a queue device
	//  We could do this:
	//      $device = new ZMQDevice($frontend, $backend);
	//  But doing it ourselves means we can debug this more easily
    
	$read = $write = array();
	//  Switch messages between frontend and backend
	while(true) {
		$poll = new ZMQPoll();
		$poll-&gt;add($frontend, ZMQ::POLL_IN);
		$poll-&gt;add($backend, ZMQ::POLL_IN);
		
		$poll-&gt;poll($read, $write);
		foreach($read as $socket) {
			$zmsg = new Zmsg($socket);
			$zmsg-&gt;recv();
			if($socket === $frontend) {
				//echo "Request from client:";
				//echo $zmsg-&gt;__toString();
				$zmsg-&gt;set_socket($backend)-&gt;send();
			} else if($socket === $backend) {
				//echo "Request from worker:";
				//echo $zmsg-&gt;__toString();
				$zmsg-&gt;set_socket($frontend)-&gt;send();
			}
		}		
	}
}

function server_worker() {
	$context = new ZMQContext();
	$worker = new ZMQSocket($context, ZMQ::SOCKET_DEALER);
	$worker-&gt;connect("ipc://backend");
	$zmsg = new Zmsg($worker);
	
	while(true) {
		//  The DEALER socket gives us the address envelope and message
		$zmsg-&gt;recv();
		assert($zmsg-&gt;parts() == 2);
		
		// Send 0..4 replies back
		$replies = rand(0,4);
		for($reply = 0; $reply &lt; $replies; $reply++) {
			//  Sleep for some fraction of a second
			usleep(rand(0,1000) + 1);
			$zmsg-&gt;send(Zmsg::NOCLEAR);
		}

	}
}

/* This main thread simply starts several clients, and a server, and then
 * waits for the server to finish.
 */
function main() {
	for($num_clients = 0; $num_clients &lt; 3; $num_clients++) {
		$pid = pcntl_fork();
		if($pid == 0) {
			client_task(); 
			exit();
		}
	}
	
	$pid = pcntl_fork();
	if($pid == 0) {
		server_task(); 
		exit();
	}
	
}

main();
</programlisting>

</example>
<para>The example runs in one process, with multiple threads simulating a real multi-process architecture. When you run the example, you'll see three clients (each with a random ID), printing out the replies they get from the server. Look carefully and you'll see each client task gets 0 or more replies per request.</para>

<para>Some comments on this code:</para>

<itemizedlist>
  <listitem><para>The clients send a request once per second, and get zero or more replies back. To make this work using <literal>zmq_poll()</literal>, we can't simply poll with a 1-second timeout, or we'd end up sending a new request only one second <emphasis>after we received the last reply</emphasis>. So we poll at a high frequency (100 times at 1/100th of a second per poll), which is approximately accurate.</para></listitem>
  <listitem><para>The server uses a pool of worker threads, each processing one request synchronously. It connects these to its frontend socket using an internal queue. It connects the frontend and backend sockets using a <literal>zmq_proxy()</literal> call.</para></listitem>
</itemizedlist>
<figure id="figure-38">
    <title>Detail of Asynchronous Server</title>
    <mediaobject>
        <imageobject>
            <imagedata fileref="images/fig38.eps" format="EPS" width="4.8in"/>
        </imageobject>
    </mediaobject>
</figure>

<para>Note that we're doing DEALER to ROUTER dialog between client and server, but internally between the server main thread and workers we're doing DEALER to DEALER. If the workers were strictly synchronous, we'd use REP. But since we want to send multiple replies we need an async socket. We do <emphasis>not</emphasis> want to route replies, they always go to the single server thread that sent us the request.</para>

<para>Let's think about the routing envelope. The client sends a simple message. The server thread receives a two-part message (real message prefixed by client identity). We have two possible designs for the server-to-worker interface:</para>

<itemizedlist>
  <listitem><para>Workers get unaddressed messages, and we manage the connections from server thread to worker threads explicitly using a ROUTER socket as backend. This would require that workers start by telling the server they exist, which can then route requests to workers and track which client is 'connected' to which worker. This is the load-balancing pattern again.</para></listitem>
  <listitem><para>Workers get addressed messages, and they return addressed replies. This requires that workers can properly decode and recode envelopes but it doesn't need any other mechanisms.</para></listitem>
</itemizedlist>
<para>The second design is much simpler, so that's what we use:</para>

<screen>     client          server       frontend       worker
   [ DEALER ]&lt;----&gt;[ ROUTER &lt;----&gt; DEALER &lt;----&gt; DEALER ]
             1 part         2 parts       2 parts
</screen>

<para>When you build servers that maintain stateful conversations with clients, you will run into a classic problem. If the server keeps some state per client, and clients keep coming and going, eventually it will run out of resources. Even if the same clients keep connecting, if you're using default identities, each connection will look like a new one.</para>

<para>We cheat in the above example by keeping state only for a very short time (the time it takes a worker to process a request) and then throwing away the state. But that's not practical for many cases. To properly manage client state in a stateful asynchronous server you have to:</para>

<itemizedlist>
  <listitem><para>Do heartbeating from client to server. In our example we send a request once per second, which can reliably be used as a heartbeat.</para></listitem>
  <listitem><para>Store state using the client identity (whether generated or explicit) as key.</para></listitem>
  <listitem><para>Detect a stopped heartbeat. If there's no request from a client within, say, two seconds, the server can detect this and destroy any state it's holding for that client.</para></listitem>
</itemizedlist>
</sect1>
<sect1>
<title>Worked Example: Inter-Broker Routing</title>
<para>Let's take everything we've seen so far, and scale things up to a real application. We'll build this step by step over several iterations.</para>

<para>Our best client calls us urgently and asks for a design of a large cloud computing facility. He has this vision of a cloud that spans many data centers, each a cluster of clients and workers, and that works together as a whole.</para>

<para>Because we're smart enough to know that practice always beats theory, we propose to make a working simulation using &Oslash;MQ. Our client, eager to lock down the budget before his own boss changes his mind, and having read great things about &Oslash;MQ on Twitter, agrees.</para>

<sect2>
<title>Establishing the Details</title>
<para>Several espressos later, we want to jump into writing code but a little voice tells us to get more details before making a sensational solution to entirely the wrong problem. "What kind of work is the cloud doing?", we ask. The client explains:</para>

<itemizedlist>
  <listitem><para>Workers run on various kinds of hardware, but they are all able to handle any task. There are several hundred workers per cluster, and as many as a dozen clusters in total.</para></listitem>
  <listitem><para>Clients create tasks for workers. Each task is an independent unit of work and all the client wants is to find an available worker, and send it the task, as soon as possible. There will be a lot of clients and they'll come and go arbitrarily.</para></listitem>
  <listitem><para>The real difficulty is to be able to add and remove clusters at any time. A cluster can leave or join the cloud instantly, bringing all its workers and clients with it.</para></listitem>
  <listitem><para>If there are no workers in their own cluster, clients' tasks will go off to other available workers in the cloud.</para></listitem>
  <listitem><para>Clients send out one task at a time, waiting for a reply. If they don't get an answer within X seconds they'll just send out the task again. This ain't our concern, the client API does it already.</para></listitem>
  <listitem><para>Workers process one task at a time, they are very simple beasts. If they crash, they get restarted by whatever script started them.</para></listitem>
</itemizedlist>
<para>So we double check to make sure that we understood this correctly:</para>

<itemizedlist>
  <listitem><para>"There will be some kind of super-duper network interconnect between clusters, right?", we ask. The client says, "Yes, of course, we're not idiots."</para></listitem>
  <listitem><para>"What kind of volumes are we talking about?", we ask. The client replies, "Up to a thousand clients per cluster, each doing max. ten requests per second. Requests are small, and replies are also small, no more than 1K bytes each."</para></listitem>
</itemizedlist>
<para>So we do a little calculation and see that this will work nicely over plain TCP. 2,500 clients x 10/second x 1,000 bytes x 2 directions = 50MB/sec or 400Mb/sec, not a problem for a 1Gb network.</para>

<para>It's a straight-forward problem that requires no exotic hardware or protocols, just some clever routing algorithms and careful design. We start by designing one cluster (one data center) and then we figure out how to connect clusters together.</para>

</sect2>
<sect2>
<title>Architecture of a Single Cluster</title>
<para>Workers and clients are synchronous. We want to use the load-balancing pattern to route tasks to workers. Workers are all identical, our facility has no notion of different services. Workers are anonymous, clients never address them directly. We make no attempt here to provide guaranteed delivery, retry, etc.</para>

<para>For reasons we already looked at, clients and workers won't speak to each other directly. It makes it impossible to add or remove nodes dynamically. So our basic model consists of the request-reply message broker we saw earlier(<xref linkend="figure-39"/>).</para>

<figure id="figure-39">
    <title>Cluster Architecture</title>
    <mediaobject>
        <imageobject>
            <imagedata fileref="images/fig39.eps" format="EPS" width="4.8in"/>
        </imageobject>
    </mediaobject>
</figure>

</sect2>
<sect2>
<title>Scaling to Multiple Clusters</title>
<para>Now we scale this out to more than one cluster. Each cluster has a set of clients and workers, and a broker that joins these together:</para>

<figure id="figure-40">
    <title>Multiple Clusters</title>
    <mediaobject>
        <imageobject>
            <imagedata fileref="images/fig40.eps" format="EPS" width="4.8in"/>
        </imageobject>
    </mediaobject>
</figure>

<para>The question is: how do we get the clients of each cluster talking to the workers of the other cluster? There are a few possibilities, each with pros and cons:</para>

<itemizedlist>
  <listitem><para>Clients could connect directly to both brokers. The advantage is that we don't need to modify brokers or workers. But clients get more complex, and become aware of the overall topology. If we want to add, e.g. a third or forth cluster, all the clients are affected. In effect we have to move routing and fail-over logic into the clients and that's not nice.</para></listitem>
  <listitem><para>Workers might connect directly to both brokers. But REQ workers can't do that, they can only reply to one broker. We might use REPs but REPs don't give us customizable broker-to-worker routing like load-balancing, only the built-in load balancing. That's a fail, if we want to distribute work to idle workers: we precisely need load-balancing. One solution would be to use ROUTER sockets for the worker nodes. Let's label this "Idea #1".</para></listitem>
  <listitem><para>Brokers could connect to each other. This looks neatest because it creates the fewest additional connections. We can't add clusters on the fly but that is probably out of scope. Now clients and workers remain ignorant of the real network topology, and brokers tell each other when they have spare capacity. Let's label this "Idea #2".</para></listitem>
</itemizedlist>
<para>Let's explore Idea #1. In this model we have workers connecting to both brokers and accepting jobs from either(<xref linkend="figure-41"/>).</para>

<figure id="figure-41">
    <title>Idea 1 - Cross-connected Workers</title>
    <mediaobject>
        <imageobject>
            <imagedata fileref="images/fig41.eps" format="EPS" width="4.8in"/>
        </imageobject>
    </mediaobject>
</figure>

<para>It looks feasible. However it doesn't provide what we wanted, which was that clients get local workers if possible and remote workers only if it's better than waiting. Also workers will signal "ready" to both brokers and can get two jobs at once, while other workers remain idle. It seems this design fails because again we're putting routing logic at the edges.</para>

<para>So idea #2 then. We interconnect the brokers and don't touch the clients or workers, which are REQs like we're used to(<xref linkend="figure-42"/>).</para>

<figure id="figure-42">
    <title>Idea 2 - Brokers Talking to Each Other</title>
    <mediaobject>
        <imageobject>
            <imagedata fileref="images/fig42.eps" format="EPS" width="4.8in"/>
        </imageobject>
    </mediaobject>
</figure>

<para>This design is appealing because the problem is solved in one place, invisible to the rest of the world. Basically, brokers open secret channels to each other and whisper, like camel traders, "Hey, I've got some spare capacity, if you have too many clients give me a shout and we'll deal".</para>

<para>It is in effect just a more sophisticated routing algorithm: brokers become subcontractors for each other. Other things to like about this design, even before we play with real code:</para>

<itemizedlist>
  <listitem><para>It treats the common case (clients and workers on the same cluster) as default and does extra work for the exceptional case (shuffling jobs between clusters).</para></listitem>
  <listitem><para>It lets us use different message flows for the different types of work. That means we can handle them differently, e.g. using different types of network connection.</para></listitem>
  <listitem><para>It feels like it would scale smoothly. Interconnecting three, or more brokers doesn't get over-complex. If we find this to be a problem, it's easy to solve by adding a super-broker.</para></listitem>
</itemizedlist>
<para>We'll now make a worked example. We'll pack an entire cluster into one process. That is obviously not realistic but it makes it simple to simulate, and the simulation can accurately scale to real processes. This is the beauty of &Oslash;MQ, you can design at the microlevel and scale that up to the macro level. Threads become processes, become boxes and the patterns and logic remain the same. Each of our 'cluster' processes contains client threads, worker threads, and a broker thread.</para>

<para>We know the basic model well by now:</para>

<itemizedlist>
  <listitem><para>The REQ client (REQ) threads create workloads and pass them to the broker (ROUTER).</para></listitem>
  <listitem><para>The REQ worker (REQ) threads process workloads and return the results to the broker (ROUTER).</para></listitem>
  <listitem><para>The broker queues and distributes workloads using the load-balancing pattern.</para></listitem>
</itemizedlist>
</sect2>
<sect2>
<title>Federation vs. Peering</title>
<para>There are several possible ways to interconnect brokers. What we want is to be able to tell other brokers, "we have capacity", and then receive multiple tasks. We also need to be able to tell other brokers "stop, we're full". It doesn't need to be perfect: sometimes we may accept jobs we can't process immediately, then we'll do them as soon as possible.</para>

<para>The simplest interconnect is <emphasis>federation</emphasis> in which brokers simulate clients and workers for each other. We would do this by connecting our frontend to the other broker's backend socket(<xref linkend="figure-43"/>). Note that it is legal to both bind a socket to an endpoint and connect it to other endpoints.</para>

<figure id="figure-43">
    <title>Cross-connected Brokers in Federation Model</title>
    <mediaobject>
        <imageobject>
            <imagedata fileref="images/fig43.eps" format="EPS" width="4.8in"/>
        </imageobject>
    </mediaobject>
</figure>

<para>This would give us simple logic in both brokers and a reasonably good mechanism: when there are no clients, tell the other broker 'ready', and accept one job from it. The problem is also that it is too simple for this problem. A federated broker would be able to handle only one task at once. If the broker emulates a lock-step client and worker, it is by definition also going to be lock-step and if it has lots of available workers they won't be used. Our brokers need to be connected in a fully asynchronous fashion.</para>

<para>The federation model is perfect for other kinds of routing, especially service-oriented architectures (SOAs) which route by service name and proximity rather than load-balancing or round-robin. So don't dismiss it as useless, it's just not right for all use-cases.</para>

<para>Instead of federation, let's look at a <emphasis>peering</emphasis> approach in which brokers are explicitly aware of each other and talk over privileged channels. Let's break this down, assuming we want to interconnect N brokers. Each broker has (N - 1) peers, and all brokers are using exactly the same code and logic. There are two distinct flows of information between brokers:</para>

<itemizedlist>
  <listitem><para>Each broker needs to tell its peers how many workers it has available at any time. This can be fairly simple information, just a quantity that is updated regularly. The obvious (and correct) socket pattern for this is publish-subscribe. So every broker opens a PUB socket and publishes state information on that, and every broker also opens a SUB socket and connects that to the PUB socket of every other broker, to get state information from its peers.</para></listitem>
  <listitem><para>Each broker needs a way to delegate tasks to a peer and get replies back, asynchronously. We'll do this using ROUTER/ROUTER (ROUTER/ROUTER) sockets, no other combination works. Each broker has two such sockets: one for tasks it receives, one for tasks it delegates. If we didn't use two sockets it would be more work to know whether we were reading a request or a reply each time. That would mean adding more information to the message envelope.</para></listitem>
</itemizedlist>
<para>And there is also the flow of information between a broker and its local clients and workers.</para>

</sect2>
<sect2>
<title>The Naming Ceremony</title>
<para>Three flows x two sockets for each flow = six sockets that we have to manage in the broker.  Choosing good names is vital to keeping a multi-socket juggling act reasonably coherent in our minds. Sockets <emphasis>do</emphasis> something and what they do should form the basis for their names. It's about being able to read the code several weeks later on a cold Monday morning before coffee, and not feeling pain.</para>

<para>Let's do a shamanistic naming ceremony for the sockets. The three flows are:</para>

<itemizedlist>
  <listitem><para>A <emphasis>local</emphasis> request-reply flow between the broker and its clients and workers.</para></listitem>
  <listitem><para>A <emphasis>cloud</emphasis> request-reply flow between the broker and its peer brokers.</para></listitem>
  <listitem><para>A <emphasis>state</emphasis> flow between the broker and its peer brokers.</para></listitem>
</itemizedlist>
<para>Finding meaningful names that are all the same length means our code will align nicely. It's not a big thing, but attention to details helps. For each flow the broker has two sockets that we can orthogonally call the "frontend" and "backend". We've used these names quite often. A frontend receives information or tasks. A backend sends those out to other peers. The conceptual flow is from front to back (with replies going in the opposite direction from back to front).</para>

<para>So in all the code we write for this tutorial will use these socket names:</para>

<itemizedlist>
  <listitem><para><emphasis>localfe</emphasis> and <emphasis>localbe</emphasis> for the local flow.</para></listitem>
  <listitem><para><emphasis>cloudfe</emphasis> and <emphasis>cloudbe</emphasis> for the cloud flow.</para></listitem>
  <listitem><para><emphasis>statefe</emphasis> and <emphasis>statebe</emphasis> for the state flow.</para></listitem>
</itemizedlist>
<para>For our transport and because we're simulating the whole thing on one box, we'll use <literal>ipc</literal> for everything. This has the advantage of working like <literal>tcp</literal> in terms of connectivity (i.e. it's a disconnected transport, unlike <literal>inproc</literal>), yet we don't need IP addresses or DNS names, which would be a pain here. Instead, we will use <literal>ipc</literal> endpoints called <emphasis>something</emphasis>-<literal>local</literal>, <emphasis>something</emphasis>-<literal>cloud</literal>, and <emphasis>something</emphasis>-<literal>state</literal>, where <emphasis>something</emphasis> is the name of our simulated cluster.</para>

<para>You may be thinking that this is a lot of work for some names. Why not call them s1, s2, s3, s4, etc.? The answer is that if your brain is not a perfect machine, you need a lot of help when reading code, and we'll see that these names do help. It's easier to remember "three flows, two directions" than "six different sockets"(<xref linkend="figure-44"/>).</para>

<figure id="figure-44">
    <title>Broker Socket Arrangement</title>
    <mediaobject>
        <imageobject>
            <imagedata fileref="images/fig44.eps" format="EPS" width="4.8in"/>
        </imageobject>
    </mediaobject>
</figure>

<para>Note that we connect the cloudbe in each broker to the cloudfe in every other broker, and likewise we connect the statebe in each broker to the statefe in every other broker.</para>

</sect2>
<sect2>
<title>Prototyping the State Flow</title>
<para>Since each socket flow has its own little traps for the unwary, we will test them in real code one by one, rather than try to throw the whole lot into code in one go. When we're happy with each flow, we can put them together into a full program. We'll start with the state flow(<xref linkend="figure-45"/>).</para>

<figure id="figure-45">
    <title>The State Flow</title>
    <mediaobject>
        <imageobject>
            <imagedata fileref="images/fig45.eps" format="EPS" width="4.8in"/>
        </imageobject>
    </mediaobject>
</figure>

<para>Here is how this works in code:</para>

<example id="peering1-php">
<title>Prototype state flow (peering1.php)</title>
<programlisting language="php">
&lt;?php
/*
 *  Broker peering simulation (part 1)
 *  Prototypes the state flow
 */

//  First argument is this broker's name
//  Other arguments are our peers' names
if($_SERVER['argc'] &lt; 2) {
	echo "syntax: peering1 me {you}...", PHP_EOL;
    exit();
}
$self = $_SERVER['argv'][1];
printf ("I: preparing broker at %s... %s", $self, PHP_EOL);

//  Prepare our context and sockets
$context = new ZMQContext();

//  Bind statebe to endpoint
$statebe = $context-&gt;getSocket(ZMQ::SOCKET_PUB);
$endpoint = sprintf("ipc://%s-state.ipc", $self);
$statebe-&gt;bind($endpoint);

//  Connect statefe to all peers
$statefe = $context-&gt;getSocket(ZMQ::SOCKET_SUB);
$statefe-&gt;setSockOpt(ZMQ::SOCKOPT_SUBSCRIBE, "");

for ($argn = 2; $argn &lt; $_SERVER['argc']; $argn++) {
	$peer = $_SERVER['argv'][$argn];
	printf ("I: connecting to state backend at '%s'%s", $peer, PHP_EOL);
	$endpoint = sprintf("ipc://%s-state.ipc", $peer);
	$statefe-&gt;connect($endpoint);
}

$readable = $writeable = array();

//  Send out status messages to peers, and collect from peers
//  The zmq_poll timeout defines our own heartbeating
while (true) {
	//  Initialize poll set
	$poll = new ZMQPoll();
	$poll-&gt;add($statefe, ZMQ::POLL_IN);
	//  Poll for activity, or 1 second timeout
	$events = $poll-&gt;poll($readable, $writeable, 1000);
	
	if($events &gt; 0) {
		//  Handle incoming status message
		foreach($readable as $socket) {
			$address = $socket-&gt;recv();
			$body = $socket-&gt;recv();
			printf ("%s - %s workers free%s", $address, $body, PHP_EOL);
		}
	} 
	else {
		//  We stick our own address onto the envelope
		$statebe-&gt;send($self, ZMQ::MODE_SNDMORE);
		//  Send random value for worker availability
		$statebe-&gt;send(mt_rand(1, 10));
		
	}
}
//  We never get here
</programlisting>

</example>
<para>Notes about this code:</para>

<itemizedlist>
  <listitem><para>Each broker has an identity that we use to construct <literal>ipc</literal> endpoint names. A real broker would need to work with TCP and a more sophisticated configuration scheme. We'll look at such schemes later in this book but for now, using generated <literal>ipc</literal> names lets us ignore the problem of where to get TCP/IP addresses or names from.</para></listitem>
  <listitem><para>We use a <literal>zmq_poll()</literal> loop as the core of the program. This processes incoming messages and sends out state messages. We send a state message <emphasis>only</emphasis> if we did not get any incoming messages <emphasis>and</emphasis> we waited for a second. If we send out a state message each time we get one in, we'll get message storms.</para></listitem>
  <listitem><para>We use a two-part pubsub message consisting of sender address and data. Note that we will need to know the address of the publisher in order to send it tasks, and the only way is to send this explicitly as a part of the message.</para></listitem>
  <listitem><para>We don't set identities on subscribers, because if we did then we'd get out of date state information when connecting to running brokers.</para></listitem>
  <listitem><para>We don't set a HWM on the publisher, but if we were using &Oslash;MQ/2.x that would be a wise idea.</para></listitem>
</itemizedlist>
<para>We can build this little program and run it three times to simulate three clusters. Let's call them DC1, DC2, and DC3 (the names are arbitrary). We run these three commands, each in a separate window:</para>

<screen>peering1 DC1 DC2 DC3  #  Start DC1 and connect to DC2 and DC3
peering1 DC2 DC1 DC3  #  Start DC2 and connect to DC1 and DC3
peering1 DC3 DC1 DC2  #  Start DC3 and connect to DC1 and DC2
</screen>

<para>You'll see each cluster report the state of its peers, and after a few seconds they will all happily be printing random numbers once per second. Try this and satisfy yourself that the three brokers all match up and synchronize to per-second state updates.</para>

<para>In real life we'd not send out state messages at regular intervals but rather whenever we had a state change, i.e. whenever a worker becomes available or unavailable. That may seem like a lot of traffic but state messages are small and we've established that the inter-cluster connections are super-fast.</para>

<para>If we wanted to send state messages at precise intervals we'd create a child thread and open the statebe socket in that thread. We'd then send irregular state updates to that child thread from our main thread, and allow the child thread to conflate them into regular outgoing messages. This is more work than we need here.</para>

</sect2>
<sect2>
<title>Prototyping the Local and Cloud Flows</title>
<para>Let's now prototype at the flow of tasks via the local and cloud sockets(<xref linkend="figure-46"/>). This code pulls requests from clients and then distributes them to local workers and cloud peers on a random basis.</para>

<figure id="figure-46">
    <title>The Flow of Tasks</title>
    <mediaobject>
        <imageobject>
            <imagedata fileref="images/fig46.eps" format="EPS" width="4.8in"/>
        </imageobject>
    </mediaobject>
</figure>

<para>Before we jump into the code, which is getting a little complex, let's sketch the core routing logic and break it down into a simple but robust design.</para>

<para>We need two queues, one for requests from local clients and one for requests from cloud clients. One option would be to pull messages off the local and cloud frontends, and pump these onto their respective queues. But this is kind of pointless because &Oslash;MQ sockets <emphasis>are</emphasis> queues already. So let's use the &Oslash;MQ socket buffers as queues.</para>

<para>This was the technique we used in the load-balancing broker, and it worked nicely. We only read from the two frontends when there is somewhere to send the requests. We can always read from the backends, since they give us replies to route back. As long as the backends aren't talking to us, there's no point in even looking at the frontends.</para>

<para>So our main loop becomes:</para>

<itemizedlist>
  <listitem><para>Poll the backends for activity. When we get a message, it may be "READY" from a worker or it may be a reply. If it's a reply, route back via the local or cloud frontend.</para></listitem>
  <listitem><para>If a worker replied, it became available, so we queue it and count it.</para></listitem>
  <listitem><para>While there are workers available, take a request, if any, from either frontend and route to a local worker, or randomly, a cloud peer.</para></listitem>
</itemizedlist>
<para>Randomly sending tasks to a peer broker rather than a worker simulates work distribution across the cluster. It's dumb but that is fine for this stage.</para>

<para>We use broker identities to route messages between brokers. Each broker has a name, which we provide on the command line in this simple prototype. As long as these names don't overlap with the &Oslash;MQ-generated UUIDs used for client nodes, we can figure out whether to route a reply back to a client or to a broker.</para>

<para>Here is how this works in code. The interesting part starts around the comment "Interesting part".</para>

<example id="peering2-php">
<title>Prototype local and cloud flow (peering2.php)</title>
<programlisting language="php">
&lt;?php
/*
 * Broker peering simulation (part 2)
 * Prototypes the request-reply flow
 * @author Ian Barber &lt;ian(dot)barber(at)gmail(dot)com&gt;
 */
include "zmsg.php";

define("NBR_CLIENTS", 10);
define("NBR_WORKERS", 3);

// Request-reply client using REQ socket
function client_thread($self) {
	$context = new ZMQContext();
	$client = new ZMQSocket($context, ZMQ::SOCKET_REQ);
	$endpoint = sprintf("ipc://%s-localfe.ipc", $self);
	$client-&gt;connect($endpoint);
	
	while(true) {
		//  Send request, get reply
		$client-&gt;send("HELLO");
		$reply = $client-&gt;recv();
		printf("I: client status: %s%s", $reply, PHP_EOL);
	}
}

//  Worker using REQ socket to do LRU routing
function worker_thread ($self) {
	$context = new ZMQContext();
	$worker = $context-&gt;getSocket(ZMQ::SOCKET_REQ);
	$endpoint = sprintf("ipc://%s-localbe.ipc", $self);
	$worker-&gt;connect($endpoint);

    //  Tell broker we're ready for work
	$worker-&gt;send("READY");
	
	while(true) {
		$zmsg = new Zmsg($worker);
		$zmsg-&gt;recv();
		
		sleep(1);
		$zmsg-&gt;body_fmt("OK - %04x", mt_rand(0, 0x10000));
		$zmsg-&gt;send();
    }
}

//  First argument is this broker's name
//  Other arguments are our peers' names
if($_SERVER['argc'] &lt; 2) {
	echo "syntax: peering2 me {you}...", PHP_EOL;
    exit();
}

$self = $_SERVER['argv'][1];

for($client_nbr = 0; $client_nbr &lt; NBR_CLIENTS; $client_nbr++) {
	$pid = pcntl_fork();
	if($pid == 0) {
		client_thread($self);
		return;
	} 
}

for($worker_nbr = 0; $worker_nbr &lt; NBR_WORKERS; $worker_nbr++) {
	$pid = pcntl_fork();
	if($pid == 0) {
		worker_thread($self);
		return;
	} 
}

printf ("I: preparing broker at %s... %s", $self, PHP_EOL);

//  Prepare our context and sockets
$context = new ZMQContext();

//  Bind cloud frontend to endpoint
$cloudfe = $context-&gt;getSocket(ZMQ::SOCKET_ROUTER);
$endpoint = sprintf("ipc://%s-cloud.ipc", $self);
$cloudfe-&gt;setSockOpt(ZMQ::SOCKOPT_IDENTITY, $self);
$cloudfe-&gt;bind($endpoint);

//  Connect cloud backend to all peers
$cloudbe = $context-&gt;getSocket(ZMQ::SOCKET_ROUTER);
$cloudbe-&gt;setSockOpt(ZMQ::SOCKOPT_IDENTITY, $self);

for ($argn = 2; $argn &lt; $_SERVER['argc']; $argn++) {
	$peer = $_SERVER['argv'][$argn];
	printf ("I: connecting to cloud backend at '%s'%s", $peer, PHP_EOL);
	$endpoint = sprintf("ipc://%s-cloud.ipc", $peer);
	$cloudbe-&gt;connect($endpoint);
}
    
//  Prepare local frontend and backend
$localfe = new ZMQSocket($context, ZMQ::SOCKET_ROUTER);
$endpoint = sprintf("ipc://%s-localfe.ipc", $self);
$localfe-&gt;bind($endpoint);
$localbe = new ZMQSocket($context, ZMQ::SOCKET_ROUTER);
$endpoint = sprintf("ipc://%s-localbe.ipc", $self);
$localbe-&gt;bind($endpoint);

//  Get user to tell us when we can start...
printf ("Press Enter when all brokers are started: ");
$fp = fopen('php://stdin', 'r');
$line = fgets($fp, 512);
fclose($fp);


//  Interesting part
//  -------------------------------------------------------------
//  Request-reply flow
//  - Poll backends and process local/cloud replies
//  - While worker available, route localfe to local or cloud

//  Queue of available workers
$capacity = 0;
$worker_queue = array();
$readable = $writeable = array();

while(true) {
	$poll = new ZMQPoll();
	$poll-&gt;add($localbe, ZMQ::POLL_IN);
	$poll-&gt;add($cloudbe, ZMQ::POLL_IN);
	$events = 0;
	
	//  If we have no workers anyhow, wait indefinitely
	try {
		$events = $poll-&gt;poll($readable, $writeable, $capacity ? 1000000 : -1);
	} catch(ZMQPollException $e) {
		break;
	}
	
	if($events &gt; 0) {
		foreach($readable as $socket) {
			$zmsg = new Zmsg($socket);
			//  Handle reply from local worker
			if($socket === $localbe) {
				$zmsg-&gt;recv();
				//  Use worker address for LRU routing
				$worker_queue[] = $zmsg-&gt;unwrap();
				$capacity++;
				if($zmsg-&gt;address() == "READY") {
					continue;
				}
			}
			//  Or handle reply from peer broker
			else if($socket === $cloudbe) {
				//  We don't use peer broker address for anything
				$zmsg-&gt;recv()-&gt;unwrap();
			}
			
			//  Route reply to cloud if it's addressed to a broker
			for($argn = 2; $argn &lt; $_SERVER['argc']; $argn++) {
				if($zmsg-&gt;address() == $_SERVER['argv'][$argn]) {
					$zmsg-&gt;set_socket($cloudfe)-&gt;send();
					$zmsg = null;
				}
			}
			
			//  Route reply to client if we still need to
			if($zmsg) {
				$zmsg-&gt;set_socket($localfe)-&gt;send();
			}
		}
	}
	
	//  Now route as many clients requests as we can handle
	while($capacity) {
		$poll = new ZMQPoll();
		$poll-&gt;add($localfe, ZMQ::POLL_IN);
		$poll-&gt;add($cloudfe, ZMQ::POLL_IN);
		$reroutable = false;
		$events = $poll-&gt;poll($readable, $writeable, 0);
		if($events &gt; 0) {
			foreach($readable as $socket) {
				$zmsg = new Zmsg($socket);
				$zmsg-&gt;recv();
				//  We'll do peer brokers first, to prevent starvation
				if($socket === $cloudfe) {
					$reroutable = false;
				} 
				else if($socket === $localfe) {
					$reroutable = true;
				}
				
				//  If reroutable, send to cloud 20% of the time
				//  Here we'd normally use cloud status information
				if($reroutable &amp;&amp; $_SERVER['argc'] &gt; 2 &amp;&amp; mt_rand(0, 4) == 0) {
					$zmsg-&gt;wrap($_SERVER['argv'][mt_rand(2, ($_SERVER['argc']-1))]);
					$zmsg-&gt;set_socket($cloudbe)-&gt;send();
				} 
				else {
					$zmsg-&gt;wrap(array_shift($worker_queue), "");
					$zmsg-&gt;set_socket($localbe)-&gt;send();
					$capacity--;
				}
			}
		} else {
			break; //  No work, go back to backends
		}
	}
</programlisting>

</example>
<para>Run this by, for instance, starting two instance of the broker in two windows:</para>

<screen>peering2 me you
peering2 you me
</screen>

<para>Some comments on this code:</para>

<itemizedlist>
  <listitem><para>In the C code at least, using the zmsg class makes life much easier, and our code much shorter. It's obviously an abstraction that works. If you build &Oslash;MQ applications in C, you should use CZMQ.</para></listitem>
  <listitem><para>Since we're not getting any state information from peers, we naively assume they are running. The code prompts you to confirm when you've started all the brokers. In the real case we'd not send anything to brokers who had not told us they exist.</para></listitem>
</itemizedlist>
<para>You can satisfy yourself that the code works by watching it run forever. If there were any misrouted messages, clients would end up blocking, and the brokers would stop printing trace information. You can prove that by killing either of the brokers. The other broker tries to send requests to the cloud, and one by one its clients block, waiting for an answer.</para>

</sect2>
<sect2>
<title>Putting it All Together</title>
<para>Let's put this together into a single package. As before, we'll run an entire cluster as one process. We're going to take the two previous examples and merge them into one properly working design that lets you simulate any number of clusters.</para>

<para>This code is the size of both previous prototypes together, at 270 LoC. That's pretty good for a simulation of a cluster that includes clients and workers and cloud workload distribution. Here is the code:</para>

<example id="peering3-php">
<title>Full cluster simulation (peering3.php)</title>
<programlisting language="php">
&lt;?php
/*
 * Broker peering simulation (part 3)
 * Prototypes the full flow of status and tasks
 * @author Ian Barber &lt;ian(dot)barber(at)gmail(dot)com&gt;
 */
include "zmsg.php";

define("NBR_CLIENTS", 10);
define("NBR_WORKERS", 3);

/* 
 * Request-reply client using REQ socket
 * To simulate load, clients issue a burst of requests and then
 * sleep for a random period.
 */
function client_thread($self) {
	$context = new ZMQContext();
	$client = new ZMQSocket($context, ZMQ::SOCKET_REQ);
	$endpoint = sprintf("ipc://%s-localfe.ipc", $self);
	$client-&gt;connect($endpoint);
	
	$monitor = new ZMQSocket($context, ZMQ::SOCKET_PUSH);
	$endpoint = sprintf("ipc://%s-monitor.ipc", $self);
	$monitor-&gt;connect($endpoint);
	$readable = $writeable = array();
	
	while(true) {
		sleep(mt_rand(0, 4));
		$burst = mt_rand(1, 14);
		while($burst--) {
			//  Send request with random hex ID
			$task_id = sprintf("%04X", mt_rand(0, 10000));
			$client-&gt;send($task_id);
			
			//  Wait max ten seconds for a reply, then complain
			$poll = new ZMQPoll();
			$poll-&gt;add($client, ZMQ::POLL_IN);
			$events = $poll-&gt;poll($readable, $writeable, 10 * 1000000);
			if($events &gt; 0) {
				foreach($readable as $socket) {
					$zmsg = new Zmsg($socket);
					$zmsg-&gt;recv();
					//  Worker is supposed to answer us with our task id
					assert($zmsg-&gt;body() == $task_id);
				}
			} else {
				$monitor-&gt;send(sprintf("E: CLIENT EXIT - lost task %s", $task_id));
				exit();
			}
		}
	}
}

//  Worker using REQ socket to do LRU routing
function worker_thread ($self) {
	$context = new ZMQContext();
	$worker = $context-&gt;getSocket(ZMQ::SOCKET_REQ);
	$endpoint = sprintf("ipc://%s-localbe.ipc", $self);
	$worker-&gt;connect($endpoint);

    //  Tell broker we're ready for work
	$worker-&gt;send("READY");
	
	while(true) {
		$zmsg = new Zmsg($worker);
		$zmsg-&gt;recv();
		
		sleep(mt_rand(0,2));
		$zmsg-&gt;send();
    }
}

//  First argument is this broker's name
//  Other arguments are our peers' names
if($_SERVER['argc'] &lt; 2) {
	echo "syntax: peering2 me {you}...", PHP_EOL;
    exit();
}

$self = $_SERVER['argv'][1];

for($client_nbr = 0; $client_nbr &lt; NBR_CLIENTS; $client_nbr++) {
	$pid = pcntl_fork();
	if($pid == 0) {
		client_thread($self);
		return;
	} 
}

for($worker_nbr = 0; $worker_nbr &lt; NBR_WORKERS; $worker_nbr++) {
	$pid = pcntl_fork();
	if($pid == 0) {
		worker_thread($self);
		return;
	} 
}

printf ("I: preparing broker at %s... %s", $self, PHP_EOL);

//  Prepare our context and sockets
$context = new ZMQContext();

//  Bind cloud frontend to endpoint
$cloudfe = $context-&gt;getSocket(ZMQ::SOCKET_ROUTER);
$endpoint = sprintf("ipc://%s-cloud.ipc", $self);
$cloudfe-&gt;setSockOpt(ZMQ::SOCKOPT_IDENTITY, $self);
$cloudfe-&gt;bind($endpoint);

//  Connect cloud backend to all peers
$cloudbe = $context-&gt;getSocket(ZMQ::SOCKET_ROUTER);
$cloudbe-&gt;setSockOpt(ZMQ::SOCKOPT_IDENTITY, $self);

for ($argn = 2; $argn &lt; $_SERVER['argc']; $argn++) {
	$peer = $_SERVER['argv'][$argn];
	printf ("I: connecting to cloud backend at '%s'%s", $peer, PHP_EOL);
	$endpoint = sprintf("ipc://%s-cloud.ipc", $peer);
	$cloudbe-&gt;connect($endpoint);
}

//  Bind state backend / publisher to endpoint
$statebe = new ZMQSocket($context, ZMQ::SOCKET_PUB);
$endpoint = sprintf("ipc://%s-state.ipc", $self);
$statebe-&gt;bind($endpoint);

//  Connect statefe to all peers
$statefe = $context-&gt;getSocket(ZMQ::SOCKET_SUB);
$statefe-&gt;setSockOpt(ZMQ::SOCKOPT_SUBSCRIBE, "");

for ($argn = 2; $argn &lt; $_SERVER['argc']; $argn++) {
	$peer = $_SERVER['argv'][$argn];
	printf ("I: connecting to state backend at '%s'%s", $peer, PHP_EOL);
	$endpoint = sprintf("ipc://%s-state.ipc", $peer);
	$statefe-&gt;connect($endpoint);
}

//  Prepare monitor socket
$monitor = new ZMQSocket($context, ZMQ::SOCKET_PULL);
$endpoint = sprintf("ipc://%s-monitor.ipc", $self);
$monitor-&gt;bind($endpoint);

//  Prepare local frontend and backend
$localfe = new ZMQSocket($context, ZMQ::SOCKET_ROUTER);
$endpoint = sprintf("ipc://%s-localfe.ipc", $self);
$localfe-&gt;bind($endpoint);
$localbe = new ZMQSocket($context, ZMQ::SOCKET_ROUTER);
$endpoint = sprintf("ipc://%s-localbe.ipc", $self);
$localbe-&gt;bind($endpoint);


//  Interesting part
//  -------------------------------------------------------------
//  Publish-subscribe flow
//  - Poll statefe and process capacity updates
//  - Each time capacity changes, broadcast new value
//  Request-reply flow
//  - Poll primary and process local/cloud replies
//  - While worker available, route localfe to local or cloud

//  Queue of available workers
$local_capacity = 0;
$cloud_capacity = 0;
$worker_queue = array();
$readable = $writeable = array();

while(true) {
	$poll = new ZMQPoll();
	$poll-&gt;add($localbe, ZMQ::POLL_IN);
	$poll-&gt;add($cloudbe, ZMQ::POLL_IN);
	$poll-&gt;add($statefe, ZMQ::POLL_IN);
	$poll-&gt;add($monitor, ZMQ::POLL_IN);
	$events = 0;

	//  If we have no workers anyhow, wait indefinitely
	try {
		$events = $poll-&gt;poll($readable, $writeable, $local_capacity ? 1000000 : -1);
	} catch(ZMQPollException $e) {
		break;
	}

	//  Track if capacity changes during this iteration
	$previous = $local_capacity;

    foreach($readable as $socket) {
		$zmsg = new Zmsg($socket);
		
		//  Handle reply from local worker		
		if($socket === $localbe) {
			//  Use worker address for LRU routing
			$zmsg-&gt;recv();
			$worker_queue[] = $zmsg-&gt;unwrap();
			$local_capacity++;
			if($zmsg-&gt;body() == "READY") {
				$zmsg = null; //  Don't route it
			}
		}
		//  Or handle reply from peer broker
		else if($socket === $cloudbe) {
			//  We don't use peer broker address for anything
			$zmsg-&gt;recv()-&gt;unwrap();
		}
		//  Handle capacity updates
		else if($socket === $statefe) {
			$zmsg-&gt;recv();
			$cloud_capacity = $zmsg-&gt;body();
			$zmsg = null;
		} 
		//  Handle monitor message
		else if($socket === $monitor) {
			$zmsg-&gt;recv();
			echo $zmsg-&gt;body(), PHP_EOL;
			$zmsg = null;
		}
	       
		if($zmsg) {
			//  Route reply to cloud if it's addressed to a broker
			for($argn = 2; $argn &lt; $_SERVER['argc']; $argn++) {
				if($zmsg-&gt;address() == $_SERVER['argv'][$argn]) {
					$zmsg-&gt;set_socket($cloudfe)-&gt;send();
					$zmsg = null;
				}
			}
		}
		
		//  Route reply to client if we still need to
		if($zmsg) {
			$zmsg-&gt;set_socket($localfe)-&gt;send();
		}
	}  	 

	//  Now route as many clients requests as we can handle
	//  - If we have local capacity we poll both localfe and cloudfe
	//  - If we have cloud capacity only, we poll just localfe
	//  - Route any request locally if we can, else to cloud
	while($local_capacity + $cloud_capacity) {
		$poll = new ZMQPoll();
		$poll-&gt;add($localfe, ZMQ::POLL_IN);
		if($local_capacity) {
			$poll-&gt;add($cloudfe, ZMQ::POLL_IN);
		}
		$reroutable = false;
		$events = $poll-&gt;poll($readable, $writeable, 0);
		if($events &gt; 0) {
			foreach($readable as $socket) {
				$zmsg = new Zmsg($socket);
				$zmsg-&gt;recv();
				
				if($local_capacity) {
					$zmsg-&gt;wrap(array_shift($worker_queue), "");
					$zmsg-&gt;set_socket($localbe)-&gt;send();
					$local_capacity--;
				} 
				else {
					//  Route to random broker peer
					printf ("I: route request %s to cloud...%s", $zmsg-&gt;body(), PHP_EOL);
					$zmsg-&gt;wrap($_SERVER['argv'][mt_rand(2, ($_SERVER['argc']-1))]);
					$zmsg-&gt;set_socket($cloudbe)-&gt;send();
				}
			}
		} else {
			break; //  No work, go back to backends
		}
	}
	
	if ($local_capacity != $previous) {
        //  Broadcast new capacity
        $zmsg = new Zmsg($statebe);
		$zmsg-&gt;body_set($local_capacity);
		//  We stick our own address onto the envelope
		$zmsg-&gt;wrap($self)-&gt;send();
    }
}
</programlisting>

</example>
<para>It's a non-trivial program and took about a day to get working. These are the highlights:</para>

<itemizedlist>
  <listitem><para>The client threads detect and report a failed request. They do this by polling for a response and if none arrives after a while (10 seconds), printing an error message.</para></listitem>
  <listitem><para>Client threads don't print directly, but instead send a message to a 'monitor' socket (PUSH) that the main loop collects (PULL) and prints off. This is the first case we've seen of using &Oslash;MQ sockets for monitoring and logging; this is a big use case we'll come back to later.</para></listitem>
  <listitem><para>Clients simulate varying loads to get the cluster 100% at random moments, so that tasks are shifted over to the cloud. The number of clients and workers, and delays in the client and worker threads control this. Feel free to play with them to see if you can make a more realistic simulation.</para></listitem>
  <listitem><para>The main loop uses two pollsets. It could in fact use three: information, backends, and frontends. As in the earlier prototype, there is no point in taking a frontend message if there is no backend capacity.</para></listitem>
</itemizedlist>
<para>These are some of the problems that hit during development of this program:</para>

<itemizedlist>
  <listitem><para>Clients would freeze, due to requests or replies getting lost somewhere. Recall that the &Oslash;MQ ROUTER/ROUTER socket drops messages it can't route. The first tactic here was to modify the client thread to detect and report such problems. Secondly, I put zmsg_dump() calls after every recv() and before every send() in the main loop, until it was clear what the problems were.</para></listitem>
  <listitem><para>The main loop was mistakenly reading from more than one ready socket. This caused the first message to be lost. Fixed that by reading only from the first ready socket.</para></listitem>
  <listitem><para>The zmsg class was not properly encoding UUIDs as C strings. This caused UUIDs that contain 0 bytes to be corrupted. Fixed by modifying zmsg to encode UUIDs as printable hex strings.</para></listitem>
</itemizedlist>
<para>This simulation does not detect disappearance of a cloud peer. If you start several peers and stop one, and it was broadcasting capacity to the others, they will continue to send it work even if it's gone. You can try this, and you will get clients that complain of lost requests. The solution is twofold: first, only keep the capacity information for a short time so that if a peer does disappear, its capacity is quickly set to 'zero'. Second, add reliability to the request-reply chain. We'll look at reliability in the next chapter.</para>

</sect2>
</sect1>
</chapter>
<chapter id="reliable-request-reply">
<title>Reliable Request-Reply Patterns</title>
<para>Advanced Request-Reply Patterns<xref linkend="advanced-request-reply"/> covered advanced uses of &Oslash;MQ's request-reply pattern with working examples. This chapter looks at the general question of reliability and builds a set of reliable messaging patterns on top of &Oslash;MQ's core request-reply pattern.</para>

<para>In this chapter we focus heavily on user-space request-reply "patterns", reusable models that help you design your own &Oslash;MQ architectures:</para>

<itemizedlist>
  <listitem><para>The <emphasis>Lazy Pirate</emphasis> pattern: reliable request reply from the client side</para></listitem>
  <listitem><para>The <emphasis>Simple Pirate</emphasis> pattern: reliable request-reply using load-balancing</para></listitem>
  <listitem><para>The <emphasis>Paranoid Pirate</emphasis> pattern: reliable request-reply with heartbeating</para></listitem>
  <listitem><para>The <emphasis>Majordomo</emphasis> pattern: service-oriented reliable queuing</para></listitem>
  <listitem><para>The <emphasis>Titanic</emphasis> pattern: disk-based / disconnected reliable queuing</para></listitem>
  <listitem><para>The <emphasis>Binary Star</emphasis> pattern: primary-backup server fail-over</para></listitem>
  <listitem><para>The <emphasis>Freelance</emphasis> pattern: brokerless reliable request-reply</para></listitem>
</itemizedlist>
<sect1>
<title>What is "Reliability"?</title>
<para>Most people who speak of "reliability" don't really know what they mean. We can only define reliability in terms of failure. That is, if we can handle a certain set of well-defined and understood failures, we are reliable with respect to those failures. No more, no less. So let's look at the possible causes of failure in a distributed &Oslash;MQ application, in roughly descending order of probability:</para>

<itemizedlist>
  <listitem><para>Application code is the worst offender. It can crash and exit, freeze and stop responding to input, run too slowly for its input, exhaust all memory, etc.</para></listitem>
  <listitem><para>System code--like brokers we write using &Oslash;MQ--can die for the same reasons as application code. System code <emphasis>should</emphasis> be more reliable than application code but it can still crash and burn, and especially run out of memory if it tries to queue messages for slow clients.</para></listitem>
  <listitem><para>Message queues can overflow, typically in system code that has learned to deal brutally with slow clients. When a queue overflows, it starts to discard messages. So we get "lost" messages.</para></listitem>
  <listitem><para>Networks can fail (e.g. WiFi gets switched off or goes out of range). &Oslash;MQ will automatically reconnect in such cases but in the meantime, messages may get lost.</para></listitem>
  <listitem><para>Hardware can fail and take with it all the processes running on that box.</para></listitem>
  <listitem><para>Networks can fail in exotic ways, e.g. some ports on a switch may die and those parts of the network become inaccessible.</para></listitem>
  <listitem><para>Entire data centers can be struck by lightning, earthquakes, fire, or more mundane power or cooling failures.</para></listitem>
</itemizedlist>
<para>To make a software system fully reliable against <emphasis>all</emphasis> of these possible failures is an enormously difficult and expensive job and goes beyond the scope of this modest guide.</para>

<para>Since the first five cases cover 99.9% of real world requirements outside large companies (according to a highly scientific study I just ran, which also told me that 78% of statistics are made up on the spot), that's what we'll look at. If you're a large company with money to spend on the last two cases, contact my company immediately! There's a large hole behind my beach house waiting to be converted into an executive pool.</para>

</sect1>
<sect1>
<title>Designing Reliability</title>
<para>So to make things brutally simple, reliability is "keeping things working properly when code freezes or crashes", a situation we'll shorten to "dies". However, the things we want to keep working properly are more complex than just messages. We need to take each core &Oslash;MQ messaging pattern and see how to make it work (if we can) even when code dies.</para>

<para>Let's take them one by one:</para>

<itemizedlist>
  <listitem><para>Request-reply - if the server dies (while processing a request), the client can figure that out because it won't get an answer back. Then it can give up in a huff, wait and try again later, find another server, etc. As for the client dying, we can brush that off as "someone else's problem" for now.</para></listitem>
  <listitem><para>Publish-subscribe - if the client dies (having gotten some data), the server doesn't know about it. Pubsub doesn't send any information back from client to server. But the client can contact the server out-of-band, e.g. via request-reply, and ask, "please resend everything I missed". As for the server dying, that's out of scope for here. Subscribers can also self-verify that they're not running too slowly, and take action (e.g., warn the operator, and die) if they are.</para></listitem>
  <listitem><para>Pipeline - if a worker dies (while working), the ventilator doesn't know about it. Pipelines, like pubsub, and the grinding gears of time, only work in one direction. But the downstream collector can detect that one task didn't get done, and send a message back to the ventilator saying, "hey, resend task 324!" If the ventilator or collector dies, whatever upstream client originally sent the work batch can get tired of waiting and resend the whole lot. It's not elegant, but system code should really not die often enough to matter.</para></listitem>
</itemizedlist>
<para>In this chapter we'll focus on just on request-reply, which is the low-hanging fruit of reliable messaging.</para>

<para>The basic request-reply pattern (a REQ client socket doing a blocking send/recv to a REP server socket) scores low on handling the most common types of failure. If the server crashes while processing the request, the client just hangs forever. If the network loses the request or the reply, the client hangs forever.</para>

<para>Request-reply is still much better than TCP, thanks to &Oslash;MQ's ability to reconnect peers silently, to load-balance messages, and so on. But it's still not good enough for real work. The only case where you can really trust the basic request-reply pattern is between two threads in the same process where there's no network or separate server process to die.</para>

<para>However, with a little extra work this humble pattern becomes a good basis for real work across a distributed network, and we get a set of reliable request-reply (RRR) patterns I like to call the <emphasis>Pirate</emphasis> patterns (you'll eventually get the joke, I hope).</para>

<para>There are, in my experience, roughly three ways to connect clients to servers. Each needs a specific approach to reliability:</para>

<itemizedlist>
  <listitem><para>Multiple clients talking directly to a single server. Use case: a single well-known server that clients need to talk to. Types of failure we aim to handle: server crashes and restarts, network disconnects.</para></listitem>
  <listitem><para>Multiple clients talking to a broker proxy that distributes work to multiple workers. Use case: service oriented transaction processing. Types of failure we aim to handle: worker crashes and restarts, worker busy looping, worker overload, queue crashes and restarts, network disconnects.</para></listitem>
  <listitem><para>Multiple clients talking to multiple servers with no intermediary proxies. Use case: distributed services such as name resolution. Types of failure we aim to handle: service crashes and restarts, service busy looping, service overload, network disconnects.</para></listitem>
</itemizedlist>
<para>Each of these approaches has its trade-offs and often you'll mix them. We'll look at all three in detail.</para>

</sect1>
<sect1>
<title>Client-side Reliability (Lazy Pirate Pattern)</title>
<para>We can get very simple reliable request-reply with some changes to the client. We call this the Lazy Pirate pattern(<xref linkend="figure-47"/>). Rather than doing a blocking receive, we:</para>

<itemizedlist>
  <listitem><para>Poll the REQ socket and receive from it only when it's sure a reply has arrived.</para></listitem>
  <listitem><para>Resend a request, if no reply has arrived within a timeout period.</para></listitem>
  <listitem><para>Abandon the transaction if there is still no reply after several requests.</para></listitem>
</itemizedlist>
<figure id="figure-47">
    <title>The Lazy Pirate Pattern</title>
    <mediaobject>
        <imageobject>
            <imagedata fileref="images/fig47.eps" format="EPS" width="4.8in"/>
        </imageobject>
    </mediaobject>
</figure>

<para>If you try to use a REQ socket in anything than a strict send-recv fashion, you'll get an error (technically, the REQ socket implements a small finite-state machine to enforce the send-recv ping-pong, and so the error code is called "EFSM"). This is slightly annoying when we want to use REQ in a pirate pattern, because we may send several requests before getting a reply. The pretty good brute-force solution is to close and reopen the REQ socket after an error:</para>

<example id="lpclient-php">
<title>Lazy Pirate client (lpclient.php)</title>
<programlisting language="php">
&lt;?php
/* 
 * Lazy Pirate client
 * Use zmq_poll to do a safe request-reply
 * To run, start lpserver and then randomly kill/restart it
 * 
 * @author Ian Barber &lt;ian(dot)barber(at)gmail(dot)com&gt;
 */

define("REQUEST_TIMEOUT", 2500); //  msecs, (&gt; 1000!)
define("REQUEST_RETRIES", 3); //  Before we abandon

/* 
 * Helper function that returns a new configured socket
 * connected to the Hello World server
 */
function client_socket(ZMQContext $context) {
	echo "I: connecting to server...", PHP_EOL;
	$client = new ZMQSocket($context,ZMQ::SOCKET_REQ);
	$client-&gt;connect("tcp://localhost:5555");

    //  Configure socket to not wait at close time
	$client-&gt;setSockOpt(ZMQ::SOCKOPT_LINGER, 0);
    return $client;
}

$context = new ZMQContext();
$client = client_socket($context);

$sequence = 0; 
$retries_left = REQUEST_RETRIES;
$read = $write = array();

while($retries_left) {
	//  We send a request, then we work to get a reply
	$client-&gt;send(++$sequence);
	
	$expect_reply = true;
	while($expect_reply) {
		//  Poll socket for a reply, with timeout
		$poll = new ZMQPoll();
		$poll-&gt;add($client, ZMQ::POLL_IN);
		$events = $poll-&gt;poll($read, $write, REQUEST_TIMEOUT);
		
		//  If we got a reply, process it
		if($events &gt; 0) {
			//  We got a reply from the server, must match sequence
			$reply = $client-&gt;recv();
			if(intval($reply) == $sequence) {
				printf ("I: server replied OK (%s)%s", $reply, PHP_EOL);
				$retries_left = REQUEST_RETRIES;
				$expect_reply = false;
			} else {
				printf ("E: malformed reply from server: %s%s", $reply, PHP_EOL);
			}
		} else if(--$retries_left == 0) {
			echo "E: server seems to be offline, abandoning", PHP_EOL;
			break;
		} else {
			echo "W: no response from server, retrying...", PHP_EOL;
			//  Old socket will be confused; close it and open a new one
			$client = client_socket($context);
			//  Send request again, on new socket
			$client-&gt;send($sequence);
		}
	}
</programlisting>

</example>
<para>Run this together with the matching server:</para>

<example id="lpserver-php">
<title>Lazy Pirate server (lpserver.php)</title>
<programlisting language="php">
&lt;?php
/*
 * Lazy Pirate server
 * Binds REQ socket to tcp://*:5555
 * Like hwserver except:
 * - echoes request as-is
 * - randomly runs slowly, or exits to simulate a crash.
 * 
 * @author Ian Barber &lt;ian(dot)barber(at)gmail(dot)com&gt;
 */

$context = new ZMQContext();
$server = new ZMQSocket($context, ZMQ::SOCKET_REP);
$server-&gt;bind("tcp://*:5555");

$cycles = 0;
while(true) {
	$request = $server-&gt;recv();
	$cycles++;
	
	//  Simulate various problems, after a few cycles
	if($cycles &gt; 3 &amp;&amp; rand(0, 3) == 0) {
		echo "I: simulating a crash", PHP_EOL;
		break;
	} else if($cycles &gt; 3 &amp;&amp; rand(0, 3) == 0) {
		echo "I: simulating CPU overload", PHP_EOL;
		sleep(5);
	}
	printf ("I: normal request (%s)%s", $request, PHP_EOL);
    sleep(1); // Do some heavy work
	$server-&gt;send($request);
</programlisting>

</example>
<para>To run this test case, start the client and the server in two console windows. The server will randomly misbehave after a few messages. You can check the client's response. Here is typical output from the server:</para>

<screen>I: normal request (1)
I: normal request (2)
I: normal request (3)
I: simulating CPU overload
I: normal request (4)
I: simulating a crash
</screen>

<para>And here is the client's response:</para>

<screen>I: connecting to server...
I: server replied OK (1)
I: server replied OK (2)
I: server replied OK (3)
W: no response from server, retrying...
I: connecting to server...
W: no response from server, retrying...
I: connecting to server...
E: server seems to be offline, abandoning
</screen>

<para>The client sequences each message and checks that replies come back exactly in order: that no requests or replies are lost, and no replies come back more than once, or out of order. Run the test a few times until you're convinced this mechanism actually works. You don't need sequence numbers in a production application; they just help us trust our design.</para>

<para>The client uses a REQ socket, and does the brute-force close/reopen because REQ sockets impose that strict send/receive cycle. You might be tempted to use a DEALER instead, but it would not be a good decision. First, it would mean emulating the secret sauce that REQ does with envelopes (if you've forgotten what that is, it's a good sign you don't want to have to do it). Second, it would mean potentially getting back replies that you didn't expect.</para>

<para>Handling failures only at the client works when we have a set of clients talking to a single server. It can handle a server crash, but only if recovery means restarting that same server. If there's a permanent error--e.g., a dead power supply on the server hardware--this approach won't work. Since the application code in servers is usually the biggest source of failures in any architecture, depending on a single server is not a great idea.</para>

<para>So, pros and cons:</para>

<itemizedlist>
  <listitem><para>Pro: simple to understand and implement.</para></listitem>
  <listitem><para>Pro: works easily with existing client and server application code.</para></listitem>
  <listitem><para>Pro: &Oslash;MQ automatically retries the actual reconnection until it works.</para></listitem>
  <listitem><para>Con: doesn't do fail-over to backup / alternate servers.</para></listitem>
</itemizedlist>
</sect1>
<sect1>
<title>Basic Reliable Queuing (Simple Pirate Pattern)</title>
<para>Our second approach extends the Lazy Pirate pattern with a queue proxy that lets us talk, transparently, to multiple servers, which we can more accurately call "workers". We'll develop this in stages, starting with a minimal working model, the Simple Pirate pattern.</para>

<para>In all these Pirate patterns, workers are stateless. If the application requires some shared state--e.g., a shared database--we don't know about it as we design our messaging framework. Having a queue proxy means workers can come and go without clients knowing anything about it. If one worker dies, another takes over. This is a nice, simple topology with only one real weakness, namely the central queue itself, which can become a problem to manage, and a single point of failure.</para>

<para>The basis for the queue proxy is the load-balancing broker from Advanced Request-Reply Patterns<xref linkend="advanced-request-reply"/>. What is the very <emphasis>minimum</emphasis> we need to do to handle dead or blocked workers? Turns out, it's surprisingly little. We already have a retry mechanism in the client. So using the load-balancing pattern will work pretty well. This fits with &Oslash;MQ's philosophy that we can extend a peer-to-peer pattern like request-reply by plugging naive proxies in the middle(<xref linkend="figure-48"/>).</para>

<figure id="figure-48">
    <title>The Simple Pirate Pattern</title>
    <mediaobject>
        <imageobject>
            <imagedata fileref="images/fig48.eps" format="EPS" width="4.8in"/>
        </imageobject>
    </mediaobject>
</figure>

<para>We don't need a special client; we're still using the Lazy Pirate client. Here is the queue, which is identical to the main task of the load-balancing broker:</para>

<example id="spqueue-php">
<title>Simple Pirate queue (spqueue.php)</title>
<programlisting language="php">
&lt;?php
/*
 * Simple Pirate queue
 * This is identical to the LRU pattern, with no reliability mechanisms
 * at all. It depends on the client for recovery. Runs forever.
 * 
 * @author Ian Barber &lt;ian(dot)barber(at)gmail(dot)com&gt;
 */
include("zmsg.php");

define("MAX_WORKERS", 100);


//  Prepare our context and sockets
$context  = new ZMQContext();
$frontend = $context-&gt;getSocket(ZMQ::SOCKET_ROUTER);
$backend = $context-&gt;getSocket(ZMQ::SOCKET_ROUTER);
$frontend-&gt;bind("tcp://*:5555");    //  For clients
$backend-&gt;bind("tcp://*:5556");    //  For workers

//  Queue of available workers
$available_workers = 0;
$worker_queue = array();
$read = $write = array();

while(true) {
	$poll = new ZMQPoll();
	$poll-&gt;add($backend, ZMQ::POLL_IN);
	
	
	//  Poll frontend only if we have available workers
	if($available_workers) {
		$poll-&gt;add($frontend, ZMQ::POLL_IN);
	}
	
	$events = $poll-&gt;poll($read, $write);
	
	foreach($read as $socket) {
		$zmsg = new Zmsg($socket);
		$zmsg-&gt;recv();
		
		//  Handle worker activity on backend
		if($socket === $backend) {
			//  Use worker address for LRU routing
			assert($available_workers &lt; MAX_WORKERS);
			array_push($worker_queue, $zmsg-&gt;unwrap());
			$available_workers++;
			
			//  Return reply to client if it's not a READY
			if($zmsg-&gt;address() != "READY") {
				$zmsg-&gt;set_socket($frontend)-&gt;send();
			}
		} else if($socket === $frontend) {
			//  Now get next client request, route to next worker
			//  REQ socket in worker needs an envelope delimiter
			//  Dequeue and drop the next worker address
			$zmsg-&gt;wrap(array_shift($worker_queue), "");
			$zmsg-&gt;set_socket($backend)-&gt;send();
			$available_workers--;
		}
	}
	//  We never exit the main loop
}
</programlisting>

</example>
<para>Here is the worker, which takes the Lazy Pirate server and adapts it for the load-balancing pattern (using the REQ 'ready' signaling):</para>

<example id="spworker-php">
<title>Simple Pirate worker (spworker.php)</title>
<programlisting language="php">
&lt;?php
/*
 * Simple Pirate worker
 * Connects REQ socket to tcp://*:5556
 * Implements worker part of LRU queueing
 * 
 * @author Ian Barber &lt;ian(dot)barber(at)gmail(dot)com&gt;
 */
include "zmsg.php";

$context = new ZMQContext();
$worker = new ZMQSocket($context, ZMQ::SOCKET_REQ);

//  Set random identity to make tracing easier
$identity = sprintf ("%04X-%04X", rand(0, 0x10000), rand(0, 0x10000));
$worker-&gt;setSockOpt(ZMQ::SOCKOPT_IDENTITY, $identity);
$worker-&gt;connect("tcp://localhost:5556");

//  Tell queue we're ready for work
printf ("I: (%s) worker ready%s", $identity, PHP_EOL);
$worker-&gt;send("READY");

$cycles = 0;
while(true) {
	$zmsg = new Zmsg($worker);
	$zmsg-&gt;recv();
	$cycles++;
	
	//  Simulate various problems, after a few cycles
	if($cycles &gt; 3 &amp;&amp; rand(0, 3) == 0) {
		echo "I: (%s) simulating a crash", $identity, PHP_EOL;
		break;
	} else if($cycles &gt; 3 &amp;&amp; rand(0, 3) == 0) {
		echo "I: (%s) simulating CPU overload", $identity, PHP_EOL;
		sleep(5);
	}
	printf ("I: (%s) normal reply - %s%s", $identity, $zmsg-&gt;body(), PHP_EOL);
	sleep(1); // Do some heavy work
	$zmsg-&gt;send();
</programlisting>

</example>
<para>To test this, start a handful of workers, a Lazy Pirate client, and the queue, in any order. You'll see that the workers eventually all crash and burn, and the client retries and then gives up. The queue never stops, and you can restart workers and clients ad-nauseam. This model works with any number of clients and workers.</para>

</sect1>
<sect1>
<title>Robust Reliable Queuing (Paranoid Pirate Pattern)</title>
<para>The Simple Pirate Queue pattern works pretty well, especially since it's just a combination of two existing patterns, but it has some weaknesses:</para>

<itemizedlist>
  <listitem><para>It's not robust in the face of a queue crash and restart. The client will recover, but the workers won't. While &Oslash;MQ will reconnect workers' sockets automatically, as far as the newly started queue is concerned, the workers haven't signaled "READY", so don't exist. To fix this we have to do heartbeating from queue to worker, so that the worker can detect when the queue has gone away.</para></listitem>
  <listitem><para>The queue does not detect worker failure, so if a worker dies while idle, the queue can't remove it from its worker queue until the queue sends it a request. The client waits and retries for nothing. It's not a critical problem, but it's not nice. To make this work properly we do heartbeating from worker to queue, so that the queue can detect a lost worker at any stage.</para></listitem>
</itemizedlist>
<para>We'll fix these in a properly pedantic Paranoid Pirate Pattern.</para>

<para>We previously used a REQ socket for the worker. For the Paranoid Pirate worker we'll switch to a DEALER socket(<xref linkend="figure-49"/>). This has the advantage of letting us send and receive messages at any time, rather than the lock-step send/receive that REQ imposes. The downside of DEALER is that we have to do our own envelope management (re-read Advanced Request-Reply Patterns<xref linkend="advanced-request-reply"/> for background on this concept).</para>

<figure id="figure-49">
    <title>The Paranoid Pirate Pattern</title>
    <mediaobject>
        <imageobject>
            <imagedata fileref="images/fig49.eps" format="EPS" width="4.8in"/>
        </imageobject>
    </mediaobject>
</figure>

<para>We're still using the Lazy Pirate client. Here is the Paranoid Pirate queue proxy:</para>

<example id="ppqueue-php">
<title>Paranoid Pirate queue (ppqueue.php)</title>
<programlisting language="php">
&lt;?php
/* 
 * Paranoid Pirate queue
 * 
 * @author Ian Barber &lt;ian(dot)barber(at)gmail(dot)com&gt;
 */
include "zmsg.php";

define("MAX_WORKERS", 100);
define("HEARTBEAT_LIVENESS", 3); //  3-5 is reasonable
define("HEARTBEAT_INTERVAL", 1); //  secs


class Queue_T implements Iterator{
	private $queue = array();
	
	/* Iterator functions */
	public function rewind() { return reset($this-&gt;queue); }
	public function valid() { return current($this-&gt;queue); }
	public function key() { return key($this-&gt;queue); }
	public function next() { return next($this-&gt;queue); }
	public function current() { return current($this-&gt;queue); }

	/*  
	 * Insert worker at end of queue, reset expiry
	 * Worker must not already be in queue
	 */
	public function s_worker_append($identity) {
		if(isset($this-&gt;queue[$identity])) {
			printf ("E: duplicate worker identity %s", $identity);
		} else {
			$this-&gt;queue[$identity] = microtime(true) + HEARTBEAT_INTERVAL * HEARTBEAT_LIVENESS;
		}
	}

	/*
	 * Remove worker from queue, if present
	 */
	public function s_worker_delete($identity) {
		unset($this-&gt;queue[$identity]);
	}

	/*
	 * Reset worker expiry, worker must be present
	 */
	function s_worker_refresh($identity) {
		if(!isset($this-&gt;queue[$identity])) {
			printf ("E: worker %s not ready\n", $identity);
		} else {
			$this-&gt;queue[$identity] = microtime(true) + HEARTBEAT_INTERVAL * HEARTBEAT_LIVENESS;
		}
	}
	
	/*
	 * Pop next available worker off queue, return identity
	 */
	public function s_worker_dequeue() {
		reset($this-&gt;queue);
		$identity = key($this-&gt;queue);
		unset($this-&gt;queue[$identity]);
		return $identity;
	}

	/* 
	 * Look for &amp; kill expired workers
	 */
	public function s_queue_purge() {
		foreach($this-&gt;queue as $id =&gt; $expiry) {
			if(microtime(true) &gt; $expiry) {
				unset($this-&gt;queue[$id]);
			}
		}
	}
	
	/*
	 * Return the size of the queue
	 */
	public function size() {
		return count($this-&gt;queue);
	}
}

//  Prepare our context and sockets
$context = new ZMQContext();
$frontend = new ZMQSocket($context, ZMQ::SOCKET_ROUTER);
$backend = new ZMQSocket($context, ZMQ::SOCKET_ROUTER);
$frontend-&gt;bind("tcp://*:5555");    //  For clients
$backend-&gt;bind("tcp://*:5556");    //  For workers
$read = $write = array();

//  Queue of available workers
$queue = new Queue_T();

//  Send out heartbeats at regular intervals
$heartbeat_at = microtime(true) +  HEARTBEAT_INTERVAL;

while(true) {
	$poll = new ZMQPoll();
	$poll-&gt;add($backend, ZMQ::POLL_IN);
	
	//  Poll frontend only if we have available workers
	if($queue-&gt;size()) {
		$poll-&gt;add($frontend, ZMQ::POLL_IN);
	}
	
	$events = $poll-&gt;poll($read, $write, HEARTBEAT_INTERVAL * 1000 ); // milliseconds
	
	if($events &gt; 0) {
		foreach($read as $socket) {
			$zmsg = new Zmsg($socket);
			$zmsg-&gt;recv();
			
			//  Handle worker activity on backend
			if($socket === $backend) {
				$identity = $zmsg-&gt;unwrap();
				
				//  Return reply to client if it's not a control message
				if($zmsg-&gt;parts() == 1) {
					if($zmsg-&gt;address() == "READY") {
						$queue-&gt;s_worker_delete($identity);
						$queue-&gt;s_worker_append($identity);
 					} else if($zmsg-&gt;address() == 'HEARTBEAT') {
						$queue-&gt;s_worker_refresh($identity);
					} else {
						printf ("E: invalid message from %s%s%s", $identity, PHP_EOL, $zmsg-&gt;__toString());
					}
				} else {
					$zmsg-&gt;set_socket($frontend)-&gt;send();
					$queue-&gt;s_worker_append($identity);
				}
			} else {
				//  Now get next client request, route to next worker
				$identity = $queue-&gt;s_worker_dequeue();
				$zmsg-&gt;wrap($identity);
				$zmsg-&gt;set_socket($backend)-&gt;send();
			}
		}		
		
		if(microtime(true) &gt; $heartbeat_at) {
			foreach($queue as $id =&gt; $expiry) {
				$zmsg = new Zmsg($backend);
				$zmsg-&gt;body_set("HEARTBEAT");
				$zmsg-&gt;wrap($identity, NULL);
				$zmsg-&gt;send();
			}
			$heartbeat_at = microtime(true) + HEARTBEAT_INTERVAL;
		}
		$queue-&gt;s_queue_purge();
	}
</programlisting>

</example>
<para>The queue extends the load-balancing pattern with heartbeating of workers. Heartbeating is one of those "simple" things that can be subtle to get right. I'll explain more about that in a second.</para>

<para>Here is the Paranoid Pirate worker:</para>

<example id="ppworker-php">
<title>Paranoid Pirate worker (ppworker.php)</title>
<programlisting language="php">
&lt;?php
/* 
 * Paranoid Pirate worker
 * 
 * @author Ian Barber &lt;ian(dot)barber(at)gmail(dot)com&gt;
 */
include "zmsg.php";

define("HEARTBEAT_LIVENESS", 3); //  3-5 is reasonable
define("HEARTBEAT_INTERVAL", 1); //  secs
define("INTERVAL_INIT", 1000); //  Initial reconnect
define("INTERVAL_MAX", 32000); //  After exponential backoff

/* 
 * Helper function that returns a new configured socket
 * connected to the Hello World server
 */
function s_worker_socket($context) {
	$worker = new ZMQSocket($context, ZMQ::SOCKET_DEALER);
	
	//  Set random identity to make tracing easier
	$identity = sprintf ("%04X-%04X", rand(0, 0x10000), rand(0, 0x10000));
	$worker-&gt;setSockOpt(ZMQ::SOCKOPT_IDENTITY, $identity);
	$worker-&gt;connect("tcp://localhost:5556");

	//  Configure socket to not wait at close time
	$worker-&gt;setSockOpt(ZMQ::SOCKOPT_LINGER, 0);
	
    //  Tell queue we're ready for work
    printf ("I: (%s) worker ready%s", $identity, PHP_EOL);
	$worker-&gt;send("READY");
	
	return array($worker, $identity);
}

$context = new ZMQContext();
list($worker, $identity) = s_worker_socket($context);

//  If liveness hits zero, queue is considered disconnected
$liveness = HEARTBEAT_LIVENESS;
$interval = INTERVAL_INIT;


//  Send out heartbeats at regular intervals
$heartbeat_at = microtime(true) + HEARTBEAT_INTERVAL;
$read = $write = array();
$poll = new ZMQPoll();
$poll-&gt;add($worker, ZMQ::POLL_IN);

$cycles = 0;
while(true) {
	$events = $poll-&gt;poll($read, $write, HEARTBEAT_INTERVAL * 1000);
	
	if($events) {
		//  Get message
		//  - 3-part envelope + content -&gt; request
		//  - 1-part "HEARTBEAT" -&gt; heartbeat
		$zmsg = new Zmsg($worker);
		$zmsg-&gt;recv();
		
		if($zmsg-&gt;parts() == 3) {
			//  Simulate various problems, after a few cycles
			$cycles++;
			if($cycles &gt; 3 &amp;&amp; rand(0, 5) == 0) {
				printf ("I: (%s) simulating a crash%s", $identity, PHP_EOL);
				break;
			} else if($cycles &gt; 3 &amp;&amp; rand(0, 5) == 0) {
				printf ("I: (%s) simulating CPU overload%s", $identity, PHP_EOL);
				sleep(5);
			}
			printf ("I: (%s) normal reply - %s%s", $identity, $zmsg-&gt;body(), PHP_EOL);
			$zmsg-&gt;send();
			$liveness = HEARTBEAT_LIVENESS;
			sleep(1); // Do some heavy work
		} else if($zmsg-&gt;parts() == 1 &amp;&amp; $zmsg-&gt;body() == 'HEARTBEAT'){
			$liveness = HEARTBEAT_LIVENESS;
		} else {
			printf ("E: (%s) invalid message%s%s", $identity, PHP_EOL, $zmsg-&gt;__toString());
		}
		$interval = INTERVAL_INIT;
	} else if(--$liveness == 0) {
		printf ("W: (%s) heartbeat failure, can't reach queue%s", $identity, PHP_EOL);
		printf ("W: (%s) reconnecting in %d msec...%s", $identity, $interval, PHP_EOL);
		usleep ($interval * 1000 * 1000);

		if ($interval &lt; INTERVAL_MAX) {
			$interval *= 2;
		}
		list($worker, $identity) = s_worker_socket ($context);
        $liveness = HEARTBEAT_LIVENESS;
	}
	
	//  Send heartbeat to queue if it's time
	if(microtime(true) &gt; $heartbeat_at) {
		$heartbeat_at = microtime(true) + HEARTBEAT_INTERVAL;
		printf ("I: (%s) worker heartbeat%s", $identity, PHP_EOL);
		$worker-&gt;send("HEARTBEAT");
	}
}
</programlisting>

</example>
<para>Some comments about this example:</para>

<itemizedlist>
  <listitem><para>The code includes simulation of failures, as before. This makes it (a) very hard to debug, and (b) dangerous to reuse. When you want to debug this, disable the failure simulation.</para></listitem>
  <listitem><para>The worker uses a reconnect strategy similar to the one we designed for the Lazy Pirate client, with two major differences: (a) it does an exponential back-off, and (b) it retries indefinitely (where as the client retries a few times before reporting a failure).</para></listitem>
</itemizedlist>
<para>Try the client, queue, and workers, such as by using a script like this:</para>

<screen>ppqueue &amp;
for i in 1 2 3 4; do
    ppworker &amp;
    sleep 1
done
lpclient &amp;
</screen>

<para>You should see the workers die, one by one, as they simulate a crash, and the client eventually give up. You can stop and restart the queue and both client and workers will reconnect and carry on. And no matter what you do to queues and workers, the client will never get an out-of-order reply: the whole chain either works, or the client abandons.</para>

</sect1>
<sect1>
<title>Heartbeating</title>
<para>Heartbeating solves the problem of knowing whether a peer is alive or dead. This is not an issue specific to &Oslash;MQ. TCP has a long timeout (30 minutes or so), that means that it can be impossible to know whether a peer has died, been disconnected, or gone on a weekend to Prague with a case of vodka, a redhead, and a large expense account.</para>

<para>It's is not easy to get heartbeating right. When writing the Paranoid Pirate examples, it took about five hours to get the heartbeating working properly. The rest of the request-reply chain took perhaps ten minutes. It is especially easy to create "false failures", i.e., when peers decide that they are disconnected because the heartbeats aren't sent properly.</para>

<para>We'll look at the three main answers people use for heartbeating with &Oslash;MQ.</para>

<sect2>
<title>Shrugging It Off</title>
<para>The most common approach is to do no heartbeating at all and hope for the best. Many if not most &Oslash;MQ applications do this. &Oslash;MQ encourages this by hiding peers in many cases. What problems does this approach cause?</para>

<itemizedlist>
  <listitem><para>When we use a ROUTER socket in an application that tracks peers, as peers disconnect and reconnect, the application will leak memory (resources that the application holds for each peer) and get slower and slower.</para></listitem>
  <listitem><para>When we use SUB or DEALER-based data recipients, we can't tell the difference between good silence (there's no data) and bad silence (the other end died). When a recipient knows the other side died, it can for example switch over to a backup route.</para></listitem>
  <listitem><para>If we use a TCP connection that stays silent for a long while, it will, in some networks, just die. Sending something (technically, a "keep-alive" more than a heartbeat), will keep the network alive.</para></listitem>
</itemizedlist>
</sect2>
<sect2>
<title>One-Way Heartbeats</title>
<para>A second option is to sending a heartbeat message from each node to its peers, every second or so. When one node hears nothing from another, within some timeout (several seconds, typically), it will treat that peer as dead. Sounds good, right? Sadly no. This works in some cases but has nasty edge cases in other cases.</para>

<para>For PUB-SUB, this does work, and it's the only model you can use. SUB sockets cannot talk back to PUB sockets, but PUB sockets can happily send "I'm alive" messages to their subscribers.</para>

<para>As an optimization, you can send heartbeats only when there is no real data to send. Furthermore, you can send heartbeats progressively slower and slower, if network activity is an issue (e.g. on mobile networks where activity drains the battery). As long as the recipient can detect a failure (sharp stop in activity), that's fine.</para>

<para>Now the typical problems with this design:</para>

<itemizedlist>
  <listitem><para>It can be inaccurate when we send large amounts of data, since heartbeats will be delayed behind that data. If heartbeats are delayed, you can get false timeouts and disconnections due to network congestion. Thus, always treat <emphasis>any</emphasis> incoming data as a heartbeat, whether or not the sender optimizes out heartbeats.</para></listitem>
  <listitem><para>While the PUB-SUB pattern will drop messages for disappeared recipients, PUSH and DEALER sockets will queue them. So, if you send heartbeats to a dead peer, and it comes back, it'll get all the heartbeats you sent. Which can be thousands. Whoa, whoa!</para></listitem>
  <listitem><para>This design assumes that heartbeat timeouts are the same across the whole network. But that won't be accurate. Some peers will want very aggressive heart-beating, to detect faults rapidly. And some will want very relaxed heart-beating, to let sleeping networks lie, and save power.</para></listitem>
</itemizedlist>
</sect2>
<sect2>
<title>Ping-Pong Heartbeats</title>
<para>The third option is to use a ping-pong dialog. One peer sends a ping command to the other, which replies with a pong command. Neither command has any payload. Pings and pongs are not correlated. Since the roles of "client" and "server" are arbitrary in some networks, we usually specify that either peer can in fact send a ping and expect a pong in response. However, since the timeouts depend on network topologies known best to dynamic clients, it is usually the client which pings the server.</para>

<para>This works for all ROUTER-based brokers. The same optimizations we used in the second model make this work even better: treat any incoming data as a pong, and only send a ping when not otherwise sending data.</para>

</sect2>
<sect2>
<title>Heartbeating for Paranoid Pirate</title>
<para>For Paranoid Pirate we chose the second approach. It might not have been the simplest option: if designing this today, I'd probably try a ping-pong approach instead. However the principles are similar. The heartbeat messages flow asynchronously in both directions, and either peer can decide the other is 'dead' and stop talking to it.</para>

<para>In the worker, this is how we handle heartbeats from the queue:</para>

<itemizedlist>
  <listitem><para>We calculate a <emphasis>liveness</emphasis> which is how many heartbeats we can still miss before deciding the queue is dead. It starts at 3 and we decrement it each time we miss a heartbeat.</para></listitem>
  <listitem><para>We wait, in the <literal>zmq_poll</literal> loop, for one second each time, which is our heartbeat interval.</para></listitem>
  <listitem><para>If there's any message from the queue during that time we reset our liveness to three.</para></listitem>
  <listitem><para>If there's no message during that time, we count down our liveness.</para></listitem>
  <listitem><para>If the liveness reaches zero, we consider the queue dead.</para></listitem>
  <listitem><para>If the queue is 'dead', we destroy our socket, create a new one, and reconnect.</para></listitem>
  <listitem><para>To avoid opening and closing too many sockets we wait for a certain <emphasis>interval</emphasis> before reconnecting, and we double the interval each time until it reaches 32 seconds.</para></listitem>
</itemizedlist>
<para>And this is how we handle heartbeats <emphasis>to</emphasis> the queue:</para>

<itemizedlist>
  <listitem><para>We calculate when to send the next heartbeat; this is a single variable since we're talking to one peer, the queue.</para></listitem>
  <listitem><para>In the <literal>zmq_poll</literal> loop, whenever we pass this time, we send a heartbeat to the queue.</para></listitem>
</itemizedlist>
<para>Here's the essential heartbeating code for the worker:</para>

<programlisting language="c">
#define HEARTBEAT_LIVENESS  3       //  3-5 is reasonable
#define HEARTBEAT_INTERVAL  1000    //  msecs
#define INTERVAL_INIT       1000    //  Initial reconnect
#define INTERVAL_MAX       32000    //  After exponential backoff

...
//  If liveness hits zero, queue is considered disconnected
size_t liveness = HEARTBEAT_LIVENESS;
size_t interval = INTERVAL_INIT;

//  Send out heartbeats at regular intervals
uint64_t heartbeat_at = zclock_time () + HEARTBEAT_INTERVAL;

while (true) {
    zmq_pollitem_t items [] = { { worker,  0, ZMQ_POLLIN, 0 } };
    int rc = zmq_poll (items, 1, HEARTBEAT_INTERVAL * ZMQ_POLL_MSEC);

    if (items [0].revents &amp; ZMQ_POLLIN) {
        //  Receive any message from queue
        liveness = HEARTBEAT_LIVENESS;
        interval = INTERVAL_INIT;
    }
    else
    if (--liveness == 0) {
        zclock_sleep (interval);
        if (interval &lt; INTERVAL_MAX)
            interval *= 2;
        zsocket_destroy (ctx, worker);
        ...
        liveness = HEARTBEAT_LIVENESS;
    }
    //  Send heartbeat to queue if it's time
    if (zclock_time () &gt; heartbeat_at) {
        heartbeat_at = zclock_time () + HEARTBEAT_INTERVAL;
        //  Send heartbeat message to queue
    }
}
</programlisting>

<para>The queue does the same, but manages an expiry time for each worker.</para>

<para>Here are some tips for your own heartbeating implementation:</para>

<itemizedlist>
  <listitem><para>Use <literal>zmq_poll</literal> or a reactor as the core of your application's main task.</para></listitem>
  <listitem><para>Start by building the heartbeating between peers, test it by simulating failures, and <emphasis>then</emphasis> build the rest of the message flow. Adding heartbeating afterwards is much trickier.</para></listitem>
  <listitem><para>Use simple tracing, i.e. print to console, to get this working. Some tricks to help you trace the flow of messages between peers: a dump method such as zmsg offers; number messages incrementally so you can see if there are gaps.</para></listitem>
  <listitem><para>In a real application, heartbeating must be configurable and usually negotiated with the peer. Some peers will want aggressive heartbeating, as low as 10 msecs. Other peers will be far away and want heartbeating as high as 30 seconds.</para></listitem>
  <listitem><para>If you have different heartbeat intervals for different peers, your poll timeout should be the lowest (shortest time) of these. Do not use an infinite timeout.</para></listitem>
  <listitem><para>Do heartbeating on the same socket as you use for messages, so your heartbeats also act as a <emphasis>keep-alive</emphasis> to stop the network connection from going stale (some firewalls can be unkind to silent connections).</para></listitem>
</itemizedlist>
</sect2>
</sect1>
<sect1>
<title>Contracts and Protocols</title>
<para>If you're paying attention you'll realize that Paranoid Pirate is not interoperable with Simple Pirate, because of the heartbeats. But how do we define "interoperable"? To guarantee interoperability we need a kind of contract, an agreement that lets different teams, in different times and places, write code that is guaranteed to work together. We call this a "protocol".</para>

<para>It's fun to experiment without specifications, but that's not a sensible basis for real applications. What happens if we want to write a worker in another language? Do we have to read code to see how things work? What if we want to change the protocol for some reason? The protocol may be simple but it's not obvious, and if it's successful, it must become more complex.</para>

<para>Lack of contracts is a sure sign of a disposable application. So, let's write a contract for this protocol. How do we do that?</para>

<para>There's a wiki at <ulink url="http://rfc.zeromq.org">rfc.zeromq.org</ulink> that we made especially as a home for public &Oslash;MQ contracts.</para>

<para>To create a new specification, register, and follow the instructions. It's straightforward, though technical writing is not for everyone.</para>

<para>It took me about fifteen minutes to draft the new <ulink url="http://rfc.zeromq.org/spec:6">Pirate Pattern Protocol</ulink>. It's not a big specification but it does capture enough to act as the basis for arguments ("your queue isn't PPP compatible, please fix it!").</para>

<para>Turning PPP into a real protocol would take more work:</para>

<itemizedlist>
  <listitem><para>There should be a protocol version number in the READY command so that it's possible to create distinguish different versions of PPP.</para></listitem>
  <listitem><para>Right now, READY and HEARTBEAT are not entirely distinct from requests and replies. To make them distinct, we would want a message structure that includes a "message type" part.</para></listitem>
</itemizedlist>
</sect1>
<sect1>
<title>Service-Oriented Reliable Queuing (Majordomo Pattern)</title>
<para>The nice thing about progress is how fast it happens when lawyers and committees aren't involved. Just a few sentences ago we were dreaming of a better protocol that would fix the world. And <ulink url="http://rfc.zeromq.org/spec:7">here we have it</ulink>.</para>

<para>This one-page specification turs PPP into something more solid(<xref linkend="figure-50"/>). This is how we should design complex architectures: start by writing down the contracts, and only <emphasis>then</emphasis> write software to implement them.</para>

<para>The Majordomo Protocol (MDP) extends and improves on PPP in one interesting way: it adds a "service name" to requests that the client sends, and asks workers to register for specific services. Adding service names turns our Paranoid Pirate queue into a service-oriented broker. The nice thing about MDP is that it came out working code, a simpler ancestor protocol (PPP), and a precise set of improvements. This made it easy to draft.</para>

<figure id="figure-50">
    <title>The Majordomo Pattern</title>
    <mediaobject>
        <imageobject>
            <imagedata fileref="images/fig50.eps" format="EPS" width="4.8in"/>
        </imageobject>
    </mediaobject>
</figure>

<para>To implement Majordomo we need to write a framework for clients and workers. It's really not sane to ask every application developer to read the spec and make it work, when they could be using a simpler API built and tested just once.</para>

<para>So, while our first contract (MDP itself) defines how the pieces of our distributed architecture talk to each other, our second contract defines how user applications talk to the technical framework we're going to design.</para>

<para>Majordomo has two halves, a client side and a worker side. Since we'll write both client and worker applications, we will need two APIs. Here is a sketch for the client API, using a simple object-oriented approach:</para>

<programlisting language="c">
mdcli_t *mdcli_new     (char *broker);
void     mdcli_destroy (mdcli_t **self_p);
zmsg_t  *mdcli_send    (mdcli_t *self, char *service, zmsg_t **request_p);
</programlisting>

<para>That's it. We open a session to the broker, send a request message, get a reply message back, and eventually close the connection. Here's a sketch for the worker API:</para>

<programlisting language="c">
mdwrk_t *mdwrk_new     (char *broker,char *service);
void     mdwrk_destroy (mdwrk_t **self_p);
zmsg_t  *mdwrk_recv    (mdwrk_t *self, zmsg_t *reply);
</programlisting>

<para>It's more or less symmetrical,but the worker dialog is a little different. The first time a worker does a recv(), it passes a null reply. Thereafter, it passes the current reply, and gets a new request.</para>

<para>The client and worker APIs were fairly simple to construct, since they're heavily based on the Paranoid Pirate code we already developed. Here is the client API:</para>

<example id="mdcliapi-php">
<title>Majordomo client API (mdcliapi.php)</title>
<programlisting language="php">
&lt;?php
/* =====================================================================
 * mdcliapi.h
 * 
 * Majordomo Protocol Client API
 * Implements the MDP/Worker spec at http://rfc.zeromq.org/spec:7.

 * ---------------------------------------------------------------------
 * Copyright (c) 1991-2011 iMatix Corporation &lt;www.imatix.com&gt;
 * Copyright other contributors as noted in the AUTHORS file.
 * 
 * This file is part of the ZeroMQ Guide: http://zguide.zeromq.org
 * 
 * This is free software; you can redistribute it and/or modify it under
 * the terms of the GNU Lesser General Public License as published by 
 * the Free Software Foundation; either version 3 of the License, or (at 
 * your option) any later version.
 * 
 * This software is distributed in the hope that it will be useful, but
 * WITHOUT ANY WARRANTY; without even the implied warranty of 
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU 
 * Lesser General Public License for more details.
 * 
 * You should have received a copy of the GNU Lesser General Public 
 * License along with this program. If not, see 
 * &lt;http://www.gnu.org/licenses/&gt;.
 * =====================================================================
 * 
 * @author Ian Barber &lt;ian(dot)barber(at)gmail(dot)com&gt;
 */
include_once "zmsg.php";
include_once "mdp.php";

class MDCli {
	//  Structure of our class
	//  We access these properties only via class methods
	private $broker;
	private $context;
	private $client;	//  Socket to broker
	private $verbose;	//  Print activity to stdout
	private $timeout;	//  Request timeout
	private $retries;	//  Request retries
	
	/**
	 * Constructor
	 *
	 * @param string $broker 
	 * @param boolean $verbose 
	 */
	public function __construct($broker, $verbose = false) {
		$this-&gt;broker = $broker;
		$this-&gt;context = new ZMQContext();
		$this-&gt;verbose = $verbose;
		$this-&gt;timeout = 2500;           //  msecs
		$this-&gt;retries = 3;              //  Before we abandon
		$this-&gt;connect_to_broker();
	}
	
	/**
	 * Connect or reconnect to broker
	 */
	protected function connect_to_broker() {
		if($this-&gt;client) {
			unset($this-&gt;client);
		}
		$this-&gt;client = new ZMQSocket($this-&gt;context, ZMQ::SOCKET_REQ);
		$this-&gt;client-&gt;setSockOpt(ZMQ::SOCKOPT_LINGER, 0);
		$this-&gt;client-&gt;connect($this-&gt;broker);
		if($this-&gt;verbose) {
			printf("I: connecting to broker at %s...", $this-&gt;broker);
		}
	}
	
	/**
	 * Set request timeout
	 *
	 * @param int $timeout (msecs)
	 */
	public function set_timeout($timeout) {
		$this-&gt;timeout = $timeout;
	}
	
	/**
	 * Set request retries
	 *
	 * @param int $retries 
	 */
	public function set_retries($retries) {
		$this-&gt;retries = $retries;
	}
	
	/**
	 * Send request to broker and get reply by hook or crook
	 * Takes ownership of request message and destroys it when sent.
	 * Returns the reply message or NULL if there was no reply.
	 *
	 * @param string $service 
	 * @param Zmsg $request 
	 * @param string $client
	 * @return Zmsg
	 */
	public function send($service, Zmsg $request) {
		//  Prefix request with protocol frames
		//  Frame 1: "MDPCxy" (six bytes, MDP/Client	
		//  Frame 2: Service name (printable string)
		$request-&gt;push($service);
		$request-&gt;push(MDPC_CLIENT);
		if ($this-&gt;verbose) {
			printf ("I: send request to '%s' service:", $service);
			echo $request-&gt;__toString();
		}
	
		$retries_left = $this-&gt;retries;
		$read = $write = array();
		while($retries_left) {
			$request-&gt;set_socket($this-&gt;client)-&gt;send();
			
			 //  Poll socket for a reply, with timeout
			$poll = new ZMQPoll();
			$poll-&gt;add($this-&gt;client, ZMQ::POLL_IN);
			$events = $poll-&gt;poll($read, $write, $this-&gt;timeout);
			
			//  If we got a reply, process it
			if($events) {
				$request-&gt;recv();
				if ($this-&gt;verbose) {
					echo "I: received reply:", $request-&gt;__toString(), PHP_EOL;
				}
				//  Don't try to handle errors, just assert noisily
				assert ($request-&gt;parts() &gt;= 3);
				
				$header = $request-&gt;pop();
				assert($header == MDPC_CLIENT);
				
				$reply_service = $request-&gt;pop();
				assert($reply_service == $service);
				
				return $request; //  Success
			} else if($retries_left--) {
				if($this-&gt;verbose) {
					echo "W: no reply, reconnecting...", PHP_EOL;
				}
				//  Reconnect, and resend message
				$this-&gt;connect_to_broker();
				$request-&gt;send();
			} else {
				echo "W: permanent error, abandoning request", PHP_EOL;
				break;	//  Give up
			}
		}
	}
</programlisting>

</example>
<para>with an example test program that does 100K request-reply cycles:</para>

<example id="mdclient-php">
<title>Majordomo client application (mdclient.php)</title>
<programlisting language="php">
&lt;?php
/*
 * Majordomo Protocol client example
 * Uses the mdcli API to hide all MDP aspects
 * 
 * @author Ian Barber &lt;ian(dot)barber(at)gmail(dot)com&gt;
 */
include_once "mdcliapi.php";

$verbose = $_SERVER['argc'] &gt; 1 &amp;&amp; $_SERVER['argv'][1] == '-v';
$session = new MDCli("tcp://localhost:5555", $verbose);
for($count = 0; $count &lt; 100000; $count++) {
	$request = new Zmsg(); 
	$request-&gt;body_set("Hello world");
	$reply = $session-&gt;send("echo", $request);
	if(!$reply) {
		break; // Interrupt or failure
	}
}
printf ("%d requests/replies processed", $count);
echo PHP_EOL;
</programlisting>

</example>
<para>And here is the worker API:</para>

<example id="mdwrkapi-php">
<title>Majordomo worker API (mdwrkapi.php)</title>
<programlisting language="php">
&lt;?php
/* =====================================================================
 * mdwrkapi.php
 * 
 * Majordomo Protocol Worker API
 * Implements the MDP/Worker spec at http://rfc.zeromq.org/spec:7.
 * 
 * ---------------------------------------------------------------------
 * Copyright (c) 1991-2011 iMatix Corporation &lt;www.imatix.com&gt;
 * Copyright other contributors as noted in the AUTHORS file.
 * 
 * This file is part of the ZeroMQ Guide: http://zguide.zeromq.org
 * 
 * This is free software; you can redistribute it and/or modify it under
 * the terms of the GNU Lesser General Public License as published by
 * the Free Software Foundation; either version 3 of the License, or (at
 * your option) any later version.
 * 
 * This software is distributed in the hope that it will be useful, but
 * WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
 * Lesser General Public License for more details.
 * 
 * You should have received a copy of the GNU Lesser General Public
 * License along with this program. If not, see
 * &lt;http://www.gnu.org/licenses/&gt;.
 * =====================================================================
 * @author Ian Barber &lt;ian(dot)barber(at)gmail(dot)com&gt;
 */

include_once "zmsg.php";
include_once "mdp.php";

//  Reliability parameters
define("HEARTBEAT_LIVENESS", 3); //  3-5 is reasonable


//  Structure of our class
//  We access these properties only via class methods
class MDWrk {
    private $ctx;           //  Our context
    private $broker;
    private $service;   
    private $worker;        //  Socket to broker
    private $verbose = false;       //  Print activity to stdout
    
    //  Heartbeat management
    private $heartbeat_at;  //  When to send HEARTBEAT
    private $liveness;      //  How many attempts left
    private $heartbeat;     //  Heartbeat delay, msecs
    private $reconnect;     //  Reconnect delay, msecs
    
    //  Internal state
    private $expect_reply = 0;
    
    //  Return address, if any
    private $reply_to;

	/**
	 * Constructor
	 *
	 * @param string $broker
	 * @param string $service
	 * @param boolean $verbose 
	 */
    public function __construct($broker, $service, $verbose = false) {
        $this-&gt;ctx = new ZMQContext();
        $this-&gt;broker = $broker;
        $this-&gt;service = $service;
        $this-&gt;verbose = $verbose;
        $this-&gt;heartbeat = 2500; // msecs
        $this-&gt;reconnect = 2500; // msecs

        $this-&gt;connect_to_broker();
    }
    
    /**
	 * Send message to broker
     * If no msg is provided, creates one internally
	 *
	 * @param string $command
	 * @param string $option
	 * @param Zmsg $msg 
	 */
    public function send_to_broker($command, $option, $msg = null) {
        $msg = $msg ? $msg : new Zmsg();
        
        if($option) {
            $msg-&gt;push($option);
        }
        $msg-&gt;push($command);
        $msg-&gt;push(MDPW_WORKER);
        $msg-&gt;push("");
        
        if($this-&gt;verbose) {
            printf("I: sending %s to broker %s", $command, PHP_EOL);
            echo $msg-&gt;__toString();
        }
        
        $msg-&gt;set_socket($this-&gt;worker)-&gt;send();
    }
    
    /**
     * Connect or reconnect to broker
     */
    public function connect_to_broker() {
        $this-&gt;worker = new ZMQSocket($this-&gt;ctx, ZMQ::SOCKET_DEALER);
        $this-&gt;worker-&gt;connect($this-&gt;broker);
        if($this-&gt;verbose) {
            printf("I: connecting to broker at %s... %s", $this-&gt;broker, PHP_EOL);
        }
        
        //  Register service with broker
        $this-&gt;send_to_broker(MDPW_READY, $this-&gt;service, NULL);
        
        //  If liveness hits zero, queue is considered disconnected
        $this-&gt;liveness = HEARTBEAT_LIVENESS;
        $this-&gt;heartbeat_at = microtime(true) + ($this-&gt;heartbeat / 1000);
    }
    
    /**
	 * Set heartbeat delay
	 *
	 * @param int $heartbeat 
	 */
    public function set_heartbeat($heartbeat) {
        $this-&gt;heartbeat = $heartbeat;
    }
    
    /**
     * Set reconnect delay
	 *
	 * @param int $reconnect 
     */
    public function set_reconnect($reconnect) {
        $this-&gt;reconnect = $reconnect;
    }
    
    /**
     * Send reply, if any, to broker and wait for next request.
	 *
	 * @param Zmsg $reply 
	 * @return Zmsg Returns if there is a request to process
     */
    public function recv($reply = null) {
        //  Format and send the reply if we were provided one
        assert ($reply || !$this-&gt;expect_reply);
        if($reply) {
            $reply-&gt;wrap($this-&gt;reply_to);
            $this-&gt;send_to_broker(MDPW_REPLY, NULL, $reply);
        }
        $this-&gt;expect_reply = true;
        
        $read = $write = array();
        while(true) {
            $poll = new ZMQPoll();
            $poll-&gt;add($this-&gt;worker, ZMQ::POLL_IN);
            $events = $poll-&gt;poll($read, $write, $this-&gt;heartbeat);

            if($events) {
                $zmsg = new Zmsg($this-&gt;worker);
                $zmsg-&gt;recv();
                
                if($this-&gt;verbose) {
                    echo "I: received message from broker:", PHP_EOL;
                    echo $zmsg-&gt;__toString();
                }
                
                $this-&gt;liveness = HEARTBEAT_LIVENESS;
                
                //  Don't try to handle errors, just assert noisily
                assert ($zmsg-&gt;parts() &gt;= 3);

                $zmsg-&gt;pop();
                $header = $zmsg-&gt;pop();
                assert($header == MDPW_WORKER);
                
                $command = $zmsg-&gt;pop();
                if($command == MDPW_REQUEST) {
                    //  We should pop and save as many addresses as there are
                    //  up to a null part, but for now, just save one...
                    $this-&gt;reply_to = $zmsg-&gt;unwrap();
                    return $zmsg;//  We have a request to process
                } else if($command == MDPW_HEARTBEAT) {
                    // Do nothing for heartbeats
                } else if($command == MDPW_DISCONNECT) {
                    $this-&gt;connect_to_broker();
                } else {
                    echo "E: invalid input message", PHP_EOL;
                    echo $zmsg-&gt;__toString();
                }
            } else if(--$this-&gt;liveness == 0) { // poll ended on timeout, $event being false
                if($this-&gt;verbose) {
                    echo "W: disconnected from broker - retrying...", PHP_EOL;
                }
                usleep($this-&gt;reconnect*1000);
                $this-&gt;connect_to_broker();
            }
            
            // Send HEARTBEAT if it's time
            if(microtime(true) &gt; $this-&gt;heartbeat_at) {
                $this-&gt;send_to_broker(MDPW_HEARTBEAT, NULL, NULL);
                $this-&gt;heartbeat_at = microtime(true) + ($this-&gt;heartbeat/1000);
            }
        }
    }
</programlisting>

</example>
<para>with an example test program that implements an 'echo' service:</para>

<example id="mdworker-php">
<title>Majordomo worker application (mdworker.php)</title>
<programlisting language="php">
&lt;?php
/* 
 * Majordomo Protocol worker example
 * Uses the mdwrk API to hide all MDP aspects
 * 
 * @author Ian Barber &lt;ian(dot)barber(at)gmail(dot)com&gt;
 */

include_once "mdwrkapi.php";

$verbose = $_SERVER['argc'] &gt; 1 &amp;&amp; $_SERVER['argv'][1] == "-v";

$mdwrk = new Mdwrk("tcp://localhost:5555", "echo", $verbose);

$reply = NULL;
while(true) {
    $request = $mdwrk-&gt;recv($reply);
    $reply = $request;      //  Echo is complex... :-)
</programlisting>

</example>
<para>Notes on this code:</para>

<itemizedlist>
  <listitem><para>The APIs are single-threaded. This means, for example, that the worker won't send heartbeats in the background. Happily, this is exactly what we want: if the worker application gets stuck, heartbeats will stop and the broker will stop sending requests to the worker.</para></listitem>
  <listitem><para>The worker API doesn't do an exponential back-off; it's not worth the extra complexity.</para></listitem>
  <listitem><para>The APIs don't do any error reporting. If something isn't as expected, they raise an assertion (or exception depending on the language). This is ideal for a reference implementation, so any protocol errors show immediately. For real applications, the API should be robust against invalid messages.</para></listitem>
</itemizedlist>
<para>You might wonder why the worker API is manually closing its socket and opening a new one, when &Oslash;MQ will automatically reconnect a socket if the peer disappears and comes back. Look back at the Simple Pirate and Paranoid Pirate workers to understand. Although &Oslash;MQ will automatically reconnect workers, if the broker dies and comes back up, this isn't sufficient to re-register the workers with the broker. There are at least two solutions I know of. The simplest, which we use here, is for the worker to monitor the connection using heartbeats, and if it decides the broker is dead, to close its socket and starts afresh with a new socket. The alternative is for the broker to challenge unknown workers--when it gets a heartbeat from the worker--and ask them to re-register. That would require protocol support.</para>

<para>Now let's design the Majordomo broker. Its core structure is a set of queues, one per service. We will create these queues as workers appear (we could delete them as workers disappear but forget that for now, it gets complex). Additionally, we keep a queue of workers per service.</para>

<para>And here is the broker:</para>

<example id="mdbroker-php">
<title>Majordomo broker (mdbroker.php)</title>
<programlisting language="php">
&lt;?php
/*
 * Majordomo Protocol broker
 * A minimal implementation of http://rfc.zeromq.org/spec:7 and spec:8
 * @author Ian Barber &lt;ian(dot)barber(at)gmail(dot)com&gt;
 */
include_once "zmsg.php";
include_once "mdp.php";

//  We'd normally pull these from config data
define("HEARTBEAT_LIVENESS", 3);    //  3-5 is reasonable
define("HEARTBEAT_INTERVAL", 2500); //  msecs
define("HEARTBEAT_EXPIRY", HEARTBEAT_INTERVAL * HEARTBEAT_LIVENESS);

/* Main broker work happens here */
$verbose = $_SERVER['argc'] &gt; 1 &amp;&amp; $_SERVER['argv'][1] == '-v';
$broker = new Mdbroker($verbose);
$broker-&gt;bind("tcp://*:5555");
$broker-&gt;listen();

class Mdbroker {
    private $ctx;               //  Our context
    private $socket;            //  Socket for clients &amp; workers
    private $endpoint;          //  Broker binds to this endpoint

    private $services = array();          //  Hash of known services
    private $workers = array();           //  Hash of known workers
    private $waiting = array();           //  List of waiting workers

    private $verbose = false;   //  Print activity to stdout

    //  Heartbeat management
    private $heartbeat_at;  //  When to send HEARTBEAT

    /**
     * Constructor
     *
     * @param boolean $verbose
     */
    public function __construct($verbose = false) {
        $this-&gt;ctx = new ZMQContext();
        $this-&gt;socket = new ZMQSocket($this-&gt;ctx, ZMQ::SOCKET_ROUTER);
        $this-&gt;verbose = $verbose;
        $this-&gt;heartbeat_at = microtime(true) + (HEARTBEAT_INTERVAL/1000);
    }

    /**
     * Bind broker to endpoint, can call this multiple time
     * We use a single socket for both clients and workers.
     *
     * @param string $endpoint
     */
    public function bind($endpoint) {
        $this-&gt;socket-&gt;bind($endpoint);
        if($this-&gt;verbose) {
            printf("I: MDP broker/0.1.1 is active at %s %s", $endpoint, PHP_EOL);
        }
    }

    /**
     * This is the main listen and process loop
     */
    public function listen() {
        $read = $write = array();

        //  Get and process messages forever or until interrupted
        while(true) {
            $poll = new ZMQPoll();
            $poll-&gt;add($this-&gt;socket, ZMQ::POLL_IN);

            $events = $poll-&gt;poll($read, $write, HEARTBEAT_INTERVAL);

            //  Process next input message, if any
            if($events) {
                $zmsg = new Zmsg($this-&gt;socket);
                $zmsg-&gt;recv();
                if($this-&gt;verbose) {
                    echo "I: received message:", PHP_EOL, $zmsg-&gt;__toString();
                }

                $sender = $zmsg-&gt;pop();
                $empty = $zmsg-&gt;pop();
                $header = $zmsg-&gt;pop();

                if($header == MDPC_CLIENT) {
                    $this-&gt;client_process($sender, $zmsg);
                } else if($header == MDPW_WORKER) {
                    $this-&gt;worker_process($sender, $zmsg);
                } else {
                    echo "E: invalid message", PHP_EOL, $zmsg-&gt;__toString();
                }
            }

            //  Disconnect and delete any expired workers
            //  Send heartbeats to idle workers if needed
            if(microtime(true) &gt; $this-&gt;heartbeat_at) {
                $this-&gt;purge_workers();
                foreach($this-&gt;workers as $worker) {
                    $this-&gt;worker_send($worker, MDPW_HEARTBEAT, NULL, NULL);
                }
                $this-&gt;heartbeat_at = microtime(true) + (HEARTBEAT_INTERVAL/1000);
            }
        }
    }

    /**
     * Delete any idle workers that haven't pinged us in a while.
     * We know that workers are ordered from oldest to most recent.
     */
    public function purge_workers() {
        foreach($this-&gt;waiting as $id =&gt; $worker) {
            if(microtime(true) &lt; $worker-&gt;expiry) {
                break;      //  Worker is alive, we're done here
            }
            if($this-&gt;verbose) {
                printf("I: deleting expired worker: %s %s",
                    $worker-&gt;identity, PHP_EOL);
            }
            $this-&gt;worker_delete($worker);
        }
    }

    /**
     * Locate or create new service entry
     *
     * @param string $name
     * @return stdClass
     */
    public function service_require($name) {
        $service = isset($this-&gt;services[$name]) ? $this-&gt;services[$name] : NULL;
        if($service == NULL) {
            $service = new stdClass();
            $service-&gt;name = $name;
            $service-&gt;requests = array();
            $service-&gt;waiting = array();
            $this-&gt;services[$name] = $service;
        }

        return $service;
    }

    /**
     * Dispatch requests to waiting workers as possible
     *
     * @param type $service
     * @param type $msg
     */
    public function service_dispatch($service, $msg) {
        if($msg) {
            $service-&gt;requests[] = $msg;
        }

        $this-&gt;purge_workers();

        while(count($service-&gt;waiting) &amp;&amp; count($service-&gt;requests)) {
            $worker = array_shift($service-&gt;waiting);
            $msg = array_shift($service-&gt;requests);
            $this-&gt;worker_send($worker, MDPW_REQUEST, NULL, $msg);
        }
    }

    /**
     * Handle internal service according to 8/MMI specification
     *
     * @param string $frame
     * @param Zmsg $msg
     */
    public function service_internal($frame, $msg) {
        if($frame == "mmi.service") {
            $name = $msg-&gt;last();
            $service = $this-&gt;services[$name];
            $return_code = $service &amp;&amp; $service-&gt;workers ? "200" : "404";
        } else {
            $return_code = "501";
        }

        $msg-&gt;set_last($return_code);

        //  Remove &amp; save client return envelope and insert the
        //  protocol header and service name, then rewrap envelope
        $client = $msg-&gt;unwrap();
        $msg-&gt;push($frame);
        $msg-&gt;push(MDPC_CLIENT);
        $msg-&gt;wrap($client, "");
        $msg-&gt;set_socket($this-&gt;socket)-&gt;send();
    }

    /**
     * Creates worker if necessary
     *
     * @param string $address
     * @return stdClass
     */
    public function worker_require($address) {
        $worker = isset($this-&gt;workers[$address]) ? $this-&gt;workers[$address] : NULL;
        if($worker == NULL) {
            $worker = new stdClass();
            $worker-&gt;identity = $address;
            $worker-&gt;address = $address;
            if($this-&gt;verbose) {
                printf("I: registering new worker: %s %s", $address, PHP_EOL);
            }
            $this-&gt;workers[$address] = $worker;
        }
        return $worker;
    }

    /**
     * Remove a worker
     *
     * @param stdClass $worker
     * @param boolean $disconnect
     */
    public function worker_delete($worker, $disconnect = false) {
        if($disconnect) {
            $this-&gt;worker_send($worker, MDPW_DISCONNECT, NULL, NULL);
        }

        if(isset($worker-&gt;service)) {
            worker_remove_from_array($worker, $worker-&gt;service-&gt;waiting);
            $worker-&gt;service-&gt;workers--;
        }
        worker_remove_from_array($worker, $this-&gt;waiting);
        unset($this-&gt;workers[$worker-&gt;identity]);
    }

    private function worker_remove_from_array($worker, &amp;$array) {
        $index = array_search($worker, $array);
        if ($index !== false) {
            unset($array[$index]);
        }
    }

    /**
     * Process message sent to us by a worker
     *
     * @param string $sender
     * @param Zmsg $msg
     */
    public function worker_process($sender, $msg) {
        $command = $msg-&gt;pop();
        $worker_ready = isset($this-&gt;workers[$sender]);
        $worker = $this-&gt;worker_require($sender);
        if($command == MDPW_READY) {
            if($worker_ready) {
                $this-&gt;worker_delete($worker, true); //  Not first command in session
            } else if(strlen($sender) &gt;= 4      // Reserved service name
                    &amp;&amp; substr($sender, 0, 4) == 'mmi.') {
                $this-&gt;worker_delete($worker, true);
            } else {
                //  Attach worker to service and mark as idle
                $service_frame = $msg-&gt;pop();
                $worker-&gt;service = $this-&gt;service_require($service_frame);
                $worker-&gt;service-&gt;workers++;
                $this-&gt;worker_waiting($worker);
            }
        } else if($command == MDPW_REPLY) {
            if($worker_ready) {
                //  Remove &amp; save client return envelope and insert the
                //  protocol header and service name, then rewrap envelope.
                $client = $msg-&gt;unwrap();
                $msg-&gt;push($worker-&gt;service-&gt;name);
                $msg-&gt;push(MDPC_CLIENT);
                $msg-&gt;wrap($client, "");
                $msg-&gt;set_socket($this-&gt;socket)-&gt;send();
                $this-&gt;worker_waiting($worker);
            } else {
                $this-&gt;worker_delete($worker, true);
            }
        } else if($command == MDPW_HEARTBEAT) {
            if($worker_ready) {
                $worker-&gt;expiry = microtime(true) + (HEARTBEAT_EXPIRY/1000);
            } else {
                $this-&gt;worker_delete($worker, true);
            }
        } else if($command == MDPW_DISCONNECT) {
            $this-&gt;worker_delete($worker, true);
        } else {
            echo "E: invalid input message", PHP_EOL, $msg-&gt;__toString();
        }
    }

    /**
     * Send message to worker
     *
     * @param stdClass $worker
     * @param string $command
     * @param mixed $option
     * @param Zmsg $msg
     */
    public function worker_send($worker, $command, $option, $msg) {
        $msg = $msg ? $msg : new Zmsg();
        //  Stack protocol envelope to start of message
        if($option) {
            $msg-&gt;push($option);
        }
        $msg-&gt;push($command);
        $msg-&gt;push(MDPW_WORKER);

        //  Stack routing envelope to start of message
        $msg-&gt;wrap($worker-&gt;address, "");

        if($this-&gt;verbose) {
            printf("I: sending %s to worker %s",
               $command, PHP_EOL);
            echo $msg-&gt;__toString();
        }

        $msg-&gt;set_socket($this-&gt;socket)-&gt;send();
    }

    /**
     * This worker is now waiting for work
     *
     * @param stdClass $worker
     */
    public function worker_waiting($worker) {
        //  Queue to broker and service waiting lists
        $this-&gt;waiting[] = $worker;
        $worker-&gt;service-&gt;waiting[] = $worker;
        $worker-&gt;expiry = microtime(true) + (HEARTBEAT_EXPIRY/1000);
        $this-&gt;service_dispatch($worker-&gt;service, NULL);
    }

    /**
     * Process a request coming from a client
     *
     * @param string $sender
     * @param Zmsg $msg
     */
    public function client_process($sender, $msg) {
        $service_frame = $msg-&gt;pop();
        $service = $this-&gt;service_require($service_frame);

        //  Set reply return address to client sender
        $msg-&gt;wrap($sender, "");
        if(substr($service_frame, 0, 4) == 'mmi.') {
            $this-&gt;service_internal($service_frame, $msg);
        } else {
            $this-&gt;service_dispatch($service, $msg);
        }
    }
</programlisting>

</example>
<para>This is by far the most complex example we've seen. It's almost 500 lines of code. To write this, and make it somewhat robust took two days. However this is still a short piece of code for a full service-oriented broker.</para>

<para>Notes on this code:</para>

<itemizedlist>
  <listitem><para>The Majordomo Protocol lets us handle both clients and workers on a single socket. This is nicer for those deploying and managing the broker: it just sits on one &Oslash;MQ endpoint rather than the two that most proxies need.</para></listitem>
  <listitem><para>The broker implements all of MDP/0.1 properly (as far as I know), including disconnection if the broker sends invalid commands, heartbeating, and the rest.</para></listitem>
  <listitem><para>It can be extended to run multiple threads, each managing one socket and one set of clients and workers. This could be interesting for segmenting large architectures. The C code is already organized around a broker class to make this trivial.</para></listitem>
  <listitem><para>A primary-fail-over or live-live broker reliability model is easy, since the broker essentially has no state except service presence. It's up to clients and workers to choose another broker if their first choice isn't up and running.</para></listitem>
  <listitem><para>The examples use 5-second heartbeats, mainly to reduce the amount of output when you enable tracing. Realistic values would be lower for most LAN applications. However, any retry has to be slow enough to allow for a service to restart, say 10 seconds at least.</para></listitem>
</itemizedlist>
<para>We later improved and extended the protocol and the Majordomo implementation, which now sits in its own Github project. If you want a properly usable Majordomo stack, use the github project.</para>

</sect1>
<sect1>
<title>Asynchronous Majordomo Pattern</title>
<para>The Majordomo implementation in the previous section is simple and stupid. The client is just the original Simple Pirate, wrapped up in a sexy API. When I fire up a client, broker, and worker on a test box, it can process 100,000 requests in about 14 seconds. That is partly due to the code, which cheerfully copies message frames around as if CPU cycles were free. But the real problem is that we're doing network round-trips. &Oslash;MQ disables <ulink url="http://en.wikipedia.org/wiki/Nagles_algorithm">Nagle's algorithm</ulink>, but round-tripping is still slow.</para>

<para>Theory is great in theory, but in practice, practice is better. Let's measure the actual cost of round-tripping with a simple test program. This sends a bunch of messages, first waiting for a reply to each message, and second as a batch, reading all the replies back as a batch. Both approaches do the same work, but they give very different results. We mock-up a client, broker, and worker:</para>

<example id="tripping-php">
<title>Round-trip demonstrator (tripping.php)</title>
<programlisting language="php">
&lt;?php

/*
 * Round-trip demonstrator
 * 
 * While this example runs in a single process, that is just to make
 * it easier to start and stop the example. Each thread has its own
 * context and conceptually acts as a separate process.
 * 
 * @author Ian Barber &lt;ian(dot)barber(at)gmail(dot)com&gt;
 */
include "zmsg.php";

function client_task() {
	$context = new ZMQContext();
	$client = new ZMQSocket($context, ZMQ::SOCKET_DEALER);
	$client-&gt;setSockOpt(ZMQ::SOCKOPT_IDENTITY, "C");
	$client-&gt;connect("tcp://localhost:5555");
	
	echo "Setting up test...", PHP_EOL;
	usleep(10000);
	
	echo "Synchronous round-trip test...", PHP_EOL;
	$start = microtime(true);
	$text = "HELLO";
	for($requests = 0; $requests &lt; 10000; $requests++) {
		$client-&gt;send($text);
	    $msg = $client-&gt;recv();
	}
	printf (" %d calls/second%s",
		(1000 * 10000) / (int) ((microtime(true) - $start) * 1000), 
		PHP_EOL);
	
	echo "Asynchronous round-trip test...", PHP_EOL;
	$start = microtime(true);
	for($requests = 0; $requests &lt; 100000; $requests++) {
		$client-&gt;send($text);
	}
	
	for($requests = 0; $requests &lt; 100000; $requests++) {
		$client-&gt;recv();
	}
	
	printf (" %d calls/second%s",
		(1000 * 100000) / (int) ((microtime(true) - $start) * 1000), 
		PHP_EOL);
}


function worker_task() {
	$context = new ZMQContext();
	$worker = new ZMQSocket($context, ZMQ::SOCKET_DEALER);
	$worker-&gt;setSockOpt(ZMQ::SOCKOPT_IDENTITY, "W");
	$worker-&gt;connect("tcp://localhost:5556");
	while(true) {
		$zmsg = new Zmsg($worker);
		$zmsg-&gt;recv();
		$zmsg-&gt;send();
	}
}

function broker_task() {
	//  Prepare our context and sockets
	$context = new ZMQContext();
	$frontend = new ZMQSocket($context, ZMQ::SOCKET_ROUTER);
	$backend = new ZMQSocket($context, ZMQ::SOCKET_ROUTER);
	$frontend-&gt;bind("tcp://*:5555");
	$backend-&gt;bind("tcp://*:5556");
	
	//  Initialize poll set
	$poll = new ZMQPoll();
	$poll-&gt;add($frontend, ZMQ::POLL_IN);
	$poll-&gt;add($backend, ZMQ::POLL_IN);
	$read = $write = array();
	while(true) {
		$events = $poll-&gt;poll($read, $write);
		foreach($read as $socket) {
			$zmsg = new Zmsg($socket);
			$zmsg-&gt;recv();
			if($socket === $frontend) {
				$zmsg-&gt;push("W");
				$zmsg-&gt;set_socket($backend)-&gt;send();
			} else if($socket === $backend) {
				$zmsg-&gt;pop();
				$zmsg-&gt;push("C");
				$zmsg-&gt;set_socket($frontend)-&gt;send();
			}
			
		}
	}
}

$wpid = pcntl_fork();
if($wpid == 0) {
	worker_task();
	exit;
}
$bpid = pcntl_fork();
if($bpid == 0) {
	broker_task();
	exit;
}
sleep(1);
client_task();
posix_kill($wpid, SIGKILL);
posix_kill($bpid, SIGKILL)
</programlisting>

</example>
<para>On my development box, this program says:</para>

<screen>Setting up test...
Synchronous round-trip test...
 9057 calls/second
Asynchronous round-trip test...
 173010 calls/second
</screen>

<para>Note that the client thread does a small pause before starting. This is to get around one of the "features" of the router socket: if you send a message with the address of a peer that's not yet connected, the message gets discarded. In this example we don't use the load-balancing mechanism, so without the sleep, if the worker thread is too slow to connect, it will lose messages, making a mess of our test.</para>

<para>As we see, round-tripping in the simplest case is 20 times slower than the  asynchronous, "shove it down the pipe as fast as it'll go" approach. Let's see if we can apply this to Majordomo to make it faster.</para>

<para>First, we modify the client API to have separate send and recv methods:</para>

<programlisting language="c">
mdcli_t *mdcli_new     (char *broker);
void     mdcli_destroy (mdcli_t **self_p);
int      mdcli_send    (mdcli_t *self, char *service, zmsg_t **request_p);
zmsg_t  *mdcli_recv    (mdcli_t *self);
</programlisting>

<para>It's literally a few minutes' work to refactor the synchronous client API to become asynchronous:</para>

<example id="mdcliapi2-php">
<title>Majordomo asynchronous client API (mdcliapi2.php)</title>
<programlisting language="php">
&lt;?php
/*  =====================================================================
 * mdcliapi2.c
 * 
 * Majordomo Protocol Client API (async version)
 * Implements the MDP/Worker spec at http://rfc.zeromq.org/spec:7.
 * 
 * ---------------------------------------------------------------------
 * Copyright (c) 1991-2011 iMatix Corporation &lt;www.imatix.com&gt;
 * Copyright other contributors as noted in the AUTHORS file.
 * 
 * This file is part of the ZeroMQ Guide: http://zguide.zeromq.org
 * 
 * This is free software; you can redistribute it and/or modify it under
 * the terms of the GNU Lesser General Public License as published by
 * the Free Software Foundation; either version 3 of the License, or (at
 * your option) any later version.
 * 
 * This software is distributed in the hope that it will be useful, but
 * WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
 * Lesser General Public License for more details.
 * 
 * You should have received a copy of the GNU Lesser General Public
 * License along with this program. If not, see
 * &lt;http://www.gnu.org/licenses/&gt;.
 * =====================================================================
 * @author Ian Barber &lt;ian(dot)barber(at)gmail(dot)com&gt;
 */

include_once "zmsg.php";
include_once "mdp.php";

class MDCli {
	//  Structure of our class
	//  We access these properties only via class methods
	private $broker;
	private $context;
	private $client;	//  Socket to broker
	private $verbose;	//  Print activity to stdout
	private $timeout;	//  Request timeout
	private $retries;	//  Request retries
	
	/**
	 * Constructor
	 *
	 * @param string $broker 
	 * @param boolean $verbose 
	 */
	public function __construct($broker, $verbose = false) {
		$this-&gt;broker = $broker;
		$this-&gt;context = new ZMQContext();
		$this-&gt;verbose = $verbose;
		$this-&gt;timeout = 2500;           //  msecs
		$this-&gt;connect_to_broker();
	}
	
	/**
	 * Connect or reconnect to broker
	 */
	protected function connect_to_broker() {
		if($this-&gt;client) {
			unset($this-&gt;client);
		}
		$this-&gt;client = new ZMQSocket($this-&gt;context, ZMQ::SOCKET_DEALER);
		$this-&gt;client-&gt;setSockOpt(ZMQ::SOCKOPT_LINGER, 0);
		$this-&gt;client-&gt;connect($this-&gt;broker);
		if($this-&gt;verbose) {
			printf("I: connecting to broker at %s...", $this-&gt;broker);
		}
	}

	/**
	 * Set request timeout
	 *
	 * @param int $timeout (msecs)
	 */
	public function set_timeout($timeout) {
		$this-&gt;timeout = $timeout;
	}
	
	/**
	 * Send request to broker
	 * Takes ownership of request message and destroys it when sent.
	 *
	 * @param string $service 
	 * @param Zmsg $request 
	 */
	public function send($service, Zmsg $request) {
	    //  Prefix request with protocol frames
        //  Frame 0: empty (REQ emulation)
        //  Frame 1: "MDPCxy" (six bytes, MDP/Client x.y)
        //  Frame 2: Service name (printable string)
        $request-&gt;push($service);
        $request-&gt;push(MDPC_CLIENT);
        $request-&gt;push("");
        if($this-&gt;verbose) {
            printf ("I: send request to '%s' service: %s", $service, PHP_EOL);
            echo $request-&gt;__toString();
        }
        $request-&gt;set_socket($this-&gt;client)-&gt;send();
	}
	
	/**
	 * Returns the reply message or NULL if there was no reply. Does not
	 * attempt to recover from a broker failure, this is not possible
	 * without storing all unanswered requests and resending them all...
	 *
	 */
	public function recv() {
	    $read = $write = array();
	    
	    //  Poll socket for a reply, with timeout
		$poll = new ZMQPoll();
		$poll-&gt;add($this-&gt;client, ZMQ::POLL_IN);
		$events = $poll-&gt;poll($read, $write, $this-&gt;timeout);
		
		//  If we got a reply, process it
		if($events) {
			$msg = new Zmsg($this-&gt;client);
			$msg-&gt;recv();
			if ($this-&gt;verbose) {
				echo "I: received reply:", $request-&gt;__toString(), PHP_EOL;
			}
			//  Don't try to handle errors, just assert noisily
			assert ($msg-&gt;parts() &gt;= 4);
			
			$msg-&gt;pop(); // empty
			
			$header = $msg-&gt;pop();
			assert($header == MDPC_CLIENT);
			
			$reply_service = $msg-&gt;pop();
			
			return $msg; //  Success
		} else {
			echo "W: permanent error, abandoning request", PHP_EOL;
			return;	//  Give up
		}
	}
</programlisting>

</example>
<para>The differences are:</para>

<itemizedlist>
  <listitem><para>We use a DEALER socket instead of REQ, so we emulate REQ with an empty delimiter frame before each request and each response.</para></listitem>
  <listitem><para>We don't retry requests; if the application needs to retry, it can do this itself.</para></listitem>
  <listitem><para>We break the synchronous <literal>send</literal> method into separate <literal>send</literal> and <literal>recv</literal> methods.</para></listitem>
  <listitem><para>The <literal>send</literal> method is asynchronous and returns immediately after sending. The caller can thus send a number of messages before getting a response.</para></listitem>
  <listitem><para>The <literal>recv</literal> method waits for (with a timeout) one response and returns that to the caller.</para></listitem>
</itemizedlist>
<para>And here's the corresponding client test program, which sends 100,000 messages and then receives 100,00 back:</para>

<example id="mdclient2-php">
<title>Majordomo client application (mdclient2.php)</title>
<programlisting language="php">
&lt;?php
/*
 * Majordomo Protocol client example - asynchronous
 * Uses the mdcli API to hide all MDP aspects
 * 
 * @author Ian Barber &lt;ian(dot)barber(at)gmail(dot)com&gt;
 */
include_once "mdcliapi2.php";

$verbose = $_SERVER['argc'] &gt; 1 &amp;&amp; $_SERVER['argv'][1] == '-v';
$session = new MDCli("tcp://localhost:5555", $verbose);
for($count = 0; $count &lt; 10000; $count++) {
	$request = new Zmsg(); 
	$request-&gt;body_set("Hello world");
	$session-&gt;send("echo", $request);
}

for($count = 0; $count &lt; 10000; $count++) {
    $reply = $session-&gt;recv();
	if(!$reply) {
		break; // Interrupt or failure
	}
}
printf ("%d replies received", $count);
echo PHP_EOL;
</programlisting>

</example>
<para>The broker and worker are unchanged, since we've not modified the protocol at all. We see an immediate improvement in performance. Here's the synchronous client chugging through 100K request-reply cycles:</para>

<screen>$ time mdclient
100000 requests/replies processed

real    0m14.088s
user    0m1.310s
sys     0m2.670s
</screen>

<para>And here's the asynchronous client, with a single worker:</para>

<screen>$ time mdclient2
100000 replies received

real    0m8.730s
user    0m0.920s
sys     0m1.550s
</screen>

<para>Twice as fast. Not bad, but let's fire up 10 workers, and see how it handles the traffic</para>

<screen>$ time mdclient2
100000 replies received

real    0m3.863s
user    0m0.730s
sys     0m0.470s
</screen>

<para>It isn't fully asynchronous since workers get their messages on a strict last-used basis. But it will scale better with more workers. On my PC, after eight or so workers it doesn't get any faster. Four cores only stretches so far. But we got a 4x improvement in throughput with just a few minutes' work. The broker is still unoptimized. It spends most of its time copying message frames around, instead of doing zero copy, which it could. But we're getting 25K reliable request/reply calls a second, with pretty low effort.</para>

<para>However, the asynchronous Majordomo pattern isn't all roses. It has a fundamental weakness, namely that it cannot survive a broker crash without more work. If you look at the mdcliapi2 code you'll see it does not attempt to reconnect after a failure. A proper reconnect would require:</para>

<itemizedlist>
  <listitem><para>A number on every request and a matching number on every reply, which would ideally require a change to the protocol to enforce.</para></listitem>
  <listitem><para>Tracking and holding onto all outstanding requests in the client API, i.e., those for which no reply has yet been received.</para></listitem>
  <listitem><para>In case of fail-over, for the client API to <emphasis>resend</emphasis> all outstanding requests to the broker.</para></listitem>
</itemizedlist>
<para>It's not a deal breaker, but it does show that performance often means complexity. Is this worth doing for Majordomo? It depends on your use case. For a name lookup service you call once per session, no. For a web front-end serving thousands of clients, probably yes.</para>

</sect1>
<sect1>
<title>Service Discovery</title>
<para>So, we have a nice service-oriented broker, but we have no way of knowing whether a particular service is available or not. We know whether a request failed, but we don't know why. It is useful to be able to ask the broker, "is the echo service running?" The most obvious way would be to modify our MDP/Client protocol to add commands to ask this. But MDP/Client has the great charm of being simple. Adding service discovery to it would make it as complex as the MDP/Worker protocol.</para>

<para>Another option is to do what email does, and ask that undeliverable requests be returned. This can work well in an asynchronous world, but it also adds complexity. We need ways to distinguish returned requests from replies, and to handle these properly.</para>

<para>Let's try to use what we've already built, building on top of MDP instead of modifying it. Service discovery is, itself, a service. It might indeed be one of several management services, such as "disable service X", "provide statistics", and so on. What we want is a general, extensible solution that doesn't affect the protocol or existing applications.</para>

<para>So here's a small RFC that layers this on top of MDP: <ulink url="http://rfc.zeromq.org/spec:8">the Majordomo Management Interface (MMI)</ulink>. We already implemented it in the broker, though unless you read the whole thing you probably missed that. I'll explain how it works in the broker:</para>

<itemizedlist>
  <listitem><para>When a client requests a service that starts with "mmi.", instead of routing this to a worker, we handle it internally.</para></listitem>
  <listitem><para>We handle just one service in this broker, which is "mmi.service", the service discovery service.</para></listitem>
  <listitem><para>The payload for the request is the name of an external service (a real one, provided by a worker).</para></listitem>
  <listitem><para>The broker returns "200" (OK) or "404" (Not found) depending on whether there are workers registered for that service, or not.</para></listitem>
</itemizedlist>
<para>Here's how we use the service discovery in an application:</para>

<example id="mmiecho-php">
<title>Service discovery over Majordomo (mmiecho.php)</title>
<programlisting language="php">
&lt;?php
/*
 * MMI echo query example
 * 
 * @author Ian Barber &lt;ian(dot)barber(at)gmail(dot)com&gt;
 */
include "mdcliapi.php";

$verbose = $_SERVER['argc'] &gt; 1 &amp;&amp; $_SERVER['argv'][1] == '-v';
$session = new MDCli("tcp://localhost:5555", $verbose);

//  This is the service we want to look up
$request = new Zmsg();
$request-&gt;body_set("echo");

//  This is the service we send our request to
$reply = $session-&gt;send("mmi.service", $request);

if($reply) {
    $reply_code = $reply-&gt;pop();
    printf ("Lookup echo service: %s %s", $reply_code, PHP_EOL);
</programlisting>

</example>
<para>Try this with and without a worker running, and you should see the little program report "200" or "404" accordingly. The implementation of MMI in our example broker is flimsy. For example if a worker disappears, services remain "present". In practice, a broker should remove services that have no workers after some configurable timeout.</para>

</sect1>
<sect1>
<title>Idempotent Services</title>
<para>Idempotency is not something you take a pill for. What it means is that it's safe to repeat an operation. Checking the clock is idempotent. Lending one's credit card to ones children is not. While many client-to-server use cases are idempotent, some are not. Examples of idempotent use cases include:</para>

<itemizedlist>
  <listitem><para>Stateless task distribution, i.e. a pipeline where the servers are stateless workers that compute a reply based purely on the state provided by a request. In such a case it's safe (though inefficient) to execute the same request many times.</para></listitem>
  <listitem><para>A name service that translates logical addresses into endpoints to bind or connect to. In such a case it's safe to make the same lookup request many times.</para></listitem>
</itemizedlist>
<para>And here are examples of a non-idempotent use cases:</para>

<itemizedlist>
  <listitem><para>A logging service. One does not want the same log information recorded more than once.</para></listitem>
  <listitem><para>Any service that has impact on downstream nodes, e.g., sends on information to other nodes. If that service gets the same request more than once, downstream nodes will get duplicate information.</para></listitem>
  <listitem><para>Any service that modifies shared data in some non-idempotent way; e.g., a service that debits a bank account is definitely not idempotent.</para></listitem>
</itemizedlist>
<para>When our server applications are not idempotent, we have to think more carefully about when exactly they might crash. If an application dies when it's idle, or while it's processing a request, that's usually fine. We can use database transactions to make sure a debit and a credit are always done together, if at all. If the server dies while sending its reply, that's a problem, because as far as it's concerned, it has done its work.</para>

<para>If the network dies just as the reply is making its way back to the client, the same problem arises. The client will think the server died and will resend the request, and the server will do the same work twice. Which is not what we want.</para>

<para>To handle non-idempotent operations, use the fairly standard solution of detecting and rejecting duplicate requests. This means:</para>

<itemizedlist>
  <listitem><para>The client must stamp every request with a unique client identifier and a unique message number.</para></listitem>
  <listitem><para>The server, before sending back a reply, stores it using the combination of client id and message number as a key.</para></listitem>
  <listitem><para>The server, when getting a request from a given client, first checks whether it has a reply for that client id and message number. If so, it does not process the request but just resends the reply.</para></listitem>
</itemizedlist>
</sect1>
<sect1>
<title>Disconnected Reliability (Titanic Pattern)</title>
<para>Once you realize that Majordomo is a "reliable" message broker, you might be tempted to add some spinning rust (that is, ferrous-based hard disk platters). After all, this works for all the enterprise messaging systems. It's such a tempting idea that it's a little sad to have to be negative toward it. But brutal cynicism is one of my specialties. So, some reasons you don't want rust-based brokers sitting in the center of your architecture are:</para>

<itemizedlist>
  <listitem><para>As you've seen, the Lazy Pirate client performs surprisingly well. It works across a whole range of architectures, from direct client-to-server to distributed queue proxies. It does tend to assume that workers are stateless and idempotent. But we can work around that limitation without resorting to rust.</para></listitem>
  <listitem><para>Rust brings a whole set of problems, from slow performance to additional pieces that you have to manage, repair, and handle 6am panics from, as they inevitably break at the start of daily operations. The beauty of the Pirate patterns in general is their simplicity. They won't crash. And if you're still worried about the hardware, you can move to a peer-to-peer pattern that has no broker at all. I'll explain later in this chapter.</para></listitem>
</itemizedlist>
<para>Having said this, however, there is one sane use case for rust-based reliability, which is an asynchronous disconnected network. It solves a major problem with Pirate, namely that a client has to wait for an answer in real-time. If clients and workers are only sporadically connected (think of email as an analogy), we can't use a stateless network between clients and workers. We have to put state in the middle.</para>

<para>So, here's the Titanic pattern(<xref linkend="figure-51"/>), in which we write messages to disk to ensure they never get lost, no matter how sporadically clients and workers are connected. As we did for service discovery, we're going to layer Titanic on top of MDP rather than extend it. It's wonderfully lazy because it means we can implement our fire-and-forget reliability in a specialized worker, rather than in the broker. This is excellent for several reasons:</para>

<itemizedlist>
  <listitem><para>It is <emphasis>much</emphasis> easier because we divide and conquer: the broker handles message routing and the worker handles reliability.</para></listitem>
  <listitem><para>It lets us mix brokers written in one language with workers written in another.</para></listitem>
  <listitem><para>It lets us evolve the fire-and-forget technology independently.</para></listitem>
</itemizedlist>
<para>The only downside is that there's an extra network hop between broker and hard disk. The benefits are easily worth it.</para>

<para>There are many ways to make a persistent request-reply architecture. We'll aim for one that is simple and painless. The simplest design I could come up with, after playing with this for a few hours, is a "proxy service". That is, Titanic doesn't affect workers at all. If a client wants a reply immediately, it talks directly to a service and hopes the service is available. If a client is happy to wait a while, it talks to Titanic instead and asks, "hey, buddy, would you take care of this for me while I go buy my groceries?"</para>

<figure id="figure-51">
    <title>The Titanic Pattern</title>
    <mediaobject>
        <imageobject>
            <imagedata fileref="images/fig51.eps" format="EPS" width="4.8in"/>
        </imageobject>
    </mediaobject>
</figure>

<para>Titanic is thus both a worker and a client. The dialog between client and Titanic goes along these lines:</para>

<itemizedlist>
  <listitem><para>Client: please accept this request for me. Titanic: OK, done.</para></listitem>
  <listitem><para>Client: do you have a reply for me? Titanic: Yes, here it is. Or, no, not yet.</para></listitem>
  <listitem><para>Client: ok, you can wipe that request now, I'm happy. Titanic: OK, done.</para></listitem>
</itemizedlist>
<para>Whereas the dialog between Titanic and broker and worker goes like this:</para>

<itemizedlist>
  <listitem><para>Titanic: hey, broker, is there an Coffee service? Broker: uhm, yeah, seems like.</para></listitem>
  <listitem><para>Titanic: hey, Coffee service, please handle this for me.</para></listitem>
  <listitem><para>Coffee: sure, here you are.</para></listitem>
  <listitem><para>Titanic: sweeeeet!</para></listitem>
</itemizedlist>
<para>You can work through this, and the possible failure scenarios. If a worker crashes while processing a request, Titanic retries, indefinitely. If a reply gets lost somewhere, Titanic will retry. If the request gets processed but the client doesn't get the reply, it will ask again. If Titanic crashes while processing a request, or a reply, the client will try again. As long as requests are fully committed to safe storage, work can't get lost.</para>

<para>The handshaking is pedantic, but can be pipelined, i.e., clients can use the asynchronous Majordomo pattern to do a lot of work and then get the responses later.</para>

<para>We need some way for a client to request <emphasis>its</emphasis> replies. We'll have many clients asking for the same services, and clients disappear and reappear with different identities. So here is a simple, reasonably secure solution:</para>

<itemizedlist>
  <listitem><para>Every request generates a universally unique ID (UUID), which Titanic returns to the client after it has queued the request.</para></listitem>
  <listitem><para>When a client asks for a reply, it must specify the UUID for the original request.</para></listitem>
</itemizedlist>
<para>In a realistic case the client would want to store its request UUIDs safely, e.g. in a local database.</para>

<para>Before we jump off and write yet another formal specification (fun, fun!) let's consider how the client talks to Titanic. One way is to use a single service and send it three different request types. Another way, which seems simpler, is to use three services:</para>

<itemizedlist>
  <listitem><para><literal>titanic.request</literal> - store a request message, and return a UUID for the request.</para></listitem>
  <listitem><para><literal>titanic.reply</literal> - fetch a reply, if available, for a given request UUID.</para></listitem>
  <listitem><para><literal>titanic.close</literal> - confirm that a reply has been stored and processed.</para></listitem>
</itemizedlist>
<para>We'll just make a multithreaded worker, which as we've seen from our multithreading experience with &Oslash;MQ, is trivial. However let's first sketch what Titanic would look like in terms of &Oslash;MQ messages and frames. This gives us the <ulink url="http://rfc.zeromq.org/spec:9">Titanic Service Protocol (TSP)</ulink>.</para>

<para>Using TSP is clearly more work for client applications than accessing a service directly via MDP. Here's the shortest robust "echo" client example:</para>

<example id="ticlient-php">
<title>Titanic client example (ticlient.php)</title>
<programlisting language="php">
&lt;?php
/* Titanic client example
 * Implements client side of http://rfc.zeromq.org/spec:9
 * @author Ian Barber &lt;ian(dot)barber(at)gmail(dot)com&gt;
 */
include_once "mdcliapi.php";

/**
 * Calls a TSP service                                          
 * Returns response if successful (status code 200 OK), else NULL
 *
 * @param string $session 
 * @param string $service 
 * @param Zmsg $request 
 */
function s_service_call($session, $service, $request) {
    $reply = $session-&gt;send($service, $request);
    if($reply) {
        $status = $reply-&gt;pop();
        if($status == "200") {
            return $reply;
        } else if($status == "404") {
            echo "E: client fatal error, aborting", PHP_EOL;
            exit(1);
        } else if($status == "500") {
            echo "E: server fatal error, aborting", PHP_EOL;
            exit(1);
        }
    } else {
        exit(0);
    }
    
    return NULL; //  Didn't succeed, don't care why not
}

$verbose = $_SERVER['argc'] &gt; 1 &amp;&amp; $_SERVER['argv'][1] == '-v';
$session = new Mdcli("tcp://localhost:5555", $verbose);

//  1. Send 'echo' request to Titanic
$request = new Zmsg();
$request-&gt;push("Hello world");
$request-&gt;push("echo");
$reply = s_service_call($session, "titanic.request", $request);

$uuid = null;
if($reply) {
    $uuid = $reply-&gt;pop();
    printf("I: request UUID %s %s", $uuid, PHP_EOL);
}
    
//  2. Wait until we get a reply
while(true) {
    usleep(100000);
    $request = new Zmsg();
    $request-&gt;push($uuid);
    $reply = s_service_call ($session, "titanic.reply", $request);
    
    if($reply) {
        $reply_string = $reply-&gt;last();
        printf ("Reply: %s %s", $reply_string, PHP_EOL);
        
        //  3. Close request
        $request = new Zmsg();
        $request-&gt;push($uuid);
        $reply = s_service_call ($session, "titanic.close", $request);
        break;
    } else {
        echo "I: no reply yet, trying again...", PHP_EOL;
        usleep (5000000);     //  Try again in 5 seconds 
    }
}
</programlisting>

</example>
<para>Of course this can be, and should be, wrapped up in some kind of framework or API. It's not healthy to ask average application developers to learn the full details of messaging: it hurts their brains, costs time, and offers too many ways to make buggy complexity. Additionally, it makes it hard to add intelligence.</para>

<para>For example, this client blocks on each request whereas in a real application we'd want to be doing useful work while tasks are executed. This requires some non-trivial plumbing, to build a background thread and talk to that cleanly. It's the kind of thing you want to wrap in a nice simple API that the average developer cannot misuse. It's the same approach that we used for Majordomo.</para>

<para>Here's the Titanic implementation. This server handles the three services using three threads, as proposed. It does full persistence to disk using the most brute-force approach possible: one file per message. It's so simple it's scary. The only complex part is that it keeps a separate 'queue' of all requests to avoid reading the directory over and over:</para>

<example id="titanic-php">
<title>Titanic broker example (titanic.php)</title>
<programlisting language="php">
&lt;?php
/*
 * Titanic service
 * 
 * Implements server side of http://rfc.zeromq.org/spec:9
 * @author Ian Barber &lt;ian(dot)barber(at)gmail(dot)com&gt;
 */
 
include_once "mdwrkapi.php";
include_once "mdcliapi.php";

/*  Return a new UUID as a printable character string */
function s_generate_uuid() {
    $uuid = sprintf('%04x%04x%04x%03x4%04x%04x%04x%04x',
        mt_rand(0, 65535), mt_rand(0, 65535), // 32 bits for "time_low"
        mt_rand(0, 65535), // 16 bits for "time_mid"
        mt_rand(0, 4095),  // 12 bits before the 0100 of (version) 4 for "time_hi_and_version"
        bindec(substr_replace(sprintf('%016b', mt_rand(0, 65535)), '01', 6, 2)),
            // 8 bits, the last two of which (positions 6 and 7) are 01, for "clk_seq_hi_res"
            // (hence, the 2nd hex digit after the 3rd hyphen can only be 1, 5, 9 or d)
            // 8 bits for "clk_seq_low"
        mt_rand(0, 65535), mt_rand(0, 65535), mt_rand(0, 65535) // 48 bits for "node" 
    ); 
    return $uuid;
}

define("TITANIC_DIR", ".titanic");

/**
 * Returns freshly allocated request filename for given UUID
 */
function s_request_filename($uuid) {
    return TITANIC_DIR . "/" . $uuid . ".req";
}


/**
 * Returns freshly allocated reply filename for given UUID
 */
function s_reply_filename($uuid) {
    return TITANIC_DIR . "/" . $uuid . ".rep";
}


/**
 * Titanic request service
 */
function titanic_request($pipe) {
    $worker = new Mdwrk("tcp://localhost:5555", "titanic.request");
    $reply = null;
    
    while(true) {
        //  Get next request from broker
        $request = $worker-&gt;recv($reply);
        
        //  Ensure message directory exists
        if(!is_dir(TITANIC_DIR)) {
            mkdir(TITANIC_DIR);
        }
        
        //  Generate UUID and save message to disk
        $uuid = s_generate_uuid();
        $filename = s_request_filename($uuid);
        
        $fh = fopen($filename, "w");
        $request-&gt;save($fh);
        fclose($fh);
        
        //  Send UUID through to message queue
        $reply = new Zmsg($pipe);
        $reply-&gt;push($uuid);
        $reply-&gt;send();
        
        //  Now send UUID back to client
        // - sent in the next loop iteration
        $reply = new Zmsg();
        $reply-&gt;push($uuid);
        $reply-&gt;push("200");
    }
}

/**
 * Titanic reply service
 */
function titanic_reply() {
    $worker = new Mdwrk( "tcp://localhost:5555", "titanic.reply", false);
    $reply = null;

    while(true) {
        $request = $worker-&gt;recv($reply);
        
        $uuid = $request-&gt;pop();
        $req_filename = s_request_filename($uuid);
        $rep_filename = s_reply_filename($uuid);
        
        if(file_exists($rep_filename)) {
            $fh = fopen($rep_filename, "r");
            assert($fh);
            $reply = new Zmsg();
            $reply-&gt;load($fh);
            $reply-&gt;push("200");
            fclose($fh);
        } else {
            $reply = new Zmsg();
            if(file_exists($req_filename)) {
                $reply-&gt;push("300"); // Pending
            } else {
                $reply-&gt;push("400"); // Unknown
            }
        }
    } 
}

/**
 * Titanic close service
 */
function titanic_close() {
    $worker = new Mdwrk("tcp://localhost:5555", "titanic.close", false);
    $reply = null;
    
    while(true) {
        $request = $worker-&gt;recv($reply);
        
        $uuid = $request-&gt;pop();
        $req_filename = s_request_filename($uuid);
        $rep_filename = s_reply_filename($uuid);
        
        unlink($req_filename);
        unlink($rep_filename);
        
        $reply = new Zmsg();
        $reply-&gt;push("200");
    }
}

/**
 * Attempt to process a single request, return 1 if successful
 *
 * @param Mdcli $client 
 * @param string $uuid 
 */
function s_service_success($client, $uuid) {
    //  Load request message, service will be first frame
    $filename = s_request_filename($uuid);
    $fh = fopen($filename, "r");
    
    //  If the client already closed request, treat as successful
    if(!$fh) {
        return true;
    }
    
    $request = new Zmsg();
    $request-&gt;load($fh);
    fclose($fh);
    $service = $request-&gt;pop();

    //  Use MMI protocol to check if service is available
    $mmi_request = new Zmsg();
    $mmi_request-&gt;push($service);
    $mmi_reply = $client-&gt;send("mmi.service", $mmi_request);
    $service_ok = $mmi_reply &amp;&amp; $mmi_reply-&gt;pop() == "200";
    
    if($service_ok) {
        $reply = $client-&gt;send($service, $request);
        $filename = s_reply_filename($uuid);
        $fh = fopen($filename, "w");
        assert($fh);
        $reply-&gt;save($fh);
        fclose($fh);
        return true;
    }
    
    return false;
}

$verbose = $_SERVER['argc'] &gt; 1 &amp;&amp; $_SERVER['argv'][1] == '-v';

$pid = pcntl_fork();
if($pid == 0) {
    titanic_reply();
    exit();
}

$pid = pcntl_fork(); 
if($pid == 0) {
    titanic_close();
    exit();
}

$pid = pcntl_fork(); 
if($pid == 0) {
    $pipe = new ZMQSocket(new ZMQContext(), ZMQ::SOCKET_PAIR);
    $pipe-&gt;connect("ipc://" . sys_get_temp_dir() . "/titanicpipe");
    titanic_request($pipe);
    exit();
}


//  Create MDP client session with short timeout
$client = new Mdcli("tcp://localhost:5555", $verbose);
$client-&gt;set_timeout(1000); // 1 sec
$client-&gt;set_retries(1);    // only 1 retry

$request_pipe = new ZMQSocket(new ZMQContext(), ZMQ::SOCKET_PAIR);
$request_pipe-&gt;bind("ipc://" . sys_get_temp_dir() . "/titanicpipe");
$read = $write = array();
//  Main dispatcher loop
while(true) {
    //  We'll dispatch once per second, if there's no activity
    $poll = new ZMQPoll();
    $poll-&gt;add($request_pipe, ZMQ::POLL_IN);
    $events = $poll-&gt;poll($read, $write, 1000);
    
    if($events) {
        //  Ensure message directory exists
        if(!is_dir(TITANIC_DIR)) {
            mkdir(TITANIC_DIR);
        }
        
        //  Append UUID to queue, prefixed with '-' for pending
        $msg = new Zmsg($request_pipe);
        $msg-&gt;recv();
        $fh = fopen(TITANIC_DIR . "/queue", "a");
        $uuid = $msg-&gt;pop();
        fprintf($fh, "-%s\n", $uuid);
        fclose($fh);
    }
    
    //  Brute-force dispatcher
    if(file_exists(TITANIC_DIR . "/queue")) {
        $fh = fopen(TITANIC_DIR . "/queue", "r+");
        while($fh &amp;&amp; $entry  = fread($fh, 33)) {
            //  UUID is prefixed with '-' if still waiting
            if($entry[0] == "-") {
                if($verbose) {
                    printf ("I: processing request %s%s", substr($entry, 1), PHP_EOL);
                }
                if(s_service_success($client, substr($entry, 1))) {
                    //  Mark queue entry as processed
                    fseek($fh, -33, SEEK_CUR);
                    fwrite ($fh, "+");
                    fseek($fh, 32, SEEK_CUR);
                }
            }
            //  Skip end of line, LF or CRLF
            if(fgetc($fh) == "\r") {
                fgetc($fh);
            }
        }
        if($fh) {
            fclose($fh);
        }
    }
}
     
</programlisting>

</example>
<para>To test this, start <literal>mdbroker</literal> and <literal>titanic</literal>, then run <literal>ticlient</literal>. Now start <literal>mdworker</literal> arbitrarily, and you should see the client getting a response and exiting happily.</para>

<para>Some notes about this code:</para>

<itemizedlist>
  <listitem><para>Note that some loops start by sending, others by receiving messages. This is because Titanic acts both as a client and a worker in different roles.</para></listitem>
  <listitem><para>We send requests only to services that appear to be running, using MMI. This works as well as the MMI implementation in the broker.</para></listitem>
  <listitem><para>We use an inproc connection to send new request data from the <emphasis role="bold">titanic.request</emphasis> service through to the main dispatcher. This saves the dispatcher from having to scan the disk directory, load all request files, and sort them by date/time.</para></listitem>
</itemizedlist>
<para>The important thing about this example is not performance (which, although I haven't tested it, is surely terrible), but how well it implements the reliability contract. To try it, start the mdbroker and titanic programs. Then start the ticlient, and then start the mdworker echo service. You can run all four of these using the <literal>-v</literal> option to do verbose activity tracing. You can stop and restart any piece <emphasis>except</emphasis> the client and nothing will get lost.</para>

<para>If you want to use Titanic in real cases, you'll rapidly be asking "how do we make this faster?" Here's what I'd do, starting with the example implementation:</para>

<itemizedlist>
  <listitem><para>Use a single disk file for all data, rather than multiple files. Operating systems are usually better at handling a few large files than many smaller ones.</para></listitem>
  <listitem><para>Organize that disk file as a circular buffer so that new requests can be written contiguously (with very occasional wraparound). One thread, writing full speed to a disk file, can work rapidly.</para></listitem>
  <listitem><para>Keep the index in memory and rebuild the index at startup time, from the disk buffer. This saves the extra disk head flutter needed to keep the index fully safe on disk. You would want an fsync after every message, or every N milliseconds if you were prepared to lose the last M messages in case of a system failure.</para></listitem>
  <listitem><para>Use a solid-state drive rather than spinning iron oxide platters.</para></listitem>
  <listitem><para>Preallocate the entire file, or allocate it in large chunks allowing the circular buffer to grow and shrink as needed. This avoids fragmentation and ensures most reads and writes are contiguous.</para></listitem>
</itemizedlist>
<para>And so on. What I'd not recommend is storing messages in a database, not even a "fast" key/value store, unless you really like a specific database and don't have performance worries. You will pay a steep price for the abstraction, ten to a thousand times over a raw disk file.</para>

<para>If you want to make Titanic <emphasis>even more reliable</emphasis>, duplicate the requests to a second server, which you'd place in a second location just far enough to survive nuclear attack on your primary location, yet not so far that you get too much latency.</para>

<para>If you want to make Titanic <emphasis>much faster and less reliable</emphasis>, store requests and replies purely in memory. This will give you the functionality of a disconnected network, but requests won't survive a crash of the Titanic server itself.</para>

</sect1>
<sect1>
<title>High-availability Pair (Binary Star Pattern)</title>
<para>The Binary Star pattern puts two servers in a primary-backup high-availability pair(<xref linkend="figure-52"/>). At any given time, one of these (the active) accepts connections from client applications. The other (the passive) does nothing, but the two servers monitor each other. If the active disappears from the network, after a certain time the passive takes over as active.</para>

<para>Binary Star pattern was developed by myself and Martin Sustrik for the iMatix <ulink url="http://www.openamq.org">OpenAMQ server</ulink>. We designed it:</para>

<itemizedlist>
  <listitem><para>To provide a straight-forward high-availability solution.</para></listitem>
  <listitem><para>To be simple enough to actually understand and use.</para></listitem>
  <listitem><para>To fail-over reliably when needed, and only when needed.</para></listitem>
</itemizedlist>
<figure id="figure-52">
    <title>High-availability Pair, Normal Operation</title>
    <mediaobject>
        <imageobject>
            <imagedata fileref="images/fig52.eps" format="EPS" width="4.8in"/>
        </imageobject>
    </mediaobject>
</figure>

<para>Assuming we have a Binary Star pair running, here are the different scenarios that will result in a fail-over(<xref linkend="figure-53"/>):</para>

<itemizedlist>
  <listitem><para>The hardware running the primary server has a fatal problem (power supply explodes, machine catches fire, or someone simply unplugs it by mistake), and disappears. Applications see this, and reconnect to the backup server.</para></listitem>
  <listitem><para>The network segment on which the primary server sits crashes--perhaps a router gets hit by a power spike--and applications start to reconnect to the backup server.</para></listitem>
  <listitem><para>The primary server crashes or is killed by the operator and does not restart automatically.</para></listitem>
</itemizedlist>
<figure id="figure-53">
    <title>High-availability Pair During Failover</title>
    <mediaobject>
        <imageobject>
            <imagedata fileref="images/fig53.eps" format="EPS" width="4.8in"/>
        </imageobject>
    </mediaobject>
</figure>

<para>Recovery from fail-over works as follows:</para>

<itemizedlist>
  <listitem><para>The operators restart the primary server and fix whatever problems were causing it to disappear from the network.</para></listitem>
  <listitem><para>The operators stop the backup server at a moment when it will cause minimal disruption to applications.</para></listitem>
  <listitem><para>When applications have reconnected to the primary server, the operators restart the backup server.</para></listitem>
</itemizedlist>
<para>Recovery (to using the primary server as active) is a manual operation. Painful experience teaches us that automatic recovery is undesirable. There are several reasons:</para>

<itemizedlist>
  <listitem><para>Failover creates an interruption of service to applications, possibly lasting 10-30 seconds. If there is a real emergency, this is much better than total outage. But if recovery creates a further 10-30 second outage, it is better that this happens off-peak, when users have gone off the network.</para></listitem>
  <listitem><para>When there is an emergency, it's a Good Idea to create predictability for those trying to fix things. Automatic recovery creates uncertainty for system administrators, who can no longer be sure which server is in charge without double-checking.</para></listitem>
  <listitem><para>Automatic recovery can create situations where networks fail over and then recover, placing operators in the difficult position of analyzing what happened. There was an interruption of service, but the cause isn't clear.</para></listitem>
</itemizedlist>
<para>Having said this, the Binary Star pattern will fail back to the primary server if this is running (again) and the backup server fails. In fact this is how we provoke recovery.</para>

<para>The shutdown process for a Binary Star pair is to either:</para>

<orderedlist>
  <listitem><para>Stop the passive server and then stop the active server at any later time, or</para></listitem>
  <listitem><para>Stop both servers in any order but within a few seconds of each other.</para></listitem>
</orderedlist>
<para>Stopping the active and then the passive server with any delay longer than the fail-over timeout will cause applications to disconnect, then reconnect, then disconnect again, which may disturb users.</para>

<sect2>
<title>Detailed Requirements</title>
<para>Binary Star is as simple as it can be, while still working accurately. In fact, the current design is the third complete redesign. Each of the previous designs we found to be too complex, trying to do too much, and we stripped out functionality until we came to a design that was understandable and use, and reliable enough to be worth using.</para>

<para>These are our requirements for a high-availability architecture:</para>

<itemizedlist>
  <listitem><para>The fail-over is meant to provide insurance against catastrophic system failures, such as hardware breakdown, fire, accident, etc. There are simpler ways to recover from ordinary server crashes and we already covered these.</para></listitem>
  <listitem><para>Failover time should be under 60 seconds and preferably under 10 seconds.</para></listitem>
  <listitem><para>Failover has to happen automatically, whereas recover must happen manually. We want applications to switch over to the backup server automatically but we do not want them to switch back to the primary server except when the operators have fixed whatever problem there was, and decided that it is a good time to interrupt applications again.</para></listitem>
  <listitem><para>The semantics for client applications should be simple and easy for developers to understand. Ideally they should be hidden in the client API.</para></listitem>
  <listitem><para>There should be clear instructions for network architects on how to avoid designs that could lead to <emphasis>split brain syndrome</emphasis>, in which both servers in a Binary Star pair think they are the active server.</para></listitem>
  <listitem><para>There should be no dependencies on the order in which the two servers are started.</para></listitem>
  <listitem><para>It must be possible to make planned stops and restarts of either server without stopping client applications (though they may be forced to reconnect).</para></listitem>
  <listitem><para>Operators must be able to monitor both servers at all times.</para></listitem>
  <listitem><para>It must be possible to connect the two servers using a high-speed dedicated network connection. That is, fail-over synchronization must be able to use a specific IP route.</para></listitem>
</itemizedlist>
<para>We make these assumptions:</para>

<itemizedlist>
  <listitem><para>A single backup server provides enough insurance; we don't need multiple levels of backup.</para></listitem>
  <listitem><para>The primary and backup servers are equally capable of carrying the application load. We do not attempt to balance load across the servers.</para></listitem>
  <listitem><para>There is sufficient budget to cover a fully redundant backup server that does nothing almost all the time.</para></listitem>
</itemizedlist>
<para>We don't attempt to cover:</para>

<itemizedlist>
  <listitem><para>The use of an active backup server or load balancing. In a Binary Star pair, the backup server is inactive and does no useful work until the primary server goes off-line.</para></listitem>
  <listitem><para>The handling of persistent messages or transactions in any way. We assuming a network of unreliable (and probably untrusted) servers or Binary Star pairs.</para></listitem>
  <listitem><para>Any automatic exploration of the network. The Binary Star pair is manually and explicitly defined in the network and is known to applications (at least in their configuration data).</para></listitem>
  <listitem><para>Replication of state or messages between servers. All server-side state much be recreated by applications when they fail over.</para></listitem>
</itemizedlist>
<para>Here is the key terminology we use in Binary Star:</para>

<itemizedlist>
  <listitem><para><emphasis>Primary</emphasis> - the server that is normally or initially 'active'.</para></listitem>
  <listitem><para><emphasis>Backup</emphasis> - the server that is normally 'passive'. It will become active if and when the primary server disappears from the network, and when client applications ask the backup server to connect.</para></listitem>
  <listitem><para><emphasis>Active</emphasis> - the server that accepts client connections. There is at most one active server.</para></listitem>
  <listitem><para><emphasis>Passive</emphasis> - the server that takes over if the active disappears. Note that when a Binary Star pair is running normally, the primary server is active, and the backup is passive. When a fail-over has happened, the roles are switched.</para></listitem>
</itemizedlist>
<para>To configure a Binary Star pair, you need to:</para>

<orderedlist>
  <listitem><para>Tell the primary server where the backup server is.</para></listitem>
  <listitem><para>Tell the backup server where the primary server is.</para></listitem>
  <listitem><para>Optionally, tune the fail-over response times, which must be the same for both servers.</para></listitem>
</orderedlist>
<para>The main tuning concern is how frequently you want the servers to check their peering status, and how quickly you want to activate fail-over. In our example, the fail-over timeout value defaults to 2,000 msec. If you reduce this, the backup server will take over as active more rapidly but may take over in cases where the primary server could recover. You may for example have wrapped the primary server in a shell script that restarts it if it crashes. In that case the timeout should be higher than the time needed to restart the primary server.</para>

<para>For client applications to work properly with a Binary Star pair, they must:</para>

<orderedlist>
  <listitem><para>Know both server addresses.</para></listitem>
  <listitem><para>Try to connect to the primary server, and if that fails, to the backup server.</para></listitem>
  <listitem><para>Detect a failed connection, typically using heartbeating.</para></listitem>
  <listitem><para>Try to reconnect to the primary, and then backup (in that order), with a delay between retries that is at least as high as the server fail-over timeout.</para></listitem>
  <listitem><para>Recreate all of the state they require on a server.</para></listitem>
  <listitem><para>Retransmit messages lost during a fail-over, if messages need to be reliable.</para></listitem>
</orderedlist>
<para>It's not trivial work, and we'd usually wrap this in an API that hides it from real end-user applications.</para>

<para>These are the main limitations of the Binary Star pattern:</para>

<itemizedlist>
  <listitem><para>A server process cannot be part of more than one Binary Star pair.</para></listitem>
  <listitem><para>A primary server can have a single backup server, no more.</para></listitem>
  <listitem><para>Whichever server is passive is wasted.</para></listitem>
  <listitem><para>The backup server must be capable of handling full application loads.</para></listitem>
  <listitem><para>Failover configuration cannot be modified at runtime.</para></listitem>
  <listitem><para>Client applications must do some work to benefit from fail-over.</para></listitem>
</itemizedlist>
</sect2>
<sect2>
<title>Preventing Split-Brain Syndrome</title>
<para><emphasis>Split-brain syndrome</emphasis> occurs when different parts of a cluster think they are active at the same time. It causes applications to stop seeing each other. Binary Star has an algorithm for detecting and eliminating split brain, based on a three-way decision mechanism (a server will not decide to become active until it gets application connection requests and it cannot see its peer server).</para>

<para>However it is still possible to (mis)design a network to fool this algorithm. A typical scenario would be a Binary Star pair distributed between two buildings, where each building also had a set of applications, and there was a single network link between both buildings. Breaking this link would create two sets of client applications, each with half of the Binary Star pair, and each fail-over server would become active.</para>

<para>To prevent split-brain situations, we must connect a Binary Star pair using a dedicated network link, which can be as simple as plugging them both into the same switch or better, using a cross-over cable directly between two machines.</para>

<para>We must not split a Binary Star architecture into two islands, each with a set of applications. While this may be a common type of network architecture, you should use federation, not high-availability fail-over, in such cases.</para>

<para>A suitably paranoid network configuration would use two private cluster interconnects, rather than a single one. Further, the network cards used for the cluster would be different from those used for message traffic, and possibly even on different PCI paths on the server hardware. The goal is to separate possible failures in the network from possible failures in the cluster. Network ports have a relatively high failure rate.</para>

</sect2>
<sect2>
<title>Binary Star Implementation</title>
<para>Without further ado, here is a proof-of-concept implementation of the Binary Star server. The primary and backup servers run the same code, and their roles are chosen by the invoker:</para>

<example id="bstarsrv-php">
<title>Binary Star server (bstarsrv.php)</title>
<programlisting language="php">
(This example still needs translation into PHP)
</programlisting>

</example>
<para>And here is the client:</para>

<example id="bstarcli-php">
<title>Binary Star client (bstarcli.php)</title>
<programlisting language="php">
(This example still needs translation into PHP)
</programlisting>

</example>
<para>To test Binary Star, start the servers and client in any order:</para>

<screen>bstarsrv -p     # Start primary
bstarsrv -b     # Start backup
bstarcli
</screen>

<para>You can then provoke fail-over by killing the primary server, and recovery by restarting the primary and killing the backup. Note how it's the client vote that triggers fail-over, and recovery.</para>

<para>Binary star is driven by a finite state machine(<xref linkend="figure-54"/>). States in white accept client requests, states in grey refuse them. Events are the peer state, so "Peer Active" means the other server has told us it's active. "Client Request" means we've received a client request. "Client Vote" means we've received a client request AND our peer is inactive for two heartbeats.</para>

<figure id="figure-54">
    <title>Binary Star Finite State Machine</title>
    <mediaobject>
        <imageobject>
            <imagedata fileref="images/fig54.eps" format="EPS" width="4.8in"/>
        </imageobject>
    </mediaobject>
</figure>

<para>Note that the servers use PUB-SUB sockets for state exchange. No other socket combination will work here. PUSH and DEALER block if there is no peer ready to receive a message. PAIR does not reconnect if the peer disappears and comes back. ROUTER needs the address of the peer before it can send it a message.</para>

</sect2>
<sect2>
<title>Binary Star Reactor</title>
<para>Binary Star is useful and generic enough to package up as a reusable reactor class. The reactor then runs and calls our code whenever it has a message to process. This is much nicer than copying/pasting the Binary Star code into each server where we want that capability.</para>

<para>In C we wrap the CZMQ <literal>zloop</literal> class that we saw before. <literal>zloop</literal> lets you register handlers to react on socket and timer events. In the Binary Star reactor, we provide handlers for voters, and for state changes (active to passive, and vice-versa). Here is the <literal>bstar</literal> API:</para>

<programlisting language="c">
//  Create a new Binary Star instance, using local (bind) and
//  remote (connect) endpoints to set-up the server peering.
bstar_t *bstar_new (int primary, char *local, char *remote);

//  Destroy a Binary Star instance
void bstar_destroy (bstar_t **self_p);

//  Return underlying zloop reactor, for timer and reader
//  registration and cancelation.
zloop_t *bstar_zloop (bstar_t *self);

//  Register voting reader
int bstar_voter (bstar_t *self, char *endpoint, int type,
                 zloop_fn handler, void *arg);

//  Register main state change handlers
void bstar_new_active (bstar_t *self, zloop_fn handler, void *arg);
void bstar_new_passive (bstar_t *self, zloop_fn handler, void *arg);

//  Start the reactor, ends if a callback function returns -1, or the
//  process received SIGINT or SIGTERM.
int bstar_start (bstar_t *self);
</programlisting>

<para>And here is the class implementation:</para>

<example id="bstar-php">
<title>Binary Star core class (bstar.php)</title>
<programlisting language="php">
(This example still needs translation into PHP)
</programlisting>

</example>
<para>Which gives us the following short main program for the server:</para>

<example id="bstarsrv2-php">
<title>Binary Star server, using core class (bstarsrv2.php)</title>
<programlisting language="php">
(This example still needs translation into PHP)
</programlisting>

</example>
</sect2>
</sect1>
<sect1>
<title>Brokerless Reliability (Freelance Pattern)</title>
<para>It might seem ironic to focus so much on broker-based reliability, when we often explain &Oslash;MQ as "brokerless messaging". However in messaging, as in real life, the middleman is both a burden and a benefit. In practice, most messaging architectures benefit from a mix of distributed and brokered messaging. You get the best results when you can decide freely what trade-offs you want to make. This is why I can drive 10km to a wholesaler to buy five cases of wine for a party, but I can also walk 10 minutes to a corner store to buy one bottle for a dinner. Our highly context-sensitive relative valuations of time, energy, and cost are essential to the real world economy. And they are essential to an optimal message-based architecture.</para>

<para>Which is why &Oslash;MQ does not <emphasis>impose</emphasis> a broker-centric architecture, though it gives you the tools to build brokers, aka <emphasis>proxies</emphasis>, and we've built a dozen or so different ones so far, just for practice.</para>

<para>So we'll end this chapter by deconstructing the broker-based reliability we've built so far, and turning it back into a distributed peer-to-peer architecture I call the Freelance pattern. Our use case will be a name resolution service. This is a common problem with &Oslash;MQ architectures: how do we know the endpoint to connect to? Hard-coding TCP/IP addresses in code is insanely fragile. Using configuration files creates an administration nightmare. Imagine if you had to hand-configure your web browser, on every PC or mobile phone you used, to realize that "google.com" was "74.125.230.82".</para>

<para>A &Oslash;MQ name service (and we'll make a simple implementation) has to:</para>

<itemizedlist>
  <listitem><para>Resolve a logical name into at least a bind endpoint, and a connect endpoint. A realistic name service would provide multiple bind endpoints, and possibly multiple connect endpoints too.</para></listitem>
  <listitem><para>Allow us to manage multiple parallel environments, e.g. "test" versus "production", without modifying code.</para></listitem>
  <listitem><para>Be reliable, because if it is unavailable, applications won't be able to connect to the network.</para></listitem>
</itemizedlist>
<para>Putting a name service behind a service-oriented Majordomo broker is clever from some points of view. However it's simpler and much less surprising to just expose the name service as a server that clients can connect to directly. If we do this right, the name service becomes the <emphasis>only</emphasis> global network endpoint we need to hard-code in our code or configuration files.</para>

<para>The types of failure we aim to handle are server crashes and restarts, server busy looping, server overload, and network issues. To get reliability, we'll create a pool of name servers so if one crashes or goes away, clients can connect to another, and so on. In practice, two would be enough. But for the example, we'll assume the pool can be any size(<xref linkend="figure-55"/>).</para>

<figure id="figure-55">
    <title>The Freelance Pattern</title>
    <mediaobject>
        <imageobject>
            <imagedata fileref="images/fig55.eps" format="EPS" width="4.8in"/>
        </imageobject>
    </mediaobject>
</figure>

<para>In this architecture, a large set of clients connect to a small set of servers directly. The servers bind to their respective addresses. It's fundamentally different from a broker-based approach like Majordomo, where workers connect to the broker. Clients have a couple options:</para>

<itemizedlist>
  <listitem><para>Use REQ sockets and the Lazy Pirate pattern. Easy, but would need some additional intelligence so clients don't stupidly try to reconnect to dead servers over and over.</para></listitem>
  <listitem><para>Use DEALER sockets and blast out requests (which will be load balanced to all connected servers) until they get a reply. Effective, but not elegant.</para></listitem>
  <listitem><para>Use ROUTER sockets so clients can address specific servers. But how does the client know the identity of the server sockets? Either the server has to ping the client first (complex), or the each server has to use a hard-coded, fixed identity known to the client (nasty).</para></listitem>
</itemizedlist>
<para>We'll develop each of these in the following subsections.</para>

<sect2>
<title>Model One - Simple Retry and Failover</title>
<para>So our menu appears to offer: simple, brutal, complex, or nasty. Let's start with 'simple' and then work out the kinks. We take Lazy Pirate and rewrite it to work with multiple server endpoints. Start one or several servers first, specifying a bind endpoint as the argument:</para>

<example id="flserver1-php">
<title>Freelance server, Model One (flserver1.php)</title>
<programlisting language="php">
&lt;?php
/*
* Freelance server - Model 1
* Trivial echo service
*
* Author: Rob Gagnon &lt;rgagnon24(at)gmail(dot)com&gt;
*/

if (count($argv) &lt; 2) {
	printf("I: Syntax: %s &lt;endpoint&gt;\n", $argv[0]);
	exit;
}

$endpoint = $argv[1];
$context = new ZMQContext();
$server = $context-&gt;getSocket(ZMQ::SOCKET_REP);
$server-&gt;bind($endpoint);

printf("I: Echo service is ready at %s\n", $endpoint);
while(true) {
	$msg = $server-&gt;recvMulti();
	$server-&gt;sendMulti($msg);
</programlisting>

</example>
<para>Then start the client, specifying one or more connect endpoints as arguments:</para>

<example id="flclient1-php">
<title>Freelance client, Model One (flclient1.php)</title>
<programlisting language="php">
&lt;?php
/*
* Freelance Client - Model 1
* Uses REQ socket to query one or more services
*
* Author: Rob Gagnon &lt;rgagnon24(at)gmail(dot)com&gt;
*/

$request_timeout = 1000; // ms
$max_retries = 3; # Before we abandon

/**
* @param ZMQContext $ctx
* @param string $endpoint
* @param string $request
*/
function try_request($ctx, $endpoint, $request) {
	global $request_timeout;

	printf("I: Trying echo service at %s...\n", $endpoint);
	$client = $ctx-&gt;getSocket(ZMQ::SOCKET_REQ);
	$client-&gt;connect($endpoint);
	$client-&gt;send($request);

	$poll = new ZMQPoll();
	$poll-&gt;add($client, ZMQ::POLL_IN);
	$readable = $writable = array();

	$events = $poll-&gt;poll($readable, $writable, $request_timeout);
	$reply = null;
	foreach($readable as $sock) {
		if ($sock == $client) {
			$reply = $client-&gt;recvMulti();
		} else {
			$reply = null;
		}
	}

	$poll-&gt;remove($client);
	$poll = null;
	$client = null;
	return $reply;
}

$context = new ZMQContext();
$request = 'Hello world';
$reply = null;

$cmd = array_shift($argv);
$endpoints = count($argv);
if ($endpoints == 0) {
	printf("I: syntax: %s &lt;endpoint&gt; ...\n", $cmd);
	exit;
}

if ($endpoints == 1) {
	// For one endpoint, we retry N times
	$endpoint = $argv[0];
	for($retries = 0; $retries &lt; $max_retries; $retries++) {
		$reply = try_request($context, $endpoint, $request);
		if (isset($reply)) {
			break; // Success
		}
		printf("W: No response from %s, retrying\n", $endpoint);
	}
} else {
	// For multiple endpoints, try each at most once
	foreach($argv as $endpoint) {
		$reply = try_request($context, $endpoint, $request);
		if (isset($reply)) {
			break; // Success
		}
		printf("W: No response from %s\n", $endpoint);
	}
}

if (isset($reply)) {
	print "Service is running OK\n";
</programlisting>

</example>
<para>A sample run is:</para>

<screen>flserver1 tcp://*:5555 &amp;
flserver1 tcp://*:5556 &amp;
flclient1 tcp://localhost:5555 tcp://localhost:5556
</screen>

<para>Although the basic approach is Lazy Pirate, the client aims to just get one successful reply. It has two techniques, depending on whether you are running a single server, or multiple servers:</para>

<itemizedlist>
  <listitem><para>With a single server, the client will retry several times, exactly as for Lazy Pirate.</para></listitem>
  <listitem><para>With multiple servers, the client will try each server at most once, until it's received a reply, or has tried all servers.</para></listitem>
</itemizedlist>
<para>This solves the main weakness of Lazy Pirate, namely that it could not do fail-over to backup / alternate servers.</para>

<para>However, this design won't work well in a real application. If we're connecting many sockets, and our primary name server is down, we're going to experience this painful timeout each time.</para>

</sect2>
<sect2>
<title>Model Two - Brutal Shotgun Massacre</title>
<para>Let's switch our client to using a DEALER socket. Our goal here is to make sure we get a reply back within the shortest possible time, no matter whether a particular server is up or down. Our client takes this approach:</para>

<itemizedlist>
  <listitem><para>We set things up, connecting to all servers.</para></listitem>
  <listitem><para>When we have a request, we blast it out as many times as we have servers.</para></listitem>
  <listitem><para>We wait for the first reply, and take that.</para></listitem>
  <listitem><para>We ignore any other replies.</para></listitem>
</itemizedlist>
<para>What will happen in practice is that when all servers are running, &Oslash;MQ will distribute the requests so each server gets one request, and sends one reply. When any server is offline, and disconnected, &Oslash;MQ will distribute the requests to the remaining servers. So a server may in some cases get the same request more than once.</para>

<para>What's more annoying for the client is that we'll get multiple replies back, but there's no guarantee we'll get a precise number of replies. Requests and replies can get lost (e.g., if the server crashes while processing a request).</para>

<para>So we have to number requests, and ignore any replies that don't match the request number. Our Model One server will work, since it's an echo server, but coincidence is not a great basis for understanding. So we'll make a Model Two server that chews up the message and returns a correctly-numbered reply with the content "OK". We'll use messages consisting of two parts: a sequence number and a body.</para>

<para>Start one or more servers, specifying a bind endpoint each time:</para>

<example id="flserver2-php">
<title>Freelance server, Model Two (flserver2.php)</title>
<programlisting language="php">
&lt;?php
/*
* Freelance server - Model 2
* Does some work, replies OK, with message sequencing
*
* Author: Rob Gagnon &lt;rgagnon24(at)gmail(dot)com&gt;
*/

if (count($argv) &lt; 2) {
	printf("I: Syntax: %s &lt;endpoint&gt;\n", $argv[0]);
	exit;
}

$endpoint = $argv[1];
$context = new ZMQContext();
$server = $context-&gt;getSocket(ZMQ::SOCKET_REP);
$server-&gt;bind($endpoint);

printf("I: Echo service is ready at %s\n", $endpoint);
while(true) {
	$request = $server-&gt;recvMulti();
	if (count($request) != 2) {
		// Fail nastily if run against wrong client
		exit(-1);
	}

	$address = $request[0];
	$reply = array($address, 'OK');
	$server-&gt;sendMulti($reply);
</programlisting>

</example>
<para>Then start the client, specifying the connect endpoints as arguments:</para>

<example id="flclient2-php">
<title>Freelance client, Model Two (flclient2.php)</title>
<programlisting language="php">
&lt;?php
/*
* Freelance Client - Model 2
* Uses DEALER socket to blast one or more services
*
* Author: Rob Gagnon &lt;rgagnon24(at)gmail(dot)com&gt;
*/

class FLClient {
	const GLOBAL_TIMEOUT = 2500; // ms

	private $servers = 0;
	private $sequence = 0;
	/** @var ZMQContext */
	private $context = null;
	/** @var ZMQSocket */
	private $socket = null;

	public function __construct() {
		$this-&gt;servers = 0;
		$this-&gt;sequence = 0;
		$this-&gt;context = new ZMQContext();
		$this-&gt;socket = $this-&gt;context-&gt;getSocket(ZMQ::SOCKET_DEALER);
	}

	public function __destruct() {
		$this-&gt;socket-&gt;setSockOpt(ZMQ::SOCKOPT_LINGER, 0);
		$this-&gt;socket = null;
		$this-&gt;context = null;
	}

	/**
	* @param string $endpoint
	*/
	public function connect($endpoint) {
		$this-&gt;socket-&gt;connect($endpoint);
		$this-&gt;servers++;
		printf("I: Connected to %s\n", $endpoint);
	}

	/**
	* @param string $request
	*/
	public function request($request) {
		// Prefix request with sequence number and empty envelope
		$this-&gt;sequence++;
		$msg = array('', $this-&gt;sequence, $request);

		// Blast the request to all connected servers
		for($server = 1; $server &lt;= $this-&gt;servers; $server++) {
			$this-&gt;socket-&gt;sendMulti($msg);
		}

		// Wait for a matching reply to arrive from anywhere
		// Since we can poll several times, calculate each one
		$poll = new ZMQPoll();
		$poll-&gt;add($this-&gt;socket, ZMQ::POLL_IN);

		$reply = null;
		$endtime = time() + self::GLOBAL_TIMEOUT / 1000;
		while (time() &lt; $endtime) {
			$readable = $writable = array();
			$events = $poll-&gt;poll($readable, $writable, ($endtime - time()) * 1000);
			foreach($readable as $sock) {
				if ($sock == $this-&gt;socket) {
					$reply = $this-&gt;socket-&gt;recvMulti();
					if (count($reply) != 3) {
						exit;
					}
					$sequence = $reply[1];
					if ($sequence == $this-&gt;sequence) {
						break;
					}
				}
			}
		}

		return $reply;
	}
}

$cmd = array_shift($argv);
if (count($argv) == 0) {
	printf("I: syntax: %s &lt;endpoint&gt; ...\n", $cmd);
	exit;
}

// Create new freelance client object
$client = new FLClient();

foreach($argv as $endpoint) {
	$client-&gt;connect($endpoint);
}

$start = time();
for($requests = 0; $requests &lt; 10000; $requests++) {
	$request = "random name";
	$reply = $client-&gt;request($request);
	if (!isset($reply)) {
		print "E: name service not available, aborting\n";
		break;
	}
}

printf("Average round trip cost: %i ms\n", ((time() - $start) / 100));
$client = null
</programlisting>

</example>
<para>Some notes on this code:</para>

<itemizedlist>
  <listitem><para>The client is structured as a nice little class-based API that hides the dirty work of creating &Oslash;MQ contexts and sockets and talking to the server. If a shotgun blast to the midriff can be called "talking".</para></listitem>
  <listitem><para>The client will abandon the chase if it can't find <emphasis>any</emphasis> responsive server within a few seconds.</para></listitem>
  <listitem><para>The client has to create a valid REP envelope, i.e. add an empty message frame to the front of the message.</para></listitem>
</itemizedlist>
<para>The client does 10,000 name resolution requests (fake ones, since our server does essentially nothing), and measures the average cost. On my test box, talking to one server, this requires about 60 microseconds. Talking to three servers, it's about 80 microseconds.</para>

<para>So pros and cons of our shotgun approach are:</para>

<itemizedlist>
  <listitem><para>Pro: it is simple, easy to make and easy to understand.</para></listitem>
  <listitem><para>Pro: it does the job of fail-over, and works rapidly, so long as there is at least one server running.</para></listitem>
  <listitem><para>Con: it creates redundant network traffic.</para></listitem>
  <listitem><para>Con: we can't prioritize our servers, i.e. Primary, then Secondary.</para></listitem>
  <listitem><para>Con: the server can do at most one request at a time, period.</para></listitem>
</itemizedlist>
</sect2>
<sect2>
<title>Model Three - Complex and Nasty</title>
<para>The shotgun approach seems too good to be true. Let's be scientific and work through all the alternatives. We're going to explore the complex/nasty option, even if it's only to finally realize that we preferred brutal. Ah, the story of my life.</para>

<para>We can solve the main problems of the client by switching to a ROUTER socket. That lets us send requests to specific servers, avoid servers we know are dead, and in general be as smart as we want to make it. We can also solve the main problem of the server (single-threadedness) by switching to a ROUTER socket.</para>

<para>But doing ROUTER to ROUTER between two anonymous sockets (which haven't set an identity) is not possible. Both sides generate an identity (for the other peer) only when they receive a first message, and thus neither can talk to the other until it has first received a message. The only way out of this conundrum is to cheat, and use hard-coded identities in one direction. The proper way to cheat, in a client server case, is to let client 'know' the identity of the server. Doing it the other way around would be insane, on top of complex and nasty, because any number of clients should be able to arise independently. Insane, complex, and nasty are great attributes for a genocidal dictator, but terrible ones for software.</para>

<para>Rather than invent yet another concept to manage, we'll use the connection endpoint as identity. This is a unique string both sides can agree on without more prior knowledge than they already have for the shotgun model. It's a sneaky and effective way to connect two ROUTER sockets.</para>

<para>Remember how &Oslash;MQ identities work. The server ROUTER socket sets an identity before it binds its socket. When a client connects, they do a little handshake to exchange identities, before either side sends a real message. The client ROUTER socket, having not set an identity, sends a null identity to the server. The server generates a random UUID to designate the client, for its own use. The server sends its identity (which we've agreed is going to be an endpoint string) to the client.</para>

<para>This means our client can route a message to the server (i.e. send on its ROUTER socket, specifying the server endpoint as identity) as soon as the connection is established. That's not <emphasis>immediately</emphasis> after doing a <literal>zmq_connect()</literal>, but some random time thereafter. Herein lies one problem: we don't know when the server will actually be available and complete its connection handshake. If the server is actually online, it could be after a few milliseconds. If the server is down, and the sysadmin is out to lunch, it could be an hour.</para>

<para>There's a small paradox here. We need to know when servers become connected and available for work. In the Freelance pattern, unlike the broker-based patterns we saw earlier in this chapter, servers are silent until spoken to. Thus we can't talk to a server until it's told us it's on-line, which it can't do until we've asked it.</para>

<para>My solution is to mix in a little of the shotgun approach from model 2, meaning we'll fire (harmless) shots at anything we can, and if anything moves, we know it's alive. We're not going to fire real requests, but rather a kind of ping-pong heartbeat.</para>

<para>This brings us to the realm of protocols again, so here's a <ulink url="http://rfc.zeromq.org/spec:10">short spec that defines how a Freelance client and server exchange PING-PONG commands and request-reply commands</ulink>.</para>

<para>It is short and sweet to implement as a server. Here's our echo server, Model Three, now speaking FLP:</para>

<example id="flserver3-php">
<title>Freelance server, Model Three (flserver3.php)</title>
<programlisting language="php">
(This example still needs translation into PHP)
</programlisting>

</example>
<para>The Freelance client, however, has gotten large. For clarity, it's split into an example application and a class that does the hard work. Here's the top-level application:</para>

<example id="flclient3-php">
<title>Freelance client, Model Three (flclient3.php)</title>
<programlisting language="php">
(This example still needs translation into PHP)
</programlisting>

</example>
<para>And here, almost as complex and large as the Majordomo broker, is the client API class:</para>

<example id="flcliapi-php">
<title>Freelance client API (flcliapi.php)</title>
<programlisting language="php">
(This example still needs translation into PHP)
</programlisting>

</example>
<para>This API implementation is fairly sophisticated and uses a couple of techniques that we've not seen before:</para>

<para><emphasis role="bold">Multithreaded API</emphasis></para>

<para>The client API consists of two parts, a synchronous 'flcliapi' class that runs in the application thread, and an asynchronous 'agent' class that runs as a background thread. Remember how &Oslash;MQ makes it easy to create multithreaded apps. The flcliapi and agent classes talk to each other with messages over an <literal>inproc</literal> socket. All &Oslash;MQ aspects (such as creating and destroying a context) are hidden in the API. The agent in effect acts like a mini-broker, talking to servers in the background, so that when we make a request, it can make a best effort to reach a server it believes is available.</para>

<para><emphasis role="bold">Tickless poll timer</emphasis></para>

<para>In previous poll loops we always used a fixed tick interval, e.g., 1 second, which is simple enough but not excellent on power-sensitive clients, such as notebooks or mobile phones, where waking the CPU costs power. For fun, and to help save the planet, the agent uses a 'tickless timer', which calculates the poll delay based on the next timeout we're expecting. A proper implementation would keep an ordered list of timeouts. We just check all timeouts and calculate the poll delay until the next one.</para>

</sect2>
</sect1>
<sect1>
<title>Conclusion</title>
<para>In this chapter we've seen a variety of reliable request-reply mechanisms, each with certain costs and benefits. The example code is largely ready for real use, though it is not optimized. Of all the different patterns, the two that stand out for production use are the Majordomo pattern, for broker-based reliability, and the Freelance pattern, for brokerless reliability.</para>

</sect1>
</chapter>
<chapter id="advanced-pub-sub">
<title>Advanced Publish-Subscribe Patterns</title>
<para>In Advanced Request-Reply Patterns<xref linkend="advanced-request-reply"/> and Reliable Request-Reply Patterns<xref linkend="reliable-request-reply"/> we looked at advanced use of &Oslash;MQ's request-reply pattern. If you managed to digest all that, congratulations. In this chapter we'll focus on publish-subscribe, and extend &Oslash;MQ's core pub-sub pattern with higher-level patterns for performance, reliability, state distribution, and monitoring.</para>

<para>We'll cover:</para>

<itemizedlist>
  <listitem><para>How to handle too-slow subscribers (the <emphasis>Suicidal Snail</emphasis> pattern).</para></listitem>
  <listitem><para>How to design high-speed subscribers (the <emphasis>Black Box</emphasis> pattern).</para></listitem>
  <listitem><para>How to build a shared key-value cache (the <emphasis>Clone</emphasis> pattern).</para></listitem>
  <listitem><para>How to use reactors to simplify complex servers.</para></listitem>
  <listitem><para>How to use the Binary Star pattern to add failover to a server.</para></listitem>
  <listitem><para>How to monitor a publish-subscribe network (the <emphasis>Espresso</emphasis> pattern).</para></listitem>
</itemizedlist>
<sect1>
<title>Slow Subscriber Detection (Suicidal Snail Pattern)</title>
<para>A common problem you will hit when using the pub-sub pattern in real life is the slow subscriber. In an ideal world, we stream data at full speed from publishers to subscribers. In reality, subscriber applications are often written in interpreted languages, or just do a lot of work, or are just badly written, to the extent that they can't keep up with publishers.</para>

<para>How do we handle a slow subscriber? The ideal fix is to make the subscriber faster, but that might take work and time. Some of the classic strategies for handling a slow subscriber are:</para>

<itemizedlist>
  <listitem><para><emphasis role="bold">Queue messages on the publisher</emphasis>. This is what Gmail does when I don't read my email for a couple of hours. But in high-volume messaging, pushing queues upstream has the thrilling but unprofitable result of making publishers run out of memory and crash. Especially if there are lots of subscribers and it's not possible to flush to disk for performance reasons.</para></listitem>
  <listitem><para><emphasis role="bold">Queue messages on the subscriber</emphasis>. This is much better, and it's what &Oslash;MQ does by default if the network can keep up with things. If anyone's going to run out of memory and crash, it'll be the subscriber rather than the publisher, which is fair. This is perfect for "peaky" streams where a subscriber can't keep up for a while, but can catch up when the stream slows down. However it's no answer to a subscriber that's simply too slow in general.</para></listitem>
  <listitem><para><emphasis role="bold">Stop queuing new messages after a while</emphasis>. This is what Gmail does when my mailbox overflows its 7.554GB, no 7.555GB of space. New messages just get rejected or dropped. This is a great strategy from the perspective of the publisher, and it's what &Oslash;MQ does when the publisher sets a high water mark or HWM. However it still doesn't help us fix the slow subscriber. Now we just get gaps in our message stream.</para></listitem>
  <listitem><para><emphasis role="bold">Punish slow subscribers with disconnect</emphasis>. This is what Hotmail does when I don't login for two weeks, which is why I'm on my fifteenth Hotmail account. It's a nice brutal strategy that forces subscribers to sit up and pay attention, and would be ideal, but &Oslash;MQ doesn't do this, and there's no way to layer it on top since subscribers are invisible to publisher applications.</para></listitem>
</itemizedlist>
<para>None of these classic strategies fit. So we need to get creative. Rather than disconnect the publisher, let's convince the subscriber to kill itself. This is the Suicidal Snail pattern. When a subscriber detects that it's running too slowly (where "too slowly" is presumably a configured option that really means "so slowly that if you ever get here, shout really loudly because I need to know, so I can fix this!"), it croaks and dies.</para>

<para>How can a subscriber detect this? One way would be to sequence messages (number them in order), and use a HWM at the publisher. Now, if the subscriber detects a gap (i.e. the numbering isn't consecutive), it knows something is wrong. We then tune the HWM to the "croak and die if you hit this" level.</para>

<para>There are two problems with this solution. One, if we have many publishers, how do we sequence messages? The solution is to give each publisher a unique ID and add that to the sequencing. Second, if subscribers use <literal>ZMQ_SUBSCRIBE</literal> filters, they will get gaps by definition. Our precious sequencing will be for nothing.</para>

<para>Some use-cases won't use filters, and sequencing will work for them. But a more general solution is that the publisher timestamps each message. When a subscriber gets a message it checks the time, and if the difference is more than, say, one second, it does the "croak and die" thing. Possibly firing off a squawk to some operator console first.</para>

<para>The Suicide Snail pattern works especially when subscribers have their own clients and service-level agreements and need to guarantee certain maximum latencies. Aborting a subscriber may not seem like a constructive way to guarantee a maximum latency, but it's the assertion model. Abort today, and the problem will be fixed. Allow late data to flow downstream, and the problem may cause wider damage and take longer to appear on the radar.</para>

<para>So here is a minimal example of a Suicidal Snail:</para>

<example id="suisnail-php">
<title>Suicidal Snail (suisnail.php)</title>
<programlisting language="php">
&lt;?php
/* Suicidal Snail 
 *
 *  @author Ian Barber &lt;ian(dot)barber(at)gmail(dot)com&gt;
*/

/*  ---------------------------------------------------------------------
 * This is our subscriber
 * It connects to the publisher and subscribes to everything. It 
 * sleeps for a short time between messages to simulate doing too
 * much work. If a message is more than 1 second late, it croaks.
 */
define("MAX_ALLOWED_DELAY", 100); // msecs

function subscriber() {
	$context = new ZMQContext();
	
	// Subscribe to everything
	$subscriber = new ZMQSocket($context, ZMQ::SOCKET_SUB);
	$subscriber-&gt;connect("tcp://localhost:5556");
	$subscriber-&gt;setSockOpt(ZMQ::SOCKOPT_SUBSCRIBE, "");

	//  Get and process messages
	while(true) {
		$clock = $subscriber-&gt;recv();
		//  Suicide snail logic
		if(microtime(true)*100 - $clock*100 &gt; MAX_ALLOWED_DELAY) {
			echo "E: subscriber cannot keep up, aborting", PHP_EOL;
			break;
		}
		
		//  Work for 1 msec plus some random additional time
		usleep(1000 + rand(0, 1000));
	}
}


/* ---------------------------------------------------------------------
 * This is our server task
 * It publishes a time-stamped message to its pub socket every 1ms.
 */
function publisher() {
	$context = new ZMQContext();
	
	//  Prepare publisher
	$publisher = new ZMQSocket($context, ZMQ::SOCKET_PUB);
	$publisher-&gt;bind("tcp://*:5556");
	
	while(true) {
		//  Send current clock (msecs) to subscribers
		$publisher-&gt;send(microtime(true));
		usleep(1000); //  1msec wait
	}
}


/*
 * This main thread simply starts a client, and a server, and then
 * waits for the client to croak.
 */
$pid = pcntl_fork();
if($pid == 0) {
	publisher(); 
	exit();
}

$pid = pcntl_fork();
if($pid == 0) {
	subscriber(); 
	exit();
}
</programlisting>

</example>
<para>Notes about this example:</para>

<itemizedlist>
  <listitem><para>The message here consists simply of the current system clock as a number of milliseconds. In a realistic application you'd have at least a message header with the timestamp, and a message body with data.</para></listitem>
  <listitem><para>The example has subscriber and publisher in a single process, as two threads. In reality they would be separate processes. Using threads is just convenient for the demonstration.</para></listitem>
</itemizedlist>
</sect1>
<sect1>
<title>High-speed Subscribers (Black Box Pattern)</title>
<para>A common use-case for pub-sub is distributing large data streams. For example, 'market data' coming from stock exchanges. A typical set-up would have a publisher connected to a stock exchange, taking price quotes, and sending them out to a number of subscribers. If there are a handful of subscribers, we could use TCP. If we have a larger number of subscribers, we'd probably use reliable multicast, i.e. <literal>pgm</literal>.</para>

<para>Let's imagine our feed has an average of 100,000 100-byte messages a second. That's a typical rate, after filtering market data we don't need to send on to subscribers. Now we decide to record a day's data (maybe 250 GB in 8 hours), and then replay it to a simulation network, i.e. a small group of subscribers. While 100K messages a second is easy for a &Oslash;MQ application, we want to replay <emphasis>much faster</emphasis>.</para>

<para>So we set-up our architecture with a bunch of boxes, one for the publisher, and one for each subscriber. These are well-specified boxes, eight cores, twelve for the publisher. (If you're reading this in 2015, which is when the Guide is scheduled to be finished, please add a zero to those numbers.)</para>

<para>And as we pump data into our subscribers, we notice two things:</para>

<orderedlist>
  <listitem><para>When we do even the slightest amount of work with a message, it slows down our subscriber to the point where it can't catch up with the publisher again.</para></listitem>
  <listitem><para>We're hitting a ceiling, at both publisher and subscriber, to around say 6M messages a second, even after careful optimization and TCP tuning.</para></listitem>
</orderedlist>
<para>The first thing we have to do is break our subscriber into a multithreaded design so that we can do work with messages in one set of threads, while reading messages in another. Typically we don't want to process every message the same way. Rather, the subscriber will filter some messages, perhaps by prefix key. When a message matches some criteria, the subscriber will call a worker to deal with it. In &Oslash;MQ terms this means sending the message to a worker thread.</para>

<para>So the subscriber looks something like a queue device. We could use various sockets to connect the subscriber and workers. If we assume one-way traffic, and workers that are all identical, we can use PUSH and PULL, and delegate all the routing work to &Oslash;MQ(<xref linkend="figure-56"/>). This is the simplest and fastest approach.</para>

<figure id="figure-56">
    <title>The Simple Black Box Pattern</title>
    <mediaobject>
        <imageobject>
            <imagedata fileref="images/fig56.eps" format="EPS" width="4.8in"/>
        </imageobject>
    </mediaobject>
</figure>

<para>The subscriber talks to the publisher over TCP or PGM. The subscriber talks to its workers, which are all in the same process, over inproc.</para>

<para>Now to break that ceiling. What happens is that the subscriber thread hits 100% of CPU, and since it is one thread, it cannot use more than one core. A single thread will always hit a ceiling, be it at 2M, 6M, or more messages per second. We want to split the work across multiple threads that can run in parallel.</para>

<para>The approach used by many high-performance products, which works here, is <emphasis>sharding</emphasis>, meaning we split the work into parallel and independent streams. E.g. half of the topic keys are in one stream, half in another(<xref linkend="figure-57"/>). We could use many streams, but performance won't scale unless we have free cores.</para>

<para>So let's see how to shard into two streams:</para>

<figure id="figure-57">
    <title>Mad Black Box Pattern</title>
    <mediaobject>
        <imageobject>
            <imagedata fileref="images/fig57.eps" format="EPS" width="4.8in"/>
        </imageobject>
    </mediaobject>
</figure>

<para>With two streams, working at full speed, we would configure &Oslash;MQ as follows:</para>

<itemizedlist>
  <listitem><para>Two I/O threads, rather than one.</para></listitem>
  <listitem><para>Two network interfaces (NIC), one per subscriber.</para></listitem>
  <listitem><para>Each I/O thread bound to a specific NIC.</para></listitem>
  <listitem><para>Two subscriber threads, bound to specific cores.</para></listitem>
  <listitem><para>Two SUB sockets, one per subscriber thread.</para></listitem>
  <listitem><para>The remaining cores assigned to worker threads.</para></listitem>
  <listitem><para>Worker threads connected to both subscriber PUSH sockets.</para></listitem>
</itemizedlist>
<para>With ideally, no more threads in our architecture than we had cores. Once we create more threads than cores, we get contention between threads, and diminishing returns. There would be no benefit, for example, in creating more I/O threads.</para>

</sect1>
<sect1>
<title>A Shared Key-Value Cache (Clone Pattern)</title>
<para>Pub-sub is like a radio broadcast, you miss everything before you join, and then how much information you get depends on the quality of your reception. Surprisingly, for engineers who are used to aiming for "perfection", this model is useful and wide-spread, because it maps perfectly to real-world distribution of information. Think of Facebook and Twitter, the BBC World Service, and the sports results.</para>

<para>However, there are also a whole lot of cases where more reliable pub-sub would be valuable, if we could do it. As we did for request-reply, let's define ''reliability' in terms of what can go wrong. Here are the classic problems with pub-sub:</para>

<itemizedlist>
  <listitem><para>Subscribers join late, so miss messages the server already sent.</para></listitem>
  <listitem><para>Subscriber connections are slow, and can lose messages during that time.</para></listitem>
  <listitem><para>Subscribers go away, and lose messages while they are away.</para></listitem>
</itemizedlist>
<para>Less often, we see problems like these:</para>

<itemizedlist>
  <listitem><para>Subscribers can crash, and restart, and lose whatever data they already received.</para></listitem>
  <listitem><para>Subscribers can fetch messages too slowly, so queues build up and then overflow.</para></listitem>
  <listitem><para>Networks can become overloaded and drop data (specifically, for PGM).</para></listitem>
  <listitem><para>Networks can become too slow, so publisher-side queues overflow, and publishers crash.</para></listitem>
</itemizedlist>
<para>A lot more can go wrong but these are the typical failures we see in a realistic system.</para>

<para>We've already solved some of these, such as the slow subscriber, which we handle with the Suicidal Snail pattern. But for the rest, it would be nice to have a generic, reusable framework for reliable pub-sub.</para>

<para>The difficulty is that we have no idea what our target applications actually want to do with their data. Do they filter it, and process only a subset of messages? Do they log the data somewhere for later reuse? Do they distribute the data further to workers? There are dozens of plausible scenarios, and each will have its own ideas about what reliability means and how much it's worth in terms of effort and performance.</para>

<para>So we'll build an abstraction that we can implement once, and then reuse for many applications. This abstraction is a <emphasis role="bold">shared key-value cache</emphasis>, which stores a set of blobs indexed by unique keys.</para>

<para>Don't confuse this with <emphasis>distributed hash tables</emphasis>, which solve the wider problem of connecting peers in a distributed network, or with <emphasis>distributed key-value tables</emphasis>, which act like non-SQL databases. All we will build is a system that reliably clones some in-memory state from a server to a set of clients. We want to:</para>

<itemizedlist>
  <listitem><para>Let a client join the network at any time, and reliably get the current server state.</para></listitem>
  <listitem><para>Let any client update the key-value cache (inserting new key-value pairs, updating existing ones, or deleting them).</para></listitem>
  <listitem><para>Reliably propagate changes to all clients, and do this with minimum latency overhead.</para></listitem>
  <listitem><para>Handle very large numbers of clients, e.g. tens of thousands or more.</para></listitem>
</itemizedlist>
<para>The key aspect of the Clone pattern is that clients talk back to servers, which is more than we do in a simple pub-sub dialog. This is why I use the terms 'server' and 'client' instead of 'publisher' and 'subscriber'. We'll use pub-sub as the core of Clone but it is a bit more than that.</para>

<sect2>
<title>Distributing Key-Value Updates</title>
<para>We'll develop Clone in stages, solving one problem at a time. First, let's look at how to distribute key-value updates from a server to a set of clients. We'll take our weather server from Basics<xref linkend="basics"/> and refactor it to send messages as key-value pairs(<xref linkend="figure-58"/>). We'll modify our client to store these in a hash table.</para>

<figure id="figure-58">
    <title>Simplest Clone Model</title>
    <mediaobject>
        <imageobject>
            <imagedata fileref="images/fig58.eps" format="EPS" width="4.8in"/>
        </imageobject>
    </mediaobject>
</figure>

<para>This is the server:</para>

<example id="clonesrv1-php">
<title>Clone server, Model One (clonesrv1.php)</title>
<programlisting language="php">
(This example still needs translation into PHP)
</programlisting>

</example>
<para>And here is the client:</para>

<example id="clonecli1-php">
<title>Clone client, Model One (clonecli1.php)</title>
<programlisting language="php">
(This example still needs translation into PHP)
</programlisting>

</example>
<para>Some notes about this code:</para>

<itemizedlist>
  <listitem><para>All the hard work is done in a <emphasis role="bold">kvmsg</emphasis> class. This class works with key-value message objects, which are multi-part &Oslash;MQ messages structured as three frames: a key (a &Oslash;MQ string), a sequence number (64-bit value, in network byte order), and a binary body (holds everything else).</para></listitem>
  <listitem><para>The server generates messages with a randomized 4-digit key, which lets us simulate a large but not enormous hash table (10K entries).</para></listitem>
  <listitem><para>The server does a 200 millisecond pause after binding its socket. This is to prevent "slow joiner syndrome" where the subscriber loses messages as it connects to the server's socket. We'll remove that in later models.</para></listitem>
  <listitem><para>We'll use the terms 'publisher' and 'subscriber' in the code to refer to sockets. This will help later when we have multiple sockets doing different things.</para></listitem>
</itemizedlist>
<para>Here is the kvmsg class, in the simplest form that works for now:</para>

<example id="kvsimple-php">
<title>Key-value message class (kvsimple.php)</title>
<programlisting language="php">
(This example still needs translation into PHP)
</programlisting>

</example>
<para>We'll make a more sophisticated kvmsg class later, for using in real applications.</para>

<para>Both the server and client maintain hash tables, but this first model only works properly if we start all clients before the server, and the clients never crash. That's not 'reliability'.</para>

</sect2>
<sect2>
<title>Getting a Snapshot</title>
<para>In order to allow a late (or recovering) client to catch up with a server it has to get a snapshot of the server's state. Just as we've reduced "message" to mean "a sequenced key-value pair", we can reduce "state" to mean "a hash table". To get the server state, a client opens a DEALER socket and asks for it explicitly(<xref linkend="figure-59"/>).</para>

<figure id="figure-59">
    <title>State Replication</title>
    <mediaobject>
        <imageobject>
            <imagedata fileref="images/fig59.eps" format="EPS" width="4.8in"/>
        </imageobject>
    </mediaobject>
</figure>

<para>To make this work, we have to solve the timing problem. Getting a state snapshot will take a certain time, possibly fairly long if the snapshot is large. We need to correctly apply updates to the snapshot. But the server won't know when to start sending us updates. One way would be to start subscribing, get a first update, and then ask for "state for update N". This would require the server storing one snapshot for each update, which isn't practical.</para>

<para>So we will do the synchronization in the client, as follows:</para>

<itemizedlist>
  <listitem><para>The client first subscribes to updates and then makes a state request. This guarantees that the state is going to be newer than the oldest update it has.</para></listitem>
  <listitem><para>The client waits for the server to reply with state, and meanwhile queues all updates. It does this simply by not reading them: &Oslash;MQ keeps them queued on the socket queue, since we don't set a HWM.</para></listitem>
  <listitem><para>When the client receives its state update, it begins once again to read updates. However it discards any updates that are older than the state update. So if the state update includes updates up to 200, the client will discard updates up to 201.</para></listitem>
  <listitem><para>The client then applies updates to its own state snapshot.</para></listitem>
</itemizedlist>
<para>It's a simple model that exploits &Oslash;MQ's own internal queues. Here's the server:</para>

<example id="clonesrv2-php">
<title>Clone server, Model Two (clonesrv2.php)</title>
<programlisting language="php">
(This example still needs translation into PHP)
</programlisting>

</example>
<para>And here is the client:</para>

<example id="clonecli2-php">
<title>Clone client, Model Two (clonecli2.php)</title>
<programlisting language="php">
(This example still needs translation into PHP)
</programlisting>

</example>
<para>Some notes about this code:</para>

<itemizedlist>
  <listitem><para>The server uses two threads, for simpler design. One thread produces random updates, and the second thread handles state. The two communicate across PAIR sockets. You might like to use SUB sockets but you'd hit the "slow joiner" problem where the subscriber would randomly miss some messages while connecting. PAIR sockets let us explicitly synchronize the two threads.</para></listitem>
  <listitem><para>We set a HWM on the updates socket pair, since hash table insertions are relatively slow. Without this, the server runs out of memory. On <literal>inproc</literal> connections, the real HWM is the sum of the HWM of <emphasis>both</emphasis> sockets, so we set the HWM on each socket.</para></listitem>
  <listitem><para>The client is really simple. In C, under 60 lines of code. A lot of the heavy lifting is done in the kvmsg class, but still, the basic Clone pattern is easier to implement than it seemed at first.</para></listitem>
  <listitem><para>We don't use anything fancy for serializing the state. The hash table holds a set of kvmsg objects, and the server sends these, as a batch of messages, to the client requesting state. If multiple clients request state at once, each will get a different snapshot.</para></listitem>
  <listitem><para>We assume that the client has exactly one server to talk to. The server <emphasis role="bold">must</emphasis> be running; we do not try to solve the question of what happens if the server crashes.</para></listitem>
</itemizedlist>
<para>Right now, these two programs don't do anything real, but they correctly synchronize state. It's a neat example of how to mix different patterns: PAIR-over-inproc, PUB-SUB, and ROUTER-DEALER.</para>

</sect2>
<sect2>
<title>Republishing Updates</title>
<para>In our second model, changes to the key-value cache came from the server itself. This is a centralized model, useful for example if we have a central configuration file we want to distribute, with local caching on each node. A more interesting model takes updates from clients, not the server. The server thus becomes a stateless broker. This gives us some benefits:</para>

<itemizedlist>
  <listitem><para>We're less worried about the reliability of the server. If it crashes, we can start a new instance, and feed it new values.</para></listitem>
  <listitem><para>We can use the key-value cache to share knowledge between dynamic peers.</para></listitem>
</itemizedlist>
<para>Updates from clients go via a PUSH-PULL socket flow from client to server(<xref linkend="figure-60"/>).</para>

<figure id="figure-60">
    <title>Republishing Updates</title>
    <mediaobject>
        <imageobject>
            <imagedata fileref="images/fig60.eps" format="EPS" width="4.8in"/>
        </imageobject>
    </mediaobject>
</figure>

<para>Why don't we allow clients to publish updates directly to other clients? While this would reduce latency, it makes it impossible to assign ascending unique sequence numbers to messages. The server can do this. There's a more subtle second reason. In many applications it's important that updates have a single order, across many clients. Forcing all updates through the server ensures that they have the same order when they finally get to clients.</para>

<para>With unique sequencing, clients can detect the nastier failures - network congestion and queue overflow. If a client discovers that its incoming message stream has a hole, it can take action. It seems sensible that the client contact the server and ask for the missing messages, but in practice that isn't useful. If there are holes, they're caused by network stress, and adding more stress to the network will make things worse. All the client can really do is warn its users "Unable to continue", and stop, and not restart until someone has manually checked the cause of the problem.</para>

<para>We'll now generate state updates in the client. Here's the server:</para>

<example id="clonesrv3-php">
<title>Clone server, Model Three (clonesrv3.php)</title>
<programlisting language="php">
(This example still needs translation into PHP)
</programlisting>

</example>
<para>And here is the client:</para>

<example id="clonecli3-php">
<title>Clone client, Model Three (clonecli3.php)</title>
<programlisting language="php">
(This example still needs translation into PHP)
</programlisting>

</example>
<para>Some notes about this code:</para>

<itemizedlist>
  <listitem><para>The server has collapsed to a single task. It manages a PULL socket for incoming updates, a ROUTER socket for state requests, and a PUB socket for outgoing updates.</para></listitem>
  <listitem><para>The client uses a simple tickless timer to send a random update to the server once a second. In a real implementation we would drive updates from application code.</para></listitem>
</itemizedlist>
</sect2>
<sect2>
<title>Clone Subtrees</title>
<para>A realistic key-value cache will get large, and clients will usually be interested only in parts of the cache. Working with a subtree is fairly simple. The client has to tell the server the subtree when it makes a state request, and it has to specify the same subtree when it subscribes to updates.</para>

<para>There are a couple of common syntaxes for trees. One is the "path hierarchy", and another is the "topic tree". These look like:</para>

<itemizedlist>
  <listitem><para>Path hierarchy: "/some/list/of/paths"</para></listitem>
  <listitem><para>Topic tree: "some.list.of.topics"</para></listitem>
</itemizedlist>
<para>We'll use the path hierarchy, and extend our client and server so that a client can work with a single subtree. Working with multiple subtrees is not much more difficult, we won't do that here but it's a trivial extension.</para>

<para>Here's the server, a small variation on Model Three:</para>

<example id="clonesrv4-php">
<title>Clone server, Model Four (clonesrv4.php)</title>
<programlisting language="php">
(This example still needs translation into PHP)
</programlisting>

</example>
<para>And here is the client:</para>

<example id="clonecli4-php">
<title>Clone client, Model Four (clonecli4.php)</title>
<programlisting language="php">
(This example still needs translation into PHP)
</programlisting>

</example>
</sect2>
<sect2>
<title>Ephemeral Values</title>
<para>An ephemeral value is one that expires dynamically. If you think of Clone being used for a DNS-like service, then ephemeral values would let you do dynamic DNS. A node joins the network, publishes its address, and refreshes this regularly. If the node dies, its address eventually gets removed.</para>

<para>The usual abstraction for ephemeral values is to attach them to a "session", and delete them when the session ends. In Clone, sessions would be defined by clients, and would end if the client died.</para>

<para>The simpler alternative to using sessions is to define every ephemeral value with a "time to live" that tells the server when to expire the value. Clients then refresh values, and if they don't, the values expire.</para>

<para>I'm going to implement that simpler model because we don't know yet that it's worth making a more complex one. The difference is really in performance. If clients have a handful of ephemeral values, it's fine to set a TTL on each one. If clients use masses of ephemeral values, it's more efficient to attach them to sessions, and expire them in bulk.</para>

<para>First off, we need a way to encode the TTL in the key-value message. We could add a frame. The problem with using frames for properties is that each time we want to add a new property, we have to change the structure of our kvmsg class. It breaks compatibility. So let's add a 'properties' frame to the message, and code to let us get and put property values.</para>

<para>Next, we need a way to say, "delete this value". Up to now servers and clients have always blindly inserted or updated new values into their hash table. We'll say that if the value is empty, that means "delete this key".</para>

<para>Here's a more complete version of the kvmsg class, which implements a 'properties' frame (and adds a UUID frame, which we'll need later on). It also handles empty values by deleting the key from the hash, if necessary:</para>

<example id="kvmsg-php">
<title>Key-value message class - full (kvmsg.php)</title>
<programlisting language="php">
(This example still needs translation into PHP)
</programlisting>

</example>
<para>The Model Five client is almost identical to Model Four. Diff is your friend. It uses the full kvmsg class instead of kvsimple, and sets a randomized 'ttl' property (measured in seconds) on each message:</para>

<programlisting language="c">
kvmsg_set_prop (kvmsg, "ttl", "%d", randof (30));
</programlisting>

<para>The Model Five server has totally changed. Instead of a poll loop, we're now using a reactor. This just makes it simpler to mix timers and socket events. Unfortunately in C the reactor style is more verbose. Your mileage will vary in other languages. But reactors seem to be a better way of building more complex &Oslash;MQ applications. Here's the server:</para>

<example id="clonesrv5-php">
<title>Clone server, Model Five (clonesrv5.php)</title>
<programlisting language="php">
(This example still needs translation into PHP)
</programlisting>

</example>
</sect2>
<sect2>
<title>Clone Server Reliability</title>
<para>Clone models one to five are relatively simple. We're now going to get into unpleasantly complex territory here that has me getting up for another espresso. You should appreciate that making "reliable" messaging is complex enough that you always need to ask, "do we actually need this?" before jumping into it. If you can get away with unreliable, or "good enough" reliability, you can make a huge win in terms of cost and complexity. Sure, you may lose some data now and then. It is often a good trade-off. Having said, that, and (sips) since the espresso is really good, let's jump in!</para>

<para>As you play with model three, you'll stop and restart the server. It might look like it recovers, but of course it's applying updates to an empty state, instead of the proper current state. Any new client joining the network will get just the latest updates, instead of all of them. So let's work out a design for making Clone work despite server failures.</para>

<para>Let's list the failures we want to be able to handle:</para>

<itemizedlist>
  <listitem><para>Clone server process crashes and is automatically or manually restarted. The process loses its state and has to get it back from somewhere.</para></listitem>
  <listitem><para>Clone server machine dies and is off-line for a significant time. Clients have to switch to an alternate server somewhere.</para></listitem>
  <listitem><para>Clone server process or machine gets disconnected from the network, e.g. a switch dies. It may come back at some point, but in the meantime clients need an alternate server.</para></listitem>
</itemizedlist>
<para>Our first step is to add a second server. We can use the Binary Star pattern from Reliable Request-Reply Patterns<xref linkend="reliable-request-reply"/> to organize these into primary and backup. Binary Star is a reactor, so it's useful that we already refactored the last server model into a reactor style.</para>

<para>We need to ensure that updates are not lost if the primary server crashes. The simplest technique is to send them to both servers.</para>

<para>The backup server can then act as a client, and keep its state synchronized by receiving updates as all clients do. It'll also get new updates from clients. It can't yet store these in its hash table, but it can hold onto them for a while.</para>

<para>So, Model Six introduces these changes over Model Five:</para>

<itemizedlist>
  <listitem><para>We use a pub-sub flow instead of a push-pull flow for client updates (to the servers). The reasons: push sockets will block if there is no recipient, and they round-robin, so we'd need to open two of them. We'll bind the servers' SUB sockets and connect the clients' PUB sockets to them. This takes care of fanning out from one client to two servers.</para></listitem>
  <listitem><para>We add heartbeats to server updates (to clients), so that a client can detect when the primary server has died. It can then switch over to the backup server.</para></listitem>
  <listitem><para>We connect the two servers using the Binary Star <literal>bstar</literal> reactor class. Binary Star relies on the clients to 'vote' by making an explicit request to the server they consider "active". We'll use snapshot requests for this.</para></listitem>
  <listitem><para>We make all update messages uniquely identifiable by adding a UUID field. The client generates this, and the server propagates it back on re-published updates.</para></listitem>
  <listitem><para>The passive server keeps a "pending list" of updates that it has received from clients, but not yet from the active server. Or, updates it's received from the active, but not yet clients. The list is ordered from oldest to newest, so that it is easy to remove updates off the head.</para></listitem>
</itemizedlist>
<para>It's useful to design the client logic as a finite state machine. The client cycles through three states:</para>

<itemizedlist>
  <listitem><para>The client opens and connects its sockets, and then requests a snapshot from the first server. To avoid request storms, it will ask any given server only twice. One request might get lost, that'd be bad luck. Two getting lost would be carelessness.</para></listitem>
  <listitem><para>The client waits for a reply (snapshot data) from the current server, and if it gets it, it stores it. If there is no reply within some timeout, it fails over to the next server.</para></listitem>
  <listitem><para>When the client has gotten its snapshot, it waits for and processes updates. Again, if it doesn't hear anything from the server within some timeout, it fails over to the next server.</para></listitem>
</itemizedlist>
<para>The client loops forever. It's quite likely during startup or fail-over that some clients may be trying to talk to the primary server while others are trying to talk to the backup server. The Binary Star state machine handles this(<xref linkend="figure-61"/>), hopefully accurately. (One of the joys of making designs like this is we cannot prove they are right, we can only prove them wrong. So it's like a guy falling off a tall building. So far, so good... so far, so good...)</para>

<figure id="figure-61">
    <title>Clone Client Finite State Machine</title>
    <mediaobject>
        <imageobject>
            <imagedata fileref="images/fig61.eps" format="EPS" width="4.8in"/>
        </imageobject>
    </mediaobject>
</figure>

<para>Fail-over happens as follows:</para>

<itemizedlist>
  <listitem><para>The client detects that primary server is no longer sending heartbeats, so has died. The client connects to the backup server and requests a new state snapshot.</para></listitem>
  <listitem><para>The backup server starts to receive snapshot requests from clients, and detects that primary server has gone, so takes over as primary.</para></listitem>
  <listitem><para>The backup server applies its pending list to its own hash table, and then starts to process state snapshot requests.</para></listitem>
</itemizedlist>
<para>When the primary server comes back on-line, it will:</para>

<itemizedlist>
  <listitem><para>Start up as passive server, and connect to the backup server as a Clone client.</para></listitem>
  <listitem><para>Start to receive updates from clients, via its SUB socket.</para></listitem>
</itemizedlist>
<para>We make some assumptions:</para>

<itemizedlist>
  <listitem><para>That at least one server will keep running. If both servers crash, we lose all server state and there's no way to recover it.</para></listitem>
  <listitem><para>That multiple clients do not update the same hash keys, at the same time. Client updates will arrive at the two servers in a different order. So, the backup server may apply updates from its pending list in a different order than the primary server would or did. Updates from one client will always arrive in the same order on both servers, so that is safe.</para></listitem>
</itemizedlist>
<para>So the architecture for our high-availability server pair using the Binary Star pattern has two servers and a set of clients that talk to both servers(<xref linkend="figure-62"/>).</para>

<figure id="figure-62">
    <title>High-availability Clone Server Pair</title>
    <mediaobject>
        <imageobject>
            <imagedata fileref="images/fig62.eps" format="EPS" width="4.8in"/>
        </imageobject>
    </mediaobject>
</figure>

<para>As a first step to building this, we're going to refactor the client as a reusable class. This is partly for fun (writing asynchronous classes with &Oslash;MQ is like an exercise in elegance), but mainly because we want Clone to be really easy to plug-in to random applications. Since resilience depends on clients behaving correctly, it's much easier to guarantee this when there's a reusable client API. When we start to handle fail-over in clients, it does get a little complex (imagine mixing a Freelance client with a Clone client). So, reusability ahoy!</para>

<para>My usual design approach is to first design an API that feels right, then to implement that. So, we start by taking the clone client, and rewriting it to sit on top of some presumed class API called <emphasis role="bold">clone</emphasis>. Turning random code into an API means defining a reasonably stable and abstract contract with applications. For example, in Model Five, the client opened three separate sockets to the server, using endpoints that were hard-coded in the source. We could make an API with three methods, like this:</para>

<programlisting language="c">
//  Specify endpoints for each socket we need
clone_subscribe (clone, "tcp://localhost:5556");
clone_snapshot  (clone, "tcp://localhost:5557");
clone_updates   (clone, "tcp://localhost:5558");

//  Times two, since we have two servers
clone_subscribe (clone, "tcp://localhost:5566");
clone_snapshot  (clone, "tcp://localhost:5567");
clone_updates   (clone, "tcp://localhost:5568");
</programlisting>

<para>But this is both verbose and fragile. It's not a good idea to expose the internals of a design to applications. Today, we use three sockets. Tomorrow, two, or four. Do we really want to go and change every application that uses the clone class? So to hide these sausage factory details, we make a small abstraction, like this:</para>

<programlisting language="c">
//  Specify primary and backup servers
clone_connect (clone, "tcp://localhost:5551");
clone_connect (clone, "tcp://localhost:5561");
</programlisting>

<para>Which has the advantage of simplicity (one server sits at one endpoint) but has an impact on our internal design. We now need to somehow turn that single endpoint into three endpoints. One way would be to bake the knowledge "client and server talk over three consecutive ports" into our client-server protocol. Another way would be to get the two missing endpoints from the server. We'll take the simplest way, which is:</para>

<itemizedlist>
  <listitem><para>The server state router (ROUTER) is at port P.</para></listitem>
  <listitem><para>The server updates publisher (PUB) is at port P + 1.</para></listitem>
  <listitem><para>The server updates subscriber (SUB) is at port P + 2.</para></listitem>
</itemizedlist>
<para>The clone class has the same structure as the flcliapi class from Reliable Request-Reply Patterns<xref linkend="reliable-request-reply"/>. It consists of two parts:</para>

<itemizedlist>
  <listitem><para>An asynchronous clone agent that runs in a background thread. The agent handles all network I/O, talking to servers in real-time, no matter what the application is doing.</para></listitem>
  <listitem><para>A synchronous 'clone' class which runs in the caller's thread. When you create a clone object, that automatically launches an agent thread, and when you destroy a clone object, it kills the agent thread.</para></listitem>
</itemizedlist>
<para>The frontend class talks to the agent class over an <literal>inproc</literal> 'pipe' socket. In C, the CZMQ thread layer creates this pipe automatically for us as it starts an "attached thread". This is a natural pattern for multithreading over &Oslash;MQ.</para>

<para>Without &Oslash;MQ, this kind of asynchronous class design would be weeks of really hard work. With &Oslash;MQ, it was a day or two of work. The results are kind of complex, given the simplicity of the Clone protocol it's actually running. There are some reasons for this. We could turn this into a reactor, but that'd make it harder to use in applications. So the API looks a bit like a key-value table that magically talks to some servers:</para>

<programlisting language="c">
clone_t *clone_new (void);
void clone_destroy (clone_t **self_p);
void clone_connect (clone_t *self, char *address, char *service);
void clone_set (clone_t *self, char *key, char *value);
char *clone_get (clone_t *self, char *key);
</programlisting>

<para>So here is Model Six of the clone client, which has now become just a thin shell using the clone class:</para>

<example id="clonecli6-php">
<title>Clone client, Model Six (clonecli6.php)</title>
<programlisting language="php">
(This example still needs translation into PHP)
</programlisting>

</example>
<para>And here is the actual clone class implementation:</para>

<example id="clone-php">
<title>Clone class (clone.php)</title>
<programlisting language="php">
(This example still needs translation into PHP)
</programlisting>

</example>
<para>Finally, here is the sixth and last model of the clone server:</para>

<example id="clonesrv6-php">
<title>Clone server, Model Six (clonesrv6.php)</title>
<programlisting language="php">
(This example still needs translation into PHP)
</programlisting>

</example>
<para>This main program is only a few hundred lines of code, but it took some time to get working. To be accurate, building Model Six took about a full week of "sweet god, this is just too complex for the Guide" hacking. We've assembled pretty much everything and the kitchen sink into this small application. We have fail-over, ephemeral values, subtrees, and so on. What surprised me was that the upfront design was pretty accurate. But the details of writing and debugging so many socket flows is something special. Here's how I made this work:</para>

<itemizedlist>
  <listitem><para>By using reactors (bstar, on top of zloop), which remove a lot of grunt-work from the code, and leave what remains simpler and more obvious. The whole server runs as one thread, so there's no inter-thread weirdness going on. Just pass a structure pointer ('self') around to all handlers, which can do their thing happily. One nice side-effect of using reactors is that code, being less tightly integrated into a poll loop, is much easier to reuse. Large chunks of Model Six are taken from Model Five.</para></listitem>
  <listitem><para>By building it piece by piece, and getting each piece working <emphasis role="bold">properly</emphasis> before going onto the next one. Since there are four or five main socket flows, that meant quite a lot of debugging and testing. I debug just by printing stuff to the console (e.g. dumping messages). There's no sense in actually opening a debugger for this kind of work.</para></listitem>
  <listitem><para>By always testing under Valgrind, so that I'm sure there are no memory leaks. In C this is a major concern, you can't delegate to some garbage collector. Using proper and consistent abstractions like kvmsg and CZMQ helps enormously.</para></listitem>
</itemizedlist>
<para>I'm sure the code still has flaws which kind readers will spend weekends debugging and fixing for me. I'm happy enough with this model to use it as the basis for real applications.</para>

<para>To test the sixth model, start the primary server and backup server, and a set of clients, in any order. Then kill and restart one of the servers, randomly, and keep doing this. If the design and code is accurate, clients will continue to get the same stream of updates from whatever server is currently active.</para>

</sect2>
<sect2>
<title>Clone Protocol Specification</title>
<para>After this much work to build reliable pub-sub, we want some guarantee that we can safely build applications to exploit the work. A good start is to write-up the protocol. This lets us make implementations in other languages and lets us improve the design on paper, rather than hands-deep in code.</para>

<para>Here, then, is the Clustered Hashmap Protocol, which "defines a cluster-wide key-value hashmap, and mechanisms for sharing this across a set of clients. CHP allows clients to work with subtrees of the hashmap, to update values, and to define ephemeral values."</para>

<itemizedlist>
  <listitem><para>http://rfc.zeromq.org/spec:12</para></listitem>
</itemizedlist>
</sect2>
</sect1>
<sect1>
<title>The Espresso Pattern</title>
<para>here is a fun little machine that exploits the <literal>zmq_proxy()</literal> method to show you what's happening on a pub-sub network. It's deceptively simple:</para>

<example id="espresso-php">
<title>Espresso Machine (espresso.php)</title>
<programlisting language="php">
(This example still needs translation into PHP)
</programlisting>

</example>
</sect1>
<sect1>
<title>Last Value Caching</title>
<para>If you've used commercial publish-subscribe systems you may be used to some features that are missing in the fast and cheerful &Oslash;MQ pub-sub model. One of these is "last value caching". The problem this solves is how a new subscriber catches up when it joins the network. The theory is that publishers get notified when a new subscriber joins and subscribes to some specific topics. The publisher can then re-broadcast the last message for those topics.</para>

<para>First I'll explain why &Oslash;MQ does not do this, and then I'll show how to do it anyhow.</para>

<para>We don't do it because in large pub-sub systems the volumes of data make it pretty much impossible. To make really large-scale pub-sub networks you need a protocol like PGM that exploits an upscale Ethernet switch's ability to multicast data to thousands of subscribers. Trying to do a TCP unicast from the publisher to each of thousands of subscribers just doesn't scale. You get weird spikes, unfair distribution (some subscribers getting the message before others), network congestion, and general unhappiness.</para>

<para>PGM is a one-way protocol: the publisher sends a message to a multicast address at the switch, which then rebroadcasts that to all interested subscribers. The publisher never sees when subscribers join or leave: this all happens in the switch, which we don't really want to start reprogramming.</para>

<para>OK, but in a lower-volume network, with a few dozen subscribers, and a limited number of topics, how do we make a LVC using &Oslash;MQ? The answer is to create a proxy that sits between the publisher and subscribers; an analog for the switch, but one we can program ourselves.</para>

<para>I'll start by making a publisher and subscriber that highlight the worst case scenario. This publisher is pathological. It starts by immediately sending messages to each of a thousand topics, and then it sends one update a second to a random topic. A subscriber connects, and subscribes to a topic. Without LVC, a subscriber would have to wait an average of 500 seconds to get any data. To add some drama, let's pretend there's an escaped convict called Roth threatening to rip the head off Roger the toy bunny if we can't fix that 8.3 minutes' delay.</para>

<para>Here's the publisher code. Note that it has the command line option to connect to some address, but otherwise binds to an endpoint. We'll use this later to connect to our last value cache:</para>

<example id="pathopub-php">
<title>Pathologic Publisher (pathopub.php)</title>
<programlisting language="php">
(This example still needs translation into PHP)
</programlisting>

</example>
<para>And here's the subscriber:</para>

<example id="pathosub-php">
<title>Pathologic Subscriber (pathosub.php)</title>
<programlisting language="php">
(This example still needs translation into PHP)
</programlisting>

</example>
<para>Try building and running these: first the subscriber, then the publisher. You'll see the subscriber reports getting "Save Roger" as you'd expect:</para>

<screen>./pathosub &amp;
./pathopub
</screen>

<para>It's when you run a second subscriber that you understand Roger's predicament. You have to leave it an awful long time before it reports getting any data. So, here's our last value cache. As I promised, it's a proxy that binds to two sockets and then handles messages on both them:</para>

<example id="lvcache-php">
<title>Last Value Caching Proxy (lvcache.php)</title>
<programlisting language="php">
(This example still needs translation into PHP)
</programlisting>

</example>
<para>Now, run the proxy, then the publisher:</para>

<screen>./lvcache &amp;
./pathopub tcp://localhost:5557
</screen>

<para>And now run as many instances of the subscriber as you want to try, each time connecting to the proxy on port 5558:</para>

<screen>./pathosub tcp://localhost:5558
</screen>

<para>Each subscriber happily reports "Save Roger", and Roth the Escaped Convict slinks back to his cell for dinner and a nice cup of hot milk, which is all he really wanted anyhow and could someone call his mum and tell her his clean socks are almost all up.</para>

<para>One note: the XPUB socket by default does not report duplicate subscriptions, which is what you want when you're naively connecting an XPUB to an XSUB. Our example sneakily gets around this by using random topics so the chance of it not working is one in a million. In a real LVC proxy you'll want to use the <literal>ZMQ_XPUB_VERBOSE</literal> option that we implement later in The &Oslash;MQ Community<xref linkend="the-community"/> as an exercise.</para>

</sect1>
</chapter>
<chapter id="the-human-scale">
<title>The Human Scale</title>
<para>If you've survived the first five chapters, congratulations. It was hard for for me too. Happily the jokes and the code mostly write themselves, so we'll continue with our journey of exploring &Oslash;MQ. In this chapter I'm going to step back from the nuts and bolts of &Oslash;MQ's technical machinery, and look more at how to use &Oslash;MQ successfully in larger projects.</para>

<para>We'll cover:</para>

<itemizedlist>
  <listitem><para>What "software architecture" is really about.</para></listitem>
  <listitem><para>The Simplicity-Oriented Design process and its ugly cousins Cod and Tod.</para></listitem>
  <listitem><para>How to use &Oslash;MQ to go from idea to working prototype safely.</para></listitem>
  <listitem><para>Different ways to serialize your data as &Oslash;MQ messages.</para></listitem>
  <listitem><para>How to code-generate binary serialization codecs.</para></listitem>
  <listitem><para>How to build custom code generators.</para></listitem>
  <listitem><para>How to write and license an protocol specification.</para></listitem>
  <listitem><para>How to do fast restartable file transfer over &Oslash;MQ.</para></listitem>
  <listitem><para>How to do credit-based flow control.</para></listitem>
  <listitem><para>How to build protocol servers and clients as state machines.</para></listitem>
  <listitem><para>How to make a secure protocol over &Oslash;MQ (yay!).</para></listitem>
  <listitem><para>A large-scale file publishing system (FileMQ).</para></listitem>
</itemizedlist>
<sect1>
<title>The Tale of Two Bridges</title>
<para>Two old engineers were talking of their lives and boasting of their greatest projects. One of the engineers explained how he had designed one of the greatest bridges ever made.</para>

<para>"We built it across a river gorge," he told his friend. "It was wide and deep. We spent two years studying the land, and choosing designs and materials. We hired the best engineers and designed the bridge, which took another five years. We contracted the largest engineering firms to build the structures, the towers, the tollbooths, and the roads that would connect the bridge to the main highways. Dozens died during the construction. Under the road level we had trains, and a special path for cyclists. That bridge represented years of my life."</para>

<para>The second man reflected for a while, then spoke. "One evening me and a friend got drunk on vodka, and we threw a rope across a gorge," he said. "Just a rope, tied to two trees. There were two villages, one at each side. At first, people pulled packages across that rope with a pulley and string. Then someone threw a second rope, and built a foot walk. It was dangerous, but the kids loved it. A group of men then rebuilt that, made it solid, and women started to cross, everyday, with their produce. A market grew up on one side of the bridge, and slowly that became a large town, since there was a lot of space for houses. The rope bridge got replaced with a wooden bridge, to allow horses and carts to cross. Then the town built a real stone bridge, with metal beams. Later, they replaced the stone part with steel, and today there's a suspension bridge standing in that same spot."</para>

<para>The first engineer was silent. "Funny thing," he said, "my bridge was demolished about ten years after we built it. Turns out it was built in the wrong place and no-one wanted to use it. Some guys had thrown a rope across the gorge, a few miles further downstream, and that's where everyone went."</para>

</sect1>
<sect1>
<title>Code on the Human Scale</title>
<para>To write a poem that captures the heart, first learn the language. To use &Oslash;MQ successfully at scale you have to learn two languages. The first is &Oslash;MQ itself. This takes even the best of us time. It's a truism that if you try to port an old architecture onto &Oslash;MQ, the results are going to be weird. &Oslash;MQ's language is subtle and profound and when you master it you will find yourself removing old complexity, not converting it.</para>

<para>However the real challenge of using &Oslash;MQ is that old barriers fall away, and the size of the projects you can do increases hugely. Non-distributed code is often a single-person project. You can work in your corner, perhaps for years, like an author on a book. It's all about concentration. But distributed code is different. To quote my favorite author, it "has to talk to code, has to be chatty, sociable, well-connected".</para>

<para>Writing distributed code is like playing live music: it's all about other people. Concentration is worthless if you can't listen. No-one enjoys listening to an amazingly proficient musician who's out of time with the rest of the group and can't read the mood of the audience. A live jam is entrancing not because of the technical quality but because of the real-time creative energy.</para>

<para>And so it goes with distributed code. Real-time creative energy is what wins, not pure technical quality, and certainly not technical quality combined with inability to work with others.</para>

<para>All this is fine in theory. Here comes the catch: working with other people is <emphasis>plain hard</emphasis>. We can expect a musician to be naturally social. But software developers? We're the very caricature of anti-social tunnel-visioned hermits. Other people are hard work. They're slow, they make mistakes, they ask too many questions, they don't respect our code, they make wrong assumptions, they argue.</para>

<para>My response isn't very sympathetic. To succeed in the software industry as it turns into something more like a never-ending live jam, we have to learn to put away our egos, work successfully with others, worry less about our own skills and look more at others, put away our natural insolence and attitude, and to learn to like and trust other people.</para>

<para>So this is what this chapter is really about: writing code at scale by understanding ourselves much better. Of course these lessons apply to all large-scale applications. Using &Oslash;MQ we just hit the problem sooner than we'd expect.</para>

</sect1>
<sect1>
<title>Psychology of Software Development</title>
<para>Dirkjan Ochtman pointed me to <ulink url="http://en.wikipedia.org/wiki/Software_architecture">Wikipedia's definition of Software Architecture</ulink> as "the set of structures needed to reason about the system, which comprise software elements, relations among them, and properties of both". For me this vapid and circular jargon is a good example of how miserably little we understand about what actually makes a successful large scale software architecture.</para>

<para>Architecture is the art and science of making large artificial structures for human use. If there is one thing I've learned and applied successfully in 30 years of making larger and larger software systems it is this: software is about people. Large structures in themselves are meaningless. It's how they function for <emphasis>human use</emphasis> that matters. And in software, human use starts with the programmers who make the software itself.</para>

<para>The core problems in software architecture are driven by human psychology, not technology. There are many ways our psychology affects our work. I could point to the way teams seem to get stupider as they get larger, or have to work across larger distances. Does that mean the smaller the team, the more effective? How then does a large global community like &Oslash;MQ manage to work successfully?</para>

<para>The &Oslash;MQ community wasn't accidental, it was a deliberate design, my contribution to the early days when the code came out of a cellar in Bratislava. The design was based on my pet science of "Social Architecture", which <ulink url="http://en.wikipedia.org/wiki/Social_architecture">Wikipedia defines</ulink> (what a coincidence!) as "the process, and the product, of planning, designing, and growing an on-line community."</para>

<para>One of the tenets of Social Architecture is that <emphasis>how we organize</emphasis> is more significant than <emphasis>who we are</emphasis>. The same group, organized differently, can produce entirely opposite results. We are like peers in a &Oslash;MQ network, and our communication patterns have dramatic impact on our performance. Ordinary people, well connected, can far outperform a team of experts working in the wrong patterns. If you're the architect of a larger &Oslash;MQ application, you're going to have to help others find the right patterns for working together. Do this right, and your project can succeed. Do it wrong, and your project will fail.</para>

<para>The two most important psychological elements are IMO that we're really bad at understanding complexity, and that we are so good at working together to divide and conquer large problems. We're highly social apes, and kind of smart, but only in the right kind of crowd.</para>

<para>So here is my short list of the Psychological Elements of Software Architecture:</para>

<itemizedlist>
  <listitem><para><emphasis role="bold">Stupidity</emphasis>: our mental bandwidth is limited, so we're all stupid at some point. The architecture has to be simple to understand. This is the number one rule: simplicity beats functionality, every single time. If you can't understand an architecture on a cold gray Monday morning before coffee, it is too complex.</para></listitem>
  <listitem><para><emphasis role="bold">Selfishness</emphasis>: we act only out of self-interest, so the architecture must create space and opportunity for selfish acts that benefit the whole. Selfishness is often indirect and subtle. For example I'll spend hours helping someone else understand something because that could be worth days to me later.</para></listitem>
  <listitem><para><emphasis role="bold">Laziness</emphasis>: we make lots of assumptions, many of which are wrong. We are happiest when we can spend the least effort to get a result, to test an assumption quickly, so the architecture has to make this possible. Specifically, that means it must be simple.</para></listitem>
  <listitem><para><emphasis role="bold">Jealousy</emphasis>: we're jealous of others, which means we'll overcome our stupidity and laziness to prove others wrong, and beat them in competition. The architecture thus has to create space for public competition based on fair rules that anyone can understand.</para></listitem>
  <listitem><para><emphasis role="bold">Reciprocity</emphasis>: we'll pay extra in terms of hard work, even money, to punish cheats and enforce fair rules. The architecture should be heavily rule-based, telling people how to work together, but not what to work on.</para></listitem>
  <listitem><para><emphasis role="bold">Pride</emphasis>: we're intensely aware of our social status, and we'll work hard to avoid looking stupid or incompetent in public. The architecture has to make sure every piece we make has our name on it, so we'll have sleepless nights stressing about what others will say about our work.</para></listitem>
  <listitem><para><emphasis role="bold">Greed</emphasis>: we're ultimately economic animals (see selfishness), so the architecture has to give us economic incentive to invest in making it happen. Maybe it's polishing our reputation as experts, maybe it's literally making money from some skill or component. It doesn't matter what it is, but there must be economic incentive. Think of architecture as a market place, not an engineering design.</para></listitem>
  <listitem><para><emphasis role="bold">Conformity</emphasis>: we're happiest to conform, out of fear and laziness, so the architecture should be strongly rule-based, and rules should be clear, accurate, well-documented, and enforced.</para></listitem>
  <listitem><para><emphasis role="bold">Fear</emphasis>: we're unwilling to take risks, especially if it makes us look stupid. Fear of failure is a major reason people conform and follow the group in mass stupidity. The architecture should make silent experimentation easy and cheap, giving people opportunity for success without punishing failure.</para></listitem>
</itemizedlist>
<para>These strategies work on large scale but also on small scale, within an organization or team.</para>

</sect1>
<sect1>
<title>The Bad, the Ugly, and the Delicious</title>
<para>Complexity is easy, it's simplicity that is hard. Whether our software is bad, ugly, or so delicious that it feels wrong to consume alone, doesn't depend so much on our individual skills as how we work together. That is, our processes.</para>

<para>There are many aspects to getting product-building teams and organizations to think wisely. You need diversity, freedom, challenge, resources, and so on. I discuss these in detail in <ulink url="http://swsi.info">Software and Silicon</ulink>. However, even if you have all the right ingredients, the default processes that skilled engineers and designers develop will result in complex, hard-to-use products.</para>

<para>The classic errors are: to focus on ideas, not problems; to focus on the wrong problems; to misjudge the value of solving problems; to not use ones' own work; and in many other ways to misjudge the real market.</para>

<para>I'll propose a process called "Simplicity Oriented Design", or SOD, which is as far as I can tell a reliable, repeatable way of developing simple and elegant products. This process organizes people into flexible supply chains that are able to navigate a problem landscape rapidly and cheaply. They do this by building, testing, and keeping or discarding minimal plausible solutions, called "patches". Living products consist of long series of patches, applied one atop the other. Yes, you may recognize the process by which we develop &Oslash;MQ.</para>

<para>Let's first look at the more common and less joyful processes, TOD and COD.</para>

<sect2>
<title>Trash-Oriented Design</title>
<para>The most popular design process in large businesses seems to be "Trash Oriented Design", or TOD. TOD feeds off the belief that all we need to make money are great ideas. It's tenacious nonsense but a powerful crutch for people who lack imagination. The theory goes that ideas are rare, so the trick is to capture them. It's like non-musicians being awed by a guitar player, not realizing that great talent is so cheap it literally plays on the streets for coins.</para>

<para>The main output of TODs are expensive "ideations": concepts, design documents, and products that go straight into the trash can. It works as follows:</para>

<itemizedlist>
  <listitem><para>The Creative People come up with long lists of "we could do X and Y". I've seen endlessly detailed lists of everything amazing a product could do. Once the creative work of idea generation has happened, it's just a matter of execution, of course.</para></listitem>
  <listitem><para>So the managers and their consultants pass their brilliant, world-shattering ideas to designers who create acres of detailed, preciously refined design documents. The designers take the tens of ideas the managers came up with, and turn them into hundreds of amazing, world-changing designs.</para></listitem>
  <listitem><para>These designs get given to engineers who scratch their heads and wonder who the heck came up with such stupid nonsense. They start to argue back but the designs come from up high, and really, it's not up to engineers to argue with creative people and expensive consultants.</para></listitem>
  <listitem><para>So the engineers creep back to their cubicles, humiliated and threatened into building the gigantic but oh-so-elegant pile of junk. It is bone-breakingly hard work since the designs take no account of practical costs. Minor whims might take weeks of work to build. As the project gets delayed, the managers bully the engineers into giving up their evenings and weekends.</para></listitem>
  <listitem><para>Eventually, something resembling a working product makes it out of the door. It's creaky and fragile, complex and ugly. The designers curse the engineers for their incompetence and pay more consultants to put lipstick onto the pig, and slowly the product starts to look a little nicer.</para></listitem>
  <listitem><para>By this time, the managers have started to try to sell the product and they find, shockingly, that no-one wants it. Undaunted and courageously they build million-dollar web sites and ad campaigns to explain to the public why they absolutely need this product. They do deals with other businesses to force the product on the lazy, stupid and ungrateful market.</para></listitem>
  <listitem><para>After twelve months of intense marketing, the product still isn't making profits. Worse, it suffers dramatic failures and gets branded in the press as a disaster. The company quietly shelves it, fires the consultants, buys a competing product from a small start-up and re-brands that as its own Version 2. Hundreds of millions of dollars end-up in the trash.</para></listitem>
  <listitem><para>Meanwhile, another visionary manager, somewhere in the Organization, drinks a little too much tequila with some marketing people and has a Brilliant Idea.</para></listitem>
</itemizedlist>
<para>Trash-Oriented Design would be a caricature if it wasn't so common. Something like 19 out of 20 market-ready products built by large firms are failures (yes, 87% of statistics are made up on the spot). The remaining one in 20 probably only succeeds because the competitors are so bad and the marketing is so aggressive.</para>

<para>The main lessons of TOD are quite straight-forward but hard to swallow. They are:</para>

<itemizedlist>
  <listitem><para>Ideas are cheap. No exceptions. There are no brilliant ideas. Anyone who tries to start a discussion with "oooh, we can do this too!" should be beaten down with all the passion one reserves for traveling evangelists. It is like sitting in a cafe at the foot of a mountain, drinking a hot chocolate and telling others, "hey, I have a great idea, we can climb that mountain! And build a chalet on top! With two saunas! And a garden! Hey, and we can make it solar powered! Dude, that's awesome! What color should we paint it? Green! No, blue! OK, go and make it, I'll stay here and make spreadsheets and graphics!"</para></listitem>
  <listitem><para>The starting point for a good design process is to collect real problems that confront real people. The second step is to evaluate these problems with the basic question, "how much is it worth to solve this problem?" Having done that, we can collect that set of problems that are worth solving.</para></listitem>
  <listitem><para>Good solutions to real problems will succeed as products. Their success will depend on how good and cheap the solution is, and how important the problem is (and sadly, how big the marketing budgets are). But their success will also depend on how much they demand in effort to use, in other words how simple they are.</para></listitem>
</itemizedlist>
<para>Hence after slaying the dragon of utter irrelevance, we attack the demon of complexity.</para>

</sect2>
<sect2>
<title>Complexity-Oriented Design</title>
<para>Really good engineering teams and small firms can usually build decent products. But the vast majority of products still end up being too complex and less successful than they might be. This is because specialist teams, even the best, often stubbornly apply a process I call "Complexity-Oriented Design", or COD, which works as follows:</para>

<itemizedlist>
  <listitem><para>Management correctly identifies some interesting and difficult problem with economic value. In doing so they already leapfrog over any TOD team.</para></listitem>
  <listitem><para>The team with enthusiasm start to build prototypes and core layers. These work as designed and thus encouraged, the team go off into intense design and architecture discussions, coming up with elegant schemas that look beautiful and solid.</para></listitem>
  <listitem><para>Management comes back and challenges team with yet more difficult problems. We tend to equate value with cost, so the harder the problem, and more expensive to solve, the more the solution should be worth, in their minds.</para></listitem>
  <listitem><para>The team, being engineers and thus loving to build stuff, build stuff. They build and build and build and end-up with massive, perfectly-designed complexity.</para></listitem>
  <listitem><para>The products go to market, and the market scratches its head and asks, "seriously, is this the best you can do?" People do use the products, especially if they aren't spending their own money in climbing the learning curve.</para></listitem>
  <listitem><para>Management gets positive feedback from its larger customers, who share the same idea that high cost (in training and use) means high value. and so continues to push the process.</para></listitem>
  <listitem><para>Meanwhile somewhere across the world, a small team is solving the same problem using a better process, and a year later smashes the market to little pieces.</para></listitem>
</itemizedlist>
<para>COD is characterized by a team obsessively solving the wrong problems to the point of collective insanity. COD products tend to be large, ambitious, complex, and unpopular. Much open source software is the output of COD processes. It is insanely hard for engineers to <emphasis role="bold">stop</emphasis> extending a design to cover more potential problems. They argue, "what if someone wants to do X?" but never ask themselves, "what is the real value of solving X?"</para>

<para>A good example of COD in practice is Bluetooth, a complex, over-designed set of protocols that users hate. It continues to exist only because in a massively-patented industry there are no real alternatives. Bluetooth is perfectly secure, which is close to pointless for a proximity protocol. At the same time it lacks a standard API for developers, meaning it's really costly to use Bluetooth in applications.</para>

<para>On the #zeromq IRC channel, Wintre once wrote of how enraged he was many years ago when he "found that XMMS 2 had a working plugin system but could not actually play music."</para>

<para>COD is a form of large-scale "rabbit holing", in which designers and engineers cannot distance themselves from the technical details of their work. They add more and more features, utterly misreading the economics of their work.</para>

<para>The main lessons of COD are also simple but hard for experts to swallow. They are:</para>

<itemizedlist>
  <listitem><para>Making stuff that you don't immediately have a need for is pointless. Doesn't matter how talented or brilliant you are, if you just sit down and make stuff people are not actually asking for, you are most likely wasting your time.</para></listitem>
  <listitem><para>Problems are not equal. Some are simple, and some are complex. Ironically, solving the simpler problems often has more value to more people than solving the really hard ones. So if you allow engineers to just work on random things, they'll most focus on the most interesting but least worthwhile things.</para></listitem>
  <listitem><para>Engineers and designers love to make stuff and decoration, and this inevitably leads to complexity. It is crucial to have a "stop mechanism", a way to set short, hard deadlines that force people to make smaller, simpler answers to just the most crucial problems.</para></listitem>
</itemizedlist>
</sect2>
<sect2>
<title>Simplicity-Oriented Design</title>
<para>Finally, we come to the rare but precious Simplicity-Oriented Design. This process starts with a realization: we do not know what we have to make until after we start making it. Coming up with ideas, or large-scale designs isn't just wasteful, it's a direct hindrance to designing the truly accurate solutions. The really juicy problems are hidden like far valleys, and any activity except active scouting creates a fog that hides those distant valleys. You need to keep mobile, pack light, and move fast.</para>

<para>SOD works as follows:</para>

<itemizedlist>
  <listitem><para>We collect a set of interesting problems (by looking at how people use technology or other products) and we line these up from simple to complex, looking for and identifying patterns of use.</para></listitem>
  <listitem><para>We take the simplest, most dramatic problem and we solve this with a minimal plausible solution, or "patch". Each patch solves exactly a genuine and agreed problem in a brutally minimal fashion.</para></listitem>
  <listitem><para>We apply one measure of quality to patches, namely "can this be done any simpler while still solving the stated problem?" We can measure complexity in terms of concepts and models that the user has to learn or guess in order to use the patch. The fewer, the better. A perfect patch solves a problem with zero learning required by the user.</para></listitem>
  <listitem><para>Our product development consists of a patch that solves the problem "we need a proof of concept" and then evolves in an unbroken line to a mature series of products, through hundreds or thousands of patches piled on top of each other.</para></listitem>
  <listitem><para>We do not do <emphasis>anything</emphasis> that is not a patch. We enforce this rule with formal processes that demand that every activity or task is tied to a genuine and agreed problem, explicitly enunciated and documented.</para></listitem>
  <listitem><para>We build our projects into a supply chain where each project can provide problems to its "suppliers" and receive patches in return. The supply chain creates the "stop mechanism" since when people are impatiently waiting for an answer, we necessarily cut our work short.</para></listitem>
  <listitem><para>Individuals are free to work on any projects, and provide patches at any place they feel it's worthwhile. No individuals "own" any project, except to enforce the formal processes. A single project can have many variations, each a collection of different, competing patches.</para></listitem>
  <listitem><para>Projects export formal and documented interfaces so that upstream (client) projects are unaware of change happening in supplier projects. Thus multiple supplier projects can compete for client projects, in effect creating a free and competitive market.</para></listitem>
  <listitem><para>We tie our supply chain to real users and external clients and we drive the whole process by rapid cycles so that a problem received from outside users can be analyzed, evaluated, and solved with a patch in a few hours.</para></listitem>
  <listitem><para>At every moment from the very first patch, our product is shippable. This is essential, because a large proportion of patches will be wrong (10-30%) and only by giving the product to users can we know which patches have become problems and themselves need solving.</para></listitem>
</itemizedlist>
<para>SOD is a form of "hill climbing algorithm", a reliable way of finding optimal solutions to the most significant problems in an unknown landscape. You don't need to be a genius to use SOD successfully, you just need to be able to see the difference between the fog of activity and the progress towards new real problems.</para>

<para>A really good designer with a good team can use SOD to build world-class products, rapidly and accurately. To get the most out of SOD, the designer has to use the product continuously, from day 1, and develop his or her ability to smell out problems such as inconsistency, surprising behavior, and other forms of friction. We naturally overlook many annoyances but a good designer picks these up, and thinks about how to patch them. Design is about removing friction in the use of a product.</para>

<para>In an open source setting, we do this work in public. There's no "let's open the code" moment. Projects that do this are in my view missing the point of open source, which is to engage your users in your exploration, and to build community around the seed of the architecture.</para>

</sect2>
</sect1>
<sect1>
<title>Message Oriented Pattern for Elastic Design</title>
<para>Now I'll introduce MOPED, which is a SOD pattern custom-designed for &Oslash;MQ architectures. It was either MOPED or BIKE, the Backronym-Induced Kinetic Effect. That's short for BICICLE, the Backronym-Inflated See if I Care Less Effect. In life, one learns to go with the least embarrassing choices.</para>

<para>Speaking of embarrassments, just as &Oslash;MQ lets us aim for really massive architectures, it also, like any technology that removes friction, opens the door to truly massive blunders. If &Oslash;MQ is the ACME rocket-propelled shoe of distributed software development, a lot of us are like Wile E. Coyote, slamming full speed into the proverbial desert cliff.</para>

<para>So MOPED is meant to save us from such mistakes. Partly it's about slowing down, partly it's about ensuring that when you move fast, you go - and this is essential, dear reader - in the <emphasis>right direction</emphasis>. It's my standard interview riddle: what's the rarest property of any software system, the absolute hardest thing to get right, the lack of which causes the slow or fast death of the vast majority of projects? The answer is not code quality, funding, performance, or even (though it's a close answer), popularity. The answer is "accuracy".</para>

<para>If you've read the Guide observantly you'll have seen MOPED in action already. The development of Majordomo in Reliable Request-Reply Patterns<xref linkend="reliable-request-reply"/> is a near-perfect case. But cute names are worth a thousand words.</para>

<para>The goal of MOPED is to define a process, a pattern by which we can take a rough use case for a new distributed application, and go from "hello world" to fully-working prototype in any language in under a week.</para>

<para>Using MOPED, you grow, more than build, a working &Oslash;MQ architecture from the ground-up, with minimal risk of failure. By focusing on the contracts, rather than the implementations, you avoid the risk of premature optimization. By driving the design process through ultra-short test-based cycles, you can be more certain what you have works, before you add more.</para>

<para>We can turn this into five real steps:</para>

<itemizedlist>
  <listitem><para>Step 1: internalize the &Oslash;MQ semantics.</para></listitem>
  <listitem><para>Step 2: draw a rough architecture.</para></listitem>
  <listitem><para>Step 3: decide on the contracts.</para></listitem>
  <listitem><para>Step 4: make a minimal end-to-end solution.</para></listitem>
  <listitem><para>Step 5: solve one problem and repeat.</para></listitem>
</itemizedlist>
<sect2>
<title>Step 1: Internalize the Semantics</title>
<para>To repeat myself: you must learn &Oslash;MQ's language. The only way to learn a language is to use it. There's no way to avoid this investment, no tapes you can play while you sleep, no chips you can plug in to magically become smarter. Read the Guide, work through the code examples, understand what's going on, and (most importantly) write some examples yourself, and then <emphasis>throw them away</emphasis>.</para>

<para>At a certain point you'll feel a clicking noise in your brain. Maybe you'll have a weird chili-induced dream where little &Oslash;MQ tasks run around trying to eat you alive. Maybe you'll just think "aaahh, so <emphasis>that's</emphasis> what it means!" If we did our work right, it should take 2-3 days. However long it takes, until you start thinking in terms of &Oslash;MQ sockets and patterns, you're not ready for step 2.</para>

</sect2>
<sect2>
<title>Step 2: Draw a Rough Architecture</title>
<para>Whiteboard time. Get a couple of colleagues and try to draw your architecture on a whiteboard. you want to draw boxes connected with arrows, showing the flow of work, data, results, etc. Since we live in a gravity well, it's best to draw the main arrows going down. Almost all architectures have a <emphasis>direction</emphasis>, and a certain symmetry, and what you want to do is capture that as simply and cleanly as you can.</para>

<para>Ignore anything that's not central to the core problem. Ignore logging, error handling, recovery from failures, etc. What you leave out is as important as what you capture: you can always add, but it's very hard to remove. When you have a simple, clean drawing, you're ready for step 3.</para>

</sect2>
<sect2>
<title>Step 3: Decide on the Contracts</title>
<para>Human scale depends on contracts, and the more explicit they are, the better things scale. You don't care <emphasis>how</emphasis> things happen, only the results. If I send an email, I don't care how it arrives at its destination, so long as the contract (it arrives within a few minutes, it's not modified, it doesn't get lost) is respected.</para>

<para>And to build a large system that works well, you must focus on the contracts, before the implementations. It may sound obvious but all too often, people forget and ignore this, or are just too shy to impose themselves. I wish I could say &Oslash;MQ had done this properly but for years our public contracts were second-rate afterthoughts instead of primary in-your-face pieces of work.</para>

<para>So what is a contract in a distributed system? There are, in my experience, two types of contract:</para>

<itemizedlist>
  <listitem><para>The APIs to client applications. Remember the Psychological Elements. The APIs need to be as absolutely <emphasis>simple</emphasis>, <emphasis>consistent</emphasis>, and <emphasis>familiar</emphasis> as possible. Yes, you can generate API documentation from code, but you must first design it, and designing an API is often hard.</para></listitem>
  <listitem><para>The protocols that connect the pieces. It sounds like rocket science, but it's really just a simple trick, and one that &Oslash;MQ makes particularly easy. In fact they're so simple to write, and need so little bureaucracy that I call them "unprotocols".</para></listitem>
</itemizedlist>
<para>You write minimal contracts that are mostly just place markers. Most messages and most API methods will be missing, or empty. You also want to write down any known technical requirements in terms of throughput, latency, reliability, etc. These are the criteria on which you will accept, or reject, any particular piece of work.</para>

</sect2>
<sect2>
<title>Step 4: Write a Minimal End-to-End Solution</title>
<para>The goal is to test out the overall architecture as rapidly as possible. Make skeleton applications that call the APIs, and skeleton stacks that implement both sides of every protocol. You want to get a working end-to-end "hello world" as soon as you can. You want to be able to test code, as you write it, to weed-out the broken assumptions and inevitable errors you make. Do not go off and spend six months writing a test suite! Instead, make a minimal bare-bones application that uses our still-hypothetical API.</para>

<para>If you design an API wearing the hat of the person who implements it, you'll start to think of performance, features, options, and so on. You'll make it more complex, more irregular, and more surprising than it should be. But, and here's the trick (it's a cheap one, was big in Japan), if you design an API while wearing the hat of the poor sucker who has to actually write apps that use it, you use all that laziness and fear to our advantage.</para>

<para>Write down the protocols, on a wiki or shared document, in such a way that you can explain every command clearly without too much detail. Strip off any real functionality, because it'll create inertia that just makes it harder to move stuff around. You can always add weight. Don't spend effort defining formal message structures: pass the minimum around, in the simplest possible fashion, using &Oslash;MQ's multi-part framing.</para>

<para>Our goal is to get the simplest test case working, without any avoidable functionality. Everything you can chop off the list of things to do, you chop. Ignore the groans from colleagues and bosses. I'll repeat this once again: you can <emphasis>always</emphasis> add functionality, that's relatively easy. But aim to keep the overall weight to a minimum.</para>

</sect2>
<sect2>
<title>Step 5: Solve One Problem and Repeat</title>
<para>You're now in the Happy Loop of issue-driven development where you can start to solve tangible problems instead of adding features. Write issues that state a clear problem, and propose a solution. Keep in mind, as you design the API, your standards for names, consistency, and behavior. Writing these down in prose often helps keep them sane.</para>

<para>From here, every single change you make to the architecture and code is now proven by running the test case, watching it not work, making the change, and then watching it work.</para>

<para>Now you go through the whole cycle (extending the test case, fixing the API, updating the protocol, extending the code, as needed), taking problems one at a time and testing the solutions individually. It should take about 10-30 minutes for each cycle, with the occasional spike due to random confusion.</para>

</sect2>
</sect1>
<sect1>
<title>Unprotocols</title>
<sect2>
<title>Why Unprotocols?</title>
<para>When this man thinks of protocols, this man thinks of massive documents written by committees, over years. This man thinks of the IETF, W3C, ISO, Oasis, regulatory capture, FRAND patent license disputes, and soon after, this man thinks of retirement to a nice little farm in northern Bolivia up in the mountains where the only other needlessly stubborn beings are the goats chewing up the coffee plants.</para>

<para>Now, I've nothing personal against committees. The useless folk need a place to sit out their lives with minimal risk of reproducing, after all, that only seems fair. But most committee protocols tend towards complexity (the ones that work), or trash (the ones we don't talk about). There's a few reasons for this. One is the amount of money at stake. More money means more people who want their particular prejudices and assumptions expressed in prose. But two is the lack of good abstractions on which to build. People have tried to build reusable protocol abstractions, like BEEP. Most did not stick, and those that did, like SOAP and XMPP, are on the complex side of things.</para>

<para>It used to be, decades ago, when the Internet was a young modest thing, that protocols were short and sweet. They weren't even "standards", but "requests for comments", which is as modest as you can get. It's been one of my goals since we started iMatix in 1995 to find a way for ordinary people like me to write small, accurate protocols without the overhead of the committees.</para>

<para>Now, &Oslash;MQ does appear to provide a living, successful protocol abstraction layer with its "we'll carry multi-part messages over random transports" way of working. Since &Oslash;MQ deals silently with framing, connections, and routing, it's surprisingly easy to write full protocol specs on top of &Oslash;MQ, and in Reliable Request-Reply Patterns<xref linkend="reliable-request-reply"/> and Advanced Publish-Subscribe Patterns<xref linkend="advanced-pub-sub"/> I showed how to do this.</para>

<para>Somewhere around mid-2007, I kicked-off the Digital Standards Organization to define new simpler ways of producing little standards, protocols, specifications. In my defense, it was a quiet summer. At the time <ulink url="http://www.digistan.org/spec:1">I wrote that</ulink> a new specification should take "minutes to explain, hours to design, days to write, weeks to prove, months to become mature, and years to replace."</para>

<para>In 2010 we started calling such little specifications "unprotocols", which some people might mistake for a dastardly plan for world domination by a shadowy international organization, but which really just means, "protocols without the goats".</para>

</sect2>
<sect2>
<title>How to Write Unprotocols</title>
<para>When you start to write an unprotocol specification document, stick to a consistent structure so that your readers know what to expect. Here is the structure I use:</para>

<itemizedlist>
  <listitem><para>Cover section: with a 1-line summary, URL to the spec, formal name, version, who to blame.</para></listitem>
  <listitem><para>License for the text: absolutely needed for public specifications.</para></listitem>
  <listitem><para>The change process: i.e. how I as a reader fix problems in the specification?</para></listitem>
  <listitem><para>Use of language: MUST, MAY, SHOULD, etc. with a reference to RFC 2119.</para></listitem>
  <listitem><para>Maturity indicator: is this a experimental, draft, stable, legacy, retired?</para></listitem>
  <listitem><para>Goals of the protocol: what problems is it trying to solve?</para></listitem>
  <listitem><para>Formal grammar: prevents arguments due to different interpretation of the text.</para></listitem>
  <listitem><para>Technical explanation: semantics of each message, error handling, etc.</para></listitem>
  <listitem><para>Security discussion: explicitly, how secure the protocol is.</para></listitem>
  <listitem><para>References: to other documents, protocols, etc.</para></listitem>
</itemizedlist>
<para>Writing clear, expressive text is hard. Do avoid trying to describe implementations of the protocol. Remember that you're writing a contract. You describe in clear language the obligations and expectations of each party, the level of obligation, and the penalties for breaking the rules. You do not try to define <emphasis>how</emphasis> each party honors its part of the deal.</para>

<para>If you need reference material to start with, read the http://rfc.zeromq.org site, which has a bunch of unprotocols that you can copy/paste from.</para>

<para>Here are some key points about unprotocols:</para>

<itemizedlist>
  <listitem><para>As long as your process is open then you don't need a committee: just make clean minimal designs and make sure anyone is free to improve them.</para></listitem>
  <listitem><para>If use an existing license then you don't have legal worries afterwards. I use GPLv3 for my public specifications and advise you to do the same. For in-house work, standard copyright is perfect.</para></listitem>
  <listitem><para>The formality is valuable. That is, learn to write a formal grammar such as ABNF (Augmented Backus-Naur Form) and use this to fully document your messages.</para></listitem>
  <listitem><para>Use a market-driven life-cycle process like <ulink url="http://www.digistan.org/spec:1">Digistan's COSS</ulink> so that people place the right weight on your specs as they mature (or don't).</para></listitem>
</itemizedlist>
</sect2>
<sect2>
<title>Why use the GPLv3 for Public Specifications?</title>
<para>The license you choose is particularly crucial for public specifications. Traditionally, protocols are published under custom licenses, where the authors own the text and derived works are forbidden. This sounds great (after all, who wants to see a protocol forked?) but it's in fact highly risky. A protocol committee is vulnerable to capture, and if the protocol is important and valuable, the incentive for capture grows.</para>

<para>Once captured, like some wild animals, an important protocol will often die. The real problem is there's no way to <emphasis>free</emphasis> a captive protocol published under a conventional license. The word "free" isn't just an adjective to describe speech or air, it's also a verb, and the right to fork a work, <emphasis>against the wishes of the owner</emphasis>, is essential to avoiding capture.</para>

<para>Let me explain this in shorter words. Imagine iMatix writes a protocol today, that's really amazing and popular. We publish the spec and many people implement it. Those implementations are fast and awesome, and free as in beer. And they start to threaten an existing business. Their expensive commercial product is slower and can't compete. So one day they come to our iMatix office in Maetang-Dong, South Korea, and offer to buy our firm. Since we're spending vast amounts on sushi and beer and GFEs, we accept gratefully. With evil laughter the new owners of the protocol stop improving the public version, and close the specification and add patented extensions. Their new products support this, and they take over the whole market.</para>

<para>When you contribute to an open source project, you really want to know your hard work won't used against you by a closed-source competitor. Which is why the GPL beats the "more permissive" BSD/MIT/X11 licenses. These license give permission to cheat. This applies just as much to protocols as to source code.</para>

<para>When you implement a GPLv3 specification, your applications are of course yours, and licensed any way you like. But you can be sure and certain of two things. One, that specification will <emphasis>ever</emphasis> be embraced and extended into proprietary forms. Any derived forms of the specification must also be GPLv3. Two, no-one who ever implements or uses the protocol will ever launch a patent attack on anything it covers.</para>

</sect2>
<sect2>
<title>Using ABNF</title>
<para>My advice when writing protocol specs is to learn, and use a formal grammar. It's just less hassle than allowing others to interpret what you mean, and then recover from the inevitable false assumptions. The target of your grammar is other people, engineers, not compilers.</para>

<para>My favorite grammar is ABNF, as defined by <ulink url="http://www.ietf.org/rfc/rfc2234.txt">RFC 2234</ulink>, because it is probably the simplest and most widely used formal language for defining bidirectional communications protocols. Most IETF (Internet Engineering Task Force) specifications use ABNF, which is good company to be in.</para>

<para>I'll give a 30-second crash course in writing ABNF. It may remind you of regular expressions. You write the grammar as rules. Each rule takes the form "name = elements". An element can be another rule (which you define below as another rule), or a pre-defined "terminal" like CRLF, OCTET, or a number. <ulink url="http://www.ietf.org/rfc/rfc2234.txt">The RFC</ulink> lists all the terminals. To define alternative elements, use "element / element". To define repetition, use "*" (read the RFC since it's not intuitive). To group elements, use parentheses.</para>

<para>I'm not sure if this extension is proper, but I then prefix elements with "C:" and "S:" to indicate whether they come from the client or server.</para>

<para>Here's a piece of ABNF for an unprotocol called NOM that we'll come back to later in this chapter:</para>

<screen>nom-protocol    = open-peering *use-peering

open-peering    = C:OHAI ( S:OHAI-OK / S:WTF )

use-peering     = C:ICANHAZ
                / S:CHEEZBURGER
                / C:HUGZ S:HUGZ-OK
                / S:HUGZ C:HUGZ-OK
</screen>

<para>I've actually used these keywords (OHAI, WTF) in commercial projects. They make developers giggly and happy. They confuse management. They're good in first drafts that you want to throw away later.</para>

</sect2>
</sect1>
<sect1>
<title>Serializing your Data</title>
<para>When we start to design a protocol, one of the first questions we face is how we encode data on the wire. There is, sadly, no universal answer. There are a half-dozen different ways to serialize data, each with pros and cons. We'll explore these.</para>

<para>However, there is a general lesson I've learned over a couple of decades of writing protocols small and large. I call this the "Cheap and Nasty" pattern: you can often split your work into two layers, and solve these separately, one using a "cheap" approach, the other using a "nasty" approach.</para>

<sect2>
<title>Cheap and Nasty</title>
<para>The key insight to making Cheap and Nasty work is to realize that many protocols mix a low-volume chatty part for control, and a high-volume asynchronous part for data. For instance, HTTP has a chatty dialog to authenticate and get pages, and an asynchronous dialog to stream data. FTP actually splits this over two ports; one port for control and one port for data.</para>

<para>Protocol designers who don't separate control from data tend to make awful protocols, because the trade-offs in the two cases are almost totally opposite. What is perfect for control is terrible for data, and what's ideal for data just doesn't work for control. It's especially true when we want high-performance at the same time as extensibility and good error checking.</para>

<para>Let's break this down using a classic client-server use-case. The client connects to the server, and authenticates. It then asks for some resource. The server chats back, then starts to send data back to the client. Eventually the client disconnects or the server finishes, and the conversation is over.</para>

<para>Now, before starting to design these messages, stop and think, and let's compare the control dialog, and the data flow:</para>

<itemizedlist>
  <listitem><para>The control dialog lasts a short time and involve very few messages. The data flow could last for hours or days, and involve billions of messages.</para></listitem>
  <listitem><para>The control dialog is where all the "normal" errors happen, e.g. not authenticated, not found, payment required, censored, etc. Any errors that happen during the data flow are exceptional (disk full, server crashed).</para></listitem>
  <listitem><para>The control dialog is where things will change over time, as we add more options, parameters, and so on. The data flow should barely change over time since the semantics of a resource are fairly constant over time.</para></listitem>
  <listitem><para>The control dialog is essentially a synchronous request/reply dialog. The data flow is essentially a 1-way asynchronous flow.</para></listitem>
</itemizedlist>
<para>These differences are critical. When we talk about performance, it applies <emphasis>only</emphasis> to data flows. It's pathological to design a one-time control dialog to be fast. When we talk about the cost of serialization, thus, this only applies to the data flow. The cost of encoding/decoding the control flow could be huge, and for many cases it would not change a thing. So, we encode control using "Cheap", and we encode data flows using "Nasty".</para>

<para>Cheap is essentially synchronous, verbose, descriptive, and flexible. A Cheap message is full of rich information that can change for each application. Your goal as designer is to make this information easy to encode and to parse, trivial to extend for experimentation or growth, and highly robust against change both forwards and backwards. The Cheap part of a protocol looks like this:</para>

<itemizedlist>
  <listitem><para>It uses a simple self-describing structured encoding for data, be it XML, JSON, HTTP-style headers, or some other. Any encoding is fine so long as there are standard simple parsers for it in your target languages.</para></listitem>
  <listitem><para>It uses a straight request-reply model where each request has a success/failure reply. This makes it trivial to write correct clients and servers for a Cheap dialog.</para></listitem>
  <listitem><para>It doesn't try, even marginally, to be fast. Performance doesn't matter when you do something once or a few times per session.</para></listitem>
</itemizedlist>
<para>A Cheap parser is something you take off the shelf, and throw data at. It shouldn't crash, shouldn't leak memory, should be highly tolerant, and should be relatively simple to work with. That's it.</para>

<para>Nasty however is essentially asynchronous, terse, silent, and inflexible. A Nasty message carries minimal information that practically never changes. Your goal as designer is to make this information ultrafast to parse, and possibly even impossible to extend and experiment with. The ideal Nasty pattern looks like this:</para>

<itemizedlist>
  <listitem><para>It uses a hand-optimized binary layout for data, where every bit is precisely crafted.</para></listitem>
  <listitem><para>It uses a pure asynchronous model where one or both peers send data without acknowledgments (or if they do, they use sneaky asynchronous techniques like credit-based flow control).</para></listitem>
  <listitem><para>It doesn't try, even marginally, to be friendly. Performance is all that matters when you are doing something several million times per second.</para></listitem>
</itemizedlist>
<para>A Nasty parser is something you write by hand, which writes or reads bits, bytes, words, and integers individually and precisely. It rejects anything it doesn't like, does no memory allocations at all, and never crashes.</para>

<para>Cheap and Nasty isn't a universal pattern; not all protocols have this dichotomy. Also, how you use Cheap and Nasty will depend. In some cases, it can be two parts of a single protocol. In other cases it can be two protocols, one layered on top of the other.</para>

</sect2>
<sect2>
<title>&Oslash;MQ Framing</title>
<para>The simplest and most widely used serialization format for &Oslash;MQ applications is &Oslash;MQ's own multi-part framing. For example, here is how the <ulink url="http://rfc.zeromq.org/spec:7">Majordomo Protocol</ulink> defines a request:</para>

<screen>Frame 0: Empty frame
Frame 1: "MDPW01" (six bytes, representing MDP/Worker v0.1)
Frame 2: 0x02 (one byte, representing REQUEST)
Frame 3: Client address (envelope stack)
Frame 4: Empty (zero bytes, envelope delimiter)
Frames 5+: Request body (opaque binary)
</screen>

<para>To read and write this in code is easy. But this is a classic example of a control flow (the whole of MDP is, really, since it's a chatty request-reply protocol). When we came to improve MDP for the second version, we had to change this framing. Excellent, we broke all existing implementations!</para>

<para>Backwards compatibility is hard, but using &Oslash;MQ framing for control flows <emphasis>does not help</emphasis>. Here's how I should have designed this protocol if I'd followed by own advice (and I'll fix this in the next version). It's split into a Cheap part and a Nasty part, and uses the &Oslash;MQ framing to separate these:</para>

<screen>Frame 0: "MDP/2.0" for protocol name and version
Frame 1: command header
Frame 2: command body
</screen>

<para>Where we'd expect the parse the command header in the various intermediaries (client API, broker, and worker API), and pass the command body untouched from application to application.</para>

</sect2>
<sect2>
<title>Serialization Languages</title>
<para>Serialization languages have their fashions. XML used to be big as in popular, then it got big as in over-engineered, and then it fell into the hands of "Enterprise Information Architects" and it's not been seen alive since. Today's XML is the epitome of "somewhere in that mess is small, elegant language trying to escape".</para>

<para>Still XML, was way, way better than its predecessors which included such monsters as the Standard Generalized Markup Language (SGML), which in turn were a cool breeze compared to mind-torturing beasts like EDIFACT. So the history of serialization languages seems to be of gradually emerging sanity, hidden by waves of revolting EIAs doing their best to hold onto their jobs.</para>

<para>JSON popped out of the JavaScript world as a quick-and-dirty "I'd rather resign than use XML here" way to throw data onto the wire and get it back again. JSON is just minimal XML expressed, sneakily, as JavaScript source code.</para>

<para>Here's a simple example of using JSON in a Cheap protocol:</para>

<screen>"protocol": {
    "name": "MTL",
    "version": 1
},
"virtual-host": "test-env"
</screen>

<para>The same in XML would be (XML forces us to invent a single top-level entity):</para>

<screen>&lt;command&gt;
    &lt;protocol name = "MTL" version = "1" /&gt;
    &lt;virtual-host&gt;test-env&lt;/virtual-host&gt;
&lt;/command&gt;
</screen>

<para>And using plain-old HTTP-style headers:</para>

<screen>Protocol: MTL/1.0
Virtual-host: test-env
</screen>

<para>These are all pretty equivalent so long as you don't go overboard with validating parsers, schemas and such "trust us, this is all for your own good" nonsense. A Cheap serialization language gives you space for experimentation for free ("ignore any elements/attributes/headers that you don't recognize"), and it's simple to write generic parsers that e.g. thunk a command into a hash table, or vice-versa.</para>

<para>However it's not all roses. While modern scripting languages support JSON and XML easily enough, older languages do not. If you use XML or JSON, you create non-trivial dependencies. It's also somewhat of a pain to work with tree-structured data in a language like C.</para>

<para>So you can drive your choice according to the languages you're aiming for. If your universe is a scripting language then go for JSON. If you are aiming to build protocols for wider system use, keep things simple for C developers and stick to HTTP-style headers.</para>

</sect2>
<sect2>
<title>Serialization Libraries</title>
<para>The msgpack.org site says, "It's like JSON. but fast and small. MessagePack is an efficient binary serialization format. It lets you exchange data among multiple languages like JSON but it's faster and smaller. For example, small integers (like flags or error code) are encoded into a single byte, and typical short strings only require an extra byte in addition to the strings themselves."</para>

<para>I'm going to make the perhaps unpopular claim that "fast and small" are features that solve non-problems. The only real problem that serialization libraries solve is, as far as I can tell, the need to document the message contracts and actually serialize data to and from the wire.</para>

<para>Let's start with "fast and small". It's based on a two-part argument. First, that making your messages smaller, and that reducing CPU cost for encoding and decoding will make a significant different to your application's performance. Second, that this equally valid across-the-board to all messages.</para>

<para>But most real applications tend to fall into one of two categories. Either the speed of serialization and size of encoding is marginal compared to other costs, such as database access or application code performance. Or, network performance really is critical, and then all significant costs occur in a few specific message types.</para>

<para>Thus, aiming for "fast and small" across the board is a false optimization. You neither get the easy flexibility of Cheap for your infrequent control flows, nor do you get the brutal efficiency of Nasty for your high-volume data flows. Worse, the assumption that all messages are equal in some way can corrupt your protocol design. Cheap and Nasty isn't only about serialization strategies, it's also about synchronous vs. asynchronous, error handling, and the cost of change.</para>

<para>My experience is that most performance problems in message-based applications can be solved by (a) improving the application itself and (b) hand-optimizing the high-volume data flows. And to hand-optimize your most critical data flows, you need to cheat, know and exploit facts about your data, which is something general-purpose serializers cannot do.</para>

<para>Now to documentation: the need to write our contracts explicitly and formally, not in code. This is a valid problem to solve, indeed one of the main ones if we're to build a long-lasting large-scale message-based architecture.</para>

<para>Here is how we describe a typical message using the MessagePack IDL:</para>

<screen>message Person {
  1: string surname
  2: string firstname
  3: optional string email
}
</screen>

<para>Now, the same message using the protobufs IDL:</para>

<screen>message Person {
  required string surname = 1;
  required string firstname = 2;
  optional string email = 3;
}
</screen>

<para>It works but in most practical cases, wins you little over a serialization language backed by decent specifications written by hand or produced mechanically (we'll come to this). The price you'll pay is an extra dependency, and quite probably, worse overall performance than if you used Cheap and Nasty.</para>

</sect2>
<sect2>
<title>Hand-written Binary Serialization</title>
<para>As you'll gather from this book, my preferred language for systems programming is C (upgraded to C99, with a constructor/destructor API model and generic containers). There are two reasons I like this modernized C language: firstly, I'm too weak-minded to learn a big language like C++. Life just seems filled with more interesting things to understand. Secondly, I find that this specific level of manual control lets me produce better results, and faster.</para>

<para>The point here isn't C vs. C++ but the value of manual control for high-end professional users. It's no accident that the best cars and cameras and espresso machines in the world have manual controls. That level of on-the-spot fine-tuning often makes the difference between world-class success, and second-best.</para>

<para>When you are really, truly, concerned about the speed of serialization and/or the size of the result (often these contradict each other), you need hand-written binary serialization, in other words, let's hear it for Mr. Nasty!</para>

<para>Your basic process for writing an efficient Nasty encoder/decoder (codec) is:</para>

<itemizedlist>
  <listitem><para>Build representative data sets and test applications that can stress-test your codec.</para></listitem>
  <listitem><para>Write a first dumb version of the codec.</para></listitem>
  <listitem><para>Test, measure, improve, and repeat until you run out of time and/or money.</para></listitem>
</itemizedlist>
<para>Here are some of the techniques we use to make our codecs better:</para>

<itemizedlist>
  <listitem><para><emphasis>Use a profiler.</emphasis> There's simply no way to know what your code is doing until you've profiled it, for function counts and for CPU cost per function. Once you find your hot-spots, fix them.</para></listitem>
  <listitem><para><emphasis>Eliminate memory allocations.</emphasis> On a modern Linux kernel the heap is very fast, but it's still the bottleneck in most naive codecs. On older kernels the heap can be tragically slow. Use local variables (the stack) instead of the heap where you can.</para></listitem>
  <listitem><para><emphasis>Test on different platforms and with different compilers and compiler options.</emphasis> Apart from the heap, there are many other differences. You need to learn the main ones, and allow for these.</para></listitem>
  <listitem><para><emphasis>Use state to compress better.</emphasis> If you are concerned about codec performance, you are almost definitely sending the same kinds of data many times. There will be redundancy between instances of data. You can detect these, and use that to compress (e.g. a short value that means "same as last time").</para></listitem>
  <listitem><para><emphasis>Know your data.</emphasis> The best compression techniques (in terms of CPU cost for compactness) require knowing about the data. For example the techniques to compress a word list, a video, and a stream of stock market data are all different.</para></listitem>
  <listitem><para><emphasis>Be ready to break the rules.</emphasis> Do you really need to encode integers in big-endian network byte order? x86 and ARM account for almost all modern CPUs, yet use little-endian (ARM is actually bi-endian but Android, like Windows and iOS, is little-endian).</para></listitem>
</itemizedlist>
</sect2>
<sect2>
<title>Code Generation</title>
<para>Reading the previous two sections, you might have wondered, "could I write my own IDL generator that was better than a general-purpose one?" If this thought wandered into your mind, it probably left pretty soon after, chased by dark calculations about how much work that actually involved.</para>

<para>What if I told you of a way to build custom IDL generators cheaply and quickly? A way to get perfectly documented contracts, code that is as evil and domain-specific as you need, and all you need to do is sign away your soul (<emphasis>who ever really used that, amirite?</emphasis>) right here...</para>

<para>At iMatix, until a few years ago, we used code generation to build ever larger and more ambitious systems until we decided the technology (GSL) was too dangerous for common use, and we sealed the archive and locked it, with heavy chains, in a deep dungeon. Well, we actually posted it on github. If you want to try the examples that are coming up, grab <ulink url="https://github.com/imatix/gsl">the repository</ulink> and build yourself a <literal>gsl</literal> command. Typing "make" in the src subdirectory should do it (and if you're that guy who loves Windows, I'm sure you'll send a patch with project files).</para>

<para>This section isn't really about GSL at all, but about a useful and little-known trick that's useful for ambitious architects who want to scale themselves, as well as their work. Once you learn the trick is, you can whip up your own code generators in a short time. The code generators most software engineers know about come with a single hard-coded model. For instance, Ragel "compiles executable finite state machines from regular languages", i.e. Ragel's model is a regular language. This certainly works for a good set of problems but it's far from universal. How do you describe an API in Ragel? Or a project makefile? Or even a finite-state machine like the one we used to design the Binary Star pattern in Reliable Request-Reply Patterns<xref linkend="reliable-request-reply"/>?</para>

<para>All these would benefit from code generation, but there's no universal model. So the trick is to design your own models as you need them, then make code generators as cheap compilers for that model. You need some experience in how to make good models, and you need a technology that makes it cheap to build custom code generators. Scripting languages like Perl and Python are a good option. However we actually built GSL specifically for this, and that's what I prefer.</para>

<para>Let's take a simple example that ties into what we already know. We'll see more extensive examples later, because I really do believe that code generation is crucial knowledge for large-scale work. In Reliable Request-Reply Patterns<xref linkend="reliable-request-reply"/>, we developed the <ulink url="http://rfc.zeromq.org/spec:7">Majordomo Protocol (MDP)</ulink>, and wrote clients, brokers, and workers for that. Now could we generate those pieces mechanically, by building our own interface description language and code generators?</para>

<para>When we write a GSL model, we can use <emphasis>any</emphasis> semantics we like, in other words we can invent domain-specific languages on the spot. I'll invent a couple - see if you can guess what they represent:</para>

<screen>slideshow
    name = Cookery level 3
    page
        title = French Cuisine
        item = Overview
        item = The historical cuisine
        item = The nouvelle cuisine
        item = Why the French live longer
    page
        title = Overview
        item = Soups and salads
        item = Le plat principal
        item = Béchamel and other sauces
        item = Pastries, cakes, and quiches
        item = Soufflé - cheese to strawberry
</screen>

<para>How about this one:</para>

<screen>table
    name = person
    column
        name = firstname
        type = string
    column
        name = lastname
        type = string
    column
        name = rating
        type = integer
</screen>

<para>The first we could compile into a presentation. The second, into SQL to create and work with a database table. So for this exercise our domain language, our model, consists of "classes" that contain "messages" that contain "fields" of various types. It's deliberately familiar. Here is the MDP client protocol:</para>

<screen>&lt;class name = "mdp_client"&gt;
    MDP/Client
    &lt;header&gt;
        &lt;field name = "empty" type = "string" value = ""
            &gt;Empty frame&lt;/field&gt;
        &lt;field name = "protocol" type = "string" value = "MDPC01"
            &gt;Protocol identifier&lt;/field&gt;
    &lt;/header&gt;
    &lt;message name = "request"&gt;
        Client request to broker
        &lt;field name = "service" type = "string"&gt;Service name&lt;/field&gt;
        &lt;field name = "body" type = "frame"&gt;Request body&lt;/field&gt;
    &lt;/message&gt;
    &lt;message name = "reply"&gt;
        Response back to client
        &lt;field name = "service" type = "string"&gt;Service name&lt;/field&gt;
        &lt;field name = "body" type = "frame"&gt;Response body&lt;/field&gt;
    &lt;/message&gt;
&lt;/class&gt;
</screen>

<para>And here is the MDP worker protocol:</para>

<screen>&lt;class name = "mdp_worker"&gt;
    MDP/Worker
    &lt;header&gt;
        &lt;field name = "empty" type = "string" value = ""
            &gt;Empty frame&lt;/field&gt;
        &lt;field name = "protocol" type = "string" value = "MDPW01"
            &gt;Protocol identifier&lt;/field&gt;
        &lt;field name = "id" type = "octet"&gt;Message identifier&lt;/field&gt;
    &lt;/header&gt;
    &lt;message name = "ready" id = "1"&gt;
        Worker tells broker it is ready
        &lt;field name = "service" type = "string"&gt;Service name&lt;/field&gt;
    &lt;/message&gt;
    &lt;message name = "request" id = "2"&gt;
        Client request to broker
        &lt;field name = "client" type = "frame"&gt;Client address&lt;/field&gt;
        &lt;field name = "body" type = "frame"&gt;Request body&lt;/field&gt;
    &lt;/message&gt;
    &lt;message name = "reply" id = "3"&gt;
        Worker returns reply to broker
        &lt;field name = "client" type = "frame"&gt;Client address&lt;/field&gt;
        &lt;field name = "body" type = "frame"&gt;Request body&lt;/field&gt;
    &lt;/message&gt;
    &lt;message name = "hearbeat" id = "4"&gt;
        Either peer tells the other it's still alive
    &lt;/message&gt;
    &lt;message name = "disconnect" id = "5"&gt;
        Either peer tells other the party is over
    &lt;/message&gt;
&lt;/class&gt;
</screen>

<para>GSL uses XML as its modeling language. XML has a poor reputation, having been dragged through too many enterprise sewers to smell sweet, but it has some strong positives, as long as you keep it simple. Any way to write a self-describing hierarchy of items and attributes would work.</para>

<para>Now here is a short IDL generator written in GSL that turns our protocol models into documentation:</para>

<screen>.#  Trivial IDL generator (specs.gsl)
.#
.output "$(class.name).md"
## The $(string.trim (class.?''):left) Protocol
.for message
.   frames = count (class-&gt;header.field) + count (field)

A $(message.NAME) command consists of a multi-part message of $(frames)
frames:

.   for class-&gt;header.field
.       if name = "id"
* Frame $(item ()): 0x$(message.id:%02x) (1 byte, $(message.NAME))
.       else
* Frame $(item ()): "$(value:)" ($(string.length ("$(value)")) \
bytes, $(field.:))
.       endif
.   endfor
.   index = count (class-&gt;header.field) + 1
.   for field
* Frame $(index): $(field.?'') \
.       if type = "string"
(printable string)
.       elsif type = "frame"
(opaque binary)
.           index += 1
.       else
.           echo "E: unknown field type: $(type)"
.       endif
.       index += 1
.   endfor
.endfor
</screen>

<para>The XML models and this script are in the subdirectory examples/models. To do the code generation I give this command:</para>

<screen>gsl -script:specs mdp_client.xml mdp_worker.xml
</screen>

<para>Here is the Markdown text we get for the worker protocol:</para>

<screen>## The MDP/Worker Protocol

A READY command consists of a multi-part message of 4
frames:

* Frame 1: "" (0 bytes, Empty frame)
* Frame 2: "MDPW01" (6 bytes, Protocol identifier)
* Frame 3: 0x01 (1 byte, READY)
* Frame 4: Service name (printable string)

A REQUEST command consists of a multi-part message of 5
frames:

* Frame 1: "" (0 bytes, Empty frame)
* Frame 2: "MDPW01" (6 bytes, Protocol identifier)
* Frame 3: 0x02 (1 byte, REQUEST)
* Frame 4: Client address (opaque binary)
* Frame 6: Request body (opaque binary)

A REPLY command consists of a multi-part message of 5
frames:

* Frame 1: "" (0 bytes, Empty frame)
* Frame 2: "MDPW01" (6 bytes, Protocol identifier)
* Frame 3: 0x03 (1 byte, REPLY)
* Frame 4: Client address (opaque binary)
* Frame 6: Request body (opaque binary)

A HEARBEAT command consists of a multi-part message of 3
frames:

* Frame 1: "" (0 bytes, Empty frame)
* Frame 2: "MDPW01" (6 bytes, Protocol identifier)
* Frame 3: 0x04 (1 byte, HEARBEAT)

A DISCONNECT command consists of a multi-part message of 3
frames:

* Frame 1: "" (0 bytes, Empty frame)
* Frame 2: "MDPW01" (6 bytes, Protocol identifier)
* Frame 3: 0x05 (1 byte, DISCONNECT)
</screen>

<para>Which as you can see is close to what I wrote by hand in the original spec. Now, if you have cloned the Guide repository and you are looking at the code in examples/models, you can generate the MDP client and worker codecs. We pass the same two models to a different code generator:</para>

<screen>gsl -script:codec_c mdp_client.xml mdp_worker.xml
</screen>

<para>Which gives us mdp_client and mdp_worker classes. Actually MDP is so simple that it's barely worth the effort of writing the code generator. The profit comes when we want to change the protocol (which we did for the standalone Majordomo project). You modify the protocol, run the command, and out pops more perfect code.</para>

<para>The <literal>codec_c.gsl</literal> code generator is not short, but the resulting codecs are much better than the hand-written code I originally put together for Majordomo. For instance the hand-written code had no error checking, and would die if you passed it bogus messages.</para>

<para>I'm now going to explain the pros and cons of GSL-powered model-oriented code generation. Power does not come for free and one of the greatest traps in our business is the ability to invent concepts out of thin air. GSL makes this particularly easy, so can be a particularly dangerous tool.</para>

<para><emphasis>Do not invent concepts</emphasis>. The job of a designer is to remove problems, not to add features.</para>

<para>So, first, the advantages of model-oriented code generation:</para>

<itemizedlist>
  <listitem><para>You can create 'perfect' abstractions that map to your real world. So, our protocol model maps 100% to the 'real world' of Majordomo. This would be impossible without the freedom to tune and change the model in any way.</para></listitem>
  <listitem><para>You can develop these perfect models quickly and cheaply.</para></listitem>
  <listitem><para>You can generate <emphasis>any</emphasis> text output. From a single model you can create documentation, code in any language, test tools, literally any output you can think of.</para></listitem>
  <listitem><para>You can generate (and I mean this literally) <emphasis>perfect</emphasis> output since it's cheap to improve your code generators to any level you want.</para></listitem>
  <listitem><para>You get a single source that combines specifications and semantics.</para></listitem>
  <listitem><para>You can leverage a small team to a massive size. At iMatix we produced the million-line OpenAMQ messaging product out of perhaps 85K lines of input models, including the code generation scripts themselves.</para></listitem>
</itemizedlist>
<para>Now the disadvantages:</para>

<itemizedlist>
  <listitem><para>You add tool dependencies to your project.</para></listitem>
  <listitem><para>You may get carried away and create models for the pure joy of creating them.</para></listitem>
  <listitem><para>You may alienate newcomers to your work, who will see "strange stuff".</para></listitem>
  <listitem><para>You may give people a strong excuse to not invest in your project.</para></listitem>
</itemizedlist>
<para>Cynically, model-oriented abuse works great in environments where you want to produce huge amounts of perfect code that you can maintain with little effort, and which <emphasis>no-one can ever take away from you.</emphasis> Personally, I like to cross my rivers and move on. But if long-term job security is your thing, this is almost perfect.</para>

<para>So if you do use GSL and want to create open communities around your work, here is my advice:</para>

<itemizedlist>
  <listitem><para>Use only where you would otherwise be writing tiresome code by hand.</para></listitem>
  <listitem><para>Design natural models that are what people would expect to see.</para></listitem>
  <listitem><para>Write the code by hand first so you know what to generate.</para></listitem>
  <listitem><para>Do not overuse. Keep it simple! <emphasis>Do not get too meta!!</emphasis></para></listitem>
  <listitem><para>Introduce gradually into a project.</para></listitem>
  <listitem><para>Put the generated code into your repositories.</para></listitem>
</itemizedlist>
<para>We're already using GSL in some projects around &Oslash;MQ, for example the high-level C binding, CZMQ, uses GSL to generate the socket options class (zsockopt). A 300-line code generator turns 78 lines of XML model into 1,500 lines of perfect but really boring code. That's a good win.</para>

</sect2>
</sect1>
<sect1>
<title>Transferring Files</title>
<para>Let's take a break from the lecturing and get back to our first love and the reason for doing all of this: code.</para>

<para>"How do I send a file?" is a common question on the &Oslash;MQ mailing lists. Not surprising, because file transfer is perhaps the oldest and most obvious type of messaging. Sending files around networks has lots of use-cases apart from annoying the copyright cartels. &Oslash;MQ is very good, out of the box, at sending events and tasks but less good at sending files.</para>

<para>I've promised, for a year or two, to write a proper explanation. Here's a gratuitous piece of information to brighten your morning: the word "proper" comes from the archaic French "propre" which means "clean". The dark age English common folk, not being familiar with hot water and soap, changed the word to mean "foreign" or "upper-class", as in "that's proper food!" but later the word meant just "real", as in "that's a proper mess you've gotten us into!"</para>

<para>So, file transfer. There are several reasons you can't just pick up a random file, blindfold it, and shove it whole into a message. The most obvious being that despite decades of determined growth in RAM sizes (and who among us old-timers doesn't fondly remember saving up for that 1,014-byte memory extension card?!), disk sizes obstinately remain much larger. Even if we could send a file with one instruction (say, using a system call like sendfile), we'd hit the reality that networks are not infinitely fast, nor perfectly reliable. After trying to upload a large file several times on a slow flaky network (WiFi, anyone?), you'll realize that a proper file transfer protocol needs a way to recover from failures. That is, a way to send only the part of a file that wasn't yet received.</para>

<para>Finally, after all this, if you build a proper file server, you'll notice that simply sending massive amounts of data to lots of clients creates that situation we like to call, in the technical parlance, "server went belly-up due to all available heap memory being eaten by a poorly-designed application". A proper file transfer protocol needs to pay attention to memory use.</para>

<para>We'll solve these problems properly, one by one, which should hopefully get us to a good and proper file transfer protocol running over &Oslash;MQ. First, let's generate a 1GB test file with random data (real power-of-two-giga-like-Von-Neumman-intended, not the fake silicon ones the memory industry likes to sell):</para>

<screen>dd if=/dev/urandom of=testdata bs=1M count=1024
</screen>

<para>This is large enough to be troublesome when we have lots of clients asking for the same file at once, and on many machines, 1GB is going to be too large to allocate in memory anyhow. As a base reference, let's measure how long it takes to copy this file from disk back to disk. This will tell us how much our file transfer protocol adds on top (including 'network' costs):</para>

<screen>$ time cp testdata testdata2

real    0m7.143s
user    0m0.012s
sys     0m1.188s
</screen>

<para>The 4-figure precision is misleading; expect variations of 25% either way. This is just an "order of magnitude" measurement.</para>

<para>Here's our first cut at the code, where the client asks for the test data and the server just sends it, without stopping for breath, as a series of messages, where each message holds one 'chunk':</para>

<example id="fileio1-php">
<title>File transfer test, model 1 (fileio1.php)</title>
<programlisting language="php">
(This example still needs translation into PHP)
</programlisting>

</example>
<para>It's pretty simple but we already run into a problem: if we send too much data to the ROUTER socket, we can easily overflow it. The simple but stupid solution is to put an infinite high-water mark on the socket. It's stupid because we now have no protection against exhausting the server's memory. Yet without an infinite HWM we risk losing chunks of large files.</para>

<para>Try this: set the HWM to 1,000 (in &Oslash;MQ/3.x this is the default) and then reduce the chunk size to 100K so we send 10K chunks in one go. Run the test, and you'll see it never finishes. As the <literal>zmq_socket()</literal> man page says with cheerful brutality, for the ROUTER socket: "ZMQ_HWM option action: Drop".</para>

<para>We have to control the amount of data the server sends up-front. There's no point in it sending more than the network can handle. Let's try sending one chunk at a time. In this version of the protocol, the client will explicitly say,"give me chunk N", and the server will fetch that specific chunk from disk and send it.</para>

<para>Here's the improved second model, where the client asks for one chunk at a time, and the server only sends one chunk for each request it gets from the client:</para>

<example id="fileio2-php">
<title>File transfer test, model 2 (fileio2.php)</title>
<programlisting language="php">
(This example still needs translation into PHP)
</programlisting>

</example>
<para>It is much slower now, because of the to-and-fro chatting between client and server. We pay about 300 microseconds for each request-reply round-trips, on a local loop connection (client and server on the same box). It doesn't sound like much but it adds up quickly:</para>

<screen>$ time ./fileio1
4296 chunks received, 1073741824 bytes

real    0m0.669s
user    0m0.056s
sys     0m1.048s

$ time ./fileio2
4295 chunks received, 1073741824 bytes

real    0m2.389s
user    0m0.312s
sys     0m2.136s
</screen>

<para>There are two valuable lessons here. First, while request-reply is easy, it's also too slow for high-volume data flows. Paying that 300 microseconds once would be fine. Paying it for every single chunk isn't acceptable, particularly on real networks with latencies of perhaps 1,000 times higher.</para>

<para>The second point is something I've said before but will repeat: it's incredibly easy to experiment, measure, and improve our protocols over &Oslash;MQ. And when the cost of something comes way down, you can afford a lot more of it. Do learn to develop and prove your protocols in isolation: I've seen teams waste time trying to improve poorly-designed protocols that are too deeply embedded in applications to be easily testable or fixable.</para>

<para>Our model 2 file transfer protocol isn't so bad, apart from performance:</para>

<itemizedlist>
  <listitem><para>It completely eliminates any risk of memory exhaustion. To prove that we set the high-water mark to 1 in both sender and receiver.</para></listitem>
  <listitem><para>It lets the client choose the chunk size, which is useful because if there's any tuning of the chunk size to be done, for network conditions, for file types, or to reduce memory consumption further, it's the client that should be doing this.</para></listitem>
  <listitem><para>It gives us fully restartable file transfers.</para></listitem>
  <listitem><para>It allows the client to cancel the file transfer at any point in time.</para></listitem>
</itemizedlist>
<para>If we just didn't have to do a request for each chunk, it'd be a usable protocol. What we need is a way for the server to send multiple chunks, without waiting for the client to request or acknowledge each one. What are the options?</para>

<itemizedlist>
  <listitem><para>The server could send 10 chunks at once, then wait for a single acknowledgment. That's exactly like multiplying the chunk size by 10, so pointless. And yes, it's just as pointless for all values of 10.</para></listitem>
  <listitem><para>The server could send chunks without any chatter from the client but with a slight delay between each send, so that it would send chunks only as fast as the network could handle them. This would require the server to know what's happening at the network layer, which sounds like hard work. It also breaks layering horribly. And what happens if the network is really fast but the client itself is slow? Where are chunks queued then?</para></listitem>
  <listitem><para>The server could try to spy on the sending queue, i.e. see how full it is, and send only when the queue isn't full. Well, &Oslash;MQ doesn't allow that because it doesn't work, for the same reason as throttling doesn't work. The server and network may be more than fast enough, but the client may be a slow little device.</para></listitem>
  <listitem><para>We could modify libzmq to take some other action on reaching HWM. Perhaps it could block? That would mean that a single slow client would block the whole server, so no thank you. Maybe it could return an error to the caller? Then the server could do something smart like... well, there isn't really anything it could do that's any better than dropping the message.</para></listitem>
</itemizedlist>
<para>Apart from being complex and variously unpleasant, none of these options would even work. What we need is a way for the client to tell the server, asynchronously and in the background, that it's ready for more. Some kind of asynchronous flow control. If we do this right, data should flow without interruption from the server to the client, but only as long as the client is reading it. Let's review our three protocols. This was the first one:</para>

<screen>C: fetch
S: chunk 1
S: chunk 2
S: chunk 3
....
</screen>

<para>And the second introduced a request for each chunk:</para>

<screen>C: fetch chunk 1
S: send chunk 1
C: fetch chunk 2
S: send chunk 2
C: fetch chunk 3
S: send chunk 3
C: fetch chunk 4
....
</screen>

<para>Now - waves hands mysteriously - here's a changed protocol that fixes the performance problem:</para>

<screen>C: fetch chunk 1
C: fetch chunk 2
C: fetch chunk 3
S: send chunk 1
C: fetch chunk 4
S: send chunk 2
S: send chunk 3
....
</screen>

<para>It looks suspiciously similar. In fact it's identical except that we send multiple requests without waiting for a reply for each one. This is a technique called "pipelining" and it works because our DEALER and ROUTER sockets are fully asynchronous.</para>

<para>Here's the third model of our file transfer test-bench, with pipelining. The client sends a number of requests ahead (the "credit") and then each time it processes an incoming chunk, it sends one more credit. The server will never send more chunks than the client has asked for:</para>

<example id="fileio3-php">
<title>File transfer test, model 3 (fileio3.php)</title>
<programlisting language="php">
(This example still needs translation into PHP)
</programlisting>

</example>
<para>What we've achieved here, with a little magic, is to take control of the end-to-end pipeline including all network buffers and &Oslash;MQ queues at sender and receiver, and then ensure that pipeline is always filled with data while never growing beyond a predefined limit. More than that, the client decides exactly when to send "credit" to the sender. It could be when it receives a chunk, or when it has fully processed a chunk. And this happens asynchronously, with no significant performance cost.</para>

<para>In the third model I chose a pipeline size of 10 messages (each message is a chunk). This will cost a maximum of 2.5MB memory per client. So with 1GB of memory we can handle at least 400 clients. We can try to calculate the ideal pipeline size. It takes about 0.7 seconds to send the 1GB file, which is about 160 microseconds for a chunk. A round trip is 300 microseconds, so the pipeline needs to be at least 3-5 to keep the server busy. In practice, I still got performance spikes with a pipeline of 5, probably because the credit messages sometimes get delayed by outgoing data. So at 10, it works consistently.</para>

<screen>$ time ./fileio3
4291 chunks received, 1072741824 bytes

real    0m0.777s
user    0m0.096s
sys     0m1.120s
</screen>

<para>Do measure rigorously. Your calculations may be good but the real world tends to have its own opinions.</para>

<para>What we've made is clearly not yet a real file transfer protocol, but it proves the pattern and I think it is the simplest plausible design. For a real working protocol we'd want to add some or all of:</para>

<itemizedlist>
  <listitem><para>Authentication and access controls, even without encryption: the point isn't to protect sensitive data but to catch errors like sending test data to production servers.</para></listitem>
  <listitem><para>A Cheap-style request including file path, optional compression, and other stuff we've learned is useful from HTTP (such as If-Modified-Since).</para></listitem>
  <listitem><para>A Cheap-style response, at least for the first chunk, that provides meta data such as file size (so the client can pre-allocate and avoid unpleasant disk-full situations).</para></listitem>
  <listitem><para>The ability to fetch a set of files in one go, otherwise the protocol becomes inefficient for large sets of small files.</para></listitem>
  <listitem><para>Confirmation from the client when it's fully received a file, to recover from chunks that might be lost of the client disconnects unexpectedly.</para></listitem>
</itemizedlist>
<para>So far, our semantic has been "fetch"; that is, the recipient knows (somehow), that they need a specific file, so they ask for it. The knowledge of which files exist, and where they are is then passed out-of-band (e.g. in HTTP, by links in the HTML page).</para>

<para>How about a "push" semantic? There are two plausible use-cases for this. First, if we adopt a centralized architecture with files on a main "server" (not something I'm advocating, but people do sometimes like this), then it's very useful to allow clients to upload files to the server. Second, it lets do a kind of pub-sub for files, where the client asks for all new files of some type; as the server gets these, it forwards them to the client.</para>

<para>A fetch semantic is synchronous, while a push semantic is asynchronous. Asynchronous is less chatty, so faster. Also, you can do cute things like "subscribe to this path" so creating a publish-subscribe file transfer architecture. That is so obviously awesome that I shouldn't need to explain what problem it solves.</para>

<para>Still, here is the problem with the fetch semantic: that out-of-band route to tell clients what files exist. No matter how you do this, it ends up complex. Either clients have to poll, or you need a separate pub-sub channel to keep clients up to date, or you need user interaction.</para>

<para>Thinking this through a little more, though, we can see that fetch is just a special case of publish-subscribe. So we can get the best of both worlds. Here is the general design:</para>

<itemizedlist>
  <listitem><para>Fetch this path</para></listitem>
  <listitem><para>Here is credit (repeat)</para></listitem>
</itemizedlist>
<para>To make this work (and we will, my dear readers), we need to be a little more explicit about how we send credit to the server. The cute trick of treating a pipelined "fetch chunk" request as credit won't fly since the client doesn't know any longer what files actually exist, how large they are, anything. If the client says, "I'm good for 250,000 bytes of data", this should work equally for one file of 250K bytes, or 100 files of 2,500 bytes.</para>

<para>And this gives us "credit-based flow control", which effectively removes the need for HWMs, and any risk of memory overflow.</para>

</sect1>
<sect1>
<title>State Machines</title>
<para>Software engineers tend to treat (finite) state machines as a kind of intermediary interpreter. That is, you take a regular language and compile that into a state machine, then execute the state machine. The state machine itself is rarely visible to the developer: it's an internal representation, optimized, compressed, and bizarre.</para>

<para>However it turns out that state machines are also valuable as a first-class modeling languages for protocol handlers, i.e. &Oslash;MQ clients and servers. &Oslash;MQ makes it rather easy to design protocols, but we've never defined a good pattern for writing those clients and servers properly.</para>

<para>A protocol has at least two levels:</para>

<itemizedlist>
  <listitem><para>How we represent individual messages on the wire.</para></listitem>
  <listitem><para>How messages flow between peers, and the significance of each message.</para></listitem>
</itemizedlist>
<para>We've seen in this chapter how to produce codecs that handle serialization. That's a good start. But if we leave the second job to developers, that gives them a lot of room to interpret. As we make more ambitious protocols (file transfer + heart-beating + credit + authentication), it becomes less and less sane to try to implement clients and servers by hand.</para>

<para>Yes, people do this almost systematically. But the costs are high, and they're avoidable. I'll explain how to model protocols using state machines, and how to generate neat and solid code from those models.</para>

<para>My experience with using state machines as a software construction tool dates to 1985 and my first real job making tools for application developers. In 1991 I turned that knowledge into a free software tool called Libero, which spat out executable state machines from a simple text model.</para>

<para>The thing about Libero's model was that it was readable. That is, you described your program logic as named states, each accepting a set of events, each doing some real work. The resulting state machine hooked into your application code, driving it like a boss.</para>

<para>Libero was charmingly good at its job, fluent in many languages, and modestly popular given the enigmatic nature of state machines. We used Libero in anger in dozens of large distributed applications, one of which was finally switched off in 2011. State-machine driven code construction worked so well that it's somewhat impressive this approach never hit the mainstream of software engineering.</para>

<para>So in this section I'm going to explain Libero's model, and show how to use it to generate &Oslash;MQ clients and servers. We'll use GSL again but like I said, the principles are general and you can put together code generators using any scripting language.</para>

<para>As a worked example let's see how to carry-on a stateful dialog with a peer on a ROUTER socket. We'll develop the server using a state machine (and the client by hand). We have a simple protocol that I'll call "NOM". I'm using the oh-so-very-serious <ulink url="http://unprotocols.org/blog:2">keywords for unprotocols</ulink> proposal:</para>

<screen>nom-protocol    = open-peering *use-peering

open-peering    = C:OHAI ( S:OHAI-OK / S:WTF )

use-peering     = C:ICANHAZ
                / S:CHEEZBURGER
                / C:HUGZ S:HUGZ-OK
                / S:HUGZ C:HUGZ-OK
</screen>

<para>I've not found a quick way to explain the true nature of state machine programming. In my experience, it invariably takes a few days of practice. After three or four days' exposure to the idea there is a near-audible 'click!' as something in the brain connects all the pieces together. We'll make it concrete by looking at the state machine for our NOM server.</para>

<para>A useful thing about state machines is that you can read them state by state. Each state has a unique descriptive name, and one or more <emphasis>events</emphasis>, which we list in any order. For each event we perform zero or more <emphasis>actions</emphasis>, and we then move to a <emphasis>next state</emphasis> (or stay in the same state).</para>

<para>In a &Oslash;MQ protocol server, we have a state machine instance <emphasis>per client</emphasis>. That sounds complex but it isn't, as we'll see. We describe our first state (Start) as having one valid event, "OHAI". We check the user's credentials and then arrive in the Authenticated state(<xref linkend="figure-63"/>).</para>

<figure id="figure-63">
    <title>The 'Start' State</title>
    <mediaobject>
        <imageobject>
            <imagedata fileref="images/fig63.eps" format="EPS" width="4.8in"/>
        </imageobject>
    </mediaobject>
</figure>

<para>The Check Credentials action produces either an 'ok' or an 'error' event. It's in the Authenticated state that we handle these two possible events, by sending an appropriate reply back to the client(<xref linkend="figure-64"/>). If authentication failed, we return to the Start state where the client can try again.</para>

<figure id="figure-64">
    <title>The 'Authenticated' State</title>
    <mediaobject>
        <imageobject>
            <imagedata fileref="images/fig64.eps" format="EPS" width="4.8in"/>
        </imageobject>
    </mediaobject>
</figure>

<para>When authentication has succeeded, we arrive in the Ready state. Here we have three possible events: an ICANHAZ or HUGZ message from the client, or a heartbeat timer event(<xref linkend="figure-65"/>).</para>

<figure id="figure-65">
    <title>The 'Ready' State</title>
    <mediaobject>
        <imageobject>
            <imagedata fileref="images/fig65.eps" format="EPS" width="4.8in"/>
        </imageobject>
    </mediaobject>
</figure>

<para>There are a few more things about this state machine model that are worth knowing:</para>

<itemizedlist>
  <listitem><para>Events in upper case (like "HUGZ") are 'external events' that come from the client as messages.</para></listitem>
  <listitem><para>Events in lower case (like "heartbeat") are 'internal events', produced by code in the server.</para></listitem>
  <listitem><para>The "Send SOMETHING" actions are shorthand for sending a specific reply back to the client.</para></listitem>
  <listitem><para>Events that aren't defined in a particular state are silently ignored.</para></listitem>
</itemizedlist>
<para>Now, the original source for these pretty pictures is an XML model:</para>

<screen>&lt;class name = "nom_server" script = "server_c"&gt;

&lt;state name = "start"&gt;
    &lt;event name = "OHAI" next = "authenticated"&gt;
        &lt;action name = "check credentials" /&gt;
    &lt;/event&gt;
&lt;/state&gt;

&lt;state name = "authenticated"&gt;
    &lt;event name = "ok" next = "ready"&gt;
        &lt;action name = "send" message ="OHAI-OK" /&gt;
    &lt;/event&gt;
    &lt;event name = "error" next = "start"&gt;
        &lt;action name = "send" message = "WTF" /&gt;
    &lt;/event&gt;
&lt;/state&gt;

&lt;state name = "ready"&gt;
    &lt;event name = "ICANHAZ"&gt;
        &lt;action name = "send" message = "CHEEZBURGER" /&gt;
    &lt;/event&gt;
    &lt;event name = "HUGZ"&gt;
        &lt;action name = "send" message = "HUGZ-OK" /&gt;
    &lt;/event&gt;
    &lt;event name = "heartbeat"&gt;
        &lt;action name = "send" message = "HUGZ" /&gt;
    &lt;/event&gt;
&lt;/state&gt;
&lt;/class&gt;
</screen>

<para>The code generator is in examples/models/server_c.gsl. It is a fairly complete tool that I'll use and expand for more serious work later. It generates:</para>

<itemizedlist>
  <listitem><para>A server class in C (nom_server.c, nom_server.h) that implements the whole protocol flow.</para></listitem>
  <listitem><para>A selftest method that runs the selftest steps listed in the XML file.</para></listitem>
  <listitem><para>Documentation in the form of graphics (the pretty pictures).</para></listitem>
</itemizedlist>
<para>Here's a simple main program that starts the generated NOM server:</para>

<programlisting language="c">
#include "czmq.h"
#include "nom_server.h"

int main (int argc, char *argv [])
{
    printf ("Starting NOM protocol server on port 6000...\n");
    nom_server_t *server = nom_server_new ();
    nom_server_bind (server, "tcp://*:6000");
    nom_server_wait (server);
    nom_server_destroy (&amp;server);
    return 0;
}
</programlisting>

<para>The generated nom_server class is a fairly classic model. It accepts client messages on a ROUTER socket. The first frame on every request is the client's identity. The server manages a set of clients, each with state. As messages arrive, it feeds these as 'events' to the state machine. Here's the core of the state machine, as a mix of GSL commands and the C code we intend to generate:</para>

<programlisting language="c">
client_execute (client_t *self, int event)
{
    self-&gt;next_event = event;
    while (self-&gt;next_event) {
        self-&gt;event = self-&gt;next_event;
        self-&gt;next_event = 0;
        switch (self-&gt;state) {
.for class.state
            case $(name:c)_state:
.   for event
.       if index () &gt; 1
                else
.       endif
                if (self-&gt;event == $(name:c)_event) {
.       for action
.           if name = "send"
                    zmsg_addstr (self-&gt;reply, "$(message:)");
.           else
                $(name:c)_action (self);
.           endif
.       endfor
.       if defined (event.next)
                    self-&gt;state = $(next:c)_state;
.       endif
                }
.   endfor
                break;
.endfor
        }
        if (zmsg_size (self-&gt;reply) &gt; 1) {
            zmsg_send (&amp;self-&gt;reply, self-&gt;router);
            self-&gt;reply = zmsg_new ();
            zmsg_add (self-&gt;reply, zframe_dup (self-&gt;address));
        }
    }
}
</programlisting>

<para>Each client is held as an object with various properties, including the variables we need to represent a state machine instance:</para>

<programlisting language="c">
event_t next_event;         //  Next event
state_t state;              //  Current state
event_t event;              //  Current event
</programlisting>

<para>You will see by now that we are generating technically-perfect code that has the precise design and shape we want. The only clue that the nom_server class isn't hand-written is that the code is <emphasis>too good</emphasis>. People who complain that code generators produce poor code are obviously used to poor code generators. It is trivial to extend our model as we need it. For example, here's how we generate the selftest code.</para>

<para>First, we add a "selftest" item to the state machine and write our tests. We're not using any XML grammar or validators so it really is just a matter of opening the editor and adding half-a-dozen lines of text:</para>

<screen>&lt;selftest&gt;
    &lt;step send = "OHAI" body = "Sleepy" recv = "WTF" /&gt;
    &lt;step send = "OHAI" body = "Joe" recv = "OHAI-OK" /&gt;
    &lt;step send = "ICANHAZ" recv = "CHEEZBURGER" /&gt;
    &lt;step send = "HUGZ" recv = "HUGZ-OK" /&gt;
    &lt;step recv = "HUGZ" /&gt;
&lt;/selftest&gt;
</screen>

<para>Designing on the fly, I decided that "send" and "recv" were a nice way to express "send this request, then expect this reply". Here's the GSL code that turns this model into real code:</para>

<screen>.for class-&gt;selftest.step
.   if defined (send)
    msg = zmsg_new ();
    zmsg_addstr (msg, "$(send:)");
.       if defined (body)
    zmsg_addstr (msg, "$(body:)");
.       endif
    zmsg_send (&amp;msg, dealer);

.   endif
.   if defined (recv)
    msg = zmsg_recv (dealer);
    assert (msg);
    command = zmsg_popstr (msg);
    assert (streq (command, "$(recv:)"));
    free (command);
    zmsg_destroy (&amp;msg);

.   endif
.endfor
</screen>

<para>Finally, one of the more tricky but absolutely essential parts of any state machine generator is <emphasis>how do I plug this into my own code?</emphasis> As a minimal example for this exercise I wanted to implement the "check credentials" action by accepting all OHAIs from my friend Joe (Hi Joe!) and reject everyone else's OHAIs. After some thought I decided to grab code directly from the state machine model. So in nom_server.xml, you'll see this:</para>

<screen>&lt;action name = "check credentials"&gt;
    char *body = zmsg_popstr (self-&gt;request);
    if (body &amp;&amp; streq (body, "Joe"))
        self-&gt;next_event = ok_event;
    else
        self-&gt;next_event = error_event;
    free (body);
&lt;/action&gt;
</screen>

<para>And the code generator grabs that custom code and inserts it into the generated nom_server.c file:</para>

<screen>.for class.action
static void
$(name:c)_action (client_t *self) {
$(string.trim (.):)
}
.endfor
</screen>

<para>And now we have something quite elegant: a single source file that describes my server state machine, and which also contains the native implementations for my actions. A nice mix of high-level and low-level that is about 90% smaller than the C code.</para>

<para>Beware, as your head spins with notions of all the amazing things you could produce with such leverage. While this approach gives you real power, it also moves you away from your peers, and if you go too far, you'll find yourself working alone.</para>

<para>By the way, this simple little state machine design exposes just three variables to our custom code:</para>

<itemizedlist>
  <listitem><para><literal>self->next_event</literal></para></listitem>
  <listitem><para><literal>self->request</literal></para></listitem>
  <listitem><para><literal>self->reply</literal></para></listitem>
</itemizedlist>
<para>In the Libero state machine model there are a few more concepts that we've not used here, but which we will need when we write larger state machines:</para>

<itemizedlist>
  <listitem><para>Exceptions, which lets us write terser state machines. When an action raises an exception, further processing on the event stops. The state machine can then define how to handle exception events.</para></listitem>
  <listitem><para>Defaults state, where we can define default handling for events (especially useful for exception events).</para></listitem>
</itemizedlist>
</sect1>
<sect1>
<title>Authentication using SASL</title>
<para>When we designed AMQP in 2007, we chose <ulink url="http://en.wikipedia.org/wiki/Simple_Authentication_and_Security_Layer">SASL</ulink> for the authentication layer, one of the ideas we took from the BEEP protocol framework. SASL looks complex at first, but it's simple and fits very nicely into a &Oslash;MQ-based protocol. What I especially like about SASL is that it's scalable. You can start with anonymous access, or plain text authentication and no security, and grow to more secure mechanisms over time, without changing your protocol one bit.</para>

<para>I'm not going to give a deep explanation now, since we'll see SASL in action somewhat later. But I'll explain the principle so you're already somewhat prepared.</para>

<para>In the NOM protocol the client started with an OHAI command, which the server either accepted ("Hi Joe!") or rejected. This is simple but not scalable since server and client have to agree upfront what kind of authentication they're going to do.</para>

<para>What SASL introduced, and which is genius, is a fully abstracted and negotiable security layer that's still easy to implement at the protocol level. It works as follows:</para>

<itemizedlist>
  <listitem><para>The client connects.</para></listitem>
  <listitem><para>The server challenges the client, passing a list of security "mechanisms" that it knows about.</para></listitem>
  <listitem><para>The client chooses a security mechanism that it knows about, and answers the server's challenge with a blob of opaque data that (and here's the neat trick) some generic security library calculates and gives to the client.</para></listitem>
  <listitem><para>The server takes the security mechanism the client choose, and that blob of data, and passes it to its own security library.</para></listitem>
  <listitem><para>The library either accepts the client's answer, or the server challenges again.</para></listitem>
</itemizedlist>
<para>There are a number of free SASL libraries. When we come to real code, we'll implement just two mechanisms, ANONYMOUS and PLAIN, which don't need any special libraries.</para>

<para>To support SASL we have to add an optional challenge/response step to our "open-peering" flow. Here is what the resulting protocol grammar looks like (I'm modifying NOM to do this):</para>

<screen>secure-nom      = open-peering *use-peering

open-peering    = C:OHAI *( S:ORLY C:YARLY ) ( S:OHAI-OK / S:WTF )

ORLY            = 1*mechanism challenge
mechanism       = string
challenge       = *OCTET

YARLY           = mechanism response
response        = *OCTET
</screen>

<para>Where ORLY and YARLY contain a string (a list of mechanisms in ORLY, one mechanism in YARLY) and a blob of opaque data. Depending on the mechanism, the initial challenge from the server may be empty. We don't care a jot: we just pass this to the security library to deal with.</para>

<para>The SASL <ulink url="http://tools.ietf.org/html/rfc4422">RFC</ulink> goes into detail about other features (that we don't need), the kinds of ways SASL could be attacked, and so on.</para>

<para>Unless you're a security geek, all you should care about is the impact on the protocol, which is as simple as I've explained here.</para>

</sect1>
<sect1>
<title>Large-scale File Publishing</title>
<para>Let's put all these techniques together into a file distribution system that I'll call FileMQ. This is going to be a real product, living on <ulink url="https://github.com/hintjens/filemq">github.com</ulink>. What we'll make here is a first version of FileMQ, as a training tool. If the concept works, the real thing may eventually get its own Guide.</para>

<sect2>
<title>Why make FileMQ?</title>
<para>Why make a file distribution system? I already explained how to send large files over &Oslash;MQ, and it's really quite simple. But if you want to make messaging accessible to a million times more people than can use &Oslash;MQ, you need another kind of API. An API that my five-year old son can understand. An API that is universal, requires no programming, and works with just about every single application.</para>

<para>Yes, I'm talking about the file system. It's the DropBox pattern: chuck your files somewhere, and they get magically copied somewhere else, when the network connects again.</para>

<para>However what I'm aiming for is a fully decentralized architecture that looks more like git, that doesn't need any cloud services (though we could put FileMQ in the cloud), and which does multicast, i.e. can send files to many places at once.</para>

<para>FileMQ has to be secure(able), has to be easily hooked into random scripting languages, and has to be as fast as possible across our domestic and office networks.</para>

<para>I want to use it to back-up photos from my mobile phone to my laptop, over WiFi. To share presentation slides in real-time across fifty laptops in a conference. To share documents with colleagues in a meeting. To send earthquake data from sensors to central clusters. To back-up video from my phone as I take it, during protests or riots. To synchronize configuration files across a cloud of Linux servers.</para>

<para>A visionary idea, isn't it? Well, ideas are cheap. The hard part is making this, and making it simple.</para>

</sect2>
<sect2>
<title>Initial Design Cut - the API</title>
<para>Here's the way I see the first design. FileMQ has to be distributed, so every node can be a server and a client at the same time. But I don't want the protocol to be symmetric, because that seems forced. We have a natural flow of files from point A to point B, where A is the "server" and B is the "client". If files flow back the other way, we have two flows. So, FileMQ is <emphasis>not</emphasis> a synchronization protocol, though synchronizing two directories is going to be a common use case.</para>

<para>Thus, I'm going to build FileMQ as two pieces: a client, and a server. Then, I'll put these together in a main application (the "filemq" tool) that can act both as client and server. The two pieces will look quite similar to the nom_server, with the same kind of API:</para>

<programlisting language="c">
fmq_server_t *server = fmq_server_new ();
fmq_server_bind (server, "tcp://*:6000");
fmq_server_publish (server, "/home/ph/filemq/share", "/public");
fmq_server_publish (server, "/home/ph/photos/stream", "/photostream");

fmq_client_t *client = fmq_client_new ();
fmq_client_connect (client, "tcp://pieter.filemq.org:6000");
fmq_client_subscribe (server, "/public/", "/home/ph/filemq/share");
fmq_client_subscribe (server, "/photostream/", "/home/ph/photos/stream");

while (!zctx_interrupted)
    sleep (1);

fmq_server_destroy (&amp;server);
fmq_client_destroy (&amp;client);
</programlisting>

<para>If we wrap this C API in other languages, we can easily script FileMQ, embed it applications, port it to smartphones, and so on.</para>

</sect2>
<sect2>
<title>Initial Design Cut - the Protocol</title>
<para>To start with we write down the protocol as an ABNF grammar. Our grammar starts with the flow of commands between the client and server. You should recognize these as a combination of the various techniques we've seen already:</para>

<screen>filemq-protocol = open-peering *use-peering [ close-peering ]

open-peering    = C:OHAI *( S:ORLY C:YARLY ) ( S:OHAI-OK / error )

use-peering     = C:ICANHAZ ( S:ICANHAZ-OK / error )
                / C:NOM
                / S:CHEEZBURGER
                / C:HUGZ S:HUGZ-OK
                / S:HUGZ C:HUGZ-OK

close-peering   = C:KTHXBAI / S:KTHXBAI

error           = S:SRSLY / S:RTFM
</screen>

<para>Here are the messages to and from the server:</para>

<screen>;   The client opens peering to the server
OHAI            = %x01 protocol version identity
protocol        = string        ; Must be "FILEMQ"
string          = size *VCHAR
size            = OCTET
version         = %x01
identity        = 16OCTET

;   The server challenges the client using the SASL model
ORLY            = %x02 mechanisms challenge
mechanisms      = size 1*mechanism
mechanism       = string
challenge       = *OCTET        ; 0MQ frame

;   The client responds with SASL authentication information
YARLY           = %x03 mechanism response
response        = *OCTET        ; 0MQ frame

;   The server grants the client access
OHAI-OK         = %x04

;   The client subscribes to a path
ICANHAZ         = %x05 path options
path            = string        ; Full or partial path
options         = dictionary
dictionary      = size *key-value
key-value       = string        ; Formatted as name=value

;   The server confirms the subscription
ICANHAZ-OK      = %x06

;   The client sends credit to the server
NOM             = %x07 credit
credit          = number
number          = 8OCTET        ; 64-bit integer, network order
sequence        = number

;   The server sends a chunk of file data
CHEEZBURGER     = %x08 sequence operation filename
                  offset headers chunk
sequence        = number
operation       = OCTET
filename        = string
offset          = number
headers         = dictionary
chunk           = FRAME

;   Client or server sends a heartbeat, the other peer responds
HUGZ            = %x09
HUGZ-OK         = %x0A

;   Client closes the peering
KTHXBAI         = %x0B
</screen>

<para>And here are the different ways the server can tell the client things went wrong:</para>

<screen>;   Server error replies
S:SRSLY         = %x80 reason    ; Refused due to access rights
S:RTFM          = %x81 reason    ; Client sent an invalid command
</screen>

<para>The FILEMQ/1.0 protocol is specified on the <ulink url="http://rfc.zeromq.org/spec:19">&Oslash;MQ unprotocols website</ulink>.</para>

</sect2>
<sect2>
<title>Building and Trying FileMQ</title>
<para>The FileMQ stack is <ulink url="https://github.com/hintjens/filemq">on github</ulink>. It works like a classic C/C++ project:</para>

<screen>git clone git://github.com/hintjens/filemq.git
cd filemq
./autogen.sh
./configure
make check
</screen>

<para>You want to be using the latest CZMQ master for this. Now try running the <literal>track</literal> command, which is a simple tool that uses FileMQ to track changes in one directory in another:</para>

<screen>cd src
./track ./fmqroot/send ./fmqroot/recv
</screen>

<para>And open two file navigator windows, one into <literal>src/fmqroot/send</literal> and one into <literal>src/fmqroot/recv</literal>. Drop files into the send folder and you'll see them arrive in the recv folder. The server checks once per second for new files. Delete files in the send folder, and they're deleted in the recv folder similarly.</para>

<para>I use track for things like updating my MP3 player mounted as a USB drive. As I add or remove files in my laptop's Music folder, the same changes happen on the MP3 player. FILEMQ isn't a full replication protocol it might become one.</para>

</sect2>
<sect2>
<title>Internal Architecture</title>
<para>To build FileMQ I used a lot of code generation, possibly too much. However the code generators are all reusable in other stacks. They are an evolution of the set we saw earlier:</para>

<itemizedlist>
  <listitem><para>codec_c.gsl - generates a message codec for a given protocol.</para></listitem>
  <listitem><para>server_c.gsl - generates a server class for a protocol and state machine.</para></listitem>
  <listitem><para>client_c.gsl - generates a client class for a protocol and state machine.</para></listitem>
</itemizedlist>
<para>The best way to learn to use GSL code generation is to translate these into a language of your choice and make your own demo protocols and stacks. You'll find it fairly easy. FileMQ itself doesn't try to support multiple languages. It could but it'd make things needlessly complex.</para>

<para>The FileMQ architecture actually slices into two layers. There's a generic set of classes to handle chunks, directories, files, patches, SASL security, and configuration files. Then, there's the generated stack: messages, client, and server. If I was creating a new project I'd fork the whole FileMQ project, and go and modify the three models:</para>

<itemizedlist>
  <listitem><para>fmq_msg.xml - which defines the message formats</para></listitem>
  <listitem><para>fmq_client.xml - which defines the client state machine, API, and implementation.</para></listitem>
  <listitem><para>fmq_server.xml - which does the same for the server.</para></listitem>
</itemizedlist>
<para>You'd want to rename things, to avoid confusion. Why didn't I make the reusable classes into a separate library? The answer is two-fold. First, no-one actually needs this (yet). Second, it'd make things more complex for you as you build and play with FileMQ. It's never worth adding complexity to solve a theoretical problem.</para>

<para>Although I wrote FileMQ in C, it's easy to map to other languages. It is quite amazing how nice C becomes when you add CZMQ's generic zlist and zhash containers, and class style. Let me go through the classes quickly:</para>

<itemizedlist>
  <listitem><para>fmq_sasl: encodes and decodes a SASL challenge. I only implemented the PLAIN mechanism, which is enough to prove the concept.</para></listitem>
  <listitem><para>fmq_chunk: works with variable sized blobs. Not as efficient as &Oslash;MQ's messages but they do less weirdness and so are easier to understand. The chunk class has methods to read and write chunks from disk.</para></listitem>
  <listitem><para>fmq_file: works with files, which may or may not exist on disk. Gives you information about a file (like size), lets you read and write to files, remove files, check if a file exists, and check if a file is "stable" (more on that later).</para></listitem>
  <listitem><para>fmq_dir: works with directories, reading them from disk and comparing two directories to see what changed. When there are changes, returns a list of "patches".</para></listitem>
  <listitem><para>fmq_patch: works with one patch, which really just says "create this file" or "delete this file" (referring to a fmq_file item each time).</para></listitem>
  <listitem><para>fmq_config: works with configuration data. I'll come back to client and server configuration later.</para></listitem>
</itemizedlist>
<para>Every class has a test method, and the main development cycle is "edit, test". These are mostly simple self tests but they make the difference between code I can trust. and code I know will still break. It's a safe bet that any code that isn't covered by a test case will have undiscovered errors. I'm not a fan of external test harnesses. But internal test code that you write as you write your functionality... that's like the handle on a knife.</para>

<para>You should, really, be able to read the source code and rapidly understand what these classes are doing. If you can't read the code happily, tell me. If you want to port the FileMQ implementation into other languages, start by forking the whole repository and later we'll see if it's possible to do this in one overall repo.</para>

</sect2>
<sect2>
<title>Public API</title>
<para>The public API consists of two classes (as we sketched earlier):</para>

<itemizedlist>
  <listitem><para>fmq_client: provides the client API, with methods to connect to a server, configure the client, and subscribe to paths.</para></listitem>
  <listitem><para>fmq_server: provides the server API, with methods to bind to a port, configure the server, and publish a path.</para></listitem>
</itemizedlist>
<para>If I was a keen young developer eager to use FileMQ in another language, I'd probably spend a happy weekend writing a binding for this public API, then stick it in a subdirectory of the filemq project called, say, "bindings/", and make a pull request.</para>

<para>The actual API methods come from the state machine description, like this (for the server):</para>

<screen>&lt;method name = "bind"&gt;
&lt;argument name = "endpoint" type = "string" /&gt;
zmq_bind (self-&gt;router, endpoint);
&lt;/method&gt;

&lt;method name = "publish"&gt;
&lt;argument name = "location" type = "string" /&gt;
&lt;argument name = "alias" type = "string" /&gt;
mount_t *mount = mount_new (location, alias);
zlist_append (self-&gt;mounts, mount);
&lt;/method&gt;

&lt;method name = "set anonymous"&gt;
&lt;argument name = "access" type = "number" /&gt;
//  Enable anonymous access without a config file
fmq_config_path_set (self-&gt;config, "security/anonymous", access? "1" :"0");
&lt;/method&gt;
</screen>

</sect2>
<sect2>
<title>Design Notes</title>
<para>The hardest part of making FileMQ wasn't the protocol part, but maintaining accurate state internally. An FTP or HTTP server is essentially stateless. But a publish/subscribe server <emphasis>has</emphasis> to maintain subscriptions, at least.</para>

<para>So I'll go through some of the design aspects:</para>

<itemizedlist>
  <listitem><para>The client detects if the server has died by the lack of heartbeats (HUGZ) coming from the server. It then restarts its dialog by sending an OHAI. There's no timeout on the OHAI since the &Oslash;MQ DEALER socket will queue an outgoing message indefinitely.</para></listitem>
  <listitem><para>The server detects if a client has died by its lack of response (HUGZ-OK) to a heartbeat. In that case it deletes all state for the client including its subscriptions.</para></listitem>
  <listitem><para>The client API holds subscriptions in memory and replays them when it has connected successfully. This means the called can subscribe at any time (and doesn't care when connections and authentication actually happen).</para></listitem>
  <listitem><para>The server and client use virtual paths, much like an HTTP or FTP server. You publish one or more "mount points" each corresponding to a directory on the server. Each of these maps to some virtual path, for instance "/" if you have only one mount point. Clients then subscribe to virtual paths, and files arrive in an inbox directory. We don't send physical file names across the network.</para></listitem>
  <listitem><para>There are some timing issues: if the server is creating its mount points, while clients are connected and subscribing, the subscriptions won't attach to the right mount points. So, we bind the server port as last thing.</para></listitem>
  <listitem><para>Clients can reconnect at any point; if the client sends OHAI, that signals the end of any previous conversation and the start of a new one. I might one day make subscriptions durable so that they survive a disconnection. The client stack, after reconnecting, replays any subscriptions the caller application already made.</para></listitem>
</itemizedlist>
</sect2>
<sect2>
<title>Reliabilty</title>
<para>As it stands, FileMQ implements the classic &Oslash;MQ publish-subscribe pattern. That is, clients receive a stream of updates but with no guarantees about overall consistency. To make FileMQ reliable we'd have to add some functionality:</para>

<itemizedlist>
  <listitem><para>A way for clients to request all patches since a certain time (possibly, all patches).</para></listitem>
  <listitem><para>A way for the server to store patches and subscriptions persistently.</para></listitem>
</itemizedlist>
</sect2>
<sect2>
<title>Configuration</title>
<para>I've written many servers, like the Xitami web server that was popular in the late 90's, and the OpenAMQ messaging server. Getting configuration easy and obvious was a large part of making these servers fun to use.</para>

<para>We typically aim to solve a number of problems:</para>

<itemizedlist>
  <listitem><para>Ship default configuration files with the product.</para></listitem>
  <listitem><para>Allow users to add custom configuration files that are never overwritten.</para></listitem>
  <listitem><para>Allow users to configure from the command-line.</para></listitem>
</itemizedlist>
<para>And then layer these one on the other, so command-line settings override custom settings, which override default settings. It can be a lot of work to do this right. For FileMQ I've taken a somewhat simpler tack: all configuration is done from the API.</para>

<para>So this is how we start and configure the server, for example:</para>

<programlisting language="c">
server = fmq_server_new ();
fmq_server_configure (server, "server_test.cfg");
fmq_server_publish (server, "./fmqroot/send", "/");
fmq_server_publish (server, "./fmqroot/logs", "/logs");
fmq_server_bind (server, "tcp://*:6000");
</programlisting>

<para>We do use a specific format for the config files, which is <ulink url="http://rfc.zeromq.org/spec:4">ZPL</ulink>, a minimalist syntax that we started using for &Oslash;MQ "devices" a few years ago, but which works well for any server:</para>

<screen>#   Configure server for plain access
#
server
    monitor = 1             #   Check mount points
    heartbeat = 1           #   Heartbeat to clients

publish
    location = ./fmqroot/logs
    virtual = /logs

security
    echo = I: use guest/guest to login to server
    #   These are SASL mechanisms we accept
    anonymous = 0
    plain = 1
        account
            login = guest
            password = guest
            group = guest
        account
            login = super
            password = secret
            group = admin
</screen>

<para>One cute thing (which seems useful) the generated server code does is to parse this config file (when you use the fmq_server_configure() method) and execute any section that matches an API method. Thus the 'publish' section works as a fmq_server_publish() method.</para>

</sect2>
</sect1>
<sect1>
<title>File Stability</title>
<para>It is quite common to poll a directory for changes and then do something 'interesting' with new files. But as one process is writing to a file, other processes have no idea when the file has been fully written. One solution is to add a second "indicator" file which we create after creating the first file. This is intrusive, however.</para>

<para>There is a neater way, which is to detect when a file is "stable", i.e. no-one is writing to it any longer. FileMQ does this by checking the modification time of the file. If it's more than a second old, then the file is considered stable, at least stable enough to be shipped off to clients. If a process comes along after five minutes and appends to the file, it'll be shipped off again.</para>

<para>For this to work, and this is a requirement for any application hoping to use FileMQ successfully, do not buffer more than a second's worth of data in memory before writing. If you use very large block sizes, the file may look stable when it's not.</para>

</sect1>
<sect1>
<title>Test Use Case</title>
<para>To properly test something like FileMQ we need a test case that plays with live data. One of my ongoing chores is to manage the MP3 tracks on my music player, which is a Sansa Clip reflashed with Rock Box (highly recommended). As I download tracks into my Music folder I want to copy these to my player, and as I find tracks that annoy me, I delete them in the Music folder and want those gone from my player too.</para>

<para>This is kind of over-kill for a powerful file distribution protocol. I could write this using a bash or Perl script but to be honest the hardest work in FileMQ was the directory comparison code, and I want to benefit from that. So I put together a simple tool called "track" which calls the FileMQ API. From the command line this runs with two arguments; the sending and the receiving directories:</para>

<screen>./track /home/ph/Music /media/3230-6364/MUSIC
</screen>

<para>The code is a neat example of how to use the FileMQ API to do local file distribution. Here is the full program, minus the license text (it's MIT/X11 licensed):</para>

<programlisting language="c">
#include "czmq.h"
#include "../include/fmq.h"

int main (int argc, char *argv [])
{
    if (argc &lt; 3) {
        puts ("usage: track original-directory tracking-directory");
        return 0;
    }
    fmq_server_t *server = fmq_server_new ();
    fmq_server_configure (server, "anonymous.cfg");
    fmq_server_publish (server, argv [1], "/");
    fmq_server_set_anonymous (server, true);
    fmq_server_bind (server, "tcp://*:6000");

    fmq_client_t *client = fmq_client_new ();
    fmq_client_connect   (client, "tcp://localhost:6000");
    fmq_client_set_inbox (client, argv [2]);
    fmq_client_subscribe (client, "/");

    while (!zctx_interrupted)
        sleep (1);
    puts ("interrupted");

    fmq_server_destroy (&amp;server);
    fmq_client_destroy (&amp;client);
    return 0;
}
</programlisting>

<para>Note how we work with physical paths in this tool. The server publishes the physical path "/home/ph/Music" and maps this to the virtual path "/". The client subscribes to "/" and receives all files in "/media/3230-6364/MUSIC". I could use any structure within the server directory, and it would be copied faithfully to the client's inbox.</para>

</sect1>
</chapter>
<chapter id="the-community">
<title>The &Oslash;MQ Community</title>
<para>People sometimes ask what's special about &Oslash;MQ. My standard answer is, it's true that &Oslash;MQ is arguably the best answer we have to the vexing question of "how do we make the distributed software that the 21st century demands?" but more than that, &Oslash;MQ is special because of its community. This is ultimately what separates the wolves from the sheep.</para>

<para>There are three main open source patterns. One, the large firm dumping code to break the market for others. This is the Apache Foundation model. Two, tiny teams or small firms building their dream. This is the most common open source model, and it can be very successful commercially. Three, aggressive and diverse communities that swarm over a problem landscape. This is the Linux model, and the one we aspire to with &Oslash;MQ.</para>

<para>It's hard to over-emphasize the power and persistence of a working open source community. There really does not seem to be a better way of making software for the long term. Not only does the community choose the best problems to solve, it solves them minimally, carefully, and it then looks after these answers for years, decades, until they're no longer relevant, and then it quietly puts them away.</para>

<para>To really benefit from &Oslash;MQ, you need to understand the community. At some stage you'll want to submit a patch, an issue, an add-on. You might want to ask someone for help. You will probably want to bet a part of your business on &Oslash;MQ, and when I tell you that the community is much, much more important than the company which backs the product, even though I'm CEO of that company, this should be significant.</para>

<para>In this chapter I'm going to look at our community from several angles, and conclude by explaining in detail our contract for collaboration, which <ulink url="http://rfc.zeromq.org/spec:16">we call "C4"</ulink>. You should find the discussion useful for your own work. We've also adapted the &Oslash;MQ C4 process for closed source projects, with good success.</para>

<para>We'll cover:</para>

<itemizedlist>
  <listitem><para>The rough structure of &Oslash;MQ as a set of projects.</para></listitem>
  <listitem><para>Why we use the LGPL and not the BSD license.</para></listitem>
  <listitem><para>How we designed and grew the &Oslash;MQ community.</para></listitem>
  <listitem><para>The business that backs &Oslash;MQ.</para></listitem>
  <listitem><para>Who owns the &Oslash;MQ source code.</para></listitem>
  <listitem><para>How to make and submit a patch to &Oslash;MQ.</para></listitem>
  <listitem><para>Who has special rights to make commits to &Oslash;MQ.</para></listitem>
  <listitem><para>How we guarantee compatibility with old code.</para></listitem>
  <listitem><para>Why we don't use public git branches.</para></listitem>
  <listitem><para>Who decides on the &Oslash;MQ road-map?</para></listitem>
  <listitem><para>A worked example of a change to libzmq.</para></listitem>
</itemizedlist>
<sect1>
<title>Architecture of the &Oslash;MQ Community</title>
<para>You know that &Oslash;MQ is an LGPL-licensed project. In fact it's a collection of projects, built around the core library, <literal>libzmq</literal>. I'll visualize these projects as an expanding galaxy:</para>

<itemizedlist>
  <listitem><para>At the core, libzmq is the &Oslash;MQ core library. It's written in C++, with a low-level C API. The code is nasty, mainly because it's highly optimized but also because it's written in C++, a language that lends itself to subtle and deep nastiness. Martin Sustrik wrote the bulk of this code originally, today it has dozens of people who maintain different parts of it. Sustrik, incidentally, has said he'd use C if he did this again, and has started a rewrite in C called "nano".</para></listitem>
  <listitem><para>Around libzmq there are about 50 "bindings". These are individual projects that create higher-level APIs for &Oslash;MQ, or at least map the low-level API into other languages. The bindings vary in quality from experimental to utterly awesome. By far the most awesome binding is <ulink url="https://github.com/zeromq/pyzmq">PyZMQ</ulink>, which was one of the first community projects on top of &Oslash;MQ. If you are a binding author, you should really study PyZMQ and aspire to making your code and community as awesome.</para></listitem>
  <listitem><para>A lot of languages have multiple bindings (Erlang, Ruby, C#, at least) written by different people over time, or taking different approaches. We don't regulate these in any way. There are no "official" bindings. You vote by using one or the other, contributing to it, or ignoring it.</para></listitem>
  <listitem><para>On top of the bindings are a lot of projects that use &Oslash;MQ or build on it. See the "Labs" page on the wiki for a long list of projects and proto-projects that use &Oslash;MQ in some way. There are frameworks, web servers like Mongrel2, brokers like Majordomo, and enterprise open source tools like Storm.</para></listitem>
</itemizedlist>
<para>Libzmq, most of the bindings, and some of the outer projects sit in the <ulink url="https://github.com/organizations/zeromq">&Oslash;MQ community "organization"</ulink> on GitHub. This organization is "run" by a group consisting of the most senior binding authors. There's very little to run since it's almost all self-managing and there's zero conflict these days.</para>

<para>iMatix, my firm, plays a specific role in the community. We own the trademarks and enforce them discretely, to make sure that if you download a package calling itself "ZeroMQ", you can trust what you are getting. People have on rare occasion tried to hijack the name, maybe believing that "free software" means there is no property at stake, or no-one willing to defend it. One thing you'll understand from this chapter is how seriously we take the process behind our software (and I mean "us" as a community, not a company). iMatix backs the community by enforcing that process on anything calling itself "ZeroMQ" or "&Oslash;MQ". We also put money and time into the software and packaging for reasons I'll explain later.</para>

<para>It is not a charity exercise. &Oslash;MQ is a for-profit project, and a very profitable one. The profits are just widely distributed among all those who invest in it. It's really that simple: take the time to become an expert in &Oslash;MQ, or build something useful on top of &Oslash;MQ, and you'll find your value as an individual, or team, or company increasing. iMatix enjoys the same benefits as everyone else in the community. It's win-win to everyone except our competitors, who find themselves facing a threat they can't beat and can't really escape. &Oslash;MQ dominates the future world of massively distributed software.</para>

<para>My firm doesn't just have the community's back, we also built the community. This was deliberate work; in the original &Oslash;MQ white paper from 2007 there were two projects. One was technical, how to make a better messaging system. The second was how to build a community that could take the software to through to dominant success. Software dies, but community survives.</para>

</sect1>
<sect1>
<title>How to Make Really Large Architectures</title>
<para>There are, it has been said (at least by people reading this sentence out loud), two ways to make really large-scale software. Option One is to throw massive amounts of money and problems at empires of smart people, and hope that what emerges is not yet another career killer. If you're very lucky, and are building on lots of experience, and have kept your teams solid, and are not aiming for technical brilliance, and are furthermore incredibly lucky, it works.</para>

<para>But gambling with hundreds of millions of others' money isn't for everyone. For the rest of us who want to build large-scale software, there's Option Two, which is open source, and more specifically, <emphasis>free software</emphasis>. If you're asking how the choice of software license is relevant to the scale of the software you build, that's the right question.</para>

<para>The brilliant and visionary Eben Moglen once said, roughly, that a free software license is the contract on which a community builds. When I heard this, about ten years ago, the idea came to me, <emphasis>can we deliberately grow free software communities</emphasis>?</para>

<para>Ten years later, the answer is "yes", and there is almost a science to it. I say "almost" because we don't yet have enough evidence of people doing this deliberately with a documented, reproducible process. It is what I'm trying to do with <ulink url="http://softwareandsilicon.com/chapter:2#toc5">Social Architecture</ulink>. &Oslash;MQ came after Wikidot, after Digistan and after the Foundation for a Free Information Infrastructure (aka the FFII, an NGO that fights against software patents). Which came after a lot of less successful community projects like Xitami and Libero. My main takeaway from a long career of projects of every conceivable format is: if you want to build truly large-scale and long-lasting software, aim to build a free software community.</para>

<sect2>
<title>The Game</title>
<para>Software is my business: I code to eat, and to feed my kids and to make sure my wife doesn't leave me for someone nicer and better looking. You see the challenge. Of course I <emphasis>love</emphasis> coding and that sensation of sculpting perfect functionality out of the raw mass of possibility. But the world doesn't care how passionately we feel. We wrest our living from a market that is largely uninterested, and often hostile when it does notice. This is the Game, and the detailed strategy of how to win it is for another book, though if you make it to the end of this chapter I'll toss in a few cheap gimmicks.</para>

<para>But roughly: you identify and destroy your competitors, you develop active strategies, you learn rapidly, move quickly, and in general appreciate the Game as a constant competition in which every participant is an ally, an enemy, or the ground over which you are fighting. Friend, foe, or food.</para>

</sect2>
<sect2>
<title>The Contract</title>
<para>Here is a story, it happened to the eldest brother-in-law of the cousin of a friend of mine's colleague at work. His name was, and still is, Patrick.</para>

<para>Patrick was a computer scientist, with a PhD in advanced network topologies. He spent two years and his savings building a new product, and choose the BSD license because he believed that would get him more adoption. He worked in his attic, at great personal cost, and proudly published his work. People applauded, for it was truly fantastic, and his mailing lists were soon abuzz with activity and patches and happy chatter. Many companies told him how they were saving millions using his work. Some of them even paid him for consultancy and training. He was invited to speak at conferences and started collecting badges with his name on them. He started a small business, hired a friend to work with him, and dreamed of making it big.</para>

<para>Then one day, someone pointed him to a new project, GPL licensed, which had forked his work and was improving on it. He was irritated and upset, and asked how people -- fellow open sourcers, no less! -- would so shamelessly steal his code. There were long arguments on the list about whether it was even legal to relicense their BSD code as GPL code. Turned out, it was. He tried to ignore the new project but then he soon realized that new patches coming from that project <emphasis>couldn't even be merged back</emphasis> into his work!</para>

<para>Worse, the GPL project got popular and some of his core contributors made first small, and then larger patches to it. Again, he couldn't use those changes, and he felt abandoned. Patrick went into a depression, his girlfriend left him for an international currency dealer called, weirdly, Patrice, and he stopped all work on the project. He felt betrayed, and utterly miserable. He fired his friend, who took it rather badly and told everyone that Patrick was a closet banjo player. Finally Patrick took a job as a project manager for a cloud company, and by the age of forty, he had stopped programming even for fun.</para>

<para>Poor Patrick, I almost felt sorry for him. Then I asked him, "why didn't you choose the GPL?" "Because it's a restrictive viral license", he replied. I told him, you may have a PhD, and you may be the eldest brother-in-law of the cousin of a friend of my colleague, but you are an idiot and Monique was smart to leave you. You published your work saying, "please steal my code as long as you keep this 'please steal my code' statement in the resulting work", and when people did exactly that, you got upset. Worse, you were a hypocrite because when they did it in secret, you were happy, but when they did it openly, you felt betrayed.</para>

<para>Seeing your hard work captured by a smarter team and then used against you is enormously painful, so why even make that possible? Every proprietary project that uses BSD code is capturing it. A public GPL fork is perhaps more humiliating, but it's fully self-inflicted.</para>

<para>BSD is like food. It literally (and I mean that metaphorically) whispers "eat me" in the little voice one imagines a cube of cheese might use when it's sitting next to an empty bottle of Orval. The BSD license, like its near clone MIT/X11, was designed specifically by a university (Berkeley) with no profit motive, to leak work and effort. It is a way to push subsidized technology at below its cost price, a dumping of under-priced code in the hope that it will break the market for others. BSD is an <emphasis>excellent</emphasis> tool in the Game, but only if you're a large well-funded institution that can afford to use Option One. The Apache license is BSD in a suit.</para>

<para>For us small businesses who aim our investments like precious bullets, leaking work and effort is unacceptable. Breaking the market is great, but we cannot afford to subsidize our competitors. The BSD networking stack ended up putting Windows on the Internet. We cannot afford battles with those we should naturally be allies with. We cannot afford to make fundamental business errors because in the end, that means we have to fire people.</para>

<para>It comes down to behavioral economics and game theory. <emphasis>The license we choose modifies the economics of those who use our work</emphasis>. In the Game there are friends, foes, and food. BSD makes most people see us as lunch. Closed source makes most people see us as enemies (do you <emphasis>like</emphasis> paying people for software?) GPL, however, makes most people, with the exception of the Patricks of the world, our allies. Any fork of &Oslash;MQ is license compatible with &Oslash;MQ, to the point where we <emphasis>encourage</emphasis> forks as a valuable tool for experimentation. Yes, it can be weird to see someone try run off with the ball but, and here's the secret, <emphasis>I can get it back any time I want.</emphasis></para>

</sect2>
<sect2>
<title>The Process</title>
<para>If you've accepted my thesis up to now, great! Now, I'll explain the rough process by which we actually build an open source community. This was how we built or grew or gently steered the &Oslash;MQ community into existence.</para>

<para>You recall the simplicity-oriented design process from The Human Scale<xref linkend="the-human-scale"/>, where I claimed that we build successfully accurate software by successfully exploring the problem landscape, rather than by sheer intellectual effort. Keep this in mind. Now, see your community as a group exploring that landscape and sharing the results of their work.</para>

<para>Your goal as leader of a community is to motivate people to get out there and explore; to ensure they can do so safely and without disturbing others; to reward them when they make successful discoveries; and to ensure they share their knowledge with everyone else (and not because we ask them, not because they feel generous, but because it's The Law).</para>

<para>It is an iterative process. You make a small product, at your own cost, but in public view. You then build a small community around that product. If you have a small but real hit, the community then helps design and build the next version, and grows larger. And then that community builds the next version, and so on. It's evident that you remain part of the community, maybe even a majority contributor, but the more control you try to assert over the material results, the less people will want to participate. Plan your own retirement well before someone decides you are their next problem.</para>

</sect2>
<sect2>
<title>Crazy, Beautiful, and Easy</title>
<para>You need a goal that's crazy and simple enough to get people out of bed in the morning. Your community has to attract the very best people and that demands something special. With &Oslash;MQ, we said we were going to make "the Fastest. Messaging. Ever.", which qualifies as a good motivator. If we'd said, we're going to make "a smart transport layer that'll connect your moving pieces cheaply and flexibly across your enterprise", we'd have failed.</para>

<para>Then your work must be beautiful, immediately useful and attractive. Your contributors are users who want to explore just a little beyond where they are now. Make it simple, elegant, brutally clean. The experience when people run or use your work should be an emotional one. They should <emphasis>feel</emphasis> something, and if you accurately solved even just one big problem that until then they didn't quite realize they faced, you'll have a small part of their soul.</para>

<para>And then, easy to understand, use, and join. Too many projects have barriers to access: put yourself in the other person's mind and see all the reasons they come to your site, thinking "Uhm, interesting project, but..." and then leave. You want them to stay, try it, just once. Use GitHub and put the issue tracker right there.</para>

<para>If you do these things well, your community will be smart but more importantly, will be intellectually and geographically diverse. This is really important. A group of like-minded experts cannot explore the problem landscape well. They tend to make big mistakes. Diversity beats education any time.</para>

</sect2>
<sect2>
<title>Stranger, meet Stranger</title>
<para>How much up-front agreement do two people need to work together on something? In most organizations, a lot. But you can bring this cost down to near-zero, and then people can collaborate without having ever met, done a phone conference, meeting, or business trip to discuss Roles and Responsibilities over way too many bottles of soju.</para>

<para>You need well-written rules that are designed by cynical people like me to force strangers into mutually beneficial collaboration instead of conflict. The GPL is a good start. GitHub and its fork/merge strategy is a good follow-up. And then you want something like our <ulink url="http://rfc.zeromq.org/spec:16">C4 rulebook</ulink> to control how work actually happens.</para>

<para>C4 (which I now use for every new open source project) has detailed and tested answers to a lot of common mistakes people make. For example, the sin of working off-line in a corner with others "because it's faster". Transparency is essential to get trust, which is essential to get scale. By forcing every single change through a single transparent process, you build real trust in the results.</para>

<para>Another cardinal sin that many open source developers make is to place themselves above others. "I founded this project thus my intellect is superior to that of others". It's not just immodest and rude, and usually inaccurate, it's also poor business. The rules must apply equally to everyone, without distinction. You are part of the community. Your job, as founder of a project, is not to impose your vision of the product over others, but to make sure the rules are good, honest, and <emphasis>enforced</emphasis>.</para>

</sect2>
<sect2>
<title>Infinite Property</title>
<para>One of the saddest myths of the knowledge business is that ideas are a sensible form of property. It's medieval nonsense that should have been junked along with slavery, but sadly it's still making too many powerful people too much money.</para>

<para>Ideas are cheap. What does work sensibly as property is the hard work we do in building a market. "You eat what you kill" is the right model for encouraging people to work hard. Whether it's moral authority over a project, money from consulting, or the sale of a trademark to some large, rich firm: if you make it, you own it. But what you really own is "footfall", participants in your project, which ultimately defines your power in the Game.</para>

<para>To do this requires infinite free space. Thankfully, GitHub solved this problem for us, for which I will die a grateful person (there are many reasons to be grateful in life, which I won't list here because we only have a hundred or so pages left, but this is one of them).</para>

<para>You cannot scale a single project with many owners like you can scale a collection of many small projects, each with fewer owners. When we embrace forks, a person can become an "owner" with a single click. Now they just have to convince others to join, by demonstrating their unique value.</para>

<para>So in &Oslash;MQ we aimed to make it easy to write bindings on top of the core library, and we stopped trying to make those bindings ourselves. This created space for others to make those, become their owners, get that credit.</para>

</sect2>
<sect2>
<title>Care and Feeding</title>
<para>I wish a community could be 100% self-steering, and perhaps one day this will work, but today it's not the case. We're very close with &Oslash;MQ, but from my experience a community needs four types of care and feeding:</para>

<itemizedlist>
  <listitem><para>First, simply because most people are too nice, we need some kind of symbolic leadership or owners who provide ultimate authority in case of conflict. Usually it's the founders of the community. I've seen it work with self-elected groups of "elders", but old men like to talk a lot. I've seen communities split over the question "who is in charge?", and setting up legal entities, with boards, and such, seems to make such arguments worse, not better. Maybe because there seems to be more to fight over. One of the real benefits of free software is that it's always remixable, so instead of fighting over a pie, one simply forks the pie.</para></listitem>
  <listitem><para>Second, communities need living rules, and thus they need a lawyer able to formulate and write these down. Rules are critical; when done right, they remove friction. When done wrong, or neglected, we see real friction and argument that can drive away the nice majority, leaving the argumentative core in charge of the burning house. One thing I've tried to do with the &Oslash;MQ and previous communities is create reusable rules, which perhaps means we don't need lawyers as much.</para></listitem>
  <listitem><para>Thirdly, communities need some kind of financial backing. This is the jagged rock that breaks most ships. If you starve a community, it becomes more creative but the core contributors burn out. If you pour too much money into it, you attract the professionals, who never say "no", and the community loses its diversity and creativity. If you create a fund for people to share, they will fight (bitterly) over it. With &Oslash;MQ we (iMatix) spend our time and money on marketing and packaging (like this book), and basic care - bug fixes, releases, website.</para></listitem>
  <listitem><para>Lastly, sales and commercial mediation. There is a natural market between expert contributors and customers, but both are somewhat incompetent at talking to each other. Customers assume that support is free or very cheap, since the software is free. Contributors are shy at asking a fair rate for their work. It makes for a difficult market. A growing part of my work and my firm's profits is simply connecting &Oslash;MQ users who want help, with experts from the community able to provide it, and ensuring both sides are happy with the results.</para></listitem>
</itemizedlist>
<para>I've seen communities of brilliant people with noble goals dying because the founders got some or all of these four things wrong. The core problem is that you can't expect consistently great leadership from any one company, person, or group. What works today often won't work tomorrow, yet structures become more solid, not more flexible, over time.</para>

<para>The best answer I can find is a mix of two things. One, the GPL and its guarantee of remixability. No matter how bad the authority, no matter how much they try to privatize and capture the community's work, if it's GPL licensed, that work can walk away and find a better authority. Before you say, "all open source offers this," think it through. I can kill a BSD-licensed project by hiring the core contributors, and not releasing any new patches. But even with a billion of dollars I <emphasis>cannot</emphasis> kill a GPL-licensed project. Two, the philosophical anarchist model of authority: we choose it, it does not own us.</para>

</sect2>
</sect1>
<sect1>
<title>The &Oslash;MQ Process - C4</title>
<para>When we say &Oslash;MQ we sometimes mean libzmq, the core library. In early 2012 we synthesized the libzmq process into a formal protocol for collaboration that we called the <ulink url="http://rfc.zeromq.org/spec:16">Collective Code Construction Contract</ulink>, or C4. You can see this as a layer above the GPL. In fact libzmq doesn't quite stick to C4, since for historic reasons we use Jira instead of the GitHub issue tracker. But apart from that, these are our rules, and I'll explain the reasoning behind each one.</para>

<para>C4 is an evolution of the GitHub <ulink url="http://help.github.com/send-pull-requests/">Fork + Pull Model</ulink>. You may get the feeling I'm a fan of git and GitHub. This would be accurate: these two tools have made such a positive impact on our work over the last years, and especially when it comes to building community.</para>

<sect2>
<title>Language</title>
<blockquote>
  <para>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED",  "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.</para>
</blockquote>
<para>By starting with the RFC 2119 language, the C4 text makes very clear its intention to act as a protocol rather than a randomly written set of recommendations. A protocol is a contract between parties, that defines the rights and obligations of each party. These can be peers in a network. They can be strangers working in the same project.</para>

<para>I think C4 is the first time anyone has attempted to codify a community's rulebook as a formal and reusable protocol spec. Before, our rules were spread out over several wiki pages, and quite specific to libzmq in many ways. But experience teaches us that the more formal and accurate and reusable the rules, the easier it is for strangers to collaborate up-front. And less friction means a more scalable community. At the time of C4, we also had some disagreement in the libzmq over precisely what process we were using. Not everyone felt bound by the same rules. Let's just say some people felt they had a special status, which created friction with the rest of the community. So codification made things clear.</para>

<para>It's easy to use C4: just host your project on GitHub, get one other person to join, and open the floor to pull requests. In your README, put a link to C4 and that's it. We've done this in quite a few projects and it does seem to work. I've been pleasantly surprised a few times just applying these rules to my own work, like CZMQ. We are none of us, as I'll explain later in this chapter, so amazing that we can work without others.</para>

</sect2>
<sect2>
<title>Goals</title>
<blockquote>
  <para>C4 is meant to provide a reusable optimal collaboration model for open source software projects.</para>
</blockquote>
<para>The short term reason for writing C4 was to end argument over the libzmq contribution process. The dissenters went off elsewhere. <ulink url="https://github.com/zeromq/libzmq/graphs/contributors">The &Oslash;MQ community blossomed</ulink>, smoothly and easily, as I'd predicted. Most people were surprised, but gratified. There's been no real criticisms of C4 except its branching policy, which I'll come to later since it deserves its own discussion.</para>

<para>There's a reason I'm bringing up history here: as founder of a community, you are asking people to invest in your property, trademark, branding. In return, and this is what we do with &Oslash;MQ, you can use that branding to set a bar for quality. When you download a product labeled as "&Oslash;MQ", you know that it's been produced to certain standards. It's a basic rule of quality: write down your process, otherwise you cannot improve it. Our processes aren't perfect, they can't ever be. But any flaw in them can be fixed, and tested.</para>

<para>Making C4 reusable is therefore really important. To learn more about the best possible process we need to get results from the widest range of projects.</para>

<blockquote>
  <para>It has these specific goals:</para>
  <para>To maximize the scale of the community around a project, by reducing the friction for new Contributors and creating a scaled participation model with strong positive feedbacks;</para>
</blockquote>
<para>The number one goal is size and health of the community. Not technical quality, not profits, not performance, not market share. Simply, number of people who contribute to the project. The science here is simple: the larger the community, the more accurate the results.</para>

<blockquote>
  <para>To relieve dependencies on key individuals by separating different skill sets so that there is a larger pool of competence in any required domain;</para>
</blockquote>
<para>Perhaps the worst problem we faced in libzmq was dependence on people who could at the same time understand the code, manage GitHub branches, and make clean releases. It's like looking for athletes who can run both marathons and sprint, swim, and also lift weights. We humans are really good at specialization. Asking us to be really good at two contradictory things just reduces the number of candidates sharply, which is a Bad Thing for any project. We had this problem severely in libzmq in 2009 or so, and fixed it by splitting the role of maintainer into two: one person makes patches, another person makes releases.</para>

<blockquote>
  <para>To allow the project to develop faster and more accurately, by increasing the diversity of the decision making process;</para>
</blockquote>
<para>This is theory, not fully proven but not falsified. The diversity of the community and number of people who can weigh into discussions, without fear of being criticized or dismissed, the faster and more accurately the software should develop, Speed is quite subjective here. Going very fast in the wrong direction is not just useless, it's actively damaging (and we suffered a lot of that in libzmq before we switched to C4).</para>

<blockquote>
  <para>To support the natural life-cycle of project versions from experimental through to stable, by allowing safe experimentation, rapid failure, and isolation of stable code;</para>
</blockquote>
<para>To be honest, this goal seems to be fading into irrelevance. It's quite an interesting effect of the process: <emphasis>the git master is almost always perfectly stable</emphasis>. This has to do with the size of changes, and their "latency", i.e. the time between someone writing the code, and someone actually using it fully. However, people still expect "stable" releases, so we'll keep this goal there for a while.</para>

<blockquote>
  <para>To reduce the internal complexity of project repositories, thus making it easier for Contributors to participate and reducing the scope for error;</para>
</blockquote>
<para>Curious observation: people who thrive in complex situations like to create complexity because it keeps their value high. It's the Cobra Effect (Google it). Git made branches easy and left us with the all too common syndrome of "git is easy once you understand that a git branch is just a folded five-dimensional lepton space that has a detached history with no intervening cache". Developers should not be made to feel stupid by their tools. I've too many top-class developers confused by repository structures, to accept conventional wisdom on git branches. We'll come back to dispose of git branches shortly, dear reader.</para>

<blockquote>
  <para>To enforce collective ownership of the project, which increases economic incentive to Contributors and reduces the risk of hijack by hostile entities.</para>
</blockquote>
<para>Ultimately, we're economic creatures, and the sense that "we own this, and our work can never be used against us" makes it much easier for people to invest in an open source project like &Oslash;MQ. And it can't be just a feeling, it has to be real. There are a number of aspects to making collective ownership work, we'll see these one by one as we go through C4.</para>

</sect2>
<sect2>
<title>Preliminaries</title>
<blockquote>
  <para>The project SHALL use the git distributed revision control system.</para>
</blockquote>
<para>Git has its faults. Its command-line API is horribly inconsistent, and it has a complex, messy internal model that it shoves in your face at the slightest provocation. But despite doing its best to make its users feel stupid, git does its job really, really well. More pragmatically, I've found that if you stay away certain areas (branches!), people learn git rapidly and don't make many mistakes. That data works for me.</para>

<blockquote>
  <para>The project SHALL be hosted on github.com or equivalent, herein called the "Platform".</para>
</blockquote>
<para>I'm sure one day some large firm will buy github and break it, and another platform will rise in its place. Github serves up a near-perfect set of minimal, fast, simple tools. I've thrown hundreds of people at it, and they all stick, like flies stuck in a dish of honey.</para>

<blockquote>
  <para>The project SHALL use the Platform issue tracker.</para>
</blockquote>
<para>We made the mistake in libzmq of switching to Jira (mainly because a now departed person didn't like GitHub). Jira is a great example of how to turn something useful into a complex mess because the business depends on selling more "features". But even without criticizing Jira, keeping the issue tracker on the same platform means one less UI to learn, one less login, and integration between issues and patches.</para>

<blockquote>
  <para>The project SHOULD have clearly documented guidelines for code style.</para>
</blockquote>
<para>This is a protocol plug-in: insert code style guidelines here. If you don't document the code style you use, you have no basis except prejudice to reject patches.</para>

<blockquote>
  <para>A "Contributor" is a person who wishes to provide a patch, being a set of commits that solve some clearly identified problem.</para>
  <para>A "Maintainer" is a person who merge patches to the project. Maintainers are not developers; their job is to enforce process.</para>
</blockquote>
<para>Now to definitions of the parties, and the splitting of roles that saved us from the sin of structural dependency on rare individuals. This worked well in libzmq, but as you will see it depends on the rest of the process. C4 isn't a buffet; you will need the whole process (or something very like it), or it won't hold together.</para>

<blockquote>
  <para>Contributors SHALL NOT have commit access to the repository unless they are also Maintainers.</para>
  <para>Maintainers SHALL have commit access to the repository.</para>
</blockquote>
<para>What we wanted to avoid was people pushing their changes directly to master. This was the biggest source of trouble in libzmq historically: large masses of raw code that took months or years to fully stabilize. We eventually followed other &Oslash;MQ projects like PyZMQ in using pull requests. We went further, and stipulated that <emphasis>all</emphasis> changes had to follow the same path. No exceptions for "special people".</para>

<blockquote>
  <para>Everyone, without distinction or discrimination, SHALL have an equal right to become a Contributor under the terms of this contract.</para>
</blockquote>
<para>We had to state this explicitly. It used to be that the libzmq "maintainers" would reject patches simply because they didn't like them. Now, that may sound reasonable to the author of a library (though libzmq was not written by any one person) but let's remember our goal of creating a work that is owned by as many people as possible. Saying "I don't like your patch so I'm going to reject it" is equivalent to saying, "I claim to own this and I think I'm better than you, and I don't trust you". Those are toxic messages to give to others who are thinking of becoming your co-investors.</para>

<para>I think this fight between individual expertise and collective intelligence plays out in other areas. It defined Wikipedia, and still does, a decade after that work surpassed anything built by small groups of experts. For me, we make software by synthesizing knowledge, much as we make Wikipedia articles.</para>

</sect2>
<sect2>
<title>Licensing and Ownership</title>
<blockquote>
  <para>The project SHALL use the GPLv3 or a variant thereof (LGPL, AGPL).</para>
</blockquote>
<para>I've already explained how full remixability creates better scale and why the GPL and its variants seems the optimal contract for remixable software. If you're a large business aiming to dump code on the market, you won't want C4, but then you won't really care about community either.</para>

<blockquote>
  <para>All contributions to the project source code ("patches") SHALL use the same license as the project.</para>
</blockquote>
<para>This removes the need for any specific license or contribution agreement for patches. You fork the GPL code, you publish your remixed version on github, you or anyone else can then submit that as a patch to the original code. BSD doesn't allow this. Any work that contains BSD code may also contain unlicensed proprietary code so you need explicit action from the author of the code before you can remix it.</para>

<blockquote>
  <para>All patches are owned by their authors. There SHALL NOT be any copyright assignment process.</para>
</blockquote>
<para>Here we come to the key reason people trust their investments in &Oslash;MQ: it's logistically impossible to buy the copyrights to create a closed-source competitor to &Oslash;MQ. iMatix can't do this either. And the more people send patches, the harder it becomes. &Oslash;MQ isn't just free and open today, this specific rule means it will remain so forever. Note that it's not the case in all GPL projects, many of which still ask for copyright transfer back to the maintainers.</para>

<blockquote>
  <para>The project SHALL be owned collectively by all its Contributors.</para>
</blockquote>
<para>This is perhaps redundant but worth saying: if everyone owns their patches, then the resulting whole is also owned by every contributor. There's no legal concept of owning lines of code: the "work" is at least a source file.</para>

<blockquote>
  <para>Each Contributor SHALL be responsible for identifying themselves in the project Contributor list.</para>
</blockquote>
<para>I.e. the maintainers are not karma accountants. Anyone who wants credit has to claim it themselves.</para>

</sect2>
<sect2>
<title>Patch Requirements</title>
<para>In this section we define the obligations of the contributor: specifically, what constitutes a "valid" patch, so maintainers have rules they can use to accept or reject patches.</para>

<blockquote>
  <para>Maintainers and Contributors MUST have a Platform account and SHOULD use their real names or a well-known alias.</para>
</blockquote>
<para>In the worst case scenario, where someone has submitted toxic code (patented, or owned by someone else), we need to be able to trace who and when, so we can remove the code. Asking for real names or a well-known alias is a theoretical strategy to reducing the risk of bogus patches. We don't know if this actually works because we haven't had the problem yet.</para>

<blockquote>
  <para>A patch SHOULD be a minimal and accurate answer to exactly one identified and agreed problem.</para>
</blockquote>
<para>Recall the Simplicity Oriented Design pattern from The Human Scale<xref linkend="the-human-scale"/>. This implements that. One clear problem, one minimal solution, apply, test, repeat.</para>

<blockquote>
  <para>A patch MUST adhere to the code style guidelines of the project if these are defined.</para>
</blockquote>
<para>This is just sanity: I've spent time cleaning up other peoples' patches because they insisted on putting the 'else' beside the 'if' instead of just below as Nature intended. Consistent code is healthier.</para>

<blockquote>
  <para>A patch MUST adhere to the "Evolution of Public Contracts" guidelines defined below.</para>
</blockquote>
<para>Ah, the pain, the pain. I'm not speaking of the time at age eight when I stepped on a plank with a 4-inch nail protruding from it. That was relatively OK. I'm speaking of 2010-2011 when we had multiple parallel releases of &Oslash;MQ, each with different <emphasis>incompatible</emphasis> APIs or wire protocols. It was an exercise in bad rules pointlessly enforced. The rule was "if you change the API or protocol, you SHALL create a new major version". Give me the nail through the foot, that hurt less.</para>

<para>One of the big changes we made with C4 was simply to ban, outright, this kind of sanctioned sabotage. Amazingly, it's not even hard. We just don't allow the breaking of existing public contracts, period, unless everyone agrees, in which case no period.</para>

<blockquote>
  <para>A patch SHALL NOT include non-trivial code from other projects unless the Contributor is the original author of that code.</para>
</blockquote>
<para>This rule has two effects. The first is that it forces people to make minimal solutions, since they cannot simply import swathes of existing code. In the cases where I've seen this happen to projects, it's always bad unless the imported code is very cleanly separated. The second is that it avoids license arguments. You write the patch, you are allowed to publish it as LGPL, and we can merge it back in. But you find a 200-line code fragment on the web, and try to paste that, we'll refuse.</para>

<blockquote>
  <para>A patch MUST compile cleanly on at least the most important target platforms.</para>
</blockquote>
<para>This is probably asking a lot since most contributors have only one platform to work on.</para>

<blockquote>
  <para>A "Correct Patch" is one that satisfies the above requirements.</para>
</blockquote>
<para>Just in case it wasn't clear, we're back to legalese and definitions.</para>

</sect2>
<sect2>
<title>Development Process</title>
<para>In this section we aim to describe the actual development process, step by step.</para>

<blockquote>
  <para>Change on the project SHALL be governed by the pattern of accurately identifying problems and applying minimal, accurate solutions to these problems.</para>
</blockquote>
<para>This is a blatant attempt to ram through thirty years' software design experience. It's a profoundly simple approach to design: make minimal, accurate solutions to real problems. Nothing more or less. Note the stress on "accuracy", a rare but essential ingredient. In &Oslash;MQ we don't have feature requests. Treating new features the same as bugs confuses newbies. But this process works, and not just in open source. Enunciating the problem we're trying to solve, with every single change, is key to deciding whether the change is worth making or not.</para>

<blockquote>
  <para>To initiate changes, a user SHALL log an issue on the project Platform issue tracker.</para>
</blockquote>
<para>This is meant to stop us going off-line and working in a ghetto, either by ourselves or with others. Although we tend to accept pull requests that have clear argumentation, this rule lets us say "stop" to confused or too-large patches.</para>

<blockquote>
  <para>The user SHOULD write the issue by describing the problem they face or observe.</para>
</blockquote>
<para>"Problem: we need feature X. Solution: make it" is not a good issue. "Problem: user cannot do common tasks A or B except by using a complex workaround. Solution: make feature X" is a decent explanation. Since everyone I've ever worked with has needed to learn this, it seems worth re-stating: document the real problem first, solution second.</para>

<blockquote>
  <para>The user SHOULD seek consensus on the accuracy of their observation, and the value of solving the problem.</para>
</blockquote>
<para>And since many apparent problems are illusionary, by stating the problem explicitly we give others a chance to correct our logic. "You're only using A and B a lot because function C is unreliable. Solution: make function C work properly."</para>

<blockquote>
  <para>Users SHALL NOT log feature requests, ideas, suggestions, or any solutions to problems that are not explicitly documented and provable.</para>
</blockquote>
<para>There are several reasons for not logging ideas, suggestions, or feature requests. In our experience these just accumulate in the issue tracker until someone deletes them. But more profoundly, when we treat all change as problem solutions, we can prioritize trivially. Either the problem is real and someone wants to solve it, now, or it's not on the table. Thus, wish-lists are off the table.</para>

<blockquote>
  <para>Thus, the release history of the project SHALL be a list of meaningful issues logged and solved.</para>
</blockquote>
<para>I'd love the GitHub issue tracker to simply list all the issues we solved in each release. Today we still have to write that by hand. If one puts the issue number in each commit, and if one uses the GitHub issue tracker, which we sadly don't yet do for &Oslash;MQ, this release history is easier to produce mechanically.</para>

<blockquote>
  <para>To work on an issue, a Contributor SHALL fork the project repository and then work on their forked repository.</para>
</blockquote>
<para>Here we explain the GitHub fork + pull request model so that newcomers only have to learn one process (C4) in order to contribute.</para>

<blockquote>
  <para>To submit a patch, a Contributor SHALL create a Platform pull request back to the project.</para>
</blockquote>
<para>GitHub has made this so simple that we don't need to learn git commands to do it, for which I'm deeply grateful. Sometimes, I'll tell people who I don't particularly like that command-line git is awesome and all they need to do is learn git's internal model in detail before trying to use it on real work. When I see them several months later they look... different.</para>

<blockquote>
  <para>A Contributor SHALL NOT commit changes directly to the project.</para>
</blockquote>
<para>Anyone who submits a patch is a contributor, and all contributors follow the same rules. No special privileges to the original authors, because otherwise we're not building a community, but boosting our egos.</para>

<blockquote>
  <para>To discuss a patch, people MAY comment on the Platform pull request, on the commit, or elsewhere.</para>
</blockquote>
<para>Randomly distributed discussions may be confusing if you're walking up for the first time, but GitHub solves this for all current participants by sending emails to those who need to follow what's going on. We had the same experience and the same solution in Wikidot, and it works. There's no evidence that discussing in different places has any negative effect.</para>

<blockquote>
  <para>To accept or reject a patch, a Maintainer SHALL use the Platform interface.</para>
</blockquote>
<para>Working via the GitHub web user interface means pull requests are logged as issues, with workflow and discussion. I'm sure there are more complex ways to work. Complexity is easy, it's simplicity that's incredibly hard.</para>

<blockquote>
  <para>Maintainers SHALL NOT accept their own patches.</para>
</blockquote>
<para>There was a rule we defined in the FFII years ago to stop people burning out: no less than two people on any project. One-person projects tend to end in tears, or at least bitter silence. We have quite a lot of data on burnout, why it happens, and how to prevent it (even cure it). I'll explore this later in the chapter, since if you work with or on open source you need to be aware of the risks. The "no merging your own patch" rule has two goals. First, if you want your project to be C4-certified, you have to get at least one other person to help. If no-one wants to help you, perhaps you need to rethink your project. Second, having a control for every patch makes it much more satisfying, keeps us more focused, and stops us breaking the rules because we're in a hurry, or just feeling lazy.</para>

<blockquote>
  <para>Maintainers SHALL NOT make value judgments on correct patches.</para>
</blockquote>
<para>We already said this but it's worth repeating: the role of Maintainer is not to judge a patch's substance, only its technical quality. The substantive worth of a patch only emerges over time: people use it, and like it, or they do not. And if no-one is using a patch, eventually it'll annoy someone else who will remove it, and no-one will complain.</para>

<blockquote>
  <para>Maintainers SHALL merge correct patches rapidly.</para>
</blockquote>
<para>There is a criteria I call "change latency" which is the round-trip time from identifying a problem to testing a solution. The faster the better. If maintainers cannot respond to pull requests as rapidly as people expect, they're not doing their job (or they need more hands).</para>

<blockquote>
  <para>The Contributor MAY tag an issue as "Ready" after making a pull request for the issue.</para>
</blockquote>
<para>GitHub by default offers the usual variety of issues but with C4 we don't use them. Instead we need just two labels, "Urgent" and "Ready". A contributor who wants another user to test an issue can then label it as "Ready".</para>

<blockquote>
  <para>The user who created an issue SHOULD close the issue after checking the patch is successful.</para>
</blockquote>
<para>When one person opens an issue, and another works on it, it's best to allow the original person close the issue. That acts as a double-check that the issue was properly resolved.</para>

<blockquote>
  <para>Maintainers SHOULD ask for improvements to incorrect patches and SHOULD reject incorrect patches if the Contributor does not respond constructively.</para>
</blockquote>
<para>Initially I felt it was worth merging all patches no matter how poor. There's an element of trolling involved. Merging broken code to master might, I felt, pull in more contributors. But people were uncomfortable with this so we defined the "correct patch" rules, and the Maintainer's role in checking for quality. On the negative side I think we didn't take some interesting risks which could have paid off with more participants. On the positive side this has led to &Oslash;MQ master (and in all projects that use C4) being practically production quality, practically all the time.</para>

<blockquote>
  <para>Any Contributor who has value judgments on a correct patch SHOULD express these via their own patches.</para>
</blockquote>
<para>In essence, the goal here is to allow users to try patches rather than to spend time arguing pros and cons. As easy as it is to make a patch, it's as easy to revert it with another patch. You might think this would lead to "patch wars" but that hasn't happened. We've had a handful of cases in libzmq where patches by one contributor were killed by another person who felt the experimentation wasn't going in the right direction. It is easier than seeking up-front consensus.</para>

<blockquote>
  <para>Maintainers MAY commit changes to non-source documentation directly to the project.</para>
</blockquote>
<para>This exit allows maintainers who are making release notes to push those without having to create an issue which would then affect the release notes, leading to stress on the space time fabric and possibly involuntary rerouting backwards in the fourth dimension to before the invention of cold beer. Shudder. Simpler to agree that release notes aren't changes to the software.</para>

</sect2>
<sect2>
<title>Creating Stable Releases</title>
<para>For a production system we want some guarantee of stability. In the past this meant taking unstable code and then over months hammering out the bugs and faults until it was safe to trust. iMatix's job, for years, has been to do this to libzmq, turning raw code into packages by allowing only bug fixes, and no new code, into a "stabilization branch". It's surprisingly not as thankless as it sounds.</para>

<para>Now, since we went full speed with C4, we've found that git master of libzmq is mostly perfect, most of the time. This frees our time to do more interesting things, such as building new open source layers on top of libzmq. However, people still want that guarantee: many users will simply not install except from an "official" release. So a stable release today means two things. First, a snapshot of the master taken at a time when there were no new changes for a while, and no dramatic open bugs. Second, a way to fine-tune that snapshot to fix the critical issues remaining in it.</para>

<para>This is the process we explain in this section.</para>

<blockquote>
  <para>The project SHALL have one branch ("master") that always holds the latest in-progress version and SHOULD always build.</para>
</blockquote>
<para>This is redundant since every patch always builds but it's worth restating. If the master doesn't build (and pass its tests), someone needs waking up.</para>

<blockquote>
  <para>The project SHALL NOT use topic branches for any reason. Personal forks MAY use topic branches.</para>
</blockquote>
<para>I'll come to branches soon. In short (or "tl;dr", as they say on the webs) branches make the repository too complex and fragile, and require upfront agreement, all of which are expensive and avoidable.</para>

<blockquote>
  <para>To make a stable release someone SHALL fork the repository by copying it and thus become maintainer of this repository.</para>
  <para>Forking a project for stabilization MAY be done unilaterally and without agreement of project maintainers.</para>
</blockquote>
<para>It's free software. No-one has a monopoly on it. If you think the maintainers aren't producing stable releases right, fork the repository and do it yourself. Forking isn't a failure, it's an essential tool for competition. You can't BTW do this with branches, which means a branch-based release policy gives the project maintainers a monopoly. And that's bad because they'll become lazier and more arrogant than if real competition is chasing their heels.</para>

<blockquote>
  <para>Maintainers of the stabilization project SHALL maintain it through pull requests which MAY cherry-pick patches from the forked project.</para>
</blockquote>
<para>Perhaps the C4 process should just say that stabilization projects have maintainers and contributors like any project. That's all this rule means.</para>

<blockquote>
  <para>A patch to a repository declared "stable" SHALL be accompanied by a reproducible test case.</para>
</blockquote>
<para>Beware of a one-size fits all process. New code does not need the same paranoia as code which people are trusting for production use. In the normal development process we did not mention test cases. There's a reason for this. While I love testable patches, many changes aren't easily or at all testable. However to stabilize a code base you want to fix only serious bugs, and you want to be 100% sure every change is accurate. This means before/after tests for every change.</para>

<blockquote>
  <para>A stabilization repository SHOULD progress through these phases: "unstable", "candidate", "stable", and then "legacy". That is, the default behavior of stabilization repositories is to die.</para>
</blockquote>
<para>This may be over-detailed. The key point here is that these forked stabilization repositories all die in the end, as master continues to evolve and continues to be forked off for production releases.</para>

</sect2>
<sect2>
<title>Evolution of Public Contracts</title>
<para>By "public contracts" I mean APIs and protocols. Up until the end of 2011, libzmq's naturally happy state was marred by broken promises and broken contracts. We stopped making promises (aka "roadmaps") for libzmq completely, and our dominant theory of change is now that it emerges carefully and accurately over time. At a 2012 Chicago meetup, Garrett Smith and Chuck Remes called this the "drunken stumble to greatness", which is how I think of it now.</para>

<para>We stopped breaking public contracts simply by banning the practice. Before then it had been "OK" (as in we did it, and everyone complained bitterly, and we ignored them) to break the API or protocol so long as we changed the major version number. Sounds fine, until you get &Oslash;MQ version 2.0, 3.0, and 4.0, all in development at the same time, and not speaking to each other.</para>

<blockquote>
  <para>All Public Contracts (APIs or protocols) SHOULD be documented.</para>
</blockquote>
<para>You'd think this was a given for professional software engineers but no, it's not. So, it's a rule. You want C4 certification for your project, you make sure your public contracts are documented. No "it's specified in the code" excuses. Code is not a contract. (Yes, I intend at some point to create a C4 certification process to act as a quality indicator for open source projects.)</para>

<blockquote>
  <para>All Public Contracts SHALL use Semantic Versioning.</para>
</blockquote>
<para>This rule is mainly here because people asked for it. I've no real love for it, since Semantic Versioning is what led to the so-called "why does &Oslash;MQ not speak to itself?!" debacle. I've never seen the problem that this solved. Something about runtime validation of library versions, or some-such.</para>

<blockquote>
  <para>All Public Contracts SHOULD have space for extensibility and experimentation.</para>
</blockquote>
<para>Now, the real thing is that public contracts <emphasis>do change</emphasis>. It's not about not changing them. It's about changing them safely. This means educating (especially protocol) designers to create that space up front.</para>

<blockquote>
  <para>A patch that modifies a Public Contract SHOULD not break existing applications unless there is prior consensus on the value of doing this.</para>
</blockquote>
<para>Sometimes the patch is fixing a bad API that no-one is using. It's a freedom we need but it should be based on consensus, not one person's dogma. However, making random changes just 'because' is not good. In &Oslash;MQ/3.x, did we benefit from renaming <literal>ZMQ_NOBLOCK</literal> to <literal>ZMQ_DONTWAIT</literal>? Sure, it's closer to the POSIX socket <literal>recv()</literal> call, but is that worth breaking thousands of applications? No-one ever reported it as an issue. To misquote Stallman: <emphasis>your freedom to create an ideal world stops one inch from my application.</emphasis></para>

<blockquote>
  <para>A patch that introduces new features to a Public Contract SHOULD do so using new names.</para>
</blockquote>
<para>We had the experience in &Oslash;MQ once or twice of new features using old names (or worse, using names that were <emphasis>still in use</emphasis> elsewhere). &Oslash;MQ/3.0 had a newly introduced "ROUTER" socket that was totally different from the existing ROUTER socket in 2.x. Dear lord, you should be face-palming, why? The reason: apparently, even smart people sometimes need regulation to stop them doing silly things.</para>

<blockquote>
  <para>Old names SHOULD be deprecated in a systematic fashion by marking new names as "experimental" until they are stable, then marking the old names as "deprecated".</para>
</blockquote>
<para>This life-cycle notation has the great benefit of actually telling users what is going on, with a consistent direction. "Experimental" means "we have introduced this and intend to make it stable if it works". Not, "we have introduced this and will remove it at any time if we feel like it". One assumes that code which survives more than one patch cycle is meant to be there. "Deprecated" means "we have replaced this and intend to remove it".</para>

<blockquote>
  <para>When sufficient time has passed, old deprecated names SHOULD be marked "legacy" and eventually removed.</para>
</blockquote>
<para>Which in theory gives applications time to move onto stable new contracts but without risk. You can upgrade first, make sure things work, and then, over time, fix things up to remove dependencies on deprecated and legacy APIs and protocols.</para>

<blockquote>
  <para>Old names SHALL NOT be reused by new features.</para>
</blockquote>
<para>Ah, yes, the joy when &Oslash;MQ/3.x renamed the top-used API functions (<literal>zmq_send()</literal> and <literal>zmq_recv()</literal>) and then recycled the old names for new methods that were utterly incompatible (and which I suspect few people actually use). You should be slapping yourself in confusion again, but really, this is what happened and I was as guilty as anyone. After all, we did change the version number! Semantic Version FTW!! The only benefit of that experience was to get this rule.</para>

<blockquote>
  <para>When old names are removed, their implementations MUST provoke an exception (assertion) if used by applications.</para>
</blockquote>
<para>I've not tested this rule to be certain it makes sense. Perhaps what it means is "if you can't provoke a compile error because the API is dynamic, provoke an assertion".</para>

<para>C4 is not perfect, few things are. The process for changing it (Digistan's COSS) is a little outdated now: it relies on a single-editor workflow with the ability to fork, but not merge. This seems to work but it could be better to use C4 for protocols like C4.</para>

</sect2>
</sect1>
<sect1>
<title>Worked Example</title>
<para>In <ulink url="http://lists.zeromq.org/pipermail/zeromq-dev/2012-October/018838.html">this email thread</ulink>, Dan Goes asks how to make a publisher that knows when a new client subscribes, and sends out previous matching messages. It's a standard pub-sub technique called "last value caching". Now over a 1-way transport like pgm (where subscribers literally send no packets back to publishers) this can't be done. But over TCP, it can, if we use an XPUB socket and if that socket didn't cleverly filter out duplicate subscriptions to reduce upstream traffic.</para>

<para>Though I'm not an expert contributor to libzmq, this seems a fun problem to solve. How hard could it be? I start by forking the libzmq repository to my own GitHub account, and then clone it to my laptop, where I build it:</para>

<screen>git clone git@github.com:hintjens/libzmq.git
cd libzmq
./autogen.sh
./configure
make
</screen>

<para>Since the libzmq code is neat and well-organized it was quite easy to find the main files to change (xpub.cpp and xpub.hpp). Each socket type has its own source file and class. They inherit from socket_base.cpp, which has this hook for socket-specific options:</para>

<screen>//  First, check whether specific socket type overloads the option.
int rc = xsetsockopt (option_, optval_, optvallen_);
if (rc == 0 || errno != EINVAL)
    return rc;

//  If the socket type doesn't support the option, pass it to
//  the generic option parser.
return options.setsockopt (option_, optval_, optvallen_);
</screen>

<para>Then I check where the XPUB socket filters out duplicate subscriptions, in its xread_activated method:</para>

<screen>bool unique;
if (*data == 0)
    unique = subscriptions.rm (data + 1, size - 1, pipe_);
else
    unique = subscriptions.add (data + 1, size - 1, pipe_);

//  If the subscription is not a duplicate store it so that it can be
//  passed to used on next recv call.
if (unique &amp;&amp; options.type != ZMQ_PUB)
    pending.push_back (blob_t (data, size));
</screen>

<para>At this stage I'm not too concerned with the details of how subscriptions.rm and .add work. The code seems obvious except that "subscription" also includes unsubscription, which confused me for a few seconds. If there's anything else weird in the rm and add methods, that's a separate issue to fix later. Time to make an issue for this change. I head over to the zeromq.jira.com site, log in, and create a new entry.</para>

<para>Jira kindly offers me the traditional choice between "bug" and "new feature" and I spend thirty seconds wondering where this counter-productive historical distinction came from. Presumably, the "we'll fix bugs for free but you pay for new features" commercial proposal, which stems from the "you tell us what you want and we'll make it for $X" model of software development, and which generally leads to "we spent three times $X and we got what?!" email Fists of Fury.</para>

<para>Putting such thoughts aside, I create <ulink url="https://zeromq.jira.com/browse/LIBZMQ-443">an issue #443</ulink> and described the problem and plausible solution:</para>

<blockquote>
  <para>Problem: XPUB socket filters out duplicate subscriptions (deliberate design). However this makes it impossible to do subscription-based intelligence. See http://lists.zeromq.org/pipermail/zeromq-dev/2012-October/018838.html for a use-case.</para>
  <para>Solution: make this behaviour configurable with a socket option.</para>
</blockquote>
<para>Naming time. The API sits in include/zmq.h, so this is where I added the option name. When you invent a concept, in an API or anywhere, <emphasis>please</emphasis> take a moment to choose a name that is explicit and short and obvious. Don't fall back on generic names which need additional context to understand. You have one chance to tell the reader what your concept is, and does. A name like <literal>ZMQ_SUBSCRIPTION_FORWARDING_FLAG</literal> is terrible. It technically kind of aims in the right direction but is miserably long and obscure. I chose <literal>ZMQ_XPUB_VERBOSE</literal>: short and explicit and clearly an on/off switch with "off" being the default setting.</para>

<para>So, time to add a private property to the xpub class definition in xpub.hpp:</para>

<screen>// If true, send all subscription messages upstream, not just
// unique ones
bool verbose;
</screen>

<para>And then lift some code from router.cpp to implement the xsetsockopt method. Finally, change the xread_activated method to use this new option, and while at it, make that test on socket type more explicit too:</para>

<screen>//  If the subscription is not a duplicate store it so that it can be
//  passed to used on next recv call.
if (options.type == ZMQ_XPUB &amp;&amp; (unique || verbose))
    pending.push_back (blob_t (data, size));
</screen>

<para>The thing builds nicely first time. Which makes me a little suspicious, but being lazy and jet-lagged I don't immediately make a test case to actually try out the change. The process doesn't demand that, even if usually I'd do it just to catch that inevitable 10% of mistakes we all make. I do however document this new option on the <literal>doc/zmq_setsockopt.txt</literal> man page. In the worst case I added a patch that wasn't really useful. But I certainly didn't break anything.</para>

<para>I don't implement a matching <literal>zmq_getsockopt</literal>, since "minimal" means what it says. There's no obvious use case for getting the value of an option that you presumably just set, in code. Symmetry isn't a valid reason to double the size of a patch. I did have to document the new option since the process says, "All Public Contracts SHOULD be documented."</para>

<para>Committing the code, I push the patch to my forked repository (the 'origin'):</para>

<screen>git commit -a -m "Fixed issue #443"
git push origin master
</screen>

<para>Switching to the GitHub web interface, I go to my libzmq fork, and press the big "Pull Request" button at the top. GitHub asks me for a title, so I enter "Added ZMQ_XPUB_VERBOSE option". I'm not sure why it asks this since I made a neat commit message but hey, let's go with the flow here.</para>

<para>This makes a nice little pull request with two commits. The one I'd made a month ago on the release notes, to prepare for the 3.2.1 release (a month passes so quickly when you spend most of it in airports), and my fix for issue #443 (37 new lines of code). GitHub lets you continue to make commits after you've kicked off a pull request. They get queued up, and merged in one go. That is easy but the maintainer may refuse the whole bundle based on one patch that doesn't look valid.</para>

<para>Since Dan is waiting (at least in my highly optimistic imagination) for this fix, I go back to the zeromq-dev list and tell him I've made the patch, with a link to the commit. The faster I get feedback, the better. It's 1am in South Korea as I make this patch, so early evening in Europe, and morning in the States. You learn to count timezones when you work with people across the world. Ian is in a conference, Mikko is getting on a plane, and Chuck is probably in the office, but three hours later, Ian merges the pull request.</para>

<para>After Ian merges the pull request, I re-synchronize my fork with the upstream libzmq repository. First, I add a 'remote' that tells git where this repository sits (I do this just once in the directory where I'm working):</para>

<screen>git remote add upstream git://github.com/zeromq/libzmq.git
</screen>

<para>And then I pull changes back from the upstream master and check the git log to double-check:</para>

<screen>git pull --rebase upstream master
git log
</screen>

<para>And that is pretty much it, in terms of how much git one needs to learn and use to contribute patches to libzmq. Six git commands and some clicking on web pages. Most importantly to me as a naturally lazy, stupid, and easily confused developer, I don't have to learn git's internal models, and never have to do anything involving those infernal engines of structural complexity we call "git branches". Next up, the attempted assassination of git branches. Let's live dangerously!</para>

</sect1>
<sect1>
<title>Git Branches Considered Harmful</title>
<para>One of git's most popular features is how easy it makes branches. Almost all projects that use git use branches, and the selection of the "best" branching strategy is like a rite of passage for an open source project. Vincent Driessen's <ulink url="http://nvie.com/posts/a-successful-git-branching-model/">git-flow</ulink> is maybe the best known. It has 'base' branches (master, develop), 'feature' branches, 'release' branches, 'hotfix' branches, and 'support' branches. Many teams have adopted git-flow, which even has git extensions to support it. I'm a great believer in popular wisdom, but sometimes you have to recognize mass delusion for what it is.</para>

<para>Here is a section of C4 that might have shocked you when you first read it:</para>

<blockquote>
  <para>The project SHALL NOT use topic branches for any reason. Personal forks MAY use topic branches.</para>
</blockquote>
<para>To be clear, it's <emphasis>public branches in shared repositories</emphasis> that I'm talking about. Using branches for private work, e.g. to work on different issues, appears to work well enough, though it's more complexity than I personally enjoy. To channel Stallman again: <emphasis>your freedom to create complexity ends one inch from our shared workspace.</emphasis></para>

<para>Like the rest of C4, the rules on branches are not accidental. They came from our experience making &Oslash;MQ, starting when Martin Sustrik and I rethought how to make stable releases. We both love and appreciate simplicity (some people seem to have a remarkable tolerance for complexity). We chatted for a while... I asked him, "I'm going to start making a stable release, would it be OK for me to make a branch in the git you're working in?" Martin didn't like the idea. "OK, if I fork the repository, I can move patches from your repo to that one". That felt much better to both of us.</para>

<para>The response from many in the &Oslash;MQ community was shock and horror. People felt we were being lazy and making contributors work harder to find the "right" repository. Still, this seemed simple, and indeed it worked smoothly. The best part was that we each worked as we wanted to. Whereas before, the &Oslash;MQ repository had felt horribly complex (and it wasn't even anything like git-flow), this felt simple. And it worked. The only downside was that we lost a single unified history. Now, perhaps historians will feel robbed, but I honestly can't see that the historical minutiae of who changed what, when, including every branch and experiment, are worth any significant pain or friction.</para>

<para>People have gotten used to the "multiple repositories" approach in ZeroMQ and we've started using that in other projects quite successfully. My own opinion is that history will judge git branches and patterns like git-flow as a complex solution to imaginary problems inherited from the days of Subversion and monolithic repositories.</para>

<para>More profoundly, and perhaps this is why the majority seems to be "wrong": I think the the branches vs. forks argument is really a deeper design vs. evolve argument about how to make software optimally. I'll address that deeper argument in the next section. For now, I'll try to be scientific about my irrational hate of branches, by looking at a number of criteria, and compare branches and forks in each one.</para>

<sect2>
<title>Simplicity vs. Complexity</title>
<para><emphasis>The simpler, the better.</emphasis></para>

<para>There is no inherent reason branches are more complex than forks. However, git-flow uses <emphasis>five types</emphasis> of branch, whereas C4 uses two types of fork (development, and stable) and one branch (master). Circumstantial evidence is thus that branches lead to more complexity than forks. For new users, it is definitely, and we've measured this in practice, easier to learn to work with many repositories and no branches except master.</para>

</sect2>
<sect2>
<title>Change Latency</title>
<para><emphasis>The smaller and more rapid the delivery, the better.</emphasis></para>

<para>Development branches seem to correlate strongly with large, slow, risky deliveries. "Sorry, I have to merge this branch before we can test the new version" signals a breakdown in process. It's certainly not how C4 works, which is by focusing tightly on individual problems and their minimal solutions. Allowing branches in development raises change latency. Forks have a different outcome: it's up to the forker to ensure his changes merge cleanly, and to keep them simple so they won't be rejected.</para>

</sect2>
<sect2>
<title>Learning Curve</title>
<para><emphasis>The smoother the learning curve, the better.</emphasis></para>

<para>Evidence definitely shows that learning to use git branches is complex. For some people this is OK. For most developers, every cycle spent learning git is a cycle lost on more productive things. I've been told several times, by different people, that I do not like branches because I "never properly learned git". That is fair but it is a criticism of the tool, not the human.</para>

</sect2>
<sect2>
<title>Cost of Failure</title>
<para><emphasis>The lower the cost of failure, the better.</emphasis></para>

<para>Branches demand more perfection from developers since mistakes potentially affect others. This raises the cost of failure. Forks make failure extremely cheap since literally nothing that happens in a fork can affect others not using that fork.</para>

</sect2>
<sect2>
<title>Upfront Coordination</title>
<para><emphasis>The less need for upfront coordination, the better.</emphasis></para>

<para>You can do a hostile fork. You cannot do a hostile branch. Branches depend on upfront coordination, which is expensive and fragile. One person can veto the desires of a whole group. In the &Oslash;MQ community for example we were unable to agree on a git branching model for a year. We solved that by using forking instead. The problem went away.</para>

</sect2>
<sect2>
<title>Scalability</title>
<para><emphasis>The more you can scale a project, the better.</emphasis></para>

<para>The strong assumption in all branch strategies is that the repository <emphasis>is</emphasis> the project. But there is a limit to how many people you can get in agreement to work together in one repository. As I explained, the cost of upfront coordination can become fatal. A more realistic project scales by allowing anyone to start their own repositories, and ensuring these can work together. A project like &Oslash;MQ has dozens of repositories. Forking looks more scalable than branching.</para>

</sect2>
<sect2>
<title>Surprise and Expectations</title>
<para><emphasis>The less surprising, the better.</emphasis></para>

<para>People expect branches and find forks to be uncommon and thus confusing. This is the one aspect where branches win. If you use branches, a single patch will have the same commit hash tag, whereas across forks the patch will have different hash tags. That makes it harder to track patches as they cross forks, true. But seriously, <emphasis>having to track hexadecimal hash tags is not a feature</emphasis>. It's a bug. Sometimes better ways of working just are surprising at first.</para>

</sect2>
<sect2>
<title>Economics of Participation</title>
<para><emphasis>The more tangible the rewards, the better.</emphasis></para>

<para>People like to own their work, and get credit for it. This is much easier with forks than with branches. Forks create more competition, in a healthy way, while branches suppress competition and force people to collaborate and share credit. This sounds positive but in my experience it de-motivates people. A branch isn't a product you can "own", whereas a fork can be.</para>

</sect2>
<sect2>
<title>Robustness in Conflict</title>
<para><emphasis>The more a model can survive conflict, the better.</emphasis></para>

<para>Like it or not, people fight over ego, status, beliefs, and theories of the world. Challenge is a necessary part of science. If your organizational model depends on agreement, you won't survive the first real fight. Branches do not survive real arguments and fights. Whereas forks can be hostile, and still benefit all parties. And this is indeed how free software works.</para>

</sect2>
<sect2>
<title>Guarantees of Isolation</title>
<para><emphasis>The stronger the isolation between production code and experiment, the better.</emphasis></para>

<para>People make mistakes. I've seen experimental code pushed to mainline production by error. I've seen people make bad panic changes under stress. But the real fault is in allowing two entirely separate generations of product to exist in the same protected space. If you can push to random-branch-x you can push to master. Branches do not guarantee isolation of production critical code. Forks do.</para>

</sect2>
<sect2>
<title>Visibility</title>
<para><emphasis>The more visible our work, the better.</emphasis></para>

<para>Forks have watchers, issues, a README, a wiki. Branches have none of these. People try forks, build them, break them, patch them. Branches sit there until someone remembers to work on them. Forks have downloads and tarballs. Branches do not. When we look for self-organization, the more visible and declarative the problems, the faster and more accurately we can work.</para>

</sect2>
<sect2>
<title>Conclusions</title>
<para>In this section I've listed a series of arguments, most of which came from fellow team members. Here's how it seems to break down: git veterans insist that branches are the way to work, whereas newcomers tend to feel intimidated when asked to navigate git branches. Git is not an easy tool to master. What we've discovered, accidentally, is that when you stop using branches <emphasis>at all</emphasis>, git becomes trivial to use. It literally comes down to six commands (clone, remote, commit, log, push, and pull). Furthermore, a branch-free process actually works, we've used it for a couple of years now, and no visible downside except surprise to the veterans, and growth of "single" projects over multiple repositories.</para>

<para>If you can't use forks, perhaps because your firm doesn't trust github's private repositories, then you perhaps topic branches (one per issue) will work. You'll still suffer the costs of getting upfront consensus, low competitiveness, and risk of human error.</para>

</sect2>
</sect1>
<sect1>
<title>The Myth of Intelligent Design</title>
<para>The dominant theory of design is that you take smart, creative people and money, and produce amazing products. The smarter the people, the better the results. I'm going to claim that theory is bogus, a myth based on magical thinking that treats the "invention" as a product of individual "inventor" minds. As an alternative I'll present the Theory of Heuristic Innovation, which states roughly that we do not invent solutions, we discover them, and that discovery process can be highly automated.</para>

<para>Presenting &Oslash;MQ at the Mix-IT conference in Lyon in early 2012, I was asked several times for the "road-map". My answer was, There is no road-map, and road-maps are bad for several reasons. First, they make promises we can rarely keep, which causes problems for our users. Second, they claim territory and make it harder for others to participate. Lastly, they preempt the thinking process of the community. The audience didn't really like my answer. So un-French. Software engineers don't like the notion that powerful, effective solutions can come into existence without an intelligent designer actively thinking things through. And yet no-one in that room would question evolution. A strange irony, and one I wanted to explore further as it underpins the direction the &Oslash;MQ community has taken over the last year or so.</para>

<para>In the dominant theory, brilliant individuals reflect on large problem sets and then carefully and precisely create a solution. Sometimes they will have "eureka" moments where they "get" brilliantly simple answers to whole large problem sets. The inventor, and the process of invention are rare, precious, and can command a monopoly. History is full of such heroic individuals. We owe them our modern world.</para>

<para>Looking closer, however, the facts don't match. History doesn't show lone inventors. It shows lucky people who steal or claim ownership of ideas that are being worked on by many. It shows brilliant people striking lucky once, and then spending decades on fruitless and pointless quests. The best known large-scale inventors like Thomas Edison were in fact just very good at systematic broad research done by large teams. It's like claiming that Steve Jobs invented every device made by Apple. It is a nice myth, good for marketing, but utterly useless as practical science.</para>

<para>Recent history, much better documented and less easy to manipulate, shows this well. The Internet is surely one of the most innovative and fast-moving areas of technology, and one of the best documented. It has no inventor. Instead it has a massive economy of people who have carefully and progressively solved a long series of immediate problems, documented their answers, and made those available to all. The innovative nature of the Internet comes not from a small, select band of Einsteins. It comes from RFCs anyone can use and improve, made by hundreds and thousands of smart, but not uniquely smart, individuals. It comes from open source software anyone can use and improve. It comes from sharing, scale of community, and the continuous accretion of good solutions and disposal of bad ones.</para>

<para>Here thus is my "Theory of Heuristic Innovation":</para>

<orderedlist>
  <listitem><para>There is an infinite problem/solution terrain.</para></listitem>
  <listitem><para>This terrain changes over time according to external conditions.</para></listitem>
  <listitem><para>We can only accurately perceive problems we are close to.</para></listitem>
  <listitem><para>We can rank the cost/benefit economics of problems using a market for solutions.</para></listitem>
  <listitem><para>There is an optimal solution to any solvable problem.</para></listitem>
  <listitem><para>We can approach this optimal solution heuristically, and mechanically.</para></listitem>
  <listitem><para>Our intelligence can make this process faster but does not replace it.</para></listitem>
</orderedlist>
<para>It's an approximation. Feel free to send me patches. There are a few takeaways from this:</para>

<itemizedlist>
  <listitem><para><emphasis>Individual creativity matters less than process.</emphasis> Smarter people may work faster but they may work in the wrong direction. It's the collective vision of reality that keeps us honest and relevant.</para></listitem>
  <listitem><para><emphasis>We don't need road-maps if we have a good process.</emphasis> Functionality will emerge and evolve over time as solutions compete for market share.</para></listitem>
  <listitem><para><emphasis>We don't invent solutions, so much as discover them.</emphasis> All sympathies to the creative soul. It's just an information processing machine that likes to polish its own ego and collect karma.</para></listitem>
  <listitem><para><emphasis>Intelligence is a social effect, though it feels personal.</emphasis> A person cut-off from others eventually stops thinking. We can neither collect problems nor measure solutions without other people.</para></listitem>
  <listitem><para><emphasis>The size and diversity of the community is a key factor.</emphasis> Larger, more diverse communities collect more relevant problems, and solve them more accurately, and do this faster, than a small expert group.</para></listitem>
</itemizedlist>
<para>People have pointed out that hill-climbing algorithms (what this is, essentially) have known limitations. One gets stuck on local peaks, mainly. But this is nonetheless how life itself works: collecting tiny incremental improvements over long periods of time. There is no intelligent designer. We reduce the risk of local peaks by spreading out widely across the landscape but it is somewhat moot. The limitations aren't optional, they are physical laws. The theory says, <emphasis>this is how innovation really works, so better embrace it and work with it, than try to work on the basis of belief</emphasis>.</para>

<para>And in fact once you see all innovation as more or less successful hill-climbing, you realize why some teams and companies and products get stuck in a never-never land of diminishing prospects. They simply don't have the diversity and collective intelligence to find better hills to climb. When Nokia killed their open source projects, they cut their own throat.</para>

</sect1>
<sect1>
<title>Burnout</title>
<para>The &Oslash;MQ community has been and still is heavily dependent on pro-bono individual efforts. I'd like to think that everyone was compensated in some way for their contributions, and I believe that with &Oslash;MQ, contributing means gaining expertise in an extraordinarily valuable technology, which means improved professional options.</para>

<para>However not all projects will be so lucky and if you work with or in open source you should understand the risk of burnout that volunteers face. This applies to all pro-bono communities. In this section I'll explain what causes burnout, how to recognize it, how to prevent it, and (if it happens) how to try to treat it. Disclaimer: I'm not a psychiatrist and this article is based on my own experiences of working in pro-bono contexts for the last 20 years, including free software projects, and NGOs such as the <ulink url="http://www.ffii.org">FFII</ulink>.</para>

<para>In a pro-bono context we're expected to work without direct or obvious economic incentive. That is, we sacrifice family life, professional advancement, free time, and health in order to accomplish some goal we have decided to accomplish. In any project, we need some kind of reward to make it worth continuing each day. In most pro-bono projects the rewards are very indirect, superficially not economical at all. Mostly, we do things because people say, "hey, great!" Karma is a powerful motivator.</para>

<para>However, we are economic beings, and sooner or later, if a project costs us a great deal and does not bring economic rewards of some kind (money, fame, a new job,...) we start to suffer. At a certain stage it seems our subconscious simply gets disgusted, and says, "enough is enough!" and refuses to go any further. If we try to force ourselves, we can literally get sick.</para>

<para>This is what I call "burnout", though the term is also used for other kinds of exhaustion. Too much investment on a project, with too little economic reward, for too long. We are great at manipulating ourselves, and others, and this is often part of the process that leads to burnout. We tell ourselves that it's for a good cause, that the other guy is doing OK, so we should be able to as well.</para>

<para>When I got burnt out on open source projects like Xitami, I remember clearly how I felt. I simply stopped working on it, refused to answer any more emails, and told people to forget about it. You can tell when someone's burnt out. They go off-line, and everyone starts saying, "he's acting strange... depressed, or tired..."</para>

<para>Diagnosis is simple. Has someone worked a lot on a project that was not paying back in any way? Did she make exceptional sacrifices? Did he lose or abandon his job or studies to do the project? If you're answering "yes", it's burnout.</para>

<para>There are three simple techniques I've developed over the years to reduce the risk of burnout in the teams I work with:</para>

<itemizedlist>
  <listitem><para><emphasis>No-one is irreplaceable.</emphasis> Working solo on a critical or popular project -- the concentration of responsibility on one person who cannot set their own limits -- is probably the main factor. It's a management truism: if someone in your organization is irreplaceable, get rid of him or her.</para></listitem>
  <listitem><para><emphasis>We need day jobs to pay the bills.</emphasis> This can be hard but seems necessary. Getting money from somewhere else makes it much easier to sustain a sacrificial project.</para></listitem>
  <listitem><para><emphasis>Teach people about burnout.</emphasis> This should IMO be a basic course in colleges and universities, as pro-bono work becomes a more common way for young people to experiment professionally.</para></listitem>
</itemizedlist>
<para>When someone is working alone on a critical project, you <emphasis>know</emphasis> they are going blow their fuses sooner or later. It's actually fairly predictable: something like 18-36 months depending on the individual and how much economic stress they face in their private lives. I've not seen anyone burn-out after half a year, nor last five years in a unrewarding project.</para>

<para>There is a simple cure for burnout which works in at least some cases: get paid decently for your work. However this pretty much destroys the freedom of movement (across that infinite problem landscape) that the volunteer enjoys.</para>

</sect1>
<sect1>
<title>Patterns for The Game</title>
<para>I'll end this code-free chapter with a series of patterns for success in software engineering. They aim to capture the essence of what divides glorious success from tragic failure. They were described as "religious maniacal dogma" by a manager, and "anything else would be effing insane" by a colleague, in a single day. For me, they are science. But treat the Lazy Perfectionist and others as tools to use, sharpen, and throw away if something better comes along.</para>

<sect2>
<title>The Lazy Perfectionist</title>
<para><emphasis>Never design anything that's not a precise minimal answer to a problem we can identify and have to solve.</emphasis></para>

<para>The Lazy Perfectionist spends his idle time observing others and identifying problems that are worth solving. He looks for agreement on those problems, always asking, "what is the <emphasis>real</emphasis> problem". Then he moves, precisely and minimally, to build, or get others to build, a usable answer to one problem. He uses, or gets others to use those solutions. And he repeats this until there are no problems left to solve, or time or money runs out.</para>

</sect2>
<sect2>
<title>The Benevolent Tyrant</title>
<para><emphasis>The control of a large force is the same principle as the control of a few men: it is merely a question of dividing up their numbers.</emphasis> -- Sun Tzu</para>

<para>The Benevolent Tyrant divides large problems into smaller ones and throws them at groups to focus on. He brokers contracts between these groups, in the form of APIs and unprotocols. The Benevolent Tyrant constructs a supply chain that starts with problems, and results in usable solutions. He is ruthless about how the supply chain works, but does not tell people on what to work, nor how to do their work.</para>

</sect2>
<sect2>
<title>The Earth and Sky</title>
<para><emphasis>The ideal team consists of two sides: one writing code, and one providing feedback.</emphasis></para>

<para>The Earth and Sky work together as a whole, in close proximity, but they communicate formally through an issue tracking. Sky seeks out problems, from others and from their own use of the product, and feeds these to Earth. Earth rapidly answers with testable solutions. Earth and Sky can work through dozens of issues in a day. Sky talks to other users, and Earth talks to other developers. Earth and Sky may be two people, or two small groups.</para>

</sect2>
<sect2>
<title>The Open Door</title>
<para><emphasis>The accuracy of knowledge comes from diversity.</emphasis></para>

<para>The Open Door accepts contributions from almost anyone. He does not argue quality or direction, instead allowing others to argue that and so get more engaged. He calculates that even a troll will bring more diverse opinion to the group. He lets the group form its opinion about what goes into stable code, and he enforces this opinion with help of a Benevolent Tyrant.</para>

</sect2>
<sect2>
<title>The Laughing Clown</title>
<para><emphasis>Perfection precludes participation.</emphasis></para>

<para>The Laughing Clown, often acting as the Happy Failure, makes no claim to high competence. Instead his antics and bumbling attempts provoke others into rescuing him from his own tragedy. Somehow however, he always identifies the right problems to solve. People are so busy proving him wrong they don't realize they're doing valuable work.</para>

</sect2>
<sect2>
<title>The Mindful General</title>
<para><emphasis>Make no plans. Set goals, develop strategies and tactics.</emphasis></para>

<para>The Mindful General operates in unknown territory, solving problems that are hidden until they are nearby. Thus he makes no plans, but seeks opportunities, then exploits them rapidly and accurately. He develops tactics and strategies in the field, and teaches these to his men so they can move independently, and together.</para>

</sect2>
<sect2>
<title>The Social Engineer</title>
<para><emphasis>If you know the enemy and know yourself, you need not fear the result of a hundred battles.</emphasis> -- Sun Tzu</para>

<para>The Social Engineer reads the hearts and minds of those he works with and for. He asks, of everyone, "what makes this person angry, insecure, argumentative, calm, happy?" He studies their moods and dispositions. With this knowledge he can encourage those who are useful, and discourage those who are not. The Social Engineer never acts on his own emotions.</para>

</sect2>
<sect2>
<title>The Constant Gardener</title>
<para><emphasis>He will win whose army is animated by the same spirit throughout all its ranks.</emphasis> -- Sun Tzu</para>

<para>The Constant Gardener grows a process from a small seed, step by step as more people come into the project. He makes every change for a precise reason, with agreement from everyone. He never imposes a process from above but lets others come to consensus, then he enforces that consensus. In this way everyone owns the process together and by owning it, they are attached to it.</para>

</sect2>
<sect2>
<title>The Rolling Stone</title>
<para><emphasis>After crossing a river, you should get far away from it.</emphasis> -- Sun Tzu</para>

<para>The Rolling Stone accepts his own mortality and transience. He has no attachment to his past work. He accepts that all that we make is destined for the trash can, it is just a matter of time. With precise, minimal investments, he can move rapidly away from the past and stay focused on the present and near future. Above all he has no ego and no pride to be hurt by the actions of others.</para>

</sect2>
<sect2>
<title>The Pirate Gang</title>
<para><emphasis>Code, like all knowledge, works best as collective -- not private -- property.</emphasis></para>

<para>The Pirate Gang organizes freely around problems. It accepts authority insofar as authority provides goals and resources. The Pirate Gang owns and shares all it makes: every work is fully remixable by others in the Pirate Gang. The gang moves rapidly as new problems emerge, and is quick to abandon old solutions if those stop being relevant. No persons or groups can monopolize any part of the supply chain.</para>

</sect2>
<sect2>
<title>The Flash Mob</title>
<para><emphasis>Water shapes its course according to the nature of the ground over which it flows.</emphasis> -- Sun Tzu</para>

<para>The Flash Mob comes together in space and time as needed, then disperses as soon as they can. Physical closeness is essential for high-bandwidth communications. But over time it creates technical ghettos, where Earth gets separated from Sky. The Flash Mob tends to collect a lot of frequent flier miles.</para>

</sect2>
<sect2>
<title>The Canary Watcher</title>
<para><emphasis>Pain is not, generally, a Good Sign.</emphasis></para>

<para>The Canary Watcher measures the quality of an organization by the their own pain level, and the observed pain levels of those he works with. He brings new participants into existing organizations so they can express the raw pain of the innocent. He may use alcohol to get others to verbalize their pain points. He asks others, and himself, "are you happy in this process, and if not, why not?" When an organization causes pain in himself or others, he treats that as a problem to be fixed. People should feel joy in their work.</para>

</sect2>
<sect2>
<title>The Hangman</title>
<para><emphasis>Never interrupt others when they are making mistakes.</emphasis></para>

<para>The Hangman knows that we learn only by making mistakes, and he gives others copious rope with which to learn. He only pulls the rope gently, when it's time. A little tug to remind the other of their precarious position. Allowing others to learn by failure gives the good reason to stay, and the bad excuse to leave. The Hangman is endlessly patient, because there is no shortcut to the learning process.</para>

</sect2>
<sect2>
<title>The Historian</title>
<para><emphasis>Keeping the public record may be tedious, but it's the only way to prevent collusion.</emphasis></para>

<para>The Historian forces discussion into the public view, to prevent collusion to own areas of work. The Pirate Gang depends on full and equal communications that do not depend on momentary presence. No-one really reads the archives, but the simply possibility stops most abuses. The Historian encourages the right tool for the job: email for transient discussions, IRC for chatter, wikis for knowledge, issue tracking for recording opportunities.</para>

</sect2>
<sect2>
<title>The Provocateur</title>
<para><emphasis>When a man knows he is to be hanged in a fortnight, it concentrates his mind wonderfully.</emphasis> -- Samuel Johnson</para>

<para>The Provocateur creates deadlines, enemies, and the occasional impossibility. Teams work best when they don't have time for the crap. Deadlines bring people together and focus the collective mind. An external enemy can move a passive team into action. The Provocateur never takes the deadline too seriously. The product is <emphasis>always</emphasis> ready to ship. But he gently reminds the team of the stakes: fail, and we all look for other jobs.</para>

</sect2>
<sect2>
<title>The Mystic</title>
<para><emphasis>When people argue or complain, just write them a Sun Tzu quotation</emphasis> -- Mikko Koppanen</para>

<para>The Mystic never argues directly. He knows that to argue with an emotional person only creates more emotion. Instead he side-steps the discussion. It's hard to be angry at a Chinese general, especially when he has been dead for 2,400 years. The Mystic plays Hangman when people insist on the right to get it wrong.</para>

</sect2>
</sect1>
</chapter>
<chapter id="moving-pieces">
<title>A Universe of Moving Pieces</title>
<para>So far in this book I've aimed to take you though a journey of understanding &Oslash;MQ in its many aspects. By now you may have started to build your own products using the techniques I explained, and others you've figured out yourself. You will start to face questions about to make these products work in the real world.</para>

<para>But what is that "real world"? My view is that it is becoming a world of ever increasing numbers of moving pieces. Some people use the phrase the "Internet of Things", suggesting that we'll see a new category of devices that are more numerous but rather stupider than our current smart phones and tablets and laptops and servers. However, I don't think the data points this way at all. Yes, more and more devices, but they're not stupid at all. They're smart and powerful and getting more so all the time.</para>

<para>The mechanism at work is something I call "Cost Gravity" and it has the effect of cutting the cost of technology by half every 18-24 months. Or, put another way, our global computing capacity doubles every two years, over and over and over. The future is filled with trillions of devices that are fully powerful multi-core computers: they don't run some cut-down "operating system for things" but full operating systems and full applications.</para>

<para>And this is the world we're aiming at with &Oslash;MQ. When we talk of "scale" we don't mean hundreds of computers, or even thousands. Think of clouds of tiny smart and perhaps self-replicating machines surrounding every person, filling every space, covering every wall, filling the cracks and eventually, becoming so much a part of us that we get them before birth and they follow us to death.</para>

<para>These clouds of tiny machines talk to each other, all the time, over short-range wireless links, using the Internet Protocol. They create mesh networks, pass information and tasks around like nervous signals. They augment our memory, vision, every aspect of our communications, and physical functions. And it's &Oslash;MQ that powers their conversations and events and exchanges of work and information.</para>

<para>Now, to make even a thin imitation of this come true today, we need to solve a set of technical problems (how do peers discover each other, how do they talk to existing networks like the Web, how do they protect the information they carry, how do we track and monitor them, to get some idea of what they're doing). Then we need to do what most engineers forget about: package this solution into a framework that is dead easy for ordinary developers to use.</para>

<para>This is what we'll attempt in this chapter: to build a framework for distributed applications, as an API, protocols, and implementations. It's not a small challenge but I've claimed often that &Oslash;MQ makes such problems simple, so let's see if that's still true.</para>

<para>We'll cover:</para>

<itemizedlist>
  <listitem><para>Building a basic framework for distributed computing.</para></listitem>
</itemizedlist>
<sect1>
<title>Design for The Real World</title>
<para>Whether we're connecting a roomful of mobile devices over WiFi, or a cluster of virtual boxes over simulated Ethernet, we will hit the same kinds of problems. These are:</para>

<itemizedlist>
  <listitem><para><emphasis>Discovery</emphasis> - how do we learn about other nodes on the network? Do we use a discovery service, centralized mediation, or some kind of broadcast beacon?</para></listitem>
  <listitem><para><emphasis>Presence</emphasis> - how do we track when other nodes come and go? Do we use some kind of central registration service, or heartbeating or beacons?</para></listitem>
  <listitem><para><emphasis>Connectivity</emphasis> - how do we actually connect one node to another? Do we use local networking, wide-area networking, or do we use a central message broker to do the forwarding?</para></listitem>
  <listitem><para><emphasis>Point-to-point messaging</emphasis> - how do we send a message from one node to another? Do we send this to the node's network address, or do we use some indirect addressing via a centralized message broker?</para></listitem>
  <listitem><para><emphasis>Group messaging</emphasis> - how do we send a message from one node to a group of others? Do we work via centralized message broker, or do we use a publish-subscribe model like &Oslash;MQ?</para></listitem>
  <listitem><para><emphasis>Testing and simulation</emphasis> - how do we simulate large numbers of nodes so we can test performance properly? Do we have to buy two dozen Android tablets, or can we use pure software simulation?</para></listitem>
  <listitem><para><emphasis>Distributed Logging</emphasis> - how do we track what this cloud of nodes is doing so we can detect performance problems and failures? Do we create a main logging service, or do we allow every device to log the world around it?</para></listitem>
  <listitem><para><emphasis>File transfer</emphasis> - how do we send files from one node to another? Do we use server-centric protocols like FTP or HTTP, or do we use decentralized protocols like FILEMQ?</para></listitem>
  <listitem><para><emphasis>State synchronization</emphasis> - how do we ensure that many nodes receive the same unique stream of events? Do we send events via single central point, or do we use a distributed eventually consistent model?</para></listitem>
  <listitem><para><emphasis>Security</emphasis> - how do we protect the confidentiality of information, and make sure people are who they claim to be? Do we use a centralised trust network, or do we use use some kind of distributed key management?</para></listitem>
  <listitem><para><emphasis>Bridging</emphasis> - how do we connect our networks across the Internet? Do we use the cloud as our central point for all messaging or do we create bridges that join groups to other groups?</para></listitem>
</itemizedlist>
<para>If we can solve these dozen problems reasonably well, we get something like a framework for what I might call "Really Cool Distributed Applications", or as my grandkids call it, "the software our world runs on".</para>

<para>You should have guessed from my rhetorical questions that there are two broad directions we can go. One is to centralize everything. The other is to distribute everything. I'm going to bet on decentralization. If you want centralization, you don't really need &Oslash;MQ; there are other options you can use.</para>

<para>So very roughly, here's the story. One, the number of moving pieces increases exponentially over time (doubles every 24 months). Two, these pieces stop using wires since dragging cables everywhere gets <emphasis>really</emphasis> boring. Three, future applications run across clusters of these pieces using the Benevolent Tyrant pattern from The &Oslash;MQ Community<xref linkend="the-community"/>. Four, today it's really difficult, nay still rather impossible, to build such applications. Five, let's make it cheap and easy using all the techniques and tools we've built up. Six, partay!</para>

</sect1>
<sect1>
<title>The Secret Life of WiFi</title>
<para>The future is clearly wireless, and while many big businesses live by concentrating data in their clouds, the future doesn't look quite so centralized. The devices at the edges of our networks get smarter every year, not dumber. They're hungry for work and information to digest and profit from. And they don't drag cables around, except once a night for power. It's all wireless, and more and more, 802.11-branded WiFi of different alphabetical flavors.</para>

<sect2>
<title>Why Mesh isn't Here Yet</title>
<para>As such a vital part of our future, WiFi has a big problem that's not often discussed but which anyone betting on it needs to be aware of. The phone companies of the world have built themselves nice profitable mobile phone cartels in nearly every country with a functioning government, based on convincing governments that without monopoly rights to airwaves and ideas, the world would fall apart. Technically, we call this "regulatory capture" and "patents", but in fact it's just a form of blackmail and corruption. If you, the state, give me, a business, the right to overcharge and tax the market, and ban all real competitors, I'll give you 5%. Not enough? How about 10%? OK, 15% plus snacks. If you refuse, we pull service.</para>

<para>But WiFi snuck past this, borrowing unlicensed airspace and riding on the back of the open and unpatented and remarkably innovative Internet protocol stack. So today we have the curious situation where it costs me several Euro a minute to call from Seoul to Brussels if I use the state-backed infrastructure that we've subsidized over decades, but nothing at all if I can find an unregulated WiFi access point. Oh, and I can do video, and send files, and photos, and download entire home movies all for the same amazing price point of precisely zero point zero zero (in any currency you like). God help me if I try to send just one photo home using the service I actually pay for. That would cost me more than the camera I took it on.</para>

<para>It is the price we pay for having tolerated the "trust us, we're the experts" patent system for so long. But more than that, it's a massive economic incentive to chunks of the technology sector -- and especially chipset makers who own patents on the anti-Internet GSM, GPRS, 3G, and LTE stacks, and who treat the telcos as prime clients -- to actively throttle WiFi development. And of course it's these firms that bulk out the IEEE committees that define WiFi.</para>

<para>The reason for this rant against lawyer-driven "innovation" is to steer your thinking into "what if WiFi was really free?" Because this will happen one day, not too far off, and it's worth betting on. We'll see several things happen. First, much more aggressive use of airspace especially for near-distance communications where there is no risk of interference. Secondly, big capacity improvements as we learn to use more airspace in parallel. Thirdly, acceleration of the standardization process. Lastly, broader support in devices for really interesting connectivity.</para>

<para>Right now streaming a movie from your phone to your TV is considered "leading edge". This is ridiculous. Let's get truly ambitious. How about a stadium of people watching a game, sharing photos and HD video with each other in real time, creating an ad-hoc event that literally saturates the airspace with a digital frenzy. I should be able to collect terabytes of imagery from those around me, in an hour. Why does this have to go through Twitter or Facebook and that tiny expensive mobile data connection? How about a home with hundreds of devices all talking to each other over mesh, so when someone rings the doorbell, the porch lights stream video through to your phone or TV? How about a car that can talk to your phone and play your dubstep playlist <emphasis>without you plugging in wires</emphasis>.</para>

<para>Why, in 2012, and to get more serious, is our digital society in the hands of central points that are monitored, censored, logged, used to track who we talk to, collect evidence against us, and then shut down when the authorities decide we have too much free speech? The loss of privacy we're living through is only a problem when it's one-sided, but then the problem is calamitous. A truly wireless world would bypass all central censorship. It's how the Internet was designed, and it's quite feasible. Technically.</para>

</sect2>
<sect2>
<title>Some Physics</title>
<para>Naive developers of distributed software treat the network as infinitely fast and perfectly reliable. While this is approximately true for simple applications over Ethernet, WiFi rapidly proves the difference between magical thinking and science. That is, WiFi breaks so easily and dramatically under stress that I sometimes wonder how anyone would dare use it for real work. The ceiling moves up, as WiFi gets better, but never fast enough to stop us hitting it.</para>

<para>To understand how WiFi performs technically you need to understand a basic law of physics: the power required to connect two points increases according to the square of the distance. People who grow up in larger houses have exponentially louder voices, as I learned in Dallas. For a WiFi network this means that as two radios get further apart, they have to either use more power, or lower their signal rate.</para>

<para>There's only so much power you can pull out of a battery before users treat the device as hopelessly broken. So though a WiFi network may be rated at some speed, the real bit rate between the access point (AP) and a client depends on how far apart the two are. As you move your WiFi-enabled phone away from the AP, the two radios trying to talk to each other will first increase their power but then reduce their bit rate.</para>

<para>This effect has some consequences we need to be aware of if we want to build robust distributed applications that don't dangle wires behind them like puppets:</para>

<itemizedlist>
  <listitem><para>If you have a group of devices talking to an AP, when the AP is talking to the slowest device, the <emphasis>whole network has to wait</emphasis>. It's like having to repeat a joke at a party to the designated driver, who has no sense of humor, is still fully and tragically sober, and has in any case a poor grasp of the language.</para></listitem>
  <listitem><para>If you use unicast TCP and send a message to multiple devices, the AP must send the packets to each device separately, Yes, you knew this, it's also how Ethernet works. But now understand that one distant (or low-powered) device means everything waits for that slowest device to catch up.</para></listitem>
  <listitem><para>If you use multicast or broadcast (which work the same, in most cases), the AP will send single packets to the whole network at once, which is awesome, but it will do it at the slowest possible bit rate (usually 1Mbps). You can adjust this rate manually in some APs. That just reduces the reach of your AP. You can also buy more expensive APs that have a little more intelligence and will figure out the highest bit rate they can safely use. You can also use enterprise APs with IGMP (Internet Group Management Protocol) support and &Oslash;MQ's PGM transport to send only to subscribed clients. I'd not however bet on such APs being widely available, ever.</para></listitem>
</itemizedlist>
<para>As you try to put more devices onto an AP, performance rapidly gets worse to the point where adding one more device can break the whole network, for everyone. Many APs solve this by randomly disconnecting clients when they reach some limit, four to eight devices for a mobile hotspot, 30-50 devices for a consumer AP, perhaps 100 devices for an enterprise AP.</para>

</sect2>
<sect2>
<title>What's the Current Status?</title>
<para>Despite its uncomfortable role as enterprise technology that somehow escaped into the wild, WiFi is already useful for more than getting a free Skype call. It's not ideal but it works well enough to let us solve some interesting problems. Let me give you a rapid status report.</para>

<para>First, point-to-point versus access point-to-client. Traditional WiFi is all AP-client. Every packet has to go from client A to AP, thence to client B. You cut your bandwidth by 50% but that's only half the problem. I explained about the inverse power law. If A and B are very close together but both far from the AP, they'll both be using a low bit rate. Imagine your AP is in the garage, and you're in the living room trying to stream video from your phone to your TV. Good luck!</para>

<para>There is an old "ad-hoc" mode that lets A and B talk to each other but it's way too slow for anything fun, and of course, it's disabled on all mobile chipsets. Actually, it's disabled in the top-secret drivers that the chipset makers kindly provide to hardware makers. There is a new "Tunneled Direct Link Setup" (TDLS) protocol that lets two devices create a direct link, using an AP for discovery but not for traffic. And there's a "5G" WiFi standard (it's a marketing term, so goes in quotes) that boosts link speeds to a gigabit. TDLS and 5G together make HD movie streaming from your phone to your TV a plausible reality. I assume TDLS will be restricted in various ways so as to placate the telcos.</para>

<para>Lastly, we saw standardization of the 802.11s mesh protocol in 2012, after a remarkably speedy ten years or so of work. Mesh removes the access point completely, at least in the imaginary future where it exists and is widely used. Devices talk to each other directly, and maintain little routing tables of neighbors that let them forward packets. Imagine the AP software embedded into every device but smart enough (it's not as impressive as it sounds) to do multiple hops.</para>

<para>No-one who is making money from the mobile data extortion racket wants to see 802.11s available, because city-wide mesh is such a nightmare for the bottom line, so it's happening as slowly as possible. The only large organization with the power (and, I assume the surface-to-surface missiles) to get mesh technology into wide use is the US Army. But mesh will emerge and I'd bet on 802.11s being widely available in consumer electronics by 2020 or so.</para>

<para>Second, if we don't have point-to-point, how far can we trust APs today? Well, if you go to a Starbucks in the USA and try the &Oslash;MQ Hello World example using two laptops connected via the free WiFi, you'll find they cannot connect. Why? Well, the answer is in the name: "attwifi". AT&amp;T is a good old incumbent telco that hates WiFi and presumably provides the service cheaply to Starbucks and others so that independents can't get into the market. But any access point you buy will support client-to-AP-to-client access, and outside the USA I've never found a public AP locked-down the AT&amp;T way.</para>

<para>Third, performance. The AP is clearly a bottleneck; you cannot get better than half of its advertised speed even if you put A and B literally beside the AP. Worse, if there are other APs in the same airspace, they'll shout each other out. In my home, WiFi barely works at all because the neighbors two houses down have an AP which they've amplified. Even on a different channel it interferes with our home WiFi. In the cafe where I'm sitting now there are over a dozen networks. Realistically, as long as we're dependent on AP-based WiFi, we're subject to random interference and unpredictable performance.</para>

<para>Fourth, battery life. There's no inherent reason that WiFi, when idle, is hungrier than Bluetooth, for example. They use the same radios and low-level framing. The main difference is tuning and in the protocols. For wireless power-saving to work well, devices have to mostly sleep, and beacon out to other devices only once every so often. For this to work, they need to synchronize their clocks. This happens properly for the mobile phone part, which is why my old flip phone can run five days on a charge. When WiFi is working, it will use more power. Current power amplifier technology is also inefficient, meaning you draw a lot more energy from your battery than you pump into the air (the waste turns into a hot phone). Power amplifiers are improving as people focus more on mobile WiFi.</para>

<para>Lastly, mobile access points. If we can't trust centralized APs, and if our devices are smart enough to run full operating systems, can't we make them work as APs? I'm <emphasis>so glad</emphasis> you asked that question. Yes, we can, and it works quite nicely. Especially since you can switch this on and off in software, on a modern OS like Android. Again, the villains of the peace are the US telcos, who mostly detest this feature and kill it or cripple it on the phones they control. Smarter telcos realize that it's a way to amplify their "last mile" and bring higher-value products to more users, but crooks don't compete on smarts.</para>

</sect2>
<sect2>
<title>Conclusions</title>
<para>WiFi is not Ethernet and although I believe future &Oslash;MQ applications will have a very important decentralized wireless presence, it's not going to be an easy road. Much of the basic reliability and capacity that you expect from Ethernet is missing. When you run a distributed application over WiFi you have to allow for frequent timeouts, random latencies, arbitrary disconnections, whole interfaces going down and coming up, and so on.</para>

<para>The technological evolution of wireless networking is best described as "slow and joyless". Applications and frameworks that try to exploit decentralized wireless are mostly absent or poor. The only existing open source framework for proximity networking is <ulink url="https://www.alljoyn.org">AllJoyn</ulink>, from Qualcomm. But with &Oslash;MQ we proved that the inertia and decrepit incompetence of existing players was no reason for us to sit still. When we accurately understand problems, we can solve them. What we imagine, we can make real.</para>

</sect2>
</sect1>
<sect1>
<title>Discovery</title>
<para>One of the great things about short-range wireless is the proximity. WiFi maps closely to the physical space, which maps closely to how we naturally organize. In fact the Internet is quite abstract and this confuses a lot of people who kind of "get" it but in fact don't really. With WiFi we have technical connectivity that is potentially super-tangible. You see what you get and you get what you see. Tangible means easy to understand and that should mean love from users instead of the typical frustration and quiet hate.</para>

<para>Proximity is the key. We have a bunch of WiFi radios in room, happily beaconing to each other. For lots of applications it makes sense that they can find each other and start chatting, without any user input. After all, most real world data isn't private, it's just highly localized.</para>

<para>As a first step towards &Oslash;MQ-based proximity networking, let's look at how to do discovery. There exist libraries that do this. I don't like them. They seem too complex and too specific and somehow to date from a prehistoric era before people realized that distributed computing could be <emphasis>fundamentally simple</emphasis>.</para>

<sect2>
<title>Preemptive Discovery over Raw Sockets</title>
<para>I'm in a hotel room in Gangnam, Seoul, with a 4G wireless hotspot, a Linux laptop, and an couple of Android phones. The phones and laptop are talking to the hotspot. The <literal>ifconfig</literal> command says my IP address is 192.168.1.2. Let me try some <literal>ping</literal> commands. DHCP servers tend to dish out addresses in sequence, so my phones are probably close by, numerically speaking:</para>

<screen>$ ping 192.168.1.1
PING 192.168.1.1 (192.168.1.1) 56(84) bytes of data.
64 bytes from 192.168.1.1: icmp_req=1 ttl=64 time=376 ms
64 bytes from 192.168.1.1: icmp_req=2 ttl=64 time=358 ms
64 bytes from 192.168.1.1: icmp_req=4 ttl=64 time=167 ms
^C
--- 192.168.1.1 ping statistics ---
3 packets transmitted, 2 received, 33% packet loss, time 2001ms
rtt min/avg/max/mdev = 358.077/367.522/376.967/9.445 ms
</screen>

<para>Found one! 150-300 msec round-trip latency... that's a surprisingly high figure, something to keep in mind for later. Now I ping myself, just to try to double check things:</para>

<screen>$ ping 192.168.1.2
PING 192.168.1.2 (192.168.1.2) 56(84) bytes of data.
64 bytes from 192.168.1.2: icmp_req=1 ttl=64 time=0.054 ms
64 bytes from 192.168.1.2: icmp_req=2 ttl=64 time=0.055 ms
64 bytes from 192.168.1.2: icmp_req=3 ttl=64 time=0.061 ms
^C
--- 192.168.1.2 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 1998ms
rtt min/avg/max/mdev = 0.054/0.056/0.061/0.009 ms
</screen>

<para>The response time is a bit faster now, which is what we'd expect. Let's try the next couple of addresses:</para>

<screen>$ ping 192.168.1.3
PING 192.168.1.3 (192.168.1.3) 56(84) bytes of data.
64 bytes from 192.168.1.3: icmp_req=1 ttl=64 time=291 ms
64 bytes from 192.168.1.3: icmp_req=2 ttl=64 time=271 ms
64 bytes from 192.168.1.3: icmp_req=3 ttl=64 time=132 ms
^C
--- 192.168.1.3 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2001ms
rtt min/avg/max/mdev = 132.781/231.914/291.851/70.609 ms
</screen>

<para>That's the second phone, with the same kind of latency as the first one. Let's continue, see if there are any other devices connected to the hotspot:</para>

<screen>$ ping 192.168.1.4
PING 192.168.1.4 (192.168.1.4) 56(84) bytes of data.
^C
--- 192.168.1.4 ping statistics ---
3 packets transmitted, 0 received, 100% packet loss, time 2016ms
</screen>

<para>And that is it. Now, <literal>ping</literal> uses raw IP sockets to send ICMP_ECHO messages. The useful thing about ICMP_ECHO is that it gets a response from any IP stack that has not been deliberately had echo switched off. That's still a common practice on corporate websites who fear the old "ping of death" exploit where malformed messages could crash the machine.</para>

<para>I call this <emphasis>pre-emptive discovery</emphasis> since it doesn't take any cooperation from the device. We don't rely on any cooperation from the phones to see them sitting there; as long as they're not actively ignoring us, we can see them.</para>

<para>You might ask why this is useful. We don't know that the peers responding to ICMP_ECHO run &Oslash;MQ, that they are interested in talking to us, that they have any services we can use, or even what kind of device they are. However, knowing that there's <emphasis>something</emphasis> on address 192.168.1.3 is already useful. We also know how far away the device is, relatively, we know how many devices are on the network, and we know the rough state of the network (as in, good, poor, or terrible).</para>

<para>It isn't even hard to create ICMP_ECHO messages and send them. A few dozen lines of code, and we could use &Oslash;MQ multithreading to do this in parallel for addresses stretching out above and below our own IP address. Could be kind of fun.</para>

<para>However, sadly, there's a fatal flaw in my idea of using ICMP_ECHO to discover devices. To open a raw IP socket requires root privileges on a POSIX box. It stops rogue programs getting data meant for others. We can get the power to open raw sockets on Linux by giving sudo privileges to our command (ping has the so-called 'sticky bit' set). On a mobile OS like Android, it requires root access, i.e. rooting the phone or tablet. That's out of the question for most people and so ICMP_ECHO is out of reach for most devices.</para>

<para><emphasis>Expletive deleted!</emphasis> Let's try something in user space. The next step most people take is UDP multicast or broadcast. Let's follow that trail.</para>

</sect2>
<sect2>
<title>Cooperative Discovery using UDP Broadcasts</title>
<para>Multicast tends to be seen as more modern and "better" than broadcast. In IPv6, broadcast doesn't work at all: you have to always use broadcast. Nonetheless, all IPv4 local network discovery protocols end up using UDP broadcast anyhow. The reasons: broadcast and multicast end up working much the same, except broadcast is simpler and less risky. Multicast is seen by network admins as kind of dangerous, as it can leak over network segments.</para>

<para>If you never used UDP, you'll discover it's quite a nice protocol. In some ways it reminds us of &Oslash;MQ, sending whole messages to peers using a two different patterns: one-to-one, and one-to-many. The main problems with UDP are that (a) the POSIX socket API was designed for universal flexibility not simplicity, (b) UDP messages are limited for practical purposes to about 512 bytes, and (c) when you start to use UDP for real data, you find that a lot of messages get dropped, especially as infrastructure tends to favor TCP over UDP.</para>

<para>Here is a minimal ping program that uses UDP instead of ICMP_ECHO:</para>

<example id="udpping1-php">
<title>UDP discovery, model 1 (udpping1.php)</title>
<programlisting language="php">
(This example still needs translation into PHP)
</programlisting>

</example>
<para>This code uses a single socket to broadcast 1-byte messages and receive anything that other nodes are broadcasting. When I run it, it shows just one node, which is itself:</para>

<screen>Pinging peers...
Found peer 192.168.1.2:9999
Pinging peers...
Found peer 192.168.1.2:9999
</screen>

<para>If I switch off all networking and try again, sending a message fails, as I'd expect:</para>

<screen>Pinging peers...
sendto: Network is unreachable
</screen>

<para>Working on the basis of <emphasis>solve the problems currently aiming at your throat</emphasis>, let's fix the most urgent issues in this first model. These issues are:</para>

<itemizedlist>
  <listitem><para>Using the 255.255.255.255 broadcast address is a bit dubious. On the one hand, this broadcast address means precisely "send to all nodes on the local network, and don't forward". However, if you have several interfaces (wired Ethernet, WiFi) then broadcasts will go out on your default route only, and via just one interface. What we want to do is either send our broadcast on each interface's broadcast address, or find the WiFi interface and its broadcast address.</para></listitem>
  <listitem><para>Like many aspects of socket programming, getting information on network interfaces is not portable. Do we want to write non-portable code in our applications? No, this is better hidden in a library.</para></listitem>
  <listitem><para>There's no handling for errors except "abort", which is too brutal for transient problems like "your WiFi is switched off". The code should distinguish between soft errors (ignore and retry) and hard errors (assert).</para></listitem>
  <listitem><para>The code needs to know its own IP address and ignore beacons that it sent out. Like finding the broadcast address, this requires inspecting the available interfaces.</para></listitem>
</itemizedlist>
<para>The simplest answer to these issues is to push the UDP code into a separate library that provided a clean API, like this:</para>

<programlisting language="c">
//  Constructor
static udp_t *
    udp_new (int port_nbr);

//  Destructor
static void
    udp_destroy (udp_t **self_p);

//  Returns UDP socket handle
static int
    udp_handle (udp_t *self);

//  Send message using UDP broadcast
static void
    udp_send (udp_t *self, byte *buffer, size_t length);

//  Receive message from UDP broadcast
static ssize_t
    udp_recv (udp_t *self, byte *buffer, size_t length);
</programlisting>

<para>Here is the refactored UDP ping program that calls this library, which is much cleaner and nicer:</para>

<example id="udpping2-php">
<title>UDP discovery, model 2 (udpping2.php)</title>
<programlisting language="php">
(This example still needs translation into PHP)
</programlisting>

</example>
<para>The library, udplib, hides a lot of the unpleasant code (which will become uglier we make this work on more systems). I'm not going to print that code here. You can read it <ulink url="https://github.com/imatix/zguide/blob/master/examples/C/udplib.c">in the repository</ulink>.</para>

<para>Now, there are more problems sizing us up and wondering if they can make lunch out of us. First, IPv4 versus IPv6 and multicast vs. broadcast. In IPv6, broadcast doesn't exist at all; one uses multicast. From my experience with WiFi, IPv4 multicast and broadcast work identically except that multicast breaks in some situations where broadcast works fine. Some access points do not forward multicast packets. When you have a device (e.g. a tablet) that acts as a mobile AP, then it's possible it won't get multicast packets. Meaning, it won't see other peers on the network.</para>

<para>The simplest plausible solution is simply to ignore IPv6 for now, and use broadcast. A perhaps smarter solution would be to use multicast, and deal with asymmetric beacons if they happen.</para>

<para>We'll stick with stupid and simple for now. There's always time to make it more complex.</para>

</sect2>
<sect2>
<title>Multiple Nodes on One Device</title>
<para>So we can discover nodes on the WiFi network, as long as they're sending out beacons as we expect. So I try to test with two processes. But when I run udpping2 twice, the second instance complains "'Address already in use' on bind" and exits. Oh, right. UDP and TCP both return an error if you try to bind two different sockets to the same port. This is right. The semantics of two readers on one socket would be weird to say the least. Odd/even bytes? You get all the 1s, I get all the 0's?</para>

<para>However, a quick check of stackoverflow and some memory of a socket option called SO_REUSEADDR turns up gold. If I use that, I can bind several processes to the same UDP port, and they will all receive any message arriving on that port. It's almost as if the guys who designed this were reading my mind! (That's way more plausible than maybe I'm reinventing the wheel.)</para>

<para>A quick test shows that SO_REUSADDR works as promised. This is great because the next thing I want to do is design an API and then start dozens of nodes to see them discovering each other. It would be really cumbersome to have to test each node on a separate device. And when we get to testing how real traffic behaves on a large, flaky network, the two alternatives are simulation or temporary insanity.</para>

<para>And I speak from experience: we were, this summer, testing on dozens of devices at once. It takes about an hour to set-up a full test run, and you need a space shielded from WiFi interference if you want any kind of reproducibility (unless your test case is "prove that interference kills WiFi networks faster than Orval can kill a thirst".</para>

<para>If I was a wizz Android developer with a free weekend I'd immediately (as in, it would take me two days) port this code to my phone and get it sending beacons to my PC. But sometimes lazy is more profitable. I <emphasis>like</emphasis> my Linux laptop. I like being able to start a dozen threads from one process, and have each thread acting like an independent node. I like not having to work in a real Faraday cage when I can simulate one on my laptop.</para>

</sect2>
<sect2>
<title>Designing the API</title>
<para>I'm going to run N nodes on a device, and they are going to have to discover each other, and also discover a bunch of other nodes out there on the local network. I can use UDP for local discovery as well as remote discovery. It's arguably not as efficient as using, e.g., the &Oslash;MQ inproc:// transport, but has the great advantage that the exact same code will work in simulation and in real deployment.</para>

<para>If I have multiple nodes on one device, we clearly can't use the IP address and port number as node address. I need some logical node identifier. Arguably, the node identifier only has to be unique within the context of the device. My mind fills with complex stuff I could make, like supernodes that sit on real UDP ports and forward messages to internal nodes. I hit my head on the table until the idea of <emphasis>inventing new concepts</emphasis> leaves it.</para>

<para>Experience tells us that WiFi does things like disappear and reappear while applications are running. Users click on things. Which does interesting things like change the IP address halfway through a session. We cannot depend on IP addresses, nor on established connections (in the TCP fashion). We need some long-lasting addressing mechanism that survives interfaces and connections being torn down, and then recreated.</para>

<para>Here's the simplest solution I can see: we give every node a UUID, and specify that nodes, represented by their UUIDs, can appear or reappear at certain IP address:port endpoints, and then disappear again. We'll deal with recovery from lost messages later. A UUID is 16 bytes. So if I have 100 nodes on a WiFi network that's (double it for other random stuff) 3,200 bytes a second of beacon data that the air has to carry just for discovery and presence. Seems acceptable.</para>

<para>Back to concepts. We do need some names for our API. At the least we need a way to distinguish between the node object that is "us", and node objects that are our peers.  We'll be doing things like creating an "us" and then asking it how many peers it knows about and who they are. The term "peer" is clear enough.</para>

<para>From the developer point of view, a node (the application) needs a way to talk to the outside world. Let's borrow a term from networking and call this an "interface". The interface represents us to the rest of the world and presents the rest of the world to us, as a set of other peers. It automatically does whatever discovery it has to. When we want to talk to a peer, we get the interface to do that for us. And when a peer talks to us, it's the interface that delivers us the message.</para>

<para>This seems like a clean API design. How about the internals?</para>

<itemizedlist>
  <listitem><para>The interface has to be multithreaded, so that one thread can do I/O in the background, while the foreground API talks to the application. We used this design in the Clone and Freelance client APIs.</para></listitem>
  <listitem><para>The interface background thread does the discovery business; bind to the UDP port, send out UDP beacons, and receive beacons.</para></listitem>
  <listitem><para>We need to at least send UUIDs in the beacon message so that we can distinguish our own beacons from those of our peers.</para></listitem>
  <listitem><para>We need to track peers that appear, and that disappear. For this I'll use a hash table that stores all known peers, and expire peers after some timeout.</para></listitem>
  <listitem><para>We need a way to report peers and events to the caller. Here we get into a juicy question. How does a background I/O thread tell a foreground API thread that stuff is happening? Callbacks maybe? <emphasis>Heck no.</emphasis> We'll use &Oslash;MQ messages, of course.</para></listitem>
</itemizedlist>
<para>The third iteration of the UDP ping program is even simpler and more beautiful than the second. The main body, in C, is just ten lines of code.</para>

<example id="udpping3-php">
<title>UDP discovery, model 3 (udpping3.php)</title>
<programlisting language="php">
(This example still needs translation into PHP)
</programlisting>

</example>
<para>The interface code should be familiar if you've studied how we make multithreaded API classes:</para>

<example id="interface-php">
<title>UDP ping interface (interface.php)</title>
<programlisting language="php">
(This example still needs translation into PHP)
</programlisting>

</example>
<para>When I run this in two windows, it reports one peer joining the network. I kill that peer and a few seconds later it tells me the peer left:</para>

<screen>--------------------------------------
[006] JOINED
[032] 418E98D4B7184844B7D5E0EE5691084C
--------------------------------------
[004] LEFT
[032] 418E98D4B7184844B7D5E0EE5691084C
</screen>

<para>What's nice about a &Oslash;MQ-message based API is that I can wrap this any way I like. For instance, turning it into callbacks if I really want those. I can also trace all activity on the API very easily.</para>

<para>Some notes about tuning. On Ethernet, five seconds (the expiry time I used in this code) seems like a lot. On a badly stressed WiFi network you can get ping latencies of 30 seconds or more. If you use a too-aggressive value for the expiry, you'll disconnect nodes that are still there. On the other side, end user applications expect a certain liveliness. If it takes 30 seconds to report that a node has gone, users will get annoyed.</para>

<para>A decent strategy is to detect and report disappeared nodes rapidly, but only delete them after a longer interval. Visually, a node would be green when it's alive, then gray for a while as it went out of reach, then finally disappear. We're not doing this now, but will do it in the real implementation of the as-yet-unnamed framework we're making.</para>

<para>As we will also see later, we have to treat any input from a node, not just UDP beacons, as a sign of life. UDP may get squashed when there's a lot of TCP traffic. This is perhaps the main reason we're not using an existing UDP discovery library: we have to integrate this tightly with our &Oslash;MQ messaging for it to work.</para>

</sect2>
<sect2>
<title>More about UDP</title>
<para>So we have discovery and presence working over UDP IPv4 broadcasts. It's not ideal, but it works for the local networks we have today. However we can't use UDP for real work, not without additional work to make it reliable. There's a joke about UDP but sometimes you'll get it, and sometimes you won't.</para>

<para>We'll stick to TCP for all one-to-one messaging. There is one more use-case for UDP after discovery, which is multicast file distribution. I'll explain why and how, then shelve that for another day. The why is simple: what we call "social networks" is just augmented culture. We create culture by sharing, and this means more and more, sharing works that we make or remix. Photos, documents, contracts, tweets. The clouds of devices we're aiming towards do more of this, not less.</para>

<para>Now, there are two principal patterns for sharing content. One is the "pubsub" pattern where one node sends out content to a set of other nodes, at the same time. Second is the "late joiner" pattern, where a node arrives somewhat later and wants to catch up to the conversation. We can deal with the late joiner using TCP unicast. But doing TCP unicast to a group of clients at the same time has some disadvantages. First, it can be slower than multicast. Second, it's unfair since some will get the content before others.</para>

<para>Before you jump off to design a UDP multicast protocol, realize that it's not a simple calculation. When you send a multicast packet, the WiFi access point uses a low bit rate, to ensure that even the furthest devices will get it safely. Most normal APs don't do the obvious optimization, which is to measure the distance of the furthest device and use that bit rate. Instead they just use a fixed value. So if you have a few devices, close to the AP, multicast will be insanely slow. But if you have a roomful of devices which all want to get the next chapter of the textbook, multicast can be insanely effective.</para>

<para>The curves cross around 6-12 devices depending on the network. You could in theory measure the curves in real-time and create an adaptive protocol. That would be cool but probably too hard for even the smartest of us.</para>

<para>If you do sit down and sketch out a UDP multicast protocol, realize that you need a channel for recovery, to get lost packets. You'd probably want to do this over TCP, using &Oslash;MQ. For now, however, we'll forget about multicast UDP, and assume all traffic goes over TCP.</para>

</sect2>
</sect1>
<sect1>
<title>Spinning off a Library Project</title>
<para>At this stage however, the code is growing larger than an example should be, so it's time to create a proper GitHub project. It's a rule: build your projects in public view, and tell people about them as you go, so your marketing and community building starts on day 1. I'll walk through what this involves. I explained in The &Oslash;MQ Community<xref linkend="the-community"/> about growing communities around projects. We need a few things:</para>

<itemizedlist>
  <listitem><para>A name.</para></listitem>
  <listitem><para>A slogan.</para></listitem>
  <listitem><para>A public github repository.</para></listitem>
  <listitem><para>A README that links to the C4 process.</para></listitem>
  <listitem><para>License files.</para></listitem>
  <listitem><para>An issue tracker.</para></listitem>
  <listitem><para>Two maintainers.</para></listitem>
  <listitem><para>A first bootstrap version.</para></listitem>
</itemizedlist>
<para>The name and slogan first. The trademarks of the 21st century are domain names. So the first thing I do when spinning off a project is to look for a domain name that might work. Quite randomly, one of our old messaging products was called "Zyre" and I have the domain names for it. So, the "ZeroMQ Realtime Experience" project, which is a terribly forced construction but more or less accurate. We are aiming to create a framework for real time experiences (sharing games, photos, stories) over &Oslash;MQ.</para>

<para>I'm somewhat shy about pushing new projects into the &Oslash;MQ community too aggressively, and normally would start a project in either my personal account or the iMatix organization. But we've learned that moving projects after they become popular is counter-productive. My predictions of a future filled with moving pieces are either valid, or wrong. If this chapter is valid, we might as well launch this as a &Oslash;MQ project from the start. If it's wrong, we can delete the repository later or let it sink to the bottom of a long list of forgotten starts.</para>

<para>Start with the basics. The protocol (UDP and &Oslash;MQ/TCP) will be ZRE and the project will be ZyRE, with different capitalization to reduce confusion with the old project. I need a second maintainer, so invite my friend Dong Min (the Korean hacker behind JeroMQ, a pure-Java &Oslash;MQ stack) to join. He's been working on very similar ideas so is enthusiastic. We discuss this and we get the idea of building ZyRE on top of JeroMQ as well as on top of CZMQ and libzmq. This would make it a lot easier to run ZyRE on Android. It would also give us two fully separate implementations from the start, always a good thing for a protocol.</para>

<para>So we take the FileMQ project I built in The Human Scale<xref linkend="the-human-scale"/> as a template for a new GitHub project. The GNU autoconf tools are quite decent but have a painful syntax. It's easiest to copy existing project files, and modify them. The FileMQ project builds a library, has test tools, license files, man pages, and so on. It's not too large so it's a good starting point.</para>

<para>I put together a README to summarize the goals of the project and point to C4. The issue tracker is enabled by default on new GitHub projects, so once we've pushed the UDP ping code as a first version, we're ready to go. However it's always good to recruit more maintainers, so I create an issue "Call for maintainers" that says:</para>

<blockquote>
  <para>If you'd like to help click that lovely green "Merge Pull Request" button, and get eternal karma, add a comment confirming that you've read and understand the C4 process at http://rfc.zeromq.org/spec:16.</para>
</blockquote>
<para>Finally, I change the issue tracker labels. GitHub by default offers the usual variety of issue types but with C4 we don't use them. Instead we need just two labels ("Urgent", in red, and "Ready", in black).</para>

</sect1>
<sect1>
<title>Point-to-point Messaging</title>
<para>I'm going to take the last UDP ping program and build a point-to-point messaging layer on top of that. Our goal is that we can detect peers as they join and leave the network, that we can send messages to them, and that we can get replies. It is a non-trivial problem to solve and takes Min and me two days to get a "hello world" version working.</para>

<para>We had to solve a number of issues:</para>

<itemizedlist>
  <listitem><para>What information to send in the UDP beacon, and how to format it.</para></listitem>
  <listitem><para>What &Oslash;MQ socket types to use to interconnect nodes.</para></listitem>
  <listitem><para>What &Oslash;MQ messages to send, and how to format them.</para></listitem>
  <listitem><para>How to send a message to a specific node.</para></listitem>
  <listitem><para>How to know the sender of any message so we could send a reply.</para></listitem>
  <listitem><para>How to recover from lost UDP beacons.</para></listitem>
  <listitem><para>How to avoid overloading the network with beacons.</para></listitem>
</itemizedlist>
<para>I'll explain these in enough detail that you understand why we made each choice we did, with some code fragments to illustrate. We tagged this code as <ulink url="https://github.com/zeromq/zyre/zipball/v0.1.0">version 0.1.0</ulink> so you can look at the code: most of the hard work is done in zre_interface.c.</para>

<sect2>
<title>UDP Beacon Framing</title>
<para>Sending UUIDs across the network is the bare minimum for a logical addressing scheme. However we have a few more aspects to get working before this will work in real use:</para>

<itemizedlist>
  <listitem><para>We need some protocol identification so that we can check for, and reject invalid packets.</para></listitem>
  <listitem><para>We need some version information so that we can change this protocol over time.</para></listitem>
  <listitem><para>We need to tell other nodes how to reach us via TCP, i.e. a &Oslash;MQ port they can talk to us on.</para></listitem>
</itemizedlist>
<para>Let's start with the beacon message format. We probably want a fixed protocol header that will never change in future versions, and a body that depends on the version(<xref linkend="figure-66"/>).</para>

<figure id="figure-66">
    <title>ZRE discovery message</title>
    <mediaobject>
        <imageobject>
            <imagedata fileref="images/fig66.eps" format="EPS" width="4.8in"/>
        </imageobject>
    </mediaobject>
</figure>

<para>The version can be a 1-byte counter starting at 1. The UUID is 16 bytes, and the port is a 2-byte port number, since UDP nicely tells us the sender's IP address for every message we receive. This gives us a 22-byte frame.</para>

<para>The C language (and a few others like Erlang) make it simple to read and write binary structures. We define the beacon frame structure:</para>

<programlisting language="c">
#define BEACON_PROTOCOL     "ZRE"
#define BEACON_VERSION      0x01

typedef struct {
    byte protocol [3];
    byte version;
    uuid_t uuid;
    uint16_t port;
} beacon_t;
</programlisting>

<para>Which makes sending and receiving beacons quite simple. Here is how we send a beacon, using the zre_udp class to do the non-portable network calls:</para>

<programlisting language="c">
//  Beacon object
beacon_t beacon;

//  Format beacon fields
beacon.protocol [0] = 'Z';
beacon.protocol [1] = 'R';
beacon.protocol [2] = 'E';
beacon.version = BEACON_VERSION;
memcpy (beacon.uuid, self-&gt;uuid, sizeof (uuid_t));
beacon.port = htons (self-&gt;port);

//  Broadcast the beacon to anyone who is listening
zre_udp_send (self-&gt;udp, (byte *) &amp;beacon, sizeof (beacon_t));
</programlisting>

<para>When we receive a beacon we need to guard against bogus data. We're not going to be paranoid against, for example, denial-of-service attacks. We just want to make sure we're not going to crash when a bad ZRE implementation sends us erroneous frames.</para>

<para>To validate a frame we check its size and header. If those are OK, we assume the body is usable. When we get a UUID that isn't ourselves (recall, we'll get our own UDP broadcasts back), we can treat this as a peer:</para>

<programlisting language="c">
//  Get beacon frame from network
beacon_t beacon;
ssize_t size = zre_udp_recv (self-&gt;udp, (byte *) &amp;beacon, sizeof (beacon_t));

//  Basic validation on the frame
if (size != sizeof (beacon_t)
||  beacon.protocol [0] != 'Z'
||  beacon.protocol [1] != 'R'
||  beacon.protocol [2] != 'E'
||  beacon.version != BEACON_VERSION)
    return 0;               //  Ignore invalid beacons

//  If we got a UUID and it's not our own beacon, we have a peer
if (memcmp (beacon.uuid, self-&gt;uuid, sizeof (uuid_t))) {
    char *identity = s_uuid_str (beacon.uuid);
    s_require_peer (self, identity,
        zre_udp_from (self-&gt;udp), ntohs (beacon.port));
    free (identity);
}
</programlisting>

</sect2>
<sect2>
<title>True Peer Connectivity (Harmony Pattern)</title>
<para>Since &Oslash;MQ is designed to make distributed messaging easy, people often ask how to interconnect a set of true peers (as compared to obvious clients and servers). It is a thorny question and &Oslash;MQ doesn't really provide a single clear answer.</para>

<para>TCP, which is the most commonly-used transport in &Oslash;MQ, is not symmetric; one side must bind and one must connect and though &Oslash;MQ tries to be neutral about this, it's not. When you connect, you create an outgoing message pipe. When you bind, you do not. When there is no pipe, you cannot write messages (&Oslash;MQ will return EAGAIN).</para>

<para>Developers who study &Oslash;MQ and then try to create N-to-N connections between sets of equal peers often try a ROUTER-to-ROUTER flow. It's obvious why: each peer needs to address a set of peers, which requires ROUTER. It usually ends with a plaintive email to the list.</para>

<para>My conclusion after trying several times from different angles is that ROUTER-to-ROUTER does not work. And the &Oslash;MQ reference manual does not allow it when it discusses ROUTER sockets in <literal>zmq_socket()</literal>. At a minimum, one peer must bind and one must connect, meaning the architecture is not symmetrical. But also because you simply can't tell when you are allowed to safely send a message to a peer. It's Catch-22: you can talk to a peer after it's talked to you. But the peer can't talk to you until you've talked to it. One side or the other will be losing messages and thus has to retry, which means the peers cannot be equal.</para>

<para>I'm going to explain the Harmony pattern, which solves this problem, and which we use in ZyRE.</para>

<para>We want a guarantee that when a peer "appears" on our network, we can talk to it safely, without &Oslash;MQ dropping messages. For this, we have to use a DEALER or PUSH socket which <emphasis>connects out to the peer</emphasis> so that even if that connection takes some non-zero time, there is immediately a pipe, and &Oslash;MQ will accept outgoing messages.</para>

<para>A DEALER socket cannot address multiple peers individually. But if we have one DEALER per peer, and we connect that DEALER to the peer, we can safely send messages to a peer as soon as we've connected to it.</para>

<para>Now, the next problem is to know who sent us a particular message. We need a reply address, that is the UUID of the node who sent any given message. DEALER can't do this unless we prefix every single message with that 16-byte UUID, which would be wasteful. ROUTER does, if we set the identity properly before connecting to the router.</para>

<para>And so the Harmony pattern comes down to:</para>

<itemizedlist>
  <listitem><para>One ROUTER socket that we bind to a transient port, which we broadcast in our beacons.</para></listitem>
  <listitem><para>One DEALER socket <emphasis>per peer</emphasis> that we connect to the peer's ROUTER socket.</para></listitem>
  <listitem><para>Reading from our ROUTER socket.</para></listitem>
  <listitem><para>Writing to the peer's DEALER socket.</para></listitem>
</itemizedlist>
<para>Next problem is that discovery isn't neatly synchronized. We can get the first beacon from a peer <emphasis>after</emphasis> we start to receive messages from it. A message comes in on the ROUTER socket and has a nice UUID attached to it. But no physical IP address and port. We have to force discovery over TCP. To do this, our first command to any new peer we connect to is an OHAI command with our IP address and port. This ensure that the receiver connects back to us before trying to send us any command.</para>

<para>Breaking this down into steps:</para>

<itemizedlist>
  <listitem><para>If we receive a UDP beacon we connect to the peer.</para></listitem>
  <listitem><para>We read messages from our ROUTER socket, and each message comes with the UUID of the sender.</para></listitem>
  <listitem><para>If it's an OHAI message we connect back to that peer if not already connected to it.</para></listitem>
  <listitem><para>If it's any other message, we <emphasis>must</emphasis> already be connected to the peer (a good place for an assertion).</para></listitem>
  <listitem><para>We send messages to each peer using a dedicated per-peer DEALER socket, which <emphasis>must</emphasis> be connected.</para></listitem>
  <listitem><para>When we connect to a peer we also tell our application that the peer exists.</para></listitem>
  <listitem><para>Every time we get a message from a peer, we treat that as a heartbeat (it's alive).</para></listitem>
</itemizedlist>
<para>If we were not using UDP but some other discovery mechanism, I'd still use the Harmony pattern for a true peer network: one ROUTER for input from all peers, and one DEALER per peer for output. Bind the ROUTER, connect the DEALER, and start each conversation with an OHAI equivalent that provides the return IP address and port. You would need some external mechanism to bootstrap each connection.</para>

</sect2>
<sect2>
<title>Detecting Disappearances</title>
<para>Heartbeating sounds simple but it's not. UDP packets get dropped when there's a lot of TCP traffic, so if we depend on UDP beacons we'll get false disconnections. TCP traffic can be delayed for five, ten, 30 seconds if the network is really busy. So if we kill peers when they go quiet, we'll have false disconnections.</para>

<para>Since UDP beacons aren't reliable, it's tempting to add in TCP beacons. After all, TCP will deliver them reliably. One little problem. Imagine you have 100 nodes on a network, and each node sends a TCP beacon once a second. Each beacon is 22 bytes not counting TCP's framing overhead. That is 100 * 99 * 22 bytes per second, or 217,000 bytes/second just for heartbeating. That's about 1-2% of a typical WiFi network's ideal capacity, which sounds OK. But when a network is stressed, or fighting other networks for airspace, that extra 200K a second will break what's left. UDP broadcasts are at least low cost.</para>

<para>So what we do is switch to TCP heartbeats only when a specific peer hasn't sent us any UDP beacons in a while. And then, we send TCP heartbeats only to that one peer. If the peer continues to be silent, we conclude it's gone away. If the peer comes back, with a different IP address and/or port, we have to disconnect our DEALER socket and reconnect to the new port.</para>

<para>This gives us a set of states for each peer, though at this stage the code doesn't use a formal state machine:</para>

<itemizedlist>
  <listitem><para>Peer visible thanks to UDP beacon (we connect using IP address and port from beacon)</para></listitem>
  <listitem><para>Peer visible thanks to OHAI command (we connect using IP address and port from command)</para></listitem>
  <listitem><para>Peer seems alive (we got a UDP beacon or command over TCP recently)</para></listitem>
  <listitem><para>Peer seems quiet (no activity in some time, so we send a HUGZ command)</para></listitem>
  <listitem><para>Peer has disappeared (no reply to our HUGZ commands, so we destroy peer)</para></listitem>
</itemizedlist>
<para>There's one remaining scenario we didn't address in the code at this stage. It's possible for a peer to change IP addresses and ports without actually triggering a disappearance event. For example if the user switches off WiFi and then switches it back on, then the the access point can assign the peer a new IP address. We'll need to handle a disappeared WiFi interface on our node by unbinding the ROUTER socket and rebinding it when we can. Since this is not central to the design now, I decide to log an issue on the GitHub tracker and leave it for a rainy day.</para>

</sect2>
</sect1>
<sect1>
<title>Group Messaging</title>
<para>Group messaging is a common and very useful pattern. The concept is simple: instead of talking to a single node, you talk to a "group" of nodes. The group is just a name, a string that you agree on in the application. It's precisely like using the publish-subscribe prefixes in PUB and SUB sockets. In fact the only reason I say "group messaging" and not "pub-sub" is to prevent confusion, since we're not going to use PUB/SUB sockets for this.</para>

<para>PUB/SUB would almost work. But we've just done such a lot of work to solve the late joiner problem. Applications are inevitably going to wait for peers to arrive before sending messages to groups, so we have to build on the Harmony pattern rather than start again beside it.</para>

<para>Let's look at the operations we want to do on groups:</para>

<itemizedlist>
  <listitem><para>We want to join and leave groups.</para></listitem>
  <listitem><para>We want to know what other nodes are in any given group.</para></listitem>
  <listitem><para>We want to send a message to (all nodes in) a group.</para></listitem>
</itemizedlist>
<para>Which look familiar to anyone who's used Internet Relay Chat, except we have no server. Every node will need to keep track of what each group represents. This information will not always be fully consistent across the network but it will be close enough.</para>

<para>Our interface will track a set of groups (each an object). These are all the known groups with one or more member node, excluding ourselves. We'll track nodes as they leave and join groups. Since nodes can join the network at any time, we have to tell new peers what groups we're in. When a peer disappears, we'll remove it from all groups we know about.</para>

<para>This gives us some new protocol commands:</para>

<itemizedlist>
  <listitem><para>JOIN - we send this to all peers when we join a group.</para></listitem>
  <listitem><para>LEAVE - we send this to all peers when we leave a group.</para></listitem>
</itemizedlist>
<para>Plus, we add a 'groups' field to the first command we send (renamed from OHAI to HELLO at this point because I need a larger lexicon of command verbs).</para>

<para>Lastly, let's add a way for peers to double-check the accuracy of their group data. The risk is that we miss one of the above messages. Though we are using Harmony to avoid the typical message loss at startup, it's worth being paranoid. For now, all we need is a way to detect such a failure. We'll deal with recovery later, if the problem actually happens.</para>

<para>I'll use the UDP beacon for this. What we want is a rolling counter that simply tells how many join and leave operations ("transitions") there have been for a node. It starts at 0 and increments for each group we join or leave. We can use a minimal 1-byte value since that will catch all failures except the astronomically rare "we lost precisely 256 messages in a row" failure (this is the one that hits during the first demo). We will also put the transitions counter into the JOIN, LEAVE, and HELLO commands. And to try to provoke the problem, we'll test by joining/leaving several hundred groups, with a high-water mark set to 10 or so.</para>

<para>Time to choose verbs for the group messaging. We need a command that means "talk to one peer" and one that means "talk to many peers". After some attempts, my best choices are "WHISPER" and "SHOUT", and this is what the code uses. The SHOUT command needs to tell the user the group name, as well as the sender peer.</para>

<para>Since groups are like publish-subscribe, you might be tempted to use this to broadcast the JOIN and LEAVE commands as well, perhaps by creating a "global" group that all nodes join. My advice is to keep groups purely as user-space concepts for two reasons. First, how do you join the global group if you need the global group to send out a JOIN command? Second, it creates special cases ('reserved names') which are messy.</para>

<para>It's simpler just to send JOINs and LEAVEs explicitly to all connected peers, period.</para>

<para>I'm not going to work through the implementation of group messaging in detail since it's fairly pedantic and not exciting. The data structures for group and peer management aren't optimal but they're workable. We need:</para>

<itemizedlist>
  <listitem><para>A list of groups for our interface, which we can send to new peers in a HELLO command;</para></listitem>
  <listitem><para>A hash of groups for other peers, which we update with information from HELLO, JOIN, and LEAVE commands;</para></listitem>
  <listitem><para>A hash of peers for each group, which we update with the same three commands.</para></listitem>
</itemizedlist>
<para>At this stage I'm starting to get pretty happy with the binary serialization (our codec generator from The Human Scale<xref linkend="the-human-scale"/>), which handles lists and dictionaries as well as strings and integers.</para>

<para>This version is tagged in the repository as v0.2.0 and you can <ulink url="https://github.com/zeromq/zyre/tags">download the tarball</ulink> if you want to check what the code looked like at this stage.</para>

</sect1>
<sect1>
<title>Testing and Simulation</title>
<sect2>
<title>On Assertions</title>
<para>The proper use of assertions is one of the hallmarks of a professional programmer.</para>

<para>Our confirmation bias as creators makes it hard to test our work properly. We tend to write tests to prove the code works, rather than trying to prove it doesn't. There are many reasons for this. We pretend to ourselves and others that we can be (could be) perfect, when in fact we consistently make mistakes. Bugs in code are seen as "bad", rather than "inevitable", so psychologically we want to see fewer of them, not uncover more of them. "He writes perfect code" is a compliment rather than a euphemism for "he never takes risks so his code is as boring and heavily used as cold spaghetti".</para>

<para>Some cultures teach us to aspire to perfection, and punish mistakes, in education and work, which makes this attitude worse. To accept that we're fallible, and then to learn how to turn that into profit rather than shame is one of the hardest intellectual exercises, in any profession. We leverage our fallibilities by working with others, and by challenging our own work sooner, not later.</para>

<para>One trick that makes it easier is to use assertions. Assertions are not a form of error handling. They are executable theories of fact. The code asserts, "at this point, such and such must be true" and if the assertion fails, the code kills itself.</para>

<para>The faster you can prove code incorrect, the faster and more accurately you can fix it. Believing that code works and proving that it behaves as expected is less science, and more magical thinking. It's far better to be able to say, "libzmq has five hundred assertions and despite all my efforts, not one of them fails".</para>

<para>So the ZyRE code base is scattered with assertions, and particularly a couple on the code that deals with the state of peers. This is the hardest aspect to get right: peers need to track each other and exchange state accurately, or things stop working. The algorithms depends on asynchronous messages flying around and I'm pretty sure the initial design has flaws. They always do.</para>

<para>And as I test the original ZyRE code by starting and stopping instances of zre_ping by hand, every so often I get an assertion failure. Running by hand doesn't reproduce these often enough, so let's make a proper tester tool.</para>

</sect2>
<sect2>
<title>On Up-front Testing</title>
<para>Being able to fully test the real behavior of individual components in the laboratory can make a 10x or 100x difference to the cost of your project. That confirmation bias engineers have to their own work makes upfront testing incredibly profitable, and late-stage testing incredibly expensive.</para>

<para>I'll tell you a short story about a project we worked on in the late 1990's. We provided the software, and other teams the hardware, for a factory automation project. Three or four teams brought their experts on-site, which was a remote factory (funny how the polluting factories are always in remote border country).</para>

<para>One of these teams, a firm specializing in industrial automation, built ticket machines: kiosks, and software to run on them. Nothing unusual: swipe a badge, choose an option, receive a ticket. They assembled two of these kiosks, on-site, each week bringing some more bits and pieces. Ticket printers, monitor screens, special keypads from Israel. The stuff had to be resistant against dust since the kiosks sat outside. Nothing worked. The screens were unreadable in the sun. The ticket printers continually jammed and misprinted. The internals of the kiosk were just sat on wooden shelving. The kiosk software crashed regularly. It was comedic except that the project really, <emphasis>really</emphasis> had to work and so we spent weeks and then months on-site helping the other teams debug their bits and pieces until it worked.</para>

<para>A year later, a second factory, and the same story. By this time the client was getting impatient. So when they came to the third and largest factory, a year later, we jumped up and said, "please let us make the kiosks and the software and everything".</para>

<para>So we made a detailed design for the software and hardware and found suppliers for all the pieces. It took us three months to search the Internet for each component, and another two months to get them assembled into stainless-steel bricks each weighing about twenty kilos. These bricks were 60cm square and 20cm deep, with a large flat-screen panel behind unbreakable glass, and two connectors: one for power, one for Ethernet. You loaded up the paper bin with enough for six months, then screwed the brick into a housing, and it automatically booted, found its DNS server, loaded its Linux OS and then application software. It connected to the real server, and showed the main menu. You got access to the configuration screens by swiping a special badge and then entering a code.</para>

<para>The software was portable so we could test that as we wrote it, and as we collected the pieces from our suppliers we kept one of each so we had a disassembled kiosk to play with. When we got our finished kiosks, they all worked immediately. We shipped them to the client, who plugged them into their housing, switched them on, and went to business. We spent a week or so on site, and in ten years, one kiosk broke (the screen died, and was replaced).</para>

<para>Lesson is, test up-front so that when you plug the thing in, you know precisely how it's going to behave. If you haven't tested it up-front, you're going to be spending weeks and months in the field, ironing out problems that should never have been there.</para>

</sect2>
<sect2>
<title>The ZyRE Tester</title>
<para>During manual testing I did hit an assertion rarely. It then disappeared. Since I don't believe in magic, that means the code is still wrong somewhere. So, next step is heavy-duty testing of the ZyRE 0.2.0 code to try to break its assertions, and get a good idea of how it will behave in the field.</para>

<para>We packaged the discovery and messaging functionality as an 'interface' object that the main program creates, works with, and then destroys. We don't use any global variables. This makes it easy to start large numbers of interfaces and simulate real activity, all within one process. And if there's one thing we've learned from writing lots of examples, it's that &Oslash;MQ's ability to orchestrate multiple threads in a single process is <emphasis>much</emphasis> easier to work with than multiple processes.</para>

<para>The first version of the tester consists of a main thread which starts and stops a set of child threads, each running one interface, each with a ROUTER, DEALER, and UDP socket ('R', 'D', and 'U' in the diagram)(<xref linkend="figure-67"/>).</para>

<figure id="figure-67">
    <title>ZyRE Tester Tool</title>
    <mediaobject>
        <imageobject>
            <imagedata fileref="images/fig67.eps" format="EPS" width="4.8in"/>
        </imageobject>
    </mediaobject>
</figure>

<para>The nice thing is that when I am connected to a WiFi access point, all ZyRE traffic (even between two interfaces in the same process) goes across the AP. This means I can fully stress test any WiFi infrastructure with just a couple of PCs running in a room. It's hard to emphasize how valuable this is: if we had built ZyRE as, say, a dedicated service for Android, we'd literally need dozens of Android tablets or phones to do any large-scale testing. Kiosks, and all that.</para>

<para>The focus is now on breaking the current code, trying to prove it wrong. There's <emphasis>no point</emphasis> at this stage in testing how well it runs, how fast it is, how much memory it uses, or anything else. We'll work up to trying (and failing) to break each individual functionality but first, we try to break some of the core assertions I've put into the code.</para>

<para>These are:</para>

<itemizedlist>
  <listitem><para>The first command that any node receives from a peer MUST be "HELLO". In other words, messages <emphasis>cannot</emphasis> be lost during the peer-to-peer connection process.</para></listitem>
  <listitem><para>The state each node each node calculates for its peers matches the state each peer calculates for itself. In other words, again, no messages are lost in the network.</para></listitem>
  <listitem><para>When my application sends a message to a peer, we have a connection to that peer. In other words, the application only "sees" a peer after we have established a &Oslash;MQ connection to it.</para></listitem>
</itemizedlist>
<para>With &Oslash;MQ, there are several cases we may lose messages. One is when the "late joiner" syndrome. Two is when we close sockets without sending everything. Three is when we overflow the high-water mark on a ROUTER or PUB socket. Four is when we use an unknown address with a ROUTER socket.</para>

<para>Now, I <emphasis>think</emphasis> Harmony gets around all these potential cases. But we're also adding UDP to the mix. So the first version of the tester simulates an unstable and dynamic network, where nodes come and go randomly. It's here that things will break.</para>

<para>Here is the main thread of the tester, which manages a pool of 100 threads, starting and stopping each one randomly. Every ~750 msecs it either starts or stops one random thread. We randomize the timing so that threads aren't all synchronized. After a few minutes we have an average of 50 threads happily chatting to each other like Korean teenagers in Gangnam subway station:</para>

<programlisting language="c">
int main (int argc, char *argv [])
{
    //  Initialize context for talking to tasks
    zctx_t *ctx = zctx_new ();
    zctx_set_linger (ctx, 100);

    //  Get number of interfaces to simulate, default 100
    int max_interface = 100;
    int nbr_interfaces = 0;
    if (argc &gt; 1)
        max_interface = atoi (argv [1]);

    //  We address interfaces as an array of pipes
    void **pipes = zmalloc (sizeof (void *) * max_interface);

    //  We will randomly start and stop interface threads
    while (!zctx_interrupted) {
        uint index = randof (max_interface);
        //  Toggle interface thread
        if (pipes [index]) {
            zstr_send (pipes [index], "STOP");
            zsocket_destroy (ctx, pipes [index]);
            pipes [index] = NULL;
            zclock_log ("I: Stopped interface (%d running)", --nbr_interfaces);
        }
        else {
            pipes [index] = zthread_fork (ctx, interface_task, NULL);
            zclock_log ("I: Started interface (%d running)", ++nbr_interfaces);
        }
        //  Sleep ~750 msecs randomly so we smooth out activity
        zclock_sleep (randof (500) + 500);
    }
    zctx_destroy (&amp;ctx);
    return 0;
}
</programlisting>

<para>Note that we maintain a 'pipe' to each child thread (CZMQ creates the pipe automatically when we use the zthread_fork() method). It's via this pipe that we tell child threads to stop, when it's time for them to leave. The child threads do the following (I'm switching to pseudo-code for clarity):</para>

<screen>create an interface
while true:
    poll on pipe to parent, and on interface
    if parent sent us a message:
        break
    if interface sent us a message:
        if message is ENTER:
            send a WHISPER to the new peer
        if message is EXIT:
            send a WHISPER to the departed peer
        if message is WHISPER:
            send back a WHISPER 1/2 of the time
        if message is SHOUT:
            send back a WHISPER 1/3 of the time
            send back a SHOUT 1/3 of the time
    once per second:
        join or leave one of 10 random groups
destroy interface
</screen>

</sect2>
<sect2>
<title>Test Results</title>
<para>Yes, we broke the code. Several times, in fact. This was satisfying. I'll work through the different things we found.</para>

<para>Getting nodes to agree on consistent group status was the most difficult. Every node needs to track the group membership of the whole network, as I already explained in the section "Group Messaging". Group messaging is a publish-subscribe pattern. JOINs and LEAVEs are analogous to 'subscribe' and 'unsubscribe' messages. It's essential none of these ever get lost, or we'll find nodes dropping randomly off groups.</para>

<para>So each node counts the total number of JOINs and LEAVEs it's ever done, and broadcasts this status (as 1-byte rolling counter) in its UDP beacon. Other nodes pick up the status, compare it to their own calculations, and if there's a difference, the code asserts.</para>

<para>First problem was that UDP beacons get delayed randomly, so they're useless for carrying the status. When a beacons arrives late, the status is inaccurate and we get a 'false negative'. To fix this we moved the status information into the JOIN and LEAVE commands. We also added it to the HELLO command. The logic then becomes:</para>

<itemizedlist>
  <listitem><para>Get initial status for a peer from its HELLO command.</para></listitem>
  <listitem><para>When getting a JOIN or LEAVE from a peer, increment the status counter.</para></listitem>
  <listitem><para>Check that the new status counter matches the value in the JOIN or LEAVE command</para></listitem>
  <listitem><para>If it doesn't, assert.</para></listitem>
</itemizedlist>
<para>Next problem we got was that messages were arriving unexpectedly on new connections. The Harmony pattern connects, then sends HELLO as the first command. This means the receiving peer should always get HELLO as the first command from a new peer. We were seeing PING, JOIN, and other commands arriving.</para>

<para>This turned out to be due to CZMQ's ephemeral port logic. An ephemeral port is just a dynamically assigned port that a service can get rather than asking for a fixed port number. A POSIX system usually assigns ephemeral ports in the range 0xC000 to 0xFFFF. CZMQ's logic is to look for a free port in this range, bind to that, and return the port number to the calller.</para>

<para>Which sounds fine, until you get one node stopping and another node starting, close together, and the new node getting the port number of the old node. Remember that &Oslash;MQ tries to re-establish a broken connection. So when the first node stopped, its peers would retry to connect. When the new node appears on that same port, suddenly all the peers connect to it, and start chatting like they're old buddies.</para>

<para>It's a general problem that affects any larger-scale dynamic &Oslash;MQ application. There are a number of plausible answers. One is to not reuse ephemeral ports, which is easier said than done when you have multiple processes on one system. Another solution would be to select a random port each time, which at least reduces the risk of hitting a just-freed port. This brings the risk of a garbage connection down to perhaps 1/1000 but it's still there. Perhaps the best solution is to accept that this can happen, understand the causes, and deal with it on the application level.</para>

<para>We have a stateful protocol that always starts with a HELLO command. We know that it's possible for peers to connect to us, thinking we're an existing node that went away and came back, and send us other commands. Step one is when we discover a new peer, to destroy any existing peer connected to the same endpoint. It's not a full answer but it's polite, at least. Step two is to ignore anything coming in from a new peer until that peer says HELLO.</para>

<para>This doesn't require any change to the protocol but it has to be specified in the protocol when we come to it: due to the way &Oslash;MQ connections work, it's possible to receive unexpected commands from a <emphasis>well-behaving</emphasis> peer and there is no way to return an error code, or otherwise tell that peer to reset its connection. Thus, a peer must discard any command from a peer until it receives HELLO.</para>

<para>In fact, if you draw this on a piece of paper and think it through, you'll see that you never get a HELLO from such a connection. The peer will send PINGs and JOINs and LEAVEs and then eventually time-out and close, as it fails to get any heartbeats back from us.</para>

<para>You'll also see that there's no risk of confusion, no way for commands from two peers to get mixed into a single stream on our DEALER socket.</para>

<para>When you are satisfied this works, we're ready to move on. This version is tagged in the repository as v0.3.0 and you can <ulink url="https://github.com/zeromq/zyre/tags">download the tarball</ulink> if you want to check what the code looked like at this stage.</para>

<para>Note that doing heavy simulation of lots of nodes will probably cause your process to run out of file handles, giving an assertion failure in libzmq. I raised the per-process limit to 30,000 by running (on my Linux box):</para>

<screen>ulimit -n 30000
</screen>

</sect2>
<sect2>
<title>Tracing Activity</title>
<para>To debug the kinds of problems we saw here, we need extensive logging. There's a lot happening in parallel but every problem can be traced down to a specific exchange between two nodes, consisting of a set of events that happen in strict sequence. We know how to make very sophisticated logging but as usual it's wiser to make just what we need, no more. We have to capture:</para>

<itemizedlist>
  <listitem><para>Time and date for each event.</para></listitem>
  <listitem><para>In which node the event occurred.</para></listitem>
  <listitem><para>The peer node, if any.</para></listitem>
  <listitem><para>What the event was (e.g. which command arrived).</para></listitem>
  <listitem><para>Event data, if any.</para></listitem>
</itemizedlist>
<para>The very simplest technique is to print the necessary information to the console, with a timestamp. That's the approach I used. Then it's simple to find the nodes affected by a failure, filter the log file for only messages referring to them, and see exactly what happened.</para>

</sect2>
<sect2>
<title>Dealing with Blocked Peers</title>
<para>In any performance-sensitive &Oslash;MQ architecture you need to solve the problem of flow control. You cannot simply send unlimited messages to a socket and hope for the best. At the one extreme, you can exhaust memory. This is a classic failure pattern for a message broker: one slow client stops receiving messages; the broker starts to queue them, and eventually exhausts memory and the whole process dies. At the other extreme, the socket drops messages, or blocks, as you hit the high-water mark.</para>

<para>With ZyRE we want to distribute messages to a set of peers, and we want to do this fairly. Using a single ROTER socket for output would be problematic since any one blocked peer would block outgoing traffic to all peers. TCP does have good algorithms for spreading the network capacity across a set of connections. And we're using a separate DEALER socket to talk to each peer, so in theory each DEALER socket will send its queued messages in the background reasonably fairly.</para>

<para>The normal behavior of a DEALER socket that hits its high-water mark is to block. This is usually ideal, but it's a problem for us here. Our current interface design uses one thread that distributes messages to all peers. If one of those send calls were to block, all output would block.</para>

<para>There are a few options to avoid blocking. One is to use <literal>zmq_poll()</literal> on the whole set of DEALER sockets, and only write to sockets that are ready. I don't like this for a couple of reasons. First, the DEALER socket is hidden inside the peer class, and it is cleaner to allow each class to handle this opaquely. Second, what do we do with messages we can't yet deliver to a DEALER socket? Where do we queue them? Third, it seems to be side-stepping the issue. If a peer is really so busy it can't read its messages, something is wrong. Most likely, it's dead.</para>

<para>So no polling for output. The second option is to use one thread per peer. I quite like the idea of this, since it fits into the &Oslash;MQ design pattern of "do one thing in one thread". But this is going to create a <emphasis>lot</emphasis> of threads (square of the number of nodes we start) in the simulation, and we're already running out of file handles.</para>

<para>A third option is to use a non-blocking send. This is nicer and it's the solution I choose. We can then provide each peer with a reasonable outgoing queue (the HWM) and if that gets full, treat it as a fatal error on that peer. This will work for smaller messages. If we're sending large chunks -- e.g. for file transfer -- we'll need a credit-based flow control on top.</para>

<para>First step therefore is to prove to ourselves that we can turn the normal blocking DEALER socket into a non-blocking socket. This example creates a normal DEALER socket, connects it to some endpoint (so there's an outgoing pipe and the socket will accept messages), sets the high-water mark to four, and then sets the send timeout to zero:</para>

<example id="eagain-php">
<title>Checking EAGAIN on DEALER socket (eagain.php)</title>
<programlisting language="php">
(This example still needs translation into PHP)
</programlisting>

</example>
<para>When we run this, we send four messages successfully (they go nowhere, the socket just queues them), and then we get a nice EAGAIN error:</para>

<screen>Sending message 0
Sending message 1
Sending message 2
Sending message 3
Sending message 4
Resource temporarily unavailable
</screen>

<para>Next step is to decide what a reasonable high water mark would be for a peer. ZyRE is meant for human interactions, that is applications which chat at a low frequency. Perhaps two games, or a shared drawing program. I'd expect a hundred messages per second to be quite a lot. Our "peer is really dead" timeout is 10 seconds. So a high-water mark of 1,000 seems fair.</para>

<para>Rather than set a fixed HWM, or use the default (which randomly also happens to be 1,000) we calculate it as 100 * the timeout. Here's how we configure a new DEALER socket for a peer:</para>

<programlisting language="c">
//  Create new outgoing socket (drop any messages in transit)
self-&gt;mailbox = zsocket_new (self-&gt;ctx, ZMQ_DEALER);

//  Set our caller 'From' identity so that receiving node knows
//  who each message came from.
zsocket_set_identity (self-&gt;mailbox, reply_to);

//  Set a high-water mark that allows for reasonable activity
zsocket_set_sndhwm (self-&gt;mailbox, PEER_EXPIRED * 100);

//  Send messages immediately or return EAGAIN
zsocket_set_sndtimeo (self-&gt;mailbox, 0);

//  Connect through to peer node
zsocket_connect (self-&gt;mailbox, "tcp://%s", endpoint);
</programlisting>

<para>And finally, what do we do when we get an EAGAIN on a peer? We don't need to go through all the work of destroying the peer since the interface will do this automatically if it doesn't get any message from the peer within the expiry timeout. Just dropping the last message seems very weak - it will give the receiving peer gaps.</para>

<para>I'd rather a more brutal response. Brutal is good because it forces the design to a "good" or "bad" decision rather than a fuzzy "should work but to be honest there are a lot of edge cases so let's worry about it later". Destroy the socket, disconnect the peer, and stop sending anything to it. The peer will eventually have to reconnect and re-initialize any state. It's kind of an assertion that 100 messages a second is enough for anyone. So, in the zre_peer_send method:</para>

<programlisting language="c">
int
zre_peer_send (zre_peer_t *self, zre_msg_t **msg_p)
{
    assert (self);
    if (self-&gt;connected) {
        if (zre_msg_send (msg_p, self-&gt;mailbox) &amp;&amp; errno == EAGAIN) {
            zre_peer_disconnect (self);
            return -1;
        }
    }
    return 0;
}
</programlisting>

<para>Where the disconnect method looks like this:</para>

<programlisting language="c">
void
zre_peer_disconnect (zre_peer_t *self)
{
    //  If connected, destroy socket and drop all pending messages
    assert (self);
    if (self-&gt;connected) {
        zsocket_destroy (self-&gt;ctx, self-&gt;mailbox);
        free (self-&gt;endpoint);
        self-&gt;endpoint = NULL;
        self-&gt;connected = false;
    }
}
</programlisting>

</sect2>
</sect1>
<sect1>
<title>Distributed Logging and Monitoring</title>
<para>Let's look at logging and monitoring. If you've ever managed a real server (like a web server) you know how vital it is to have a capture of what is going on. There are a long list of reasons, not least:</para>

<itemizedlist>
  <listitem><para>To measure the performance of the system over time.</para></listitem>
  <listitem><para>To see what kinds of work are done the most, to optimize performance.</para></listitem>
  <listitem><para>To track errors and how often they occur.</para></listitem>
  <listitem><para>To do postmortems of failures.</para></listitem>
  <listitem><para>To provide an audit trail in case of dispute.</para></listitem>
</itemizedlist>
<para>Let's scope this in terms of the problems we think we'll have to solve:</para>

<itemizedlist>
  <listitem><para>We want to track key events (such as nodes leaving and rejoining the network).</para></listitem>
  <listitem><para>For each event we want to track a consistent set of data: the date/time, node which observed the event, peer that created the event, type of event itself, and other event data.</para></listitem>
  <listitem><para>We want to be able to switch logging on and off at any time.</para></listitem>
  <listitem><para>We want to be able to process log data mechanically, since it will be sizable.</para></listitem>
  <listitem><para>We want to be able to monitor a running system, that is, collect logs and analyze in real-time.</para></listitem>
  <listitem><para>We want log traffic to have minimal effect on the network.</para></listitem>
  <listitem><para>We want to be able to collect log data at a single point on the network.</para></listitem>
</itemizedlist>
<para>As in any design, some of these requirements are hostile to each other. For example, collecting log data in real-time means sending it over the network, which will affect network traffic to some extent. However as in any design these requirements are also hypothetical until we have running code, so we can't take them too seriously. We'll aim for <emphasis>plausibly good enough</emphasis> and improve over time.</para>

<sect2>
<title>A Plausible Minimal Implementation</title>
<para>Arguably, just dumping log data to disk is one solution, and it's what most mobile applications do (using 'debug logs'). But most failures require correlation of events from two nodes. This means searching lots of debug logs by hand to find the ones that matter. It's not a very clever approach.</para>

<para>We want to send log data somewhere central, either immediately, or opportunistically (i.e. store and forward). For now, let's focus on immediate logging. My first idea, when it comes to sending data, is to use ZyRE for this. Just send log data to a group called "LOG", and hope someone collects it.</para>

<para>But using ZyRE to log ZyRE itself is a Catch-22. Who logs the logger? What if we want a verbose log of every message sent? Do we include logging messages in that, or not? It quickly gets messy. We want a logging protocol that's independent of ZyRE's main ZRE protocol. The simplest approach is a PUB-SUB protocol, where all nodes publish log data on a PUB socket, and a collector picks that up via a SUB socket(<xref linkend="figure-68"/>).</para>

<figure id="figure-68">
    <title>Distributed Log Collection</title>
    <mediaobject>
        <imageobject>
            <imagedata fileref="images/fig68.eps" format="EPS" width="4.8in"/>
        </imageobject>
    </mediaobject>
</figure>

<para>The collector can, of course, run on any node. This gives us a nice range of use cases:</para>

<itemizedlist>
  <listitem><para>A passive log collector that stores log data on disk for eventual statistical analysis; this would be a PC with sufficient hard disk space for weeks or months of log data.</para></listitem>
  <listitem><para>A collector that stores log data into a database where it can be used in real-time by other applications. This might be overkill for a small workgroup but would be snazzy for tracking the performance of larger groups. The collector could collect log data over WiFi and then forward it over Ethernet to a database somewhere.</para></listitem>
  <listitem><para>A live meter application that joined the ZyRE network and then collected log data from nodes, showing events and statistics in real-time.</para></listitem>
</itemizedlist>
<para>Next question is how to interconnect the nodes and collector. Which side binds, and which connects? Both ways will work here but it's marginally better if the PUB sockets connect to the SUB socket. If you recall, &Oslash;MQ's internal buffers only pop into existence when there are connections. It means as soon as a node connects to the collector it can start sending log data without loss.</para>

<para>How do we tell nodes what endpoint to connect to? We may have any number of collectors on the network, and they'll be using arbitrary network addresses and ports. We need some kind of service announcement mechanism, and here we can use ZyRE to do the work for us. We could use group messaging, but it seems neater to build service discovery into the ZRE protocol itself. It's nothing complex: if a node provides a service X, it can tell other nodes about that when it sends them a HELLO command.</para>

<para>We'll extend the HELLO command with a <emphasis>headers</emphasis> field that holds a set of name=value pairs. Let's define that the header <literal>LOG_COLLECTOR</literal> specifies the collector endpoint (the SUB socket). A node that acts as a collector can add a header like this (for example):</para>

<screen>LOG_COLLECTOR=tcp://192.168.1.122:9992
</screen>

<para>When another node sees this header it simply connects its PUB socket to that endpoint. Log data now gets distributed to all collectors (zero or more) on the network.</para>

<para>Making this first version was fairly simple and took half a day. Here are the pieces we had to make or change:</para>

<itemizedlist>
  <listitem><para>We made a new class <literal>zre_log</literal> that accepts log data and manages the connection to the collector, if any.</para></listitem>
  <listitem><para>We added some basic management for peer headers, taken from the HELLO command.</para></listitem>
  <listitem><para>When a peer has the LOG_COLLECTOR header, we connect to the endpoint it specifies.</para></listitem>
  <listitem><para>Where we were logging to stdout, we switched to logging via the zre_log class.</para></listitem>
  <listitem><para>We extended the interface API with a method that lets the application set headers.</para></listitem>
  <listitem><para>We wrote a simple logger application that manages the SUB socket and sets the LOG_COLLECTOR header.</para></listitem>
  <listitem><para>We send our own headers when we send a HELLO command.</para></listitem>
</itemizedlist>
<para>This version is tagged in the ZyRE repository as v0.4.0 and you can <ulink url="https://github.com/zeromq/zyre/tags">download the tarball</ulink> if you want to check what the code looked like at this stage.</para>

<para>At this stage the log message is just a string. We'll make more professionally structured log data in a little while.</para>

<para>First, a note on dynamic ports. In the zre_tester app that we use for testing, we create and destroy interfaces aggressively. One consequence is that a new interface can easily reuse a port that was just freed by another application. If there's a &Oslash;MQ socket somewhere trying to connect this port, the results can be hilarious.</para>

<para>Here's the scenario I had, which caused a few minutes' confusion. The logger was running on a dynamic port:</para>

<itemizedlist>
  <listitem><para>Start logger application</para></listitem>
  <listitem><para>Start tester application</para></listitem>
  <listitem><para>Stop logger</para></listitem>
  <listitem><para>Tester receives invalid message (and asserts as designed)</para></listitem>
</itemizedlist>
<para>As the tester created a new interface, that reused the dynamic port freed by the (just stopped) logger, and suddenly the interface began to receive log data from nodes, on its mailbox. We saw a similar situation before, where a new interface could reuse the port freed by an old interface, and start getting old data.</para>

<para>The lesson is, if you use dynamic ports, be prepared to receive random data from ill-informed applications that are reconnecting to you. Switching to a static port stopped the misbehaving connection. That's not a full solution though. There are two more weaknesses:</para>

<itemizedlist>
  <listitem><para>As I write this, libzmq doesn't check socket types when connecting. The <ulink url="http://rfc.zeromq.org/spec:15">ZMTP/2.0 protocol</ulink> does announce each peer's socket type, so this check is doable.</para></listitem>
  <listitem><para>The ZRE protocol has no fail-fast (assertion) mechanism; we need to read and parse a whole message before realizing that it's invalid.</para></listitem>
</itemizedlist>
<para>Let's address the second one. Socket pair validation wouldn't solve this fully anyhow.</para>

</sect2>
<sect2>
<title>Protocol Assertions</title>
<para>As Wikipedia puts it, "Fail-fast systems are usually designed to stop normal operation rather than attempt to continue a possibly flawed process." A protocol like HTTP has a fail-fast mechanism in that the first four bytes that a client sends to an HTTP server must be "HTTP". If they're not, the server can close the connection without reading anything more.</para>

<para>Our ROUTER socket is not connection-oriented so there's no way to "close the connection" when we get bad incoming messages. However we can throw-out the entire message if it's not valid. The problem is going to be worse when we use ephemeral ports, but it applies broadly to all protocols.</para>

<para>So let's define a <emphasis>protocol assertion</emphasis> as being a unique signature that we place at the start of each message, and which identities the intended protocol. When we read a message, we check the signature, and if it's not what we expect, we discard the message silently. A good signature should be hard to confuse with regular data, and give us enough space for a number of protocols.</para>

<para>I'm going to use a 16-bit signature consisting of a twelve-bit pattern and a 4-bit protocol ID(<xref linkend="figure-69"/>). The pattern %xAAA is meant to stay away from values we might otherwise expect to see at the start of a message: %x00, %xFF, and printable characters.</para>

<figure id="figure-69">
    <title>Protocol Signature</title>
    <mediaobject>
        <imageobject>
            <imagedata fileref="images/fig69.eps" format="EPS" width="4.8in"/>
        </imageobject>
    </mediaobject>
</figure>

<para>As our protocol codec is generated, it's relatively easy to add this assertion. The logic is:</para>

<itemizedlist>
  <listitem><para>Get first frame of message.</para></listitem>
  <listitem><para>Check if first two bytes are %xAAA with expected 4-bit signature.</para></listitem>
  <listitem><para>If so, continue to parse rest of message.</para></listitem>
  <listitem><para>If not, skip all 'more' frames, get first frame, and repeat.</para></listitem>
</itemizedlist>
<para>And to test, I switched the logger back to using an ephemeral port. The interface now properly detects and discards any messages that don't have a valid signature. If the message has a valid signature and is <emphasis>still</emphasis> wrong, that's a proper bug.</para>

</sect2>
<sect2>
<title>Binary Logging Protocol</title>
<para>Now that we have the logging framework working properly, let's look at the protocol itself. Sending strings around the network is simple but when it comes to WiFi we really cannot afford to waste bandwidth. We have the tools to work with efficient binary protocols, so let's design one for logging.</para>

<para>This is going to be a PUB-SUB protocol and in &Oslash;MQ/3.x we do publisher-side filtering. This means we can do multi-level logging (errors, warnings, information), if we put the logging level at the start of the message. So, our message starts with a protocol signature (two bytes), a logging level (one byte), and an event type (one byte).</para>

<para>In the first version we send UUID strings to identify each node. As text, these are 32 characters each. We can send binary UUIDs but it's still verbose and wasteful. In the log files we don't care about the node identifiers. All we need is some way to correlate events. So what's the shortest identifier we can use that's going to be unique enough for logging? I say unique 'enough' because while we really want zero chance of duplicate UUIDs in the live code, log files are not so critical.</para>

<para>Looking for the simplest plausible answer: hash the IP address and port into a 2-byte value. We'll get some collisions but they'll be rare. How rare? As a quick sanity check I write a small program that generates a bunch of addresses and hashes them into 16-bit values, looking for collisions. To be sure, I generate 10,000 addresses across a small number of IP addresses (matching a simulation set-up), and then across a large number of addresses (matching a real-life setup). The hashing algorithm is a <emphasis>modified Bernstein</emphasis>:</para>

<programlisting language="c">
uint16_t hash = 0;
while (*endpoint)
    hash = 33 * hash ^ *endpoint++;
</programlisting>

<para>Over several runs I don't get any collisions, so this will work as identifier for the log data. This adds four bytes (two for the node recording the event, and two for its peer in events that come from a peer).</para>

<para>Next, we want to store the date and time of the event. The POSIX time_t type used to be 32 bits but since this overflows in 2038, it's a 64-bit value. We'll use this; there's no need for millisecond resolution in a log file: events are sequential, clocks are unlikely to be that tightly synchronized, and network latencies mean that precise times aren't that meaningful.</para>

<para>We're up to 16 bytes, which is decent. Finally we want to allow some additional data, formatted as text and depending on the type of event. Putting this all together gives the following message specification:</para>

<screen>&lt;class
    name = "zre_log_msg"
    script = "codec_c.gsl"
    signature = "2"
&gt;
This is the ZRE logging protocol - raw version.
&lt;include filename = "license.xml" /&gt;

&lt;!-- Protocol constants --&gt;
&lt;define name = "VERSION" value = "1" /&gt;

&lt;define name = "LEVEL_ERROR" value = "1" /&gt;
&lt;define name = "LEVEL_WARNING" value = "2" /&gt;
&lt;define name = "LEVEL_INFO" value = "3" /&gt;

&lt;define name = "EVENT_JOIN" value = "1" /&gt;
&lt;define name = "EVENT_LEAVE" value = "2" /&gt;
&lt;define name = "EVENT_ENTER" value = "3" /&gt;
&lt;define name = "EVENT_EXIT" value = "4" /&gt;

&lt;message name = "LOG" id = "1"&gt;
    &lt;field name = "level" type = "number" size = "1" /&gt;
    &lt;field name = "event" type = "number" size = "1" /&gt;
    &lt;field name = "node" type = "number" size = "2" /&gt;
    &lt;field name = "peer" type = "number" size = "2" /&gt;
    &lt;field name = "time" type = "number" size = "8" /&gt;
    &lt;field name = "data" type = "string" /&gt;
Log an event
&lt;/message&gt;

&lt;/class&gt;
</screen>

<para>Which generates us 800 lines of perfect binary codec (the zre_log_msg class). The codec does protocol assertions just like the main ZRE protocol does. Code generation has a fairly steep starting curve but it makes it so much easier to push your designs past "amateur" into "professional".</para>

<para><emphasis>More coming soon...</emphasis></para>

</sect2>
</sect1>
</chapter>
<chapter id="postface">
<title>Postface</title>
<sect1>
<title>Tales from Out There</title>
<para>I asked some of the contributors to the Guide to tell us what they were doing with &Oslash;MQ. Here are their stories.</para>

<sect2>
<title>Rob Gagnon's Story</title>
<para>"We use &Oslash;MQ to assist in aggregating thousands of events occurring every minute across our global network of telecommunications servers so that we can accurately report and monitor for situations that require our attention. &Oslash;MQ made the development of the system not only easier, but faster to develop, and more robust and fault-tolerant than we had originally planned in our original design.</para>

<para>"We're able to easily add and remove clients from the network without the loss of any message. If we need to enhance the server portion of our system, we can stop and restart it as well, without having to worry about stopping all of the clients first. The built-in buffering of &Oslash;MQ makes this all possible."</para>

</sect2>
<sect2>
<title>Tom van Leeuwen's Story</title>
<para>"I was looking for creating some kind of service bus connecting all kinds of services together. There were already some products that implemented a broker, but they did not have the functionality I wanted/needed. By accident I stumbled upon &Oslash;MQ which is awesome. It's very lightweight, lean, simple and easy to follow since the zguide is very complete and reads very well. I've actually implemented the Titanic pattern and the Majordomo broker with some additions (client/worker authentication and workers sending a catalog explaining what they provide and how they should be addressed).</para>

<para>"The beautiful thing about &Oslash;MQ is the fact that it is a library and not an application. You can mold it however you like and it simply puts boring things like queuing, reconnecting, tcp sockets and such to the background, making sure you can concentrate on what is important for you. I've implemented all kinds of workers/clients and the broker in Ruby, because that is the main language we use for development, but also some php clients to connect to the bus from existing php webapps. We use this service bus for cloud services connecting all kinds of platform devices to a service bus exposing functionality for automation.</para>

<para>"&Oslash;MQ is very easy to understand and if you spend a day in the zguide, you'll have good knowledge of how it works. I'm a network engineer, not a software developer, but managed to create a very nice solution for our automation needs! &Oslash;MQ: Thank you very much!"</para>

</sect2>
<sect2>
<title>Michael Jakl's Story</title>
<para>"We use &Oslash;MQ for distributing millions of documents per day in our distributed processing pipeline. We started out with big message queuing brokers that had their own respective issues and problems. In the quest of simplifying our architecture, we chose &Oslash;MQ to do the wiring. So far it had a huge impact in how our architecture scales and how easy it is to change/move the components. The plethora of language bindings lets us choose the right tool for the job without sacrificing interoperability in our system. We don't use a lot of sockets (less than 10 in our whole application), but that's all we needed to split a huge monolithic application into small independent parts.</para>

<para>"All in all, &Oslash;MQ lets me keep my sanity and helps my customers to stay within budget."</para>

</sect2>
<sect2>
<title>Vadim Shalts's Story</title>
<para>"I am team leader in the company ActForex, which develops software for financial markets. Due to the nature of our domain, we need to process large volumes of prices quickly. In addition, it's extremely critical to minimize latency in processing orders and prices. Achieve a high throughput is not enough. Everything must be handled in a soft real-time with a predictable ultra low latency per price. The system consists of multiple components which exchanging messages. Each price can take a lot of processing stages, each of which increases total latency. As a consequence, low and predictable latency of messaging between components becomes a key factor of our architecture.</para>

<para>"We investigated different solutions to find suitable for our needs. We tried different message brokers (RabbitMQ, ActiveMQ Apollo, Kafka), but failed to reach a low and predictable latency with any of them. In the end, we have chosen ZeroMQ used in conjunction with ZooKeeper for service discovery. Complex coordination with ZeroMQ requires a relatively large effort and a good understanding, as a result of the natural complexity of multi-threading. We found that external agent like ZooKeeper is better choice for service discovery and coordination while ZeroMQ can be used primary for simple messaging. ZeroMQ perfectly fit into our architecture. It allowed us to achieve the desired latency using minimal efforts. It saved us from a bottleneck in the processing of messages and made processing time very stable and predictable.</para>

<para>"I can decidedly recommend ZeroMQ for solutions where low latency is important."</para>

</sect2>
</sect1>
<sect1>
<title>How the Guide Happened</title>
<para>When I set out to write the Guide, we were still debating the pros and cons of forks and pull requests in the &Oslash;MQ community. Today, for what it's worth, this argument seems settled: the "liberal" policy which we adopted for libzmq in early 2012 broke our dependency on a single prime author, and opened the floor to dozens of new contributors. More profoundly, it allowed us to move to a gently organic evolutionary model, very different from the older forced-march model.</para>

<para>The reason I was confident this would work was that our work on the Guide had, for a year or more, shown the way. True, the text is my own work, which is perhaps as it should be. Writing is not programming. When we write, we tell a story and one doesn't want different voices telling one tale, it feels strange.</para>

<para>For me the real long-term value of the Guide is the repository of examples: about 65,000 lines of code in 24 different languages. It's partly about making &Oslash;MQ accessible to more people. People already refer to the Python and PHP example repositories -- two of the most complete -- when they want to tell others how to learn &Oslash;MQ. But it's also about learning programming languages.</para>

<para>Here's a loop of code in Tcl:</para>

<programlisting language="">
while {1} {
    # Process all parts of the message
    zmq message message
    frontend recv_msg message
    set more [frontend getsockopt RCVMORE]
    backend send_msg message [expr {$more?"SNDMORE":""}]
    message close
    if {!$more} {
        break ; # Last message part
    }
}
</programlisting>

<para>And the same loop in Lua:</para>

<programlisting language="lua">
while true do
    --  Process all parts of the message
    local msg = frontend:recv()
    if (frontend:getopt(zmq.RCVMORE) == 1) then
        backend:send(msg, zmq.SNDMORE)
    else
        backend:send(msg, 0)
        break;      --  Last message part
    end
end
</programlisting>

<para>And this particular example (rrbroker) is in C#, C++, CL, Clojure, Erlang, F#, Go, Haskell, Haxe, Java, Lua, Node.js, Perl, PHP, Python, Ruby, Scala, Tcl, and of course C. This code base, all licensed as open source under the MIT/X11, may form the basis for other books, or projects.</para>

<para>But what this collection of translations says most profoundly is this: the language you choose is a detail, even a distraction. The power of &Oslash;MQ lies in the patterns it gives you and lets you build, and these transcend the comings and goings of languages. My goal as a software and social architect is to build structures that can last generations. There seems no point in aiming for mere decades.</para>

</sect1>
<sect1>
<title>Removing Friction</title>
<para>I'll explain the technical tool chain we used in terms of the friction we removed. With the Guide we're telling a story and the goal is to reach as many people as possible, as cheaply and smoothly as we can.</para>

<para>The core idea was to host the Guide on github and make it easy for anyone to contribute. It turned out to be more complex than that, however.</para>

<para>Let's start with the division of labor. I'm a good writer and can produce endless amounts of decent text quickly. But what was impossible for me was to provide the examples in other languages. Since the core &Oslash;MQ API is in C, it seemed logical to write the original examples in C. Also, C is a neutral choice; it's perhaps the only language that doesn't create strong emotions.</para>

<para>How to encourage people to make translations of the examples? We tried a few approaches and finally what worked best was to offer a "choose your language" link on every single example, in the text, which took people either to the translation, or to a page explaining how they could contribute. The way it usually works is that as people learn &Oslash;MQ in their preferred language, they contribute a handful of translations, or fixes to the existing ones.</para>

<para>At the same time I noticed a few people quite determinedly translating <emphasis>every single</emphasis> example. This was mainly binding authors who realized that the examples were a great way to encourage people to use their bindings. For their efforts, I extended the scripts to produce language-specific versions of the Guide. Instead of including the C code, we'd include the Python, or PHP code. Lua and Haxe also got their dedicated Guides.</para>

<para>Once we have an idea of who works on what, we know how to structure the work itself. It's clear that to write and test an example, what you want to work on is <emphasis>source code</emphasis>. So we import this source code when we build the Guide, and that's how we make language-specific versions.</para>

<para>I like to write in a plain text format. It's fast and works well with source control systems like git. Since the main platform for our websites is Wikidot, I write using Wikidot's very readable markup format.</para>

<para>At least in the first chapters, it was important to draw pictures to explain the flow of messages between peers. I found Ditaa, a lovely tool that chews up line-drawings and spits out elegant graphics. Having the graphics in the text, as text, makes it remarkably easy to work.</para>

<para>By now you'll realize that the Guide toolchain is highly customized, though it uses a lot of external tools. All are available on Ubuntu, which is a mercy, and the whole toolchain is in the Guide repository in the bin subdirectory.</para>

<para>Let's walk through the editing and publishing process. Here is how we produce the online version:</para>

<screen>bin/mkguide
</screen>

<para>Which works as follows:</para>

<itemizedlist>
  <listitem><para>The original text sits in a series of text files (one per chapter).</para></listitem>
  <listitem><para>The examples sit in the examples subdirectory, classified per language.</para></listitem>
  <listitem><para>We take the text, and processes this into a set of Wikidot-ready files. It does this for each of the languages that get their own Guide version.</para></listitem>
  <listitem><para>We extract the graphics and calls Ditaa on each one to produce image files, which it stores in the images subdirectory.</para></listitem>
  <listitem><para>We extract inline listings (which are not translated) and stores these in the listings subdirectory.</para></listitem>
  <listitem><para>We use pygmentize on each example and listing to create a marked-up page in Wikidot format.</para></listitem>
  <listitem><para>We upload all changed files to the Guide wiki using the Wikidot API.</para></listitem>
</itemizedlist>
<para>Doing this from scratch takes a while. So we store the SHA1 signatures of every image, listing, example, and text file, and only process and upload changes, and that makes it easy to publish a new version of the Guide when people make new contributions.</para>

<para>To produce the PDF and Epub formats we do this:</para>

<screen>bin/mkpdfs
</screen>

<para>Which works as follows:</para>

<itemizedlist>
  <listitem><para>We use the mkbook script on all the input files to produce a DocBook output.</para></listitem>
  <listitem><para>We push the DocBook format through docbook2ps and ps2pdf to create clean PDFs, in each language.</para></listitem>
  <listitem><para>We push the DocBook format though db2epub to create Epub books, and in each language.</para></listitem>
  <listitem><para>We upload the PDFs to the Guide wiki using the Wikidot API.</para></listitem>
</itemizedlist>
<para>It's important, when you create a community project, to lower the "change latency", i.e. the time it takes for people to see their work live. Or at least, to see that you've accepted their pull request. If that is more than a day or two, you've often lost your contributor's interest.</para>

</sect1>
<sect1>
<title>Licensing</title>
<para>I want people to reuse the Guide in their own work: in presentations, articles, and even other books. However, the deal is that if they remix my work, others can remix theirs. I'd like credit, and have no argument against others making money from their remixes. Thus, the Guide is licensed under cc-by-sa.</para>

<para>For the examples, we started with GPL, but it rapidly became clear this wasn't workable. The point of examples is to give people reusable code fragments so they will use &Oslash;MQ more widely, and if these are GPL that won't happen. We switched to MIT/X11, even for the larger and more complex examples that conceivably would work as LGPL.</para>

<para>However when we started turning the examples into stand-alone projects (as with Majordomo), we used the LGPL. Again, remixability trumps dissemination. Licenses are tools, use them with intent, not ideology.</para>

</sect1>
</chapter>
</book>