Query2 in timely dataflow

22 Feb 2016

Last week, we have established that timely-dataflow rocks. We have shown it was allowing us to crunch data with one order of magnitude cost-efficiency that Redshift or Spark on EC2.

Timely is great, but it can be a bit intimidating. It’s lower-level than Spark, bringing us a bit to the Hadoop manual map/reduce era. So this week, we will take the time to translate step by step our good old Query 2 friend to its timely-dataflow implementation.

rusted gears

From pseudo-SQL to execution plan

First, we need to do the query planner job by hand. Some people like SQL syntax — you guessed it, I’m not one of them — but translating it to something more computer friendly is not simple.

  SELECT SUBSTR(sourceIP, 1, 8), SUM(adRevenue)
    FROM uservisits
GROUP BY SUBSTR(sourceIP, 1, 8)

What’s not to like? Eye-hurting capitalization convention, moronic SUBSTR repetition, complete lack of composability, absurd reading order.

Big, big sigh.

Anyway. What would it look like in Spark, or in fluent collection style? (This is pseudo code)

UserVisits.load_from(...)
    .map(visit => (visit.sourceIP.substring(0,8),visit.adRevenue))
    .reduceByKey((a,b) => a+b)
    .count

This tries less hard to be readable in natural language, but is a good step is a very nice step in the direction of actually doing something.

To paraphrase it:

we start with a big collection of UserVisits
for each of them, we make a Key/Value pair of an 8 position prefix of the sourceIp and the visit revenue
for each Key prefix, we sum the associated revenues Values
and in the end, we will only be checking the number of results

In all the experimentations we have done so far, I have computed the actual revenue per prefix, even if I never displayed more than the count of prefix.

Now, if we were to run in Shark, we would be more or less done at this point, and Shark would do the rest for us. But timely dataflow, even with differential-dataflow, does not do this extra mile for us.

Reducers

First, we need to decide how to distribute this to several workers. In everything that follows, a worker is some code that shares no application data with its couterparts. From an isolation point-of-view, a worker could be a process — they were just that in good old Hadoop. In timely, they can be multiple threads running on different processes, running on different servers. But the isolation invariant stands: application logic and data stay contained in one worker. There is no shared-across-worker memory structures visible to the application developer.

A very efficient way to think about execution plan, obviously, to think in map and reduce terms.

Maps are “pure” functions of an item in a stream. A map will be run on each item on the stream, producing one item of output at each turn, not having any side effect at all. Maps are the “nice guys” in map/reduce. They are very composable, scale naturally. They will often contain most of the application logic code, but are not really interesting from a structural point of view.
Reduces, of course, are another story. They are operations that for instance:
- may have state,
- may “break” the item-to-item stream of data (because they need to accumulate some or all of its input before outputting something),
- may need to know exhaustively something about some subset or all of the dataset to produce a correct result. On the happy side, there is a small zoology of reduce pattern families.

So basically, when processing data in map/reduce, you pick and configure a few reducers, shove most of your application logic in the mapper, and let the framework do it’s magic. Which is more or less what we have just done with the pseudo-code pseudo-spark code above.

It has an input, a map, a first reducer ‘reduceByKey’, that will perform a reduction of numbers by summing them, and a second reducer in the form of the final count.

That is exactly the path I followed to write the timely implementation of Query2. From SQL to stream (at least in my head) to translation to timely.

I happened to follow a different mental path for the non-timely implementation. If you remember the previous posts, with less constraints to start with (a blank page), I manage to produce a slower implementation than the one constrained by timely… even if it had only one reducer step.

Distributing Reducers is more complicated than Mappers because we must deal with their “unit of work” and make sure it falls into one single worker.

the reduceByKey unit of work is portion of our dataset sharing the same key. Remember: it receive all visit sharing an sourceIP prefix, and produces the total revenue for them. To provide the correct result, we need to make sure all prefix/revenue pairs are dealt at one single place
the count unit of work is the whole dataset.

All in all, some data will flow over the network twice, one for each reducer. First to get to the right worker for the reduceByKey responsible for the matching shard of the data, later each worker will contribute a single number of keys to an arbitrary worker which will sum them. Timely dataflow calls these worker-to-worker communications “Exchange”.

Implementing the two-reducer plan in timely

The full code is here.

It is not the executable I used for the EC2 tests. I used query2.rs which allow me to tweak and instrument more things. I copy-pasta-ed the relevant bits to make a query2_timely something easier to explain. It has the same performance characteristics.

It’s a single main() function, but the first (and only) thing it does is yield control to timely’s configuration manager. Yeah, Inversion-Of-Control, rust-style.

fn main() {
    timely::execute_from_args(std::env::args(), move |root| {
        let peers = root.peers();
        let index = root.index();

        root.scoped::<u64, _, _>(move |builder| {
            [...]
        });
    })
}

This allows timely to parse the arguments from the command line, and start as many workers as necessary. Each worker is given an index, and knows the number of its peers. We put this useful constants somewhere convenient as we will need them in places where borrowing root will not be an option — rust being rust, it is picky about that.

The real interesting stuff happen in scoped(). This is where we will plug everything together.

First we have lines 23 to 33 devoted to read the input files. There is not much in there that is timely-specific, but let’s have a look at them, some bits are actually relevant.

let files = dazone::files::files_for_format("5nodes", "uservisits", "buren-snz");
let files = files.enumerate().filter_map(move |(i, f)| {
    if i % peers == index {
        Some(f)
    } else {
        None
    }
});
let uservisits = files.flat_map(|file| {
    PartialDeserializer::new(file, Compressor::get("snz"), &[0, 3])
});

First we get an iterator on all the files from the right data directory. This is just stuff I made for these Query2 experiments. The next few lines are more relevant: each worker will load some files in the distributed system. We want every file to be read once and only once, and we want the load spread evenly across the workers. This is where the index and peers count we put aside before come in handy.

Once the files are filtered, we can load the actual content. PartialDeserializer, Compressor are part of the Query2 experiment code. The last three lines produce a rust standard Iterator over (sourceIp, adRevenur) pairs.

Next comes the SUBSTR, in a form of a map:

let uservisits = uservisits.map(|visit: (String, f32)| {
    (Bytes8::prefix(&*visit.0), visit.1)
});

Here again, nothing comes from timely. Bytes8 is a structure around an array of eight bytes I have written for Query2. As a matter of fact, Map are so easy to deal with that I have not even bother putting this one in timely formalism. I could have done it, but what’s the point really? Iterators are so easy…

A few more lines of preliminary:

let stream = uservisits.to_stream(builder);

let mut hashmap = ::std::collections::HashMap::new();
let mut sum = 0usize;

to_stream actually comes from timely! It takes a standard rust Iterator and transforms it into a timely Stream. From a 30 000 feet point of view, think of it as a mere adaptor.

The two other lines are standard rust, but they define the state of the worker: the HashMap stores the on-going reduced state of the reduceByKey step, the sum is the on-going figure for the final count().

And now for the main course, the two reducers in all their splendor.

let group = stream.unary_notify(
    Exchange::new(move |x: &(Bytes8, f32)| {
                        ::dazone::hash(&(x.0)) as u64
    }),
    "reduceByKey/Count",
    vec![],
    move |input, output, notif| {
        input.for_each(|time, chunk| {
            notif.notify_at(time);
            for (k, v) in chunk.drain(..) {
                update_hashmap(&mut hashmap, &|&a, &b| a + b, k, v);
            }
        });
        notif.for_each(|time, _| {
            if hashmap.len() > 0 {
                output.session(&time).give(hashmap.len());
            }
        });
    });

First stage implements the reduceByKey — and half of the count actually. Obviously, we are deep in timely realm this time.

We plug a unary_notify operator on our uservisits stream, and we will keep the resulting stream in group for our next step.

unary_notify is “unary” in that it define an operator that has one input (the uservisits key/value tuples) and one output. It is “_notify” in that, as we are a reducer and not a map, we will need some access to the shared computation progress in order to know when we have received everything that belong to our shard of data. Access to this global state is why “timely” is called “timely”: each record in the stream is tagged with a “time”. Time is actually a complex thing in timely dataflow. Think about the black board scene in Back to the Future II, combined with Interstellar library and twelve monkeys mad scientists — or mad engineers, sometimes it’s hard to tell the difference.

In our example, it’s quite simple. There is just one “epoch” we are interested in for the reduceByKey: we need to be notified when all the workers in the system are done with the input files.

The Exchange is what will decide to which broker a record should be sent: we are returning a hash of the key from the pair. Timely will make sure all records with the same hash are going to the same worker.

Next come a debug and audit name for our worker, and a vector of timestamps we know we will want to be notified at. We could use this to register our the required notification, except we don’t know what to_stream() uses as timestamp. Fortunately, there are other ways to register notifications.

Finally, the last argument of unary_notify is the logic that the operator will run every time it gets a chance. It takes the form of a closure which allows the logic to make use of the input, output and notification bus, and, by capture, the various variables the worker has declared as its state: more specifically, here, the hashmap variable.

Inside this closure, we code how to react to the presence of an input batch or a notification. In case of an input batch we get our chance to register to the notification bus that we need to know the end of the epoch matching the data coming in. We are calling notify_at repeatedly. It is a bit offsetting, but the cost is amortized by the framework. It is preferable to guessing whatever to_stream() does and it’s what the timely guys recommend anyway.

Moving on to the “payload” logic, we update the state hashmap. The update_hashmap acts by inserting a new key or updating the value of an existing key using the additioner we provide.

Second bit of logic is what to do when the notification occur: we will send in the outgoing stream the sole count of entries in our map. This is where we are doing half of the count. A strict implementation of reduceByKey would shove the whole HashMap in the pipe, but we know better.

This is an instance of a Map being, again, the good guy. Remember the pseudo-spark implementation above? It featured a reduceByKey(...).count(). So the stream between these two logical operators is a stream of disjoints HashMap. Count, on the other hand, could be implemented as follow (again, pseudo-code).

    .map( h:HashMap => h.count() )
    .map( n => (0, n) )
    .reduceByKey((a,b) => a+b)

The first map calls .len() on the HashMap, the second make an absurd constant key to bring all the partial counts to the same worker, which will sum to get the result.

So what I have done here was to move “up” the first of the count map as close as possible to the reduceByKey above, avoiding making a stream of HashMap and having to explain that to rust.

Back to the code, everything that remains is to move everybody to a worker (let’s pick worker 0) and sum the numbers incoming.

let _: Stream<_, ()> = group.unary_notify(Exchange::new(|_| 0u64),
      "count",
      vec![],
      move |input, _, notif| {
          input.for_each(|time, data| {
              notif.notify_at(time);
              for x in data.drain(..) {
                  sum += x;
              }
          });
          notif.for_each(|_, _| {
              if index == 0 {
                  println!("result: {}", sum);
              }
          });
      });
});

Exchange is now a constant function, the rest is similar: the input chunks contain numbers that are added to the worker-scoped sum variable. We register the notification in the input handler, as we know spamming it is tolerable. Once we get notified, we output the count on the console.

And that’s it.

This executable is ready for the distributed mode, but the distribution itself is not covered: something or someone needs to copy the executable on each of the cluster nodes, provide them the list of their peers and start them. If you want to run a second time you need to go to each node and start the process again… For the ec2 tests, I used ansible scripts.

Recap

Of course, we are far from the elegant four lines of Spark, we had to do many things by hand including translating the logical query to a distributable executable plan. We produced a full screen with a good quantity of boilerplate, but we have a standalone executable that will crunch the data with an efficiency way higher than that of Spark or Redshift.

Building an equivalent of Spark or Redshift on top of timely is a daunting task. But there are intermediary objectives that could make sense: helpers and ready-to-use operators to make writing data processing task easier, daemons that would migrate executables to all the nodes, start them and manage the peer list for instance.

Not sure what’s the next step is yet.

Rust and BigData series:

Busy Beaver

Query2 in timely dataflow

From pseudo-SQL to execution plan

Reducers

Implementing the two-reducer plan in timely

Recap

Related Posts

Query2 and differential dataflow 29 Feb 2016

Embrace the glow cloud 15 Feb 2016

Hashes to hashes 08 Feb 2016