Let's optimize

I/O and CPU

We have seen in part #1 that my laptop is processing 30GB of deflated CSV in about 11 minutes. If we want to do better, the first step is find out what is our bottleneck. The code was presented in part #2.

For years, we have worked under the assumption that IO where the limiting factor in data processing. With SSD and PCIe disks, all this has changed. Believe me, or re-run the bench and look at top, it’s very obvious that we are now CPU-bound.

Another simple way to convince us about this is to check out at what speed we can read our 30GB data:

11:35 ~/dev/github/dazone% cat data/text-deflate/5nodes/uservisits/* | pv > /dev/null
  29GiB 0:00:46 [ 633MiB/s] [                                <=>         ]

If you did not know about pv, well, check it out. What we learn here is that we manage very easily to read the files for our benchmark from disk in 46 seconds, at more than half a Gigabyte per second.

So we definitely have to look at what the CPUs are doing.

profile, showing gz and csv

Without trying to decipher too much out of it, the four or five top lines show a lot a time is spent in CSV and Zlib functions. More “decoding stuff” appears just above the hilighted line (which is the actual final HashMap construction and is below 3% of the computing resources consumed).

A flamegraph will show it even better…

flamegraph

The fourth line (from the bottom) splits nicely the actual CPU load. From left to right:

All in all, not huge surprise.

Compression makes a lot of sense when I/O bandwidth is scarce and computing power is plentiful. But SSDs really have altered the equilibrium here (0.6 GB/s on a laptop…).

As for CSV, it is a human friendly format. It’s relatively easy to parse by eye (mostly due to the fact that each record is on its separate line). But it is an absolute horror to parse for a computer. As a matter of fact, most of the widely used data formats in data science are human readable, and terrible for performance: CSV, Json, XML.

It is still frown upon to use a binary format, but we will probably have to get used to it. Text formats are for Debug mode. HTTP is moving from a text format to a Binary format. We have been doing HTTP in Debug mode for 20 years, this is probably enough. Let’s grow up, develop tools to pretty print these new formats and protocols, and stop wasting so much computing power. Maybe we will save polar bears. At least, we’ll save some battery time.

Note that, as we try to find a better compromise for efficient data loading, we do not necessarily want to change the way data is put in long-term cold storage. It may be worth vaulting the data in highly compressed (bzip2) Json or XML or whatever format will be percieve as resilient to technology evolution, and use a separate, very loadable, probably binary, format to feed day-to-day data analysis.

Parser-friendly encoding

First, there is now a wide variety of binary Json-equivalent formats. They are schema-free, support string, numbers, boolean, and some form of list and map. BSON, MessagePack, bincode, CBOR happen to all have a ready-to-use rust encoder and decoder, so we can consider them as an alternative. It is relatively easy to design and implement these kind of formats, so there are many competing options today, and it’s a bit hard to guess which formats will still be around in a few years.

ProtoBuf and Cap’n Proto

Then, we have schema-full formats. They require the developer to write a format specification (think IDL) that the encoder and decoder will use to read and write the data, freeing it of the redundancy that comes with schema-free formats. There is a long and sometimes dark history here (yes, CORBA, I’m looking at you), so some people will get uncomfortable at the simple thought of IDL files. This is a bit of a shame. In the past decade, new formats have emerged from big organizations (Facebook, Google) and were deemed relevant enough to get some traction in open sources communities. They were originally targeted at message exchanging, but nothing prevents using them as storage formats. I have found off-the-shelf Rust support for Protocol Buffers, Thrift, and implemented a Protocol Buffers alternative in the bench.

ProtoBuf implementation is a departure from the POD approach: the IDL defines its own data class, where each field is actually a placeholder: native scalar values are wrapped in Option<> and String are wrapped in a proprietary wrapper. So switching to ProtoBuf implies a bit more work than POD-based RustEncoding or Serde encoders.

Cap’n Proto is a relatively recent development, “infinitely faster” than Protobuf. It does not use POD or near-POD struct, but rather provides generated Reader and Builder that wrap a buffer to provide accessors to the actual data. The buffer encoding does not contain pointers or architecture dependent stuff, so it’s ready to be send, read, written or shared. No encoding happens or decoding when loading a record. This is particularly relevant in our case because we only use one string among the seven from the record.

As a matter of fact, Cap’n proto actually offers an optional “packed” encoding (where strings of zeros are collapsed togethers). But if we use a more sophisticated compression on top of Cap’n proto, it may or may not be useful. We’ll try both.

As for protobuf, we have to go through a proprietary interface to access the data, so the switch is not completely trivial.

Columing data

I have added one more “encoding” for a very simple not-even-proof-of-concept column storage scheme. This is very relevant here as we are only using two columns. It is called “buren”. If you don’t know why, you need to come and visit Paris.

Compression

Zlib-based formats are not the only option. Snappy and LZ4 may be relevant alternatives. Not compressing the data at all may also be an option, actually. I have not considered bzip2 as its computing requirement are even bigger than Zlib’s.

Plugging compression algorithms in and out of my code has been relatively painless, except for a few snags that I think are mostly due to the relative young age of the ecosystem: Rust having been stable for less than one year, library implementers are still working without a complete framework of good practise rules. Rust will get there with time, experience and discussion.

Results

This table have five groups of three columns, One group per compression scheme. From left to right, no compression, snappy, lz4, gz, zlib/deflate. For each group/compression scheme:

format disk mbp ovh   disk mbp ovh   disk mbp ovh   disk mbp ovh   disk mbp ovh
json 213 855 461   55 815 481   51 756 471   33 1037 638   33 873 516
csv 120 640 366   48 806 378   46 616 375   30 754 477   30 825 383
bincode 150 377 202   50 552 214   46 377 208   33 547 335   33 551 263
mpack 116 509 272   44 723 283   42 495 274   30 633 379   30 674 325
cbor 186 1220 698   53 1564 708   48 1171 697   33 1361 847   33 1537 719
protobuf 121 378 243   46 397 253   44 407 250   32 570 364   32 596 284
capnproto 188 242 207   63 243 171   58 239 158   39 482 319   39 416 207
pcapnproto 140 257 183   56 279 194   54 278 185   38 476 316   38 390 231
buren 155 135 83   39 139 87   38 157 93   22 211 114   22 170 99

There are still holes in the table: a few combination I was not able to test or had widely inconsistent results. Remember that I’m running benches on a laptop, that this 1/ is not scientific at all, 2/ generates a lot of angry fan noise in my office (which is also my bedroom). As a matter of fact, all combinations in the mbp have not been treated fairly. I have made extra runs for the good perfoming combinations where I was basically leaving my laptop alone in a relative quiet state.

Or if you like bars… (shorter is faster)

format bars format bars

Anyway, I have highlighted a few “sweet spots” in the table. As I was writing earlier, Zlib is no longer a good choice.

Remarks:

Cap’n Proto all the way

We can try one more thing. As Cap’n Proto can work with no encoding at all, we can wrap its Reader on raw memory-mapped files, assuming there are in then “not packed” form. This will only work in no zero-packing, no compression form. As a matter of fact, to make it work, I had to add a “record size” header between each record in the encoding, so the files are even bigger than the non-compressed cap files. This will amount to 193GB (still smaller than Json) and runs, on the laptop, in 261s. So this compromise is not particularly relevant.

And the winner is…

So where are we now ? I have picked buren encoding and snappy compression. This is only marginally slower than uncompressed buren, and weights significantly less on my hard drive.

flamegraph

Now the big “map” stage, with all decoding, is 25% instead of 85%. There might still be possible improvements here: buren actually make copies of the textual data instead of borrowing it from the raw uncompressed buffer… malloc is at 2.6%, dalloc at 2%… But is it worth the trouble ?

Now the big chunk is the partial aggregation, at 64%. Is it some kind of low-hanging fruit we could grab here ?

Well, as a matter of fact, there is.

update_hashmap is the fonction that actually performs in-place aggregation in the intermediary result, and later in the final result. Basically, it is called with a key (a 8 to 12 bytes prefix, remember) and a float (the ad_revenue). It will lookup the prefix in the hashmap, update the value by adding the new revenue or just perform an insertion. This means we are performing a lot of insertion in the HashMap. A simple instrumentation ( aka printf) showed me that in the case of Query2A, the partial map size at merge time is in the 700-800. I also happen to know that growing HashMap is expensive. So let’s try and create these HashMap at capacity instead of letting them grow organically.

It takes us from 135 seconds to 80 seconds. Wow. That was worth a try. I honestly did not thought it could be so big an improvement. I double checked it and triple checked it, but here we go. Have a last flamegraph on me.

flamegraph

All this is for Query2A. Something fishy is happening with Query2C, I’ll need more time to get into what’s going on there.

Conclusion

So that was a big post, but we have gone a long way. We started at ~660 seconds and are now at 80 seconds. Let’s recap:

What’s next

Next post should be the one about running this on a — modest — cluster with timely dataflow, if I can get time on the cluster.


Rust and BigData series:

  1. Rust, BigData and my laptop
  2. The rust is in there
  3. Let’s optimize
  4. Hashes to hashes
  5. Embrace the glow cloud
  6. Query2 in timely dataflow
  7. Query2 in differential dataflow