Query2 and differential dataflow
29 Feb 2016Along with timely-dataflow, Frank McSherry develops and maintains another library focusing on efficient data processing: differential-dataflow. This week, we’ll have a look at what it brings to the table.
Along with timely-dataflow, Frank McSherry develops and maintains another library focusing on efficient data processing: differential-dataflow. This week, we’ll have a look at what it brings to the table.
Last week, we have established that timely-dataflow rocks. We have shown it was allowing us to crunch data with one order of magnitude cost-efficiency that Redshift or Spark on EC2.
Timely is great, but it can be a bit intimidating. It’s lower-level than Spark, bringing us a bit to the Hadoop manual map/reduce era. So this week, we will take the time to translate step by step our good old Query 2 friend to its timely-dataflow implementation.
This is part #5 of a series about a BigData in Rust experiment.
We are working on a simple query nicknamed Query2 and comparing our results to the BigData benchmark.
Okay. So this post was supposed to be about running on a cluster. I promise we will come to that eventually, but this week I got a bit side-tracked. Serendipity happened! We will have to dive into Rust HashMaps characteristics.
We have seen in part #1 that my laptop is processing 30GB of deflated CSV in about 11 minutes. If we want to do better, the first step is find out what is our bottleneck. The code was presented in part #2.
For years, we have worked under the assumption that IO where the limiting
factor in data processing. With SSD and PCIe disks, all this has changed.
Believe me, or re-run the bench and look at top
, it’s very obvious that we
are now CPU-bound.