Distributed Aggregation for Data-Parallel Computing: Interfaces and Implementations
Goal to make large-scale programming simple for all developers.
Write a program in Visual Studio; Dryad(LINQ) takes care of shipping it to the cluster, fault tolerance, etc.
Wrestling with the implementation of GroupBy-Aggregate. GroupBy takes a sequence of objects with some kind of key, and groups them together by key. Similar to MapReduce.
Naïve execution plan splits map and reduce into two phases, with an all-to-all data exchange between them. However, applying the reduce after this exchange results in a large amount of network I/O.
A better idea is to do early partial aggregation: use an aggregation tree to achieve this. Reduces the disk and network I/O by up to one or two orders of magnitude.
Want to automate this optimization. Programmer writes the obvious code and the system takes care of the rest.
Notion of decomposable functions is key to this. Need an initial reducer that is commutative, and a combiner that is commutative and associative.
How do we decompose a function? Two ways: iterator and accumulator interface. Choice can have a significant impact on performance.
How do we deal with user-defined functions? Try automatic inference, but fall-through to a good annotation mechanism. Implement simple function and annotate it with the initial reduce and combiner implementation function names.
Hadoop interface for this adds quite a lot of complexity. Java’s static typing is not preserved.
Iterator interface has to build an entire group and iterate through it. Accumulator can discard the inputs if they are not needed. Oracle uses this approach, implemented with stored procedures. Hard to link in user-defined procedures.
Automatic decomposition looks at the expression and checks whether all leaf function calls are decomposable.
Want our approach to have good data reduction, pipelining, low memory consumption and parallelisability (multicore). Define six strategies, accumulator- and iterator-based.
Iterator PartialSort approach. Idea is to keep only a fixed number of chunks in memory; processed in parallel. The bound on memory makes pipelining possible. Strategy close to MapReduce.
Accumulator FullHash approach builds an in-memory parallel hash table with one accumulator object per key. Objects are accumulated immediately. This gives optimal data reduction and memory consumption proportional to the number of keys, not records. This is the DB strategy (DB2 and Oracle).
Evaluated with three applications: WordStates, TopDocs and PageRank on a 240-machine cluster. Accumulator-based implemen—