OSDI 2010: Day 2

Deterministic Parallelism

Deterministic Process Groups in dOS

  • Many sources of non-determinism in real systems (shared-memory, IPC, pthreads, devices, etc.). This makes programs hard to test, debug and replicate (for fault-tolerance). Multicore only makes this worse.
  • Solution: OS support for deterministic execution. A “deterministic box” around some processes.
  • What can be made deterministic? What can be done about remaining non-determinism?
  • Internal versus external non-determinism. Internal is not fundamental (arises from scheduling artifacts) and can be eliminated. By contrast, external comes from interactions with the real world, and cannot be eliminated. External includes users and the network. Internal includes channels used between processes in the deterministic process group (shared memory, pipes, private files). Processes outside the group are external sources of nondeterminism.
  • External nondeterminism is controlled with a shim program that controls external inputs.
  • Alternately, could have made a deterministic VMM and put the whole thing in the box, but this is inflexible and costly.
  • dOS is a modified version of Linux.
  • Motivating examples: parallel computation and a web server. Consider a parallel (scientific?) computation that operates from local input files. This should execute deterministically even on a multiprocessor. A web server has a shim that does deterministic record/replay, but since internal nondeterminism is eliminated, the recording is much less costly. This makes it easy to replicate multithreaded servers. Could move web server request processing into a DPG and have all the I/O outside.
  • New system call sys_makedet() which creates a deterministic box that expands to include all children. Processes are otherwise exactly the same as Linux processes.
  • Semantics of internal deterministic: execute as if serialized on a logical timeline.
  • External nondeterministic: e.g. read() from a socket (the what (data) and when (blocking time)), which the shim program interposes on. The shim could perhaps record the call or replay the call from a previous recording. It controls the logical time at which the call returns.
  • Other uses: deterministic replication with DPGs and shim programs. Shim ensures that replicas see the same input at the same logical time. 2PC is detailed in the paper.
  • Implemented on Linux 2.6.24 on x86_64. 8000 LOC across 50 files changed. Transparently supports unmodified binaries. Needed to modify thread scheduling, virtual memory and syscall entry/exit. Paper describes the implementation challenges.
  • Use the DMP-O algorithm from previous ASPLOS papers. Chose this because it is easy to implement. Threads run in parallel before communication detected, then communication is serialized. Detection using an “ownership table” that maps data to threads. Parallel and serial modes execute atomically and serialize onto the logical timeline (one time step per mode switch).
  • All loads and stores must be instrumented to manage shared memory, as must all syscalls for in-kernel channels (pipes, files, signals, but also address space, fd table (implicit/covert channels)). Page protection hardware is used to monitor shared memory at a page granularity.
  • Evaluated on an 8-core Xeon with 10GB of RAM. Each application runs in a DPG. Used the racey deterministic stress test to verify determinism. How much internal nondeterminism is eliminated? How much overhead does dOS impose? How does it affect parallel scalability?
  • Compare log sizes for dOS and SMP-ReVirt. SMP-ReVirt produces logs that are 4 orders of magnitude bigger.
  • Overhead compares no-DPG to DPG-only to DPG-with-execution-record.
  • For Apache, DPGs can saturate a 1G network (100KB static pages). For 10KB static pages, nondeterministic saturates, DPG drops 26% and record drops 78% (from nondeterministic). For Chromium, the slow down is about 1.6x for both DPG and record.
  • Ran parallel benchmarks on 2, 4 and 8 threads. Blackscholes, lu and pbzip benchmarks have 1 to 3x slowdown in DPGs. Dedup, fmm and make suffer more due to false sharing on a page.
  • Q: Do you intend only to run programs deterministically for debugging, or should it be always-on? Ideally always, but some workloads will not be suitable.
  • Q: How do you decide on a timeline? Logical time increments either when a fixed set of instructions are reached, or [...].
  • Q: Does putting in debug code distort the execution? Yes.
  • Q: Is it possible to write a custom MPI that replaces fine-grained shared memory with message passing that would get rid of much of the overhead? It’s certainly possible, but we haven’t done that.
  • Q: Is non-determinism really bad? Splitting between internal and external handles that, and gives you control over where the split lies.
  • Q: Did you measure CPU utilization in your Apache benchmarks? Yes, but I don’t have numbers.
  • Q: Did you think about how else you might address the fine-grained memory sharing otherwise? Could use hardware support or recompile the program using a transforming compiler. These might get some of the scalability back.

Efficient System-Enforced Deterministic Parallelism

  • Parallelism is a grand challenge, and nondeterminism makes it harder. Races are everywhere: memory access, file access, synchronization….
  • Do we have to live with races? Restrictive languages (such as Haskell, SHIM, DPJ) don’t allow races to occur in the first place. Could we have race-freedom in any language?
  • Determinator is a new OS that offers race-free parallel programming. Compatible with existing programming languages, from Java through C to assembly.
  • Core central idea: check-in/check-out model for shared state. Fork checks out a copy, threads work on a private working copy of shared state, and join checks in and merges changes. Idea from parallel Fortran DOALL, or version control.
  • Example is a multithreaded game/simulation/actor-based program, where threads share a global array. In Determinator, each thread gets a copy of the array, and join merges the diffs.
  • Another example is Parallel Make, where there is a Heisenbug (two parallel jobs use the same temporary filename). Determinator creates a working copy of the file system for each parallel job, so the file doesn’t get overwritten.
  • Writes only get communicated at synchronization. Write/write races either go away or become deterministic conflicts, which cause a runtime exception.
  • Determinator is a ukernel, which enforces a strict hierarchy of process “spaces”. Only the root space can directly do external I/O, and children can only communicate with their immediate parent. Single thread per space. Each space has a single address space, but with a backup “shadow” space.
  • ukernel API is very small (PUT, GET and RET). User-level runtime provides higher-level abstractions (C library, pthreads, file system, process management).
  • Join does a three-way diff between the parent’s working copy, the original copy and the child’s final working copy. Optimized using VM copy on write. Can do merge at page-level (faster) or byte-level.
  • Can emulate a shared file system (non-persistent) with weak consistency. Can do merges at the file rather than byte granularity. Only used for intermediate results.
  • Other features include process migration for distributed computing support, deterministic scheduling, and other things (in the paper).
  • Compared single-node performance against Linux on parallel benchmarks. Some speedup over/parity with Linux on coarse-grained benchmarks. Fine-grained (e.g. LU) benchmarks get a serious performance hit.
  • Looked at the granularity of a parallel quicksort, and the break-even is around 100K elements in an array.
  • Q: Is check-out/check-in more like transactions? It absolutely resembles transactional memory, but these don’t try to provide determinism.
  • Q: point about a “PAR DO” construct in an old Comp Arch news.
  • Q: What about other forms of synchronization e.g. condition variables? These are a terrible fit for Determinator since they are not naturally deterministic. We do fork/join and barriers. Talk about non-hierarchical synchronization in the paper. Barriers allow a child to sync with its parent without terminating.
  • Q: Do you think it is ideal to support arbitrary languages, or would it be better to change the way programs are written to exploit the Determinator semantics? Trade-offs/synergy? We’ve tried to be completely language-agnostic, but interaction could be a huge benefit. A big issue is choosing the granularity for merges, so type information could help here. Or could implement the merge in a more informed way.
  • Q: How much experience do you have with figuring out the programming model to deal with conflicts? Don’t have much practical experience yet as it is just a prototype.
  • Q: Could the merge operation (working on a shared file) yield a result that I might not expect, and do I have to change my assumptions about shared files? Currently, we merge at file granularity, and don’t attempt to merge within the file: this would be a conflict.

Stable Deterministic Multithreading through Schedule Memoization

  • Nondeterminism arises when different runs of a multithreaded program give different behaviour for the same input. Some behaviours may be incorrect.
  • Deterministic multithreading partly addresses this problem, but the schedules are tightly coupled to the input. A slight change in the input leads to a very different schedule. This complicates testing and can mean that some inputs deterministically lead to buggy schedules.
  • The idea is to have stability: memoize past schedules and repeat familiar ones.
  • TERN runs on Linux as a user-space scheduler. Schedule memoized as a totally-ordered sequence of synchronization operations. Only race-free schedules are memoized (previous work). Use symbolic execution (based on KLEE) to track the input constraints required to reuse a schedule. Memoization checks inputs against previous input constraints that were identified by the symbolic execution.
  • Evaluated on Apache, MySQL, PBZip2 and 11 scientific programs. For 13/14 programs, only needed to modify less than 10 lines of code. 100 schedules process over 90% of a real HTTP trace with 122K requests. Overhead is <10% for 9/14 programs.
  • TERN works on annotated program source, with annotations to say which inputs could affect scheduling. An LLVM module does instrumentation. A schedule cache stores memoized schedules as a mapping from constraint sets to schedules. On a hit, the schedule is replayed; on a miss, the memoizer runs.
  • Simplified example based on PBZip2, which has worker threads processing blocks independently. “symbolic” annotation identifies parameters that affect the schedule.
  • Developed programming techniques to make this practical, and these are detailed in the paper.
  • Stability experiment uses a 4-day httpd trace (122K), MySQL sysbench simple and tx workloads, and PBZip2 uses the contents of /usr. Ran on a 4-core Intel box.
  • TERN can reuse over 90% of schedules for three benchmarks. MySQL TX workload only reuses 44.2% of schedules.
  • Looked at bug stability: if you change the input, do bugs occur in the changed input but not in the original? Compared to COREDET. Looked at three buggy programs: fft, lu and barnes from SPLASH-2. Symptom is that global variables are printed before being assigned the correct value. TERN never sees the bug, but COREDET does for some thread/matrix size configurations. Fewer schedules in TERN means that the chance of hitting the bug is less, and hence it is more stable.
  • What about TERN’s overhead? It is supposed to run in everyday execution, so this must be mild. Apache sees small overhead, MySQL is much greater (>30%), PBZip2 is negative. Mixture on SPLASH, -15% up to 40%. Details of the optimization/speedup are in Section 7 of the paper.
  • Future work is fast deterministic record/replay.
  • Q: Does not recording shared memory access order mean that you might not faithfully reproduce data race bugs? The premise is to make behaviour stable and deterministic, rather than eliminating bugs. But we can memoize race-free schedules only, by applying a race detector.
  • Q: Could the annotations be created automatically in future? Haven’t done this yet, but in our evaluation, the overhead is pretty small.
  • Q: How do you automatically determine constraints, and would these become bigger as execution goes on? We have two refinements that mitigate this. First, we remove redundant constraints from the constraint set. Second, we slice out branches that don’t affect synchronization orders.
  • Q: Is it true that you only memoize and replay schedules at program granularity (and cannot replay parts of schedules), and would that be a benefit (and would it be easy or difficult)? Currently not implemented, but this would be an interesting direction.

Systems Management

Enabling Configuration-Independent Automation by Non-Expert Users

  • It is hard to automate a task on different machine configurations. KarDo observes user actions as they perform at task, and produces a solution that works across configurations.
  • Strawman: collect a separate trace for every possible configuration. But the configuration space is too big to explore completely. Also, tasks are so diverse that we are unlikely to get more than a few traces for each task.
  • KarDo can generalize from a few traces.
  • Example application: turning on OOF notification in Outlook. What if the two machines start off in different view modes? What if a Java update notification turns up in between steps when you record a step?  Can infer which actions are non-state-modifying from traces for any task. This means we need far fewer traces, and we can skip non-state-modifying actions in the replayed trace.
  • Evaluated by automating tasks from the MS Help and eHow website, in numerous categories.
  • Tracing user actions is challenging because the OS only gives mouse clicks and keypresses which may be meaningless on a different machine. Use the accessibility interface to get more information about the widgets that were affected by an action, which is meaningful across machines.
  • Generalizing to a single canonical solution requires distinguishing state-modifying actions from other actions. Three kinds of action: update, commit and navigate. Only consider the update and commit actions. To identify these, use visual cues and feed them into an SVM, which labels actions. For features, use widget type, whether it closes the window, whether it changes state, what its text label is, etc. Train it using a small set of examples.
  • To generalize across different configurations, must splice in navigation actions to the state-modifying ones in order to get to the right buttons, etc. Build a global navigation graph, then do breadth first search on the graph to get to the desired state.
  • To remove unnecessary user actions, use a two-pass algorithm. First pass removes unnecessary updates (which are later overwritten).
  • To create a canonical solution, need to create a per-task state modification graph, which may have conditional branches for different configurations (e.g. configuring a router, there may be different manufacturer applications).
  • Implemented KarDo as a thin-client backed by a server. Evaluated on 57 tasks. Used 20 diversely configured VMs, split into 10 training VMs and 10 test VMs. Each task performed manually on exactly 2 VMs.
  • Testing by direct replay of the automation recording (baseline) and comparing it to KarDo. Baseline is successful in 18% of cases. KarDo succeeds in 84% of cases.
  • Why does KarDo fail? 4% of mistakes are due to the navigation algorithm. 5% are due to missing steps and 7% are due to configuration errors.
  • Q: How many of the examples could you have handled by modifying configuration files directly? Some reasonable percentage of them, but we want to handle more general things than just configuration tasks.
  • Q: To what extent could you adapt your system to perform command-line tasks, or editing specific files? We do consider some command-line steps, but this isn’t the primary goal. Could extend it in this direction, but would need to understand the syntax and text on the command-line, which is more complex.
  • Q: What if the training steps include mistakes? We remove unnecessary steps. What about misconfiguration? Could use traces to “check” the correctness of configurations.
  • Q: What if you have to branch, how resilient is this to cosmetic perturbations to the interface? The accessibility layer is the level we consider, so we look for the presence of widgets.
  • Q: How do you determine the start of a task? People upload traces to a central server, so we assume they determine the start before recording it, and do this deliberately. What about input-invariant actions (entering user-specific text in a text box)? We flag this in the recording stage.
  • Q: Have you thought about using Google searches as an input to the trainer? Had a HotNets paper on this very thing last year.
  • Q: What if your program has privacy-sensitive data and this was captured in a trace? Do they have to be manually anonymised? How do I make sure that I don’t upload parts of my private email? In the general case, you need to rely on the user.  However, we can infer that, if many users input the same value, it is probably not private. But I have to trust the operator? You’d have to trust the automated system, not the administrator.

Automating Configuration Troubleshooting with Dynamic Information Flow Analysis

  • Systems are hard to configure, and users make mistakes. Troubleshooting is therefore difficult and expensive.
  • When we have misconfigurations today (such as a bogus value in a config file), might look at log messages, ask colleagues, search Google or look at the code if available. Idea is to make a tool that automatically finds the root cause?
  • ConfAid uses application code to lead users to the root cause, by doing dynamic information flow analysis on binaries.
  • Developers find the root cause by tracing back through the variables and inputs that have influence on the error condition occurring. ConfAid uses taint tracking to identify these dependencies. When the application reads from a configuration file, it adds taint to variables, and traces this through data and control flow. ConfAid attempts to find execution paths that avoid the error. Different inputs have distinct taint.
  • Each variable has a taint set which identifies the inputs that might influence the variable if they change. These sets are built through data flow and control flow. Must explore both paths of a conditional statement.
  • Complete and sound analysis would lead to bad performance and a high false positive rate. Use heuristics to deal with this.
  • The bounded horizon heuristic prevents path explosion by only running a fixed number of instructions in an alternate path. However, this may lead to false positives and false negatives.
  • Also, the single mistake heuristic assumes that a configuration file contains only one mistake. This reduces the amount of taint and the number of explored paths.
  • The weighting heuristic attempts to reduce the false positive rate by weighting different taint propagations differently. Data flow taint is weighted more heavily than control flow taint. Weight branches closer to the error higher than those further away.
  • ConfAid can also propagate taint between processes that use IPC system calls. Support communication via sockets, pipes, TCP/UDP and regular files.
  • Evaluated on OpenSSH, Apache and Postfix. Manually injected errors and evaluated the accuracy and time-to-discover of the root cause result.
  • Used real-world misconfigurations and randomly generated bugs.
  • The correct root cause was ranked first or second for all 18 real-world bugs. Performance also great for randomly-generated bugs, getting 85% first-place correct root causes.
  • Average execution time is 1 minute 32s. Apache is the slowest. The randomly-generated bugs only take about 23s.
  • Q: How do you handle external input or output of various types in your speculative execution? Can you do rollback? We abort alternative path exploration if we see externally observable file.
  • Q: Do you automate the parsing and identification of the configuration files? We just intercept the read call and assign taint from there. Don’t need to identify the parsing routines, etc.
  • Q: Can you combine this with backward flow analysis from the error point? (With program slicing?) This would be much more difficult. Program slicing goes backwards, but we find that the config is read at the very beginning, then run for a while, then something goes wrong. Probably not practical to do program slicing on such a workload. It would probably also give a lot of false positives, because our taint tracking is more contained.
  • Q: Could you use your technique for other types of errors, such as segfaults? You could, but we have focused on configuration problems.
  • Q: Are all of the bugs that you dealt with single-mistake bugs? Yes. How common do you think those kinds of bugs are? Well, it depends if the bugs are independent or if they influence each other. Will probably only find the first error if there are two. Relaxing the single mistake restriction in future work.
  • Q: Do you have anything to add in terms of the performance overhead (proportional to program size) and did you do anything to reduce it? Dynamic data-flow analysis is slow: 100 to 200x slowdown, and we haven’t optimized it much, but it would be hard to get much faster. If the problem happens far into the execution, it will take a long time to come up with the ranking. Idea is to use this offline when you encounter a bug.
  • Q: How is the syntax of the configuration file handled? Does ConfAid require a priori knowledge of the syntax? No, we don’t require this. Fine with both Apache and OpenSSH, without app-specific tuning. However, there might be some noise in the result.

Inside the Data Center, 2

Large-scale Incremental Processing Using Distributed Transactions and Notifications

  • Indexing is a long process, not just matching keywords to URLs. Converts raw documents into documents ready for serving.
  • Example of duplicate elimination using MapReduce. Map documents to their checksum, and reduce them into single documents. However, refreshing the index involves adding more documents to the repository and deciding whether to add each on to the repository. With MapReduce, you would have to start again from scratch.
  • Goal is to have a large repository of documents with a high-quality index, and a small delay between crawling and indexing (”freshness”). MapReduce could only achieve freshness of a few days.
  • All the information is stored in Bigtable. Tables mapping URLs to metadata, and from checksums to canonical documents. On getting a new document, you need to update the checksum table, and flip a metadata bit in the URL table. This leads to a race condition.
  • Percolator adds distributed transactions to Bigtable. The API has get(), set(), commit() and iterator methods. Limited support for secondary indices.
  • Bigtable is a sorted (row, column, timestamp) store. Rows are partitioned into ranges called “tablets”. Tablets are spread across many machines.
  • Percolator provides “snapshot isolation” semantics, using two-phase commit coordinated by the client. Locks are stored in additional Bigtable columns: commit and lock for each data column. Transactions manipulate these columns transparently to the user.
  • Notifications allow users to run user-defined code when a column is updated. “Message collapsing” means that notifications are run at most once per write. Applications are structured as a series of observers, which hopefully form a tree.
  • Notifications run independently on individual rows, which means that stragglers don’t hold up other work.
  • Bus clumping: randomly scanning the repository for notifications will lead to clumping which reduces parallelism and overloads the Bigtable servers. Solution is to obtain locks on a row-by-row basis, and randomly jumping to another row if the bus is contended.
  • Very different access pattern on the hardware. Lots of small, random disk I/Os with many RPCs per phase (compared to MapReduce’s streaming I/O with many documents per RPC).
  • Compared MapReduce versus Percolator on document processing latency. MapReduce is basically constant against churn at around 2200s, whereas Percolator has a much lower latency, until latency explodes around 35% churn. MapReduce is limited by stragglers. Operating point is around 2 to 15% churn.
  • Pros of Percolator: much better freshness, and better scalability (more throughput by buying more CPUs), which supports a bigger repository. Immune to stragglers. Cons are that it needs to reason about concurrency and is about 2x more expensive per document processed.
  • Percolator runs 1000 threads on a single Linux instance. This makes it easier for application developers, with straight-line code and meaningful stack traces. And it gets good multicore speedup. Problems with kernel scalability, though, such as the mmap semaphore, and had to fix those.
  • Unknown unknowns: some CPUs periodically got XOR calculations wrong in checksums. Sometimes Bigtable cannot delete files fast enough (the garbage collector). 50% of seeks were doing useless readahead, so switched to ext4. Incorrect resistor value on the motherboard led to the workload powering machines off.
  • Gets 50% fresher results, with a 3x larger repository.
  • An existence proof for distributed transactions at web scale.
  • Q:
  • Q:
  • Q: Did you consider doing Nectar? Didn’t consider this, but it could improve the performance of this system. We do batch RPCs for operations on the same row.
  • Q: What was causing stragglers? Machine failures, process failures, etc.
  • Q: Could you live with some of the failures that you saw? Some limitations in the Bigtable abstraction that made us have to debug down to the lowest layers.
  • Q: Were Stonebraker and DeWitt right? It isn’t a good idea to pretend that MapReduce is a database system. But MapReduce isn’t dead. Percolator has a handful of users, whereas MapReduce has thousands of Google users.
  • Q: Did you find cycles of notifications arising in practice? Did it exactly once, but were more careful after that. The system looked healthy but would have run for an infinitely long time.

Reining in the Outliers in Map-Reduce Clusters using Mantri

  • Work began as a measurement study, but started to fix them. Now in production on Bing’s clusters.
  • MapReduce is used to build the search index, improve relevance and ads, perform genome analysis, etc. It decouples operations on data from mechanisms to scale across large distributed clusters.
  • Implementation is Cosmos (based on Dryad). Bing uses the SCOPE language to specify queries.
  • Example: finding frequent search queries. Three phases: read/parse, map and reduce. Barrier after the map task. Leads to a drop in cluster utilization towards the end of the map phase, due to stragglers. Reduce phase also has stragglers. Both phases slow down the overall job.
  • Want to get rid of stragglers and recomputations. This gives better predictability, better resource utilization and quicker response (good for developers).
  • Measured start time, end time, data rate and failure mode on a production cluster. Stragglers take 1.5x the duration of the median task in a phase. Half of phases have >10% of stragglers and no recomputation. 10% of stragglers take 10x the duration of the median task (very heavy tailed).
  • The impact of outliers on job completion time (based on a trace-driver simulator): 50% of jobs would improve by at least 34.7%.
  • If input data is unavailable, tasks must wait for recomputations, and this delay may cascade. Idea is to replicate intermediate data, and use a copy if the original is unavailable. Challenge is to decide what data to replicate, and where. Since recomputes are rare and localized (mostly due to disk failures or high load). This gives a predictive model for how likely a machine is to cause a recompute. Cost-benefit calculation based on the probability of a recompute and the runtime of the task. Since whole-rack failures are so rare, only replicate within a rack.
  • Cause of stragglers: tasks reading input over the network end up experiencing variable congestion. Solution is to spread tasks evenly across the cluster. Many aspects of MapReduce are already network-aware (such as shard placement), but the shuffle of the reduce phase means that this locality is harder to exploit.
  • Place reduce tasks so that the traffic on each link is proportional to the available bandwidth on that link. This would require global co-ordination across jobs to identify the congestion and place tasks. Can approximate this with local control.
  • Cause of stragglers: workload imbalance due to unequal partitions. Approximately 25% of outlier tasks have more data to process. Better to ignore these than duplicate them! Mantri builds an estimator that uses Graham’s 1969 theorem.
  • A final cause is contention on a machine. Idea is to copy tasks elsewhere in the cluster. Challenge is to do this as early as possible. Need to build an estimator to guess how much work is left to do. Trick is to copy when the predicted time of starting again is less than half the remaining time.
  • Running time is a function of the input, network congestion, data left to process and various other factors.
  • Results from deployment in production clusters and simulations. Running in production since May. Compared May to June on Mantri against April (previous setup). Median job improvement is 32.1%. There is also a reduction in the number of resources used, because too many duplicate jobs were scheduled in the Dryad scheduler. (0.5x as many copy tasks launched as Dryad, with a 2.8x higher success rate).
  • Compared Mantri to Dryad, Hadoop, LATE and MapReduce on a trace. Mantri gives a better reduction in completion time than all of these schemes. The others have less time savings with more resources used.
  • Q: You took the MapReduce interface as given, so what would you do if you built a new system knowing what you know now? Providing better ways of partitioning maps, which we have already done in production.
  • Q: What would you do if you had a non-uniform cluster? We have very efficient cluster management software, called Autopilot, that monitors these things. The primary reason we don’t have persistent problems is that this will shoot particularly bad nodes in the head. A slot abstraction is how we split up heterogeneous machines.

Transactional Consistency and Automatic Management in an Application Data Cache

  • Application-level caching is used to handle scaling issues in modern websites, e.g. Facebook, MediaWiki, LiveJournal. Caching whole web pages is not much good if you have highly personalized content. Database caches are less useful when so much processing is done in the application layer. This has led to the use of things like memcached or Java object caches. These are essentially a DHT, for application-level objects, which can separate commonly-used and customized content. This caching reduces load on the database and the application server.
  • Existing caches are very simple and force a lot of management work to the application. For example, they don’t provide transactional consistency, which violates guarantees that the underlying database would provide. This can expose anomalies to the user, and complicates application code (to deal with violated invariants).
  • Naming is harder than you think: a cache key must uniquely identify the value. Invalidations require global reasoning about the possible contents of the cache. Saw two bugs in MediaWiki (#7541 and #8391) that arose from these problems.
  • TxCache has transactional consistency, bounded staleness and function memoization. Implemented as a library the hides the complexity of cache management. It integrates with a new cache server we have developed. Function calls for beginning, committing and aborting transactions, and a hook to make a side-effect-free function cacheable (depending only on the arguments provided and the state of the database). No need to deal with inconsistencies, naming, invalidation, etc.
  • Transactional consistency goal: all data seen in a transaction reflects a single point-in-time snapshot. Each transaction gets a timestamp, and objects in the database and cache are tagged with a validity interval. Transaction can read cached data if the validity interval includes the timestamp. Staleness allows cached data to be used for longer, if this is compatible with the application requirements. Nevertheless, it must be consistent with the data in the database if other parts of the transaction require DB data. Simply expose an API on the database for starting a transaction at a given timestamp.
  • Validity intervals computed for application objects, queries and tuples (the last being tracked by the database). A query is valid if the current timestamp is in the intersection of the validity intervals of the tuples accessed. Had to modify the database server to track this and return it with each SELECT query.
  • It is hard to choose a timestamp a priori, since we don’t know the access pattern or the contents of the cache. However, can choose this based on the tuples that would be accessed.
  • Each object in the cache has an invalidation tags, and had to modify the DB to assign these tags to each query. Tags are computed from the query access methods (e.g. index lookups, wildcard tags for queries on the entire table). An update generates affected tags (one per key per tuple modified) and broadcasts these to all cache nodes in an ordered stream. Need to be careful to avoid lookup-invalidate race conditions.
  • Evaluated using the RUBiS benchmark suite, with the standard browsing/bidding workload. Ran with an in-memory (850MB) workload, and a 6GB (disk-bound) workload. As the cache size grows, throughput increases up to about 5000 requests per second. The cache hit rate flattens out at 95%. With no cache, the rate is about 1000 requests/second.
  • Throughput increases also for the disk-bound workload, though much lower (up to 500 requests/second). The cache hit rate is much higher though. Bottleneck in this workload is random I/O for infrequently-accessed data.
  • Adding staleness gives a 3x to 5x throughput improvement.
  • Investigated cache misses due to the stringent consistency: less than 10%. When consistency is turned off, the increase in throughput is much lower.
  • Q: How are the invalidation tags affected by complex queries like joins? We have a more complex approach based on predicate locks.
  • Q: How do you deal with fault tolerance? The cache is inherently fault tolerant because its data can always be recreated. Although migration etc. could improve that. Problems with the validity intervals? No, because these are generated by the database.
  • Q: Do you provide serializability or snapshot isolation? Depends on what the underlying database supports, since the cache is for read-only transactions.
  • Q: Why did your throughput keep going up as the hit rate levels off? This is because we are caching objects that are sometimes larger than the data that is used to generate them.
  • Q: Do you still get a performance win if you have a single cache node on a single system? The benefit of a distributed architecture is the ability to scale the effective size of the cache. Have you compared distributing the cache versus distributing the database? There are a lot of approaches based on DB replication that do similar things, but these are more heavyweight because they try to scale writes as well.
  • Q: How is the cache key chosen? Based on the function’s name and its arguments. A more complex approach could be used if the code might change. What if not all of the arguments matter? Still work to be done in the application to make sure the cache is used efficiently (e.g. in choosing the cacheable functions).

Piccolo: Building Fast, Distributed Programs with Partitioned Tables

  • Motivating example is PageRank: for each node in a graph, add the current score of that node to its out-neighbours, and repeat until convergence. The only changing data (the current and next iteration scores) will fit in memory.
  • Dataflow models don’t have global state as a first class concept. Need to read from and write to distributed storage. No access to intermediate global state, and costly access.
  • Could implement it with MPI or RPC. Need to decide how to break up the computation explicitly.
  • Piccolo provides global access to distributed shared state. Piccolo runtime handles communication between these components. Usually ease of use and performance are in conflict.
  • Programming model uses a key-value store with a put/get/update/iterate interface. Implemented as a library for C++ and Python.
  • Showed how PageRank is implemented on Piccolo. A job is specified with a control function and a kernel function that actually does the computation. Doing it naïvely is slow, because it would do a lot of remote gets and remote puts.
  • Let the user control the partitioning function (e.g. for PageRank based on the domain of each page), and colocate tables that a kernel function uses on the same node.
  • Synchronization is challenging if multiple writers can access the same key at once. So provide an atomic update (e.g. sum or min or max) function attached to a table value. Only works for commutative functions. Synchronization primitive is a global barrier. Tables provide release consistency: accumulations are atomic, but they may be buffered and only guaranteed to be applied at barriers. Need to do an explicit barrier between operations. Release consistency makes writes quite fast.
  • Fault tolerance: do checkpointing of the table state. The barrier is augmented with a checkpoint of particular tables. Use the Chandy-Lamport (non-blocking) protocol for checkpointing.
  • Load balancing is tricky because the tasks are stateful. Cannot just start copy tasks. The master coordinates work stealing, and needs to pause updates to a partition before it is stolen to another machine.
  • Evaluated against Hadoop on a local 64-core, 12-node cluster. Used a 100M page graph, taking around 8 seconds per iteration. Main Hadoop overheads are sorting, HDFS access and serialization.
  • Evaluated scaling up to 200 workers on EC2, and the iteration times stays about the same (about 60 seconds per iteration at all configurations).
  • Implemented iterative n-body and matrix multiply that don’t fit into Hadoop. Also asynchronous applications such as a distributed web crawler.
  • Q: What is your fallback strategy for dealing with computations that don’t have a commutative aggregate function? We haven’t find many applications like that. We would have to rely on pairwise locking in this case.
  • Q: Do you need to restart every node if a single node fails? Yes, you need to rollback all nodes to the last checkpoint, so need careful consideration of the checkpoint frequency.

Cloud Storage

Depot: Cloud Storage with Minimal Trust

  • Storage providers like S3 are appealing because they free the user from a lot of concerns. However, they also present risks. Failures can cause undesired behaviour (such as not propagating permission changes immediately). Also risks of data loss, corruption or unavailability.
  • Approach is to take a radical stance on fault-tolerance. Completely eliminate trust from cloud storage for put-availability, eventual consistency, staleness detection and dependency preservation. It minimises trust for get-availability and durability. So you can always perform a put, for example. But gets are only available as long as there is at least one correct node.
  • Depot ensures high availability by running multiple servers and non-sequential consistency. It falls back to client-to-client communication in the case of extreme cloud failure. It prevents omission and reordering, by adding metadata to puts, which is fully replicated in nodes’ local state. Metadata received at the client is checked to ensure that the client sees a view that satisfies various well-defined properties. Update metadata includes a local clock and a history. Logically each node stores all metadata it sees (garbage collection is handled in the paper).
  • To protect consistency, nodes perform local checks on the metadata: needs to ensure that the updating client has seen all previous updates (based on its history), and it doesn’t try to change history (based on its local clock).
  • Faults can lead to forks in the history, which partition correct nodes, and prevent eventual consistency. Forks are joined to gain eventual consistency: faults are transformed into concurrency. One faulty node is transformed into two correct virtual nodes. Since concurrency can introduce conflicts (concurrent updates to the same object), Depot exposes such conflicts to applications.
  • Depot provides various properties in terms of consistency, availability, integrity and eviction, which vary based on how many nodes are correct.
  • For get availability and durability, it would be ideal to trust only yourself, but impractical to do this. Depot minimises the number of correct nodes that are necessary to service the get. Any correct node that is willing to supply an object can service a get. To make it likely that a correct node has data, employ replication (including a local copy).
  • Contingency plan for storage service provider failure. Prototype forces all clients to store all of the puts that they create, and gossip update metadata in order to route get requests to the right node. Despite complete SSP failure, this allows the system to keep working.
  • Evaluated cost in terms of latency, resources and dollars (for storage).
  • Overhead comes from adding metadata and adding checks (SHA256, RSA verify and history verify).
  • Setup the evaluation on 12 Emulab nodes, 8 clients and 4 servers, with a 1G link. Clients issue 1 request/second, and measure latency and per-request cost. As a baseline, emulate traditional cloud storage (servers implemented in Depot without any of the checks, and no metadata at the clients).
  • Latency overhead on get is very small (<1 ms), and on put, is about 5 ms.
  • Network overhead on gets are modest. CPU costs are bigger. Metadata transfer causes network cost, and verification has CPU cost.
  • Cost model based on cloud pricing. Get cost per TB is small over the baseline (about $100/TB). Puts incur more  cost (about $220/TB) due to metadata.
  • Q: Does presenting overheads in terms of the Depot cost mask the effect? Isn’t the CPU cost something like 400%? Yes, but the absolute values are small. Plus the real cost is in the bandwidth.
  • Q: What can Byzantine clients do if you rely on gossiping metadata (such as cause all get requests to be routed to it)? Include a “receipt” mechanism to deal with such a situation, described in the paper. When a client receives a get response, it performs a check to ensure that data is sufficiently replicated.

Comet: An Active Distributed Key-Value Store

  • Key-value stores are increasingly important, with great scalability, availability and reliability properties. Popular in both data centers and P2P. Stores are shared between many applications to avoid per-app storage system deployment, but building applications above these stores is challenging.
  • Main challenge is due to inflexibility: all apps are treated exactly the same way, independent of their needs (which may be conflicting). Motivating example is authors’ experience with Vanish on top of the Vuze DHT.
  • Vanish is self-destructing data storage for the web, based on the Vuze DHT, using temporal-based keys. Unfortunately, Vuze had a fixed 8-hour data timeout, and an overly-aggressive replication scheme, which harmed security. Making the necessary changes was simple, but deploying them was difficult, because it is a public service in production. Also it is hard to evaluate changes before deployment.
  • Propose extensible key/value stores that allow applications to customise functions of the store. Different data lifetimes, numbers of replicas and replication intervals. Also allow apps to define new functions, such as popularity tracking, access logging, or user-dependent responses. Built a peer-to-peer DHT: Comet.
  • Goals: flexibility, isolation, safety and lightweightness.
  • Main abstraction is an “active storage object” (ASO), which contains a value and a small set of handlers that are called on put or get to the object. Comet is built on top of a traditional DHT as an active runtime that runs ASO handlers in a sandbox.
  • Built several applications on top of the ASO extension API, including Vanish, Adeona, file sharing, etc. Extension API is 12 functions: 4 callbacks,  5 host interaction functions and 3 DHT interactions functions (get/put/lookup). Deliberately chose not to allow arbitrary network I/O because of security implications.
  • Sandbox is less than 5KLOC, based on a subset of Lua with unnecessary functions removed. Resource consumption is limited by counting the number of per-handler bytecode instructions and memory consumption. Incoming and outgoing requests are rate-limited. DHT interaction is limited (to prevent traffic amplification and DDoS attacks). ASOs can only talk to neighbours, and not recursive requests.
  • Built on top of Vuze using Lua. Deployed on PlanetLab. Discussing mainlining it in Vuze with the main engineer.
  • Considered three examples. App-specific customisation of the replication scheme. Did Vanish-specific replication in 41 lines of Lua.
  • A context-aware storage object is used to build a proximity-based distributed tracker (returning close nodes rather than random nodes). ASO uses 37 lines of Lua. Using the proximity-based tracker gives better latency between paired nodes than a random tracker.
  • A self-monitoring DHT: a node monitors its neighbours, by periodically pinging it. This allows you to collect measurements of DHTs. Ran an experiment to get Vuze node lifetimes.
  • Q: Have you thought about macrosecurity (as opposed to microsecurity considered in the talk)? We added global limits to ASOs (not just per-ASOs limits), that is just standard rate-limiting for DHTs. These are, we believe, the right mechanisms to manage the overhead of Comet. But we need more experience to determine limits. What about fairness? Haven’t looked at this yet.
  • Q: What happens when the metadata in the ASOs diverge between replicas? In today’s DHTs, replicas are actually inconsistent (may be stale). Opportunity here for different replicas to merge with each other using an ASO.
  • Q: Were you able to enable certain classes of computation in Lua? Removed everything except for three classes of things: simple math, string manipulation and table manipulation. A modified Lua interpreter counts the number instructions. Is there an interesting set of things that you might add back in, and what applications might that enable (or is it a gray area)? One example was to do true internet measurement with arbitrary network I/O, but we decided not to add these things.
  • Q: Does the use of a high-level language cause performance unpredictability that might harm the 99.9th percentile latency? An interesting issue, but we haven’t looked at this yet. May choose to use non-GC’d languages in a real deployment.
  • Q: Do you persist that ASO state? Currently stored in-memory, and use replication to gain persistence.
  • Q: What happens to reliability if things actually get cancelled? Applications should expect that things may fail at any given point in time. Can use tricks like Vanish does to get reliable deletes.

SPORC: Group Collaboration using Untrusted Cloud Resources

  • Cloud deployment is attractive because it allows collaboration in user-facing applications, but the price is that you must trust the cloud provider.
  • Goal is to support practical cloud apps for real-time collaboration (and offline working), but using untrusted servers.
  • Application logic has to be moved to the clients, and every client has to maintain a copy of the local state. The server’s role is limited to storing encrypted data and processing client updates.
  • To keep clients in sync, use “operational transformation” (technique from 1989): application provides an app-specific transformation function that is used to fix up non-commutative actions (such as deleting two different characters concurrently). This can synchronize arbitrarily-divergent clients.
  • Digital signatures aren’t enough, since the server could equivocate, so employ “fork* consistency”, which gives linearizability on an honest server, and detectable equivocation after 2 messages from a malicious server. Clients embed their history (hashed) in messages they send. Server may fork clients into partitions, but can’t then unfork them. Need to detect forks out of band. SPORC provides a way to recover from malicious forks.
  • Internally, the SPORC library keeps a history of all the operations that it sees, marked as committed or pending. Could reconstruct the client’s local state. Committed operations are fork* consistent across all clients, and pending operations are causally consistent. Clients enforce this by checking sequence numbers on each operation.
  • To commit an operation, the client encrypts and signs it, and sends it to the server. The server keeps a sequence of encrypted operations which represent the canonical state. Then it broadcasts the operation to the client. Finally, at the clients, it needs to be transformed using OT (since the recipient state has probably diverged). Only after these transformations can the operation be applied to the local state, and added to the end of the committed operations.
  • For access control, we cannot trust the server. Need to preserve causality and concurrency only makes it harder. Access control enforced using encryption with a single shared symmetric key shared between the clients. ACL changes are committed in the same way as update operations. All users have a public/private keypair in addition. Removing a user is a little more complicated than adding one: need to change the encryption key for subsequent changes and encrypt it under the public key of all of the remaining users, and also encrypt the old key under the new key (so it can be provided to any new users to access old operations).
  • If two people want to remove different users concurrently, the server will order them, but this creates a challenge for encrypting the new key. Need dependency pointers between operations and barriers: if a dependency pointer crosses a barrier, the operation will be rejected, which maintains the chain of old encryption keys.
  • To recover from a fork, use OT to resolve malicious forks by treating them as normal divergences.
  • Implemented as a client library and a generic server. The generic server can support any application since all it does is push encrypted data around. Applications are defined as operations and a transformation function. Implemented a Java CLI client and a browser-based version (using the Google Web Toolkit).
  • Evaluated on 8-core 2.3GHz AMD boxes, with one for the server, and 4 for clients. Tested the Java CLI version. Looked at latency, server throughput and time-to-join. Connected using 1G Ethernet.
  • Latency is <30ms for a text editor on up to 16 clients with low load, and slightly worse if everyone writes. Broke this down into various components: the most expensive thing is the RSA signature on outgoing messages (about 10ms per message). Could use a faster implementation like eSign to get this down into the microseconds. Client queuing delay gets worse as the number of clients increases on a write-everyone workload.
  • Server throughput gets up to 20MB/s as the payload size is increased to 16KB.
  • Q: What happens if an evil person about to be removed tries to add more evil users? The way we deal with these situations is to have multiple privilege levels (admins, writers and readers) with ACLs implemented on the client side. This is only a problem if you have a malicious admin.
  • Q: Why use a specialised server rather than just using a cloud storage provider or a DHT? In addition to storage, you need a way to keep clients in sync. We thought about doing a completely decentralised system. Advantage of a server is to get timely commits. Server is “highly-available”.

Leave a Reply