SOSP 2007: Day 2

I’ll skip doing a review of the banquet last night, so it’s on with the papers!

Software Robustness

Bouncer: Securing Software by Blocking Bad Input

  • Filters block bad input before it is processed. Low overhead and no false positives. Programs can keep running even when under attack.
  • Performs symbolic execution of programs to perform analysis and generate input filters. Works on machine code.
  • Pre-condition slicing: compute a subsequence of trace instructions, executing which is sufficient to create a vulnerability. Find instruction with vulnerability (e.g. call to sprintf), then move backwards through the trace, adding the instructions on which the vulnerable instruction depends.
  • Deployment in a distributed scenario. When an exploit is found, run Bouncer, calculate an improved filter, and deploy this. Or run centrally on a cluster to find exploits in parallel.
  • Nirvana generates traces. Phoenix implements slicing.
  • Evaluated against four real vulnerabilities (including the Slammer worm). Ran experiment for 24 hours.
  • Filters have no false positives. Found perfect filters for two vulnerabilities. Some false negatives for two others.
  • Throughput is much closer to 100% when using filters, compared to having to restart on each attack. Can still maintain 80% throughput with 18000 attack probes per second.

Triage: Diagnosing Production Run Failures at the User’s Site

  • What to do with a production run failure (send error report)? At present, we just send a core dump, which takes a lot of manual effort to diagnose. At the same time, these failures cause damage to end users.
  • Reproduction is hard, because user environment is unique. Furthermore, there are privacy concerns.
  • Core dumps tell you what the failure is. Bug detection tells you about some errors. Diagnosis is a means of tracing back to the underlying fault, but existing tools are offline (e.g. requiring programmer input).
  • To diagnose: (i) need information about the failure (fault, error, propagation tree), but off-site techniques don’t work on-site; (ii) need guidance about what to do next, the kind of analysis to perform, the interesting variables to investigate (but there is no programmer to do this, so decisions must be taken automatically); (iii) need to be able to test “what-ifs”, such as changing input and seeing what happens, but most techniques focus on replaying the same thing.
  • Triage enables on-site diagnosis, using systems techniques to make offline analysis tools useful on-site, addressing the three challenges. It introduces a new technique called delta analysis. Tested in a human study with real programmers and real bugs.
  • Uses checkpointing and re-execution to capture the bug. Allows replaying the failure repeatedly, using different analysis tools. Little overhead in the normal-run case. Checkpoints every 200ms, keeping 20 previous checkpoints, in memory to lower overhead.
  • Uses a human-like “diagnosis protocol”. Repeated replay enables an incremental diagnosis. e.g. If the bug doesn’t always repeat, it may be a race condition.
  • But the replay isn’t identical: it can either be deterministic (plain), loose (tolerating some variance, such as the introduction of analysis tools which change the syscalls) or wild (including potentially large variations, maybe changing input or the code itself). Delta analysis is similar to diffing a pair of (successful and failing) runs. Triage doesn’t need to be safe when it replays execution: it only cares about why the bug happens.
  • Failure analysis variety of methods (bounds checking, taint analysis, symbolic execution, etc.) and delta generation (rearrange allocations, drop inputs, mutate inputs, drop code(!)).
  • Delta analysis: compute the basic block (of code) vector (1 if executed, 0 if not), and subtract them. The two closest runs are diffed (finding the edit distance and shortest edit script). So a basic block that differs could for example be to return NULL, and it is then dereferenced.
  • Human study: 15 programmers from faculty, researchers and grad students. Measured times to repair bugs with and without Triage. (People got core dumps, sample inputs, instructions for how to replicate, and access to debugging tools.) The evaluation didn’t expect the participants to work out how to replicate the bug, and set a maximum time.
  • 47% time saving on diagnosing real bugs, significant with probability of null hypothesis < 0.01%.

/* iComment: Bugs or Bad Comments? */

  • Many bugs are due to mismatches between the code and the programmer’s assumptions. Such as assuming that a lock is held when a piece of code is executed. Comments express these assumptions. 20% of Linux is comments (excluding blank lines and copyrights), but they are not utilised by compilers or bug detection tools.
  • However, comments are imprecise, compared to code, and cannot be tested. Therefore they tend to become less reliable as software evolves. Nevertheless, they are easier to understand for programmers. So wrong comments are likely to mislead programmers. Difficult to infer assumptions from source code (but see the MUVI paper from yesterday).
  • Mismatches indicate either bugs or bad comments. Challenge is to understand the natural language of comments. NLP techniques are not enough: POS tagging (97% accurate), chunking (90%) and semantic role labelling (70%). But they assume correct grammar, which comments don’t always have.
  • Therefore, also use machine learning, staistics and program analysis (with NLP). 1832 rules extracted, detecting 60 new bugs or bad comments (19 of these confirmed). Looking at call and locking bugs only.
  • Focus on the comments that contain rules (rather than explaining code), as these are more likely to be inconsistent with the code.
  • Example rule: must be held before entering .
  • Comment classifier is trained using a decision tree building algorithm. Some manual training is required (at least once per topic).
  • Validated using Linux, Mozilla, Wine and Apache, looking at lock- and call-related topics. Overall, 60 mismatches, 33 (12 confirmed) bugs, 27 (7) bad comments and 38 false positives, from 1832 rules. Training accuracy is between 90 and 100% (for lock-related comments). Cross-software accuracy is between 78 and 90% (saves the amount of training).

Distributed Systems

Sinfonia: A New Paradigm for Building Scalable Distributed Systems

  • The protocols in current distributed algorithms are complicated, error-prone, and often difficult to understand. We want to avoid such protocols. Focus is on data centre systems, with small and predictable latencies, and occasional crashes. Focus also on infrastructure applications, like cluster file systems, lock managers and communication services.
  • Sinfonia is a data sharing service, comprising a set of memory nodes which each export a linear address space (no structure imposed). Protocol design becomes the easier problem of shared data structure design. Minitransactions are used to access the memory nodes.
  • Minitranscations have ACID properties, balance power and efficiency. Reduce network round-trips, but are still flexible and easy to use. Semantics: perform a bunch of compare items (for equality over an address range), if all of these match then retrieve the read items and perform the write items. Execution of the transaction is piggybacked on the two-phase commit process, to minimise round-trips.
  • Commit coordinator runs at the application, which may crash, so a new two-phase commit protocol was necessary. Based on all memory nodes logging a “yes” vote, rather than the coordinator logging a “commit” decision.
  • Applications: sinfoniaFS (a cluster file system) and sinfoniaGCS (a group communication service - i.e. a chatroom with ordered notifications).
  • sinfoniaFS performs as well as LinuxNFS in the Andrew benchmark. sinfoniaGCS scales better than Spread.

PeerReview: Practical Accountability for Distributed Systems

  • Fault in a distributed system with distributed state can lead to incomplete information. In the general case, there could be multiple admins with different interests (and possible malicious intent).
  • Many faults are not fail-stop but instead change behaviour of a node. Dealing with general faults is difficult, because of a lack of control over the system. How can faults be detected or the faulty nodes identified. Then how do you convince others that a node is faulty (or not faulty) and must be fixed.
  • Real-world systems rely on accountability. e.g. signed receipts, double-entry bookkeeping and auditing. These can be used to detect, identify and convince about faults.
  • A fault is when a node deviates from expected behaviour. We want a system that generates a proof of misbehaviour against a faulty node. Difficult if we consider a fault that affects only a node’s internal state (requires an online trusted probe at each node). So we focus on observable faults: those that causally affect a correct node, and don’t require a trusted component.
  • Verifiable evidence: a proof of misbehaviour or a challenge that a faulty node cannot answer.
  • Accountability: whenever a fault is observed by a correct node, the system eventually generates verifiable evidence against a faulty node.
  • Implementation: as a library, PeerReview. Assumes that nodes modelled as a collection of deterministic state machines, with each node holding a reference implementation of all state machines. Also that correct nodes can eventually communicate and nodes can sign messages.
  • All nodes log all inputs and outputs. Each log is audited periodically by witnesses (a set of nodes for each node). If misbehaviour is detected, evidence is generated and distributed to other nodes.
  • Log entries form a hash chain to prevent forgery of a log. Keeping multiple logs or forking logs is prevented by checking the signed hashes to ensure that a single linear log is kept. Faults in a log are recognised by replaying the state machine with the known inputs and checking outputs against the log.
  • PeerReview guarantees that faults will be detected and good nodes cannot be accused.
  • Applicability: NFS server in the Linux kernel, overlay multicast, and P2P email system. Deals with small latency-sensitive requests, large transfers, and complex, large and decentralised systems.
  • Dominant cost depends on the number of witnesses per node.
  • Normal P2P email scheme scales as O(log n) with n being the number of nodes. PeerReview can scale O((log n)^2) if you accept a 99.999% probability of detecting faults. (100% accuracy is prohibitive (linear, I think)).

Dynamo: Amazon’s Highly Available Key-Value Store

  • Amazon architecture: loosely coupled SOA, with stateful services that manage their own state. Requirements on latency, availability and scale.
  • State management is the primary factory affecting scalability and availability. Data is only ever accessed by primary key, in self-describing blobs, so key-value is better than RDBMS (don’t need a query optimiser, triggers, etc.). RDBMS scale up, not out, and (strong, transactional) consistency is less important than availability.
  • Dynamo requirements: always writable (accept writes during failures), e.g. to add an item to a shopping cart must always be possible; user-perceived consistency (so can’t be too weak on consistency); performance with latency in the 99.9th percentile; low cost; incremental scalability; and tunable knobs for cost, consistency, durability and latency. No production-ready systems met these requirements
  • Replicated DHT with consistency management, using: consistent hashing, optimistic replication, sloppy quorum; anti-entropy mechanisms; and object versioning.
  • DHT is similar to Chord, using MD5 as the hash function. Clique for routing between replicas. Each storage node given many locations on the ring for load-balancing. Trade-off between balancing load and keeping membership tables concise.
  • Configurable for different scenarios e.g. consistent, durable, interactive user state; a high-performance read engine; or a distributed web cache.
  • Able to meet a 500ms SLA on latency, for the implementation of a shopping cart.
  • Drawbacks: no indefinite scaling, not appropriate if transactional semantics required, more challenging to program than using ACID properties.

System Maintenance

Staged Deployment in Mirage, an Integrated Software Upgrade Testing and Distribution System

  • Typically test on a couple of vendor machines, then a bunch of beta tests in the outside world, but despite this, the software will still fail on user machines in the outside world (due to dependencies, odd configurations, etc.). Mirage enables vendors to cluster outside world machines in terms of the their configurations/environment, so as to ensure coverage of beta testing before public release (and hence improve reliability). This reduces “upgrade overhead”: the number of machines that fail.
  • Environment must be identified and fingerprinted, before clustering. Then the order and speed of deployment must be chosen by the vendor.
  • In a cluster, all machines “behave identically wrt an upgrade”. Benefit depends on quality of clustering and the quality of testing.
  • Identification: instrument the application to determine the resources (libraries, config files, environment variables and dependent applications) that it uses during a normal run. Filter out noise (log files, temp files and data) based on heuristics and vendor-provided rules. A library becomes a (name, version, hash of binary) tuple. The hash is used in case there are different builds, for example.
  • Tradeoff between low upgrade overhead (few failures) and fast deployment. Choice between deploying in parallel to many clusters (fast, but high overhead), or sequentially (slow, but low overhead). Fixing a problem for one environment might also fix it for another environment (benefit of serialisation).
  • Evaluated the quality of clustering (identification and clustering), and staged deployment (the upgrade overhead and deployment speed).
  • Accurate identification: between 200 and 850 environment resources (on firefox, apache, php and mysql), between 0 and 7 vendor rules, yielded 0 errors in any case.
  • Accurate clustering: 21 different environments, with 2 distros of linux, PHP and Apache and various MySQL configurations. Yielded 0 misplaced machines within 15 clusters. Varying the fingerprinting granularity can vary the number of clusters.
  • Controlling the tradeoff: simulation with 100,000 machines, 3 problems and 2 staging protocols. Upgrade overhead reduced very significantly, with, in the worst case, a 25% increase in deployment time.
  • Studying “real machines” and deployments is the subject of future work.

AutoBash: Improving Configuration Management with Operating System Causality Analysis

  • Current approach to configuration management is to ask friends, search online, read manual, try potential solutions. Frustrating, time consuming and tedious! Hard to undo a wrong solution or know if the problem was solved. A solution can cause new problems.
  • AutoBash: tries many solutions at once, provides an undo capability, explains to the user how it solves a problem and automatically runs regression tests.
  • When a problem is detected, there are two modes: Replay mode which automatically searches for a solution; and Observation mode which helps the user to fix the problem. All the time, there is a Health monitoring mode.
  • Observation mode: a modified bash where user tries to fix a problem by typing a command then testing if the app works, then possibly undoing the command and rolling back the command, before trying again. Has “predicates” which test if an application functions correctly and returns true iff the test passes. Idea is for application to be shipped with a testsuite of predicates. Speculator (SOSP’05) is used to perform process-level speculative execution, and make predicate testing safe (undoable). Shell provides a rollback command, used in conjunction with the speculative execution, to undo the attempted command.
  • Regression testing is handled by running all predicates in some database. This can be slow, however, so causality tracking is used to find which predicates might be affected by a change.
  • Tracking causality: Output set of kernel objects that an action causally affects, and Input set of kernel objects on which a predicate causally depends. If the output set of an action and input set of a predicate intersect, then the predicate must be checked. The output set can be tracked by syscall tracing, and then trimmed by excluding temporary objects (such as processes, or temporary files). Similarly, this can be done to determine the input set of a predicate.
  • Generating a causal explanation: the important actions are the ones whose output sets intersect with the input sets of the newly-passing predicates.
  • Replay mode: assume that vendors ship “common solutions” with their software, so that these may be automatically tested (safely, using speculative execution) in the background, while the user gets on with work.
  • Evaluation: overhead of speculative execution and effectiveness of causality analysis? Looked at CVS, GCC cross compiler and webserver. Created 10 bugs and 10 solutions.
  • Very little overhead (non-significant) of speculative execution. Causality analysis almost halves the total replay time, with a slight increase in the predicate testing time.


Integrating Concurrency Control and Energy Management in Device Drivers

  • Concentrating on wireless sensor networks. Concurrency of I/O operations are considered (synchronous versus asynchronous). Energy management is considered in terms of the necessary power state of devices to perform I/O.
  • The more workload information the app can provide the scheduler, the less energy needs to be used.
  • Hard to tell OS about application workloads: API extensions for hints to energy management. Done for CPU voltage scaling and disk spin-down. Sensor networks need a unique solution, however: harsh requirements, small power source, and need to run unattended for periods ranging from months to years. 1st generation OSs push all decisions to the applications.
  • ICEM: a driver architecture that automatically manages energy, implemented in TinyOS 2. New concept of Power Locks: split-phase (asynchronous) locks with integrated energy and configuration management. Dedicated, shared and virtualised drivers and a component library for building them.v Energy efficiency is 98.4% of hand-tuned implementation. Also much easier to program.
  • Case study on the TMote platform: three interfaces to six total devices. Logging application with producer, which writes sensor samples out to flash every 5 minutes, and a consumer that sends all samples from flash over the radio every 12 hours. Code complexity of hand-tuned implementation is very complex: need to order the power-on and -off of each different device. ICEM hides these calls, greatly simplifying the code.
  • Split-phase I/O operations: single thread of control, where read() call is given a callback, readDone(), that allows the driver to poke the application when it is done. These can then be scheduled efficiently, because the driver has information about what has to be done before it considers switching the device off.
  • Virtualised driver: only a functional interface. Assume multiple concurrent users, buffering requests for concurrency management, and managing energy by looking at pending requests. Such drivers have longer latencies, because the underlying device has to have synchronous access. Suitable for high-level hardware (like a block device?)
  • Dedicated driver: functional and power control interfaces. A single user only with no concurrency control and explicit energy management. Suitable for low-level hardware.
  • Shared driver: functional and lock interfaces. Multiple users, explicit concurrency control (using the lock), and implicit energy management based on pending requests.
  • Power locks: hardware-specific configuration and power interfaces, and a lock interface. Talk to a dedicated driver. First, request access to the lock, which powers on the underlying device and configures it and calls back the application (split-phase). Other locking requests are queued after the current user. Power down only happens when all pending lock requests have been fulfilled and released.
  • Arbiter: performs concurrency control and queuing/scheduling (FCFS and round-robin). Configurator: calls h/w-specific configuration interface on the dedicated driver. Power Manager: powers down device when it falls idle, and powers it back up when a new lock request comes in (with immediate and deferred policies).
  • Evaluation: comparing hand-tuned, ICEM, optimal serial and worst-case serial orderings, in the sensor logging application described above. ICEM performance very close to hand-tuned, and better than both serial versions.
  • ICEM overhead is 5.60% of total sampling energy. Node lifetime for ICEM is between 98.4% and 100% of the hand-tuned version, as the sampling period is increased.

VirtualPower: Coordinated Power Management in Virtualized Enterprise Systems

  • Two key benefits of integrated power management: power savings are proportional to cost savings, and cooling capability is proportional to rack utilisation. Information and capabilities exported ACPI have led to application-specific power management policies. How do we do this in the context of virtualisation?
  • Problems: what should be exposed? Are there issues with isolation? Are there power benefits arising from resource sharing? VPM can give a 34% power improvement.
  • Data centres have a heterogeneous collection of platforms. Virtualisation should abstract this (e.g. because of migration), but this is tricky with different power characteristics in the underlying hardware. The workload stays the same, so we should virtualise the power management states: therefore we can have a consistent view of manageability across migrations.
  • What about isolation? Disassociate changes to virtual state from the management of the physical state: front-ends in DomUs, back-end in Dom0, the latter of which manages the physical state. Notion of soft scaling which allows the possibility of “scaling” a virtual CPU, which enables consolidation of soft-scaled virtual resources. (Two soft-scaled VCPUs on a single full-power CPU, but you might be able to turn off another physical CPU, for example.)
  • VPM rules in Dom0 which control the physical PM. Virtualisation layer policies are driven by VPM state requests from the VMs, which makes an implicit feedback loop. Might need some learning algorithms to derive rules.
  • e.g. Dom1 changes VPx state via a hypercall and creates a VPM event (comprising the VM soft state, and a shadow state (the physical manifestation of the current soft state). This leads to a new set of shadow states being calculated.
  • Workloads: tiered web service (RUBIS) (Linux on-demand gonvernor); transactional workload (select policy based on the transaction processing rate and the amount of slack); web service with quality of information metric (Travelport (Worldspan)) (policy based on the QoI and processing time of requests across different client classes).
  • Complex analysis has up to 4% degradation on throughput. Seems to work with multiple VMs too. [Lots of fairly convincing graphs but I think it'll be necessary to take a good look at the paper to judge what we're actually being presented.]

Leave a Reply