Archive for October, 2009
Monday, October 19th, 2009
Distributed Aggregation for Data-Parallel Computing: Interfaces and Implementations
- Goal to make large-scale programming simple for all developers.
- Write a program in Visual Studio; Dryad(LINQ) takes care of shipping it to the cluster, fault tolerance, etc.
- Wrestling with the implementation of GroupBy-Aggregate. GroupBy takes a sequence of objects with some kind of key, and groups them together by key. Similar to MapReduce.
- Naïve execution plan splits map and reduce into two phases, with an all-to-all data exchange between them. However, applying the reduce after this exchange results in a large amount of network I/O.
- A better idea is to do early partial aggregation: use an aggregation tree to achieve this. Reduces the disk and network I/O by up to one or two orders of magnitude.
- Want to automate this optimization. Programmer writes the obvious code and the system takes care of the rest.
- Notion of decomposable functions is key to this. Need an initial reducer that is commutative, and a combiner that is commutative and associative.
- How do we decompose a function? Two ways: iterator and accumulator interface. Choice can have a significant impact on performance.
- How do we deal with user-defined functions? Try automatic inference, but fall-through to a good annotation mechanism. Implement simple function and annotate it with the initial reduce and combiner implementation function names.
- Hadoop interface for this adds quite a lot of complexity. Java’s static typing is not preserved.
- Iterator interface has to build an entire group and iterate through it. Accumulator can discard the inputs if they are not needed. Oracle uses this approach, implemented with stored procedures. Hard to link in user-defined procedures.
- Automatic decomposition looks at the expression and checks whether all leaf function calls are decomposable.
- Want our approach to have good data reduction, pipelining, low memory consumption and parallelisability (multicore). Define six strategies, accumulator- and iterator-based.
- Iterator PartialSort approach. Idea is to keep only a fixed number of chunks in memory; processed in parallel. The bound on memory makes pipelining possible. Strategy close to MapReduce.
- Accumulator FullHash approach builds an in-memory parallel hash table with one accumulator object per key. Objects are accumulated immediately. This gives optimal data reduction and memory consumption proportional to the number of keys, not records. This is the DB strategy (DB2 and Oracle).
- Evaluated with three applications: WordStates, TopDocs and PageRank on a 240-machine cluster. Accumulator-based implemen—
Tuesday, October 13th, 2009
Better I/O Through Byte-Addressable, Persistent Memory
- DRAM is fast, byte-addressable and volatile, but disk/Flash are non-volatile, but slow and not byte-addressable. BPRAM is all three!
- Phase change memory is a promising source for this. Bits encoded as resistivity. Access latency in the nanoseconds, and far better endurance than flash. Designed BPFS for BPRAM.
- Goal: FS ops commit atomically and in program order. Data is durable as soon as the cache flushes. Use short-circuit shadow paging to get this (new consistency model).
- Eliminate DRAM buffer cache; use L1/2 instead. Put BPRAM on the memory bus. Provide atomicity and ordering in hardware.
- Both BPRAM and DRAM are addressable by the CPU: physical address space is partitioned into volatile/non-volatile.
- BPFS gets better performance than NTFS on the same media.
- What happens on crash during update? Short-circuit shadow paging comes into play (contrast with journalling or shadow paging). Overhead of journalling is that all data (or metadata) must be written twice. Shadow paging uses copy-on-write up to the root of the FS: drawback is that writes propagate all the way back to the root (multiple updates), and small writes have a large copying overhead.
- Short-circuit shadow paging makes in-place updates where possible. Uses byte-addressability and atomic, 64-bit writes. Both in-place updates and appends are made simple by this technique. Cross-directory rename does bubble up to the common ancestor.
- Problem: if data is cached in L1/L2, the ordering of cache eviction can lead to inconsistent states. Also, writes from the cache controller might not be atomic.
- So add two new hardware components to the CPU and cache controller. Epoch barriers are used to declare ordering constraints and they are much faster than a write-through cache. Also add capacitors to DIMMs which allow writes to propagate even after the loss of power.
- Do CoW then Barrier then Commit. Paper also shows how to make it work on multiprocessors.
- Built and evaluated on Windows as an in-kernel file system.
- Microbenchmarks (append n bytes and random n-byte write) compare NTFS/Disk, NTFS/RAM and BPFS/RAM. (Using DRAM in this experiments.) BPFS is significantly faster than NTFS on disk, and NTFS isn’t syncing so it isn’t durable!
- Postmark benchmark compares NTFS/Disk, NTFS/RAM, BPFS/RAM and (projected) BPFS/PCM. BPFS/PCM is much faster than both NTFS/Disk and NTFS/RAM. Analytical projection based on sustained throughput of PCM.
- Q: are the storage requirements of a database and of a file system converging when you have this hardware available? The changes to the hardware will be applicable to other sorts of storage systems like a database, not just filesystems.
- Q: have you thought about how to expose more capabilities of this medium to the applications (not just sequential reads and writes)? Applications are currently written in terms of what is efficient.
- Q: how do you atomically deal with a free-list? We don’t have a free-list. Don’t need to keep track of so many data structures because the medium is so fast, which lowers the consistency overhead.
- Q: where do you go next for multiprocessors and clusters? One of the goals was to have multiple concurrent threads operating on the FS at the same time.
- Q: is there a risk that data in the capacitor gets garbled after the machine gets switched off? [Taken offline.]
- Q: could you go even faster without having the consistency guarantees, for applications that don’t need it? There’s always a trade-off here.
- Q: how do you do mmap, and is meddling with L1/L2 caches going to be expensive? Haven’t implemented mmap yet, but we would have a much better guarantee of durability. Changes to the cache, in the paper have been looked at in terms of interference, and performed well.
- Q: why didn’t you benchmark against a file system using a B-tree or a red-black tree that can take advantage of random writes [why did you compare against NTFS]? NTFS is widely used, and it’d be interesting to compare against others.
Modular Data Storage with Anvil
- Data storage drives modern applications (everyone has a database) and they are frequently a bottleneck. Hand-built stores outperform general-purpose ones by up to 100x. Observe that changing the layout can substantially improve performance. Custom storage is hard to write, especially in order to provide consistency guarantees. Can be prohibitively expensive to experiment with new layouts.
- Need a simple and efficient modular framework to support a wide variety of layouts.
- Fine-grained modules: dTables. These are composable to build complex data stores. All writing is isolated to dedicated writable dTables, which incidentally has good disk access properties.
- dTable = key/value store. Maps integers/floats/strings/blobs to blobs. Provides an iterator to support in-order traversal. dTables used by applications and frontends, and also other dTables. Can transform data, add indices or otherwise construct complex functionality from simple pieces.
- Example of a mapping from customer IDs (mostly contiguous) to states. Start with an array dTable for the common case. Layer a dictionary on top of that (maps state names to array indices). Have an exception dTable for the case where a customer isn’t in one of the 50 states, and a linear-search dTable for their residences. But to make this fast, layer a B-Tree index dTable on top of the linear store.
- So far just read-only. Updates are hard to do transactionally. Need to implement a write-optimized dTable. Fundamental writable dTable is the journal dTable. New data is appended to a shared journal; data are cached in an in-ram AVL tree. The journal is digested when it gets large. Transaction system is described in the paper. Layer this over read-only dTables.
- Managed dTable goes at the top. Also have a Bloom filter dTable to deal with multiple overlaid read-only dTables.
- Many additional dTables listed in the paper.
- Evaluate the effect of simple configuration changes on performance (modularity). Key lookup workload, comparing contiguous versus sparse keys. Contiguous good with arrays; sparse good with B-trees. Also show the benefit of layering an index on top of a linear store. Also show the low overhead of the Exception dTable.
- Evaluated by running TPC-C. Replaces a SQLite backend with Anvil. Shows that Anvil outperforms both the original backend and MySQL. Split read and write stores perform well.
- Evaluated the cost of digesting and combining. These can be done in the background, taking advantage of additional cores and spare I/O bandwidth. Measured the overhead when doing a bulk load (1GB) into the dTable, with digests every few seconds.
- Q: why didn’t you compare either performance or features against BDB, which is very similar? Didn’t find it as easy to construct read-only data stores in BDB: creating customisable data stores has a lot of transactional overhead.
- Q: did you evaluate iteration? In paper. How does the performance depend on the order of updates? Not sure what you mean. Did look at overlay iteration in the paper, which ought to be the most expensive (due to key lookup cost), and overhead was only 10%
- Q: how did you make it so that the creator of a new dTable doesn’t have to consider ACID semantics? Most dTables are read-only, so you don’t need to worry about this (like shadow paging). The managed dTable has a small hook that enforces transactional semantics. And read/write dTables? Don’t envision that people will need to create these. Could implement your own, but this would miss the point.
- Q: how should developers write recovery tools for systems like Anvil? Anvil includes such a tool that handles recovery for you. Read-only semantics makes this much simpler.
Operating Systems Transactions
- [Unfortunately, I missed this talk due to having an obligation to man the registration desk. I'll try and track down the video and update this later.]
Do You Have to Reproduce the Bug at the First Replay Attempt? — PRES: Probabilistic Replay with Execution Sketching on Multiprocessors
- Concurrent programs are hard to write. Multi-core makes concurrent programming more important, and bugs more common. However they are non-deterministic (requiring e.g. a special or improbable thread interleaving). This makes it hard to reproduce them.
- Deterministic replay for uniprocessors is relatively easy: only need to record inputs, thread scheduling and return values of system calls. On multiprocessors, this is much more challenging. e.g. Simultaneous memory accesses are another source of non-determinism.
- Previous proposals introduce new hardware, which don’t exist in reality. Or there are software-only approaches but they have up to 100x slow-down.
- Ideally want to reproduce a bug with a single replay and no runtime overhead. But what if we relax this slightly?
- Idea 1: record only partial information during a production run. Idea 2: push the complexity into diagnosis time. Idea 3: use feedback from unsuccessful (non-reproducing) replay runs.
- Just record a sketch during the production run. When a replay goes off the sketch, terminate it immediately and feed back information about why it deviated for refining the next replay. Can eventually reproduce the bug with 100% probability.
- Several different methods for sketch recording. Spectrum of approaches from UP deterministic replay to full MP DR. Can record e.g. synchronization points, or basic blocks, or more: build up this information during the replay runs.
- At replay time, the partial information replayer consults the sketch to see that recorded global ordering is obeyed. How do we know whether a replay is successful? Use a failure detector based on crash, deadlock or incorrect results. This can also detect unsuccessful replay runs.
- When a replay attempt fails, start it again. But could do something different the next time: a random approach would just leave it to fate, but PRES is more systematic. Failed reproduction is due to un-recorded data races. The feedback generator captures these races and tweaks them in future runs. Start with many candidate races and filter them down.
- Implemented PRES using Pin. Evaluated many different applications (desktop, server and scientific). Overhead is around 18%, which is barely more than baseline Pin overhead. Macrobenchmarks show that PRES gets much higher throughput for server applications (MySQL).
- Effectiveness: UP algorithm doesn’t detect any bugs within 1000 replays, whereas PRES gets 12/13 in 10 attempts. Feedback generation is crucial to effectiveness. Race filtering also effective.
- Q: could you also use execution traces to help track down which parts of the execution trace cause the bug to not happen, and guide the programmer? Good idea.
- Q: could you apply PRES to virtual machine replay? Also a good point. The work could be integrated with virtual machines. Could you rollback an execution, is it precise enough? Depends on the fine-grainedness of the recording scheme used. What is the inherent overhead in collecting a trace with sufficient fidelity to do backtracking? If there is a lot of lock operations, the low-overhead approach (SYNC) could work. But if there is no synchronization, we can’t use this information, and a more heavyweight scheme would be needed.
- Q: why do the results differ so much from the next paper? The main idea is similar, but their work is more focussed on static analysis to reduce runtime overhead.
- Q: would you advocate this as a solution for long-running (one year or more) services, as it is often only after this time that they emerge? We can take a checkpoint of the process state, which solves the problem of data accumulation.
ODR: Output-Deterministic Replay for Multicore Debugging
- Debugging non-deterministic software failures is really hard. The problem is how to reproduce these failures for debugging. Model checking/testing/verification could work, but it’s not perfect, and it doesn’t capture everything. So we need deterministic replay.
- Need multiprocess operation, efficient recording, no special hardware and the ability to run arbitrary programs without annotation (especially programs with data races). All related work fails to meet one of these requirements.
- ODR is a user-level replay system, which works in the MP case, has only 1.6x overhead, needs no new hardware and works on arbitrary x86 Linux binaries.
- Often sufficient to produce any run with the same outputs… needn’t have the exact same execution. So the idea is to relax the determinism requirements.
- The classic guarantee is value determinism: replay run reads and writes must have the same values as the original. Relax this to “output determinism”: the replay run produces the same user-visible output as the original. This is not perfect, but still useful for debugging: reproduces most visible signs of failure, and enables reasoning about failure’s root cause.
- How to achieve this? Deterministic-run inference. Basic idea is to translate a program into a logical formula (verification condition). Function of schedule trace, input trace and read trace, returning an output trace. Use a formula solver to yield unknown schedule trace. Scale this by directing the inference using more original-run information. Also by relaxing memory consistency of the inferred run: where values read have nothing to do with schedule order, can use an arbitrary schedule trace.
- Three-dimensional inference design space: memory consistency (strict, lock order or null), query complexity (output, I/O&lock-order, I/O&lock-order&path or determinant), and inference-time (polynomial or exponential). Search- and query-intensive DRI fit into this space.
- Search-intensive is really slow (400–60000x slowdown). Formula generation, not solving is the bottleneck. Use multi-path symbolic execution with backtracking to generate formulas. Each backtrack involves a 200x slowdown. Backtracks are caused by race-tainted branches: wrong choice leads to a divergent, unsatisfying path. Work around by backtracking to the most recent race-tainted branch.
- If we know the path (QI-DRI), inference time improves by 100x but we need to record much more data, so there is a 6x slowdown.
- Future work is to reduce the path-search space, reduce the cost of each backtrack (cut down on race-tainted branch analysis), and parallelize formula generation by forking threads at each divergence.
- Q: if I put in arbitrary values in the memory consistency model, how do I ensure that invariants are maintained? We don’t actually do null consistency, but if we did and invariants were violated the program might crash and output determinism catches this (because the output would be different).
- Q: since your techniques are similar to the last paper, why are the results so different? In our approach, we do race-tainted branch analysis for the entire exectuion, and that is costly. We also do taint-flow propagation. There is more analysis in each backtracking iteration than theirs. We could reduce this cost by, for example, reusing results from previous iterations.
- Q: have you considered changing your static analysis to another algorithm that could significantly improve your formula generation time? We are considering static approaches to formula generation.
- RAMCloud: Scalable, High-Performance Storage Entirely in DRAM. New research project. Zero results. Motivated by wanting to build large scale systems with low latency. Data center style of web application separates the code from the data, in order to scale, but 4–5 orders of magnitude increase in latency. Want this kind of scaling with latency close to memory speeds (sub-microsecond). Basic architecture puts all data in DRAM. Scale using commodity servers. Reckon we can get 5–10us RPC latency end-to-end. Also have a story on durability and availability. Also want to support multiple applications.
Q: effect of other memories? Whichever wins should work with RAMCloud
Q: GMS system from UW 10 years ago? Not familiar with that [taken offline].
- Transactional Caching of Application Data using Recent Snapshots. DB-driven website performance issues: use memcached. Add an in-memory DHT that is very lightweight, and stores application objects (not a DB). DBs provide transactional consistency, but these caches don’t do this. Goal is transactional consistency for accesses to the cache. Idea is to embrace staleness: all read-only transactions to run on stale data. Avoids blocking and improves utilization. This is quite safe, since stale data is already everywhere. Application can control staleness. Add a TxCache library between memcached and the application. DHT values are timestamped, and have a validity interval.
Q: paper at HotStorage from HP Labs in similar area?
Q: how does the DB know the validity interval? Modified DB to track this.
- Chameleon: A self-managing, low cost file system. Targeting home user or small business, who doesn’t want to lose data. Doesn’t know anything about RAID. Cost-sensitive. Deployment scenario has 4 PCs connected by fast LAN, with a broadband connection to cloud storage. There are many ways to replicate, place and encode data. Ideally store data on at least one offline device to avoid vulnerability to viruses. A small, trusted “anti-availability kernel” enforces this requirement. Use linear programming to select and adapt storage configuration: the design space is now even more complicated. Tend towards the optimal solution.
- Sloth: Let the Hardware Do the Work!Looked at embedded OSes used in the automotive industry. OSEK OS is the prevalent real-time embedded OS: event-triggered, priority-driven real-time system. Don’t want to implement a scheduler. SLOTH lets the interrupt subsystem do the scheduling and dispatching work. All threads are implemented as interrupt handlers and have interrupt priorities. Each thread needs an IRQ source. Priorities enable pre-emption. Can implement a bunch of synchronization this way also. System is simple, small (concise implementation and memory footprint) and fast (2–20x).
Q: this looks like very simple scheduling… how would you deal with something more complicated like earliest deadline? Drawback is no blocking system calls, so can’t do everything.
- The case for cooperative kernel threads. Kernels are multithreaded, and drivers have concurrency bugs, which, if they are in the kernel, is bad. Event-based devices drivers need to use continuations to preserve driver context across blocking operations. This becomes very complex, almost as bad as dealing with pre-emptive threads. Cooperative threads give the best of both worlds: atomic execution but allowing blocking. Research showed that drivers are mostly I/O bound, so cooperative threads are appropriate. Implementing this as a “cooperative domain” in the Linux kernel.
Q: Linux does have cooperative thread scheduling available, so how does this interact with the work you are doing? Providing a framework for implementing drivers this way, much nicer.
- Abstractions for Scalable Operating Systems on Manycore Architectures. Tesselation. Goal isn’t just to support heterogeneous hardware, but also provide predictable performance and guarantees for applications. Asymmetrically structured OS: some cores are dedicated as a management unit for keeping track of and scheduling applications. Eliminates the need for per-core runqueues, improves cache locality, decreases lock contention and limits kernel interference with applications. Applications interact with the kernel through remote, asynchronous system calls. Applications make explicit requests for cores, and OS guarantees that they will be gang-scheduled. OS just provides cores to the application, doesn’t need to know about threads. Applications have private memory ranges.
Q: how do you balance the different demands for resources across applications? [Taken offline.]
- System Support for Custom Speculation Policies. Applications run on some speculation infrastructure, which speeds things up. Want to separate policy from mechanism. Typically implemented transparently, which means that you have to be conservative, giving limited opportunities for speculation. Idea is to push the policy into the application. What could an application do that is different from the default? Might allow some output to be uncommitted. Or could commit equivalent-but-not-identical results. Process gets a “speculative fork” interface. Use cases: predicting user actions (predictive bash shell), authentication and user-level network services (when you have a predictable protocol).
Q: how do you ensure errors in the speculative state don’t propagate to the main state? Need to be able to detect this, and could abort speculation in this case.
- IDEA: Integrated Distributed Energy Awareness for Wireless Sensor Networks. A new “group diet” for wireless sensor networks. Problem of overloaded nodes. Existing solutions are “single node diets”, which are unsatisfactory because nodes have to collaborate. Local efforts cannot go far enough unless there is some cooperation. Aim to improve application fidelity by matching system load to availability. Shift load from overutilized to underutilized nodes, and shift load away from threatened nodes. Like a distributed OS for sensor networks. IDEA evaluates multiple solutions and distributes information to the nodes. Ideal goal is awareness of application constraints.
Q: Quanto does some cross-node analysis? We’re building on these great ideas.
- Flicker: Refresh Power Reduction in DRAMs by Critical Data Partitioning. Hardware is over-designed for correctness and reliability. Make it less reliable and tolerate errors in software. Smartphones are a motivation: power consumption is way too high, due to the use of DRAM for responsiveness. Battery drains even when a phone is idle. Goal is to improve power consumption here. If you increase the refresh cycle length, the power consumption drops, but the error rate increases. Currently use 64ms refresh, so could we increase this? Secret sauce is a partitioning into critical and non-critical (e.g. soft-state) data. Map critical data to short-refresh cycle DRAM, and non-critical data to long-cycle DRAM. Requires some hardware changes. Hypothesise that smartphones have a lot of non-critical data. Initial results show 25% drop in power consumption with only 1% loss in reliability.
Q: [?] Looking at replication and checksumming in other work.
- BFT for the skeptics. Industry deals with crash failures a lot, so do we need full BFT? We already use checksums, timeouts, sanity checks, etc. to translate faults to crash faults. How often do we get faults that require BFT to handle it? Looked at ZooKeeper and real-world failures. Yahoo!’s crawler uses ZooKeeper extensively. Saw 9 issues, due to misconfiguration (5, BFT wouldn’t help), application bugs (2) and ZooKeeper bugs (2, correlated, BFT wouldn’t help). Could BFT hurt? It has more things to configure, so misconfigurations could become worse. Need to show that BFT really solves a problem before industry will pick it up.
Q: You showed that correctly implemented BFT couldn’t help with some failures? Failures were correlated, affecting all replicas.
- Prophecy: Using History for High-Throughput Fault Tolerance. BFT has poor throughput. Need 3f+1 replicas to handle f faulty replicas. Can we improve this for read-mostly, internet workloads? Add a “sketcher” to each replica, which sketches requests and responses. Only one machine sends a full response, the others send sketches. Trades off consistency for performance, which gives delay-once linearizability. Faulty replicas can return slightly stale data. Internet services have unmodified clients and short-lived sessions. Look at performance of PBFT. Can improve by consolidating sketch tables on a trusted sketcher. We already trust middleboxes, so why not trust this too? Performance is much better than PBFT. Work not specific to BFT, and could apply to PAxos, quorums, etc. while getting similar benefits.
- Securing Hardware Platforms Against Malicious Circuits Through Static Analysis. Make assumptions when building systems. Best way to break a system is to break its assumptions. People assume hardware is correct. What if we can’t make this assumption? Hardware is complex, expensive, static and the base of the system. Do “dead circuit identification”: highlight all potentially malicious circuits automatically. Attacker is motivated to avoid impacting functionality during testing (or else they’d be caught). DCI gets an assertion that says which paths are effectively short circuits. Use these assertions in a new graph algorithm to identify the possibly-malicious, dead circuits. No false negatives, but 30% over-identification. Empirical evidence shows a tight correlation between code coverage and
Q: is this primarily at design-time on HDL? Yes, this is one of our assumptions.
Q: what about redundant circuits for fault tolerance? This is used at design time where you can make calls about this.
- Enhancing Datacenter Network Security and Scalability with Trusted End Host Monitors. Cloud workload is dynamic and hostile. Key selling point is that multiple tenants can share common infrastructure. Need a new approach to security, because exploits are more likely, and the cloud resources can be used to perform exploits themselves. Cloud datacenters can help: they are centrally-controlled so monitoring becomes easier. The software and hardware and homogeneous. Plus a clean-slate approach is possible. Use the hypervisor as a trusted component. Hypervisor can send alarms to central controller when an attack is detected. Built a prototype from Hyper-V and a trusted Intel NIC.
Q: if you trust the VM, why do you need to trust the NIC? This gives some useful properties, and the NIC could do some filtering this for you.
Q: HotOS paper on this exact topic?
Q: [?] By “hypervisor” meant the entire virtualization stack, because we didn’t want to make the hypervisor itself any better.
- Architectural Attacks and their Mitigation by Binary Transformation. What happens if someone tries to attack you from a VM on the same machine in the cloud. There is cross-talk through shared architectural channels. Example is contention for the CPU data cache. This leaks information about the memory access pattern, which could for example be used to leak AES keys. Have showed that EC2 has similar vulnerabilities: placement vulnerabilities, cloud cartography and cross-VM exfiltration are all possible. Approach is to use dynamic binary rewriting to transform x86 instructions so that the architectural effects are mitigated. Can degrade observation of timing, or inject noise and delays to hide leakage signal. Methodology is to make things secure by default, then come back to improve performance.
Q: information leakage necessarily arises from statistical multiplexing, and we need statistical multiplexing to get good performance, so how can you address that? Assert that it should be possible.
Q: how well would existing techniques protect against these attacks? Not aware of techniques that could do this.
- Execution Synthesis. Say you have a bug in Linux on a remote machine. All you have to work with is a low-detail bug report. Reproducing it is time-consuming. Want a direction finding system from your system to a particular bug. Google Maps doesn’t do this at present…. So your bug report is a stack trace and some register contents. Do VM recording and replay. Don’t expect you to record behaviour that leads to the bug, since then you wouldn’t have the problem in the first place. Don’t care so much about performance, since you don’t run this in production. Find a state in the recording that is close to the bug report, then explore paths iteratively to get closer to the bug. Then you get a sequence of inputs that lead to reproducing the bug. Need a distance function, way of choosing inputs, and good information about what to include in a bug report.
Q: could you go backwards from a failure state and execute in reverse? We don’t have the entire failure state to begin with.
- Edge Mashups for Clinical Collaboration. Health industry is going from paper-based to electronic records. Want to empower non-programmers to build applications for real-time collaboriation, but need to respect things like HIPAA for logging and data retention. Example use-cases include expert-assisted surgery (call an expert for advice when complications arise, in real-time), and micro-clinics where nurses see the patients, but doctors write prescriptions remotely. Envision a graphical tool that pulls in photographic and chart data, which is synchronized between all participants. State serialized to XML which can be distributed to all the clients. Could be client/server or peer-to-peer. Need logging for accountability. “Break-glass” access control: anyone gets access but they are held accountable after-the-fact. Need low latency so doctors don’t feel that they are wasting time. Might migrate this to the cloud for scaling.
seL4: Formal Verification of an OS Kernel
- Formally proved the functional correctness of 8700 lines of C. No bugs.
Want to build high-assurance systems: small kernels which reduce the trusted computing base. Want strong security properties. Kernel has to be correct: if it falls over, so does the whole system.
seL4 has capabilities.
- Proof is that specification and code are equivalent. Need a formal semantics for every system call. Use Isabelle as a theorem prover to bridge the gap between spec and code. But what about assumptions (in the code) and expectations (of the spec)?
- Assume correct: compiler and linker, 600 lines of assembly code, hardware, cache, TLB management and 1200 lines of boot code.
- Given these assumptions, we get some nice properties: no null dereferences, no buffer overflows, no code injection, no memory leaks, no div-by-zero, no undefined shift, no undefined execution, and no infinite loops or recursion. Does not imply security, lack of bugs from expectation to the physical world, or absence of covert channels.
- Proof architecture admits proofs of higher-level properties (e.g. access control).
- Design is written in Haskell, which can be used to generate Isabelle code automatically.
- System model has three states: user, kernel and idle. Events are syscall, exception, IRQ and VM fault.
- Call graph is messy! A microkernel takes all of the messiness and packs it into a very small space.
- Formal methods practitioners (fans of abstraction) versus kernel developers (exterminate OS abstractions). Different view of the world. Haskell prototype unified these two things: OS people got to implement an OS, while the formal methods people got well-defined semantics. The C code is manually-written and hand-optimized, but based on the Haskell prototype.
- Aim to reduce complexity. Have to deal with virtual memory in the kernel. But we can put drivers outside the kernel. Concurrency is complex, so use an event-based kernel and limit pre-emption to a few well-chosen points in long-running operations. The C code is derived from the functional representation. Need to support a subset of C: everything from the standard, minus goto, switch fall-through, & on stack variables, side-effects in expressions, function pointers and unions.
- Found 16 bugs during testing, and 460 bugs during verification (roughly equally distributed between the C code, the design and the spec). Took 25 person-years in total: $6 million (compared to $87 million for EAL6).
One of the largest proofs ever done in a theorem prover: 200kloc handwritten, machine-checked proof. Proved 10kloc of OS code.
- Q: can you comment on what happens when you have to evolve the code? What effort is required? It depends. An optimization on the code level that doesn’t change the semantics might need a few days to re-prove. A new feature that adds new components could be added (in the paper) doesn’t depend on the rest of the kernel as long as you don’t screw around with existing data structures.
- Q: does your work solve the stated problem? The assumptions are significant, and you’ve just done a very significant type-check on the code? Is it really possible to solve the originally stated problem? This is just the first step. You can reduce the assumptions with more work. It isn’t the only technique that you should use, so if you deploy it in an Airbus you should also do testing and verification. [Taken offline.]
- Q: how can you verify something high-level like having the address spaces of two processes being isolated? We do this. You can still use seL4 in a stupid way, but you can use our security model with capabilities and reason about those in the spec. You don’t have to go down to the code.
- Q: did you see a correlation between the logical errors in the specification and the implementation? Not really. The C bugs were fairly “stupid”: typos, copy-and-paste, etc.
- Q: wouldn’t it be better to prove temporal properties? Are they expressible? They are expressible. We look at functional correctness, not temporal properties. But you need functional correctness before you can reason about temporal properties.
Helios: Heterogeneous Multiprocessing with Satellite Kernels
- Systems are getting more complicated, from UP to SMP to CMP to NUMA. This is still homogeneous. But hardware is no longer homogeneous: programmable NICs, GPGPUs, etc. Operating systems ignore this heterogeneity: the other devices have different instruction sets and often no cache coherence. This means that the standard OS abstractions are missing, and programming models are fragmented. Can we bring this back into the operating system?
- Helios is an OS for distributed systems in the small. Use four techniques to manage heterogeneity, simplify app development and provide a single programming model for heterogeneous systems.
- Result is that it is possible to offload processes to these heterogeneous devices with no code changes. Also improves performance on NUMA architectures.
- Satellite kernels. Want to make use of an I/O device, but the driver interface is a poor interface for applications that want to use programmable devices. It becomes hard to perform tasks like debugging, I/O and IPC with these devices. The driver doubles as an OS, within the OS itself. A satellite kernel runs on the device itself: fundamentally a microkernel. Also run separate satellite kernels on each NUMA node. Local IPC and remote IPC for communication between satellite kernels.
- Applications register as services in a namespace. The namespace connects IPC channels.
- Application placement is constrained by the use of heterogeneous ISA, an expectation of fast message passing and platform preference. Applications are allowed to specify affinity in their metadata: a hint for where the process should run. Easy for a dev, admin or user to edit affinity. Platform affinity is processed first, and this gurarantees certain performance characteristics. Can also contra-locate, e.g. if you don’t want the interference of an anti-virus program running on the same core. Algorithm attempts to balance simplicity with optimality.
- Applications are first compiled down to MSIL, and then that is compiled down to the appropriate ISA. Can encapsulate multiple versions of a method for different ISA in the MSIL (e.g. fast vector math).
- Implemented on Singularity, using an XScale programmable I/O card (2GHz ARM processor with 256MB of DRAM). Just need a timer, an interrupt controller and the ability to handle exceptions to implement a satellite kernel for a new device. No need for an MMU (thanks, Singularity!). GPUs are adding timers (Larrabee). Only supports two platforms and a limited set of applications.
- Evaluated several applications (network stack, FAT32 FS, mail server, web server, etc.) and how easy it was to run them on satellite kernels. Almost no code had to be changed (only in the TCP test harness). One line of metadata had to be changed in almost every case (zero in the other).
- Offloaded an entire networking stack to the XScale, and showed that the end-to-end performance of PNG compression-and-serving is improved when offloading to the XScale.
- Considered an email server built on Singularity, using a NUMA box. Emails per second handling improved by 39%. Turned out that the instruction throughput was much higher due to better cache utilization.
- Q: when you transfer data between two NUMA domains, couldn’t the IPC fail due to memory allocation failures? Singularity is statically verified, using contracts, so we don’t have to worry about that.
- Q: is this not just 20-year-old distributed microkernel research rehashed? We pay homage to that in the paper. A simple heuristic is sufficient to decide where to run some process, and you need to have process migration anyway, so why not just use that when you get a problem? Process migration is pretty difficult, in the heterogenous case. Abstraction turned out to be pretty brittle in commodity OSs.
- Q: is it reasonable to rely on protection from a large runtime? The system isn’t dependent on type-safety.
Surviving Sensor Network Software Faults
- Sensors have to operate unattended for months or even years. It’s hard to debug failures considering that the input is unknown. There is no debugger.
- Safe TinyOS introduces memory safety to sensor nodes. But what do you do when you get a safety violation? In the lab, spit out an error message; in the wild, reboot the entire node (losing valuable soft state and application data).
- Neutron is a new version of TinyOS. Reduces the cost of a violation by 95–99%. It has near-zero CPU overhead during execution. Runs on a 16-bit microcontroller.
- A TinyOS program is a graph of software components: statically instantiated code and state. Connections are typed by interface and there is minimal state sharing. Now have preemptive multithreading with a non-blocking, single-threaded kernel. Aim is to separate the program into independent units for recovery. Infer the boundaries of these at compile time. The kernel is a single unit.
- Units can be rebooted independently. A wrinkle involves cancelling system calls, so you need to block if a syscall is still pending. Blocks of allocated memory are tagged with the owning recovery uint, which enables these to be freed on reboot (by walking the heap). Can even reboot the kernel: just cancel all pending system calls (return ERETRY), and just have to maintain thread memory structures. Applications will continue after the kernel reboots.
- New idea of “precious state”: a group of precious variables will persist across a reboot. Annotate variables in the source code. Some restrictions on precious pointers. Precious variables must be accessed in atomic blocks. Variables are persisted on the stack across reboot: the set of precious state is usually smaller than the worst-case stack size.
- Evaluated the cost of a kernel violation in Neutron, compared to safe TinyOS. Looked at three libraries, running on a 55-node testbed. Show the effect of a reboot on the CTP workload. Neutron gets close to the non-reboot case. Also look at the effect on time synchronization in FTSP, showing what proportion of the nodes have unsynchronized time. Again, Neutron gets close to the non-reboot case. Looked at fault isolation. CTP and FTSP data persist across reboots.
- Main cost is in ROM bytes: 1–5kB of added code, roughly constant.
- Measured cost of a reboot in milliseconds. A kernel safety violation will result in a 10–20ms outage.
- Much lighterweight than microreboots (and lets you reboot a kernel, not a J2EE application).
- It’s easy to change the TinyOS toolchain, but changing the programming model isn’t due to the amount of deployed code.
- Q: how can I reason about a node that has survived a fault (a rebooted node is in a known-good state)? Do you have evidence that this is going to help us? [Showed emails from the questioner.] It is hard to diagnose these faults. Different approaches are possible.
- Q: what do you think of the alternatives, such as using an MMU, verification or simulation? Well we could add an MMU, but we don’t have it at present. These new developments might make the dependability better. Verification struggles with the huge input space.
- Q: how do you ensure that the precious state is not corrupted? Using safe TinyOS, so we won’t see memory access violations due to memory safety. Taint in the paper is about inconsistent state, not corruption.
- Q: what are your criteria for what state should be marked as precious? Don’t have a strong set of guidelines for this, but have done it by inspection so far.
- Q: what is your fault detection system, and what is its coverage? How do you know when you have a fault? The deputy compiler gets you to annotate code with things like buffer lengths. Can infer faulty behaviour from these. Annotating interfaces tends to be sufficient.
- Q: this is a very valid approach?
Monday, October 12th, 2009
Keynote: Barbara Liskov
- Inventing abstract data types, CLU, Type hierarchy, What next?
- More of a Programming Methodology talk than a systems talk.
- Started out in systems with the Venus machine on the Interdata 3. Presented it and its operating system at SOSP 1971.
- Back in the early 1970’s, people were concerned about the software crisis. (cf. Dijkstra’s 1972 Turing Award lecture, The Humble Programmer.) As machines got cheaper, bigger and faster, software started to matter a lot. A tendency to underprovision the hardware, creating challenges for the software developers.
- In late 1960’s the field of Programming Methodology began. Started to think about program design and structure: in order to have functionality, maintainability, etc.
- First paper: Go To Statement Considered Harmful (Dijkstra, 1968). A revolutionary letter to CACM. Use static program text to reason about dynamic program behaviour: it would be useful if these were as close as possible, but GOTOs prevent that. Example of debugging: understand how you got to a particular point in the text. GOTOs make this very difficult. Provoked a huge amount of resistance: can’t program without it. Pointed out limitations of the programming languages: branch into a label table is just a case statement, but they didn’t have case statements.
- Second paper: Program Development by Stepwise Refinement (Wirth, 1971). European school of software development: a top-down approach, starting with many abstract parts which are not initially implemented. Example was the 8-queens problem.
- Third paper: Information Distribution Aspects of Design Methodology (Parnas, 1971). A new interest in modularity, and information hiding. “The connections between modules are the assumptions which the modules make about each other.” Hedging about what modules were at the time.
- Fourth paper: On the Criteria to be used in Docomposing Systems into Modules (Parnas, 1972). How to actually break a system into modules: a hint of data abstraction but the full idea wasn’t there yet.
- Enter Barbara Liskov: A Design Methodology for Reliable Software Systems (1972). Entire system should be partitioned: no global state and each partition owns some part of the state. Partition exposes operations and only way of interacting with the state would be by calling operations on the modules.
- Wanted to apply the idea of partitions for building programs. It was unclear how to combine modules to make whole programs. Idea of partitions came out of doing work on operating systems: ur-partitions were the supervisor and user modes. This idea is carried a lot further by ADTs.
- Idea was to connect partitions to data types. A strike of inspiration. Ideas often arrive in the middle of the night, or when arriving to work with a fresh mind.
- March 1973 SIGPLAN/SIGOPS interface meeting on programming methodology was the debut of the idea. Began working with Steve Zilles. At the time, they were knowledgable about all the languages in existence (FORTRAN, LISP, ALGOL, PL/I, COBOL…). Started to do language design.
- People were interested in extensible languages as early as 1967 (Schuman and Jourrand, Balzer). How can we help people build dialects of languages that make them easier to use. Looked at syntactic and semantic extensibility. Syntactic extensions written in BNF and added to the language using some kind of preprocessor. Fortunately this died a death…. People were much more worried about writing programs than reading them. Didn’t realise that programs are read more often than they are written. Balzer imagined data types as being collections which allowed four defined operations (add, remove, etc.) with operator overloading.
- Hierarchical Program Structures (Dahl and Hoare, 1972): Simula 1967. Didn’t have encapsulation but did have inheritance to make simulation easier. Precursor to Smalltalk.
- Protection in Programming Languages (Morris, 1973). Recognised the importance of locality in module comprehension (allows local reasoning). Proposed sealed objects using encryption as an OS mechanism to guarantee locality.
- Global Variable Considered Harmful (Wulf and Shaw, 1973). In the 1960’s a stream of languages made block structure the big thing. Give locality within blocks, but you can always access things on the outside (i.e. global variables). Made analogy with Dijkstra’s paper that global variables are implicit connections between states of the program, which makes reasoning about it more difficult.
- Programming with Abstract Data Types (Liskov and Zilles, 1974).
- Said what ADTs were: a set of operations (whatever ones made sense, not a fixed set), encapsulation was important, and the operations were the only way to access object state.
- ADTs were “clusters” with encapsulation. Proposed polymorphism, static type checking (but weren’t sure if this was possible due to polymorphism) and exception handling.
- Why was a new programming language necessary? Needed to communicate the ideas to programmers. Enabled the testing of whether ADTs work in practice. Also made it possible to get a precise definition (tendency to think of the compiler as the language definition…). And to validate whether it was possible to achieve reasonable performance.
- Goals of language design: ease of use, simplicity, expressive power and performance. First two played off against second two.
- Also wanted minimality (limiting the language to what we could get by with), uniformity (keep abstract types similar to the built-ins) and safety (find errors as soon as possible: compile time?).
- Assumptions/design decisions. Wanted language to be heap-based with garbage collection (based on experience with LISP). Program was a collection of procedures rather than a linear piece of code (ALGOL style). People were scared of pointers, but used them to simplify the design. No block structure! Separate compilation of individual modules. Also had static type checking which was meant to speed up finding errors. No concurrency (cut out what wasn’t necessary to simplify a big project). No GOTOs. No inheritance.
- CLU clusters. A cluster had a header with a list of the operations that it defined. Thought of operations as belonging to the type rather than the object (passing an object as an argument). Defined the representation of the object internally, and the implementation of the operations. Used “cvt” to define unsealing on operation entry, and sealing on exit (but compile-time checked).
- Polymorphism: set[T: type], and had a where clause for the type parameter to specify, e.g., that T has an equals: T -> T -> bool function.
- Exception handling: Issues and a Proposed Notation (Goodenough, 1975). People didn’t know the right model: procedure should terminate (now the status quo), or throw exception to a higher level that would allow it to resume. PL/I had both. How should handlers be specified? At the call-site, or out of the main-line for all invocations to that function. CLU used termination, and specifies the exceptions that a method may call in the header. Handled at the call-site.
- How to handle exceptions? Handle it, propagate it up the call stack, or signal failure (the exception shouldn’t happen…). Can never be certain that these last exceptions won’t happen, but don’t want to write code to deal with this, so CLU introduced the “failure” exception. Really want accurate interfaces (know exactly what exceptions a method might call) and no useless code.
- Iterators. For all x in C do S. Solutions were to destroy the collection (repeated removal), or complicating the abstraction (turn it into an ordered set, making it indexible). In summer 1975-ish, the MIT group visited CMU, where the Alphard group were working on “Generators”. Thought these are a bit crusty, so invented iterators which are like procedures that you call and they yield instead of returning. Can nest iterators and recursively call them. Implemented by passing the loop body as an argument to the iterator. But this limited the expressive power.
- In 1987, gave a keynote at OOPSLA, but had been ignoring object-oriented languages and inheritance in particular. Took the opportunity to get into the literature. Much of it was very bad/confused. Inheritance being used for two different things: implementation simplification and type hierarchy. The two were not compatible.
- Implementation inheritance violated encapsulation! Subclasses depend on the implementation of the superclass, making it hard to change the superclass. CLU could do implementation sharing without inheritance.
- Type hierarchy is much more interesting, but wasn’t well understood. How were stacks and queues related?
- Led to the Liskov Substitution Principle: Objects of subtypes should behave like those of supertypes if used via supertypes methods. (Data abstraction and hierarchy (Liskov, 1988).) Didn’t realise that this was a big idea!
- What next? The world has changed from one where people had no idea about modularity, to one where modularity is based on abstraction.
- Modern programming languages (Java and C#) are pretty good. Procedures are missing: they are important, and the loss of them as a first-class thing makes the program less simple. Closures are missing, as are iterators. Exception handling is important but failure handling is not done well. Also need built-in types as a basis: extensibility might be going too far. Can we do better than “serialization” (horrible overloading of the term): can’t it be done by garbage collection?
- The state of programming is pretty lousy. The COBOL programmers of yesterday are now writing web services and browsers. The era of globals has returned. There’s little encapsulation and protection, and yet these are handling confidential information. Problem might be persistent storage violating abstraction: perhaps we would be better with an object store that provides automatic translation and type preservation.
- Programming language research. Is now the time for some new abstraction mechanisms? Probably not just specification langauges. Concurrency and multi-core: modularity is very helpful here, but there is still a lot of work to do. Should distributed systems be programmed in languages that include distribution as a first-class concept.
- System research has done well. Abstractions like DHTs, map-reduce, client/server, distributed information flow. These have been useful for making progress.
- Concerned that we trade off for performance (versus simplicity and semantics). Led astray by the end-to-end argument, but not sure that the argument is valid when the end is the user. We know that we’ll never be 100% reliable, but failures should be catastrophes, not laziness (or an optimization).
- The systems community has thrived because it has been so open: embracing new concepts. Worried about the semantics of the coming Internet Computer: what does it mean to run a program in the cloud (the PL should embrace distribution?). Massive parallelism also coming as a problem: not sure they necessarily follow from the end of Moore’s Law. But we will need to manage them as a distributed system, so there is a lot learn from that world.
FAWN: A Fast Array of Wimpy Nodes
- Awarded best paper.
- Energy is becoming an important consideration in computing? Google uses $40 million of energy a week. Can we reduce energy costs tenfold? Without increasing capital costs?
- Idea is to improve energy efficiency by using an array of well-balanced low-power systems. For computationally intensive, data-intensive systems.
- Goal is to reduce peak power: 20% energy loss on power and cooling is considered “good”. 100% utilization => 1000W but 20% utilization => 750W. Fawn wants to get 100% utilization down to 100W, and less utilization to less power.
- As CPU cycles have improved much faster than disk seeks, we have a gulf in wasted resources. Could rebalance by using very fast disks. Could use slower CPUs and moderately faster storage. Or could use slow CPUs and today’s standard disks. These are not equivalently efficient.
- Fastest processors have superlinear power usage (Xeon7350, etc.) due to things like branch prediction, caching, etc. which are not so useful for data-intensive, I/O intensive workloads. Custom ARM nodes etc. are slow but power usage is dominated by fixed power cost. In between we have things like the Atom and XScale: FAWN targets these.
- Application is data-intensive key-value systems. A critical infrastructure service with SLAs, random-access and read-write. Goal was to improve the efficiency in Queries/Joule or Queries/Second/Watt.
- Prototype using Alix3c2 nodes with flash storage, 500MHz CPU, 256MB DRAM and 4GB flash.
- Challenges: need efficient and fast failover, wimpy CPUs, limited DRAM and flash that is poor at small random writes.
- Architecture: single front-end connected by switch fabric to many back-end. Front-end manages backends, acts as a gateway and routes requests. Use consistent hashing. Interesting design decisions are at the backends (focus of the talk).
- High performance KV store: must map keys to values on the backend. Use an in-memory hashtable to find the location of a value on disk. 160-bit key into hashtable which contains a key fragment, valid bit and offset into the data region. On flash, store the entire key, length and the data itself. Limited DRAM means you only store a fragment of the key in memory (to get 12-byte hashtable entry): risk of collisions and multiple requests. Get a low probability of multiple flash reads.
- Small random writes problem avoided by log-structured datastore. Helps also with node addition, which requires transfer of key ranges: the log structure makes the transfer of key ranges a simple streaming transfer… a simple iteration of the data store. But the SLA means that we need the source of the data continues running while doing this streaming: log-structure lets you minimize locking. Also need a compact operation on the data store, which also runs in the background.
- Replication and strong consistency is covered in the paper.
- Evaluated by considering the efficiency (energy) of KV lookup. Look at the impact of background operations on query throughput. And then TCO analysis for random read workloads.
- Measured the efficiency of 256-byte KV lookups on various systems, looking at the power outlet power usage. Alix3c2/Sandisk got 346 QPS/Watt, whereas a standard desktop machine with SSD got 51.7 and two hard-disk based systems were 2.3 (MacBook Pro) and 1.96 (desktop).
- Peak capacity is 1600 QPS. During compaction ±800. During split ±1200, and Merge ±700.
- At 30% of peak query load, these have almost no discernable impact (±500 QPS in all cases).
- TCO = Capital Cost + Power Cost ($0.10/kWh). When should you use FAWN?
- Traditional system with 200W usage (5 x 2TB disks/160GB PCI-e Flash SSD/64GB FBDIMM per node). $2000–8000 per node. FAWN (10W usage) (2TB disk/64GB SATA Flash/DRAM-based). Where storage capacity is dominant, use FAWN + Disk. When you care about query throughput, use FAWN + DRAM. FAWN + Flash covers much of the remainder of the space, but Traditional + DRAM also covers some of the space.
- To be energy efficient, require some effort on the part of the system developer.
- Q: impact of networking on performance and cost models? Network is an important component, and we really want things like all-to-all communication, which often needs a high-powered core. But proposals for using low-powered commodity switches at scale mean that is possible to get down to a fixed power overhead per core.
- Q: off-the-shelf memcached could do better? [Inaudible.] It takes a lot of effort to get very high throughput.
- Q: what happens when you include latency in the model of TCO? Interactive services that serialize might shift the model? Flash devices gives pretty good common-case latency. What if you have computational load in the query? You have to trade off energy efficiency for longer periods of computation. Get high 99.9% latency during maintenance but it’s still okay.
- Q: what happens when a frontend thinks one node is down, etc.? Have optional replication to provide constant access to data, and a background replication process.
RouteBricks: Exploiting Parallelism to Scale Software Routers
- Awarded best paper.
- Want to build routers that are fast and programmable. Why do we want them to be programmable? Programmable routers enable ISPs to offer new services (intrusion detection, application acceleration, etc.). They make network monitoring simpler (measuring link latency, tracking down traffic). Finally they make it possible to offer new protocols, such as IP traceback, trajectory sampling etc. They enable, flexible, extensible networks.
- But today, speed versus programmability is a trade off. Fast routers are implemented in hardware (Tbps throughput), but offer no programmability. But programmable software routers get throughput < 10Gbps using general purpose CPUs.
- RouteBricks uses off-the-shelf PCs, a familiar programming environment and large-volume manufacturing to reduce costs. How do we get to Tbps routers with these ingredients?
- A router is just packet processing + switching. We have N linecards that handle each input or output port, and it must be able to handle packets at the per-port rate, R. Also have a switch fabric which much switch at N*R.
- RouteBricks uses one server instead of each line card and a commodity interconnect.
- Require internal link rates < R and per-server processing rate = c*R (c is a small, reasonable constant). Per-server fanout should be constant.
- What interconnect satisfies these requirements? Could naively have a crossbar switch, with N^2 line-rate links. Could instead use Valiant Load Balancing, which introduces a third stage between input and output. Have N intermediate servers, and R/N rate links from each input to the intermediate, and from each intermediate to each output. Now have N^2 links at rate 2R/N. Per-server processing rate is 3R. But with uniform traffic patterns, each server only must process at 2R.
- But this still gives linear per-server fanout (bad). If we increase per-server processing capacity, could assign multiple ports to each server. Or add a k-degree n-stage butterfly network. Combine these ideas: RouteBricks. Use full mesh if possible and extra servers otherwise.
- The trade-off depends on the kind of servers that you have. Assuming current (5 NICs * 2 X 10G ports or 8 X 1G ports and 1 external port per server).
- With 2R–3R processing rate, we need to optimize the server! Use a NUMA architecture (Nehalem, QuickPath interconnect), 2 quad-core CPUs, and 2*2*10G NICs. Run Click in kernel-mode.
- First try: got 1.3Gbps per server. Spending a lot of cycles on book-keeping operations, such as managing packet descriptors. So use batched packet operations and CPU polling, which got 3Gbps.
- Still a problem with how the cores accessed the NICs: queue access. Get synchronization overhead between the two cores. Simple rule: 1 core per port (input or output), gets no locking overhead. But now we have cache misses because of separate cores working on the same packet. So second rule: 1 core per packet. These rules are mutually exclusive!
- Solution is to uses NICs with multiple queues per port. Can now assign each queue to a single core, which achieves our objective.
- So, use state-of-the-art hardware, modify the NIC driver to do batching, and perform careful queue-to-core allocation.
- For no-op forwarding, get 9.7Gbps with min-size packets, and 24.6Gbps with a realistic mix.
- For IP routing (IPSec also but not shown), get 6.35Gbps for min-size, and 24.6Gbps with a realistic mix.
- Realistic size mix: R = 8–12 Gbps. In this case, we are I/O bound. Min-size packets are CPU-bound.
- Look at the next-gen Nehalem and do some back-of-the-envelope calculations. Could get R = 23–35 Gbps with upcoming servers for a realistic mix.
- Prototyped RB4: N = 4 in a full mesh. Realistic size mix was 35 Gbps.
- Did not talk about reordering (avoid per-flow reordering), latency (24us per server), open issues (power, form-factor, programming model). Slide illustrates Russell’s paradox quite nicely.
- Q: about programmability, the examples require maintaining cross-packet state, so would the choice of the load-balancing mechanism within the router affects this? When the bottleneck is the CPU, so yes this is important.
- Q: are the trends on both power and power-performance of next-gen processors in your favour? We spend more energy than the corresponding Cisco router, but we have a lot of room for improvement and will get better. Maybe best thing to do is use general-purpose CPUs with an efficient interconnect.
- Q: is the real choice between general purpose CPUs and a different programming model? Could you do better with ternary CAMs, etc.? Programmability is just trapping exceptional packets. IP routing is easy? Not suggesting that you just throw away all hardware routers, but you might want to do it in some cases, e.g. for specialized monitoring at the borders of your network. A programmable datapath might make it easy to deploy a new protocol.
- Q: curious about reordering phenomenon and how this would affect TCP? What would be the most extreme reordering that an adversary produce? [Inaudible.]
The Multikernel: A New OS Architecture for Scalable Multicore Systems
- Boots into Barrelfish rather than kubuntu to get the video driver to work.
- How do we structure an OS for the multicore future? Need to deal with scalability and heterogeneity.
- Sun Niagara has a banked L2 cache for all cores (bad for shared memory). Opteron Istanbul has per-core L2. Nehalem is quite different still. Need to do different things to work well on each of these. An 8-socket Opteron has an “interesting” interconnect. The interconnect matters, especially to access latencies.
- Can also have core diversity: within a system (programmable NICs, GPUs, FPGAs) and on a single die (for performance asymmetry, SIMD extensions or virtualization support).
- As core count increases, so does diversity. And unlike HPC systems, one cannot optimize at design time. Need systems software to adapt for you.
- So now is the time to rethink the default OS structure. Now we have shared-memory kernels on every core, data structures protected by locks and everything else being a device. So propose structuring the OS as a distributed system (like Barbara Liskov said earlier).
- First principle: make inter-core communication explicit. All communication should be done with messages and no shared state. This decouples the structure of the system from the inter-core communication mechanism, and makes communication patterns explicit (which cores you communicate with etc.). It naturally supports heterogeneity of cores. Also better matches future hardware (no cache coherency? Such as Intel’s 80-core machine). Also allows split-phase operation which makes concurrency easier. And finally, we can reason about performance, scalability, etc. of such systems.
- Simple microbenchmark to evaluate the trade-off between shared memory and message passing. Several cores update a shared array. In shared-memory case, we get stalling due to cache coherency protocol. This is limited by the latency of interconnect round-trips. Latency is linearish in the number of cores (up to 16), and gets worse with the number of cache lines modified.
- What if we had a server core that takes messages from a ring buffer per client core? Higher latency to begin with, but scales much better as the number of cache lines modified is increased. Client overhead is queuing delay at the server. The server’s actual cost is approximately constant, and very low. This would give us many spare cycles if the RPC is implemented split-phase.
- Second principle: make the OS structure hardware-neutral. The only hardware specific parts should be the message transports and the CPU/device drivers. Makes it possible to adapt to changing performance characteristics. Can late-bind protocol and message-passing implementations.
- Third principle: view state as replicated. Potentially-shared local state is accessed as if it were a local replica (e.g. scheduler queues, process control blocks, etc.). The message-passing model requires this. This naturally supports domains that don’t share memory.
- Replicas were previously used as a selective optimization in other systems. The multikernel makes sharing a local optimization, instead (opposite view). Only use shared state when it is faster, and make this decision at runtime.
- Can support applications that need shared memory if it is available, but the OS doesn’t rely on this.
- Currently run on x86-64 machines. A CPU driver handles traps and exceptions serially. A user-space monitor mediates local operations on global state. Use URPC inter-core message transport, but we expect that to change.
- Many non-original ideas (in particular decoupling message-passing from synchronization).
- Many applications running on Barrelfish (slide viewer, webserver, VMM, OpenMP benchmarks, etc.).
- How do you evaluate a radically-different OS structure? Barrelfish was from-scratch, so is less complete than other OSs. But we need to show that it has good baseline performance: comparable to existing systems.
- Case study of unmap (TLB shootdown). Logically need to send a message to every core and wait for all to acknowledge. Linux/Windows use IPIs and a spinlock to do this. Barrelfish makes a user request to a local monitor, and uses message passing.
- Tried several different communication protocols. One unicast channel per core; a single broadcast channel. Neither of these perform well (especially broadcast because there is no such thing in the interconnect).
- Really want a multicast optimization: send message once to every socket, which has an aggregation core that forwards it on to local cores. Also use the HyperTransport topology to make decisions about which cores are further away (and hence should receive the message earlier to give parallelism). Need to know the mapping of cores to sockets, and the messaging latency.
- Use constraint programming and a system knowledge base to perform online reasoning. Prolog query on SKB constructs the appropriate multicast structure. Unmap latency is much less than Windows for >7 cores, and Linux for >12 cores. It also scales much better.
- Show that there is no penalty for shared-memory user workloads (OpenMP), respectable network throughput, and a pipelined webserver with high throughput.
- Conclusion: no penalty for structuring using message passing. So we should start rethinking OS structure this way.
- Q: how do you build an application on top of Barrelfish that tries really hard to ignore topology? Focus so far on how to build the OS, and make no particular demands on application programming. Gone with POSIX so far, but we need some higher-level programming model to express things more easily.
- Q: about URPC performance, are you running a single thing on each processor and hence is the process only ever spinning waiting for URPC, avoiding any context switching latency? That’s true. We decoupled messaging primitive from notification primitive, because notifications are very expensive (at least on our hardware). This leads to a trade-off. The split phase API makes it efficient to work in the case where you don’t need notification immediately.
- Q: you didn’t mention virtualization… if we map one VM to one core, we can get good performance, so does Barrelfish obviate the need for virtualization? Barrelfish is viewed as orthogonal to virtualization.
- Q: your microbenchmark [inaudible]. The point of the benchmark is to make the case that messaging over shared memory can have reasonable performance.
- Q: as systems get bigger, do you expect messages to get lost? Maybe… if they do get lost, this model is a better way to structure it than shared-mmeory.
Fast Byte-granularity Software Fault Isolation
- Operating systems run many drivers and these run in a fully trusted mode. But they turn out to be a major source of bugs. Existing solutions either require changes to the source code or unacceptably bad performance. Why? They are complex and require complex, fine-grained temporal and spatial sharing within the kernel. But any mistake in this can be fatal….
- BGI isolates drivers in separate protection domains but allows domains to share the same address space. It provides strong isolation for existing drivers even with a complex kernel API. Write integrity, control-flow integrity and type-safety for kernel objects. Achieves this without changes to drivers or hardware, and only 6.4% CPU overhead (~12% space overhead).
- Use a BGI compiler on unmodified source code to get instrumented driver, then link this with the BGI interposition library to get a BGI driver.
- Protection model has kernel in trusted protection domain. Most drivers are untrusted and drivers may share untrusted domains. Each byte has an ACL. Access rights are read and write (default everything readable), indirect call, typed rights for different kernel objects (type safety for mutex, dispatcher, io etc.).
- Two primitives: CheckRight and SetRight, called by the interposition library (not the driver code). Dynamic type checking is possible using these: check the arguments to kernel API calls, which ensures type safety, and state safety (temporal properties).
- Compiler checks rights on writes and indirect calls, and grants and revokes write access to the stack as appropriate.
- Rights are granted on driver load (write to globals, icall right to address-taken functions). Rights granted/revoked to function arguments according to call semantics. Function arguments are also rights-checked.
- Example of a read request (event-based), and how the BGI compiler inserts inline checks to avoid many problems.
- Rights change as the driver makes kernel calls, and BGI enforces complex kernel API usage rules (use-before-initialized, write-after-initialized, etc.).
- Implementation must be fast. Fast crossing of protection domains (just a normal call instruction). Fast rights grant/revoke/check: uses compiler support to inline these checks and perform data alignment.
- ACL data structure encodes (domain, right) pair as an integer. Several tables of “drights”: arrays for efficient access, and have 1-byte dright per 8-byte memory slot. Need a conflict table when rights ranges overlap. Want to avoid accesses to these conflict tables by aligning data on 8-byte memory slots. Makes the non-conflict case common at the cost of space. Heap objects are 8-byte aligned, and also need special drights for writable half-slots.
- SetRight is implemented with 4 x86 assembly instructions. Uses arithmetic shift to obtain same code for kernel and user space addresses (with different tables).
- CheckRight has fast check (5 instructions) and slow check (7 instructions).
- BGI has a recovery mechanism that runs driver code inside a try block (cf. Nooks).
- Evaluated on 16 Vista device drivers: 400 KLOC.
- Evaluated for fault containment. Injected faults in the source of fat and intelpro drivers (following previous bug studies). Measured the number of faults contained by BGI: fat = 45/45, intelpro = 116/118.
- Measured file I/O performance with Postmark. Max CPU overhead was 10% (for FAT) and max throughput cost was 12.3% (for FAT).
- Measured network performance (max 16% CPU overhead and 10.2% throughput decrease both for UDP send).
- Found 28 new bugs in widely used drivers. Some were serious: writes to incorrect places or use of uninitialized objects. Some less so: abstraction violation, etc. BGI is a good bug-finding tool.
- Q: subtle problems such as virtual address aliasing or out-of-bound pointers within the bounds of other objects… how do you handle this? The granting of access rights is done according to the kernel API, for write/control-flow integrity and type safety. But we wouldn’t catch the case where a driver makes a write to something that it is allowed to write to, but isn’t the right thing.
- Q: how do you deal with type-unknown pointers? Can catch errors on data buffers. But where pointers are passed as arguments, you typically don’t do arithmetic on these.
- Q: compared with SFI work on already-compiled code (compiler not in TCB), could you do this on object code instead of source code? We have a binary version, but we report on the compiler version because it has better performance.
- Q: the transformations looks similar to existing SFI, so to what do you attribute this improvement in performance? Dealing with complex kernel API and enforcing fine-grained sharing. Previous work did a lot more copying or had expensive crossing of protection domains, in order to deal with complex sharing.
- Q: you’re redoing virtual memory protection with a complex compiler? Why not just use a microkernel? The performance can’t be that bad? The goal was to support existing drivers that run on existing operating systems with good performance. Could you do the same transformations on existing code to run in separate address spaces? Maybe, but don’t know of a way to do this. Goal was to deal with existing drivers on existing OS.
- Q: could you just insert latent compiler-inserted checks to avoid zero-day exploits, which are only turned on on zero-day? An interesting idea, but it’s better if you can cheaply run it all the time in deployed systems… you will find more bugs this way.
Tolerating Hardware Device Failures in Software
- Device drivers assume device perfection. But we can see hardware-dependence bugs across driver classes. Transient failures cause 8% of all unplanned reboots. The existing solution is to hand-code a hardened driver, which gets this down to 3%. What can we do with software fault tolerance? Detect hardware failures and perform recovery in software.
- Where do bugs come from? Device wear-out, insufficient burn-in, bridging faults, EMF radiation, firmware bugs, corrupted inputs, timing errors, unpredictable DMA, etc.
- Vendors give recommendations to driver developers. Firstly, validate all input coming from the hardware. Then ensure all operations take finite time. Failures should be reported. There are guidelines for recovery also. Goal is to implement as many of these as possible automatically.
- System is called Carburizer. Runs on driver source code, and compiler generates hardened driver that links to the Carburizer runtime.
- Goal is to fix these bugs with static analysis. Find driver code that uses device data, and ensures that the driver performs validity checks. Fixes bugs from infinite polling, unsafe array reference, unsafe pointer dereference and system panic calls.
- First, use CIL to identify tainted variables. Consult a table of functions known to perform I/O and mark heap and stack variables that receive data from these. Propagate taint through computation and return values. Now find the risky uses of these variables. e.g. Find loops where all terminating conditions depend on tainted variables.
- Now look for tainted variables used to perform unsafe array accesses: used as array index into static or dynamic arrays.
- Evaluated by analysis of Linux 220.127.116.11. Analysed 2.8 million LOC in 37 minutes (including analysis and compilation). Found a total of 992 bugs in driver code, with false positive rate of 7.4% (based on a manual sampling of 190 of the bugs). May also be some false negatives because don’t track taint via procedure arguments.
- Automatic fixing of infinite loops by inserting timeout code. Inserted code can never harm the system because the timeout is conservative.
- Inserts code to bounds-check arrays when an unsafe index is used.
- Empirical validation using synthetic fault injection on network drivers: modify return values of in and read functions. Without Carburizer, the system hung; with it, the driver recovered. Works well for transient device failures.
- Also want to report device errors to fault management systems. Carburizer (i) detects the failures, and (ii) reports them. Report things like loop timeouts, negative error returns and jumps to common cleanup code. Also looks for calls to functions with string arguments, and considers these to be error reporting, so it doesn’t insert additional reporting in this case.
- Evaluated failure reporting with manual analysis of drivers of different classes (network/scsi/sound). No false positives, but a few false negatives. Overall fixed 1135 cases of unreported failures. Thus it improves fault diagnosis.
- Static analysis is insufficient. We also need to consider runtime failures, such as missing and stuck interrupts. Can detect when to expect interrupts, and invoke ISR when bits are referenced but there is no interrupt activity. Can also detect how often to poll, which reduces spurious interrupt invocation, and improves overall performance.
- Can also tolerate stuck interrupts by seeing when an ISR is called too often, and converts from interrupt-based to polling in this case. This ensures the system and device make forward progress.
- Evaluated effect on network throughput: overhead is < 0.5%. And on CPU utilization, this goes up 5% for nVidia MCP 55 when recovery is used; no cost when just doing static bug fixing.
- Covers 8/15 of the recommendations for driver writers, automatically.
- Q: is your intention that driver developers should run Carburizer only in development and incorporate its fixes, or to run it on deployed code? Encourage sysadmins to run the tool all the time to cover as many bugs as possible.
- Q: talked about transient device failures, but what about transient errors in the CPU? In this case, even the code execution will fail. Can you comment on the relative frequencies of these kinds of failure? Not aware of this data. Cannot trust anything when the CPU fails, so not sure what we can do about this.
- Q: do you have a rough feel about how many of the found bugs could be used by malicious attackers to perform code injection? Don’t know off hand. Certainly very easy to hang the system, so could do this maliciously.
- Q: could you talk about the static analysis that you are using? How do you handle pointer aliases? Analysis is very simple, and we do have 7.4% false positives. But the combination of our techniques gives us this low rate.
Automatic Device Driver Synthesis with Termite
- Conventional driver development requires acquiring necessary information from OS interface spec and device spec. Need to combine this information to come up with a driver implementation that appropriately translates OS requests into driver requests.
- If we formalize these specs, we can do better by synthesizing drivers, and there is no longer a need for developers to be an expert in both the OS and the device. Only know one well. Furthermore, code can be specified once and synthesized many times.
- Use finite state machines as the basic formalism for writing these specs. Note device-initiated transitions and software-initiated transitions.
- First step is to take two state machines (OS and device) and synthesize a combined state machine that considers all possible transitions. Any transition represents a legal transition in the driver. Also associate timeout labels with transitions.
- Now translate the state machine into C source code (simple and covered in the paper).
- A real device has multiple functional units, so can’t possibly use a single FSM for all of these. A new language is used to compose multiple FSMs together (one per functional unit). Need a synthesis algorithm that can handle this: need to deal with the state explosion problem (by exploring the state space incrementally). Also need to deal with data, and cannot model all possible assignments to each variable. Instead manipulate data symbolically.
- Termite successfuly synthesizes drivers for Asis AX88772 USB-to-Ethernet (on Linux) and Ricoh R5C822 SD host controller (on Linux and FreeBSD).
- <1 kLOC for OS interface spec. <700 LOC for device specs. Synthesized drivers are 2 to 4 times larger than the Linux drivers.
- Showed a demo of a visual Termite debugger. Allows single-stepping and setting of breakpoints on states.
- Performance is very close to the native drivers.
- Some limitations: cannot specify constraints on data (alignment, fragmentation, etc.), complex inter-variable relations are not supported (limitation of symbolic execution engine), the structure of specifications is restricted, and Termite drivers require runtime support. Not conceptual limitations, only of the current implementation.
- One issue is how you write the specifications. This is particularly onerous for the device manufacturers. Also a potential source of bugs. Need a big brain to do this based on the HDL of the device. Ideally we would automate this translation (HDL to driver spec). Since you write devices in HDL anyway, this shouldn’t be too bad.
- Not much open-source hardware around, so there was a difficulty in finding hardware on which to evaluate this.
- Q: to make this practical, there are a few questions: performance? Scalability (for big devices like video cards)? Performance hasn’t been an issue so far, because the device developer makes the same assumptions that Termite ends up making, so the generated code is quite similar. (But the devices are simple?) So far this hasn’t been a problem. Scalability is more of an issue. Looking at using better symbolic execution that ignores irrelevant relations between variables.
- Q: how do you deal with firmware in the device (also specified in HDL)? At the moment, the manual process has to cover this. Once we automate this, it might be more challenging, so we might want to generate firmware as well.
- Q: what about the data size of compiled code, and the CPU utilization when saturating the device? Code and data size doesn’t look very big (roughly proportional to the amount of source code). Haven’t looked so much at data size.
- Q: does your spec language have room for cork tables and other crap? [Taken offline.]
- Q: the OS spec is intended to be device independent (e.g. generic for GPU, ethernet driver, etc.), but how do you cope with new features in a device, which people like to have? All the interesting stuff is on the side? How does that play with the OS spec if you want to take advantage of these features? We cope with standard features, could include “semi-standard” optional features in the OS spec. For unique features, you would need to extend (but not rewrite) the OS spec. No experience with doing this.
- Q: the requirement for a full functional spec of the OS driver interface is somewhat intimidating, so what is your experience in making these, and how do they scale? It’s generally doable, but we would want to create a special-purpose language to make this easier.
Automatically Patching Errors in Deployed Software
- Your code has bugs and vulnerabilities, but attack detectors (code injection, memory errors, etc.) exist. What do you do in this case? At the moment, just crash the application, which is a straightforward DoS.
- ClearView protects against unknown vulnerabilities, preserves functionality and works for legacy software.
- Zero-day exploits are a problem for hard-coded checks because they are unknown in advance.
- The application must continue to work (especially if mission-critical) despite attacks. A patch can repair the application. (Mind you, we shouldn’t always keep the application running: sometimes crashing is correct behaviour.)
- Want to do this without access to the source code, so can’t rely on built-in features. Needs to run on x86/Windows.
- Use learning as the secret sauce. Normal executions show how the application is supposed to run. Attacks provide information about the vulnerability, and can be used to give the system immunity. The first few attacks may crash the system, however.
- Detection is pluggable: tells you whether an execution was normal or an attack. Learning learns normal behaviour from successful runs, and checks constraints during attacks. This gives a predictive constraint, which is true on every good run, and false during every attack. Repair component creates a patch to re-establish constraints and invariants. System evaluates patches and distributes them to deployed applications.
- Use off-the-shelf detectors.
- Assume a single server and several community machines running the application. (Assume that they are not exploited to begin with.) Community machines report constraints back to the server. Use code injection and memory corruption attack detectors (others are possible).
- On an attack, the detector collects information and terminates the application. Server attempts to correlate this information with a constraint: leads to a predictive constraint. The server generates appropriate patches and distributes the best of these. The quality of patches can be refined by information about successful or failed attack attempts (failed or successful defenses). Redistribution is then possible.
- How do we learn normal behaviour? Use an ML technique called dynamic invariant detection (previous work), which has many optimizations for accuracy and speed. Technique was enhanced for this project.
- Inference results. ML technique is neither sound (overfitting) nor complete (templates are not exhaustive). However it is useful and effective. Sound in practice, and complete enough.
- How do we learn attack behaviour? Attack detectors aim to detect problems close to their source. Code injection uses Determina Memory Firewall (triggers when control jumps to code outside the original executable). Memory corruption uses Heap Guard (triggers on sentinel value overwrite). Techniques have low overhead and no false positives.
- Server pushes out appropriate instrumentation to the community. Only check constraints at attack sites (low overhead).
- Repairing installs additional detectors to see if you have a bad patch (e.g. looking for assertion violations).
- Aim to fix a problem while it is small, before the detector is invoked. Repair isn’t identical to what a human would write, but it is much more timely.
- Patches are evaluated in the field (do they avoid triggering the attack detector or prevent other behaviour deviations?).
- Evaluated with a Red Team that created 10 exploits (HTML pages) against Firefox 1.0. ClearView was not tuned to known vulnerabilities in that version, but the learning component focussed on the most security-critical components. Red Team had access to all project materials.
- ClearView detected every attack and prevented all exploits. For 7/10 vulnerabilities, ClearView generated patches that maintained functionality after an average of 4.9 minutes and 5.4 attacks. Handled polymorphic attack variants, simultaneous and intermixed attacks, and had no false positives (installing a patch when not under attack). Low overhead for detection and repair (considering this is an interactive application, not surprising).
- What about unrepaired vulnerabilities? 1. ClearView was misconfigured. 2. Learning suite was too small. 3. Needed constraint not built into Daikon. All zero-day attacks against the system and all trivial to fix with minor changes to ClearView.
- Q: introducing code is a bit scary… what if one of the patches introduces a new vulnerability? Firstly, you can only do this when you’ve found an exploitability. Red Team tried and failed. In one case, ClearView found a vulnerability in its own injected code.
- Q: if I were an attacker who wanted to DoS your system (knowing ClearView was running), I might try to disable ClearView somehow by making the ClearView DB learn an incorrect fact… so is your system vulnerable to that kind of attack? It doesn’t matter what facts are true during attacks, so you’d have to find good executions that weren’t observed as being bad to poison the database. It’s conceivable and theoretically possible that you could do that, but I don’t know if it’s practical.
- Q: what is the overhead of inserting invariants at every instruction? You will see between 5 and 10 constraints per instruction… learning is the biggest bottleneck, but you could distribute this amongst the community. In terms of unsoundness, we’re not seeing that as a problem.
- Q: how sensitive are you to the invariants that I have to specify for the patches, in the case where continuation introduces incorrectness into the persistent state? This is a policy decision: an hour of downtime for a bank is $6 million, so is it better to come back and fix things up later.
Debugging in the (Very) Large: Ten Years of Implementation and Experience
- 10 years of work by the first 8 authors.
- Even Microsoft’s shipping software has bugs! (And so does your hardware….)
- A bug is a flaw in program logic; an error is a failure in execution caused by a bug (1 bug -> many errors).
- How does Microsoft find out when things go wrong? We want to fix bugs regardless of source, prioritize bugs affecting the most users. Kernel crashes (BSOD), application crashes, everything down to invariant violations.
- Windows Error Reporting. What happens after you click “Send Error Report”?
- Server is over-provisioned to handle 100 million reports per day. 17 million programs have records in WER. 1000s of bugs have been fixed. Uses 200TB of storage, 60 servers over 10 years. Anyone in the audience can get access to WER data….
- Debugging in the large makes the user-developer feedback loop much longer, both in terms of the number of people and the latency. The problem is the human bottleneck (both in accuracy and latency). Goal was to remove humans from the loop.
- On an error, collect a minidump: stack of erroneous thread and a little extra context. If the user allows it, upload this to WER. An analysis procedure (!analyze) runs over all of these mini-dumps and clusters these in buckets.
- !analyze takes a minidump as input, and outputs a bucket ID. So increment the bucket count and prioritize buckets with the highest count. Actually upload only the first few minidumps for a bucket; after that just increment the count. Sometimes you need a full core dump, and programmers can request this to be collected on future hits.
- 2-phase bucketing strategy: labelling on the client (bucketed by failure point) and classifying on the servers (consolidate versions and replace offsets with symbols; find callers where the bug might be (if it calls known-good code)). This refines the bucket ID… more details in the paper.
- One bug can hit multiple buckets (up to 40% of error reports). Also multiple bugs can hit one bucket (up to 4% of error reports). Bucketing mostly works… scale is our friend (throw away a few here and there and you still have enough to debug).
- Bucket hits for a given program look like a Pareto curve. Just 20 buckets in Word 2010 account for 50% of all errors. Only fixing a small number of bugs will help many users.
- Earliest success story was finding heisenbugs in Windows kernel (>= 5 years old). Vista team fixed 5000 bugs in the beta. Anti-virus vendor accounted for 7.6% of all kernel crashes: in 30-days got this down to 3.6% of all kernel crashes. Office 2010 team fixed 22% of reports in 3 weeks.
- Example of hardware errors too: failure in a CPU (exact same revision and step). Chip vendor knew that the bug existed and didn’t think that it would get hit in real-life. Error reports dropped dramatically after the work-around was applied.
- Also hardware failures in SMBIOS of a particular laptop (buffer overrun); motherboard USB controller (only implemented 31/32 of DMA address bits). Lots of information about failures due to overclocking, HD controller resets and substandard memory.
- Also looked at malware. The Renos social engineering worm which caused Explorer.exe to crash when people downloaded something from an email. Saw a spike, issued a worm removal tools through Windows Automatic Update, and saw this decline very quickly. Shows that WER scales to handle global events.
- Distributed system architecture hasn’t changed in 10 years, and yet scales to global events.
- Product teams now have ship goals based on reducing WER reports: led to a numerically-driven debugging approach. Fundamentally changed software development at Microsoft.
- Q: what about privacy? Most private data ends up in the heap, not on the stack. Only collect stack. Also do some data-scrubbing based on things we know about (user ID, etc.). Looked for SSNs, credit-card numbers, etc.: found fewer than 10 possible matches in 100,000 error reports. MS also puts very strong employment conditions on how this data is accessed and used.
- Q: . Team looks for zero-day exploits in the WER data regularly. Philosophy is that there should be no overhead when users don’t hit an error report. 90% of users only ever have to do an increment of a counter.
- Q: are you keeping track of how many people don’t send error reports? We have good estimates of opt-in rates, from other systems that collect information from machines that are running normally (only if people opt-in to those programs of course). If half the world turned off the error reports, we’d still get enough information.
- Q: does analyzing these bugs tell you about common errors programmers make? Check for NULL is the most important ones. This has generated many guidelines that has been fed back into internal development processes.
Detecting Large-Scale System Problems by Mining Console Logs
- It is challenging to detect problems in large-scale internet services. Requires instrumentation (expensive to maintain, and may use modules that aren’t instrumented). So can we use console logs in lieu of instrumentation? They are easy for the developer to insert in their code, but they are imperfect (not originally intended for instrumentation). It is non-trivial to analyze their free-text contents.
- Go from 24 million lines of log messages, finding a small number of abnormal log segments, to a single page of visualization. Fully automatic process without manual input.
- Use machine learning based on carefully-defined (numerical) features in the logs. Parse the program source-code to infer information about the log contents, to generate appropriate features. Finally visualize the results.
- Key insight: log contains the necessary information to create features. Identifiers and state variables are useful. Important information is in the correlation between messages (examples taken from HDFS (Hadoop Distributed File System)). e.g. “receiving block x” followed by “received block x” is normal; without “received block x”, you have an error.
- First step is to parse free-text logs into semi-structured text. Look at program source to generate regular expressions that extract state variables (the block number, for example). It becomes non-trivial in OO languages, where type inference on the whole source tree is necessary. Yields highly accurate parsing results.
- Identifiers are widely used in logs (filename, object keys, user IDs, etc.). Do group-by on identifiers. Identifiers can be discovered automatically.
- Now build a numerical representation of traces for feature creation. Approach is similar to the bag of words model in information retrieval. Yields a message term vector with term frequencies.
- Can use these vectors to do anomaly detection. Use Principal Component Analysis (PCA) to capture normal patterns in all vectors. These are based on correlations between dimensions of the vector.
- Ran an experiment on Amazon EC2. 203 nodes ran Hadoop for 28 hours, with standard map-reduce jobs (sorting etc.). Generated 24 million lines of console logs with ~575000 HDFS blocks. 575000 vectors lead to ~680 distinct vectors. Distinct cases were labelled manually as being normal or abnormal (in the evaluation only). However, the algorithms are unsupervised and automatic.
- 11 kinds of anomaly, occurring 16916 times. PCA detected 16808 of these. Two kinds of false positive: background migration and multiple replicas. Believe that no unsupervised algorithm could do better, so we’re now allowing operators to provide feedback.
- Results are visualized with a decision tree. Unusual log message text is used to document split points in the decision tree.
- Future work. Want to improve parsing so that it doesn’t require source code, and support more languages. Also want to improve feature creation and machine learning so that online detection is possible, also across applications and layers to provide more useful and comprehensive information.
- Q: most applications have many identifiers, so how do you automatically detect these, and how reliably? The grouping step addresses the problem of multiple identifiers. In HDFS, we only have the block ID, but we have an example in the paper where we run the algorithm multiple times for each class of identifier.
- Q: how do you know whether the different values of an identifier correspond to the same identifier variable? [Taken offline.]
- Q: what was different about the single anomaly where you don’t do well? (Deleting a node when it no longer exists on the data node.) Block numbers can be arbitrary due to multiple reads and writes. Sometimes get errors in the correlations.
- Q: how common do you think this is in other systems besides HDFS? HDFS has this problem most severely because every block interaction is written to the logs.
- Q: you had to turn on the more-detailed logging level to get this to work, so how did you choose this? I had to turn on debug-logging. Depends on the problems you want to detect. Turn on more logging when you see problems but you can’t find out why.
- Q: what happens to performance when you do this? Also what about heisenbugs that go away with more-detailed logging? System doesn’t do anything about logging-based heisenbugs.
- Q: what if I add a new feature, would you be able to detect problems in it? Don’t currently deal with multiple versions of the software.
- Q: how much information does your visualization offer to the developer to help them diagnose detected problems? If the operators have some insight, this tool can help them provide useful information to the developers.