Inventing abstract data types, CLU, Type hierarchy, What next?
More of a Programming Methodology talk than a systems talk.
Started out in systems with the Venus machine on the Interdata 3. Presented it and its operating system at SOSP 1971.
Back in the early 1970’s, people were concerned about the software crisis. (cf. Dijkstra’s 1972 Turing Award lecture, The Humble Programmer.) As machines got cheaper, bigger and faster, software started to matter a lot. A tendency to underprovision the hardware, creating challenges for the software developers.
In late 1960’s the field of Programming Methodology began. Started to think about program design and structure: in order to have functionality, maintainability, etc.
First paper: Go To Statement Considered Harmful (Dijkstra, 1968). A revolutionary letter to CACM. Use static program text to reason about dynamic program behaviour: it would be useful if these were as close as possible, but GOTOs prevent that. Example of debugging: understand how you got to a particular point in the text. GOTOs make this very difficult. Provoked a huge amount of resistance: can’t program without it. Pointed out limitations of the programming languages: branch into a label table is just a case statement, but they didn’t have case statements.
Second paper: Program Development by Stepwise Refinement (Wirth, 1971). European school of software development: a top-down approach, starting with many abstract parts which are not initially implemented. Example was the 8-queens problem.
Third paper: Information Distribution Aspects of Design Methodology (Parnas, 1971). A new interest in modularity, and information hiding. “The connections between modules are the assumptions which the modules make about each other.” Hedging about what modules were at the time.
Fourth paper: On the Criteria to be used in Docomposing Systems into Modules (Parnas, 1972). How to actually break a system into modules: a hint of data abstraction but the full idea wasn’t there yet.
Enter Barbara Liskov: A Design Methodology for Reliable Software Systems (1972). Entire system should be partitioned: no global state and each partition owns some part of the state. Partition exposes operations and only way of interacting with the state would be by calling operations on the modules.
Wanted to apply the idea of partitions for building programs. It was unclear how to combine modules to make whole programs. Idea of partitions came out of doing work on operating systems: ur-partitions were the supervisor and user modes. This idea is carried a lot further by ADTs.
Idea was to connect partitions to data types. A strike of inspiration. Ideas often arrive in the middle of the night, or when arriving to work with a fresh mind.
March 1973 SIGPLAN/SIGOPS interface meeting on programming methodology was the debut of the idea. Began working with Steve Zilles. At the time, they were knowledgable about all the languages in existence (FORTRAN, LISP, ALGOL, PL/I, COBOL…). Started to do language design.
People were interested in extensible languages as early as 1967 (Schuman and Jourrand, Balzer). How can we help people build dialects of languages that make them easier to use. Looked at syntactic and semantic extensibility. Syntactic extensions written in BNF and added to the language using some kind of preprocessor. Fortunately this died a death…. People were much more worried about writing programs than reading them. Didn’t realise that programs are read more often than they are written. Balzer imagined data types as being collections which allowed four defined operations (add, remove, etc.) with operator overloading.
Hierarchical Program Structures (Dahl and Hoare, 1972): Simula 1967. Didn’t have encapsulation but did have inheritance to make simulation easier. Precursor to Smalltalk.
Protection in Programming Languages (Morris, 1973). Recognised the importance of locality in module comprehension (allows local reasoning). Proposed sealed objects using encryption as an OS mechanism to guarantee locality.
Global Variable Considered Harmful (Wulf and Shaw, 1973). In the 1960’s a stream of languages made block structure the big thing. Give locality within blocks, but you can always access things on the outside (i.e. global variables). Made analogy with Dijkstra’s paper that global variables are implicit connections between states of the program, which makes reasoning about it more difficult.
Programming with Abstract Data Types (Liskov and Zilles, 1974).
Said what ADTs were: a set of operations (whatever ones made sense, not a fixed set), encapsulation was important, and the operations were the only way to access object state.
ADTs were “clusters” with encapsulation. Proposed polymorphism, static type checking (but weren’t sure if this was possible due to polymorphism) and exception handling.
Why was a new programming language necessary? Needed to communicate the ideas to programmers. Enabled the testing of whether ADTs work in practice. Also made it possible to get a precise definition (tendency to think of the compiler as the language definition…). And to validate whether it was possible to achieve reasonable performance.
Goals of language design: ease of use, simplicity, expressive power and performance. First two played off against second two.
Also wanted minimality (limiting the language to what we could get by with), uniformity (keep abstract types similar to the built-ins) and safety (find errors as soon as possible: compile time?).
Assumptions/design decisions. Wanted language to be heap-based with garbage collection (based on experience with LISP). Program was a collection of procedures rather than a linear piece of code (ALGOL style). People were scared of pointers, but used them to simplify the design. No block structure! Separate compilation of individual modules. Also had static type checking which was meant to speed up finding errors. No concurrency (cut out what wasn’t necessary to simplify a big project). No GOTOs. No inheritance.
CLU clusters. A cluster had a header with a list of the operations that it defined. Thought of operations as belonging to the type rather than the object (passing an object as an argument). Defined the representation of the object internally, and the implementation of the operations. Used “cvt” to define unsealing on operation entry, and sealing on exit (but compile-time checked).
Polymorphism: set[T: type], and had a where clause for the type parameter to specify, e.g., that T has an equals: T -> T -> bool function.
Exception handling: Issues and a Proposed Notation (Goodenough, 1975). People didn’t know the right model: procedure should terminate (now the status quo), or throw exception to a higher level that would allow it to resume. PL/I had both. How should handlers be specified? At the call-site, or out of the main-line for all invocations to that function. CLU used termination, and specifies the exceptions that a method may call in the header. Handled at the call-site.
How to handle exceptions? Handle it, propagate it up the call stack, or signal failure (the exception shouldn’t happen…). Can never be certain that these last exceptions won’t happen, but don’t want to write code to deal with this, so CLU introduced the “failure” exception. Really want accurate interfaces (know exactly what exceptions a method might call) and no useless code.
Iterators. For all x in C do S. Solutions were to destroy the collection (repeated removal), or complicating the abstraction (turn it into an ordered set, making it indexible). In summer 1975-ish, the MIT group visited CMU, where the Alphard group were working on “Generators”. Thought these are a bit crusty, so invented iterators which are like procedures that you call and they yield instead of returning. Can nest iterators and recursively call them. Implemented by passing the loop body as an argument to the iterator. But this limited the expressive power.
In 1987, gave a keynote at OOPSLA, but had been ignoring object-oriented languages and inheritance in particular. Took the opportunity to get into the literature. Much of it was very bad/confused. Inheritance being used for two different things: implementation simplification and type hierarchy. The two were not compatible.
Implementation inheritance violated encapsulation! Subclasses depend on the implementation of the superclass, making it hard to change the superclass. CLU could do implementation sharing without inheritance.
Type hierarchy is much more interesting, but wasn’t well understood. How were stacks and queues related?
Led to the Liskov Substitution Principle: Objects of subtypes should behave like those of supertypes if used via supertypes methods. (Data abstraction and hierarchy (Liskov, 1988).) Didn’t realise that this was a big idea!
What next? The world has changed from one where people had no idea about modularity, to one where modularity is based on abstraction.
Modern programming languages (Java and C#) are pretty good. Procedures are missing: they are important, and the loss of them as a first-class thing makes the program less simple. Closures are missing, as are iterators. Exception handling is important but failure handling is not done well. Also need built-in types as a basis: extensibility might be going too far. Can we do better than “serialization” (horrible overloading of the term): can’t it be done by garbage collection?
The state of programming is pretty lousy. The COBOL programmers of yesterday are now writing web services and browsers. The era of globals has returned. There’s little encapsulation and protection, and yet these are handling confidential information. Problem might be persistent storage violating abstraction: perhaps we would be better with an object store that provides automatic translation and type preservation.
Programming language research. Is now the time for some new abstraction mechanisms? Probably not just specification langauges. Concurrency and multi-core: modularity is very helpful here, but there is still a lot of work to do. Should distributed systems be programmed in languages that include distribution as a first-class concept.
System research has done well. Abstractions like DHTs, map-reduce, client/server, distributed information flow. These have been useful for making progress.
Concerned that we trade off for performance (versus simplicity and semantics). Led astray by the end-to-end argument, but not sure that the argument is valid when the end is the user. We know that we’ll never be 100% reliable, but failures should be catastrophes, not laziness (or an optimization).
The systems community has thrived because it has been so open: embracing new concepts. Worried about the semantics of the coming Internet Computer: what does it mean to run a program in the cloud (the PL should embrace distribution?). Massive parallelism also coming as a problem: not sure they necessarily follow from the end of Moore’s Law. But we will need to manage them as a distributed system, so there is a lot learn from that world.
FAWN: A Fast Array of Wimpy Nodes
Awarded best paper.
Energy is becoming an important consideration in computing? Google uses $40 million of energy a week. Can we reduce energy costs tenfold? Without increasing capital costs?
Idea is to improve energy efficiency by using an array of well-balanced low-power systems. For computationally intensive, data-intensive systems.
Goal is to reduce peak power: 20% energy loss on power and cooling is considered “good”. 100% utilization => 1000W but 20% utilization => 750W. Fawn wants to get 100% utilization down to 100W, and less utilization to less power.
As CPU cycles have improved much faster than disk seeks, we have a gulf in wasted resources. Could rebalance by using very fast disks. Could use slower CPUs and moderately faster storage. Or could use slow CPUs and today’s standard disks. These are not equivalently efficient.
Fastest processors have superlinear power usage (Xeon7350, etc.) due to things like branch prediction, caching, etc. which are not so useful for data-intensive, I/O intensive workloads. Custom ARM nodes etc. are slow but power usage is dominated by fixed power cost. In between we have things like the Atom and XScale: FAWN targets these.
Application is data-intensive key-value systems. A critical infrastructure service with SLAs, random-access and read-write. Goal was to improve the efficiency in Queries/Joule or Queries/Second/Watt.
Prototype using Alix3c2 nodes with flash storage, 500MHz CPU, 256MB DRAM and 4GB flash.
Challenges: need efficient and fast failover, wimpy CPUs, limited DRAM and flash that is poor at small random writes.
Architecture: single front-end connected by switch fabric to many back-end. Front-end manages backends, acts as a gateway and routes requests. Use consistent hashing. Interesting design decisions are at the backends (focus of the talk).
High performance KV store: must map keys to values on the backend. Use an in-memory hashtable to find the location of a value on disk. 160-bit key into hashtable which contains a key fragment, valid bit and offset into the data region. On flash, store the entire key, length and the data itself. Limited DRAM means you only store a fragment of the key in memory (to get 12-byte hashtable entry): risk of collisions and multiple requests. Get a low probability of multiple flash reads.
Small random writes problem avoided by log-structured datastore. Helps also with node addition, which requires transfer of key ranges: the log structure makes the transfer of key ranges a simple streaming transfer… a simple iteration of the data store. But the SLA means that we need the source of the data continues running while doing this streaming: log-structure lets you minimize locking. Also need a compact operation on the data store, which also runs in the background.
Replication and strong consistency is covered in the paper.
Evaluated by considering the efficiency (energy) of KV lookup. Look at the impact of background operations on query throughput. And then TCO analysis for random read workloads.
Measured the efficiency of 256-byte KV lookups on various systems, looking at the power outlet power usage. Alix3c2/Sandisk got 346 QPS/Watt, whereas a standard desktop machine with SSD got 51.7 and two hard-disk based systems were 2.3 (MacBook Pro) and 1.96 (desktop).
Peak capacity is 1600 QPS. During compaction ±800. During split ±1200, and Merge ±700.
At 30% of peak query load, these have almost no discernable impact (±500 QPS in all cases).
TCO = Capital Cost + Power Cost ($0.10/kWh). When should you use FAWN?
Traditional system with 200W usage (5 x 2TB disks/160GB PCI-e Flash SSD/64GB FBDIMM per node). $2000–8000 per node. FAWN (10W usage) (2TB disk/64GB SATA Flash/DRAM-based). Where storage capacity is dominant, use FAWN + Disk. When you care about query throughput, use FAWN + DRAM. FAWN + Flash covers much of the remainder of the space, but Traditional + DRAM also covers some of the space.
To be energy efficient, require some effort on the part of the system developer.
Q: impact of networking on performance and cost models? Network is an important component, and we really want things like all-to-all communication, which often needs a high-powered core. But proposals for using low-powered commodity switches at scale mean that is possible to get down to a fixed power overhead per core.
Q: off-the-shelf memcached could do better? [Inaudible.] It takes a lot of effort to get very high throughput.
Q: what happens when you include latency in the model of TCO? Interactive services that serialize might shift the model? Flash devices gives pretty good common-case latency. What if you have computational load in the query? You have to trade off energy efficiency for longer periods of computation. Get high 99.9% latency during maintenance but it’s still okay.
Q: what happens when a frontend thinks one node is down, etc.? Have optional replication to provide constant access to data, and a background replication process.
RouteBricks: Exploiting Parallelism to Scale Software Routers
Awarded best paper.
Want to build routers that are fast and programmable. Why do we want them to be programmable? Programmable routers enable ISPs to offer new services (intrusion detection, application acceleration, etc.). They make network monitoring simpler (measuring link latency, tracking down traffic). Finally they make it possible to offer new protocols, such as IP traceback, trajectory sampling etc. They enable, flexible, extensible networks.
But today, speed versus programmability is a trade off. Fast routers are implemented in hardware (Tbps throughput), but offer no programmability. But programmable software routers get throughput < 10Gbps using general purpose CPUs.
RouteBricks uses off-the-shelf PCs, a familiar programming environment and large-volume manufacturing to reduce costs. How do we get to Tbps routers with these ingredients?
A router is just packet processing + switching. We have N linecards that handle each input or output port, and it must be able to handle packets at the per-port rate, R. Also have a switch fabric which much switch at N*R.
RouteBricks uses one server instead of each line card and a commodity interconnect.
Require internal link rates < R and per-server processing rate = c*R (c is a small, reasonable constant). Per-server fanout should be constant.
What interconnect satisfies these requirements? Could naively have a crossbar switch, with N^2 line-rate links. Could instead use Valiant Load Balancing, which introduces a third stage between input and output. Have N intermediate servers, and R/N rate links from each input to the intermediate, and from each intermediate to each output. Now have N^2 links at rate 2R/N. Per-server processing rate is 3R. But with uniform traffic patterns, each server only must process at 2R.
But this still gives linear per-server fanout (bad). If we increase per-server processing capacity, could assign multiple ports to each server. Or add a k-degree n-stage butterfly network. Combine these ideas: RouteBricks. Use full mesh if possible and extra servers otherwise.
The trade-off depends on the kind of servers that you have. Assuming current (5 NICs * 2 X 10G ports or 8 X 1G ports and 1 external port per server).
With 2R–3R processing rate, we need to optimize the server! Use a NUMA architecture (Nehalem, QuickPath interconnect), 2 quad-core CPUs, and 2*2*10G NICs. Run Click in kernel-mode.
First try: got 1.3Gbps per server. Spending a lot of cycles on book-keeping operations, such as managing packet descriptors. So use batched packet operations and CPU polling, which got 3Gbps.
Still a problem with how the cores accessed the NICs: queue access. Get synchronization overhead between the two cores. Simple rule: 1 core per port (input or output), gets no locking overhead. But now we have cache misses because of separate cores working on the same packet. So second rule: 1 core per packet. These rules are mutually exclusive!
Solution is to uses NICs with multiple queues per port. Can now assign each queue to a single core, which achieves our objective.
So, use state-of-the-art hardware, modify the NIC driver to do batching, and perform careful queue-to-core allocation.
For no-op forwarding, get 9.7Gbps with min-size packets, and 24.6Gbps with a realistic mix.
For IP routing (IPSec also but not shown), get 6.35Gbps for min-size, and 24.6Gbps with a realistic mix.
Realistic size mix: R = 8–12 Gbps. In this case, we are I/O bound. Min-size packets are CPU-bound.
Look at the next-gen Nehalem and do some back-of-the-envelope calculations. Could get R = 23–35 Gbps with upcoming servers for a realistic mix.
Prototyped RB4: N = 4 in a full mesh. Realistic size mix was 35 Gbps.
Did not talk about reordering (avoid per-flow reordering), latency (24us per server), open issues (power, form-factor, programming model). Slide illustrates Russell’s paradox quite nicely.
Q: about programmability, the examples require maintaining cross-packet state, so would the choice of the load-balancing mechanism within the router affects this? When the bottleneck is the CPU, so yes this is important.
Q: are the trends on both power and power-performance of next-gen processors in your favour? We spend more energy than the corresponding Cisco router, but we have a lot of room for improvement and will get better. Maybe best thing to do is use general-purpose CPUs with an efficient interconnect.
Q: is the real choice between general purpose CPUs and a different programming model? Could you do better with ternary CAMs, etc.? Programmability is just trapping exceptional packets. IP routing is easy? Not suggesting that you just throw away all hardware routers, but you might want to do it in some cases, e.g. for specialized monitoring at the borders of your network. A programmable datapath might make it easy to deploy a new protocol.
Q: curious about reordering phenomenon and how this would affect TCP? What would be the most extreme reordering that an adversary produce? [Inaudible.]
The Multikernel: A New OS Architecture for Scalable Multicore Systems
Boots into Barrelfish rather than kubuntu to get the video driver to work.
How do we structure an OS for the multicore future? Need to deal with scalability and heterogeneity.
Sun Niagara has a banked L2 cache for all cores (bad for shared memory). Opteron Istanbul has per-core L2. Nehalem is quite different still. Need to do different things to work well on each of these. An 8-socket Opteron has an “interesting” interconnect. The interconnect matters, especially to access latencies.
Can also have core diversity: within a system (programmable NICs, GPUs, FPGAs) and on a single die (for performance asymmetry, SIMD extensions or virtualization support).
As core count increases, so does diversity. And unlike HPC systems, one cannot optimize at design time. Need systems software to adapt for you.
So now is the time to rethink the default OS structure. Now we have shared-memory kernels on every core, data structures protected by locks and everything else being a device. So propose structuring the OS as a distributed system (like Barbara Liskov said earlier).
First principle: make inter-core communication explicit. All communication should be done with messages and no shared state. This decouples the structure of the system from the inter-core communication mechanism, and makes communication patterns explicit (which cores you communicate with etc.). It naturally supports heterogeneity of cores. Also better matches future hardware (no cache coherency? Such as Intel’s 80-core machine). Also allows split-phase operation which makes concurrency easier. And finally, we can reason about performance, scalability, etc. of such systems.
Simple microbenchmark to evaluate the trade-off between shared memory and message passing. Several cores update a shared array. In shared-memory case, we get stalling due to cache coherency protocol. This is limited by the latency of interconnect round-trips. Latency is linearish in the number of cores (up to 16), and gets worse with the number of cache lines modified.
What if we had a server core that takes messages from a ring buffer per client core? Higher latency to begin with, but scales much better as the number of cache lines modified is increased. Client overhead is queuing delay at the server. The server’s actual cost is approximately constant, and very low. This would give us many spare cycles if the RPC is implemented split-phase.
Second principle: make the OS structure hardware-neutral. The only hardware specific parts should be the message transports and the CPU/device drivers. Makes it possible to adapt to changing performance characteristics. Can late-bind protocol and message-passing implementations.
Third principle: view state as replicated. Potentially-shared local state is accessed as if it were a local replica (e.g. scheduler queues, process control blocks, etc.). The message-passing model requires this. This naturally supports domains that don’t share memory.
Replicas were previously used as a selective optimization in other systems. The multikernel makes sharing a local optimization, instead (opposite view). Only use shared state when it is faster, and make this decision at runtime.
Can support applications that need shared memory if it is available, but the OS doesn’t rely on this.
Currently run on x86-64 machines. A CPU driver handles traps and exceptions serially. A user-space monitor mediates local operations on global state. Use URPC inter-core message transport, but we expect that to change.
Many non-original ideas (in particular decoupling message-passing from synchronization).
Many applications running on Barrelfish (slide viewer, webserver, VMM, OpenMP benchmarks, etc.).
How do you evaluate a radically-different OS structure? Barrelfish was from-scratch, so is less complete than other OSs. But we need to show that it has good baseline performance: comparable to existing systems.
Case study of unmap (TLB shootdown). Logically need to send a message to every core and wait for all to acknowledge. Linux/Windows use IPIs and a spinlock to do this. Barrelfish makes a user request to a local monitor, and uses message passing.
Tried several different communication protocols. One unicast channel per core; a single broadcast channel. Neither of these perform well (especially broadcast because there is no such thing in the interconnect).
Really want a multicast optimization: send message once to every socket, which has an aggregation core that forwards it on to local cores. Also use the HyperTransport topology to make decisions about which cores are further away (and hence should receive the message earlier to give parallelism). Need to know the mapping of cores to sockets, and the messaging latency.
Use constraint programming and a system knowledge base to perform online reasoning. Prolog query on SKB constructs the appropriate multicast structure. Unmap latency is much less than Windows for >7 cores, and Linux for >12 cores. It also scales much better.
Show that there is no penalty for shared-memory user workloads (OpenMP), respectable network throughput, and a pipelined webserver with high throughput.
Conclusion: no penalty for structuring using message passing. So we should start rethinking OS structure this way.
Q: how do you build an application on top of Barrelfish that tries really hard to ignore topology? Focus so far on how to build the OS, and make no particular demands on application programming. Gone with POSIX so far, but we need some higher-level programming model to express things more easily.
Q: about URPC performance, are you running a single thing on each processor and hence is the process only ever spinning waiting for URPC, avoiding any context switching latency? That’s true. We decoupled messaging primitive from notification primitive, because notifications are very expensive (at least on our hardware). This leads to a trade-off. The split phase API makes it efficient to work in the case where you don’t need notification immediately.
Q: you didn’t mention virtualization… if we map one VM to one core, we can get good performance, so does Barrelfish obviate the need for virtualization? Barrelfish is viewed as orthogonal to virtualization.
Q: your microbenchmark [inaudible]. The point of the benchmark is to make the case that messaging over shared memory can have reasonable performance.
Q: as systems get bigger, do you expect messages to get lost? Maybe… if they do get lost, this model is a better way to structure it than shared-mmeory.
Fast Byte-granularity Software Fault Isolation
Operating systems run many drivers and these run in a fully trusted mode. But they turn out to be a major source of bugs. Existing solutions either require changes to the source code or unacceptably bad performance. Why? They are complex and require complex, fine-grained temporal and spatial sharing within the kernel. But any mistake in this can be fatal….
BGI isolates drivers in separate protection domains but allows domains to share the same address space. It provides strong isolation for existing drivers even with a complex kernel API. Write integrity, control-flow integrity and type-safety for kernel objects. Achieves this without changes to drivers or hardware, and only 6.4% CPU overhead (~12% space overhead).
Use a BGI compiler on unmodified source code to get instrumented driver, then link this with the BGI interposition library to get a BGI driver.
Protection model has kernel in trusted protection domain. Most drivers are untrusted and drivers may share untrusted domains. Each byte has an ACL. Access rights are read and write (default everything readable), indirect call, typed rights for different kernel objects (type safety for mutex, dispatcher, io etc.).
Two primitives: CheckRight and SetRight, called by the interposition library (not the driver code). Dynamic type checking is possible using these: check the arguments to kernel API calls, which ensures type safety, and state safety (temporal properties).
Compiler checks rights on writes and indirect calls, and grants and revokes write access to the stack as appropriate.
Rights are granted on driver load (write to globals, icall right to address-taken functions). Rights granted/revoked to function arguments according to call semantics. Function arguments are also rights-checked.
Example of a read request (event-based), and how the BGI compiler inserts inline checks to avoid many problems.
Rights change as the driver makes kernel calls, and BGI enforces complex kernel API usage rules (use-before-initialized, write-after-initialized, etc.).
Implementation must be fast. Fast crossing of protection domains (just a normal call instruction). Fast rights grant/revoke/check: uses compiler support to inline these checks and perform data alignment.
ACL data structure encodes (domain, right) pair as an integer. Several tables of “drights”: arrays for efficient access, and have 1-byte dright per 8-byte memory slot. Need a conflict table when rights ranges overlap. Want to avoid accesses to these conflict tables by aligning data on 8-byte memory slots. Makes the non-conflict case common at the cost of space. Heap objects are 8-byte aligned, and also need special drights for writable half-slots.
SetRight is implemented with 4 x86 assembly instructions. Uses arithmetic shift to obtain same code for kernel and user space addresses (with different tables).
CheckRight has fast check (5 instructions) and slow check (7 instructions).
BGI has a recovery mechanism that runs driver code inside a try block (cf. Nooks).
Evaluated on 16 Vista device drivers: 400 KLOC.
Evaluated for fault containment. Injected faults in the source of fat and intelpro drivers (following previous bug studies). Measured the number of faults contained by BGI: fat = 45/45, intelpro = 116/118.
Measured file I/O performance with Postmark. Max CPU overhead was 10% (for FAT) and max throughput cost was 12.3% (for FAT).
Measured network performance (max 16% CPU overhead and 10.2% throughput decrease both for UDP send).
Found 28 new bugs in widely used drivers. Some were serious: writes to incorrect places or use of uninitialized objects. Some less so: abstraction violation, etc. BGI is a good bug-finding tool.
Q: subtle problems such as virtual address aliasing or out-of-bound pointers within the bounds of other objects… how do you handle this? The granting of access rights is done according to the kernel API, for write/control-flow integrity and type safety. But we wouldn’t catch the case where a driver makes a write to something that it is allowed to write to, but isn’t the right thing.
Q: how do you deal with type-unknown pointers? Can catch errors on data buffers. But where pointers are passed as arguments, you typically don’t do arithmetic on these.
Q: compared with SFI work on already-compiled code (compiler not in TCB), could you do this on object code instead of source code? We have a binary version, but we report on the compiler version because it has better performance.
Q: the transformations looks similar to existing SFI, so to what do you attribute this improvement in performance? Dealing with complex kernel API and enforcing fine-grained sharing. Previous work did a lot more copying or had expensive crossing of protection domains, in order to deal with complex sharing.
Q: you’re redoing virtual memory protection with a complex compiler? Why not just use a microkernel? The performance can’t be that bad? The goal was to support existing drivers that run on existing operating systems with good performance. Could you do the same transformations on existing code to run in separate address spaces? Maybe, but don’t know of a way to do this. Goal was to deal with existing drivers on existing OS.
Q: could you just insert latent compiler-inserted checks to avoid zero-day exploits, which are only turned on on zero-day? An interesting idea, but it’s better if you can cheaply run it all the time in deployed systems… you will find more bugs this way.
Tolerating Hardware Device Failures in Software
Device drivers assume device perfection. But we can see hardware-dependence bugs across driver classes. Transient failures cause 8% of all unplanned reboots. The existing solution is to hand-code a hardened driver, which gets this down to 3%. What can we do with software fault tolerance? Detect hardware failures and perform recovery in software.
Where do bugs come from? Device wear-out, insufficient burn-in, bridging faults, EMF radiation, firmware bugs, corrupted inputs, timing errors, unpredictable DMA, etc.
Vendors give recommendations to driver developers. Firstly, validate all input coming from the hardware. Then ensure all operations take finite time. Failures should be reported. There are guidelines for recovery also. Goal is to implement as many of these as possible automatically.
System is called Carburizer. Runs on driver source code, and compiler generates hardened driver that links to the Carburizer runtime.
Goal is to fix these bugs with static analysis. Find driver code that uses device data, and ensures that the driver performs validity checks. Fixes bugs from infinite polling, unsafe array reference, unsafe pointer dereference and system panic calls.
First, use CIL to identify tainted variables. Consult a table of functions known to perform I/O and mark heap and stack variables that receive data from these. Propagate taint through computation and return values. Now find the risky uses of these variables. e.g. Find loops where all terminating conditions depend on tainted variables.
Now look for tainted variables used to perform unsafe array accesses: used as array index into static or dynamic arrays.
Evaluated by analysis of Linux 220.127.116.11. Analysed 2.8 million LOC in 37 minutes (including analysis and compilation). Found a total of 992 bugs in driver code, with false positive rate of 7.4% (based on a manual sampling of 190 of the bugs). May also be some false negatives because don’t track taint via procedure arguments.
Automatic fixing of infinite loops by inserting timeout code. Inserted code can never harm the system because the timeout is conservative.
Inserts code to bounds-check arrays when an unsafe index is used.
Empirical validation using synthetic fault injection on network drivers: modify return values of in and read functions. Without Carburizer, the system hung; with it, the driver recovered. Works well for transient device failures.
Also want to report device errors to fault management systems. Carburizer (i) detects the failures, and (ii) reports them. Report things like loop timeouts, negative error returns and jumps to common cleanup code. Also looks for calls to functions with string arguments, and considers these to be error reporting, so it doesn’t insert additional reporting in this case.
Evaluated failure reporting with manual analysis of drivers of different classes (network/scsi/sound). No false positives, but a few false negatives. Overall fixed 1135 cases of unreported failures. Thus it improves fault diagnosis.
Static analysis is insufficient. We also need to consider runtime failures, such as missing and stuck interrupts. Can detect when to expect interrupts, and invoke ISR when bits are referenced but there is no interrupt activity. Can also detect how often to poll, which reduces spurious interrupt invocation, and improves overall performance.
Can also tolerate stuck interrupts by seeing when an ISR is called too often, and converts from interrupt-based to polling in this case. This ensures the system and device make forward progress.
Evaluated effect on network throughput: overhead is < 0.5%. And on CPU utilization, this goes up 5% for nVidia MCP 55 when recovery is used; no cost when just doing static bug fixing.
Covers 8/15 of the recommendations for driver writers, automatically.
Q: is your intention that driver developers should run Carburizer only in development and incorporate its fixes, or to run it on deployed code? Encourage sysadmins to run the tool all the time to cover as many bugs as possible.
Q: talked about transient device failures, but what about transient errors in the CPU? In this case, even the code execution will fail. Can you comment on the relative frequencies of these kinds of failure? Not aware of this data. Cannot trust anything when the CPU fails, so not sure what we can do about this.
Q: do you have a rough feel about how many of the found bugs could be used by malicious attackers to perform code injection? Don’t know off hand. Certainly very easy to hang the system, so could do this maliciously.
Q: could you talk about the static analysis that you are using? How do you handle pointer aliases? Analysis is very simple, and we do have 7.4% false positives. But the combination of our techniques gives us this low rate.
Automatic Device Driver Synthesis with Termite
Conventional driver development requires acquiring necessary information from OS interface spec and device spec. Need to combine this information to come up with a driver implementation that appropriately translates OS requests into driver requests.
If we formalize these specs, we can do better by synthesizing drivers, and there is no longer a need for developers to be an expert in both the OS and the device. Only know one well. Furthermore, code can be specified once and synthesized many times.
Use finite state machines as the basic formalism for writing these specs. Note device-initiated transitions and software-initiated transitions.
First step is to take two state machines (OS and device) and synthesize a combined state machine that considers all possible transitions. Any transition represents a legal transition in the driver. Also associate timeout labels with transitions.
Now translate the state machine into C source code (simple and covered in the paper).
A real device has multiple functional units, so can’t possibly use a single FSM for all of these. A new language is used to compose multiple FSMs together (one per functional unit). Need a synthesis algorithm that can handle this: need to deal with the state explosion problem (by exploring the state space incrementally). Also need to deal with data, and cannot model all possible assignments to each variable. Instead manipulate data symbolically.
Termite successfuly synthesizes drivers for Asis AX88772 USB-to-Ethernet (on Linux) and Ricoh R5C822 SD host controller (on Linux and FreeBSD).
<1 kLOC for OS interface spec. <700 LOC for device specs. Synthesized drivers are 2 to 4 times larger than the Linux drivers.
Showed a demo of a visual Termite debugger. Allows single-stepping and setting of breakpoints on states.
Performance is very close to the native drivers.
Some limitations: cannot specify constraints on data (alignment, fragmentation, etc.), complex inter-variable relations are not supported (limitation of symbolic execution engine), the structure of specifications is restricted, and Termite drivers require runtime support. Not conceptual limitations, only of the current implementation.
One issue is how you write the specifications. This is particularly onerous for the device manufacturers. Also a potential source of bugs. Need a big brain to do this based on the HDL of the device. Ideally we would automate this translation (HDL to driver spec). Since you write devices in HDL anyway, this shouldn’t be too bad.
Not much open-source hardware around, so there was a difficulty in finding hardware on which to evaluate this.
Q: to make this practical, there are a few questions: performance? Scalability (for big devices like video cards)? Performance hasn’t been an issue so far, because the device developer makes the same assumptions that Termite ends up making, so the generated code is quite similar. (But the devices are simple?) So far this hasn’t been a problem. Scalability is more of an issue. Looking at using better symbolic execution that ignores irrelevant relations between variables.
Q: how do you deal with firmware in the device (also specified in HDL)? At the moment, the manual process has to cover this. Once we automate this, it might be more challenging, so we might want to generate firmware as well.
Q: what about the data size of compiled code, and the CPU utilization when saturating the device? Code and data size doesn’t look very big (roughly proportional to the amount of source code). Haven’t looked so much at data size.
Q: does your spec language have room for cork tables and other crap? [Taken offline.]
Q: the OS spec is intended to be device independent (e.g. generic for GPU, ethernet driver, etc.), but how do you cope with new features in a device, which people like to have? All the interesting stuff is on the side? How does that play with the OS spec if you want to take advantage of these features? We cope with standard features, could include “semi-standard” optional features in the OS spec. For unique features, you would need to extend (but not rewrite) the OS spec. No experience with doing this.
Q: the requirement for a full functional spec of the OS driver interface is somewhat intimidating, so what is your experience in making these, and how do they scale? It’s generally doable, but we would want to create a special-purpose language to make this easier.
Automatically Patching Errors in Deployed Software
Your code has bugs and vulnerabilities, but attack detectors (code injection, memory errors, etc.) exist. What do you do in this case? At the moment, just crash the application, which is a straightforward DoS.
ClearView protects against unknown vulnerabilities, preserves functionality and works for legacy software.
Zero-day exploits are a problem for hard-coded checks because they are unknown in advance.
The application must continue to work (especially if mission-critical) despite attacks. A patch can repair the application. (Mind you, we shouldn’t always keep the application running: sometimes crashing is correct behaviour.)
Want to do this without access to the source code, so can’t rely on built-in features. Needs to run on x86/Windows.
Use learning as the secret sauce. Normal executions show how the application is supposed to run. Attacks provide information about the vulnerability, and can be used to give the system immunity. The first few attacks may crash the system, however.
Detection is pluggable: tells you whether an execution was normal or an attack. Learning learns normal behaviour from successful runs, and checks constraints during attacks. This gives a predictive constraint, which is true on every good run, and false during every attack. Repair component creates a patch to re-establish constraints and invariants. System evaluates patches and distributes them to deployed applications.
Use off-the-shelf detectors.
Assume a single server and several community machines running the application. (Assume that they are not exploited to begin with.) Community machines report constraints back to the server. Use code injection and memory corruption attack detectors (others are possible).
On an attack, the detector collects information and terminates the application. Server attempts to correlate this information with a constraint: leads to a predictive constraint. The server generates appropriate patches and distributes the best of these. The quality of patches can be refined by information about successful or failed attack attempts (failed or successful defenses). Redistribution is then possible.
How do we learn normal behaviour? Use an ML technique called dynamic invariant detection (previous work), which has many optimizations for accuracy and speed. Technique was enhanced for this project.
Inference results. ML technique is neither sound (overfitting) nor complete (templates are not exhaustive). However it is useful and effective. Sound in practice, and complete enough.
How do we learn attack behaviour? Attack detectors aim to detect problems close to their source. Code injection uses Determina Memory Firewall (triggers when control jumps to code outside the original executable). Memory corruption uses Heap Guard (triggers on sentinel value overwrite). Techniques have low overhead and no false positives.
Server pushes out appropriate instrumentation to the community. Only check constraints at attack sites (low overhead).
Repairing installs additional detectors to see if you have a bad patch (e.g. looking for assertion violations).
Aim to fix a problem while it is small, before the detector is invoked. Repair isn’t identical to what a human would write, but it is much more timely.
Patches are evaluated in the field (do they avoid triggering the attack detector or prevent other behaviour deviations?).
Evaluated with a Red Team that created 10 exploits (HTML pages) against Firefox 1.0. ClearView was not tuned to known vulnerabilities in that version, but the learning component focussed on the most security-critical components. Red Team had access to all project materials.
ClearView detected every attack and prevented all exploits. For 7/10 vulnerabilities, ClearView generated patches that maintained functionality after an average of 4.9 minutes and 5.4 attacks. Handled polymorphic attack variants, simultaneous and intermixed attacks, and had no false positives (installing a patch when not under attack). Low overhead for detection and repair (considering this is an interactive application, not surprising).
What about unrepaired vulnerabilities? 1. ClearView was misconfigured. 2. Learning suite was too small. 3. Needed constraint not built into Daikon. All zero-day attacks against the system and all trivial to fix with minor changes to ClearView.
Q: introducing code is a bit scary… what if one of the patches introduces a new vulnerability? Firstly, you can only do this when you’ve found an exploitability. Red Team tried and failed. In one case, ClearView found a vulnerability in its own injected code.
Q: if I were an attacker who wanted to DoS your system (knowing ClearView was running), I might try to disable ClearView somehow by making the ClearView DB learn an incorrect fact… so is your system vulnerable to that kind of attack? It doesn’t matter what facts are true during attacks, so you’d have to find good executions that weren’t observed as being bad to poison the database. It’s conceivable and theoretically possible that you could do that, but I don’t know if it’s practical.
Q: what is the overhead of inserting invariants at every instruction? You will see between 5 and 10 constraints per instruction… learning is the biggest bottleneck, but you could distribute this amongst the community. In terms of unsoundness, we’re not seeing that as a problem.
Q: how sensitive are you to the invariants that I have to specify for the patches, in the case where continuation introduces incorrectness into the persistent state? This is a policy decision: an hour of downtime for a bank is $6 million, so is it better to come back and fix things up later.
Debugging in the (Very) Large: Ten Years of Implementation and Experience
10 years of work by the first 8 authors.
Even Microsoft’s shipping software has bugs! (And so does your hardware….)
A bug is a flaw in program logic; an error is a failure in execution caused by a bug (1 bug -> many errors).
How does Microsoft find out when things go wrong? We want to fix bugs regardless of source, prioritize bugs affecting the most users. Kernel crashes (BSOD), application crashes, everything down to invariant violations.
Windows Error Reporting. What happens after you click “Send Error Report”?
Server is over-provisioned to handle 100 million reports per day. 17 million programs have records in WER. 1000s of bugs have been fixed. Uses 200TB of storage, 60 servers over 10 years. Anyone in the audience can get access to WER data….
Debugging in the large makes the user-developer feedback loop much longer, both in terms of the number of people and the latency. The problem is the human bottleneck (both in accuracy and latency). Goal was to remove humans from the loop.
On an error, collect a minidump: stack of erroneous thread and a little extra context. If the user allows it, upload this to WER. An analysis procedure (!analyze) runs over all of these mini-dumps and clusters these in buckets.
!analyze takes a minidump as input, and outputs a bucket ID. So increment the bucket count and prioritize buckets with the highest count. Actually upload only the first few minidumps for a bucket; after that just increment the count. Sometimes you need a full core dump, and programmers can request this to be collected on future hits.
2-phase bucketing strategy: labelling on the client (bucketed by failure point) and classifying on the servers (consolidate versions and replace offsets with symbols; find callers where the bug might be (if it calls known-good code)). This refines the bucket ID… more details in the paper.
One bug can hit multiple buckets (up to 40% of error reports). Also multiple bugs can hit one bucket (up to 4% of error reports). Bucketing mostly works… scale is our friend (throw away a few here and there and you still have enough to debug).
Bucket hits for a given program look like a Pareto curve. Just 20 buckets in Word 2010 account for 50% of all errors. Only fixing a small number of bugs will help many users.
Earliest success story was finding heisenbugs in Windows kernel (>= 5 years old). Vista team fixed 5000 bugs in the beta. Anti-virus vendor accounted for 7.6% of all kernel crashes: in 30-days got this down to 3.6% of all kernel crashes. Office 2010 team fixed 22% of reports in 3 weeks.
Example of hardware errors too: failure in a CPU (exact same revision and step). Chip vendor knew that the bug existed and didn’t think that it would get hit in real-life. Error reports dropped dramatically after the work-around was applied.
Also hardware failures in SMBIOS of a particular laptop (buffer overrun); motherboard USB controller (only implemented 31/32 of DMA address bits). Lots of information about failures due to overclocking, HD controller resets and substandard memory.
Also looked at malware. The Renos social engineering worm which caused Explorer.exe to crash when people downloaded something from an email. Saw a spike, issued a worm removal tools through Windows Automatic Update, and saw this decline very quickly. Shows that WER scales to handle global events.
Distributed system architecture hasn’t changed in 10 years, and yet scales to global events.
Product teams now have ship goals based on reducing WER reports: led to a numerically-driven debugging approach. Fundamentally changed software development at Microsoft.
Q: what about privacy? Most private data ends up in the heap, not on the stack. Only collect stack. Also do some data-scrubbing based on things we know about (user ID, etc.). Looked for SSNs, credit-card numbers, etc.: found fewer than 10 possible matches in 100,000 error reports. MS also puts very strong employment conditions on how this data is accessed and used.
Q: . Team looks for zero-day exploits in the WER data regularly. Philosophy is that there should be no overhead when users don’t hit an error report. 90% of users only ever have to do an increment of a counter.
Q: are you keeping track of how many people don’t send error reports? We have good estimates of opt-in rates, from other systems that collect information from machines that are running normally (only if people opt-in to those programs of course). If half the world turned off the error reports, we’d still get enough information.
Q: does analyzing these bugs tell you about common errors programmers make? Check for NULL is the most important ones. This has generated many guidelines that has been fed back into internal development processes.
Detecting Large-Scale System Problems by Mining Console Logs
It is challenging to detect problems in large-scale internet services. Requires instrumentation (expensive to maintain, and may use modules that aren’t instrumented). So can we use console logs in lieu of instrumentation? They are easy for the developer to insert in their code, but they are imperfect (not originally intended for instrumentation). It is non-trivial to analyze their free-text contents.
Go from 24 million lines of log messages, finding a small number of abnormal log segments, to a single page of visualization. Fully automatic process without manual input.
Use machine learning based on carefully-defined (numerical) features in the logs. Parse the program source-code to infer information about the log contents, to generate appropriate features. Finally visualize the results.
Key insight: log contains the necessary information to create features. Identifiers and state variables are useful. Important information is in the correlation between messages (examples taken from HDFS (Hadoop Distributed File System)). e.g. “receiving block x” followed by “received block x” is normal; without “received block x”, you have an error.
First step is to parse free-text logs into semi-structured text. Look at program source to generate regular expressions that extract state variables (the block number, for example). It becomes non-trivial in OO languages, where type inference on the whole source tree is necessary. Yields highly accurate parsing results.
Identifiers are widely used in logs (filename, object keys, user IDs, etc.). Do group-by on identifiers. Identifiers can be discovered automatically.
Now build a numerical representation of traces for feature creation. Approach is similar to the bag of words model in information retrieval. Yields a message term vector with term frequencies.
Can use these vectors to do anomaly detection. Use Principal Component Analysis (PCA) to capture normal patterns in all vectors. These are based on correlations between dimensions of the vector.
Ran an experiment on Amazon EC2. 203 nodes ran Hadoop for 28 hours, with standard map-reduce jobs (sorting etc.). Generated 24 million lines of console logs with ~575000 HDFS blocks. 575000 vectors lead to ~680 distinct vectors. Distinct cases were labelled manually as being normal or abnormal (in the evaluation only). However, the algorithms are unsupervised and automatic.
11 kinds of anomaly, occurring 16916 times. PCA detected 16808 of these. Two kinds of false positive: background migration and multiple replicas. Believe that no unsupervised algorithm could do better, so we’re now allowing operators to provide feedback.
Results are visualized with a decision tree. Unusual log message text is used to document split points in the decision tree.
Future work. Want to improve parsing so that it doesn’t require source code, and support more languages. Also want to improve feature creation and machine learning so that online detection is possible, also across applications and layers to provide more useful and comprehensive information.
Q: most applications have many identifiers, so how do you automatically detect these, and how reliably? The grouping step addresses the problem of multiple identifiers. In HDFS, we only have the block ID, but we have an example in the paper where we run the algorithm multiple times for each class of identifier.
Q: how do you know whether the different values of an identifier correspond to the same identifier variable? [Taken offline.]
Q: what was different about the single anomaly where you don’t do well? (Deleting a node when it no longer exists on the data node.) Block numbers can be arbitrary due to multiple reads and writes. Sometimes get errors in the correlations.
Q: how common do you think this is in other systems besides HDFS? HDFS has this problem most severely because every block interaction is written to the logs.
Q: you had to turn on the more-detailed logging level to get this to work, so how did you choose this? I had to turn on debug-logging. Depends on the problems you want to detect. Turn on more logging when you see problems but you can’t find out why.
Q: what happens to performance when you do this? Also what about heisenbugs that go away with more-detailed logging? System doesn’t do anything about logging-based heisenbugs.
Q: what if I add a new feature, would you be able to detect problems in it? Don’t currently deal with multiple versions of the software.
Q: how much information does your visualization offer to the developer to help them diagnose detected problems? If the operators have some insight, this tool can help them provide useful information to the developers.