SOSP 2009: Day 2

I/O

Better I/O Through Byte-Addressable, Persistent Memory

  • DRAM is fast, byte-addressable and volatile, but disk/Flash are non-volatile, but slow and not byte-addressable. BPRAM is all three!
  • Phase change memory is a promising source for this. Bits encoded as resistivity. Access latency in the nanoseconds, and far better endurance than flash. Designed BPFS for BPRAM.
  • Goal: FS ops commit atomically and in program order. Data is durable as soon as the cache flushes. Use short-circuit shadow paging to get this (new consistency model).
  • Eliminate DRAM buffer cache; use L1/2 instead. Put BPRAM on the memory bus. Provide atomicity and ordering in hardware.
  • Both BPRAM and DRAM are addressable by the CPU: physical address space is partitioned into volatile/non-volatile.
  • BPFS gets better performance than NTFS on the same media.
  • What happens on crash during update? Short-circuit shadow paging comes into play (contrast with journalling or shadow paging). Overhead of journalling is that all data (or metadata) must be written twice. Shadow paging uses copy-on-write up to the root of the FS: drawback is that writes propagate all the way back to the root (multiple updates), and small writes have a large copying overhead.
  • Short-circuit shadow paging makes in-place updates where possible. Uses byte-addressability and atomic, 64-bit writes. Both in-place updates and appends are made simple by this technique. Cross-directory rename does bubble up to the common ancestor.
  • Problem: if data is cached in L1/L2, the ordering of cache eviction can lead to inconsistent states. Also, writes from the cache controller might not be atomic.
  • So add two new hardware components to the CPU and cache controller. Epoch barriers are used to declare ordering constraints and they are much faster than a write-through cache. Also add capacitors to DIMMs which allow writes to propagate even after the loss of power.
  • Do CoW then Barrier then Commit. Paper also shows how to make it work on multiprocessors.
  • Built and evaluated on Windows as an in-kernel file system.
  • Microbenchmarks (append n bytes and random n-byte write) compare NTFS/Disk, NTFS/RAM and BPFS/RAM. (Using DRAM in this experiments.) BPFS is significantly faster than NTFS on disk, and NTFS isn’t syncing so it isn’t durable!
  • Postmark benchmark compares NTFS/Disk, NTFS/RAM, BPFS/RAM and (projected) BPFS/PCM. BPFS/PCM is much faster than both NTFS/Disk and NTFS/RAM. Analytical projection based on sustained throughput of PCM.
  • Q: are the storage requirements of a database and of a file system converging when you have this hardware available? The changes to the hardware will be applicable to other sorts of storage systems like a database, not just filesystems.
  • Q: have you thought about how to expose more capabilities of this medium to the applications (not just sequential reads and writes)? Applications are currently written in terms of what is efficient.
  • Q: how do you atomically deal with a free-list? We don’t have a free-list. Don’t need to keep track of so many data structures because the medium is so fast, which lowers the consistency overhead.
  • Q: where do you go next for multiprocessors and clusters? One of the goals was to have multiple concurrent threads operating on the FS at the same time.
  • Q: is there a risk that data in the capacitor gets garbled after the machine gets switched off? [Taken offline.]
  • Q: could you go even faster without having the consistency guarantees, for applications that don’t need it? There’s always a trade-off here.
  • Q: how do you do mmap, and is meddling with L1/L2 caches going to be expensive? Haven’t implemented mmap yet, but we would have a much better guarantee of durability. Changes to the cache, in the paper have been looked at in terms of interference, and performed well.
  • Q: why didn’t you benchmark against a file system using a B-tree or a red-black tree that can take advantage of random writes [why did you compare against NTFS]? NTFS is widely used, and it’d be interesting to compare against others.

Modular Data Storage with Anvil

  • Data storage drives modern applications (everyone has a database) and they are frequently a bottleneck. Hand-built stores outperform general-purpose ones by up to 100x. Observe that changing the layout can substantially improve performance. Custom storage is hard to write, especially in order to provide consistency guarantees. Can be prohibitively expensive to experiment with new layouts.
  • Need a simple and efficient modular framework to support a wide variety of layouts.
  • Fine-grained modules: dTables. These are composable to build complex data stores. All writing is isolated to dedicated writable dTables, which incidentally has good disk access properties.
  • dTable = key/value store. Maps integers/floats/strings/blobs to blobs. Provides an iterator to support in-order traversal. dTables used by applications and frontends, and also other dTables. Can transform data, add indices or otherwise construct complex functionality from simple pieces.
  • Example of a mapping from customer IDs (mostly contiguous) to states. Start with an array dTable for the common case. Layer a dictionary on top of that (maps state names to array indices). Have an exception dTable for the case where a customer isn’t in one of the 50 states, and a linear-search dTable for their residences. But to make this fast, layer a B-Tree index dTable on top of the linear store.
  • So far just read-only. Updates are hard to do transactionally. Need to implement a write-optimized dTable. Fundamental writable dTable is the journal dTable. New data is appended to a shared journal; data are cached in an in-ram AVL tree. The journal is digested when it gets large. Transaction system is described in the paper. Layer this over read-only dTables.
  • Managed dTable goes at the top. Also have a Bloom filter dTable to deal with multiple overlaid read-only dTables.
  • Many additional dTables listed in the paper.
  • Evaluate  the effect of simple configuration changes on performance (modularity). Key lookup workload, comparing contiguous versus sparse keys. Contiguous good with arrays; sparse good with B-trees. Also show the benefit of layering an index on top of a linear store. Also show the low overhead of the Exception dTable.
  • Evaluated by running TPC-C. Replaces a SQLite backend with Anvil. Shows that Anvil outperforms both the original backend and MySQL. Split read and write stores perform well.
  • Evaluated the cost of digesting and combining. These can be done in the background, taking advantage of additional cores and spare I/O bandwidth. Measured the overhead when doing a bulk load (1GB) into the dTable, with digests every few seconds.
  • Q:  why didn’t you compare either performance or features against BDB, which is very similar? Didn’t find it as easy to construct read-only data stores in BDB: creating customisable data stores has a lot of transactional overhead.
  • Q: did you evaluate iteration? In paper. How does the performance depend on the order of updates? Not sure what you mean. Did look at overlay iteration in the paper, which ought to be the most expensive (due to key lookup cost), and overhead was only 10%
  • Q: how did you make it so that the creator of a new dTable doesn’t have to consider ACID semantics? Most dTables are read-only, so you don’t need to worry about this (like shadow paging). The managed dTable has a small hook that enforces transactional semantics. And read/write dTables? Don’t envision that people will need to create these. Could implement your own, but this would miss the point.
  • Q: how should developers write recovery tools for systems like Anvil? Anvil includes such a tool that handles recovery for you. Read-only semantics makes this much simpler.

Operating Systems Transactions

  • [Unfortunately, I missed this talk due to having an obligation to man the registration desk. I'll try and track down the video and update this later.]

Parallel Debugging

Do You Have to Reproduce the Bug at the First Replay Attempt? — PRES: Probabilistic Replay with Execution Sketching on Multiprocessors

  • Concurrent programs are hard to write. Multi-core makes concurrent programming more important, and bugs more common. However they are non-deterministic (requiring e.g. a special or improbable thread interleaving). This makes it hard to reproduce them.
  • Deterministic replay for uniprocessors is relatively easy: only need to record inputs, thread scheduling and return values of system calls. On multiprocessors, this is much more challenging. e.g. Simultaneous memory accesses are another source of non-determinism.
  • Previous proposals introduce new hardware, which don’t exist in reality. Or there are software-only approaches but they have up to 100x slow-down.
  • Ideally want to reproduce a bug with a single replay and no runtime overhead. But what if we relax this slightly?
  • Idea 1: record only partial information during a production run. Idea 2: push the complexity into diagnosis time. Idea 3: use feedback from unsuccessful (non-reproducing) replay runs.
  • Just record a sketch during the production run. When a replay goes off the sketch, terminate it immediately and feed back information about why it deviated for refining the next replay. Can eventually reproduce the bug with 100% probability.
  • Several different methods for sketch recording. Spectrum of approaches from UP deterministic replay to full MP DR. Can record e.g. synchronization points, or basic blocks, or more: build up this information during the replay runs.
  • At replay time, the partial information replayer consults the sketch to see that recorded global ordering is obeyed. How do we know whether a replay is successful? Use a failure detector based on crash, deadlock or incorrect results. This can also detect unsuccessful replay runs.
  • When a replay attempt fails, start it again. But could do something different the next time: a random approach would just leave it to fate, but PRES is more systematic. Failed reproduction is due to un-recorded data races. The feedback generator captures these races and tweaks them in future runs. Start with many candidate races and filter them down.
  • Implemented PRES using Pin. Evaluated many different applications (desktop, server and scientific). Overhead is around 18%, which is barely more than baseline Pin overhead. Macrobenchmarks show that PRES gets much higher throughput for server applications (MySQL).
  • Effectiveness: UP algorithm doesn’t detect any bugs within 1000 replays, whereas PRES gets 12/13 in 10 attempts. Feedback generation is crucial to effectiveness. Race filtering also effective.
  • Q: could you also use execution traces to help track down which parts of the execution trace cause the bug to not happen, and guide the programmer? Good idea.
  • Q: could you apply PRES to virtual machine replay? Also a good point. The work could be integrated with virtual machines. Could you rollback an execution, is it precise enough? Depends on the fine-grainedness of the recording scheme used. What is the inherent overhead in collecting a trace with sufficient fidelity to do backtracking? If there is a lot of lock operations, the low-overhead approach (SYNC) could work. But if there is no synchronization, we can’t use this information, and a more heavyweight scheme would be needed.
  • Q: why do the results differ so much from the next paper? The main idea is similar, but their work is more focussed on static analysis to reduce runtime overhead.
  • Q: would you advocate this as a solution for long-running (one year or more) services, as it is often only after this time that they emerge? We can take a checkpoint of the process state, which solves the problem of data accumulation. 

ODR: Output-Deterministic Replay for Multicore Debugging

  • Debugging non-deterministic software failures is really hard. The problem is how to reproduce these failures for debugging. Model checking/testing/verification could work, but it’s not perfect, and it doesn’t capture everything. So we need deterministic replay.
  • Need multiprocess operation, efficient recording, no special hardware and the ability to run arbitrary programs without annotation (especially programs with data races). All related work fails to meet one of these requirements.
  • ODR is a user-level replay system, which works in the MP case, has only 1.6x overhead, needs no new hardware and works on arbitrary x86 Linux binaries.
  • Often sufficient to produce any run with the same outputs… needn’t have the exact same execution. So the idea is to relax the determinism requirements.
  • The classic guarantee is value determinism: replay run reads and writes must have the same values as the original. Relax this to “output determinism”: the replay run produces the same user-visible output as the original. This is not perfect, but still useful for debugging: reproduces most visible signs of failure, and enables reasoning about failure’s root cause.
  • How to achieve this? Deterministic-run inference. Basic idea is to translate a program into a logical formula (verification condition). Function of schedule trace, input trace and read trace, returning an output trace. Use a formula solver to yield unknown schedule trace. Scale this by directing the inference using more original-run information. Also by relaxing memory consistency of the inferred run: where values read have nothing to do with schedule order, can use an arbitrary schedule trace.
  • Three-dimensional inference design space: memory consistency (strict, lock order or null), query complexity (output, I/O&lock-order, I/O&lock-order&path or determinant), and inference-time (polynomial or exponential). Search- and query-intensive DRI fit into this space.
  • Search-intensive is really slow (400–60000x slowdown). Formula generation, not solving is the bottleneck. Use multi-path symbolic execution with backtracking to generate formulas. Each backtrack involves a 200x slowdown. Backtracks are caused by race-tainted branches: wrong choice leads to a divergent, unsatisfying path. Work around by backtracking to the most recent race-tainted branch.
  • If we know the path (QI-DRI), inference time improves by 100x but we need to record much more data, so there is a 6x slowdown.
  • Future work is to reduce the path-search space, reduce the cost of each backtrack (cut down on race-tainted branch analysis), and parallelize formula generation by forking threads at each divergence.
  • Q: if I put in arbitrary values in the memory consistency model, how do I ensure that invariants are maintained? We don’t actually do null consistency, but if we did and invariants were violated the program might crash and output determinism catches this (because the output would be different).
  • Q: since your techniques are similar to the last paper, why are the results so different? In our approach, we do race-tainted branch analysis for the entire exectuion, and that is costly. We also do taint-flow propagation. There is more analysis in each backtracking iteration than theirs. We could reduce this cost by, for example, reusing results from previous iterations.
  • Q: have you considered changing your static analysis to another algorithm that could significantly improve your formula generation time? We are considering static approaches to formula generation.

Works-in-Progress

  • RAMCloud: Scalable, High-Performance Storage Entirely in DRAM. New research project. Zero results. Motivated by wanting to build large scale systems with low latency. Data center style of web application separates the code from the data, in order to scale, but 4–5 orders of magnitude increase in latency. Want this kind of scaling with latency close to memory speeds (sub-microsecond). Basic architecture puts all data in DRAM. Scale using commodity servers. Reckon we can get 5–10us RPC latency end-to-end. Also have a story on durability and availability. Also want to support multiple applications.

    Q: effect of other memories? Whichever wins should work with RAMCloud

    Q: GMS system from UW 10 years ago? Not familiar with that [taken offline].
  • Transactional Caching of Application Data using Recent Snapshots. DB-driven website performance issues: use memcached. Add an in-memory DHT that is very lightweight, and stores application objects (not a DB). DBs provide transactional consistency, but these caches don’t do this. Goal is transactional consistency for accesses to the cache. Idea is to embrace staleness: all read-only transactions to run on stale data. Avoids blocking and improves utilization. This is quite safe, since stale data is already everywhere. Application can control staleness. Add a TxCache library between memcached and the application. DHT values are timestamped, and have a validity interval.

    Q: paper at HotStorage from HP Labs in similar area?

    Q: how does the DB know the validity interval? Modified DB to track this.
  • Chameleon: A self-managing, low cost file system. Targeting home user or small business, who doesn’t want to lose data. Doesn’t know anything about RAID. Cost-sensitive. Deployment scenario has 4 PCs connected by fast LAN, with a broadband connection to cloud storage. There are many ways to replicate, place and encode data. Ideally store data on at least one offline device to avoid vulnerability to viruses. A small, trusted “anti-availability kernel” enforces this requirement. Use linear programming to select and adapt storage configuration: the design space is now even more complicated. Tend towards the optimal solution.
  • Sloth: Let the Hardware Do the Work!Looked at embedded OSes used in the automotive industry. OSEK OS is the prevalent real-time embedded OS: event-triggered, priority-driven real-time system. Don’t want to implement a scheduler. SLOTH lets the interrupt subsystem do the scheduling and dispatching work. All threads are implemented as interrupt handlers and have interrupt priorities. Each thread needs an IRQ source. Priorities enable pre-emption. Can implement a bunch of synchronization this way also. System is simple, small (concise implementation and memory footprint) and fast (2–20x).

    Q: this looks like very simple scheduling… how would you deal with something more complicated like earliest deadline? Drawback is no blocking system calls, so can’t do everything.
  • The case for cooperative kernel threads. Kernels are multithreaded, and drivers have concurrency bugs, which, if they are in the kernel, is bad. Event-based devices drivers need to use continuations to preserve driver context across blocking operations. This becomes very complex, almost as bad as dealing with pre-emptive threads. Cooperative threads give the best of both worlds: atomic execution but allowing blocking. Research showed that drivers are mostly I/O bound, so cooperative threads are appropriate. Implementing this as a “cooperative domain” in the Linux kernel.

    Q: Linux does have cooperative thread scheduling available, so how does this interact with the work you are doing? Providing a framework for implementing drivers this way, much nicer.
  • Abstractions for Scalable Operating Systems on Manycore Architectures. Tesselation. Goal isn’t just to support heterogeneous hardware, but also provide predictable performance and guarantees for applications. Asymmetrically structured OS: some cores are dedicated as a management unit for keeping track of and scheduling applications. Eliminates the need for per-core runqueues, improves cache locality, decreases lock contention and limits kernel interference with applications. Applications interact with the kernel through remote, asynchronous system calls. Applications make explicit requests for cores, and OS guarantees that they will be gang-scheduled. OS just provides cores to the application, doesn’t need to know about threads. Applications have private memory ranges.

    Q: how do you balance the different demands for resources across applications? [Taken offline.]
  • System Support for Custom Speculation Policies. Applications run on some speculation infrastructure, which speeds things up. Want to separate policy from mechanism. Typically implemented transparently, which means that you have to be conservative, giving limited opportunities for speculation. Idea is to push the policy into the application. What could an application do that is different from the default? Might allow some output to be uncommitted. Or could commit equivalent-but-not-identical results. Process gets a “speculative fork” interface. Use cases: predicting user actions (predictive bash shell), authentication and user-level network services (when you have a predictable protocol).

    Q: how do you ensure errors in the speculative state don’t propagate to the main state? Need to be able to detect this, and could abort speculation in this case.
  • IDEA: Integrated Distributed Energy Awareness for Wireless Sensor Networks. A new “group diet” for wireless sensor networks. Problem of overloaded nodes. Existing solutions are “single node diets”, which are unsatisfactory because nodes have to collaborate. Local efforts cannot go far enough unless there is some cooperation. Aim to improve application fidelity by matching system load to availability. Shift load from overutilized to underutilized nodes, and shift load away from threatened nodes. Like a distributed OS for sensor networks. IDEA evaluates multiple solutions and distributes information to the nodes. Ideal goal is awareness of application constraints.

    Q: Quanto does some cross-node analysis? We’re building on these great ideas.
  • Flicker: Refresh Power Reduction in DRAMs by Critical Data Partitioning. Hardware is over-designed for correctness and reliability. Make it less reliable and tolerate errors in software. Smartphones are a motivation: power consumption is way too high, due to the use of DRAM for responsiveness. Battery drains even when a phone is idle. Goal is to improve power consumption here. If you increase the refresh cycle length, the power consumption drops, but the error rate increases. Currently use 64ms refresh, so could we increase this? Secret sauce is a partitioning into critical and non-critical (e.g. soft-state) data. Map critical data to short-refresh cycle DRAM, and non-critical data to long-cycle DRAM. Requires some hardware changes. Hypothesise that smartphones have a lot of non-critical data. Initial results show 25% drop in power consumption with only 1% loss in reliability.

    Q: [?] Looking at replication and checksumming in other work.
  • BFT for the skeptics. Industry deals with crash failures a lot, so do we need full BFT? We already use checksums, timeouts, sanity checks, etc. to translate faults to crash faults. How often do we get faults that require BFT to handle it? Looked at ZooKeeper and real-world failures. Yahoo!’s crawler uses ZooKeeper extensively. Saw 9 issues, due to misconfiguration (5, BFT wouldn’t help), application bugs (2) and ZooKeeper bugs (2, correlated, BFT wouldn’t help). Could BFT hurt? It has more things to configure, so misconfigurations could become worse. Need to show that BFT really solves a problem before industry will pick it up.

    Q: You showed that correctly implemented BFT couldn’t help with some failures? Failures were correlated, affecting all replicas.
  • Prophecy: Using History for High-Throughput Fault Tolerance. BFT has poor throughput. Need 3f+1 replicas to handle f faulty replicas. Can we improve this for read-mostly, internet workloads? Add a “sketcher” to each replica, which sketches requests and responses. Only one machine sends a full response, the others send sketches. Trades off consistency for performance, which gives delay-once linearizability. Faulty replicas can return slightly stale data. Internet services have unmodified clients and short-lived sessions. Look at performance of PBFT. Can improve by consolidating sketch tables on a trusted sketcher. We already trust middleboxes, so why not trust this too? Performance is much better than PBFT. Work not specific to BFT, and could apply to PAxos, quorums, etc. while getting similar benefits.
  • Securing Hardware Platforms Against Malicious Circuits Through Static Analysis. Make assumptions when building systems. Best way to break a system is to break its assumptions. People assume hardware is correct. What if we can’t make this assumption? Hardware is complex, expensive, static and the base of the system. Do “dead circuit identification”: highlight all potentially malicious circuits automatically. Attacker is motivated to avoid impacting functionality during testing (or else they’d be caught). DCI gets an assertion that says which paths are effectively short circuits. Use these assertions in a new graph algorithm to identify the possibly-malicious, dead circuits. No false negatives, but 30% over-identification. Empirical evidence shows a tight correlation between code coverage and

    Q: is this primarily at design-time on HDL? Yes, this is one of our assumptions.

    Q: what about redundant circuits for fault tolerance? This is used at design time where you can make calls about this.
  • Enhancing Datacenter Network Security and Scalability with Trusted End Host Monitors. Cloud workload is dynamic and hostile. Key selling point is that multiple tenants can share common infrastructure. Need a new approach to security, because exploits are more likely, and the cloud resources can be used to perform exploits themselves. Cloud datacenters can help: they are centrally-controlled so monitoring becomes easier. The software and hardware and homogeneous. Plus a clean-slate approach is possible. Use the hypervisor as a trusted component. Hypervisor can send alarms to central controller when an attack is detected. Built a prototype from Hyper-V and a trusted Intel NIC.

    Q: if you trust the VM, why do you need to trust the NIC? This gives some useful properties, and the NIC could do some filtering this for you.

    Q: HotOS paper on this exact topic?

    Q: [?] By “hypervisor” meant the entire virtualization stack, because we didn’t want to make the hypervisor itself any better.
  • Architectural Attacks and their Mitigation by Binary Transformation. What happens if someone tries to attack you from a VM on the same machine in the cloud. There is cross-talk through shared architectural channels. Example is contention for the CPU data cache. This leaks information about the memory access pattern, which could for example be used to leak AES keys. Have showed that EC2 has similar vulnerabilities: placement vulnerabilities, cloud cartography and cross-VM exfiltration are all possible. Approach is to use dynamic binary rewriting to transform x86 instructions so that the architectural effects are mitigated. Can degrade observation of timing, or inject noise and delays to hide leakage signal. Methodology is to make things secure by default, then come back to improve performance.

    Q: information leakage necessarily arises from statistical multiplexing, and we need statistical multiplexing to get good performance, so how can you address that? Assert that it should be possible.

    Q: how well would existing techniques protect against these attacks? Not aware of techniques that could do this.
  • Execution Synthesis. Say you have a bug in Linux on a remote machine. All you have to work with is a low-detail bug report. Reproducing it is time-consuming. Want a direction finding system from your system to a particular bug. Google Maps doesn’t do this at present…. So your bug report is a stack trace and some register contents. Do VM recording and replay. Don’t expect you to record behaviour that leads to the bug, since then you wouldn’t have the problem in the first place. Don’t care so much about performance, since you don’t run this in production. Find a state in the recording that is close to the bug report, then explore paths iteratively to get closer to the bug. Then you get a sequence of inputs that lead to reproducing the bug. Need a distance function, way of choosing inputs, and good information about what to include in a bug report.

    Q: could you go backwards from a failure state and execute in reverse? We don’t have the entire failure state to begin with.
  • Edge Mashups for Clinical Collaboration. Health industry is going from paper-based to electronic records. Want to empower non-programmers to build applications for real-time collaboriation, but need to respect things like HIPAA for logging and data retention. Example use-cases include expert-assisted surgery (call an expert for advice when complications arise, in real-time), and micro-clinics where nurses see the patients, but doctors write prescriptions remotely. Envision a graphical tool that pulls in photographic and chart data, which is synchronized between all participants. State serialized to XML which can be distributed to all the clients. Could be client/server or peer-to-peer. Need logging for accountability. “Break-glass” access control: anyone gets access but they are held accountable after-the-fact. Need low latency so doctors don’t feel that they are wasting time. Might migrate this to the cloud for scaling.

Kernels

seL4: Formal Verification of an OS Kernel

  • Formally proved the functional correctness of 8700 lines of C. No bugs.
    Want to build high-assurance systems: small kernels which reduce the trusted computing base. Want strong security properties. Kernel has to be correct: if it falls over, so does the whole system.
    seL4 has capabilities.
  • Proof is that specification and code are equivalent. Need a formal semantics for every system call. Use Isabelle as a theorem prover to bridge the gap between spec and code. But what about assumptions (in the code) and expectations (of the spec)?
  • Assume correct: compiler and linker, 600 lines of assembly code, hardware, cache, TLB management and 1200 lines of boot code.
  • Given these assumptions, we get some nice properties: no null dereferences, no buffer overflows, no code injection, no memory leaks, no div-by-zero, no undefined shift, no undefined execution, and no infinite loops or recursion. Does not imply security, lack of bugs from expectation to the physical world, or absence of covert channels.
  • Proof architecture admits proofs of higher-level properties (e.g. access control).
  • Design is written in Haskell, which can be used to generate Isabelle code automatically.
  • System model has three states: user, kernel and idle. Events are syscall, exception, IRQ and VM fault.
  • Call graph is messy! A microkernel takes all of the messiness and packs it into a very small space.
  • Formal methods practitioners (fans of abstraction) versus kernel developers (exterminate OS abstractions). Different view of the world. Haskell prototype unified these two things: OS people got to implement an OS, while the formal methods people got well-defined semantics. The C code is manually-written and hand-optimized, but based on the Haskell prototype.
  • Aim to reduce complexity. Have to deal with virtual memory in the kernel. But we can put drivers outside the kernel. Concurrency is complex, so use an event-based kernel and limit pre-emption to a few well-chosen points in long-running operations. The C code is derived from the functional representation. Need to support a subset of C: everything from the standard, minus goto, switch fall-through, & on stack variables, side-effects in expressions, function pointers and unions.
  • Found 16 bugs during testing, and 460 bugs during verification (roughly equally distributed between the C code, the design and the spec). Took 25 person-years in total: $6 million (compared to $87 million for EAL6).
    One of the largest proofs ever done in a theorem prover: 200kloc handwritten, machine-checked proof. Proved 10kloc of OS code.
  • Q: can you comment on what happens when you have to evolve the code? What effort is required? It depends. An optimization on the code level that doesn’t change the semantics might need a few days to re-prove. A new feature that adds new components could be added (in the paper) doesn’t depend on the rest of the kernel as long as you don’t screw around with existing data structures.
  • Q: does your work solve the stated problem? The assumptions are significant, and you’ve just done a very significant type-check on the code? Is it really possible to solve the originally stated problem? This is just the first step. You can reduce the assumptions with more work. It isn’t the only technique that you should use, so if you deploy it in an Airbus you should also do testing and verification. [Taken offline.]
  • Q: how can you verify something high-level like having the address spaces of two processes being isolated? We do this. You can still use seL4 in a stupid way, but you can use our security model with capabilities and reason about those in the spec. You don’t have to go down to the code.
  • Q: did you see a correlation between the logical errors in the specification and the implementation? Not really. The C bugs were fairly “stupid”: typos, copy-and-paste, etc.
  • Q: wouldn’t it be better to prove temporal properties? Are they expressible? They are expressible. We look at functional correctness, not temporal properties. But you need functional correctness before you can reason about temporal properties.

Helios: Heterogeneous Multiprocessing with Satellite Kernels

  • Systems are getting more complicated, from UP to SMP to CMP to NUMA. This is still homogeneous. But hardware is no longer homogeneous: programmable NICs, GPGPUs, etc. Operating systems ignore this heterogeneity: the other devices have different instruction sets and often no cache coherence. This means that the standard OS abstractions are missing, and programming models are fragmented. Can we bring this back into the operating system?
  • Helios is an OS for distributed systems in the small. Use four techniques to manage heterogeneity, simplify app development and provide a single programming model for heterogeneous systems.
  • Result is that it is possible to offload processes to these heterogeneous devices with no code changes. Also improves performance on NUMA architectures.
  • Satellite kernels. Want to make use of an I/O device, but the driver interface is a poor interface for applications that want to use programmable devices. It becomes hard to perform tasks like debugging, I/O and IPC with these devices. The driver doubles as an OS, within the OS itself. A satellite kernel runs on the device itself: fundamentally a microkernel. Also run separate satellite kernels on each NUMA node. Local IPC and remote IPC for communication between satellite kernels.
  • Applications register as services in a namespace. The namespace connects IPC channels.
  • Application placement is constrained by the use of heterogeneous ISA, an expectation of fast message passing and platform preference. Applications are allowed to specify affinity in their metadata: a hint for where the process should run. Easy for a dev, admin or user to edit affinity. Platform affinity is processed first, and this gurarantees certain performance characteristics. Can also contra-locate, e.g. if you don’t want the interference of an anti-virus program running on the same core. Algorithm attempts to balance simplicity with optimality.
  • Applications are first compiled down to MSIL, and then that is compiled down to the appropriate ISA. Can encapsulate multiple versions of a method for different ISA in the MSIL (e.g. fast vector math).
  • Implemented on Singularity, using an XScale programmable I/O card (2GHz ARM processor with 256MB of DRAM). Just need a timer, an interrupt controller and the ability to handle exceptions to implement a satellite kernel for a new device. No need for an MMU (thanks, Singularity!). GPUs are adding timers (Larrabee). Only supports two platforms and a limited set of applications.
  • Evaluated several applications (network stack, FAT32 FS, mail server, web server, etc.) and how easy it was to run them on satellite kernels. Almost no code had to be changed (only in the TCP test harness). One line of metadata had to be changed in almost every case (zero in the other).
  • Offloaded an entire networking stack to the XScale, and showed that the end-to-end performance of PNG compression-and-serving is improved when offloading to the XScale.
  • Considered an email server built on Singularity, using a NUMA box. Emails per second handling improved by 39%. Turned out that the instruction throughput was much higher due to better cache utilization.
  • Q: when you transfer data between two NUMA domains, couldn’t the IPC fail due to memory allocation failures? Singularity is statically verified, using contracts, so we don’t have to worry about that.
  • Q: is this not just 20-year-old distributed microkernel research rehashed? We pay homage to that in the paper. A simple heuristic is sufficient to decide where to run some process, and you need to have process migration anyway, so why not just use that when you get a problem? Process migration is pretty difficult, in the heterogenous case. Abstraction turned out to be pretty brittle in commodity OSs.
  • Q: is it reasonable to rely on protection from a large runtime? The system isn’t dependent on type-safety.

Surviving Sensor Network Software Faults

  • Sensors have to operate unattended for months or even years. It’s hard to debug failures considering that the input is unknown. There is no debugger.
  • Safe TinyOS introduces memory safety to sensor nodes. But what do you do when you get a safety violation? In the lab, spit out an error message; in the wild, reboot the entire node (losing valuable soft state and application data).
  • Neutron is a new version of TinyOS. Reduces the cost of a violation by 95–99%. It has near-zero CPU overhead during execution. Runs on a 16-bit microcontroller.
  • A TinyOS program is a graph of software components: statically instantiated code and state. Connections are typed by interface and there is minimal state sharing. Now have preemptive multithreading with a non-blocking, single-threaded kernel. Aim is to separate the program into independent units for recovery. Infer the boundaries of these at compile time. The kernel is a single unit.
  • Units can be rebooted independently. A wrinkle involves cancelling system calls, so you need to block if a syscall is still pending. Blocks of allocated memory are tagged with the owning recovery uint, which enables these to be freed on reboot (by walking the heap). Can even reboot the kernel: just cancel all pending system calls (return ERETRY), and just have to maintain thread memory structures. Applications will continue after the kernel reboots.
  • New idea of “precious state”: a group of precious variables will persist across a reboot. Annotate variables in the source code. Some restrictions on precious pointers. Precious variables must be accessed in atomic blocks. Variables are persisted on the stack across reboot: the set of precious state is usually smaller than the worst-case stack size.
  • Evaluated the cost of a kernel violation in Neutron, compared to safe TinyOS. Looked at three libraries, running on a 55-node testbed. Show the effect of a reboot on the CTP workload. Neutron gets close to the non-reboot case. Also look at the effect on time synchronization in FTSP, showing what proportion of the nodes have unsynchronized time. Again, Neutron gets close to the non-reboot case. Looked at fault isolation. CTP and FTSP data persist across reboots.
  • Main cost is in ROM bytes: 1–5kB of added code, roughly constant.
  • Measured cost of a reboot in milliseconds. A kernel safety violation will result in a 10–20ms outage.
  • Much lighterweight than microreboots (and lets you reboot a kernel, not a J2EE application).
  • It’s easy to change the TinyOS toolchain, but changing the programming model isn’t due to the amount of deployed code.
  • Q: how can I reason about a node that has survived a fault (a rebooted node is in a known-good state)? Do you have evidence that this is going to help us? [Showed emails from the questioner.] It is hard to diagnose these faults. Different approaches are possible.
  • Q: what do you think of the alternatives, such as using an MMU, verification or simulation? Well we could add an MMU, but we don’t have it at present. These new developments might make the dependability better. Verification struggles with the huge input space.
  • Q: how do you ensure that the precious state is not corrupted? Using safe TinyOS, so we won’t see memory access violations due to memory safety. Taint in the paper is about inconsistent state, not corruption.
  • Q: what are your criteria for what state should be marked as precious? Don’t have a strong set of guidelines for this, but have done it by inspection so far.
  • Q: what is your fault detection system, and what is its coverage? How do you know when you have a fault? The deputy compiler gets you to annotate code with things like buffer lengths. Can infer faulty behaviour from these. Annotating interfaces tends to be sufficient.
  • Q: this is a very valid approach?

Leave a Reply