SOSP 2007: Day 3

Indulge me for a moment. For some reason, the powers that be at SIGOPS like to hold SOSP at remote locations, which in recent years have included Bretton Woods, Banff, and Brighton. So I’ve not been the only one to point out that this year’s location, Skamania Lodge in southern Washington state, bears something of a resemblance to the infamous Overlook Hotel in Stephen King’s The Shining. A little Wikipedia surfing last night led me to discover that the basis for the Overlook is Timberline Lodge, just across the river in Oregon. Imagine my surprise as I watched the local news in our own hotel (which owes more than a little to Hitchcock), and the weather forecaster cut to a shot of the Timberline Lodge, where apparently the snow has just started falling for the season. Ah well, at least they’ve got a security camera up there these days.

Storage

DejaView: A Personal Virtual Computer Recorder

  • The MEMEX Vision: We lack tools to store all of our books, records and communications quickly and with great flexibility. (Vannevar Bush.) We need to archive, search, view and manipulate what we have seen. Web/desktop search is inadequate: no record of what we have seen but not saved.
  • DejaView: PVCR that provides complete, transparent, fast recording of the desktop computing experience. Tivo for the desktop. Records display like a video (playback, browse, ff, rw); and text and context to use as an index.
  • Display recording: transparent, efficient and full-fidelity. Uses a virtual display driver that can redirect the display anywhere: to the user (on the screen) and to persistent storage. Has a standard device interface, intercepting updates at the low level and logs all display updates.
  • Text/context recording: naïve approach would be do to OCR on the framebuffer, which is too slow and inaccurate. Instead leverage accessibility infrastructure (screen reader technology), which is integrated in most standard GUI toolkits. Also provides useful contextual information: name/type of application, window in focus, special properties (e.g. menu text).
  • Execution recording: be able to revive execution state at any time, including the entire desktop session, not just a single process, and it needs to be fast enough to save frequently without degrading user experience. Use a virtual execution environment that decouples the desktop from the underlying OS by interposing on the syscall API. Only saves the desktop state and not the entire OS, which cuts down on the size of the checkpoint. Requires some downtime for the application during saving and snapshotting the filesystem, which must be optimised for interactivity. Checkpoint rate must be limited to make the overhead manageable. Also avoid taking pointless checkpoints (like when the desktop clock changes). A checkpoint per second is sufficient granularity.
  • On revival, use UnionFS to put a CoW R/W layer on top of the R/O checkpointed file system. Can therefore fork history.
  • Evaluation: based on real implementation - overhead, impact on interactivity and storage requirements; and access to data (latency etc.). Ran a bunch of benchmarks, based on normal workloads, like web browsing, video playback, catting a large file to screen, doing a kernel build, or something in Matlab. Also evaluated based on real desktop usage by grad students.
  • Little overhead for display or execution recording; 100% overhead for text recording; 125% for full recording when web browsing (due to a bug in Firefox’s accessibility). All other use cases have much less overhead. Checkpoint latency: 5ms to 22ms downtime; total checkpoint time from 80 to 200ms. Storage growth is very low for the real usage scenario (lower than a PVR with equivalent display resolution). Browse and search latency no more than 200ms, which is good enough for interactive use. Playback speedup about 200x for real usage. Session revive takes 1.5 to just over 4s.

Improving File System Reliability with I/O Shepherding

  • Storage systems are becoming complex, which can lead to complex failures. Want to manage disk and individual block failures. Addressed by checksumming, parity, mirroring, versioning, etc. However, these techniques are insufficient, and poorly understood (unlike performance and consistency). There is no good strategy in commodity FSs, only coarse-grained policies.
  • Reliability treatments are diffuse and inflexible. Different apps require different policies (e.g. desktop workload versus web servers).
  • I/O shepherd: localised, flexible policies for different types of storage. e.g. Mirroring for archival data, checksumming for scientific data, or different levels of protection based on the quality of the drive. These policies can be composed.
  • I/O shepherding layer interposed between the FS and the disk subsystem. Includes >=1 policy tables, linking to policy code, policy primitives and policy metadata.
  • Policy table: for specifying policies. Different kinds of block types and volumes, and levels of reliability and importance. So we have a write policy and read policy for each block type (like a function pointer); and a policy table for each volume (or mount point?).
  • Policy metadata: need remapping, mirroring and sanity checking. Must be integrated with the FS. Also need I/O Shepherd Maps, to identify e.g. mirror locations.
  • Primitives and Code: want to make reliability management simple. Primitives like “Checksum”, “Parity”, “Sanity Check”, “Allocate Near”, “Allocate Far”. These are then composed into a full policy, in the policy code.
  • Challenge is to keep the new data and metadata consistent in the presence of crashes. Need consistency management.
  • CrookFS: ext3 with shepherding capabilities. 900LOC changes to the OS, and 3500LOC for the shepherding infrastructure. Small overhead. Integrates reliability policy with journalling: execute policies during checkpoint.
  • However, we can’t run e.g. remapping during checkpointing, because the checkpoint might fail, leaving us in an inconsistent state. If we keep the remap table in memory, the remap succeeds, but the program crashes before we can sync the table, we have the disk in an inconsistent state. At present, there is no consistency for any checkpoint recovery that changes state. Consistent reliability with the current journalling system is impossible!
  • Thus we need “chained transactions”, which says that only after a chained transaction (updating the remap table) can the previous transaction (doing the remapping) be released. This fixes the flaw in journalling. It is repeatable across crashes, as long as we have idempotent policies.
  • Evaluation: [see the paper].

Generalized File System Dependencies

  • A new architecture for constructing file systems, using the “generalised dependency abstraction”.
  • e.g. for consistency, we must keep FS consistent after every write, by, e.g. journalling. However must tradeoff durability features against performance. A file system must pick one tradeoff, but why can it not be extensible? This is difficult because it’s difficult to implement, complicated due to caches, and because correctness must be maintained.
  • What is a simple, general mechanism for implementing any consistency model? The “patch” abstraction in their implementation, Featherstitch.
  • Featherstitch contributions: the patch and patchgroup abstractions (explicit write-before abstractions, and FS-agnostic). Replaces Linux’s FS and buffer cache layer with implementations of ext2 and UFS. Journalling, WAFL and soft updates implemented just by using patch arrangements.
  • Patches: a disk data change and any dependencies it has on other disk data changes. Includes some undo data. The dependency says which data must be written by which others. Benefits of this are to separate write-before specification and enforcement, and makes these relationships explicit. Inspired by soft updates, but don’t enforce a particular FS and consistency model.
  • Example of rename: involves adding a directory entry and removing the old one. So we could write the source with the removed entry, then write the new entry. But what if we crash in between?
  • Soft updates: inc ref count for the file inode, then add the new dirent, then remove the old dirent, then dec the ref count. However, this has a cyclic dependency for block updates. But, in patches, this is not a cycle. So we can do the increment patch, then do the other patches, with no cycle.
  • Also showed how this would be done with journalling and WAFL: somewhat more complicated.
  • Patchgroups: arise from application-defined consistency requirements. Commonly just syscall to the buffer cache (fsync, sync), or depend on the underlying FS. Instead, the patch interface is extended to user space, using patchgroups, which enables the specification of write-before requirements between syscalls. Can make things more efficient by not doing a bunch of syncs.
  • Patch optimisations: massive dependency graph between patches even when allocating a few blocks for a new file, and so patches must have a lightning-fast implementation. Main primary overhead is unused undo data (150% overhead on some benchmarks). How do we detect where this is unused? Theorem is that undo data is only necessary when there are block-level cycles. So we should specify all dependencies when creating a patch.
  • Next: the patch data structures take a large amount of memory. Therefore try to merge them, i.e. if they are on the same block (hard patch merging) and when they overlap. There are several other optimisations in the paper!
  • Evaluation: measure optimisation effectiveness, compare with ext2 and ext3, check consistency, and run a benchmark (UW IMAP).
  • The optimisations are successful in reducing the number of patches, system time and undo data overhead.
  • Comparison with Linux: Featherstitch is between 19 and 26% slower at the PostMark benchmark. It’s faster on other benchmarks, where differences in the block allocation strategy outweigh the overhead.
  • Consistency correctness: under random crashes, is the FS consistent. Works on Soft updates, Journalling as expected.
  • Benchmark: moving 1000 messages from one IMAP folder to another. Patchgroups are much faster than using fsync.

Operating System Security

Information Flow Control For Standard OS Abstractions

  • Problem: vulnerabilities in websites can lead to exploits. Solution: decentralised information flow control (DIFC). Example: admin wants to secure web app that handles secret and non-secret data. Use a “declassifier” which decides who gets to see what data. User authenticates to the declassifier then passes all requests through it. Even a bug in the webapp cannot leak data, because the declassifier makes the final decision on who can see what. So why is nobody using this, if it’s so good?
  • Too hard to write the declassifier, requires smart people to develop all aspects of the system. Label systems are complex. Programming with DIFC can lead to unexpected behaviour. And it prevents the reuse of existing code, such as commodity operating systems.
  • Unexpected behaviour: unreliable communication between two processes of different security types; mysterious failures when a secret process elevates the security of another process, and therefore prevents it from continuing with a task such as writing to an unsecure file (so that it can’t leak data from the secret process).
  • Solution: Flume to solve DIFC problems (user-level DIFC on Linux with a simple label system, and glue between Unix API and labels). Then develop an application and evaluate it. Goal: want to be able to install this on existing machines using a apt-get.
  • Uses system call delegation which forwards system calls to a user-level Flume reference monitor.
  • Three process classes: confined (all syscalls go through refmon), Flume-oblivious (with no access to secret data), and unconfined/mediators (a mix of the two).
  • Simple label system: each process gets a secrecy label which summarises the categories of data that a process is assumed to have seen. (e.g. Financial Documents, HR Documents.) Single category is a tag; set of tags is a label. Any process can add any tag to its label, but it cannot remove it without special privileges. A process can create a new tag, and is given the ability to declassify it (and hence remove it from its secrecy label).
  • Communication rule: process p can send to q if p’s label is a subset of q’s label.
  • Endpoints: intermediaries between processes for communication (per file-descriptor?), which can be labelled independently from the processes, and which declassify data. This is only possible if the owning process is able to declassify a tag in the secrecy label of the endpoint (better specified in set notation). A process can have many endpoints. This addresses some of the mysterious failures by eagerly revealing errors (which would violate the endpoint invariant).
  • Example application is the Python-based MoinMoin wiki (100KLOC). 43 instances of the same check to see if the current user is able to access a bit of data, which have led to bugs, and plugins further complicate this. Instead add a 1KLOC declassifier. TCB is server + declassifier. Untrusted code is the Wiki plus any plugins.
  • Apache is Flume-oblivious. Declassifier is a mediator. MoinMoin is confined, i.e. must syscall through the refmon.
  • Results: Flume allows the adoption of existing Unix software: just 1% of MoinMoin had to be changed (no change in Python interpreter or Apache). By using Flume, two ACL bypass bugs were solved automatically. Performance is within a factor of 2.
  • Limitations: bigger TCB than HiStar or Asbestos (Linux stack + 22KLOC refmon). Disk quotas are a covert channel.
  • Q: On the example showing how endpoints are made, and secrecy labels may be changed, what if a malicious plugin attempts to perform declassification? This isn’t a problem, because, if the secrecy were critical to other applications, it wouldn’t be able to declassify the tag (because the tag would have been created by another application). Untrusted software would not have the elevated privilege to declassify tags.
  • Q: Can the described policies be implemented inside SELinux? For a single application, probably yes, but the policies would become complex if it were possible to add applications to the change. Fluke simplifies system administration and transfers these decisions to the content provider.

SecVisor: A Tiny Hypervisor to Provide Lifetime Kernel Code Integrity for Commodity OSes

  • Kernel rootkits: malware inserted in OS kernels to avoid detection by user-level scanners. Becoming increasingly common in the wild. DMA-based attacks are becoming common. Current (user-space) security tools are insufficient because they assume kernel integrity, and cannot find all attacks (as they are detection-based).
  • Aim: a hypervisor that prevents injected code from executing at kernel privilege. Permit only user-approved code to execute a kernel privilege, based on a user-specified approval policy. Goals: security and ease of porting (not performance).
  • ~1100LOC hypervisor, that operates at a ring below the kernel, but doesn’t attempt to do anything with hardware. Enforces approved code execution in kernel mode, which holds over system lifetime.
  • Assumes: attacker can perform all attacks except HW attacks against CPU and memory. Can modify memory contents, perform DMA writes and modify system firmware. Can also have knowledge of 0-day kernel exploits. Single CPU with SVM or VT. Executed in 32-bit mode without the use of self-modifying code. No vulnerabilities in SecVisor (going to formally verify that).
  • Require: constrained instruction pointer, which should be within approved code regions whenever CPU is in kernel mode. Approved code regions must be immutable, and so cannot be modified by an attacker.
  • Constrained IP: kernel mode entry should set IP to being within approved regions. IP must remain within approved regions as long as we stay in kernel mode. Each kernel mode exit should set the CPU privilege level to user mode.
  • Kernel Entry: exception/interrupt happens, looks up handler in the interrupt vector, and jumps to that IP in kernel mode. So all entries in the vector must point to approved code.
  • During execution: place write XOR execute protection over kernel memory. But the problem is that the kernel and application share the same address space, so what if the attacker attempts to jump to user-space code, maybe by performing a buffer overrun. Solution: mark all application memory as non-executable (NX) on kernel entry.
  • Intercepting user-to-kernel switch: all CPU entry pointers point to approved code; mark approved code regions as NX during user mode execution; all user-to-kernel switches raise exceptions.
  • Kernel exit: intercept all kernel-to-user switches by using the exception that is raised by trying to execute NX code in user space after the switch back to user space.
  • Immutable approved code: memory may be written by software executing on CPU or by DMA writes by peripherals. Approved code must be marked read-only. IOMMU protects from DMA writes.
  • Implementation of memory protection: set protection independent of OS, using shadow (or nested) page tables. DMA exclusion vector to protect approved code.
  • [I don't think I need to include the explanation of shadow page tables.]
  • Shadow PT is used to set memory protections. Approved code regions are set read-only in the SPT, and DEV prevents DMA writes. We also need to prevent aliasing of pages that are in the approved regions to non-approved VAs.
  • Check entry pointers to ensure they contain VAs of approved code. GDT, LDT and IDT must be protected by shadowing these (to protect them from DMA writes).
  • Overhead from intercepting all switches, and the shadow PT synchronisation and kernel PT checks. I/O-intensive workloads will perform poorly.
  • Specint benchmarks show that SecVisor is about 2 to 5% slower than Xen (except for the gcc test, where it is much slower). Some optimisation has been done in the latest version.
  • Q: Many past kernel exploits change the uid of user processes to overwrite kernel memory; does SecVisor address this? It doesn’t protect the integrity of kernel data structures. This is future work.
  • Q: Performance effect of nested page tables? Shadow overhead goes away, but you still have to check and protect the kernel page tables.
  • Q: Knowing everything you know about the system, how would you attack it as an intruder? Try to attack the TCB, but it is very small, so we don’t expect that this will be a problem.
  • Q: Deal with loadable modules? Paper discusses this: you could have a hash-based approval policy, and SecVisor performs code relocation on behalf of the kernel.
  • Q: How to protect the stable storage off which SecVisor is loaded? Use skinit, along with trusted computing techniques.

Secure Virtual Architecture: A Safe Execution Environment for Commodity Operating Systems

  • Safe execution environment: such as that provided by Java or C#. Array indexing within bounds, no use of uninitialised variables, type safe operations, no dangling pointers, control flow follows semantics, etc.
  • Commodity OSs do not use these kind of languages.
  • Using as secure exec env would provide security, novel design opportunities and the ability to develop new solutions to higher-level security challenges (such as information flow policies, for example).
  • Secure virtual architecture: interpose a compiler-based VM between a commodity OS and hardware. The VM provides a virtual instruction set architecture.
  • Contributions: a compiler-based VM that can host a commodity OS. First to provide security guarantees for a complete commodity OS.
  • Safety-checking compiler: similar to a C compiler which translates C into the virtual ISA (application level). Generates bytecode rather than native code. Bytecode has a type system which enables alias analysis in the VM.
  • VM contains a safety verifier (which ensures that the compiler has done the right thing; may insert runtime checks), native code generator, and OS and memory safety runtime libraries. The compiler is outside the TCB; only the verifier and code generator are inside.
  • Virtual ISA based on LLVM. Added OS-neutral operations (SVA-OS) that remove difficult to analyze assembly code and encapsulates privileged operations. Porting to SVA-OS is like porting to a new arch.
  • Goals: maximise safety guarantees, minimise reliance on OS correctness, minimise changes to the kernel, and retain original kernel memory allocators.
  • Guarantees: just like Java or C#, but type safety only for a subset of objects (due to the use of C code), and dangling pointers are harmless (rather than never used). These limitations do no compromise the other guarantees.
  • Checks: load/store, bounds, illegal free and indirect call. Transforms: stack to heap promotion to deal with dangling pointers, and initialising memory. Need to track object bounds: use object lookups. Naïvely, the security verifier would add code to register all allocations in the VM metadata. All uses of these objects would have to follow a lookup to see if the access is within bounds. Can improve this by alias analysis, which groups objects into logical partitions, reducing overhead from between 400 and 1100% to between 10 and 30%.
  • Implementation: ported Linux to the SVA ISA, compiled using LLVM. SVA-OS is a runtime library linked into the kernel. Guarantees safety of everything apart from the kernel memory management code and [blink and you'll miss it, see the paper]. Around 5KLOC had to be changed in Linux.
  • Memory safety overhead is less than 59% (for a web server benchmark). Latency is bad with many small transfers. 80% of exploits caught (the non-caught one was in a library that isn’t currently ported to the VM).
  • Several performance improvements are possible, by changing the code, using smarter checks and smarter static analysis.
  • Q: To what extent does the analysis equate to type safety? Do not guarantee that pointers point to the correct object, but that it is a valid object.
  • Q: Must the number of memory pools be statically allocated? No. What if memory pools are allocated at run-time? Even then, its use is statically identifiable.
  • Q: Is the overhead really small enough to make this approach suitable for use today? Improving the overheads is future work.

And, with that, I’m off to Portland for an afternoon of shopping in a city where £1 buys $2.03, and there’s no sales tax. See you when I get back!

Leave a Reply