SOSP 2007: Day 1

Since day 0 was driving up (through some outstanding scenery) from Portland to Skamania Lodge, I’ll come back to it when the photos are developed. Therefore, on with a little experiment: the live blog! (The page will grow downwards as more presentations are added.)

N.B. If you’re reading this through Facebook or are not a die-hard Computer Scientist, you might want to ignore it!

Web Meets Operating Systems

Protection and Communication Abstractions for Web Browsers in MashupOS

  • Adding inline modules to a single page involves trusting those modules not to interfere with each other or snoop data (e.g. from Google cookie). Want multiple principals in the web browser for client mashups.
  • “Same origin policy” is currently all or nothing: either no cross-domain interactions allowed (iframe) or external scripts run with the privilege of the enclosing page (included script). With latter, risk of cross-site scripting attack (Samy worm: 1 million MySpace users in 24 hours), leads to scripts being disallowed and flexibility being lost.
  • Browser should be a multi-principal OS. Balance ease-of-use and security.
  • Protect: memory (heap of script/DOM objects), persistent state (cookies) and remote data access (XMLHttpRequest).
  • What if integrator (Google) is trusted to access provider (e.g. widget developer) content, but provider is not able to access integrator content? New sandbox and openSandbox tags for “unauthorised” content. Invoking code inside sandbox only happens after a setuid call. Put user input in a sandbox and so prevent XSS attacks.

AjaxScope: Remotely Monitoring Client-side Web-App Behavior

  • Windows Live Local client-side codebase: 70kloc (1MB in size), 2855 functions, interacts with >14 backend services, and all this executes on the host, not in the cloud.
  • Web 1.0 all done on the server then serves static content: developer has control over machine and can profile it or debug it. How does developer address bugs or performance issues when code is largely running on the client? And of course there are mashups….
  • On-the-fly rewriting to add instrumentation, by a proxy between webapp and the client.
  • Goals: performance optimisation, debugging, testing (code coverage and A/B tests), user interaction feedback (feature discovery etc.).
  • No changes to original webapp or client-side browsers.
  • Experiments tested prototype against 90 websites, including Google Maps, Live Local and 88 other sites.
  • Adaptive instrumentation: don’t log every function with timestamps; instead profile whole script and, if it’s slow, drill down to discover the particular function that’s taking time (then recurse).
  • Distributed instrumentation: monitoring for JS memory leaks by looking for runtime patterns indicative of a leak. Check all object assignments for potential cycles: involves an expensive heap traversal, which wouldn’t scale to all assignments. So give each user a random N% of the assignments to check.
  • Need to address information protection: should it be possible for credit card information to be stored in the logging DB? At present, don’t instrument HTTPS pages, but in future could allow developers to mark state that should not be logged.

Swift: Secure Web Applications via Automatic Partitioning

  • Web development methods lack security reasoning. Swift makes interactive web applications secure and easier to write. Write a single program in a single language, which is automatically split by the compiler. Rich security policies as declarative annotations. Attempts to optimise the partitioning for performance.
  • For example, want to download state (e.g. a secret number in a guess-the-number game) and behaviour (input validation, checking the guess against the secret) to the browser to reduce number of server roundtrips, but (i) want to keep this secret from a malicious user, and (ii) preserve the integrity of any guess sent back to the server.
  • Swift code looks like regular Java code with an event-driven GUI.
  • Security policy on data denoted by labels with named principals (e.g. Alice->Bob == “Alice permits Bob to read”; Alice<-Bob == "Alice permits Bob to write"). e.g. int {server->server; server<-server} secretNumber; int {server->client; server<-server} numberOfTries;
  • Endorsements after, say, a bounds check, allows a client-provided value to be trusted.
  • Placement constraints (based on who can read and write), and architectural constraints (e.g. database access must be located on server).
  • Performance optimisation: minimise number of network messages. Construct weighted control flow graph (with weights estimated by interprocedural dataflow analysis), then execute MIN-CUT/MAX-FLOW on it to find the optimal partitioning. (Slightly more subtle, since nodes could be placed on both the client and server: could for example replicate input validation for fast responsiveness at the client (if wrong) but proper integrity at the server.)
  • Server keeps state about expected control flow to prevent client attempting to execute arbitrary server code by falsifying a message. Client cannot corrupt server-local variables because server does not accept (non-endorsed?) client-supplied values to update high-integrity variables.
  • Evaluated by writing a suite of example applications and comparing the number of messages sent by the Swift-generated and hand-optimised versions of the apps.


TxLinux: Managing Transactional Memory in an Operating System

  • Hardware transactional memory is a reality! Sun “Rock” chip supports it, and so should Solaris 10.
  • Locks are hard (deadlock, priority inversion, etc.). Transactional memory in the OS benefits user programs and simplifies programming.
  • Conventional wisdom maps lock_acquire and lock_release to tx_begin and tx_end, respectively. Transactionalising Linux this way took six person-years, mostly due to issues with I/O and idiosyncratic locking.
  • Reject conventional wisdom and insist that locks and transactions must cooperate! Retain legacy code. Flexibility to aid performance: TM is good when contention is rare; locks are good when contention is high.
  • Innovation: Cooperative Transactional Spinlock. Critical sections choose dynamically between locks and transactions. If csec attempts I/O then rollback and use a lock instead. (This took one person-month to implement on Linux.)
  • Implemented as x86 extensions, using the Simics machine simulator. Benchmarks on pmake, bonnie++, MAB, (parallel) configure, and (parallel) find. Only kernel is using transactions, not user code. 2.5% speedup over Linux (16 CPUs); 1% speedup over Linux (32 CPUs).
  • Scheduling-aware transactions to eliminate 100% of priority inversion: contention manager decides in favour of higher priority process.
  • One pathological behaviour (in the Bonnie++ benchmark) which is caused by the exponential backoff in transaction retry.

MUVI: Automatically Inferring Multi-Variable Access Correlations and Detecting Related Semantic and Concurrency Bugs

  • Variable Access Correlation: programs contain many variables which are not isolated. Correct programs consistently access correlated variables. (e.g. an array and the length of that array, in two variables; different views of the same information (number of packets and bytes received); different aspects of related information (RGBA values in a framebuffer); iterator in a data structure).
  • Need to ensure consistent access (e.g. when updating array, must update length).
  • If x and y are correlated, an access (read or write) to x is correlated with an access to y. So we can detect where a programmer forgets to access a correlated variable (e.g. initialise RGB but forget alpha). Or where two correlated variables are updated concurrently (leaving the possibility of reaching an incorrect state due to a race condition).
  • Statistically infer access correlation based on variable access pattern in source code, assuming that a mature program is mostly correct. Correlation is inferred where variables commonly appear together, and seldom appear separately. Use static code distance between accesses.
  • Frequent itemset mining to determine potentially-correlated variables. Every variable is an item. Every function is an itemset. The accessed variables, tagged with the type of item, is placed in the itemset. Flow-insensitive, inter-procedural analysis, considering globals and structure-typed variables. IPA means that callee accesses are incorporated in the itemset. This yields frequent variable sets.
  • Post-processor prunes this set by identifying where items an a sub-itemset appear separately too many times, or are very popular (e.g. stdout and stderr). Then it categorises based on the type of access (read or write).
  • Inconsistent update bug detection by getting all write(x)->access(y) correlations. Concurrency bug detection by looking for common locks surrounding different accesses to a set of variables. Look for where the variables are accessed without holding that lock.
  • Approximately 13 to 19 per cent false positives on correlation inference. Analysis time takes on the order of two hours. Bad programming shows up even when it is not a bug.

Byzantine Fault Tolerance

Zyzzyva: Speculative Byzantine Fault Tolerance

  • State of the art in BFT is too complex for system designers. Zyzzyva outperforms existing approaches, gives performance comparable to unreplicated services, and has overhead that approaches lower bounds.
  • State machine replication to replicate failures. Two phases: agreement and execution. Agree on the request order (three-phase agreement), followed by execution.
  • In Zyzzyva, replicas execute requests without agreement, so much less overhead. The output commit happens at the client. Only the client needs to know that the system is consistent, not the replicas. So client only commits if it can confirm that the system is consistent (if it can verify that the reply is stable).
  • Verify the stable reply by using the request history. Replicas include request history in the replies. If all of the replies and histories match, then all replicas are in a consistent state, and we are done.
  • What if there are failures? We can commit output if a majority of the replies and histories are available and match. Commit phase: the client sends the histories to the replicas, and if it receives a majority of acks, then it can commit.
  • Same consistency guarantees as traditional BFT.
  • A faulty client cannot block progress or compromise safety.
  • Performance is at least twice as good as existing BFT methods, and about 35% of the unreplicated case (with no failures).

Tolerating Byzantine Faults in Database Systems using Commit Barrier Scheduling

  • 50% of errors in DBMSs are “non-crash” errors.
  • This is the first practical BFTDB
  • 3f+1 replicas, protocol globally orders requests, replicas execute in order: no concurrency. We want to be able to extract concurrency in the database context.
  • Introduce “Shepherd” into the architecture: centralised and cannot handle faults in this (but it’s small compared to the DB replicas). Many clients -> single shepherd -> many replicas. Need f+1 matching votes from the replicas.
  • Pre-determine which statements conflict to extract concurrency. Difficult to inspect SQL.
  • Commit Barrier Scheduling: run transactions first on the primary, then duplicate the ordering from the primary on the secondaries. Works best if primary is “sufficiently blocking”. Primary sends back result, shepherd responsible for running the same-ordered statements on the secondaries.
  • For correct execution: execute that statements of a transaction in the same order, and all replicas commit transactions in the same order.
  • This assumes a non-faulty primary! Concurrency on both primary and secondaries, but there is an increase in latency.
  • Faulty secondaries are not a problem, because of voting. Faulty primary is a problem, because it could generate an invalid schedule for the whole system. Assuming faulty primaries are rare.
  • Asymptotically, 17% performance penalty on MySQL or PassThrough scheme (measured in Transactions per second). But a 10x speedup on the serial case.
  • Masked bugs by using heterogeneous vendors and heterogeneous versions. Found concurrency bugs in MySQL (now patched).

Low-Overhead Byzantine Fault-Tolerant Storage

  • Block storage protocol that can tolerate arbitrary Byzantine faults with similar performance to systems that can handle crash-faults.
  • Crash fault-tolerant (erasure-code based) storage performs much better than replicated BFT storage (for write bandwidth). So they introduce erasure-coded BFT storage, with a relatively small overhead over the CFT method (within 10% for sufficiently large writes).
  • Write overhead = two rounds plus a cryptographic checksum. Read overhead = cryptographic checksum.

Attested Append-Only Memory: Making Adversaries Stick to their Word

  • BFT protocols enforce linearisability and liveness on a replicated system, with up to 1/3 faulty replicas.
  • Problem of servers equivocating to clients. Does preventing it help? Can it improve on the 1/3 bound? If so, how do we prevent equivocation?
  • Introduce a trusted equivocation guard to prevent equivocation.
  • Attested Append-Only Memory: a set of numbered logs, with sequence #, stored value, and crypto digest of entire log for attestation.
  • Run A2M as a 3rd party service, in a process, in a virtual machine or even in the VMM, or even in a secure coprocessor (cf. IBM’s vTPM implementation). Different levels of isolation or sizes of TCB.
  • Trustworthy systems are built from untrusted components. Looking at what trusted small components can be put together to make better systems.
  • A2M is simple and easily implementable, and prevents equivocation. It has broader implications on structuring trustworthy systems.

Work in Progress

  • GIGA+: Scalable Directories for Shared File Systems: FS used as a lightweight database of small files. Want scalability to billion-to-trillion entries, striped on many servers; 100K inserts/sec. Usually use fast, dynamic indexing structures, but are synchronised. GIGA+ is more parallel, with incremental growth over many servers, and high concurrency through minimal synchronisation. Client cache consistency is expensive, so GIGA+ uses stale partition-to-server maps, but without affecting the correctness of operations. Decentralised and concurrent indexing, using increasing prefixes of hashed filenames. Servers can then split partitions independently and in parallel.
  • Sequoia: Prediction Trees to Support Network-Aware Applications: Virtual trees to provide properties of the network to support applications like “What is the server that can provide this file with the greatest bandwidth?” Predicts bandwidth and latency. Tree position yields coordinates in the network, which can be used to estimate distance and the quality of paths in the network. Free hierarchical partitioning of the network.
  • Flexible, Wide-Area Storage for Distributed Systems with WheelFS: Can’t just use existing tools to make a cooperative web-cache (Apache caching proxy plus network file system). Cache can use old copies of data, and always can fall back to the origin. So why not tell the FS about this? Introduce semantic cues to help apps control behaviour in event of failure. Only one line change to make Apache distributed. Semantic cues embedded in the pathname. Useful for PlanetLab, Parallel Grid computations and Distributed make.
  • Reforming Software Delivery using P2P Technology: Analysed managed service department at HP. An edge node spends 1.8 hours daily synchronising its image. Idea is to combine P2P technology and rsync. Large chunk sizes give low per-chunk overhead, but less opportunities to exploit similarity. Pure P2P is not feasible because different clients use different tools, and also want to avoid using customer bandwidth. Related to CDNs. Decentralisation makes security and optimisation hard.
  • Ostra: Leveraging trust to thwart unwanted communication: i.e. spam, mislabeled videos on YouTube. Usual defenses don’t work because you can’t do Bayesian filtering on video. Ostra uses social networking trust relationships, which are assumed to be hard to form. If a source sends unwanted communication to a destination, the destination will cut off the source. For multi-hop transmission, the links along the path from source to destination are all penalised. No requirement of strong identities: can create a network of sybils, or use different identities for friends.
  • Enabling BITE: High-Performance Snapshots in a High-Level Cache: Applications control when snapshots are taken, and can get and put objects that reside in pages. Does CoW to save storage. Provides time travel in a storage system. (BITE = Back In Time Execution.) Approach is virtualised, crash consistent and requires a Write-Ahead Snapshot invariant. High-level: database or file system, because lower level makes it difficult to achieve consistent application-requested snapshots. Therefore they sync from higher levels when a snapshot is requested. Need to make declaration of snapshots more efficient (currently transactional).
  • Kernel Memory Management in Verified Small Kernels: Constructed a small, highly-assured microkernel. Formal assurance that the abstract model translates to the kernel code, which must be rigid. seL4 exports all memory allocation and freeing to the user, with no implicit allocations within the kernel. Something of a capability model. Formally proved spatial partitioning. Haskell prototype and C/C++ version of the kernel. Ongoing work to evaluate and refine the performance.
  • The Case for DDoS Resistant Membership Management in P2P Systems: Malicious node M acts as index and says that a victim node is the source for a file (even though it is not), which, if there are millions of requests, could lead to DDoS. Exploitable mechanisms include: push-based mechanisms and the ability to have multiple logical IDs per physical ID (such as IP address). Achieved an attack magnitude of 700Mbps on a real network. Idea is to have self-validation of membership information. Instead of dumbly sending requests to a victim node, count the number of failures, and, past a threshold, stop hitting it.
  • Adaptive File Transfers for Diverse Environments: Goal to correctly and efficiently transfer files in a wide range of scenarios. Existing tools are scenario-specific (files in place, other files, peers, identical peers). Challenges: resources have varying performance, which changes dynamically; receivers may have different initial state (e.g. software version); and resources should not need to be set up in advance. Novel optimisation framework: first decide which resources are available, then schedule across them “in an intelligent fashion”. Defers disk operations when network is faster than disk.
  • Fine-Grained Isolation for the Apache Web Server: Restructuring legacy apps for the principle of least privilege. Apaches Web Server has a sensitive private key, and a complicated parser, which could be compromised, leading to the private key being exposed. 222 heap objects and 389 globals. Manual isolation took 5 weeks. Crowbar is a binary instrumentation tool that tells you what memory a function accesses, what functions access what memory items and where a sensitive-data-generating function propagates.
  • A Social Networking-Based Access Control Scheme for Personal Content: Presently done messily, for example using an email service (but limited bandwidth, file size; plus push-based model is inefficient for content delivery where not all recipients would want the content). Or using a social networking site (but little support for access control on content-specific sites; plus users end up having a bunch of social networks that may get out of sync between sites). Idea is to separate the social network (which we can let people manage), and the sites that serve content. Social relations are captured by digital social attestations, and maintain this in one’s personal address book. Enables social ACLs (e.g. restrict access to family only, or friends only, etc.). Enables “social firewalls” (access only to people whom you know), or a social calendar (provides different views to different people, based on relationship).
  • Improving Virtual Appliances through Virtual Layered File Systems: Problem is existing file systems are not sufficient for management of large numbers of machines. Provisioning takes too long; existing file systems treat machines as fully independent; and security models treat every file in the same way(?). A layer is a self-contained set of files, which can be shared in a read-only manner. e.g. a library or an application can be a layer. Layers are composable into a traditional file system view, along with a private read-write layer to make the file systems independent. Layers can be composed into templates for provisioning. Updating is easy: just replace the layer! Security is improved because it is easier to keep machines up to date, and tell which files have been modified.

Leave a Reply