OSDI 2010: Day 1

Kernels: Past, Present and Future

An Analysis of Linux Scalability to Many Cores

Looking at Amdahl’s Law limitations (serial paths) in the Linux kernel. For example, Exim spends about 70% of time in the kernel on a uniprocessor.
Want to know how serious the scaling problems are, and how much work is necessary to fix them (in light of recent efforts to make new scalable OSs).
Used a 48-core x86 box running a recent version of Linux. Looked at 7 real limitations, and analysed the limitations of the Linux kernel, which did exist.
3002 lines of code in 16 patches fixed the scalability limitations. Remaining bottlenecks were in the hardware or applications.
Ran applications with an in-memory file system to avoid a disk bottleneck. Found bottlenecks, fixed them and re-ran the application. Stopped when the application or hardware had a non-trivial bottleneck that they couldn’t fix.
Apps: Exim, memcached, Apache, PostgreSQL, gmake, Psearchy (inverted index building) and Metis (MapReduce implementation).
Exim throughput (messages per second) collapses around 40 cores and kernel time (per message) increases greatly at this point. Large explosion of time in lookup_mnt function (>30% of time). This function returns metadata about a mount point, under a spinlock. Acquiring and releasing the spinlock were using more cycles than the hashtable lookup, because of the ticket-based spinlock implementation. Cache traffic (atomic inc and read of current ticket value in acquire; increment in release) takes 120-420 cycles on a lightly-loaded system. With high load, the delay goes up to 500-4000 cycles.
Solution could be to implement scalable locks or message passing (like Barrelfish). But since the mount table is rarely modified, use per-core mount tables to cache this. First lookup in local core table, else do the slow path through the global table. On modificastion, need to invalidate the per-core tables, but since this is rare, it leads to better scalability. This fix, removes the collapse at 40 cores.
Now see that the same functions execute more slowly on 48 cores than on 32, leading to a flattening out of scalability. Next bottleneck was dput(), which modifies the refcount on a dentry. Does an atomic decrement and test on the reference count. Reading the reference count is slow, because it is atomic, which makes cache coherency traffic occur. Concurrent cores must retry because a single core gets a hardware cache-line lock on the value.
Solution is “sloppy counters”: each core should hold a few spare references to each object. The reference count is stored in a per-core counter. Sloppiness occurs because multiple reference on the same core will grab one reference each. On an unlink, the kernel gathers the sloppy information into the shared counter, and disables the per-core counters. Uses O(N) space in the number of cores, but lead to no cache misses in the common case. Leads to an increase in scalability over the per-core lookup table.
In all, 9 areas were considered this way. Some applications were also modified (amounting to 60 lines of code). All well-known parallel programming techniques.
Almost all applications get some speedup (except Psearchy and Metis, which were already quite high).
Remaining bottlenecks include hardware queues on the NIC, application-level contention, cache capacity and DRAM throughput.
Limitations: only looked at 48 cores and a small set of applications. Some problems still loom: fork, virtual memory, page allocation, file system and concurrent address-space modifications.
Q: What was your experience in mainlining these patches? Haven’t tried because working on the talk. Sloppy counters are probably questionable. Are you going to try to submit? We intend to think about it.
Q: Do you really believe that hacking at each piece of shared state one at a time is the right way to go? No, but up to 48 cores it seems to work well. What do you really think? I have no idea.
Q: How many of your optimizations would work if you ran multiple applications on subsets of the core? Are they general enough for this case? I think so, since we didn’t target particular applications with these changes, but some applications may not benefit.
Q: How did you implement sloppy counters given that cache lines are pretty large, packing or wasting memory? We packed multiple per-core counters onto a single cache line. Did you find that managing these counters then led to a lot of overhead? For some counters, yes, such as sockets, which allocate a dentry per socket. The fix was not to allocate a dentry for a socket.
Q: A big ball of code wins in the long term because it is easier to make larger changes as well as smaller changes. Did you find that to be the case (that it was easier to modify the Linux kernel than other microkernels)? [Taken offline.]
Q: Looking at spinlocks, why did your original results have an elbow in the curve at 40 cores, rather than a gradual degradation? The interconnect on the machine becomes congested. The other effect was an n-squared effect based on the number of waiters.
Q: Should we or should we not be designing new OS architectures for multicore systems? Up to 48 cores, it doesn’t seem like Linux has a problem.

Trust and Protection in the Illinois Browser Operating System

Today, the web is ubiquitous, and runs in web browsers on many different operating systems.
However, these applications are vulnerable to attacks at many layers. There are web-app vulnerabilities, browser vulnerabilities, and OS and library vulnerabilities. The last two are the rarer but also most severe, so we consider these only in this talk.
Web browsers, such as Firefox, are monolithic and easy to exploit, since they run in a single address space. “Secure” web browsers are not enough, since they still have a huge TCB (including the TCP stack, X server, file system, drivers, etc.). A microkernel would be better, but still need to trust all of the system components.
Illinois Browser OS (IBOS) is a new OS design to make a single app more secure, which leads to a reduced TCB.
The IBOS kernel makes browser abstractions first-class, i.e. web page instances, cookies, IP connections, etc.
Security invariants check key properties without understanding the implementation of the particular components. Challenge is to check security properties using these “invariants”.
An IBOS label is a tuple containing a protocol, domain and port. These are inferred and applied by the kernel. Leads to different “web page instances” for different originating domains. Since these may then access several servers (for content, images, ads, etc.), the web page instances can spawn multiple “net processes”. Enforcing this requires a split driver architecture: on sending an Ethernet frame, the IBOS kernel checks the TCP port and IP address in net process label, and hands off the actual transfer to the NIC driver.
A basic frame buffer driver splits the screen into three parts: an address bar at the top, which shows the current page address; toolbars that provide global control of the system; and the content itself. The IBOS kernel has a tab abstraction which multiplexes the display between different web page instances. Input devices are routed to the visible tab, which prevents a keylogger from hijacking another instance’s session.
Implemented on L4Ka::Pistachio, using WebKit and Qt. Also provide a “Unix-like” API through Qt on uclibc, which supports a PDF viewer.
TCB size is 42 KLOC, compared to 5.6 MLOC for Firefox+Linux+X, and 4.4 MLOC for ChromeOS. Could potentially used formal methods to verify that. Avoided 27 CVE exploits published in 2010, and didn’t prevent 1 relating to device drivers. Also avoided 135 (77%) browser vulnerabilities published in Google Chrome’s bugtracker. Scored well on memory exploitation and sandbox bypassing vulnerabilities.
Evaluated page load latency time for Google Maps, Bing, Craigslist, CS@Illinois, Wikipedia and Facebook. IBOS is close to Firefox and Chrome, and sometimes better (though sometimes a little worse).
Q: Does IBOS have the same functionality as Chrome and Firefox, especially for JavaScript etc.? Yes: the presentation is actually using HTML5 on IBOS. IBOS uses WebKit, which is pretty standard.
Q: What do you think is fundamental about the web model that makes this different from core microkernel work? Web browsers are a very important application, so borrowing ideas from OS design could help to improve them.
Q: In terms of user experience, can you have two frames co-existing on the same screen? Currently don’t support windowed display, only tabs.
Q: Does the IBOS kernel manage DMA, and is it specific to a particular network device? Currently only support one driver, but should generalise to other drivers that support DMA buffers.
Q: Does the design inherently prevent cross-site scripting, or can this be relaxed? Don’t actually prevent these attacks, but with the right policy, we can contain these attacks. The paper details a custom policy that allows sites to communicate with other sites. Is your presentation running on top of IBOS? Yes.
Q: Is there any reason why Firefox or IE could not run on top of IBOS? They could if you changed the windowing abstraction.

FlexSC: Flexible System Call Scheduling with Exception-Less System Calls

System calls are typically synchronous and exception-based for the call and return. This is inflexible for scheduling, and expensive due to the mode-switch and pollution of processor structures when you multiplex these at a high frequency.
FlexSC is a shim that sits above the OS and provides exception-less syscalls, and FlexSC-threads is a threading library that uses FlexSC. Improves throughput in MySQL and Apache without changes to their codebase.
Took an application (Xalan from SpecCPU 2006) that does almost no syscalls, and injected exceptions with various frequencies. Emulated a null syscall and a write syscall. Measured only user-mode time and assume that it will be unaltered with the exceptions. However, almost half of the performance is lost. Indirect (pollution of structures) cost is the dominant problem.
Indirect cost: a Linux write() call can cause up to 2/3 of the L1 data cache and data TLB to be evicted. This also impacts the performance of OS code.
Exceptionless syscalls use a syscall page into which calls and results are written; syscall kernel threads do the processing. This leads to fewer mode switches, allows for batching of syscalls, and allows for dynamic multicore specialization, which further reduces the costs. In the ideal case, there is no need to context switch at all. The application writes its syscall registers into an entry on the page, and loops doing other work while the result is not marked as DONE.
Syscall threads are kernel-only threads that are part of the application process and run in the same process space. They execute requests from the syscall page and are schedulable on a per-core basis.
System call batching allows user space to make as many calls as it can, then jumps into the kernel which executes all of them before switching back to user mode.
Dynamic multicore specialization allows the OS to map syscall threads to particular cores, and user-mode threads to other cores. This can adapt to workload needs.
The user-mode interface is quite different to synchronous system calls. However, it is very suitable for event-driven servers, which already process asynchronous events. Multi-threaded applications can use FlexSC-Threads, which is compatible with Linux pthreads and requires no changes to the apps or even recompilation.
FlexSC-Threads is a hybrid (m-on-n) threading model, with one kernel-visible thread per core. It redirects system calls using libc wrappers to the syscall page, and switches to another runnable thread. When all threads are blocked, the kernel is entered, and we wait for at least one syscall to complete.
Evaluated on Linux 2.6.33 on a quad-core Nehalem, running at 2.3GHz. Sysbench on MySQL and ApacheBench on Apache were the workloads. Compared Linux NTPL (”sync”) versus FlexSC-Threads.
OLTP on MySQL on a single core sees a 15% performance improvement in throughput as the request concurrency increases. On 4 cores, see a bigger improvement. For latency, FlexSC dramatically improves the 95th-percentile latency, and a 30% improvement in the average case.
Individual performance metrics gathered using hardware performance counters. FlexSC achieves higher instructions-per-clock cycle than sync. Lower cache miss rates in several structures leads to this in both the kernel and user-mode.
Apache performance can improve by 80 t0 90% on a single core, and 115% on four cores. Latency is halved on average, and again is much better in the 99th percentile. Instructions per cycle are even better for Apache (almost doubling in user-mode and more than doubling in kernel-mode).
Exception-less syscalls can coexist with legacy ones. Doesn’t require a drastically new OS architecture.
Q: Did you try to upstream the patches for the kernel portion or plan to? No. Why? A lot of time and effort that I would prefer to use in other ways.
Q: Does this reduce security by sharing memory between the kernel and user-mode? The pages are per-process, within the application’s address space. Anything you could do in a process before, you could do now.
Q: Do you have an architecture for having outstanding syscalls with dependencies between them? No. We rely on servers that have concurrent, independent work to do. There is related work that tries to do this (composite syscalls from Stony Brook?), but we don’t have support for that.
Q: How specific are your observations to the fact that there is a protection boundary, and how much is due to the modularity boundary? A project called “cohort scheduling” made similar observations (jumping around to do different things will cause problems), so some of our observations are related to this. The boundary only adds to the problem because of the direct cost of the exceptions, which we measured earlier.
Q: Did you measure to what extent you get cache pollution from doing different work after the syscall? Didn’t measure it, but we did see that the TLB miss rate increased for some cases due to going through multiple user-level threads. So we could do even better if we tried to restructure the application.
Q: If you dedicate a core to kernel syscall handling, how do you approach the scheduling problem? Tried many things, but the simplest algorithm ended up performing the best: get a static list of cores, and go to the first non-busy core when a request comes in. Turns out to be very elastic, but we were only running a single application. May need to fine-tune this with multiple applications.
Q: Have you noticed or can you imagine a situation where having more cores than threads could cause you performance degradation? Do you have to have more threads than cores? We targeted server workloads. It would be a bad idea to run scientific computing workloads on top of this.
Q: What happens to this system when you have lots of unrelated threads demanding service (since you poll the syscall pages to pick up messages)? How does it scale up? [Taken offline.]
Q: Do you feel like you’re going to be fully backward-compatible? We ran four or five different applications, and the only problem we had was with get_tid (get thread ID), which required a small patch. Other than that, we didn’t see other backward compatibility problems.

Inside the Data Center, 1

Finding a Needle in Haystack: Facebook’s Photo Storage

In April 2009, 1.5PB of photo data (15 billion photos in 4 versions), with a 25TB/week upload rate and 550k images/second serving rate.
Currently: 65 billion photos, 20PB of data. 60 TB/week upload and 1 million images/second serving rate.
Simple design would be to use NFS, with one file per image. Works for a typical webiste with a small working set and infrequent access of old content (99% hit rate on the CDN). Facebook has a large working set with frequent access of old content. Leads to an 80% CDN hit rate.
I/O bandwidth gets wasted doing small metadata I/Os. Large directories required 10 iops per image read, improving to 3 iops per image with smaller directories. Using a file handle cache got to 2.5 iops per image read.
Haystack store replaces the photo server and storage layer in the NFS based design. A typical unit is a 2U node with 12 1TB SATA drives in RAID-6 (as a single volume). A single 10TB xfs filesystem runs across these. Haystack is a log-structured append-only object store, and each node hosts about 100 of these, at about 100GB in size.
File layout has a superblock, followed by “needles” (objects) which are padded to 64-bytes, and contain a key, an alternate key (signifying the size of the image), the data and a checksum. Magic numbers are used to reconstruct the Haystack in case of corruption, which doesn’t happen anyway.
The index file has a subset of the needle metadata, and includes the offset of the needle in the log file.
The Haystack photo server takes HTTP requests and turns them into Haystack operations. This builds an index of all images in the Haystack in a hashtable. Store 32 bytes per photo (8 bytes per image (32-bit offset and 16-bit size per image scale), compared to 600 bytes per inode). A 5GB image can handle 10TB of photos.
From the URL, determine the image and the size from the in-core index. 1 iop reads the data from disk. If the size of the image exceeds the maximum size of the device, could have multiple iops, but this doesn’t happen in practice. Use alloc_size mount option to grow files by 1GB extents. Only problem is when photos cross stripe boundary (6 or 7% of photos).
Multiwrite for modifications, since users typically upload many images at a time. Asynchronously append images to the file, then do a single fsync(). The index file is asynchronously appended to without a flush (unless a threshold of outstanding dirty records is exceeded), since it can be reconstructed from the log.
Delete just marks photos as deleted in the incore index and the needle header (synchronously).
Compaction generates a copy that skips duplicates and deleted photos. Works online and can accept reads, writes and deletes while this goes on.
Haystack directory maintains a logical-to-physical volume mapping. 3 physical haystacks on 3 nodes per logical volume. Writes are balanced across logical volumes, and reads are balanced across physical haystacks.
On upload, web server receives photos and scales them. The directory provides the logical volume, and writes synchronously all three copies (latency is relatively small compared to the browser-to-web-server latency).
On download, web server generates URLs in the web page based on information from the Haystack directory. The Haystack cache may return the image data, or else the cache fetches it from the Haystack store. The image is returned to the CDN which returns it to the browser.
Implemented in 8.5KLOC of C++. Took two engineers 4 months from inception to initial deployment.
May require more improvements to handle high-res images, so software RAID6 or RAID10 might improve performance. Considering new drives with on-board Flash to store the index. Currently, SSDs are not cost-effective.
Q: Did you find prefetching to be useful or are your reads so random? More and more, we are seeing photos accessed by tagging, so prefetching on the album index structure is less useful. Thumbnails are usually cached, so it looks purely random read access to Haystack.
Q: Have you considered using social data to figure out what your access stream might look like? The application (implemented in JavaScript) has a much better idea of what will be loaded, so it would make more sense to put the intelligence there than in Haystack. This also keeps Haystack simple.
Q: Have you considered migrating needles between Haystacks for load balancing? RAID6 limits us in this respect, and is simply horrible for mixed workloads. RAID6 would not cope with migration.
Q: Are high-res images supported for existing images? No, we don’t have the data stored.
Q: What do you get out of having Haystack getting as big as 100GB? This cuts down on file system metadata, but they could be smaller. Some Haystacks are even larger, but they run just fine. The only limit is 256GB, which would require 64-bit offsets in the index. Also want to keep the Haystack directory as small as possible, so big Haystacks keep this down. Want to distribute this in future.

Availability in Globally Distributed Storage Systems

Globally distributed storage systems are built from commodity servers, arranged into racks, then clusters.
200+ clusters, pools of thousands of clients, 75PB filesystems.
Want to look at how available is our data, what are the causes of unavailability, and how can we tune our systems to make this better?
95th percentile of unavailability periods is about 15 minutes. Median due to node restarts is about 1 minute. Planned reboots cause longer unavailability periods (closer to 10 minutes median). Unplanned reboots are more severe, albeit not much (a few more minutes).
Correlated failures lead to a “failure burst”, which corresponds to many failures happening within 2 minutes of the previous one. 37% of all failures are part of a burst across at least 2 nodes. This has been fairly consistent over the last year. Showed an example of a burst of 50 servers within quick succession (failure of a shared power or network domain). Also see bursts of about size 20 (rolling reboot of machines with planned staggering, or a cascading failure).
Small failure bursts are not necessarily rack-correlated, but large failure-bursts are highly rack-correlated.
Came up with a scoring function for rack-correlation: the probability that a burst of the same size affecting randomly-chosen nodes in that cell would have a smaller burst score.
Data is stored in stripes, made up of chunks with fixed-size data and code blocks. Use erasure coding (such as Reed-Solomon) to reconstruct missing chunks from available ones.
Perform rack-aware placement on blocks.Â MTTF increases when rack-aware placement is used, for all encoding schemes (replication and Reed-Solomon). Different encoding schemes do better for different sizes of bursts.
Used a Markov chain to model the number of available chunks, with transitions due to chunks failing and recovering.
Findings on recovery time: using Reed-Solomon (6, 3), a 10% reduction in recovery time leads to a 19% reduction in unavailability, without correlated failures.
Correlation matters a lot. Modelling correlated failures gives a substantial reduction in MTTF. However, it reduces the benefit of increased data redundancy.
Improving availability below the node layer do not significantly improve data availability (reducing node failure rate is the most useful thing to do).
Multi-cell replication (across data centers) leads to a higher MTTF, but there is a trade-off between limiting the inter-cell recovery bandwidth and higher replication.
Q: Could controlling the order of planned upgrades and so on have a big impact on MTTF/availability? Yes. We need to do a better job of teasing apart all the reasons for unavailability (finding exact causes).
Q: Is it true that most of the time machines are unavailable due to long-lasting events or lots of short events? Graphs in the paper answer this. There are challenges in trying to model rare events, and we tend to be conservative.
Q: Can you talk about the cost effectiveness of different encoding schemes, and are you actually using erasure coding? We are considering using Reed-Solomon. What’s the drawback and why didn’t you start with this? For MapReduce, you might have parallel copies that you want many workers to read from. Also chunk recovery is more involved for Reed-Solomon, and has to touch more servers.
Q: If the number of cores per box goes up, what effect would this have on availability (with a single motherboard failure taking out more work)? Also, what about power saving and TCO? Multi-core servers are interesting to us in terms of work done per watt of electricity.

Nectar: Automatic Management of Data and Computation in Datacenters

Study shows that 20 to 40% of computation in a data center is redundant (based on production clusters in Microsoft). Over 50% of storage space is occupied by obsolete or seldom-used data sets (based on research cluster in MSR). Thus resource management is a problem.
Example application: click-log processing. All computations tend to do parsing and pre-processing, but only really need to do this once. Also, as the click-log grows, there is re-done computation on the previously-existing data.
50% of data in the research cluster had not been accessed in the last 275 days.
Programs specified using LINQ, which may include user-defined functions (written in a .NET language). Submitted to DryadLINQ, which uses Dryad to execute it on the cluster. Data is stored in an in-house filesystem, called TidyFS.
Nectar interposes on submission to rewrite the LINQ program, performing lookups on a cache server. Dryad is modified to add cache entries based on computations that actually run. The distributed store contains Nectar program and data stores.
Mostly, the cache server gets several partial hits on the submitted query. The program rewriter will generate many programs that incorporate these hits into the submitted query, and choose the one that is cheapest to execute.
Considered an incremental MapReduce-style computation (Select-GroupBy-Select). Can cache the result of the GroupBy and the result of the Reduce. A new operator called MergeGroup can incorporate the previous GroupBy results with the results of Mapping and Grouping the increment data.
The executed program is stored in the Nectar program store, which contains a fingerprint based on the program and the data. The program fingerprint is computed from a static analysis of the program text. Dryad puts the result data in both the user store and the Nectar data store, and puts a mapping in the cache server from the fingerprint to the data store.
If the data has been deleted (due to garbage collection or a lack of space), the program store provides a program that can be used to regenerate the file.
The cache server uses a SQL database to keep the cache entries. A cache entry has usage stats, lookup stats and a URI to the results. The garbage collector uses these stats to make victim decisions.
Cache policies for deciding when to insert and delete entries. Final results are always cached, but sub-queries are cached when a subcomputation is popular enough. Deletion policy computes a “value” for each entry, based on the size of the result (cost), cumulative execution time (saving), the last time it was used and the number of times it was used.
Garbage collector uses mark and sweep, based on information in the cache server. However, the cache can be inaccurate (due to concurrent execution, or lack of usage statistics). Leases protect new datasets.
Evaluated based on logs from production clusters. 36000 jobs written in a DryadLINQ-like language. Simulated the effect of caching on these jobs. Between 15% and 65% of jobs would benefit from caching, across 25 production clusters. Overall, 25% of jobs benefit.
Also performed controlled experiments on a research cluster. Evaluated a common subcomputation (parsing and lexical analysis) shared between three jobs, which leads to about a 98% saving in machine time.
Evaluated incremental click log processing as data was added from day to day. Saved as much as 50% execution time when executing on the 5th day, having the 4th day’s results cached. Could do better when combining results from existing computations.
Q: Given that storage capacity is increasing so quickly, the precious resources are memory and I/O bandwidth, so do you find that to be true in your system, and how would you evaluate that kind of tradeoff? It is very important not to waste machine time in a data center. Could you model this to find the optimal trade-off between storing and recomputing data (in terms of the garbage collector)? It depends on the storage pressure, so we may need a threshold or ratio to decide how much to store on disk. Some heuristic basically? Yes.
Q: How do you know or ensure that the computations really are deterministic? Is it built into the language or is it something that you assume? DryadLINQ assumes that the result is deterministic. If the user uses random numbers, Nectar will always return the same result. It is not a problem with Nectar, rather it is in the DryadLINQ assumptions.
Q: Are we just reinventing distributed databases under different names, as a community? We did learn a lot from databases (incremental view management and materialized views). But this is work at much larger scales.
Q: Could we combine some of the lineage detection techniques with systems that are a bit more heterogeneous (e.g. transferring data from cluster to cluster)? The big difference is that Nectar maintains the program that generates a derived data set, so we can always regenerate it. However, we assume that we never delete “primary” data sets.

Security Technologies

Intrusion Recovery using Selective Re-Execution

Need a combination of proactive security and reactive recovery mechanisms.
Status quo for recovery is limited: anti-virus tools and backup tools. These respectively are limited to predictable attacks, and can cause data loss on reverting. Therefore admins have to track down the effects manually. The challenge is to disentangle changes made by attacker and legitimate users.
This work helps user to do this disentangling on a single machine. Idea is to rollback affected objects after an intrusion is found, and then re-execute actions that were indirectly affected. This is a new approach to intrusion recovery, based on an action history graph using re-execution, “predicates” and “refinement”. Implemented a prototype on Linux that recovers from 10 real attacks.
Example attack scenario: modify /etc/passwd to add a new account, install trojans in pdflatex and ls to restart and hide the botnet. Afterwards, a new legitimate user is added.
Strawman idea: taint tracking on all OS-level dependencies in the system. Taint travels through shared files, but we have to be conservative in assumptions about how taint spreads, and so taint spreads too widely.
Second strawman: everything executes in a VM, and periodic checkpoints are taken. When an attack input is detected, rollback to the previous checkpoint, and replay inputs except the attack. Re-execution would be very expensive. Also, the original inputs may be meaningless in the context of the new system (due to non-determinism). Deterministic re-execution doesn’t work because the inputs themselves are being changed.
Selective re-execution uses an action history graph, which is a more detailed version of the taint-tracking dependency graph. Actions have dependencies, such as written/read data, exec arguments and exit codes. Selective re-execution only reruns some processes (like adduser in the example scenario). However, what about the exit code from adduser?
Many suspect computations are not affected: an attacker writing to a file may not taint everyone who reads from different parts of the file.
To minimize re-execution, specify predicates which allow Retro to skip equivalent computations. Also use refinement, which only replays fine-grained actions (not entire login sessions).
Example predicate: exit code. With the same exit status as before, no need to taint the admin shell after adduser Alice.
Example refinement: individual functions. So could do getpwname() instead of a generic read() on /etc/passwd.
What if the effect of the attack was externally-visible. Cannot solve this in the general case (e.g. recalling spam sent from an attacked host). But could e.g. email the diff of output from a terminal session to the user.
Implemented in about 4000 lines of C and another 200 lines of Python for the repair controller system. Use btfrs as a checkpointing file system. Use ptrace to shepherd the re-execution of processes and skip equivalent syscalls.
Evaluated by looking at whether Retro is better than manual repair, and what the cost of using Retro during normal execution.
Used 2 real-world attacks from a honeypot. 2 synthetic challenge attacks (including the example scenario) and an sshd trojan. And 6 attacks from Taser. Retro can recover from all attacks. In 6 cases, it required no user input. However, sometimes need to skip the attacker’s login attempt. Two required significant help to get beyond skipped network traffic. However, Retro can pinpoint the exact objects that are involved, which is better than the manual status quo.
Repair cost is proportional to the extent of the attack. Worst case is the sshd trojan, but the other cases taint far fewer objects. Repair time is proportional to the number of objects, rather than the log size. This is much more efficient than VM re-execution.
Evaluated the runtime overhead based on running HotCRP from just before the SOSP 2007 deadline. CPU cost was 35% with storage overhead as 100GB/day. Also looked at Apache-small-static-files and kernel recompilation workloads, which had much higher overhead in terms of CPU and storage. However, can still store 2 weeks of logs on a 2TB disk in the worst case.
Q: How do you pick the level at which you record and re-execute actions? Did you experience any cases where it was fragile to small changes in the replay? We reintroduced higher-level managers in response to false dependencies that we saw. The interface between the repair controller and the repair managers is well-defined so it would be easy to add new abstractions if necessary.
Q: Could you tune the system to record less-precise information which would make it harder to recover, but pay a much lower cost? There are two kinds of trade-offs to make: either the number of user actions to preserve, or assume that some things won’t get compromised. Could also trade-off logging cost for repair time (by re-executing more things).
Q: What do you do about non-deterministic programs and negative information flow? Just want some acceptable execution that isn’t compromised, so actually non-determinism may be okay. Could this modify my data on disk? Yes. Can only provide the guarantee of an “acceptable” execution, not the same.
Q: What if the attacker could assume that Retro would be there: could he cause a large number of false positives? Well, the sshd trojan is perhaps a case like that, but our hope is that it would be much easier to detect these attacks, but we haven’t experimented with this.
Q: Did you evaluate throughput and latency for web workloads? Is there any protection of Retro from attackers? How hard would it be to adapt Retro to a closed-source system like Windows? For Windows machines, you would have to implement many more managers, but we believe the ideas are still applicable. For protecting Retro, we use SELinux to protect the kernel and Retro, but we assume that the kernel is intact, although we could consider a VMM approach. The libc wrappers? Have to be careful with those. [Throughput and latency question dragged offline.]

Static Checking of Dynamically-Varying Security Policies in Database-Backed Applications

Authentication is complicated, especially for web apps, and getting it wrong can have major consequences. Typical development involves thinking of some attacks, and auditing for these. A more principled approach involves setting a security (information flow and access control) policy on a resource.
One approach is dynamic policy checking, by adding metadata to objects in the system, and checking them when before objects interact with the environment. This is easy to add to existing programs, and very flexible, but this only finds bugs for tested program paths, and there is performance overhead.
Alternatively, could do static checking, which can run at compile time and involves no changes to runtime behaviour. But this usually requires extensive annotations, and has limited policy expressiveness if the type system is used.
UrFlow is a new analysis approach for the Ur/Web programming language. Contains a flexible and programmer-accessible policy language based on SQL. However, it runs statically at compile time and adds no runtime overhead.
Ur/Web has integrated parsing and type-checking of SQL and XML (HTML).
Can have simple policies based on whether rows exist in a table. Or can reason about what the client “knows”, such as string literals and contents in the source code. Can also have policies that join between multiple tables.
Can use first-order logic to reason about what is known at each point in the program. Then reconcile that with the SQL query policy (which may also be expressed in first-order logic). Then use an automated theorem prover to prove that the code meets the policy.
Program code is turned into a finite set of execution paths, and symbolic execution is done on those paths. This leads to a series of states, and a series of check predicates.
Evaluated using case studies for representative web applications: secrets, poll, user DB, calendar, forum and gradebook. 8 to 134 lines of code in the policies (forum most complicated).
Idea of using SQL is to take advantage of code that is already written in web apps, and get a free lunch from the checking. Hope to add these to a more mainstream language.
Q: Do you have any intention to analyse how hard it is to write policies? The analysis is never going to change the behaviour of the program, so no policies should block legitimate access to data. What about indirect paths to data? This is a hard problem and I don’t have an answer. The code author’s job is made easier by moving to a declarative paradigm.
Q: To what extent can your program give a better answer than just a yes or no from the theorem prover? When the analysis thinks it detects a violation, it displays the instruction that it thinks might be wrong and all of the facts it knows at that point. However, at least this is a static view rather than a dynamic view of the program. There’s probably a heuristic for filtering out irrelevant facts, but I haven’t thought much about the details.
Q: How do you manage the accumulation of security policies over time and becoming cumbersome? The language supports ways of encapsulating common patterns, which can help to shrink down the size of policy that you need to read.
Q: Is there a theoretical difference to existing security automata [inaudible]? Techniques very similar to software model checking. Can it be applied to type analysis? I haven’t seen any type systems that could reason about this.
Q: What about security policies that refer to exogenous facts that don’t exist in the database (like timed event sequences)? You can expand the database to store state fields that allow you to encapsulate the model. This would be worth looking into.

Accountable Virtual Machines

Scenario of a multiplayer networked game. In Counterstrike, the amount of ammo is local state, and fire events are broadcast to other players. So a player can modify the game not to decrement that counter. Such cheats exist and are used.
Cheating is a serious problem in itself, and since gaming is a multi-billion-dollar industry, we need techniques to stamp it out. It is also a symptom of a more general problem: how do you know that a networked system is running as intended?
Want to be able to detect that a remote machine is faulty, and obtain evidence that can convince a third party.
Challenges: mutual distrust and lack of access to the source code to understand how it should work.
Software runs in an accountable virtual machine on an accountable virtual machine monitor. Maintain a log of network inputs and outputs, and can check this log with a reference image. It the AVM is correct, the same inputs will produce the same outputs. Otherwise it is considered to be faulty.
Solution involves two pieces. First is tamper-evident logging (based on PeerReview). Log uses a hash chain and messages contain signed authenticators.
Second part is execution logging. How does Alice know if the log is from a correct execution of her software image. Need the AVMM to specify an execution and additionally log all nondeterministic inputs.
Can audit and replay a log. If one player changes the reference image, the logs will diverge, which indicates that the AVM is faulty.
AVMs provide strong accountability for arbitrary, unmodified binaries. Don’t need to trust other participants, the AVMM or any software running on the other machine.
Evaluation based on the example of cheating in Counterstrike. Used a prototype AVMM based on VMWare Workstation 6.5.1 which has a mature logging/replay engine. Extended this with tamper-evident logging and auditing. Used three players on Nehalem boxes running Windows XP SP3.
Can AVMs detect real cheats? If the cheat needs to be installed in the AVM, it can trivially be detected (since the branch counter and instruction pointer will change). Examined 26 real cheats from the internet, and all were detectable.
Could cheats be adapted to subvert the AVMM? There are impossible-to-detect cheats (collusion), guaranteed detectable cheats, and technically-difficult-to-evade cheats, which we can’t guarantee are impossible to hide.
What is the impact on frame rate? With no fps cap (Window mode, 800×600, software rendering), the frame rate drops by about 13% from bare hardware to AVMM. Still around 137 fps though. The main culprit is execution logging (about 11% overhead), so accountability only costs another 2%.
For a one-hour game, how expensive is auditing? The log is about 8MB per minute, which compresses to 2.47MB. So the log is 148MB. Replay takes approximately one hour.
Idea: stream logs to auditors during the game and have detection almost immediately. Since the machines are quad-core, one plays the game and one does logging, it is possible to do replay on one or two of the spare cores. This causes some drop in the frame rate, but it is still above 100 fps.
Q: What if the issue is that a machine isn’t cheating but is in fact sabotaged? The AVM provides a full record of what has happened, and could demonstrate that the owner is not at fault.
Q: Would it be possible for the malicious party to run another instance in the background to get a plausible-looking log? The two copies would be completely independent, because the authenticators are signed with one machine’s key. The outputs wouldn’t match, so this would be detectable.
Q: Do most have the cheats have the form of changing the inputs? What is the equivalence class of cheats that you can and can’t detect? If the cheat causes the network I/O or control flow to be inconsistent, we can detect it.
Q: What about concurrency (e.g. who fires first) and delaying traffic in the network to change who fires first? Concurrency wouldn’t affect detectability. But from a distributed systems point of view? We can’t guarantee detection.
Q: What if the program that you’re dealing with has non-deterministic output? Randomness is not a problem for us because at the VM level, everything is deterministic.
Q: Is it sufficient to log at the AVM level when the application has latitude to process events in a different order? If this is a legitimate thing for the application to do, then we are vulnerable.
Q: How does privacy interact with the implementation of your system? Would you end up sharing too much information with your adversary (especially in the case of things that are more important than games)? It’s true that AVMs are verbose in what gets logged. The impact on privacy depends on what you’re doing. Auditing a machine that you’re paying for seems reasonable. It has to okay for you as the auditor to see all computation that is going on. You can delay the online auditing perhaps.
Q: How can you detect cheats that involve improving the player’s skill (or making all the walls invisible or using an aimbot)? Can’t help information leakage with this technology.

Concurrency Bugs

Bypassing Races in Live Applications with Execution Filters

Races and atomicity violations can be a big problem, and hard to find. One solution is to put everything under a big lock. Or can go for a complex fix by decomposing behaviour and adding state variables. But it takes a long time to fix these races, even if the root cause is well understood.
Idea of LOOM is to filter out buggy thread interleavings to mitigate bugs while the problem is fixed. Write declarative execution filters to specify the disallowed interleavings.
LOOM also works to implement live workarounds, so it is possible to install an execution filter into a running application. It installs execution filters safely and avoids introducing new errors.
LOOM incurs little overhead during normal execution.
Evaluated LOOM on 9 real races, and it bypasses all evaluated races safely, applies immediately, has little overhead and scales well.
LOOM is divided into Static Preparation (uses the application source, which is compiled down to LLVM bytecode where transformations are applied. This adds a mechanism for LOOM update injection and safety maintenance) and Live Update (passing an execution filter to the LOOM controller, which passes code to the LOOM update engine which is linked into the application binary).
Safety issue: what if the program is running inside a critical section when you update it? Or what if there are order constraints on multiple threads? The LOOM update engine identifies regions of the code where it is dangerous to update, if any thread is inside that region. It then evacutates the dangerous region, to get threads out of the region, and applies the update once they are all out.
To control application threads, add conditional breakpoints (in the instrumentation phase) on iterative or recursive control flow (since there is no point in pausing code that can quickly run to completion): i.e. CFG backward edges. Details of the breakpoint mechanism in the paper. They use MRSW locks which are unlocked and reacquired in the breakpoint to pause threads (if the update manager grabs the write lock in the mean time).Â The back-edges are identified as safe or dangerous. The breakpoint only pauses if the back-edges are safe.
Hybrid instrumentation to reduce the performance overhead. Have two versions of an instrumented function: fast and slow. The slow version includes configurable slots for additional instructions. During the common case (no update), the fast path is always followed; only if the code is updated do the slots get used.
Measured the bare instrumentation overhead on Apache, MySQL and SPLASH benchmarks, and the overhead was between -1.84% and 4.11%. This is much better than Pin, which has from 14.94% to 1065% overhead.
Measured the scalability of the number of threads. Overhead from 1 to 32 threads (on a 48-core box) is less than 12% in all cases. There is some anomalous speedup for 4 threads.
Intend to extend this to memory errors and security errors.
Q: Is there a potential for deadlock in your evacuation algorithm? That may be the case, but we didn’t observe this in our evaluation, and we can always “uninstall” the fix if there really is deadlock.
Q: Can you collect information at runtime to aid you with fixing the bug at the source code level? Can write execution filters based on line numbers.

Effective Data-Race Detection for the Kernel

Looked at a data race in the Windows kernel. Thread A ANDs out a bit, and Thread B ORs a different bit in. These can interleave badly, leading to one of the mutations getting lost, and cause a system hang.
Data races are very hard to reproduce since the timings are very tight. And they are hard to debug, since they could be mistaken as a hardware bit-flip. They are also becoming more common, since people are moving away from monolithic locks to fine-grained locking and lock-free approaches.
Previous techniques are happens-before and lockset algorithms. The Intel Thread Checker runs at a 200x overhead, due to logging all synchronization accesses, and instrumenting all possibly-racy memory accesses. This prevents practical usage in the field. Indeed, false failures arise due to timeouts. They also require a complete knowledge and logging of all locking semantics which may not be feasible in practice (since there are diverse ways of implementing locking, and many of them are used in the kernel).
Trade-off accuracy for efficiency. Can pick up false and benign data races. False data races are ones that cannot actually occur. Benign data races can and might occur, but are intentionally put there by the developer. Statistics counters are an example of this.
Another goal is to give the user control over the overhead (from 0.0x up).
Also want actionable data that is useful for the developer.
First insight: don’t infer what a data race might happen, and instead actually cause it to happen.Â Second insight: sample memory accesses for controllable overhead. Use code and data breakpoints and randomly select to get uniform coverage. A data breakpoint pauses the first thread when it hits some memory location; if another thread comes along and accesses the same location, flag the current thread states as a possible data race.
Memory accesses are sampled by setting code breakpoints, which lead to data breakpoints being set on the relevant locations.
Results: most dynamic data races are benign (104/113). Many could be heuristically pruned (86/113).Â Found 25 confirmed bugs in the Windows OS, with 8 more pending investigation.
Saw two bits in the same byte, each protected by individual locks, have a data race with bits being set and cleared.
The breakpoint approach found data races with as little as 5% overhead.
Future work includes prioritizing benign versus non-benign races, and also looking at (false) data sharing and its performance impacts.
Q: What are your preliminary thoughts on how to identify bugs as benign automatically? There are a lot of unanswered questions there….
Q: How random is your breakpoint setting? The demo was considering the ndis and ntfs drivers, and randomly sampled within those. The breakpoints are set based on a feedback loop to get the rate.
Q: How dependent is the technique on having full debug symbols, since the problems may be in third-party device drivers? The dependence is so we know where the memory access instructions are. You can load the module into memory and use some other technique to disassemble them, but we ended up accidentally putting a breakpoint into data.
Q: How is the sampling rate related to the expected time until you detect a particular race? We haven’t investigated that, but we would like to verify that we’re not just getting the easy races.

Ad Hoc Synchronization Considered Harmful

Synchronization is important to ensure the correctness of concurrent programs. Sometimes synchronization code is easy to identify (pthreads calls), but sometimes it is hard to recognise. Sometimes it even uses goto statements.
Ad hoc synchronization harms program reliability. Because it is hard to recognise, the programmer may be unaware of it. This may lead to bugs and performance issues. Saw up to 67% of ad hoc sync introducing bugs. Program analysis is more difficult, and may introduce hard-to-detect deadlocks, cause false positives in data race checkers and confuse performance profiling. It may even cause problems for compilers and relaxed memory consistency model.
Contributions: quantitative evidence of the harmfulness of ad hoc syncs; and a tool (SyncFinder) that automatically identifies and annotates ad hoc syncs (detects deadlocks and bad practices).
Examined server, desktop and scientific applications (including Apache, MySQL, Mozilla JavaScript engine, SPLASH). Found lots of ad hoc sync loops in each of them (83 in MySQL). In OpenLDAP, 67% of ad hoc syncs were buggy; in Apache, 22% were buggy.
Showed an example of a hard-to-detect deadlock in Mozilla JS, which involves three threads. And an example which does an unnecessary sleep in MySQL, causing performance issues.
Ad hoc synchronization leads to benign data races. Leads to false positives in data race detectors (see above).
Diverse kinds of ad hoc synchronization. For example, a while loop, or a goto loop. Or a single condition, or Â multiple conditions. Or spinning on a synchronization variable, or have a more complicated dependency structure….
To identify ad hoc sync, need first to understand it. Every one on the waiting side has a loop body (”sync loop”) with at least one exit condition that is associated with at least one exit condition variable; and a synchronization variable. The setting side does a write to the synchronization variable. First detect the loop, then extract the exit condition, then detect the exit-dependent variable set, and finally prune the set to find the synchronization variable.
To prune the sync loop, need to find variables that are shared with a remote thread, and are loop-invariant.
Report annotated code with the line numbers of sync variable reads and writes.
SyncFinder identifies 96% of ad hoc syncs in the considered applications, with only 6 false positives (out of thousands of loops).
Use cases: a tool to detect bad practices, and an extended race detector in Valgrind.
Q: What about the amount of bugs in non-ad hoc synchronization: do you have any data on that? We did some previous work on the reliability of normal synchronization, and we found that the proportional of buggy ad hoc sync is higher.
Q: Why is ad hoc synchronization so pervasive? What is the limitation in sync primitives, or is it just performance? We checked the comments surrounding this ad hoc synchronization. People just want a flexible way to do sync. They assume that it is a short-term fix (or that it will not be used in the common case). Would you suggest an additional sync primitive? Conditional wait and signal.
Q: Do people do this to avoid a synchronous write on the system bus? Is it always harmful? Your performance assumptions may change in the future on different architectures, and the cost to maintainability is probably higher than the performance win. Are lock-free data structures bad, then? I cannot say that.
Q: Is there any chance that these application developers declare their sync variables as volatile? They may do this, but it doesn’t guarantee that that is how they use it. We thought about this for our tool.
Q: How do you identify the bugs? We checked the code repository for patches on those locations and identify bugs in Bugzilla and the changelog.
Q: How long does SyncFinder take to run on the code base? Could you run it on the Linux kernel? We ran it on the 1 MLOC MySQL, which takes about 2.5 hours. The complexity is data-dependent: OpenLDAP is much smaller but takes almost as long as MySQL.

This entry was posted on Monday, October 4th, 2010 at 4:48 pm and is filed under Technology, Trip Reports. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

mrry