Onix: A Distributed Control Platform for Large-scale Production Networks
Traditional networking has a traditional router with a forwarding plane and a control plane, which are coupled together in one box. Changing one without the other is difficult to do. This slows down ideas like the Trill protocol, which took 6 years to deploy.
Software-defined networking moves the control plane into a server outside the router that talks to the router using a management protocol, sending down forwarding entries to the router, and receiving information about new flows, etc.
One of the problems with L2 Ethernet networks is ARP broadcast, which doesn’t scale. Idea to have a centralized unicast ARP server. However, don’t want a single point of failure.
Onix is a software-defined networking platform that implements distributed control. Centralized control logic thought of as a distributed system. Simple API and flexible mechanisms for state distribution, scaling and reliability.
Challenges are generality, scalability, reliability, simplicity and performance.
The API is based on treating the network as a graph. Nodes are physical (or logical) network entities, such as a forwarding engine, which is connected to a port, which is connected to an end host. Nodes have methods defined on them (e.g. “write flow entry” for a forwarding engine). The graph is a “network information base” (NIB).
Switching protocols integrated into the NIB, such as OpenFlow.
Need a way to distribute the NIB to the switches and to other control servers. Depending on the application, you might want strong consistency or eventual consistency. Two options: replicated-transactional (SQL) data, and one-hop memory-based DHT (for eventual consistency). For an ARP server: switch topology is hard state, IP-to-MAC mapping are soft state. Applications only interact with the NIB at run-time, after setting up the distribution mechanisms initially.
Integrated with ZooKeeper for distributed coordination, in order to do leader election, failover, distributed locking, barrieres, etc.
Can trade off scalability and reliability, by partitioning the control state variously. Partitioning techniques based on VLAN and MPLS (versus integration techniques like routing on IP prefixes or ATM routing areas). Provide similar interfaces to the user. Can change the subset of Onix instances to which a server connects to vary this. Or give different computational tasks to different instances. Or partition the NIB between instances.
Applications can reduce the fidelity of information before sharing it (based on an ATM trick).
For reliability, need to deal with various kinds of failures: network element/link failures (application has to deal with that), connectivity infrastructure failure (the network that connects Onix to the networking infrastructure, usually use a management backbone, but could do multipathing in that backbone), and Onix failure (application-controlled, but mitigated by distributed coordination facilities (like ZooKeeper)).
Implemented in 150KLOC of C++, Python and Java, from 15 developers in 3 institutions. In QA testing, and will be appearing in products.
Implemented Ethane (enterprise security solution) on Onix. Multiple Onix instances are necessary to handle the load from flows being set up.
Implemented a distributed virtual switch, which does centralized management of VMM virtual switches. Only need a single instance of Onix with a cold standby for failover.
Implemented a multi-tenant virtualised data center, which provides per-tenant L2 networks in a data center. Provides isolation in addressing and resource usage. Need to partition the NIB and management tasks.
Implemented a scale-out IP router, which does routing control over multiple physical switches.
Q: What is an “application” and how do they multiplex on a single network? What is the implication at the level of routing? Envision only a single application per Onix instance (app is a program that manipulates the NIB graph). Forwarding layer gets programmed using something like OpenFlow (a TCP connection to the controller), which programs a TCAM on the switch. How do you set up the network if there is no routing? Depends on the application. Need configuration support at the beginning.
Q: How do you refresh the ARP cache at the host, don’t you need to do broadcast? We didn’t actually build an ARP server, just an example for the talk. Imagine we could still save a lot of traffic from gratuitous ARPs.
Q: What is your favourite trick or hack that we might be able to reuse from your implementation? And/or what might you do differently? Favourite idea is that the distribution mechanisms are hidden from you in the NIB, and you can switch out the storage layer with your own if you want.
Can the Production Network Be the Testbed?
Hard to evaluate new network services realistically, which makes it hard to decide what should and should not be deployed in production networks. Testbed networks are too small, and maybe don’t have the proper topology. A successful service on a testbed may not correlate with deployment on a real network.
Problem with NetFPGA: expensive to deploy and fan-out is small. Software testbeds are slow and may not have realistic topologies. And getting real users to join an experimental testbed is hard.
Solution: network slicing. Divide a production network into logical slices, in each of which different control plane is run. Users can opt in to a particular slice. Need to have strong isolation between slices so that actions in one don’t affect another.
Control plane makes up forwarding rules and pushes them down into the data plane. Data plane pushes exceptions up to the control plane (e.g. packets that don’t match any rules). Add a slicing layer between the control plane and data plane, and allow multiple control planes to control the same data plane.
Network slice is a collection of sliced switches or routers. The data plane remains unmodified, and the slicing layer performs rewrites on the control planes’ rules to ensure isolation.
Slicing policy specifies resource limits for a slice (e.g. link bandwidth, maximum number of rules, topology and CPU usage).
Now, allow real users to opt-in to slices on a flow-by-flow level. So one slice can handle HTTP, another can handle VoIP, etc.
Implemented on top of OpenFlow for communication between the control and data planes. Control logic is moved to a commodity PC. OpenFlow is just a firmware upgrade for various commodity hardware.
FlowVisor performs message handling (the slicing layer). On getting an exception from the data plane, FlowVisor looks up which slice controls the packet and passes it to the appropriate controller. When the controller sends back a rule, FlowVisor checks that the update is allowed, and writes it to the data plane if so.
Implemented for 20 OpenFlow message types as a transparent OpenFlow proxy in 8261 lines of C.
Isolation performance in terms of device CPU usage (other metrics are considered in the paper). Need to prevent CPU exhaustion, using rule-insertion rate limiting. Consider a malicious slice that tries to do DoS on the switch. Isolation reduces this to 40% CPU usage on average (though bursty).
Runs on the production network at Stanford, with 15 switches, 35 access points, 25 users and >1 year of use. Host 7 different OpenFlow demos on the same physical network. Also runs in many different research groups.
Not explicitly virtualization, because currently only deal with slices of the physical topology. Looking into virtual nodes in future.
Q: How do you create isolation between slices so that one slice doesn’t do too many flow setups per second? This is handled by isolating switch CPU usage.
Q: What about latency? Add about 0.483 ms of latency.
Q: Don’t you have to have n times as much forwarding memory, which creates problems for realism? Tricks that you can play in terms of caching, but it is true that the number of rules becomes a scarce commodity (second most important after device CPU).
Q: [...] Experimental services use the production slice for their control plane. Worked out an in-network control protocol.
Building Extensible Networks with Rule-Based Forwarding
Goal is to have flexible forwarding in the network. A network becomes more powerful if its forwarding plane is more flexible. For example, we might want forwarding to be aware of middleboxes (for intrusion detection). But also multipath, active networks, multicast, and so on.
However, flexibility is not enough, because it must be secure as well. Need policy support to constrain forwarding directives. And need every entity that appears in a forwarding directive to be able to refuse it. Not respected today (e.g. denial of service, since the source performs forwarding). Or might want to do IP source routing but give every intermediate ISP the ability to refuse the route.
Idea: packet rule, which carries forwarding directive in the packet itself. All intermediate nodes must validate the rule. Rules encapsulate a proof of policy compliance.
Rules are owned by destinations (but they could be owned by other nodes). They are certified by trusted third parties. An extended DNS returns rules instead of addresses. Routers verify the rule signature and follow rule directives. Packets may also have a return rule for allowing data to be returned to the source.
Distribution and certification is done in the control plane (in the paper). Talk focuses on the data plane.
Assume: an anti-spoofing mechanism (ingress filtering), existence of rule-certifying entities and key distribution, and DNS well-provisioned against DDoS.
Rule is a sequence of actions in if-then-else statement form. Conditions are comparisons based on packet headers and router state. Allowable actions are: modify header, forward packet, drop packet or pass up to the control layer.
Rule forwarding engine can modify packet attributes, such as adding explicit congestion notification. Arbitrary attributes on a packet may be set. Need to make attribute setting secure, by using strong cryptography or checking the source (to avoid circumventing an IDS middlebox).
Rules can be used to prevent DoS, as they can work like network capabilities. Requires the ability to dynamically grant rules, which is unusual for DNS, so would need to do some proxying.
Many other examples of applications that work in this architecture.
Rules are flexible, policy-compliant and safe (bounded forwarding time, and cannot modify router state or amplify traffic). Invoked functions are controlled by the ISPs and middlebox owners. Resource isolation can be used to prevent attacks.
Evaluated several questions: size overhead of rules, forwarding overhead, performance isolation, load on certifiers and security. Only looking at overhead in the talk.
Rule compiler transforms the rules into a binary form. Rule size is between 60 and 140 bytes. Average across examples is below 85 bytes (13% overhead on an average IP packet). Overhead is 27% if using RSA signatures.
Built a prototype router on top of RouteBricks. Tested on an 8-core Nehalem with 2 dual-port NICs. Compared RBF to plain RouteBricks, and saw little overhead (no overhead for packets over 500 bytes). Can forward 35Gbps on a real traffic distribution.
Signature verification happens only at “trust boundary routers”. Results can be cached (about 19 bytes per rule), and new flows tend to represent 1% of backbone line capacity on average (5% in worst case). Existing hardware can cope with this, and it can be parallelized, so throwing more hardware at it will support higher bandwidth links.
Q: What if as a receiver, I want packets to take a particular path… can I drop those packets earlier in the path? No, can only verify the previous hop. Would need stronger crypto guarantees to detect that. What about stakeholders who aren’t named in rules? In a way, we are like overlay networks. We do not name entities between waypoints. But since they approve packet forwarding, you could see them as overlay nodes.
Q: What is the best-case/worst-case for ping? Latency is negligible in our implementation.
Q: Can you comment on deployability? Could use network slicing. Could be partially deployed at some routers and get many of the benefits. What is the incentive for ISPs? The ability to diversify their offering to customers.
TaintDroid: An Information-Flow Tracking System for Realtime Privacy Monitoring on Smartphones
60 million smart phones sold in Q2 this year. But privacy is very poor on these devices, especially in terms of the applications that run on them (e.g. of a wallpaper application on Android). Challenge is to balance fun and utility with privacy. Idea is to look inside applications to watch how they use privacy-sensitive data, and determine how it leaves the phone.
Challenges include resource constraints, allowing apps to use some privacy-sensitive information, context-based privacy and applications sharing information.
Idea: use dynamic taint analysis, with a taint source, taint propagation and a taint sink. Need to trade off between performance and granularity. TaintDroid does taint tracking on Android, with variable tracking throughout Dalvik, handling native method invocation, and extending to storage.
Modified the Dalvik interpreter to store and propagate taint tags on variables. Store tags adjacent to local variables/args on the internal execution stack, for locality. For class fields, store them in instance fields on heap objects. For arrays, store one tag per array to minimize overhead (in particular because many arrays are strings).
Various rules for taint propagation through data-flow.
JNI is used to execute native methods. TaintDroid doesn’t support third-party native code, but uses a combination of heuristics and method profiles to patch up the tracking state on return from native methods.
For IPC and file propagation, taint at the message level and the file level. File taint is stored in an XATTR on the file.
On the CaffeineMark benchmark, overall 14% performance overhead. Relatively worse for the “string” workload. 4.4% of memory overhead. IPC has a 27% overhead. Macrobenchmarks overhead between 3% and 29% (worst for taking pictures: 0.5 seconds).
Sources and sinks are carefully integrated into the existing framework. Need to treat sources differently based on the properties of the information that they provide. IMSI needed particularly careful handling.
Started with the top-50 most popular applications in categories on the Android Marketplace, and winnowed it down to 30 applications that used the internet, location, microphone and camera. 105 connections were flagged, but only 37 were clearly legitimate.
15/30 applications were sharing location with a third-party advertiser, usually in plaintext (e.g. in the HTTP GET url for AdMob, which also contained a unique ID for the particular phone). Sometimes every 30 seconds. No sharing was obvious to the user or covered in the EULA.
7 applications sent the IMEI, and 2 apps send the phone number, IMSI, or ICC-ID to a remote server. One application mentioned in the EULA that said the IMEI would be sent, another sent a hash of the IMEI.
Limitations to the approach: only explicit data flows. Sources: IMSI contains country and network codes, so tainting it is too conservative.
Source code should be released next Wednesday.
Q: What about timing attacks, and what implicit flows do you think are important? Need to do static analysis to capture all implicit flows, and limitations would lead to false positives.
Q: Have you thought about providing this application as a website to which people could upload applications? Haven’t thought about that specific model, but definite potential there.
Q: How do you stop applications modify the XATTR to remove taint? XATTR not used in Android, so could use LSMs to do this. Accessing this requires third-party native libraries, which we don’t support (can’t access them from Java).
Q: Are you planning to do anything for control dependencies, and how would that impact performance and false positives? We are not attempting to identify privacy violiations, just that privacy-sensitive information has left the phone. Using the information for control is the next step.
Q: Now that TaintDroid is public, what do you see as its future purpose (since people will now try to get around it)? Going to look at more techniques in future (implicit flows), though we can still catch low-hanging fruit where the leak is unintentional.
Q: [...]. Understanding how information is going to be used is a tough challenge. Want to integrate privacy-preserving technologies.
StarTrack Next Generation: A Scalable Infrastructure for Track-Based Applications
Since many phones can now identify their own location, many apps now use this information. A track is a sequence of (location, timestamp) readings. Applications could use these tracks to e.g. personalise driving directions (collapsing directions that are familiar, such as “Take 101 north”, rather than a bunch of redundant short steps around one’s own house).
Application taxonomy: use current location, use past locations or use tracks. StarTrack deals with the last case (e.g. personalised driving directions, track-based search, ride sharing, discovery, urban sensing).
Two kinds of application: insertion and retrieval/manipulation/comparison operations. ST server is a database designed to store geometric data efficiently.
Challenges: handling error-prone tracks, providing a flexible API and efficiently implementing track operations. (Also must provide scalability and fault tolerance.)
Raw tracks are difficult to compare for equality, since there may be noise, and readings may be taken at different times. Canonicalisation enables more efficient retrieval and comparison, by mapping similar tracks onto the same path.
API needs to: pre-filter tracks (based on similarity criteria), manipulate them and fetch them. Track collections are abstract groupings of tasks, which aid programming convenience and implementation efficiency (preventing unnecessary message exchanges and enabling caching).
Manipulation operations include joining collections, sorting them, taking a subset, getting similar ones, getting common segments, and getting individual tracks.
Showed a code fragment for a ride-sharing application in terms of the StarTrack API.
In the DB, StarTrack maintains only non-duplicate tracks per user. The table of unique coordinates are separate. Quadtrees are used to do geographic range searches.
To compute track similarity, compute the length of common segments over the length of all segments in both tracks (i.e. Jaccard similarity). Conventional DBs don’t support this efficiently. The track tree data structure is an efficient representation for this, for doing get-similar-tracks and get-common-segments.
Evaluated the performance of track trees and two sample applications (driving directions and ride sharing). Used synthetically-generated tracks, 9 servers and 3 DB servers.
Evaluated get-similar-tracks against DB, in-memory filtered, and in-memory brute force approaches. Track tree query takes 1ms for up to 100K tracks in the collection. Other approaches are much more costly. To build a track tree takes up to 2 minutes for 100K tracks, and about 170MB of RAM.
Personalised driving directions uses get-common-segments. Cached takes about 50 ms until hitting a wall at about 260 requests per second. Ride sharing can do 30 requests/second at 170ms response time.
Have a working infrastructure, and looking for users.
Q: Have you tried to pair intent and content with a track? We have the availability to stick metadata on tracks, so different applications could associate different metadata. Expect we could have ACLs on this metadata.
Q: How do you handle errors in the tracks? Canonicalisation handles this, pre-filtering obviously-erroneous points. What technique do you use for this? Rely on the underlying algorithm but don’t use statistical techniques.
Q: Have you thought about how you might do non-intersecting paths in a privacy preserving way (e.g. to avoid an ex)? Thought about it, but it is very hard. Access control might address this, but would like to have a way to aggregate it using differential privacy techniques.
Q: Why does the performance curve hit a wall? Network congestion and packets getting dropped.
The Turtles Project: Design and Implementation of Nested Virtualization
Approach is to multiplex the hardware between different layers of virtualization. Focus on two layers of nesting.
Multiple VMCS for each layer transition. The hypervisor merges VMCSs so that they can run in root mode.
Need to avoid multiplying the number of vmexits to one per nesting layer. One idea is to make a single exit fast, and another involves cutting down their frequency.
Multi-dimensional paging cuts down the number of exits. Baseline approach uses shadow paging on shadow paging (assuming no EPT). Better to use a shadow page table on EPT (secondary hardware table), which turns three logical translations into two in hardware. Best is to observe that since EPT tables rarely change, do multi-dimensional paging, which compresses the two levels of EPT into one (from layer 0 to 2).
For I/O virtualization, could do device emulation, or paravirtualized drivers, or (even better) direct driver assignment. This requires an IOMMU.
For nested, need to do multi-level device assignment. Idea is to have the bottom level emulate an IOMMU for the first level, and the second level access the device directly.
Experimental set up involved kernbench, SPECjbb, and netperf. Evaluated different hypervisor choices for the nested performance. Overhead is 9.5% in the guest, and a further 14.5% in the nested guest for kernbench. For SPECjbb, it’s 7.6% in the guest and a further 7.8% in the nested guest. vmread and vmwrite operations were the most expensive, so these were hacked to not cause a trap.
Multidimensional paging improves kernbench a lot (3.5x), but SpecJBB and netperf are roughly the same.
Multi-level device assignment is the best performing I/O setup, but it adds an overhead of 3x in CPU cost. Interrupts cause exit multiplication. If we could deliver interrupts directly to the nested VM, the overhead would just be 7%. But can still fill a 1G pipe.
Q: Do you have a feeling for how the overhead will scale as the number of levels increases? Hope that additional levels will add approximately the same overhead.
Q: Are you confident that you could run this under arbitrary unmodified hypervisors? Not advocating rootkits at all.
Q: Would binary translation help to apply changes to e.g. the vmread and vmwrite instructions? Two ways to solve this: either get better hardware support or do binary translation. Not feasible to do binary translation in this case.
mClock: Handling Throughput Variability for Hypervisor IO Scheduling
Done this work because customers need it.
Hypervisor currently multiplexes hardware resources between VMs. Three possible controls: reservation (minimum guarantee), limit (non work-conserving) and proportional share. Supported for CPU and Memory in ESX Server since 2003. Now looking at I/O resource allocation.
Contention for I/O resources can have an arbitrarily bad effect on a VM’s resource allocation. Hard problem for storage (considered in terms of iops) to get reservations or limits, because you need to continuously recompute the allocation in the face of bursty, variable workloads. Also, since this access may be distributed and shared, it is even more complex.
A single algorithm to support all controls, handling variable and unknown capacity, and that’s simple to implement.
Typical proportional-share scheduling assigns a weight to every application, and tags each request with the application. Then requests are spaced based on their tag and the weight of their application.
Three key ideas in mClock: use real-time tags (instead of virtual time tags) based on wall-clock time. Use separate tags for reservation, shares and limits. Then use dynamic tag selection and synchronization.
Scheduling proceeds by first doing constraint-based scheduling, the weight-based if reservations are met.
Idle VMs are synchronized when they make a new request.
Need to scale the number of requests based on their size.
Distributed mClock for clustered storage architectures.
Experimented with a 3 to 6 VM cluster with mixture of Windows and Linux hosts. Workloads: iometer and and Filebench OLTP benchmark. Showed an experiment with diverse workloads and allocations. Compared SFQ(D) to mClock. Showed that mClock closely enforces the limits and reservations for the relevant VMs.
Another experiment looking at burst handling, when idle VMs get benefit from spare capacity.
Looked at an OLTP workload on both SFQ(D) and mClock. On SFQ(D), the reservation is not respected, whereas mClock manages it.
Distributed mClock experiment, with three servers and three clients, and various reservations.
Future work involves extracting SLAs and deriving scheduling policies from those.
Q: Why does the total number of iops become lower for mClock as SFQ(D) when two VMs are switched on? Some variance would be acceptable, due to the varying workloads and access patterns etc. Should note that the VMs are not doing exactly the same workloads in each case.
Q: Can you comment on how you make the trade-off between throughput and latency? Focussed on getting a certain amount of throughput. Previous work looked at latency. Overall, to do a good job at latency, need support from the underlying array vendor.
Q: [...]. We group I/O into 32-request batches onto the LUN.
Virtualize Everything but Time
Problem is clock synchronization, which is important for network monitoring, traffic analysis, telecoms, finance, gaming, and distributed scheduling.
Status quo under Xen is based on ntpd, but amplifies its flaws. When live VM migration is done, it fails.
Propose a new architecture. based on RADclock (clock synchronization solution). Robust, accurate and scalable solution that supports seamless migration.
Idea: give each physical host a single clock that never migrates. A stateless clock-read function migrates with each VM. (Concept of a “dependent clock” (not new).)
Context is paravirtualized Xen guests. It would be more complex to do it for fully-virtualized guests, and is not discussed here. Implemented for pvops Linux (184.108.40.206) on Xen 4.0.
Clocks are built on local hardware (oscillators/counters). However, they are imperfect and drift (often due to temperature). The TSC counts CPU cycles and is the highest resolution/lowest latency time source, but it might be unreliable. HPET is reliable but low-resolution and high latency. The Xen clocksource is a hardware/software hybrid that the hypervisor provides, and aims to combine the reliability of HPET with the low latency of TSC: a 1GHz 64-bit counter. Performs well, but HPET is not far behind and a lot simpler.
Raw timestamps and clock timestamps are different, and scaled counters are insufficient since they drift. Network based synchronization is convenient, but suffers from delay variability and path asymmetry. (Typically assume symmetry.)
Two synchronization algorithms under discussion. First is ntpd, which is feedback-based, using the Xen clocksource on Xen. RADclock uses raw timestamps, and can use any raw counter, including HPET and the Xen clocksource.
Experimental setup includes an atomic clock, a GPS receiver and a network connection to a stratum-1 NTP server.
With the client and time server on the same LAN, and a 16-second polling period, can get 60us error. Using a feedback-directed polling period with multiple servers (but no load, no traffic and high quality servers) can lead to millisecond-level errors.
On Xen, ntpd is much worse. Consider a dependent ntpd clock, independent ntpd clocks and migrating independent ntpd clocks. With a dependent ntpd clock, only dom0 runs ntpd and the Xen clocksource is used (no clock sync in the guests). If an independent ntpd clock per guest runs (resource-hungry), works almost as well as non-virtualized but with a bit more noise due to virtualization latency. Adding some load and churn leads to millisecond-sized errors. A migrating independent ntpd clock is worse [...].
RADclock does raw counter reads, not clock reads and is independent of the clock algorithm. Has a synchronization algorithm based on raw timestamps and server timestamps with feed-forward.
Feed-forward allows you to define a “difference clock” which can determine time differences on a small timescale that is highly accurate. This is inherently difficult in a feedback-based system.
Now have dom0 run a RADclock daemon that talks to timeserver, and make information available through XenStore.
For migration, only the clock-reading function actually migrates, and no state migrates. However, since we can have asymmetry and it will be different on a different host, this will result in a small clock jump. (No algorithm can do anything about this.)
Migration experiment. The ntpd clock has errors on the order of seconds and convergence time on the order of hours. The RADclock-based solution is much more stable.
Q: What about the protocol for network synchronisation with RADclock: could you keep the NTP protocol to allow incremental deployment? We do use the NTP protocol.
Q: How much is RADclock buying you over NTP when the biggest gain comes from using independent clocks? Could you modify NTP to do dependent clocks well? Problem is that two heads try to control the same state.
Q: How does your clock perform under a workload that might cause temperature variation? High temperature is not a problem: rapidly-changing temperature is the problem.
Many sources of non-determinism in real systems (shared-memory, IPC, pthreads, devices, etc.). This makes programs hard to test, debug and replicate (for fault-tolerance). Multicore only makes this worse.
Solution: OS support for deterministic execution. A “deterministic box” around some processes.
What can be made deterministic? What can be done about remaining non-determinism?
Internal versus external non-determinism. Internal is not fundamental (arises from scheduling artifacts) and can be eliminated. By contrast, external comes from interactions with the real world, and cannot be eliminated. External includes users and the network. Internal includes channels used between processes in the deterministic process group (shared memory, pipes, private files). Processes outside the group are external sources of nondeterminism.
External nondeterminism is controlled with a shim program that controls external inputs.
Alternately, could have made a deterministic VMM and put the whole thing in the box, but this is inflexible and costly.
dOS is a modified version of Linux.
Motivating examples: parallel computation and a web server. Consider a parallel (scientific?) computation that operates from local input files. This should execute deterministically even on a multiprocessor. A web server has a shim that does deterministic record/replay, but since internal nondeterminism is eliminated, the recording is much less costly. This makes it easy to replicate multithreaded servers. Could move web server request processing into a DPG and have all the I/O outside.
New system call sys_makedet() which creates a deterministic box that expands to include all children. Processes are otherwise exactly the same as Linux processes.
Semantics of internal deterministic: execute as if serialized on a logical timeline.
External nondeterministic: e.g. read() from a socket (the what (data) and when (blocking time)), which the shim program interposes on. The shim could perhaps record the call or replay the call from a previous recording. It controls the logical time at which the call returns.
Other uses: deterministic replication with DPGs and shim programs. Shim ensures that replicas see the same input at the same logical time. 2PC is detailed in the paper.
Implemented on Linux 2.6.24 on x86_64. 8000 LOC across 50 files changed. Transparently supports unmodified binaries. Needed to modify thread scheduling, virtual memory and syscall entry/exit. Paper describes the implementation challenges.
Use the DMP-O algorithm from previous ASPLOS papers. Chose this because it is easy to implement. Threads run in parallel before communication detected, then communication is serialized. Detection using an “ownership table” that maps data to threads. Parallel and serial modes execute atomically and serialize onto the logical timeline (one time step per mode switch).
All loads and stores must be instrumented to manage shared memory, as must all syscalls for in-kernel channels (pipes, files, signals, but also address space, fd table (implicit/covert channels)). Page protection hardware is used to monitor shared memory at a page granularity.
Evaluated on an 8-core Xeon with 10GB of RAM. Each application runs in a DPG. Used the racey deterministic stress test to verify determinism. How much internal nondeterminism is eliminated? How much overhead does dOS impose? How does it affect parallel scalability?
Compare log sizes for dOS and SMP-ReVirt. SMP-ReVirt produces logs that are 4 orders of magnitude bigger.
Overhead compares no-DPG to DPG-only to DPG-with-execution-record.
For Apache, DPGs can saturate a 1G network (100KB static pages). For 10KB static pages, nondeterministic saturates, DPG drops 26% and record drops 78% (from nondeterministic). For Chromium, the slow down is about 1.6x for both DPG and record.
Ran parallel benchmarks on 2, 4 and 8 threads. Blackscholes, lu and pbzip benchmarks have 1 to 3x slowdown in DPGs. Dedup, fmm and make suffer more due to false sharing on a page.
Q: Do you intend only to run programs deterministically for debugging, or should it be always-on? Ideally always, but some workloads will not be suitable.
Q: How do you decide on a timeline? Logical time increments either when a fixed set of instructions are reached, or [...].
Q: Does putting in debug code distort the execution? Yes.
Q: Is it possible to write a custom MPI that replaces fine-grained shared memory with message passing that would get rid of much of the overhead? It’s certainly possible, but we haven’t done that.
Q: Is non-determinism really bad? Splitting between internal and external handles that, and gives you control over where the split lies.
Q: Did you measure CPU utilization in your Apache benchmarks? Yes, but I don’t have numbers.
Q: Did you think about how else you might address the fine-grained memory sharing otherwise? Could use hardware support or recompile the program using a transforming compiler. These might get some of the scalability back.
Parallelism is a grand challenge, and nondeterminism makes it harder. Races are everywhere: memory access, file access, synchronization….
Do we have to live with races? Restrictive languages (such as Haskell, SHIM, DPJ) don’t allow races to occur in the first place. Could we have race-freedom in any language?
Determinator is a new OS that offers race-free parallel programming. Compatible with existing programming languages, from Java through C to assembly.
Core central idea: check-in/check-out model for shared state. Fork checks out a copy, threads work on a private working copy of shared state, and join checks in and merges changes. Idea from parallel Fortran DOALL, or version control.
Example is a multithreaded game/simulation/actor-based program, where threads share a global array. In Determinator, each thread gets a copy of the array, and join merges the diffs.
Another example is Parallel Make, where there is a Heisenbug (two parallel jobs use the same temporary filename). Determinator creates a working copy of the file system for each parallel job, so the file doesn’t get overwritten.
Writes only get communicated at synchronization. Write/write races either go away or become deterministic conflicts, which cause a runtime exception.
Determinator is a ukernel, which enforces a strict hierarchy of process “spaces”. Only the root space can directly do external I/O, and children can only communicate with their immediate parent. Single thread per space. Each space has a single address space, but with a backup “shadow” space.
ukernel API is very small (PUT, GET and RET). User-level runtime provides higher-level abstractions (C library, pthreads, file system, process management).
Join does a three-way diff between the parent’s working copy, the original copy and the child’s final working copy. Optimized using VM copy on write. Can do merge at page-level (faster) or byte-level.
Can emulate a shared file system (non-persistent) with weak consistency. Can do merges at the file rather than byte granularity. Only used for intermediate results.
Other features include process migration for distributed computing support, deterministic scheduling, and other things (in the paper).
Compared single-node performance against Linux on parallel benchmarks. Some speedup over/parity with Linux on coarse-grained benchmarks. Fine-grained (e.g. LU) benchmarks get a serious performance hit.
Looked at the granularity of a parallel quicksort, and the break-even is around 100K elements in an array.
Q: Is check-out/check-in more like transactions? It absolutely resembles transactional memory, but these don’t try to provide determinism.
Q: point about a “PAR DO” construct in an old Comp Arch news.
Q: What about other forms of synchronization e.g. condition variables? These are a terrible fit for Determinator since they are not naturally deterministic. We do fork/join and barriers. Talk about non-hierarchical synchronization in the paper. Barriers allow a child to sync with its parent without terminating.
Q: Do you think it is ideal to support arbitrary languages, or would it be better to change the way programs are written to exploit the Determinator semantics? Trade-offs/synergy? We’ve tried to be completely language-agnostic, but interaction could be a huge benefit. A big issue is choosing the granularity for merges, so type information could help here. Or could implement the merge in a more informed way.
Q: How much experience do you have with figuring out the programming model to deal with conflicts? Don’t have much practical experience yet as it is just a prototype.
Q: Could the merge operation (working on a shared file) yield a result that I might not expect, and do I have to change my assumptions about shared files? Currently, we merge at file granularity, and don’t attempt to merge within the file: this would be a conflict.
Stable Deterministic Multithreading through Schedule Memoization
Nondeterminism arises when different runs of a multithreaded program give different behaviour for the same input. Some behaviours may be incorrect.
Deterministic multithreading partly addresses this problem, but the schedules are tightly coupled to the input. A slight change in the input leads to a very different schedule. This complicates testing and can mean that some inputs deterministically lead to buggy schedules.
The idea is to have stability: memoize past schedules and repeat familiar ones.
TERN runs on Linux as a user-space scheduler. Schedule memoized as a totally-ordered sequence of synchronization operations. Only race-free schedules are memoized (previous work). Use symbolic execution (based on KLEE) to track the input constraints required to reuse a schedule. Memoization checks inputs against previous input constraints that were identified by the symbolic execution.
Evaluated on Apache, MySQL, PBZip2 and 11 scientific programs. For 13/14 programs, only needed to modify less than 10 lines of code. 100 schedules process over 90% of a real HTTP trace with 122K requests. Overhead is <10% for 9/14 programs.
TERN works on annotated program source, with annotations to say which inputs could affect scheduling. An LLVM module does instrumentation. A schedule cache stores memoized schedules as a mapping from constraint sets to schedules. On a hit, the schedule is replayed; on a miss, the memoizer runs.
Simplified example based on PBZip2, which has worker threads processing blocks independently. “symbolic” annotation identifies parameters that affect the schedule.
Developed programming techniques to make this practical, and these are detailed in the paper.
Stability experiment uses a 4-day httpd trace (122K), MySQL sysbench simple and tx workloads, and PBZip2 uses the contents of /usr. Ran on a 4-core Intel box.
TERN can reuse over 90% of schedules for three benchmarks. MySQL TX workload only reuses 44.2% of schedules.
Looked at bug stability: if you change the input, do bugs occur in the changed input but not in the original? Compared to COREDET. Looked at three buggy programs: fft, lu and barnes from SPLASH-2. Symptom is that global variables are printed before being assigned the correct value. TERN never sees the bug, but COREDET does for some thread/matrix size configurations. Fewer schedules in TERN means that the chance of hitting the bug is less, and hence it is more stable.
What about TERN’s overhead? It is supposed to run in everyday execution, so this must be mild. Apache sees small overhead, MySQL is much greater (>30%), PBZip2 is negative. Mixture on SPLASH, -15% up to 40%. Details of the optimization/speedup are in Section 7 of the paper.
Future work is fast deterministic record/replay.
Q: Does not recording shared memory access order mean that you might not faithfully reproduce data race bugs? The premise is to make behaviour stable and deterministic, rather than eliminating bugs. But we can memoize race-free schedules only, by applying a race detector.
Q: Could the annotations be created automatically in future? Haven’t done this yet, but in our evaluation, the overhead is pretty small.
Q: How do you automatically determine constraints, and would these become bigger as execution goes on? We have two refinements that mitigate this. First, we remove redundant constraints from the constraint set. Second, we slice out branches that don’t affect synchronization orders.
Q: Is it true that you only memoize and replay schedules at program granularity (and cannot replay parts of schedules), and would that be a benefit (and would it be easy or difficult)? Currently not implemented, but this would be an interesting direction.
Enabling Configuration-Independent Automation by Non-Expert Users
It is hard to automate a task on different machine configurations. KarDo observes user actions as they perform at task, and produces a solution that works across configurations.
Strawman: collect a separate trace for every possible configuration. But the configuration space is too big to explore completely. Also, tasks are so diverse that we are unlikely to get more than a few traces for each task.
KarDo can generalize from a few traces.
Example application: turning on OOF notification in Outlook. What if the two machines start off in different view modes? What if a Java update notification turns up in between steps when you record a step? Can infer which actions are non-state-modifying from traces for any task. This means we need far fewer traces, and we can skip non-state-modifying actions in the replayed trace.
Evaluated by automating tasks from the MS Help and eHow website, in numerous categories.
Tracing user actions is challenging because the OS only gives mouse clicks and keypresses which may be meaningless on a different machine. Use the accessibility interface to get more information about the widgets that were affected by an action, which is meaningful across machines.
Generalizing to a single canonical solution requires distinguishing state-modifying actions from other actions. Three kinds of action: update, commit and navigate. Only consider the update and commit actions. To identify these, use visual cues and feed them into an SVM, which labels actions. For features, use widget type, whether it closes the window, whether it changes state, what its text label is, etc. Train it using a small set of examples.
To generalize across different configurations, must splice in navigation actions to the state-modifying ones in order to get to the right buttons, etc. Build a global navigation graph, then do breadth first search on the graph to get to the desired state.
To remove unnecessary user actions, use a two-pass algorithm. First pass removes unnecessary updates (which are later overwritten).
To create a canonical solution, need to create a per-task state modification graph, which may have conditional branches for different configurations (e.g. configuring a router, there may be different manufacturer applications).
Implemented KarDo as a thin-client backed by a server. Evaluated on 57 tasks. Used 20 diversely configured VMs, split into 10 training VMs and 10 test VMs. Each task performed manually on exactly 2 VMs.
Testing by direct replay of the automation recording (baseline) and comparing it to KarDo. Baseline is successful in 18% of cases. KarDo succeeds in 84% of cases.
Why does KarDo fail? 4% of mistakes are due to the navigation algorithm. 5% are due to missing steps and 7% are due to configuration errors.
Q: How many of the examples could you have handled by modifying configuration files directly? Some reasonable percentage of them, but we want to handle more general things than just configuration tasks.
Q: To what extent could you adapt your system to perform command-line tasks, or editing specific files? We do consider some command-line steps, but this isn’t the primary goal. Could extend it in this direction, but would need to understand the syntax and text on the command-line, which is more complex.
Q: What if the training steps include mistakes? We remove unnecessary steps. What about misconfiguration? Could use traces to “check” the correctness of configurations.
Q: What if you have to branch, how resilient is this to cosmetic perturbations to the interface? The accessibility layer is the level we consider, so we look for the presence of widgets.
Q: How do you determine the start of a task? People upload traces to a central server, so we assume they determine the start before recording it, and do this deliberately. What about input-invariant actions (entering user-specific text in a text box)? We flag this in the recording stage.
Q: Have you thought about using Google searches as an input to the trainer? Had a HotNets paper on this very thing last year.
Q: What if your program has privacy-sensitive data and this was captured in a trace? Do they have to be manually anonymised? How do I make sure that I don’t upload parts of my private email? In the general case, you need to rely on the user. However, we can infer that, if many users input the same value, it is probably not private. But I have to trust the operator? You’d have to trust the automated system, not the administrator.
Automating Configuration Troubleshooting with Dynamic Information Flow Analysis
Systems are hard to configure, and users make mistakes. Troubleshooting is therefore difficult and expensive.
When we have misconfigurations today (such as a bogus value in a config file), might look at log messages, ask colleagues, search Google or look at the code if available. Idea is to make a tool that automatically finds the root cause?
ConfAid uses application code to lead users to the root cause, by doing dynamic information flow analysis on binaries.
Developers find the root cause by tracing back through the variables and inputs that have influence on the error condition occurring. ConfAid uses taint tracking to identify these dependencies. When the application reads from a configuration file, it adds taint to variables, and traces this through data and control flow. ConfAid attempts to find execution paths that avoid the error. Different inputs have distinct taint.
Each variable has a taint set which identifies the inputs that might influence the variable if they change. These sets are built through data flow and control flow. Must explore both paths of a conditional statement.
Complete and sound analysis would lead to bad performance and a high false positive rate. Use heuristics to deal with this.
The bounded horizon heuristic prevents path explosion by only running a fixed number of instructions in an alternate path. However, this may lead to false positives and false negatives.
Also, the single mistake heuristic assumes that a configuration file contains only one mistake. This reduces the amount of taint and the number of explored paths.
The weighting heuristic attempts to reduce the false positive rate by weighting different taint propagations differently. Data flow taint is weighted more heavily than control flow taint. Weight branches closer to the error higher than those further away.
ConfAid can also propagate taint between processes that use IPC system calls. Support communication via sockets, pipes, TCP/UDP and regular files.
Evaluated on OpenSSH, Apache and Postfix. Manually injected errors and evaluated the accuracy and time-to-discover of the root cause result.
Used real-world misconfigurations and randomly generated bugs.
The correct root cause was ranked first or second for all 18 real-world bugs. Performance also great for randomly-generated bugs, getting 85% first-place correct root causes.
Average execution time is 1 minute 32s. Apache is the slowest. The randomly-generated bugs only take about 23s.
Q: How do you handle external input or output of various types in your speculative execution? Can you do rollback? We abort alternative path exploration if we see externally observable file.
Q: Do you automate the parsing and identification of the configuration files? We just intercept the read call and assign taint from there. Don’t need to identify the parsing routines, etc.
Q: Can you combine this with backward flow analysis from the error point? (With program slicing?) This would be much more difficult. Program slicing goes backwards, but we find that the config is read at the very beginning, then run for a while, then something goes wrong. Probably not practical to do program slicing on such a workload. It would probably also give a lot of false positives, because our taint tracking is more contained.
Q: Could you use your technique for other types of errors, such as segfaults? You could, but we have focused on configuration problems.
Q: Are all of the bugs that you dealt with single-mistake bugs? Yes. How common do you think those kinds of bugs are? Well, it depends if the bugs are independent or if they influence each other. Will probably only find the first error if there are two. Relaxing the single mistake restriction in future work.
Q: Do you have anything to add in terms of the performance overhead (proportional to program size) and did you do anything to reduce it? Dynamic data-flow analysis is slow: 100 to 200x slowdown, and we haven’t optimized it much, but it would be hard to get much faster. If the problem happens far into the execution, it will take a long time to come up with the ranking. Idea is to use this offline when you encounter a bug.
Q: How is the syntax of the configuration file handled? Does ConfAid require a priori knowledge of the syntax? No, we don’t require this. Fine with both Apache and OpenSSH, without app-specific tuning. However, there might be some noise in the result.
Inside the Data Center, 2
Large-scale Incremental Processing Using Distributed Transactions and Notifications
Indexing is a long process, not just matching keywords to URLs. Converts raw documents into documents ready for serving.
Example of duplicate elimination using MapReduce. Map documents to their checksum, and reduce them into single documents. However, refreshing the index involves adding more documents to the repository and deciding whether to add each on to the repository. With MapReduce, you would have to start again from scratch.
Goal is to have a large repository of documents with a high-quality index, and a small delay between crawling and indexing (”freshness”). MapReduce could only achieve freshness of a few days.
All the information is stored in Bigtable. Tables mapping URLs to metadata, and from checksums to canonical documents. On getting a new document, you need to update the checksum table, and flip a metadata bit in the URL table. This leads to a race condition.
Percolator adds distributed transactions to Bigtable. The API has get(), set(), commit() and iterator methods. Limited support for secondary indices.
Bigtable is a sorted (row, column, timestamp) store. Rows are partitioned into ranges called “tablets”. Tablets are spread across many machines.
Percolator provides “snapshot isolation” semantics, using two-phase commit coordinated by the client. Locks are stored in additional Bigtable columns: commit and lock for each data column. Transactions manipulate these columns transparently to the user.
Notifications allow users to run user-defined code when a column is updated. “Message collapsing” means that notifications are run at most once per write. Applications are structured as a series of observers, which hopefully form a tree.
Notifications run independently on individual rows, which means that stragglers don’t hold up other work.
Bus clumping: randomly scanning the repository for notifications will lead to clumping which reduces parallelism and overloads the Bigtable servers. Solution is to obtain locks on a row-by-row basis, and randomly jumping to another row if the bus is contended.
Very different access pattern on the hardware. Lots of small, random disk I/Os with many RPCs per phase (compared to MapReduce’s streaming I/O with many documents per RPC).
Compared MapReduce versus Percolator on document processing latency. MapReduce is basically constant against churn at around 2200s, whereas Percolator has a much lower latency, until latency explodes around 35% churn. MapReduce is limited by stragglers. Operating point is around 2 to 15% churn.
Pros of Percolator: much better freshness, and better scalability (more throughput by buying more CPUs), which supports a bigger repository. Immune to stragglers. Cons are that it needs to reason about concurrency and is about 2x more expensive per document processed.
Percolator runs 1000 threads on a single Linux instance. This makes it easier for application developers, with straight-line code and meaningful stack traces. And it gets good multicore speedup. Problems with kernel scalability, though, such as the mmap semaphore, and had to fix those.
Unknown unknowns: some CPUs periodically got XOR calculations wrong in checksums. Sometimes Bigtable cannot delete files fast enough (the garbage collector). 50% of seeks were doing useless readahead, so switched to ext4. Incorrect resistor value on the motherboard led to the workload powering machines off.
Gets 50% fresher results, with a 3x larger repository.
An existence proof for distributed transactions at web scale.
Q: Did you consider doing Nectar? Didn’t consider this, but it could improve the performance of this system. We do batch RPCs for operations on the same row.
Q: What was causing stragglers? Machine failures, process failures, etc.
Q: Could you live with some of the failures that you saw? Some limitations in the Bigtable abstraction that made us have to debug down to the lowest layers.
Q: Were Stonebraker and DeWitt right? It isn’t a good idea to pretend that MapReduce is a database system. But MapReduce isn’t dead. Percolator has a handful of users, whereas MapReduce has thousands of Google users.
Q: Did you find cycles of notifications arising in practice? Did it exactly once, but were more careful after that. The system looked healthy but would have run for an infinitely long time.
Reining in the Outliers in Map-Reduce Clusters using Mantri
Work began as a measurement study, but started to fix them. Now in production on Bing’s clusters.
MapReduce is used to build the search index, improve relevance and ads, perform genome analysis, etc. It decouples operations on data from mechanisms to scale across large distributed clusters.
Implementation is Cosmos (based on Dryad). Bing uses the SCOPE language to specify queries.
Example: finding frequent search queries. Three phases: read/parse, map and reduce. Barrier after the map task. Leads to a drop in cluster utilization towards the end of the map phase, due to stragglers. Reduce phase also has stragglers. Both phases slow down the overall job.
Want to get rid of stragglers and recomputations. This gives better predictability, better resource utilization and quicker response (good for developers).
Measured start time, end time, data rate and failure mode on a production cluster. Stragglers take 1.5x the duration of the median task in a phase. Half of phases have >10% of stragglers and no recomputation. 10% of stragglers take 10x the duration of the median task (very heavy tailed).
The impact of outliers on job completion time (based on a trace-driver simulator): 50% of jobs would improve by at least 34.7%.
If input data is unavailable, tasks must wait for recomputations, and this delay may cascade. Idea is to replicate intermediate data, and use a copy if the original is unavailable. Challenge is to decide what data to replicate, and where. Since recomputes are rare and localized (mostly due to disk failures or high load). This gives a predictive model for how likely a machine is to cause a recompute. Cost-benefit calculation based on the probability of a recompute and the runtime of the task. Since whole-rack failures are so rare, only replicate within a rack.
Cause of stragglers: tasks reading input over the network end up experiencing variable congestion. Solution is to spread tasks evenly across the cluster. Many aspects of MapReduce are already network-aware (such as shard placement), but the shuffle of the reduce phase means that this locality is harder to exploit.
Place reduce tasks so that the traffic on each link is proportional to the available bandwidth on that link. This would require global co-ordination across jobs to identify the congestion and place tasks. Can approximate this with local control.
Cause of stragglers: workload imbalance due to unequal partitions. Approximately 25% of outlier tasks have more data to process. Better to ignore these than duplicate them! Mantri builds an estimator that uses Graham’s 1969 theorem.
A final cause is contention on a machine. Idea is to copy tasks elsewhere in the cluster. Challenge is to do this as early as possible. Need to build an estimator to guess how much work is left to do. Trick is to copy when the predicted time of starting again is less than half the remaining time.
Running time is a function of the input, network congestion, data left to process and various other factors.
Results from deployment in production clusters and simulations. Running in production since May. Compared May to June on Mantri against April (previous setup). Median job improvement is 32.1%. There is also a reduction in the number of resources used, because too many duplicate jobs were scheduled in the Dryad scheduler. (0.5x as many copy tasks launched as Dryad, with a 2.8x higher success rate).
Compared Mantri to Dryad, Hadoop, LATE and MapReduce on a trace. Mantri gives a better reduction in completion time than all of these schemes. The others have less time savings with more resources used.
Q: You took the MapReduce interface as given, so what would you do if you built a new system knowing what you know now? Providing better ways of partitioning maps, which we have already done in production.
Q: What would you do if you had a non-uniform cluster? We have very efficient cluster management software, called Autopilot, that monitors these things. The primary reason we don’t have persistent problems is that this will shoot particularly bad nodes in the head. A slot abstraction is how we split up heterogeneous machines.
Transactional Consistency and Automatic Management in an Application Data Cache
Application-level caching is used to handle scaling issues in modern websites, e.g. Facebook, MediaWiki, LiveJournal. Caching whole web pages is not much good if you have highly personalized content. Database caches are less useful when so much processing is done in the application layer. This has led to the use of things like memcached or Java object caches. These are essentially a DHT, for application-level objects, which can separate commonly-used and customized content. This caching reduces load on the database and the application server.
Existing caches are very simple and force a lot of management work to the application. For example, they don’t provide transactional consistency, which violates guarantees that the underlying database would provide. This can expose anomalies to the user, and complicates application code (to deal with violated invariants).
Naming is harder than you think: a cache key must uniquely identify the value. Invalidations require global reasoning about the possible contents of the cache. Saw two bugs in MediaWiki (#7541 and #8391) that arose from these problems.
TxCache has transactional consistency, bounded staleness and function memoization. Implemented as a library the hides the complexity of cache management. It integrates with a new cache server we have developed. Function calls for beginning, committing and aborting transactions, and a hook to make a side-effect-free function cacheable (depending only on the arguments provided and the state of the database). No need to deal with inconsistencies, naming, invalidation, etc.
Transactional consistency goal: all data seen in a transaction reflects a single point-in-time snapshot. Each transaction gets a timestamp, and objects in the database and cache are tagged with a validity interval. Transaction can read cached data if the validity interval includes the timestamp. Staleness allows cached data to be used for longer, if this is compatible with the application requirements. Nevertheless, it must be consistent with the data in the database if other parts of the transaction require DB data. Simply expose an API on the database for starting a transaction at a given timestamp.
Validity intervals computed for application objects, queries and tuples (the last being tracked by the database). A query is valid if the current timestamp is in the intersection of the validity intervals of the tuples accessed. Had to modify the database server to track this and return it with each SELECT query.
It is hard to choose a timestamp a priori, since we don’t know the access pattern or the contents of the cache. However, can choose this based on the tuples that would be accessed.
Each object in the cache has an invalidation tags, and had to modify the DB to assign these tags to each query. Tags are computed from the query access methods (e.g. index lookups, wildcard tags for queries on the entire table). An update generates affected tags (one per key per tuple modified) and broadcasts these to all cache nodes in an ordered stream. Need to be careful to avoid lookup-invalidate race conditions.
Evaluated using the RUBiS benchmark suite, with the standard browsing/bidding workload. Ran with an in-memory (850MB) workload, and a 6GB (disk-bound) workload. As the cache size grows, throughput increases up to about 5000 requests per second. The cache hit rate flattens out at 95%. With no cache, the rate is about 1000 requests/second.
Throughput increases also for the disk-bound workload, though much lower (up to 500 requests/second). The cache hit rate is much higher though. Bottleneck in this workload is random I/O for infrequently-accessed data.
Adding staleness gives a 3x to 5x throughput improvement.
Investigated cache misses due to the stringent consistency: less than 10%. When consistency is turned off, the increase in throughput is much lower.
Q: How are the invalidation tags affected by complex queries like joins? We have a more complex approach based on predicate locks.
Q: How do you deal with fault tolerance? The cache is inherently fault tolerant because its data can always be recreated. Although migration etc. could improve that. Problems with the validity intervals? No, because these are generated by the database.
Q: Do you provide serializability or snapshot isolation? Depends on what the underlying database supports, since the cache is for read-only transactions.
Q: Why did your throughput keep going up as the hit rate levels off? This is because we are caching objects that are sometimes larger than the data that is used to generate them.
Q: Do you still get a performance win if you have a single cache node on a single system? The benefit of a distributed architecture is the ability to scale the effective size of the cache. Have you compared distributing the cache versus distributing the database? There are a lot of approaches based on DB replication that do similar things, but these are more heavyweight because they try to scale writes as well.
Q: How is the cache key chosen? Based on the function’s name and its arguments. A more complex approach could be used if the code might change. What if not all of the arguments matter? Still work to be done in the application to make sure the cache is used efficiently (e.g. in choosing the cacheable functions).
Piccolo: Building Fast, Distributed Programs with Partitioned Tables
Motivating example is PageRank: for each node in a graph, add the current score of that node to its out-neighbours, and repeat until convergence. The only changing data (the current and next iteration scores) will fit in memory.
Dataflow models don’t have global state as a first class concept. Need to read from and write to distributed storage. No access to intermediate global state, and costly access.
Could implement it with MPI or RPC. Need to decide how to break up the computation explicitly.
Piccolo provides global access to distributed shared state. Piccolo runtime handles communication between these components. Usually ease of use and performance are in conflict.
Programming model uses a key-value store with a put/get/update/iterate interface. Implemented as a library for C++ and Python.
Showed how PageRank is implemented on Piccolo. A job is specified with a control function and a kernel function that actually does the computation. Doing it naïvely is slow, because it would do a lot of remote gets and remote puts.
Let the user control the partitioning function (e.g. for PageRank based on the domain of each page), and colocate tables that a kernel function uses on the same node.
Synchronization is challenging if multiple writers can access the same key at once. So provide an atomic update (e.g. sum or min or max) function attached to a table value. Only works for commutative functions. Synchronization primitive is a global barrier. Tables provide release consistency: accumulations are atomic, but they may be buffered and only guaranteed to be applied at barriers. Need to do an explicit barrier between operations. Release consistency makes writes quite fast.
Fault tolerance: do checkpointing of the table state. The barrier is augmented with a checkpoint of particular tables. Use the Chandy-Lamport (non-blocking) protocol for checkpointing.
Load balancing is tricky because the tasks are stateful. Cannot just start copy tasks. The master coordinates work stealing, and needs to pause updates to a partition before it is stolen to another machine.
Evaluated against Hadoop on a local 64-core, 12-node cluster. Used a 100M page graph, taking around 8 seconds per iteration. Main Hadoop overheads are sorting, HDFS access and serialization.
Evaluated scaling up to 200 workers on EC2, and the iteration times stays about the same (about 60 seconds per iteration at all configurations).
Implemented iterative n-body and matrix multiply that don’t fit into Hadoop. Also asynchronous applications such as a distributed web crawler.
Q: What is your fallback strategy for dealing with computations that don’t have a commutative aggregate function? We haven’t find many applications like that. We would have to rely on pairwise locking in this case.
Q: Do you need to restart every node if a single node fails? Yes, you need to rollback all nodes to the last checkpoint, so need careful consideration of the checkpoint frequency.
Depot: Cloud Storage with Minimal Trust
Storage providers like S3 are appealing because they free the user from a lot of concerns. However, they also present risks. Failures can cause undesired behaviour (such as not propagating permission changes immediately). Also risks of data loss, corruption or unavailability.
Approach is to take a radical stance on fault-tolerance. Completely eliminate trust from cloud storage for put-availability, eventual consistency, staleness detection and dependency preservation. It minimises trust for get-availability and durability. So you can always perform a put, for example. But gets are only available as long as there is at least one correct node.
Depot ensures high availability by running multiple servers and non-sequential consistency. It falls back to client-to-client communication in the case of extreme cloud failure. It prevents omission and reordering, by adding metadata to puts, which is fully replicated in nodes’ local state. Metadata received at the client is checked to ensure that the client sees a view that satisfies various well-defined properties. Update metadata includes a local clock and a history. Logically each node stores all metadata it sees (garbage collection is handled in the paper).
To protect consistency, nodes perform local checks on the metadata: needs to ensure that the updating client has seen all previous updates (based on its history), and it doesn’t try to change history (based on its local clock).
Faults can lead to forks in the history, which partition correct nodes, and prevent eventual consistency. Forks are joined to gain eventual consistency: faults are transformed into concurrency. One faulty node is transformed into two correct virtual nodes. Since concurrency can introduce conflicts (concurrent updates to the same object), Depot exposes such conflicts to applications.
Depot provides various properties in terms of consistency, availability, integrity and eviction, which vary based on how many nodes are correct.
For get availability and durability, it would be ideal to trust only yourself, but impractical to do this. Depot minimises the number of correct nodes that are necessary to service the get. Any correct node that is willing to supply an object can service a get. To make it likely that a correct node has data, employ replication (including a local copy).
Contingency plan for storage service provider failure. Prototype forces all clients to store all of the puts that they create, and gossip update metadata in order to route get requests to the right node. Despite complete SSP failure, this allows the system to keep working.
Evaluated cost in terms of latency, resources and dollars (for storage).
Overhead comes from adding metadata and adding checks (SHA256, RSA verify and history verify).
Setup the evaluation on 12 Emulab nodes, 8 clients and 4 servers, with a 1G link. Clients issue 1 request/second, and measure latency and per-request cost. As a baseline, emulate traditional cloud storage (servers implemented in Depot without any of the checks, and no metadata at the clients).
Latency overhead on get is very small (<1 ms), and on put, is about 5 ms.
Network overhead on gets are modest. CPU costs are bigger. Metadata transfer causes network cost, and verification has CPU cost.
Cost model based on cloud pricing. Get cost per TB is small over the baseline (about $100/TB). Puts incur more cost (about $220/TB) due to metadata.
Q: Does presenting overheads in terms of the Depot cost mask the effect? Isn’t the CPU cost something like 400%? Yes, but the absolute values are small. Plus the real cost is in the bandwidth.
Q: What can Byzantine clients do if you rely on gossiping metadata (such as cause all get requests to be routed to it)? Include a “receipt” mechanism to deal with such a situation, described in the paper. When a client receives a get response, it performs a check to ensure that data is sufficiently replicated.
Comet: An Active Distributed Key-Value Store
Key-value stores are increasingly important, with great scalability, availability and reliability properties. Popular in both data centers and P2P. Stores are shared between many applications to avoid per-app storage system deployment, but building applications above these stores is challenging.
Main challenge is due to inflexibility: all apps are treated exactly the same way, independent of their needs (which may be conflicting). Motivating example is authors’ experience with Vanish on top of the Vuze DHT.
Vanish is self-destructing data storage for the web, based on the Vuze DHT, using temporal-based keys. Unfortunately, Vuze had a fixed 8-hour data timeout, and an overly-aggressive replication scheme, which harmed security. Making the necessary changes was simple, but deploying them was difficult, because it is a public service in production. Also it is hard to evaluate changes before deployment.
Propose extensible key/value stores that allow applications to customise functions of the store. Different data lifetimes, numbers of replicas and replication intervals. Also allow apps to define new functions, such as popularity tracking, access logging, or user-dependent responses. Built a peer-to-peer DHT: Comet.
Goals: flexibility, isolation, safety and lightweightness.
Main abstraction is an “active storage object” (ASO), which contains a value and a small set of handlers that are called on put or get to the object. Comet is built on top of a traditional DHT as an active runtime that runs ASO handlers in a sandbox.
Built several applications on top of the ASO extension API, including Vanish, Adeona, file sharing, etc. Extension API is 12 functions: 4 callbacks, 5 host interaction functions and 3 DHT interactions functions (get/put/lookup). Deliberately chose not to allow arbitrary network I/O because of security implications.
Sandbox is less than 5KLOC, based on a subset of Lua with unnecessary functions removed. Resource consumption is limited by counting the number of per-handler bytecode instructions and memory consumption. Incoming and outgoing requests are rate-limited. DHT interaction is limited (to prevent traffic amplification and DDoS attacks). ASOs can only talk to neighbours, and not recursive requests.
Built on top of Vuze using Lua. Deployed on PlanetLab. Discussing mainlining it in Vuze with the main engineer.
Considered three examples. App-specific customisation of the replication scheme. Did Vanish-specific replication in 41 lines of Lua.
A context-aware storage object is used to build a proximity-based distributed tracker (returning close nodes rather than random nodes). ASO uses 37 lines of Lua. Using the proximity-based tracker gives better latency between paired nodes than a random tracker.
A self-monitoring DHT: a node monitors its neighbours, by periodically pinging it. This allows you to collect measurements of DHTs. Ran an experiment to get Vuze node lifetimes.
Q: Have you thought about macrosecurity (as opposed to microsecurity considered in the talk)? We added global limits to ASOs (not just per-ASOs limits), that is just standard rate-limiting for DHTs. These are, we believe, the right mechanisms to manage the overhead of Comet. But we need more experience to determine limits. What about fairness? Haven’t looked at this yet.
Q: What happens when the metadata in the ASOs diverge between replicas? In today’s DHTs, replicas are actually inconsistent (may be stale). Opportunity here for different replicas to merge with each other using an ASO.
Q: Were you able to enable certain classes of computation in Lua? Removed everything except for three classes of things: simple math, string manipulation and table manipulation. A modified Lua interpreter counts the number instructions. Is there an interesting set of things that you might add back in, and what applications might that enable (or is it a gray area)? One example was to do true internet measurement with arbitrary network I/O, but we decided not to add these things.
Q: Does the use of a high-level language cause performance unpredictability that might harm the 99.9th percentile latency? An interesting issue, but we haven’t looked at this yet. May choose to use non-GC’d languages in a real deployment.
Q: Do you persist that ASO state? Currently stored in-memory, and use replication to gain persistence.
Q: What happens to reliability if things actually get cancelled? Applications should expect that things may fail at any given point in time. Can use tricks like Vanish does to get reliable deletes.
SPORC: Group Collaboration using Untrusted Cloud Resources
Cloud deployment is attractive because it allows collaboration in user-facing applications, but the price is that you must trust the cloud provider.
Goal is to support practical cloud apps for real-time collaboration (and offline working), but using untrusted servers.
Application logic has to be moved to the clients, and every client has to maintain a copy of the local state. The server’s role is limited to storing encrypted data and processing client updates.
To keep clients in sync, use “operational transformation” (technique from 1989): application provides an app-specific transformation function that is used to fix up non-commutative actions (such as deleting two different characters concurrently). This can synchronize arbitrarily-divergent clients.
Digital signatures aren’t enough, since the server could equivocate, so employ “fork* consistency”, which gives linearizability on an honest server, and detectable equivocation after 2 messages from a malicious server. Clients embed their history (hashed) in messages they send. Server may fork clients into partitions, but can’t then unfork them. Need to detect forks out of band. SPORC provides a way to recover from malicious forks.
Internally, the SPORC library keeps a history of all the operations that it sees, marked as committed or pending. Could reconstruct the client’s local state. Committed operations are fork* consistent across all clients, and pending operations are causally consistent. Clients enforce this by checking sequence numbers on each operation.
To commit an operation, the client encrypts and signs it, and sends it to the server. The server keeps a sequence of encrypted operations which represent the canonical state. Then it broadcasts the operation to the client. Finally, at the clients, it needs to be transformed using OT (since the recipient state has probably diverged). Only after these transformations can the operation be applied to the local state, and added to the end of the committed operations.
For access control, we cannot trust the server. Need to preserve causality and concurrency only makes it harder. Access control enforced using encryption with a single shared symmetric key shared between the clients. ACL changes are committed in the same way as update operations. All users have a public/private keypair in addition. Removing a user is a little more complicated than adding one: need to change the encryption key for subsequent changes and encrypt it under the public key of all of the remaining users, and also encrypt the old key under the new key (so it can be provided to any new users to access old operations).
If two people want to remove different users concurrently, the server will order them, but this creates a challenge for encrypting the new key. Need dependency pointers between operations and barriers: if a dependency pointer crosses a barrier, the operation will be rejected, which maintains the chain of old encryption keys.
To recover from a fork, use OT to resolve malicious forks by treating them as normal divergences.
Implemented as a client library and a generic server. The generic server can support any application since all it does is push encrypted data around. Applications are defined as operations and a transformation function. Implemented a Java CLI client and a browser-based version (using the Google Web Toolkit).
Evaluated on 8-core 2.3GHz AMD boxes, with one for the server, and 4 for clients. Tested the Java CLI version. Looked at latency, server throughput and time-to-join. Connected using 1G Ethernet.
Latency is <30ms for a text editor on up to 16 clients with low load, and slightly worse if everyone writes. Broke this down into various components: the most expensive thing is the RSA signature on outgoing messages (about 10ms per message). Could use a faster implementation like eSign to get this down into the microseconds. Client queuing delay gets worse as the number of clients increases on a write-everyone workload.
Server throughput gets up to 20MB/s as the payload size is increased to 16KB.
Q: What happens if an evil person about to be removed tries to add more evil users? The way we deal with these situations is to have multiple privilege levels (admins, writers and readers) with ACLs implemented on the client side. This is only a problem if you have a malicious admin.
Q: Why use a specialised server rather than just using a cloud storage provider or a DHT? In addition to storage, you need a way to keep clients in sync. We thought about doing a completely decentralised system. Advantage of a server is to get timely commits. Server is “highly-available”.
Looking at Amdahl’s Law limitations (serial paths) in the Linux kernel. For example, Exim spends about 70% of time in the kernel on a uniprocessor.
Want to know how serious the scaling problems are, and how much work is necessary to fix them (in light of recent efforts to make new scalable OSs).
Used a 48-core x86 box running a recent version of Linux. Looked at 7 real limitations, and analysed the limitations of the Linux kernel, which did exist.
3002 lines of code in 16 patches fixed the scalability limitations. Remaining bottlenecks were in the hardware or applications.
Ran applications with an in-memory file system to avoid a disk bottleneck. Found bottlenecks, fixed them and re-ran the application. Stopped when the application or hardware had a non-trivial bottleneck that they couldn’t fix.
Apps: Exim, memcached, Apache, PostgreSQL, gmake, Psearchy (inverted index building) and Metis (MapReduce implementation).
Exim throughput (messages per second) collapses around 40 cores and kernel time (per message) increases greatly at this point. Large explosion of time in lookup_mnt function (>30% of time). This function returns metadata about a mount point, under a spinlock. Acquiring and releasing the spinlock were using more cycles than the hashtable lookup, because of the ticket-based spinlock implementation. Cache traffic (atomic inc and read of current ticket value in acquire; increment in release) takes 120-420 cycles on a lightly-loaded system. With high load, the delay goes up to 500-4000 cycles.
Solution could be to implement scalable locks or message passing (like Barrelfish). But since the mount table is rarely modified, use per-core mount tables to cache this. First lookup in local core table, else do the slow path through the global table. On modificastion, need to invalidate the per-core tables, but since this is rare, it leads to better scalability. This fix, removes the collapse at 40 cores.
Now see that the same functions execute more slowly on 48 cores than on 32, leading to a flattening out of scalability. Next bottleneck was dput(), which modifies the refcount on a dentry. Does an atomic decrement and test on the reference count. Reading the reference count is slow, because it is atomic, which makes cache coherency traffic occur. Concurrent cores must retry because a single core gets a hardware cache-line lock on the value.
Solution is “sloppy counters”: each core should hold a few spare references to each object. The reference count is stored in a per-core counter. Sloppiness occurs because multiple reference on the same core will grab one reference each. On an unlink, the kernel gathers the sloppy information into the shared counter, and disables the per-core counters. Uses O(N) space in the number of cores, but lead to no cache misses in the common case. Leads to an increase in scalability over the per-core lookup table.
In all, 9 areas were considered this way. Some applications were also modified (amounting to 60 lines of code). All well-known parallel programming techniques.
Almost all applications get some speedup (except Psearchy and Metis, which were already quite high).
Remaining bottlenecks include hardware queues on the NIC, application-level contention, cache capacity and DRAM throughput.
Limitations: only looked at 48 cores and a small set of applications. Some problems still loom: fork, virtual memory, page allocation, file system and concurrent address-space modifications.
Q: What was your experience in mainlining these patches? Haven’t tried because working on the talk. Sloppy counters are probably questionable. Are you going to try to submit? We intend to think about it.
Q: Do you really believe that hacking at each piece of shared state one at a time is the right way to go? No, but up to 48 cores it seems to work well. What do you really think? I have no idea.
Q: How many of your optimizations would work if you ran multiple applications on subsets of the core? Are they general enough for this case? I think so, since we didn’t target particular applications with these changes, but some applications may not benefit.
Q: How did you implement sloppy counters given that cache lines are pretty large, packing or wasting memory? We packed multiple per-core counters onto a single cache line. Did you find that managing these counters then led to a lot of overhead? For some counters, yes, such as sockets, which allocate a dentry per socket. The fix was not to allocate a dentry for a socket.
Q: A big ball of code wins in the long term because it is easier to make larger changes as well as smaller changes. Did you find that to be the case (that it was easier to modify the Linux kernel than other microkernels)? [Taken offline.]
Q: Looking at spinlocks, why did your original results have an elbow in the curve at 40 cores, rather than a gradual degradation? The interconnect on the machine becomes congested. The other effect was an n-squared effect based on the number of waiters.
Q: Should we or should we not be designing new OS architectures for multicore systems? Up to 48 cores, it doesn’t seem like Linux has a problem.
Trust and Protection in the Illinois Browser Operating System
Today, the web is ubiquitous, and runs in web browsers on many different operating systems.
However, these applications are vulnerable to attacks at many layers. There are web-app vulnerabilities, browser vulnerabilities, and OS and library vulnerabilities. The last two are the rarer but also most severe, so we consider these only in this talk.
Web browsers, such as Firefox, are monolithic and easy to exploit, since they run in a single address space. “Secure” web browsers are not enough, since they still have a huge TCB (including the TCP stack, X server, file system, drivers, etc.). A microkernel would be better, but still need to trust all of the system components.
Illinois Browser OS (IBOS) is a new OS design to make a single app more secure, which leads to a reduced TCB.
The IBOS kernel makes browser abstractions first-class, i.e. web page instances, cookies, IP connections, etc.
Security invariants check key properties without understanding the implementation of the particular components. Challenge is to check security properties using these “invariants”.
An IBOS label is a tuple containing a protocol, domain and port. These are inferred and applied by the kernel. Leads to different “web page instances” for different originating domains. Since these may then access several servers (for content, images, ads, etc.), the web page instances can spawn multiple “net processes”. Enforcing this requires a split driver architecture: on sending an Ethernet frame, the IBOS kernel checks the TCP port and IP address in net process label, and hands off the actual transfer to the NIC driver.
A basic frame buffer driver splits the screen into three parts: an address bar at the top, which shows the current page address; toolbars that provide global control of the system; and the content itself. The IBOS kernel has a tab abstraction which multiplexes the display between different web page instances. Input devices are routed to the visible tab, which prevents a keylogger from hijacking another instance’s session.
Implemented on L4Ka::Pistachio, using WebKit and Qt. Also provide a “Unix-like” API through Qt on uclibc, which supports a PDF viewer.
TCB size is 42 KLOC, compared to 5.6 MLOC for Firefox+Linux+X, and 4.4 MLOC for ChromeOS. Could potentially used formal methods to verify that. Avoided 27 CVE exploits published in 2010, and didn’t prevent 1 relating to device drivers. Also avoided 135 (77%) browser vulnerabilities published in Google Chrome’s bugtracker. Scored well on memory exploitation and sandbox bypassing vulnerabilities.
Evaluated page load latency time for Google Maps, Bing, Craigslist, CS@Illinois, Wikipedia and Facebook. IBOS is close to Firefox and Chrome, and sometimes better (though sometimes a little worse).
Q: What do you think is fundamental about the web model that makes this different from core microkernel work? Web browsers are a very important application, so borrowing ideas from OS design could help to improve them.
Q: Does the IBOS kernel manage DMA, and is it specific to a particular network device? Currently only support one driver, but should generalise to other drivers that support DMA buffers.
Q: Does the design inherently prevent cross-site scripting, or can this be relaxed? Don’t actually prevent these attacks, but with the right policy, we can contain these attacks. The paper details a custom policy that allows sites to communicate with other sites. Is your presentation running on top of IBOS? Yes.
Q: Is there any reason why Firefox or IE could not run on top of IBOS? They could if you changed the windowing abstraction.
FlexSC: Flexible System Call Scheduling with Exception-Less System Calls
System calls are typically synchronous and exception-based for the call and return. This is inflexible for scheduling, and expensive due to the mode-switch and pollution of processor structures when you multiplex these at a high frequency.
FlexSC is a shim that sits above the OS and provides exception-less syscalls, and FlexSC-threads is a threading library that uses FlexSC. Improves throughput in MySQL and Apache without changes to their codebase.
Took an application (Xalan from SpecCPU 2006) that does almost no syscalls, and injected exceptions with various frequencies. Emulated a null syscall and a write syscall. Measured only user-mode time and assume that it will be unaltered with the exceptions. However, almost half of the performance is lost. Indirect (pollution of structures) cost is the dominant problem.
Indirect cost: a Linux write() call can cause up to 2/3 of the L1 data cache and data TLB to be evicted. This also impacts the performance of OS code.
Exceptionless syscalls use a syscall page into which calls and results are written; syscall kernel threads do the processing. This leads to fewer mode switches, allows for batching of syscalls, and allows for dynamic multicore specialization, which further reduces the costs. In the ideal case, there is no need to context switch at all. The application writes its syscall registers into an entry on the page, and loops doing other work while the result is not marked as DONE.
Syscall threads are kernel-only threads that are part of the application process and run in the same process space. They execute requests from the syscall page and are schedulable on a per-core basis.
System call batching allows user space to make as many calls as it can, then jumps into the kernel which executes all of them before switching back to user mode.
Dynamic multicore specialization allows the OS to map syscall threads to particular cores, and user-mode threads to other cores. This can adapt to workload needs.
The user-mode interface is quite different to synchronous system calls. However, it is very suitable for event-driven servers, which already process asynchronous events. Multi-threaded applications can use FlexSC-Threads, which is compatible with Linux pthreads and requires no changes to the apps or even recompilation.
FlexSC-Threads is a hybrid (m-on-n) threading model, with one kernel-visible thread per core. It redirects system calls using libc wrappers to the syscall page, and switches to another runnable thread. When all threads are blocked, the kernel is entered, and we wait for at least one syscall to complete.
Evaluated on Linux 2.6.33 on a quad-core Nehalem, running at 2.3GHz. Sysbench on MySQL and ApacheBench on Apache were the workloads. Compared Linux NTPL (”sync”) versus FlexSC-Threads.
OLTP on MySQL on a single core sees a 15% performance improvement in throughput as the request concurrency increases. On 4 cores, see a bigger improvement. For latency, FlexSC dramatically improves the 95th-percentile latency, and a 30% improvement in the average case.
Individual performance metrics gathered using hardware performance counters. FlexSC achieves higher instructions-per-clock cycle than sync. Lower cache miss rates in several structures leads to this in both the kernel and user-mode.
Apache performance can improve by 80 t0 90% on a single core, and 115% on four cores. Latency is halved on average, and again is much better in the 99th percentile. Instructions per cycle are even better for Apache (almost doubling in user-mode and more than doubling in kernel-mode).
Exception-less syscalls can coexist with legacy ones. Doesn’t require a drastically new OS architecture.
Q: Did you try to upstream the patches for the kernel portion or plan to? No. Why? A lot of time and effort that I would prefer to use in other ways.
Q: Does this reduce security by sharing memory between the kernel and user-mode? The pages are per-process, within the application’s address space. Anything you could do in a process before, you could do now.
Q: Do you have an architecture for having outstanding syscalls with dependencies between them? No. We rely on servers that have concurrent, independent work to do. There is related work that tries to do this (composite syscalls from Stony Brook?), but we don’t have support for that.
Q: How specific are your observations to the fact that there is a protection boundary, and how much is due to the modularity boundary? A project called “cohort scheduling” made similar observations (jumping around to do different things will cause problems), so some of our observations are related to this. The boundary only adds to the problem because of the direct cost of the exceptions, which we measured earlier.
Q: Did you measure to what extent you get cache pollution from doing different work after the syscall? Didn’t measure it, but we did see that the TLB miss rate increased for some cases due to going through multiple user-level threads. So we could do even better if we tried to restructure the application.
Q: If you dedicate a core to kernel syscall handling, how do you approach the scheduling problem? Tried many things, but the simplest algorithm ended up performing the best: get a static list of cores, and go to the first non-busy core when a request comes in. Turns out to be very elastic, but we were only running a single application. May need to fine-tune this with multiple applications.
Q: Have you noticed or can you imagine a situation where having more cores than threads could cause you performance degradation? Do you have to have more threads than cores? We targeted server workloads. It would be a bad idea to run scientific computing workloads on top of this.
Q: What happens to this system when you have lots of unrelated threads demanding service (since you poll the syscall pages to pick up messages)? How does it scale up? [Taken offline.]
Q: Do you feel like you’re going to be fully backward-compatible? We ran four or five different applications, and the only problem we had was with get_tid (get thread ID), which required a small patch. Other than that, we didn’t see other backward compatibility problems.
Inside the Data Center, 1
Finding a Needle in Haystack: Facebook’s Photo Storage
In April 2009, 1.5PB of photo data (15 billion photos in 4 versions), with a 25TB/week upload rate and 550k images/second serving rate.
Currently: 65 billion photos, 20PB of data. 60 TB/week upload and 1 million images/second serving rate.
Simple design would be to use NFS, with one file per image. Works for a typical webiste with a small working set and infrequent access of old content (99% hit rate on the CDN). Facebook has a large working set with frequent access of old content. Leads to an 80% CDN hit rate.
I/O bandwidth gets wasted doing small metadata I/Os. Large directories required 10 iops per image read, improving to 3 iops per image with smaller directories. Using a file handle cache got to 2.5 iops per image read.
Haystack store replaces the photo server and storage layer in the NFS based design. A typical unit is a 2U node with 12 1TB SATA drives in RAID-6 (as a single volume). A single 10TB xfs filesystem runs across these. Haystack is a log-structured append-only object store, and each node hosts about 100 of these, at about 100GB in size.
File layout has a superblock, followed by “needles” (objects) which are padded to 64-bytes, and contain a key, an alternate key (signifying the size of the image), the data and a checksum. Magic numbers are used to reconstruct the Haystack in case of corruption, which doesn’t happen anyway.
The index file has a subset of the needle metadata, and includes the offset of the needle in the log file.
The Haystack photo server takes HTTP requests and turns them into Haystack operations. This builds an index of all images in the Haystack in a hashtable. Store 32 bytes per photo (8 bytes per image (32-bit offset and 16-bit size per image scale), compared to 600 bytes per inode). A 5GB image can handle 10TB of photos.
From the URL, determine the image and the size from the in-core index. 1 iop reads the data from disk. If the size of the image exceeds the maximum size of the device, could have multiple iops, but this doesn’t happen in practice. Use alloc_size mount option to grow files by 1GB extents. Only problem is when photos cross stripe boundary (6 or 7% of photos).
Multiwrite for modifications, since users typically upload many images at a time. Asynchronously append images to the file, then do a single fsync(). The index file is asynchronously appended to without a flush (unless a threshold of outstanding dirty records is exceeded), since it can be reconstructed from the log.
Delete just marks photos as deleted in the incore index and the needle header (synchronously).
Compaction generates a copy that skips duplicates and deleted photos. Works online and can accept reads, writes and deletes while this goes on.
Haystack directory maintains a logical-to-physical volume mapping. 3 physical haystacks on 3 nodes per logical volume. Writes are balanced across logical volumes, and reads are balanced across physical haystacks.
On upload, web server receives photos and scales them. The directory provides the logical volume, and writes synchronously all three copies (latency is relatively small compared to the browser-to-web-server latency).
On download, web server generates URLs in the web page based on information from the Haystack directory. The Haystack cache may return the image data, or else the cache fetches it from the Haystack store. The image is returned to the CDN which returns it to the browser.
Implemented in 8.5KLOC of C++. Took two engineers 4 months from inception to initial deployment.
May require more improvements to handle high-res images, so software RAID6 or RAID10 might improve performance. Considering new drives with on-board Flash to store the index. Currently, SSDs are not cost-effective.
Q: Did you find prefetching to be useful or are your reads so random? More and more, we are seeing photos accessed by tagging, so prefetching on the album index structure is less useful. Thumbnails are usually cached, so it looks purely random read access to Haystack.
Q: Have you considered migrating needles between Haystacks for load balancing? RAID6 limits us in this respect, and is simply horrible for mixed workloads. RAID6 would not cope with migration.
Q: Are high-res images supported for existing images? No, we don’t have the data stored.
Q: What do you get out of having Haystack getting as big as 100GB? This cuts down on file system metadata, but they could be smaller. Some Haystacks are even larger, but they run just fine. The only limit is 256GB, which would require 64-bit offsets in the index. Also want to keep the Haystack directory as small as possible, so big Haystacks keep this down. Want to distribute this in future.
Availability in Globally Distributed Storage Systems
Globally distributed storage systems are built from commodity servers, arranged into racks, then clusters.
200+ clusters, pools of thousands of clients, 75PB filesystems.
Want to look at how available is our data, what are the causes of unavailability, and how can we tune our systems to make this better?
95th percentile of unavailability periods is about 15 minutes. Median due to node restarts is about 1 minute. Planned reboots cause longer unavailability periods (closer to 10 minutes median). Unplanned reboots are more severe, albeit not much (a few more minutes).
Correlated failures lead to a “failure burst”, which corresponds to many failures happening within 2 minutes of the previous one. 37% of all failures are part of a burst across at least 2 nodes. This has been fairly consistent over the last year. Showed an example of a burst of 50 servers within quick succession (failure of a shared power or network domain). Also see bursts of about size 20 (rolling reboot of machines with planned staggering, or a cascading failure).
Small failure bursts are not necessarily rack-correlated, but large failure-bursts are highly rack-correlated.
Came up with a scoring function for rack-correlation: the probability that a burst of the same size affecting randomly-chosen nodes in that cell would have a smaller burst score.
Data is stored in stripes, made up of chunks with fixed-size data and code blocks. Use erasure coding (such as Reed-Solomon) to reconstruct missing chunks from available ones.
Perform rack-aware placement on blocks. MTTF increases when rack-aware placement is used, for all encoding schemes (replication and Reed-Solomon). Different encoding schemes do better for different sizes of bursts.
Used a Markov chain to model the number of available chunks, with transitions due to chunks failing and recovering.
Findings on recovery time: using Reed-Solomon (6, 3), a 10% reduction in recovery time leads to a 19% reduction in unavailability, without correlated failures.
Correlation matters a lot. Modelling correlated failures gives a substantial reduction in MTTF. However, it reduces the benefit of increased data redundancy.
Improving availability below the node layer do not significantly improve data availability (reducing node failure rate is the most useful thing to do).
Multi-cell replication (across data centers) leads to a higher MTTF, but there is a trade-off between limiting the inter-cell recovery bandwidth and higher replication.
Q: Could controlling the order of planned upgrades and so on have a big impact on MTTF/availability? Yes. We need to do a better job of teasing apart all the reasons for unavailability (finding exact causes).
Q: Is it true that most of the time machines are unavailable due to long-lasting events or lots of short events? Graphs in the paper answer this. There are challenges in trying to model rare events, and we tend to be conservative.
Q: Can you talk about the cost effectiveness of different encoding schemes, and are you actually using erasure coding? We are considering using Reed-Solomon. What’s the drawback and why didn’t you start with this? For MapReduce, you might have parallel copies that you want many workers to read from. Also chunk recovery is more involved for Reed-Solomon, and has to touch more servers.
Q: If the number of cores per box goes up, what effect would this have on availability (with a single motherboard failure taking out more work)? Also, what about power saving and TCO? Multi-core servers are interesting to us in terms of work done per watt of electricity.
Nectar: Automatic Management of Data and Computation in Datacenters
Study shows that 20 to 40% of computation in a data center is redundant (based on production clusters in Microsoft). Over 50% of storage space is occupied by obsolete or seldom-used data sets (based on research cluster in MSR). Thus resource management is a problem.
Example application: click-log processing. All computations tend to do parsing and pre-processing, but only really need to do this once. Also, as the click-log grows, there is re-done computation on the previously-existing data.
50% of data in the research cluster had not been accessed in the last 275 days.
Programs specified using LINQ, which may include user-defined functions (written in a .NET language). Submitted to DryadLINQ, which uses Dryad to execute it on the cluster. Data is stored in an in-house filesystem, called TidyFS.
Nectar interposes on submission to rewrite the LINQ program, performing lookups on a cache server. Dryad is modified to add cache entries based on computations that actually run. The distributed store contains Nectar program and data stores.
Mostly, the cache server gets several partial hits on the submitted query. The program rewriter will generate many programs that incorporate these hits into the submitted query, and choose the one that is cheapest to execute.
Considered an incremental MapReduce-style computation (Select-GroupBy-Select). Can cache the result of the GroupBy and the result of the Reduce. A new operator called MergeGroup can incorporate the previous GroupBy results with the results of Mapping and Grouping the increment data.
The executed program is stored in the Nectar program store, which contains a fingerprint based on the program and the data. The program fingerprint is computed from a static analysis of the program text. Dryad puts the result data in both the user store and the Nectar data store, and puts a mapping in the cache server from the fingerprint to the data store.
If the data has been deleted (due to garbage collection or a lack of space), the program store provides a program that can be used to regenerate the file.
The cache server uses a SQL database to keep the cache entries. A cache entry has usage stats, lookup stats and a URI to the results. The garbage collector uses these stats to make victim decisions.
Cache policies for deciding when to insert and delete entries. Final results are always cached, but sub-queries are cached when a subcomputation is popular enough. Deletion policy computes a “value” for each entry, based on the size of the result (cost), cumulative execution time (saving), the last time it was used and the number of times it was used.
Garbage collector uses mark and sweep, based on information in the cache server. However, the cache can be inaccurate (due to concurrent execution, or lack of usage statistics). Leases protect new datasets.
Evaluated based on logs from production clusters. 36000 jobs written in a DryadLINQ-like language. Simulated the effect of caching on these jobs. Between 15% and 65% of jobs would benefit from caching, across 25 production clusters. Overall, 25% of jobs benefit.
Also performed controlled experiments on a research cluster. Evaluated a common subcomputation (parsing and lexical analysis) shared between three jobs, which leads to about a 98% saving in machine time.
Evaluated incremental click log processing as data was added from day to day. Saved as much as 50% execution time when executing on the 5th day, having the 4th day’s results cached. Could do better when combining results from existing computations.
Q: Given that storage capacity is increasing so quickly, the precious resources are memory and I/O bandwidth, so do you find that to be true in your system, and how would you evaluate that kind of tradeoff? It is very important not to waste machine time in a data center. Could you model this to find the optimal trade-off between storing and recomputing data (in terms of the garbage collector)? It depends on the storage pressure, so we may need a threshold or ratio to decide how much to store on disk. Some heuristic basically? Yes.
Q: How do you know or ensure that the computations really are deterministic? Is it built into the language or is it something that you assume? DryadLINQ assumes that the result is deterministic. If the user uses random numbers, Nectar will always return the same result. It is not a problem with Nectar, rather it is in the DryadLINQ assumptions.
Q: Are we just reinventing distributed databases under different names, as a community? We did learn a lot from databases (incremental view management and materialized views). But this is work at much larger scales.
Q: Could we combine some of the lineage detection techniques with systems that are a bit more heterogeneous (e.g. transferring data from cluster to cluster)? The big difference is that Nectar maintains the program that generates a derived data set, so we can always regenerate it. However, we assume that we never delete “primary” data sets.
Intrusion Recovery using Selective Re-Execution
Need a combination of proactive security and reactive recovery mechanisms.
Status quo for recovery is limited: anti-virus tools and backup tools. These respectively are limited to predictable attacks, and can cause data loss on reverting. Therefore admins have to track down the effects manually. The challenge is to disentangle changes made by attacker and legitimate users.
This work helps user to do this disentangling on a single machine. Idea is to rollback affected objects after an intrusion is found, and then re-execute actions that were indirectly affected. This is a new approach to intrusion recovery, based on an action history graph using re-execution, “predicates” and “refinement”. Implemented a prototype on Linux that recovers from 10 real attacks.
Example attack scenario: modify /etc/passwd to add a new account, install trojans in pdflatex and ls to restart and hide the botnet. Afterwards, a new legitimate user is added.
Strawman idea: taint tracking on all OS-level dependencies in the system. Taint travels through shared files, but we have to be conservative in assumptions about how taint spreads, and so taint spreads too widely.
Second strawman: everything executes in a VM, and periodic checkpoints are taken. When an attack input is detected, rollback to the previous checkpoint, and replay inputs except the attack. Re-execution would be very expensive. Also, the original inputs may be meaningless in the context of the new system (due to non-determinism). Deterministic re-execution doesn’t work because the inputs themselves are being changed.
Selective re-execution uses an action history graph, which is a more detailed version of the taint-tracking dependency graph. Actions have dependencies, such as written/read data, exec arguments and exit codes. Selective re-execution only reruns some processes (like adduser in the example scenario). However, what about the exit code from adduser?
Many suspect computations are not affected: an attacker writing to a file may not taint everyone who reads from different parts of the file.
To minimize re-execution, specify predicates which allow Retro to skip equivalent computations. Also use refinement, which only replays fine-grained actions (not entire login sessions).
Example predicate: exit code. With the same exit status as before, no need to taint the admin shell after adduser Alice.
Example refinement: individual functions. So could do getpwname() instead of a generic read() on /etc/passwd.
What if the effect of the attack was externally-visible. Cannot solve this in the general case (e.g. recalling spam sent from an attacked host). But could e.g. email the diff of output from a terminal session to the user.
Implemented in about 4000 lines of C and another 200 lines of Python for the repair controller system. Use btfrs as a checkpointing file system. Use ptrace to shepherd the re-execution of processes and skip equivalent syscalls.
Evaluated by looking at whether Retro is better than manual repair, and what the cost of using Retro during normal execution.
Used 2 real-world attacks from a honeypot. 2 synthetic challenge attacks (including the example scenario) and an sshd trojan. And 6 attacks from Taser. Retro can recover from all attacks. In 6 cases, it required no user input. However, sometimes need to skip the attacker’s login attempt. Two required significant help to get beyond skipped network traffic. However, Retro can pinpoint the exact objects that are involved, which is better than the manual status quo.
Repair cost is proportional to the extent of the attack. Worst case is the sshd trojan, but the other cases taint far fewer objects. Repair time is proportional to the number of objects, rather than the log size. This is much more efficient than VM re-execution.
Evaluated the runtime overhead based on running HotCRP from just before the SOSP 2007 deadline. CPU cost was 35% with storage overhead as 100GB/day. Also looked at Apache-small-static-files and kernel recompilation workloads, which had much higher overhead in terms of CPU and storage. However, can still store 2 weeks of logs on a 2TB disk in the worst case.
Q: How do you pick the level at which you record and re-execute actions? Did you experience any cases where it was fragile to small changes in the replay? We reintroduced higher-level managers in response to false dependencies that we saw. The interface between the repair controller and the repair managers is well-defined so it would be easy to add new abstractions if necessary.
Q: Could you tune the system to record less-precise information which would make it harder to recover, but pay a much lower cost? There are two kinds of trade-offs to make: either the number of user actions to preserve, or assume that some things won’t get compromised. Could also trade-off logging cost for repair time (by re-executing more things).
Q: What do you do about non-deterministic programs and negative information flow? Just want some acceptable execution that isn’t compromised, so actually non-determinism may be okay. Could this modify my data on disk? Yes. Can only provide the guarantee of an “acceptable” execution, not the same.
Q: What if the attacker could assume that Retro would be there: could he cause a large number of false positives? Well, the sshd trojan is perhaps a case like that, but our hope is that it would be much easier to detect these attacks, but we haven’t experimented with this.
Q: Did you evaluate throughput and latency for web workloads? Is there any protection of Retro from attackers? How hard would it be to adapt Retro to a closed-source system like Windows? For Windows machines, you would have to implement many more managers, but we believe the ideas are still applicable. For protecting Retro, we use SELinux to protect the kernel and Retro, but we assume that the kernel is intact, although we could consider a VMM approach. The libc wrappers? Have to be careful with those. [Throughput and latency question dragged offline.]
Static Checking of Dynamically-Varying Security Policies in Database-Backed Applications
Authentication is complicated, especially for web apps, and getting it wrong can have major consequences. Typical development involves thinking of some attacks, and auditing for these. A more principled approach involves setting a security (information flow and access control) policy on a resource.
One approach is dynamic policy checking, by adding metadata to objects in the system, and checking them when before objects interact with the environment. This is easy to add to existing programs, and very flexible, but this only finds bugs for tested program paths, and there is performance overhead.
Alternatively, could do static checking, which can run at compile time and involves no changes to runtime behaviour. But this usually requires extensive annotations, and has limited policy expressiveness if the type system is used.
UrFlow is a new analysis approach for the Ur/Web programming language. Contains a flexible and programmer-accessible policy language based on SQL. However, it runs statically at compile time and adds no runtime overhead.
Ur/Web has integrated parsing and type-checking of SQL and XML (HTML).
Can have simple policies based on whether rows exist in a table. Or can reason about what the client “knows”, such as string literals and contents in the source code. Can also have policies that join between multiple tables.
Can use first-order logic to reason about what is known at each point in the program. Then reconcile that with the SQL query policy (which may also be expressed in first-order logic). Then use an automated theorem prover to prove that the code meets the policy.
Program code is turned into a finite set of execution paths, and symbolic execution is done on those paths. This leads to a series of states, and a series of check predicates.
Evaluated using case studies for representative web applications: secrets, poll, user DB, calendar, forum and gradebook. 8 to 134 lines of code in the policies (forum most complicated).
Idea of using SQL is to take advantage of code that is already written in web apps, and get a free lunch from the checking. Hope to add these to a more mainstream language.
Q: Do you have any intention to analyse how hard it is to write policies? The analysis is never going to change the behaviour of the program, so no policies should block legitimate access to data. What about indirect paths to data? This is a hard problem and I don’t have an answer. The code author’s job is made easier by moving to a declarative paradigm.
Q: To what extent can your program give a better answer than just a yes or no from the theorem prover? When the analysis thinks it detects a violation, it displays the instruction that it thinks might be wrong and all of the facts it knows at that point. However, at least this is a static view rather than a dynamic view of the program. There’s probably a heuristic for filtering out irrelevant facts, but I haven’t thought much about the details.
Q: How do you manage the accumulation of security policies over time and becoming cumbersome? The language supports ways of encapsulating common patterns, which can help to shrink down the size of policy that you need to read.
Q: Is there a theoretical difference to existing security automata [inaudible]? Techniques very similar to software model checking. Can it be applied to type analysis? I haven’t seen any type systems that could reason about this.
Q: What about security policies that refer to exogenous facts that don’t exist in the database (like timed event sequences)? You can expand the database to store state fields that allow you to encapsulate the model. This would be worth looking into.
Accountable Virtual Machines
Scenario of a multiplayer networked game. In Counterstrike, the amount of ammo is local state, and fire events are broadcast to other players. So a player can modify the game not to decrement that counter. Such cheats exist and are used.
Cheating is a serious problem in itself, and since gaming is a multi-billion-dollar industry, we need techniques to stamp it out. It is also a symptom of a more general problem: how do you know that a networked system is running as intended?
Want to be able to detect that a remote machine is faulty, and obtain evidence that can convince a third party.
Challenges: mutual distrust and lack of access to the source code to understand how it should work.
Software runs in an accountable virtual machine on an accountable virtual machine monitor. Maintain a log of network inputs and outputs, and can check this log with a reference image. It the AVM is correct, the same inputs will produce the same outputs. Otherwise it is considered to be faulty.
Solution involves two pieces. First is tamper-evident logging (based on PeerReview). Log uses a hash chain and messages contain signed authenticators.
Second part is execution logging. How does Alice know if the log is from a correct execution of her software image. Need the AVMM to specify an execution and additionally log all nondeterministic inputs.
Can audit and replay a log. If one player changes the reference image, the logs will diverge, which indicates that the AVM is faulty.
AVMs provide strong accountability for arbitrary, unmodified binaries. Don’t need to trust other participants, the AVMM or any software running on the other machine.
Evaluation based on the example of cheating in Counterstrike. Used a prototype AVMM based on VMWare Workstation 6.5.1 which has a mature logging/replay engine. Extended this with tamper-evident logging and auditing. Used three players on Nehalem boxes running Windows XP SP3.
Can AVMs detect real cheats? If the cheat needs to be installed in the AVM, it can trivially be detected (since the branch counter and instruction pointer will change). Examined 26 real cheats from the internet, and all were detectable.
Could cheats be adapted to subvert the AVMM? There are impossible-to-detect cheats (collusion), guaranteed detectable cheats, and technically-difficult-to-evade cheats, which we can’t guarantee are impossible to hide.
What is the impact on frame rate? With no fps cap (Window mode, 800×600, software rendering), the frame rate drops by about 13% from bare hardware to AVMM. Still around 137 fps though. The main culprit is execution logging (about 11% overhead), so accountability only costs another 2%.
For a one-hour game, how expensive is auditing? The log is about 8MB per minute, which compresses to 2.47MB. So the log is 148MB. Replay takes approximately one hour.
Idea: stream logs to auditors during the game and have detection almost immediately. Since the machines are quad-core, one plays the game and one does logging, it is possible to do replay on one or two of the spare cores. This causes some drop in the frame rate, but it is still above 100 fps.
Q: What if the issue is that a machine isn’t cheating but is in fact sabotaged? The AVM provides a full record of what has happened, and could demonstrate that the owner is not at fault.
Q: Would it be possible for the malicious party to run another instance in the background to get a plausible-looking log? The two copies would be completely independent, because the authenticators are signed with one machine’s key. The outputs wouldn’t match, so this would be detectable.
Q: Do most have the cheats have the form of changing the inputs? What is the equivalence class of cheats that you can and can’t detect? If the cheat causes the network I/O or control flow to be inconsistent, we can detect it.
Q: What about concurrency (e.g. who fires first) and delaying traffic in the network to change who fires first? Concurrency wouldn’t affect detectability. But from a distributed systems point of view? We can’t guarantee detection.
Q: What if the program that you’re dealing with has non-deterministic output? Randomness is not a problem for us because at the VM level, everything is deterministic.
Q: Is it sufficient to log at the AVM level when the application has latitude to process events in a different order? If this is a legitimate thing for the application to do, then we are vulnerable.
Q: How does privacy interact with the implementation of your system? Would you end up sharing too much information with your adversary (especially in the case of things that are more important than games)? It’s true that AVMs are verbose in what gets logged. The impact on privacy depends on what you’re doing. Auditing a machine that you’re paying for seems reasonable. It has to okay for you as the auditor to see all computation that is going on. You can delay the online auditing perhaps.
Q: How can you detect cheats that involve improving the player’s skill (or making all the walls invisible or using an aimbot)? Can’t help information leakage with this technology.
Bypassing Races in Live Applications with Execution Filters
Races and atomicity violations can be a big problem, and hard to find. One solution is to put everything under a big lock. Or can go for a complex fix by decomposing behaviour and adding state variables. But it takes a long time to fix these races, even if the root cause is well understood.
Idea of LOOM is to filter out buggy thread interleavings to mitigate bugs while the problem is fixed. Write declarative execution filters to specify the disallowed interleavings.
LOOM also works to implement live workarounds, so it is possible to install an execution filter into a running application. It installs execution filters safely and avoids introducing new errors.
LOOM incurs little overhead during normal execution.
Evaluated LOOM on 9 real races, and it bypasses all evaluated races safely, applies immediately, has little overhead and scales well.
LOOM is divided into Static Preparation (uses the application source, which is compiled down to LLVM bytecode where transformations are applied. This adds a mechanism for LOOM update injection and safety maintenance) and Live Update (passing an execution filter to the LOOM controller, which passes code to the LOOM update engine which is linked into the application binary).
Safety issue: what if the program is running inside a critical section when you update it? Or what if there are order constraints on multiple threads? The LOOM update engine identifies regions of the code where it is dangerous to update, if any thread is inside that region. It then evacutates the dangerous region, to get threads out of the region, and applies the update once they are all out.
To control application threads, add conditional breakpoints (in the instrumentation phase) on iterative or recursive control flow (since there is no point in pausing code that can quickly run to completion): i.e. CFG backward edges. Details of the breakpoint mechanism in the paper. They use MRSW locks which are unlocked and reacquired in the breakpoint to pause threads (if the update manager grabs the write lock in the mean time). The back-edges are identified as safe or dangerous. The breakpoint only pauses if the back-edges are safe.
Hybrid instrumentation to reduce the performance overhead. Have two versions of an instrumented function: fast and slow. The slow version includes configurable slots for additional instructions. During the common case (no update), the fast path is always followed; only if the code is updated do the slots get used.
Measured the bare instrumentation overhead on Apache, MySQL and SPLASH benchmarks, and the overhead was between -1.84% and 4.11%. This is much better than Pin, which has from 14.94% to 1065% overhead.
Measured the scalability of the number of threads. Overhead from 1 to 32 threads (on a 48-core box) is less than 12% in all cases. There is some anomalous speedup for 4 threads.
Intend to extend this to memory errors and security errors.
Q: Is there a potential for deadlock in your evacuation algorithm? That may be the case, but we didn’t observe this in our evaluation, and we can always “uninstall” the fix if there really is deadlock.
Q: Can you collect information at runtime to aid you with fixing the bug at the source code level? Can write execution filters based on line numbers.
Effective Data-Race Detection for the Kernel
Looked at a data race in the Windows kernel. Thread A ANDs out a bit, and Thread B ORs a different bit in. These can interleave badly, leading to one of the mutations getting lost, and cause a system hang.
Data races are very hard to reproduce since the timings are very tight. And they are hard to debug, since they could be mistaken as a hardware bit-flip. They are also becoming more common, since people are moving away from monolithic locks to fine-grained locking and lock-free approaches.
Previous techniques are happens-before and lockset algorithms. The Intel Thread Checker runs at a 200x overhead, due to logging all synchronization accesses, and instrumenting all possibly-racy memory accesses. This prevents practical usage in the field. Indeed, false failures arise due to timeouts. They also require a complete knowledge and logging of all locking semantics which may not be feasible in practice (since there are diverse ways of implementing locking, and many of them are used in the kernel).
Trade-off accuracy for efficiency. Can pick up false and benign data races. False data races are ones that cannot actually occur. Benign data races can and might occur, but are intentionally put there by the developer. Statistics counters are an example of this.
Another goal is to give the user control over the overhead (from 0.0x up).
Also want actionable data that is useful for the developer.
First insight: don’t infer what a data race might happen, and instead actually cause it to happen. Second insight: sample memory accesses for controllable overhead. Use code and data breakpoints and randomly select to get uniform coverage. A data breakpoint pauses the first thread when it hits some memory location; if another thread comes along and accesses the same location, flag the current thread states as a possible data race.
Memory accesses are sampled by setting code breakpoints, which lead to data breakpoints being set on the relevant locations.
Results: most dynamic data races are benign (104/113). Many could be heuristically pruned (86/113). Found 25 confirmed bugs in the Windows OS, with 8 more pending investigation.
Saw two bits in the same byte, each protected by individual locks, have a data race with bits being set and cleared.
The breakpoint approach found data races with as little as 5% overhead.
Future work includes prioritizing benign versus non-benign races, and also looking at (false) data sharing and its performance impacts.
Q: What are your preliminary thoughts on how to identify bugs as benign automatically? There are a lot of unanswered questions there….
Q: How random is your breakpoint setting? The demo was considering the ndis and ntfs drivers, and randomly sampled within those. The breakpoints are set based on a feedback loop to get the rate.
Q: How dependent is the technique on having full debug symbols, since the problems may be in third-party device drivers? The dependence is so we know where the memory access instructions are. You can load the module into memory and use some other technique to disassemble them, but we ended up accidentally putting a breakpoint into data.
Q: How is the sampling rate related to the expected time until you detect a particular race? We haven’t investigated that, but we would like to verify that we’re not just getting the easy races.
Ad Hoc Synchronization Considered Harmful
Synchronization is important to ensure the correctness of concurrent programs. Sometimes synchronization code is easy to identify (pthreads calls), but sometimes it is hard to recognise. Sometimes it even uses goto statements.
Ad hoc synchronization harms program reliability. Because it is hard to recognise, the programmer may be unaware of it. This may lead to bugs and performance issues. Saw up to 67% of ad hoc sync introducing bugs. Program analysis is more difficult, and may introduce hard-to-detect deadlocks, cause false positives in data race checkers and confuse performance profiling. It may even cause problems for compilers and relaxed memory consistency model.
Contributions: quantitative evidence of the harmfulness of ad hoc syncs; and a tool (SyncFinder) that automatically identifies and annotates ad hoc syncs (detects deadlocks and bad practices).
Showed an example of a hard-to-detect deadlock in Mozilla JS, which involves three threads. And an example which does an unnecessary sleep in MySQL, causing performance issues.
Ad hoc synchronization leads to benign data races. Leads to false positives in data race detectors (see above).
Diverse kinds of ad hoc synchronization. For example, a while loop, or a goto loop. Or a single condition, or multiple conditions. Or spinning on a synchronization variable, or have a more complicated dependency structure….
To identify ad hoc sync, need first to understand it. Every one on the waiting side has a loop body (”sync loop”) with at least one exit condition that is associated with at least one exit condition variable; and a synchronization variable. The setting side does a write to the synchronization variable. First detect the loop, then extract the exit condition, then detect the exit-dependent variable set, and finally prune the set to find the synchronization variable.
To prune the sync loop, need to find variables that are shared with a remote thread, and are loop-invariant.
Report annotated code with the line numbers of sync variable reads and writes.
SyncFinder identifies 96% of ad hoc syncs in the considered applications, with only 6 false positives (out of thousands of loops).
Use cases: a tool to detect bad practices, and an extended race detector in Valgrind.
Q: What about the amount of bugs in non-ad hoc synchronization: do you have any data on that? We did some previous work on the reliability of normal synchronization, and we found that the proportional of buggy ad hoc sync is higher.
Q: Why is ad hoc synchronization so pervasive? What is the limitation in sync primitives, or is it just performance? We checked the comments surrounding this ad hoc synchronization. People just want a flexible way to do sync. They assume that it is a short-term fix (or that it will not be used in the common case). Would you suggest an additional sync primitive? Conditional wait and signal.
Q: Do people do this to avoid a synchronous write on the system bus? Is it always harmful? Your performance assumptions may change in the future on different architectures, and the cost to maintainability is probably higher than the performance win. Are lock-free data structures bad, then? I cannot say that.
Q: Is there any chance that these application developers declare their sync variables as volatile? They may do this, but it doesn’t guarantee that that is how they use it. We thought about this for our tool.
Q: How do you identify the bugs? We checked the code repository for patches on those locations and identify bugs in Bugzilla and the changelog.
Q: How long does SyncFinder take to run on the code base? Could you run it on the Linux kernel? We ran it on the 1 MLOC MySQL, which takes about 2.5 hours. The complexity is data-dependent: OpenLDAP is much smaller but takes almost as long as MySQL.