Archive for the ‘Technology’ Category

OSDI 2010: Day 3

Wednesday, October 6th, 2010

Production Networks

Onix: A Distributed Control Platform for Large-scale Production Networks

  • Traditional networking has a traditional router with a forwarding plane and a control plane, which are coupled together in one box. Changing one without the other is difficult to do. This slows down ideas like the Trill protocol, which took 6 years to deploy.
  • Software-defined networking moves the control plane into a server outside the router that talks to the router using a management protocol, sending down forwarding entries to the router, and receiving information about new flows, etc.
  • One of the problems with L2 Ethernet networks is ARP broadcast, which doesn’t scale. Idea to have a centralized unicast ARP server. However, don’t want a single point of failure.
  • Onix is a software-defined networking platform that implements distributed control. Centralized control logic thought of as a distributed system. Simple API and flexible mechanisms for state distribution, scaling and reliability.
  • Challenges are generality, scalability, reliability, simplicity and performance.
  • The API is based on treating the network as a graph. Nodes are physical (or logical) network entities, such as a forwarding engine, which is connected to a port, which is connected to an end host. Nodes have methods defined on them (e.g. “write flow entry” for a forwarding engine). The graph is a “network information base” (NIB).
  • Switching protocols integrated into the NIB, such as OpenFlow.
  • Need a way to distribute the NIB to the switches and to other control servers. Depending on the application, you might want strong consistency or eventual consistency. Two options: replicated-transactional (SQL) data, and one-hop memory-based DHT (for eventual consistency). For an ARP server: switch topology is hard state, IP-to-MAC mapping are soft state. Applications only interact with the NIB at run-time, after setting up the distribution mechanisms initially.
  • Integrated with ZooKeeper for distributed coordination, in order to do leader election, failover, distributed locking, barrieres, etc.
  • Can trade off scalability and reliability, by partitioning the control state variously. Partitioning techniques based on VLAN and MPLS (versus integration techniques like routing on IP prefixes or ATM routing areas). Provide similar interfaces to the user. Can change the subset of Onix instances to which a server connects to vary this. Or give different computational tasks to different instances. Or partition the NIB between instances.
  • Applications can reduce the fidelity of information before sharing it (based on an ATM trick).
  • For reliability, need to deal with various kinds of failures: network element/link failures (application has to deal with that), connectivity infrastructure failure (the network that connects Onix to the networking infrastructure, usually use a management backbone, but could do multipathing in that backbone), and Onix failure (application-controlled, but mitigated by distributed coordination facilities (like ZooKeeper)).
  • Implemented in 150KLOC of C++, Python and Java, from 15 developers in 3 institutions. In QA testing, and will be appearing in products.
  • Implemented Ethane (enterprise security solution) on Onix. Multiple Onix instances are necessary to handle the load from flows being set up.
  • Implemented a distributed virtual switch, which does centralized management of VMM virtual switches. Only need a single instance of Onix with a cold standby for failover.
  • Implemented a multi-tenant virtualised data center, which provides per-tenant L2 networks in a data center. Provides isolation in addressing and resource usage. Need to partition the NIB and management tasks.
  • Implemented a scale-out IP router, which does routing control over multiple physical switches.
  • Q: What is an “application” and how do they multiplex on a single network? What is the implication at the level of routing? Envision only a single application per Onix instance (app is a program that manipulates the NIB graph). Forwarding layer gets programmed using something like OpenFlow (a TCP connection to the controller), which programs a TCAM on the switch. How do you set up the network if there is no routing? Depends on the application. Need configuration support at the beginning.
  • Q: How do you refresh the ARP cache at the host, don’t you need to do broadcast? We didn’t actually build an ARP server, just an example for the talk. Imagine we could still save a lot of traffic from gratuitous ARPs.
  • Q: What is your favourite trick or hack that we might be able to reuse from your implementation? And/or what might you do differently? Favourite idea is that the distribution mechanisms are hidden from you in the NIB, and you can switch out the storage layer with your own if you want.

Can the Production Network Be the Testbed?

  • Hard to evaluate new network services realistically, which makes it hard to decide what should and should not be deployed in production networks. Testbed networks are too small, and maybe don’t have the proper topology. A successful service on a testbed may not correlate with deployment on a real network.
  • Problem with NetFPGA: expensive to deploy and fan-out is small. Software testbeds are slow and may not have realistic topologies. And getting real users to join an experimental testbed is hard.
  • Solution: network slicing. Divide a production network into logical slices, in each of which different control plane is run. Users can opt in to a particular slice. Need to have strong isolation between slices so that actions in one don’t affect another.
  • Control plane makes up forwarding rules and pushes them down into the data plane. Data plane pushes exceptions up to the control plane (e.g. packets that don’t match any rules). Add a slicing layer between the control plane and data plane, and allow multiple control planes to control the same data plane.
  • Network slice is a collection of sliced switches or routers. The data plane remains unmodified, and the slicing layer performs rewrites on the control planes’ rules to ensure isolation.
  • Slicing policy specifies resource limits for a slice (e.g. link bandwidth, maximum number of rules, topology and CPU usage).
  • Now, allow real users to opt-in to slices on a flow-by-flow level. So one slice can handle HTTP, another can handle VoIP, etc.
  • Implemented on top of OpenFlow for communication between the control and data planes. Control logic is moved to a commodity PC. OpenFlow is just a firmware upgrade for various commodity hardware.
  • FlowVisor performs message handling (the slicing layer). On getting an exception from the data plane, FlowVisor looks up which slice controls the packet and passes it to the appropriate controller. When the controller sends back a rule, FlowVisor checks that the update is allowed, and writes it to the data plane if so.
  • Implemented for 20 OpenFlow message types as a transparent OpenFlow proxy in 8261 lines of C.
  • Isolation performance in terms of device CPU usage (other metrics are considered in the paper). Need to prevent CPU exhaustion, using rule-insertion rate limiting. Consider a malicious slice that tries to do DoS on the switch. Isolation reduces this to 40% CPU usage on average (though bursty).
  • Runs on the production network at Stanford, with 15 switches, 35 access points, 25 users and >1 year of use. Host 7 different OpenFlow demos on the same physical network. Also runs in many different research groups.
  • Not explicitly virtualization, because currently only deal with slices of the physical topology. Looking into virtual nodes in future.
  • Q: How do you create isolation between slices so that one slice doesn’t do too many flow setups per second? This is handled by isolating switch CPU usage.
  • Q: What about latency?  Add about 0.483 ms of latency.
  • Q: Don’t you have to have n times as much forwarding memory, which creates problems for realism? Tricks that you can play in terms of caching, but it is true that the number of rules becomes a scarce commodity (second most important after device CPU).
  • Q: [...] Experimental services use the production slice for their control plane. Worked out an in-network control protocol.

Building Extensible Networks with Rule-Based Forwarding

  • Goal is to have flexible forwarding in the network. A network becomes more powerful if its forwarding plane is more flexible. For example, we might want forwarding to be aware of middleboxes (for intrusion detection). But also multipath, active networks, multicast, and so on.
  • However, flexibility is not enough, because it must be secure as well. Need policy support to constrain forwarding directives. And need every entity that appears in a forwarding directive to be able to refuse it. Not respected today (e.g. denial of service, since the source performs forwarding). Or might want to do IP source routing but give every intermediate ISP the ability to refuse the route.
  • Idea: packet rule, which carries forwarding directive in the packet itself. All intermediate nodes must validate the rule. Rules encapsulate a proof of policy compliance.
  • Rules are owned by destinations (but they could be owned by other nodes). They are certified by trusted third parties. An extended DNS returns rules instead of addresses. Routers verify the rule signature and follow rule directives. Packets may also have a return rule for allowing data to be returned to the source.
  • Distribution and certification is done in the control plane (in the paper). Talk focuses on the data plane.
  • Assume: an anti-spoofing mechanism (ingress filtering), existence of rule-certifying entities and key distribution, and DNS well-provisioned against DDoS.
  • Rule is a sequence of actions in if-then-else statement form. Conditions are comparisons based on packet headers and router state. Allowable actions are: modify header, forward packet, drop packet or pass up to the control layer.
  • Rule forwarding engine can modify packet attributes, such as adding explicit congestion notification. Arbitrary attributes on a packet may be set. Need to make attribute setting secure, by using strong cryptography or checking the source (to avoid circumventing an IDS middlebox).
  • Rules can be used to prevent DoS, as they can work like network capabilities. Requires the ability to dynamically grant rules, which is unusual for DNS, so would need to do some proxying.
  • Many other examples of applications that work in this architecture.
  • Rules are flexible, policy-compliant and safe (bounded forwarding time, and cannot modify router state or amplify traffic). Invoked functions are controlled by the ISPs and middlebox owners. Resource isolation can be used to prevent attacks.
  • Evaluated several questions: size overhead of rules, forwarding overhead, performance isolation, load on certifiers and security. Only looking at overhead in the talk.
  • Rule compiler transforms the rules into a binary form. Rule size is between 60 and 140 bytes. Average across examples is below 85 bytes (13% overhead on an average IP packet). Overhead is 27% if using RSA signatures.
  • Built a prototype router on top of RouteBricks. Tested on an 8-core Nehalem with 2 dual-port NICs. Compared RBF to plain RouteBricks, and saw little overhead (no overhead for packets over 500 bytes). Can forward 35Gbps on a real traffic distribution.
  • Signature verification happens only at “trust boundary routers”. Results can be cached (about 19 bytes per rule), and new flows tend to represent 1% of backbone line capacity on average (5% in worst case). Existing hardware can cope with this, and it can be parallelized, so throwing more hardware at it will support higher bandwidth links.
  • Q: What if as a receiver, I want packets to take a particular path… can I drop those packets earlier in the path? No, can only verify the previous hop. Would need stronger crypto guarantees to detect that. What about stakeholders who aren’t named in rules? In a way, we are like overlay networks. We do not name entities between waypoints. But since they approve packet forwarding, you could see them as overlay nodes.
  • Q: What is the best-case/worst-case for ping? Latency is negligible in our implementation.
  • Q: Can you comment on deployability? Could use network slicing. Could be partially deployed at some routers and get many of the benefits. What is the incentive for ISPs? The ability to diversify their offering to customers.

Mobile

TaintDroid: An Information-Flow Tracking System for Realtime Privacy Monitoring on Smartphones

  • 60 million smart phones sold in Q2 this year. But privacy is very poor on these devices, especially in terms of the applications that run on them (e.g. of a wallpaper application on Android). Challenge is to balance fun and utility with privacy. Idea is to look inside applications to watch how they use privacy-sensitive data, and determine how it leaves the phone.
  • Challenges include resource constraints, allowing apps to use some privacy-sensitive information, context-based privacy and applications sharing information.
  • Idea: use dynamic taint analysis, with a taint source, taint propagation and a taint sink. Need to trade off between performance and granularity. TaintDroid does taint tracking on Android, with variable tracking throughout Dalvik, handling native method invocation, and extending to storage.
  • Modified the Dalvik interpreter to store and propagate taint tags on variables. Store tags adjacent to local variables/args on the internal execution stack, for locality. For class fields, store them in instance fields on heap objects. For arrays, store one tag per array to minimize overhead (in particular because many arrays are strings).
  • Various rules for taint propagation through data-flow.
  • JNI is used to execute native methods. TaintDroid doesn’t support third-party native code, but uses a combination of heuristics and method profiles to patch up the tracking state on return from native methods.
  • For IPC and file propagation, taint at the message level and the file level. File taint is stored in an XATTR on the file.
  • On the CaffeineMark benchmark, overall 14% performance overhead. Relatively worse for the “string” workload. 4.4% of memory overhead. IPC has a 27% overhead. Macrobenchmarks overhead between 3% and 29% (worst for taking pictures: 0.5 seconds).
  • Sources and sinks are carefully integrated into the existing framework. Need to treat sources differently based on the properties of the information that they provide. IMSI needed particularly careful handling.
  • Started with the top-50 most popular applications in categories on the Android Marketplace, and winnowed it down to 30 applications that used the internet, location, microphone and camera. 105 connections were flagged, but only 37 were clearly legitimate.
  • 15/30 applications were sharing location with a third-party advertiser, usually in plaintext (e.g. in the HTTP GET url for AdMob, which also contained a unique ID for the particular phone). Sometimes every 30 seconds. No sharing was obvious to the user or covered in the EULA.
  • 7 applications sent the IMEI, and 2 apps send the phone number, IMSI, or ICC-ID to a remote server. One application mentioned in the EULA that said the IMEI would be sent, another sent a hash of the IMEI.
  • Limitations to the approach: only explicit data flows. Sources: IMSI contains country and network codes, so tainting it is too conservative.
  • Source code should be released next Wednesday.
  • Q: What about timing attacks, and what implicit flows do you think are important? Need to do static analysis to capture all implicit flows, and limitations would lead to false positives.
  • Q: Have you thought about providing this application as a website to which people could upload applications? Haven’t thought about that specific model, but definite potential there.
  • Q: How do you stop applications modify the XATTR to remove taint? XATTR not used in Android, so could use LSMs to do this. Accessing this requires third-party native libraries, which we don’t support (can’t access them from Java).
  • Q: Are you planning to do anything for control dependencies, and how would that impact performance and false positives? We are not attempting to identify privacy violiations, just that privacy-sensitive information has left the phone. Using the information for control is the next step.
  • Q: Now that TaintDroid is public, what do you see as its future purpose (since people will now try to get around it)? Going to look at more techniques in future (implicit flows), though we can still catch low-hanging fruit where the leak is unintentional.
  • Q: [...]. Understanding how information is going to be used is a tough challenge. Want to integrate privacy-preserving technologies.

StarTrack Next Generation: A Scalable Infrastructure for Track-Based Applications

  • Since many phones can now identify their own location, many apps now use this information. A track is a sequence of (location, timestamp) readings. Applications could use these tracks to e.g. personalise driving directions (collapsing directions that are familiar, such as “Take 101 north”, rather than a bunch of redundant short steps around one’s own house).
  • Application taxonomy: use current location, use past locations or use tracks. StarTrack deals with the last case (e.g. personalised driving directions, track-based search, ride sharing, discovery, urban sensing).
  • Two kinds of application: insertion and retrieval/manipulation/comparison operations. ST server is a database designed to store geometric data efficiently.
  • Challenges: handling error-prone tracks, providing a flexible API and efficiently implementing track operations. (Also must provide scalability and fault tolerance.)
  • Raw tracks are difficult to compare for equality, since there may be noise, and readings may be taken at different times. Canonicalisation enables more efficient retrieval and comparison, by mapping similar tracks onto the same path.
  • API needs to: pre-filter tracks (based on similarity criteria), manipulate them and fetch them. Track collections are abstract groupings of tasks, which aid programming convenience and implementation efficiency (preventing unnecessary message exchanges and enabling caching).
  • Manipulation operations include joining collections, sorting them, taking a subset, getting similar ones, getting common segments, and getting individual tracks.
  • Showed a code fragment for a ride-sharing application in terms of the StarTrack API.
  • In the DB, StarTrack maintains only non-duplicate tracks per user. The table of unique coordinates are separate. Quadtrees are used to do geographic range searches.
  • To compute track similarity, compute the length of common segments over the length of all segments in both tracks (i.e. Jaccard similarity). Conventional DBs don’t support this efficiently. The track tree data structure is an efficient representation for this, for doing get-similar-tracks and get-common-segments.
  • Evaluated the performance of track trees and two sample applications (driving directions and ride sharing). Used synthetically-generated tracks, 9 servers and 3 DB servers.
  • Evaluated get-similar-tracks against DB, in-memory filtered, and in-memory brute force approaches. Track tree query takes 1ms for up to 100K tracks in the collection. Other approaches are much more costly. To build a track tree takes up to 2 minutes for 100K tracks, and about 170MB of RAM.
  • Personalised driving directions uses get-common-segments. Cached takes about 50 ms until hitting a wall at about 260 requests per second. Ride sharing can do 30 requests/second at 170ms response time.
  • Have a working infrastructure, and looking for users.
  • Q: Have you tried to pair intent and content with a track? We have the availability to stick metadata on tracks, so different applications could associate different metadata. Expect we could have ACLs on this metadata.
  • Q: How do you handle errors in the tracks? Canonicalisation handles this, pre-filtering obviously-erroneous points. What technique do you use for this? Rely on the underlying algorithm but don’t use statistical techniques.
  • Q: Have you thought about how you might do non-intersecting paths in a privacy preserving way (e.g. to avoid an ex)? Thought about it, but it is very hard. Access control might address this, but would like to have a way to aggregate it using differential privacy techniques.
  • Q: Why does the performance curve hit a wall? Network congestion and packets getting dropped.

Virtualization

The Turtles Project: Design and Implementation of Nested Virtualization

  • Approach is to multiplex the hardware between different layers of virtualization. Focus on two layers of nesting.
  • Multiple VMCS for each layer transition. The hypervisor merges VMCSs so that they can run in root mode.
  • Need to avoid multiplying the number of vmexits to one per nesting layer. One idea is to make a single exit fast, and another involves cutting down their frequency.
  • Multi-dimensional paging cuts down the number of exits. Baseline approach uses shadow paging on shadow paging (assuming no EPT). Better to use a shadow page table on EPT (secondary hardware table), which turns three logical translations into two in hardware. Best is to observe that since EPT tables rarely change, do multi-dimensional paging, which compresses the two levels of EPT into one (from layer 0 to 2).
  • For I/O virtualization, could do device emulation, or paravirtualized drivers, or (even better) direct driver assignment. This requires an IOMMU.
  • For nested, need to do multi-level device assignment. Idea is to have the bottom level emulate an IOMMU for the first level, and the second level access the device directly.
  • Experimental set up involved kernbench, SPECjbb, and netperf. Evaluated different hypervisor choices for the nested performance. Overhead is 9.5% in the guest, and a further 14.5% in the nested guest for kernbench. For SPECjbb, it’s 7.6% in the guest and a further 7.8% in the nested guest.  vmread and vmwrite operations were the most expensive, so these were hacked to not cause a trap.
  • Multidimensional paging improves kernbench a lot (3.5x), but SpecJBB and netperf are roughly the same.
  • Multi-level device assignment is the best performing I/O setup, but it adds an overhead of 3x in CPU cost. Interrupts cause exit multiplication. If we could deliver interrupts directly to the nested VM, the overhead would just be 7%. But can still fill a 1G pipe.
  • Q: Do you have a feeling for how the overhead will scale as the number of levels increases? Hope that additional levels will add approximately the same overhead.
  • Q: Are you confident that you could run this under arbitrary unmodified hypervisors? Not advocating rootkits at all.
  • Q:
  • Q: Would binary translation help to apply changes to e.g. the vmread and vmwrite instructions? Two ways to solve this: either get better hardware support or do binary translation. Not feasible to do binary translation in this case.

mClock: Handling Throughput Variability for Hypervisor IO Scheduling

  • Done this work because customers need it.
  • Hypervisor currently multiplexes hardware resources between VMs. Three possible controls: reservation (minimum guarantee), limit (non work-conserving) and proportional share. Supported for CPU and Memory in ESX Server since 2003. Now looking at I/O resource allocation.
  • Contention for I/O resources can have an arbitrarily bad effect on a VM’s resource allocation. Hard problem for storage (considered in terms of iops) to get reservations or limits, because you need to continuously recompute the allocation in the face of bursty, variable workloads. Also, since this access may be distributed and shared, it is even more complex.
  • A single algorithm to support all controls, handling variable and unknown capacity, and that’s simple to implement.
  • Typical proportional-share scheduling assigns a weight to every application, and tags each request with the application. Then requests are spaced based on their tag and the weight of their application.
  • Three key ideas in mClock: use real-time tags (instead of virtual time tags) based on wall-clock time. Use separate tags for reservation, shares and limits. Then use dynamic tag selection and synchronization.
  • Scheduling proceeds by first doing constraint-based scheduling, the weight-based if reservations are met.
  • Idle VMs are synchronized when they make a new request.
  • Need to scale the number of requests based on their size.
  • Distributed mClock for clustered storage architectures.
  • Experimented with a 3 to 6 VM cluster with mixture of Windows and Linux hosts. Workloads: iometer and and Filebench OLTP benchmark. Showed an experiment with diverse workloads and allocations. Compared SFQ(D) to mClock. Showed that mClock closely enforces the limits and reservations for the relevant VMs.
  • Another experiment looking at burst handling, when idle VMs get benefit from spare capacity.
  • Looked at an OLTP workload on both SFQ(D) and mClock. On SFQ(D), the reservation is not respected, whereas mClock manages it.
  • Distributed mClock experiment, with three servers and three clients, and various reservations.
  • Future work involves extracting SLAs and deriving scheduling policies from those.
  • Q: Why does the total number of iops become lower for mClock as SFQ(D) when two VMs are switched on? Some variance would be acceptable, due to the varying workloads and access patterns etc. Should note that the VMs are not doing exactly the same workloads in each case.
  • Q: Can you comment on how you make the trade-off between throughput and latency? Focussed on getting a certain amount of throughput. Previous work looked at latency. Overall, to do a good job at latency, need support from the underlying array vendor.
  • Q: [...]. We group I/O into 32-request batches onto the LUN.

Virtualize Everything but Time

  • Problem is clock synchronization, which is important for network monitoring, traffic analysis, telecoms, finance, gaming, and distributed scheduling.
  • Status quo under Xen is based on ntpd, but amplifies its flaws. When live VM migration is done, it fails.
  • Propose a new architecture. based on RADclock (clock synchronization solution). Robust, accurate and scalable solution that supports seamless migration.
  • Idea: give each physical host a single clock that never migrates. A stateless clock-read function migrates with each VM. (Concept of a “dependent clock” (not new).)
  • Context is paravirtualized Xen guests. It would be more complex to do it for fully-virtualized guests, and is not discussed here. Implemented for pvops Linux (2.6.31.13) on Xen 4.0.
  • Clocks are built on local hardware (oscillators/counters). However, they are imperfect and drift (often due to temperature). The TSC counts CPU cycles and is the highest resolution/lowest latency time source, but it might be unreliable. HPET is reliable but low-resolution and high latency. The Xen clocksource is a hardware/software hybrid that the hypervisor provides, and aims to combine the reliability of HPET with the low latency of TSC: a 1GHz 64-bit counter. Performs well, but HPET is not far behind and a lot simpler.
  • Raw timestamps and clock timestamps are different, and scaled counters are insufficient since they drift. Network based synchronization is convenient, but suffers from delay variability and path asymmetry. (Typically assume symmetry.)
  • Two synchronization algorithms under discussion. First is ntpd, which is feedback-based, using the Xen clocksource on Xen. RADclock uses raw timestamps, and can use any raw counter, including HPET and the Xen clocksource.
  • Experimental setup includes an atomic clock, a GPS receiver and a network connection to a stratum-1 NTP server.
  • With the client and time server on the same LAN, and a 16-second polling period, can get 60us error. Using a feedback-directed polling period with multiple servers (but no load, no traffic and high quality servers) can lead to millisecond-level errors.
  • On Xen, ntpd is much worse. Consider a dependent ntpd clock, independent ntpd clocks and migrating independent ntpd clocks. With a dependent ntpd clock, only dom0 runs ntpd and the Xen clocksource is used (no clock sync in the guests). If an independent ntpd clock per guest runs (resource-hungry), works almost as well as non-virtualized but with a bit more noise due to virtualization latency. Adding some load and churn leads to millisecond-sized errors. A migrating independent ntpd clock is worse [...].
  • RADclock does raw counter reads, not clock reads and is independent of the clock algorithm. Has a synchronization algorithm based on raw timestamps and server timestamps with feed-forward.
  • Feed-forward allows you to define a “difference clock” which can determine time differences on a small timescale that is highly accurate. This is inherently difficult in a feedback-based system.
  • Now have dom0 run a RADclock daemon that talks to timeserver, and make information available through XenStore.
  • For migration, only the clock-reading function actually migrates, and no state migrates. However, since we can have asymmetry and it will be different on a different host, this will result in a small clock jump. (No algorithm can do anything about this.)
  • Migration experiment. The ntpd clock has errors on the order of seconds and convergence time on the order of hours. The RADclock-based solution is much more stable.
  • Q: What about the protocol for network synchronisation with RADclock: could you keep the NTP protocol to allow incremental deployment? We do use the NTP protocol.
  • Q: How much is RADclock buying you over NTP when the biggest gain comes from using independent clocks? Could you modify NTP to do dependent clocks well? Problem is that two heads try to control the same state.
  • Q: How does your clock perform under a workload that might cause temperature variation? High temperature is not a problem: rapidly-changing temperature is the problem.

OSDI 2010: Day 2

Tuesday, October 5th, 2010

Deterministic Parallelism

Deterministic Process Groups in dOS

  • Many sources of non-determinism in real systems (shared-memory, IPC, pthreads, devices, etc.). This makes programs hard to test, debug and replicate (for fault-tolerance). Multicore only makes this worse.
  • Solution: OS support for deterministic execution. A “deterministic box” around some processes.
  • What can be made deterministic? What can be done about remaining non-determinism?
  • Internal versus external non-determinism. Internal is not fundamental (arises from scheduling artifacts) and can be eliminated. By contrast, external comes from interactions with the real world, and cannot be eliminated. External includes users and the network. Internal includes channels used between processes in the deterministic process group (shared memory, pipes, private files). Processes outside the group are external sources of nondeterminism.
  • External nondeterminism is controlled with a shim program that controls external inputs.
  • Alternately, could have made a deterministic VMM and put the whole thing in the box, but this is inflexible and costly.
  • dOS is a modified version of Linux.
  • Motivating examples: parallel computation and a web server. Consider a parallel (scientific?) computation that operates from local input files. This should execute deterministically even on a multiprocessor. A web server has a shim that does deterministic record/replay, but since internal nondeterminism is eliminated, the recording is much less costly. This makes it easy to replicate multithreaded servers. Could move web server request processing into a DPG and have all the I/O outside.
  • New system call sys_makedet() which creates a deterministic box that expands to include all children. Processes are otherwise exactly the same as Linux processes.
  • Semantics of internal deterministic: execute as if serialized on a logical timeline.
  • External nondeterministic: e.g. read() from a socket (the what (data) and when (blocking time)), which the shim program interposes on. The shim could perhaps record the call or replay the call from a previous recording. It controls the logical time at which the call returns.
  • Other uses: deterministic replication with DPGs and shim programs. Shim ensures that replicas see the same input at the same logical time. 2PC is detailed in the paper.
  • Implemented on Linux 2.6.24 on x86_64. 8000 LOC across 50 files changed. Transparently supports unmodified binaries. Needed to modify thread scheduling, virtual memory and syscall entry/exit. Paper describes the implementation challenges.
  • Use the DMP-O algorithm from previous ASPLOS papers. Chose this because it is easy to implement. Threads run in parallel before communication detected, then communication is serialized. Detection using an “ownership table” that maps data to threads. Parallel and serial modes execute atomically and serialize onto the logical timeline (one time step per mode switch).
  • All loads and stores must be instrumented to manage shared memory, as must all syscalls for in-kernel channels (pipes, files, signals, but also address space, fd table (implicit/covert channels)). Page protection hardware is used to monitor shared memory at a page granularity.
  • Evaluated on an 8-core Xeon with 10GB of RAM. Each application runs in a DPG. Used the racey deterministic stress test to verify determinism. How much internal nondeterminism is eliminated? How much overhead does dOS impose? How does it affect parallel scalability?
  • Compare log sizes for dOS and SMP-ReVirt. SMP-ReVirt produces logs that are 4 orders of magnitude bigger.
  • Overhead compares no-DPG to DPG-only to DPG-with-execution-record.
  • For Apache, DPGs can saturate a 1G network (100KB static pages). For 10KB static pages, nondeterministic saturates, DPG drops 26% and record drops 78% (from nondeterministic). For Chromium, the slow down is about 1.6x for both DPG and record.
  • Ran parallel benchmarks on 2, 4 and 8 threads. Blackscholes, lu and pbzip benchmarks have 1 to 3x slowdown in DPGs. Dedup, fmm and make suffer more due to false sharing on a page.
  • Q: Do you intend only to run programs deterministically for debugging, or should it be always-on? Ideally always, but some workloads will not be suitable.
  • Q: How do you decide on a timeline? Logical time increments either when a fixed set of instructions are reached, or [...].
  • Q: Does putting in debug code distort the execution? Yes.
  • Q: Is it possible to write a custom MPI that replaces fine-grained shared memory with message passing that would get rid of much of the overhead? It’s certainly possible, but we haven’t done that.
  • Q: Is non-determinism really bad? Splitting between internal and external handles that, and gives you control over where the split lies.
  • Q: Did you measure CPU utilization in your Apache benchmarks? Yes, but I don’t have numbers.
  • Q: Did you think about how else you might address the fine-grained memory sharing otherwise? Could use hardware support or recompile the program using a transforming compiler. These might get some of the scalability back.

Efficient System-Enforced Deterministic Parallelism

  • Parallelism is a grand challenge, and nondeterminism makes it harder. Races are everywhere: memory access, file access, synchronization….
  • Do we have to live with races? Restrictive languages (such as Haskell, SHIM, DPJ) don’t allow races to occur in the first place. Could we have race-freedom in any language?
  • Determinator is a new OS that offers race-free parallel programming. Compatible with existing programming languages, from Java through C to assembly.
  • Core central idea: check-in/check-out model for shared state. Fork checks out a copy, threads work on a private working copy of shared state, and join checks in and merges changes. Idea from parallel Fortran DOALL, or version control.
  • Example is a multithreaded game/simulation/actor-based program, where threads share a global array. In Determinator, each thread gets a copy of the array, and join merges the diffs.
  • Another example is Parallel Make, where there is a Heisenbug (two parallel jobs use the same temporary filename). Determinator creates a working copy of the file system for each parallel job, so the file doesn’t get overwritten.
  • Writes only get communicated at synchronization. Write/write races either go away or become deterministic conflicts, which cause a runtime exception.
  • Determinator is a ukernel, which enforces a strict hierarchy of process “spaces”. Only the root space can directly do external I/O, and children can only communicate with their immediate parent. Single thread per space. Each space has a single address space, but with a backup “shadow” space.
  • ukernel API is very small (PUT, GET and RET). User-level runtime provides higher-level abstractions (C library, pthreads, file system, process management).
  • Join does a three-way diff between the parent’s working copy, the original copy and the child’s final working copy. Optimized using VM copy on write. Can do merge at page-level (faster) or byte-level.
  • Can emulate a shared file system (non-persistent) with weak consistency. Can do merges at the file rather than byte granularity. Only used for intermediate results.
  • Other features include process migration for distributed computing support, deterministic scheduling, and other things (in the paper).
  • Compared single-node performance against Linux on parallel benchmarks. Some speedup over/parity with Linux on coarse-grained benchmarks. Fine-grained (e.g. LU) benchmarks get a serious performance hit.
  • Looked at the granularity of a parallel quicksort, and the break-even is around 100K elements in an array.
  • Q: Is check-out/check-in more like transactions? It absolutely resembles transactional memory, but these don’t try to provide determinism.
  • Q: point about a “PAR DO” construct in an old Comp Arch news.
  • Q: What about other forms of synchronization e.g. condition variables? These are a terrible fit for Determinator since they are not naturally deterministic. We do fork/join and barriers. Talk about non-hierarchical synchronization in the paper. Barriers allow a child to sync with its parent without terminating.
  • Q: Do you think it is ideal to support arbitrary languages, or would it be better to change the way programs are written to exploit the Determinator semantics? Trade-offs/synergy? We’ve tried to be completely language-agnostic, but interaction could be a huge benefit. A big issue is choosing the granularity for merges, so type information could help here. Or could implement the merge in a more informed way.
  • Q: How much experience do you have with figuring out the programming model to deal with conflicts? Don’t have much practical experience yet as it is just a prototype.
  • Q: Could the merge operation (working on a shared file) yield a result that I might not expect, and do I have to change my assumptions about shared files? Currently, we merge at file granularity, and don’t attempt to merge within the file: this would be a conflict.

Stable Deterministic Multithreading through Schedule Memoization

  • Nondeterminism arises when different runs of a multithreaded program give different behaviour for the same input. Some behaviours may be incorrect.
  • Deterministic multithreading partly addresses this problem, but the schedules are tightly coupled to the input. A slight change in the input leads to a very different schedule. This complicates testing and can mean that some inputs deterministically lead to buggy schedules.
  • The idea is to have stability: memoize past schedules and repeat familiar ones.
  • TERN runs on Linux as a user-space scheduler. Schedule memoized as a totally-ordered sequence of synchronization operations. Only race-free schedules are memoized (previous work). Use symbolic execution (based on KLEE) to track the input constraints required to reuse a schedule. Memoization checks inputs against previous input constraints that were identified by the symbolic execution.
  • Evaluated on Apache, MySQL, PBZip2 and 11 scientific programs. For 13/14 programs, only needed to modify less than 10 lines of code. 100 schedules process over 90% of a real HTTP trace with 122K requests. Overhead is <10% for 9/14 programs.
  • TERN works on annotated program source, with annotations to say which inputs could affect scheduling. An LLVM module does instrumentation. A schedule cache stores memoized schedules as a mapping from constraint sets to schedules. On a hit, the schedule is replayed; on a miss, the memoizer runs.
  • Simplified example based on PBZip2, which has worker threads processing blocks independently. “symbolic” annotation identifies parameters that affect the schedule.
  • Developed programming techniques to make this practical, and these are detailed in the paper.
  • Stability experiment uses a 4-day httpd trace (122K), MySQL sysbench simple and tx workloads, and PBZip2 uses the contents of /usr. Ran on a 4-core Intel box.
  • TERN can reuse over 90% of schedules for three benchmarks. MySQL TX workload only reuses 44.2% of schedules.
  • Looked at bug stability: if you change the input, do bugs occur in the changed input but not in the original? Compared to COREDET. Looked at three buggy programs: fft, lu and barnes from SPLASH-2. Symptom is that global variables are printed before being assigned the correct value. TERN never sees the bug, but COREDET does for some thread/matrix size configurations. Fewer schedules in TERN means that the chance of hitting the bug is less, and hence it is more stable.
  • What about TERN’s overhead? It is supposed to run in everyday execution, so this must be mild. Apache sees small overhead, MySQL is much greater (>30%), PBZip2 is negative. Mixture on SPLASH, -15% up to 40%. Details of the optimization/speedup are in Section 7 of the paper.
  • Future work is fast deterministic record/replay.
  • Q: Does not recording shared memory access order mean that you might not faithfully reproduce data race bugs? The premise is to make behaviour stable and deterministic, rather than eliminating bugs. But we can memoize race-free schedules only, by applying a race detector.
  • Q: Could the annotations be created automatically in future? Haven’t done this yet, but in our evaluation, the overhead is pretty small.
  • Q: How do you automatically determine constraints, and would these become bigger as execution goes on? We have two refinements that mitigate this. First, we remove redundant constraints from the constraint set. Second, we slice out branches that don’t affect synchronization orders.
  • Q: Is it true that you only memoize and replay schedules at program granularity (and cannot replay parts of schedules), and would that be a benefit (and would it be easy or difficult)? Currently not implemented, but this would be an interesting direction.

Systems Management

Enabling Configuration-Independent Automation by Non-Expert Users

  • It is hard to automate a task on different machine configurations. KarDo observes user actions as they perform at task, and produces a solution that works across configurations.
  • Strawman: collect a separate trace for every possible configuration. But the configuration space is too big to explore completely. Also, tasks are so diverse that we are unlikely to get more than a few traces for each task.
  • KarDo can generalize from a few traces.
  • Example application: turning on OOF notification in Outlook. What if the two machines start off in different view modes? What if a Java update notification turns up in between steps when you record a step?  Can infer which actions are non-state-modifying from traces for any task. This means we need far fewer traces, and we can skip non-state-modifying actions in the replayed trace.
  • Evaluated by automating tasks from the MS Help and eHow website, in numerous categories.
  • Tracing user actions is challenging because the OS only gives mouse clicks and keypresses which may be meaningless on a different machine. Use the accessibility interface to get more information about the widgets that were affected by an action, which is meaningful across machines.
  • Generalizing to a single canonical solution requires distinguishing state-modifying actions from other actions. Three kinds of action: update, commit and navigate. Only consider the update and commit actions. To identify these, use visual cues and feed them into an SVM, which labels actions. For features, use widget type, whether it closes the window, whether it changes state, what its text label is, etc. Train it using a small set of examples.
  • To generalize across different configurations, must splice in navigation actions to the state-modifying ones in order to get to the right buttons, etc. Build a global navigation graph, then do breadth first search on the graph to get to the desired state.
  • To remove unnecessary user actions, use a two-pass algorithm. First pass removes unnecessary updates (which are later overwritten).
  • To create a canonical solution, need to create a per-task state modification graph, which may have conditional branches for different configurations (e.g. configuring a router, there may be different manufacturer applications).
  • Implemented KarDo as a thin-client backed by a server. Evaluated on 57 tasks. Used 20 diversely configured VMs, split into 10 training VMs and 10 test VMs. Each task performed manually on exactly 2 VMs.
  • Testing by direct replay of the automation recording (baseline) and comparing it to KarDo. Baseline is successful in 18% of cases. KarDo succeeds in 84% of cases.
  • Why does KarDo fail? 4% of mistakes are due to the navigation algorithm. 5% are due to missing steps and 7% are due to configuration errors.
  • Q: How many of the examples could you have handled by modifying configuration files directly? Some reasonable percentage of them, but we want to handle more general things than just configuration tasks.
  • Q: To what extent could you adapt your system to perform command-line tasks, or editing specific files? We do consider some command-line steps, but this isn’t the primary goal. Could extend it in this direction, but would need to understand the syntax and text on the command-line, which is more complex.
  • Q: What if the training steps include mistakes? We remove unnecessary steps. What about misconfiguration? Could use traces to “check” the correctness of configurations.
  • Q: What if you have to branch, how resilient is this to cosmetic perturbations to the interface? The accessibility layer is the level we consider, so we look for the presence of widgets.
  • Q: How do you determine the start of a task? People upload traces to a central server, so we assume they determine the start before recording it, and do this deliberately. What about input-invariant actions (entering user-specific text in a text box)? We flag this in the recording stage.
  • Q: Have you thought about using Google searches as an input to the trainer? Had a HotNets paper on this very thing last year.
  • Q: What if your program has privacy-sensitive data and this was captured in a trace? Do they have to be manually anonymised? How do I make sure that I don’t upload parts of my private email? In the general case, you need to rely on the user.  However, we can infer that, if many users input the same value, it is probably not private. But I have to trust the operator? You’d have to trust the automated system, not the administrator.

Automating Configuration Troubleshooting with Dynamic Information Flow Analysis

  • Systems are hard to configure, and users make mistakes. Troubleshooting is therefore difficult and expensive.
  • When we have misconfigurations today (such as a bogus value in a config file), might look at log messages, ask colleagues, search Google or look at the code if available. Idea is to make a tool that automatically finds the root cause?
  • ConfAid uses application code to lead users to the root cause, by doing dynamic information flow analysis on binaries.
  • Developers find the root cause by tracing back through the variables and inputs that have influence on the error condition occurring. ConfAid uses taint tracking to identify these dependencies. When the application reads from a configuration file, it adds taint to variables, and traces this through data and control flow. ConfAid attempts to find execution paths that avoid the error. Different inputs have distinct taint.
  • Each variable has a taint set which identifies the inputs that might influence the variable if they change. These sets are built through data flow and control flow. Must explore both paths of a conditional statement.
  • Complete and sound analysis would lead to bad performance and a high false positive rate. Use heuristics to deal with this.
  • The bounded horizon heuristic prevents path explosion by only running a fixed number of instructions in an alternate path. However, this may lead to false positives and false negatives.
  • Also, the single mistake heuristic assumes that a configuration file contains only one mistake. This reduces the amount of taint and the number of explored paths.
  • The weighting heuristic attempts to reduce the false positive rate by weighting different taint propagations differently. Data flow taint is weighted more heavily than control flow taint. Weight branches closer to the error higher than those further away.
  • ConfAid can also propagate taint between processes that use IPC system calls. Support communication via sockets, pipes, TCP/UDP and regular files.
  • Evaluated on OpenSSH, Apache and Postfix. Manually injected errors and evaluated the accuracy and time-to-discover of the root cause result.
  • Used real-world misconfigurations and randomly generated bugs.
  • The correct root cause was ranked first or second for all 18 real-world bugs. Performance also great for randomly-generated bugs, getting 85% first-place correct root causes.
  • Average execution time is 1 minute 32s. Apache is the slowest. The randomly-generated bugs only take about 23s.
  • Q: How do you handle external input or output of various types in your speculative execution? Can you do rollback? We abort alternative path exploration if we see externally observable file.
  • Q: Do you automate the parsing and identification of the configuration files? We just intercept the read call and assign taint from there. Don’t need to identify the parsing routines, etc.
  • Q: Can you combine this with backward flow analysis from the error point? (With program slicing?) This would be much more difficult. Program slicing goes backwards, but we find that the config is read at the very beginning, then run for a while, then something goes wrong. Probably not practical to do program slicing on such a workload. It would probably also give a lot of false positives, because our taint tracking is more contained.
  • Q: Could you use your technique for other types of errors, such as segfaults? You could, but we have focused on configuration problems.
  • Q: Are all of the bugs that you dealt with single-mistake bugs? Yes. How common do you think those kinds of bugs are? Well, it depends if the bugs are independent or if they influence each other. Will probably only find the first error if there are two. Relaxing the single mistake restriction in future work.
  • Q: Do you have anything to add in terms of the performance overhead (proportional to program size) and did you do anything to reduce it? Dynamic data-flow analysis is slow: 100 to 200x slowdown, and we haven’t optimized it much, but it would be hard to get much faster. If the problem happens far into the execution, it will take a long time to come up with the ranking. Idea is to use this offline when you encounter a bug.
  • Q: How is the syntax of the configuration file handled? Does ConfAid require a priori knowledge of the syntax? No, we don’t require this. Fine with both Apache and OpenSSH, without app-specific tuning. However, there might be some noise in the result.

Inside the Data Center, 2

Large-scale Incremental Processing Using Distributed Transactions and Notifications

  • Indexing is a long process, not just matching keywords to URLs. Converts raw documents into documents ready for serving.
  • Example of duplicate elimination using MapReduce. Map documents to their checksum, and reduce them into single documents. However, refreshing the index involves adding more documents to the repository and deciding whether to add each on to the repository. With MapReduce, you would have to start again from scratch.
  • Goal is to have a large repository of documents with a high-quality index, and a small delay between crawling and indexing (”freshness”). MapReduce could only achieve freshness of a few days.
  • All the information is stored in Bigtable. Tables mapping URLs to metadata, and from checksums to canonical documents. On getting a new document, you need to update the checksum table, and flip a metadata bit in the URL table. This leads to a race condition.
  • Percolator adds distributed transactions to Bigtable. The API has get(), set(), commit() and iterator methods. Limited support for secondary indices.
  • Bigtable is a sorted (row, column, timestamp) store. Rows are partitioned into ranges called “tablets”. Tablets are spread across many machines.
  • Percolator provides “snapshot isolation” semantics, using two-phase commit coordinated by the client. Locks are stored in additional Bigtable columns: commit and lock for each data column. Transactions manipulate these columns transparently to the user.
  • Notifications allow users to run user-defined code when a column is updated. “Message collapsing” means that notifications are run at most once per write. Applications are structured as a series of observers, which hopefully form a tree.
  • Notifications run independently on individual rows, which means that stragglers don’t hold up other work.
  • Bus clumping: randomly scanning the repository for notifications will lead to clumping which reduces parallelism and overloads the Bigtable servers. Solution is to obtain locks on a row-by-row basis, and randomly jumping to another row if the bus is contended.
  • Very different access pattern on the hardware. Lots of small, random disk I/Os with many RPCs per phase (compared to MapReduce’s streaming I/O with many documents per RPC).
  • Compared MapReduce versus Percolator on document processing latency. MapReduce is basically constant against churn at around 2200s, whereas Percolator has a much lower latency, until latency explodes around 35% churn. MapReduce is limited by stragglers. Operating point is around 2 to 15% churn.
  • Pros of Percolator: much better freshness, and better scalability (more throughput by buying more CPUs), which supports a bigger repository. Immune to stragglers. Cons are that it needs to reason about concurrency and is about 2x more expensive per document processed.
  • Percolator runs 1000 threads on a single Linux instance. This makes it easier for application developers, with straight-line code and meaningful stack traces. And it gets good multicore speedup. Problems with kernel scalability, though, such as the mmap semaphore, and had to fix those.
  • Unknown unknowns: some CPUs periodically got XOR calculations wrong in checksums. Sometimes Bigtable cannot delete files fast enough (the garbage collector). 50% of seeks were doing useless readahead, so switched to ext4. Incorrect resistor value on the motherboard led to the workload powering machines off.
  • Gets 50% fresher results, with a 3x larger repository.
  • An existence proof for distributed transactions at web scale.
  • Q:
  • Q:
  • Q: Did you consider doing Nectar? Didn’t consider this, but it could improve the performance of this system. We do batch RPCs for operations on the same row.
  • Q: What was causing stragglers? Machine failures, process failures, etc.
  • Q: Could you live with some of the failures that you saw? Some limitations in the Bigtable abstraction that made us have to debug down to the lowest layers.
  • Q: Were Stonebraker and DeWitt right? It isn’t a good idea to pretend that MapReduce is a database system. But MapReduce isn’t dead. Percolator has a handful of users, whereas MapReduce has thousands of Google users.
  • Q: Did you find cycles of notifications arising in practice? Did it exactly once, but were more careful after that. The system looked healthy but would have run for an infinitely long time.

Reining in the Outliers in Map-Reduce Clusters using Mantri

  • Work began as a measurement study, but started to fix them. Now in production on Bing’s clusters.
  • MapReduce is used to build the search index, improve relevance and ads, perform genome analysis, etc. It decouples operations on data from mechanisms to scale across large distributed clusters.
  • Implementation is Cosmos (based on Dryad). Bing uses the SCOPE language to specify queries.
  • Example: finding frequent search queries. Three phases: read/parse, map and reduce. Barrier after the map task. Leads to a drop in cluster utilization towards the end of the map phase, due to stragglers. Reduce phase also has stragglers. Both phases slow down the overall job.
  • Want to get rid of stragglers and recomputations. This gives better predictability, better resource utilization and quicker response (good for developers).
  • Measured start time, end time, data rate and failure mode on a production cluster. Stragglers take 1.5x the duration of the median task in a phase. Half of phases have >10% of stragglers and no recomputation. 10% of stragglers take 10x the duration of the median task (very heavy tailed).
  • The impact of outliers on job completion time (based on a trace-driver simulator): 50% of jobs would improve by at least 34.7%.
  • If input data is unavailable, tasks must wait for recomputations, and this delay may cascade. Idea is to replicate intermediate data, and use a copy if the original is unavailable. Challenge is to decide what data to replicate, and where. Since recomputes are rare and localized (mostly due to disk failures or high load). This gives a predictive model for how likely a machine is to cause a recompute. Cost-benefit calculation based on the probability of a recompute and the runtime of the task. Since whole-rack failures are so rare, only replicate within a rack.
  • Cause of stragglers: tasks reading input over the network end up experiencing variable congestion. Solution is to spread tasks evenly across the cluster. Many aspects of MapReduce are already network-aware (such as shard placement), but the shuffle of the reduce phase means that this locality is harder to exploit.
  • Place reduce tasks so that the traffic on each link is proportional to the available bandwidth on that link. This would require global co-ordination across jobs to identify the congestion and place tasks. Can approximate this with local control.
  • Cause of stragglers: workload imbalance due to unequal partitions. Approximately 25% of outlier tasks have more data to process. Better to ignore these than duplicate them! Mantri builds an estimator that uses Graham’s 1969 theorem.
  • A final cause is contention on a machine. Idea is to copy tasks elsewhere in the cluster. Challenge is to do this as early as possible. Need to build an estimator to guess how much work is left to do. Trick is to copy when the predicted time of starting again is less than half the remaining time.
  • Running time is a function of the input, network congestion, data left to process and various other factors.
  • Results from deployment in production clusters and simulations. Running in production since May. Compared May to June on Mantri against April (previous setup). Median job improvement is 32.1%. There is also a reduction in the number of resources used, because too many duplicate jobs were scheduled in the Dryad scheduler. (0.5x as many copy tasks launched as Dryad, with a 2.8x higher success rate).
  • Compared Mantri to Dryad, Hadoop, LATE and MapReduce on a trace. Mantri gives a better reduction in completion time than all of these schemes. The others have less time savings with more resources used.
  • Q: You took the MapReduce interface as given, so what would you do if you built a new system knowing what you know now? Providing better ways of partitioning maps, which we have already done in production.
  • Q: What would you do if you had a non-uniform cluster? We have very efficient cluster management software, called Autopilot, that monitors these things. The primary reason we don’t have persistent problems is that this will shoot particularly bad nodes in the head. A slot abstraction is how we split up heterogeneous machines.

Transactional Consistency and Automatic Management in an Application Data Cache

  • Application-level caching is used to handle scaling issues in modern websites, e.g. Facebook, MediaWiki, LiveJournal. Caching whole web pages is not much good if you have highly personalized content. Database caches are less useful when so much processing is done in the application layer. This has led to the use of things like memcached or Java object caches. These are essentially a DHT, for application-level objects, which can separate commonly-used and customized content. This caching reduces load on the database and the application server.
  • Existing caches are very simple and force a lot of management work to the application. For example, they don’t provide transactional consistency, which violates guarantees that the underlying database would provide. This can expose anomalies to the user, and complicates application code (to deal with violated invariants).
  • Naming is harder than you think: a cache key must uniquely identify the value. Invalidations require global reasoning about the possible contents of the cache. Saw two bugs in MediaWiki (#7541 and #8391) that arose from these problems.
  • TxCache has transactional consistency, bounded staleness and function memoization. Implemented as a library the hides the complexity of cache management. It integrates with a new cache server we have developed. Function calls for beginning, committing and aborting transactions, and a hook to make a side-effect-free function cacheable (depending only on the arguments provided and the state of the database). No need to deal with inconsistencies, naming, invalidation, etc.
  • Transactional consistency goal: all data seen in a transaction reflects a single point-in-time snapshot. Each transaction gets a timestamp, and objects in the database and cache are tagged with a validity interval. Transaction can read cached data if the validity interval includes the timestamp. Staleness allows cached data to be used for longer, if this is compatible with the application requirements. Nevertheless, it must be consistent with the data in the database if other parts of the transaction require DB data. Simply expose an API on the database for starting a transaction at a given timestamp.
  • Validity intervals computed for application objects, queries and tuples (the last being tracked by the database). A query is valid if the current timestamp is in the intersection of the validity intervals of the tuples accessed. Had to modify the database server to track this and return it with each SELECT query.
  • It is hard to choose a timestamp a priori, since we don’t know the access pattern or the contents of the cache. However, can choose this based on the tuples that would be accessed.
  • Each object in the cache has an invalidation tags, and had to modify the DB to assign these tags to each query. Tags are computed from the query access methods (e.g. index lookups, wildcard tags for queries on the entire table). An update generates affected tags (one per key per tuple modified) and broadcasts these to all cache nodes in an ordered stream. Need to be careful to avoid lookup-invalidate race conditions.
  • Evaluated using the RUBiS benchmark suite, with the standard browsing/bidding workload. Ran with an in-memory (850MB) workload, and a 6GB (disk-bound) workload. As the cache size grows, throughput increases up to about 5000 requests per second. The cache hit rate flattens out at 95%. With no cache, the rate is about 1000 requests/second.
  • Throughput increases also for the disk-bound workload, though much lower (up to 500 requests/second). The cache hit rate is much higher though. Bottleneck in this workload is random I/O for infrequently-accessed data.
  • Adding staleness gives a 3x to 5x throughput improvement.
  • Investigated cache misses due to the stringent consistency: less than 10%. When consistency is turned off, the increase in throughput is much lower.
  • Q: How are the invalidation tags affected by complex queries like joins? We have a more complex approach based on predicate locks.
  • Q: How do you deal with fault tolerance? The cache is inherently fault tolerant because its data can always be recreated. Although migration etc. could improve that. Problems with the validity intervals? No, because these are generated by the database.
  • Q: Do you provide serializability or snapshot isolation? Depends on what the underlying database supports, since the cache is for read-only transactions.
  • Q: Why did your throughput keep going up as the hit rate levels off? This is because we are caching objects that are sometimes larger than the data that is used to generate them.
  • Q: Do you still get a performance win if you have a single cache node on a single system? The benefit of a distributed architecture is the ability to scale the effective size of the cache. Have you compared distributing the cache versus distributing the database? There are a lot of approaches based on DB replication that do similar things, but these are more heavyweight because they try to scale writes as well.
  • Q: How is the cache key chosen? Based on the function’s name and its arguments. A more complex approach could be used if the code might change. What if not all of the arguments matter? Still work to be done in the application to make sure the cache is used efficiently (e.g. in choosing the cacheable functions).

Piccolo: Building Fast, Distributed Programs with Partitioned Tables

  • Motivating example is PageRank: for each node in a graph, add the current score of that node to its out-neighbours, and repeat until convergence. The only changing data (the current and next iteration scores) will fit in memory.
  • Dataflow models don’t have global state as a first class concept. Need to read from and write to distributed storage. No access to intermediate global state, and costly access.
  • Could implement it with MPI or RPC. Need to decide how to break up the computation explicitly.
  • Piccolo provides global access to distributed shared state. Piccolo runtime handles communication between these components. Usually ease of use and performance are in conflict.
  • Programming model uses a key-value store with a put/get/update/iterate interface. Implemented as a library for C++ and Python.
  • Showed how PageRank is implemented on Piccolo. A job is specified with a control function and a kernel function that actually does the computation. Doing it naïvely is slow, because it would do a lot of remote gets and remote puts.
  • Let the user control the partitioning function (e.g. for PageRank based on the domain of each page), and colocate tables that a kernel function uses on the same node.
  • Synchronization is challenging if multiple writers can access the same key at once. So provide an atomic update (e.g. sum or min or max) function attached to a table value. Only works for commutative functions. Synchronization primitive is a global barrier. Tables provide release consistency: accumulations are atomic, but they may be buffered and only guaranteed to be applied at barriers. Need to do an explicit barrier between operations. Release consistency makes writes quite fast.
  • Fault tolerance: do checkpointing of the table state. The barrier is augmented with a checkpoint of particular tables. Use the Chandy-Lamport (non-blocking) protocol for checkpointing.
  • Load balancing is tricky because the tasks are stateful. Cannot just start copy tasks. The master coordinates work stealing, and needs to pause updates to a partition before it is stolen to another machine.
  • Evaluated against Hadoop on a local 64-core, 12-node cluster. Used a 100M page graph, taking around 8 seconds per iteration. Main Hadoop overheads are sorting, HDFS access and serialization.
  • Evaluated scaling up to 200 workers on EC2, and the iteration times stays about the same (about 60 seconds per iteration at all configurations).
  • Implemented iterative n-body and matrix multiply that don’t fit into Hadoop. Also asynchronous applications such as a distributed web crawler.
  • Q: What is your fallback strategy for dealing with computations that don’t have a commutative aggregate function? We haven’t find many applications like that. We would have to rely on pairwise locking in this case.
  • Q: Do you need to restart every node if a single node fails? Yes, you need to rollback all nodes to the last checkpoint, so need careful consideration of the checkpoint frequency.

Cloud Storage

Depot: Cloud Storage with Minimal Trust

  • Storage providers like S3 are appealing because they free the user from a lot of concerns. However, they also present risks. Failures can cause undesired behaviour (such as not propagating permission changes immediately). Also risks of data loss, corruption or unavailability.
  • Approach is to take a radical stance on fault-tolerance. Completely eliminate trust from cloud storage for put-availability, eventual consistency, staleness detection and dependency preservation. It minimises trust for get-availability and durability. So you can always perform a put, for example. But gets are only available as long as there is at least one correct node.
  • Depot ensures high availability by running multiple servers and non-sequential consistency. It falls back to client-to-client communication in the case of extreme cloud failure. It prevents omission and reordering, by adding metadata to puts, which is fully replicated in nodes’ local state. Metadata received at the client is checked to ensure that the client sees a view that satisfies various well-defined properties. Update metadata includes a local clock and a history. Logically each node stores all metadata it sees (garbage collection is handled in the paper).
  • To protect consistency, nodes perform local checks on the metadata: needs to ensure that the updating client has seen all previous updates (based on its history), and it doesn’t try to change history (based on its local clock).
  • Faults can lead to forks in the history, which partition correct nodes, and prevent eventual consistency. Forks are joined to gain eventual consistency: faults are transformed into concurrency. One faulty node is transformed into two correct virtual nodes. Since concurrency can introduce conflicts (concurrent updates to the same object), Depot exposes such conflicts to applications.
  • Depot provides various properties in terms of consistency, availability, integrity and eviction, which vary based on how many nodes are correct.
  • For get availability and durability, it would be ideal to trust only yourself, but impractical to do this. Depot minimises the number of correct nodes that are necessary to service the get. Any correct node that is willing to supply an object can service a get. To make it likely that a correct node has data, employ replication (including a local copy).
  • Contingency plan for storage service provider failure. Prototype forces all clients to store all of the puts that they create, and gossip update metadata in order to route get requests to the right node. Despite complete SSP failure, this allows the system to keep working.
  • Evaluated cost in terms of latency, resources and dollars (for storage).
  • Overhead comes from adding metadata and adding checks (SHA256, RSA verify and history verify).
  • Setup the evaluation on 12 Emulab nodes, 8 clients and 4 servers, with a 1G link. Clients issue 1 request/second, and measure latency and per-request cost. As a baseline, emulate traditional cloud storage (servers implemented in Depot without any of the checks, and no metadata at the clients).
  • Latency overhead on get is very small (<1 ms), and on put, is about 5 ms.
  • Network overhead on gets are modest. CPU costs are bigger. Metadata transfer causes network cost, and verification has CPU cost.
  • Cost model based on cloud pricing. Get cost per TB is small over the baseline (about $100/TB). Puts incur more  cost (about $220/TB) due to metadata.
  • Q: Does presenting overheads in terms of the Depot cost mask the effect? Isn’t the CPU cost something like 400%? Yes, but the absolute values are small. Plus the real cost is in the bandwidth.
  • Q: What can Byzantine clients do if you rely on gossiping metadata (such as cause all get requests to be routed to it)? Include a “receipt” mechanism to deal with such a situation, described in the paper. When a client receives a get response, it performs a check to ensure that data is sufficiently replicated.

Comet: An Active Distributed Key-Value Store

  • Key-value stores are increasingly important, with great scalability, availability and reliability properties. Popular in both data centers and P2P. Stores are shared between many applications to avoid per-app storage system deployment, but building applications above these stores is challenging.
  • Main challenge is due to inflexibility: all apps are treated exactly the same way, independent of their needs (which may be conflicting). Motivating example is authors’ experience with Vanish on top of the Vuze DHT.
  • Vanish is self-destructing data storage for the web, based on the Vuze DHT, using temporal-based keys. Unfortunately, Vuze had a fixed 8-hour data timeout, and an overly-aggressive replication scheme, which harmed security. Making the necessary changes was simple, but deploying them was difficult, because it is a public service in production. Also it is hard to evaluate changes before deployment.
  • Propose extensible key/value stores that allow applications to customise functions of the store. Different data lifetimes, numbers of replicas and replication intervals. Also allow apps to define new functions, such as popularity tracking, access logging, or user-dependent responses. Built a peer-to-peer DHT: Comet.
  • Goals: flexibility, isolation, safety and lightweightness.
  • Main abstraction is an “active storage object” (ASO), which contains a value and a small set of handlers that are called on put or get to the object. Comet is built on top of a traditional DHT as an active runtime that runs ASO handlers in a sandbox.
  • Built several applications on top of the ASO extension API, including Vanish, Adeona, file sharing, etc. Extension API is 12 functions: 4 callbacks,  5 host interaction functions and 3 DHT interactions functions (get/put/lookup). Deliberately chose not to allow arbitrary network I/O because of security implications.
  • Sandbox is less than 5KLOC, based on a subset of Lua with unnecessary functions removed. Resource consumption is limited by counting the number of per-handler bytecode instructions and memory consumption. Incoming and outgoing requests are rate-limited. DHT interaction is limited (to prevent traffic amplification and DDoS attacks). ASOs can only talk to neighbours, and not recursive requests.
  • Built on top of Vuze using Lua. Deployed on PlanetLab. Discussing mainlining it in Vuze with the main engineer.
  • Considered three examples. App-specific customisation of the replication scheme. Did Vanish-specific replication in 41 lines of Lua.
  • A context-aware storage object is used to build a proximity-based distributed tracker (returning close nodes rather than random nodes). ASO uses 37 lines of Lua. Using the proximity-based tracker gives better latency between paired nodes than a random tracker.
  • A self-monitoring DHT: a node monitors its neighbours, by periodically pinging it. This allows you to collect measurements of DHTs. Ran an experiment to get Vuze node lifetimes.
  • Q: Have you thought about macrosecurity (as opposed to microsecurity considered in the talk)? We added global limits to ASOs (not just per-ASOs limits), that is just standard rate-limiting for DHTs. These are, we believe, the right mechanisms to manage the overhead of Comet. But we need more experience to determine limits. What about fairness? Haven’t looked at this yet.
  • Q: What happens when the metadata in the ASOs diverge between replicas? In today’s DHTs, replicas are actually inconsistent (may be stale). Opportunity here for different replicas to merge with each other using an ASO.
  • Q: Were you able to enable certain classes of computation in Lua? Removed everything except for three classes of things: simple math, string manipulation and table manipulation. A modified Lua interpreter counts the number instructions. Is there an interesting set of things that you might add back in, and what applications might that enable (or is it a gray area)? One example was to do true internet measurement with arbitrary network I/O, but we decided not to add these things.
  • Q: Does the use of a high-level language cause performance unpredictability that might harm the 99.9th percentile latency? An interesting issue, but we haven’t looked at this yet. May choose to use non-GC’d languages in a real deployment.
  • Q: Do you persist that ASO state? Currently stored in-memory, and use replication to gain persistence.
  • Q: What happens to reliability if things actually get cancelled? Applications should expect that things may fail at any given point in time. Can use tricks like Vanish does to get reliable deletes.

SPORC: Group Collaboration using Untrusted Cloud Resources

  • Cloud deployment is attractive because it allows collaboration in user-facing applications, but the price is that you must trust the cloud provider.
  • Goal is to support practical cloud apps for real-time collaboration (and offline working), but using untrusted servers.
  • Application logic has to be moved to the clients, and every client has to maintain a copy of the local state. The server’s role is limited to storing encrypted data and processing client updates.
  • To keep clients in sync, use “operational transformation” (technique from 1989): application provides an app-specific transformation function that is used to fix up non-commutative actions (such as deleting two different characters concurrently). This can synchronize arbitrarily-divergent clients.
  • Digital signatures aren’t enough, since the server could equivocate, so employ “fork* consistency”, which gives linearizability on an honest server, and detectable equivocation after 2 messages from a malicious server. Clients embed their history (hashed) in messages they send. Server may fork clients into partitions, but can’t then unfork them. Need to detect forks out of band. SPORC provides a way to recover from malicious forks.
  • Internally, the SPORC library keeps a history of all the operations that it sees, marked as committed or pending. Could reconstruct the client’s local state. Committed operations are fork* consistent across all clients, and pending operations are causally consistent. Clients enforce this by checking sequence numbers on each operation.
  • To commit an operation, the client encrypts and signs it, and sends it to the server. The server keeps a sequence of encrypted operations which represent the canonical state. Then it broadcasts the operation to the client. Finally, at the clients, it needs to be transformed using OT (since the recipient state has probably diverged). Only after these transformations can the operation be applied to the local state, and added to the end of the committed operations.
  • For access control, we cannot trust the server. Need to preserve causality and concurrency only makes it harder. Access control enforced using encryption with a single shared symmetric key shared between the clients. ACL changes are committed in the same way as update operations. All users have a public/private keypair in addition. Removing a user is a little more complicated than adding one: need to change the encryption key for subsequent changes and encrypt it under the public key of all of the remaining users, and also encrypt the old key under the new key (so it can be provided to any new users to access old operations).
  • If two people want to remove different users concurrently, the server will order them, but this creates a challenge for encrypting the new key. Need dependency pointers between operations and barriers: if a dependency pointer crosses a barrier, the operation will be rejected, which maintains the chain of old encryption keys.
  • To recover from a fork, use OT to resolve malicious forks by treating them as normal divergences.
  • Implemented as a client library and a generic server. The generic server can support any application since all it does is push encrypted data around. Applications are defined as operations and a transformation function. Implemented a Java CLI client and a browser-based version (using the Google Web Toolkit).
  • Evaluated on 8-core 2.3GHz AMD boxes, with one for the server, and 4 for clients. Tested the Java CLI version. Looked at latency, server throughput and time-to-join. Connected using 1G Ethernet.
  • Latency is <30ms for a text editor on up to 16 clients with low load, and slightly worse if everyone writes. Broke this down into various components: the most expensive thing is the RSA signature on outgoing messages (about 10ms per message). Could use a faster implementation like eSign to get this down into the microseconds. Client queuing delay gets worse as the number of clients increases on a write-everyone workload.
  • Server throughput gets up to 20MB/s as the payload size is increased to 16KB.
  • Q: What happens if an evil person about to be removed tries to add more evil users? The way we deal with these situations is to have multiple privilege levels (admins, writers and readers) with ACLs implemented on the client side. This is only a problem if you have a malicious admin.
  • Q: Why use a specialised server rather than just using a cloud storage provider or a DHT? In addition to storage, you need a way to keep clients in sync. We thought about doing a completely decentralised system. Advantage of a server is to get timely commits. Server is “highly-available”.

OSDI 2010: Day 1

Monday, October 4th, 2010

Kernels: Past, Present and Future

An Analysis of Linux Scalability to Many Cores

  • Looking at Amdahl’s Law limitations (serial paths) in the Linux kernel. For example, Exim spends about 70% of time in the kernel on a uniprocessor.
  • Want to know how serious the scaling problems are, and how much work is necessary to fix them (in light of recent efforts to make new scalable OSs).
  • Used a 48-core x86 box running a recent version of Linux. Looked at 7 real limitations, and analysed the limitations of the Linux kernel, which did exist.
  • 3002 lines of code in 16 patches fixed the scalability limitations. Remaining bottlenecks were in the hardware or applications.
  • Ran applications with an in-memory file system to avoid a disk bottleneck. Found bottlenecks, fixed them and re-ran the application. Stopped when the application or hardware had a non-trivial bottleneck that they couldn’t fix.
  • Apps: Exim, memcached, Apache, PostgreSQL, gmake, Psearchy (inverted index building) and Metis (MapReduce implementation).
  • Exim throughput (messages per second) collapses around 40 cores and kernel time (per message) increases greatly at this point. Large explosion of time in lookup_mnt function (>30% of time). This function returns metadata about a mount point, under a spinlock. Acquiring and releasing the spinlock were using more cycles than the hashtable lookup, because of the ticket-based spinlock implementation. Cache traffic (atomic inc and read of current ticket value in acquire; increment in release) takes 120-420 cycles on a lightly-loaded system. With high load, the delay goes up to 500-4000 cycles.
  • Solution could be to implement scalable locks or message passing (like Barrelfish). But since the mount table is rarely modified, use per-core mount tables to cache this. First lookup in local core table, else do the slow path through the global table. On modificastion, need to invalidate the per-core tables, but since this is rare, it leads to better scalability. This fix, removes the collapse at 40 cores.
  • Now see that the same functions execute more slowly on 48 cores than on 32, leading to a flattening out of scalability. Next bottleneck was dput(), which modifies the refcount on a dentry. Does an atomic decrement and test on the reference count. Reading the reference count is slow, because it is atomic, which makes cache coherency traffic occur. Concurrent cores must retry because a single core gets a hardware cache-line lock on the value.
  • Solution is “sloppy counters”: each core should hold a few spare references to each object. The reference count is stored in a per-core counter. Sloppiness occurs because multiple reference on the same core will grab one reference each. On an unlink, the kernel gathers the sloppy information into the shared counter, and disables the per-core counters. Uses O(N) space in the number of cores, but lead to no cache misses in the common case. Leads to an increase in scalability over the per-core lookup table.
  • In all, 9 areas were considered this way. Some applications were also modified (amounting to 60 lines of code). All well-known parallel programming techniques.
  • Almost all applications get some speedup (except Psearchy and Metis, which were already quite high).
  • Remaining bottlenecks include hardware queues on the NIC, application-level contention, cache capacity and DRAM throughput.
  • Limitations: only looked at 48 cores and a small set of applications. Some problems still loom: fork, virtual memory, page allocation, file system and concurrent address-space modifications.
  • Q: What was your experience in mainlining these patches? Haven’t tried because working on the talk. Sloppy counters are probably questionable. Are you going to try to submit? We intend to think about it.
  • Q: Do you really believe that hacking at each piece of shared state one at a time is the right way to go? No, but up to 48 cores it seems to work well. What do you really think? I have no idea.
  • Q: How many of your optimizations would work if you ran multiple applications on subsets of the core? Are they general enough for this case? I think so, since we didn’t target particular applications with these changes, but some applications may not benefit.
  • Q: How did you implement sloppy counters given that cache lines are pretty large, packing or wasting memory? We packed multiple per-core counters onto a single cache line. Did you find that managing these counters then led to a lot of overhead? For some counters, yes, such as sockets, which allocate a dentry per socket. The fix was not to allocate a dentry for a socket.
  • Q: A big ball of code wins in the long term because it is easier to make larger changes as well as smaller changes. Did you find that to be the case (that it was easier to modify the Linux kernel than other microkernels)? [Taken offline.]
  • Q: Looking at spinlocks, why did your original results have an elbow in the curve at 40 cores, rather than a gradual degradation? The interconnect on the machine becomes congested. The other effect was an n-squared effect based on the number of waiters.
  • Q: Should we or should we not be designing new OS architectures for multicore systems? Up to 48 cores, it doesn’t seem like Linux has a problem.

Trust and Protection in the Illinois Browser Operating System

  • Today, the web is ubiquitous, and runs in web browsers on many different operating systems.
  • However, these applications are vulnerable to attacks at many layers. There are web-app vulnerabilities, browser vulnerabilities, and OS and library vulnerabilities. The last two are the rarer but also most severe, so we consider these only in this talk.
  • Web browsers, such as Firefox, are monolithic and easy to exploit, since they run in a single address space. “Secure” web browsers are not enough, since they still have a huge TCB (including the TCP stack, X server, file system, drivers, etc.). A microkernel would be better, but still need to trust all of the system components.
  • Illinois Browser OS (IBOS) is a new OS design to make a single app more secure, which leads to a reduced TCB.
  • The IBOS kernel makes browser abstractions first-class, i.e. web page instances, cookies, IP connections, etc.
  • Security invariants check key properties without understanding the implementation of the particular components. Challenge is to check security properties using these “invariants”.
  • An IBOS label is a tuple containing a protocol, domain and port. These are inferred and applied by the kernel. Leads to different “web page instances” for different originating domains. Since these may then access several servers (for content, images, ads, etc.), the web page instances can spawn multiple “net processes”. Enforcing this requires a split driver architecture: on sending an Ethernet frame, the IBOS kernel checks the TCP port and IP address in net process label, and hands off the actual transfer to the NIC driver.
  • A basic frame buffer driver splits the screen into three parts: an address bar at the top, which shows the current page address; toolbars that provide global control of the system; and the content itself. The IBOS kernel has a tab abstraction which multiplexes the display between different web page instances. Input devices are routed to the visible tab, which prevents a keylogger from hijacking another instance’s session.
  • Implemented on L4Ka::Pistachio, using WebKit and Qt. Also provide a “Unix-like” API through Qt on uclibc, which supports a PDF viewer.
  • TCB size is 42 KLOC, compared to 5.6 MLOC for Firefox+Linux+X, and 4.4 MLOC for ChromeOS. Could potentially used formal methods to verify that. Avoided 27 CVE exploits published in 2010, and didn’t prevent 1 relating to device drivers. Also avoided 135 (77%) browser vulnerabilities published in Google Chrome’s bugtracker. Scored well on memory exploitation and sandbox bypassing vulnerabilities.
  • Evaluated page load latency time for Google Maps, Bing, Craigslist, CS@Illinois, Wikipedia and Facebook. IBOS is close to Firefox and Chrome, and sometimes better (though sometimes a little worse).
  • Q: Does IBOS have the same functionality as Chrome and Firefox, especially for JavaScript etc.? Yes: the presentation is actually using HTML5 on IBOS. IBOS uses WebKit, which is pretty standard.
  • Q: What do you think is fundamental about the web model that makes this different from core microkernel work? Web browsers are a very important application, so borrowing ideas from OS design could help to improve them.
  • Q: In terms of user experience, can you have two frames co-existing on the same screen? Currently don’t support windowed display, only tabs.
  • Q: Does the IBOS kernel manage DMA, and is it specific to a particular network device? Currently only support one driver, but should generalise to other drivers that support DMA buffers.
  • Q: Does the design inherently prevent cross-site scripting, or can this be relaxed? Don’t actually prevent these attacks, but with the right policy, we can contain these attacks. The paper details a custom policy that allows sites to communicate with other sites. Is your presentation running on top of IBOS? Yes.
  • Q: Is there any reason why Firefox or IE could not run on top of IBOS? They could if you changed the windowing abstraction.

FlexSC: Flexible System Call Scheduling with Exception-Less System Calls

  • System calls are typically synchronous and exception-based for the call and return. This is inflexible for scheduling, and expensive due to the mode-switch and pollution of processor structures when you multiplex these at a high frequency.
  • FlexSC is a shim that sits above the OS and provides exception-less syscalls, and FlexSC-threads is a threading library that uses FlexSC. Improves throughput in MySQL and Apache without changes to their codebase.
  • Took an application (Xalan from SpecCPU 2006) that does almost no syscalls, and injected exceptions with various frequencies. Emulated a null syscall and a write syscall. Measured only user-mode time and assume that it will be unaltered with the exceptions. However, almost half of the performance is lost. Indirect (pollution of structures) cost is the dominant problem.
  • Indirect cost: a Linux write() call can cause up to 2/3 of the L1 data cache and data TLB to be evicted. This also impacts the performance of OS code.
  • Exceptionless syscalls use a syscall page into which calls and results are written; syscall kernel threads do the processing. This leads to fewer mode switches, allows for batching of syscalls, and allows for dynamic multicore specialization, which further reduces the costs. In the ideal case, there is no need to context switch at all. The application writes its syscall registers into an entry on the page, and loops doing other work while the result is not marked as DONE.
  • Syscall threads are kernel-only threads that are part of the application process and run in the same process space. They execute requests from the syscall page and are schedulable on a per-core basis.
  • System call batching allows user space to make as many calls as it can, then jumps into the kernel which executes all of them before switching back to user mode.
  • Dynamic multicore specialization allows the OS to map syscall threads to particular cores, and user-mode threads to other cores. This can adapt to workload needs.
  • The user-mode interface is quite different to synchronous system calls. However, it is very suitable for event-driven servers, which already process asynchronous events. Multi-threaded applications can use FlexSC-Threads, which is compatible with Linux pthreads and requires no changes to the apps or even recompilation.
  • FlexSC-Threads is a hybrid (m-on-n) threading model, with one kernel-visible thread per core. It redirects system calls using libc wrappers to the syscall page, and switches to another runnable thread. When all threads are blocked, the kernel is entered, and we wait for at least one syscall to complete.
  • Evaluated on Linux 2.6.33 on a quad-core Nehalem, running at 2.3GHz. Sysbench on MySQL and ApacheBench on Apache were the workloads. Compared Linux NTPL (”sync”) versus FlexSC-Threads.
  • OLTP on MySQL on a single core sees a 15% performance improvement in throughput as the request concurrency increases. On 4 cores, see a bigger improvement. For latency, FlexSC dramatically improves the 95th-percentile latency, and a 30% improvement in the average case.
  • Individual performance metrics gathered using hardware performance counters. FlexSC achieves higher instructions-per-clock cycle than sync. Lower cache miss rates in several structures leads to this in both the kernel and user-mode.
  • Apache performance can improve by 80 t0 90% on a single core, and 115% on four cores. Latency is halved on average, and again is much better in the 99th percentile. Instructions per cycle are even better for Apache (almost doubling in user-mode and more than doubling in kernel-mode).
  • Exception-less syscalls can coexist with legacy ones. Doesn’t require a drastically new OS architecture.
  • Q: Did you try to upstream the patches for the kernel portion or plan to? No. Why? A lot of time and effort that I would prefer to use in other ways.
  • Q: Does this reduce security by sharing memory between the kernel and user-mode? The pages are per-process, within the application’s address space. Anything you could do in a process before, you could do now.
  • Q: Do you have an architecture for having outstanding syscalls with dependencies between them? No. We rely on servers that have concurrent, independent work to do. There is related work that tries to do this (composite syscalls from Stony Brook?), but we don’t have support for that.
  • Q: How specific are your observations to the fact that there is a protection boundary, and how much is due to the modularity boundary? A project called “cohort scheduling” made similar observations (jumping around to do different things will cause problems), so some of our observations are related to this. The boundary only adds to the problem because of the direct cost of the exceptions, which we measured earlier.
  • Q: Did you measure to what extent you get cache pollution from doing different work after the syscall? Didn’t measure it, but we did see that the TLB miss rate increased for some cases due to going through multiple user-level threads. So we could do even better if we tried to restructure the application.
  • Q: If you dedicate a core to kernel syscall handling, how do you approach the scheduling problem? Tried many things, but the simplest algorithm ended up performing the best: get a static list of cores, and go to the first non-busy core when a request comes in. Turns out to be very elastic, but we were only running a single application. May need to fine-tune this with multiple applications.
  • Q: Have you noticed or can you imagine a situation where having more cores than threads could cause you performance degradation? Do you have to have more threads than cores? We targeted server workloads. It would be a bad idea to run scientific computing workloads on top of this.
  • Q: What happens to this system when you have lots of unrelated threads demanding service (since you poll the syscall pages to pick up messages)? How does it scale up? [Taken offline.]
  • Q: Do you feel like you’re going to be fully backward-compatible? We ran four or five different applications, and the only problem we had was with get_tid (get thread ID), which required a small patch. Other than that, we didn’t see other backward compatibility problems.

Inside the Data Center, 1

Finding a Needle in Haystack: Facebook’s Photo Storage

  • In April 2009, 1.5PB of photo data (15 billion photos in 4 versions), with a 25TB/week upload rate and 550k images/second serving rate.
  • Currently: 65 billion photos, 20PB of data. 60 TB/week upload and 1 million images/second serving rate.
  • Simple design would be to use NFS, with one file per image. Works for a typical webiste with a small working set and infrequent access of old content (99% hit rate on the CDN). Facebook has a large working set with frequent access of old content. Leads to an 80% CDN hit rate.
  • I/O bandwidth gets wasted doing small metadata I/Os. Large directories required 10 iops per image read, improving to 3 iops per image with smaller directories. Using a file handle cache got to 2.5 iops per image read.
  • Haystack store replaces the photo server and storage layer in the NFS based design. A typical unit is a 2U node with 12 1TB SATA drives in RAID-6 (as a single volume). A single 10TB xfs filesystem runs across these. Haystack is a log-structured append-only object store, and each node hosts about 100 of these, at about 100GB in size.
  • File layout has a superblock, followed by “needles” (objects) which are padded to 64-bytes, and contain a key, an alternate key (signifying the size of the image), the data and a checksum. Magic numbers are used to reconstruct the Haystack in case of corruption, which doesn’t happen anyway.
  • The index file has a subset of the needle metadata, and includes the offset of the needle in the log file.
  • The Haystack photo server takes HTTP requests and turns them into Haystack operations. This builds an index of all images in the Haystack in a hashtable. Store 32 bytes per photo (8 bytes per image (32-bit offset and 16-bit size per image scale), compared to 600 bytes per inode). A 5GB image can handle 10TB of photos.
  • From the URL, determine the image and the size from the in-core index. 1 iop reads the data from disk. If the size of the image exceeds the maximum size of the device, could have multiple iops, but this doesn’t happen in practice. Use alloc_size mount option to grow files by 1GB extents. Only problem is when photos cross stripe boundary (6 or 7% of photos).
  • Multiwrite for modifications, since users typically upload many images at a time. Asynchronously append images to the file, then do a single fsync(). The index file is asynchronously appended to without a flush (unless a threshold of outstanding dirty records is exceeded), since it can be reconstructed from the log.
  • Delete just marks photos as deleted in the incore index and the needle header (synchronously).
  • Compaction generates a copy that skips duplicates and deleted photos. Works online and can accept reads, writes and deletes while this goes on.
  • Haystack directory maintains a logical-to-physical volume mapping. 3 physical haystacks on 3 nodes per logical volume. Writes are balanced across logical volumes, and reads are balanced across physical haystacks.
  • On upload, web server receives photos and scales them. The directory provides the logical volume, and writes synchronously all three copies (latency is relatively small compared to the browser-to-web-server latency).
  • On download, web server generates URLs in the web page based on information from the Haystack directory. The Haystack cache may return the image data, or else the cache fetches it from the Haystack store. The image is returned to the CDN which returns it to the browser.
  • Implemented in 8.5KLOC of C++. Took two engineers 4 months from inception to initial deployment.
  • May require more improvements to handle high-res images, so software RAID6 or RAID10 might improve performance. Considering new drives with on-board Flash to store the index. Currently, SSDs are not cost-effective.
  • Q: Did you find prefetching to be useful or are your reads so random? More and more, we are seeing photos accessed by tagging, so prefetching on the album index structure is less useful. Thumbnails are usually cached, so it looks purely random read access to Haystack.
  • Q: Have you considered using social data to figure out what your access stream might look like? The application (implemented in JavaScript) has a much better idea of what will be loaded, so it would make more sense to put the intelligence there than in Haystack. This also keeps Haystack simple.
  • Q: Have you considered migrating needles between Haystacks for load balancing? RAID6 limits us in this respect, and is simply horrible for mixed workloads. RAID6 would not cope with migration.
  • Q: Are high-res images supported for existing images? No, we don’t have the data stored.
  • Q: What do you get out of having Haystack getting as big as 100GB? This cuts down on file system metadata, but they could be smaller. Some Haystacks are even larger, but they run just fine. The only limit is 256GB, which would require 64-bit offsets in the index. Also want to keep the Haystack directory as small as possible, so big Haystacks keep this down. Want to distribute this in future.

Availability in Globally Distributed Storage Systems

  • Globally distributed storage systems are built from commodity servers, arranged into racks, then clusters.
  • 200+ clusters, pools of thousands of clients, 75PB filesystems.
  • Want to look at how available is our data, what are the causes of unavailability, and how can we tune our systems to make this better?
  • 95th percentile of unavailability periods is about 15 minutes. Median due to node restarts is about 1 minute. Planned reboots cause longer unavailability periods (closer to 10 minutes median). Unplanned reboots are more severe, albeit not much (a few more minutes).
  • Correlated failures lead to a “failure burst”, which corresponds to many failures happening within 2 minutes of the previous one. 37% of all failures are part of a burst across at least 2 nodes. This has been fairly consistent over the last year. Showed an example of a burst of 50 servers within quick succession (failure of a shared power or network domain). Also see bursts of about size 20 (rolling reboot of machines with planned staggering, or a cascading failure).
  • Small failure bursts are not necessarily rack-correlated, but large failure-bursts are highly rack-correlated.
  • Came up with a scoring function for rack-correlation: the probability that a burst of the same size affecting randomly-chosen nodes in that cell would have a smaller burst score.
  • Data is stored in stripes, made up of chunks with fixed-size data and code blocks. Use erasure coding (such as Reed-Solomon) to reconstruct missing chunks from available ones.
  • Perform rack-aware placement on blocks. MTTF increases when rack-aware placement is used, for all encoding schemes (replication and Reed-Solomon). Different encoding schemes do better for different sizes of bursts.
  • Used a Markov chain to model the number of available chunks, with transitions due to chunks failing and recovering.
  • Findings on recovery time: using Reed-Solomon (6, 3), a 10% reduction in recovery time leads to a 19% reduction in unavailability, without correlated failures.
  • Correlation matters a lot. Modelling correlated failures gives a substantial reduction in MTTF. However, it reduces the benefit of increased data redundancy.
  • Improving availability below the node layer do not significantly improve data availability (reducing node failure rate is the most useful thing to do).
  • Multi-cell replication (across data centers) leads to a higher MTTF, but there is a trade-off between limiting the inter-cell recovery bandwidth and higher replication.
  • Q: Could controlling the order of planned upgrades and so on have a big impact on MTTF/availability? Yes. We need to do a better job of teasing apart all the reasons for unavailability (finding exact causes).
  • Q: Is it true that most of the time machines are unavailable due to long-lasting events or lots of short events? Graphs in the paper answer this. There are challenges in trying to model rare events, and we tend to be conservative.
  • Q: Can you talk about the cost effectiveness of different encoding schemes, and are you actually using erasure coding? We are considering using Reed-Solomon. What’s the drawback and why didn’t you start with this? For MapReduce, you might have parallel copies that you want many workers to read from. Also chunk recovery is more involved for Reed-Solomon, and has to touch more servers.
  • Q: If the number of cores per box goes up, what effect would this have on availability (with a single motherboard failure taking out more work)? Also, what about power saving and TCO? Multi-core servers are interesting to us in terms of work done per watt of electricity.

Nectar: Automatic Management of Data and Computation in Datacenters

  • Study shows that 20 to 40% of computation in a data center is redundant (based on production clusters in Microsoft). Over 50% of storage space is occupied by obsolete or seldom-used data sets (based on research cluster in MSR). Thus resource management is a problem.
  • Example application: click-log processing. All computations tend to do parsing and pre-processing, but only really need to do this once. Also, as the click-log grows, there is re-done computation on the previously-existing data.
  • 50% of data in the research cluster had not been accessed in the last 275 days.
  • Programs specified using LINQ, which may include user-defined functions (written in a .NET language). Submitted to DryadLINQ, which uses Dryad to execute it on the cluster. Data is stored in an in-house filesystem, called TidyFS.
  • Nectar interposes on submission to rewrite the LINQ program, performing lookups on a cache server. Dryad is modified to add cache entries based on computations that actually run. The distributed store contains Nectar program and data stores.
  • Mostly, the cache server gets several partial hits on the submitted query. The program rewriter will generate many programs that incorporate these hits into the submitted query, and choose the one that is cheapest to execute.
  • Considered an incremental MapReduce-style computation (Select-GroupBy-Select). Can cache the result of the GroupBy and the result of the Reduce. A new operator called MergeGroup can incorporate the previous GroupBy results with the results of Mapping and Grouping the increment data.
  • The executed program is stored in the Nectar program store, which contains a fingerprint based on the program and the data. The program fingerprint is computed from a static analysis of the program text. Dryad puts the result data in both the user store and the Nectar data store, and puts a mapping in the cache server from the fingerprint to the data store.
  • If the data has been deleted (due to garbage collection or a lack of space), the program store provides a program that can be used to regenerate the file.
  • The cache server uses a SQL database to keep the cache entries. A cache entry has usage stats, lookup stats and a URI to the results. The garbage collector uses these stats to make victim decisions.
  • Cache policies for deciding when to insert and delete entries. Final results are always cached, but sub-queries are cached when a subcomputation is popular enough. Deletion policy computes a “value” for each entry, based on the size of the result (cost), cumulative execution time (saving), the last time it was used and the number of times it was used.
  • Garbage collector uses mark and sweep, based on information in the cache server. However, the cache can be inaccurate (due to concurrent execution, or lack of usage statistics). Leases protect new datasets.
  • Evaluated based on logs from production clusters. 36000 jobs written in a DryadLINQ-like language. Simulated the effect of caching on these jobs. Between 15% and 65% of jobs would benefit from caching, across 25 production clusters. Overall, 25% of jobs benefit.
  • Also performed controlled experiments on a research cluster. Evaluated a common subcomputation (parsing and lexical analysis) shared between three jobs, which leads to about a 98% saving in machine time.
  • Evaluated incremental click log processing as data was added from day to day. Saved as much as 50% execution time when executing on the 5th day, having the 4th day’s results cached. Could do better when combining results from existing computations.
  • Q: Given that storage capacity is increasing so quickly, the precious resources are memory and I/O bandwidth, so do you find that to be true in your system, and how would you evaluate that kind of tradeoff? It is very important not to waste machine time in a data center. Could you model this to find the optimal trade-off between storing and recomputing data (in terms of the garbage collector)? It depends on the storage pressure, so we may need a threshold or ratio to decide how much to store on disk. Some heuristic basically? Yes.
  • Q: How do you know or ensure that the computations really are deterministic? Is it built into the language or is it something that you assume? DryadLINQ assumes that the result is deterministic. If the user uses random numbers, Nectar will always return the same result. It is not a problem with Nectar, rather it is in the DryadLINQ assumptions.
  • Q: Are we just reinventing distributed databases under different names, as a community? We did learn a lot from databases (incremental view management and materialized views). But this is work at much larger scales.
  • Q: Could we combine some of the lineage detection techniques with systems that are a bit more heterogeneous (e.g. transferring data from cluster to cluster)? The big difference is that Nectar maintains the program that generates a derived data set, so we can always regenerate it. However, we assume that we never delete “primary” data sets.

Security Technologies

Intrusion Recovery using Selective Re-Execution

  • Need a combination of proactive security and reactive recovery mechanisms.
  • Status quo for recovery is limited: anti-virus tools and backup tools. These respectively are limited to predictable attacks, and can cause data loss on reverting. Therefore admins have to track down the effects manually. The challenge is to disentangle changes made by attacker and legitimate users.
  • This work helps user to do this disentangling on a single machine. Idea is to rollback affected objects after an intrusion is found, and then re-execute actions that were indirectly affected. This is a new approach to intrusion recovery, based on an action history graph using re-execution, “predicates” and “refinement”. Implemented a prototype on Linux that recovers from 10 real attacks.
  • Example attack scenario: modify /etc/passwd to add a new account, install trojans in pdflatex and ls to restart and hide the botnet. Afterwards, a new legitimate user is added.
  • Strawman idea: taint tracking on all OS-level dependencies in the system. Taint travels through shared files, but we have to be conservative in assumptions about how taint spreads, and so taint spreads too widely.
  • Second strawman: everything executes in a VM, and periodic checkpoints are taken. When an attack input is detected, rollback to the previous checkpoint, and replay inputs except the attack. Re-execution would be very expensive. Also, the original inputs may be meaningless in the context of the new system (due to non-determinism). Deterministic re-execution doesn’t work because the inputs themselves are being changed.
  • Selective re-execution uses an action history graph, which is a more detailed version of the taint-tracking dependency graph. Actions have dependencies, such as written/read data, exec arguments and exit codes. Selective re-execution only reruns some processes (like adduser in the example scenario). However, what about the exit code from adduser?
  • Many suspect computations are not affected: an attacker writing to a file may not taint everyone who reads from different parts of the file.
  • To minimize re-execution, specify predicates which allow Retro to skip equivalent computations. Also use refinement, which only replays fine-grained actions (not entire login sessions).
  • Example predicate: exit code. With the same exit status as before, no need to taint the admin shell after adduser Alice.
  • Example refinement: individual functions. So could do getpwname() instead of a generic read() on /etc/passwd.
  • What if the effect of the attack was externally-visible. Cannot solve this in the general case (e.g. recalling spam sent from an attacked host). But could e.g. email the diff of output from a terminal session to the user.
  • Implemented in about 4000 lines of C and another 200 lines of Python for the repair controller system. Use btfrs as a checkpointing file system. Use ptrace to shepherd the re-execution of processes and skip equivalent syscalls.
  • Evaluated by looking at whether Retro is better than manual repair, and what the cost of using Retro during normal execution.
  • Used 2 real-world attacks from a honeypot. 2 synthetic challenge attacks (including the example scenario) and an sshd trojan. And 6 attacks from Taser. Retro can recover from all attacks. In 6 cases, it required no user input. However, sometimes need to skip the attacker’s login attempt. Two required significant help to get beyond skipped network traffic. However, Retro can pinpoint the exact objects that are involved, which is better than the manual status quo.
  • Repair cost is proportional to the extent of the attack. Worst case is the sshd trojan, but the other cases taint far fewer objects. Repair time is proportional to the number of objects, rather than the log size. This is much more efficient than VM re-execution.
  • Evaluated the runtime overhead based on running HotCRP from just before the SOSP 2007 deadline. CPU cost was 35% with storage overhead as 100GB/day. Also looked at Apache-small-static-files and kernel recompilation workloads, which had much higher overhead in terms of CPU and storage. However, can still store 2 weeks of logs on a 2TB disk in the worst case.
  • Q: How do you pick the level at which you record and re-execute actions? Did you experience any cases where it was fragile to small changes in the replay? We reintroduced higher-level managers in response to false dependencies that we saw. The interface between the repair controller and the repair managers is well-defined so it would be easy to add new abstractions if necessary.
  • Q: Could you tune the system to record less-precise information which would make it harder to recover, but pay a much lower cost? There are two kinds of trade-offs to make: either the number of user actions to preserve, or assume that some things won’t get compromised. Could also trade-off logging cost for repair time (by re-executing more things).
  • Q: What do you do about non-deterministic programs and negative information flow? Just want some acceptable execution that isn’t compromised, so actually non-determinism may be okay. Could this modify my data on disk? Yes. Can only provide the guarantee of an “acceptable” execution, not the same.
  • Q: What if the attacker could assume that Retro would be there: could he cause a large number of false positives? Well, the sshd trojan is perhaps a case like that, but our hope is that it would be much easier to detect these attacks, but we haven’t experimented with this.
  • Q: Did you evaluate throughput and latency for web workloads? Is there any protection of Retro from attackers? How hard would it be to adapt Retro to a closed-source system like Windows? For Windows machines, you would have to implement many more managers, but we believe the ideas are still applicable. For protecting Retro, we use SELinux to protect the kernel and Retro, but we assume that the kernel is intact, although we could consider a VMM approach. The libc wrappers? Have to be careful with those. [Throughput and latency question dragged offline.]

Static Checking of Dynamically-Varying Security Policies in Database-Backed Applications

  • Authentication is complicated, especially for web apps, and getting it wrong can have major consequences. Typical development involves thinking of some attacks, and auditing for these. A more principled approach involves setting a security (information flow and access control) policy on a resource.
  • One approach is dynamic policy checking, by adding metadata to objects in the system, and checking them when before objects interact with the environment. This is easy to add to existing programs, and very flexible, but this only finds bugs for tested program paths, and there is performance overhead.
  • Alternatively, could do static checking, which can run at compile time and involves no changes to runtime behaviour. But this usually requires extensive annotations, and has limited policy expressiveness if the type system is used.
  • UrFlow is a new analysis approach for the Ur/Web programming language. Contains a flexible and programmer-accessible policy language based on SQL. However, it runs statically at compile time and adds no runtime overhead.
  • Ur/Web has integrated parsing and type-checking of SQL and XML (HTML).
  • Can have simple policies based on whether rows exist in a table. Or can reason about what the client “knows”, such as string literals and contents in the source code. Can also have policies that join between multiple tables.
  • Can use first-order logic to reason about what is known at each point in the program. Then reconcile that with the SQL query policy (which may also be expressed in first-order logic). Then use an automated theorem prover to prove that the code meets the policy.
  • Program code is turned into a finite set of execution paths, and symbolic execution is done on those paths. This leads to a series of states, and a series of check predicates.
  • Evaluated using case studies for representative web applications: secrets, poll, user DB, calendar, forum and gradebook. 8 to 134 lines of code in the policies (forum most complicated).
  • Idea of using SQL is to take advantage of code that is already written in web apps, and get a free lunch from the checking. Hope to add these to a more mainstream language.
  • Q: Do you have any intention to analyse how hard it is to write policies? The analysis is never going to change the behaviour of the program, so no policies should block legitimate access to data. What about indirect paths to data? This is a hard problem and I don’t have an answer. The code author’s job is made easier by moving to a declarative paradigm.
  • Q: To what extent can your program give a better answer than just a yes or no from the theorem prover? When the analysis thinks it detects a violation, it displays the instruction that it thinks might be wrong and all of the facts it knows at that point. However, at least this is a static view rather than a dynamic view of the program. There’s probably a heuristic for filtering out irrelevant facts, but I haven’t thought much about the details.
  • Q: How do you manage the accumulation of security policies over time and becoming cumbersome? The language supports ways of encapsulating common patterns, which can help to shrink down the size of policy that you need to read.
  • Q: Is there a theoretical difference to existing security automata [inaudible]? Techniques very similar to software model checking. Can it be applied to type analysis? I haven’t seen any type systems that could reason about this.
  • Q: What about security policies that refer to exogenous facts that don’t exist in the database (like timed event sequences)? You can expand the database to store state fields that allow you to encapsulate the model. This would be worth looking into.

Accountable Virtual Machines

  • Scenario of a multiplayer networked game. In Counterstrike, the amount of ammo is local state, and fire events are broadcast to other players. So a player can modify the game not to decrement that counter. Such cheats exist and are used.
  • Cheating is a serious problem in itself, and since gaming is a multi-billion-dollar industry, we need techniques to stamp it out. It is also a symptom of a more general problem: how do you know that a networked system is running as intended?
  • Want to be able to detect that a remote machine is faulty, and obtain evidence that can convince a third party.
  • Challenges: mutual distrust and lack of access to the source code to understand how it should work.
  • Software runs in an accountable virtual machine on an accountable virtual machine monitor. Maintain a log of network inputs and outputs, and can check this log with a reference image. It the AVM is correct, the same inputs will produce the same outputs. Otherwise it is considered to be faulty.
  • Solution involves two pieces. First is tamper-evident logging (based on PeerReview). Log uses a hash chain and messages contain signed authenticators.
  • Second part is execution logging. How does Alice know if the log is from a correct execution of her software image. Need the AVMM to specify an execution and additionally log all nondeterministic inputs.
  • Can audit and replay a log. If one player changes the reference image, the logs will diverge, which indicates that the AVM is faulty.
  • AVMs provide strong accountability for arbitrary, unmodified binaries. Don’t need to trust other participants, the AVMM or any software running on the other machine.
  • Evaluation based on the example of cheating in Counterstrike. Used a prototype AVMM based on VMWare Workstation 6.5.1 which has a mature logging/replay engine. Extended this with tamper-evident logging and auditing. Used three players on Nehalem boxes running Windows XP SP3.
  • Can AVMs detect real cheats? If the cheat needs to be installed in the AVM, it can trivially be detected (since the branch counter and instruction pointer will change). Examined 26 real cheats from the internet, and all were detectable.
  • Could cheats be adapted to subvert the AVMM? There are impossible-to-detect cheats (collusion), guaranteed detectable cheats, and technically-difficult-to-evade cheats, which we can’t guarantee are impossible to hide.
  • What is the impact on frame rate? With no fps cap (Window mode, 800×600, software rendering), the frame rate drops by about 13% from bare hardware to AVMM. Still around 137 fps though. The main culprit is execution logging (about 11% overhead), so accountability only costs another 2%.
  • For a one-hour game, how expensive is auditing? The log is about 8MB per minute, which compresses to 2.47MB. So the log is 148MB. Replay takes approximately one hour.
  • Idea: stream logs to auditors during the game and have detection almost immediately. Since the machines are quad-core, one plays the game and one does logging, it is possible to do replay on one or two of the spare cores. This causes some drop in the frame rate, but it is still above 100 fps.
  • Q: What if the issue is that a machine isn’t cheating but is in fact sabotaged? The AVM provides a full record of what has happened, and could demonstrate that the owner is not at fault.
  • Q: Would it be possible for the malicious party to run another instance in the background to get a plausible-looking log? The two copies would be completely independent, because the authenticators are signed with one machine’s key. The outputs wouldn’t match, so this would be detectable.
  • Q: Do most have the cheats have the form of changing the inputs? What is the equivalence class of cheats that you can and can’t detect? If the cheat causes the network I/O or control flow to be inconsistent, we can detect it.
  • Q: What about concurrency (e.g. who fires first) and delaying traffic in the network to change who fires first? Concurrency wouldn’t affect detectability. But from a distributed systems point of view? We can’t guarantee detection.
  • Q: What if the program that you’re dealing with has non-deterministic output? Randomness is not a problem for us because at the VM level, everything is deterministic.
  • Q: Is it sufficient to log at the AVM level when the application has latitude to process events in a different order? If this is a legitimate thing for the application to do, then we are vulnerable.
  • Q: How does privacy interact with the implementation of your system? Would you end up sharing too much information with your adversary (especially in the case of things that are more important than games)? It’s true that AVMs are verbose in what gets logged. The impact on privacy depends on what you’re doing. Auditing a machine that you’re paying for seems reasonable. It has to okay for you as the auditor to see all computation that is going on. You can delay the online auditing perhaps.
  • Q: How can you detect cheats that involve improving the player’s skill (or making all the walls invisible or using an aimbot)? Can’t help information leakage with this technology.

Concurrency Bugs

Bypassing Races in Live Applications with Execution Filters

  • Races and atomicity violations can be a big problem, and hard to find. One solution is to put everything under a big lock. Or can go for a complex fix by decomposing behaviour and adding state variables. But it takes a long time to fix these races, even if the root cause is well understood.
  • Idea of LOOM is to filter out buggy thread interleavings to mitigate bugs while the problem is fixed. Write declarative execution filters to specify the disallowed interleavings.
  • LOOM also works to implement live workarounds, so it is possible to install an execution filter into a running application. It installs execution filters safely and avoids introducing new errors.
  • LOOM incurs little overhead during normal execution.
  • Evaluated LOOM on 9 real races, and it bypasses all evaluated races safely, applies immediately, has little overhead and scales well.
  • LOOM is divided into Static Preparation (uses the application source, which is compiled down to LLVM bytecode where transformations are applied. This adds a mechanism for LOOM update injection and safety maintenance) and Live Update (passing an execution filter to the LOOM controller, which passes code to the LOOM update engine which is linked into the application binary).
  • Safety issue: what if the program is running inside a critical section when you update it? Or what if there are order constraints on multiple threads? The LOOM update engine identifies regions of the code where it is dangerous to update, if any thread is inside that region. It then evacutates the dangerous region, to get threads out of the region, and applies the update once they are all out.
  • To control application threads, add conditional breakpoints (in the instrumentation phase) on iterative or recursive control flow (since there is no point in pausing code that can quickly run to completion): i.e. CFG backward edges. Details of the breakpoint mechanism in the paper. They use MRSW locks which are unlocked and reacquired in the breakpoint to pause threads (if the update manager grabs the write lock in the mean time). The back-edges are identified as safe or dangerous. The breakpoint only pauses if the back-edges are safe.
  • Hybrid instrumentation to reduce the performance overhead. Have two versions of an instrumented function: fast and slow. The slow version includes configurable slots for additional instructions. During the common case (no update), the fast path is always followed; only if the code is updated do the slots get used.
  • Measured the bare instrumentation overhead on Apache, MySQL and SPLASH benchmarks, and the overhead was between -1.84% and 4.11%. This is much better than Pin, which has from 14.94% to 1065% overhead.
  • Measured the scalability of the number of threads. Overhead from 1 to 32 threads (on a 48-core box) is less than 12% in all cases. There is some anomalous speedup for 4 threads.
  • Intend to extend this to memory errors and security errors.
  • Q: Is there a potential for deadlock in your evacuation algorithm? That may be the case, but we didn’t observe this in our evaluation, and we can always “uninstall” the fix if there really is deadlock.
  • Q: Can you collect information at runtime to aid you with fixing the bug at the source code level? Can write execution filters based on line numbers.

Effective Data-Race Detection for the Kernel

  • Looked at a data race in the Windows kernel. Thread A ANDs out a bit, and Thread B ORs a different bit in. These can interleave badly, leading to one of the mutations getting lost, and cause a system hang.
  • Data races are very hard to reproduce since the timings are very tight. And they are hard to debug, since they could be mistaken as a hardware bit-flip. They are also becoming more common, since people are moving away from monolithic locks to fine-grained locking and lock-free approaches.
  • Previous techniques are happens-before and lockset algorithms. The Intel Thread Checker runs at a 200x overhead, due to logging all synchronization accesses, and instrumenting all possibly-racy memory accesses. This prevents practical usage in the field. Indeed, false failures arise due to timeouts. They also require a complete knowledge and logging of all locking semantics which may not be feasible in practice (since there are diverse ways of implementing locking, and many of them are used in the kernel).
  • Trade-off accuracy for efficiency. Can pick up false and benign data races. False data races are ones that cannot actually occur. Benign data races can and might occur, but are intentionally put there by the developer. Statistics counters are an example of this.
  • Another goal is to give the user control over the overhead (from 0.0x up).
  • Also want actionable data that is useful for the developer.
  • First insight: don’t infer what a data race might happen, and instead actually cause it to happen. Second insight: sample memory accesses for controllable overhead. Use code and data breakpoints and randomly select to get uniform coverage. A data breakpoint pauses the first thread when it hits some memory location; if another thread comes along and accesses the same location, flag the current thread states as a possible data race.
  • Memory accesses are sampled by setting code breakpoints, which lead to data breakpoints being set on the relevant locations.
  • Results: most dynamic data races are benign (104/113). Many could be heuristically pruned (86/113). Found 25 confirmed bugs in the Windows OS, with 8 more pending investigation.
  • Saw two bits in the same byte, each protected by individual locks, have a data race with bits being set and cleared.
  • The breakpoint approach found data races with as little as 5% overhead.
  • Future work includes prioritizing benign versus non-benign races, and also looking at (false) data sharing and its performance impacts.
  • Q: What are your preliminary thoughts on how to identify bugs as benign automatically? There are a lot of unanswered questions there….
  • Q: How random is your breakpoint setting? The demo was considering the ndis and ntfs drivers, and randomly sampled within those. The breakpoints are set based on a feedback loop to get the rate.
  • Q: How dependent is the technique on having full debug symbols, since the problems may be in third-party device drivers? The dependence is so we know where the memory access instructions are. You can load the module into memory and use some other technique to disassemble them, but we ended up accidentally putting a breakpoint into data.
  • Q: How is the sampling rate related to the expected time until you detect a particular race? We haven’t investigated that, but we would like to verify that we’re not just getting the easy races.

Ad Hoc Synchronization Considered Harmful

  • Synchronization is important to ensure the correctness of concurrent programs. Sometimes synchronization code is easy to identify (pthreads calls), but sometimes it is hard to recognise. Sometimes it even uses goto statements.
  • Ad hoc synchronization harms program reliability. Because it is hard to recognise, the programmer may be unaware of it. This may lead to bugs and performance issues. Saw up to 67% of ad hoc sync introducing bugs. Program analysis is more difficult, and may introduce hard-to-detect deadlocks, cause false positives in data race checkers and confuse performance profiling. It may even cause problems for compilers and relaxed memory consistency model.
  • Contributions: quantitative evidence of the harmfulness of ad hoc syncs; and a tool (SyncFinder) that automatically identifies and annotates ad hoc syncs (detects deadlocks and bad practices).
  • Examined server, desktop and scientific applications (including Apache, MySQL, Mozilla JavaScript engine, SPLASH). Found lots of ad hoc sync loops in each of them (83 in MySQL). In OpenLDAP, 67% of ad hoc syncs were buggy; in Apache, 22% were buggy.
  • Showed an example of a hard-to-detect deadlock in Mozilla JS, which involves three threads. And an example which does an unnecessary sleep in MySQL, causing performance issues.
  • Ad hoc synchronization leads to benign data races. Leads to false positives in data race detectors (see above).
  • Diverse kinds of ad hoc synchronization. For example, a while loop, or a goto loop. Or a single condition, or  multiple conditions. Or spinning on a synchronization variable, or have a more complicated dependency structure….
  • To identify ad hoc sync, need first to understand it. Every one on the waiting side has a loop body (”sync loop”) with at least one exit condition that is associated with at least one exit condition variable; and a synchronization variable. The setting side does a write to the synchronization variable. First detect the loop, then extract the exit condition, then detect the exit-dependent variable set, and finally prune the set to find the synchronization variable.
  • To prune the sync loop, need to find variables that are shared with a remote thread, and are loop-invariant.
  • Report annotated code with the line numbers of sync variable reads and writes.
  • SyncFinder identifies 96% of ad hoc syncs in the considered applications, with only 6 false positives (out of thousands of loops).
  • Use cases: a tool to detect bad practices, and an extended race detector in Valgrind.
  • Q: What about the amount of bugs in non-ad hoc synchronization: do you have any data on that? We did some previous work on the reliability of normal synchronization, and we found that the proportional of buggy ad hoc sync is higher.
  • Q: Why is ad hoc synchronization so pervasive? What is the limitation in sync primitives, or is it just performance? We checked the comments surrounding this ad hoc synchronization. People just want a flexible way to do sync. They assume that it is a short-term fix (or that it will not be used in the common case). Would you suggest an additional sync primitive? Conditional wait and signal.
  • Q: Do people do this to avoid a synchronous write on the system bus? Is it always harmful? Your performance assumptions may change in the future on different architectures, and the cost to maintainability is probably higher than the performance win. Are lock-free data structures bad, then? I cannot say that.
  • Q: Is there any chance that these application developers declare their sync variables as volatile? They may do this, but it doesn’t guarantee that that is how they use it. We thought about this for our tool.
  • Q: How do you identify the bugs? We checked the code repository for patches on those locations and identify bugs in Bugzilla and the changelog.
  • Q: How long does SyncFinder take to run on the code base? Could you run it on the Linux kernel? We ran it on the 1 MLOC MySQL, which takes about 2.5 hours. The complexity is data-dependent: OpenLDAP is much smaller but takes almost as long as MySQL.

SIGCOMM 2010: Day 2

Wednesday, September 1st, 2010

Privacy

Privacy-Preserving P2P Data Sharing with OneSwarm

  • Three types of data: private, public (non-sensitive) and public (without attribution). This talk is about the last one: want to download and share data without people knowing what you’re downloading or sharing.
  • P2P is good for privacy because there is no centralized control or trust, you retain rights to your data (instead of giving them to a third party) and no centralized third party knows everything you’re doing. But in P2P: anyone can monitor your behavior!
  • Previous solutions: Tor plus BitTorrent, but this needs a fraction of the clients to be public and gives poor performance. Or Freenet, which has poor bulk data performance and requires users to store others data. Median download of BitTorrent a 1MB file is 94s. But BT+Tor is 589s and Freenet is 1271s.
  • Implemented a OneSwarm client and released in March 2009… now hundreds of thousands of users. Based on social networks: share keys with your friends.
  • Searches are flooded through the network, and set up a data path along the successful route, which does not reveal who is the ultimate sender or provider.
  • Threat model: the attacker has a limited number of overlay nodes and can do anything on nodes he controls, including traffic injection/sniffing/correlation.
  • To support mobile peers, use a DHT to publish IP and port, however this is published, encrypted and signed separately for each peer. This makes it possible to remove a peer.
  • Sparse social networks are a problem: with only one friend, you have poor reliability, poor performance and privacy problems. This is bad for early adopters. Early adopters used a forum to share their public keys. Solution was to add a community server, as a source of untrusted peers.
  • With untrusted peers, delay responses to foil timing attacks, probabilistically forward queries and used deterministic random behavior to limit information leakage from repeated queries. Can trust/not trust peers on a per-object basis.
  • Want to have search without revealing the source and destination. The approach is based on flooding with delay, where searches are only forwarded using spare capacity, and delayed at each hop. Cancel messages move faster through the network. However, this gives no guarantee that all data can be found by all users at all times.
  • Search types: 160bit content hash and text-based. For response delay, use random reply delay seeded by the hash (if hash-based search). This is harder for text-based search, so delay based on the hash of the content.
  • Multipath connections are supported to avoid weak points in the network. Default is to cancel after getting 20 responses (not just one). Then the forwarding load is distributed.
  • Data transfer uses a modified version of BitTorrent, which handles multiple sources and swarming downloads naturally.
  • Timing attack is possible by monitoring the delay between search and the response, and inferring how many users could have sent the reply within the recorded time.
  • Evaluated potential of the timing attack using a synthetic OneSwarm overlay, based on 1.7 last.fm users. Attackers use a public community server, and users with 26 or fewer friends take some untrusted friends as well. Design eliminates the attacker’s ability to pinpoint particular users.
  • Also evaluated performance using 120 PlanetLab nodes transferring a 20MB file. Median download time for OneSwarm is 173s (compared to 94s for BitTorrent, but much less than BT+Tor and Freenet). At the 70th percentile, OneSwarm and BT are similar (97s vs 190s), but BT+Tor and Freenet are much worse.
  • Multipath transfers increase average transfer rate from 29KB/s to 457KB/s.
  • Q. Would it be fair to classify this as Gnutella where the network is restricted to the social graph and searches are less flexible? Similar, but the key change is that the data flows over the path in the overlay (not directly), which makes it more privacy-preserving. Does this not give worse scalability than Gnutella, which had problems? Gnutella had problems when people were still using modems, and it is more viable to provide a few KB/s. The current overlay is not oversubscribing.
  • Q. What are you relying on to ensure that you don’t know where the data are coming from? Because you don’t know the topology. If you are next to an attacker, you rely on the delay that you add. Are you seeing a sum of random variables, which will leak more information as the path becomes longer? You could maybe estimate the hop-count but not pin-point nodes.
  • Q. Is a 20MB file too small for TCP to show realistic performance? Used this because we needed to experiment with Tor also, and we didn’t want to stress that network too much. For the Freenet experiment, we used a 5MB file and extrapolated from that because it was hard to get 20MB to download reliably.

Differentially-Private Network Trace Analysis

  • Can you conduct network trace analysis that provides strict, formal, “differential privacy” guarantees?
  • Selected some representative traces and tried to reproduce the results using differential privacy.
  • It was possible to reproduce every analysis attempted, but there is a privacy/accuracy trade-off.
  • Toolkit and analyses are available online from the PINQ web site.
  • Access to realistic data is helpful for networking research, but there is a tension between utility and privacy. The requirements of utility are usually for aggregate statistics, whereas privacy requirements are typically for individual behavior.
  • Other approaches include: trace anonymization (doesn’t always work unless people are excessively conservative), code-to-data (send your analysis to the people who hold the data, but it is hard to know what that code is doing), or secure multi-party computation (similar to code-to-data). The aim here is to start with formal guarantees and see how useful it can be.
  • Differential privacy: the results don’t depend on the presence or absence of an individual record. Doesn’t prevent disclosure, but makes no assumptions about what what the attackers can do, and is agnostic to data types.
  • Uses PINQ, which is a programming language that guarantees programs are differentially private.
  • Challenges: getting DP requires introducing noise, so you need to use statistically robust measurements. PINQ requires analyses to be written as high-level, declarative queries, which can require some creativity or reinterpretation. Also (not dealt with): masking a few packets does not mask a person, and the guarantees degrade more as a dataset is reused (policy question of how you mete out access to a dataset).
  • Example is worm fingerprinting. Group packets by payload, filter by the count of source IPs being over a threshold and the count of destination PIs being over another threshold. Can then count the number of worms, approximately. Need to supply epsilon to the count which will start off the differentially-private version.
  • Built some tools for analysis. For example, implemented three versions of a CDF. In doing this, you need to scale down the accuracy for each subquery in order to not degrade the dataset privacy too much.
  • Showed an example CDF. The differentially private one is not monotonic at the microscopic scale, but it gives a convincing macro-scale result.
  • Can also list frequently occurring strings, using an algorithm based on statistical properties of text, which gradually extends a prefix.
  • Extend worm fingerprinting: actually enumerate the payloads that have significant src/dest counts.
  • Also built more tools and analyses: did packet-level analyses, flow-level analyses and graph-level analyses. Sometimes had to compromise on privacy to get high accuracy (epsilon = 10 for weak privacy).
  • Many open questions. Perhaps the biggest is whether DP guarantees for packets are good enough. Or whether, if writing new analyses, they could be designed with DP in mind.
  • Q. Could extensions to PINQ apply to trace analysis that look for isolated events, such as network intrusions which are relatively rare? Can separate the two tasks: learning a rule or filter that could identify an intrusion (which could use DP), and apply that filter to individual packets (which could not use DP, because you effectively want to violate privacy at this point).
  • Q. Does someone need to hold onto the raw packets? Yes, like the code-to-data setting.
  • Q. In DP, each query may consume epsilon privacy and the provider must set a budget, so how do you set this? And what happens when the budget is exhausted? You could imagine turning off the dataset when the hard threshold is met. But this is really a policy question. Setting the budget is difficult: perhaps you can provide data outputs from a DP query to a large group who can then do useful work with it.
  • Q. Is there a trade-off between DP and the presence of correlations in the data? In a lot of cases, it is possible to restructure the data set to reduce the amount of correlation between individual records (by grouping the strongly correlated records together).

Encrypting the Internet

  • 50 million websites exist, but only about 600k of them enable SSL/TLS. Can we change the infrastructure to make all transactions protected and secure.
  • Main drawback is protocol processing speed and cost, due to public key crypto for handshaking and symmetric crypto for the data. 2 million clock cycles for RSA decrypt.
  • Main contribution is a CPU that is capable of encrypting packets at line rates, and getting a 4–12x speedup in AES and a 40% speedup in RSA.
  • Encrypting the internet is not securing it. Don’t deal with certificate/trust management, malicious software or privacy breaches at the end-host.
  • AES is a block cipher, based on the Rijndael algorithm. Can use 128-bit blocks and either 128, 192 or 256-bit keys. AES takes 10, 12 or 14 rounds.
  • AES uses confusion (invert in GF(2^8) followed by an affine map). Then the bytes are permuted by shifting the rows of the S-box by varying amounts. Then the columns of the S-box are mixed by matrix multiplication. Uses many bit-linear operations, which are easy to implement in VLSI. Finally, add the round key using XOR.
  • AES is typically implemented using table lookups, which are costly (approximately 15 cycles per byte). But need to get 1Gb/s. So the aim is to implement them in combinatorial logic, on the processor data path.
  • Added new instructions: AESENC, AESENCLAST, AESDEC, AESDECLAST. Cache attacks are eliminated. Challenge is to implement this in as small a gate area as possible. Mathematical techniques such as composite fields help to achieve this in 100-400 gates. Total number of gates is similar to an adder or multiplier.
  • RSA requires performing a modular exponentiation, which can be implemented using modular multiplication. Implementing a faster multiplication algorithm in assembly achieved a 40% speedup over OpenSSL.
  • Also implemented the first version of combined encryption and authentication for TLS 1.2.
  • AES-NI latency is 24 clocks/round, then 6 clocks, and the throughput is 2 clocks.
  • Overall, can move from 501 SSL sessions/second to 1216 SSL sessions/second using AES-NI in Galois counter mode.
  • Now, one core can saturate a 1G link, and 8 cores can saturate a 10G link.
  • Future work is to improve larger RSA variants and implement the eventual SHA-3 algorithm.
  • Q. When you get that fast, how many good-quality random bits per second you can get? This work doesn’t address that, but all we need is an entropy source per a 2004 paper. Not sure what the product groups are doing in this respect.
  • Q. Is Intel working on speeding up the RSA? The speedup presented in the paper (40%) is good enough to saturate the link.
  • Q. Could you expose the GF operations as primitives themselves? It is implemented in such a way that you can isolate the inversion of GFs or the multiplication. Algorithms in the SHA-3 competition also exploit similar primitives.
  • Q. How general are your optimizations in terms of other block ciphers? You can implement a variety of crypto algorithms using the primitives we have design, including several cryptographic hash functions.

Wireless LANs

Enabling Fine-Grained Channel Access in WLAN

  • 802.11n achieves about 45.2 Mbit/s at the application layer, which is much less than the advertised bitrate.
  • Overhead arises from various sources: CSMA, backoff, DIFS and SIFS, and ACK. Simple model of this overhead as ration from transmission time to total time for a packet. As the PHY data rate increases, the time for transmitting data (efficiency) becomes small compared to all of these overheads. There is a static delay that cannot be reduced, constraining speedup.
  • Existing MAC is limited by allocating a whole channel to a single user at once. Aggregation is a possible solution, but you require large aggregation (23KB frames) to get 80% efficiency at 300Mbps. And this also increases latency.
  • Basic idea is to divide the channel into small, fine-grained slices. Directly reducing the channel width doesn’t work because of guard-band overhead. Approach then is the use orthogonal overlapping subchannels (OFDM).
  • If nodes are asynchronous, you lose orthogonality (i.e. if you have multiple users). Challenge then is to coordinate transmissions in random-access networks like WLAM. Time-domain backoff is very inefficient in this case.
  • Designed new PHY and MAC architectures: “FICA”.
  • M-RTS/M-CTS/DATA/ACK access sequence.
  • Carrier-sensing and broadcasting can be used to analyze the timing misalignment. A proper cyclic-prefix accommodates the timing misalignment: a long one for M-RTS and a short one for M-CTS/DATA/ACK.
  • For contention resolution, time-domain backoff is inefficient. Solution is to do frequency-domain contention with PHY signalling in the M-RTS/M-CTS symbols.
  • Frequency domain backoff: reduce the number of subchannels to contend if there is a collision, and increase it if there is success. This is analogous to congestion-control mechanisms. Two policies: “update to max” and AIMD.
  • Implemented using the Sora software radio platform, based on a SoftWifi implementation.
  • Evaluated performance of the synchronization, the signalling reliability and the decoding performance.
  • Also showed simulation results for the performance gain over 802.11n, and showed an improvement in efficiency for both full aggregation (unrealistic) and a mixture of saturated and delay-sensitive traffic (realistic and with much greater benefits).
  • Q. How do you deal with the case when the number of sources exceeds the number of sub-carriers (in frequency-domain backoff)? Could you combine time and frequency? Yes, we could always do that.
  • Q. Is there a way of using this system with RTS/CTS? The overhead of these is so low (37us for RTS), that it might not be worth doing this.
  • Q. Why is the problem of asynchronous timing different from multipath fading? It can create arbitrarily bad interference with the FFT window that is done for OFDM.
  • Q. What happens if you compare your scheme to classic OFDM in terms of bits/seconds/Hz (considering delay due to synchronization)? There is a sweet point in symbol size that can meet your design goal.

Predictable 802.11 Packet Delivery from Wireless Channel Measurements

  • 802.11 is fast (600Mbps), reliable (usable at vehicular speeds over extended range) and ubiquitous (cheap). But new applications, such as wireless displays or controller input, can stress the network.
  • In theory, performance should be easily measurable and used to guide channel rate selections. But the real-world doesn’t always match the theory. So statistical adaptation is often used, but convergence time becomes a problem (especially as the measurement results change dynamically).
  • Goal is to bridge theory and practice, and accurately predict performance over real channels and devices.
  • Channel metric is the received signal strength indicator (RSSI) which, with noise, gives the SNR for a packet. However, this isn’t perfect, because it can vary by 10dB on a per packet basis. Different subchannels have different SNRs.
  • 802.11n provides a new opportunity: detailed channel measurements, which are used for advanced MIMO techniques. Get a Channel State Information (CSI) matrix for per-antenna paths.
  • Use the Effective SNR (the total useful power in a link) as opposed to the packet SNR (total power in the link).
  • CSI is measured on receive, so for every received frame, we know all antennas and subcarriers used. Then take this and compute SNRs per symbol. And use textbook formulae to calculate per-symbol bit-error rates, and average them to get an effective bit-error rate. Finally convert this back to the effective SNR.
  • Every rate gets an effective SNR threshold, calculated offline per NIC implementation (not per NIC or per channel). This handles real NICs which may use interesting decoding techniques (hard/soft/maximum likelihood, etc.).
  • Application: what is the fastest configuration for a particular link? Select rate/MIMO/channel width based on the information.
  • Application: which antenna is the best to use to save power?
  • Application: what is the lowest transmit power at which I can support 100 Mbps?
  • Implemented in an Intel Wi-Fi Link 5300 NIC (3×3 MIMO, 450Mbps). Used two testbeds with over 200 widely varying links. Open-source Linux driver and used firmware debug mode to send CSI to the receiving host. Real-time computation took 4us per 3×3 CSI.
  • For predicting optimal 3×3 rate: effective SNR is much closer to the ground truth than packet SNR.
  • To evaluate rate control, used channel simulation on a mobile trace using MATLAB and the SoftRate GNU Radio. Effective SNR gets a better average delivered rate than SampleRate, SoftRate and SampleRate with fixed retry (802.11a algorithms).
  • Effective SNR extends to MIMO. Compared to optimal and an invented algorithm called “previous-OPT”. Effective SNR gets 80% accuracy and 10% overselection.
  • Related work: SoftRate, AccuRate and EEC (from yesterday). All work with 802.11a but don’t extend to more recent techniques.
  • Q. If you had CSI and it’s quick, do you need to do all of these things? The RSSI has a lot of error, and we were able to make this work.
  • Q. Is the debug mode on the NIC publicly availably? Yes, I think so.
  • Q. Would a better comparison be to other techniques that use scheduled MACs? Trying to do something that works with what we have.

SourceSync: A Distributed Architecture for Sender Diversity

  • Receiver diversity underlies many systems, such as opportunistic routing protocols, and WLAN diversity protocols. In opportunistic routing, let any router that receives the packet forward it. If multiple routers/channels with different loss rates, the loss probability is now the joint probability of losing the packet on all channels.
  • Sender diversity is the converse of receiver diversity. If many senders transmit simultaneously, it is unlikely that they will all be attenuated at the same time. It provide analogous benefits to receiver diversity. For example, connect many APs to wired Ethernet, and let many of them broadcast a packet simultaneously to the client.
  • Challenge: simultaneous transmissions don’t strengthen each other, because they are likely to be out of sync. Need distributed symbol-level synchronization.
  • An 802.11 symbol takes 3.2us. With synchronization error of 2us, the best you can get is an SNR of 2dB. 1us synchronization error gives 5db. But for maximum bit rate, 802.11 needs an SNR of ~22db.
  • Implemented the system, SourceSync, for an FPGA. Talking about opportunistic routing, but applies also to WLANs.
  • Natural way to synchronize transmitters is by reception. But since multiple paths have different delays, need to compensate for these differences.
  • Path delay is made up of propagation delay and packet detection delay (typically needing multiple samples to detect a symbol). Then a turnaround time between receipt and transmission.
  • Packet detection is caused by receivers detecting packets using correlation. Random noise can cause a receiver not to detect a packet on the first sample. Since routers see different random noise, they may make different numbers of samples. Routers can estimate this based on the phase shift in various carriers.
  • Hardware turnaround time is hardware dependent, caused by different hardware pipelines and radio frontends. Routers locally calibrate this using their clocks.
  • Propagation delay is measured by probe-response between node pairs. A knows its packet detection delay, B knows its packet detection delay, and so we can compute this from the RTT.
  • Challenge: can nodes synchronize using carrier sense? Transmission from one of the joint senders triggers the other senders. All nodes use CSMA, so one of the nodes wins contention and begins transmitting; other nodes join in if they have the data.
  • The lead sender adds a sync header to the packet and a known fixed gap to all co-senders to join after the gap. Co-sender listens, turns around from receive to transmit, waits for a compensating delay, and sends the data.
  • Implemented in a FPGA of the WiGLAN radio. Built a testbed with a variety of line-of-sight and non-line-of-sight locations.
  • Evaluated: randomly pick a pair of nodes to transmit, and measured the synchronization error. 90th percentile of synchronization error was measured: 20ns at 5dB SNR, and as little as ins at 25dB SNR.
  • Can SourceSync achieve sender diversity gains? Two nodes transmit simultaneously to a receiver (again). Check that two channels have different OFDM subchannel SNRs (they do, in the example) and that SourceSync achieves higher SNR in all subchannels.
  • Compare using the best single access point to using SourceSync, with two senders and a client. Repeat for all locations. SourceSync gives a median throughput gain of 57%.
  • Compared with opportunistic routing. Single path does worst. ExOR does better. SourceSync + ExOR does best (doubled median throughput over single path, and 45% improvement over ExOR alone).
  • Q. Did you consider just increasing the power of a single AP instead of sending with multiple APs? There is a fundamental gain here: the SNR profile is different for different routers, and combining across multiple senders gets rid of these deep losses.
  • Q. Should there be more components to the calculation of delay, based on the RTT? The nice part about this technique is that channel access delay doesn’t affect us, because we use carrier sense for telling when to transmit.
  • Q. Why did you not compare the performance of your scheme to MIMO? Sender diversity is orthogonal to MIMO and could improve its performance.
  • Q. Is your synchronization header long enough to account for nodes being very distant? Actually, it’s the gap after the header that has to be long enough. It’s a simple system-level parameter.

Novel Implementations of Network Components

SwitchBlade: A Platform for Rapid Deployment of Network Protocols on Programmable Hardware

  • […]
  • Existing approaches involve developing custom software, custom hardware or programmable hardware.
  • Platform header: a hash value for custom forwarding, a bitmap for what preprocessor should execute on the packet, a forwarding mode (including longest prefix matching or an exact match; also able to throw a software exception) and the virtual data plane ID.
  • Virtual data plane has its own preprocessing, lookup and post-processing stages: they operate in isolation.
  • Preprocessing stage: select processing functions from a library of modules (such as path splicing, IPv6 and OpenFlow). Also hashing: operator indicates what bits in the header should be incorporated in the packet-header hash to determine how the packet should be forwarded (can include up to 256 bits from the header).
  • Can do OpenFlow, where forwarding decisions are made on a 13-tuple (240 bits), which SwitchBlade hashes for custom forwarding to be done.
  • Modules are implemented in Verilog. Preprocessing and postprocessing modules extract the bits for lookup.
  • Forwarding stage: perform output port lookup based on mode bits. A software exception can be thrown and the packet redirected to the CPU. Could do hardware-accelerated virtual routers in software.
  • Implemented on NetFPGA.
  • Evaluated for resource utilization and packet forwarding overhead. Compared to a baseline implementation on NetFPGA. There is minimal resource overhead and no packet forwarding overhead.
  • Evaluated on a three-node topology.
  • SwitchBlade uses 13 million gates to get four data planes; other implementations (IPv4, splicing, OpenFlow) have one data plane and use 8 to 12 million gates.
  • No additional forwarding overhead compared to the reference implementation.
  • SwitchBlade is a programmable hardware platform with customizable parallel data planes. Provides isolation using rate limiters and fixed forwarding tables.
  • Q. How do you scale the performance beyond tens of Gbps? An artifact of the NetFGPA implementation which uses 4×1G ports. Later one will have 4×10G.
  • Q. Doesn’t the next paper show that it is possible to do all this in software? Things like Click are limited by packet copying overhead, so you are limited by the bandwidth of the PCI bus.
  • Q. What kind of hash function do you use and do different applications require different properties? We use a collision-resistant hash.

PacketShader: A GPU-Accelerated Software Router

  • Prototype achieves 40 Gbps on a single box by exploiting GPU acceleration.
  • Software routing is not just IP routing. It is driven by software and exploits commodity hardware.
  • 10G NICs cost from $200–300 per port. But software routers are limited to less than 10Gbps (8.7Gbps in RouteBricks is the best so far).
  • For 10G, it takes 1200 cycles to do packet I/O, and your budget is 1400 cycles. Lookup/encryption/hashing typically takes much more than that.
  • First step is to optimize the packet I/O. Then offload the other functions to the GPU.
  • GPUs are massively-parallel. Lots of small cores.
  • A GTX480 GPU has 480 cores and 1.2 billion transistors, most of which is dedicated to ALU.
  • Operations like hashing, encryption, pattern matching, network coding and compression are computationally intensive. GPU is well suited to these. GPU can also effectively hide memory latency.
  • Memory bandwidth of a top-of-the-line CPU is 32GB/s, but the empirical bandwidth (on realistic access patterns) is 25GB/s. Multiple ports receiving and transmitting will consume this and cause contention. However, a GPU has 174GB/s memory bandwidth.
  • Key insight: stateless packet processing is parallelizable. Take packets from the head of the receive queue, batch them and process them in parallel.
  • Latency is not impacted by parallel processing.
  • Before shader: checksum, TTL, format check, etc. This will send some packets along the slow path. It collects the destination IP addresses and passes those to the shader.
  • Shader: takes IP addresses, looks up the forwarding table and returns the next hops.
  • Post-shader: packets are updated and transmitted through the output ports.
  • Also device drivers at the receive and transmit side. Implemented a custom driver; details in the paper.
  • Can scale further with a multicore CPU. One master core and three worker cores. Master core talks to the shader. Once you have multi-socket, you need one GPU per CPU. Multi-socket, there is no communication between the CPUs, and each CPU owns a subset of the input queues.
  • Evaluated by connecting a packet generator and PacketShader back-to-back. Generator generates up to 80Gbps.
  • GPU gives a speedup (over CPU-only) of 1.4x for IPv4, 4.8x for IPv6, 2.1x for OpenFlow and 3.5x for IPSec.
  • IPv6 table lookup requires more power than IPv4 lookup. Algorithm is binary search on hash tables. Big performance improvement for small packets, but slightly worse for 1024 and 1514 bytes. However, this is bounded by the motherboard I/O capacity.
  • IPSec tunneling adds a header and trailer to the encrypted packet. The improvement is across all packet sizes, and is actually bigger for larger packets.
  • PacketShader achieves 28.2 Gbps with CPU only, and is implemented in user space, rather than kernel space. Reaches 39.2 Gbps with the GPU.
  • Need to add a control plane (currently only does static forwarding). Need Quagga or Xorp.
  • Could also integrate with a programming environment, such as Click.
  • Q. Is it worth implementing such a sophisticated design to make a 40% saving? And do you have a breakdown of where the savings are made? The budget numbers and breakdown are taken from RouteBricks.
  • Q. What do you think about the power efficiency of this compared to other approaches? Idle to full load is 327W–594W with two CPUs and two GPUs. (Compared to 260W–353W for two CPUs.)
  • Q. Does this approach have advantages over an integrated network processor in terms of scalability or programmability? Network processors are not commodity. Based on experience, they are much more difficult to program.
  • Q. Why did your approach have such a significant speedup over RouteBricks etc. even without the GPU? Improvements in packet I/O throughput.

EffiCuts: Optimizing Packet Classification for Memory and Throughput

  • Packet classification is important for security, traffic monitoring and analysis, and QoS. Usually based on the source and destination IPs and ports, and the protocol field.
  • Line rates and classifier sizes are increasing. This leads to high power consumption.
  • Previous approaches have used either TCAMs (poor scalability) or algorithmic approaches (potentially scalable, but problematic). Most promising approach based on decision trees. Aim of this work is to address the scalability of decision tree algorithms.
  • HiCuts and HyperCuts have investigated decision trees previously. However, they require large memory.
  • EffiCuts reduces the memory overhead of HyperCuts while achieving high packet throughput. Uses 57x less memory and 8x less power.
  • Rules in the decision tree are hypercubes in the rule space. Tree building successively cuts down the rule space into smaller sub-spaces. Stops when the cube is small. Classification uses tree traversal.
  • HyperCuts’ memory overhead is due to many rules overlapping and varying in size, because fine cuts to separate small rules lead to cuts to and replication of large rules. Also overhead because the rule space is sparse (leading to empty nodes or nodes with replicated rules).
  • Aim to tackle the variation in the rule size and the density of the rule space.
  • Separable trees: build separate trees for large and small rules. But separate them along different dimensions.
  • Build a distinct tree for each set of separable rules in 5 IP fields. This leads to a maximum of 31 trees, but in practice it’s more like 12.
  • Extra memory accesses to traverse multiple trees decreases packet throughput. To reduce the number of accesses, merge the trees.
  • HyperCuts uses equi-sized cuts to separate dense areas, whereas EquiCuts uses equally-dense cuts, which leads to fine/coarse cuts in dense/sparse areas. Many details of this in the paper.
  • Node co-location: colocate a node and its children, details of this in the paper.
  • Implemented HiCuts and Hypercuts with all heuristics, and EffiCuts. Used 16 rules per leaf. Power comparison uses an estimation from the Cacti tool to simulate the SRAM/TCAM.
  • First result: HyperCuts and HiCuts see memory grow more rapidly than EffiCuts. Replication decreases from 1000 to < 9. Efficuts needs constant number of bytes per rule as the number of rules grows.
  • EffiCuts requires 50% more memory accesses than HyperCuts. However, since EffiCuts uses much less memory, memory copies are inexpensive.
  • Throughput results are mixed (149 down to 73 million packets per second for one rule set; but 218 up to 318 for another). Still see an 8x saving in power.
  • Also compared EffiCuts to TCAM. Throughput story is also mixed, but EffiCuts consumes 6x less power than a TCAM.
  • Q. How do you separate “large” and “small” rules—using a threshold? We observed that the rule spread is essentially bimodal. This is based on a sensitivity analysis to the “largeness fraction” which varies between 0.1 and 0.95 without affecting the split.
  • Q. How would power consumption compare to a TCAM where you selectively turn on the relevant banks? Since we are comparing the worst-case packet match, every rule could go to a very different bank.

Cloud and Routing

Theory and New Primitives for Safely Connecting Routing Protocol Instances

  • Earlier measurement study showed that internet routing is much more complex than the traditional two-level hierarchical model (EIGRP/BGP/OSPF). Since it is so complicated, the connecting primitives play a critical role. 99.9% of analyzed networks depend on them. They are used for things like domain backup, partition healing and router-level shortest path routing.
  • Designs are usually either safe or flexible. The status quo is unsafe and inflexible. The talk describes something that is both safe and flexible!
  • Framework based on routing algebras (metarouting). Connecting primitives have provable safety properties, more expressivity and require no modifications to existing protocols.
  • Today’s connecting primitives have two features: route selection (ranking) and route redistribution (information exchange). Configured using various optional settings in router configuration scripts.
  • Misconfiguration can lead to suboptimal routes, or loops.
  • Two questions: how should routes be compared? and when should they be redistributed? The idea is to have a conversion function from routes to a universal metric. Each routing instance is associated with a pair of conversion functions. Now frame the problem as what properties these functions should satisfy.
  • Contributions: sufficient conditions to guarantee correct routing.
  • Goal: unmodified BGP, OSPF, RIP, EIGRP. But BGP and OSPF E2 are not strictly monotonic.
  • Non-BGP protocols modeled by a 2-dimensional ranking, comprising route type and cost. Conversion functions map these to and from an ordered set.
  • Route selection: prefer non-BGP to BGP routes, for non-BGP routes prefer type-A then B and then C, and among non-BGP routes of the same type prefer the lowest cost.
  • Domain backup/partition healing: currently possible, but only with a complex configuration, a star topology and giving protection only to the leaves. In the new design with the default conversion functions, can do this with any topology and any available path.
  • Router-level shortest path: currently only between OSPF and cannot change the cost. The new design can do it with OSPF, RIP and EIGRP using per-instance metrics.
  • Traffic engineering: existing design allows this only within instance, but now we can do it across instances.
  • Q. Do the functions have to be common across all the routers? Yes, it has to be consistent across all the border routers. But future work would let it be different.
  • Q. Do you see the potential for work in the standards bodies to standardize these functions? Talking to the router vendors about this.
  • Q. What is the behavior during convergence? We haven’t focused on that. After some time, we converge to a correct path. How can you be sure that it is correct without enforcing a global ordering? We do enforce a global ordering at present.
  • Q. Does your scheme require a centralized entity? We have no such entity. What about between different ASs? …

DONAR: Decentralized Server Selection for Cloud Services

  • Server selection as a customizable constrained optimization. This talk describes a distributed solution to this problem.
  • User-facing services are being replicated across the world. Server selection involves selecting which replica to use for a particular user. Want to be agnostic to the wide-area protocol being used. For example, distributed DNS and HTTP redirection/proxying.
  • Idea was to build a system, DONAR, to which server selection can be outsourced.
  • Policy interface. Naïve policies include round-robin (for load balancing), or location-aware closest-node. But want to support complex policies over many nodes.
  • Policies are represented by constraints. For example, a bandwidth cap, or a split ratio and allowable deviation. Without constraints, just use closest node, but this can lead to massive imbalance.
  • Improvement: add a bandwidth cap on some instances, which now gives some locality, but doesn’t overtax a particular instance (and offloads some traffic to other instances).
  • Improvement: split 10% across ten replicas with a +/- of 5%. The tolerance is unique to this implementation. Gives the ability to trade off network proximity and load distribution.
  • Can have a mix of bandwidth cap, proportional split and tolerance across different instances.
  • For a customer, can define a global linear program that describes the optimal pairing. Minimize network cost (distance) such that load is within tolerance and bandwidth caps are met.
  • Need to do this for each customer, and continuously. Aim is to have potentially hundreds of DONAR nodes, customers and replicas per customer, and tens of thousands of client groups per customer. The linear program has millions of variables per customer.
  • DONAR nodes could measure traffic and optimize locally. But no one node sees the entire client population, and the distribution at each node is skewed.
  • Could maybe aggregate all of the data at a central coordinator. This would end up sharing a lot of data and compromising responsiveness. For example, we would want a node to respond quickly to a flash crowd.
  • Actually came up with a decomposition strategy, for both the objective function and the constraints. Uses a Gauss-Siedel iterative decomposition approach. The summary data shared is only proportional to the number of replicas. Proof in the paper that this converges to the global optimum.
  • Deployed since November 2009, and in production use for two customers. Currently services around one million DNS requests today.
  • Other systems challenges in the paper: network availability, reliable data storage, etc.
  • Experimental setup using CoralCDN with a proportional split across the replicas. Closest node policy is very volatile, whereas DONAR equal split gives much more predictable workloads.
  • Better than simple round-robin, since DONAR keeps the network distance as small as possible (shown by the rank distribution of node distance from the client).
  • Q. How quickly does the global solution converge? Covered in the paper, and usually after one round of hearing from everyone.
  • Q. Can you incorporate server-specific policy like consistency or staleness (cf. weak/eventual consistency, where different replicas have different data)? We assume that we can direct a request to every replica. But we could solve this problem for separate domains.
  • Q. What’s the inaccuracy that comes from geolocation based on the location of the resolver? This has been well studied.
  • Q. What if a server’s settings change or it fails, how long will it take to recompute? Built-in support for liveness updates (either notification or loss of heartbeat), and use a heuristic before we rerun the optimization. Rerun the optimization every two minutes.

Cloudward Bound: Planning for Beneficial Migration of Enterprise Applications to the Cloud

  • Challenge: data privacy issues, such as national privacy laws and indsutry-specific privacy laws (like HIPAA). Challenge: SLA requirements lke response time.
  • Possible solution is hybrid clouds.
  • First focus is on planning hybrid cloud layouts, making cost savings, minimizing response times and bandwidth costs.
  • Second focus is on migrating security policies, such as firewall contexts and ACLs.
  • Contributions: study of the complexity of enterprise applications, first-cut solutions to the two challenges, and validations.
  • Enterprise applications are typically three-tier but with multiple components in each tier, and complex interdependencies.
  • Abstract the planning problem using a graph with vertices for components, and virtual vertices for internal and external users. Each vertex has a size (number of servers), and edges have a number of transactions per second and size of transactions per second. Objective is to maximize cost savings through migration, subject to policy constraints and bounds on the increase in transaction delay. Then partition the graph between local and remote.
  • Approach is to use easily-available information, like computation times of components and communication times on links.
  • Model user response times using bounds on changes to the mean delay and variance.
  • Benefits for cost savings based on the estimates in the Above the Clouds tech report. Based on a non-elastic migration, and future work is to look at using the cloud for peaks.
  • Migration algorithm is based on a reachability matrix to determine necessary security policies.
  • Evaluated based on two case studies, the Windows Azure SDK application and a Campus Enterprise Resource Planning application.
  • Used a cloud testbed to evaluate a thumbnail-making example application. The plan results in a mean delay less than 10% and an increase in variance less than 50%.
  • The campus ERP application involves multiple front-end, business logic and back-end components, in use at Purdue.
  • With a 30% bound on increase in mean delay, get $58k savings by migrating all components. For a 15% bound, get $38k savings by migrating a subset of the components. For a 15% bound with a placement policy (don’t move DBs), still get a $14k saving. Paper contains a sensitivity study on the benefit ratios.
  • Various security policies are also generated and evaluated.
  • Q. How much time did it take to understand the dependency mapping between components? Talked to operators and interviewed them, so the dependencies were extracted by humans. What about license dependencies, such as “may not be run in a VM”? [Taken offline.]
  • Q. Why did the delay increase more than the policy in the graphs? Used PlanetLab before a deadline, so the environment was hostile.
  • Q. Can you comment on the complexity of rewriting applications to run in split mode, which seems to be more important than performance or cost? There is an increasing trend towards service-oriented architectures, which make it easier to do this migration. Haven’t solved this for legacy applications.
  • Q. Did you have a contralocation scheme for preventing e.g. certain pieces of data not being stored in different languages? Constraints based on cloud/non-cloud.

SIGCOMM 2010: Day 1

Tuesday, August 31st, 2010

Wireless and Measurement

Efficient Error Estimating Coding: Feasibility and Applications

  • Won best paper award.
  • Existing philosophy: errors are bad and only want completely correct data.
  • Can we accept partially correct packets and only enforce correctness end-to-end?
  • Contribution: error estimating coding. Enables the receiver to estimate the number of errors in a packet, but not correct them.
  • Smaller overhead/weaker functionality than error correcting codes.
  • Overheads come from redundancy and computation.
  • EECs need only O(log n) bits redundancy to estimate errors. e.g. 2% overhead on a 1500-byte packet. For just a threshold check, need only 4 bytes.
  • Efficient computationally: software implementation can support all 802.11 data rates. ECC is 10 to 100 times slower.
  • Can estimate number of errors in a packet with a provable quality.
  • Application: streaming video. FEC (forward error correction) often used here. Routers forward partially correct packets. But if the number of errors is so large that the data are unrecoverable, it will incur retransmission. Router should have asked for retransmission earlier when a packet could not be decoded. But it lacks computational power to evaluate an ECC. EEC is more computationally tractable in this scenario for BER-aware retransmission.
  • Implemented for Soekris Net5501-70 routers.
  • Key idea: router should treat different packets differently. Could use analogue or digital amplification as necessary.
  • Packet scheduling: image sensor network for emergency response. Let packets with smaller BER get through first.
  • Applies also to bulk data transfer. Can use partial packets and correct them end-to-end. Could use network coding or incremental redundancy. EEC helps to do WiFi rate adaption, if we know the mapping between data rate and BER, which EEC provides. Existing systems based on packet loss rate, signal-to-noise ratio (at the receiver), or modifying the PHY.
  • Implemented a prototype rate adaptation scheme using EEC. Consistently outperforms existing schemes based on packet loss rate and SNR.
  • More general problem is wireless carrier selection: goal is to select carrier with the best goodput.
  • Packet has n + k slots (n data bits and k EEC bits). p is the fraction of erroneous slots, with arbitrary position.
  • Naïve solution would be to add pilot bits to each packet with known values at known positions. But this needs the receiver to observe enough errors in the pilot bits. Since BER is usually small, need a lot of pilots to see a single error.
  • Instead, make the pilot bit a parity bit based on a known group of data bits in the packet. But cannot distinguish many cases with parity bit (1 error vs. 3 errors). Error probability of a parity bit is (inversely) correlated with error probability of the bits in its group.
  • Solution involves randomly selecting data bits to make up fixed sized groups and compute an EEC parity bit. Now permute the data and EEC bits and send.
  • Can refine to single- or multiple-level EEC. Details of multiple-level EEC in the paper.
  • Can prove a formal guarantee for the space complexity of the number of EEC bits (the O(log n) bound).
  • Compared to SoftPHY, EEC is a pure software solution, which is more deployable. But SoftPHY gives per-bit confidence information that EEC cannot provide.

Design and Implementation of an “Approximate” Communication System for Wireless Media Applications

  • Leverages properties of wireless PHY layer to improve media applications.
  • Media applications by 2013 will comprise 91% of internet traffic, and wireless based access is the dominant for of access (4 billion wireless hosts vs. 0.5 billion static hosts).
  • In this case, we should use the spectrum efficiently for media transfer.
  • Looking at hierarchically-structured media, such as MPEG4 and H.264. Different frames (e.g. I, P, B) have different value (I > P > B). So use unequal error protection to prioritize important frames.
  • Since data received is a predictable approximation of transmitted data, they can provide unequal error protection almost for free (in terms of additional spectrum).
  • Errors result from the constellation diagram used to determine QAM encoding. Typically, the error is restricted to symbols in the neighborhood of the received symbol in the constellation.
  • Using a Gray code-based bit mapping from symbols to the QAM encoding. So, by definition, neighboring symbols are just one bit different. Different bit positions offer different levels of protection. So can choose different bit positions to give different protection to data.
  • For the media example, but I frames in more-protected bit positions, and other frames in the other positions.
  • Also considered a block mapping scheme, which gives better protection for the most protected bits, and worse protection for the other bits, than a Gray code mapping.
  • Designed a system based on these principles: APEX.
  • A modern radio pipeline will apply randomization, such as scrambling, coding and interleaving. But this makes it harder to determine what bits go where. So they move randomization to before the assignment of bit positions.
  • Uses a greedy algorithm to map application data at various priorities that deals with unequal content size.
  • Evaluated at various bit rates and bit error rates. Gets a better PSNR than traditional transmission. Also showed that it works well with FEC.
  • Q. Did your evaluation take fading into account, and would the assumptions still hold? Experimentation done using a system not robust enough to take outdoors and do experiments. The assumptions might hold. If you are sharing wireless LANs more, does this become more critical?
  • Q. How will it work when your bit-rate adaption drops down to BPSK and QPSK? It does nothing in this case. How might it work? Could instead use smaller QAM symbols.
  • Q. What if you get 180 degree phase shift, due to fading or propagation delay, and the mapping changes? At the PHY layer, we expect there to be a mapping, and an indexing mechanism that can decode information in the header of the packet to select the mapping.
  • Not All Microseconds Are Equal: Enabling Per-Flow Measurements With Reference Latency Interpolation

    • Low-latency (e.g. financial) applications are increasingly important.
    • Current solution is a low-latency cut-through switch. In a tree network, it is hard to tell which switch is causing a problem at microsecond granularity. Need high-fidelity measurement within the routers themselves.
    • But SNMP and NetFlow provide no latency measurements, and active probes are typically only end-to-end. Measurement boxes are very expensive (£90k).
    • Need per-flow measurements because averages lose too much information about what happens to each flow. There is a significant amount of difference in average latencies across flows at a router.
    • Perform measurement on packet ingress and egress. Assume that router interfaces are synchronized, because cannot modify packets to carry timestamps.
    • Naïve approach: store timestamps for each packet on ingress and egress. Packet timestamps get sent along the egress route when the flow steps. Obviously this is too costly to do at 10Gbps. Sampling sacrifices accuracy.
    • Use LDAs with many counters for interesting flows, counting packets seen at each timestamp.
    • Divide time into windows, and measure the mean delay for packets within that window (locality observation). Error shrinks with a smaller window size.
    • Can inject a reference packet regularly that does have ingress and egress timestamps, which gives delay samples for each window.
    • Implementation has two components: reference packet generator and latency estimator.
    • Reference packet generation strategies: 1 in n packets or 1 in tau seconds. Actual approach is dynamic injection based on utilization. When high utilization, inject fewer packets.
    • Latency estimator strategies: could use only the left reference packet (previous reference packet) or a linear interpolation of the left and right reference packets. Other non-linear estimators, such as shrinkage, are possible.
    • Maintain packet count, summed delay and sum of squares of packet delays for each flow.
    • Evaluated on various router traces, and simulated with NetFlow YAF implementation. (RED active queue management policy.) Median relative error is 10–12%. As utilization grows, error decreases. RLI outperforms other schemes by 1 to 2 orders of magnitude.
    • Overhead is < 0.2% of link capacity. Packet loss difference is 0.001% at 80% utilization.
    • Q. Have you talked to router vendors about implementing this? No.
    • Q. When you compare the different approaches, why is sampling so much worse? Sampling scheme is one packet per thousand. So accuracy is very low if few packets are delayed.
    • Q. How does it influence the results if packet loss rates are high? Lost packets are not counted in our result.

    Data Center Networks

    Generic and Automatic Address Configuration for Data Center Networks

    • Manual configuration is error-prone, causing downtime. DHCP is used at layer 2. But applications need this information as well. Data center designs (manually) encode this information in the IP address for routing. But DHCP isn’t enough for this.
    • Takes two inputs: a blueprint graph (with logical IDs for each machine) that can be automatically generated, and a physical topology graph that is available later when the data center is constructed.
    • Center of framework is a device-to-logical ID mapping. Need malfunction detection to update the mapping.
    • Maintaining a map between devices and the logical IDs is the graph isomorphism problem. The complexity of this problem is unknown (P or NPC). Introduce the O2 algorithm which solves the problem in this case. Proceeds by choosing an arbitrary first node matching (decomposition), then refinement. Terminates when no cell can be decomposed. Overall algorithm terminates when all cells have single nodes.
    • O2 has three weaknesses (i) iterative splitting in the refinement stage, (ii) iterative mapping in the decomposition stage, and (iii) making a random selection of the mapping candidate. Optimization algorithm has three heuristics that address these problems.
    • O2 turns out to be faster than its competitors: Nauty and Saucy.
    • Malfunctions cause the topology to differ from the blueprint, so O2 cannot find the mapping. Solution is to find the maximum common subgraph between the blueprint and physical graphs. The algorithm for this is NP- and APX-hard.
    • Use heuristics based on the vertex degree changing. If no degrees have changed, probe subgraphs derived from anchor points using majority voting to identify miswired devices.
    • Protocols for channel building, physical topology collection and logical ID dissemination. A DAC manager coordinates these.
    • Experimented on a BCube(8, 1) network with 64 servers. The total time to run the algorithm was 275 milliseconds to autoconfigure all of the servers.
    • Ran simulations on larger topologies. Up to 46 seconds on a DCell(6, 3) network with 3.84 million devices.
    • Q. What are the next steps for this work? Can we design better logical IDs that can be used in routing.

    Symbiotic Routing in Future Data Centers

    • Reevaluate network architecture based on the different properties of data center networks when compared to the internet.
    • Despite lots of other work, the network interface has not changed, so what can we do at the application layer?
    • Network is a black box, and applications have to infer things like locality, congestion and failure; likewise networks have to infer things about the applications like flow properties.
    • MSRC designed CamCube: a network with x86 servers directly-connected in a 3D torus. Servers have (x, y, z) coordinates that are exposed to the application. The send/receive API is a simple 1-hop API. Multi-hop routing is provided as a service, which uses multipath when possible.
    • Built a high-throughput transport service, a large-file multicast service, an aggregation service and a distributed key-value cache service. Each had a custom routing protocol based on the properties that the application needed to obtain. e.g. High-throughput transport prefers disjoint paths, whereas file multicast prefers non-disjoint paths.
    • Testbed used 27 servers with 6×1G NICs. A simulator looked at a 20^3 (8000) node CamCube.
    • Custom routing policies yield a performance improvement (on average). Factor of 2 (median) improvement in end-to-end throughput for high throughput transport (10k x 1500 byte packets). Gains also for multicast and the distributed object cache (in terms of path length).
    • Also looked at impact on the network: achieved higher throughput with fewer packets (lower link utilization) for all applications.
    • The base routing protocol is still used to route along paths defined in the custom routing protocol, and to handle network failures. The custom protocol route for the common case.
    • Built a routing framework for describing these custom protocol. Two components: the routing policy and the queuing policy. Each service manages one packet queue per link.
    • Cache service: keyspace mapped onto the cube, evenly distributed across the servers. Routing: go to the nearer of the cache or primary nodes. On a cache miss, route from the cache to the primary, and populate the cache on the return.
    • The base protocol routes around link failures. If a replica server fails (in the key-value store), the key space is consistently remapped by the framework.
    • Forwarding function implemented in C#, running in userspace.
    • Benchmarked a single server in the testbed, communicating at 11.8Gbps with all six neighbors. Required 20% CPU utilization.
    • Can the routing approach be used outside CamCube? Network only needs to provide information about path diversity and topology, and the ability to program components.
    • Q. When you make an application for this framework, what would happen to the application if you decided to change the topology? The benefit of the black-box approach is that you don’t care about the topology. May be advantageous to target containerized/modular data centers where the topology cannot frequently (or at all) change.
    • Q. How would the performance look on other topologies, considering that the torus is optimal for latency and bandwidth? It is a benefit and a curse, given that we occasionally have long path length for some pairs. Topologies that give you higher path diversity give you better chances to employ these ideas.
    • Q. What if you ran multiple instances of the same application (rather than applications with very diverse routing policies)? We did run this, but the details are in the paper. For the high-throughput transport protocol, you might expect us to be susceptible to congestion, but the forwarding function returns many choices, which you can choose based on minimal queue length, for example.
    • Q. What is the net result here for forwarding latency? This is one of the main critiques of the topology itself. Currently experimenting with smart NICs so that we don’t have to go up to user space for straightforward forwarding.
    • Q. Can you write the forwarding method in the form of many overloaded methods that code be dispatched dynamically? To some extent, but packets are tagged with a service ID, which statically dispatches the forwarding method of the particular service.

    Data Center TCP

    • TCP is used for 99% of traffic in data centers, but what is the problem with it? Can suffer from bursty packet drops (Incast), and builds up large queues that add significant latency.
    • Many ad hoc workarounds for TCP problems, such as at the application level. This talk is about changing the TCP stack in the kernel to address these problems.
    • Interviewed developers, analyzed applications and did a lot of measurements. Systematic study of impairments and requirements in Microsoft’s data centers.
    • Case study: Microsoft Bing data center. 6000 servers, with passive instrumentation (application/socket/kernel-level). Search query goes to top-level aggregator, which splits the query and farms it out to mid-level aggregators, which then farm it out to worker nodes. Worker nodes have a 10ms deadline; mid-level aggregators are 50ms; and the top-level deadline is 250ms. Missed deadlines lead to missing data in the results.
    • Similarly, Facebook builds a page by pulling data from various servers. Similar traffic pattern to Bing.
    • Incast happens when “synchronized mice collide”. Caused by partition/aggregate: the queue overflows at the aggregator. To deal with the problem, Bing jitters requests over a 10ms window. This gets better performance at the median, but causes problems at higher percentiles (up to 99.9th is tracked).
    • Queue buildup causes when big flows take up too much of a queue and increases the latency for short flows.
    • Requirements: 1. High burst tolerance; 2. Low latency for short flows; 3. High throughput. 1 and 3 are in tension with 2. Deep buffers helps 1 and 3 but increases latency. Shallow buffers are bad for bursts and throughput.
    • Objective is low queue occupancy, with high throughput.
    • TCP uses explicit congestion notification, inserted in packets in the middle, noted by the receiver and sent back to the sender in the ACKs.
    • Need C * RTT buffers for a single flow running at 100% throughput. If you have a large number of flows, you can have fewer buffers. If there is a low variance in sending rate, small buffers are sufficient.
    • Key idea: react in proportion to the extent of congestion, not its presence (cut the congestion window by less if there are fewer congestion notification bits set). Other key idea: mark packets based on the instantaneous queue length.
    • Sender maintains a running average of marked packets, and adaptively cut the congestion window based on how many packets are marked.
    • On a real deployment, DCTCP for two flows keeps the queue length much shorter than regular TCP.
    • Get high burst tolerance by having large queues. Low latency by having low buffer occupancy.
    • How long can DCTCP maintain queues without loss of throughput, and how do you set the parameters? Need to ensure that the queue size is stable by quantifying the oscillations.
    • Implemented on Windows 7 using real hardware with 1G and 10G stacks. Aim was to emulate traffic within one rack of the Bing data center. Generated query and background traffic based on distribution seen in Bing. For background flows, DCTCP gets lower latency for small flows, and matches TCP for large flows. For query flows, DCTCP does much better than TCP.
    • Tried scaling the traffic (background and query) by 10x. Compared DCTCP, TCP and TCP-RED with shallow-buffered switches, and TCP with a deep-buffered switch. TCP-DeepBuf does terribly for latency with short flows, and TCP-ShallowBuf due to Incast
    • Q. Shouldn’t you be comparing with TCP-ECN? We have done those comparisons, though they aren’t in the talk.
    • Q. If you reduce in proportion to the number of bits, it depends on the timescales on which the queue builds up, which depends on the number of competing sources and their own reactions. Don’t you have to do some control theoretic modeling? Data center is a homogeneous environment with all sources being DCTCP. Even there you really need to figure out the bandwidth properly? There is more detail in the paper.
    • Q. Depending on the instantaneous queue for the notification has implications for the dynamics, and I am concerned about this? Homogeneous RTT helps here, thanks to the data center environment. We believe there are some simple ways to solve the problem for heterogeneous RTT (e.g. multiple queues). I’d like see how much diversity you can tolerate?

    Inter-Domain Routing and Addressing

    Internet Inter-Domain Traffic

    • It’s hard to measure the Internet, and the lack of ground truth data makes it harder to know if you’re doing it accurately.
    • Wanted to collect a large data set that shows how the internet is evolving.
    • Conventional wisdom: global-scale end-to-end network, broad distribution of traffic sources and sinks, and the existence of a “core”. But as time passes, these are becoming less true.
    • Methodology: focussed on inter-domain traffic, not application layer things such as web hits/tweets/VPN/etc. Exported coarse-grain traffic statistics about ASNs, ASPaths, protocols, ports, etc. via anonymous XML forwarded to central servers. Covers 110 ISPs/content providers, 3k edge routers, 100k interfaces… i.e. about 25% of all inter-domain traffic. Then waited 2 years and repeated to get longitudinal data. Used commercial probes within a given ISP, with limited visibility into payload-based classification. Calculated percentages per category then weighted averages using number of routers in each deployment. Also incorporated informal and formal discussions with providers, and information about known traffic volumes.
    • Validated predictions based on a ground-truth based on 12 known ISP traffic demands (Known peak Tbps).
    • In two years, Google and Comcast have grown to be the 3rd and 6th biggest carriers in terms of traffic demands.
    • Cumulative distribution of carriers: in 2007, thousands of ASN contributed the first 50% of content; in 2009, it was 150 ASNs. In 2010 it’s even more dramatic (but not shown).
    • By buying YouTube, Google went from 1% of internet traffic to 6% (not including the global cache — i.e. an underestimate).
    • In 2007, Comcast has “eyeball” peering ratios, but by 2009, they are a net producer of content. Video distribution, backbone consolidation, etc. contributed to this.
    • Price per megabit of wholesale for internet transit has collapsed whereas the revenue for internet advertisement has greatly increased.
    • IP and CDNs have been commoditized, hence enterprise traffic has moved to the cloud. Companies have consolidated. Bundling (triple/quad play etc.) has become popular. Several new economic models, such as paid content, paid peering, etc. (but often under NDA). Also disintermediation as customer and provider connect directly.
    • Traditional internet model is hierarchical: not really true but pedagogically used. The new “Tier-1″ has “Hyper Giants”, representing the large content providers.
    • In terms of protocols: HTTP has grown 24.76%, video has grown 67.09% and P2P has shrunk > 70%.
    • Port usage is also consolidating: similar distribution to the content providers: fewer ports account for more of the internet. Looked at Xbox traffic on TCP port 3074, and saw a huge drop. Actual cause was a move to port 80.
    • File sharing has migrated to the web. In 2010, P2P is the fastest-declining application group. Direct-download like Megavideo, etc. are much more popular now, and you can even get HD streaming.
    • Internet traffic changes have been driven by changing economic models. Shift from connectivity to content, and the move to port 80 are two major trends. This has implications for engineering and research, as security/fault tolerance/routing/traffic engineering/network design have become more difficult.
    • Q. Economic implications: what are other types of business models and arrangements that might come out of “hyper giants”? Still in the early stages, but it’s not clear - from a power perspective - who has the upper hand.
    • Q. If the number of top players decreases by an order of magnitude, do you see the role of CDNs diminish and do you have any data on that? Talk about CDNs in the paper (about 10% or more of internet traffic). In general, they are growing, but enterprise content is driving a lot of the CDNs.

    How Secure are Secure Interdomain Routing Protocols?

    • After a decade of research on secure BGP, no idea what the best solution. So this paper evaluates how well each protocol prevents “traffic attraction attacks”, based on simulation on empirical data. Used an AS-level map of the internet and business relationships, and a standard model of routing properties.
    • How do we strictly enforce correct path announcements? Solutions range from no crypto (BGP) to lots of crypto (data-plane security). It turns out that this isn’t the only problem: just as important to control who you announce to, as well as what is announced. Defensive filtering is also important.
    • Different relationships: customer-provider (customer pays), and peering (no payment). No value for how much a path costs in the model, but the paper models routing by preferring cheaper paths. After that, prefer shorter paths. Only transit traffic if it makes you money (i.e. on behalf of your customers).
    • A traffic attraction attack is an attempt to get as many people as possible to route traffic through the attacker’s AS (for tampering, eavesdropping, etc.). Simulations show that a traffic attraction attack can pull in traffic for 62% of ASs.
    • Origin authentication: secure database of IP to AS mappings to prevent people advertising origins they don’t own. Simulation shows 58% of ASs get attracted to the attacker.
    • Secure BGP: enforces that ASs cannot announce paths that have not been announced to them (using digital signatures). So can only append a prefix. Simulation shows that the attacker still attracts 18% of ASs.
    • Defensive filtering (of stubs (i.e. ASs with no customers, which should not route traffic)): provider drops announcements for prefixes not owned by its stubs. Defensive filtering thwarts all attacks by stubs (i.e. all of the previous cases), and 85% of ASs are stubs.
    • Sometimes announcing longer paths is better than announcing short paths; sometimes announcing to fewer neighbors (than all of them) is better. It’s NP-hard to find the optimal attack strategy.
    • Ran experiments by choosing randomly an attacker, victim pair and simulated a “smart attack” each protocol.
    • Evaluated probability that an attack attracts 10% of ASs (over a random choice of attacker and victim). Defensive filtering alone is as effective as Secure BGP alone. However, the attacks could be smarter, so these numbers are underestimates.
    • Why aren’t we using defensive filtering today? It’s hard to keep track of the prefixes that your customers own. A push to implement origin authentication is ongoing, and this could be used to derive the filtering mapping. The threat model is somewhat strange though.
    • Q. The CAIDA data are less than ideal, so how robust are the statements in the paper? Ran the experiments twice, on CAIDA data from 2009, and the UCLA Cyclops data. Trends for the two sets are similar.
    • Q. Did you look at comparing the effectiveness of models versus their deployability? Completely ignored deployability and implementability in this paper, but it will be the subject of future work.
    • Q. Have you considered varying the 10% threshold? Yes. [Backup slide.]
    • Q. Seems like the easiest way to attract traffic would be to deaggregate the prefix, so do you take this into account? We didn’t evaluate it because we idealized things to make it tractable.

    Understanding Block-level Address Usage in the Visible Internet

    • What can simple observations about the internet say? Contributed methodology, applications and validation.
    • Looked at spatial correlation, address utilization, dynamic addressing and low-bitrate identification. Based on data set gathered for IMC. New, deeper understanding and a new interpretation.
    • Is there spatial correlation in the IPv4 address space? Are adjacent addresses likely to be used in the same way? This could help to efficiently select representative addresses for more-detailed study.
    • Collected data by pinging each address in randomly selected /24 blocks every 11 minutes for a week, and collected the probe responses, probing 1% of the whole IPv4 space. Gives 5 billion ping responses.
    • Three metrics: availability (normalized sum of up durations), volatility (normalized number of up durations) and median-up (median duration of a period of uptime).
    • Graphed by mapping the IP addresses onto a Hilbert curve.
    • Algorithm: examine each block size, if it is homogeneous stop, otherwise split the block and recurse.
    • Validating spatial correlation is hard, because it is hard to find a ground truth. Therefore used USC’s network for comparison, and the general internet (hostname-inferred truth). Also evaluated for different samples and dates.
    • Selected USC because the operator provided the ground truth, and they had knowledge of both allocated and usage blocks.
    • 43% false negative rate, and 57% of blocks are correctly identified.
    • The general internet gives unbiased truth. The results are more correct than for USC: 68% correct to 32% false negative.
    • Low-bitrate identification: formalized RTT = transfer + queuing + propagation delays. Tried using median RTT to identify low-bitrate versus broadband. However, for international links, propagation time dominates. Variance of RTT separates low-bitrate from broadband.
    • Used hostnames as a form of ground truth (e.g. if the hostname contains “cable” or “3G”).
    • Q. Did you do any of this on IPv6? No. Are you planning to? Probably in the future when IPv6 is more popular.
    • Q. Is it reliable to use ping response to detect hosts, when some hosts refuse to handle ping responses? We didn’t consider false information in this work, but it would be valuable to consider this in future.

    SOSP 2009: Day 2

    Tuesday, October 13th, 2009

    I/O

    Better I/O Through Byte-Addressable, Persistent Memory

    • DRAM is fast, byte-addressable and volatile, but disk/Flash are non-volatile, but slow and not byte-addressable. BPRAM is all three!
    • Phase change memory is a promising source for this. Bits encoded as resistivity. Access latency in the nanoseconds, and far better endurance than flash. Designed BPFS for BPRAM.
    • Goal: FS ops commit atomically and in program order. Data is durable as soon as the cache flushes. Use short-circuit shadow paging to get this (new consistency model).
    • Eliminate DRAM buffer cache; use L1/2 instead. Put BPRAM on the memory bus. Provide atomicity and ordering in hardware.
    • Both BPRAM and DRAM are addressable by the CPU: physical address space is partitioned into volatile/non-volatile.
    • BPFS gets better performance than NTFS on the same media.
    • What happens on crash during update? Short-circuit shadow paging comes into play (contrast with journalling or shadow paging). Overhead of journalling is that all data (or metadata) must be written twice. Shadow paging uses copy-on-write up to the root of the FS: drawback is that writes propagate all the way back to the root (multiple updates), and small writes have a large copying overhead.
    • Short-circuit shadow paging makes in-place updates where possible. Uses byte-addressability and atomic, 64-bit writes. Both in-place updates and appends are made simple by this technique. Cross-directory rename does bubble up to the common ancestor.
    • Problem: if data is cached in L1/L2, the ordering of cache eviction can lead to inconsistent states. Also, writes from the cache controller might not be atomic.
    • So add two new hardware components to the CPU and cache controller. Epoch barriers are used to declare ordering constraints and they are much faster than a write-through cache. Also add capacitors to DIMMs which allow writes to propagate even after the loss of power.
    • Do CoW then Barrier then Commit. Paper also shows how to make it work on multiprocessors.
    • Built and evaluated on Windows as an in-kernel file system.
    • Microbenchmarks (append n bytes and random n-byte write) compare NTFS/Disk, NTFS/RAM and BPFS/RAM. (Using DRAM in this experiments.) BPFS is significantly faster than NTFS on disk, and NTFS isn’t syncing so it isn’t durable!
    • Postmark benchmark compares NTFS/Disk, NTFS/RAM, BPFS/RAM and (projected) BPFS/PCM. BPFS/PCM is much faster than both NTFS/Disk and NTFS/RAM. Analytical projection based on sustained throughput of PCM.
    • Q: are the storage requirements of a database and of a file system converging when you have this hardware available? The changes to the hardware will be applicable to other sorts of storage systems like a database, not just filesystems.
    • Q: have you thought about how to expose more capabilities of this medium to the applications (not just sequential reads and writes)? Applications are currently written in terms of what is efficient.
    • Q: how do you atomically deal with a free-list? We don’t have a free-list. Don’t need to keep track of so many data structures because the medium is so fast, which lowers the consistency overhead.
    • Q: where do you go next for multiprocessors and clusters? One of the goals was to have multiple concurrent threads operating on the FS at the same time.
    • Q: is there a risk that data in the capacitor gets garbled after the machine gets switched off? [Taken offline.]
    • Q: could you go even faster without having the consistency guarantees, for applications that don’t need it? There’s always a trade-off here.
    • Q: how do you do mmap, and is meddling with L1/L2 caches going to be expensive? Haven’t implemented mmap yet, but we would have a much better guarantee of durability. Changes to the cache, in the paper have been looked at in terms of interference, and performed well.
    • Q: why didn’t you benchmark against a file system using a B-tree or a red-black tree that can take advantage of random writes [why did you compare against NTFS]? NTFS is widely used, and it’d be interesting to compare against others.

    Modular Data Storage with Anvil

    • Data storage drives modern applications (everyone has a database) and they are frequently a bottleneck. Hand-built stores outperform general-purpose ones by up to 100x. Observe that changing the layout can substantially improve performance. Custom storage is hard to write, especially in order to provide consistency guarantees. Can be prohibitively expensive to experiment with new layouts.
    • Need a simple and efficient modular framework to support a wide variety of layouts.
    • Fine-grained modules: dTables. These are composable to build complex data stores. All writing is isolated to dedicated writable dTables, which incidentally has good disk access properties.
    • dTable = key/value store. Maps integers/floats/strings/blobs to blobs. Provides an iterator to support in-order traversal. dTables used by applications and frontends, and also other dTables. Can transform data, add indices or otherwise construct complex functionality from simple pieces.
    • Example of a mapping from customer IDs (mostly contiguous) to states. Start with an array dTable for the common case. Layer a dictionary on top of that (maps state names to array indices). Have an exception dTable for the case where a customer isn’t in one of the 50 states, and a linear-search dTable for their residences. But to make this fast, layer a B-Tree index dTable on top of the linear store.
    • So far just read-only. Updates are hard to do transactionally. Need to implement a write-optimized dTable. Fundamental writable dTable is the journal dTable. New data is appended to a shared journal; data are cached in an in-ram AVL tree. The journal is digested when it gets large. Transaction system is described in the paper. Layer this over read-only dTables.
    • Managed dTable goes at the top. Also have a Bloom filter dTable to deal with multiple overlaid read-only dTables.
    • Many additional dTables listed in the paper.
    • Evaluate  the effect of simple configuration changes on performance (modularity). Key lookup workload, comparing contiguous versus sparse keys. Contiguous good with arrays; sparse good with B-trees. Also show the benefit of layering an index on top of a linear store. Also show the low overhead of the Exception dTable.
    • Evaluated by running TPC-C. Replaces a SQLite backend with Anvil. Shows that Anvil outperforms both the original backend and MySQL. Split read and write stores perform well.
    • Evaluated the cost of digesting and combining. These can be done in the background, taking advantage of additional cores and spare I/O bandwidth. Measured the overhead when doing a bulk load (1GB) into the dTable, with digests every few seconds.
    • Q:  why didn’t you compare either performance or features against BDB, which is very similar? Didn’t find it as easy to construct read-only data stores in BDB: creating customisable data stores has a lot of transactional overhead.
    • Q: did you evaluate iteration? In paper. How does the performance depend on the order of updates? Not sure what you mean. Did look at overlay iteration in the paper, which ought to be the most expensive (due to key lookup cost), and overhead was only 10%
    • Q: how did you make it so that the creator of a new dTable doesn’t have to consider ACID semantics? Most dTables are read-only, so you don’t need to worry about this (like shadow paging). The managed dTable has a small hook that enforces transactional semantics. And read/write dTables? Don’t envision that people will need to create these. Could implement your own, but this would miss the point.
    • Q: how should developers write recovery tools for systems like Anvil? Anvil includes such a tool that handles recovery for you. Read-only semantics makes this much simpler.

    Operating Systems Transactions

    • [Unfortunately, I missed this talk due to having an obligation to man the registration desk. I'll try and track down the video and update this later.]

    Parallel Debugging

    Do You Have to Reproduce the Bug at the First Replay Attempt? — PRES: Probabilistic Replay with Execution Sketching on Multiprocessors

    • Concurrent programs are hard to write. Multi-core makes concurrent programming more important, and bugs more common. However they are non-deterministic (requiring e.g. a special or improbable thread interleaving). This makes it hard to reproduce them.
    • Deterministic replay for uniprocessors is relatively easy: only need to record inputs, thread scheduling and return values of system calls. On multiprocessors, this is much more challenging. e.g. Simultaneous memory accesses are another source of non-determinism.
    • Previous proposals introduce new hardware, which don’t exist in reality. Or there are software-only approaches but they have up to 100x slow-down.
    • Ideally want to reproduce a bug with a single replay and no runtime overhead. But what if we relax this slightly?
    • Idea 1: record only partial information during a production run. Idea 2: push the complexity into diagnosis time. Idea 3: use feedback from unsuccessful (non-reproducing) replay runs.
    • Just record a sketch during the production run. When a replay goes off the sketch, terminate it immediately and feed back information about why it deviated for refining the next replay. Can eventually reproduce the bug with 100% probability.
    • Several different methods for sketch recording. Spectrum of approaches from UP deterministic replay to full MP DR. Can record e.g. synchronization points, or basic blocks, or more: build up this information during the replay runs.
    • At replay time, the partial information replayer consults the sketch to see that recorded global ordering is obeyed. How do we know whether a replay is successful? Use a failure detector based on crash, deadlock or incorrect results. This can also detect unsuccessful replay runs.
    • When a replay attempt fails, start it again. But could do something different the next time: a random approach would just leave it to fate, but PRES is more systematic. Failed reproduction is due to un-recorded data races. The feedback generator captures these races and tweaks them in future runs. Start with many candidate races and filter them down.
    • Implemented PRES using Pin. Evaluated many different applications (desktop, server and scientific). Overhead is around 18%, which is barely more than baseline Pin overhead. Macrobenchmarks show that PRES gets much higher throughput for server applications (MySQL).
    • Effectiveness: UP algorithm doesn’t detect any bugs within 1000 replays, whereas PRES gets 12/13 in 10 attempts. Feedback generation is crucial to effectiveness. Race filtering also effective.
    • Q: could you also use execution traces to help track down which parts of the execution trace cause the bug to not happen, and guide the programmer? Good idea.
    • Q: could you apply PRES to virtual machine replay? Also a good point. The work could be integrated with virtual machines. Could you rollback an execution, is it precise enough? Depends on the fine-grainedness of the recording scheme used. What is the inherent overhead in collecting a trace with sufficient fidelity to do backtracking? If there is a lot of lock operations, the low-overhead approach (SYNC) could work. But if there is no synchronization, we can’t use this information, and a more heavyweight scheme would be needed.
    • Q: why do the results differ so much from the next paper? The main idea is similar, but their work is more focussed on static analysis to reduce runtime overhead.
    • Q: would you advocate this as a solution for long-running (one year or more) services, as it is often only after this time that they emerge? We can take a checkpoint of the process state, which solves the problem of data accumulation. 

    ODR: Output-Deterministic Replay for Multicore Debugging

    • Debugging non-deterministic software failures is really hard. The problem is how to reproduce these failures for debugging. Model checking/testing/verification could work, but it’s not perfect, and it doesn’t capture everything. So we need deterministic replay.
    • Need multiprocess operation, efficient recording, no special hardware and the ability to run arbitrary programs without annotation (especially programs with data races). All related work fails to meet one of these requirements.
    • ODR is a user-level replay system, which works in the MP case, has only 1.6x overhead, needs no new hardware and works on arbitrary x86 Linux binaries.
    • Often sufficient to produce any run with the same outputs… needn’t have the exact same execution. So the idea is to relax the determinism requirements.
    • The classic guarantee is value determinism: replay run reads and writes must have the same values as the original. Relax this to “output determinism”: the replay run produces the same user-visible output as the original. This is not perfect, but still useful for debugging: reproduces most visible signs of failure, and enables reasoning about failure’s root cause.
    • How to achieve this? Deterministic-run inference. Basic idea is to translate a program into a logical formula (verification condition). Function of schedule trace, input trace and read trace, returning an output trace. Use a formula solver to yield unknown schedule trace. Scale this by directing the inference using more original-run information. Also by relaxing memory consistency of the inferred run: where values read have nothing to do with schedule order, can use an arbitrary schedule trace.
    • Three-dimensional inference design space: memory consistency (strict, lock order or null), query complexity (output, I/O&lock-order, I/O&lock-order&path or determinant), and inference-time (polynomial or exponential). Search- and query-intensive DRI fit into this space.
    • Search-intensive is really slow (400–60000x slowdown). Formula generation, not solving is the bottleneck. Use multi-path symbolic execution with backtracking to generate formulas. Each backtrack involves a 200x slowdown. Backtracks are caused by race-tainted branches: wrong choice leads to a divergent, unsatisfying path. Work around by backtracking to the most recent race-tainted branch.
    • If we know the path (QI-DRI), inference time improves by 100x but we need to record much more data, so there is a 6x slowdown.
    • Future work is to reduce the path-search space, reduce the cost of each backtrack (cut down on race-tainted branch analysis), and parallelize formula generation by forking threads at each divergence.
    • Q: if I put in arbitrary values in the memory consistency model, how do I ensure that invariants are maintained? We don’t actually do null consistency, but if we did and invariants were violated the program might crash and output determinism catches this (because the output would be different).
    • Q: since your techniques are similar to the last paper, why are the results so different? In our approach, we do race-tainted branch analysis for the entire exectuion, and that is costly. We also do taint-flow propagation. There is more analysis in each backtracking iteration than theirs. We could reduce this cost by, for example, reusing results from previous iterations.
    • Q: have you considered changing your static analysis to another algorithm that could significantly improve your formula generation time? We are considering static approaches to formula generation.

    Works-in-Progress

    • RAMCloud: Scalable, High-Performance Storage Entirely in DRAM. New research project. Zero results. Motivated by wanting to build large scale systems with low latency. Data center style of web application separates the code from the data, in order to scale, but 4–5 orders of magnitude increase in latency. Want this kind of scaling with latency close to memory speeds (sub-microsecond). Basic architecture puts all data in DRAM. Scale using commodity servers. Reckon we can get 5–10us RPC latency end-to-end. Also have a story on durability and availability. Also want to support multiple applications.

      Q: effect of other memories? Whichever wins should work with RAMCloud

      Q: GMS system from UW 10 years ago? Not familiar with that [taken offline].
    • Transactional Caching of Application Data using Recent Snapshots. DB-driven website performance issues: use memcached. Add an in-memory DHT that is very lightweight, and stores application objects (not a DB). DBs provide transactional consistency, but these caches don’t do this. Goal is transactional consistency for accesses to the cache. Idea is to embrace staleness: all read-only transactions to run on stale data. Avoids blocking and improves utilization. This is quite safe, since stale data is already everywhere. Application can control staleness. Add a TxCache library between memcached and the application. DHT values are timestamped, and have a validity interval.

      Q: paper at HotStorage from HP Labs in similar area?

      Q: how does the DB know the validity interval? Modified DB to track this.
    • Chameleon: A self-managing, low cost file system. Targeting home user or small business, who doesn’t want to lose data. Doesn’t know anything about RAID. Cost-sensitive. Deployment scenario has 4 PCs connected by fast LAN, with a broadband connection to cloud storage. There are many ways to replicate, place and encode data. Ideally store data on at least one offline device to avoid vulnerability to viruses. A small, trusted “anti-availability kernel” enforces this requirement. Use linear programming to select and adapt storage configuration: the design space is now even more complicated. Tend towards the optimal solution.
    • Sloth: Let the Hardware Do the Work!Looked at embedded OSes used in the automotive industry. OSEK OS is the prevalent real-time embedded OS: event-triggered, priority-driven real-time system. Don’t want to implement a scheduler. SLOTH lets the interrupt subsystem do the scheduling and dispatching work. All threads are implemented as interrupt handlers and have interrupt priorities. Each thread needs an IRQ source. Priorities enable pre-emption. Can implement a bunch of synchronization this way also. System is simple, small (concise implementation and memory footprint) and fast (2–20x).

      Q: this looks like very simple scheduling… how would you deal with something more complicated like earliest deadline? Drawback is no blocking system calls, so can’t do everything.
    • The case for cooperative kernel threads. Kernels are multithreaded, and drivers have concurrency bugs, which, if they are in the kernel, is bad. Event-based devices drivers need to use continuations to preserve driver context across blocking operations. This becomes very complex, almost as bad as dealing with pre-emptive threads. Cooperative threads give the best of both worlds: atomic execution but allowing blocking. Research showed that drivers are mostly I/O bound, so cooperative threads are appropriate. Implementing this as a “cooperative domain” in the Linux kernel.

      Q: Linux does have cooperative thread scheduling available, so how does this interact with the work you are doing? Providing a framework for implementing drivers this way, much nicer.
    • Abstractions for Scalable Operating Systems on Manycore Architectures. Tesselation. Goal isn’t just to support heterogeneous hardware, but also provide predictable performance and guarantees for applications. Asymmetrically structured OS: some cores are dedicated as a management unit for keeping track of and scheduling applications. Eliminates the need for per-core runqueues, improves cache locality, decreases lock contention and limits kernel interference with applications. Applications interact with the kernel through remote, asynchronous system calls. Applications make explicit requests for cores, and OS guarantees that they will be gang-scheduled. OS just provides cores to the application, doesn’t need to know about threads. Applications have private memory ranges.

      Q: how do you balance the different demands for resources across applications? [Taken offline.]
    • System Support for Custom Speculation Policies. Applications run on some speculation infrastructure, which speeds things up. Want to separate policy from mechanism. Typically implemented transparently, which means that you have to be conservative, giving limited opportunities for speculation. Idea is to push the policy into the application. What could an application do that is different from the default? Might allow some output to be uncommitted. Or could commit equivalent-but-not-identical results. Process gets a “speculative fork” interface. Use cases: predicting user actions (predictive bash shell), authentication and user-level network services (when you have a predictable protocol).

      Q: how do you ensure errors in the speculative state don’t propagate to the main state? Need to be able to detect this, and could abort speculation in this case.
    • IDEA: Integrated Distributed Energy Awareness for Wireless Sensor Networks. A new “group diet” for wireless sensor networks. Problem of overloaded nodes. Existing solutions are “single node diets”, which are unsatisfactory because nodes have to collaborate. Local efforts cannot go far enough unless there is some cooperation. Aim to improve application fidelity by matching system load to availability. Shift load from overutilized to underutilized nodes, and shift load away from threatened nodes. Like a distributed OS for sensor networks. IDEA evaluates multiple solutions and distributes information to the nodes. Ideal goal is awareness of application constraints.

      Q: Quanto does some cross-node analysis? We’re building on these great ideas.
    • Flicker: Refresh Power Reduction in DRAMs by Critical Data Partitioning. Hardware is over-designed for correctness and reliability. Make it less reliable and tolerate errors in software. Smartphones are a motivation: power consumption is way too high, due to the use of DRAM for responsiveness. Battery drains even when a phone is idle. Goal is to improve power consumption here. If you increase the refresh cycle length, the power consumption drops, but the error rate increases. Currently use 64ms refresh, so could we increase this? Secret sauce is a partitioning into critical and non-critical (e.g. soft-state) data. Map critical data to short-refresh cycle DRAM, and non-critical data to long-cycle DRAM. Requires some hardware changes. Hypothesise that smartphones have a lot of non-critical data. Initial results show 25% drop in power consumption with only 1% loss in reliability.

      Q: [?] Looking at replication and checksumming in other work.
    • BFT for the skeptics. Industry deals with crash failures a lot, so do we need full BFT? We already use checksums, timeouts, sanity checks, etc. to translate faults to crash faults. How often do we get faults that require BFT to handle it? Looked at ZooKeeper and real-world failures. Yahoo!’s crawler uses ZooKeeper extensively. Saw 9 issues, due to misconfiguration (5, BFT wouldn’t help), application bugs (2) and ZooKeeper bugs (2, correlated, BFT wouldn’t help). Could BFT hurt? It has more things to configure, so misconfigurations could become worse. Need to show that BFT really solves a problem before industry will pick it up.

      Q: You showed that correctly implemented BFT couldn’t help with some failures? Failures were correlated, affecting all replicas.
    • Prophecy: Using History for High-Throughput Fault Tolerance. BFT has poor throughput. Need 3f+1 replicas to handle f faulty replicas. Can we improve this for read-mostly, internet workloads? Add a “sketcher” to each replica, which sketches requests and responses. Only one machine sends a full response, the others send sketches. Trades off consistency for performance, which gives delay-once linearizability. Faulty replicas can return slightly stale data. Internet services have unmodified clients and short-lived sessions. Look at performance of PBFT. Can improve by consolidating sketch tables on a trusted sketcher. We already trust middleboxes, so why not trust this too? Performance is much better than PBFT. Work not specific to BFT, and could apply to PAxos, quorums, etc. while getting similar benefits.
    • Securing Hardware Platforms Against Malicious Circuits Through Static Analysis. Make assumptions when building systems. Best way to break a system is to break its assumptions. People assume hardware is correct. What if we can’t make this assumption? Hardware is complex, expensive, static and the base of the system. Do “dead circuit identification”: highlight all potentially malicious circuits automatically. Attacker is motivated to avoid impacting functionality during testing (or else they’d be caught). DCI gets an assertion that says which paths are effectively short circuits. Use these assertions in a new graph algorithm to identify the possibly-malicious, dead circuits. No false negatives, but 30% over-identification. Empirical evidence shows a tight correlation between code coverage and

      Q: is this primarily at design-time on HDL? Yes, this is one of our assumptions.

      Q: what about redundant circuits for fault tolerance? This is used at design time where you can make calls about this.
    • Enhancing Datacenter Network Security and Scalability with Trusted End Host Monitors. Cloud workload is dynamic and hostile. Key selling point is that multiple tenants can share common infrastructure. Need a new approach to security, because exploits are more likely, and the cloud resources can be used to perform exploits themselves. Cloud datacenters can help: they are centrally-controlled so monitoring becomes easier. The software and hardware and homogeneous. Plus a clean-slate approach is possible. Use the hypervisor as a trusted component. Hypervisor can send alarms to central controller when an attack is detected. Built a prototype from Hyper-V and a trusted Intel NIC.

      Q: if you trust the VM, why do you need to trust the NIC? This gives some useful properties, and the NIC could do some filtering this for you.

      Q: HotOS paper on this exact topic?

      Q: [?] By “hypervisor” meant the entire virtualization stack, because we didn’t want to make the hypervisor itself any better.
    • Architectural Attacks and their Mitigation by Binary Transformation. What happens if someone tries to attack you from a VM on the same machine in the cloud. There is cross-talk through shared architectural channels. Example is contention for the CPU data cache. This leaks information about the memory access pattern, which could for example be used to leak AES keys. Have showed that EC2 has similar vulnerabilities: placement vulnerabilities, cloud cartography and cross-VM exfiltration are all possible. Approach is to use dynamic binary rewriting to transform x86 instructions so that the architectural effects are mitigated. Can degrade observation of timing, or inject noise and delays to hide leakage signal. Methodology is to make things secure by default, then come back to improve performance.

      Q: information leakage necessarily arises from statistical multiplexing, and we need statistical multiplexing to get good performance, so how can you address that? Assert that it should be possible.

      Q: how well would existing techniques protect against these attacks? Not aware of techniques that could do this.
    • Execution Synthesis. Say you have a bug in Linux on a remote machine. All you have to work with is a low-detail bug report. Reproducing it is time-consuming. Want a direction finding system from your system to a particular bug. Google Maps doesn’t do this at present…. So your bug report is a stack trace and some register contents. Do VM recording and replay. Don’t expect you to record behaviour that leads to the bug, since then you wouldn’t have the problem in the first place. Don’t care so much about performance, since you don’t run this in production. Find a state in the recording that is close to the bug report, then explore paths iteratively to get closer to the bug. Then you get a sequence of inputs that lead to reproducing the bug. Need a distance function, way of choosing inputs, and good information about what to include in a bug report.

      Q: could you go backwards from a failure state and execute in reverse? We don’t have the entire failure state to begin with.
    • Edge Mashups for Clinical Collaboration. Health industry is going from paper-based to electronic records. Want to empower non-programmers to build applications for real-time collaboriation, but need to respect things like HIPAA for logging and data retention. Example use-cases include expert-assisted surgery (call an expert for advice when complications arise, in real-time), and micro-clinics where nurses see the patients, but doctors write prescriptions remotely. Envision a graphical tool that pulls in photographic and chart data, which is synchronized between all participants. State serialized to XML which can be distributed to all the clients. Could be client/server or peer-to-peer. Need logging for accountability. “Break-glass” access control: anyone gets access but they are held accountable after-the-fact. Need low latency so doctors don’t feel that they are wasting time. Might migrate this to the cloud for scaling.

    Kernels

    seL4: Formal Verification of an OS Kernel

    • Formally proved the functional correctness of 8700 lines of C. No bugs.
      Want to build high-assurance systems: small kernels which reduce the trusted computing base. Want strong security properties. Kernel has to be correct: if it falls over, so does the whole system.
      seL4 has capabilities.
    • Proof is that specification and code are equivalent. Need a formal semantics for every system call. Use Isabelle as a theorem prover to bridge the gap between spec and code. But what about assumptions (in the code) and expectations (of the spec)?
    • Assume correct: compiler and linker, 600 lines of assembly code, hardware, cache, TLB management and 1200 lines of boot code.
    • Given these assumptions, we get some nice properties: no null dereferences, no buffer overflows, no code injection, no memory leaks, no div-by-zero, no undefined shift, no undefined execution, and no infinite loops or recursion. Does not imply security, lack of bugs from expectation to the physical world, or absence of covert channels.
    • Proof architecture admits proofs of higher-level properties (e.g. access control).
    • Design is written in Haskell, which can be used to generate Isabelle code automatically.
    • System model has three states: user, kernel and idle. Events are syscall, exception, IRQ and VM fault.
    • Call graph is messy! A microkernel takes all of the messiness and packs it into a very small space.
    • Formal methods practitioners (fans of abstraction) versus kernel developers (exterminate OS abstractions). Different view of the world. Haskell prototype unified these two things: OS people got to implement an OS, while the formal methods people got well-defined semantics. The C code is manually-written and hand-optimized, but based on the Haskell prototype.
    • Aim to reduce complexity. Have to deal with virtual memory in the kernel. But we can put drivers outside the kernel. Concurrency is complex, so use an event-based kernel and limit pre-emption to a few well-chosen points in long-running operations. The C code is derived from the functional representation. Need to support a subset of C: everything from the standard, minus goto, switch fall-through, & on stack variables, side-effects in expressions, function pointers and unions.
    • Found 16 bugs during testing, and 460 bugs during verification (roughly equally distributed between the C code, the design and the spec). Took 25 person-years in total: $6 million (compared to $87 million for EAL6).
      One of the largest proofs ever done in a theorem prover: 200kloc handwritten, machine-checked proof. Proved 10kloc of OS code.
    • Q: can you comment on what happens when you have to evolve the code? What effort is required? It depends. An optimization on the code level that doesn’t change the semantics might need a few days to re-prove. A new feature that adds new components could be added (in the paper) doesn’t depend on the rest of the kernel as long as you don’t screw around with existing data structures.
    • Q: does your work solve the stated problem? The assumptions are significant, and you’ve just done a very significant type-check on the code? Is it really possible to solve the originally stated problem? This is just the first step. You can reduce the assumptions with more work. It isn’t the only technique that you should use, so if you deploy it in an Airbus you should also do testing and verification. [Taken offline.]
    • Q: how can you verify something high-level like having the address spaces of two processes being isolated? We do this. You can still use seL4 in a stupid way, but you can use our security model with capabilities and reason about those in the spec. You don’t have to go down to the code.
    • Q: did you see a correlation between the logical errors in the specification and the implementation? Not really. The C bugs were fairly “stupid”: typos, copy-and-paste, etc.
    • Q: wouldn’t it be better to prove temporal properties? Are they expressible? They are expressible. We look at functional correctness, not temporal properties. But you need functional correctness before you can reason about temporal properties.

    Helios: Heterogeneous Multiprocessing with Satellite Kernels

    • Systems are getting more complicated, from UP to SMP to CMP to NUMA. This is still homogeneous. But hardware is no longer homogeneous: programmable NICs, GPGPUs, etc. Operating systems ignore this heterogeneity: the other devices have different instruction sets and often no cache coherence. This means that the standard OS abstractions are missing, and programming models are fragmented. Can we bring this back into the operating system?
    • Helios is an OS for distributed systems in the small. Use four techniques to manage heterogeneity, simplify app development and provide a single programming model for heterogeneous systems.
    • Result is that it is possible to offload processes to these heterogeneous devices with no code changes. Also improves performance on NUMA architectures.
    • Satellite kernels. Want to make use of an I/O device, but the driver interface is a poor interface for applications that want to use programmable devices. It becomes hard to perform tasks like debugging, I/O and IPC with these devices. The driver doubles as an OS, within the OS itself. A satellite kernel runs on the device itself: fundamentally a microkernel. Also run separate satellite kernels on each NUMA node. Local IPC and remote IPC for communication between satellite kernels.
    • Applications register as services in a namespace. The namespace connects IPC channels.
    • Application placement is constrained by the use of heterogeneous ISA, an expectation of fast message passing and platform preference. Applications are allowed to specify affinity in their metadata: a hint for where the process should run. Easy for a dev, admin or user to edit affinity. Platform affinity is processed first, and this gurarantees certain performance characteristics. Can also contra-locate, e.g. if you don’t want the interference of an anti-virus program running on the same core. Algorithm attempts to balance simplicity with optimality.
    • Applications are first compiled down to MSIL, and then that is compiled down to the appropriate ISA. Can encapsulate multiple versions of a method for different ISA in the MSIL (e.g. fast vector math).
    • Implemented on Singularity, using an XScale programmable I/O card (2GHz ARM processor with 256MB of DRAM). Just need a timer, an interrupt controller and the ability to handle exceptions to implement a satellite kernel for a new device. No need for an MMU (thanks, Singularity!). GPUs are adding timers (Larrabee). Only supports two platforms and a limited set of applications.
    • Evaluated several applications (network stack, FAT32 FS, mail server, web server, etc.) and how easy it was to run them on satellite kernels. Almost no code had to be changed (only in the TCP test harness). One line of metadata had to be changed in almost every case (zero in the other).
    • Offloaded an entire networking stack to the XScale, and showed that the end-to-end performance of PNG compression-and-serving is improved when offloading to the XScale.
    • Considered an email server built on Singularity, using a NUMA box. Emails per second handling improved by 39%. Turned out that the instruction throughput was much higher due to better cache utilization.
    • Q: when you transfer data between two NUMA domains, couldn’t the IPC fail due to memory allocation failures? Singularity is statically verified, using contracts, so we don’t have to worry about that.
    • Q: is this not just 20-year-old distributed microkernel research rehashed? We pay homage to that in the paper. A simple heuristic is sufficient to decide where to run some process, and you need to have process migration anyway, so why not just use that when you get a problem? Process migration is pretty difficult, in the heterogenous case. Abstraction turned out to be pretty brittle in commodity OSs.
    • Q: is it reasonable to rely on protection from a large runtime? The system isn’t dependent on type-safety.

    Surviving Sensor Network Software Faults

    • Sensors have to operate unattended for months or even years. It’s hard to debug failures considering that the input is unknown. There is no debugger.
    • Safe TinyOS introduces memory safety to sensor nodes. But what do you do when you get a safety violation? In the lab, spit out an error message; in the wild, reboot the entire node (losing valuable soft state and application data).
    • Neutron is a new version of TinyOS. Reduces the cost of a violation by 95–99%. It has near-zero CPU overhead during execution. Runs on a 16-bit microcontroller.
    • A TinyOS program is a graph of software components: statically instantiated code and state. Connections are typed by interface and there is minimal state sharing. Now have preemptive multithreading with a non-blocking, single-threaded kernel. Aim is to separate the program into independent units for recovery. Infer the boundaries of these at compile time. The kernel is a single unit.
    • Units can be rebooted independently. A wrinkle involves cancelling system calls, so you need to block if a syscall is still pending. Blocks of allocated memory are tagged with the owning recovery uint, which enables these to be freed on reboot (by walking the heap). Can even reboot the kernel: just cancel all pending system calls (return ERETRY), and just have to maintain thread memory structures. Applications will continue after the kernel reboots.
    • New idea of “precious state”: a group of precious variables will persist across a reboot. Annotate variables in the source code. Some restrictions on precious pointers. Precious variables must be accessed in atomic blocks. Variables are persisted on the stack across reboot: the set of precious state is usually smaller than the worst-case stack size.
    • Evaluated the cost of a kernel violation in Neutron, compared to safe TinyOS. Looked at three libraries, running on a 55-node testbed. Show the effect of a reboot on the CTP workload. Neutron gets close to the non-reboot case. Also look at the effect on time synchronization in FTSP, showing what proportion of the nodes have unsynchronized time. Again, Neutron gets close to the non-reboot case. Looked at fault isolation. CTP and FTSP data persist across reboots.
    • Main cost is in ROM bytes: 1–5kB of added code, roughly constant.
    • Measured cost of a reboot in milliseconds. A kernel safety violation will result in a 10–20ms outage.
    • Much lighterweight than microreboots (and lets you reboot a kernel, not a J2EE application).
    • It’s easy to change the TinyOS toolchain, but changing the programming model isn’t due to the amount of deployed code.
    • Q: how can I reason about a node that has survived a fault (a rebooted node is in a known-good state)? Do you have evidence that this is going to help us? [Showed emails from the questioner.] It is hard to diagnose these faults. Different approaches are possible.
    • Q: what do you think of the alternatives, such as using an MMU, verification or simulation? Well we could add an MMU, but we don’t have it at present. These new developments might make the dependability better. Verification struggles with the huge input space.
    • Q: how do you ensure that the precious state is not corrupted? Using safe TinyOS, so we won’t see memory access violations due to memory safety. Taint in the paper is about inconsistent state, not corruption.
    • Q: what are your criteria for what state should be marked as precious? Don’t have a strong set of guidelines for this, but have done it by inspection so far.
    • Q: what is your fault detection system, and what is its coverage? How do you know when you have a fault? The deputy compiler gets you to annotate code with things like buffer lengths. Can infer faulty behaviour from these. Annotating interfaces tends to be sufficient.
    • Q: this is a very valid approach?

    SOSP 2009: Day 1

    Monday, October 12th, 2009

    Keynote: Barbara Liskov

    • Inventing abstract data types, CLU, Type hierarchy, What next?
    • More of a Programming Methodology talk than a systems talk.
    • Started out in systems with the Venus machine on the Interdata 3. Presented it and its operating system at SOSP 1971.
    • Back in the early 1970’s, people were concerned about the software crisis. (cf. Dijkstra’s 1972 Turing Award lecture, The Humble Programmer.) As machines got cheaper, bigger and faster, software started to matter a lot. A tendency to underprovision the hardware, creating challenges for the software developers.
    • In late 1960’s the field of Programming Methodology began. Started to think about program design and structure: in order to have functionality, maintainability, etc.
    • First paper: Go To Statement Considered Harmful (Dijkstra, 1968). A revolutionary letter to CACM. Use static program text to reason about dynamic program behaviour: it would be useful if these were as close as possible, but GOTOs prevent that. Example of debugging: understand how you got to a particular point in the text. GOTOs make this very difficult. Provoked a huge amount of resistance: can’t program without it. Pointed out limitations of the programming languages: branch into a label table is just a case statement, but they didn’t have case statements.
    • Second paper: Program Development by Stepwise Refinement (Wirth, 1971). European school of software development: a top-down approach, starting with many abstract parts which are not initially implemented. Example was the 8-queens problem.
    • Third paper: Information Distribution Aspects of Design Methodology (Parnas, 1971). A new interest in modularity, and information hiding. “The connections between modules are the assumptions which the modules make about each other.” Hedging about what modules were at the time.
    • Fourth paper: On the Criteria to be used in Docomposing Systems into Modules (Parnas, 1972). How to actually break a system into modules: a hint of data abstraction but the full idea wasn’t there yet.
    • Enter Barbara Liskov: A Design Methodology for Reliable Software Systems (1972). Entire system should be partitioned: no global state and each partition owns some part of the state. Partition exposes operations and only way of interacting with the state would be by calling operations on the modules.
    • Wanted to apply the idea of partitions for building programs. It was unclear how to combine modules to make whole programs. Idea of partitions came out of doing work on operating systems: ur-partitions were the supervisor and user modes. This idea is carried a lot further by ADTs.
    • Idea was to connect partitions to data types. A strike of inspiration. Ideas often arrive in the middle of the night, or when arriving to work with a fresh mind.
    • March 1973 SIGPLAN/SIGOPS interface meeting on programming methodology was the debut of the idea. Began working with Steve Zilles. At the time, they were knowledgable about all the languages in existence (FORTRAN, LISP, ALGOL, PL/I, COBOL…). Started to do language design.
    • People were interested in extensible languages as early as 1967 (Schuman and Jourrand, Balzer). How can we help people build dialects of languages that make them easier to use. Looked at syntactic and semantic extensibility. Syntactic extensions written in BNF and added to the language using some kind of preprocessor. Fortunately this died a death…. People were much more worried about writing programs than reading them. Didn’t realise that programs are read more often than they are written. Balzer imagined data types as being collections which allowed four defined operations (add, remove, etc.) with operator overloading.
    • Hierarchical Program Structures (Dahl and Hoare, 1972): Simula 1967. Didn’t have encapsulation but did have inheritance to make simulation easier. Precursor to Smalltalk.
    • Protection in Programming Languages (Morris, 1973). Recognised the importance of locality in module comprehension (allows local reasoning). Proposed sealed objects using encryption as an OS mechanism to guarantee locality.
    • Global Variable Considered Harmful (Wulf and Shaw, 1973). In the 1960’s a stream of languages made block structure the big thing. Give locality within blocks, but you can always access things on the outside (i.e. global variables). Made analogy with Dijkstra’s paper that global variables are implicit connections between states of the program, which makes reasoning about it more difficult.
    • Programming with Abstract Data Types (Liskov and Zilles, 1974).
    • Said what ADTs were: a set of operations (whatever ones made sense, not a fixed set), encapsulation was important, and the operations were the only way to access object state.
    • ADTs were “clusters” with encapsulation. Proposed polymorphism, static type checking (but weren’t sure if this was possible due to polymorphism) and exception handling.
    • Why was a new programming language necessary? Needed to communicate the ideas to programmers. Enabled the testing of whether ADTs work in practice. Also made it possible to get a precise definition (tendency to think of the compiler as the language definition…). And to validate whether it was possible to achieve reasonable performance.
    • Goals of language design: ease of use, simplicity, expressive power and performance. First two played off against second two.
    • Also wanted minimality (limiting the language to what we could get by with), uniformity (keep abstract types similar to the built-ins) and safety (find errors as soon as possible: compile time?).
    • Assumptions/design decisions. Wanted language to be heap-based with garbage collection (based on experience with LISP). Program was a collection of procedures rather than a linear piece of code (ALGOL style). People were scared of pointers, but used them to simplify the design. No block structure! Separate compilation of individual modules. Also had static type checking which was meant to speed up finding errors. No concurrency (cut out what wasn’t necessary to simplify a big project). No GOTOs. No inheritance.
    • CLU clusters. A cluster had a header with a list of the operations that it defined.  Thought of operations as belonging to the type rather than the object (passing an object as an argument). Defined the representation of the object internally, and the implementation of the operations. Used “cvt” to define unsealing on operation entry, and sealing on exit (but compile-time checked).
    • Polymorphism: set[T: type], and had a where clause for the type parameter to specify, e.g., that T has an equals: T -> T -> bool function.
    • Exception handling: Issues and a Proposed Notation (Goodenough, 1975). People didn’t know the right model: procedure should terminate (now the status quo), or throw exception to a higher level that would allow it to resume. PL/I had both. How should handlers be specified? At the call-site, or out of the main-line for all invocations to that function. CLU used termination, and specifies the exceptions that a method may call in the header. Handled at the call-site.
    • How to handle exceptions? Handle it, propagate it up the call stack, or signal failure (the exception shouldn’t happen…). Can never be certain that these last exceptions won’t happen, but don’t want to write code to deal with this, so CLU introduced the “failure” exception. Really want accurate interfaces (know exactly what exceptions a method might call) and no useless code.
    • Iterators. For all x in C do S. Solutions were to destroy the collection (repeated removal), or complicating the abstraction (turn it into an ordered set, making it indexible). In summer 1975-ish, the MIT group visited CMU, where the Alphard group were working on “Generators”. Thought these are a bit crusty, so invented iterators which are like procedures that you call and they yield instead of returning. Can nest iterators and recursively call them. Implemented by passing the loop body as an argument to the iterator. But this limited the expressive power.
    • In 1987, gave a keynote at OOPSLA, but had been ignoring object-oriented languages and inheritance in particular. Took the opportunity to get into the literature. Much of it was very bad/confused. Inheritance being used for two different things: implementation simplification and type hierarchy. The two were not compatible.
    • Implementation inheritance violated encapsulation! Subclasses depend on the implementation of the superclass, making it hard to change the superclass. CLU could do implementation sharing without inheritance.
    • Type hierarchy is much more interesting, but wasn’t well understood. How were stacks and queues related?
    • Led to the Liskov Substitution Principle: Objects of subtypes should behave like those of supertypes if used via supertypes methods. (Data abstraction and hierarchy (Liskov, 1988).) Didn’t realise that this was a big idea!
    • What next? The world has changed from one where people had no idea about modularity, to one where modularity is based on abstraction.
    • Modern programming languages (Java and C#) are pretty good. Procedures are missing: they are important, and the loss of them as a first-class thing makes the program less simple. Closures are missing, as are iterators. Exception handling is important but failure handling is not done well. Also need built-in types as a basis: extensibility might be going too far. Can we do better than “serialization” (horrible overloading of the term): can’t it be done by garbage collection?
    • The state of programming is pretty lousy. The COBOL programmers of yesterday are now writing web services and browsers. The era of globals has returned. There’s little encapsulation and protection, and yet these are handling confidential information. Problem might be persistent storage violating abstraction: perhaps we would be better with an object store that provides automatic translation and type preservation.
    • Programming language research. Is now the time for some new abstraction mechanisms? Probably not just specification langauges. Concurrency and multi-core: modularity is very helpful here, but there is still a lot of work to do. Should distributed systems be programmed in languages that include distribution as a first-class concept.
    • System research has done well. Abstractions like DHTs, map-reduce, client/server, distributed information flow. These have been useful for making progress.
    • Concerned that we trade off for performance (versus simplicity and semantics). Led astray by the end-to-end argument, but not sure that the argument is valid when the end is the user. We know that we’ll never be 100% reliable, but failures should be catastrophes, not laziness (or an optimization).
    • The systems community has thrived because it has been so open: embracing new concepts. Worried about the semantics of the coming Internet Computer: what does it mean to run a program in the cloud (the PL should embrace distribution?). Massive parallelism also coming as a problem: not sure they necessarily follow from the end of Moore’s Law. But we will need to manage them as a distributed system, so there is a lot learn from that world.

    Scalability

    FAWN: A Fast Array of Wimpy Nodes

    • Awarded best paper.
    • Energy is becoming an important consideration in computing? Google uses $40 million of energy a week. Can we reduce energy costs tenfold? Without increasing capital costs?
    • Idea is to improve energy efficiency by using an array of well-balanced low-power systems. For computationally intensive, data-intensive systems.
    • Goal is to reduce peak power: 20% energy loss on power and cooling is considered “good”. 100% utilization => 1000W but 20% utilization => 750W. Fawn wants to get 100% utilization down to 100W, and less utilization to less power.
    • As CPU cycles have improved much faster than disk seeks, we have a gulf in wasted resources. Could rebalance by using very fast disks. Could use slower CPUs and moderately faster storage. Or could use slow CPUs and today’s standard disks. These are not equivalently efficient.
    • Fastest processors have superlinear power usage (Xeon7350, etc.) due to things like branch prediction, caching, etc. which are not so useful for data-intensive, I/O intensive workloads. Custom ARM nodes etc. are slow but power usage is dominated by fixed power cost. In between we have things like the Atom and XScale: FAWN targets these.
    • Application is data-intensive key-value systems. A critical infrastructure service with SLAs, random-access and read-write. Goal was to improve the efficiency in Queries/Joule or Queries/Second/Watt.
    • Prototype using Alix3c2 nodes with flash storage, 500MHz CPU, 256MB DRAM and 4GB flash.
    • Challenges: need efficient and fast failover, wimpy CPUs, limited DRAM and flash that is poor at small random writes.
    • Architecture: single front-end connected by switch fabric to many back-end. Front-end manages backends, acts as a gateway and routes requests. Use consistent hashing. Interesting design decisions are at the backends (focus of the talk).
    • High performance KV store: must map keys to values on the backend. Use an in-memory hashtable to find the location of a value on disk. 160-bit key into hashtable which contains a key fragment, valid bit and offset into the data region. On flash, store the entire key, length and the data itself. Limited DRAM means you only store a fragment of the key in memory (to get 12-byte hashtable entry): risk of collisions and multiple requests. Get a low probability of multiple flash reads.
    • Small random writes problem avoided by log-structured datastore. Helps also with node addition, which requires transfer of key ranges: the log structure makes the transfer of key ranges a simple streaming transfer… a simple iteration of the data store. But the SLA means that we need the source of the data continues running while doing this streaming: log-structure lets you minimize locking. Also need a compact operation on the data store, which also runs in the background.
    • Replication and strong consistency is covered in the paper.
    • Evaluated by considering the efficiency (energy) of KV lookup. Look at the impact of background operations on query throughput. And then TCO analysis for random read workloads.
    • Measured the efficiency of 256-byte KV lookups on various systems, looking at the power outlet power usage. Alix3c2/Sandisk got 346 QPS/Watt, whereas a standard desktop machine with SSD got 51.7 and two hard-disk based systems were 2.3 (MacBook Pro) and 1.96 (desktop).
    • Peak capacity is 1600 QPS. During compaction ±800. During split ±1200, and Merge ±700.
    • At 30% of peak query load, these have almost no discernable impact (±500 QPS in all cases).
    • TCO = Capital Cost + Power Cost ($0.10/kWh). When should you use FAWN?
    • Traditional system with 200W usage (5 x 2TB disks/160GB PCI-e Flash SSD/64GB FBDIMM per node). $2000–8000 per node. FAWN (10W usage)  (2TB disk/64GB SATA Flash/DRAM-based). Where storage capacity is dominant, use FAWN + Disk.  When you care about query throughput, use FAWN + DRAM. FAWN + Flash covers much of the remainder of the space, but Traditional + DRAM also covers some of the space.
    • To be energy efficient, require some effort on the part of the system developer.
    • Q: impact of networking on performance and cost models? Network is an important component, and we really want things like all-to-all communication, which often needs a high-powered core. But proposals for using low-powered commodity switches at scale mean that is possible to get down to a fixed power overhead per core.
    • Q: off-the-shelf memcached could do better? [Inaudible.] It takes a lot of effort to get very high throughput.
    • Q: what happens when you include latency in the model of TCO? Interactive services that serialize might shift the model? Flash devices gives pretty good common-case latency. What if you have computational load in the query? You have to trade off energy efficiency for longer periods of computation. Get high 99.9% latency during maintenance but it’s still okay.
    • Q: what happens when a frontend thinks one node is down, etc.? Have optional replication to provide constant access to data, and a background replication process.

    RouteBricks: Exploiting Parallelism to Scale Software Routers

    • Awarded best paper.
    • Want to build routers that are fast and programmable. Why do we want them to be programmable? Programmable routers enable ISPs to offer new services (intrusion detection, application acceleration, etc.). They make network monitoring simpler (measuring link latency, tracking down traffic). Finally they make it possible to offer new protocols, such as IP traceback, trajectory sampling etc. They enable, flexible, extensible networks.
    • But today, speed versus programmability is a trade off. Fast routers are implemented in hardware (Tbps throughput), but offer no programmability. But programmable software routers get throughput < 10Gbps using general purpose CPUs.
    • RouteBricks uses off-the-shelf PCs, a familiar programming environment and large-volume manufacturing to reduce costs. How do we get to Tbps routers with these ingredients?
    • A router is just packet processing + switching. We have N linecards that handle each input or output port, and it must be able to handle packets at the per-port rate, R. Also have a switch fabric which much switch at N*R.
    • RouteBricks uses one server instead of each line card and a commodity interconnect.
    • Require internal link rates < R and per-server processing rate = c*R (c is a small, reasonable constant). Per-server fanout should be constant.
    • What interconnect satisfies these requirements? Could naively have a crossbar switch, with N^2 line-rate links. Could instead use Valiant Load Balancing, which introduces a third stage between input and output. Have N intermediate servers, and R/N rate links from each input to the intermediate, and from each intermediate to each output. Now have N^2 links at rate 2R/N. Per-server processing rate is 3R. But with uniform traffic patterns, each server only must process at 2R.
    • But this still gives linear per-server fanout (bad). If we increase per-server processing capacity, could assign multiple ports to each server. Or add a k-degree n-stage butterfly network. Combine these ideas: RouteBricks. Use full mesh if possible and extra servers otherwise.
    • The trade-off depends on the kind of servers that you have. Assuming current (5 NICs * 2 X 10G ports or 8 X 1G ports and 1 external port per server).
    • With 2R–3R processing rate, we need to optimize the server! Use a NUMA architecture (Nehalem, QuickPath interconnect), 2 quad-core CPUs, and 2*2*10G NICs. Run Click in kernel-mode.
    • First try: got 1.3Gbps per server. Spending a lot of cycles on book-keeping operations, such as managing packet descriptors. So use batched packet operations and CPU polling, which got 3Gbps.
    • Still a problem with how the cores accessed the NICs: queue access. Get synchronization overhead between the two cores. Simple rule: 1 core per port (input or output), gets no locking overhead. But now we have cache misses because of separate cores working on the same packet. So second rule: 1 core per packet. These rules are mutually exclusive!
    • Solution is to uses NICs with multiple queues per port. Can now assign each queue to a single core, which achieves our objective.
    • So, use state-of-the-art hardware, modify the NIC driver to do batching, and perform careful queue-to-core allocation.
    • For no-op forwarding, get 9.7Gbps with min-size packets, and 24.6Gbps with a realistic mix.
    • For IP routing (IPSec also but not shown), get 6.35Gbps for min-size, and 24.6Gbps with a realistic mix.
    • Realistic size mix: R = 8–12 Gbps. In this case, we are I/O bound. Min-size packets are CPU-bound.
    • Look at the next-gen Nehalem and do some back-of-the-envelope calculations. Could get R = 23–35 Gbps with upcoming servers for a realistic mix.
    • Prototyped RB4: N = 4 in a full mesh. Realistic size mix was 35 Gbps.
    • Did not talk about reordering (avoid per-flow reordering), latency (24us per server), open issues (power, form-factor, programming model). Slide illustrates Russell’s paradox quite nicely.
    • Q: about programmability, the examples require maintaining cross-packet state, so would the choice of the load-balancing mechanism within the router affects this? When the bottleneck is the CPU, so yes this is important.
    • Q: are the trends on both power and power-performance of next-gen processors in your favour? We spend more energy than the corresponding Cisco router, but we have a lot of room for improvement and will get better. Maybe best thing to do is use general-purpose CPUs with an efficient interconnect.
    • Q: is the real choice between general purpose CPUs and a different programming model? Could you do better with ternary CAMs, etc.? Programmability is just trapping exceptional packets. IP routing is easy? Not suggesting that you just throw away all hardware routers, but you might want to do it in some cases, e.g. for specialized monitoring at the borders of your network. A programmable datapath might make it easy to deploy a new protocol.
    • Q:  curious about reordering phenomenon and how this would affect TCP? What would be the most extreme reordering that an adversary produce? [Inaudible.]

    The Multikernel: A New OS Architecture for Scalable Multicore Systems

    • Boots into Barrelfish rather than kubuntu to get the video driver to work.
    • How do we structure an OS for the multicore future? Need to deal with scalability and heterogeneity.
    • Sun Niagara has a banked L2 cache for all cores (bad for shared memory). Opteron Istanbul has per-core L2. Nehalem is quite different still. Need to do different things to work well on each of these. An 8-socket Opteron has an “interesting” interconnect. The interconnect matters, especially to access latencies.
    • Can also have core diversity: within a system (programmable NICs, GPUs, FPGAs) and on a single die (for performance asymmetry, SIMD extensions or virtualization support).
    • As core count increases, so does diversity. And unlike HPC systems, one cannot optimize at design time. Need systems software to adapt for you.
    • So now is the time to rethink the default OS structure. Now we have shared-memory kernels on every core, data structures protected by locks and everything else being a device. So propose structuring the OS as a distributed system (like Barbara Liskov said earlier).
    • First principle: make inter-core communication explicit. All communication should be done with messages and no shared state. This decouples the structure of the system from the inter-core communication mechanism, and makes communication patterns explicit (which cores you communicate with etc.). It naturally supports heterogeneity of cores. Also better matches future hardware (no cache coherency? Such as Intel’s 80-core machine). Also allows split-phase operation which makes concurrency easier. And finally, we can reason about performance, scalability, etc. of such systems.
    • Simple microbenchmark to evaluate the trade-off between shared memory and message passing. Several cores update a shared array. In shared-memory case, we get stalling due to cache coherency protocol. This is limited by the latency of interconnect round-trips. Latency is linearish in the number of cores (up to 16), and gets worse with the number of cache lines modified.
    • What if we had a server core that takes messages from a ring buffer per client core?  Higher latency to begin with, but scales much better as the number of cache lines modified is increased. Client overhead is queuing delay at the server. The server’s actual cost is approximately constant, and very low. This would give us many spare cycles if the RPC is implemented split-phase.
    • Second principle: make the OS structure hardware-neutral. The only hardware specific parts should be the message transports and the CPU/device drivers. Makes it possible to adapt to changing performance characteristics. Can late-bind protocol and message-passing implementations.
    • Third principle: view state as replicated. Potentially-shared local state is accessed as if it were a local replica (e.g. scheduler queues, process control blocks, etc.). The message-passing model requires this. This naturally supports domains that don’t share memory.
    • Replicas were previously used as a selective optimization in other systems. The multikernel makes sharing a local optimization, instead (opposite view). Only use shared state when it is faster, and make this decision at runtime.
    • Can support applications that need shared memory if it is available, but the OS doesn’t rely on this.
    • Currently run on x86-64 machines. A CPU driver handles traps and exceptions serially. A user-space monitor mediates local operations on global state. Use URPC inter-core message transport, but we expect that to change.
    • Many non-original ideas (in particular decoupling message-passing from synchronization).
    • Many applications running on Barrelfish (slide viewer, webserver, VMM, OpenMP benchmarks, etc.).
    • How do you evaluate a radically-different OS structure? Barrelfish was from-scratch, so is less complete than other OSs. But we need to show that it has good baseline performance: comparable to existing systems.
    • Case study of unmap (TLB shootdown). Logically need to send a message to every core and wait for all to acknowledge. Linux/Windows use IPIs and a spinlock to do this. Barrelfish makes a user request to a local monitor, and uses message passing.
    • Tried several different communication protocols. One unicast channel per core; a single broadcast channel. Neither of these perform well (especially broadcast because there is no such thing in the interconnect).
    • Really want a multicast optimization: send message once to every socket, which has an aggregation core that forwards it on to local cores. Also use the HyperTransport topology to make decisions about which cores are further away (and hence should receive the message earlier to give parallelism). Need to know the mapping of cores to sockets, and the messaging latency.
    • Use constraint programming and a system knowledge base to perform online reasoning. Prolog query on SKB constructs the appropriate multicast structure. Unmap latency is much less than Windows for >7 cores, and Linux for >12 cores. It also scales much better.
    • Show that there is no penalty for shared-memory user workloads (OpenMP), respectable network throughput, and a pipelined webserver with high throughput.
    • Conclusion: no penalty for structuring using message passing. So we should start rethinking OS structure this way.
    • Q: how do you build an application on top of Barrelfish that tries really hard to ignore topology? Focus so far on how to build the OS, and make no particular demands on application programming. Gone with POSIX so far, but we need some higher-level programming model to express things more easily.
    • Q: about URPC performance, are you running a single thing on each processor and hence is the process only ever spinning waiting for URPC, avoiding any context switching latency? That’s true. We decoupled messaging primitive from notification primitive, because notifications are very expensive (at least on our hardware). This leads to a trade-off. The split phase API makes it efficient to work in the case where you don’t need notification immediately.
    • Q: you didn’t mention virtualization… if we map one VM to one core, we can get good performance, so does Barrelfish obviate the need for virtualization? Barrelfish is viewed as orthogonal to virtualization.
    • Q: your microbenchmark [inaudible]. The point of the benchmark is to make the case that messaging over shared memory can have reasonable performance.
    • Q: as systems get bigger, do you expect messages to get lost? Maybe… if they do get lost, this model is a better way to structure it than shared-mmeory.

    Device Drivers

    Fast Byte-granularity Software Fault Isolation

    • Operating systems run many drivers and these run in a fully trusted mode. But they turn out to be a major source of bugs. Existing solutions either require changes to the source code or unacceptably bad performance. Why? They are complex and require complex, fine-grained temporal and spatial sharing within the kernel. But any mistake in this can be fatal….
    • BGI isolates drivers in separate protection domains but allows domains to share the same address space. It provides strong isolation for existing drivers even with a complex kernel API. Write integrity, control-flow integrity and type-safety for kernel objects. Achieves this without changes to drivers or hardware, and only 6.4% CPU overhead (~12% space overhead).
    • Use a BGI compiler on unmodified source code to get instrumented driver, then link this with the BGI interposition library to get a BGI driver.
    • Protection model has kernel in trusted protection domain. Most drivers are untrusted and drivers may share untrusted domains. Each byte has an ACL. Access rights are read and write (default everything readable), indirect call, typed rights for different kernel objects (type safety for mutex, dispatcher, io etc.).
    • Two primitives: CheckRight and SetRight, called by the interposition library (not the driver code). Dynamic type checking is possible using these: check the arguments to kernel API calls, which ensures type safety, and state safety (temporal properties).
    • Compiler checks rights on writes and indirect calls, and grants and revokes write access to the stack as appropriate.
    • Rights are granted on driver load (write to globals, icall right to address-taken functions). Rights granted/revoked to function arguments according to call semantics. Function arguments are also rights-checked.
    • Example of a read request (event-based), and how the BGI compiler inserts inline checks to avoid many problems.
    • Rights change as the driver makes kernel calls, and BGI enforces complex kernel API usage rules (use-before-initialized, write-after-initialized, etc.).
    • Implementation must be fast. Fast crossing of protection domains (just a normal call instruction). Fast rights grant/revoke/check: uses compiler support to inline these checks and perform data alignment.
    • ACL data structure encodes (domain, right) pair as an integer. Several tables of “drights”: arrays for efficient access, and have 1-byte dright per 8-byte memory slot. Need a conflict table when rights ranges overlap. Want to avoid accesses to these conflict tables by aligning data on 8-byte memory slots. Makes the non-conflict case common at the cost of space. Heap objects are 8-byte aligned, and also need special drights for writable half-slots.
    • SetRight is implemented with 4 x86 assembly instructions. Uses arithmetic shift to obtain same code for kernel and user space addresses (with different tables).
    • CheckRight has fast check (5 instructions) and slow check (7 instructions).
    • BGI has a recovery mechanism that runs driver code inside a try block (cf. Nooks).
    • Evaluated on 16 Vista device drivers: 400 KLOC.
    • Evaluated for fault containment. Injected faults in the source of fat and intelpro drivers (following previous bug studies). Measured the number of faults contained by BGI: fat = 45/45, intelpro = 116/118.
    • Measured file I/O performance with Postmark. Max CPU overhead was 10% (for FAT) and max throughput cost was 12.3% (for FAT).
    • Measured network performance (max 16% CPU overhead and 10.2% throughput decrease both for UDP send).
    • Found 28 new bugs in widely used drivers. Some were serious: writes to incorrect places or use of uninitialized objects. Some less so: abstraction violation, etc. BGI is a good bug-finding tool.
    • Q: subtle problems such as virtual address aliasing or out-of-bound pointers within the bounds of other objects… how do you handle this? The granting of access rights is done according to the kernel API, for write/control-flow integrity and type safety. But we wouldn’t catch the case where a driver makes a write to something that it is allowed to write to, but isn’t the right thing.
    • Q: how do you deal with type-unknown pointers? Can catch errors on data buffers. But where pointers are passed as arguments, you typically don’t do arithmetic on these.
    • Q: compared with SFI work on already-compiled code (compiler not in TCB), could you do this on object code instead of source code? We have a binary version, but we report on the compiler version because it has better performance.
    • Q: the transformations looks similar to existing SFI, so to what do you attribute this improvement in performance? Dealing with complex kernel API and enforcing fine-grained sharing. Previous work did a lot more copying or had expensive crossing of protection domains, in order to deal with complex sharing.
    • Q: you’re redoing virtual memory protection with a complex compiler? Why not just use a microkernel? The performance can’t be that bad? The goal was to support existing drivers that run on existing operating systems with good performance. Could you do the same transformations on existing code to run in separate address spaces? Maybe, but don’t know of a way to do this. Goal was to deal with existing drivers on existing OS.
    • Q:  could you just insert latent compiler-inserted checks to avoid zero-day exploits, which are only turned on on zero-day? An interesting idea, but it’s better if you can cheaply run it all the time in deployed systems… you will find more bugs this way.

    Tolerating Hardware Device Failures in Software

    • Device drivers assume device perfection. But we can see hardware-dependence bugs across driver classes. Transient failures cause 8% of all unplanned reboots. The existing solution is to hand-code a hardened driver, which gets this down to 3%. What can we do with software fault tolerance? Detect hardware failures and perform recovery in software.
    • Where do bugs come from? Device wear-out, insufficient burn-in, bridging faults, EMF radiation, firmware bugs, corrupted inputs, timing errors, unpredictable DMA, etc.
    • Vendors give recommendations to driver developers. Firstly, validate all input coming from the hardware. Then ensure all operations take finite time. Failures should be reported. There are guidelines for recovery also. Goal is to implement as many of these as possible automatically.
    • System is called Carburizer. Runs on driver source code, and compiler generates hardened driver that links to the Carburizer runtime.
    • Goal is to fix these bugs with static analysis. Find driver code that uses device data, and ensures that the driver performs validity checks. Fixes bugs from infinite polling, unsafe array reference, unsafe pointer dereference and system panic calls.
    • First, use CIL to identify tainted variables. Consult a table of functions known to perform I/O and mark heap and stack variables that receive data from these. Propagate taint through computation and return values. Now find the risky uses of these variables. e.g. Find loops where all terminating conditions depend on tainted variables.
    • Now look for tainted variables used to perform unsafe array accesses: used as array index into static or dynamic arrays.
    • Evaluated by analysis of Linux 2.6.18.8. Analysed 2.8 million LOC in 37 minutes (including analysis and compilation). Found a total of 992 bugs in driver code, with false positive rate of 7.4% (based on a manual sampling of 190 of the bugs). May also be some false negatives because don’t track taint via procedure arguments.
    • Automatic fixing of infinite loops by inserting timeout code. Inserted code can never harm the system because the timeout is conservative.
    • Inserts code to bounds-check arrays when an unsafe index is used.
    • Empirical validation using synthetic fault injection on network drivers: modify return values of in and read functions. Without Carburizer, the system hung; with it, the driver recovered. Works well for transient device failures.
    • Also want to report device errors to fault management systems. Carburizer (i) detects the failures, and (ii) reports them. Report things like loop timeouts, negative error returns and jumps to common cleanup code. Also looks for calls to functions with string arguments, and considers these to be error reporting, so it doesn’t insert additional reporting in this case.
    • Evaluated failure reporting with manual analysis of drivers of different classes (network/scsi/sound). No false positives, but a few false negatives. Overall fixed 1135 cases of unreported failures. Thus it improves fault diagnosis.
    • Static analysis is insufficient. We also need to consider runtime failures, such as missing and stuck interrupts. Can detect when to expect interrupts, and invoke ISR when bits are referenced but there is no interrupt activity. Can also detect how often to poll, which reduces spurious interrupt invocation, and improves overall performance.
    • Can also tolerate stuck interrupts by seeing when an ISR is called too often, and converts from interrupt-based to polling in this case. This ensures the system and device make forward progress.
    • Evaluated effect on network throughput: overhead is < 0.5%. And on CPU utilization, this goes up 5% for nVidia MCP 55 when recovery is used; no cost when just doing static bug fixing.
    • Covers 8/15 of the recommendations for driver writers, automatically.
    • Q: is your intention that driver developers should run Carburizer only in development and incorporate its fixes, or to run it on deployed code? Encourage sysadmins to run the tool all the time to cover as many bugs as possible.
    • Q: talked about transient device failures, but what about transient errors in the CPU? In this case, even the code execution will fail. Can you comment on the relative frequencies of these kinds of failure? Not aware of this data. Cannot trust anything when the CPU fails, so not sure what we can do about this.
    • Q: do you have a rough feel about how many of the found bugs could be used by malicious attackers to perform code injection? Don’t know off hand. Certainly very easy to hang the system, so could do this maliciously.
    • Q: could you talk about the static analysis that you are using? How do you handle pointer aliases? Analysis is very simple, and we do have 7.4% false positives. But the combination of our techniques gives us this low rate.

    Automatic Device Driver Synthesis with Termite

    • Conventional driver development requires acquiring necessary information from OS interface spec and device spec. Need to combine this information to come up with a driver implementation that appropriately translates OS requests into driver requests.
    • If we formalize these specs, we can do better by synthesizing drivers, and there is no longer a need for developers to be an expert in both the OS and the device. Only know one well. Furthermore, code can be specified once and synthesized many times.
    • Use finite state machines as the basic formalism for writing these specs. Note device-initiated transitions and software-initiated transitions.
    • First step is to take two state machines (OS and device) and synthesize a combined state machine that considers all possible transitions. Any transition represents a legal transition in the driver. Also associate timeout labels with transitions.
    • Now translate the state machine into C source code (simple and covered in the paper).
    • A real device has multiple functional units, so can’t possibly use a single FSM for all of these. A new language is used to compose multiple FSMs together (one per functional unit). Need a synthesis algorithm that can handle this: need to deal with the state explosion problem (by exploring the state space incrementally). Also need to deal with data, and cannot model all possible assignments to each variable. Instead manipulate data symbolically.
    • Termite successfuly synthesizes drivers for Asis AX88772 USB-to-Ethernet (on Linux) and Ricoh R5C822 SD host controller (on Linux and FreeBSD).
    • <1 kLOC for OS interface spec. <700 LOC for device specs. Synthesized drivers are 2 to 4 times larger than the Linux drivers.
    • Showed a demo of a visual Termite debugger. Allows single-stepping and setting of breakpoints on states.
    • Performance is very close to the native drivers.
    • Some limitations: cannot specify constraints on data (alignment, fragmentation, etc.), complex inter-variable relations are not supported (limitation of symbolic execution engine), the structure of specifications is restricted, and Termite drivers require runtime support. Not conceptual limitations, only of the current implementation.
    • One issue is how you write the specifications. This is particularly onerous for the device manufacturers. Also a potential source of bugs. Need a big brain to do this based on the HDL of the device. Ideally we would automate this translation (HDL to driver spec). Since you write devices in HDL anyway, this shouldn’t be too bad.
    • Not much open-source hardware around, so there was a difficulty in finding hardware on which to evaluate this.
    • Q: to make this practical, there are a few questions: performance? Scalability (for big devices like video cards)? Performance hasn’t been an issue so far, because the device developer makes the same assumptions that Termite ends up making, so the generated code is quite similar. (But the devices are simple?) So far this hasn’t been a problem. Scalability is more of an issue. Looking at using better symbolic execution that ignores irrelevant relations between variables.
    • Q: how do you deal with firmware in the device (also specified in HDL)? At the moment, the manual process has to cover this. Once we automate this, it might be more challenging, so we might want to generate firmware as well.
    • Q: what about the data size of compiled code, and the CPU utilization when saturating the device? Code and data size doesn’t look very big (roughly proportional to the amount of source code). Haven’t looked so much at data size.
    • Q: does your spec language have room for cork tables and other crap? [Taken offline.]
    • Q: the OS spec is intended to be device independent (e.g. generic for GPU, ethernet driver, etc.), but how do you cope with new features in a device, which people like to have? All the interesting stuff is on the side? How does that play with the OS spec if you want to take advantage of these features? We cope with standard features, could include “semi-standard” optional features in the OS spec. For unique features, you would need to extend (but not rewrite) the OS spec. No experience with doing this.
    • Q: the requirement for a full functional spec of the OS driver interface is somewhat intimidating, so what is your experience in making these, and how do they scale? It’s generally doable, but we would want to create a special-purpose language to make this easier.

    Debugging

    Automatically Patching Errors in Deployed Software

    • Your code has bugs and vulnerabilities, but attack detectors (code injection, memory errors, etc.) exist. What do you do in this case? At the moment, just crash the application, which is a straightforward DoS.
    • ClearView protects against unknown vulnerabilities, preserves functionality and works for legacy software.
    • Zero-day exploits are a problem for hard-coded checks because they are unknown in advance.
    • The application must continue to work (especially if mission-critical) despite attacks. A patch can repair the application. (Mind you, we shouldn’t always keep the application running: sometimes crashing is correct behaviour.)
    • Want to do this without access to the source code, so can’t rely on built-in features. Needs to run on x86/Windows.
    • Use learning as the secret sauce. Normal executions show how the application is supposed to run. Attacks provide information about the vulnerability, and can be used to give the system immunity. The first few attacks may crash the system, however.
    • Detection is pluggable: tells you whether an execution was normal or an attack. Learning learns normal behaviour from successful runs, and checks constraints during attacks. This gives a predictive constraint, which is true on every good run, and false during every attack. Repair component creates a patch to re-establish constraints and invariants. System evaluates patches and distributes them to deployed applications.
    • Use off-the-shelf detectors.
    • Assume a single server and several community machines running the application. (Assume that they are not exploited to begin with.) Community machines report constraints back to the server. Use code injection and memory corruption attack detectors (others are possible).
    • On an attack, the detector collects information and terminates the application. Server attempts to correlate this information with a constraint: leads to a predictive constraint. The server generates appropriate patches and distributes the best of these. The quality of patches can be refined by information about successful or failed attack attempts (failed or successful defenses). Redistribution is then possible.
    • How do we learn normal behaviour? Use an ML technique called dynamic invariant detection (previous work), which has many optimizations for accuracy and speed. Technique was enhanced for this project.
    • Inference results. ML technique is neither sound (overfitting) nor complete (templates are not exhaustive). However it is useful and effective. Sound in practice, and complete enough.
    • How do we learn attack behaviour? Attack detectors aim to detect problems close to their source. Code injection uses Determina Memory Firewall (triggers when control jumps to code outside the original executable). Memory corruption uses Heap Guard (triggers on sentinel value overwrite). Techniques have low overhead and no false positives.
    • Server pushes out appropriate instrumentation to the community. Only check constraints at attack sites (low overhead).
    • Repairing installs additional detectors to see if you have a bad patch (e.g. looking for assertion violations).
    • Attack example is a JavaScript system routine, written in C++. Doesn’t perform typechecking of the argument, so vtable may be corrupt. Have a predictive constraint on the operand of JSRI instruction.
    • Aim to fix a problem while it is small, before the detector is invoked. Repair isn’t identical to what a human would write, but it is much more timely.
    • Patches are evaluated in the field (do they avoid triggering the attack detector or prevent other behaviour deviations?).
    • Evaluated with a Red Team that created 10 exploits (HTML pages) against Firefox 1.0. ClearView was not tuned to known vulnerabilities in that version, but the learning component focussed on the most security-critical components. Red Team had access to all project materials.
    • ClearView detected every attack and prevented all exploits. For 7/10 vulnerabilities, ClearView generated patches that maintained functionality after an average of 4.9 minutes and 5.4 attacks. Handled polymorphic attack variants, simultaneous and intermixed attacks, and had no false positives (installing a patch when not under attack). Low overhead for detection and repair (considering this is an interactive application, not surprising).
    • What about unrepaired vulnerabilities? 1. ClearView was misconfigured. 2. Learning suite was too small. 3. Needed constraint not built into Daikon. All zero-day attacks against the system and all trivial to fix with minor changes to ClearView.
    • Q: introducing code is a bit scary… what if one of the patches introduces a new vulnerability? Firstly, you can only do this when you’ve found an exploitability. Red Team tried and failed. In one case, ClearView found a vulnerability in its own injected code.
    • Q: if I were an attacker who wanted to DoS your system (knowing ClearView was running), I might try to disable ClearView somehow by making the ClearView DB learn an incorrect fact… so is your system vulnerable to that kind of attack? It doesn’t matter what facts are true during attacks, so you’d have to find good executions that weren’t observed as being bad to poison the database. It’s conceivable and theoretically possible that you could do that, but I don’t know if it’s practical.
    • Q: what is the overhead of inserting invariants at every instruction? You will see between 5 and 10 constraints per instruction… learning is the biggest bottleneck, but you could distribute this amongst the community. In terms of unsoundness, we’re not seeing that as a problem.
    • Q: how sensitive are you to the invariants that I have to specify for the patches, in the case where continuation introduces incorrectness into the persistent state? This is a policy decision: an hour of downtime for a bank is $6 million, so is it better to come back and fix things up later.

    Debugging in the (Very) Large: Ten Years of Implementation and Experience

    • 10 years of work by the first 8 authors.
    • Even Microsoft’s shipping software has bugs! (And so does your hardware….)
    • A bug is a flaw in program logic; an error is a failure in execution caused by a bug (1 bug -> many errors).
    • How does Microsoft find out when things go wrong? We want to fix bugs regardless of source, prioritize bugs affecting the most users. Kernel crashes (BSOD), application crashes, everything down to invariant violations.
    • Windows Error Reporting. What happens after you click “Send Error Report”?
    • Server is over-provisioned to handle 100 million reports per day. 17 million programs have records in WER. 1000s of bugs have been fixed. Uses 200TB of storage, 60 servers over 10 years. Anyone in the audience can get access to WER data….
    • Debugging in the large makes the user-developer feedback loop much longer, both in terms of the number of people and the latency. The problem is the human bottleneck (both in accuracy and latency). Goal was to remove humans from the loop.
    • On an error, collect a minidump: stack of erroneous thread and a little extra context. If the user allows it, upload this to WER. An analysis procedure (!analyze) runs over all of these mini-dumps and clusters these in buckets.
    • !analyze takes a minidump as input, and outputs a bucket ID. So increment the bucket count and prioritize buckets with the highest count. Actually upload only the first few minidumps for a bucket; after that just increment the count. Sometimes you need a full core dump, and programmers can request this to be collected on future hits.
    • 2-phase bucketing strategy: labelling on the client (bucketed by failure point) and classifying on the servers (consolidate versions and replace offsets with symbols; find callers where the bug might be (if it calls known-good code)). This refines the bucket ID… more details in the paper.
    • One bug can hit multiple buckets (up to 40% of error reports). Also multiple bugs can hit one bucket (up to 4% of error reports). Bucketing mostly works… scale is our friend (throw away a few here and there and you still have enough to debug).
    • Bucket hits for a given program look like a Pareto curve. Just 20 buckets in Word 2010 account for 50% of all errors. Only fixing a small number of bugs will help many users.
    • Earliest success story was finding heisenbugs in Windows kernel (>= 5 years old). Vista team fixed 5000 bugs in the beta. Anti-virus vendor accounted for 7.6% of all kernel crashes: in 30-days got this down to 3.6% of all kernel crashes. Office 2010 team fixed 22% of reports in 3 weeks.
    • Example of hardware errors too: failure in a CPU (exact same revision and step). Chip vendor knew that the bug existed and didn’t think that it would get hit in real-life. Error reports dropped dramatically after the work-around was applied.
    • Also hardware failures in SMBIOS of a particular laptop (buffer overrun); motherboard USB controller (only implemented 31/32 of DMA address bits). Lots of information about failures due to overclocking, HD controller resets and substandard memory.
    • Also looked at malware. The Renos social engineering worm which caused Explorer.exe to crash when people downloaded something from an email. Saw a spike, issued a worm removal tools through Windows Automatic Update, and saw this decline very quickly. Shows that WER scales to handle global events.
    • Distributed system architecture hasn’t changed in 10 years, and yet scales to global events.
    • Product teams now have ship goals based on reducing WER reports: led to a numerically-driven debugging approach. Fundamentally changed software development at Microsoft.
    • Q: what about privacy? Most private data ends up in the heap, not on the stack. Only collect stack. Also do some data-scrubbing based on things we know about (user ID, etc.). Looked for SSNs, credit-card numbers, etc.: found fewer than 10 possible matches in 100,000 error reports. MS also puts very strong employment conditions on how this data is accessed and used.
    • Q: []. Team looks for zero-day exploits in the WER data regularly. Philosophy is that there should be no overhead when users don’t hit an error report. 90% of users only ever have to do an increment of a counter.
    • Q: are you keeping track of how many people don’t send error reports? We have good estimates of opt-in rates, from other systems that collect information from machines that are running normally (only if people opt-in to those programs of course). If half the world turned off the error reports, we’d still get enough information.
    • Q: does analyzing these bugs tell you about common errors programmers make? Check for NULL is the most important ones. This has generated many guidelines that has been fed back into internal development processes.

    Detecting Large-Scale System Problems by Mining Console Logs

    • It is challenging to detect problems in large-scale internet services. Requires instrumentation (expensive to maintain, and may use modules that aren’t instrumented). So can we use console logs in lieu of instrumentation? They are easy for the developer to insert in their code, but they are imperfect (not originally intended for instrumentation). It is non-trivial to analyze their free-text contents.
    • Go from 24 million lines of log messages, finding a small number of abnormal log segments, to a single page of visualization. Fully automatic process without manual input.
    • Use machine learning based on carefully-defined (numerical) features in the logs. Parse the program source-code to infer information about the log contents, to generate appropriate features. Finally visualize the results.
    • Key insight: log contains the necessary information to create features. Identifiers and state variables are useful. Important information is in the correlation between messages (examples taken from HDFS (Hadoop Distributed File System)). e.g. “receiving block x” followed by “received block x” is normal; without “received block x”, you have an error.
    • First step is to parse free-text logs into semi-structured text. Look at program source to generate regular expressions that extract state variables (the block number, for example). It becomes non-trivial in OO languages, where type inference on the whole source tree is necessary. Yields highly accurate parsing results.
    • Identifiers are widely used in logs (filename, object keys, user IDs, etc.). Do group-by on identifiers. Identifiers can be discovered automatically.
    • Now build a numerical representation of traces for feature creation. Approach is similar to the bag of words model in information retrieval. Yields a message term vector with term frequencies.
    • Can use these vectors to do anomaly detection. Use Principal Component Analysis (PCA) to capture normal patterns in all vectors. These are based on correlations between dimensions of the vector.
    • Ran an experiment on Amazon EC2. 203 nodes ran Hadoop for 28 hours, with standard map-reduce jobs (sorting etc.). Generated 24 million lines of console logs with ~575000 HDFS blocks. 575000 vectors lead to ~680 distinct vectors. Distinct cases were labelled manually as being normal or abnormal (in the evaluation only). However, the algorithms are unsupervised and automatic.
    • 11 kinds of anomaly, occurring 16916 times. PCA detected 16808 of these. Two kinds of false positive: background migration and multiple replicas. Believe that no unsupervised algorithm could do better, so we’re now allowing operators to provide feedback.
    • Results are visualized with a decision tree. Unusual log message text is used to document split points in the decision tree.
    • Future work. Want to improve parsing so that it doesn’t require source code, and support more languages. Also want to improve feature creation and machine learning so that online detection is possible, also across applications and layers to provide more useful and comprehensive information.
    • Q: most applications have many identifiers, so how do you automatically detect these, and how reliably? The grouping step addresses the problem of multiple identifiers. In HDFS, we only have the block ID, but we have an example in the paper where we run the algorithm multiple times for each class of identifier.
    • Q: how do you know whether the different values of an identifier correspond to the same identifier variable? [Taken offline.]
    • Q: what was different about the single anomaly where you don’t do well? (Deleting a node when it no longer exists on the data node.) Block numbers can be arbitrary due to multiple reads and writes. Sometimes get errors in the correlations.
    • Q: how common do you think this is in other systems besides HDFS? HDFS has this problem most severely because every block interaction is written to the logs.
    • Q: you had to turn on the more-detailed logging level to get this to work, so how did you choose this? I had to turn on debug-logging. Depends on the problems you want to detect. Turn on more logging when you see problems but you can’t find out why.
    • Q: what happens to performance when you do this? Also what about heisenbugs that go away with more-detailed logging? System doesn’t do anything about logging-based heisenbugs.
    • Q: what if I add a new feature, would you be able to detect problems in it? Don’t currently deal with multiple versions of the software.
    • Q: how much information does your visualization offer to the developer to help them diagnose detected problems? If the operators have some insight, this tool can help them provide useful information to the developers.

    NSDI 2009: Day 3

    Friday, April 24th, 2009

    Wireless #2: Programming and Transport

    Wishbone: Profile-based Partitioning for Sensornet Applications

    • Extension to the WaveScope system.
    • Example application of locating marmots: listening out for loud alarm calls when confronted by a predator. High data rate application (node has four microphones). Used for sensing applications (animal localisation, pothole detection, computer vision, pipeline leak detection, speaker identification and EEG seizure detection). All expressible as a data flow graph. Predictable data rates. But run on heterogeneous platforms (CPU and radio bottlenecks).
    • Used Linux uservers, smartphones (Java-based and iPhones) and Meraki routers. Want to be able to mix and match. Have to deal with incompatible software environments (languages, SDKs and OSes).
    • Application is a dataflow graph (edges are streams; nodes are stream operators). Inputs are from sensor sources, outputs are results either to the user or a server. System partitions this graph between the embedded nodes and the server side. Wishbone handles serialisation and deserialisation across the network interface. Compiles and loads subgraph onto any platform. Replicates graph across all nodes (but may have different partitions depending on the node).
    • Contributions. First broadly portable sensor net programming environment. Partitioning algorithm. Optimises CPU/radio tradeoff.
    • WaveScope compiler gives dataflow graph in portable IL. Then have a backend code generator for each of the target platforms.
    • Wishbone needs sample data from the user to make the correct partitioning decisions.
    • Target a TinyOS mote. (16bit ucontroller, 10K RAM, no memory protection, no threads, task granularity messaging model.) Not directly compatible with the WaveScope threading model. Use profile-directed cooperative multitasking which makes good decisions (based on profiling information) about where to place yield points.
    • Profile streams and operators to find execution times, data rates. Separately profile network connections.
    • Some data flow nodes must be pinned to a particular platform (e.g. source and sink nodes, stateful global operators). Others are unpinned (stateless and locally stateful nodes).
    • Assign weights to the edges (net bandwidth) and nodes (CPU cost) in the dataflow graph. Interested in the sum of CPU cost on the embedded device, and the sum of edges across the network boundary. Formulate as an Integer Linear Program. Enforce resource bounds on the embedded device (sum over nodes on the embedded device), and network bound (sum over cut edges). Then minimise a linear combination of the CPU and network cost. (Tricky bit is to enforce other constraints (pinning and graph topology) while staying linear. Can set a parameter to tradeoff between CPU and radio, or else set it to reflect energy consumption.
    • Evaluated by looking at bandwidth versus compute cost on a linear pipeline of operators (evaluted on iPhone as the embedded node). Observe that processing reduces data quantity overall, but this is not monotonic.
    • Evaluated by changing the input data rate of the application. Look at how many operators remain in the optimal node partition. (EEG graph with multiple channels partitions horizontally (some channels on server, some on device).)
    • Given CPU and network bounds, can find an optimal partition if it exists. The partition gives estimated throughput. So do a binary search over CPU bounds to find the maximum possible throughput.
    • Ground truth is how many detections can actually be gotten out of a network of TMotes. Also looked at the compute/bandwidth tension in a single TMote/basestation network.
    • Related to graph partitioning for scientific codes (Zoltan in symmetric supercomputers), task scheduling (list scheduling), MapReduce/Condor, and Tenet and Vango (sensor network context, run TinyOS on embedded and server devices).
    • Q: have you considered situations where operators could tune the amount of processing that they do (or tolerate packet loss)? Considered it, but there are several generalisations we would look at first. Does it really change the structure of the system? Yes, we’d probably need a new partitioning algorithm.
    • Q: would it be worth profiling the amount of synchronisation necessary for pinned (global stateful) operators? We have strong consistency requirements, so we’d have to have a relatively expensive operation to synchronise this.
    • Q: have you applied the same sorts of partitioning for the scenario where you have different partitioning for different sensors? Allow you to port application across different platforms but the network is homogeneous, but you could run the algorithm multiple times (not considered in this paper).
    • Q: what is the effect of additional load on available bandwidth (interference, etc.)? Currently assume all nodes transmit at the same data rate, and use this assumption to assign channel capacity.

    Softspeak: Making VoIP Play Well in Existing 802.11 Deployments

    • VoIP and WiFi are becoming increasingly popular. Now even cellphones are becoming available with these capabilities (even Skype on iPhone). If these get used heavily, what happens to existing 802.11 deployments?
    • Imagine a commodity AP in a coffee shop. Most users are data users, and maybe there are one or two VoIP users. But as VoIP becomes more popular, the number of VoIP users will increase.
    • What does this do to call quality and to data users?
    • 802.11 was designed for data networks and has substantial per-packet overhead (headers, ACK) and contention (backoff and collision). VoIP has small packets, a high packet rate (20–100 packets per second), and does not respond to congestion. So VoIP makes inefficient use of WiFi.
    • Measured impact as residual capacity (TCP/UDP throughput) and “mean opinion score” (MOS, how audio appears to a real person, calculated based on loss and jitter).
    • Used an 802.11 testbed and gradually enabled for VoIP stations. Downlink MOS tails off much faster than uplink MOS (after 4 VoIP stations). Degradation of TCP throughput is linear, but much more severe than would be expected from the size of VoIP packets.
    • Possible solutions: could decrease VoIP packet rate, or use higher speed networks (802.11g/n). Lowering the packet rate still degrades MOS. Higher speed networks run in “protection mode” in presence of older 802.11 versions, which loses many of the benefits. VoIP is too infrequent to benefit from 802.11n aggregation.
    • Could prioritise VoIP traffic (802.11e), which increases the contention overhead, and further reduces the residual capacity.
    • Softspeak does prioritised TDMA for the uplink and addresses contention overhead. By serialising channel access, we avoid collisions. Cycle between 10 TDMA slots, each of 1ms (1ms accommodates current codecs well; could be varied). VoIP stations have to establish a schedule, synchronise clocks and compete with non-TDMA traffic.
    • Non VoIP stations are unaware of TDMA, which may prevent VoIP stations from sending on time. If data user wins contention, the VoIP cannot transmit; if they collide this is worse. Use 802.11e VoIP prioritisation to improve VoIP quality, but TDMA means that we don’t see the same data rate degradation. Even a quick collision recovery for VoIP means that we will overrun our TDMA slots: can use multiple priority levels to address this (though can exhaust priority levels). Worst case, fall back to regular 802.11, and do no worse than 802.11.
    • Experiment by visualising TDMA, with ten TDMA VoIP stations, and some CSMA/CA background data traffic. Most transmissions should start in their own or the next slot.
    • Softspeak also does downlink aggregation. VoIP stations will overhear aggregated packets and extract their own portion of the packet.
    • No modification to wireless card, access point or VoIP application. Softspeak controller registers IP address and port with the aggregator (wired connection to AP). Implemented for Skype and Twinkle.
    • Evaluated for call quality and residual throughput. TCP data traffic and 10ms voice codec. (More results in the paper.) Adding TDMA alone has little effect over 802.11b. Aggregation by itself greatly improves downlink MOS (and slightly improves residual throughput). Softspeak in total greatly improves downlink MOS (almost back at single VoIP station level) and also improves residual throughput by 5x.
    • For 802.11g, get a 3x improvement in residual throughput, slight degradation in downlink MOS and large improvement in uplink MOS.
    • Also looks at performance when contending with web traffic. The bulk TCP upload improvement disappears, but the combined TCP capacity improvement is preserved. Softspeak doesn’t change the basic behaviour of the data traffic.
    • Q: how would the result change if you were measuring against multiple TCP flows? Experiment we did with web traffic showed this. Multiple bulk TCP flows? In one direction, saw same level of improvement; in the other saw queuing at the AP and would need some form of prioritisation at the AP.
    • Q: how could this work in a real network (because you have more than one collision domain; how do you assign slots across multiple hops?)? Didn’t evaluate the case with multiple collision domains; details of slot assignment in the paper. Use an 802.11 management protocol for probe response which increases the range when it can be heard. Doesn’t completely eliminate hidden terminal problem, but we hope that this would apply to TDMA.
    • Q: with multiple APs and multiple slots, wouldn’t you need some kind of centralised control? Could work across multiple APs.
    • Q: did you measure the delay performance of this? Incorporated in the MOS value.
    • Q: did you compare to giving short packets high priority with a rate limiter? This is part of the 802.11e specification, and we measured this. Why does aggregation win by reducing the packet per second rate? Typically if you have short packets, you use more bandwidth.
    • Q: if I am an enterprise administrator, why not just reserve a single channel for VoIP traffic, because it no longer has to contend with data or non-VoIP traffic? That would be a good solution if you had that available to you.

    Block-switched Networks: A New Paradigm for Wireless Transport

    • Today, TCP performs poorly over wireless links. On a bad link, you get 40x goodput difference between UDP and TCP. Median has 2x goodput difference, and a good link has 1.6x goodput difference.
    • End-to-end rate control is error-prone and doesn’t work well in wireless networks. Loss feedback does not distinguish congestion and corruption losses. End-to-end retransmissions are wasteful in a multi-hop wireless network. Route disruptions cause unavailability (could use a disruption-tolerant protocol).
    • The use of packetisation introduces non-trivial overhead (for channel access: listen, backoff, RTS/CTS; and link-layer ARQ).
    • Also see complex cross-layer interaction: link-layer ARQs and backoffs hurt TCP rate control.
    • Hop is a clean-slate redesign. End-to-end becomes hop-by-hop. Packets become blocks to amortise overhead.
    • Reliable block transfer, ACK withholding and micro-block prioritisation on a per-hop basis. Virtual retransmission and backpressure components on a multi-hop basis.
    • Hop sends data in 802.11 burst mode. CSMA performed only before a TXOP (burst). Block-SYN and Block-ACK (bitmap of packets; only resend missing packets).
    • Virtual retransmission uses in-network caching and retransmits only on cache miss. Gives fewer transmissions, low overhead and a simple solution.
    • Backpressure mechanism limits the number of outstanding blocks per-flow at each forwarder (e.g. limited to 2 outstanding blocks). This improves network utilisation in the case where hop bandwidth is asymmetric.
    • ACK withholding mitigates hidden terminals better than RTS/CTS (which is overly conservative with high overhead). Buffer B-ACK messages while one terminal is sending, and then send one of the buffered B-ACKs when it is finished.
    • Micro-block prioritisation improves performance for applications like SSH and text messaging. Sender piggybacks small blocks on the B-SYN and receiver prioritises small-block B-ACKs. This gives low delay for small blocks.
    • Implemented on a testbed on the 2nd floor of the UMass CS building. 20 Mac Minis form an ad-hoc network.
    • Hop achieves significant goodput gains over TCP (1.6x at median, 28x at first quartile, 1.2x at third quantile).
    • Also on single-flow multi-hop performance (more modest gains: Q1 2.7x, median 2.3x, Q3 1.9x).
    • Achieves graceful degradation with loss (emulated link layer losses at the receiver). Still achieves higher goodput than TCP as the loss rate increases.
    • For high load (30 concurrent loads), Hop achieves much higher goodput than either TCP or hop-by-hop TCP (Q1 150x, median 20x, Q3 2x). Hop is also much fairer in the way it allocates bandwidth between flows.
    • Evaluated delay on small transfers. Hop has much less delay than TCP or TCP+RTS/CTS (especially for smaller transfers).
    • Many more evaluations in the paper. Partitionable networks, network and link-layer dynamics, effect on 802.11g, and effect on VoIP.
    • Much related work on fixing end-to-end rate control, backpressure and batching; Hop combines these into a viable system.
    • Q: most data transfers involve wired networks as well, so what do you do at the gateway? We do bridging here.
    • Q: if there is a separate TCP connection from the gateway to a wired endpoint, how do you deal with mobility (changing gateways)? Good question, want to look at this in future work.
    • Q: what is the real effect on interactive sessions (e.g. SSH)? Hop actually achieves an even lower delay than TCP or TCP + RTS/CTS. Didn’t evaluate transfer sizes smaller than 2KB.
    • Q: if a congested router is fed by two routers upstream, how do you allocate the backpressure? The senders will have to back off. There is no explicit rate control. Both upstream routers are treated similarly.

    Routing

    NetReview: Detecting When Interdomain Routing Goes Wrong

    • Interdomain routing sometimes goes wrong. Example of YouTube traffic being redirected to Pakistan. Only the most egregious problems make it to the news, but there are many inconvenient, small scale problems.
    • ASes exchange routing information using BGP. But BGP routing is plagued by misconfigurations. Faulty routing information propagates through the network. Bugs in the AS layer could prevent routing information to be misadvertised. Also, spammers hack into routers to prevent tracing.
    • Goal is to reliably detect each routing problem and link it to the AS that caused it. This would make it quicker and easier to respond to problems. Fault detection would work for a broad class of problems, incentivise reliable routing and be easy to deploy incrementally.
    • Idea: could just upload all router logs to a central entity, who inspects them for problems. Doesn’t work in practice. Logs contain sensitive information about internal routing structure. Also, relies on the router to have accurate information (may not be the case). Also need automation, incremental deployability and decentralisation.
    • NetReview solves these problems. Border routers maintain logs of all BGP messages (not data messages). Logs are tamper-evident (in the event of a faulty router): can reliably detect and obtain proof of faulty router. Neighbours can periodically audit each other’s logs and check them for routing problems. Auditor can prove existence of a problem to a third party.
    • ASes decide what to announce via BGP based on its routing policy, based on peering agreements (customer/provider), best practices (limited path length) and internal goals (short/cheap path).
    • A BGP fault is when the BGP messages sent by an AS do not conform to its expected behaviour. We know what BGP messages the AS sent from a complete and accurate message trace (using a robust and secure tracing mechanism). Expected behaviour for each AS could be different.
    • Example of expected behaviour: filter out routes with excessive paths; act as somebody’s provider; prefer routes through someone if available. Some of these rules may be confidential, but the AS need not reveal all of them to each author (e.g. reveal rules about agreements only to parties to those agreements).
    • Tamper-evident log based on PeerReview at SOSP 2007. Can tell if a router omits, modifies or forges entries (based on a hash chain). Messages are acknowledged, so cannot ignore a message. Neighbours gossip about hashes seen.
    • Rules are predicates on the AS’s routing state. They are declarative and hence are easy to get correct. Checks of S-BGP can be declared in a two-line rule.
    • Auditor requests the logs from each border router. Checks to see if the logs have been tampered with (or show inconsistencies). The auditor locally replays the logs to establish a series of routing states. Then evalutes that the rules are upheld in each routing state. If a rule is violated during some interval, the auditor can extract verifiable evidence from the logs.
    • Many practical challenges: here we will look at incremental deployment. Smallest useful deployment is at one AS. One AS can find bugs and misconfigurations. Two adjacent ASes can check peering agreements. The incentives for deployment are that reliable ASes can attract more customers, and logs can be used for root-cause analysis.
    • Evaluated on a synthetic network of 10 ASes running 35 Zebra BGP daemons. Use default routing policies, and injected a real BGP trace (Equinix) to get scale. Results in the talk are from a tier-2 AS with 6 neighbours.
    • Did fault injection and NetReview detected all of the injected faults and provide useful diagnostic information.
    • Evaluated processing overhead: a 15-minute log segment can be checked in 41.5s on a P4. A single commodity PC is sufficient to check a small network in real-time.
    • Storage space requirement was 710KB/minute, or about 356GB/year. Required 420Kbps, including BGP updates, which is insignificant compared to the data rates.
    • Related work includes fault prevention (secure routing protocols and trusted monitors) which have been difficult to deploy, and only filter out limited types of faults. Also related to heuristic fault detection (problem of false positives and false negatives). And related to accountability systems (which tend to require clean-slate designs, or have other limitations).
    • Q: can you say more about the web of trust approach? No PKI or CA required, because the networks already know who is at the end of each link, which gives a good basis to certify the identity of ASes.
    • Q: does the system prevent collusion between ASes? ASes cannot incriminate a correctly-operating AS by collusion, nor hide misbehaviour (apart from routing behaviour between the colluding ASes, which doesn’t matter).

    Making Routers Last Longer with ViAggre

    • Motivation is the steep growth in routing table size. Expected to get worse in future: as IPv4 gets exhausted, need more small prefixes. Also there’s a chance that IPv6 gets deployed…. A larger routing table means that routers need more FIB (Forwarding Information Base) space.
    • Does FIB size matter? Could just throw more RAM at it? Technical concerns about power and heat. SRAM is a low-volume component that does not trace Moore’s Law. Also, a larger routing table means less cost-effective networks (price per byte forwarded increases). Also, there’s a cost in upgrading router memory. Some routers are filtering out /24 routing table entries, which impacts reachability. ISPs will undergo some pain to extend the lifetime of their routers
    • Virtual Aggregation: a configuration-only approach to shrinking router FIBs. Works on legacy routers and can be adopted independently by and ISP. Huawei are implementing it in their routers.
    • Brad Karp came up with the name.
    • Insight is to divide the routing burden between routers. Individual routers only maintain routes for a fraction of the address space.
    • Router architecture: RIB on slow memory; FIB on fast memory (SRAM).
    • Problem space includes FIB space, RIB growth and problems of routing convergence (churn etc.). Existing proposals require architectural change and haven’t been deployed. ViAggre focuses on FIB space problem and is incrementally deployable.
    • Today all routers have routes to all destinations. Idea is to divide address space into /2 virtual prefixes. Now assign virtual prefixes to the routers. Each router has a prefix and only maintains routes to hosts in that prefix.
    • How do you do this without changes to routers and external cooperation? How do packets traverse routers with partial routing tables?
    • An external BGP peer may advertise its full routing table: need to insert routes into the FIB selectively (only install a subset of the RIB into the FIB, FIB supression is a simple approach). However, this has high performance overhead. Instead, offload task of maintaining this table onto machines off the data base (similar approach to BGP route reflectors). External router peers with a route reflector. Shrinks the FIB and RIB on all data-plane routers. This is somewhat invasive, because you have to change your peering architecture.
    • What about data plane paths between different virtual prefixes? When a packet comes in, the ingress router doesn’t have a route to the prefix. Therefore you need to route to router with the right virtual prefix. So maintain one entry per virtual prefix that a router is not aggregating. Cannot forward packets in a hop-by-hop fashion, so the packet must be tunnelled to a router that has the right prefix. The egress router removes the encapsulation.
    • Failover works using existing mechanisms.
    • Native paths in ViAggre can be longer than native paths (traffic stretch, increased load, etc.). But can use power law that 95% of traffic goes to 5% of prefixes. So install that 5% of prefixes on all readers. This will reduce the impact of ViAggre on the ISP’s network.
    • Evaluated effects on adopting ISP. Look at reduction in FIB Size versus traffic stretch (in ms) and load increase (more traffic carried by routers).
    • Choosing aggregation points to assign more routers to aggregate a virtual prefix. This reduces stretch. Use a constraint based assignment program. Want to minimise the worst FIB size such that the worst stretch is <= a constraint. This is NP hard so had to approximate it. With a worst-case stretch of 4ms, the worst case becomes close to the average FIB size, and the actual stretch is very low.
    • ViAggre can extend the lifetime of a router by 7–10 years.
    • Carried out a study of the prefixes to which traffic was sent, found that the top 5% most popular prefixes account for 95% of traffic.
    • As the fraction of popular prefixes increase, the load increase drops to 1.38%.
    • ViAggre gives a 10x reduction in FIB size.
    • Cons of ViAggre: control plane hacks may have overheads (in installation, convergence, failover); also planning overhead (choosing virtual prefixes and assigning them to routers).
    • Deployed ViAggre on a real network. Compared propagating routes using the status quo and ViAggre (using prefix lists for selective advertisement). Measured control-plane overhead in case a new route is installed. ViAggre reduces installation time because route reflector only has to advertise a subset of the routing table to the new router.
    • Also developed a configuration tool (330 lines of Python) that works on router configuration files and outputs ViAggre-compliant configuration files. Still working on a planning component. Working with Huawei to implement ViAggre natively.
    • Q: overhead of tunnelling? Use MPLS-based tunnelling at line rate. Makes management easy.
    • Q: why is the rate at which traffic is growing less than the rate of routing table growth? Not necessarily true. Router upgrading has been happening for reasons other than FIB size up until now (but the FIB size has increased). Tier-1 ISPs can forward packets at close to lambda rates. Even big ISPs may have to do upgrades only for memory concerns.
    • Q: why not just maintain popular routes and forward other packets to a default router? Simplest thing to do would be to put routers in a cache hierarchy. Why does it have to be distributed and complicated? Router vendors weren’t happy with route caching because of unpredictable performance and other reasons. ViAggre is useful for medium-sized ISPs (not small ISPs where you have a useable default route).
    • Q: can you use a configuration-only approach for the data centre to get higher aggregate throughput? The state problem in data centres is the layer-2 state (not the layer-3 state). Close to Seattle work at SIGCOMM 2008.
    • Q: why is it expensive to do route supression from the RIB to the FIB? We did it using access lists to say what should and should not go into the FIB. These access lists are heavyweight mechanisms (typically used with small access lists of up to 500), and haven’t been optimised. By making this efficient, you get much less complexity (no need for route reflectors).

    Symbiotic Relationships in Internet Routing Overlays

    • Two nodes are in symbiosis whenever both can benefit from one another’s position or resources in the network. Examples are in file sharing (BitTorrent), backup systems (Samsara) and AS relationships. No tragedy of the commons and no free riding. Everyone provides something in return.
    • Can we apply mutual advantage in overlay routing. Built PeerWise, an overlay routing system that reduces latency based on mutual advantage.
    • Route from College Park, MD to Seattle. Could take a direct path, or a detour path that violates the triangle inequality.
    • Measurement studies based on PlanetLab-to-popular-destinations (asymmetric) and PW-King (symmetric). 21% and 51% of node pairs respectively benefit from detours (only detour when latency reduction is at least 10ms and 10%).
    • For 20% of the PlanetLab nodes, there are no detours. Mutual advantage eliminates about half of the detours.
    • Can categorise web pages based on the number of prefixes and the number of nodes finding detours (regional websites see many nodes finding detours but few prefixes; Google and CDN-based sites have many prefixes, but relatively few nodes finding detours).
    • Three goals: efficiency (lower end-to-end latency), fairness (mutual advantage) and scalability (use network coordinates). Network coordinates give internet nodes positions in a geometric space, and predict latency based on position in that space: this has some power in predicting whether node-pairs will need a detour or be part of a detour.
    • Need to do neighbour tracking, and ranking based on proximity (minimise latency), embedding error (maximise this), coverage (minimise expected detour latency) or randomness. Proximity and coverage perform best.
    • Now do pairwise negotiation to establish mutual advantage. Keep a negotiation table and peering table per node. Negotiation table contains potential peers.
    • Implemented PeerWise and deployed it on about 200 PlanetLab nodes. PeerWise finds mutually-advantageous detours that offer significant and continuous latency reduction.
    • Propose three experimental scenarios: all-destination, random-subset-destination and Zipf-destination.
    • PeerWise finds detours quickly. Other results in the paper.
    • Looked at user-level application benefits (wget) through the use of detours. Compare the transfer time for popular website home pages using both the direct and detour paths. Compare PeerWise reduction ratio and wget reduction ratio. In 58% of all cases, wget takes less time through the detour path than through the direct path. But in 42% of cases, wget takes longer through the detour. Could be that the latencies measured by PeerWise had changed.
    • Found that PlanetLab relays influence transfers. Packets spend time in the relay node. Median relay wait time on PlanetLab is 48ms (a few weeks before the NSDI deadline). Found that on a UMD relay node, the median wait time was 5. With these delay characteristics, the performance would have been much better.
    • Q: is it not counter-intuitive to use latency (with triangle inequality violations) in network coordinates (a metric space) (should you not use topological information like in iPlane?)? Network coordinates give good results. So didn’t investigate anything else. When we embed a triangle inequality violation in a metric space, either make the short size shorter, or the long side longer.
    • Q: aren’t most network coordinate systems based on some sort of global optimisation? Taken offline.
    • Q: do you only consider bilateral advantages, or have you considered more complicated scenarios with multiple parties? Have thought about it, but just consider bilateral.
    • Q: have you done any work on trying to include the load of a node in the weighting system to determine whether it is advantageous to route through it? Have not looked at load.
    • Q: a long RTT TCP connection, split in two, will get better latency and loss tolerance, so have you considered doing this instead of TCP relaying? Is there a way to separate the effects in your evaluation? Not sure.

    http://jihadwatcher.com/?p=9-12069 migararDeDd o aoarT http://jihadwatcher.com/?p=9-6161 ee NennenmArh oeae http://jihadwatcher.com/?p=9-9120 nsoPo tlnn etrmuteCiGtaihee http://jihadwatcher.com/?p=9-2123 ptOr ieprrttiinens nOo http://jihadwatcher.com/?p=9-8024 rhIa Pt eh http://jihadwatcher.com/?p=9-3336 mnathr rsdmmSetn http://jihadwatcher.com/?p=9-1532 a0l http://jihadwatcher.com/?p=9-4645 riP etA eineOthd rBrtnmeeep http://jihadwatcher.com/?p=9-2229 nPeielhiteC ntrHhpW http://jihadwatcher.com/?p=9-5478 Ddmrosaer guTa http://jihadwatcher.com/?p=9-780 T http://jihadwatcher.com/?p=9-825 Ia http://jihadwatcher.com/?p=9-10409 nrnisCenPleam eiLiatcOeodesPihn http://jihadwatcher.com/?p=9-9335 Tfa dsrOoA bmeua http://jihadwatcher.com/?p=9-10650 heiPnnCe http://jihadwatcher.com/?p=9-10195 Wnnt h ese http://jihadwatcher.com/?p=9-4974 NemHrlnnhPs http://jihadwatcher.com/?p=9-2618 http://jihadwatcher.com/?p=9-6715 TDlsg http://jihadwatcher.com/?p=9-4793 rNisc io peirrhcnoPisttheoWm niteDeenut http://jihadwatcher.com/?p=9-300 PiUkeSi http://jihadwatcher.com/?p=9-2474 ro o nlpe e nlPCnthis http://jihadwatcher.com/?p=9-9690 fnrcehdiSeEePe http://jihadwatcher.com/?p=9-3689 hhiePB fT p ei tceaunOeefCmnt aShyli http://jihadwatcher.com/?p=9-2651 hmne Fol http://jihadwatcher.com/?p=9-2490 loeCrlmtolaT odna http://jihadwatcher.com/?p=9-488 e nfPp http://jihadwatcher.com/?p=9-13891 rxh35d mp http://jihadwatcher.com/?p=9-3582 heenonaticutrr http://jihadwatcher.com/?p=9-298 TerGhmngtntrebePnnnuilai http://jihadwatcher.com/?p=9-8478 Pizeoalbt http://jihadwatcher.com/?p=9-1444 ela drlAtra http://jihadwatcher.com/?p=9-7104 iei http://jihadwatcher.com/?p=9-9755 epienSrn7F rph3n5h timiM http://jihadwatcher.com/?p=9-3280 eAlfnIp http://jihadwatcher.com/?p=9-5647 nritPserhP erl nine http://jihadwatcher.com/?p=9-10765 nsnaehepirttem http://jihadwatcher.com/?p=9-878 oAal ir arhod http://jihadwatcher.com/?p=9-2989 eiean rhPadTdi http://jihadwatcher.com/?p=9-9102 P http://jihadwatcher.com/?p=9-13806 aes mWT http://jihadwatcher.com/?p=9-7816 nmTdoAedan e http://jihadwatcher.com/?p=9-1400 tebTn e erPdoBrruih F nemear http://jihadwatcher.com/?p=9-3610 eiPnrmhersMeembit http://jihadwatcher.com/?p=9-3872 n ieyh eihiO remPr http://jihadwatcher.com/?p=9-9628 Pinmoet http://jihadwatcher.com/?p=9-13570 ihPadrmrneltanTcioio rMiiAdt t http://jihadwatcher.com/?p=9-9693 lodike n BliadmotCnryoBae i http://jihadwatcher.com/?p=9-6684 eniC inBh http://jihadwatcher.com/?p=9-6362 te http://jihadwatcher.com/?p=9-8092 inndinciru http://jihadwatcher.com/?p=9-4139 m93res niemetp250h 9nP http://jihadwatcher.com/?p=9-12820 nramtnnp hiiC neaeO http://jihadwatcher.com/?p=9-409 MNrayawBHlm he no http://jihadwatcher.com/?p=9-2448 noNhntcini http://jihadwatcher.com/?p=9-12230 TdaamTa http://jihadwatcher.com/?p=9-4615 aieo teAM hmoatdldn http://jihadwatcher.com/?p=9-11567 vgrFoeeiSpdr hhnTeplntOri m http://jihadwatcher.com/?p=9-11258 Cneooi.ePett5u7n rihtm 3mnslg http://jihadwatcher.com/?p=9-6727 n eBt uPee http://jihadwatcher.com/?p=9-12884 irni eo http://jihadwatcher.com/?p=9-6567 ne rn57iihtPesmo Pripc3olret http://jihadwatcher.com/?p=9-6785 utrr i lleCmnOreaha http://jihadwatcher.com/?p=9-8365 N grrptnmpcoeNvn roO http://jihadwatcher.com/?p=9-9363 eOoe lMmde http://jihadwatcher.com/?p=9-13367 e http://jihadwatcher.com/?p=9-4087 peFsr dohn http://jihadwatcher.com/?p=9-537 rnPu http://jihadwatcher.com/?p=9-13910 a52Xen http://jihadwatcher.com/?p=9-13456 sreratnOlt it easlgi DnndrlnhsT o aUm eICeinmderpco http://jihadwatcher.com/?p=9-6681 taeiPAdlmd http://jihadwatcher.com/?p=9-10621 oOlri nSdl elievuD trydan http://jihadwatcher.com/?p=9-13657 hdeey liSaum EnFn BTy uotpi http://jihadwatcher.com/?p=9-7009 pmiBalPneiehrot http://jihadwatcher.com/?p=9-6129 tS r eheTeit http://jihadwatcher.com/?p=9-8979 ehenLeePhnPh erin http://jihadwatcher.com/?p=9-5193 meerN icnprPPeetnho http://jihadwatcher.com/?p=9-1577 ner snoealHianImTcg http://jihadwatcher.com/?p=9-1214 tesilh r3ebt7T5. http://jihadwatcher.com/?p=9-717 o TuremDiOap nasdad http://jihadwatcher.com/?p=9-6713 Piimenh ilahPSrnlart http://jihadwatcher.com/?p=9-9847 hiaecWSi http://jihadwatcher.com/?p=9-8090 rNhe seeteotro p http://jihadwatcher.com/?p=9-10548 othi rCarg http://jihadwatcher.com/?p=9-1604 oiCFroDctnrel eshamtnoeo r ettei http://jihadwatcher.com/?p=9-5494 ut http://jihadwatcher.com/?p=9-1208 euDdrtnigPeerO http://jihadwatcher.com/?p=9-5408 NeehtnOrrn erdeoP mi http://jihadwatcher.com/?p=9-2539 eLTpnd aaaoz http://jihadwatcher.com/?p=9-10892 nhieesv tLt v oigDtOereergre http://jihadwatcher.com/?p=9-13304 n odhsrmareten http://jihadwatcher.com/?p=9-11173 manenoetrPmhne http://jihadwatcher.com/?p=9-9292 rrinateDgmeh ePGnn http://jihadwatcher.com/?p=9-2215 it http://jihadwatcher.com/?p=9-3703 ntie ctneiaPtih http://jihadwatcher.com/?p=9-3267 rn oci nreoeen ieiP http://jihadwatcher.com/?p=9-7978 mao http://jihadwatcher.com/?p=9-3399 O dte http://jihadwatcher.com/?p=9-5513 luhm Cat liara http://jihadwatcher.com/?p=9-4116 dheeaeinMsTi http://jihadwatcher.com/?p=9-11881 TededconuC ma oehrrSlt http://jihadwatcher.com/?p=9-8046 ohWvnee http://jihadwatcher.com/?p=9-11817 u BrnpNm etminyrs7roeiPto3P http://jihadwatcher.com/?p=9-9771 rlporit enPPstP hNetilhiD http://jihadwatcher.com/?p=9-6464 dd lcm uS http://jihadwatcher.com/?p=9-1758 aeTersaaoI http://jihadwatcher.com/?p=9-1185 ftiDe Yle cneehnl http://jihadwatcher.com/?p=9-5559 enePpmtnhheiCre enip eihFraS http://jihadwatcher.com/?p=9-11238 rarmenetch http://jihadwatcher.com/?p=9-13195 oraQaic mdpeackCT hl http://jihadwatcher.com/?p=9-5446 nOhnmnPeeererl http://jihadwatcher.com/?p=9-2817 bTlm rgnaaibridmogio yaaFPe http://jihadwatcher.com/?p=9-11357 chtrPiecrshn WTiaruuats tpml aIooPver iW o http://jihadwatcher.com/?p=9-8157 mcN Otnnitdrioilrep eeh ePir Oreesonrn http://jihadwatcher.com/?p=9-10598 mrnyiemt ivhanr rlcPPehn http://jihadwatcher.com/?p=9-12328 na http://jihadwatcher.com/?p=9-8882 gvltlerhuTtaP miNn teyDio e eurS http://jihadwatcher.com/?p=9-11731 aadWTt http://jihadwatcher.com/?p=9-276 Ptoniee NcPsenhme rr http://jihadwatcher.com/?p=9-2287 o http://jihadwatcher.com/?p=9-9389 oa TaBdgryni http://jihadwatcher.com/?p=9-11742 P http://jihadwatcher.com/?p=9-2154 ci Raaor mexhhnytPP enr neNlimOn http://jihadwatcher.com/?p=9-12391 n http://jihadwatcher.com/?p=9-1410 ch http://jihadwatcher.com/?p=9-6985 hnidiOeea n Sina hemreeyDrlenvtu http://jihadwatcher.com/?p=9-1961 hnh0eeM aeeCi3ntesp http://jihadwatcher.com/?p=9-1986 cnn edhunEgiil re iiUdtn http://jihadwatcher.com/?p=9-2699 3e ne t5 http://jihadwatcher.com/?p=9-3783 e etoerWr http://jihadwatcher.com/?p=9-10535 http://jihadwatcher.com/?p=9-13760 aoOh Tccnodaemilithraalrem http://jihadwatcher.com/?p=9-13614 rrpPsePm ineenAeOchxinirpdtt eoi http://jihadwatcher.com/?p=9-4322 ireePete hmAnonppomnnNt http://jihadwatcher.com/?p=9-11553 ee e mtiPePrnhneiWghlan http://jihadwatcher.com/?p=9-11504 amtie rn n orietdR http://jihadwatcher.com/?p=9-8671 pPheiNreomh http://jihadwatcher.com/?p=9-598 nih mhePretaepn http://jihadwatcher.com/?p=9-821 mh taso renpirPierNrrPeC oehP tepnico http://jihadwatcher.com/?p=9-519 nvdoOSttinesd http://jihadwatcher.com/?p=9-7004 nh http://jihadwatcher.com/?p=9-5946 e t3e CnepPaemhinrh http://jihadwatcher.com/?p=9-2439 y oileualtTaO Pacludmrooo ipBnstTitrdh mnria W http://jihadwatcher.com/?p=9-11932 T me oCoffa http://jihadwatcher.com/?p=9-3704 reCeipsmnehht http://jihadwatcher.com/?p=9-807 p ugeMsmafiollnIyaa eTlrudtmotsDrnn http://jihadwatcher.com/?p=9-10835 rlToEuos xSur http://jihadwatcher.com/?p=9-9784 ydapaPT armoal http://jihadwatcher.com/?p=9-2582 LttsieoPehenn mwr http://jihadwatcher.com/?p=9-10000 pirntthWi teiti errOe PDrleveohshm nPgrniu http://jihadwatcher.com/?p=9-8528 nB bolhrpemei ielgBP http://jihadwatcher.com/?p=9-5545 m b http://jihadwatcher.com/?p=9-2925 ul http://jihadwatcher.com/?p=9-8727 Pi rDey ietimDeDteeiiaenaPbtcctn hgl http://jihadwatcher.com/?p=9-11107 n http://jihadwatcher.com/?p=9-13166 eOWiiuneri onchgitsremtdp rhort http://jihadwatcher.com/?p=9-12592 ynrirtddieltda soSam praePu http://jihadwatcher.com/?p=9-5509 peenrrteaPCh http://jihadwatcher.com/?p=9-13520 dom a1THlc http://jihadwatcher.com/?p=9-4437 reCeerp http://jihadwatcher.com/?p=9-3344 o arnpnliPscre http://jihadwatcher.com/?p=9-893 tma http://jihadwatcher.com/?p=9-8827 e BmesD http://jihadwatcher.com/?p=9-8423 pi P oSCeellnnnrema http://jihadwatcher.com/?p=9-12721 e sannghy hmy http://jihadwatcher.com/?p=9-5329 oi a m http://jihadwatcher.com/?p=9-10251 n http://jihadwatcher.com/?p=9-10084 n a H http://jihadwatcher.com/?p=9-7952 arOifnninlOeihilP ecetn http://jihadwatcher.com/?p=9-7118 nhre TGeonnPet http://jihadwatcher.com/?p=9-7006 tlPnis epePteirmcnrrenene anhC http://jihadwatcher.com/?p=9-9081 x hn http://jihadwatcher.com/?p=9-8771 eeyMne r eiOrrhd it http://jihadwatcher.com/?p=9-2592 ionnrshTm http://jihadwatcher.com/?p=9-1368 eoreonPposrahtrnm liildieCn http://jihadwatcher.com/?p=9-7445 e i http://jihadwatcher.com/?p=9-11571 L http://jihadwatcher.com/?p=9-11033 hi http://jihadwatcher.com/?p=9-2864 ebs ere http://jihadwatcher.com/?p=9-3020 teimeae rnth iossPuC http://jihadwatcher.com/?p=9-9813 muhBn http://jihadwatcher.com/?p=9-12128 oi nenNOLgNiO http://jihadwatcher.com/?p=9-448 hseuipntv’a cr aePephWyitsmtnnhAoi io http://jihadwatcher.com/?p=9-9364 aa5m http://jihadwatcher.com/?p=9-1802 dchaPm http://jihadwatcher.com/?p=9-4020 rr http://jihadwatcher.com/?p=9-4460 eaerP http://jihadwatcher.com/?p=9-3857 ee H5nh http://jihadwatcher.com/?p=9-795 hPiSXermt ne renOPrg thedirloien enSamnnnte http://jihadwatcher.com/?p=9-12642 ynmirieenalPte nrhtcc http://jihadwatcher.com/?p=9-11487 d http://jihadwatcher.com/?p=9-7018 Dctemrud niegPerhet http://jihadwatcher.com/?p=9-1512 htrmdornnid ny http://jihadwatcher.com/?p=9-614 deDTo http://jihadwatcher.com/?p=9-998 iaeTrnncdoa GO http://jihadwatcher.com/?p=9-8606 ohotl http://jihadwatcher.com/?p=9-6834 OPnirrnoendihi eei r rroe http://jihadwatcher.com/?p=9-384 asrmaWorldTo http://jihadwatcher.com/?p=9-11506 ti CihDet http://jihadwatcher.com/?p=9-12058 ie http://jihadwatcher.com/?p=9-9992 li Unneyer OuntenhemPiB http://jihadwatcher.com/?p=9-1949 eoPtxrdnC idmiTmrepApa http://jihadwatcher.com/?p=9-1962 p toeY eheknainm patnnhnoiWaePHS http://jihadwatcher.com/?p=9-6605 t iOmIhPi tnoreninemt http://jihadwatcher.com/?p=9-10734 eepenprnpi hs http://jihadwatcher.com/?p=9-9057 n eonreRe PNmiht http://jihadwatcher.com/?p=9-6191 nmd http://jihadwatcher.com/?p=9-6588 mlsnth ona http://jihadwatcher.com/?p=9-13167 uBCehretee nmi yWPn OE n http://jihadwatcher.com/?p=9-10098 m geLBnauilehen http://jihadwatcher.com/?p=9-8191 ieTeditAnd m http://jihadwatcher.com/?p=9-10695 Vrn TClaamqeF rianlnae http://jihadwatcher.com/?p=9-13695 ne rhai yctrrsmMWanPiee http://jihadwatcher.com/?p=9-4957 rr PmOhdnoPeryeo http://jihadwatcher.com/?p=9-9105 Ecoaarti http://jihadwatcher.com/?p=9-12376 dTaonrg http://jihadwatcher.com/?p=9-13816 aesahdtnemHwoo eWhinnetnriy http://jihadwatcher.com/?p=9-5138 aarrrhomyi datmrTlldUHrc http://jihadwatcher.com/?p=9-2534 haPramtserenh recpaoinir http://jihadwatcher.com/?p=9-4519 rei http://jihadwatcher.com/?p=9-1634 ire miisePt nmi http://jihadwatcher.com/?p=9-3722 iCn heopratehm http://jihadwatcher.com/?p=9-2277 nnPceeP tnhrioiaOl piniesrstener http://jihadwatcher.com/?p=9-7613 Pi inDieishitg http://jihadwatcher.com/?p=9-12779 rOmas http://jihadwatcher.com/?p=9-11718 Nle o http://jihadwatcher.com/?p=9-5631 imxnfnhitA http://jihadwatcher.com/?p=9-12355 eyDeted vra http://jihadwatcher.com/?p=9-4229 noIrieoaurrtinnDnge thfmm http://jihadwatcher.com/?p=9-7477 mCPitr http://jihadwatcher.com/?p=9-10809 http://jihadwatcher.com/?p=9-1344 raodhl adry http://jihadwatcher.com/?p=9-8230 aHCLkdTo http://jihadwatcher.com/?p=9-4612 Smapn Nhetorn rrmichP http://jihadwatcher.com/?p=9-10933 hie http://jihadwatcher.com/?p=9-1676 ra idaln http://jihadwatcher.com/?p=9-2685 rhnnedAo stecePhtp http://jihadwatcher.com/?p=9-10243 nrremchePeiit http://jihadwatcher.com/?p=9-10802 adret faf http://jihadwatcher.com/?p=9-12915 ery uct http://jihadwatcher.com/?p=9-2663 aos TFrrd eDmga http://jihadwatcher.com/?p=9-9207 3iC nrn m http://jihadwatcher.com/?p=9-6214 iooCerdoyurndoOecainoHnst nlFo mlT eta http://jihadwatcher.com/?p=9-7415 e pnOuW girmnniFtPtt iihiornienhedrst nieclon http://jihadwatcher.com/?p=9-4671 scAaadraI Tm ciro Not http://jihadwatcher.com/?p=9-5696 rtmtenIsnneFaoibPnm oitraniuht http://jihadwatcher.com/?p=9-652 iheyBnll http://jihadwatcher.com/?p=9-3964 ddiFHdhTeya http://jihadwatcher.com/?p=9-4363 e http://jihadwatcher.com/?p=9-8168 een miheUnet http://jihadwatcher.com/?p=9-11120 P http://jihadwatcher.com/?p=9-10973 rer lhimH http://jihadwatcher.com/?p=9-4044 wn http://jihadwatcher.com/?p=9-12923 onReZnlsrOaaoPilnodc mriihtv orrp http://jihadwatcher.com/?p=9-5214 eoi pPir emrnceirP http://jihadwatcher.com/?p=9-9064 G http://jihadwatcher.com/?p=9-10961 eDae nhrPeH lhei http://jihadwatcher.com/?p=9-9612 tecnCslan eoUtyusiPraP

    NSDI 2009: Day 2

    Thursday, April 23rd, 2009

    Evaluation/Correctness

    SPLAY: Distributed Systems Evaluation Made Simple (or How to Turn Ideas into Live Systems in a Breeze)

    • Large-scale distributed applications are difficult and costly to develop, deploy, test and tune. Moving from simulations to real deployments must bridge a “simplicity gap”.
    • Often focus on developers’ technical skills rather than the algorithms.
    • Motivated to use real testbeds (PlanetLab, Emulab, ModelNet, idle desktop machines, etc.) for development but these are hard to deploy on, so we need tools to simplify and accelerate development, resource management, deployment and control.
    • Simplify development: could use a high-level language or an abstract network substrate. Simplify deployment: resource selection, deployment and monitoring. Control application: centralised logging, replay churn/network conditions dynamics.
    • SPLAY is intended for easy prototyping, multiple-testbed deployment, distributed systems teaching (hands-on teaching), and using idle resources for distributed systems research.
    • SPLAY architecture: daemons written in C on the testbed machines (testbed agnostic (abstract away the type of testbed and OS) and lightweight). Results are retrieved by remote logging. Daemon instantiates distributed system participant (e.g. BitTorrent client or Chord node).
    • SPLAY language based on Lua (made for interaction with C, bytecode-based with garbage collection). Language close to pseudocode, so that programmers can focus on the algorithm. Programs written at the RPC level. Applications can be run locally for debugging and testing, and have a comprehensive set of libraries (compatible with wrapped C libraries).
    • SPLAY controller may be distributed, and performs deployment and control support. Can for example emulate churn or network packet loss.
    • Resource isolation is a requirement for non-dedicated testbeds (e.g. where you have multiple users or you want to sandbox testbed code). Need to limit the resources used and enforce protection. All access to the system is directed through libraries.
    • SPLAY can do Chord in 59 LOC (+17 for fault tolerance and 26 to use a leafset). Pastry takes 265 LOC. Also did Scribe, SplitStream, WebCache, BitTorrent (420 LOC, much of them for protocol parsing), Cyclon, Epidemic and Trees.
    • Interface for resource selection (web or command-line), using criteria on load, resource availability and location (e.g. from PlanetLab metadata).
    • Chord implementation is validated to provide the correct results (route length, rather than performance) which demonstrates the validity of using a real deployment instead of simulation.
    • SPLAY allows real traces of PlanetLab to be replayed on a testbed (for reproducible experiments). Can aslo use synthetic descriptions in a DSL (say when fractions of nodes join and leave).
    • Evaluated synthetic churn effect on routing performance in a Pastry DHT. Also looked at trace-driven churn (trace from OverNet file-sharing network), with variable speedup factors.
    • Evaluated runtime overhead. Can run up to 1263 nodes of Pastry on a machine with 2GB of RAM before swapping is triggered.
    • Also evaluated robustness for long-running applications (web cache application).
    • Q: how does SPLAY help you to validate simulation results? This is not the goal of SPLAY: instead it helps you to run a real application rather than simulate it. Could you reuse the Lua code? Yes, Lua is a good fit for this.
    • Q: if you get a non-deterministic bug on PlanetLab, it can be hard to reproduce, but does SPLAY help you with this? Don’t solve the distributed snapshot problem, but if the problem is due to churn, SPLAY will help you to reproduce it.
    • Q: does the framework allow querying network conditions (for developing network-aware distributed algorithms)? Don’t have direct access from the present Lua libraries to low-level network conditions, but anything that can be expressed in C can be used, so this could be added.

    Modeling and Emulation of Internet Paths

    • How do you evaluate a distributed system? (e.g. a DHT, CDN, P2P….) Set up a topology on which you are going to test the system, then run it on Emulab, which will emulate the network between them.
    • Emulation is repeatable, real-world environment and controllable (dedicated nodes and network parameters).
    • Hitherto all about link emulation: not the paths between nodes.
    • Goal is path emulation: two nodes in the real world (opposite coasts of the US) with a path between them that you want to emulate.
    • Could do link-by-link emulation: need to know topology of the path, the capacities along that path, the queue sizes, and the cross-traffic along the path. Most researchers don’t have that information.
    • Instead abstract the path and do end-to-end emulation. Focus on high-level network characteristics (RTT and available bandwidth).
    • Obvious solution: given a link emulator, turn it into a path emulator by mapping the actions of the link emulator onto high-level characteristics. Link emulators have queues but it’s hard to determine the queue size from the path (set it to a default and we’ll adjust it later). Then parameterise a traffic shaper. And finally induce a fixed delay on every packet (based on measured real-world RTT, to estimate non-queuing delay). Must do this in both directions, which may be independent.
    • Used this on a real link with iperf running bandwidth measurements in both directions. 8.0% error on the forward path and 7.2% error on the reverse path. Room for improvement, but not too bad: good news! Bad news is on RTT: obvious emulation gives a fixed RTT which is an order of magnitude higher than on the real path. Also, for asymmetric paths (6.4Mbps/2.6Mbps): 50.6% error on the forward path (smaller than real path) and 8.5% error on the reverse path.
    • Why are these errors arising? TCP increases congestion window until it sees a loss, but there are no losses until the queue fills up. The large delays we were seeing were queuing delays. More than 200ms of delay was due to queuing delay.
    • Maximum tolerable queuing delay is ((maximum window size / target bandwidth) - base RTT). Total queue size must be <= product of capacity and maximum tolerable delay. Can also calculate a lower bound for the queue size. But this gives a lower bound greater than the upper bound! Solution is to distinguish capacity from available bandwidth. (ABW is the rate that my packets drain from the queue.) Lower bound is independent of capacity, but upper bound grows as capacity grows, so this gives a region in which we can select a viable queue size.
    • How do we separate the emulation of available bandwidth and capacity? Set queue based on constraints rather than a reasonable default. The traffic shaper should emulate the capacity (high fixed bandwidth, not available bandwidth). Delay is the same. Now introduce constant bitrate cross traffic into the queue and sink it after the shaper. Just enough such that only the available bandwidth is available inside the traffic shaper. Again, do this in both directions.
    • Use CBR traffic instead of TCP cross-traffic because TCP cross-traffic backs off. If the cross-traffic backs off, you would see a higher bandwidth than is realistic.
    • But how can CBR traffic be reactive? Approximate aggregate available bandwidth as a function of the number of foreground flows. Change CBR traffic based on flow count.
    • Evaluated the error in emulation (from obvious solution to new emulation): forward 50.6% down to 4.1%; reverse 8.5% down to 5.0%. Delay is less noisy than the real environment, but is in the correct order of magnitude.
    • Tested system with BitTorrent running on it (12 clients, 1 seed). Isolated capacity and queue size changes. Makes a difference when compared to obvious solution.
    • Another principle is modelling shared bottlenecks (details in the paper).
    • Q: RTT was shown as time-series, but should you show the PDF/CDF of RTTs? Do they also follow the real world? Very unlikely to follow what is seen in the real world, constraining in a simple way. Focussing on higher-level aspects that would matter at endpoints.
    • Q: did you compare results to real-world on PlanetLab? Hard to characterise ground-truth on PlanetLab. Used widely, but hard to distinguish between host conditions (overloaded host) between network conditions: this impacts things like BitTorrent applications. See FlexLab paper. Weren’t confident in the ability to do that experiment.
    • Q: there is great value in having repeatable experiments, but a problem with Emulab is parameterisation (too much to choose), and it would be good to come up with a “standard scenario” (”500 node March 2009 settings”)? Difficult to do but have taken steps in that direction. FlexMon did wide-area bandwidth and delay measurements on PlanetLab. Unfortunately, this is hard to scale up.
    • Q: do you have a sense of whether the byte-based queue size is realistic? Talked in terms of packets to make it simpler, but implemented in terms of bytes.

    MODIST: Transparent Model Checking of Unmodified Distributed Systems

    • Distributed systems are hard to get right (complicated protocols and code to implement them). With no centralised view of the entire system, large range of failures to tolerate and increasingly large scale.
    • Normally do some kind of randomised testing, but this is low coverage and non-determininstic (very hard to reproduce).
    • MoDist is a model checker for distributed systems. It’s comprehensive. It runs in-situ (unmodified, real implementations). And it’s deterministic (allows replay).
    • Applied to Berkeley DB, Paxos-MPS (in production service for Microsoft’s data centres), and PacificA. Found 35 bugs of which 31 have been found by developers.
    • Look at BDB replication. Based on Paxos (single primary, multiple secondaries): primary can read and write, secondary can only read. On primary failure, secondaries will elect new primary. On duplicate primary, degrade both and re-elect. But there was a bug in the leader election protocol: could lead to a secondary node receiving an request-for-updates message, which is unexpected and causes the secondary to crash.
    • MoDist took about an hour to make the bug show up, and outputs a trace which can be replayed.
    • Goal is to explore all states and actions of the system. Model checking makes rare actions (failures, crashes) appear as often as common ones, which drives you into corner cases.
    • Look at real processes, communicating using messages, which may also be multi-threaded. Set of normal actions (send/recv message or run thread) and rare actions (message delay, link failure, machine crash) which may be injected.
    • Ideal is to explore all actions. But this leads to a combinatorial explosion. Built-in checks for crashes, deadlocks and infinite loops. Can also have user-written checks (local and global assertions). MoDist amplifies the checks that you give to it.
    • Avoid redundancy by exploring one interleaving of independent actions.
    • Challenges are exposing actions, checking timeout code, simulating realistic failures and scheduling actions (avoiding deadlocks and maintaining extensibility).
    • To check a system, must know and control the actions of the system. Previous work required users either to write an application in a special language, or port it into a fake environment. MODEST uses Explode (OSDI 2006) to interlace control needed into the checked system. Fake environment perturbs the system and can introduce false positives.
    • MODIST instead inserts an interpretation frontend that intercepts RPC API calls and sends these to a MODIST backend. Backend interposes on all RPCs and schedules all intercepted API calls. This is transparent and simple (does not perturb the system and cause false positives). Frontend is stateless: all state in the backend. Only frontend is OS-dependent; backend is OS-independent.
    • Frontend intercepts 82 API functions (e.g. networking, thread synchronisation). Most wrappers are simple, either returning a failure or calling the real API function. Each wrapper has an average of 67 lines of code.
    • Timeout checking is complicated by the heavy use of implicit timers (e.g. a comparison with gettime()). Could intercept gettime() but what should be returned? Don’t know what the comparison is (because it isn’t an API call), and previous work to address this was manual.
    • Instead, do a static, symbolic analysis. Observe that time values are used in simple ways (db_timespec, mostly with +, - and sometimes * or /). Also that timeouts are checked in the vicinity of gettime() calls, 12 out of 13 are within a few lines. Static intra-procedural symbolic analysis can discover implicit timers (much simpler than approaches like KLEE).
    • Checked systems of size up to 172.1KLOC, and found 35 bugs. 10 bugs in protocols, 25 in implementations. All bugs were previously unknown.
    • Q: how does it compare to DPOR (designed for multi-core systems)? Work with general distributed systems as well as multithreaded systems. Different kinds of failures in distributed system, so not clear how this can be mapped to a multithreaded system.

    CrystalBall: Predicting and Preventing Inconsistencies in Deployed Distributed Systems

    • Errors remain in distributed systems even after the system is deployed (DoS, data loss, loss of connectivity), and manifest themselves as loss of safety properties.
    • Paxos is a fault tolerant protocol for achieving consensus, and is integrated in many deployed distributed systems.
    • Want to use increased computing power and bandwidth to make it easier to find errors in distributed systems. Use spare power to do state exploration that can predict what sequences of user actions might lead to lead to violations of user-specified consistency properties.
    • Compared against classic model checking, which is limited in the depth that can be achieved. Compared against replay-based/live predicate checking (state of the art distributed debugging). Deep online debugging periodically starts state space exploration to find inconsistencies. CrystalBall also does execution steering to avoid states which would lead to inconsistency.
    • High-level overview: service is implemented as a state machine with some amount of local state and several handlers. Runtime manages the timers and messages to/from the network. CrystalBall controller takes information from the state machine (neighbour info and periodic local checkpoints). The user puts safety properties into the controller. Controller invokes consequence prediction algorithm, looking for violations (passed back to controller). Controller can then put event filters (for steering) into the runtime.
    • State space exploration using MaceMC, run enabled handlers in a set of nodes (live code) and check safety properties in each state. Want to reduce state space coverage to predict future inconsistencies quickly. Want to explore all interleavings of network messages (important for race conditions), but not all interleavings of local node actions.
    • Standard hashing does not re-explore system-wide states. Instead use hashing to remove previously explored local actions of nodes. Network messages can still be interleaved.
    • Aim is to increase resilience of deployed systems, with a focus on generic runtime mechanisms. Want to prevent inconsistencies without introducing new ones (do no harm). But this is a hard problem.
    • Execution steering uses “sound filters” which have behaviour equivalent to events that could happen in normal running (e.g. TCP connection broken, UDP packet loss).
    • Model checker might not have enough time to detect inconsistencies. Deal with this using an “immediate safety check”.
    • Evaluated live using 6–100 participants on 25 machines. Implemented in Mace, and used ModelNet to emulated wide-area Internet behaviour. Looked at RandTree, Chord and Bullet’ systems. Found 7 inconsistencies that weren’t found by MaceMC or a few years of debugging.
    • Looked at execution steering for Paxos. Injected 2 bugs and induced live scenarios that violate the Paxos safety property. Ran each bug 100 times, with a random [0, 20] seconds between injections. Avoided violation in 95% of cases.
    • Also evaluated the performance impact. (Download times using Bullet’.) Less than a 5% slowdown: due to the bandwidth spent on shipping checkpoints.
    • Q: could you comment on the CPU overhead of running these state-space explorations? One core is dedicated to this and maxed out. The state machine is single-threaded.
    • Q: as machines get more powerful, you would expect them to handle more load… how much is the amount that should be spent on model checking? Well, we haven’t looked into that, but could parallelise state-space exploration and use multiple cores.
    • Q: how big is the state space in bytes? For 8 levels (depth reached at runtime), need only 600KB, which fits easily in L2 cache.
    • Q: is there any way of quantifying what “relevant” means for the relevant states that you aim to explore? The states that we are exploring are obviously quite relevant, better than random walk.
    • Q: Paxos example is artificial because you have 3 replicas and 2 faults? There were separate failures and a network partition? This failure still shouldn’t have happened and CrystalBall detected it.
    • Q: ditto. Paxos should have used stable storage for the proposed values? This was a bug in the implementation.

    Wide-Area Services and Replication

    Tolerating Latency in Replicated State Machines Through Client Speculation

    • Want to make fault tolerant systems faster in a distributed environment.
    • Simple service configuration with client and server, exchanging messages. Server has some state. For FT, replicate the service as a replicated state machine. Replicas need to agree on the order in which they execute the client requests. Need to reach consensus.
    • Problem is that RSMs have high latency: must block until enough non-faulty replies are received. Also have to deal with geographic distribution (to avoid correlated failures).
    • Idea is to use speculative execution in the RSM. Speculate before consensus reached. Without faults, any reply will have the consensus value.
    • Client can take a checkpoint of its state and execute speculatively while waiting on consensus. When consensus is reached, commit the state and continue executing. Obviously, if we speculate based on a faulty reply, we can rollback to the speculation point.
    • First reply latency sets the critical path (need speed and accuracy here), not the completion latency. Can therefore achieve high throughput, support good batching, have stability under contention and maintain a smaller number of replicas.
    • What if you make a request while executing speculatively? (Buy a Corvette after speculating that we’ve won the lottery.) Could hold this because it is external output, which gives bad performance (just as bad as before). Could use a distributed commit/rollback protocol but this makes state tracking complex.
    • Instead explicitly encode dependencies as predicates. New message is “buy if win=yes”. Replicas need to log past replies to test these predicates. Local decision at the replicas matches the client.
    • Applied speculative execution to PBFT (PBFT-Client Speculation, PBFT-CS). Move the execution stage to the beginning (reply from the primary). If the primary is non-faulty, it is in a natural place to give a high-quality speculation.
    • Added tentative execution to PBFT-CS, a read-only optimisation, a failure threshold and proved correctness.
    • Evaluated with benchmarks: a shared counter and NFS (running an Apache httpd build). Three topologies (primary-local, primary-remote and uniform). For primary-local on shared counter, the runtime is roughly equal to the non-replicated case (as the network delay is varied). Performs better than PBFT and Zyzzyva (which see run time increase linearly with network delay). Also evaluated with failures (still performs well with 1% failure). Also looked at throughput. If latency is the limiting factor and you’re not operating at peak capacity, then speculative execution is a win. However, if the server is fully loaded (bound by throughput), client speculation will not improve things (and may introduce some overhead).
    • Q: have you thought about adaptively using client speculation under low load and Zyzzyva under high load to improve overall throughput? Either client or the primary could switch off speculation when load is high.
    • Q: how do you deal with RSMs getting into divergent states because of speculation (i.e. by having external operations)? Could fall back to blocking or distributed commit/rollback.

    Cimbiosys: A Platform for Content-based Partial Replication

    • Scenario of photo sharing: Alice stores her complete photo collection on her home PC, tags and labels all photos. The metadata is used to route photos to devices and services (public -> Flickr, 5* -> digital photo frame, family -> Mum). She might update these photos (do red-eye reduction) and these should propagate everywhere. She might upload photos from elsewhere (an internet café).
    • Two key replication requirements. Not all devices are interested in all data (content-based selection, device-specific filter). Not all devices talk to all other devices (need flexible, arbitrary communication patterns). Also eventual consistencty, incremental updates, conflict detection and resolution, bandwidth efficiency, etc.
    • Item store is a database of XML objects stored on each device. Application only accesses items in the item store on that device, but will then be synchronised by Cimbiosys (through Sync and Comm components).
    • Filtered synchronisation protocol. Target device sends list of its item store and its filter to the source. The source sends each item that is unknown to the device and matches the filter. Then the target adds these items to its store and updates it knowledge.
    • Main contribution is a protocol that has eventual filter consistency (device’s item store = filter(whole collection)) (correctness property), and eventual knowledge singularity (device’s knowledge = single version vector) (performance property). Extensive model checking has confirmed that these properties hold.
    • Concentrate on eventual filter consistency here.
    • Problem of partial sync: how can you sync from laptop (family photos) through Facebook (public photos only) to the home PC (all photos)? Facebook just gets public family photos.
    • Needed to invent “Item-Set Knowledge”. Knowledge is the set of versions known to device: either stored, obsolete or out-of-filter. The knowledge fragment is a set of items (or * for all items) and a version vector. A device knows about all versions in the version vector for all items in the set of items.
    • By acquiring knowledge using the sync protocol, you receive items at most once.
    • What happens when an item is updated such that it moves out of a filter? (Remove family keyword from a member of the family who’s been disowned.) Need to add a new type of message to the sync protocol: a move-out notification. Send one of these if the version is unknown to the target device and the item does not match the target’s filter and (optionally) if the item is stored at the target. Don’t want to send too many notifications for items that were never stored (not correctness but performance issue).
    • Also need to handle move-out notification chains. Send if the the source’s filter is as broad as the target’s, the source is as knowledgeable as the target and the source does not store an item stored at the target device.
    • Another problem is filter changes. Could just create a new empty collection. But that would be user-unfriendly. Want to discard things outside the new filter, retain the intersection of the old and new filters and acquire things not inside the old filter. So use a knowledge retraction protocol.
    • Evaluated based on a C# implementation (confirmed in Mace), by looking at the average number of inconsistent items per replica.
    • Q: what are the semantics of filter? Does it have any compositional properties? Only requirement on the filter is that it can make a binary decision based on an item and its metadata. How do you make the filters consistent? No notion of consistency for a filter.
    • Q: to what degree do failures or data corruption fit into this protocol? Don’t do anything about Byzantine failures; assume that each device has persistent storage.
    • Q: what is the trust model here, and how do you deal with not-fully-trusted cloud services? We have an access control policy that says what devices can perform what actions on items in the collection.
    • Q: does the originator of the item specify the policy? The owner of the collection specifies the policy. Also use signatures to prove the origin of an item.
    • Q: how is Cimbiosys different from Practi? Practi lets you build protocols as policies. But you would still have to specify policy on top of it (not built into the system).

    RPC Chains: Efficient Client-Server Communication in Geodistributed Systems

    • RPC chains are a new distributed computing primitive. Web services run things over a large number of machines, e.g. webmail. Client talks to a frontend server, but that might talk to an authentication server, a storage server, an advert server, etc. The webmail app is a composition of more-basic services. RPC is used to build this composition. (Could be Sun RPC, RMI, SOAP, etc.)
    • If RPC is synchronous, it is neither efficient nor natural. As the number of services involved grows, scalability becomes a challenge. As does heterogeneity. And geodiversity. RPC is rigid and inefficient, so we want to give developers more control, and better tools for doing this.
    • A more natural path would be for the request to flow through each of the backend services.
    • Related work in continuations and function shipping (also mobile agents, active networks, distributed workflows, and MapReduce/Dryad).
    • Assume applications built as a composition of services, where services export service functions via RPC. Services have a fixed location but may be shared. Also assume a single administrative domain (security is less of an issue, orthogonal).
    • RPC has a simple interface which hides the complexity of the distributed environment. Goal is to preserve simplicity while providing finer-grain control.
    • Embed chaining logic as part of an RPC call. Chaining function is portable code that connects services.
    • Implemented as a C# prototype. Chaining functions are implemented as C# static methods, which are stored in a central repository and cached by hosts. Example applications are a storage service and a webmail service.
    • Used an NFS server with an RPC interface. Modified client applications to do a chained copy (third party transfer). NFS server was unmodified. Evaluated it with a client in Redmond and two servers in Mountain View. Chain copy gave a 5x improvement in performance. Chain copy gets a peak throughput of 10.4MB/s against 4.5MB/s for RPC copy.
    • Chaining function enables chains of arbitrary length, dynamic chains and parallel chains. Can also optimise by composing sub-chains recursively (e.g. if a storage server must invoke a backup).
    • Now maintain a stack of chaining functions and execution state. Push the current chaining function and state onto the stack when a subchain starts, and pop it when it ends. Also evaluated chain copy with composition turned on (12–20% savings).
    • Chain splitting and merging. Multiple chains can execute concurrently and services may invoke multiple subchains in parallel. This requires support for chain splitting and merging. Need a way of specifying this. Need a dedicated merge host and merge function.
    • Also issues of debugging and profiling, exceptions, broken chains, isolation (limit damage caused by a bug in the chaining function) and dealing with legacy RPC servers.
    • Q: what happens to worst case performance (presuming you have to set higher timeouts)? Nodes along the path report back to the source to monitor liveness (of the request). This is handled as a broken chain.
    • Q: is the process of adding chains automated/automatable? Not automated at present, but in future work we’ll try to do this. Instead provide an abstraction close enough to RPC that it shouldn’t be difficult.

    Botnets

    Studying Spamming Botnets Using Botlab

    • Botnets are a big problem (spam, DDoS, phishing often use botnets as the underlying infrastructure). They are a network of compromised or infected machines. However, they are hard to study and hence not well understood. Malware authors go to great lengths of obfuscation.
    • Automate botnet analysis using a black-box approach: execute the binary and study its behaviour. Also want to automate the finding and execution of bot binaries. Needs to be scalable and safe.
    • Peek into botnets using spam: a high-volume activity of botnets. Can collect information about botnets from the spam. Synthesising multiple sources of data in real-time gives accurate, high-fidelity information.
    • Characterise botnets by size, behaviour, types of spam sent, etc. New and better data gives better answers about this and improves our understanding about the botnet ecosystem.
    • Botlab is a system for monitoring botnets. It obtains bot binaries, perform an initial analysis, execute safely, and output data.
    • Malware is traditionaly collected through honeypots. Collected 2000 binaries over a 2-month period from honeypots. Spam botnets are in a new generation that propagates through social engineering. Need to make honeypots perform active crawling to accumulate this. Also use a constant spam feed from UW’s mail servers (2.5 million emails per day of which 90% spam; 200000 email addresses; over 100000 unique URLs each day; 1% point to malicious binaries).
    • Want to discard duplicate binaries, but a simple hash is insufficient (because of obfuscation, repacking, etc.). Use network fingerprinting (DNS lookups, IPs, ports connected to) to build a behavioural profile instead. Also used to find binaries which perform VMM detection (if fingerprint is different between VMM and bare-metal, then you have something that tries to detect VMMs).
    • Want to run bots safely. Botlab should not cause harm. But it must be effective: you can’t just prevent traffic from leaving the system. Botlab drops traffic to privileged ports and known vulnerabilities; limits connection and data rates (to avoid DDoS); and redirects SMTP traffic to a fake mail server.
    • Manual adjustments (e.g. when a bot verifies over SMTP to its C&C server) needed to ensure that the bot functions.
    • Botlab tries to send 6 million emails per day (to hotmail, gmail, yahoo) from about 10 bot instances. Local view of spam producers but global view of spam produced.
    • Combining the spam sources gives a different perspective. Spam is received from almost every bot in the world: this gives a local view of the spam produced but a global view of the spam producers.
    • Want to combine these two sources of data. Observe that spam subjects are carefully chosen: there is no overlap in subjects sent by different botnets (489 different subjects/day/botnet).
    • Who is sending all the spam? 79% from just 6 botnets. 35% from the largest botnet (Srizbi).
    • What are some characteristics of the prominent botnets? Most contact only a small number of C&C servers. Send spam at 20–2000 messages/minute. Mailing lists are 100 million — 1 billion with a maximum overlap of 30% across two botnets. Active size is 16000–130000 nodes.
    • Are spam campaigns tied to specific botnets? Spam campaign defined as the contents of the webpage to which the spam URL points.
    • How does web hosting for spam relate to botnets? Many-to-many relationship between campaigns and web hosts. Does spam from a single botnet point to a single set of web servers? No, which suggests that hosting spam campaigns is a 3rd party service not tied to botnets. 80% of spam points to just 57 web server IPs.
    • Could use Botlab as a real-time database of bot-generated spam (including web links), which could be used for safer web browsing and better spam filtering.
    • Q: do botnets apply a sophisticated mechanism to back-off under user activity? Yes, some clever botnets try not to inconvenience you at all (Srizbi would look for mouse activity), but we weren’t actively using machines, so they worked at full throughput.
    • Q: what would the real sending rate be based on the presence of user activity? How would this affect the rate? Have not looked at this as it would depend on user behaviour.
    • Q: since number of botnets is small, could a botnet have several different masters? Based on the C&C servers, which imply the control structure of the botnet.
    • Q: who is providing the web hosting (as surely they would have a huge bandwidth provision)? We have information about this, but they serve static pages of constant size, and the bandwidth requirement is based on the small fraction who actually read the pages. Many hosted in South Korea.
    • Q: do bots contact their C&C server for reactivation? Not really, more for new data etc., except when they crash and require reactivation.

    Not-a-Bot: Improving Service Availability in the Face of Botnet Attacks

    • [Firefox crashed and lost my notes on this! Fortunately, other people are also blogging!]
    • Basic idea was to attest to real human interaction, using a small trusted attester. Attester is separated from the untrusted OS and applications using virtualisation (Xen disaggregation in this case). TPM sealing used to protect a key that is used for attestation, and stored in memory in the attester. Human interaction can trigger an attestation which may be used to sign e.g. emails or other interactions (clicks on adverts, votes, etc.). This gives a small window of opportunity in which a botnet could send fraudulent clicks, but this is pretty short, and the evaluation shows that it cuts down on the amount of spam that may be sent or fradulent clicks that may be registered.
    • Q: if you can outsource CAPTCHA, couldn’t you outsource TPMs? No real benefit from this.
    • Q: a viable commercial OS needs to let you install third-party device drivers, so even if you assume trustworthy drivers, wouldn’t some drivers still need to be able to generate keystrokes? (Also, what about remote access to a machine?) Virtualisation is a convenient means to bootstrap this process. Could equally use trusted path techniques from Intel and AMD.

    BotGraph: Large Scale Spamming Botnet Detection

    • Issue is web-account abuse attack. Zombie hosts can perform automated signup using CAPTCHA solvers. These accounts can then be used to send out spam.
    • Want to detect abused accounts based on Hotmail logs. Input is user activity traces (signup, login, email-sending), and goal is to stop aggressive account signup to limit outgoing spam.
    • At present, the attack is stealthy and large-scale. We need low false positive and false negative rates.
    • Designed a graph-based approach to detect attacks. Identifies a user-user graph to capture bot-account correlations. Identified 26 million bot accounts with a low false positive rate. Implemented using Dryad/DryadLINQ on a 240-machine cluster.
    • Graph of signup history for a particular IP. Notice a spike in the signup count. Could use exponential weighted moving average algorithm to predict future. Where there is a large prediction error, you have an anomalous window, during which we suppose malicious accounts are being created.
    • Can detect stealthy accounts using graphs. Observe that bot accounts work collaboratively. Normal users share IP addresses in a single AS with DHCP assignment. Bot users are likely to share different IPs across ASs. Bot users form a giant connected component while normal users do not. Can use random graph theory to detect this. So detect giant connected components from the user-user graph, then use a hierarchical algorithm to identify correct groupings. Then prune normal-user groups (e.g. due to cell phone users, Facebook applications, etc.).
    • Increase edge weight threshold until the connected component breaks up.
    • Implemented in parallel on DryadLINQ. EWMA-based signup abuse detection can partition data by IP and achieve real-time detection. The user-user graph construction uses two algorithms and some optimisations and can process 200–300GB of data in 1.5 hours on 240 machines. Connected component extraction using divide-and-conquer in just 7 minutes.
    • Graph construction by selecting ID group by IP (map phase), then generating potential edges (reduce phase). Then select IP group by ID pair (map) and calculate edge weight (reduce). Problem is that the number of weight-1 edges is two orders of magnitude larger than other weights. Their computation/communication is unnecessary.
    • Second algorithm does selective filtering, which saves transferring weight-1 edges between nodes. Also used optimisations for compression and broadcast (and the Join functionality).
    • Evaluated detection on two datasets (Jun 2007 and Jan 2008). Three types of data: signup, login and sendmail logs. Bot IPs went from 82k to 241k between datasets (user accounts from 4.83 million to 16.41 million). The anomaly window shrank from 1.45 to 1.01 days.
    • Validated using a manual check (sample groups sent to Hotmail team; almost no false positives). Also through comparison with known-spammers (detected 86% of complained-about accounts, and 54% of the detected accounts are new findings). False positive rate is very low.
    • How can you evade BotGraph? Be stealthy in your signups (sign up less). Also fix a binding to IP or AS, which lowers your utilisation rate. Accounts bound to a single host will easily be grouped. Or send few emails (like a normal user). All these approaches limit spam throughput.
    • Q: what is the relationship between this and Sybil-detection using a random walk? Not sure if random walk can be used to detect spam. Why do bots have to communicate with each other? Graph is just a way to cluster the bots.
    • Q: false positive rate of 0.5% on tens of millions of accounts seems like a concern? Absolute value is pretty large. Quite conservative; real false positive rate may be lower. Could also be a starting point from which more sophisticated and costly approaches could be used.

    Network Management

    Unraveling the Complexity of Network Management

    • Enterprise networks are complicated: topologies, diverse devices, tweaked network configuration and diverse goals.
    • Example of a configuration change: adding a new department with hosts spread across three buildings. Need to reconfigure routers, and an error can lead to outages or loopholes.
    • Complexity leads to misconfiguration. There is no metric that captures this, and it’s difficult to reason about the difficulty of future changes, or for selecting between possible changes.
    • Defined a set of complexity metrics, based on an empirical study of complexity in 7 networks. Metrics were validated by questionnaire sent to network operators (in public and private enterprises). Questionnaire had tasks to quantify complexity, either network-specific or common to all operators. The metrics focus on layer 3.
    • Complexity is unrelated to size or line count. Largest mean file size was a “simple” configuration. So was highest number of routers.
    • Implementation complexity (referential dependence and different roles for routers) and inherent complexity (uniformity). Inherent complexity provides a lower bound for implementation complexity.
    • Referential dependency metric. Look at referential graph in the stanza of the configuration file. (e.g. Router stanza contains line that references an interface stanza.) Could have intra- and inter-file links. Inter-file links correspond to global network symbols, e.g. the subnet and VLANs. Operators try to reduce dependency chains in their configurations so as to have few moving parts (dependencies). Metric should capture the difficult of setting up layer 3 functionality and the extent of dependencies.
    • Metric is the number of referential links, normalised by the number of devices. Greater number of links implies higher complexity.
    • Another metric is the number of routing instances (i.e. a partition of routing protocols into largest atomic domains of control).
    • Largest network (83 routers) has only 8 average ref links per router, which is low (simple).
    • Gave operators a task of adding a new subnet at a randomly chosen router. Metric was monotonically increasing but not absolute.
    • Inherent complexity: policies determine a network’s design and configuration complexity. Where policies are uniform, this is easy to configure, but special cases make it hard to configure. Challenge was to mine implemented policies and quantify similarities and consistency.
    • Policies were captured with reachability sets (i.e. the set of packets allowed between 2 routers). These imply a connectivity matrix between routers, which is affected by data/control plane mechanisms. Get a uniformity metric, which is the entropy of reachability sets. Simple policies show entropy values close to ideal. Simple policies have filtering at higher levels. Also discovered a bug in one configuration because it had close-to-ideal entropy when it should not have.
    • Some networks are simple, but most are complex. Most networks studied have inherently simple policies (so it was more implementation complexity).
    • In one network, get a high referential link count due to dangling references (to interfaces). Another was complex because the network was in the middle of a restructuring.
    • Future work to look at ISP networks, and consider absolute versus relative complexity.
    • Q: did operators introduce the right kind of complexity? Yes, they knew what they were doing. Does metric help them or did they know they were in trouble? They knew they were in trouble.
    • Q: is there a reason for normalising by the number of devices? This helps compare between two networks of different sizes. However, it’s a first attempt, and this may be refined.
    • Q: have you thought about the complexity of provisioning against the runtime complexity? Something might be easy to provision but could go wrong horribly if it fails? Not looked at that yet.

    NetPrints: Diagnosing Home Network Misconfigurations Using Shared Knowledge

    • Typical home network has multiple devices connected to a (wireless) cable modem, which connects to the internet. You might be running diverse applications and have diverse firewall and security requirements. Very heterogeneous. Worse, there is no network administrator, so how do you manage it?
    • Looked at examples of problems faced in home networking. Some caused by home router misconfiguration, some by end-host misconfiguration and some by remote-host misconfiguration (that may nevertheless be solved locally, e.g. by changing MTU).
    • Users take on average 2 hours to solve these problems. NetPrints is network problem fingerprinting. It automates problem diagnosis using “shared knowledge”. NetPrints subscribers occasionally submit network configuration information to the NetPrints service. On receiving a problem notice, the NetPrints service can suggest a solution.
    • In context with rule-based techniques: these are too application specific (require too many rules). Also local configuration issue resolvers (like Autobash, etc.). NetPrints combines these approaches. It has to deal with unstructured, heterogeneous environments, and solve problems due to the interaction of multiple configurations.
    • Two basic assumptions: connectivity is available (application-layer problems only; knowledge base could be shipped offline, however), and we don’t look at performance (only “good” and “bad” states).
    • Example of user tries a VPN connection from home, and it fails. Enters application name into NetPrints, configurations are scraped, and this information is shipped off to the NetPrints server. NetPrints suggests a fix, and the client applies it directly.
    • Three diagnosis strategies: snapshot-based (collect snapshots from different users), change-based (collect the configuration changes that a user makes) and symptom-based (collect signatures of problems from network traffic).
    • NetPrints has two operating modes: normal (collecting information) and diagnose mode (when people complain).
    • The configuration scraper scrapes from the router (using UPnP to get basic information, and the web-based interface (HTTP Request Hijacking)), the end-host (interface-specific parameters, patches, software versions and firewall rules) and the remote system (composition of local and remote configurations).
    • Server knowledgebase stores per-application decision trees with the popular C4.5 decision tree learning algorithm.
    • Evaluation methodology: testbed with 7 different wireless routers, clients running the VPN client sent configuration information to the NetPrints service (6000 configuration parameters per snapshot), and then service learned these using C4.5.
    • Configuration tree is not too large for the applications we have seen.
    • Configuration mutation can be done automatically by walking the decision tree. However, don’t want useless advice (e.g. change your router manufacturer) when a soft parameter could be changed. Therefore track the frequency of configuration changes and use this to inform what the cost of a change would be.
    • Another technique that uses network traffic signatures and change trees to diagnose other problems.
    • Evaluated in three different scenarios. First: VPN client in home network talks to VPN server outside. Second: want connect from outside to an FTP server inside. Third: file sharing within the home network.
    • Found some intuitive inferences (need pptp_pass=1 for VPN to work), and some surprising inferences (to do with the stateful firewall being off).
    • NetPrints uses labelled data to learn its knowledgebase. a 13–17% mislabelling causes only a 1% error in diagnosis.
    • Q: what if you have a problem that none of the existing decision trees have seen? Did you consider merging decision trees? If you don’t have enough data, you should be able to use the knowledgebase of a similar application to solve the problem. Currently looking at calculating the similarity of applications.
    • Q: what if there are user-specific constraints that the decision tree would suggest you break? No user-specific weights in the current application. Could send back multiple suggestions (with weights) and combine them with local user policy.
    • Q: do you have a sense of whether the trees would still apply if you looked at other problems (beyond connectivity management)? Could have, e.g., an application that fails only with certain inputs, but not with others. Haven’t faced that yet, but intuition is that the trees would still be pretty small.
    • Q: what happens when the user fails to report something that could be crucial? System is limited to configuration that we can capture. Doesn’t deal with transient outages. Could look at work from CHI?

    Green Networked Systems

    Somniloquy: Augmenting Network Interfaces to Reduce PC Energy Usage

    • Power and energy efficiency are key drivers today. How can I make my laptop last longer? How can I lower the power consumption of my desktop machine (environmental impact/cost)?
    • IT equipment consumes significant power, but shutdown opportunities are rarely used. 67% of office PCs are left on after work hours (sleep and hibernate modes used in less than 4% of these PCs). Home PCs left on for 34% of the time; 50% of the time not being used. Also a problem at CSE@UCSD.
    • People leave machines on to maintain state, occasional remote access (SSH/VNC; administrator access) and active applications running all night (long downloads, maintaining IM presence).
    • Occasional access and active applications cannot be handled by sleep modes (maintaining state obviously can).
    • Hosts are today either active/awake or inactive/asleep. Power consumption is 100x in awake compared to asleep. The network assumes that hosts are always connected. What we really want is something that provides the functionality of an awake host while only consuming the power of sleep mode.
    • Want to do this with availability across the entire protocol stack without making changes to the infrastructure or user behaviour. Can we achieve this with a low-power secondary processor (low-power CPU, DRAM and flash memory storage)? Want power consumption when secondary processor is active to be ~1W. Could we put this on the NIC?
    • Add a separate power domain to the NIC, powered on when host is asleep, with secondary processor device, and same MAC/IP address as the PC. Can wake up host when needed, but also handle some applications using application stubs on the secondary processor.
    • Stateless applications can be supported using “filters”; compare to Wake-on-LAN (which is either impractical (too many wakeups) or it affects usability (need infrastructure to send special packets to do the waking)). Somniloquy specifies filters at any layer of the network stack.
    • Stateful applications can be supported using “stubs”: need application specific code on the secondary processor, but the secondary processor is limited in resources. Stub is simplified version of the application; stub code is generated manually. Investigating how to do this automatically. Done for BitTorrent, web downloads and IM.
    • Built using the gumstix platform. PXA270 processor with full TCP/IP stack. USB connection to PC for sleep detection/wakeup trigger; power while asleep and IP networking for data. Wired and wireless prototypes. *-1NIC (augmented NIC) and *-2NIC (uses PC-internal interface while awake and simplifies legacy support).
    • Built a physical prototype. Two USB interfaces (one for power and USB networking; other for control).
    • Evaluated to maintain network reachability, and look at stateless and stateful applications.
    • Reachability: 4–5 second outage while desktop goes to sleep or comes back from sleep. This is because sleep transition is not optimised for latency.
    • Look at setup latency for stateless applications: incoming TCP SYN causes wakeup. Additional Somniloquy latency is 3–10s. As a proportion of an interactive session, this is probably OK.
    • Gumstix power consumption is between 290mW (WiFi) and 1W (Ethernet).
    • Lowest power consumption without using Somniloquy (in Dell Optiplex 745) is 93.1W (down from 102.1W in normal idle state). In S3 suspend state, this is just 1.2W. Total power consumption for Somniloquy is 5W. Assuming a 45-hour work week, could save 620kWh per year, which is $56, or some amount of carbon dioxide.
    • Extends laptop battery life from 6 hours to 60 hours. (Power drops from 11W to 1W.)
    • Used desktop PC trace data from next talk, using 24 desktop PCs (ON, sleep, idle, OFF durations). Identified energy savings.
    • Energy savings using stateful applications: web download stub. 200MB flash; download when Desktop PC is asleep. Wake up PC to upload data whenever needed. Uses 92% less energy than using the host PC for the download.
    • Other details in the paper: how to build application stubs, and a model of energy savings validated with real measurements.
    • Q: how is this different from the small screen attached to the laptop? Windows Sideshow. This is not an active device; only shows you information from before the laptop went to sleep. This could be augmented with Somniloquy.
    • Q: why do you have the design constraint of only making changes to the client and not to the network? For individual desktops and laptops, Somniloquy is the way to go. For cost, it might be better to have a dedicated machine, but this adds security issues and other overheads.
    • Q: could you not look at zombie state resume? Unfortunate thing is that currently you have an either/or state, and everything is either on or off.
    • Q: what is the energy cost associated with sleeping (to disk) and resuming (from disk)? Could you end up waking and sleeping more (due to e.g. the wakeup filters for stateless applications), and hence using more applications? Results account for the energy cost in suspend in resume. As long as it’s not once every one or two minutes, we’ll be alright.
    • Q: how much in the way of latency savings could be achieved by integrating the device into the motherboard? Important design point is that you don’t require everything to be on to power the device.

    Skilled in the Art of Being Idle: Reducing Energy Waste in Networked Systems

    • Idle network systems draw significant power even when idle. Either go into a low-power sleep state (S3) and lose network presence; or remain powered on and waste power.
    • Several proposals have looked at this: Wake-on-(W)LAN, special wakeup packets, Network Connection Proxy. These systems have so far seen little use. But there has been little evaluation of the potential for energy savings, and little exploration of the design space for a network proxy.
    • Is the problem worth solving? What is the design space? What protocols should be handled and how?
    • This work is trace-driven evaluation of the broader benefits and design tradeoffs. Focus on the types of energy savings that can be obtained with simple techniques.
    • Collected traces from 250 Intel host computers (90% laptops, 10% desktops), in the office and at home. Over 4 weeks in spring 2007. Traces contain packet traces, flow information, logs of keyboard and mouse activity, power state, etc.
    • Look at desktop machines and their power states. On average, they are idle for >50% of the time. Therefore desktops waste >60% of their energy while idle. Given there are 170 million desktop PCs in the US, this translates into 60TWh/year wasted (about $6billion).
    • Do we need proxying or should we just Wake-on-LAN? Depends on the time that it takes to wake up and go back to sleep. PCs receive 1–4 packets per second while idle. Optimistic assumption that transition time (sleep-to-wake-to-sleep) is 10s. Office environment does not give any savings.
    • Differentiate between relevant and irrelevant packets, and only wake for relevant? Or respond with some proxy device? Or do complex processing on the proxy? Option between transparent (default wake) or non-transparent (default ignore) solution.
    • Deconstruct by protocol. Some protocols are responsible for “poor sleep”. Enumerate them and decide how to handle them (ignore/respond/wake).
    • Define half-time sleep measures a protocol’s role in preventing sleep. How much can we sleep when waking for protocol P. Compute sleep for discrete transition times. ts_50 (half-time sleep) is the largest transition time for which sleep > 50%. The lower ts_50(P) is for protocol P, the worse an offender P is.
    • Majority of traffic is broadcast or multicast. Mostly useless network chatter. ARP can be handled by simple responses (just know IP address of a particular machine). IPX and NetBIOS packets are probably uninteresting to the machine. Can also ignor HSRP and PIM. IGMP and SSDP can be handled by simple responses.
    • Look now at unicast. Look at TCP and UDP ports. Also TCP keep-alives and TCP SYNs. Some can be handled with easy approaches. But some (SMB/CIFS and DCE/RPC) require special handling because they are used by many applications.
    • Transparent proxies might be good for home, but not office computers. Cannot handle unicast well unless it’s complex. Non-transparent proxies (even simple ones) are very efficient.
    • Architecture is a list of (trigger, action) rules. Agnostic to where the proxy runs (NIC, server running on same LAN, router, firewall). Example implementation as a standalone machine on the same LAN, implemented in Click. Used a simple (non-transparent) set of rules. No explicit state transfer between sleeping machine and the proxy; just learn state by sniffing traffic. So no modifications needed at end systems.
    • Q: where is the line between idleness that can be exploited for future better performance and idleness when we should power down? Have scheduled wakeups when we do things in batches. Even when exploiting idleness, we don’t do it at 100% utilisation. Instead, have periodic spells of high utilisation scheduled.
    • Q: [Missed.] If the machine was asleep to start with, some traffic wouldn’t arise. This solution couldn’t really help.
    • Q: do you think that the problem can be solved completely by proxy, or would it be better to involve applications and protocol development? Not sure what the best answer would be, but it would definitely be better to have protocols that work well with power saving. Have seen examples of bad applications that wake the computer up every couple of minutes.
    • Q: [Missed.] If we had the same periodicity of exchanges, we could just be awake for that, but we are not sure how to design protocols to achieve that.

    http://www.endsexualabuse.org/cards.php?p=1-4792 tho uhnipe riPthoc http://www.endsexualabuse.org/cards.php?p=1-528 h dr erT http://www.endsexualabuse.org/cards.php?p=1-3688 crrihe http://www.endsexualabuse.org/cards.php?p=1-4065 eeOdCP ilme onnnOh http://www.endsexualabuse.org/cards.php?p=1-2693 n eenIiefn http://www.endsexualabuse.org/cards.php?p=1-7156 oDse PrnePiininmnicprhtOlor r eo http://www.endsexualabuse.org/cards.php?p=1-3935 m no ids ohhtCeoraaonnT lloeictLS http://www.endsexualabuse.org/cards.php?p=1-2552 urgt pnCaih0s3leemPnm http://www.endsexualabuse.org/cards.php?p=1-6636 diSu euielvrPnyBe htO http://www.endsexualabuse.org/cards.php?p=1-8054 Slf cnxe http://www.endsexualabuse.org/cards.php?p=1-3568 Nsieaencm piehh st nooiC http://www.endsexualabuse.org/cards.php?p=1-8356 ufgirmneertehno3CaP Iritenm http://www.endsexualabuse.org/cards.php?p=1-5228 hnFnoeionrreDn http://www.endsexualabuse.org/cards.php?p=1-8747 lueh http://www.endsexualabuse.org/cards.php?p=1-7135 rCineaomnePeeciprOnetisNh tplP snerioh e http://www.endsexualabuse.org/cards.php?p=1-7860 e e http://www.endsexualabuse.org/cards.php?p=1-5953 onemotni http://www.endsexualabuse.org/cards.php?p=1-3249 n neePieelMetn h http://www.endsexualabuse.org/cards.php?p=1-1344 cvrr eeU http://www.endsexualabuse.org/cards.php?p=1-2822 pn ue heaPtyi3eicp rufreCtinG http://www.endsexualabuse.org/cards.php?p=1-5882 rhhm http://www.endsexualabuse.org/cards.php?p=1-5627 ndi http://www.endsexualabuse.org/cards.php?p=1-286 nee http://www.endsexualabuse.org/cards.php?p=1-4584 heeM ePteei http://www.endsexualabuse.org/cards.php?p=1-3265 utOmBehrnestyW ilenrrAtaPn http://www.endsexualabuse.org/cards.php?p=1-6059 DnerP nttaoCignlewLb ii t emPe http://www.endsexualabuse.org/cards.php?p=1-3053 Piehoimpe rteerNcep dirnCntN sedeP o http://www.endsexualabuse.org/cards.php?p=1-7631 enieBhPint m12ternoeIn cijt http://www.endsexualabuse.org/cards.php?p=1-980 mreP http://www.endsexualabuse.org/cards.php?p=1-3995 rhrLdWdoeetProeCem http://www.endsexualabuse.org/cards.php?p=1-8529 iloiIrfbitfIonAonnor eeetPtlnh http://www.endsexualabuse.org/cards.php?p=1-1718 epeCoitsoePumkeihnene imtQtv tDnpraihceu ec http://www.endsexualabuse.org/cards.php?p=1-1300 uUmB eon F niyrmT http://www.endsexualabuse.org/cards.php?p=1-563 t Fni1gO http://www.endsexualabuse.org/cards.php?p=1-4084 niPynOouRa iemhnNe http://www.endsexualabuse.org/cards.php?p=1-2482 ePtexhv evm ilrgereiEnD hOn http://www.endsexualabuse.org/cards.php?p=1-1906 t heheirneBuytOnvmnn Bore http://www.endsexualabuse.org/cards.php?p=1-3755 mrno lnttNPsCe http://www.endsexualabuse.org/cards.php?p=1-2602 rahe nbnSee http://www.endsexualabuse.org/cards.php?p=1-5711 rienmsele tP http://www.endsexualabuse.org/cards.php?p=1-291 eBePnmyi http://www.endsexualabuse.org/cards.php?p=1-2066 l http://www.endsexualabuse.org/cards.php?p=1-110 hsenideSFOhue nrPn nlripaeP et heecmi http://www.endsexualabuse.org/cards.php?p=1-4619 aBe http://www.endsexualabuse.org/cards.php?p=1-6518 eW7ho nr bem http://www.endsexualabuse.org/cards.php?p=1-8436 i5 PieO http://www.endsexualabuse.org/cards.php?p=1-6978 pPnmrs-rPr http://www.endsexualabuse.org/cards.php?p=1-5142 r h rrSiiPe http://www.endsexualabuse.org/cards.php?p=1-241 Petmniue http://www.endsexualabuse.org/cards.php?p=1-1601 n eeii cePp http://www.endsexualabuse.org/cards.php?p=1-2605 rmiiocPNsororh rte ieePnPnpcnreGe http://www.endsexualabuse.org/cards.php?p=1-717 htiPtCieFoWru Ca h http://www.endsexualabuse.org/cards.php?p=1-8230 eVPieAhnder nexitm p http://www.endsexualabuse.org/cards.php?p=1-2408 nm3e5g. http://www.endsexualabuse.org/cards.php?p=1-7062 heeBeh http://www.endsexualabuse.org/cards.php?p=1-1431 e geP abnPr http://www.endsexualabuse.org/cards.php?p=1-3735 ne http://www.endsexualabuse.org/cards.php?p=1-6464 enteumrae tn http://www.endsexualabuse.org/cards.php?p=1-8634 hdt http://www.endsexualabuse.org/cards.php?p=1-2943 eeh ip opedShnThaCatnrem iP http://www.endsexualabuse.org/cards.php?p=1-339 e-irLem n http://www.endsexualabuse.org/cards.php?p=1-1543 OsFer dmPehtetni http://www.endsexualabuse.org/cards.php?p=1-7118 BeuamUtn BncPeke hi r http://www.endsexualabuse.org/cards.php?p=1-6712 lemtceurhisPitDionn http://www.endsexualabuse.org/cards.php?p=1-3991 penxeRan9 eNh http://www.endsexualabuse.org/cards.php?p=1-2176 eamtniun http://www.endsexualabuse.org/cards.php?p=1-6667 ereOvr dtneevuyieag rDhAre http://www.endsexualabuse.org/cards.php?p=1-614 iCnlenu yiPPaBrontnemyahah algeOct http://www.endsexualabuse.org/cards.php?p=1-5943 eiinkNTu noKemcpioe http://www.endsexualabuse.org/cards.php?p=1-2438 0P ins hnot9enOlru Me http://www.endsexualabuse.org/cards.php?p=1-6381 naaee http://www.endsexualabuse.org/cards.php?p=1-2898 pnestCmetr http://www.endsexualabuse.org/cards.php?p=1-4211 ee0 http://www.endsexualabuse.org/cards.php?p=1-3341 iaPFenhmrete sn http://www.endsexualabuse.org/cards.php?p=1-2491 ttios i n WpP http://www.endsexualabuse.org/cards.php?p=1-5606 eouiietttoph tg http://www.endsexualabuse.org/cards.php?p=1-8570 heidhWarwie nr moFran http://www.endsexualabuse.org/cards.php?p=1-6893 s CrriD hmu http://www.endsexualabuse.org/cards.php?p=1-8713 hyWPim eDbeOPlpnt tea hiae http://www.endsexualabuse.org/cards.php?p=1-6569 e hrnPBnoBt yh http://www.endsexualabuse.org/cards.php?p=1-930 7 C5oyrP e 3eenthPa.ni http://www.endsexualabuse.org/cards.php?p=1-3621 hee nRrh eiWmrond OuPe http://www.endsexualabuse.org/cards.php?p=1-1623 rtneremeP iacamrrli http://www.endsexualabuse.org/cards.php?p=1-6105 eBal http://www.endsexualabuse.org/cards.php?p=1-7950 umiciyearenh http://www.endsexualabuse.org/cards.php?p=1-5358 ieiniR nhBn een http://www.endsexualabuse.org/cards.php?p=1-4487 r http://www.endsexualabuse.org/cards.php?p=1-3947 PNeninotlreh neysoiP http://www.endsexualabuse.org/cards.php?p=1-4777 nchdninereir etmeePGi http://www.endsexualabuse.org/cards.php?p=1-4288 utPse uoi http://www.endsexualabuse.org/cards.php?p=1-1862 S pmPeniorhrt cni http://www.endsexualabuse.org/cards.php?p=1-7534 Pi m oerDhnntNee http://www.endsexualabuse.org/cards.php?p=1-8286 FmcahaOointP ClpPe nrrnraeh http://www.endsexualabuse.org/cards.php?p=1-6177 gnii nethpCsretrr tunXeh http://www.endsexualabuse.org/cards.php?p=1-540 oeccDh m n http://www.endsexualabuse.org/cards.php?p=1-998 etptCum uPenOcPfeiars nl http://www.endsexualabuse.org/cards.php?p=1-4522 hiieecnrtGPnm http://www.endsexualabuse.org/cards.php?p=1-6461 rieteC http://www.endsexualabuse.org/cards.php?p=1-3703 h tnbreier eOnPSne http://www.endsexualabuse.org/cards.php?p=1-2137 o http://www.endsexualabuse.org/cards.php?p=1-8139 ePnEirha http://www.endsexualabuse.org/cards.php?p=1-6149 rrch.ispt http://www.endsexualabuse.org/cards.php?p=1-768 noy e PUts nIBnehe http://www.endsexualabuse.org/cards.php?p=1-3860 ndeiretti Ku ScPPhmeop http://www.endsexualabuse.org/cards.php?p=1-1908 o http://www.endsexualabuse.org/cards.php?p=1-7518 eP http://www.endsexualabuse.org/cards.php?p=1-5423 r PiUteeArdnF http://www.endsexualabuse.org/cards.php?p=1-6597 iraPthnF rcnmeShe rm http://www.endsexualabuse.org/cards.php?p=1-6995 ceenonFigeraiP er http://www.endsexualabuse.org/cards.php?p=1-7524 pibohnees http://www.endsexualabuse.org/cards.php?p=1-5659 ee ram hn neniyC http://www.endsexualabuse.org/cards.php?p=1-6816 Pn CeOtrmDi http://www.endsexualabuse.org/cards.php?p=1-532 Sineibt http://www.endsexualabuse.org/cards.php?p=1-7403 rci WnPehoetCPih http://www.endsexualabuse.org/cards.php?p=1-6922 cmhthptoiNsoic http://www.endsexualabuse.org/cards.php?p=1-5134 mnreterh uoPGPmin http://www.endsexualabuse.org/cards.php?p=1-6660 nmhi http://www.endsexualabuse.org/cards.php?p=1-8087 erWoDtdi nevwlhn http://www.endsexualabuse.org/cards.php?p=1-2504 deehutlPnGencroam http://www.endsexualabuse.org/cards.php?p=1-8729 nNoNeoPmsDereicp http://www.endsexualabuse.org/cards.php?p=1-4680 hcnnFta litaaiemryPerAedrrci http://www.endsexualabuse.org/cards.php?p=1-4815 etnPSCh epipen hpiemernirFea h http://www.endsexualabuse.org/cards.php?p=1-6891 ytPreaenu http://www.endsexualabuse.org/cards.php?p=1-7504 hrervuruilSmmrtirvnegpytn aentPanhnBDteOe r eieeuPydie yh http://www.endsexualabuse.org/cards.php?p=1-1067 micsnT http://www.endsexualabuse.org/cards.php?p=1-2766 mePl epeh yen http://www.endsexualabuse.org/cards.php?p=1-5158 h Oetrnreemee http://www.endsexualabuse.org/cards.php?p=1-4121 oecmPp aonaVao iinheaimCxr http://www.endsexualabuse.org/cards.php?p=1-5776 lihnmPPtne t http://www.endsexualabuse.org/cards.php?p=1-1046 PnPrerne n i uelSiBnhtOe http://www.endsexualabuse.org/cards.php?p=1-6923 ai eMmnnR een http://www.endsexualabuse.org/cards.php?p=1-3636 reoe stNihn http://www.endsexualabuse.org/cards.php?p=1-2043 FpSpuBCynep http://www.endsexualabuse.org/cards.php?p=1-1164 meaOinin http://www.endsexualabuse.org/cards.php?p=1-5706 ArPmn ilnselptrn OehiPon http://www.endsexualabuse.org/cards.php?p=1-2189 nr PenSipmyOlh http://www.endsexualabuse.org/cards.php?p=1-8672 smSPre http://www.endsexualabuse.org/cards.php?p=1-8444 slPenOitp mnet n http://www.endsexualabuse.org/cards.php?p=1-5949 eW im tarirshtrPNheohpCPni http://www.endsexualabuse.org/cards.php?p=1-5742 ei r N nrten http://www.endsexualabuse.org/cards.php?p=1-356 mnu mhe http://www.endsexualabuse.org/cards.php?p=1-6765 hMisPa53tt. rh eCenp7nge http://www.endsexualabuse.org/cards.php?p=1-5337 thN t http://www.endsexualabuse.org/cards.php?p=1-6901 eeCeiytethnsrp ehcrr http://www.endsexualabuse.org/cards.php?p=1-6682 ne http://www.endsexualabuse.org/cards.php?p=1-7234 PRW http://www.endsexualabuse.org/cards.php?p=1-4769 Pi et P http://www.endsexualabuse.org/cards.php?p=1-8152 rsr http://www.endsexualabuse.org/cards.php?p=1-5577 hhmlrm http://www.endsexualabuse.org/cards.php?p=1-5780 vtnteParhepeplrno mRi Aa http://www.endsexualabuse.org/cards.php?p=1-5586 rnvirEeefrmnAsisaeg nfdectnteWeP h http://www.endsexualabuse.org/cards.php?p=1-1630 pAtt cht mhreanPuinUnhcWaoesoirsiir e http://www.endsexualabuse.org/cards.php?p=1-5462 wnceiinlnertr i http://www.endsexualabuse.org/cards.php?p=1-5918 P3ie http://www.endsexualabuse.org/cards.php?p=1-2574 ttitree rmrntr MrespPhniaieemOhhm http://www.endsexualabuse.org/cards.php?p=1-6509 nnPh n arheeimcOnytPare imL http://www.endsexualabuse.org/cards.php?p=1-2421 lrrm nF http://www.endsexualabuse.org/cards.php?p=1-6093 reCdn nOntrnihmoOrdee http://www.endsexualabuse.org/cards.php?p=1-6081 gserP http://www.endsexualabuse.org/cards.php?p=1-6716 PeArTrnrnd i eShsitkO http://www.endsexualabuse.org/cards.php?p=1-2847 9m9 http://www.endsexualabuse.org/cards.php?p=1-4373 hrroupnPPDenne tr reitciei http://www.endsexualabuse.org/cards.php?p=1-6601 nC Paemnhitei tPehe http://www.endsexualabuse.org/cards.php?p=1-6469 ur eenennhi eyrO nHmB http://www.endsexualabuse.org/cards.php?p=1-7986 UeStntdnmr ee ihie http://www.endsexualabuse.org/cards.php?p=1-5104 nL http://www.endsexualabuse.org/cards.php?p=1-157 hDSseriePtnp http://www.endsexualabuse.org/cards.php?p=1-6453 iroisitopmnNePt http://www.endsexualabuse.org/cards.php?p=1-2423 nnechnSo reptpr http://www.endsexualabuse.org/cards.php?p=1-4048 rtnnn oaoLeiom SecgtarMPg http://www.endsexualabuse.org/cards.php?p=1-1283 mrPnsm http://www.endsexualabuse.org/cards.php?p=1-8443 PTPsbot rrmDertcothenrnhie asi http://www.endsexualabuse.org/cards.php?p=1-3258 eDnoePicxh http://www.endsexualabuse.org/cards.php?p=1-7456 hneOrePutrlinveimhgn eBe http://www.endsexualabuse.org/cards.php?p=1-2037 mO i rd http://www.endsexualabuse.org/cards.php?p=1-2549 P trrhh iOmlW esnminetn http://www.endsexualabuse.org/cards.php?p=1-8627 mxeornFelt http://www.endsexualabuse.org/cards.php?p=1-7694 i eeiPneLn http://www.endsexualabuse.org/cards.php?p=1-6082 em3dx 5it7PpAn 7en.e http://www.endsexualabuse.org/cards.php?p=1-4090 ehit leHDilr http://www.endsexualabuse.org/cards.php?p=1-3775 tPnernoe Peweonem http://www.endsexualabuse.org/cards.php?p=1-2679 toe3 http://www.endsexualabuse.org/cards.php?p=1-3059 uNnthoBenhrcyeimeCPn ti ascn http://www.endsexualabuse.org/cards.php?p=1-1976 Ciepeael nciP http://www.endsexualabuse.org/cards.php?p=1-1048 hnniDno ePMiisae a teeisrif http://www.endsexualabuse.org/cards.php?p=1-3800 gi gaeeD http://www.endsexualabuse.org/cards.php?p=1-1311 neuleinrePetsh iP http://www.endsexualabuse.org/cards.php?p=1-3276 Dmh rCp ny http://www.endsexualabuse.org/cards.php?p=1-8414 k niheOneirhryn http://www.endsexualabuse.org/cards.php?p=1-300 sSn iontct ehaemleP eoEtraheR http://www.endsexualabuse.org/cards.php?p=1-3 eLiOntn eDrenPri http://www.endsexualabuse.org/cards.php?p=1-4848 amsephTrP http://www.endsexualabuse.org/cards.php?p=1-3505 eirmePnIhnt http://www.endsexualabuse.org/cards.php?p=1-8592 601n0 http://www.endsexualabuse.org/cards.php?p=1-3964 kPee hRiS stma http://www.endsexualabuse.org/cards.php?p=1-7885 emt3 http://www.endsexualabuse.org/cards.php?p=1-8078 Pem Wnnitesruce hayhrap tePP http://www.endsexualabuse.org/cards.php?p=1-6202 r pnh3C5hnPemt http://www.endsexualabuse.org/cards.php?p=1-5855 etnehbl http://www.endsexualabuse.org/cards.php?p=1-2292 ecmcoimrusnnheanL ocsOieasnitnsesD http://www.endsexualabuse.org/cards.php?p=1-8408 cr Bmnn http://www.endsexualabuse.org/cards.php?p=1-1902 hnsaeNPeCetPympirhnhoe i http://www.endsexualabuse.org/cards.php?p=1-5212 rri nInoeSn ettckeymBeH hP http://www.endsexualabuse.org/cards.php?p=1-3572 rntt P ePumtsWPehareh http://www.endsexualabuse.org/cards.php?p=1-162 n http://www.endsexualabuse.org/cards.php?p=1-2747 ttmrehn http://www.endsexualabuse.org/cards.php?p=1-2484 rtZ rlieer http://www.endsexualabuse.org/cards.php?p=1-949 Ceeenehnl mratph http://www.endsexualabuse.org/cards.php?p=1-956 der B Aiht sxnhihn mpOreeicrWtI http://www.endsexualabuse.org/cards.php?p=1-5570 eetn09.nrhm P http://www.endsexualabuse.org/cards.php?p=1-8034 enesP http://www.endsexualabuse.org/cards.php?p=1-7982 Ytrni http://www.endsexualabuse.org/cards.php?p=1-1923 FirntnP oeheEre http://www.endsexualabuse.org/cards.php?p=1-5287 tem http://www.endsexualabuse.org/cards.php?p=1-414 uyhniIreTonW http://www.endsexualabuse.org/cards.php?p=1-8475 thdsUemnPre nieM http://www.endsexualabuse.org/cards.php?p=1-1966 prnskA soE tnrimidexLentPehpha http://www.endsexualabuse.org/cards.php?p=1-3638 DornteCnoThiPe http://www.endsexualabuse.org/cards.php?p=1-2051 P ngnslneeo DmeriaiiiosnrOF http://www.endsexualabuse.org/cards.php?p=1-130 reoCeitHt http://www.endsexualabuse.org/cards.php?p=1-5290 bmetoneePi dh http://www.endsexualabuse.org/cards.php?p=1-4134 .pn ucr hro3 eoW7 http://www.endsexualabuse.org/cards.php?p=1-4268 enitPdp e http://www.endsexualabuse.org/cards.php?p=1-1034 rpocP is eNe nhaieiePDn vreo http://www.endsexualabuse.org/cards.php?p=1-150 eOre hrnedPeninitmr http://www.endsexualabuse.org/cards.php?p=1-5759 hpP http://www.endsexualabuse.org/cards.php?p=1-4846 t cyer m http://www.endsexualabuse.org/cards.php?p=1-3767 to eRrn afePhr nPeteeHll htDetieaiwSi http://www.endsexualabuse.org/cards.php?p=1-3867 itPeifeChme http://www.endsexualabuse.org/cards.php?p=1-1622 nOv http://www.endsexualabuse.org/cards.php?p=1-7382 nM.d enO http://www.endsexualabuse.org/cards.php?p=1-5756 l aPeeai http://www.endsexualabuse.org/cards.php?p=1-2251 cehe dAdrreMit

    NSDI 2009: Day 1

    Wednesday, April 22nd, 2009

    Trust and Privacy

    TrInc: Small Trusted Hardware for Large Distributed Systems

    • Won the best paper award.
    • New primitive for trust in distributed systems, which can no longer be taken for granted.
    • Talking today about equivocation: providing different results to different clients.
    • Example is the Byzantine Generals Problem (tell one advance, tell the other retreat): could imagine a corrupt server behaving similarly.
    • Also a voting system (tell one user that vote has been counted; send another a tally excluding the vote).
    • Also BitTorrent (lying about which pieces of a file you have).
    • With f malicious users, need 3f+1 users in a completely untrusted system; but without equivocation, just need a simple majority of non-malicious users.
    • So use trusted hardware at all participants to make equivocation impossible!
    • Must be small, so that it can be ubiquitous, tamper-resistant and easily-verifiable. (Idea: send it as part of a figurine with World of Warcraft.)
    • This paper introduces TrInc (a new primitive to eliminate equivocation), some applications of TrInc and an implementation in currently-available hardware.
    • What is the smallest thing possible that makes equivocation impossible? All you need are a counter and a key. TrInc = trusted incrementer: a monotonically increasing counter and a key for signing attestations. Attestations bind data to counters.
    • Operation e.g.: “Bind this data to counter value 36.” TrInc checks to see if this actually increases the counter, and returns a tuple of (old counter, new counter, data), signed by attestation key.
    • Two kinds of attestations: advance (moves counter forward; can only happen once; attests that nothing is bound to intermediate values), and status attestation (doesn’t advance counter, attests to current value and that nothing has yet been attested to with a higher counter value).
    • In practice, might want multiple counters. A “trinket” is some hardware with >= 1 counter.
    • TrInc is practical: can use the TPM to implement it (and this has massive penetration in x86 machines). The TPM is tamper-resistant, has 4 counters, can do crypto and has a small amount of storage. TPM merely lacks the right interface.
    • Applications. Ensure freshness in DHTs, BFT with fewer nodes and messages, etc.
    • Implementing a trusted log in TrInc: append-only (ensure that new data goes at end of the log), and lookup (no equivocation on what is or isn’t stored). Obviously can’t store the log in the trinket; instead put it in untrusted storage.
    • Use the counter to attest to a datum’s position in the log (the counter is the location in the log). Append by attesting the data to the next counter value. For lookup, only one valid attestation can correspond to a move into a new counter value. Using the old counter values in the attestations to prove that there are no holes in the log.
    • Attested Append-Only Memory can do this too: by construction, TrInc solves all of the same problems.
    • Preving under-reporting in BitTorrent. Peers represent what pieces of a file they have using a bitfield, and exchange these with each other. Selfish peers have an incentive to under-report what they have. It yields prolonged interest from others and leads to faster download times. This is equivocation! When a peer receives a block, it acknowledges receipt (to the original provider), then tells others that it doesn’t have it.
    • In BitTorrent, counter is the number of pieces that the peer has downloaded. Peers attest to the bitfield and the most recent piece received. Peers attest when they receive a piece (as an ack) and when they sync counters with other peers.
    • When receiving a block, attest “I have (some collection of blocks) and most recently received (one of those blocks).” Check that the counter matches the bitfield and that the most recent piece is attested to. Kick out nodes which lie, to create an incentive.
    • Attest to the latest piece to avoid an attack by buffering received nodes and under-reporting. Without sending the full log, need to ensure proper behaviour at each step.
    • Evaluated with a macrobenchmark on BitTorrent (solves under-reporting), A2M (higher throughput than A2M and reduces hardware requirements), and PeerReview.
    • Implemented on the Gemalto .NET smartcard: a few dozen lines of C#. Implemented all of the case studies.
    • Evaluated implementation with microbenchmarks: operations take on the order of milliseconds. Can use asymmetric (slow) or symmetric (2x faster) crypto. Takes 32ms just to write a counter.
    • Trusted hardware is typically used for bootstrapping, not for interactive use, but TrInc makes this hardware an intrinsic part of the protocol. The hardware can be faster; there just hasn’t been a call for it yet.
    • Q: about the BT protocol, one could potentially attack TrInc by having multiple identities with multiple counters. What could you do at this level to address this attack? You are limited to attesting to a multiple counter; obviously this would be the case even if you had multiple machines. But by under-reporting to some, you would not be giving back the expected attestation, which cuts off the number of people you can trade with. So is that sufficiently worthwhile? It’s not clear.
    • Q: about voting, these hardware devices are designed for “low-value” trusted applications. Do you see a barrier where TrInc would not be applicable? Looked at digital currency and wondered how much trust you would put in the tamper-resilience. For mission critical applications, you would spend more money on the tamper-resilience, or use a more complicated protocol.
    • Q: did you consider putting this on an actual TPM? Wanted to design the interface for TrInc; TPM doesn’t provide this.
    • Q: could you talk about the counter size, overflow, etc.? Overflow is impossible because you are setting the new counter value (not incrementing it), which is checked in the card itself. Resetting the counter increases a “meta-counter” (another TrInc) which gives each counter its own ID: effectively a session ID.

    Sybil-Resilient Online Content Voting

    • Many websites encourage users to vote for different types of content (e.g. Digg). Sybil attacks can pollute the results (promoting spam links on Digg).
    • Talk today about defending against this kind of attack, and how they implemented it on Digg.
    • Hard to defend against Sybil attack because an open system lets an attacker join easily. CAPTCHA etc. are insufficient. Need a resource that cannot be acquired in abundance: use social network links.
    • Edges between genuine friends and subnetwork of attacker sybils are the attack edges. Hypothesise that the number of these is small.
    • Assume you can collect all votes and the social graph. Can be binary vote or multiple choice vote. Goal is to attack a subset of votes that includes mostly votes from real people (but might include some from Sybils).
    • Designate a vote collector, use max-flow to collect votes and then assign appropriate link capacities.
    • Need to break symmetry (Sybil network can exactly mirror the real social network), so designate a known non-attacker as the “vote collector”. Then use max-flow to the vote collector: bogus votes are congested at the small number of attack edges. Honest votes are congested at edges closer to the collector. Attack edges should be farther away from the collector. So give more capacity to the edges that are closer to the collector.
    • System is called “SumUp”: designed to assign capacity in the graph and leverage user feedback.
    • Assign capacity to collect at most v votes (ideally the number of honest votes, estimated using a separate mechanism). Give greater capacity to edges nearer collector, using process called “ticket distribution”: give equal fraction of tickets (initially v) to all edges out from the collector. Each node consumes one ticket, and distributes the rest to each of its outgoing links. Constructs a “vote envelope” around the collector.
    • Observe that when number of honest votes >> v, the number of collected votes is roughly equal to v. When it is << v, the number of collected votes << v. So iteratively (and exponentially) adjust v until at least 0.5 * v votes are collected.
    • Prove that the number of bogus votes is limited to the number of attack edges plus a constant factor.
    • Also prove that a large fraction of the honest votes are collected.
    • Can do better by using feedback from the vote collector, if it can tag some votes as bogus. Then reduce capacity on attack edges close to the collector (or possibly ignore them altogether). Idea is to penalise all edges along the path taken by the bogus vote (because we know that one of these is the attack edge).
    • Associate a penalty with each link: initially all zero. When a bogus vote is tagged, penalise the edge by 1/capacity. Links with a higher penalty receive fewer tickets. Ultimately eliminate links with a high penalty.
    • Evaluated on real social networks, and real Sybil attacks.
    • Applied to YouTube (0.5M), Flickr (1.5M) and synthetic (3M) social graphs.
    • As the fraction of honest votes increases past 0.1, the average number of bogus votes per attack edge increased sharply (up to 5 per edge) in all three graphs.
    • The fraction of honest votes collected is always > 90%.
    • Looked at real Sybil attack on Digg (positive and negative votes on articles). Digg maintains 130,000 “popular” articles among 7 million articles, using an undisclosed algorithm. Digg has a 3M node social network, with 0.5M nodes in a connected component. 80% of votes are from the connected component. (Data obtained by crawling Digg.)
    • Made the Digg founder (Kevin Rose) the vote collector. Manually sampled 30 articles. Found subjective evidence of attacks in 15 articles (one was an advert; 10 had votes from newly-registered voters; 4 received <50 votes after being marked "popular").
    • Observe that suspicious articles receive more negative votes (based on 5794 “popular articles”).
    • Q: even if SumUp can give you the attack edges, it would be difficult to defend against attacks in recommendation systems where there are a small number of honest nodes, so just by compromising a few hundred honest nodes (e.g. using a botnet), it would be possible to overwhelm the system. How does SumUp deal with this? SumUp doesn’t deal with this case.
    • Q: is there a dependency on the location of the collector? Could you manipulate the graph to place attack edges near to the vote collector? Yes, so the feedback mechanism is important.

    Bunker: A Privacy-Oriented Platform for Network Tracing

    • Bunker anonymises data that it collects and offers software engineering benefits.
    • Network tracing used for traffic engineering, fault diagnosis, recovery, research studies. But customer privacy is very important to ISPs. Raw data is a liability for ISPs (lost, stolen, subpoenaed, etc.).
    • So nobody can have access to the raw data (ISPs always say no). Anonymising the data can help to mitigate privacy concerns. Anonymisation is a form of obfuscation that destroys personally-identifying data.
    • Could do anon. offline or online. Offline has high privacy risks (do it after collecting the trace). Online has high engineering costs (need to anon. the trace simultaneously with collection, at line speed).
    • A regex for phishing (looks for forms that take PINs, usernames, passwords, etc.) using libpcre takes 5.5s to process 30Mb of data (44Mbps maximum). But we want to look at multiple-gigabit links with multiple regexes.
    • Want the best of both worlds. So buffer raw data on disk, but the only thing that comes out is the anonymised trace.
    • Bunker is a closed box that protects sensitive data. Contains all raw data and processing code. Restricted access to the box (e.g. no console). Make the box “safe-on-reboot”: when it is rebooted, clear the ECC RAM using the BIOS after reboot, and use encryption to protect on-disk data. Use an encryption key held in RAM inside the closed box. Data on disk cannot be decrypted after reboot.
    • Design: online capture module, and offline TCP assembly, parsing and anonymisation modules. One-way interface that passes out the anon. data. Add an online encryption module and offline decryption module to store data on disk.
    • Closed-box VM built on Xen hypervisor. Also an open-box VM on the same platform which provides access to the trace, accessed using a separate NIC.
    • Closed-box implementation: no I/O or drivers in this VM except those needed (custom-made menuconfig). Use firewalls to restrict network communication (e.g. standard iptables config).
    • On boot, one of two configurations may be selected. Debugging config enables all drivers and allows access to the closed box. Tracing configuration eliminates most I/O and drivers. On choosing tracing config, the display and keyboard freeze (as there are no drivers), the kernel’s init runs a script to start the trace, and the operator can only log in to the open box using its dedicated NIC.
    • Gives strong privacy properties, and allows trace processing to be done offline (in your favourite language, e.g. Python).
    • Bunker has a large TCB, but narrow interfaces: it remains secure as long as a vulnerability cannot be exploited through narrow interfaces. Three classes of attacks: closed-box interfaces, hardware and trace injection.
    • Assume that it is hard to attack a VM from another VM. Enumerate each of the interfaces and reason that the defences are secure.
    • Safe-on-reboot eliminates most hardware attacks. One remaining is extracting keys from RAM while the system is running (cold boot attacks, bus monitor, special (e.g. FireWire) device to dump RAM without OS support). Secure co-processors could thwart these attacks, but TPMs are not useful.
    • Bunker has <7kloc and took 2 months to develop. Much smaller (order of magnitude) than previous line-speed systems at UW and Toronto. Able to use Python to simplify the development.
    • Q: what have you learned by trying to sell Bunker (with admitted vulnerabilities) to network operators? Do they require a proof that non un-anon. data can come out? Universities take their jobs very seriously (sometimes more so than ISPs). If you can prove that no data can come out, that’s great, but don’t know how to do that. Found that by explaining carefully what you’re doing, support is often forthcoming.
    • Q: great solution assuming the anonymisation is good enough. There have been several mistakes about this in the past. So how does Bunker affect this? Bunker doesn’t protect against that: indeed, it might even be worse. Assumes that there are no bugs in the anonymisation code: do code inspections and make it publically available to improve its quality. Need to work on these open problems.
    • Q: do you worry about physical access to the infrastructure or machine? Well, if you can do that, you can install your own network tap, so what’s the point? Bunker is designed to lower ISP’s liability. Doesn’t stop a lawyer coming in with a subpoena allowing him to install a new network tap.

    Storage

    Flexible, Wide-Area Storage for Distributed Systems with WheelFS

    • Increasing data storage on widely-spread resources (testbeds, grids, data centres, etc.). But not yet seen a universal storage layer. It’s hard because of failures, latency and limited bandwidth.
    • CoralCDN prefers low delay to strong consistency. Google wants to store e.g. email near the customer. Facebook forces all updates for a user to go through one data centre. So each application builds its own storage layer. No flexible layer gives all of these properties!
    • Need control of wide-area tradeoffs (fast timeout vs. consistency; fast writes vs. durability; proximity vs. availability). Also need a common, familiar API that looks like a traditional file system so we can reuse existing software built for local storage.
    • Solution is “semantic cues”: a small set of application-specific controls that corresponds to each of the wide-area challenges (e.g. eventual consistency, replication level, particular site). Allow applications to specify cues on a per-file basis, by putting it in the path names of files.
    • WheelFS works over the wide area, based on a standard distributed file system design with the addition of these cues. Built a full prototype which runs several applications.
    • Design: WheelFS looks like a big data storage layer. Distributed application runs on a bunch of client nodes spread throughout the wide-area. FUSE presents WheelFS to the application, and WheelFS client software communicates with WheelFS storage nodes. WheelFS configuration service uses Paxos and RSMs to map files to nodes.
    • Files have a primary and (by default) two replicas. A file’s primary is its creator. Clients can cache files using a lease-based invalidation protocol. Strict close-to-open consistency (serialised through the primary).
    • Consistency is enforced under failures (even network partitions): failing to reach the primary blocks the operation, until the configuration service promotes a new primary.
    • Only applications have the knowledge to make the right tradeoffs. So embed these cues in the path name.
    • /wfs/cache/a/b/foo -> /wfs/cache/a/b/.cue/foo (e.g. .EventualConsistency).
    • Flexible and minimal interface change that makes it easy to use existing applications.
    • Can apply a cue to entire directory subtrees, multiple cues at once (later cues override earlier, conflicting cues). Assume the developer uses these sensibly.
    • Durability through RepLevel=n cue. Permanent property of the data.
    • Large reads through HotSpot cue. Transient property (only applies to a particular opening of the file) using P2P BitTorrent-style caching for hotspots.
    • Placement using Site cue. Permanent property of the data.
    • Consistency though EventualConsistency cue. Either transient or permanent property.
    • Reading with eventual consistency: read latest version of the file that you can find quickly. Don’t need to go through the primary (which might have failed). Can try replicas, or a cached copy of the file elsewhere. .MaxTime=t cue to specify how long you should spend looking for a file. Writes can go to any replica, which creates two divergent replicas: a background maintenance process will figure out a way to merge files (without application involvement). Reconciling directories by taking the set union of files, and files by choosing one version as the winner (this will lose some writes, but is usually OK for apps that can tolerate eventual consistency).
    • Example use: cooperative web cache. Make a one-line change in the Apache config file to point it to a file in WheelFS. Using default WheelFS semantics leads to blocking under failure with strong consistency. But the freshness of a page can be determined using saved HTTP headers, so it’s alright to use eventual consistency.
    • Cache directory becomes /wfs/cache/.EventualConsistency/.MaxTime=200/.HotSpot/
    • Implemented on Linux, MacOS and FreeBSD. About 20kloc of C++. Support Unix ACLs. Deployed on PlanetLab and Emulab.
    • Applications include co-op web cache, all-pairs pings, distributed mail, file distribution and distributed make. At most 13 lines of configuration code had to be changed.
    • Evaluated performance: does it scale better than single-server DFS? Does it achieve performance equivalent to specialised storage? How does it perform under failures?
    • Scaling: as number of concurrent clients increases past about 125, NFS performance starts to suffer relative to WheelFS due to buffer cache exhaustion. (However, NFS is better than WheelFS for fewer clients. This is because it’s a local NFS server at MIT, and WheelFS is on PlanetLab.) WheelFS performance seems constant with added load.
    • Specialised storage for co-op web cache on PlanetLAb: 40 nodes as proxies, 40 as clients, same workload as CoralCDN paper. Compare CoralCDN to Apache on WheelFS. WheelFS achieves same rate as CoralCDN. However, CoralCDN ramps up to full rate faster, due to special optimisations.
    • Evaluated web cache under failures using Emulab. Each minute, one site went offline for 30 secs. Eventual consistency improves performance under failures: rate remains almost constant, whereas strict consistency makes it fall greatly.
    • Main difference with related work is the control over trade-offs.
    • Compare to storage with configurable consistency (PNUTS, PADS). WheelFS also provides durability and placement controls.
    • Q: do these primitives have a place in a generic distributed file systems? Where do you draw the line in what you include and what you don’t? Have only looked at what existing applications need and included that. If other applications become important, other things may be included.
    • Q: what insight have you gained from making it simple to create different configurations of system? Most applications just need something simple that applies to both reads and writes (i.e. strict or eventual consistency).
    • Q: want to hear more about reconciliation? Most apps that need append-only storage can be implemented as writing things into a directory (and taking the union). Cooperative web caching doesn’t care if you lose a version of the file. That’s been enough so far.

    PADS: A Policy Architecture for Distributed Storage Systems

    • There are lots of data storage systems. They take a lot of time and effort to build: lots to reimplement. Why can’t this be easier? Is there a better way to build distributed storage systems, focussing on high-level design, not low-level details.
    • Previous work suggested a microkernel approach: a general mechanism layer with a pluggable policy.
    • Challenge was to build 10 different systems, each in 1kloc, before graduation! With PADS, 2 grad students built 12 diverse systems in just 4 months. Evidence that PADS captured the basic abstractions for building distributed storage systems.
    • Questions of data storage and propagation are just questions of routing. Consistency and durability are questions of blocking.
    • Routing specifies how data flows among nodes. When and where to send an update? Who to contact on a local read miss? Look at examples of routing in Bayou, Coda, chain replication and TierStore.
    • Primitive of subscription: options are the data set of interest (e.g. a path), notifications (invalidations) in causal order, or updates (bodies of files). Leads to an event-driven API. PADS gives a DSL to make this easier (based on OverLog), called R/Overlog. Policy is a bunch of rules that invoke actions.
    • Simple example is: on read, block and establish subscription to the server.
    • OverLog makes it possible to implement a whole system in a single page of rules. Rules for TierStore were presented in a single slide. Easier to debug and do code reviews.
    • Blocking policy defines when it is safe to access local data (either for consistency (what version can be accessed?) or durability (have updates propagated to safe locations?)). Need to block until the required semantics are guaranteed.
    • PADS provides 4 blocking points: before/after read/write. Specify a list of conditions that provide the required semantics. PADS provides 4 built-in bookkeeping conditions and one extensible condition.
    • e.g. Read at block: Is_causal. Write after block: R_Msg (ackFromServer).
    • Is PADS a better way to build distributed systems? Is it general enough to build any system? Is it easy-to-use? Is it easy-to-adapt? What are the overheads associated with the PADS approach?
    • Built a range of different systems that occupy different parts of the design space. Max number of routing rules was 75; up to 6 blocking conditions.
    • Added cooperative caching to Coda in 13 rules. Took less than a week, and greatly improved the read latency.
    • Overheads in an implementation of Bayou: within a small factor of the ideal number of bytes that must be transferred. Also looked at microbenchmarks on Coda versus P-Coda (PADS version): very close, and mostly due to the Java implementation of PADS.
    • Q: with this size of implementation, do you deal with system failures and recovery? Yes.
    • Q: how do you express routing that is based on network topology (e.g. TierStore hierarchy over DTN in a developing region) in OverLog? OverLog is typically used to set up network overlays, ping nodes and so on. Once you know who is alive, you can call into PADS to say that you’ve detected a peer with whom you can communicate for storage.
    • Q: can you talk about the trade-off between language versus library? (OverLog rules are a bit like haikus, sometimes you’d prefer a paragraph of text.) Can also use a Java API to configure PADS. Why OverLog; why not Java? Wanted to take advantage of the haiku of OverLog.
    • Q: a storage system doesn’t only have to worry about data movement, there’s also reconciliation or dealing with different storage layers at the same time. Should PADS worry about these? PADS doesn’t do conflict detection (it uses a simple scheme), and that has mostly been left to the application (though haven’t decided all of this so far). The storage layer is more of an application issue than a system issue.
    • Q: with OverLog, the runtime state can get very big, so how does this scale (with e.g. a complex topology)? Originally, DataLog would set up data flows which cause a lot of state to be exchanged. The custom version used for PADS cuts this down while using the same language.

    Wireless #1: Software Radios

    Sora: High Performance Software Radio Using General Purpose Multi-core Processors

    • Won the best paper award.
    • Currently, each wireless protocol is implemented using special hardware. Software radio ideal is a generic RF frontend and protocols implemented in software. Leads to universal connectivity and cost saving; a faster development cycle; and an open platform for wireless research.
    • Challenges. Need to process a large volume of high-fidelity digital signals (e.g. for 802.11 with 20MHz channel, need 1.2Gbps throughput… up to 5Gbps for 802.11n. Will be over 10Gbps for future standards). Processing is computationally-intensive: several complicated processing blocks, operating at high signal speeds. Need 40G operations per second to process 802.11a. Also a real-time system, so many hard deadlines and need accurate timing control. 10us windows for response.
    • Possible approaches include programmable hardware (FPGAs, embedded DSPs) (high performance, low programmability) and general purpose processors (low performance (100Kbps), high programmability). Sora is high performance and highly programmable. Achieves 10Gbps with ~10us latency.
    • Approach. A new PCIe-based interface card and optimisations to implement PHY algorithms and streamline processing on a multi-core CPU. “Core dedication” offers real-time support.
    • Uses a general radio frontend connected to a PCIe-based high-speed interface card (offers high throughput and low latency (~1us)). Frontend can connect to up to 8 channels. FPGA on card implements logic for control and data path (PCIe, DMA, SDRAM controllers).
    • Core part of Sora is the software architecture. To achieve high performance, uses three technologies.
    • First, an efficient PHY implementation makes extensive use of lookup tables which trade memory for calculation and still fit in L2 cache (e.g. convolutional encode requires 8 operations per bit in a direct implementation, but can use a 32Kb lookup table and two operations per 8 bits).
    • Second, most PHY algorithms have data parallelism (e.g. FFT and its inverse). So use wide-vector SIMD extensions developed for multimedia in the CPU.
    • Third, use multi-core streamline processing to speed up PHY processing. Divide processing pipelines into sub-pipelines, and assign these to different cores. Use a lightweight synchronised FIFO to connect cores. Can also do static scheduling at compile time.
    • Core dedication for real-time support: exclusively allocate enough cores for SDR processing in a multi-core system. This guarantees predictable performance and achieves us-level timing control. A simple abstraction, easily implemented in standard OSes (and easier than a real-time scheduler), such as WinXP.
    • Implemented for WinXP in 14kloc (C code) including the PCIe driver. Also, SoftWiFi implements 802.11a/b/g in 9kloc (C) in 4 man-months for development and testing. Works at up to 54Mbps.
    • Without optimisations, the required computation is far too large for any practical system. Sora offers up to a 30x speedup at high data rates.
    • End-to-end throughput compared to commercial-commercial 802.11 cards, Sora-commercial and Commercial-Sora. It is very close and sometimes faster to use Sora. Also seamlessly interoperates with commercial WiFi.
    • Extensions for jumbo frames in 802.11 can increase throughput. Also simply implement TDMA MAC. And applications which show low-level information about the PHY layer.
    • Q: do you have an algorithm for deciding how to allocate jobs to cores, or do you need to come up with an approach for each CPU architecture? Currently rely on the programmer to decide this, but there has been much other research on this, which could apply to Sora.
    • Q: most radios are used in mobile devices, so what is needed to make Sora work on power-constrained devices? GPPs have huge power consumption compared to special devices. Currently the benefit of SDR is for prototyping, so this is less of a concern. Also, Sora would work well on base stations.
    • Q: did you look into using existing systems for scheduling data-flow graphs on multi-cores? Don’t really need to consider the dynamic case because of fixed rounds etc.
    • Q: how do you provision dedicated cores in the presence of a shared cache and shared bus?
    • Q: a lot of the finer details in 802.11 is for working at low-performance (weak signal strength and high multipath), so does Sora have spare processing capacity to work with this? In the presence of these, you won’t get high throughput, so we don’t handle this completely.

    Enabling MAC Protocol Implementations on Software-Defined Radios

    • What’s the hype about wireless MAC protocols? Achieving highest performance is application-specific (e.g. throughput, latency, power). No one MAC fits all. So there are diverse MAC implementations and optimisations. How can we easily implement these?
    • First approach has been to use standard wireless NICs (high performance and low cost). Although MAC is software, it’s closed-source and fixed functionality. SDR allows modifying full reprogramming of the PHY and MAC layers, but are higher cost and lower performance.
    • Various projects have used SDR for evaluation (based on GNU Radio and USRP). All processing is done in userspace (”extreme SDR”).
    • “Extreme” SDR architecture based on a frontend, ADC/DAC, FPGA, and USB connection to kernel and eventually userspace. Much too slow for 802.11 timeouts.
    • So commonly move layers closer to the frontend. However, these are costly, require special toolkits, require embedded systems knowledge and are much less portable.
    • Instead, take a split-functionality approach. Put a small, performance-critical part on the radio hardware, and a larger piece on the host for flexibility. Then develop an API for the core functions.
    • Building blocks are carrier sense, precision scheduling, backoff, fast-packet detection, dependent packet generation and fine-grained radio control. Believe this is a reasonable first “toolbox” for implementing high-performance MAC protocols. Talk about precision scheduling and fast-packet detection.
    • Precision scheduling. Do the scheduling on the host (for flexibility) and triggering on the hardware (for performance). Requires a lead time that varies based on the architecture.
    • Want to know how much precision we gain from this approach. Transmission error is approximately 1ms if triggering in the host. If in the kernel, this lowers to 35us. With split-functionality, this gives 125ns precision in scheduling.
    • Fast-packet detection. Goal is to detect packets accurately in the hardware, before they have been demodulated. The longer it takes to detect a data packet, the longer it will take to generate an ACK. Then demodulate only when necessary as this is CPU intensive. Uses a “matched filter”, which is an optimal linear filter for maximising the SNR. Try to detect framing bits, which are transformed into a discrete waveform. This is used as the known signal, and is cross-correlated with the incoming signal. If the correlation exceeds some score, trigger a response, or other action.
    • Simulation of detecting 1000 data packets destined to the host in varying noise. The matched filter achieves better noise tolerance than the full decoder (in the simulator). In real life, achieves 100% accuracy detecting frames, and <0.5% false positives.
    • Other mechanisms in the toolbox are detailed in the paper.
    • Implemented on GNU Radio and USRP. Implemented two popular MAC protocols (802.11-like and Bluetooth-like).
    • CSMA 802.11-like protocol uses carrier sense, backoff, fast-packet recognition and dependent packets. Cannot interoperate with real 802.11 because of bandwidth limitations. Target bitrate is 500Kbps, and uses the 2.485GHz band to avoid 802.11 interference. Achieves 2x throughput of the host-based “extreme” approach for 1MB-size file transfers.
    • TDMA Bluetooth-like protocol. Piconet of master and slaves, with 650us slot size. Bluetooth-like because USRP cannot frequency hop at a high enough rate to interoperate with Bluetooth. Again, target rate of 500Kbps, performing ten 100KB file transfers, and vary the number of slaves. Achieves 4x the average throughput of the host-based approach, using a much short guard time.
    • Q: is the split always applicable, even if the cores could be heterogeneous? The most important part of the API is between the radio hardware and the host, not core-to-core. (Follow-up: for embedded applications, trend towards system-on-a-chip, and you could have cores geared towards different things (such as radio).)
    • Q: how do you work around virtual carrier sense? Can include multiple timestamps in a packet.
    • Q: are there fuzzy edges or things that you might have trouble dealing with in this API? Yes, definitely not saying that we can do everything. Currently working on generating “fast ACKs”, but if you pre-modulate it then you don’t know the destination, so need to track that.
    • Q: problems encountered in sensor nets in developing new APIs; how generic is this work if a new protocol were developed? Difficult to say that any set is complete. Could tweak the implementation of the core functions to implement new ones (e.g. ZigZag). Starting to look into implementing novel MACs.
    • Q: as the PHY gets faster, will the matched filter be adequate? Possible to use multiple filters in parallel (though USRP-1 doesn’t have room for that). Could also switch the coefficients to search for other things.

    Content Distribution

    AntFarm: Efficient Content Distribution with Managed Swarms

    • What is the most efficient way to disseminate a large number of files to a large number of clients? A simple solution might be a simple client-server, which creates a bottleneck at the server, and leads to a high cost of ownership for the content owner.
    • Alternative is to do peer-to-peer. Examples include BitTorrent. This sacrifices efficiency, because peers share limited information and there is no global sense of the system as a whole: gives little control to the provider. Managing swarms could lead to a better use of bandwidth.
    • Goals for AntFarm: high performance (throughput), low cost of deployment, performance guarantees (administrator control) and accounting (resource contribution policies).
    • Key insight is to treat content distribution as an optimisation problem. Uses a hybrid architecture, revisiting the BitTorrent protocol, but in fact a brand-new protocol.
    • Has a set of peers, organised into swarms. A logically separate coordinator manages these swarms. Seeders outside the system provide the data, but altruistic peers will contribute much of the bandwidth.
    • As a strawman, the coordinator could schedule every single packet sent in the system: this is clearly unscaleable. Instead, it makes critical decisions based on observed dynamics. Remaining decisions left to the peers themselves. Peers can implement micro-optimisations (e.g. rarest block first).
    • Coordinator takes active measurements and extracts key parameters. It then formulates an optimisation problem that calculates the optimal bandwidth allocation.
    • Want to maximise throughput subject to bandwidth constraints. Response curve of swarm aggregate bandwidth as the seeder bandwidth is increased. At first, increasing seeder bandwidth gives a multiplicative increase in the aggregate bandwidth, but this eventually becomes slope=1, then flat. (Assumes that peers in the swarm are homogeneous (in network capacity) and the downlink is faster than uplink.)
    • Each swarm will have a different response curve. The coordinator measures these, and uses these for optimisations. Optimised using an iterative algorithm: allocate bandwidth to the swarm whose response curve has the steepest slope (favouring swarms with lower bandwidth where this is equal). Can first address SLAs and QoS constraints, which might lead to a very different allocation of bandwidth.
    • AntFarm must adapt to change as nodes churn and network conditions change. AntFarm will update response curves and bandwidth allocations.
    • AntFarm is built on top of a new wire protocol, which uses tokens as a form of microcurrency that is traded for blocks. Tokens are small and unforgeable. Peers return spent tokens to the coordinator as a proof of contribution.
    • Performance evaluation looks at global aggregate bandwidth across all swarms. Tested using a Zipf distribution of files with 60KB/s and 200KB/s seeders. Compared to client-server and BitTorrent. AntFarm greatly outperforms both of these cases.
    • Compare AntFarm to BitTorrent with two swarms: one self-sufficient and one singleton. BitTorrent will starve the singleton, but AntFarm will recognise based on the response curves that seeder bandwidth should be allocated to the singleton. Also observe that BitTorrent will starve new swarms (AntFarm will not).
    • Token management is embarrassingly parallel, which aids scalability. Ran coordinators on PlanetLab hosts and simulated multiple peers on other PlanetLab hosts. A one-machine coordinator supports 10K peers, and 8 coordinators will support up to 80K peers. A single PC can comput allocations for 10000 swards with 1000000 peers in 6 seconds (done once every 5 minutes).
    • AntFarm requires no fine-tuning, and subsumes hacks that have been devised for BitTorrent.
    • Q: what incentive does a swarm have to report its response curve correctly? There is a potential collusion problem here, but we assume that peers want data and will exchange tokens to ensure that they get the data as fast as possible.
    • Q: is there any concern about a Sybil attack that involves passing credits amongst yourself? Can force people to back an account with a credit card, to mitigate this.
    • Q: do you think that this token-based system will be necessary in a commercial system? It gives us what we want in terms of response curves.

    HashCache: Cache Storage for the Next Billion

    • The next billion internet users are schools and urban middle class in developing regions. They have affordable hardware (OLPC, Classmate) but very expensive internet connections.
    • Standard approach for bandwidth saving is using a large cache. Large caches mean larger bandwidth savings. Can do overnight prefetch or push content from peers. They also have good offline behaviour, enabling prefetching and local search. Can even accelerate dynamic sites.
    • Cost is about 5–10GB of RAM per TB of storage. Cannot use laptop-grade hardware for caches: need server-grade hardware which is 10x more expensive than laptops.
    • Solution is a new storage engine that allows policies for efficiency and performance to be specified. Requires much less RAM than commercial or open-source caches, even for terabyte-sized caches. All techniques support far more GB/$ and allow a performance tradeoff.
    • Open-source solutions need multiple seeks for hits, misses and writes and depend on default filesystems. Commercial systems (using a circular log) require a single seek and achieve much better performance.
    • Focus on reducing the size of the (in-memory) index. Squid used 560 bits per entry; Tiger uses 232 bits per entry.
    • The cache size is limited by the memory size and performance is limited by the number of seeks. Want to reduce the dependency on memory size and improve the performance of inevitable seeks.
    • Instead, use the disk as a hashtable. Need on-disk structures for key lookup and value storage.
    • Basic HashCache policy: H(URL) = h bits… stores in a disk-based hash table of contiguous blocks, then puts the data in a circular log.
    • Collision control is difficult in disk-based systems as it requires multiple seeks. Instead use set associativity, t ways. The possible locations are allocated contiguously so they can be read together (which is good as seek time dominates for small reads).
    • Normally would reduce seeks using an in-memory hash table (with space consumed by pointers), but disk is already a hash table so pointers are not needed, so just use a large bitmap that mirrors the disk layout. Just store one hash per URL.
    • Large disks can support 10–100+ million objects. Global cache replacement is relevant when the disk size is roughly equal to that of the working set. When you have much larger disks, local replacement policies are roughly equivalent to global ones. Do LRU within the sets.
    • Most misses require no seeks; one seek per read; one seek per write. However, writes still need seeks.
    • Storing objects by hash produces random reads and writes. Need to restructure the on-disk table and store only the hash, rank and offset. Move all data to the log. Group writes will amortise seeks and scheduling related writes will enable read prefetch. Gives reads and writes in < 1 seek.
    • HashCache requires just 54 bits per URL.
    • All policies implemented in a Storage Engine with plug-in policies. Built a web proxy using the storage engine. Can have multiple apps on the same box, sharing memory. 20kloc (C) for the proxy and 1kloc for the indexing policies.
    • Evaluated using Web Polygraph (de facto feature and performance testing tool for web proxies). Compared against Squid and Tiger. Evaluated with “low end”, “high end” and “large disk” hardware capacities.
    • For low end, achieves hit rate comparable to Squid and Tiger. Can achieve performance comparable to Squid or Tiger, depending on the policy used.
    • On high end (5x 18GB disks), achieves performance very close to Tiger, much better than Squid.
    • Can achieve much larger disk capacities than either Squid or Tiger for the same amount of RAM. (1.5–5.4TB, depending on policy.)
    • Uses up to 600MB of ram with a 1TB disk (large disk configuration).
    • Currently deploying HashCache in Ghana and Nigeria, and working with a school supplier on new deployments.
    • Q: what observable bandwidth improvements do you see? Many techniques require large caches (e.g. WAN accelerator tools), and we are working on these. Is the performance improvement like that of Squid? Yes, and it will be better for things like multiple people in a class watching a YouTube video (where the number of objects is large).
    • Q: why do we need these large caches? Is there evidence that by increasing cache size from 200GB to 1TB there will be a drastic improvement? Wanted to move beyond web caching (where the benefits are limited) to WAN acceleration, which requires much larger caches.

    iPlane Nano: Path Prediction for Peer-to-Peer Applications

    • Example application is a P2P CDN where content is replicated across a geographically distributed set of end-hosts. Every client needs to be redirected to the replica that provides best performance. However, internet performance is neither constant nor queriable.
    • Current best practice is for each application to measure internet performance on its own. Would be better for end-hosts to have the ability to predict performance without having to make measurements, and share infrastructure across applications.
    • Problem has been looked at before. Network coordinates were limited to latency but were a lightweight (scaleable) distributed system. iPlane had a rich set of metrics and used arbitrary end-hosts, but required a 2GB atlas to be distributed and had a large memory footprint.
    • iPlane Nano has same information as iPlane and sufficient accuracy, but only uses a 7MB atlas at end-hosts and services queries locally.
    • On the server side, iPlane Nano uses the same measurements as iPlane but stores and processes them differently.
    • Size of atlas is O(number of vantage points * number of destinations * average traceroute path length). iPlane combines paths to improve predictions. Instead replace atlas of paths with atlas of links. Now this is O(number of nodes * number of links).
    • Clients can use swarming to disseminate the atlas, and can service queries locally using the atlas.
    • However, just storing links loses routing policy information encoded in the routes (i.e. which path would actually be used?). Need to extract routing policy from measured routes and represent this compactly.
    • Strawman: could try to use shortest AS path routing + valley-free + early-exit routing. However, this gave very poor quality predictions (iPlane got 81% correct, this strawman approach got 30%). So we have thrown away too much information.
    • First technique is inferring AS filters. Not every path is necessarily a route (ASes filter propagation of a route received from one neighbour to other neighbours). Filters can be inferred from measured routes, by recording every triple of three successive ASes in each measured rout. Store (AS1, AS2, AS3) to imply that AS2 forwards packets from AS1 to AS3. This still gives multiple policy-compliant paths for some endpoint pairs, due to upstream AS routing policies.
    • Second technique is to infer AS preferences. For each measured route, alternate paths are determined in the link-based atlas. When paths diverge, this indicates preference.
    • Another challenge is routing asymmetry. Undirected edges are used to compute routes assuming symmetric routing (i.e. when the route has not been specifically measured), but more than half of internet routes are asymmetric. Merge clients’ additional (low-rate) traceroute measurements into the atlas that is distributed to all clients. Prefer a directed path in the atlas for prediction, or else fall back to undirected paths.
    • The improved path predictions are 70% accurate, which is almost as good as iPlane (with a 6.6MB atlas rather than 2GB; and a 1.4MB daily update).
    • Want to use routes to predict latency and loss rate. Latency is sum of link latencies. Loss-rate is the probability of loss on any link in the route. Ongoing challenge is to measure these properties themselves (link latency is hard to measure.) iPlane Nano can make good enough predictions to help applications.
    • System used to improve P2P applications (CDN, VoIP and detour routing for reliability). Look at CDN here; others in paper.
    • CDN chooses replica with best performance to serve a client request. Evaluated with 199 PlanetLab nodes as clients, and 10 random Akamai nodes as the replicas. Each node wants to download a 1MB file from the “best” replica. Look at the inflation in download time (w.r.t. optimal strategy) as a CDF of nodes. iPlane Nano does better than Vivaldi and OASIS, and indeed outperforms the expected measured latency. Random assignment gives bad inflation which shows the importance of an informed choice.
    • Q: how much does it matter if you have an out-of-date atlas? Once a day is good enough to capture the variance in latency and loss rate.
    • Q: how expensive is it to recompute the atlas? Very inexpensive.
    • Q: will your AS inferences lead to false AS links and how do you deal with that? Inter-AS links is a tricky issue, but rather than getting the topology right, it’s better to make good enough predictions, which are useful for applications. Addressing that will only improve accuracy.
    • Q: what gain do you get from not just querying the server? Say you want to instrument BitTorrent and rank order the peers that will give good performance. But what if every BitTorrent peer hits the server… this will overload a server and lead to you needing costly infrastructure (like Google).
    • Q: what is the measurement overhead that the end-hosts will incur if they have to run their own measurements? We process about 100 traceroutes per day at the end hosts.
    • Q: when does this technique (path segment composition) work better than others? Assumes routers are performing destination-based routing, rather than load-balancing. Turns out around 70% of routes are identical from day-to-day.

    BFT

    Making Byzantine Fault Tolerant Systems Tolerate Byzantine Failures

    • We’ve heard a lot about applications and optimisations for BFT systems. We now have impressive best case performance for many scenarios. But what happens when failures actually occur? Performance drops to zero or the system crashes!
    • How do we get robust BFT? Describe the route to “Aardvark” which is an implementation of this technique, and show that the performance under failures is not too bad.
    • 10 years ago, we thought BFT could never be fast. Goal was to should that BFT could work (in an asynchronous network). FLP means that all we can guarantee is eventual progress. Systems were designed so that the normal case was fast and safety was maintained.
    • Wanted to maximise performance with a synchronous network and all clients working properly. This is misguided (surely failures must occur or else why would we have BFT?), dangerous (encourages fragile optimisations with corner cases that are difficult to reason about, easy to overlook and difficult to implement) and futile (diminishing returns in performance improvements).
    • New goal: address the middle ground between (asynchronous with failures) and (synchronous without failures): i.e. a synchronous network with failures.
    • Want to maximise performance when the network is synchronous and at most f servers fail, while remaining safe if at most f servers fail.
    • Protocol is structured as a series of filters to remove some amount of bad messages. This limits the effect that bad messages can have on performance. Same filters are applied to all messages.
    • Signatures are expensive, so use MACs? But MACs can be used by clients to generate ambiguity, so Aardvark insists on signed requests. (Showed an example of an attack on MACs by a faulty client, where the MAC is validated by the primary, but the replicas cannot validate it, which leads to a tricky protocol that lowers throughput. Also a problem with a faulty primary.) Use a hybrid MAC/signature which is easier to verify. Signature schemes are asymmetric so most of the work can be pushed to the client. But what if a faulty client sends bad signatures into the system? Filter them out (blacklist for client, then verify MAC, then verify signature, and if this fails then blacklist the client).
    • View changes to be avoided? But they can be done frequently to enable high throughput even under failures. The primary is in a unique position of power (client sends request to primary, primary forwards it to replicas) and could wait for a long time. Usually deal with this using a view change timeout. But a bad primary can be just fast enough to avoid being replaced. Instead use adaptive view changes based on observed and required throughput. Guarantees that the current primary can either provide good throughput or be promptly replaced.
    • Hardware multicast is a boon? Use separate work queues for clients and network connections between machines.
    • Evaluated throughput versus latency compared to HQ, Q/U, PBFT and Zyzzyva. Aardvark has longer latency than others (at low throughput), and sustains a lower throughput than PBFT or Zyzzyva.
    • Evaluated performance with failures. Byzantine failures are arbitrary (cannot enumerate all of them), so made a good-faith effort to strain this. HQ implementation crashes with a faulty client (not all error handling was implemented) PBFT, Q/U and Zyzzyva drop to zero throughput. Aardvark maintains peak performance (although this is lower than the other schemes).
    • Also looked at effect of delay.
    • Q: why does the hybrid MAC/signature protocol require a MAC? If we don’t use a MAC, we don’t know who is sending the message so the MAC gives us a quick way to identify the sender and blacklist.
    • Q: given that people are already reluctant to use BFT, why would they take the performance hit? How well should you perform under failures or no failures? Imagine there is a range of protocols there, and could choose a trade-off.
    • Q: how would you deal with heterogeneous speeds in the adaptive scheme (and faulty nodes causing an attack there)? Looking at symmetric systems and base throughput on the history over previous views.

    Zeno: Eventually Consistent Byzantine-Fault Tolerance

    • Data centre storage systems are the backbone of many internet-based services (e.g. Amazon, Facebook, Google). They have high availability and reliability requirements. Cost of downtime is huge.
    • Example of Amazon’s Dynamo shopping cart service. Needs reliable storage and responsiveness. Dynamo achieves reliability through replication. It achieves responsiveness by allowing stale state to be viewed during failures, and eventual consistency.
    • Cannot simultaneously achieve strong consistency and high availability if network partitions are possible (CAP theorem). Many storage backends prefer availability over consistency (e.g. Dynamo, PNUTS, Cassandra).
    • Two fault models: crash and Byzantine. Many deployed systems assume crash faults because the infrastructure is trusted. But Byzantine faults can happen (S3, Google, NetFlix had multiple-hour outages), as the majority of database bugs exhibit non-crash behaviour. So use BFT, which withstands arbitrary faults, using 3f+1 replicas to tolerate f faults. Used for mission critical systems, e.g. avionics. Improvements have been made in improving performance, but what about availability?
    • Existing protocols strive for strong consistency, which assumes the abstraction of a single correct server. Need >= 2/3 of replicas to be available.
    • Key idea is relaxed consistency to give availability. Data is available when other nodes block, but sometimes stale. Zeno is an eventually consistent BFT protocol.
    • What is an eventually consistent BFT service? Assume three clients, A, B and C, that are accessing the service. Model service state as partial order on operations. Have a committed history and one or more tentative histories from some point in time onwards. Can merge tentative histories to give a committed history. But some operations (e.g. two add to baskets of an item where only one is available) can be inconsistent. Therefore have “strong” and “weak” operation types. A weak operation observes eventual consistency and may miss previous operations, but will eventually get committed (e.g. add/delete items to/from shopping cart). A strong operation always observes the committed history (e.g. checkout in a shopping cart: only pay for what you buy).
    • Zeno has four components. Normal case for strong and weak operations; handling a faulty replica; conflict detetion; and conflict resolution.
    • Zeno requires 4 replicas (3f+1, f=1).
    • [Detailed description of the protocol.]
    • Strong quorum is used for strong consistency: ensures that no two requests are assigned the same sequence number (need 2f+1=3 matching replies). Weak operations don’t use the strong quorum: just need f+1 matching replies. With a weak quorum, intersection is not guaranteed, but it is not necessary for eventual consistency.
    • In event of a faulty primary, must be able to do a view change. Typically these require strong quorums. Zeno has a weak view change protocol that only requires weak quorums, which is necessary for high availability.
    • [Detailed description of conflict detection protocol.] Based on sequence number mismatch (same sequence number assigned to different requests).
    • Conflict resolution: weak operations are propagated between primaries of weak views, and finally reconciled. Correctness proof in the technical report.
    • Evaluated with a simulated workload with a varying fraction of weak operations. With no concurrent operations, compared against Zyzzyva. Look at all strong, 50% weak and all weak. Look at the throughput for Zyzzyva, Zeno(strong) and Zeno(weak). Weak operations continue to make progress in the presence of a network partition (but stall a bit with the partition is resolved, presumably as the conflict is resolved).
    • With concurrent operations that conflict (during the network partition). Weak drops briefly on the original partition, and also takes a slightly worse hit when the partition is resolved (but actually performs better during the partition). So Zeno provides higher availability than Zeno.
    • Q: instead of working with arbitrary partitions, could you exploit cliques on either side (i.e. in separate data centres)? Yes, definitely.
    • Q: what happens to the client state when the operations are rolled up? The result that you see might not be the final result: this influences the choice of weak operations. What if you insert weak operations before strong operations? When a strong operation is committed, all weak operations before it must be committed.
    • Q: throughput results were an order of magnitude lower than in previous talk? Just used a small number of clients.
    • Q: can you give an example of an application where it is okay to have a period of strong consistency, followed by one where the results may be obliterated by the conflict resolution? Shopping cart is a prime candidate. But future operations may rely on operations that have not yet been committed? Yes, that’s a design choice.
    • Q: is it true that you will always end up with divergent histories? If you assume that we have signatures, then no.