OSDI 2010: Day 3

October 6th, 2010

Production Networks

Onix: A Distributed Control Platform for Large-scale Production Networks

  • Traditional networking has a traditional router with a forwarding plane and a control plane, which are coupled together in one box. Changing one without the other is difficult to do. This slows down ideas like the Trill protocol, which took 6 years to deploy.
  • Software-defined networking moves the control plane into a server outside the router that talks to the router using a management protocol, sending down forwarding entries to the router, and receiving information about new flows, etc.
  • One of the problems with L2 Ethernet networks is ARP broadcast, which doesn’t scale. Idea to have a centralized unicast ARP server. However, don’t want a single point of failure.
  • Onix is a software-defined networking platform that implements distributed control. Centralized control logic thought of as a distributed system. Simple API and flexible mechanisms for state distribution, scaling and reliability.
  • Challenges are generality, scalability, reliability, simplicity and performance.
  • The API is based on treating the network as a graph. Nodes are physical (or logical) network entities, such as a forwarding engine, which is connected to a port, which is connected to an end host. Nodes have methods defined on them (e.g. “write flow entry” for a forwarding engine). The graph is a “network information base” (NIB).
  • Switching protocols integrated into the NIB, such as OpenFlow.
  • Need a way to distribute the NIB to the switches and to other control servers. Depending on the application, you might want strong consistency or eventual consistency. Two options: replicated-transactional (SQL) data, and one-hop memory-based DHT (for eventual consistency). For an ARP server: switch topology is hard state, IP-to-MAC mapping are soft state. Applications only interact with the NIB at run-time, after setting up the distribution mechanisms initially.
  • Integrated with ZooKeeper for distributed coordination, in order to do leader election, failover, distributed locking, barrieres, etc.
  • Can trade off scalability and reliability, by partitioning the control state variously. Partitioning techniques based on VLAN and MPLS (versus integration techniques like routing on IP prefixes or ATM routing areas). Provide similar interfaces to the user. Can change the subset of Onix instances to which a server connects to vary this. Or give different computational tasks to different instances. Or partition the NIB between instances.
  • Applications can reduce the fidelity of information before sharing it (based on an ATM trick).
  • For reliability, need to deal with various kinds of failures: network element/link failures (application has to deal with that), connectivity infrastructure failure (the network that connects Onix to the networking infrastructure, usually use a management backbone, but could do multipathing in that backbone), and Onix failure (application-controlled, but mitigated by distributed coordination facilities (like ZooKeeper)).
  • Implemented in 150KLOC of C++, Python and Java, from 15 developers in 3 institutions. In QA testing, and will be appearing in products.
  • Implemented Ethane (enterprise security solution) on Onix. Multiple Onix instances are necessary to handle the load from flows being set up.
  • Implemented a distributed virtual switch, which does centralized management of VMM virtual switches. Only need a single instance of Onix with a cold standby for failover.
  • Implemented a multi-tenant virtualised data center, which provides per-tenant L2 networks in a data center. Provides isolation in addressing and resource usage. Need to partition the NIB and management tasks.
  • Implemented a scale-out IP router, which does routing control over multiple physical switches.
  • Q: What is an “application” and how do they multiplex on a single network? What is the implication at the level of routing? Envision only a single application per Onix instance (app is a program that manipulates the NIB graph). Forwarding layer gets programmed using something like OpenFlow (a TCP connection to the controller), which programs a TCAM on the switch. How do you set up the network if there is no routing? Depends on the application. Need configuration support at the beginning.
  • Q: How do you refresh the ARP cache at the host, don’t you need to do broadcast? We didn’t actually build an ARP server, just an example for the talk. Imagine we could still save a lot of traffic from gratuitous ARPs.
  • Q: What is your favourite trick or hack that we might be able to reuse from your implementation? And/or what might you do differently? Favourite idea is that the distribution mechanisms are hidden from you in the NIB, and you can switch out the storage layer with your own if you want.

Can the Production Network Be the Testbed?

  • Hard to evaluate new network services realistically, which makes it hard to decide what should and should not be deployed in production networks. Testbed networks are too small, and maybe don’t have the proper topology. A successful service on a testbed may not correlate with deployment on a real network.
  • Problem with NetFPGA: expensive to deploy and fan-out is small. Software testbeds are slow and may not have realistic topologies. And getting real users to join an experimental testbed is hard.
  • Solution: network slicing. Divide a production network into logical slices, in each of which different control plane is run. Users can opt in to a particular slice. Need to have strong isolation between slices so that actions in one don’t affect another.
  • Control plane makes up forwarding rules and pushes them down into the data plane. Data plane pushes exceptions up to the control plane (e.g. packets that don’t match any rules). Add a slicing layer between the control plane and data plane, and allow multiple control planes to control the same data plane.
  • Network slice is a collection of sliced switches or routers. The data plane remains unmodified, and the slicing layer performs rewrites on the control planes’ rules to ensure isolation.
  • Slicing policy specifies resource limits for a slice (e.g. link bandwidth, maximum number of rules, topology and CPU usage).
  • Now, allow real users to opt-in to slices on a flow-by-flow level. So one slice can handle HTTP, another can handle VoIP, etc.
  • Implemented on top of OpenFlow for communication between the control and data planes. Control logic is moved to a commodity PC. OpenFlow is just a firmware upgrade for various commodity hardware.
  • FlowVisor performs message handling (the slicing layer). On getting an exception from the data plane, FlowVisor looks up which slice controls the packet and passes it to the appropriate controller. When the controller sends back a rule, FlowVisor checks that the update is allowed, and writes it to the data plane if so.
  • Implemented for 20 OpenFlow message types as a transparent OpenFlow proxy in 8261 lines of C.
  • Isolation performance in terms of device CPU usage (other metrics are considered in the paper). Need to prevent CPU exhaustion, using rule-insertion rate limiting. Consider a malicious slice that tries to do DoS on the switch. Isolation reduces this to 40% CPU usage on average (though bursty).
  • Runs on the production network at Stanford, with 15 switches, 35 access points, 25 users and >1 year of use. Host 7 different OpenFlow demos on the same physical network. Also runs in many different research groups.
  • Not explicitly virtualization, because currently only deal with slices of the physical topology. Looking into virtual nodes in future.
  • Q: How do you create isolation between slices so that one slice doesn’t do too many flow setups per second? This is handled by isolating switch CPU usage.
  • Q: What about latency?  Add about 0.483 ms of latency.
  • Q: Don’t you have to have n times as much forwarding memory, which creates problems for realism? Tricks that you can play in terms of caching, but it is true that the number of rules becomes a scarce commodity (second most important after device CPU).
  • Q: [...] Experimental services use the production slice for their control plane. Worked out an in-network control protocol.

Building Extensible Networks with Rule-Based Forwarding

  • Goal is to have flexible forwarding in the network. A network becomes more powerful if its forwarding plane is more flexible. For example, we might want forwarding to be aware of middleboxes (for intrusion detection). But also multipath, active networks, multicast, and so on.
  • However, flexibility is not enough, because it must be secure as well. Need policy support to constrain forwarding directives. And need every entity that appears in a forwarding directive to be able to refuse it. Not respected today (e.g. denial of service, since the source performs forwarding). Or might want to do IP source routing but give every intermediate ISP the ability to refuse the route.
  • Idea: packet rule, which carries forwarding directive in the packet itself. All intermediate nodes must validate the rule. Rules encapsulate a proof of policy compliance.
  • Rules are owned by destinations (but they could be owned by other nodes). They are certified by trusted third parties. An extended DNS returns rules instead of addresses. Routers verify the rule signature and follow rule directives. Packets may also have a return rule for allowing data to be returned to the source.
  • Distribution and certification is done in the control plane (in the paper). Talk focuses on the data plane.
  • Assume: an anti-spoofing mechanism (ingress filtering), existence of rule-certifying entities and key distribution, and DNS well-provisioned against DDoS.
  • Rule is a sequence of actions in if-then-else statement form. Conditions are comparisons based on packet headers and router state. Allowable actions are: modify header, forward packet, drop packet or pass up to the control layer.
  • Rule forwarding engine can modify packet attributes, such as adding explicit congestion notification. Arbitrary attributes on a packet may be set. Need to make attribute setting secure, by using strong cryptography or checking the source (to avoid circumventing an IDS middlebox).
  • Rules can be used to prevent DoS, as they can work like network capabilities. Requires the ability to dynamically grant rules, which is unusual for DNS, so would need to do some proxying.
  • Many other examples of applications that work in this architecture.
  • Rules are flexible, policy-compliant and safe (bounded forwarding time, and cannot modify router state or amplify traffic). Invoked functions are controlled by the ISPs and middlebox owners. Resource isolation can be used to prevent attacks.
  • Evaluated several questions: size overhead of rules, forwarding overhead, performance isolation, load on certifiers and security. Only looking at overhead in the talk.
  • Rule compiler transforms the rules into a binary form. Rule size is between 60 and 140 bytes. Average across examples is below 85 bytes (13% overhead on an average IP packet). Overhead is 27% if using RSA signatures.
  • Built a prototype router on top of RouteBricks. Tested on an 8-core Nehalem with 2 dual-port NICs. Compared RBF to plain RouteBricks, and saw little overhead (no overhead for packets over 500 bytes). Can forward 35Gbps on a real traffic distribution.
  • Signature verification happens only at “trust boundary routers”. Results can be cached (about 19 bytes per rule), and new flows tend to represent 1% of backbone line capacity on average (5% in worst case). Existing hardware can cope with this, and it can be parallelized, so throwing more hardware at it will support higher bandwidth links.
  • Q: What if as a receiver, I want packets to take a particular path… can I drop those packets earlier in the path? No, can only verify the previous hop. Would need stronger crypto guarantees to detect that. What about stakeholders who aren’t named in rules? In a way, we are like overlay networks. We do not name entities between waypoints. But since they approve packet forwarding, you could see them as overlay nodes.
  • Q: What is the best-case/worst-case for ping? Latency is negligible in our implementation.
  • Q: Can you comment on deployability? Could use network slicing. Could be partially deployed at some routers and get many of the benefits. What is the incentive for ISPs? The ability to diversify their offering to customers.

Mobile

TaintDroid: An Information-Flow Tracking System for Realtime Privacy Monitoring on Smartphones

  • 60 million smart phones sold in Q2 this year. But privacy is very poor on these devices, especially in terms of the applications that run on them (e.g. of a wallpaper application on Android). Challenge is to balance fun and utility with privacy. Idea is to look inside applications to watch how they use privacy-sensitive data, and determine how it leaves the phone.
  • Challenges include resource constraints, allowing apps to use some privacy-sensitive information, context-based privacy and applications sharing information.
  • Idea: use dynamic taint analysis, with a taint source, taint propagation and a taint sink. Need to trade off between performance and granularity. TaintDroid does taint tracking on Android, with variable tracking throughout Dalvik, handling native method invocation, and extending to storage.
  • Modified the Dalvik interpreter to store and propagate taint tags on variables. Store tags adjacent to local variables/args on the internal execution stack, for locality. For class fields, store them in instance fields on heap objects. For arrays, store one tag per array to minimize overhead (in particular because many arrays are strings).
  • Various rules for taint propagation through data-flow.
  • JNI is used to execute native methods. TaintDroid doesn’t support third-party native code, but uses a combination of heuristics and method profiles to patch up the tracking state on return from native methods.
  • For IPC and file propagation, taint at the message level and the file level. File taint is stored in an XATTR on the file.
  • On the CaffeineMark benchmark, overall 14% performance overhead. Relatively worse for the “string” workload. 4.4% of memory overhead. IPC has a 27% overhead. Macrobenchmarks overhead between 3% and 29% (worst for taking pictures: 0.5 seconds).
  • Sources and sinks are carefully integrated into the existing framework. Need to treat sources differently based on the properties of the information that they provide. IMSI needed particularly careful handling.
  • Started with the top-50 most popular applications in categories on the Android Marketplace, and winnowed it down to 30 applications that used the internet, location, microphone and camera. 105 connections were flagged, but only 37 were clearly legitimate.
  • 15/30 applications were sharing location with a third-party advertiser, usually in plaintext (e.g. in the HTTP GET url for AdMob, which also contained a unique ID for the particular phone). Sometimes every 30 seconds. No sharing was obvious to the user or covered in the EULA.
  • 7 applications sent the IMEI, and 2 apps send the phone number, IMSI, or ICC-ID to a remote server. One application mentioned in the EULA that said the IMEI would be sent, another sent a hash of the IMEI.
  • Limitations to the approach: only explicit data flows. Sources: IMSI contains country and network codes, so tainting it is too conservative.
  • Source code should be released next Wednesday.
  • Q: What about timing attacks, and what implicit flows do you think are important? Need to do static analysis to capture all implicit flows, and limitations would lead to false positives.
  • Q: Have you thought about providing this application as a website to which people could upload applications? Haven’t thought about that specific model, but definite potential there.
  • Q: How do you stop applications modify the XATTR to remove taint? XATTR not used in Android, so could use LSMs to do this. Accessing this requires third-party native libraries, which we don’t support (can’t access them from Java).
  • Q: Are you planning to do anything for control dependencies, and how would that impact performance and false positives? We are not attempting to identify privacy violiations, just that privacy-sensitive information has left the phone. Using the information for control is the next step.
  • Q: Now that TaintDroid is public, what do you see as its future purpose (since people will now try to get around it)? Going to look at more techniques in future (implicit flows), though we can still catch low-hanging fruit where the leak is unintentional.
  • Q: [...]. Understanding how information is going to be used is a tough challenge. Want to integrate privacy-preserving technologies.

StarTrack Next Generation: A Scalable Infrastructure for Track-Based Applications

  • Since many phones can now identify their own location, many apps now use this information. A track is a sequence of (location, timestamp) readings. Applications could use these tracks to e.g. personalise driving directions (collapsing directions that are familiar, such as “Take 101 north”, rather than a bunch of redundant short steps around one’s own house).
  • Application taxonomy: use current location, use past locations or use tracks. StarTrack deals with the last case (e.g. personalised driving directions, track-based search, ride sharing, discovery, urban sensing).
  • Two kinds of application: insertion and retrieval/manipulation/comparison operations. ST server is a database designed to store geometric data efficiently.
  • Challenges: handling error-prone tracks, providing a flexible API and efficiently implementing track operations. (Also must provide scalability and fault tolerance.)
  • Raw tracks are difficult to compare for equality, since there may be noise, and readings may be taken at different times. Canonicalisation enables more efficient retrieval and comparison, by mapping similar tracks onto the same path.
  • API needs to: pre-filter tracks (based on similarity criteria), manipulate them and fetch them. Track collections are abstract groupings of tasks, which aid programming convenience and implementation efficiency (preventing unnecessary message exchanges and enabling caching).
  • Manipulation operations include joining collections, sorting them, taking a subset, getting similar ones, getting common segments, and getting individual tracks.
  • Showed a code fragment for a ride-sharing application in terms of the StarTrack API.
  • In the DB, StarTrack maintains only non-duplicate tracks per user. The table of unique coordinates are separate. Quadtrees are used to do geographic range searches.
  • To compute track similarity, compute the length of common segments over the length of all segments in both tracks (i.e. Jaccard similarity). Conventional DBs don’t support this efficiently. The track tree data structure is an efficient representation for this, for doing get-similar-tracks and get-common-segments.
  • Evaluated the performance of track trees and two sample applications (driving directions and ride sharing). Used synthetically-generated tracks, 9 servers and 3 DB servers.
  • Evaluated get-similar-tracks against DB, in-memory filtered, and in-memory brute force approaches. Track tree query takes 1ms for up to 100K tracks in the collection. Other approaches are much more costly. To build a track tree takes up to 2 minutes for 100K tracks, and about 170MB of RAM.
  • Personalised driving directions uses get-common-segments. Cached takes about 50 ms until hitting a wall at about 260 requests per second. Ride sharing can do 30 requests/second at 170ms response time.
  • Have a working infrastructure, and looking for users.
  • Q: Have you tried to pair intent and content with a track? We have the availability to stick metadata on tracks, so different applications could associate different metadata. Expect we could have ACLs on this metadata.
  • Q: How do you handle errors in the tracks? Canonicalisation handles this, pre-filtering obviously-erroneous points. What technique do you use for this? Rely on the underlying algorithm but don’t use statistical techniques.
  • Q: Have you thought about how you might do non-intersecting paths in a privacy preserving way (e.g. to avoid an ex)? Thought about it, but it is very hard. Access control might address this, but would like to have a way to aggregate it using differential privacy techniques.
  • Q: Why does the performance curve hit a wall? Network congestion and packets getting dropped.

Virtualization

The Turtles Project: Design and Implementation of Nested Virtualization

  • Approach is to multiplex the hardware between different layers of virtualization. Focus on two layers of nesting.
  • Multiple VMCS for each layer transition. The hypervisor merges VMCSs so that they can run in root mode.
  • Need to avoid multiplying the number of vmexits to one per nesting layer. One idea is to make a single exit fast, and another involves cutting down their frequency.
  • Multi-dimensional paging cuts down the number of exits. Baseline approach uses shadow paging on shadow paging (assuming no EPT). Better to use a shadow page table on EPT (secondary hardware table), which turns three logical translations into two in hardware. Best is to observe that since EPT tables rarely change, do multi-dimensional paging, which compresses the two levels of EPT into one (from layer 0 to 2).
  • For I/O virtualization, could do device emulation, or paravirtualized drivers, or (even better) direct driver assignment. This requires an IOMMU.
  • For nested, need to do multi-level device assignment. Idea is to have the bottom level emulate an IOMMU for the first level, and the second level access the device directly.
  • Experimental set up involved kernbench, SPECjbb, and netperf. Evaluated different hypervisor choices for the nested performance. Overhead is 9.5% in the guest, and a further 14.5% in the nested guest for kernbench. For SPECjbb, it’s 7.6% in the guest and a further 7.8% in the nested guest.  vmread and vmwrite operations were the most expensive, so these were hacked to not cause a trap.
  • Multidimensional paging improves kernbench a lot (3.5x), but SpecJBB and netperf are roughly the same.
  • Multi-level device assignment is the best performing I/O setup, but it adds an overhead of 3x in CPU cost. Interrupts cause exit multiplication. If we could deliver interrupts directly to the nested VM, the overhead would just be 7%. But can still fill a 1G pipe.
  • Q: Do you have a feeling for how the overhead will scale as the number of levels increases? Hope that additional levels will add approximately the same overhead.
  • Q: Are you confident that you could run this under arbitrary unmodified hypervisors? Not advocating rootkits at all.
  • Q:
  • Q: Would binary translation help to apply changes to e.g. the vmread and vmwrite instructions? Two ways to solve this: either get better hardware support or do binary translation. Not feasible to do binary translation in this case.

mClock: Handling Throughput Variability for Hypervisor IO Scheduling

  • Done this work because customers need it.
  • Hypervisor currently multiplexes hardware resources between VMs. Three possible controls: reservation (minimum guarantee), limit (non work-conserving) and proportional share. Supported for CPU and Memory in ESX Server since 2003. Now looking at I/O resource allocation.
  • Contention for I/O resources can have an arbitrarily bad effect on a VM’s resource allocation. Hard problem for storage (considered in terms of iops) to get reservations or limits, because you need to continuously recompute the allocation in the face of bursty, variable workloads. Also, since this access may be distributed and shared, it is even more complex.
  • A single algorithm to support all controls, handling variable and unknown capacity, and that’s simple to implement.
  • Typical proportional-share scheduling assigns a weight to every application, and tags each request with the application. Then requests are spaced based on their tag and the weight of their application.
  • Three key ideas in mClock: use real-time tags (instead of virtual time tags) based on wall-clock time. Use separate tags for reservation, shares and limits. Then use dynamic tag selection and synchronization.
  • Scheduling proceeds by first doing constraint-based scheduling, the weight-based if reservations are met.
  • Idle VMs are synchronized when they make a new request.
  • Need to scale the number of requests based on their size.
  • Distributed mClock for clustered storage architectures.
  • Experimented with a 3 to 6 VM cluster with mixture of Windows and Linux hosts. Workloads: iometer and and Filebench OLTP benchmark. Showed an experiment with diverse workloads and allocations. Compared SFQ(D) to mClock. Showed that mClock closely enforces the limits and reservations for the relevant VMs.
  • Another experiment looking at burst handling, when idle VMs get benefit from spare capacity.
  • Looked at an OLTP workload on both SFQ(D) and mClock. On SFQ(D), the reservation is not respected, whereas mClock manages it.
  • Distributed mClock experiment, with three servers and three clients, and various reservations.
  • Future work involves extracting SLAs and deriving scheduling policies from those.
  • Q: Why does the total number of iops become lower for mClock as SFQ(D) when two VMs are switched on? Some variance would be acceptable, due to the varying workloads and access patterns etc. Should note that the VMs are not doing exactly the same workloads in each case.
  • Q: Can you comment on how you make the trade-off between throughput and latency? Focussed on getting a certain amount of throughput. Previous work looked at latency. Overall, to do a good job at latency, need support from the underlying array vendor.
  • Q: [...]. We group I/O into 32-request batches onto the LUN.

Virtualize Everything but Time

  • Problem is clock synchronization, which is important for network monitoring, traffic analysis, telecoms, finance, gaming, and distributed scheduling.
  • Status quo under Xen is based on ntpd, but amplifies its flaws. When live VM migration is done, it fails.
  • Propose a new architecture. based on RADclock (clock synchronization solution). Robust, accurate and scalable solution that supports seamless migration.
  • Idea: give each physical host a single clock that never migrates. A stateless clock-read function migrates with each VM. (Concept of a “dependent clock” (not new).)
  • Context is paravirtualized Xen guests. It would be more complex to do it for fully-virtualized guests, and is not discussed here. Implemented for pvops Linux (2.6.31.13) on Xen 4.0.
  • Clocks are built on local hardware (oscillators/counters). However, they are imperfect and drift (often due to temperature). The TSC counts CPU cycles and is the highest resolution/lowest latency time source, but it might be unreliable. HPET is reliable but low-resolution and high latency. The Xen clocksource is a hardware/software hybrid that the hypervisor provides, and aims to combine the reliability of HPET with the low latency of TSC: a 1GHz 64-bit counter. Performs well, but HPET is not far behind and a lot simpler.
  • Raw timestamps and clock timestamps are different, and scaled counters are insufficient since they drift. Network based synchronization is convenient, but suffers from delay variability and path asymmetry. (Typically assume symmetry.)
  • Two synchronization algorithms under discussion. First is ntpd, which is feedback-based, using the Xen clocksource on Xen. RADclock uses raw timestamps, and can use any raw counter, including HPET and the Xen clocksource.
  • Experimental setup includes an atomic clock, a GPS receiver and a network connection to a stratum-1 NTP server.
  • With the client and time server on the same LAN, and a 16-second polling period, can get 60us error. Using a feedback-directed polling period with multiple servers (but no load, no traffic and high quality servers) can lead to millisecond-level errors.
  • On Xen, ntpd is much worse. Consider a dependent ntpd clock, independent ntpd clocks and migrating independent ntpd clocks. With a dependent ntpd clock, only dom0 runs ntpd and the Xen clocksource is used (no clock sync in the guests). If an independent ntpd clock per guest runs (resource-hungry), works almost as well as non-virtualized but with a bit more noise due to virtualization latency. Adding some load and churn leads to millisecond-sized errors. A migrating independent ntpd clock is worse [...].
  • RADclock does raw counter reads, not clock reads and is independent of the clock algorithm. Has a synchronization algorithm based on raw timestamps and server timestamps with feed-forward.
  • Feed-forward allows you to define a “difference clock” which can determine time differences on a small timescale that is highly accurate. This is inherently difficult in a feedback-based system.
  • Now have dom0 run a RADclock daemon that talks to timeserver, and make information available through XenStore.
  • For migration, only the clock-reading function actually migrates, and no state migrates. However, since we can have asymmetry and it will be different on a different host, this will result in a small clock jump. (No algorithm can do anything about this.)
  • Migration experiment. The ntpd clock has errors on the order of seconds and convergence time on the order of hours. The RADclock-based solution is much more stable.
  • Q: What about the protocol for network synchronisation with RADclock: could you keep the NTP protocol to allow incremental deployment? We do use the NTP protocol.
  • Q: How much is RADclock buying you over NTP when the biggest gain comes from using independent clocks? Could you modify NTP to do dependent clocks well? Problem is that two heads try to control the same state.
  • Q: How does your clock perform under a workload that might cause temperature variation? High temperature is not a problem: rapidly-changing temperature is the problem.

OSDI 2010: Day 2

October 5th, 2010

Deterministic Parallelism

Deterministic Process Groups in dOS

  • Many sources of non-determinism in real systems (shared-memory, IPC, pthreads, devices, etc.). This makes programs hard to test, debug and replicate (for fault-tolerance). Multicore only makes this worse.
  • Solution: OS support for deterministic execution. A “deterministic box” around some processes.
  • What can be made deterministic? What can be done about remaining non-determinism?
  • Internal versus external non-determinism. Internal is not fundamental (arises from scheduling artifacts) and can be eliminated. By contrast, external comes from interactions with the real world, and cannot be eliminated. External includes users and the network. Internal includes channels used between processes in the deterministic process group (shared memory, pipes, private files). Processes outside the group are external sources of nondeterminism.
  • External nondeterminism is controlled with a shim program that controls external inputs.
  • Alternately, could have made a deterministic VMM and put the whole thing in the box, but this is inflexible and costly.
  • dOS is a modified version of Linux.
  • Motivating examples: parallel computation and a web server. Consider a parallel (scientific?) computation that operates from local input files. This should execute deterministically even on a multiprocessor. A web server has a shim that does deterministic record/replay, but since internal nondeterminism is eliminated, the recording is much less costly. This makes it easy to replicate multithreaded servers. Could move web server request processing into a DPG and have all the I/O outside.
  • New system call sys_makedet() which creates a deterministic box that expands to include all children. Processes are otherwise exactly the same as Linux processes.
  • Semantics of internal deterministic: execute as if serialized on a logical timeline.
  • External nondeterministic: e.g. read() from a socket (the what (data) and when (blocking time)), which the shim program interposes on. The shim could perhaps record the call or replay the call from a previous recording. It controls the logical time at which the call returns.
  • Other uses: deterministic replication with DPGs and shim programs. Shim ensures that replicas see the same input at the same logical time. 2PC is detailed in the paper.
  • Implemented on Linux 2.6.24 on x86_64. 8000 LOC across 50 files changed. Transparently supports unmodified binaries. Needed to modify thread scheduling, virtual memory and syscall entry/exit. Paper describes the implementation challenges.
  • Use the DMP-O algorithm from previous ASPLOS papers. Chose this because it is easy to implement. Threads run in parallel before communication detected, then communication is serialized. Detection using an “ownership table” that maps data to threads. Parallel and serial modes execute atomically and serialize onto the logical timeline (one time step per mode switch).
  • All loads and stores must be instrumented to manage shared memory, as must all syscalls for in-kernel channels (pipes, files, signals, but also address space, fd table (implicit/covert channels)). Page protection hardware is used to monitor shared memory at a page granularity.
  • Evaluated on an 8-core Xeon with 10GB of RAM. Each application runs in a DPG. Used the racey deterministic stress test to verify determinism. How much internal nondeterminism is eliminated? How much overhead does dOS impose? How does it affect parallel scalability?
  • Compare log sizes for dOS and SMP-ReVirt. SMP-ReVirt produces logs that are 4 orders of magnitude bigger.
  • Overhead compares no-DPG to DPG-only to DPG-with-execution-record.
  • For Apache, DPGs can saturate a 1G network (100KB static pages). For 10KB static pages, nondeterministic saturates, DPG drops 26% and record drops 78% (from nondeterministic). For Chromium, the slow down is about 1.6x for both DPG and record.
  • Ran parallel benchmarks on 2, 4 and 8 threads. Blackscholes, lu and pbzip benchmarks have 1 to 3x slowdown in DPGs. Dedup, fmm and make suffer more due to false sharing on a page.
  • Q: Do you intend only to run programs deterministically for debugging, or should it be always-on? Ideally always, but some workloads will not be suitable.
  • Q: How do you decide on a timeline? Logical time increments either when a fixed set of instructions are reached, or [...].
  • Q: Does putting in debug code distort the execution? Yes.
  • Q: Is it possible to write a custom MPI that replaces fine-grained shared memory with message passing that would get rid of much of the overhead? It’s certainly possible, but we haven’t done that.
  • Q: Is non-determinism really bad? Splitting between internal and external handles that, and gives you control over where the split lies.
  • Q: Did you measure CPU utilization in your Apache benchmarks? Yes, but I don’t have numbers.
  • Q: Did you think about how else you might address the fine-grained memory sharing otherwise? Could use hardware support or recompile the program using a transforming compiler. These might get some of the scalability back.

Efficient System-Enforced Deterministic Parallelism

  • Parallelism is a grand challenge, and nondeterminism makes it harder. Races are everywhere: memory access, file access, synchronization….
  • Do we have to live with races? Restrictive languages (such as Haskell, SHIM, DPJ) don’t allow races to occur in the first place. Could we have race-freedom in any language?
  • Determinator is a new OS that offers race-free parallel programming. Compatible with existing programming languages, from Java through C to assembly.
  • Core central idea: check-in/check-out model for shared state. Fork checks out a copy, threads work on a private working copy of shared state, and join checks in and merges changes. Idea from parallel Fortran DOALL, or version control.
  • Example is a multithreaded game/simulation/actor-based program, where threads share a global array. In Determinator, each thread gets a copy of the array, and join merges the diffs.
  • Another example is Parallel Make, where there is a Heisenbug (two parallel jobs use the same temporary filename). Determinator creates a working copy of the file system for each parallel job, so the file doesn’t get overwritten.
  • Writes only get communicated at synchronization. Write/write races either go away or become deterministic conflicts, which cause a runtime exception.
  • Determinator is a ukernel, which enforces a strict hierarchy of process “spaces”. Only the root space can directly do external I/O, and children can only communicate with their immediate parent. Single thread per space. Each space has a single address space, but with a backup “shadow” space.
  • ukernel API is very small (PUT, GET and RET). User-level runtime provides higher-level abstractions (C library, pthreads, file system, process management).
  • Join does a three-way diff between the parent’s working copy, the original copy and the child’s final working copy. Optimized using VM copy on write. Can do merge at page-level (faster) or byte-level.
  • Can emulate a shared file system (non-persistent) with weak consistency. Can do merges at the file rather than byte granularity. Only used for intermediate results.
  • Other features include process migration for distributed computing support, deterministic scheduling, and other things (in the paper).
  • Compared single-node performance against Linux on parallel benchmarks. Some speedup over/parity with Linux on coarse-grained benchmarks. Fine-grained (e.g. LU) benchmarks get a serious performance hit.
  • Looked at the granularity of a parallel quicksort, and the break-even is around 100K elements in an array.
  • Q: Is check-out/check-in more like transactions? It absolutely resembles transactional memory, but these don’t try to provide determinism.
  • Q: point about a “PAR DO” construct in an old Comp Arch news.
  • Q: What about other forms of synchronization e.g. condition variables? These are a terrible fit for Determinator since they are not naturally deterministic. We do fork/join and barriers. Talk about non-hierarchical synchronization in the paper. Barriers allow a child to sync with its parent without terminating.
  • Q: Do you think it is ideal to support arbitrary languages, or would it be better to change the way programs are written to exploit the Determinator semantics? Trade-offs/synergy? We’ve tried to be completely language-agnostic, but interaction could be a huge benefit. A big issue is choosing the granularity for merges, so type information could help here. Or could implement the merge in a more informed way.
  • Q: How much experience do you have with figuring out the programming model to deal with conflicts? Don’t have much practical experience yet as it is just a prototype.
  • Q: Could the merge operation (working on a shared file) yield a result that I might not expect, and do I have to change my assumptions about shared files? Currently, we merge at file granularity, and don’t attempt to merge within the file: this would be a conflict.

Stable Deterministic Multithreading through Schedule Memoization

  • Nondeterminism arises when different runs of a multithreaded program give different behaviour for the same input. Some behaviours may be incorrect.
  • Deterministic multithreading partly addresses this problem, but the schedules are tightly coupled to the input. A slight change in the input leads to a very different schedule. This complicates testing and can mean that some inputs deterministically lead to buggy schedules.
  • The idea is to have stability: memoize past schedules and repeat familiar ones.
  • TERN runs on Linux as a user-space scheduler. Schedule memoized as a totally-ordered sequence of synchronization operations. Only race-free schedules are memoized (previous work). Use symbolic execution (based on KLEE) to track the input constraints required to reuse a schedule. Memoization checks inputs against previous input constraints that were identified by the symbolic execution.
  • Evaluated on Apache, MySQL, PBZip2 and 11 scientific programs. For 13/14 programs, only needed to modify less than 10 lines of code. 100 schedules process over 90% of a real HTTP trace with 122K requests. Overhead is <10% for 9/14 programs.
  • TERN works on annotated program source, with annotations to say which inputs could affect scheduling. An LLVM module does instrumentation. A schedule cache stores memoized schedules as a mapping from constraint sets to schedules. On a hit, the schedule is replayed; on a miss, the memoizer runs.
  • Simplified example based on PBZip2, which has worker threads processing blocks independently. “symbolic” annotation identifies parameters that affect the schedule.
  • Developed programming techniques to make this practical, and these are detailed in the paper.
  • Stability experiment uses a 4-day httpd trace (122K), MySQL sysbench simple and tx workloads, and PBZip2 uses the contents of /usr. Ran on a 4-core Intel box.
  • TERN can reuse over 90% of schedules for three benchmarks. MySQL TX workload only reuses 44.2% of schedules.
  • Looked at bug stability: if you change the input, do bugs occur in the changed input but not in the original? Compared to COREDET. Looked at three buggy programs: fft, lu and barnes from SPLASH-2. Symptom is that global variables are printed before being assigned the correct value. TERN never sees the bug, but COREDET does for some thread/matrix size configurations. Fewer schedules in TERN means that the chance of hitting the bug is less, and hence it is more stable.
  • What about TERN’s overhead? It is supposed to run in everyday execution, so this must be mild. Apache sees small overhead, MySQL is much greater (>30%), PBZip2 is negative. Mixture on SPLASH, -15% up to 40%. Details of the optimization/speedup are in Section 7 of the paper.
  • Future work is fast deterministic record/replay.
  • Q: Does not recording shared memory access order mean that you might not faithfully reproduce data race bugs? The premise is to make behaviour stable and deterministic, rather than eliminating bugs. But we can memoize race-free schedules only, by applying a race detector.
  • Q: Could the annotations be created automatically in future? Haven’t done this yet, but in our evaluation, the overhead is pretty small.
  • Q: How do you automatically determine constraints, and would these become bigger as execution goes on? We have two refinements that mitigate this. First, we remove redundant constraints from the constraint set. Second, we slice out branches that don’t affect synchronization orders.
  • Q: Is it true that you only memoize and replay schedules at program granularity (and cannot replay parts of schedules), and would that be a benefit (and would it be easy or difficult)? Currently not implemented, but this would be an interesting direction.

Systems Management

Enabling Configuration-Independent Automation by Non-Expert Users

  • It is hard to automate a task on different machine configurations. KarDo observes user actions as they perform at task, and produces a solution that works across configurations.
  • Strawman: collect a separate trace for every possible configuration. But the configuration space is too big to explore completely. Also, tasks are so diverse that we are unlikely to get more than a few traces for each task.
  • KarDo can generalize from a few traces.
  • Example application: turning on OOF notification in Outlook. What if the two machines start off in different view modes? What if a Java update notification turns up in between steps when you record a step?  Can infer which actions are non-state-modifying from traces for any task. This means we need far fewer traces, and we can skip non-state-modifying actions in the replayed trace.
  • Evaluated by automating tasks from the MS Help and eHow website, in numerous categories.
  • Tracing user actions is challenging because the OS only gives mouse clicks and keypresses which may be meaningless on a different machine. Use the accessibility interface to get more information about the widgets that were affected by an action, which is meaningful across machines.
  • Generalizing to a single canonical solution requires distinguishing state-modifying actions from other actions. Three kinds of action: update, commit and navigate. Only consider the update and commit actions. To identify these, use visual cues and feed them into an SVM, which labels actions. For features, use widget type, whether it closes the window, whether it changes state, what its text label is, etc. Train it using a small set of examples.
  • To generalize across different configurations, must splice in navigation actions to the state-modifying ones in order to get to the right buttons, etc. Build a global navigation graph, then do breadth first search on the graph to get to the desired state.
  • To remove unnecessary user actions, use a two-pass algorithm. First pass removes unnecessary updates (which are later overwritten).
  • To create a canonical solution, need to create a per-task state modification graph, which may have conditional branches for different configurations (e.g. configuring a router, there may be different manufacturer applications).
  • Implemented KarDo as a thin-client backed by a server. Evaluated on 57 tasks. Used 20 diversely configured VMs, split into 10 training VMs and 10 test VMs. Each task performed manually on exactly 2 VMs.
  • Testing by direct replay of the automation recording (baseline) and comparing it to KarDo. Baseline is successful in 18% of cases. KarDo succeeds in 84% of cases.
  • Why does KarDo fail? 4% of mistakes are due to the navigation algorithm. 5% are due to missing steps and 7% are due to configuration errors.
  • Q: How many of the examples could you have handled by modifying configuration files directly? Some reasonable percentage of them, but we want to handle more general things than just configuration tasks.
  • Q: To what extent could you adapt your system to perform command-line tasks, or editing specific files? We do consider some command-line steps, but this isn’t the primary goal. Could extend it in this direction, but would need to understand the syntax and text on the command-line, which is more complex.
  • Q: What if the training steps include mistakes? We remove unnecessary steps. What about misconfiguration? Could use traces to “check” the correctness of configurations.
  • Q: What if you have to branch, how resilient is this to cosmetic perturbations to the interface? The accessibility layer is the level we consider, so we look for the presence of widgets.
  • Q: How do you determine the start of a task? People upload traces to a central server, so we assume they determine the start before recording it, and do this deliberately. What about input-invariant actions (entering user-specific text in a text box)? We flag this in the recording stage.
  • Q: Have you thought about using Google searches as an input to the trainer? Had a HotNets paper on this very thing last year.
  • Q: What if your program has privacy-sensitive data and this was captured in a trace? Do they have to be manually anonymised? How do I make sure that I don’t upload parts of my private email? In the general case, you need to rely on the user.  However, we can infer that, if many users input the same value, it is probably not private. But I have to trust the operator? You’d have to trust the automated system, not the administrator.

Automating Configuration Troubleshooting with Dynamic Information Flow Analysis

  • Systems are hard to configure, and users make mistakes. Troubleshooting is therefore difficult and expensive.
  • When we have misconfigurations today (such as a bogus value in a config file), might look at log messages, ask colleagues, search Google or look at the code if available. Idea is to make a tool that automatically finds the root cause?
  • ConfAid uses application code to lead users to the root cause, by doing dynamic information flow analysis on binaries.
  • Developers find the root cause by tracing back through the variables and inputs that have influence on the error condition occurring. ConfAid uses taint tracking to identify these dependencies. When the application reads from a configuration file, it adds taint to variables, and traces this through data and control flow. ConfAid attempts to find execution paths that avoid the error. Different inputs have distinct taint.
  • Each variable has a taint set which identifies the inputs that might influence the variable if they change. These sets are built through data flow and control flow. Must explore both paths of a conditional statement.
  • Complete and sound analysis would lead to bad performance and a high false positive rate. Use heuristics to deal with this.
  • The bounded horizon heuristic prevents path explosion by only running a fixed number of instructions in an alternate path. However, this may lead to false positives and false negatives.
  • Also, the single mistake heuristic assumes that a configuration file contains only one mistake. This reduces the amount of taint and the number of explored paths.
  • The weighting heuristic attempts to reduce the false positive rate by weighting different taint propagations differently. Data flow taint is weighted more heavily than control flow taint. Weight branches closer to the error higher than those further away.
  • ConfAid can also propagate taint between processes that use IPC system calls. Support communication via sockets, pipes, TCP/UDP and regular files.
  • Evaluated on OpenSSH, Apache and Postfix. Manually injected errors and evaluated the accuracy and time-to-discover of the root cause result.
  • Used real-world misconfigurations and randomly generated bugs.
  • The correct root cause was ranked first or second for all 18 real-world bugs. Performance also great for randomly-generated bugs, getting 85% first-place correct root causes.
  • Average execution time is 1 minute 32s. Apache is the slowest. The randomly-generated bugs only take about 23s.
  • Q: How do you handle external input or output of various types in your speculative execution? Can you do rollback? We abort alternative path exploration if we see externally observable file.
  • Q: Do you automate the parsing and identification of the configuration files? We just intercept the read call and assign taint from there. Don’t need to identify the parsing routines, etc.
  • Q: Can you combine this with backward flow analysis from the error point? (With program slicing?) This would be much more difficult. Program slicing goes backwards, but we find that the config is read at the very beginning, then run for a while, then something goes wrong. Probably not practical to do program slicing on such a workload. It would probably also give a lot of false positives, because our taint tracking is more contained.
  • Q: Could you use your technique for other types of errors, such as segfaults? You could, but we have focused on configuration problems.
  • Q: Are all of the bugs that you dealt with single-mistake bugs? Yes. How common do you think those kinds of bugs are? Well, it depends if the bugs are independent or if they influence each other. Will probably only find the first error if there are two. Relaxing the single mistake restriction in future work.
  • Q: Do you have anything to add in terms of the performance overhead (proportional to program size) and did you do anything to reduce it? Dynamic data-flow analysis is slow: 100 to 200x slowdown, and we haven’t optimized it much, but it would be hard to get much faster. If the problem happens far into the execution, it will take a long time to come up with the ranking. Idea is to use this offline when you encounter a bug.
  • Q: How is the syntax of the configuration file handled? Does ConfAid require a priori knowledge of the syntax? No, we don’t require this. Fine with both Apache and OpenSSH, without app-specific tuning. However, there might be some noise in the result.

Inside the Data Center, 2

Large-scale Incremental Processing Using Distributed Transactions and Notifications

  • Indexing is a long process, not just matching keywords to URLs. Converts raw documents into documents ready for serving.
  • Example of duplicate elimination using MapReduce. Map documents to their checksum, and reduce them into single documents. However, refreshing the index involves adding more documents to the repository and deciding whether to add each on to the repository. With MapReduce, you would have to start again from scratch.
  • Goal is to have a large repository of documents with a high-quality index, and a small delay between crawling and indexing (”freshness”). MapReduce could only achieve freshness of a few days.
  • All the information is stored in Bigtable. Tables mapping URLs to metadata, and from checksums to canonical documents. On getting a new document, you need to update the checksum table, and flip a metadata bit in the URL table. This leads to a race condition.
  • Percolator adds distributed transactions to Bigtable. The API has get(), set(), commit() and iterator methods. Limited support for secondary indices.
  • Bigtable is a sorted (row, column, timestamp) store. Rows are partitioned into ranges called “tablets”. Tablets are spread across many machines.
  • Percolator provides “snapshot isolation” semantics, using two-phase commit coordinated by the client. Locks are stored in additional Bigtable columns: commit and lock for each data column. Transactions manipulate these columns transparently to the user.
  • Notifications allow users to run user-defined code when a column is updated. “Message collapsing” means that notifications are run at most once per write. Applications are structured as a series of observers, which hopefully form a tree.
  • Notifications run independently on individual rows, which means that stragglers don’t hold up other work.
  • Bus clumping: randomly scanning the repository for notifications will lead to clumping which reduces parallelism and overloads the Bigtable servers. Solution is to obtain locks on a row-by-row basis, and randomly jumping to another row if the bus is contended.
  • Very different access pattern on the hardware. Lots of small, random disk I/Os with many RPCs per phase (compared to MapReduce’s streaming I/O with many documents per RPC).
  • Compared MapReduce versus Percolator on document processing latency. MapReduce is basically constant against churn at around 2200s, whereas Percolator has a much lower latency, until latency explodes around 35% churn. MapReduce is limited by stragglers. Operating point is around 2 to 15% churn.
  • Pros of Percolator: much better freshness, and better scalability (more throughput by buying more CPUs), which supports a bigger repository. Immune to stragglers. Cons are that it needs to reason about concurrency and is about 2x more expensive per document processed.
  • Percolator runs 1000 threads on a single Linux instance. This makes it easier for application developers, with straight-line code and meaningful stack traces. And it gets good multicore speedup. Problems with kernel scalability, though, such as the mmap semaphore, and had to fix those.
  • Unknown unknowns: some CPUs periodically got XOR calculations wrong in checksums. Sometimes Bigtable cannot delete files fast enough (the garbage collector). 50% of seeks were doing useless readahead, so switched to ext4. Incorrect resistor value on the motherboard led to the workload powering machines off.
  • Gets 50% fresher results, with a 3x larger repository.
  • An existence proof for distributed transactions at web scale.
  • Q:
  • Q:
  • Q: Did you consider doing Nectar? Didn’t consider this, but it could improve the performance of this system. We do batch RPCs for operations on the same row.
  • Q: What was causing stragglers? Machine failures, process failures, etc.
  • Q: Could you live with some of the failures that you saw? Some limitations in the Bigtable abstraction that made us have to debug down to the lowest layers.
  • Q: Were Stonebraker and DeWitt right? It isn’t a good idea to pretend that MapReduce is a database system. But MapReduce isn’t dead. Percolator has a handful of users, whereas MapReduce has thousands of Google users.
  • Q: Did you find cycles of notifications arising in practice? Did it exactly once, but were more careful after that. The system looked healthy but would have run for an infinitely long time.

Reining in the Outliers in Map-Reduce Clusters using Mantri

  • Work began as a measurement study, but started to fix them. Now in production on Bing’s clusters.
  • MapReduce is used to build the search index, improve relevance and ads, perform genome analysis, etc. It decouples operations on data from mechanisms to scale across large distributed clusters.
  • Implementation is Cosmos (based on Dryad). Bing uses the SCOPE language to specify queries.
  • Example: finding frequent search queries. Three phases: read/parse, map and reduce. Barrier after the map task. Leads to a drop in cluster utilization towards the end of the map phase, due to stragglers. Reduce phase also has stragglers. Both phases slow down the overall job.
  • Want to get rid of stragglers and recomputations. This gives better predictability, better resource utilization and quicker response (good for developers).
  • Measured start time, end time, data rate and failure mode on a production cluster. Stragglers take 1.5x the duration of the median task in a phase. Half of phases have >10% of stragglers and no recomputation. 10% of stragglers take 10x the duration of the median task (very heavy tailed).
  • The impact of outliers on job completion time (based on a trace-driver simulator): 50% of jobs would improve by at least 34.7%.
  • If input data is unavailable, tasks must wait for recomputations, and this delay may cascade. Idea is to replicate intermediate data, and use a copy if the original is unavailable. Challenge is to decide what data to replicate, and where. Since recomputes are rare and localized (mostly due to disk failures or high load). This gives a predictive model for how likely a machine is to cause a recompute. Cost-benefit calculation based on the probability of a recompute and the runtime of the task. Since whole-rack failures are so rare, only replicate within a rack.
  • Cause of stragglers: tasks reading input over the network end up experiencing variable congestion. Solution is to spread tasks evenly across the cluster. Many aspects of MapReduce are already network-aware (such as shard placement), but the shuffle of the reduce phase means that this locality is harder to exploit.
  • Place reduce tasks so that the traffic on each link is proportional to the available bandwidth on that link. This would require global co-ordination across jobs to identify the congestion and place tasks. Can approximate this with local control.
  • Cause of stragglers: workload imbalance due to unequal partitions. Approximately 25% of outlier tasks have more data to process. Better to ignore these than duplicate them! Mantri builds an estimator that uses Graham’s 1969 theorem.
  • A final cause is contention on a machine. Idea is to copy tasks elsewhere in the cluster. Challenge is to do this as early as possible. Need to build an estimator to guess how much work is left to do. Trick is to copy when the predicted time of starting again is less than half the remaining time.
  • Running time is a function of the input, network congestion, data left to process and various other factors.
  • Results from deployment in production clusters and simulations. Running in production since May. Compared May to June on Mantri against April (previous setup). Median job improvement is 32.1%. There is also a reduction in the number of resources used, because too many duplicate jobs were scheduled in the Dryad scheduler. (0.5x as many copy tasks launched as Dryad, with a 2.8x higher success rate).
  • Compared Mantri to Dryad, Hadoop, LATE and MapReduce on a trace. Mantri gives a better reduction in completion time than all of these schemes. The others have less time savings with more resources used.
  • Q: You took the MapReduce interface as given, so what would you do if you built a new system knowing what you know now? Providing better ways of partitioning maps, which we have already done in production.
  • Q: What would you do if you had a non-uniform cluster? We have very efficient cluster management software, called Autopilot, that monitors these things. The primary reason we don’t have persistent problems is that this will shoot particularly bad nodes in the head. A slot abstraction is how we split up heterogeneous machines.

Transactional Consistency and Automatic Management in an Application Data Cache

  • Application-level caching is used to handle scaling issues in modern websites, e.g. Facebook, MediaWiki, LiveJournal. Caching whole web pages is not much good if you have highly personalized content. Database caches are less useful when so much processing is done in the application layer. This has led to the use of things like memcached or Java object caches. These are essentially a DHT, for application-level objects, which can separate commonly-used and customized content. This caching reduces load on the database and the application server.
  • Existing caches are very simple and force a lot of management work to the application. For example, they don’t provide transactional consistency, which violates guarantees that the underlying database would provide. This can expose anomalies to the user, and complicates application code (to deal with violated invariants).
  • Naming is harder than you think: a cache key must uniquely identify the value. Invalidations require global reasoning about the possible contents of the cache. Saw two bugs in MediaWiki (#7541 and #8391) that arose from these problems.
  • TxCache has transactional consistency, bounded staleness and function memoization. Implemented as a library the hides the complexity of cache management. It integrates with a new cache server we have developed. Function calls for beginning, committing and aborting transactions, and a hook to make a side-effect-free function cacheable (depending only on the arguments provided and the state of the database). No need to deal with inconsistencies, naming, invalidation, etc.
  • Transactional consistency goal: all data seen in a transaction reflects a single point-in-time snapshot. Each transaction gets a timestamp, and objects in the database and cache are tagged with a validity interval. Transaction can read cached data if the validity interval includes the timestamp. Staleness allows cached data to be used for longer, if this is compatible with the application requirements. Nevertheless, it must be consistent with the data in the database if other parts of the transaction require DB data. Simply expose an API on the database for starting a transaction at a given timestamp.
  • Validity intervals computed for application objects, queries and tuples (the last being tracked by the database). A query is valid if the current timestamp is in the intersection of the validity intervals of the tuples accessed. Had to modify the database server to track this and return it with each SELECT query.
  • It is hard to choose a timestamp a priori, since we don’t know the access pattern or the contents of the cache. However, can choose this based on the tuples that would be accessed.
  • Each object in the cache has an invalidation tags, and had to modify the DB to assign these tags to each query. Tags are computed from the query access methods (e.g. index lookups, wildcard tags for queries on the entire table). An update generates affected tags (one per key per tuple modified) and broadcasts these to all cache nodes in an ordered stream. Need to be careful to avoid lookup-invalidate race conditions.
  • Evaluated using the RUBiS benchmark suite, with the standard browsing/bidding workload. Ran with an in-memory (850MB) workload, and a 6GB (disk-bound) workload. As the cache size grows, throughput increases up to about 5000 requests per second. The cache hit rate flattens out at 95%. With no cache, the rate is about 1000 requests/second.
  • Throughput increases also for the disk-bound workload, though much lower (up to 500 requests/second). The cache hit rate is much higher though. Bottleneck in this workload is random I/O for infrequently-accessed data.
  • Adding staleness gives a 3x to 5x throughput improvement.
  • Investigated cache misses due to the stringent consistency: less than 10%. When consistency is turned off, the increase in throughput is much lower.
  • Q: How are the invalidation tags affected by complex queries like joins? We have a more complex approach based on predicate locks.
  • Q: How do you deal with fault tolerance? The cache is inherently fault tolerant because its data can always be recreated. Although migration etc. could improve that. Problems with the validity intervals? No, because these are generated by the database.
  • Q: Do you provide serializability or snapshot isolation? Depends on what the underlying database supports, since the cache is for read-only transactions.
  • Q: Why did your throughput keep going up as the hit rate levels off? This is because we are caching objects that are sometimes larger than the data that is used to generate them.
  • Q: Do you still get a performance win if you have a single cache node on a single system? The benefit of a distributed architecture is the ability to scale the effective size of the cache. Have you compared distributing the cache versus distributing the database? There are a lot of approaches based on DB replication that do similar things, but these are more heavyweight because they try to scale writes as well.
  • Q: How is the cache key chosen? Based on the function’s name and its arguments. A more complex approach could be used if the code might change. What if not all of the arguments matter? Still work to be done in the application to make sure the cache is used efficiently (e.g. in choosing the cacheable functions).

Piccolo: Building Fast, Distributed Programs with Partitioned Tables

  • Motivating example is PageRank: for each node in a graph, add the current score of that node to its out-neighbours, and repeat until convergence. The only changing data (the current and next iteration scores) will fit in memory.
  • Dataflow models don’t have global state as a first class concept. Need to read from and write to distributed storage. No access to intermediate global state, and costly access.
  • Could implement it with MPI or RPC. Need to decide how to break up the computation explicitly.
  • Piccolo provides global access to distributed shared state. Piccolo runtime handles communication between these components. Usually ease of use and performance are in conflict.
  • Programming model uses a key-value store with a put/get/update/iterate interface. Implemented as a library for C++ and Python.
  • Showed how PageRank is implemented on Piccolo. A job is specified with a control function and a kernel function that actually does the computation. Doing it naïvely is slow, because it would do a lot of remote gets and remote puts.
  • Let the user control the partitioning function (e.g. for PageRank based on the domain of each page), and colocate tables that a kernel function uses on the same node.
  • Synchronization is challenging if multiple writers can access the same key at once. So provide an atomic update (e.g. sum or min or max) function attached to a table value. Only works for commutative functions. Synchronization primitive is a global barrier. Tables provide release consistency: accumulations are atomic, but they may be buffered and only guaranteed to be applied at barriers. Need to do an explicit barrier between operations. Release consistency makes writes quite fast.
  • Fault tolerance: do checkpointing of the table state. The barrier is augmented with a checkpoint of particular tables. Use the Chandy-Lamport (non-blocking) protocol for checkpointing.
  • Load balancing is tricky because the tasks are stateful. Cannot just start copy tasks. The master coordinates work stealing, and needs to pause updates to a partition before it is stolen to another machine.
  • Evaluated against Hadoop on a local 64-core, 12-node cluster. Used a 100M page graph, taking around 8 seconds per iteration. Main Hadoop overheads are sorting, HDFS access and serialization.
  • Evaluated scaling up to 200 workers on EC2, and the iteration times stays about the same (about 60 seconds per iteration at all configurations).
  • Implemented iterative n-body and matrix multiply that don’t fit into Hadoop. Also asynchronous applications such as a distributed web crawler.
  • Q: What is your fallback strategy for dealing with computations that don’t have a commutative aggregate function? We haven’t find many applications like that. We would have to rely on pairwise locking in this case.
  • Q: Do you need to restart every node if a single node fails? Yes, you need to rollback all nodes to the last checkpoint, so need careful consideration of the checkpoint frequency.

Cloud Storage

Depot: Cloud Storage with Minimal Trust

  • Storage providers like S3 are appealing because they free the user from a lot of concerns. However, they also present risks. Failures can cause undesired behaviour (such as not propagating permission changes immediately). Also risks of data loss, corruption or unavailability.
  • Approach is to take a radical stance on fault-tolerance. Completely eliminate trust from cloud storage for put-availability, eventual consistency, staleness detection and dependency preservation. It minimises trust for get-availability and durability. So you can always perform a put, for example. But gets are only available as long as there is at least one correct node.
  • Depot ensures high availability by running multiple servers and non-sequential consistency. It falls back to client-to-client communication in the case of extreme cloud failure. It prevents omission and reordering, by adding metadata to puts, which is fully replicated in nodes’ local state. Metadata received at the client is checked to ensure that the client sees a view that satisfies various well-defined properties. Update metadata includes a local clock and a history. Logically each node stores all metadata it sees (garbage collection is handled in the paper).
  • To protect consistency, nodes perform local checks on the metadata: needs to ensure that the updating client has seen all previous updates (based on its history), and it doesn’t try to change history (based on its local clock).
  • Faults can lead to forks in the history, which partition correct nodes, and prevent eventual consistency. Forks are joined to gain eventual consistency: faults are transformed into concurrency. One faulty node is transformed into two correct virtual nodes. Since concurrency can introduce conflicts (concurrent updates to the same object), Depot exposes such conflicts to applications.
  • Depot provides various properties in terms of consistency, availability, integrity and eviction, which vary based on how many nodes are correct.
  • For get availability and durability, it would be ideal to trust only yourself, but impractical to do this. Depot minimises the number of correct nodes that are necessary to service the get. Any correct node that is willing to supply an object can service a get. To make it likely that a correct node has data, employ replication (including a local copy).
  • Contingency plan for storage service provider failure. Prototype forces all clients to store all of the puts that they create, and gossip update metadata in order to route get requests to the right node. Despite complete SSP failure, this allows the system to keep working.
  • Evaluated cost in terms of latency, resources and dollars (for storage).
  • Overhead comes from adding metadata and adding checks (SHA256, RSA verify and history verify).
  • Setup the evaluation on 12 Emulab nodes, 8 clients and 4 servers, with a 1G link. Clients issue 1 request/second, and measure latency and per-request cost. As a baseline, emulate traditional cloud storage (servers implemented in Depot without any of the checks, and no metadata at the clients).
  • Latency overhead on get is very small (<1 ms), and on put, is about 5 ms.
  • Network overhead on gets are modest. CPU costs are bigger. Metadata transfer causes network cost, and verification has CPU cost.
  • Cost model based on cloud pricing. Get cost per TB is small over the baseline (about $100/TB). Puts incur more  cost (about $220/TB) due to metadata.
  • Q: Does presenting overheads in terms of the Depot cost mask the effect? Isn’t the CPU cost something like 400%? Yes, but the absolute values are small. Plus the real cost is in the bandwidth.
  • Q: What can Byzantine clients do if you rely on gossiping metadata (such as cause all get requests to be routed to it)? Include a “receipt” mechanism to deal with such a situation, described in the paper. When a client receives a get response, it performs a check to ensure that data is sufficiently replicated.

Comet: An Active Distributed Key-Value Store

  • Key-value stores are increasingly important, with great scalability, availability and reliability properties. Popular in both data centers and P2P. Stores are shared between many applications to avoid per-app storage system deployment, but building applications above these stores is challenging.
  • Main challenge is due to inflexibility: all apps are treated exactly the same way, independent of their needs (which may be conflicting). Motivating example is authors’ experience with Vanish on top of the Vuze DHT.
  • Vanish is self-destructing data storage for the web, based on the Vuze DHT, using temporal-based keys. Unfortunately, Vuze had a fixed 8-hour data timeout, and an overly-aggressive replication scheme, which harmed security. Making the necessary changes was simple, but deploying them was difficult, because it is a public service in production. Also it is hard to evaluate changes before deployment.
  • Propose extensible key/value stores that allow applications to customise functions of the store. Different data lifetimes, numbers of replicas and replication intervals. Also allow apps to define new functions, such as popularity tracking, access logging, or user-dependent responses. Built a peer-to-peer DHT: Comet.
  • Goals: flexibility, isolation, safety and lightweightness.
  • Main abstraction is an “active storage object” (ASO), which contains a value and a small set of handlers that are called on put or get to the object. Comet is built on top of a traditional DHT as an active runtime that runs ASO handlers in a sandbox.
  • Built several applications on top of the ASO extension API, including Vanish, Adeona, file sharing, etc. Extension API is 12 functions: 4 callbacks,  5 host interaction functions and 3 DHT interactions functions (get/put/lookup). Deliberately chose not to allow arbitrary network I/O because of security implications.
  • Sandbox is less than 5KLOC, based on a subset of Lua with unnecessary functions removed. Resource consumption is limited by counting the number of per-handler bytecode instructions and memory consumption. Incoming and outgoing requests are rate-limited. DHT interaction is limited (to prevent traffic amplification and DDoS attacks). ASOs can only talk to neighbours, and not recursive requests.
  • Built on top of Vuze using Lua. Deployed on PlanetLab. Discussing mainlining it in Vuze with the main engineer.
  • Considered three examples. App-specific customisation of the replication scheme. Did Vanish-specific replication in 41 lines of Lua.
  • A context-aware storage object is used to build a proximity-based distributed tracker (returning close nodes rather than random nodes). ASO uses 37 lines of Lua. Using the proximity-based tracker gives better latency between paired nodes than a random tracker.
  • A self-monitoring DHT: a node monitors its neighbours, by periodically pinging it. This allows you to collect measurements of DHTs. Ran an experiment to get Vuze node lifetimes.
  • Q: Have you thought about macrosecurity (as opposed to microsecurity considered in the talk)? We added global limits to ASOs (not just per-ASOs limits), that is just standard rate-limiting for DHTs. These are, we believe, the right mechanisms to manage the overhead of Comet. But we need more experience to determine limits. What about fairness? Haven’t looked at this yet.
  • Q: What happens when the metadata in the ASOs diverge between replicas? In today’s DHTs, replicas are actually inconsistent (may be stale). Opportunity here for different replicas to merge with each other using an ASO.
  • Q: Were you able to enable certain classes of computation in Lua? Removed everything except for three classes of things: simple math, string manipulation and table manipulation. A modified Lua interpreter counts the number instructions. Is there an interesting set of things that you might add back in, and what applications might that enable (or is it a gray area)? One example was to do true internet measurement with arbitrary network I/O, but we decided not to add these things.
  • Q: Does the use of a high-level language cause performance unpredictability that might harm the 99.9th percentile latency? An interesting issue, but we haven’t looked at this yet. May choose to use non-GC’d languages in a real deployment.
  • Q: Do you persist that ASO state? Currently stored in-memory, and use replication to gain persistence.
  • Q: What happens to reliability if things actually get cancelled? Applications should expect that things may fail at any given point in time. Can use tricks like Vanish does to get reliable deletes.

SPORC: Group Collaboration using Untrusted Cloud Resources

  • Cloud deployment is attractive because it allows collaboration in user-facing applications, but the price is that you must trust the cloud provider.
  • Goal is to support practical cloud apps for real-time collaboration (and offline working), but using untrusted servers.
  • Application logic has to be moved to the clients, and every client has to maintain a copy of the local state. The server’s role is limited to storing encrypted data and processing client updates.
  • To keep clients in sync, use “operational transformation” (technique from 1989): application provides an app-specific transformation function that is used to fix up non-commutative actions (such as deleting two different characters concurrently). This can synchronize arbitrarily-divergent clients.
  • Digital signatures aren’t enough, since the server could equivocate, so employ “fork* consistency”, which gives linearizability on an honest server, and detectable equivocation after 2 messages from a malicious server. Clients embed their history (hashed) in messages they send. Server may fork clients into partitions, but can’t then unfork them. Need to detect forks out of band. SPORC provides a way to recover from malicious forks.
  • Internally, the SPORC library keeps a history of all the operations that it sees, marked as committed or pending. Could reconstruct the client’s local state. Committed operations are fork* consistent across all clients, and pending operations are causally consistent. Clients enforce this by checking sequence numbers on each operation.
  • To commit an operation, the client encrypts and signs it, and sends it to the server. The server keeps a sequence of encrypted operations which represent the canonical state. Then it broadcasts the operation to the client. Finally, at the clients, it needs to be transformed using OT (since the recipient state has probably diverged). Only after these transformations can the operation be applied to the local state, and added to the end of the committed operations.
  • For access control, we cannot trust the server. Need to preserve causality and concurrency only makes it harder. Access control enforced using encryption with a single shared symmetric key shared between the clients. ACL changes are committed in the same way as update operations. All users have a public/private keypair in addition. Removing a user is a little more complicated than adding one: need to change the encryption key for subsequent changes and encrypt it under the public key of all of the remaining users, and also encrypt the old key under the new key (so it can be provided to any new users to access old operations).
  • If two people want to remove different users concurrently, the server will order them, but this creates a challenge for encrypting the new key. Need dependency pointers between operations and barriers: if a dependency pointer crosses a barrier, the operation will be rejected, which maintains the chain of old encryption keys.
  • To recover from a fork, use OT to resolve malicious forks by treating them as normal divergences.
  • Implemented as a client library and a generic server. The generic server can support any application since all it does is push encrypted data around. Applications are defined as operations and a transformation function. Implemented a Java CLI client and a browser-based version (using the Google Web Toolkit).
  • Evaluated on 8-core 2.3GHz AMD boxes, with one for the server, and 4 for clients. Tested the Java CLI version. Looked at latency, server throughput and time-to-join. Connected using 1G Ethernet.
  • Latency is <30ms for a text editor on up to 16 clients with low load, and slightly worse if everyone writes. Broke this down into various components: the most expensive thing is the RSA signature on outgoing messages (about 10ms per message). Could use a faster implementation like eSign to get this down into the microseconds. Client queuing delay gets worse as the number of clients increases on a write-everyone workload.
  • Server throughput gets up to 20MB/s as the payload size is increased to 16KB.
  • Q: What happens if an evil person about to be removed tries to add more evil users? The way we deal with these situations is to have multiple privilege levels (admins, writers and readers) with ACLs implemented on the client side. This is only a problem if you have a malicious admin.
  • Q: Why use a specialised server rather than just using a cloud storage provider or a DHT? In addition to storage, you need a way to keep clients in sync. We thought about doing a completely decentralised system. Advantage of a server is to get timely commits. Server is “highly-available”.

OSDI 2010: Day 1

October 4th, 2010

Kernels: Past, Present and Future

An Analysis of Linux Scalability to Many Cores

  • Looking at Amdahl’s Law limitations (serial paths) in the Linux kernel. For example, Exim spends about 70% of time in the kernel on a uniprocessor.
  • Want to know how serious the scaling problems are, and how much work is necessary to fix them (in light of recent efforts to make new scalable OSs).
  • Used a 48-core x86 box running a recent version of Linux. Looked at 7 real limitations, and analysed the limitations of the Linux kernel, which did exist.
  • 3002 lines of code in 16 patches fixed the scalability limitations. Remaining bottlenecks were in the hardware or applications.
  • Ran applications with an in-memory file system to avoid a disk bottleneck. Found bottlenecks, fixed them and re-ran the application. Stopped when the application or hardware had a non-trivial bottleneck that they couldn’t fix.
  • Apps: Exim, memcached, Apache, PostgreSQL, gmake, Psearchy (inverted index building) and Metis (MapReduce implementation).
  • Exim throughput (messages per second) collapses around 40 cores and kernel time (per message) increases greatly at this point. Large explosion of time in lookup_mnt function (>30% of time). This function returns metadata about a mount point, under a spinlock. Acquiring and releasing the spinlock were using more cycles than the hashtable lookup, because of the ticket-based spinlock implementation. Cache traffic (atomic inc and read of current ticket value in acquire; increment in release) takes 120-420 cycles on a lightly-loaded system. With high load, the delay goes up to 500-4000 cycles.
  • Solution could be to implement scalable locks or message passing (like Barrelfish). But since the mount table is rarely modified, use per-core mount tables to cache this. First lookup in local core table, else do the slow path through the global table. On modificastion, need to invalidate the per-core tables, but since this is rare, it leads to better scalability. This fix, removes the collapse at 40 cores.
  • Now see that the same functions execute more slowly on 48 cores than on 32, leading to a flattening out of scalability. Next bottleneck was dput(), which modifies the refcount on a dentry. Does an atomic decrement and test on the reference count. Reading the reference count is slow, because it is atomic, which makes cache coherency traffic occur. Concurrent cores must retry because a single core gets a hardware cache-line lock on the value.
  • Solution is “sloppy counters”: each core should hold a few spare references to each object. The reference count is stored in a per-core counter. Sloppiness occurs because multiple reference on the same core will grab one reference each. On an unlink, the kernel gathers the sloppy information into the shared counter, and disables the per-core counters. Uses O(N) space in the number of cores, but lead to no cache misses in the common case. Leads to an increase in scalability over the per-core lookup table.
  • In all, 9 areas were considered this way. Some applications were also modified (amounting to 60 lines of code). All well-known parallel programming techniques.
  • Almost all applications get some speedup (except Psearchy and Metis, which were already quite high).
  • Remaining bottlenecks include hardware queues on the NIC, application-level contention, cache capacity and DRAM throughput.
  • Limitations: only looked at 48 cores and a small set of applications. Some problems still loom: fork, virtual memory, page allocation, file system and concurrent address-space modifications.
  • Q: What was your experience in mainlining these patches? Haven’t tried because working on the talk. Sloppy counters are probably questionable. Are you going to try to submit? We intend to think about it.
  • Q: Do you really believe that hacking at each piece of shared state one at a time is the right way to go? No, but up to 48 cores it seems to work well. What do you really think? I have no idea.
  • Q: How many of your optimizations would work if you ran multiple applications on subsets of the core? Are they general enough for this case? I think so, since we didn’t target particular applications with these changes, but some applications may not benefit.
  • Q: How did you implement sloppy counters given that cache lines are pretty large, packing or wasting memory? We packed multiple per-core counters onto a single cache line. Did you find that managing these counters then led to a lot of overhead? For some counters, yes, such as sockets, which allocate a dentry per socket. The fix was not to allocate a dentry for a socket.
  • Q: A big ball of code wins in the long term because it is easier to make larger changes as well as smaller changes. Did you find that to be the case (that it was easier to modify the Linux kernel than other microkernels)? [Taken offline.]
  • Q: Looking at spinlocks, why did your original results have an elbow in the curve at 40 cores, rather than a gradual degradation? The interconnect on the machine becomes congested. The other effect was an n-squared effect based on the number of waiters.
  • Q: Should we or should we not be designing new OS architectures for multicore systems? Up to 48 cores, it doesn’t seem like Linux has a problem.

Trust and Protection in the Illinois Browser Operating System

  • Today, the web is ubiquitous, and runs in web browsers on many different operating systems.
  • However, these applications are vulnerable to attacks at many layers. There are web-app vulnerabilities, browser vulnerabilities, and OS and library vulnerabilities. The last two are the rarer but also most severe, so we consider these only in this talk.
  • Web browsers, such as Firefox, are monolithic and easy to exploit, since they run in a single address space. “Secure” web browsers are not enough, since they still have a huge TCB (including the TCP stack, X server, file system, drivers, etc.). A microkernel would be better, but still need to trust all of the system components.
  • Illinois Browser OS (IBOS) is a new OS design to make a single app more secure, which leads to a reduced TCB.
  • The IBOS kernel makes browser abstractions first-class, i.e. web page instances, cookies, IP connections, etc.
  • Security invariants check key properties without understanding the implementation of the particular components. Challenge is to check security properties using these “invariants”.
  • An IBOS label is a tuple containing a protocol, domain and port. These are inferred and applied by the kernel. Leads to different “web page instances” for different originating domains. Since these may then access several servers (for content, images, ads, etc.), the web page instances can spawn multiple “net processes”. Enforcing this requires a split driver architecture: on sending an Ethernet frame, the IBOS kernel checks the TCP port and IP address in net process label, and hands off the actual transfer to the NIC driver.
  • A basic frame buffer driver splits the screen into three parts: an address bar at the top, which shows the current page address; toolbars that provide global control of the system; and the content itself. The IBOS kernel has a tab abstraction which multiplexes the display between different web page instances. Input devices are routed to the visible tab, which prevents a keylogger from hijacking another instance’s session.
  • Implemented on L4Ka::Pistachio, using WebKit and Qt. Also provide a “Unix-like” API through Qt on uclibc, which supports a PDF viewer.
  • TCB size is 42 KLOC, compared to 5.6 MLOC for Firefox+Linux+X, and 4.4 MLOC for ChromeOS. Could potentially used formal methods to verify that. Avoided 27 CVE exploits published in 2010, and didn’t prevent 1 relating to device drivers. Also avoided 135 (77%) browser vulnerabilities published in Google Chrome’s bugtracker. Scored well on memory exploitation and sandbox bypassing vulnerabilities.
  • Evaluated page load latency time for Google Maps, Bing, Craigslist, CS@Illinois, Wikipedia and Facebook. IBOS is close to Firefox and Chrome, and sometimes better (though sometimes a little worse).
  • Q: Does IBOS have the same functionality as Chrome and Firefox, especially for JavaScript etc.? Yes: the presentation is actually using HTML5 on IBOS. IBOS uses WebKit, which is pretty standard.
  • Q: What do you think is fundamental about the web model that makes this different from core microkernel work? Web browsers are a very important application, so borrowing ideas from OS design could help to improve them.
  • Q: In terms of user experience, can you have two frames co-existing on the same screen? Currently don’t support windowed display, only tabs.
  • Q: Does the IBOS kernel manage DMA, and is it specific to a particular network device? Currently only support one driver, but should generalise to other drivers that support DMA buffers.
  • Q: Does the design inherently prevent cross-site scripting, or can this be relaxed? Don’t actually prevent these attacks, but with the right policy, we can contain these attacks. The paper details a custom policy that allows sites to communicate with other sites. Is your presentation running on top of IBOS? Yes.
  • Q: Is there any reason why Firefox or IE could not run on top of IBOS? They could if you changed the windowing abstraction.

FlexSC: Flexible System Call Scheduling with Exception-Less System Calls

  • System calls are typically synchronous and exception-based for the call and return. This is inflexible for scheduling, and expensive due to the mode-switch and pollution of processor structures when you multiplex these at a high frequency.
  • FlexSC is a shim that sits above the OS and provides exception-less syscalls, and FlexSC-threads is a threading library that uses FlexSC. Improves throughput in MySQL and Apache without changes to their codebase.
  • Took an application (Xalan from SpecCPU 2006) that does almost no syscalls, and injected exceptions with various frequencies. Emulated a null syscall and a write syscall. Measured only user-mode time and assume that it will be unaltered with the exceptions. However, almost half of the performance is lost. Indirect (pollution of structures) cost is the dominant problem.
  • Indirect cost: a Linux write() call can cause up to 2/3 of the L1 data cache and data TLB to be evicted. This also impacts the performance of OS code.
  • Exceptionless syscalls use a syscall page into which calls and results are written; syscall kernel threads do the processing. This leads to fewer mode switches, allows for batching of syscalls, and allows for dynamic multicore specialization, which further reduces the costs. In the ideal case, there is no need to context switch at all. The application writes its syscall registers into an entry on the page, and loops doing other work while the result is not marked as DONE.
  • Syscall threads are kernel-only threads that are part of the application process and run in the same process space. They execute requests from the syscall page and are schedulable on a per-core basis.
  • System call batching allows user space to make as many calls as it can, then jumps into the kernel which executes all of them before switching back to user mode.
  • Dynamic multicore specialization allows the OS to map syscall threads to particular cores, and user-mode threads to other cores. This can adapt to workload needs.
  • The user-mode interface is quite different to synchronous system calls. However, it is very suitable for event-driven servers, which already process asynchronous events. Multi-threaded applications can use FlexSC-Threads, which is compatible with Linux pthreads and requires no changes to the apps or even recompilation.
  • FlexSC-Threads is a hybrid (m-on-n) threading model, with one kernel-visible thread per core. It redirects system calls using libc wrappers to the syscall page, and switches to another runnable thread. When all threads are blocked, the kernel is entered, and we wait for at least one syscall to complete.
  • Evaluated on Linux 2.6.33 on a quad-core Nehalem, running at 2.3GHz. Sysbench on MySQL and ApacheBench on Apache were the workloads. Compared Linux NTPL (”sync”) versus FlexSC-Threads.
  • OLTP on MySQL on a single core sees a 15% performance improvement in throughput as the request concurrency increases. On 4 cores, see a bigger improvement. For latency, FlexSC dramatically improves the 95th-percentile latency, and a 30% improvement in the average case.
  • Individual performance metrics gathered using hardware performance counters. FlexSC achieves higher instructions-per-clock cycle than sync. Lower cache miss rates in several structures leads to this in both the kernel and user-mode.
  • Apache performance can improve by 80 t0 90% on a single core, and 115% on four cores. Latency is halved on average, and again is much better in the 99th percentile. Instructions per cycle are even better for Apache (almost doubling in user-mode and more than doubling in kernel-mode).
  • Exception-less syscalls can coexist with legacy ones. Doesn’t require a drastically new OS architecture.
  • Q: Did you try to upstream the patches for the kernel portion or plan to? No. Why? A lot of time and effort that I would prefer to use in other ways.
  • Q: Does this reduce security by sharing memory between the kernel and user-mode? The pages are per-process, within the application’s address space. Anything you could do in a process before, you could do now.
  • Q: Do you have an architecture for having outstanding syscalls with dependencies between them? No. We rely on servers that have concurrent, independent work to do. There is related work that tries to do this (composite syscalls from Stony Brook?), but we don’t have support for that.
  • Q: How specific are your observations to the fact that there is a protection boundary, and how much is due to the modularity boundary? A project called “cohort scheduling” made similar observations (jumping around to do different things will cause problems), so some of our observations are related to this. The boundary only adds to the problem because of the direct cost of the exceptions, which we measured earlier.
  • Q: Did you measure to what extent you get cache pollution from doing different work after the syscall? Didn’t measure it, but we did see that the TLB miss rate increased for some cases due to going through multiple user-level threads. So we could do even better if we tried to restructure the application.
  • Q: If you dedicate a core to kernel syscall handling, how do you approach the scheduling problem? Tried many things, but the simplest algorithm ended up performing the best: get a static list of cores, and go to the first non-busy core when a request comes in. Turns out to be very elastic, but we were only running a single application. May need to fine-tune this with multiple applications.
  • Q: Have you noticed or can you imagine a situation where having more cores than threads could cause you performance degradation? Do you have to have more threads than cores? We targeted server workloads. It would be a bad idea to run scientific computing workloads on top of this.
  • Q: What happens to this system when you have lots of unrelated threads demanding service (since you poll the syscall pages to pick up messages)? How does it scale up? [Taken offline.]
  • Q: Do you feel like you’re going to be fully backward-compatible? We ran four or five different applications, and the only problem we had was with get_tid (get thread ID), which required a small patch. Other than that, we didn’t see other backward compatibility problems.

Inside the Data Center, 1

Finding a Needle in Haystack: Facebook’s Photo Storage

  • In April 2009, 1.5PB of photo data (15 billion photos in 4 versions), with a 25TB/week upload rate and 550k images/second serving rate.
  • Currently: 65 billion photos, 20PB of data. 60 TB/week upload and 1 million images/second serving rate.
  • Simple design would be to use NFS, with one file per image. Works for a typical webiste with a small working set and infrequent access of old content (99% hit rate on the CDN). Facebook has a large working set with frequent access of old content. Leads to an 80% CDN hit rate.
  • I/O bandwidth gets wasted doing small metadata I/Os. Large directories required 10 iops per image read, improving to 3 iops per image with smaller directories. Using a file handle cache got to 2.5 iops per image read.
  • Haystack store replaces the photo server and storage layer in the NFS based design. A typical unit is a 2U node with 12 1TB SATA drives in RAID-6 (as a single volume). A single 10TB xfs filesystem runs across these. Haystack is a log-structured append-only object store, and each node hosts about 100 of these, at about 100GB in size.
  • File layout has a superblock, followed by “needles” (objects) which are padded to 64-bytes, and contain a key, an alternate key (signifying the size of the image), the data and a checksum. Magic numbers are used to reconstruct the Haystack in case of corruption, which doesn’t happen anyway.
  • The index file has a subset of the needle metadata, and includes the offset of the needle in the log file.
  • The Haystack photo server takes HTTP requests and turns them into Haystack operations. This builds an index of all images in the Haystack in a hashtable. Store 32 bytes per photo (8 bytes per image (32-bit offset and 16-bit size per image scale), compared to 600 bytes per inode). A 5GB image can handle 10TB of photos.
  • From the URL, determine the image and the size from the in-core index. 1 iop reads the data from disk. If the size of the image exceeds the maximum size of the device, could have multiple iops, but this doesn’t happen in practice. Use alloc_size mount option to grow files by 1GB extents. Only problem is when photos cross stripe boundary (6 or 7% of photos).
  • Multiwrite for modifications, since users typically upload many images at a time. Asynchronously append images to the file, then do a single fsync(). The index file is asynchronously appended to without a flush (unless a threshold of outstanding dirty records is exceeded), since it can be reconstructed from the log.
  • Delete just marks photos as deleted in the incore index and the needle header (synchronously).
  • Compaction generates a copy that skips duplicates and deleted photos. Works online and can accept reads, writes and deletes while this goes on.
  • Haystack directory maintains a logical-to-physical volume mapping. 3 physical haystacks on 3 nodes per logical volume. Writes are balanced across logical volumes, and reads are balanced across physical haystacks.
  • On upload, web server receives photos and scales them. The directory provides the logical volume, and writes synchronously all three copies (latency is relatively small compared to the browser-to-web-server latency).
  • On download, web server generates URLs in the web page based on information from the Haystack directory. The Haystack cache may return the image data, or else the cache fetches it from the Haystack store. The image is returned to the CDN which returns it to the browser.
  • Implemented in 8.5KLOC of C++. Took two engineers 4 months from inception to initial deployment.
  • May require more improvements to handle high-res images, so software RAID6 or RAID10 might improve performance. Considering new drives with on-board Flash to store the index. Currently, SSDs are not cost-effective.
  • Q: Did you find prefetching to be useful or are your reads so random? More and more, we are seeing photos accessed by tagging, so prefetching on the album index structure is less useful. Thumbnails are usually cached, so it looks purely random read access to Haystack.
  • Q: Have you considered using social data to figure out what your access stream might look like? The application (implemented in JavaScript) has a much better idea of what will be loaded, so it would make more sense to put the intelligence there than in Haystack. This also keeps Haystack simple.
  • Q: Have you considered migrating needles between Haystacks for load balancing? RAID6 limits us in this respect, and is simply horrible for mixed workloads. RAID6 would not cope with migration.
  • Q: Are high-res images supported for existing images? No, we don’t have the data stored.
  • Q: What do you get out of having Haystack getting as big as 100GB? This cuts down on file system metadata, but they could be smaller. Some Haystacks are even larger, but they run just fine. The only limit is 256GB, which would require 64-bit offsets in the index. Also want to keep the Haystack directory as small as possible, so big Haystacks keep this down. Want to distribute this in future.

Availability in Globally Distributed Storage Systems

  • Globally distributed storage systems are built from commodity servers, arranged into racks, then clusters.
  • 200+ clusters, pools of thousands of clients, 75PB filesystems.
  • Want to look at how available is our data, what are the causes of unavailability, and how can we tune our systems to make this better?
  • 95th percentile of unavailability periods is about 15 minutes. Median due to node restarts is about 1 minute. Planned reboots cause longer unavailability periods (closer to 10 minutes median). Unplanned reboots are more severe, albeit not much (a few more minutes).
  • Correlated failures lead to a “failure burst”, which corresponds to many failures happening within 2 minutes of the previous one. 37% of all failures are part of a burst across at least 2 nodes. This has been fairly consistent over the last year. Showed an example of a burst of 50 servers within quick succession (failure of a shared power or network domain). Also see bursts of about size 20 (rolling reboot of machines with planned staggering, or a cascading failure).
  • Small failure bursts are not necessarily rack-correlated, but large failure-bursts are highly rack-correlated.
  • Came up with a scoring function for rack-correlation: the probability that a burst of the same size affecting randomly-chosen nodes in that cell would have a smaller burst score.
  • Data is stored in stripes, made up of chunks with fixed-size data and code blocks. Use erasure coding (such as Reed-Solomon) to reconstruct missing chunks from available ones.
  • Perform rack-aware placement on blocks. MTTF increases when rack-aware placement is used, for all encoding schemes (replication and Reed-Solomon). Different encoding schemes do better for different sizes of bursts.
  • Used a Markov chain to model the number of available chunks, with transitions due to chunks failing and recovering.
  • Findings on recovery time: using Reed-Solomon (6, 3), a 10% reduction in recovery time leads to a 19% reduction in unavailability, without correlated failures.
  • Correlation matters a lot. Modelling correlated failures gives a substantial reduction in MTTF. However, it reduces the benefit of increased data redundancy.
  • Improving availability below the node layer do not significantly improve data availability (reducing node failure rate is the most useful thing to do).
  • Multi-cell replication (across data centers) leads to a higher MTTF, but there is a trade-off between limiting the inter-cell recovery bandwidth and higher replication.
  • Q: Could controlling the order of planned upgrades and so on have a big impact on MTTF/availability? Yes. We need to do a better job of teasing apart all the reasons for unavailability (finding exact causes).
  • Q: Is it true that most of the time machines are unavailable due to long-lasting events or lots of short events? Graphs in the paper answer this. There are challenges in trying to model rare events, and we tend to be conservative.
  • Q: Can you talk about the cost effectiveness of different encoding schemes, and are you actually using erasure coding? We are considering using Reed-Solomon. What’s the drawback and why didn’t you start with this? For MapReduce, you might have parallel copies that you want many workers to read from. Also chunk recovery is more involved for Reed-Solomon, and has to touch more servers.
  • Q: If the number of cores per box goes up, what effect would this have on availability (with a single motherboard failure taking out more work)? Also, what about power saving and TCO? Multi-core servers are interesting to us in terms of work done per watt of electricity.

Nectar: Automatic Management of Data and Computation in Datacenters

  • Study shows that 20 to 40% of computation in a data center is redundant (based on production clusters in Microsoft). Over 50% of storage space is occupied by obsolete or seldom-used data sets (based on research cluster in MSR). Thus resource management is a problem.
  • Example application: click-log processing. All computations tend to do parsing and pre-processing, but only really need to do this once. Also, as the click-log grows, there is re-done computation on the previously-existing data.
  • 50% of data in the research cluster had not been accessed in the last 275 days.
  • Programs specified using LINQ, which may include user-defined functions (written in a .NET language). Submitted to DryadLINQ, which uses Dryad to execute it on the cluster. Data is stored in an in-house filesystem, called TidyFS.
  • Nectar interposes on submission to rewrite the LINQ program, performing lookups on a cache server. Dryad is modified to add cache entries based on computations that actually run. The distributed store contains Nectar program and data stores.
  • Mostly, the cache server gets several partial hits on the submitted query. The program rewriter will generate many programs that incorporate these hits into the submitted query, and choose the one that is cheapest to execute.
  • Considered an incremental MapReduce-style computation (Select-GroupBy-Select). Can cache the result of the GroupBy and the result of the Reduce. A new operator called MergeGroup can incorporate the previous GroupBy results with the results of Mapping and Grouping the increment data.
  • The executed program is stored in the Nectar program store, which contains a fingerprint based on the program and the data. The program fingerprint is computed from a static analysis of the program text. Dryad puts the result data in both the user store and the Nectar data store, and puts a mapping in the cache server from the fingerprint to the data store.
  • If the data has been deleted (due to garbage collection or a lack of space), the program store provides a program that can be used to regenerate the file.
  • The cache server uses a SQL database to keep the cache entries. A cache entry has usage stats, lookup stats and a URI to the results. The garbage collector uses these stats to make victim decisions.
  • Cache policies for deciding when to insert and delete entries. Final results are always cached, but sub-queries are cached when a subcomputation is popular enough. Deletion policy computes a “value” for each entry, based on the size of the result (cost), cumulative execution time (saving), the last time it was used and the number of times it was used.
  • Garbage collector uses mark and sweep, based on information in the cache server. However, the cache can be inaccurate (due to concurrent execution, or lack of usage statistics). Leases protect new datasets.
  • Evaluated based on logs from production clusters. 36000 jobs written in a DryadLINQ-like language. Simulated the effect of caching on these jobs. Between 15% and 65% of jobs would benefit from caching, across 25 production clusters. Overall, 25% of jobs benefit.
  • Also performed controlled experiments on a research cluster. Evaluated a common subcomputation (parsing and lexical analysis) shared between three jobs, which leads to about a 98% saving in machine time.
  • Evaluated incremental click log processing as data was added from day to day. Saved as much as 50% execution time when executing on the 5th day, having the 4th day’s results cached. Could do better when combining results from existing computations.
  • Q: Given that storage capacity is increasing so quickly, the precious resources are memory and I/O bandwidth, so do you find that to be true in your system, and how would you evaluate that kind of tradeoff? It is very important not to waste machine time in a data center. Could you model this to find the optimal trade-off between storing and recomputing data (in terms of the garbage collector)? It depends on the storage pressure, so we may need a threshold or ratio to decide how much to store on disk. Some heuristic basically? Yes.
  • Q: How do you know or ensure that the computations really are deterministic? Is it built into the language or is it something that you assume? DryadLINQ assumes that the result is deterministic. If the user uses random numbers, Nectar will always return the same result. It is not a problem with Nectar, rather it is in the DryadLINQ assumptions.
  • Q: Are we just reinventing distributed databases under different names, as a community? We did learn a lot from databases (incremental view management and materialized views). But this is work at much larger scales.
  • Q: Could we combine some of the lineage detection techniques with systems that are a bit more heterogeneous (e.g. transferring data from cluster to cluster)? The big difference is that Nectar maintains the program that generates a derived data set, so we can always regenerate it. However, we assume that we never delete “primary” data sets.

Security Technologies

Intrusion Recovery using Selective Re-Execution

  • Need a combination of proactive security and reactive recovery mechanisms.
  • Status quo for recovery is limited: anti-virus tools and backup tools. These respectively are limited to predictable attacks, and can cause data loss on reverting. Therefore admins have to track down the effects manually. The challenge is to disentangle changes made by attacker and legitimate users.
  • This work helps user to do this disentangling on a single machine. Idea is to rollback affected objects after an intrusion is found, and then re-execute actions that were indirectly affected. This is a new approach to intrusion recovery, based on an action history graph using re-execution, “predicates” and “refinement”. Implemented a prototype on Linux that recovers from 10 real attacks.
  • Example attack scenario: modify /etc/passwd to add a new account, install trojans in pdflatex and ls to restart and hide the botnet. Afterwards, a new legitimate user is added.
  • Strawman idea: taint tracking on all OS-level dependencies in the system. Taint travels through shared files, but we have to be conservative in assumptions about how taint spreads, and so taint spreads too widely.
  • Second strawman: everything executes in a VM, and periodic checkpoints are taken. When an attack input is detected, rollback to the previous checkpoint, and replay inputs except the attack. Re-execution would be very expensive. Also, the original inputs may be meaningless in the context of the new system (due to non-determinism). Deterministic re-execution doesn’t work because the inputs themselves are being changed.
  • Selective re-execution uses an action history graph, which is a more detailed version of the taint-tracking dependency graph. Actions have dependencies, such as written/read data, exec arguments and exit codes. Selective re-execution only reruns some processes (like adduser in the example scenario). However, what about the exit code from adduser?
  • Many suspect computations are not affected: an attacker writing to a file may not taint everyone who reads from different parts of the file.
  • To minimize re-execution, specify predicates which allow Retro to skip equivalent computations. Also use refinement, which only replays fine-grained actions (not entire login sessions).
  • Example predicate: exit code. With the same exit status as before, no need to taint the admin shell after adduser Alice.
  • Example refinement: individual functions. So could do getpwname() instead of a generic read() on /etc/passwd.
  • What if the effect of the attack was externally-visible. Cannot solve this in the general case (e.g. recalling spam sent from an attacked host). But could e.g. email the diff of output from a terminal session to the user.
  • Implemented in about 4000 lines of C and another 200 lines of Python for the repair controller system. Use btfrs as a checkpointing file system. Use ptrace to shepherd the re-execution of processes and skip equivalent syscalls.
  • Evaluated by looking at whether Retro is better than manual repair, and what the cost of using Retro during normal execution.
  • Used 2 real-world attacks from a honeypot. 2 synthetic challenge attacks (including the example scenario) and an sshd trojan. And 6 attacks from Taser. Retro can recover from all attacks. In 6 cases, it required no user input. However, sometimes need to skip the attacker’s login attempt. Two required significant help to get beyond skipped network traffic. However, Retro can pinpoint the exact objects that are involved, which is better than the manual status quo.
  • Repair cost is proportional to the extent of the attack. Worst case is the sshd trojan, but the other cases taint far fewer objects. Repair time is proportional to the number of objects, rather than the log size. This is much more efficient than VM re-execution.
  • Evaluated the runtime overhead based on running HotCRP from just before the SOSP 2007 deadline. CPU cost was 35% with storage overhead as 100GB/day. Also looked at Apache-small-static-files and kernel recompilation workloads, which had much higher overhead in terms of CPU and storage. However, can still store 2 weeks of logs on a 2TB disk in the worst case.
  • Q: How do you pick the level at which you record and re-execute actions? Did you experience any cases where it was fragile to small changes in the replay? We reintroduced higher-level managers in response to false dependencies that we saw. The interface between the repair controller and the repair managers is well-defined so it would be easy to add new abstractions if necessary.
  • Q: Could you tune the system to record less-precise information which would make it harder to recover, but pay a much lower cost? There are two kinds of trade-offs to make: either the number of user actions to preserve, or assume that some things won’t get compromised. Could also trade-off logging cost for repair time (by re-executing more things).
  • Q: What do you do about non-deterministic programs and negative information flow? Just want some acceptable execution that isn’t compromised, so actually non-determinism may be okay. Could this modify my data on disk? Yes. Can only provide the guarantee of an “acceptable” execution, not the same.
  • Q: What if the attacker could assume that Retro would be there: could he cause a large number of false positives? Well, the sshd trojan is perhaps a case like that, but our hope is that it would be much easier to detect these attacks, but we haven’t experimented with this.
  • Q: Did you evaluate throughput and latency for web workloads? Is there any protection of Retro from attackers? How hard would it be to adapt Retro to a closed-source system like Windows? For Windows machines, you would have to implement many more managers, but we believe the ideas are still applicable. For protecting Retro, we use SELinux to protect the kernel and Retro, but we assume that the kernel is intact, although we could consider a VMM approach. The libc wrappers? Have to be careful with those. [Throughput and latency question dragged offline.]

Static Checking of Dynamically-Varying Security Policies in Database-Backed Applications

  • Authentication is complicated, especially for web apps, and getting it wrong can have major consequences. Typical development involves thinking of some attacks, and auditing for these. A more principled approach involves setting a security (information flow and access control) policy on a resource.
  • One approach is dynamic policy checking, by adding metadata to objects in the system, and checking them when before objects interact with the environment. This is easy to add to existing programs, and very flexible, but this only finds bugs for tested program paths, and there is performance overhead.
  • Alternatively, could do static checking, which can run at compile time and involves no changes to runtime behaviour. But this usually requires extensive annotations, and has limited policy expressiveness if the type system is used.
  • UrFlow is a new analysis approach for the Ur/Web programming language. Contains a flexible and programmer-accessible policy language based on SQL. However, it runs statically at compile time and adds no runtime overhead.
  • Ur/Web has integrated parsing and type-checking of SQL and XML (HTML).
  • Can have simple policies based on whether rows exist in a table. Or can reason about what the client “knows”, such as string literals and contents in the source code. Can also have policies that join between multiple tables.
  • Can use first-order logic to reason about what is known at each point in the program. Then reconcile that with the SQL query policy (which may also be expressed in first-order logic). Then use an automated theorem prover to prove that the code meets the policy.
  • Program code is turned into a finite set of execution paths, and symbolic execution is done on those paths. This leads to a series of states, and a series of check predicates.
  • Evaluated using case studies for representative web applications: secrets, poll, user DB, calendar, forum and gradebook. 8 to 134 lines of code in the policies (forum most complicated).
  • Idea of using SQL is to take advantage of code that is already written in web apps, and get a free lunch from the checking. Hope to add these to a more mainstream language.
  • Q: Do you have any intention to analyse how hard it is to write policies? The analysis is never going to change the behaviour of the program, so no policies should block legitimate access to data. What about indirect paths to data? This is a hard problem and I don’t have an answer. The code author’s job is made easier by moving to a declarative paradigm.
  • Q: To what extent can your program give a better answer than just a yes or no from the theorem prover? When the analysis thinks it detects a violation, it displays the instruction that it thinks might be wrong and all of the facts it knows at that point. However, at least this is a static view rather than a dynamic view of the program. There’s probably a heuristic for filtering out irrelevant facts, but I haven’t thought much about the details.
  • Q: How do you manage the accumulation of security policies over time and becoming cumbersome? The language supports ways of encapsulating common patterns, which can help to shrink down the size of policy that you need to read.
  • Q: Is there a theoretical difference to existing security automata [inaudible]? Techniques very similar to software model checking. Can it be applied to type analysis? I haven’t seen any type systems that could reason about this.
  • Q: What about security policies that refer to exogenous facts that don’t exist in the database (like timed event sequences)? You can expand the database to store state fields that allow you to encapsulate the model. This would be worth looking into.

Accountable Virtual Machines

  • Scenario of a multiplayer networked game. In Counterstrike, the amount of ammo is local state, and fire events are broadcast to other players. So a player can modify the game not to decrement that counter. Such cheats exist and are used.
  • Cheating is a serious problem in itself, and since gaming is a multi-billion-dollar industry, we need techniques to stamp it out. It is also a symptom of a more general problem: how do you know that a networked system is running as intended?
  • Want to be able to detect that a remote machine is faulty, and obtain evidence that can convince a third party.
  • Challenges: mutual distrust and lack of access to the source code to understand how it should work.
  • Software runs in an accountable virtual machine on an accountable virtual machine monitor. Maintain a log of network inputs and outputs, and can check this log with a reference image. It the AVM is correct, the same inputs will produce the same outputs. Otherwise it is considered to be faulty.
  • Solution involves two pieces. First is tamper-evident logging (based on PeerReview). Log uses a hash chain and messages contain signed authenticators.
  • Second part is execution logging. How does Alice know if the log is from a correct execution of her software image. Need the AVMM to specify an execution and additionally log all nondeterministic inputs.
  • Can audit and replay a log. If one player changes the reference image, the logs will diverge, which indicates that the AVM is faulty.
  • AVMs provide strong accountability for arbitrary, unmodified binaries. Don’t need to trust other participants, the AVMM or any software running on the other machine.
  • Evaluation based on the example of cheating in Counterstrike. Used a prototype AVMM based on VMWare Workstation 6.5.1 which has a mature logging/replay engine. Extended this with tamper-evident logging and auditing. Used three players on Nehalem boxes running Windows XP SP3.
  • Can AVMs detect real cheats? If the cheat needs to be installed in the AVM, it can trivially be detected (since the branch counter and instruction pointer will change). Examined 26 real cheats from the internet, and all were detectable.
  • Could cheats be adapted to subvert the AVMM? There are impossible-to-detect cheats (collusion), guaranteed detectable cheats, and technically-difficult-to-evade cheats, which we can’t guarantee are impossible to hide.
  • What is the impact on frame rate? With no fps cap (Window mode, 800×600, software rendering), the frame rate drops by about 13% from bare hardware to AVMM. Still around 137 fps though. The main culprit is execution logging (about 11% overhead), so accountability only costs another 2%.
  • For a one-hour game, how expensive is auditing? The log is about 8MB per minute, which compresses to 2.47MB. So the log is 148MB. Replay takes approximately one hour.
  • Idea: stream logs to auditors during the game and have detection almost immediately. Since the machines are quad-core, one plays the game and one does logging, it is possible to do replay on one or two of the spare cores. This causes some drop in the frame rate, but it is still above 100 fps.
  • Q: What if the issue is that a machine isn’t cheating but is in fact sabotaged? The AVM provides a full record of what has happened, and could demonstrate that the owner is not at fault.
  • Q: Would it be possible for the malicious party to run another instance in the background to get a plausible-looking log? The two copies would be completely independent, because the authenticators are signed with one machine’s key. The outputs wouldn’t match, so this would be detectable.
  • Q: Do most have the cheats have the form of changing the inputs? What is the equivalence class of cheats that you can and can’t detect? If the cheat causes the network I/O or control flow to be inconsistent, we can detect it.
  • Q: What about concurrency (e.g. who fires first) and delaying traffic in the network to change who fires first? Concurrency wouldn’t affect detectability. But from a distributed systems point of view? We can’t guarantee detection.
  • Q: What if the program that you’re dealing with has non-deterministic output? Randomness is not a problem for us because at the VM level, everything is deterministic.
  • Q: Is it sufficient to log at the AVM level when the application has latitude to process events in a different order? If this is a legitimate thing for the application to do, then we are vulnerable.
  • Q: How does privacy interact with the implementation of your system? Would you end up sharing too much information with your adversary (especially in the case of things that are more important than games)? It’s true that AVMs are verbose in what gets logged. The impact on privacy depends on what you’re doing. Auditing a machine that you’re paying for seems reasonable. It has to okay for you as the auditor to see all computation that is going on. You can delay the online auditing perhaps.
  • Q: How can you detect cheats that involve improving the player’s skill (or making all the walls invisible or using an aimbot)? Can’t help information leakage with this technology.

Concurrency Bugs

Bypassing Races in Live Applications with Execution Filters

  • Races and atomicity violations can be a big problem, and hard to find. One solution is to put everything under a big lock. Or can go for a complex fix by decomposing behaviour and adding state variables. But it takes a long time to fix these races, even if the root cause is well understood.
  • Idea of LOOM is to filter out buggy thread interleavings to mitigate bugs while the problem is fixed. Write declarative execution filters to specify the disallowed interleavings.
  • LOOM also works to implement live workarounds, so it is possible to install an execution filter into a running application. It installs execution filters safely and avoids introducing new errors.
  • LOOM incurs little overhead during normal execution.
  • Evaluated LOOM on 9 real races, and it bypasses all evaluated races safely, applies immediately, has little overhead and scales well.
  • LOOM is divided into Static Preparation (uses the application source, which is compiled down to LLVM bytecode where transformations are applied. This adds a mechanism for LOOM update injection and safety maintenance) and Live Update (passing an execution filter to the LOOM controller, which passes code to the LOOM update engine which is linked into the application binary).
  • Safety issue: what if the program is running inside a critical section when you update it? Or what if there are order constraints on multiple threads? The LOOM update engine identifies regions of the code where it is dangerous to update, if any thread is inside that region. It then evacutates the dangerous region, to get threads out of the region, and applies the update once they are all out.
  • To control application threads, add conditional breakpoints (in the instrumentation phase) on iterative or recursive control flow (since there is no point in pausing code that can quickly run to completion): i.e. CFG backward edges. Details of the breakpoint mechanism in the paper. They use MRSW locks which are unlocked and reacquired in the breakpoint to pause threads (if the update manager grabs the write lock in the mean time). The back-edges are identified as safe or dangerous. The breakpoint only pauses if the back-edges are safe.
  • Hybrid instrumentation to reduce the performance overhead. Have two versions of an instrumented function: fast and slow. The slow version includes configurable slots for additional instructions. During the common case (no update), the fast path is always followed; only if the code is updated do the slots get used.
  • Measured the bare instrumentation overhead on Apache, MySQL and SPLASH benchmarks, and the overhead was between -1.84% and 4.11%. This is much better than Pin, which has from 14.94% to 1065% overhead.
  • Measured the scalability of the number of threads. Overhead from 1 to 32 threads (on a 48-core box) is less than 12% in all cases. There is some anomalous speedup for 4 threads.
  • Intend to extend this to memory errors and security errors.
  • Q: Is there a potential for deadlock in your evacuation algorithm? That may be the case, but we didn’t observe this in our evaluation, and we can always “uninstall” the fix if there really is deadlock.
  • Q: Can you collect information at runtime to aid you with fixing the bug at the source code level? Can write execution filters based on line numbers.

Effective Data-Race Detection for the Kernel

  • Looked at a data race in the Windows kernel. Thread A ANDs out a bit, and Thread B ORs a different bit in. These can interleave badly, leading to one of the mutations getting lost, and cause a system hang.
  • Data races are very hard to reproduce since the timings are very tight. And they are hard to debug, since they could be mistaken as a hardware bit-flip. They are also becoming more common, since people are moving away from monolithic locks to fine-grained locking and lock-free approaches.
  • Previous techniques are happens-before and lockset algorithms. The Intel Thread Checker runs at a 200x overhead, due to logging all synchronization accesses, and instrumenting all possibly-racy memory accesses. This prevents practical usage in the field. Indeed, false failures arise due to timeouts. They also require a complete knowledge and logging of all locking semantics which may not be feasible in practice (since there are diverse ways of implementing locking, and many of them are used in the kernel).
  • Trade-off accuracy for efficiency. Can pick up false and benign data races. False data races are ones that cannot actually occur. Benign data races can and might occur, but are intentionally put there by the developer. Statistics counters are an example of this.
  • Another goal is to give the user control over the overhead (from 0.0x up).
  • Also want actionable data that is useful for the developer.
  • First insight: don’t infer what a data race might happen, and instead actually cause it to happen. Second insight: sample memory accesses for controllable overhead. Use code and data breakpoints and randomly select to get uniform coverage. A data breakpoint pauses the first thread when it hits some memory location; if another thread comes along and accesses the same location, flag the current thread states as a possible data race.
  • Memory accesses are sampled by setting code breakpoints, which lead to data breakpoints being set on the relevant locations.
  • Results: most dynamic data races are benign (104/113). Many could be heuristically pruned (86/113). Found 25 confirmed bugs in the Windows OS, with 8 more pending investigation.
  • Saw two bits in the same byte, each protected by individual locks, have a data race with bits being set and cleared.
  • The breakpoint approach found data races with as little as 5% overhead.
  • Future work includes prioritizing benign versus non-benign races, and also looking at (false) data sharing and its performance impacts.
  • Q: What are your preliminary thoughts on how to identify bugs as benign automatically? There are a lot of unanswered questions there….
  • Q: How random is your breakpoint setting? The demo was considering the ndis and ntfs drivers, and randomly sampled within those. The breakpoints are set based on a feedback loop to get the rate.
  • Q: How dependent is the technique on having full debug symbols, since the problems may be in third-party device drivers? The dependence is so we know where the memory access instructions are. You can load the module into memory and use some other technique to disassemble them, but we ended up accidentally putting a breakpoint into data.
  • Q: How is the sampling rate related to the expected time until you detect a particular race? We haven’t investigated that, but we would like to verify that we’re not just getting the easy races.

Ad Hoc Synchronization Considered Harmful

  • Synchronization is important to ensure the correctness of concurrent programs. Sometimes synchronization code is easy to identify (pthreads calls), but sometimes it is hard to recognise. Sometimes it even uses goto statements.
  • Ad hoc synchronization harms program reliability. Because it is hard to recognise, the programmer may be unaware of it. This may lead to bugs and performance issues. Saw up to 67% of ad hoc sync introducing bugs. Program analysis is more difficult, and may introduce hard-to-detect deadlocks, cause false positives in data race checkers and confuse performance profiling. It may even cause problems for compilers and relaxed memory consistency model.
  • Contributions: quantitative evidence of the harmfulness of ad hoc syncs; and a tool (SyncFinder) that automatically identifies and annotates ad hoc syncs (detects deadlocks and bad practices).
  • Examined server, desktop and scientific applications (including Apache, MySQL, Mozilla JavaScript engine, SPLASH). Found lots of ad hoc sync loops in each of them (83 in MySQL). In OpenLDAP, 67% of ad hoc syncs were buggy; in Apache, 22% were buggy.
  • Showed an example of a hard-to-detect deadlock in Mozilla JS, which involves three threads. And an example which does an unnecessary sleep in MySQL, causing performance issues.
  • Ad hoc synchronization leads to benign data races. Leads to false positives in data race detectors (see above).
  • Diverse kinds of ad hoc synchronization. For example, a while loop, or a goto loop. Or a single condition, or  multiple conditions. Or spinning on a synchronization variable, or have a more complicated dependency structure….
  • To identify ad hoc sync, need first to understand it. Every one on the waiting side has a loop body (”sync loop”) with at least one exit condition that is associated with at least one exit condition variable; and a synchronization variable. The setting side does a write to the synchronization variable. First detect the loop, then extract the exit condition, then detect the exit-dependent variable set, and finally prune the set to find the synchronization variable.
  • To prune the sync loop, need to find variables that are shared with a remote thread, and are loop-invariant.
  • Report annotated code with the line numbers of sync variable reads and writes.
  • SyncFinder identifies 96% of ad hoc syncs in the considered applications, with only 6 false positives (out of thousands of loops).
  • Use cases: a tool to detect bad practices, and an extended race detector in Valgrind.
  • Q: What about the amount of bugs in non-ad hoc synchronization: do you have any data on that? We did some previous work on the reliability of normal synchronization, and we found that the proportional of buggy ad hoc sync is higher.
  • Q: Why is ad hoc synchronization so pervasive? What is the limitation in sync primitives, or is it just performance? We checked the comments surrounding this ad hoc synchronization. People just want a flexible way to do sync. They assume that it is a short-term fix (or that it will not be used in the common case). Would you suggest an additional sync primitive? Conditional wait and signal.
  • Q: Do people do this to avoid a synchronous write on the system bus? Is it always harmful? Your performance assumptions may change in the future on different architectures, and the cost to maintainability is probably higher than the performance win. Are lock-free data structures bad, then? I cannot say that.
  • Q: Is there any chance that these application developers declare their sync variables as volatile? They may do this, but it doesn’t guarantee that that is how they use it. We thought about this for our tool.
  • Q: How do you identify the bugs? We checked the code repository for patches on those locations and identify bugs in Bugzilla and the changelog.
  • Q: How long does SyncFinder take to run on the code base? Could you run it on the Linux kernel? We ran it on the 1 MLOC MySQL, which takes about 2.5 hours. The complexity is data-dependent: OpenLDAP is much smaller but takes almost as long as MySQL.

SIGCOMM 2010: Day 3

September 2nd, 2010

Network IDS

NetFence: Preventing Internet Denial of Service from Inside Out

  • DDoS is projected to be the biggest problem facing the internet in the next 12 months, and it is difficult to combat, since it conflicts with the openness and robustness internet design principles.
  • Previously, people have looked at the receivers of the DDoS (Denial of Edge Service). Usually using network filters or network capabilities.
  • But with a large enough bonnet, bots can collude to send packet floods which impair network services.
  • Challenge is to design a network architecture that combats both kinds of attack.
  • Solution: NetFence. Gives the network control over its resource allocation to combat denial of network services (DoNS). Also hierarchical and coupled with network capabilities.
  • Hierarchical congestion policing slows down flooding senders, and is robust to both compromised routers and hosts. Uses symmetric key cryptography, and each packet carries a secure token (based on Passport from NSDI 2008).
  • Secure congestion policing feedback are like network capabilities. Capabilities are returned if the receiver authorizes the traffic as “desired”.
  • Two types of packet: request and regular. Packet has five fields: mode (nop/monitor), link ID, action (up or down), timestamp and MAC (authentication).
  • First a sender sends a request packet. The mode field gets stamped by the access router as nop, and the MAC is calculated based on a hash of the original fields. The action gets set to down (deprioritize) the Link ID is stored and the mode is set to monitor, if an attack is deemed to be underway (i.e. congestion is encountered). The routers have keys distributed using Diffie-Hellman over BGP.
  • Policing is done at the access router, which looks at the packet sent back from the receiver (mode, action, etc.), and configures a leaky bucket as necessary.
  • Congestion policing loop uses AIMD at the access router to vary the sender’s bucket capacity.
  • A policing cycle is started based on a load- or loss-based detection procedure in the bottleneck router. RED is used to signal congestion within a cycle.
  • Works because: (i) secret keys used by routers to do feedback, (ii) periodic AIMD used to achieve fairness/efficiency, and (iii) congestion feedback acts as capabilities to prevent unbounded traffic.
  • Provable fairness shown in the paper. i.e. Each good user achieves a proportion of the network capacity that is equal to one over the total number of senders. Denial of service becomes “predictable delay of service”.
  • Many possible attacks against a system like this. Discussed in the paper, but two are discussed here.
  • To deal with floods of request packets, the request packet channel is separate, and there is a per-sender request packet limit, which is policed. There is a priority-based backoff which emulates computational puzzles.
  • To deal with routers hiding backoff feedback, the system treats the absence of an up-feedback as down-feedback.
  • Implemented on Linux using XORP and Click. AES-128 is the MAC function (see Encrypting the Internet). Benchmarked using DeterLab, dual-core 3GHz Xeons with 2GB of RAM.
  • The bottleneck router has 0 processing overhead when there is no attack. Overhead is 492–554 ns/packet (one AES computation) when there is an attack.
  • Shim layer between IP and TCP (or something else), which adds a header overhead between 20–28 bytes.
  • Simulated using NS-2 to evaluate various attacks. Compared to other systems, which put more state in the core.
  • Experiment on a denial of edge services attack. As the number of simulated senders increases, the file transfer time remains constant, unlike Fair Queuing which increase, but TVA+ and StopIt are faster (but less scalable).
  • Experiment on a denial of network services attack. Looked at ratio of average user throughput to average attacker throughput. NetFence achieves fairness.
  • Q. Do you distinguish good and bad users? No, we used AIMD to achieve fairness instead.
  • Q. How can you separate a flash crowd from malicious traffic? We don’t, treating extreme congestion the same way as an attack because it is a failure of end-to-end congestion control.

ASTUTE: Detecting a Different Class of Traffic Anomalies

  • Network management is used to ensure that customer SLAs, security policies, resource availability are maintained. Anomaly detection normally involves building a statistical model of normal traffic and defining an anomaly as a deviation from normal.
  • However, it is hard to obtain a model of “normal” traffic. Look at a time series of packet counts, and usually define a model baseline (tolerance) based on something like EWMA, and anomalies are anything outside that. However, training isn’t guaranteed to be anomaly-free.
  • Aim is to detect anomalies without having to define what is normal. Advantage is a simple tool that doesn’t have to perform training and is hence immune to data poisoning. It is accurate for a well-defined class of traffic anomalies, with theoretical guarantees on the false positive rates. However, its applicability is limited to when traffic characteristics don’t change.
  • Empirical properties: flow independence (although some weak correlation between flows), stationarity (time invariance over the timescales of a typical flow duration), and independence and stationarity => equilibrium.
  • ASTUTE = A Short-Timescale Uncorrelated Traffic Equilibrium. Between two consecutive time-bins, flow volume changes are zero-mean i.i.d.
  • Measure the number of flows, mean volume changes and variance of volume changes between consecutive time bins. Flag an alarm if the ASTUTE Assessment Value (AAV), calculated from these, is greater than some threshold.
  • The threshold controls the false positive rate. Appeal to the central limit theorem, so for a large number of flows, the AAV has a Gaussian distribution. False positive rate is just the area of the bell curve outside the threshold.
  • If ASTUTE is violated, at least one of the model assumptions is violated. For example, stationarity. Long bin sizes (one hour) lead to anomalies flagged when people arrive and leave at the beginning and end of the day (daily bias). Short timescales see no bias at all.
  • Worked with flow traces from Internet2, GEANT2 and the Technicolor corporate network. Compared Kalman and Wavelet filters.
  • Small overlap between anomalies detected by ASTUTE and the other methods. ASTUTE finds different classes of anomalies: tends to be larger numbers of flows with fewer packets than the Kalman and Wavelet approaches.
  • Plotted classified anomalies in each network on a similar graph (#flows vs #packets per flow), and saw that ASTUTE is worse on DoS attacks, but better on prefix outages, link outages and gaps, and port scans.
  • Looked at the ROC curve to see the trade-off between false and true alarms. Kalman would need a much higher false positive rate to detect port scans. But ASTUTE would require a very high false positive rate to detect DoS attacks.
  • Q. Can you not detect large flows because the time windows are so short that they look i.i.d. over those time scales? If it has a small number of flows, it will look independent to ASTUTE. There is an analytical limit to how many flows you need before you can detect it (threshold-squared).
  • Q. Who cares about detecting correlated flows? ASTUTE is not only useful for anomaly detection. But the interesting thing is that it can identify things that the operator would not be aware of, like bugs in misbehaving applications.
  • Q. Do you have the ground truth that the DoS attacks are real DoS attacks? Yes, we have analyzed the data, and there were lots of SYN packets going to a single location, usually from a single IP.
  • Q. Is there a way to classify an anomaly or is it ad hoc? We started with visual inspection, but we developed a tool for this.
  • Q. If your traffic is skewed towards a few flows, does the CLT hold? The CLT and assumption that we have lots of flows is an assumption for normal behavior.

NetShield: Massive Semantics-Based Vulnerability Signature Matching for High-Speed Networks

  • Maintaining network security is a grand challenge. Worms and botnets are widespread.
  • Talk concentrates on signature-based IDS. Normally, there is a database of signatures which is matched against each packet and used to generate alerts. This needs to be accurate and fast.
  • State of the art is regex-based. Used in Cisco IPS, Juniper IDS and Bro (open-source). It can efficiently match multiple signatures simultaneously using an NDFA, and can describe the syntactic conext. But the expressive power is limited, and it cannot describe the semantic context. This leads to inaccuracy.
  • Other state of the art is vulnerability signatures for host-based IDS. It directly describes the semantic context and is very expressive (able to describe the vulnerability exactly). It’s accurate, but slow, using sequential matching and requiring protocol parsing.
  • Vulnerability signature matching requires parsing, matching and combining. Since the protocol grammar is context-sensitive, cannot use a regex, as well as it being practically difficult.
  • Also a regex assumes a single input, so it cannot help with the combining phase.
  • So regex approaches cannot be used to match vulnerability signatures.
  • First challenge: matching thousands of vulnerability signatures simultaneously. Second challenge: parse protocols quickly. Solution achieves 10G throughput with an efficient matching algorithm and a tailored parsing design for high-speed matching.
  • Basically, a vulnerability signature uses a sequence of protocol data units (PDUs) with one predicate per PDU. PDU could be something like the HTTP version or the method. Need numbers and strings, number operators (comparisons) and string operators (equality, regex matching and length)
  • Given n signatures defined on k matching dimensions, a matcher is a two-tuple (field, operation) or a four-tuple for associative array elements. This leads to an n-by-k table. A table representation admits the possibility of matching multiple signatures simultaneously. Table looks like an associative array, with lots of don’t-cares.
  • Worst case time complexity is O((log n)^(k-1)) or O(n^k) space complexity. Based on the Snort and Cisco rulesets, which have selective matchers, the design actually gives O(k) time complexity.
  • Iterative matching algorithm on the columns, based on intersecting relevant rulesets with special treatment for don’t cares.
  • Complexity of merging requires k-1 merging iterations. Worst case merge complexity is O(n) in the worst case, but for real-world russets it will be more like O(1).
  • For high-soeed parsing, compare tree-based and streaming parsers. Streaming parsers can only retain signature related fields. Built an automated parser generator that builds a parsing state machine for parsing the protocol.
  • Implemented in 10kloc of C++ and 3kloc of Python. Evaluated on 26GB traces from Tsinghua University, Northwestern and DARPA. Run on a P4 3.8GHz with 4GB of RAM. For HTTP, 794 vulnerability signatures, and WINRPC 45 vulnerability signatures. Speedup ratio compared to Binpac is around 11x for non-HTTP and 3–4x for HTTP. Maintained throughput of 2.63 (HTTP in the university) to 17.6 (HTTP at Northwestern) Gbps for parsing and matching. Multicore gives a speedup.
  • Tool available online.
  • Q. Can you go into more details about the memory overhead? DFA requires 5.29GB for 973 Snort rules, whereas NetShield requires 2.3MB. The XFA paper showed 863 rules in 1.08MB. NetShield could improve by implementing XFA.
  • Q. Is it possible to do the massive matching using GPUs? Currently, most connections are independent, so yes probably.
  • Q. Do your scalability results not show that you require a clock cycle per bit? We only have to look at the bits in the signature.
  • Q. What are the advantages of your scheme with respect to XFA? Limited accuracy: XFA would make false positives.

Network Architecture and Operations

R3: Resilient Routing Reconfiguration

  • Failures are common, but today’s emerging applications impose a stringent requirement of network reliability. Plus SLA violations may impact an ISP’s revenue. Aim is to recover quickly from a single or multiple overlapping failures.
  • In 500-link network, failure scenarios up to three links exceeds 20 million. So it is difficult to optimize routing to avoid congestion under all possible failure scenarios.
  • Existing approaches focus exclusively on reachability. But these may lead to congestion and unpredictable performance. Some existing approaches consider only a small subset of failures, or optimize routing after failures, but this is too little, too late.
  • R3 require son enumeration of failure scenarios, is provably congestion-free, efficient in terms of storage overhead and flexible to diverse requirements.
  • Represent network as a graph, with link capacities and traffic demands on each link. Output of R3 is a base routing and a protection routing. Protection routing is a fast rerouting defined for every link that might fail.
  • Idea is to transform topology uncertainty to traffic uncertainty. Routing is optimized for he set of traffic demands on the original topology. Consider the amount of load that is shifted to other links when a failure occurs. If the routing is congestion free, rerouted traffic is less than capacity.
  • R3 has two phases. First, offline precomputation which minimizes congestion for original demand plus rerouting virtual demand on the original topology. The protection routing may use routes that later fail. Solve using a linear programming technique.
  • After a link fails, convert the protection routing for that link into a valid routing that doesn’t use any other failed links. After the failure, need to reconfigure the protection routing, which uses the computed rerouting.
  • Offline precomputation and online recompilation is sufficient to get congestion-free routing. Whether it is optimal for more than one link failure is an open problem. The reconfiguration is order independent, which enables distributed recompilation.
  • Some extensions: fixed base routing, trade-off between no-failure and failure protection to bound the no-failure performance, trade-off link utilization and end-to-end delay, prioritized traffic protection, realistic failure scenarios (share risk and maintenance link groups), and traffic variations.
  • Evaluated on two real networks and a synthetic topology. Compared to various rerouting schemes. Added R3 to OSPF and MPLS-ff. Looked for maximum link utilization.
  • For a single failure, R3 achieves near optimal performance. Under multiple failures, it is at least 50% better than other schemes.
  • Implemented for Linux and Linux MPLS. Emulated the Abilene topology on Emulab. 3 physical link failures simulated. Outperforms OSPF+recon by a factor of around 3.
  • Profiled the precomputation time: less than 36 minutes for each topology and less than 17 minutes for non-generated topologies.
  • Storage overhead is < 300KB in the FIB, and < 20MB in the RIB.
  • Q. Have you looked at how to redistribute traffic after a link returns? Have a reconfiguration rule for failure recovery. It will revert back to the last failure scenario, but the ordering may be different (this is provably alright).
  • Q. Have you looked at the overhead of announcements during convergence under churn? No packets will be dropped during this case.
  • Q. How does your algorithm cope with network partition? Studied this in the paper. In this case, we cannot have reachability, so we cannot have congestion-freedom. R3 will ignore the demands that it cannot fulfill.
  • Q. How does your approach compare against oblivious routing schemes (such as Valiant load balance)? These don’t usually handle a large number of failures. Normally big ISPs see larger number of failures than that.
  • Q. How do you evaluate traffic prioritization? Get 10 different priority classes from a US ISP, and show that IP traffic gets sacrifices to protect VPN traffic.

Mercury: Detecting the Performance Impact of Network Upgrades

  • Networks are becoming more complex and diverse. Software and hardware are both becoming more complex. This makes things more sensitive to network glitches or other performance issues. Purpose is to see whether a change makes the network better or worse performing.
  • Normal intuition is that an upgrade will make things better, but complex interactions can lead to unintended consequences. So it is important to monitor the impact of upgrades. This is hard due to the scale and diversity of different devices. So the challenge is to efficiently monitor at scale.
  • Mercury does automated data mining to extract trends, scales across a large number of measurements and flexibly across data sources, and is easy to interpret. Challenge is how to know when an upgrade happens, what their effect on performance is, and to find common factors in who is affected (or it is network-wide).
  • Could drive upgrade detection from the change management system, but since human information is unreliable, instead mine the configuration and workflow logs. Things like OS version and firmware upgrades are easy to track. However, lots of small configuration changes are not related to upgrades (such as customer provisioning). Out-of-the-ordinary changes are ones that are applied to multiple locations in the network, but rarely.
  • Divide event series (SNMP etc.) into equal time-bins to get a time series. Behavior change detection is based on a persistent shift in levels. Recursive rank-based cumulative sum is used on means, medians, standard deviations or distributions.
  • Identifying commonality (of attributes, configurations, etc.) is a machine learning problem (search in a multi-dimensional space). Use the RIPPER rule learner for this.
  • Sometimes aggregation will erroneously amplify rare events. Solution is to time-align each upgrade to each device (as if the upgrade happened at the same time).
  • Evaluted using close interaction with network operators. Used data sets about router configurations and workflow logs, and performance event series: SNMP and syslogs. Collected this data from a tier-1 ISP over 6 months. 988 routers in the study. Categories of router: core, aggregate, access, route reflector and hub.
  • Upgrade detection evaluated for false positives and false negatives. Threshold varied (frequency of change). Tends to see more false positives than false negatives, but these can be filtered.
  • Mercury reduces the number of upgrade-induced change points that the operator must look at by several orders of magnitude, compared to number of syslog entries. It confirmed the earlier operator findings and showed some unknown to the operator.
  • OS upgrades could cause CPU utilization to go down on access routers, but increases in memory utilization on aggregate routers (larger OS image). Varying changes in the number of layer-1 link flaps. More protection switching events.
  • Firmware upgrades could cause less CPU utilization on the central and CPU-facing routers’ CPUs.
  • Protection switching is line-card protection in customer-facing routers. Failover for the access router that customers connect to. Saw a small increase in the frequency of automated PS events. Time alignment was able to show this problem.
  • Q. Have you thought about the inverse problem where your triggers are the alarms of an anomaly detector, and you want to find the root causes? Problem with that is false alarms. With better anomaly detectors, this might become feasible.
  • Q. What is the time horizon of the attribute changes that you consider? We do persistent change detection, so look at daily averages over a history of about 6 months. We are now looking at whether transient things do matter (for the purpose of meeting SLAs, etc.).
  • Q. Do you monitor link capacity in your system? Currently only look at aggregate router statistics, not particular links/interfaces. We are starting to look into that.

California Fault Lines: Understanding the Causes and Impact of Network Failures

  • Most network failures are not catastrophic. But it’s difficult to collect comprehensive failure data. Lightweight techniques are limited, and special-purpose monitoring is expensive.
  • Contributions: a methodology to reconstruct the failure history of a network using only commonly-available data. Basically a time series of layer-3 failure events. Preferably annotated with the cause and impact of the failure. Data source for this is the syslog and the router configuration files in a version control system.
  • But this data is not intended for failure reconstruction. First rebuild the topology from the configuration file, then replay syslog messages. We also have semi-structured data from the maintenance logs.
  • Looked at CENIC network with 200 routers and 5 years of data (California academic network).
  • Limitations: syslog is sent using UDP which leads to message loss. We might see a series of log messages containing a DOWN followed by a DOWN, so just ignore messages until get back on track. Selection bias in the operational announcements.
  • Comprehensiveness: treat the operational announcements as ground truth and see how many of them have corresponding syslog messages. 97% of announcements were confirmed by the syslog.
  • Accuracy: using Skitter project which does frequent traceroutes to confirm that no packets went over down routers.
  • Validated down states using RouteViews (recorded BGP traffic) to track failure events.
  • 60% of failures last less than a minute, which inhibits detection or recovery. Turns out mostly to be flap events.
  • 7000 emails led to 3000 events. 28% of events are failures and 18% of observed failures are explained.
  • Failure causes: hardware, power, external, software, other and configuration. Hardware is the biggest cause of notices, but software is the biggest cause of failures (32% of failures). But almost 80% of software failures were due to scheduled changes.
  • Q. How are those failures distributed on the network? More at the backbone or on the edge? More downtime on the customer links and the high performance links than on the backbone.
  • Q. How do you think what you show shows more about the impact than simply tracking the control plane? It’s hard to know what the actual impact is, since we don’t collect that information. What other sources of information do we need on top of routing information? If we understood link utilization then we could see how links were being strained by these events.
  • Q. Are you saying that software upgrades are a dominant cause of failures? Not dominant, but serious. The UP/DOWN messages are a side-effect of the maintenance activity? Might be interesting to look at this.
  • Q. Do you see many concurrent failures? More details about this in the paper.

Novel Technologies for Data Center Networks

c-Through: Part-Time Optics in Data Centers

  • Comparing optical circuit switching to electrical packet switching. Circuit switching vs. store and forward. Optical can do 320×100G, vs. 16×40G for electrical. But the optical switching time is about 10ms, compared to packet granularity.
  • Despite slow switching time, optical circuit switching is still promising. Full bisection bandwidth at packet granularity may not be necessary.
  • Looked at a hybrid packet/circuit switched architecture. PS for low latency and optical-CS for high capacity transfer. Optical paths are provisioned rack-to-rack.
  • Control plane needs to estimate traffic demand and configure optical circuit based on it. Data plane does traffic demuxing and optimizes circuit utilization (maybe).
  • c-Through is a specific design for this. A centralized controller manages circuit configuration. Applications and switches are not modified, and end hosts are leveraged for traffic management.
  • Enlarge socket buffers for applications to identify which flows are heavy and which are lightweight. This generates a per-rack demand vector. Applications are unmodified, and packets are buffered per-flow to avoid head of line blocking. This estimates traffic demand and pre-batches data to improve optical circuit utilization.
  • Traffic demand vectors aggregated into a traffic matrix. Use Edmonds’ algorithm to compute the optimal configuration (maximum weight matching problem). Then servers are notified. The control traffic overhead could be reduced.
  • Electrical and optical networks isolated using VLANs.
  • Traffic control on hosts, which makes end-hosts tag packets for the two VLANS. accordingly.
  • Testbed with 16 servers, a hybrid network on a 48-port Ethernet switch. Optical switch is emulated using 4G links, whereas electrical network uses 100Mbps links. Optical circuit emulation: optical paths are only available when hosts are notified. There is a 10ms reconfiguration delay.
  • Evaluated TCP performance using dynamic bandwidth, overhead of traffic control and buffering effects. Also application performance (VM migration, MapReduce, MPI-FFT).
  • TCP exploits the dynamic bandwidth quickly. Throughput ramps up within 10ms. Throughput stabilizes within 100ms.
  • MapReduce performance. Since shuffling is independently transferred, it is amenable to batching. Sorted 10GB of random data, which took 800 seconds on an electrical network. With full bisection bandwidth, the performance is 135 seconds. As c-Through varies the buffer size limit, the best performance is 153 seconds, for 100MB buffers, which is close to ideal. As the reconfiguration interval is varied, can do it as infrequently as every 3 seconds, and the performance is 168s.
  • Ran Yahoo Gridmix benchmark, which contains 3 runs of 100 mixed jobs, such as web query, web scan and sorting. Uses 200GB of uncompressed data, and 50GB of compressed data. c-Through comes very close to the full bisection bandwidth network.
  • Q. Surprised by the claim that TCP works fine in this case, considering the multipath issues: would new protocols not be more appropriate? This technique we didn’t see many things blow up.
  • Q. Do you think it could work if the fibre is cut, and how will it affect the network? Current system doesn’t take this into account, but since there is dynamic monitoring, we could detect that and handle it.
  • Q. Won’t you have to reconfigure faster to catch short, bursty flows, and then isn’t there a risk of oscillations? Didn’t see that in our experiments.
  • Q. What is the cost of these optical technologies, and are they practical today? Expensive fixed cost, but the per-port marginal cost is not so high, which makes it competitive. A mature technology that is already on the market.

Helios: A Hybrid Electrical/Optical Switch Architecture for Modular Data Centers

  • Talk is about combining electrical packet switches (ePSs) and optical circuit switches (oCSs) in a data center network. Both cost $500/port. But ePS is limited to about 1G or 10G (maybe 40G or 100G in future), and oCS is rate-free. ePS uses 12W/port and oCS uses 240mW/port. Finally the oCS doesn’t require a transceiver, which costs another watt per port. But the downside of oCS is the 12ms switching time. ePS suited to bursty, uniform traffic, whereas oCS suitable for stable, pair-wise traffic.
  • Switching delay is due to mirrors on motors which it is necessary to reposition to switch the circuit. This simply gives a full crossbar circuit switch which does not decode packets and needs an external scheduler.
  • Wavelength division multiplexing uses one wavelength for a channel. WDM mux and demux are used on the electrical packet switch transceivers.
  • Need stability to be increased, using aggregation. Processes are more likely to communicate than threads; racks are more likely to communicate than servers; data centers are more likely to communicate than pods. Sweet spot is modular data centers.
  • With 64 pods, with 1024 hosts per pod, with a 10% electrical network (10:1 oversubscribed), need $6.3M, 96.5kW and 6656 cables. With a 100% electrical example, it would cost $62.2M, use 950kW and need 65,536 cables. Helios costs $22.1M, uses 157.2kW and needs 14016 cables.
  • Optical switch has a simple software agent, and the intelligence is in the centralized topology manager. Control loop estimates the traffic demand (hard to do), computes the optimal topology for maximum throughput, and then configure the pod and circuit switches.
  • Estimate: will this flow use more bandwidth if we give it more capacity (is it an elephant flow or a mouse flow)? However the results are biased by the current topology. So use the Hedera algorithm (NSDI 2010) which assumes all hosts are connected to an ideal crossbar switch, then compute the max-min fair bandwidth fixpoint.
  • The optimal topology is computed as a max-weight perfect matching on a bipartite graph, using Edmonds’ algorithm.
  • Testbed used two networks: a traditional one and a Helios network. 100% bisection bandwidth is 240Gb/s. Used 26 servers, and various switches including an optical circuit switch.
  • Ran Hadoop on this network, but didn’t get good numbers because the network was massively overprovisioned.
  • Got 190Gb/s peak and 171Gb/s on average on the traditional network, with drops due to hash collisions. The 50Gb/s difference from the full bisection bandwidth is the TCP overhead.
  • Helios got 160Gb/s peak and 43Gb/s average. Due to some quirks of the packet switched routers, such as port debouncing which prevents false positives on ports being up, which led to poor performance on reconfiguration. Turning that off got the average up the 87Gb/s. Turning off EDC got a 142Gb/s average. Remaining overhead is a limitation in the software. Still have 27ms gaps, due to some switching delay.
  • Helios used unidirectional circuits but there are bidirectional circuits as well. Unidirectional doesn’t waste bandwidth on the return path, which leads to a daisy chain topology.
  • First paper to demonstrate WDM in a hybrid electrical/optical network.
  • Q. Have you thought about how the traffic demand estimation technique would work at lower levels (down to within a pod, a rack, a server, a process)? The Hedera demand estimator works on the level of TCP-flows, so we could do that. Would the bias you get become stronger? [Taken offline.]
  • Q. The number of electrical and optical switches you provision is an a priori design decisions, so how would you address changing traffic patterns? The way around that is to build a hybrid electrical/optical switch.
  • Q. Have you thought about application-limited flows, where there is a bottleneck in the application that stops it using the additional bandwidth? Sensitive to the elephant flow classification. The whole pipeline depends on a good classification. Wouldn’t it be better to use OS modification (per c-Through)? Prefer not to modify the host.
  • Q. What would happen, if you didn’t have such short RTTs (such as in an aggregation network), to the end-to-end flows without buffering? It’s not clear that this would do so well (unmodified) between data centers, but the switching technology is well-suited.

Scalable Flow-Based Networking with DIFANE

  • A scalable way to apply fine-grained policies in enterprises.
  • Want to support flexible policies, such as access control rules, customized routing (e.g. Skype calls on a low-latency path) and measurement (e.g. detailed HTTP traffic statistics).
  • Flow-based switches store their rules in a high-speed TCAM, and perform simple actions based on those rules. The flow space has at least five dimensions. Want to specify these in a high-level management system and enforce low-level rules in the TCAM. Want to support large numbers of hosts, switches and policies with limited TCAM space.
  • If you pre-install the rules in the switches, this is simple, but it doesn’t support host mobility and switches don’t have enough memory for all rules.
  • Alternatively (per Ethane, NOX) install the rules on demand, buffering the first packet while the rules are looked up in the controller. The first packet misses the rules, and gives additional switch complexity, and the risk of DoS by sending multiple different packet headers.
  • DIFANE supports host mobility, reduces memory usage and keeps all packets in the data plane.
  • Stage 1: controller proactively generates rules and sends them to some authority switches. The flow space is partitioned between the authority switches.
  • Stage 2: authority switches keep packets in the data plane. When a packet is received, it is routed to the authority switch and sends feedback of the rules to cache. Subsequent packets hit the cache and are forwarded directly. There is no longer a race between updating the cache and forwarding subsequent packets.
  • A small set of coarse-grained wildcard rules is used to give the partition function for authority switches. Not a DHT, since wildcards are used in the rules.
  • A switch’s TCAM has cached rules, authority rules (if the switch is an authority switch) and partition rules (to route to an authority switch). Prefer cached rules and authority rules over partition rules.s
  • Switch prototype built with an OpenFlow switch.
  • Tricky to cache rules when wildcard rules may overlap (with different priorities). Therefore have to generate new rules based on contiguous subregions. Partition based on minimizing the TCAM entries in switches. Use a decision tree base rule partition algorithm to decide where to place the splits in the flow space.
  • Need to handle policy changes at the controller, topology changes at the switches and host mobility.
  • Evaluated prototype by implementing DIFANE in a kernel-level Click-based OpenFlow switch. Traffic generator, switches and controller run on separate 3GHz Xeons.
  • NOX sees a 10ms RTT delay for the first packet, but DIFANE sees a 0.4ms delay.
  • DIFANE can easily be implemented in hardware, whereas NOX requires more software intervention.
  • For peak throughput (one authority switch, single-packet flow), NOX hits an ingress switch bottleneck at 20Kflows/sec with one ingress switch, and then reaches a controller bottleneck with more ingress switches.
  • How many authority switches? Depends on number of rules. Campus network has 30K rules, which is assumed to be 160KB of TCAM memory. This leads to about 3 authority switches. An IPTV network with 5M rules requires 1.6MB of TCAM and would require 100 authority switches.
  • Tension between distributed (switch-based) and centralized (controller-based, easier to manage) operation. DIFANE is a point in between these extremes.
  • Q. How realistic are your assumed TCAM sizes? Already have 160 KB TCAMs, so we would just use more switches.
  • Q. If you have a slow path you can scale much better, so why do you want to keep everything on the fast path? [Taken offline.]
  • Q. Did you experiment with cache replacement policies? Much work done on how to cache rules, so we can just leverage that.
  • Q. What about the importance of dynamic rules that might change frequently, and how can DIFANE handle it? Think that only traffic engineering needs such dynamic rules. DIFANE can get the controller involved to manage these. But the performance gain is not much over OpenFlow in that scenario. Isn’t a benefit of OpenFlow that you can implement e.g. authentication at the application level? Yes, but we can get the controller to push this into the rules.
  • Q. Is there a cost to have all switches be authority switches? Depends on the network and how it is used. Why not make every switch an authority switch? May need more redirection, and hence more stretch. Also the rules will become smaller.
  • Q. Does this make internet traffic more unpredictable? A reasonable comment, but since we know the location of the authority switch, we know the paths that the traffic may take.

Social Networks

An analysis of Social Network-based Sybil defenses

  • Many online services allow attackers to create accounts for free and they can hence manipulate the system.
  • Defense approaches: trusted certification (such as SSN or passport number), or resource challenges (e.g. cryptopuzzles, not hard to solve if you can get cloud resources on demand). Or can use links in the social network to detect Sybils, since we presume that attackers can only create a limited number of links to non-Sybil users. Spawned a lot of research.
  • Unanswered questions: since the schemes use different mechanisms, it is unclear how the schemes are related, or whether there is a common insight across the schemes? This would help us understand the limitations of the defenses.
  • Talk proposes a new methodology for comparing these systems and finds that they all work in a similar manner. It implies that they have a hidden dependence on the network structure, which identifies the limitations of the schemes.
  • The interesting fact is how these schemes identify nodes as Sybils.
  • Schemes take a social network and a single trusted node, and declares Sybils from the perspective of the trusted node. Internally, each node has a Sybil probability, which gives each node a ranking of Sybilness. Can this ranking be used to compare schemes?
  • Compared rankings from each scheme from the same social graph. The ranking is jumbled between the different schemes. All schemes seemed to have a cut-off point where the partitions were (unordered) equalish.
  • The cut-off point comes at the boundary of the local community. So all schemes are effectively detecting communities. Nodes in the local community are ranked higher, but the ranking within and outwith the community are in no particular order. Can we then leverage the work on community detection to design new approaches?
  • Bad news: this depends on the graph having monolithic community structure, and the characteristics of the community around the trusted node.
  • Does this make certain network structures more vulnerable? Does having more communities make it harder to identify communities? Evaluated this on various real-world social networks. Simulated a Sybil attack by consistently adding Sybils (5% attack links and 25% Sybil nodes). Accuracy measured using ranking, i.e. the probability that Sybils will be ranked lower than non-Sybils. Compared amount of community structure (modularity) to the accuracy. Modularity seems to be negatively correlated with accuracy.
  • How can the attacker use this intuition? Can he do better than just choosing random links? For example, by placing links closer to the trusted node. Then the attacker could blend in to the community of the trusted node. Experiment ranks the nodes and gives the attacker to give the ability to place links randomly among the top N nodes. Smaller N implies an attacker with more control. Graph shows an attacker with more control will reduce the accuracy of the algorithms.
  • Moving forward: could be useful for whitelisting nodes, and could potentially incorporate information from more layers to make the decision about who is a Sybil.
  • Q. Have you evaluated where the number of Sybil nodes far exceeds the number of attack links? The results hold in those settings as well.
  • Q. Attacks are launched from compromised and fake accounts, so how do you deal with this? This violates the basic assumption that the attacker has few real links, so none of these schemes will work.
  • Q. What if the Sybils form multiple communities? No matter the Sybil topology, as long as the number of attack links is small, none of these schemes will work.

The Little Engine(s) that could: Scaling Online Social Networks.

  • Systems should be scalable, but it can be hard to implement and is not necessary at the start of an online service. Of course, this can lead to a success disaster. The cloud gives hardware scalability, but no automatic application scalability.
  • Frontend, stateless components are easy to make transparently scalable, but the data source is a bottleneck.
  • Obvious solution is full replication of the DB, but the state doesn’t decrease with the number of servers. However it maintains data locality.
  • Next most likely solution is horizontal partitioning/sharding, but the splits are disjoint, which is bad news for OSNs. The shards cannot be disjoint, because OSNs involve queries across social links, or data dissemination across social links. Presumably want to colocate all of your friends on the same server.
  • Relational databases don’t perform well under horizontal partitioning, and are expensive, so people use DHTs. These perform better but there is no SQL, less abstraction, and they suffer under high traffic (incest, multi-get hole, jitter). Also a DHT gives random partitioning, which means many servers will be hit with a particular update, and there is a high replication overhead.
  • Can leverage underlying social structure to make the partition. The SPAR (Social Partitioning And Replication) algorithm does this.
  • Algorithm has to be online (due to system and social network dynamics), fast and simple (using local information, a hill-climbing heuristic and back-pressure load balancing), stable (no cascades) and effective (approximates an NP-hard problem, minimize replicas while also maintaining a level of redundancy).
  • Evaluated on real OSN data (Twitter/Orkut/Facebook). Looked at various algorithms, including random partitioning, MO and METIS.
  • SPAR has a lower replication overhead than the other algorithms, with only 22% overhead over the replication constraint.
  • Three-tier application: front end and application logic are stateless on top, with SPAR middleware in the application logic and the data store (to intercept messages). The SPAR controller, partition manager and directory service coordinate the whole system. To applications, SPAR is totally transparent, implemented on top of MySQL and Cassandra, but could be implemented using other things.
  • Evaluated using a non-distributed Twitter clone (StatusNet) and real Twitter data, and saw if it could scale up across 16 commodity desktop machines. The 99th percentile latency for MySQL with full replication was 16 requests per second, whereas SPAR+MySQL does 2500 requests per second. Vanilla Cassandra does 200 req/s, whereas SPAR+Cassandra does 800 req/s.
  • Q. Can you replicate e.g. Facebook pictures based on the groups of friends? The rule is applied when processing the query itself, though some redundant data would be stored.
  • Q. Have you looked at incorporating more dynamic interaction behaviors in the partitioning algorithms? We have considered adding weights.
  • Q. Any thoughts on Diaspora? Only know what I read in the news and that it’s fully distributed, so don’t think there will be such a thing as a global data store.
  • Q. []? The more clustered you are, the less replication you will need. The results are consistent for large data sets.
  • Q. Would the replication overhead for Orkut not be higher? 12 or 16.
  • Q. Where is the notion of load per server? Would this not allocate servers that have absolutely no work to do? Details in paper.
  • Q. Are there not better designs than a read fan-out? Arguably.

Crowdsourcing Service-Level Network Event Detection

  • Want to identify problems that affect end-to-end performance. Do this in an online way with reliable detection and isolation.
  • Idea is to do monitoring at the edge systems and detect drops in performance.
  • Need a system that is scalable, and has localization in time and space. Scalability from passive monitoring, and fully distributed detection. Also privacy, reliability from uncontrolled hosts and wide adoption (incentive to install).
  • Approach: passively monitor local performance information (signals), and detect drops in performance. Then attempt to get group corroboration from other hosts. A likelihood ration distinguishes network effects from coincidence. Store the data distributedly, and give the operator a tap to get that data out.
  • Evaluated the approach using peer-to-peer applications (a natural fit). This gets us an edge trace. The dataset is from a plugin called Ono, which has been installed by 1 million BitTorrent users worldwide.
  • Case study on the BT Yahoo! network, which has information about confirmed events on its web interface. Gives the dates and times of the issue arising and having been fixed.
  • BitTorrent peers monitor many performance signals, both general and protocol specific (like Torrent availability). The individual signals are noisy, having uncontrolled duration and having a wide range of values. Use some moving-average smoothing to make this easier to interpret.
  • To do group corroboration, why might they occur at the same time? Could be service-specific problems (e.g. lack of a seeder), coincidence (noisy local detection), or a genuine network problem. Coincidence becomes very small with a large number of users. Can tune a likelihood ratio knob to make this more or less sensitive.
  • Evaluated in the wide-area. Don’t know the false positive or false negative rates, because ISPs wouldn’t provide information about when their network went down. Therefore use public information from BT Yahoo!, and do some work under NDA.
  • In one month of BT Yahoo! data, detected 181 events and 54 occur during confirmed events. There were 14 other reported problems. Remaining are not necessarily false positives.
  • Worked with a North American ISP under NDA. Detected 50% of events in regions with fewer than 10k subscribers.
  • Evaluated sensitivity to the likelihood ratio, detected problems 2% of the time for small moving average deviations and 0.75% of the time for larger deviations.
  • Deployed as the Network Early Warning System in 48k BitTorrent clients.
  • Q. This seems like the right approach since performance monitoring should be done at the application layer.
  • Q. Do you think that IP prefix or geolocation information would be useful for grouping? Depends on whether groupings are useful to help with the problem. Using IP prefix already.
  • Q. How are your techniques different from the earlier talks on anomaly detection? This is at the user, so the information that comes back is more useful. Why are you using moving averages compared to something more sophisticated? Wanted to implement it simply and get it incorporated in a BitTorrent client. Many schemes assume a long continuous stream of data.
  • Q. Once you have detected the events, what do you do with them? The idea is for operators to go and fetch this information. But there is a root cause analysis problem here, which is important future work in this area.

SIGCOMM 2010: Day 2

September 1st, 2010

Privacy

Privacy-Preserving P2P Data Sharing with OneSwarm

  • Three types of data: private, public (non-sensitive) and public (without attribution). This talk is about the last one: want to download and share data without people knowing what you’re downloading or sharing.
  • P2P is good for privacy because there is no centralized control or trust, you retain rights to your data (instead of giving them to a third party) and no centralized third party knows everything you’re doing. But in P2P: anyone can monitor your behavior!
  • Previous solutions: Tor plus BitTorrent, but this needs a fraction of the clients to be public and gives poor performance. Or Freenet, which has poor bulk data performance and requires users to store others data. Median download of BitTorrent a 1MB file is 94s. But BT+Tor is 589s and Freenet is 1271s.
  • Implemented a OneSwarm client and released in March 2009… now hundreds of thousands of users. Based on social networks: share keys with your friends.
  • Searches are flooded through the network, and set up a data path along the successful route, which does not reveal who is the ultimate sender or provider.
  • Threat model: the attacker has a limited number of overlay nodes and can do anything on nodes he controls, including traffic injection/sniffing/correlation.
  • To support mobile peers, use a DHT to publish IP and port, however this is published, encrypted and signed separately for each peer. This makes it possible to remove a peer.
  • Sparse social networks are a problem: with only one friend, you have poor reliability, poor performance and privacy problems. This is bad for early adopters. Early adopters used a forum to share their public keys. Solution was to add a community server, as a source of untrusted peers.
  • With untrusted peers, delay responses to foil timing attacks, probabilistically forward queries and used deterministic random behavior to limit information leakage from repeated queries. Can trust/not trust peers on a per-object basis.
  • Want to have search without revealing the source and destination. The approach is based on flooding with delay, where searches are only forwarded using spare capacity, and delayed at each hop. Cancel messages move faster through the network. However, this gives no guarantee that all data can be found by all users at all times.
  • Search types: 160bit content hash and text-based. For response delay, use random reply delay seeded by the hash (if hash-based search). This is harder for text-based search, so delay based on the hash of the content.
  • Multipath connections are supported to avoid weak points in the network. Default is to cancel after getting 20 responses (not just one). Then the forwarding load is distributed.
  • Data transfer uses a modified version of BitTorrent, which handles multiple sources and swarming downloads naturally.
  • Timing attack is possible by monitoring the delay between search and the response, and inferring how many users could have sent the reply within the recorded time.
  • Evaluated potential of the timing attack using a synthetic OneSwarm overlay, based on 1.7 last.fm users. Attackers use a public community server, and users with 26 or fewer friends take some untrusted friends as well. Design eliminates the attacker’s ability to pinpoint particular users.
  • Also evaluated performance using 120 PlanetLab nodes transferring a 20MB file. Median download time for OneSwarm is 173s (compared to 94s for BitTorrent, but much less than BT+Tor and Freenet). At the 70th percentile, OneSwarm and BT are similar (97s vs 190s), but BT+Tor and Freenet are much worse.
  • Multipath transfers increase average transfer rate from 29KB/s to 457KB/s.
  • Q. Would it be fair to classify this as Gnutella where the network is restricted to the social graph and searches are less flexible? Similar, but the key change is that the data flows over the path in the overlay (not directly), which makes it more privacy-preserving. Does this not give worse scalability than Gnutella, which had problems? Gnutella had problems when people were still using modems, and it is more viable to provide a few KB/s. The current overlay is not oversubscribing.
  • Q. What are you relying on to ensure that you don’t know where the data are coming from? Because you don’t know the topology. If you are next to an attacker, you rely on the delay that you add. Are you seeing a sum of random variables, which will leak more information as the path becomes longer? You could maybe estimate the hop-count but not pin-point nodes.
  • Q. Is a 20MB file too small for TCP to show realistic performance? Used this because we needed to experiment with Tor also, and we didn’t want to stress that network too much. For the Freenet experiment, we used a 5MB file and extrapolated from that because it was hard to get 20MB to download reliably.

Differentially-Private Network Trace Analysis

  • Can you conduct network trace analysis that provides strict, formal, “differential privacy” guarantees?
  • Selected some representative traces and tried to reproduce the results using differential privacy.
  • It was possible to reproduce every analysis attempted, but there is a privacy/accuracy trade-off.
  • Toolkit and analyses are available online from the PINQ web site.
  • Access to realistic data is helpful for networking research, but there is a tension between utility and privacy. The requirements of utility are usually for aggregate statistics, whereas privacy requirements are typically for individual behavior.
  • Other approaches include: trace anonymization (doesn’t always work unless people are excessively conservative), code-to-data (send your analysis to the people who hold the data, but it is hard to know what that code is doing), or secure multi-party computation (similar to code-to-data). The aim here is to start with formal guarantees and see how useful it can be.
  • Differential privacy: the results don’t depend on the presence or absence of an individual record. Doesn’t prevent disclosure, but makes no assumptions about what what the attackers can do, and is agnostic to data types.
  • Uses PINQ, which is a programming language that guarantees programs are differentially private.
  • Challenges: getting DP requires introducing noise, so you need to use statistically robust measurements. PINQ requires analyses to be written as high-level, declarative queries, which can require some creativity or reinterpretation. Also (not dealt with): masking a few packets does not mask a person, and the guarantees degrade more as a dataset is reused (policy question of how you mete out access to a dataset).
  • Example is worm fingerprinting. Group packets by payload, filter by the count of source IPs being over a threshold and the count of destination PIs being over another threshold. Can then count the number of worms, approximately. Need to supply epsilon to the count which will start off the differentially-private version.
  • Built some tools for analysis. For example, implemented three versions of a CDF. In doing this, you need to scale down the accuracy for each subquery in order to not degrade the dataset privacy too much.
  • Showed an example CDF. The differentially private one is not monotonic at the microscopic scale, but it gives a convincing macro-scale result.
  • Can also list frequently occurring strings, using an algorithm based on statistical properties of text, which gradually extends a prefix.
  • Extend worm fingerprinting: actually enumerate the payloads that have significant src/dest counts.
  • Also built more tools and analyses: did packet-level analyses, flow-level analyses and graph-level analyses. Sometimes had to compromise on privacy to get high accuracy (epsilon = 10 for weak privacy).
  • Many open questions. Perhaps the biggest is whether DP guarantees for packets are good enough. Or whether, if writing new analyses, they could be designed with DP in mind.
  • Q. Could extensions to PINQ apply to trace analysis that look for isolated events, such as network intrusions which are relatively rare? Can separate the two tasks: learning a rule or filter that could identify an intrusion (which could use DP), and apply that filter to individual packets (which could not use DP, because you effectively want to violate privacy at this point).
  • Q. Does someone need to hold onto the raw packets? Yes, like the code-to-data setting.
  • Q. In DP, each query may consume epsilon privacy and the provider must set a budget, so how do you set this? And what happens when the budget is exhausted? You could imagine turning off the dataset when the hard threshold is met. But this is really a policy question. Setting the budget is difficult: perhaps you can provide data outputs from a DP query to a large group who can then do useful work with it.
  • Q. Is there a trade-off between DP and the presence of correlations in the data? In a lot of cases, it is possible to restructure the data set to reduce the amount of correlation between individual records (by grouping the strongly correlated records together).

Encrypting the Internet

  • 50 million websites exist, but only about 600k of them enable SSL/TLS. Can we change the infrastructure to make all transactions protected and secure.
  • Main drawback is protocol processing speed and cost, due to public key crypto for handshaking and symmetric crypto for the data. 2 million clock cycles for RSA decrypt.
  • Main contribution is a CPU that is capable of encrypting packets at line rates, and getting a 4–12x speedup in AES and a 40% speedup in RSA.
  • Encrypting the internet is not securing it. Don’t deal with certificate/trust management, malicious software or privacy breaches at the end-host.
  • AES is a block cipher, based on the Rijndael algorithm. Can use 128-bit blocks and either 128, 192 or 256-bit keys. AES takes 10, 12 or 14 rounds.
  • AES uses confusion (invert in GF(2^8) followed by an affine map). Then the bytes are permuted by shifting the rows of the S-box by varying amounts. Then the columns of the S-box are mixed by matrix multiplication. Uses many bit-linear operations, which are easy to implement in VLSI. Finally, add the round key using XOR.
  • AES is typically implemented using table lookups, which are costly (approximately 15 cycles per byte). But need to get 1Gb/s. So the aim is to implement them in combinatorial logic, on the processor data path.
  • Added new instructions: AESENC, AESENCLAST, AESDEC, AESDECLAST. Cache attacks are eliminated. Challenge is to implement this in as small a gate area as possible. Mathematical techniques such as composite fields help to achieve this in 100-400 gates. Total number of gates is similar to an adder or multiplier.
  • RSA requires performing a modular exponentiation, which can be implemented using modular multiplication. Implementing a faster multiplication algorithm in assembly achieved a 40% speedup over OpenSSL.
  • Also implemented the first version of combined encryption and authentication for TLS 1.2.
  • AES-NI latency is 24 clocks/round, then 6 clocks, and the throughput is 2 clocks.
  • Overall, can move from 501 SSL sessions/second to 1216 SSL sessions/second using AES-NI in Galois counter mode.
  • Now, one core can saturate a 1G link, and 8 cores can saturate a 10G link.
  • Future work is to improve larger RSA variants and implement the eventual SHA-3 algorithm.
  • Q. When you get that fast, how many good-quality random bits per second you can get? This work doesn’t address that, but all we need is an entropy source per a 2004 paper. Not sure what the product groups are doing in this respect.
  • Q. Is Intel working on speeding up the RSA? The speedup presented in the paper (40%) is good enough to saturate the link.
  • Q. Could you expose the GF operations as primitives themselves? It is implemented in such a way that you can isolate the inversion of GFs or the multiplication. Algorithms in the SHA-3 competition also exploit similar primitives.
  • Q. How general are your optimizations in terms of other block ciphers? You can implement a variety of crypto algorithms using the primitives we have design, including several cryptographic hash functions.

Wireless LANs

Enabling Fine-Grained Channel Access in WLAN

  • 802.11n achieves about 45.2 Mbit/s at the application layer, which is much less than the advertised bitrate.
  • Overhead arises from various sources: CSMA, backoff, DIFS and SIFS, and ACK. Simple model of this overhead as ration from transmission time to total time for a packet. As the PHY data rate increases, the time for transmitting data (efficiency) becomes small compared to all of these overheads. There is a static delay that cannot be reduced, constraining speedup.
  • Existing MAC is limited by allocating a whole channel to a single user at once. Aggregation is a possible solution, but you require large aggregation (23KB frames) to get 80% efficiency at 300Mbps. And this also increases latency.
  • Basic idea is to divide the channel into small, fine-grained slices. Directly reducing the channel width doesn’t work because of guard-band overhead. Approach then is the use orthogonal overlapping subchannels (OFDM).
  • If nodes are asynchronous, you lose orthogonality (i.e. if you have multiple users). Challenge then is to coordinate transmissions in random-access networks like WLAM. Time-domain backoff is very inefficient in this case.
  • Designed new PHY and MAC architectures: “FICA”.
  • M-RTS/M-CTS/DATA/ACK access sequence.
  • Carrier-sensing and broadcasting can be used to analyze the timing misalignment. A proper cyclic-prefix accommodates the timing misalignment: a long one for M-RTS and a short one for M-CTS/DATA/ACK.
  • For contention resolution, time-domain backoff is inefficient. Solution is to do frequency-domain contention with PHY signalling in the M-RTS/M-CTS symbols.
  • Frequency domain backoff: reduce the number of subchannels to contend if there is a collision, and increase it if there is success. This is analogous to congestion-control mechanisms. Two policies: “update to max” and AIMD.
  • Implemented using the Sora software radio platform, based on a SoftWifi implementation.
  • Evaluated performance of the synchronization, the signalling reliability and the decoding performance.
  • Also showed simulation results for the performance gain over 802.11n, and showed an improvement in efficiency for both full aggregation (unrealistic) and a mixture of saturated and delay-sensitive traffic (realistic and with much greater benefits).
  • Q. How do you deal with the case when the number of sources exceeds the number of sub-carriers (in frequency-domain backoff)? Could you combine time and frequency? Yes, we could always do that.
  • Q. Is there a way of using this system with RTS/CTS? The overhead of these is so low (37us for RTS), that it might not be worth doing this.
  • Q. Why is the problem of asynchronous timing different from multipath fading? It can create arbitrarily bad interference with the FFT window that is done for OFDM.
  • Q. What happens if you compare your scheme to classic OFDM in terms of bits/seconds/Hz (considering delay due to synchronization)? There is a sweet point in symbol size that can meet your design goal.

Predictable 802.11 Packet Delivery from Wireless Channel Measurements

  • 802.11 is fast (600Mbps), reliable (usable at vehicular speeds over extended range) and ubiquitous (cheap). But new applications, such as wireless displays or controller input, can stress the network.
  • In theory, performance should be easily measurable and used to guide channel rate selections. But the real-world doesn’t always match the theory. So statistical adaptation is often used, but convergence time becomes a problem (especially as the measurement results change dynamically).
  • Goal is to bridge theory and practice, and accurately predict performance over real channels and devices.
  • Channel metric is the received signal strength indicator (RSSI) which, with noise, gives the SNR for a packet. However, this isn’t perfect, because it can vary by 10dB on a per packet basis. Different subchannels have different SNRs.
  • 802.11n provides a new opportunity: detailed channel measurements, which are used for advanced MIMO techniques. Get a Channel State Information (CSI) matrix for per-antenna paths.
  • Use the Effective SNR (the total useful power in a link) as opposed to the packet SNR (total power in the link).
  • CSI is measured on receive, so for every received frame, we know all antennas and subcarriers used. Then take this and compute SNRs per symbol. And use textbook formulae to calculate per-symbol bit-error rates, and average them to get an effective bit-error rate. Finally convert this back to the effective SNR.
  • Every rate gets an effective SNR threshold, calculated offline per NIC implementation (not per NIC or per channel). This handles real NICs which may use interesting decoding techniques (hard/soft/maximum likelihood, etc.).
  • Application: what is the fastest configuration for a particular link? Select rate/MIMO/channel width based on the information.
  • Application: which antenna is the best to use to save power?
  • Application: what is the lowest transmit power at which I can support 100 Mbps?
  • Implemented in an Intel Wi-Fi Link 5300 NIC (3×3 MIMO, 450Mbps). Used two testbeds with over 200 widely varying links. Open-source Linux driver and used firmware debug mode to send CSI to the receiving host. Real-time computation took 4us per 3×3 CSI.
  • For predicting optimal 3×3 rate: effective SNR is much closer to the ground truth than packet SNR.
  • To evaluate rate control, used channel simulation on a mobile trace using MATLAB and the SoftRate GNU Radio. Effective SNR gets a better average delivered rate than SampleRate, SoftRate and SampleRate with fixed retry (802.11a algorithms).
  • Effective SNR extends to MIMO. Compared to optimal and an invented algorithm called “previous-OPT”. Effective SNR gets 80% accuracy and 10% overselection.
  • Related work: SoftRate, AccuRate and EEC (from yesterday). All work with 802.11a but don’t extend to more recent techniques.
  • Q. If you had CSI and it’s quick, do you need to do all of these things? The RSSI has a lot of error, and we were able to make this work.
  • Q. Is the debug mode on the NIC publicly availably? Yes, I think so.
  • Q. Would a better comparison be to other techniques that use scheduled MACs? Trying to do something that works with what we have.

SourceSync: A Distributed Architecture for Sender Diversity

  • Receiver diversity underlies many systems, such as opportunistic routing protocols, and WLAN diversity protocols. In opportunistic routing, let any router that receives the packet forward it. If multiple routers/channels with different loss rates, the loss probability is now the joint probability of losing the packet on all channels.
  • Sender diversity is the converse of receiver diversity. If many senders transmit simultaneously, it is unlikely that they will all be attenuated at the same time. It provide analogous benefits to receiver diversity. For example, connect many APs to wired Ethernet, and let many of them broadcast a packet simultaneously to the client.
  • Challenge: simultaneous transmissions don’t strengthen each other, because they are likely to be out of sync. Need distributed symbol-level synchronization.
  • An 802.11 symbol takes 3.2us. With synchronization error of 2us, the best you can get is an SNR of 2dB. 1us synchronization error gives 5db. But for maximum bit rate, 802.11 needs an SNR of ~22db.
  • Implemented the system, SourceSync, for an FPGA. Talking about opportunistic routing, but applies also to WLANs.
  • Natural way to synchronize transmitters is by reception. But since multiple paths have different delays, need to compensate for these differences.
  • Path delay is made up of propagation delay and packet detection delay (typically needing multiple samples to detect a symbol). Then a turnaround time between receipt and transmission.
  • Packet detection is caused by receivers detecting packets using correlation. Random noise can cause a receiver not to detect a packet on the first sample. Since routers see different random noise, they may make different numbers of samples. Routers can estimate this based on the phase shift in various carriers.
  • Hardware turnaround time is hardware dependent, caused by different hardware pipelines and radio frontends. Routers locally calibrate this using their clocks.
  • Propagation delay is measured by probe-response between node pairs. A knows its packet detection delay, B knows its packet detection delay, and so we can compute this from the RTT.
  • Challenge: can nodes synchronize using carrier sense? Transmission from one of the joint senders triggers the other senders. All nodes use CSMA, so one of the nodes wins contention and begins transmitting; other nodes join in if they have the data.
  • The lead sender adds a sync header to the packet and a known fixed gap to all co-senders to join after the gap. Co-sender listens, turns around from receive to transmit, waits for a compensating delay, and sends the data.
  • Implemented in a FPGA of the WiGLAN radio. Built a testbed with a variety of line-of-sight and non-line-of-sight locations.
  • Evaluated: randomly pick a pair of nodes to transmit, and measured the synchronization error. 90th percentile of synchronization error was measured: 20ns at 5dB SNR, and as little as ins at 25dB SNR.
  • Can SourceSync achieve sender diversity gains? Two nodes transmit simultaneously to a receiver (again). Check that two channels have different OFDM subchannel SNRs (they do, in the example) and that SourceSync achieves higher SNR in all subchannels.
  • Compare using the best single access point to using SourceSync, with two senders and a client. Repeat for all locations. SourceSync gives a median throughput gain of 57%.
  • Compared with opportunistic routing. Single path does worst. ExOR does better. SourceSync + ExOR does best (doubled median throughput over single path, and 45% improvement over ExOR alone).
  • Q. Did you consider just increasing the power of a single AP instead of sending with multiple APs? There is a fundamental gain here: the SNR profile is different for different routers, and combining across multiple senders gets rid of these deep losses.
  • Q. Should there be more components to the calculation of delay, based on the RTT? The nice part about this technique is that channel access delay doesn’t affect us, because we use carrier sense for telling when to transmit.
  • Q. Why did you not compare the performance of your scheme to MIMO? Sender diversity is orthogonal to MIMO and could improve its performance.
  • Q. Is your synchronization header long enough to account for nodes being very distant? Actually, it’s the gap after the header that has to be long enough. It’s a simple system-level parameter.

Novel Implementations of Network Components

SwitchBlade: A Platform for Rapid Deployment of Network Protocols on Programmable Hardware

  • […]
  • Existing approaches involve developing custom software, custom hardware or programmable hardware.
  • Platform header: a hash value for custom forwarding, a bitmap for what preprocessor should execute on the packet, a forwarding mode (including longest prefix matching or an exact match; also able to throw a software exception) and the virtual data plane ID.
  • Virtual data plane has its own preprocessing, lookup and post-processing stages: they operate in isolation.
  • Preprocessing stage: select processing functions from a library of modules (such as path splicing, IPv6 and OpenFlow). Also hashing: operator indicates what bits in the header should be incorporated in the packet-header hash to determine how the packet should be forwarded (can include up to 256 bits from the header).
  • Can do OpenFlow, where forwarding decisions are made on a 13-tuple (240 bits), which SwitchBlade hashes for custom forwarding to be done.
  • Modules are implemented in Verilog. Preprocessing and postprocessing modules extract the bits for lookup.
  • Forwarding stage: perform output port lookup based on mode bits. A software exception can be thrown and the packet redirected to the CPU. Could do hardware-accelerated virtual routers in software.
  • Implemented on NetFPGA.
  • Evaluated for resource utilization and packet forwarding overhead. Compared to a baseline implementation on NetFPGA. There is minimal resource overhead and no packet forwarding overhead.
  • Evaluated on a three-node topology.
  • SwitchBlade uses 13 million gates to get four data planes; other implementations (IPv4, splicing, OpenFlow) have one data plane and use 8 to 12 million gates.
  • No additional forwarding overhead compared to the reference implementation.
  • SwitchBlade is a programmable hardware platform with customizable parallel data planes. Provides isolation using rate limiters and fixed forwarding tables.
  • Q. How do you scale the performance beyond tens of Gbps? An artifact of the NetFGPA implementation which uses 4×1G ports. Later one will have 4×10G.
  • Q. Doesn’t the next paper show that it is possible to do all this in software? Things like Click are limited by packet copying overhead, so you are limited by the bandwidth of the PCI bus.
  • Q. What kind of hash function do you use and do different applications require different properties? We use a collision-resistant hash.

PacketShader: A GPU-Accelerated Software Router

  • Prototype achieves 40 Gbps on a single box by exploiting GPU acceleration.
  • Software routing is not just IP routing. It is driven by software and exploits commodity hardware.
  • 10G NICs cost from $200–300 per port. But software routers are limited to less than 10Gbps (8.7Gbps in RouteBricks is the best so far).
  • For 10G, it takes 1200 cycles to do packet I/O, and your budget is 1400 cycles. Lookup/encryption/hashing typically takes much more than that.
  • First step is to optimize the packet I/O. Then offload the other functions to the GPU.
  • GPUs are massively-parallel. Lots of small cores.
  • A GTX480 GPU has 480 cores and 1.2 billion transistors, most of which is dedicated to ALU.
  • Operations like hashing, encryption, pattern matching, network coding and compression are computationally intensive. GPU is well suited to these. GPU can also effectively hide memory latency.
  • Memory bandwidth of a top-of-the-line CPU is 32GB/s, but the empirical bandwidth (on realistic access patterns) is 25GB/s. Multiple ports receiving and transmitting will consume this and cause contention. However, a GPU has 174GB/s memory bandwidth.
  • Key insight: stateless packet processing is parallelizable. Take packets from the head of the receive queue, batch them and process them in parallel.
  • Latency is not impacted by parallel processing.
  • Before shader: checksum, TTL, format check, etc. This will send some packets along the slow path. It collects the destination IP addresses and passes those to the shader.
  • Shader: takes IP addresses, looks up the forwarding table and returns the next hops.
  • Post-shader: packets are updated and transmitted through the output ports.
  • Also device drivers at the receive and transmit side. Implemented a custom driver; details in the paper.
  • Can scale further with a multicore CPU. One master core and three worker cores. Master core talks to the shader. Once you have multi-socket, you need one GPU per CPU. Multi-socket, there is no communication between the CPUs, and each CPU owns a subset of the input queues.
  • Evaluated by connecting a packet generator and PacketShader back-to-back. Generator generates up to 80Gbps.
  • GPU gives a speedup (over CPU-only) of 1.4x for IPv4, 4.8x for IPv6, 2.1x for OpenFlow and 3.5x for IPSec.
  • IPv6 table lookup requires more power than IPv4 lookup. Algorithm is binary search on hash tables. Big performance improvement for small packets, but slightly worse for 1024 and 1514 bytes. However, this is bounded by the motherboard I/O capacity.
  • IPSec tunneling adds a header and trailer to the encrypted packet. The improvement is across all packet sizes, and is actually bigger for larger packets.
  • PacketShader achieves 28.2 Gbps with CPU only, and is implemented in user space, rather than kernel space. Reaches 39.2 Gbps with the GPU.
  • Need to add a control plane (currently only does static forwarding). Need Quagga or Xorp.
  • Could also integrate with a programming environment, such as Click.
  • Q. Is it worth implementing such a sophisticated design to make a 40% saving? And do you have a breakdown of where the savings are made? The budget numbers and breakdown are taken from RouteBricks.
  • Q. What do you think about the power efficiency of this compared to other approaches? Idle to full load is 327W–594W with two CPUs and two GPUs. (Compared to 260W–353W for two CPUs.)
  • Q. Does this approach have advantages over an integrated network processor in terms of scalability or programmability? Network processors are not commodity. Based on experience, they are much more difficult to program.
  • Q. Why did your approach have such a significant speedup over RouteBricks etc. even without the GPU? Improvements in packet I/O throughput.

EffiCuts: Optimizing Packet Classification for Memory and Throughput

  • Packet classification is important for security, traffic monitoring and analysis, and QoS. Usually based on the source and destination IPs and ports, and the protocol field.
  • Line rates and classifier sizes are increasing. This leads to high power consumption.
  • Previous approaches have used either TCAMs (poor scalability) or algorithmic approaches (potentially scalable, but problematic). Most promising approach based on decision trees. Aim of this work is to address the scalability of decision tree algorithms.
  • HiCuts and HyperCuts have investigated decision trees previously. However, they require large memory.
  • EffiCuts reduces the memory overhead of HyperCuts while achieving high packet throughput. Uses 57x less memory and 8x less power.
  • Rules in the decision tree are hypercubes in the rule space. Tree building successively cuts down the rule space into smaller sub-spaces. Stops when the cube is small. Classification uses tree traversal.
  • HyperCuts’ memory overhead is due to many rules overlapping and varying in size, because fine cuts to separate small rules lead to cuts to and replication of large rules. Also overhead because the rule space is sparse (leading to empty nodes or nodes with replicated rules).
  • Aim to tackle the variation in the rule size and the density of the rule space.
  • Separable trees: build separate trees for large and small rules. But separate them along different dimensions.
  • Build a distinct tree for each set of separable rules in 5 IP fields. This leads to a maximum of 31 trees, but in practice it’s more like 12.
  • Extra memory accesses to traverse multiple trees decreases packet throughput. To reduce the number of accesses, merge the trees.
  • HyperCuts uses equi-sized cuts to separate dense areas, whereas EquiCuts uses equally-dense cuts, which leads to fine/coarse cuts in dense/sparse areas. Many details of this in the paper.
  • Node co-location: colocate a node and its children, details of this in the paper.
  • Implemented HiCuts and Hypercuts with all heuristics, and EffiCuts. Used 16 rules per leaf. Power comparison uses an estimation from the Cacti tool to simulate the SRAM/TCAM.
  • First result: HyperCuts and HiCuts see memory grow more rapidly than EffiCuts. Replication decreases from 1000 to < 9. Efficuts needs constant number of bytes per rule as the number of rules grows.
  • EffiCuts requires 50% more memory accesses than HyperCuts. However, since EffiCuts uses much less memory, memory copies are inexpensive.
  • Throughput results are mixed (149 down to 73 million packets per second for one rule set; but 218 up to 318 for another). Still see an 8x saving in power.
  • Also compared EffiCuts to TCAM. Throughput story is also mixed, but EffiCuts consumes 6x less power than a TCAM.
  • Q. How do you separate “large” and “small” rules—using a threshold? We observed that the rule spread is essentially bimodal. This is based on a sensitivity analysis to the “largeness fraction” which varies between 0.1 and 0.95 without affecting the split.
  • Q. How would power consumption compare to a TCAM where you selectively turn on the relevant banks? Since we are comparing the worst-case packet match, every rule could go to a very different bank.

Cloud and Routing

Theory and New Primitives for Safely Connecting Routing Protocol Instances

  • Earlier measurement study showed that internet routing is much more complex than the traditional two-level hierarchical model (EIGRP/BGP/OSPF). Since it is so complicated, the connecting primitives play a critical role. 99.9% of analyzed networks depend on them. They are used for things like domain backup, partition healing and router-level shortest path routing.
  • Designs are usually either safe or flexible. The status quo is unsafe and inflexible. The talk describes something that is both safe and flexible!
  • Framework based on routing algebras (metarouting). Connecting primitives have provable safety properties, more expressivity and require no modifications to existing protocols.
  • Today’s connecting primitives have two features: route selection (ranking) and route redistribution (information exchange). Configured using various optional settings in router configuration scripts.
  • Misconfiguration can lead to suboptimal routes, or loops.
  • Two questions: how should routes be compared? and when should they be redistributed? The idea is to have a conversion function from routes to a universal metric. Each routing instance is associated with a pair of conversion functions. Now frame the problem as what properties these functions should satisfy.
  • Contributions: sufficient conditions to guarantee correct routing.
  • Goal: unmodified BGP, OSPF, RIP, EIGRP. But BGP and OSPF E2 are not strictly monotonic.
  • Non-BGP protocols modeled by a 2-dimensional ranking, comprising route type and cost. Conversion functions map these to and from an ordered set.
  • Route selection: prefer non-BGP to BGP routes, for non-BGP routes prefer type-A then B and then C, and among non-BGP routes of the same type prefer the lowest cost.
  • Domain backup/partition healing: currently possible, but only with a complex configuration, a star topology and giving protection only to the leaves. In the new design with the default conversion functions, can do this with any topology and any available path.
  • Router-level shortest path: currently only between OSPF and cannot change the cost. The new design can do it with OSPF, RIP and EIGRP using per-instance metrics.
  • Traffic engineering: existing design allows this only within instance, but now we can do it across instances.
  • Q. Do the functions have to be common across all the routers? Yes, it has to be consistent across all the border routers. But future work would let it be different.
  • Q. Do you see the potential for work in the standards bodies to standardize these functions? Talking to the router vendors about this.
  • Q. What is the behavior during convergence? We haven’t focused on that. After some time, we converge to a correct path. How can you be sure that it is correct without enforcing a global ordering? We do enforce a global ordering at present.
  • Q. Does your scheme require a centralized entity? We have no such entity. What about between different ASs? …

DONAR: Decentralized Server Selection for Cloud Services

  • Server selection as a customizable constrained optimization. This talk describes a distributed solution to this problem.
  • User-facing services are being replicated across the world. Server selection involves selecting which replica to use for a particular user. Want to be agnostic to the wide-area protocol being used. For example, distributed DNS and HTTP redirection/proxying.
  • Idea was to build a system, DONAR, to which server selection can be outsourced.
  • Policy interface. Naïve policies include round-robin (for load balancing), or location-aware closest-node. But want to support complex policies over many nodes.
  • Policies are represented by constraints. For example, a bandwidth cap, or a split ratio and allowable deviation. Without constraints, just use closest node, but this can lead to massive imbalance.
  • Improvement: add a bandwidth cap on some instances, which now gives some locality, but doesn’t overtax a particular instance (and offloads some traffic to other instances).
  • Improvement: split 10% across ten replicas with a +/- of 5%. The tolerance is unique to this implementation. Gives the ability to trade off network proximity and load distribution.
  • Can have a mix of bandwidth cap, proportional split and tolerance across different instances.
  • For a customer, can define a global linear program that describes the optimal pairing. Minimize network cost (distance) such that load is within tolerance and bandwidth caps are met.
  • Need to do this for each customer, and continuously. Aim is to have potentially hundreds of DONAR nodes, customers and replicas per customer, and tens of thousands of client groups per customer. The linear program has millions of variables per customer.
  • DONAR nodes could measure traffic and optimize locally. But no one node sees the entire client population, and the distribution at each node is skewed.
  • Could maybe aggregate all of the data at a central coordinator. This would end up sharing a lot of data and compromising responsiveness. For example, we would want a node to respond quickly to a flash crowd.
  • Actually came up with a decomposition strategy, for both the objective function and the constraints. Uses a Gauss-Siedel iterative decomposition approach. The summary data shared is only proportional to the number of replicas. Proof in the paper that this converges to the global optimum.
  • Deployed since November 2009, and in production use for two customers. Currently services around one million DNS requests today.
  • Other systems challenges in the paper: network availability, reliable data storage, etc.
  • Experimental setup using CoralCDN with a proportional split across the replicas. Closest node policy is very volatile, whereas DONAR equal split gives much more predictable workloads.
  • Better than simple round-robin, since DONAR keeps the network distance as small as possible (shown by the rank distribution of node distance from the client).
  • Q. How quickly does the global solution converge? Covered in the paper, and usually after one round of hearing from everyone.
  • Q. Can you incorporate server-specific policy like consistency or staleness (cf. weak/eventual consistency, where different replicas have different data)? We assume that we can direct a request to every replica. But we could solve this problem for separate domains.
  • Q. What’s the inaccuracy that comes from geolocation based on the location of the resolver? This has been well studied.
  • Q. What if a server’s settings change or it fails, how long will it take to recompute? Built-in support for liveness updates (either notification or loss of heartbeat), and use a heuristic before we rerun the optimization. Rerun the optimization every two minutes.

Cloudward Bound: Planning for Beneficial Migration of Enterprise Applications to the Cloud

  • Challenge: data privacy issues, such as national privacy laws and indsutry-specific privacy laws (like HIPAA). Challenge: SLA requirements lke response time.
  • Possible solution is hybrid clouds.
  • First focus is on planning hybrid cloud layouts, making cost savings, minimizing response times and bandwidth costs.
  • Second focus is on migrating security policies, such as firewall contexts and ACLs.
  • Contributions: study of the complexity of enterprise applications, first-cut solutions to the two challenges, and validations.
  • Enterprise applications are typically three-tier but with multiple components in each tier, and complex interdependencies.
  • Abstract the planning problem using a graph with vertices for components, and virtual vertices for internal and external users. Each vertex has a size (number of servers), and edges have a number of transactions per second and size of transactions per second. Objective is to maximize cost savings through migration, subject to policy constraints and bounds on the increase in transaction delay. Then partition the graph between local and remote.
  • Approach is to use easily-available information, like computation times of components and communication times on links.
  • Model user response times using bounds on changes to the mean delay and variance.
  • Benefits for cost savings based on the estimates in the Above the Clouds tech report. Based on a non-elastic migration, and future work is to look at using the cloud for peaks.
  • Migration algorithm is based on a reachability matrix to determine necessary security policies.
  • Evaluated based on two case studies, the Windows Azure SDK application and a Campus Enterprise Resource Planning application.
  • Used a cloud testbed to evaluate a thumbnail-making example application. The plan results in a mean delay less than 10% and an increase in variance less than 50%.
  • The campus ERP application involves multiple front-end, business logic and back-end components, in use at Purdue.
  • With a 30% bound on increase in mean delay, get $58k savings by migrating all components. For a 15% bound, get $38k savings by migrating a subset of the components. For a 15% bound with a placement policy (don’t move DBs), still get a $14k saving. Paper contains a sensitivity study on the benefit ratios.
  • Various security policies are also generated and evaluated.
  • Q. How much time did it take to understand the dependency mapping between components? Talked to operators and interviewed them, so the dependencies were extracted by humans. What about license dependencies, such as “may not be run in a VM”? [Taken offline.]
  • Q. Why did the delay increase more than the policy in the graphs? Used PlanetLab before a deadline, so the environment was hostile.
  • Q. Can you comment on the complexity of rewriting applications to run in split mode, which seems to be more important than performance or cost? There is an increasing trend towards service-oriented architectures, which make it easier to do this migration. Haven’t solved this for legacy applications.
  • Q. Did you have a contralocation scheme for preventing e.g. certain pieces of data not being stored in different languages? Constraints based on cloud/non-cloud.

SIGCOMM 2010: Day 1

August 31st, 2010

Wireless and Measurement

Efficient Error Estimating Coding: Feasibility and Applications

  • Won best paper award.
  • Existing philosophy: errors are bad and only want completely correct data.
  • Can we accept partially correct packets and only enforce correctness end-to-end?
  • Contribution: error estimating coding. Enables the receiver to estimate the number of errors in a packet, but not correct them.
  • Smaller overhead/weaker functionality than error correcting codes.
  • Overheads come from redundancy and computation.
  • EECs need only O(log n) bits redundancy to estimate errors. e.g. 2% overhead on a 1500-byte packet. For just a threshold check, need only 4 bytes.
  • Efficient computationally: software implementation can support all 802.11 data rates. ECC is 10 to 100 times slower.
  • Can estimate number of errors in a packet with a provable quality.
  • Application: streaming video. FEC (forward error correction) often used here. Routers forward partially correct packets. But if the number of errors is so large that the data are unrecoverable, it will incur retransmission. Router should have asked for retransmission earlier when a packet could not be decoded. But it lacks computational power to evaluate an ECC. EEC is more computationally tractable in this scenario for BER-aware retransmission.
  • Implemented for Soekris Net5501-70 routers.
  • Key idea: router should treat different packets differently. Could use analogue or digital amplification as necessary.
  • Packet scheduling: image sensor network for emergency response. Let packets with smaller BER get through first.
  • Applies also to bulk data transfer. Can use partial packets and correct them end-to-end. Could use network coding or incremental redundancy. EEC helps to do WiFi rate adaption, if we know the mapping between data rate and BER, which EEC provides. Existing systems based on packet loss rate, signal-to-noise ratio (at the receiver), or modifying the PHY.
  • Implemented a prototype rate adaptation scheme using EEC. Consistently outperforms existing schemes based on packet loss rate and SNR.
  • More general problem is wireless carrier selection: goal is to select carrier with the best goodput.
  • Packet has n + k slots (n data bits and k EEC bits). p is the fraction of erroneous slots, with arbitrary position.
  • Naïve solution would be to add pilot bits to each packet with known values at known positions. But this needs the receiver to observe enough errors in the pilot bits. Since BER is usually small, need a lot of pilots to see a single error.
  • Instead, make the pilot bit a parity bit based on a known group of data bits in the packet. But cannot distinguish many cases with parity bit (1 error vs. 3 errors). Error probability of a parity bit is (inversely) correlated with error probability of the bits in its group.
  • Solution involves randomly selecting data bits to make up fixed sized groups and compute an EEC parity bit. Now permute the data and EEC bits and send.
  • Can refine to single- or multiple-level EEC. Details of multiple-level EEC in the paper.
  • Can prove a formal guarantee for the space complexity of the number of EEC bits (the O(log n) bound).
  • Compared to SoftPHY, EEC is a pure software solution, which is more deployable. But SoftPHY gives per-bit confidence information that EEC cannot provide.

Design and Implementation of an “Approximate” Communication System for Wireless Media Applications

  • Leverages properties of wireless PHY layer to improve media applications.
  • Media applications by 2013 will comprise 91% of internet traffic, and wireless based access is the dominant for of access (4 billion wireless hosts vs. 0.5 billion static hosts).
  • In this case, we should use the spectrum efficiently for media transfer.
  • Looking at hierarchically-structured media, such as MPEG4 and H.264. Different frames (e.g. I, P, B) have different value (I > P > B). So use unequal error protection to prioritize important frames.
  • Since data received is a predictable approximation of transmitted data, they can provide unequal error protection almost for free (in terms of additional spectrum).
  • Errors result from the constellation diagram used to determine QAM encoding. Typically, the error is restricted to symbols in the neighborhood of the received symbol in the constellation.
  • Using a Gray code-based bit mapping from symbols to the QAM encoding. So, by definition, neighboring symbols are just one bit different. Different bit positions offer different levels of protection. So can choose different bit positions to give different protection to data.
  • For the media example, but I frames in more-protected bit positions, and other frames in the other positions.
  • Also considered a block mapping scheme, which gives better protection for the most protected bits, and worse protection for the other bits, than a Gray code mapping.
  • Designed a system based on these principles: APEX.
  • A modern radio pipeline will apply randomization, such as scrambling, coding and interleaving. But this makes it harder to determine what bits go where. So they move randomization to before the assignment of bit positions.
  • Uses a greedy algorithm to map application data at various priorities that deals with unequal content size.
  • Evaluated at various bit rates and bit error rates. Gets a better PSNR than traditional transmission. Also showed that it works well with FEC.
  • Q. Did your evaluation take fading into account, and would the assumptions still hold? Experimentation done using a system not robust enough to take outdoors and do experiments. The assumptions might hold. If you are sharing wireless LANs more, does this become more critical?
  • Q. How will it work when your bit-rate adaption drops down to BPSK and QPSK? It does nothing in this case. How might it work? Could instead use smaller QAM symbols.
  • Q. What if you get 180 degree phase shift, due to fading or propagation delay, and the mapping changes? At the PHY layer, we expect there to be a mapping, and an indexing mechanism that can decode information in the header of the packet to select the mapping.
  • Not All Microseconds Are Equal: Enabling Per-Flow Measurements With Reference Latency Interpolation

    • Low-latency (e.g. financial) applications are increasingly important.
    • Current solution is a low-latency cut-through switch. In a tree network, it is hard to tell which switch is causing a problem at microsecond granularity. Need high-fidelity measurement within the routers themselves.
    • But SNMP and NetFlow provide no latency measurements, and active probes are typically only end-to-end. Measurement boxes are very expensive (£90k).
    • Need per-flow measurements because averages lose too much information about what happens to each flow. There is a significant amount of difference in average latencies across flows at a router.
    • Perform measurement on packet ingress and egress. Assume that router interfaces are synchronized, because cannot modify packets to carry timestamps.
    • Naïve approach: store timestamps for each packet on ingress and egress. Packet timestamps get sent along the egress route when the flow steps. Obviously this is too costly to do at 10Gbps. Sampling sacrifices accuracy.
    • Use LDAs with many counters for interesting flows, counting packets seen at each timestamp.
    • Divide time into windows, and measure the mean delay for packets within that window (locality observation). Error shrinks with a smaller window size.
    • Can inject a reference packet regularly that does have ingress and egress timestamps, which gives delay samples for each window.
    • Implementation has two components: reference packet generator and latency estimator.
    • Reference packet generation strategies: 1 in n packets or 1 in tau seconds. Actual approach is dynamic injection based on utilization. When high utilization, inject fewer packets.
    • Latency estimator strategies: could use only the left reference packet (previous reference packet) or a linear interpolation of the left and right reference packets. Other non-linear estimators, such as shrinkage, are possible.
    • Maintain packet count, summed delay and sum of squares of packet delays for each flow.
    • Evaluated on various router traces, and simulated with NetFlow YAF implementation. (RED active queue management policy.) Median relative error is 10–12%. As utilization grows, error decreases. RLI outperforms other schemes by 1 to 2 orders of magnitude.
    • Overhead is < 0.2% of link capacity. Packet loss difference is 0.001% at 80% utilization.
    • Q. Have you talked to router vendors about implementing this? No.
    • Q. When you compare the different approaches, why is sampling so much worse? Sampling scheme is one packet per thousand. So accuracy is very low if few packets are delayed.
    • Q. How does it influence the results if packet loss rates are high? Lost packets are not counted in our result.

    Data Center Networks

    Generic and Automatic Address Configuration for Data Center Networks

    • Manual configuration is error-prone, causing downtime. DHCP is used at layer 2. But applications need this information as well. Data center designs (manually) encode this information in the IP address for routing. But DHCP isn’t enough for this.
    • Takes two inputs: a blueprint graph (with logical IDs for each machine) that can be automatically generated, and a physical topology graph that is available later when the data center is constructed.
    • Center of framework is a device-to-logical ID mapping. Need malfunction detection to update the mapping.
    • Maintaining a map between devices and the logical IDs is the graph isomorphism problem. The complexity of this problem is unknown (P or NPC). Introduce the O2 algorithm which solves the problem in this case. Proceeds by choosing an arbitrary first node matching (decomposition), then refinement. Terminates when no cell can be decomposed. Overall algorithm terminates when all cells have single nodes.
    • O2 has three weaknesses (i) iterative splitting in the refinement stage, (ii) iterative mapping in the decomposition stage, and (iii) making a random selection of the mapping candidate. Optimization algorithm has three heuristics that address these problems.
    • O2 turns out to be faster than its competitors: Nauty and Saucy.
    • Malfunctions cause the topology to differ from the blueprint, so O2 cannot find the mapping. Solution is to find the maximum common subgraph between the blueprint and physical graphs. The algorithm for this is NP- and APX-hard.
    • Use heuristics based on the vertex degree changing. If no degrees have changed, probe subgraphs derived from anchor points using majority voting to identify miswired devices.
    • Protocols for channel building, physical topology collection and logical ID dissemination. A DAC manager coordinates these.
    • Experimented on a BCube(8, 1) network with 64 servers. The total time to run the algorithm was 275 milliseconds to autoconfigure all of the servers.
    • Ran simulations on larger topologies. Up to 46 seconds on a DCell(6, 3) network with 3.84 million devices.
    • Q. What are the next steps for this work? Can we design better logical IDs that can be used in routing.

    Symbiotic Routing in Future Data Centers

    • Reevaluate network architecture based on the different properties of data center networks when compared to the internet.
    • Despite lots of other work, the network interface has not changed, so what can we do at the application layer?
    • Network is a black box, and applications have to infer things like locality, congestion and failure; likewise networks have to infer things about the applications like flow properties.
    • MSRC designed CamCube: a network with x86 servers directly-connected in a 3D torus. Servers have (x, y, z) coordinates that are exposed to the application. The send/receive API is a simple 1-hop API. Multi-hop routing is provided as a service, which uses multipath when possible.
    • Built a high-throughput transport service, a large-file multicast service, an aggregation service and a distributed key-value cache service. Each had a custom routing protocol based on the properties that the application needed to obtain. e.g. High-throughput transport prefers disjoint paths, whereas file multicast prefers non-disjoint paths.
    • Testbed used 27 servers with 6×1G NICs. A simulator looked at a 20^3 (8000) node CamCube.
    • Custom routing policies yield a performance improvement (on average). Factor of 2 (median) improvement in end-to-end throughput for high throughput transport (10k x 1500 byte packets). Gains also for multicast and the distributed object cache (in terms of path length).
    • Also looked at impact on the network: achieved higher throughput with fewer packets (lower link utilization) for all applications.
    • The base routing protocol is still used to route along paths defined in the custom routing protocol, and to handle network failures. The custom protocol route for the common case.
    • Built a routing framework for describing these custom protocol. Two components: the routing policy and the queuing policy. Each service manages one packet queue per link.
    • Cache service: keyspace mapped onto the cube, evenly distributed across the servers. Routing: go to the nearer of the cache or primary nodes. On a cache miss, route from the cache to the primary, and populate the cache on the return.
    • The base protocol routes around link failures. If a replica server fails (in the key-value store), the key space is consistently remapped by the framework.
    • Forwarding function implemented in C#, running in userspace.
    • Benchmarked a single server in the testbed, communicating at 11.8Gbps with all six neighbors. Required 20% CPU utilization.
    • Can the routing approach be used outside CamCube? Network only needs to provide information about path diversity and topology, and the ability to program components.
    • Q. When you make an application for this framework, what would happen to the application if you decided to change the topology? The benefit of the black-box approach is that you don’t care about the topology. May be advantageous to target containerized/modular data centers where the topology cannot frequently (or at all) change.
    • Q. How would the performance look on other topologies, considering that the torus is optimal for latency and bandwidth? It is a benefit and a curse, given that we occasionally have long path length for some pairs. Topologies that give you higher path diversity give you better chances to employ these ideas.
    • Q. What if you ran multiple instances of the same application (rather than applications with very diverse routing policies)? We did run this, but the details are in the paper. For the high-throughput transport protocol, you might expect us to be susceptible to congestion, but the forwarding function returns many choices, which you can choose based on minimal queue length, for example.
    • Q. What is the net result here for forwarding latency? This is one of the main critiques of the topology itself. Currently experimenting with smart NICs so that we don’t have to go up to user space for straightforward forwarding.
    • Q. Can you write the forwarding method in the form of many overloaded methods that code be dispatched dynamically? To some extent, but packets are tagged with a service ID, which statically dispatches the forwarding method of the particular service.

    Data Center TCP

    • TCP is used for 99% of traffic in data centers, but what is the problem with it? Can suffer from bursty packet drops (Incast), and builds up large queues that add significant latency.
    • Many ad hoc workarounds for TCP problems, such as at the application level. This talk is about changing the TCP stack in the kernel to address these problems.
    • Interviewed developers, analyzed applications and did a lot of measurements. Systematic study of impairments and requirements in Microsoft’s data centers.
    • Case study: Microsoft Bing data center. 6000 servers, with passive instrumentation (application/socket/kernel-level). Search query goes to top-level aggregator, which splits the query and farms it out to mid-level aggregators, which then farm it out to worker nodes. Worker nodes have a 10ms deadline; mid-level aggregators are 50ms; and the top-level deadline is 250ms. Missed deadlines lead to missing data in the results.
    • Similarly, Facebook builds a page by pulling data from various servers. Similar traffic pattern to Bing.
    • Incast happens when “synchronized mice collide”. Caused by partition/aggregate: the queue overflows at the aggregator. To deal with the problem, Bing jitters requests over a 10ms window. This gets better performance at the median, but causes problems at higher percentiles (up to 99.9th is tracked).
    • Queue buildup causes when big flows take up too much of a queue and increases the latency for short flows.
    • Requirements: 1. High burst tolerance; 2. Low latency for short flows; 3. High throughput. 1 and 3 are in tension with 2. Deep buffers helps 1 and 3 but increases latency. Shallow buffers are bad for bursts and throughput.
    • Objective is low queue occupancy, with high throughput.
    • TCP uses explicit congestion notification, inserted in packets in the middle, noted by the receiver and sent back to the sender in the ACKs.
    • Need C * RTT buffers for a single flow running at 100% throughput. If you have a large number of flows, you can have fewer buffers. If there is a low variance in sending rate, small buffers are sufficient.
    • Key idea: react in proportion to the extent of congestion, not its presence (cut the congestion window by less if there are fewer congestion notification bits set). Other key idea: mark packets based on the instantaneous queue length.
    • Sender maintains a running average of marked packets, and adaptively cut the congestion window based on how many packets are marked.
    • On a real deployment, DCTCP for two flows keeps the queue length much shorter than regular TCP.
    • Get high burst tolerance by having large queues. Low latency by having low buffer occupancy.
    • How long can DCTCP maintain queues without loss of throughput, and how do you set the parameters? Need to ensure that the queue size is stable by quantifying the oscillations.
    • Implemented on Windows 7 using real hardware with 1G and 10G stacks. Aim was to emulate traffic within one rack of the Bing data center. Generated query and background traffic based on distribution seen in Bing. For background flows, DCTCP gets lower latency for small flows, and matches TCP for large flows. For query flows, DCTCP does much better than TCP.
    • Tried scaling the traffic (background and query) by 10x. Compared DCTCP, TCP and TCP-RED with shallow-buffered switches, and TCP with a deep-buffered switch. TCP-DeepBuf does terribly for latency with short flows, and TCP-ShallowBuf due to Incast
    • Q. Shouldn’t you be comparing with TCP-ECN? We have done those comparisons, though they aren’t in the talk.
    • Q. If you reduce in proportion to the number of bits, it depends on the timescales on which the queue builds up, which depends on the number of competing sources and their own reactions. Don’t you have to do some control theoretic modeling? Data center is a homogeneous environment with all sources being DCTCP. Even there you really need to figure out the bandwidth properly? There is more detail in the paper.
    • Q. Depending on the instantaneous queue for the notification has implications for the dynamics, and I am concerned about this? Homogeneous RTT helps here, thanks to the data center environment. We believe there are some simple ways to solve the problem for heterogeneous RTT (e.g. multiple queues). I’d like see how much diversity you can tolerate?

    Inter-Domain Routing and Addressing

    Internet Inter-Domain Traffic

    • It’s hard to measure the Internet, and the lack of ground truth data makes it harder to know if you’re doing it accurately.
    • Wanted to collect a large data set that shows how the internet is evolving.
    • Conventional wisdom: global-scale end-to-end network, broad distribution of traffic sources and sinks, and the existence of a “core”. But as time passes, these are becoming less true.
    • Methodology: focussed on inter-domain traffic, not application layer things such as web hits/tweets/VPN/etc. Exported coarse-grain traffic statistics about ASNs, ASPaths, protocols, ports, etc. via anonymous XML forwarded to central servers. Covers 110 ISPs/content providers, 3k edge routers, 100k interfaces… i.e. about 25% of all inter-domain traffic. Then waited 2 years and repeated to get longitudinal data. Used commercial probes within a given ISP, with limited visibility into payload-based classification. Calculated percentages per category then weighted averages using number of routers in each deployment. Also incorporated informal and formal discussions with providers, and information about known traffic volumes.
    • Validated predictions based on a ground-truth based on 12 known ISP traffic demands (Known peak Tbps).
    • In two years, Google and Comcast have grown to be the 3rd and 6th biggest carriers in terms of traffic demands.
    • Cumulative distribution of carriers: in 2007, thousands of ASN contributed the first 50% of content; in 2009, it was 150 ASNs. In 2010 it’s even more dramatic (but not shown).
    • By buying YouTube, Google went from 1% of internet traffic to 6% (not including the global cache — i.e. an underestimate).
    • In 2007, Comcast has “eyeball” peering ratios, but by 2009, they are a net producer of content. Video distribution, backbone consolidation, etc. contributed to this.
    • Price per megabit of wholesale for internet transit has collapsed whereas the revenue for internet advertisement has greatly increased.
    • IP and CDNs have been commoditized, hence enterprise traffic has moved to the cloud. Companies have consolidated. Bundling (triple/quad play etc.) has become popular. Several new economic models, such as paid content, paid peering, etc. (but often under NDA). Also disintermediation as customer and provider connect directly.
    • Traditional internet model is hierarchical: not really true but pedagogically used. The new “Tier-1″ has “Hyper Giants”, representing the large content providers.
    • In terms of protocols: HTTP has grown 24.76%, video has grown 67.09% and P2P has shrunk > 70%.
    • Port usage is also consolidating: similar distribution to the content providers: fewer ports account for more of the internet. Looked at Xbox traffic on TCP port 3074, and saw a huge drop. Actual cause was a move to port 80.
    • File sharing has migrated to the web. In 2010, P2P is the fastest-declining application group. Direct-download like Megavideo, etc. are much more popular now, and you can even get HD streaming.
    • Internet traffic changes have been driven by changing economic models. Shift from connectivity to content, and the move to port 80 are two major trends. This has implications for engineering and research, as security/fault tolerance/routing/traffic engineering/network design have become more difficult.
    • Q. Economic implications: what are other types of business models and arrangements that might come out of “hyper giants”? Still in the early stages, but it’s not clear - from a power perspective - who has the upper hand.
    • Q. If the number of top players decreases by an order of magnitude, do you see the role of CDNs diminish and do you have any data on that? Talk about CDNs in the paper (about 10% or more of internet traffic). In general, they are growing, but enterprise content is driving a lot of the CDNs.

    How Secure are Secure Interdomain Routing Protocols?

    • After a decade of research on secure BGP, no idea what the best solution. So this paper evaluates how well each protocol prevents “traffic attraction attacks”, based on simulation on empirical data. Used an AS-level map of the internet and business relationships, and a standard model of routing properties.
    • How do we strictly enforce correct path announcements? Solutions range from no crypto (BGP) to lots of crypto (data-plane security). It turns out that this isn’t the only problem: just as important to control who you announce to, as well as what is announced. Defensive filtering is also important.
    • Different relationships: customer-provider (customer pays), and peering (no payment). No value for how much a path costs in the model, but the paper models routing by preferring cheaper paths. After that, prefer shorter paths. Only transit traffic if it makes you money (i.e. on behalf of your customers).
    • A traffic attraction attack is an attempt to get as many people as possible to route traffic through the attacker’s AS (for tampering, eavesdropping, etc.). Simulations show that a traffic attraction attack can pull in traffic for 62% of ASs.
    • Origin authentication: secure database of IP to AS mappings to prevent people advertising origins they don’t own. Simulation shows 58% of ASs get attracted to the attacker.
    • Secure BGP: enforces that ASs cannot announce paths that have not been announced to them (using digital signatures). So can only append a prefix. Simulation shows that the attacker still attracts 18% of ASs.
    • Defensive filtering (of stubs (i.e. ASs with no customers, which should not route traffic)): provider drops announcements for prefixes not owned by its stubs. Defensive filtering thwarts all attacks by stubs (i.e. all of the previous cases), and 85% of ASs are stubs.
    • Sometimes announcing longer paths is better than announcing short paths; sometimes announcing to fewer neighbors (than all of them) is better. It’s NP-hard to find the optimal attack strategy.
    • Ran experiments by choosing randomly an attacker, victim pair and simulated a “smart attack” each protocol.
    • Evaluated probability that an attack attracts 10% of ASs (over a random choice of attacker and victim). Defensive filtering alone is as effective as Secure BGP alone. However, the attacks could be smarter, so these numbers are underestimates.
    • Why aren’t we using defensive filtering today? It’s hard to keep track of the prefixes that your customers own. A push to implement origin authentication is ongoing, and this could be used to derive the filtering mapping. The threat model is somewhat strange though.
    • Q. The CAIDA data are less than ideal, so how robust are the statements in the paper? Ran the experiments twice, on CAIDA data from 2009, and the UCLA Cyclops data. Trends for the two sets are similar.
    • Q. Did you look at comparing the effectiveness of models versus their deployability? Completely ignored deployability and implementability in this paper, but it will be the subject of future work.
    • Q. Have you considered varying the 10% threshold? Yes. [Backup slide.]
    • Q. Seems like the easiest way to attract traffic would be to deaggregate the prefix, so do you take this into account? We didn’t evaluate it because we idealized things to make it tractable.

    Understanding Block-level Address Usage in the Visible Internet

    • What can simple observations about the internet say? Contributed methodology, applications and validation.
    • Looked at spatial correlation, address utilization, dynamic addressing and low-bitrate identification. Based on data set gathered for IMC. New, deeper understanding and a new interpretation.
    • Is there spatial correlation in the IPv4 address space? Are adjacent addresses likely to be used in the same way? This could help to efficiently select representative addresses for more-detailed study.
    • Collected data by pinging each address in randomly selected /24 blocks every 11 minutes for a week, and collected the probe responses, probing 1% of the whole IPv4 space. Gives 5 billion ping responses.
    • Three metrics: availability (normalized sum of up durations), volatility (normalized number of up durations) and median-up (median duration of a period of uptime).
    • Graphed by mapping the IP addresses onto a Hilbert curve.
    • Algorithm: examine each block size, if it is homogeneous stop, otherwise split the block and recurse.
    • Validating spatial correlation is hard, because it is hard to find a ground truth. Therefore used USC’s network for comparison, and the general internet (hostname-inferred truth). Also evaluated for different samples and dates.
    • Selected USC because the operator provided the ground truth, and they had knowledge of both allocated and usage blocks.
    • 43% false negative rate, and 57% of blocks are correctly identified.
    • The general internet gives unbiased truth. The results are more correct than for USC: 68% correct to 32% false negative.
    • Low-bitrate identification: formalized RTT = transfer + queuing + propagation delays. Tried using median RTT to identify low-bitrate versus broadband. However, for international links, propagation time dominates. Variance of RTT separates low-bitrate from broadband.
    • Used hostnames as a form of ground truth (e.g. if the hostname contains “cable” or “3G”).
    • Q. Did you do any of this on IPv6? No. Are you planning to? Probably in the future when IPv6 is more popular.
    • Q. Is it reliable to use ping response to detect hosts, when some hosts refuse to handle ping responses? We didn’t consider false information in this work, but it would be valuable to consider this in future.

    SOSP 2009: Day 3

    October 19th, 2009

    Clusters

    Distributed Aggregation for Data-Parallel Computing: Interfaces and Implementations

    • Goal to make large-scale programming simple for all developers.
    • Write a program in Visual Studio; Dryad(LINQ) takes care of shipping it to the cluster, fault tolerance, etc.
    • Wrestling with the implementation of GroupBy-Aggregate. GroupBy takes a sequence of objects with some kind of key, and groups them together by key. Similar to MapReduce.
    • Naïve execution plan splits map and reduce into two phases, with an all-to-all data exchange between them. However, applying the reduce after this exchange results in a large amount of network I/O.
    • A better idea is to do early partial aggregation: use an aggregation tree to achieve this. Reduces the disk and network I/O by up to one or two orders of magnitude.
    • Want to automate this optimization. Programmer writes the obvious code and the system takes care of the rest.
    • Notion of decomposable functions is key to this. Need an initial reducer that is commutative, and a combiner that is commutative and associative.
    • How do we decompose a function? Two ways: iterator and accumulator interface. Choice can have a significant impact on performance.
    • How do we deal with user-defined functions? Try automatic inference, but fall-through to a good annotation mechanism. Implement simple function and annotate it with the initial reduce and combiner implementation function names.
    • Hadoop interface for this adds quite a lot of complexity. Java’s static typing is not preserved.
    • Iterator interface has to build an entire group and iterate through it. Accumulator can discard the inputs if they are not needed. Oracle uses this approach, implemented with stored procedures. Hard to link in user-defined procedures.
    • Automatic decomposition looks at the expression and checks whether all leaf function calls are decomposable.
    • Want our approach to have good data reduction, pipelining, low memory consumption and parallelisability (multicore). Define six strategies, accumulator- and iterator-based.
    • Iterator PartialSort approach. Idea is to keep only a fixed number of chunks in memory; processed in parallel. The bound on memory makes pipelining possible. Strategy close to MapReduce.
    • Accumulator FullHash approach builds an in-memory parallel hash table with one accumulator object per key. Objects are accumulated immediately. This gives optimal data reduction and memory consumption proportional to the number of keys, not records. This is the DB strategy (DB2 and Oracle).
    • Evaluated with three applications: WordStates, TopDocs and PageRank on a 240-machine cluster. Accumulator-based implemen—

    water_laptop

    SOSP 2009: Day 2

    October 13th, 2009

    I/O

    Better I/O Through Byte-Addressable, Persistent Memory

    • DRAM is fast, byte-addressable and volatile, but disk/Flash are non-volatile, but slow and not byte-addressable. BPRAM is all three!
    • Phase change memory is a promising source for this. Bits encoded as resistivity. Access latency in the nanoseconds, and far better endurance than flash. Designed BPFS for BPRAM.
    • Goal: FS ops commit atomically and in program order. Data is durable as soon as the cache flushes. Use short-circuit shadow paging to get this (new consistency model).
    • Eliminate DRAM buffer cache; use L1/2 instead. Put BPRAM on the memory bus. Provide atomicity and ordering in hardware.
    • Both BPRAM and DRAM are addressable by the CPU: physical address space is partitioned into volatile/non-volatile.
    • BPFS gets better performance than NTFS on the same media.
    • What happens on crash during update? Short-circuit shadow paging comes into play (contrast with journalling or shadow paging). Overhead of journalling is that all data (or metadata) must be written twice. Shadow paging uses copy-on-write up to the root of the FS: drawback is that writes propagate all the way back to the root (multiple updates), and small writes have a large copying overhead.
    • Short-circuit shadow paging makes in-place updates where possible. Uses byte-addressability and atomic, 64-bit writes. Both in-place updates and appends are made simple by this technique. Cross-directory rename does bubble up to the common ancestor.
    • Problem: if data is cached in L1/L2, the ordering of cache eviction can lead to inconsistent states. Also, writes from the cache controller might not be atomic.
    • So add two new hardware components to the CPU and cache controller. Epoch barriers are used to declare ordering constraints and they are much faster than a write-through cache. Also add capacitors to DIMMs which allow writes to propagate even after the loss of power.
    • Do CoW then Barrier then Commit. Paper also shows how to make it work on multiprocessors.
    • Built and evaluated on Windows as an in-kernel file system.
    • Microbenchmarks (append n bytes and random n-byte write) compare NTFS/Disk, NTFS/RAM and BPFS/RAM. (Using DRAM in this experiments.) BPFS is significantly faster than NTFS on disk, and NTFS isn’t syncing so it isn’t durable!
    • Postmark benchmark compares NTFS/Disk, NTFS/RAM, BPFS/RAM and (projected) BPFS/PCM. BPFS/PCM is much faster than both NTFS/Disk and NTFS/RAM. Analytical projection based on sustained throughput of PCM.
    • Q: are the storage requirements of a database and of a file system converging when you have this hardware available? The changes to the hardware will be applicable to other sorts of storage systems like a database, not just filesystems.
    • Q: have you thought about how to expose more capabilities of this medium to the applications (not just sequential reads and writes)? Applications are currently written in terms of what is efficient.
    • Q: how do you atomically deal with a free-list? We don’t have a free-list. Don’t need to keep track of so many data structures because the medium is so fast, which lowers the consistency overhead.
    • Q: where do you go next for multiprocessors and clusters? One of the goals was to have multiple concurrent threads operating on the FS at the same time.
    • Q: is there a risk that data in the capacitor gets garbled after the machine gets switched off? [Taken offline.]
    • Q: could you go even faster without having the consistency guarantees, for applications that don’t need it? There’s always a trade-off here.
    • Q: how do you do mmap, and is meddling with L1/L2 caches going to be expensive? Haven’t implemented mmap yet, but we would have a much better guarantee of durability. Changes to the cache, in the paper have been looked at in terms of interference, and performed well.
    • Q: why didn’t you benchmark against a file system using a B-tree or a red-black tree that can take advantage of random writes [why did you compare against NTFS]? NTFS is widely used, and it’d be interesting to compare against others.

    Modular Data Storage with Anvil

    • Data storage drives modern applications (everyone has a database) and they are frequently a bottleneck. Hand-built stores outperform general-purpose ones by up to 100x. Observe that changing the layout can substantially improve performance. Custom storage is hard to write, especially in order to provide consistency guarantees. Can be prohibitively expensive to experiment with new layouts.
    • Need a simple and efficient modular framework to support a wide variety of layouts.
    • Fine-grained modules: dTables. These are composable to build complex data stores. All writing is isolated to dedicated writable dTables, which incidentally has good disk access properties.
    • dTable = key/value store. Maps integers/floats/strings/blobs to blobs. Provides an iterator to support in-order traversal. dTables used by applications and frontends, and also other dTables. Can transform data, add indices or otherwise construct complex functionality from simple pieces.
    • Example of a mapping from customer IDs (mostly contiguous) to states. Start with an array dTable for the common case. Layer a dictionary on top of that (maps state names to array indices). Have an exception dTable for the case where a customer isn’t in one of the 50 states, and a linear-search dTable for their residences. But to make this fast, layer a B-Tree index dTable on top of the linear store.
    • So far just read-only. Updates are hard to do transactionally. Need to implement a write-optimized dTable. Fundamental writable dTable is the journal dTable. New data is appended to a shared journal; data are cached in an in-ram AVL tree. The journal is digested when it gets large. Transaction system is described in the paper. Layer this over read-only dTables.
    • Managed dTable goes at the top. Also have a Bloom filter dTable to deal with multiple overlaid read-only dTables.
    • Many additional dTables listed in the paper.
    • Evaluate  the effect of simple configuration changes on performance (modularity). Key lookup workload, comparing contiguous versus sparse keys. Contiguous good with arrays; sparse good with B-trees. Also show the benefit of layering an index on top of a linear store. Also show the low overhead of the Exception dTable.
    • Evaluated by running TPC-C. Replaces a SQLite backend with Anvil. Shows that Anvil outperforms both the original backend and MySQL. Split read and write stores perform well.
    • Evaluated the cost of digesting and combining. These can be done in the background, taking advantage of additional cores and spare I/O bandwidth. Measured the overhead when doing a bulk load (1GB) into the dTable, with digests every few seconds.
    • Q:  why didn’t you compare either performance or features against BDB, which is very similar? Didn’t find it as easy to construct read-only data stores in BDB: creating customisable data stores has a lot of transactional overhead.
    • Q: did you evaluate iteration? In paper. How does the performance depend on the order of updates? Not sure what you mean. Did look at overlay iteration in the paper, which ought to be the most expensive (due to key lookup cost), and overhead was only 10%
    • Q: how did you make it so that the creator of a new dTable doesn’t have to consider ACID semantics? Most dTables are read-only, so you don’t need to worry about this (like shadow paging). The managed dTable has a small hook that enforces transactional semantics. And read/write dTables? Don’t envision that people will need to create these. Could implement your own, but this would miss the point.
    • Q: how should developers write recovery tools for systems like Anvil? Anvil includes such a tool that handles recovery for you. Read-only semantics makes this much simpler.

    Operating Systems Transactions

    • [Unfortunately, I missed this talk due to having an obligation to man the registration desk. I'll try and track down the video and update this later.]

    Parallel Debugging

    Do You Have to Reproduce the Bug at the First Replay Attempt? — PRES: Probabilistic Replay with Execution Sketching on Multiprocessors

    • Concurrent programs are hard to write. Multi-core makes concurrent programming more important, and bugs more common. However they are non-deterministic (requiring e.g. a special or improbable thread interleaving). This makes it hard to reproduce them.
    • Deterministic replay for uniprocessors is relatively easy: only need to record inputs, thread scheduling and return values of system calls. On multiprocessors, this is much more challenging. e.g. Simultaneous memory accesses are another source of non-determinism.
    • Previous proposals introduce new hardware, which don’t exist in reality. Or there are software-only approaches but they have up to 100x slow-down.
    • Ideally want to reproduce a bug with a single replay and no runtime overhead. But what if we relax this slightly?
    • Idea 1: record only partial information during a production run. Idea 2: push the complexity into diagnosis time. Idea 3: use feedback from unsuccessful (non-reproducing) replay runs.
    • Just record a sketch during the production run. When a replay goes off the sketch, terminate it immediately and feed back information about why it deviated for refining the next replay. Can eventually reproduce the bug with 100% probability.
    • Several different methods for sketch recording. Spectrum of approaches from UP deterministic replay to full MP DR. Can record e.g. synchronization points, or basic blocks, or more: build up this information during the replay runs.
    • At replay time, the partial information replayer consults the sketch to see that recorded global ordering is obeyed. How do we know whether a replay is successful? Use a failure detector based on crash, deadlock or incorrect results. This can also detect unsuccessful replay runs.
    • When a replay attempt fails, start it again. But could do something different the next time: a random approach would just leave it to fate, but PRES is more systematic. Failed reproduction is due to un-recorded data races. The feedback generator captures these races and tweaks them in future runs. Start with many candidate races and filter them down.
    • Implemented PRES using Pin. Evaluated many different applications (desktop, server and scientific). Overhead is around 18%, which is barely more than baseline Pin overhead. Macrobenchmarks show that PRES gets much higher throughput for server applications (MySQL).
    • Effectiveness: UP algorithm doesn’t detect any bugs within 1000 replays, whereas PRES gets 12/13 in 10 attempts. Feedback generation is crucial to effectiveness. Race filtering also effective.
    • Q: could you also use execution traces to help track down which parts of the execution trace cause the bug to not happen, and guide the programmer? Good idea.
    • Q: could you apply PRES to virtual machine replay? Also a good point. The work could be integrated with virtual machines. Could you rollback an execution, is it precise enough? Depends on the fine-grainedness of the recording scheme used. What is the inherent overhead in collecting a trace with sufficient fidelity to do backtracking? If there is a lot of lock operations, the low-overhead approach (SYNC) could work. But if there is no synchronization, we can’t use this information, and a more heavyweight scheme would be needed.
    • Q: why do the results differ so much from the next paper? The main idea is similar, but their work is more focussed on static analysis to reduce runtime overhead.
    • Q: would you advocate this as a solution for long-running (one year or more) services, as it is often only after this time that they emerge? We can take a checkpoint of the process state, which solves the problem of data accumulation. 

    ODR: Output-Deterministic Replay for Multicore Debugging

    • Debugging non-deterministic software failures is really hard. The problem is how to reproduce these failures for debugging. Model checking/testing/verification could work, but it’s not perfect, and it doesn’t capture everything. So we need deterministic replay.
    • Need multiprocess operation, efficient recording, no special hardware and the ability to run arbitrary programs without annotation (especially programs with data races). All related work fails to meet one of these requirements.
    • ODR is a user-level replay system, which works in the MP case, has only 1.6x overhead, needs no new hardware and works on arbitrary x86 Linux binaries.
    • Often sufficient to produce any run with the same outputs… needn’t have the exact same execution. So the idea is to relax the determinism requirements.
    • The classic guarantee is value determinism: replay run reads and writes must have the same values as the original. Relax this to “output determinism”: the replay run produces the same user-visible output as the original. This is not perfect, but still useful for debugging: reproduces most visible signs of failure, and enables reasoning about failure’s root cause.
    • How to achieve this? Deterministic-run inference. Basic idea is to translate a program into a logical formula (verification condition). Function of schedule trace, input trace and read trace, returning an output trace. Use a formula solver to yield unknown schedule trace. Scale this by directing the inference using more original-run information. Also by relaxing memory consistency of the inferred run: where values read have nothing to do with schedule order, can use an arbitrary schedule trace.
    • Three-dimensional inference design space: memory consistency (strict, lock order or null), query complexity (output, I/O&lock-order, I/O&lock-order&path or determinant), and inference-time (polynomial or exponential). Search- and query-intensive DRI fit into this space.
    • Search-intensive is really slow (400–60000x slowdown). Formula generation, not solving is the bottleneck. Use multi-path symbolic execution with backtracking to generate formulas. Each backtrack involves a 200x slowdown. Backtracks are caused by race-tainted branches: wrong choice leads to a divergent, unsatisfying path. Work around by backtracking to the most recent race-tainted branch.
    • If we know the path (QI-DRI), inference time improves by 100x but we need to record much more data, so there is a 6x slowdown.
    • Future work is to reduce the path-search space, reduce the cost of each backtrack (cut down on race-tainted branch analysis), and parallelize formula generation by forking threads at each divergence.
    • Q: if I put in arbitrary values in the memory consistency model, how do I ensure that invariants are maintained? We don’t actually do null consistency, but if we did and invariants were violated the program might crash and output determinism catches this (because the output would be different).
    • Q: since your techniques are similar to the last paper, why are the results so different? In our approach, we do race-tainted branch analysis for the entire exectuion, and that is costly. We also do taint-flow propagation. There is more analysis in each backtracking iteration than theirs. We could reduce this cost by, for example, reusing results from previous iterations.
    • Q: have you considered changing your static analysis to another algorithm that could significantly improve your formula generation time? We are considering static approaches to formula generation.

    Works-in-Progress

    • RAMCloud: Scalable, High-Performance Storage Entirely in DRAM. New research project. Zero results. Motivated by wanting to build large scale systems with low latency. Data center style of web application separates the code from the data, in order to scale, but 4–5 orders of magnitude increase in latency. Want this kind of scaling with latency close to memory speeds (sub-microsecond). Basic architecture puts all data in DRAM. Scale using commodity servers. Reckon we can get 5–10us RPC latency end-to-end. Also have a story on durability and availability. Also want to support multiple applications.

      Q: effect of other memories? Whichever wins should work with RAMCloud

      Q: GMS system from UW 10 years ago? Not familiar with that [taken offline].
    • Transactional Caching of Application Data using Recent Snapshots. DB-driven website performance issues: use memcached. Add an in-memory DHT that is very lightweight, and stores application objects (not a DB). DBs provide transactional consistency, but these caches don’t do this. Goal is transactional consistency for accesses to the cache. Idea is to embrace staleness: all read-only transactions to run on stale data. Avoids blocking and improves utilization. This is quite safe, since stale data is already everywhere. Application can control staleness. Add a TxCache library between memcached and the application. DHT values are timestamped, and have a validity interval.

      Q: paper at HotStorage from HP Labs in similar area?

      Q: how does the DB know the validity interval? Modified DB to track this.
    • Chameleon: A self-managing, low cost file system. Targeting home user or small business, who doesn’t want to lose data. Doesn’t know anything about RAID. Cost-sensitive. Deployment scenario has 4 PCs connected by fast LAN, with a broadband connection to cloud storage. There are many ways to replicate, place and encode data. Ideally store data on at least one offline device to avoid vulnerability to viruses. A small, trusted “anti-availability kernel” enforces this requirement. Use linear programming to select and adapt storage configuration: the design space is now even more complicated. Tend towards the optimal solution.
    • Sloth: Let the Hardware Do the Work!Looked at embedded OSes used in the automotive industry. OSEK OS is the prevalent real-time embedded OS: event-triggered, priority-driven real-time system. Don’t want to implement a scheduler. SLOTH lets the interrupt subsystem do the scheduling and dispatching work. All threads are implemented as interrupt handlers and have interrupt priorities. Each thread needs an IRQ source. Priorities enable pre-emption. Can implement a bunch of synchronization this way also. System is simple, small (concise implementation and memory footprint) and fast (2–20x).

      Q: this looks like very simple scheduling… how would you deal with something more complicated like earliest deadline? Drawback is no blocking system calls, so can’t do everything.
    • The case for cooperative kernel threads. Kernels are multithreaded, and drivers have concurrency bugs, which, if they are in the kernel, is bad. Event-based devices drivers need to use continuations to preserve driver context across blocking operations. This becomes very complex, almost as bad as dealing with pre-emptive threads. Cooperative threads give the best of both worlds: atomic execution but allowing blocking. Research showed that drivers are mostly I/O bound, so cooperative threads are appropriate. Implementing this as a “cooperative domain” in the Linux kernel.

      Q: Linux does have cooperative thread scheduling available, so how does this interact with the work you are doing? Providing a framework for implementing drivers this way, much nicer.
    • Abstractions for Scalable Operating Systems on Manycore Architectures. Tesselation. Goal isn’t just to support heterogeneous hardware, but also provide predictable performance and guarantees for applications. Asymmetrically structured OS: some cores are dedicated as a management unit for keeping track of and scheduling applications. Eliminates the need for per-core runqueues, improves cache locality, decreases lock contention and limits kernel interference with applications. Applications interact with the kernel through remote, asynchronous system calls. Applications make explicit requests for cores, and OS guarantees that they will be gang-scheduled. OS just provides cores to the application, doesn’t need to know about threads. Applications have private memory ranges.

      Q: how do you balance the different demands for resources across applications? [Taken offline.]
    • System Support for Custom Speculation Policies. Applications run on some speculation infrastructure, which speeds things up. Want to separate policy from mechanism. Typically implemented transparently, which means that you have to be conservative, giving limited opportunities for speculation. Idea is to push the policy into the application. What could an application do that is different from the default? Might allow some output to be uncommitted. Or could commit equivalent-but-not-identical results. Process gets a “speculative fork” interface. Use cases: predicting user actions (predictive bash shell), authentication and user-level network services (when you have a predictable protocol).

      Q: how do you ensure errors in the speculative state don’t propagate to the main state? Need to be able to detect this, and could abort speculation in this case.
    • IDEA: Integrated Distributed Energy Awareness for Wireless Sensor Networks. A new “group diet” for wireless sensor networks. Problem of overloaded nodes. Existing solutions are “single node diets”, which are unsatisfactory because nodes have to collaborate. Local efforts cannot go far enough unless there is some cooperation. Aim to improve application fidelity by matching system load to availability. Shift load from overutilized to underutilized nodes, and shift load away from threatened nodes. Like a distributed OS for sensor networks. IDEA evaluates multiple solutions and distributes information to the nodes. Ideal goal is awareness of application constraints.

      Q: Quanto does some cross-node analysis? We’re building on these great ideas.
    • Flicker: Refresh Power Reduction in DRAMs by Critical Data Partitioning. Hardware is over-designed for correctness and reliability. Make it less reliable and tolerate errors in software. Smartphones are a motivation: power consumption is way too high, due to the use of DRAM for responsiveness. Battery drains even when a phone is idle. Goal is to improve power consumption here. If you increase the refresh cycle length, the power consumption drops, but the error rate increases. Currently use 64ms refresh, so could we increase this? Secret sauce is a partitioning into critical and non-critical (e.g. soft-state) data. Map critical data to short-refresh cycle DRAM, and non-critical data to long-cycle DRAM. Requires some hardware changes. Hypothesise that smartphones have a lot of non-critical data. Initial results show 25% drop in power consumption with only 1% loss in reliability.

      Q: [?] Looking at replication and checksumming in other work.
    • BFT for the skeptics. Industry deals with crash failures a lot, so do we need full BFT? We already use checksums, timeouts, sanity checks, etc. to translate faults to crash faults. How often do we get faults that require BFT to handle it? Looked at ZooKeeper and real-world failures. Yahoo!’s crawler uses ZooKeeper extensively. Saw 9 issues, due to misconfiguration (5, BFT wouldn’t help), application bugs (2) and ZooKeeper bugs (2, correlated, BFT wouldn’t help). Could BFT hurt? It has more things to configure, so misconfigurations could become worse. Need to show that BFT really solves a problem before industry will pick it up.

      Q: You showed that correctly implemented BFT couldn’t help with some failures? Failures were correlated, affecting all replicas.
    • Prophecy: Using History for High-Throughput Fault Tolerance. BFT has poor throughput. Need 3f+1 replicas to handle f faulty replicas. Can we improve this for read-mostly, internet workloads? Add a “sketcher” to each replica, which sketches requests and responses. Only one machine sends a full response, the others send sketches. Trades off consistency for performance, which gives delay-once linearizability. Faulty replicas can return slightly stale data. Internet services have unmodified clients and short-lived sessions. Look at performance of PBFT. Can improve by consolidating sketch tables on a trusted sketcher. We already trust middleboxes, so why not trust this too? Performance is much better than PBFT. Work not specific to BFT, and could apply to PAxos, quorums, etc. while getting similar benefits.
    • Securing Hardware Platforms Against Malicious Circuits Through Static Analysis. Make assumptions when building systems. Best way to break a system is to break its assumptions. People assume hardware is correct. What if we can’t make this assumption? Hardware is complex, expensive, static and the base of the system. Do “dead circuit identification”: highlight all potentially malicious circuits automatically. Attacker is motivated to avoid impacting functionality during testing (or else they’d be caught). DCI gets an assertion that says which paths are effectively short circuits. Use these assertions in a new graph algorithm to identify the possibly-malicious, dead circuits. No false negatives, but 30% over-identification. Empirical evidence shows a tight correlation between code coverage and

      Q: is this primarily at design-time on HDL? Yes, this is one of our assumptions.

      Q: what about redundant circuits for fault tolerance? This is used at design time where you can make calls about this.
    • Enhancing Datacenter Network Security and Scalability with Trusted End Host Monitors. Cloud workload is dynamic and hostile. Key selling point is that multiple tenants can share common infrastructure. Need a new approach to security, because exploits are more likely, and the cloud resources can be used to perform exploits themselves. Cloud datacenters can help: they are centrally-controlled so monitoring becomes easier. The software and hardware and homogeneous. Plus a clean-slate approach is possible. Use the hypervisor as a trusted component. Hypervisor can send alarms to central controller when an attack is detected. Built a prototype from Hyper-V and a trusted Intel NIC.

      Q: if you trust the VM, why do you need to trust the NIC? This gives some useful properties, and the NIC could do some filtering this for you.

      Q: HotOS paper on this exact topic?

      Q: [?] By “hypervisor” meant the entire virtualization stack, because we didn’t want to make the hypervisor itself any better.
    • Architectural Attacks and their Mitigation by Binary Transformation. What happens if someone tries to attack you from a VM on the same machine in the cloud. There is cross-talk through shared architectural channels. Example is contention for the CPU data cache. This leaks information about the memory access pattern, which could for example be used to leak AES keys. Have showed that EC2 has similar vulnerabilities: placement vulnerabilities, cloud cartography and cross-VM exfiltration are all possible. Approach is to use dynamic binary rewriting to transform x86 instructions so that the architectural effects are mitigated. Can degrade observation of timing, or inject noise and delays to hide leakage signal. Methodology is to make things secure by default, then come back to improve performance.

      Q: information leakage necessarily arises from statistical multiplexing, and we need statistical multiplexing to get good performance, so how can you address that? Assert that it should be possible.

      Q: how well would existing techniques protect against these attacks? Not aware of techniques that could do this.
    • Execution Synthesis. Say you have a bug in Linux on a remote machine. All you have to work with is a low-detail bug report. Reproducing it is time-consuming. Want a direction finding system from your system to a particular bug. Google Maps doesn’t do this at present…. So your bug report is a stack trace and some register contents. Do VM recording and replay. Don’t expect you to record behaviour that leads to the bug, since then you wouldn’t have the problem in the first place. Don’t care so much about performance, since you don’t run this in production. Find a state in the recording that is close to the bug report, then explore paths iteratively to get closer to the bug. Then you get a sequence of inputs that lead to reproducing the bug. Need a distance function, way of choosing inputs, and good information about what to include in a bug report.

      Q: could you go backwards from a failure state and execute in reverse? We don’t have the entire failure state to begin with.
    • Edge Mashups for Clinical Collaboration. Health industry is going from paper-based to electronic records. Want to empower non-programmers to build applications for real-time collaboriation, but need to respect things like HIPAA for logging and data retention. Example use-cases include expert-assisted surgery (call an expert for advice when complications arise, in real-time), and micro-clinics where nurses see the patients, but doctors write prescriptions remotely. Envision a graphical tool that pulls in photographic and chart data, which is synchronized between all participants. State serialized to XML which can be distributed to all the clients. Could be client/server or peer-to-peer. Need logging for accountability. “Break-glass” access control: anyone gets access but they are held accountable after-the-fact. Need low latency so doctors don’t feel that they are wasting time. Might migrate this to the cloud for scaling.

    Kernels

    seL4: Formal Verification of an OS Kernel

    • Formally proved the functional correctness of 8700 lines of C. No bugs.
      Want to build high-assurance systems: small kernels which reduce the trusted computing base. Want strong security properties. Kernel has to be correct: if it falls over, so does the whole system.
      seL4 has capabilities.
    • Proof is that specification and code are equivalent. Need a formal semantics for every system call. Use Isabelle as a theorem prover to bridge the gap between spec and code. But what about assumptions (in the code) and expectations (of the spec)?
    • Assume correct: compiler and linker, 600 lines of assembly code, hardware, cache, TLB management and 1200 lines of boot code.
    • Given these assumptions, we get some nice properties: no null dereferences, no buffer overflows, no code injection, no memory leaks, no div-by-zero, no undefined shift, no undefined execution, and no infinite loops or recursion. Does not imply security, lack of bugs from expectation to the physical world, or absence of covert channels.
    • Proof architecture admits proofs of higher-level properties (e.g. access control).
    • Design is written in Haskell, which can be used to generate Isabelle code automatically.
    • System model has three states: user, kernel and idle. Events are syscall, exception, IRQ and VM fault.
    • Call graph is messy! A microkernel takes all of the messiness and packs it into a very small space.
    • Formal methods practitioners (fans of abstraction) versus kernel developers (exterminate OS abstractions). Different view of the world. Haskell prototype unified these two things: OS people got to implement an OS, while the formal methods people got well-defined semantics. The C code is manually-written and hand-optimized, but based on the Haskell prototype.
    • Aim to reduce complexity. Have to deal with virtual memory in the kernel. But we can put drivers outside the kernel. Concurrency is complex, so use an event-based kernel and limit pre-emption to a few well-chosen points in long-running operations. The C code is derived from the functional representation. Need to support a subset of C: everything from the standard, minus goto, switch fall-through, & on stack variables, side-effects in expressions, function pointers and unions.
    • Found 16 bugs during testing, and 460 bugs during verification (roughly equally distributed between the C code, the design and the spec). Took 25 person-years in total: $6 million (compared to $87 million for EAL6).
      One of the largest proofs ever done in a theorem prover: 200kloc handwritten, machine-checked proof. Proved 10kloc of OS code.
    • Q: can you comment on what happens when you have to evolve the code? What effort is required? It depends. An optimization on the code level that doesn’t change the semantics might need a few days to re-prove. A new feature that adds new components could be added (in the paper) doesn’t depend on the rest of the kernel as long as you don’t screw around with existing data structures.
    • Q: does your work solve the stated problem? The assumptions are significant, and you’ve just done a very significant type-check on the code? Is it really possible to solve the originally stated problem? This is just the first step. You can reduce the assumptions with more work. It isn’t the only technique that you should use, so if you deploy it in an Airbus you should also do testing and verification. [Taken offline.]
    • Q: how can you verify something high-level like having the address spaces of two processes being isolated? We do this. You can still use seL4 in a stupid way, but you can use our security model with capabilities and reason about those in the spec. You don’t have to go down to the code.
    • Q: did you see a correlation between the logical errors in the specification and the implementation? Not really. The C bugs were fairly “stupid”: typos, copy-and-paste, etc.
    • Q: wouldn’t it be better to prove temporal properties? Are they expressible? They are expressible. We look at functional correctness, not temporal properties. But you need functional correctness before you can reason about temporal properties.

    Helios: Heterogeneous Multiprocessing with Satellite Kernels

    • Systems are getting more complicated, from UP to SMP to CMP to NUMA. This is still homogeneous. But hardware is no longer homogeneous: programmable NICs, GPGPUs, etc. Operating systems ignore this heterogeneity: the other devices have different instruction sets and often no cache coherence. This means that the standard OS abstractions are missing, and programming models are fragmented. Can we bring this back into the operating system?
    • Helios is an OS for distributed systems in the small. Use four techniques to manage heterogeneity, simplify app development and provide a single programming model for heterogeneous systems.
    • Result is that it is possible to offload processes to these heterogeneous devices with no code changes. Also improves performance on NUMA architectures.
    • Satellite kernels. Want to make use of an I/O device, but the driver interface is a poor interface for applications that want to use programmable devices. It becomes hard to perform tasks like debugging, I/O and IPC with these devices. The driver doubles as an OS, within the OS itself. A satellite kernel runs on the device itself: fundamentally a microkernel. Also run separate satellite kernels on each NUMA node. Local IPC and remote IPC for communication between satellite kernels.
    • Applications register as services in a namespace. The namespace connects IPC channels.
    • Application placement is constrained by the use of heterogeneous ISA, an expectation of fast message passing and platform preference. Applications are allowed to specify affinity in their metadata: a hint for where the process should run. Easy for a dev, admin or user to edit affinity. Platform affinity is processed first, and this gurarantees certain performance characteristics. Can also contra-locate, e.g. if you don’t want the interference of an anti-virus program running on the same core. Algorithm attempts to balance simplicity with optimality.
    • Applications are first compiled down to MSIL, and then that is compiled down to the appropriate ISA. Can encapsulate multiple versions of a method for different ISA in the MSIL (e.g. fast vector math).
    • Implemented on Singularity, using an XScale programmable I/O card (2GHz ARM processor with 256MB of DRAM). Just need a timer, an interrupt controller and the ability to handle exceptions to implement a satellite kernel for a new device. No need for an MMU (thanks, Singularity!). GPUs are adding timers (Larrabee). Only supports two platforms and a limited set of applications.
    • Evaluated several applications (network stack, FAT32 FS, mail server, web server, etc.) and how easy it was to run them on satellite kernels. Almost no code had to be changed (only in the TCP test harness). One line of metadata had to be changed in almost every case (zero in the other).
    • Offloaded an entire networking stack to the XScale, and showed that the end-to-end performance of PNG compression-and-serving is improved when offloading to the XScale.
    • Considered an email server built on Singularity, using a NUMA box. Emails per second handling improved by 39%. Turned out that the instruction throughput was much higher due to better cache utilization.
    • Q: when you transfer data between two NUMA domains, couldn’t the IPC fail due to memory allocation failures? Singularity is statically verified, using contracts, so we don’t have to worry about that.
    • Q: is this not just 20-year-old distributed microkernel research rehashed? We pay homage to that in the paper. A simple heuristic is sufficient to decide where to run some process, and you need to have process migration anyway, so why not just use that when you get a problem? Process migration is pretty difficult, in the heterogenous case. Abstraction turned out to be pretty brittle in commodity OSs.
    • Q: is it reasonable to rely on protection from a large runtime? The system isn’t dependent on type-safety.

    Surviving Sensor Network Software Faults

    • Sensors have to operate unattended for months or even years. It’s hard to debug failures considering that the input is unknown. There is no debugger.
    • Safe TinyOS introduces memory safety to sensor nodes. But what do you do when you get a safety violation? In the lab, spit out an error message; in the wild, reboot the entire node (losing valuable soft state and application data).
    • Neutron is a new version of TinyOS. Reduces the cost of a violation by 95–99%. It has near-zero CPU overhead during execution. Runs on a 16-bit microcontroller.
    • A TinyOS program is a graph of software components: statically instantiated code and state. Connections are typed by interface and there is minimal state sharing. Now have preemptive multithreading with a non-blocking, single-threaded kernel. Aim is to separate the program into independent units for recovery. Infer the boundaries of these at compile time. The kernel is a single unit.
    • Units can be rebooted independently. A wrinkle involves cancelling system calls, so you need to block if a syscall is still pending. Blocks of allocated memory are tagged with the owning recovery uint, which enables these to be freed on reboot (by walking the heap). Can even reboot the kernel: just cancel all pending system calls (return ERETRY), and just have to maintain thread memory structures. Applications will continue after the kernel reboots.
    • New idea of “precious state”: a group of precious variables will persist across a reboot. Annotate variables in the source code. Some restrictions on precious pointers. Precious variables must be accessed in atomic blocks. Variables are persisted on the stack across reboot: the set of precious state is usually smaller than the worst-case stack size.
    • Evaluated the cost of a kernel violation in Neutron, compared to safe TinyOS. Looked at three libraries, running on a 55-node testbed. Show the effect of a reboot on the CTP workload. Neutron gets close to the non-reboot case. Also look at the effect on time synchronization in FTSP, showing what proportion of the nodes have unsynchronized time. Again, Neutron gets close to the non-reboot case. Looked at fault isolation. CTP and FTSP data persist across reboots.
    • Main cost is in ROM bytes: 1–5kB of added code, roughly constant.
    • Measured cost of a reboot in milliseconds. A kernel safety violation will result in a 10–20ms outage.
    • Much lighterweight than microreboots (and lets you reboot a kernel, not a J2EE application).
    • It’s easy to change the TinyOS toolchain, but changing the programming model isn’t due to the amount of deployed code.
    • Q: how can I reason about a node that has survived a fault (a rebooted node is in a known-good state)? Do you have evidence that this is going to help us? [Showed emails from the questioner.] It is hard to diagnose these faults. Different approaches are possible.
    • Q: what do you think of the alternatives, such as using an MMU, verification or simulation? Well we could add an MMU, but we don’t have it at present. These new developments might make the dependability better. Verification struggles with the huge input space.
    • Q: how do you ensure that the precious state is not corrupted? Using safe TinyOS, so we won’t see memory access violations due to memory safety. Taint in the paper is about inconsistent state, not corruption.
    • Q: what are your criteria for what state should be marked as precious? Don’t have a strong set of guidelines for this, but have done it by inspection so far.
    • Q: what is your fault detection system, and what is its coverage? How do you know when you have a fault? The deputy compiler gets you to annotate code with things like buffer lengths. Can infer faulty behaviour from these. Annotating interfaces tends to be sufficient.
    • Q: this is a very valid approach?

    SOSP 2009: Day 1

    October 12th, 2009

    Keynote: Barbara Liskov

    • Inventing abstract data types, CLU, Type hierarchy, What next?
    • More of a Programming Methodology talk than a systems talk.
    • Started out in systems with the Venus machine on the Interdata 3. Presented it and its operating system at SOSP 1971.
    • Back in the early 1970’s, people were concerned about the software crisis. (cf. Dijkstra’s 1972 Turing Award lecture, The Humble Programmer.) As machines got cheaper, bigger and faster, software started to matter a lot. A tendency to underprovision the hardware, creating challenges for the software developers.
    • In late 1960’s the field of Programming Methodology began. Started to think about program design and structure: in order to have functionality, maintainability, etc.
    • First paper: Go To Statement Considered Harmful (Dijkstra, 1968). A revolutionary letter to CACM. Use static program text to reason about dynamic program behaviour: it would be useful if these were as close as possible, but GOTOs prevent that. Example of debugging: understand how you got to a particular point in the text. GOTOs make this very difficult. Provoked a huge amount of resistance: can’t program without it. Pointed out limitations of the programming languages: branch into a label table is just a case statement, but they didn’t have case statements.
    • Second paper: Program Development by Stepwise Refinement (Wirth, 1971). European school of software development: a top-down approach, starting with many abstract parts which are not initially implemented. Example was the 8-queens problem.
    • Third paper: Information Distribution Aspects of Design Methodology (Parnas, 1971). A new interest in modularity, and information hiding. “The connections between modules are the assumptions which the modules make about each other.” Hedging about what modules were at the time.
    • Fourth paper: On the Criteria to be used in Docomposing Systems into Modules (Parnas, 1972). How to actually break a system into modules: a hint of data abstraction but the full idea wasn’t there yet.
    • Enter Barbara Liskov: A Design Methodology for Reliable Software Systems (1972). Entire system should be partitioned: no global state and each partition owns some part of the state. Partition exposes operations and only way of interacting with the state would be by calling operations on the modules.
    • Wanted to apply the idea of partitions for building programs. It was unclear how to combine modules to make whole programs. Idea of partitions came out of doing work on operating systems: ur-partitions were the supervisor and user modes. This idea is carried a lot further by ADTs.
    • Idea was to connect partitions to data types. A strike of inspiration. Ideas often arrive in the middle of the night, or when arriving to work with a fresh mind.
    • March 1973 SIGPLAN/SIGOPS interface meeting on programming methodology was the debut of the idea. Began working with Steve Zilles. At the time, they were knowledgable about all the languages in existence (FORTRAN, LISP, ALGOL, PL/I, COBOL…). Started to do language design.
    • People were interested in extensible languages as early as 1967 (Schuman and Jourrand, Balzer). How can we help people build dialects of languages that make them easier to use. Looked at syntactic and semantic extensibility. Syntactic extensions written in BNF and added to the language using some kind of preprocessor. Fortunately this died a death…. People were much more worried about writing programs than reading them. Didn’t realise that programs are read more often than they are written. Balzer imagined data types as being collections which allowed four defined operations (add, remove, etc.) with operator overloading.
    • Hierarchical Program Structures (Dahl and Hoare, 1972): Simula 1967. Didn’t have encapsulation but did have inheritance to make simulation easier. Precursor to Smalltalk.
    • Protection in Programming Languages (Morris, 1973). Recognised the importance of locality in module comprehension (allows local reasoning). Proposed sealed objects using encryption as an OS mechanism to guarantee locality.
    • Global Variable Considered Harmful (Wulf and Shaw, 1973). In the 1960’s a stream of languages made block structure the big thing. Give locality within blocks, but you can always access things on the outside (i.e. global variables). Made analogy with Dijkstra’s paper that global variables are implicit connections between states of the program, which makes reasoning about it more difficult.
    • Programming with Abstract Data Types (Liskov and Zilles, 1974).
    • Said what ADTs were: a set of operations (whatever ones made sense, not a fixed set), encapsulation was important, and the operations were the only way to access object state.
    • ADTs were “clusters” with encapsulation. Proposed polymorphism, static type checking (but weren’t sure if this was possible due to polymorphism) and exception handling.
    • Why was a new programming language necessary? Needed to communicate the ideas to programmers. Enabled the testing of whether ADTs work in practice. Also made it possible to get a precise definition (tendency to think of the compiler as the language definition…). And to validate whether it was possible to achieve reasonable performance.
    • Goals of language design: ease of use, simplicity, expressive power and performance. First two played off against second two.
    • Also wanted minimality (limiting the language to what we could get by with), uniformity (keep abstract types similar to the built-ins) and safety (find errors as soon as possible: compile time?).
    • Assumptions/design decisions. Wanted language to be heap-based with garbage collection (based on experience with LISP). Program was a collection of procedures rather than a linear piece of code (ALGOL style). People were scared of pointers, but used them to simplify the design. No block structure! Separate compilation of individual modules. Also had static type checking which was meant to speed up finding errors. No concurrency (cut out what wasn’t necessary to simplify a big project). No GOTOs. No inheritance.
    • CLU clusters. A cluster had a header with a list of the operations that it defined.  Thought of operations as belonging to the type rather than the object (passing an object as an argument). Defined the representation of the object internally, and the implementation of the operations. Used “cvt” to define unsealing on operation entry, and sealing on exit (but compile-time checked).
    • Polymorphism: set[T: type], and had a where clause for the type parameter to specify, e.g., that T has an equals: T -> T -> bool function.
    • Exception handling: Issues and a Proposed Notation (Goodenough, 1975). People didn’t know the right model: procedure should terminate (now the status quo), or throw exception to a higher level that would allow it to resume. PL/I had both. How should handlers be specified? At the call-site, or out of the main-line for all invocations to that function. CLU used termination, and specifies the exceptions that a method may call in the header. Handled at the call-site.
    • How to handle exceptions? Handle it, propagate it up the call stack, or signal failure (the exception shouldn’t happen…). Can never be certain that these last exceptions won’t happen, but don’t want to write code to deal with this, so CLU introduced the “failure” exception. Really want accurate interfaces (know exactly what exceptions a method might call) and no useless code.
    • Iterators. For all x in C do S. Solutions were to destroy the collection (repeated removal), or complicating the abstraction (turn it into an ordered set, making it indexible). In summer 1975-ish, the MIT group visited CMU, where the Alphard group were working on “Generators”. Thought these are a bit crusty, so invented iterators which are like procedures that you call and they yield instead of returning. Can nest iterators and recursively call them. Implemented by passing the loop body as an argument to the iterator. But this limited the expressive power.
    • In 1987, gave a keynote at OOPSLA, but had been ignoring object-oriented languages and inheritance in particular. Took the opportunity to get into the literature. Much of it was very bad/confused. Inheritance being used for two different things: implementation simplification and type hierarchy. The two were not compatible.
    • Implementation inheritance violated encapsulation! Subclasses depend on the implementation of the superclass, making it hard to change the superclass. CLU could do implementation sharing without inheritance.
    • Type hierarchy is much more interesting, but wasn’t well understood. How were stacks and queues related?
    • Led to the Liskov Substitution Principle: Objects of subtypes should behave like those of supertypes if used via supertypes methods. (Data abstraction and hierarchy (Liskov, 1988).) Didn’t realise that this was a big idea!
    • What next? The world has changed from one where people had no idea about modularity, to one where modularity is based on abstraction.
    • Modern programming languages (Java and C#) are pretty good. Procedures are missing: they are important, and the loss of them as a first-class thing makes the program less simple. Closures are missing, as are iterators. Exception handling is important but failure handling is not done well. Also need built-in types as a basis: extensibility might be going too far. Can we do better than “serialization” (horrible overloading of the term): can’t it be done by garbage collection?
    • The state of programming is pretty lousy. The COBOL programmers of yesterday are now writing web services and browsers. The era of globals has returned. There’s little encapsulation and protection, and yet these are handling confidential information. Problem might be persistent storage violating abstraction: perhaps we would be better with an object store that provides automatic translation and type preservation.
    • Programming language research. Is now the time for some new abstraction mechanisms? Probably not just specification langauges. Concurrency and multi-core: modularity is very helpful here, but there is still a lot of work to do. Should distributed systems be programmed in languages that include distribution as a first-class concept.
    • System research has done well. Abstractions like DHTs, map-reduce, client/server, distributed information flow. These have been useful for making progress.
    • Concerned that we trade off for performance (versus simplicity and semantics). Led astray by the end-to-end argument, but not sure that the argument is valid when the end is the user. We know that we’ll never be 100% reliable, but failures should be catastrophes, not laziness (or an optimization).
    • The systems community has thrived because it has been so open: embracing new concepts. Worried about the semantics of the coming Internet Computer: what does it mean to run a program in the cloud (the PL should embrace distribution?). Massive parallelism also coming as a problem: not sure they necessarily follow from the end of Moore’s Law. But we will need to manage them as a distributed system, so there is a lot learn from that world.

    Scalability

    FAWN: A Fast Array of Wimpy Nodes

    • Awarded best paper.
    • Energy is becoming an important consideration in computing? Google uses $40 million of energy a week. Can we reduce energy costs tenfold? Without increasing capital costs?
    • Idea is to improve energy efficiency by using an array of well-balanced low-power systems. For computationally intensive, data-intensive systems.
    • Goal is to reduce peak power: 20% energy loss on power and cooling is considered “good”. 100% utilization => 1000W but 20% utilization => 750W. Fawn wants to get 100% utilization down to 100W, and less utilization to less power.
    • As CPU cycles have improved much faster than disk seeks, we have a gulf in wasted resources. Could rebalance by using very fast disks. Could use slower CPUs and moderately faster storage. Or could use slow CPUs and today’s standard disks. These are not equivalently efficient.
    • Fastest processors have superlinear power usage (Xeon7350, etc.) due to things like branch prediction, caching, etc. which are not so useful for data-intensive, I/O intensive workloads. Custom ARM nodes etc. are slow but power usage is dominated by fixed power cost. In between we have things like the Atom and XScale: FAWN targets these.
    • Application is data-intensive key-value systems. A critical infrastructure service with SLAs, random-access and read-write. Goal was to improve the efficiency in Queries/Joule or Queries/Second/Watt.
    • Prototype using Alix3c2 nodes with flash storage, 500MHz CPU, 256MB DRAM and 4GB flash.
    • Challenges: need efficient and fast failover, wimpy CPUs, limited DRAM and flash that is poor at small random writes.
    • Architecture: single front-end connected by switch fabric to many back-end. Front-end manages backends, acts as a gateway and routes requests. Use consistent hashing. Interesting design decisions are at the backends (focus of the talk).
    • High performance KV store: must map keys to values on the backend. Use an in-memory hashtable to find the location of a value on disk. 160-bit key into hashtable which contains a key fragment, valid bit and offset into the data region. On flash, store the entire key, length and the data itself. Limited DRAM means you only store a fragment of the key in memory (to get 12-byte hashtable entry): risk of collisions and multiple requests. Get a low probability of multiple flash reads.
    • Small random writes problem avoided by log-structured datastore. Helps also with node addition, which requires transfer of key ranges: the log structure makes the transfer of key ranges a simple streaming transfer… a simple iteration of the data store. But the SLA means that we need the source of the data continues running while doing this streaming: log-structure lets you minimize locking. Also need a compact operation on the data store, which also runs in the background.
    • Replication and strong consistency is covered in the paper.
    • Evaluated by considering the efficiency (energy) of KV lookup. Look at the impact of background operations on query throughput. And then TCO analysis for random read workloads.
    • Measured the efficiency of 256-byte KV lookups on various systems, looking at the power outlet power usage. Alix3c2/Sandisk got 346 QPS/Watt, whereas a standard desktop machine with SSD got 51.7 and two hard-disk based systems were 2.3 (MacBook Pro) and 1.96 (desktop).
    • Peak capacity is 1600 QPS. During compaction ±800. During split ±1200, and Merge ±700.
    • At 30% of peak query load, these have almost no discernable impact (±500 QPS in all cases).
    • TCO = Capital Cost + Power Cost ($0.10/kWh). When should you use FAWN?
    • Traditional system with 200W usage (5 x 2TB disks/160GB PCI-e Flash SSD/64GB FBDIMM per node). $2000–8000 per node. FAWN (10W usage)  (2TB disk/64GB SATA Flash/DRAM-based). Where storage capacity is dominant, use FAWN + Disk.  When you care about query throughput, use FAWN + DRAM. FAWN + Flash covers much of the remainder of the space, but Traditional + DRAM also covers some of the space.
    • To be energy efficient, require some effort on the part of the system developer.
    • Q: impact of networking on performance and cost models? Network is an important component, and we really want things like all-to-all communication, which often needs a high-powered core. But proposals for using low-powered commodity switches at scale mean that is possible to get down to a fixed power overhead per core.
    • Q: off-the-shelf memcached could do better? [Inaudible.] It takes a lot of effort to get very high throughput.
    • Q: what happens when you include latency in the model of TCO? Interactive services that serialize might shift the model? Flash devices gives pretty good common-case latency. What if you have computational load in the query? You have to trade off energy efficiency for longer periods of computation. Get high 99.9% latency during maintenance but it’s still okay.
    • Q: what happens when a frontend thinks one node is down, etc.? Have optional replication to provide constant access to data, and a background replication process.

    RouteBricks: Exploiting Parallelism to Scale Software Routers

    • Awarded best paper.
    • Want to build routers that are fast and programmable. Why do we want them to be programmable? Programmable routers enable ISPs to offer new services (intrusion detection, application acceleration, etc.). They make network monitoring simpler (measuring link latency, tracking down traffic). Finally they make it possible to offer new protocols, such as IP traceback, trajectory sampling etc. They enable, flexible, extensible networks.
    • But today, speed versus programmability is a trade off. Fast routers are implemented in hardware (Tbps throughput), but offer no programmability. But programmable software routers get throughput < 10Gbps using general purpose CPUs.
    • RouteBricks uses off-the-shelf PCs, a familiar programming environment and large-volume manufacturing to reduce costs. How do we get to Tbps routers with these ingredients?
    • A router is just packet processing + switching. We have N linecards that handle each input or output port, and it must be able to handle packets at the per-port rate, R. Also have a switch fabric which much switch at N*R.
    • RouteBricks uses one server instead of each line card and a commodity interconnect.
    • Require internal link rates < R and per-server processing rate = c*R (c is a small, reasonable constant). Per-server fanout should be constant.
    • What interconnect satisfies these requirements? Could naively have a crossbar switch, with N^2 line-rate links. Could instead use Valiant Load Balancing, which introduces a third stage between input and output. Have N intermediate servers, and R/N rate links from each input to the intermediate, and from each intermediate to each output. Now have N^2 links at rate 2R/N. Per-server processing rate is 3R. But with uniform traffic patterns, each server only must process at 2R.
    • But this still gives linear per-server fanout (bad). If we increase per-server processing capacity, could assign multiple ports to each server. Or add a k-degree n-stage butterfly network. Combine these ideas: RouteBricks. Use full mesh if possible and extra servers otherwise.
    • The trade-off depends on the kind of servers that you have. Assuming current (5 NICs * 2 X 10G ports or 8 X 1G ports and 1 external port per server).
    • With 2R–3R processing rate, we need to optimize the server! Use a NUMA architecture (Nehalem, QuickPath interconnect), 2 quad-core CPUs, and 2*2*10G NICs. Run Click in kernel-mode.
    • First try: got 1.3Gbps per server. Spending a lot of cycles on book-keeping operations, such as managing packet descriptors. So use batched packet operations and CPU polling, which got 3Gbps.
    • Still a problem with how the cores accessed the NICs: queue access. Get synchronization overhead between the two cores. Simple rule: 1 core per port (input or output), gets no locking overhead. But now we have cache misses because of separate cores working on the same packet. So second rule: 1 core per packet. These rules are mutually exclusive!
    • Solution is to uses NICs with multiple queues per port. Can now assign each queue to a single core, which achieves our objective.
    • So, use state-of-the-art hardware, modify the NIC driver to do batching, and perform careful queue-to-core allocation.
    • For no-op forwarding, get 9.7Gbps with min-size packets, and 24.6Gbps with a realistic mix.
    • For IP routing (IPSec also but not shown), get 6.35Gbps for min-size, and 24.6Gbps with a realistic mix.
    • Realistic size mix: R = 8–12 Gbps. In this case, we are I/O bound. Min-size packets are CPU-bound.
    • Look at the next-gen Nehalem and do some back-of-the-envelope calculations. Could get R = 23–35 Gbps with upcoming servers for a realistic mix.
    • Prototyped RB4: N = 4 in a full mesh. Realistic size mix was 35 Gbps.
    • Did not talk about reordering (avoid per-flow reordering), latency (24us per server), open issues (power, form-factor, programming model). Slide illustrates Russell’s paradox quite nicely.
    • Q: about programmability, the examples require maintaining cross-packet state, so would the choice of the load-balancing mechanism within the router affects this? When the bottleneck is the CPU, so yes this is important.
    • Q: are the trends on both power and power-performance of next-gen processors in your favour? We spend more energy than the corresponding Cisco router, but we have a lot of room for improvement and will get better. Maybe best thing to do is use general-purpose CPUs with an efficient interconnect.
    • Q: is the real choice between general purpose CPUs and a different programming model? Could you do better with ternary CAMs, etc.? Programmability is just trapping exceptional packets. IP routing is easy? Not suggesting that you just throw away all hardware routers, but you might want to do it in some cases, e.g. for specialized monitoring at the borders of your network. A programmable datapath might make it easy to deploy a new protocol.
    • Q:  curious about reordering phenomenon and how this would affect TCP? What would be the most extreme reordering that an adversary produce? [Inaudible.]

    The Multikernel: A New OS Architecture for Scalable Multicore Systems

    • Boots into Barrelfish rather than kubuntu to get the video driver to work.
    • How do we structure an OS for the multicore future? Need to deal with scalability and heterogeneity.
    • Sun Niagara has a banked L2 cache for all cores (bad for shared memory). Opteron Istanbul has per-core L2. Nehalem is quite different still. Need to do different things to work well on each of these. An 8-socket Opteron has an “interesting” interconnect. The interconnect matters, especially to access latencies.
    • Can also have core diversity: within a system (programmable NICs, GPUs, FPGAs) and on a single die (for performance asymmetry, SIMD extensions or virtualization support).
    • As core count increases, so does diversity. And unlike HPC systems, one cannot optimize at design time. Need systems software to adapt for you.
    • So now is the time to rethink the default OS structure. Now we have shared-memory kernels on every core, data structures protected by locks and everything else being a device. So propose structuring the OS as a distributed system (like Barbara Liskov said earlier).
    • First principle: make inter-core communication explicit. All communication should be done with messages and no shared state. This decouples the structure of the system from the inter-core communication mechanism, and makes communication patterns explicit (which cores you communicate with etc.). It naturally supports heterogeneity of cores. Also better matches future hardware (no cache coherency? Such as Intel’s 80-core machine). Also allows split-phase operation which makes concurrency easier. And finally, we can reason about performance, scalability, etc. of such systems.
    • Simple microbenchmark to evaluate the trade-off between shared memory and message passing. Several cores update a shared array. In shared-memory case, we get stalling due to cache coherency protocol. This is limited by the latency of interconnect round-trips. Latency is linearish in the number of cores (up to 16), and gets worse with the number of cache lines modified.
    • What if we had a server core that takes messages from a ring buffer per client core?  Higher latency to begin with, but scales much better as the number of cache lines modified is increased. Client overhead is queuing delay at the server. The server’s actual cost is approximately constant, and very low. This would give us many spare cycles if the RPC is implemented split-phase.
    • Second principle: make the OS structure hardware-neutral. The only hardware specific parts should be the message transports and the CPU/device drivers. Makes it possible to adapt to changing performance characteristics. Can late-bind protocol and message-passing implementations.
    • Third principle: view state as replicated. Potentially-shared local state is accessed as if it were a local replica (e.g. scheduler queues, process control blocks, etc.). The message-passing model requires this. This naturally supports domains that don’t share memory.
    • Replicas were previously used as a selective optimization in other systems. The multikernel makes sharing a local optimization, instead (opposite view). Only use shared state when it is faster, and make this decision at runtime.
    • Can support applications that need shared memory if it is available, but the OS doesn’t rely on this.
    • Currently run on x86-64 machines. A CPU driver handles traps and exceptions serially. A user-space monitor mediates local operations on global state. Use URPC inter-core message transport, but we expect that to change.
    • Many non-original ideas (in particular decoupling message-passing from synchronization).
    • Many applications running on Barrelfish (slide viewer, webserver, VMM, OpenMP benchmarks, etc.).
    • How do you evaluate a radically-different OS structure? Barrelfish was from-scratch, so is less complete than other OSs. But we need to show that it has good baseline performance: comparable to existing systems.
    • Case study of unmap (TLB shootdown). Logically need to send a message to every core and wait for all to acknowledge. Linux/Windows use IPIs and a spinlock to do this. Barrelfish makes a user request to a local monitor, and uses message passing.
    • Tried several different communication protocols. One unicast channel per core; a single broadcast channel. Neither of these perform well (especially broadcast because there is no such thing in the interconnect).
    • Really want a multicast optimization: send message once to every socket, which has an aggregation core that forwards it on to local cores. Also use the HyperTransport topology to make decisions about which cores are further away (and hence should receive the message earlier to give parallelism). Need to know the mapping of cores to sockets, and the messaging latency.
    • Use constraint programming and a system knowledge base to perform online reasoning. Prolog query on SKB constructs the appropriate multicast structure. Unmap latency is much less than Windows for >7 cores, and Linux for >12 cores. It also scales much better.
    • Show that there is no penalty for shared-memory user workloads (OpenMP), respectable network throughput, and a pipelined webserver with high throughput.
    • Conclusion: no penalty for structuring using message passing. So we should start rethinking OS structure this way.
    • Q: how do you build an application on top of Barrelfish that tries really hard to ignore topology? Focus so far on how to build the OS, and make no particular demands on application programming. Gone with POSIX so far, but we need some higher-level programming model to express things more easily.
    • Q: about URPC performance, are you running a single thing on each processor and hence is the process only ever spinning waiting for URPC, avoiding any context switching latency? That’s true. We decoupled messaging primitive from notification primitive, because notifications are very expensive (at least on our hardware). This leads to a trade-off. The split phase API makes it efficient to work in the case where you don’t need notification immediately.
    • Q: you didn’t mention virtualization… if we map one VM to one core, we can get good performance, so does Barrelfish obviate the need for virtualization? Barrelfish is viewed as orthogonal to virtualization.
    • Q: your microbenchmark [inaudible]. The point of the benchmark is to make the case that messaging over shared memory can have reasonable performance.
    • Q: as systems get bigger, do you expect messages to get lost? Maybe… if they do get lost, this model is a better way to structure it than shared-mmeory.

    Device Drivers

    Fast Byte-granularity Software Fault Isolation

    • Operating systems run many drivers and these run in a fully trusted mode. But they turn out to be a major source of bugs. Existing solutions either require changes to the source code or unacceptably bad performance. Why? They are complex and require complex, fine-grained temporal and spatial sharing within the kernel. But any mistake in this can be fatal….
    • BGI isolates drivers in separate protection domains but allows domains to share the same address space. It provides strong isolation for existing drivers even with a complex kernel API. Write integrity, control-flow integrity and type-safety for kernel objects. Achieves this without changes to drivers or hardware, and only 6.4% CPU overhead (~12% space overhead).
    • Use a BGI compiler on unmodified source code to get instrumented driver, then link this with the BGI interposition library to get a BGI driver.
    • Protection model has kernel in trusted protection domain. Most drivers are untrusted and drivers may share untrusted domains. Each byte has an ACL. Access rights are read and write (default everything readable), indirect call, typed rights for different kernel objects (type safety for mutex, dispatcher, io etc.).
    • Two primitives: CheckRight and SetRight, called by the interposition library (not the driver code). Dynamic type checking is possible using these: check the arguments to kernel API calls, which ensures type safety, and state safety (temporal properties).
    • Compiler checks rights on writes and indirect calls, and grants and revokes write access to the stack as appropriate.
    • Rights are granted on driver load (write to globals, icall right to address-taken functions). Rights granted/revoked to function arguments according to call semantics. Function arguments are also rights-checked.
    • Example of a read request (event-based), and how the BGI compiler inserts inline checks to avoid many problems.
    • Rights change as the driver makes kernel calls, and BGI enforces complex kernel API usage rules (use-before-initialized, write-after-initialized, etc.).
    • Implementation must be fast. Fast crossing of protection domains (just a normal call instruction). Fast rights grant/revoke/check: uses compiler support to inline these checks and perform data alignment.
    • ACL data structure encodes (domain, right) pair as an integer. Several tables of “drights”: arrays for efficient access, and have 1-byte dright per 8-byte memory slot. Need a conflict table when rights ranges overlap. Want to avoid accesses to these conflict tables by aligning data on 8-byte memory slots. Makes the non-conflict case common at the cost of space. Heap objects are 8-byte aligned, and also need special drights for writable half-slots.
    • SetRight is implemented with 4 x86 assembly instructions. Uses arithmetic shift to obtain same code for kernel and user space addresses (with different tables).
    • CheckRight has fast check (5 instructions) and slow check (7 instructions).
    • BGI has a recovery mechanism that runs driver code inside a try block (cf. Nooks).
    • Evaluated on 16 Vista device drivers: 400 KLOC.
    • Evaluated for fault containment. Injected faults in the source of fat and intelpro drivers (following previous bug studies). Measured the number of faults contained by BGI: fat = 45/45, intelpro = 116/118.
    • Measured file I/O performance with Postmark. Max CPU overhead was 10% (for FAT) and max throughput cost was 12.3% (for FAT).
    • Measured network performance (max 16% CPU overhead and 10.2% throughput decrease both for UDP send).
    • Found 28 new bugs in widely used drivers. Some were serious: writes to incorrect places or use of uninitialized objects. Some less so: abstraction violation, etc. BGI is a good bug-finding tool.
    • Q: subtle problems such as virtual address aliasing or out-of-bound pointers within the bounds of other objects… how do you handle this? The granting of access rights is done according to the kernel API, for write/control-flow integrity and type safety. But we wouldn’t catch the case where a driver makes a write to something that it is allowed to write to, but isn’t the right thing.
    • Q: how do you deal with type-unknown pointers? Can catch errors on data buffers. But where pointers are passed as arguments, you typically don’t do arithmetic on these.
    • Q: compared with SFI work on already-compiled code (compiler not in TCB), could you do this on object code instead of source code? We have a binary version, but we report on the compiler version because it has better performance.
    • Q: the transformations looks similar to existing SFI, so to what do you attribute this improvement in performance? Dealing with complex kernel API and enforcing fine-grained sharing. Previous work did a lot more copying or had expensive crossing of protection domains, in order to deal with complex sharing.
    • Q: you’re redoing virtual memory protection with a complex compiler? Why not just use a microkernel? The performance can’t be that bad? The goal was to support existing drivers that run on existing operating systems with good performance. Could you do the same transformations on existing code to run in separate address spaces? Maybe, but don’t know of a way to do this. Goal was to deal with existing drivers on existing OS.
    • Q:  could you just insert latent compiler-inserted checks to avoid zero-day exploits, which are only turned on on zero-day? An interesting idea, but it’s better if you can cheaply run it all the time in deployed systems… you will find more bugs this way.

    Tolerating Hardware Device Failures in Software

    • Device drivers assume device perfection. But we can see hardware-dependence bugs across driver classes. Transient failures cause 8% of all unplanned reboots. The existing solution is to hand-code a hardened driver, which gets this down to 3%. What can we do with software fault tolerance? Detect hardware failures and perform recovery in software.
    • Where do bugs come from? Device wear-out, insufficient burn-in, bridging faults, EMF radiation, firmware bugs, corrupted inputs, timing errors, unpredictable DMA, etc.
    • Vendors give recommendations to driver developers. Firstly, validate all input coming from the hardware. Then ensure all operations take finite time. Failures should be reported. There are guidelines for recovery also. Goal is to implement as many of these as possible automatically.
    • System is called Carburizer. Runs on driver source code, and compiler generates hardened driver that links to the Carburizer runtime.
    • Goal is to fix these bugs with static analysis. Find driver code that uses device data, and ensures that the driver performs validity checks. Fixes bugs from infinite polling, unsafe array reference, unsafe pointer dereference and system panic calls.
    • First, use CIL to identify tainted variables. Consult a table of functions known to perform I/O and mark heap and stack variables that receive data from these. Propagate taint through computation and return values. Now find the risky uses of these variables. e.g. Find loops where all terminating conditions depend on tainted variables.
    • Now look for tainted variables used to perform unsafe array accesses: used as array index into static or dynamic arrays.
    • Evaluated by analysis of Linux 2.6.18.8. Analysed 2.8 million LOC in 37 minutes (including analysis and compilation). Found a total of 992 bugs in driver code, with false positive rate of 7.4% (based on a manual sampling of 190 of the bugs). May also be some false negatives because don’t track taint via procedure arguments.
    • Automatic fixing of infinite loops by inserting timeout code. Inserted code can never harm the system because the timeout is conservative.
    • Inserts code to bounds-check arrays when an unsafe index is used.
    • Empirical validation using synthetic fault injection on network drivers: modify return values of in and read functions. Without Carburizer, the system hung; with it, the driver recovered. Works well for transient device failures.
    • Also want to report device errors to fault management systems. Carburizer (i) detects the failures, and (ii) reports them. Report things like loop timeouts, negative error returns and jumps to common cleanup code. Also looks for calls to functions with string arguments, and considers these to be error reporting, so it doesn’t insert additional reporting in this case.
    • Evaluated failure reporting with manual analysis of drivers of different classes (network/scsi/sound). No false positives, but a few false negatives. Overall fixed 1135 cases of unreported failures. Thus it improves fault diagnosis.
    • Static analysis is insufficient. We also need to consider runtime failures, such as missing and stuck interrupts. Can detect when to expect interrupts, and invoke ISR when bits are referenced but there is no interrupt activity. Can also detect how often to poll, which reduces spurious interrupt invocation, and improves overall performance.
    • Can also tolerate stuck interrupts by seeing when an ISR is called too often, and converts from interrupt-based to polling in this case. This ensures the system and device make forward progress.
    • Evaluated effect on network throughput: overhead is < 0.5%. And on CPU utilization, this goes up 5% for nVidia MCP 55 when recovery is used; no cost when just doing static bug fixing.
    • Covers 8/15 of the recommendations for driver writers, automatically.
    • Q: is your intention that driver developers should run Carburizer only in development and incorporate its fixes, or to run it on deployed code? Encourage sysadmins to run the tool all the time to cover as many bugs as possible.
    • Q: talked about transient device failures, but what about transient errors in the CPU? In this case, even the code execution will fail. Can you comment on the relative frequencies of these kinds of failure? Not aware of this data. Cannot trust anything when the CPU fails, so not sure what we can do about this.
    • Q: do you have a rough feel about how many of the found bugs could be used by malicious attackers to perform code injection? Don’t know off hand. Certainly very easy to hang the system, so could do this maliciously.
    • Q: could you talk about the static analysis that you are using? How do you handle pointer aliases? Analysis is very simple, and we do have 7.4% false positives. But the combination of our techniques gives us this low rate.

    Automatic Device Driver Synthesis with Termite

    • Conventional driver development requires acquiring necessary information from OS interface spec and device spec. Need to combine this information to come up with a driver implementation that appropriately translates OS requests into driver requests.
    • If we formalize these specs, we can do better by synthesizing drivers, and there is no longer a need for developers to be an expert in both the OS and the device. Only know one well. Furthermore, code can be specified once and synthesized many times.
    • Use finite state machines as the basic formalism for writing these specs. Note device-initiated transitions and software-initiated transitions.
    • First step is to take two state machines (OS and device) and synthesize a combined state machine that considers all possible transitions. Any transition represents a legal transition in the driver. Also associate timeout labels with transitions.
    • Now translate the state machine into C source code (simple and covered in the paper).
    • A real device has multiple functional units, so can’t possibly use a single FSM for all of these. A new language is used to compose multiple FSMs together (one per functional unit). Need a synthesis algorithm that can handle this: need to deal with the state explosion problem (by exploring the state space incrementally). Also need to deal with data, and cannot model all possible assignments to each variable. Instead manipulate data symbolically.
    • Termite successfuly synthesizes drivers for Asis AX88772 USB-to-Ethernet (on Linux) and Ricoh R5C822 SD host controller (on Linux and FreeBSD).
    • <1 kLOC for OS interface spec. <700 LOC for device specs. Synthesized drivers are 2 to 4 times larger than the Linux drivers.
    • Showed a demo of a visual Termite debugger. Allows single-stepping and setting of breakpoints on states.
    • Performance is very close to the native drivers.
    • Some limitations: cannot specify constraints on data (alignment, fragmentation, etc.), complex inter-variable relations are not supported (limitation of symbolic execution engine), the structure of specifications is restricted, and Termite drivers require runtime support. Not conceptual limitations, only of the current implementation.
    • One issue is how you write the specifications. This is particularly onerous for the device manufacturers. Also a potential source of bugs. Need a big brain to do this based on the HDL of the device. Ideally we would automate this translation (HDL to driver spec). Since you write devices in HDL anyway, this shouldn’t be too bad.
    • Not much open-source hardware around, so there was a difficulty in finding hardware on which to evaluate this.
    • Q: to make this practical, there are a few questions: performance? Scalability (for big devices like video cards)? Performance hasn’t been an issue so far, because the device developer makes the same assumptions that Termite ends up making, so the generated code is quite similar. (But the devices are simple?) So far this hasn’t been a problem. Scalability is more of an issue. Looking at using better symbolic execution that ignores irrelevant relations between variables.
    • Q: how do you deal with firmware in the device (also specified in HDL)? At the moment, the manual process has to cover this. Once we automate this, it might be more challenging, so we might want to generate firmware as well.
    • Q: what about the data size of compiled code, and the CPU utilization when saturating the device? Code and data size doesn’t look very big (roughly proportional to the amount of source code). Haven’t looked so much at data size.
    • Q: does your spec language have room for cork tables and other crap? [Taken offline.]
    • Q: the OS spec is intended to be device independent (e.g. generic for GPU, ethernet driver, etc.), but how do you cope with new features in a device, which people like to have? All the interesting stuff is on the side? How does that play with the OS spec if you want to take advantage of these features? We cope with standard features, could include “semi-standard” optional features in the OS spec. For unique features, you would need to extend (but not rewrite) the OS spec. No experience with doing this.
    • Q: the requirement for a full functional spec of the OS driver interface is somewhat intimidating, so what is your experience in making these, and how do they scale? It’s generally doable, but we would want to create a special-purpose language to make this easier.

    Debugging

    Automatically Patching Errors in Deployed Software

    • Your code has bugs and vulnerabilities, but attack detectors (code injection, memory errors, etc.) exist. What do you do in this case? At the moment, just crash the application, which is a straightforward DoS.
    • ClearView protects against unknown vulnerabilities, preserves functionality and works for legacy software.
    • Zero-day exploits are a problem for hard-coded checks because they are unknown in advance.
    • The application must continue to work (especially if mission-critical) despite attacks. A patch can repair the application. (Mind you, we shouldn’t always keep the application running: sometimes crashing is correct behaviour.)
    • Want to do this without access to the source code, so can’t rely on built-in features. Needs to run on x86/Windows.
    • Use learning as the secret sauce. Normal executions show how the application is supposed to run. Attacks provide information about the vulnerability, and can be used to give the system immunity. The first few attacks may crash the system, however.
    • Detection is pluggable: tells you whether an execution was normal or an attack. Learning learns normal behaviour from successful runs, and checks constraints during attacks. This gives a predictive constraint, which is true on every good run, and false during every attack. Repair component creates a patch to re-establish constraints and invariants. System evaluates patches and distributes them to deployed applications.
    • Use off-the-shelf detectors.
    • Assume a single server and several community machines running the application. (Assume that they are not exploited to begin with.) Community machines report constraints back to the server. Use code injection and memory corruption attack detectors (others are possible).
    • On an attack, the detector collects information and terminates the application. Server attempts to correlate this information with a constraint: leads to a predictive constraint. The server generates appropriate patches and distributes the best of these. The quality of patches can be refined by information about successful or failed attack attempts (failed or successful defenses). Redistribution is then possible.
    • How do we learn normal behaviour? Use an ML technique called dynamic invariant detection (previous work), which has many optimizations for accuracy and speed. Technique was enhanced for this project.
    • Inference results. ML technique is neither sound (overfitting) nor complete (templates are not exhaustive). However it is useful and effective. Sound in practice, and complete enough.
    • How do we learn attack behaviour? Attack detectors aim to detect problems close to their source. Code injection uses Determina Memory Firewall (triggers when control jumps to code outside the original executable). Memory corruption uses Heap Guard (triggers on sentinel value overwrite). Techniques have low overhead and no false positives.
    • Server pushes out appropriate instrumentation to the community. Only check constraints at attack sites (low overhead).
    • Repairing installs additional detectors to see if you have a bad patch (e.g. looking for assertion violations).
    • Attack example is a JavaScript system routine, written in C++. Doesn’t perform typechecking of the argument, so vtable may be corrupt. Have a predictive constraint on the operand of JSRI instruction.
    • Aim to fix a problem while it is small, before the detector is invoked. Repair isn’t identical to what a human would write, but it is much more timely.
    • Patches are evaluated in the field (do they avoid triggering the attack detector or prevent other behaviour deviations?).
    • Evaluated with a Red Team that created 10 exploits (HTML pages) against Firefox 1.0. ClearView was not tuned to known vulnerabilities in that version, but the learning component focussed on the most security-critical components. Red Team had access to all project materials.
    • ClearView detected every attack and prevented all exploits. For 7/10 vulnerabilities, ClearView generated patches that maintained functionality after an average of 4.9 minutes and 5.4 attacks. Handled polymorphic attack variants, simultaneous and intermixed attacks, and had no false positives (installing a patch when not under attack). Low overhead for detection and repair (considering this is an interactive application, not surprising).
    • What about unrepaired vulnerabilities? 1. ClearView was misconfigured. 2. Learning suite was too small. 3. Needed constraint not built into Daikon. All zero-day attacks against the system and all trivial to fix with minor changes to ClearView.
    • Q: introducing code is a bit scary… what if one of the patches introduces a new vulnerability? Firstly, you can only do this when you’ve found an exploitability. Red Team tried and failed. In one case, ClearView found a vulnerability in its own injected code.
    • Q: if I were an attacker who wanted to DoS your system (knowing ClearView was running), I might try to disable ClearView somehow by making the ClearView DB learn an incorrect fact… so is your system vulnerable to that kind of attack? It doesn’t matter what facts are true during attacks, so you’d have to find good executions that weren’t observed as being bad to poison the database. It’s conceivable and theoretically possible that you could do that, but I don’t know if it’s practical.
    • Q: what is the overhead of inserting invariants at every instruction? You will see between 5 and 10 constraints per instruction… learning is the biggest bottleneck, but you could distribute this amongst the community. In terms of unsoundness, we’re not seeing that as a problem.
    • Q: how sensitive are you to the invariants that I have to specify for the patches, in the case where continuation introduces incorrectness into the persistent state? This is a policy decision: an hour of downtime for a bank is $6 million, so is it better to come back and fix things up later.

    Debugging in the (Very) Large: Ten Years of Implementation and Experience

    • 10 years of work by the first 8 authors.
    • Even Microsoft’s shipping software has bugs! (And so does your hardware….)
    • A bug is a flaw in program logic; an error is a failure in execution caused by a bug (1 bug -> many errors).
    • How does Microsoft find out when things go wrong? We want to fix bugs regardless of source, prioritize bugs affecting the most users. Kernel crashes (BSOD), application crashes, everything down to invariant violations.
    • Windows Error Reporting. What happens after you click “Send Error Report”?
    • Server is over-provisioned to handle 100 million reports per day. 17 million programs have records in WER. 1000s of bugs have been fixed. Uses 200TB of storage, 60 servers over 10 years. Anyone in the audience can get access to WER data….
    • Debugging in the large makes the user-developer feedback loop much longer, both in terms of the number of people and the latency. The problem is the human bottleneck (both in accuracy and latency). Goal was to remove humans from the loop.
    • On an error, collect a minidump: stack of erroneous thread and a little extra context. If the user allows it, upload this to WER. An analysis procedure (!analyze) runs over all of these mini-dumps and clusters these in buckets.
    • !analyze takes a minidump as input, and outputs a bucket ID. So increment the bucket count and prioritize buckets with the highest count. Actually upload only the first few minidumps for a bucket; after that just increment the count. Sometimes you need a full core dump, and programmers can request this to be collected on future hits.
    • 2-phase bucketing strategy: labelling on the client (bucketed by failure point) and classifying on the servers (consolidate versions and replace offsets with symbols; find callers where the bug might be (if it calls known-good code)). This refines the bucket ID… more details in the paper.
    • One bug can hit multiple buckets (up to 40% of error reports). Also multiple bugs can hit one bucket (up to 4% of error reports). Bucketing mostly works… scale is our friend (throw away a few here and there and you still have enough to debug).
    • Bucket hits for a given program look like a Pareto curve. Just 20 buckets in Word 2010 account for 50% of all errors. Only fixing a small number of bugs will help many users.
    • Earliest success story was finding heisenbugs in Windows kernel (>= 5 years old). Vista team fixed 5000 bugs in the beta. Anti-virus vendor accounted for 7.6% of all kernel crashes: in 30-days got this down to 3.6% of all kernel crashes. Office 2010 team fixed 22% of reports in 3 weeks.
    • Example of hardware errors too: failure in a CPU (exact same revision and step). Chip vendor knew that the bug existed and didn’t think that it would get hit in real-life. Error reports dropped dramatically after the work-around was applied.
    • Also hardware failures in SMBIOS of a particular laptop (buffer overrun); motherboard USB controller (only implemented 31/32 of DMA address bits). Lots of information about failures due to overclocking, HD controller resets and substandard memory.
    • Also looked at malware. The Renos social engineering worm which caused Explorer.exe to crash when people downloaded something from an email. Saw a spike, issued a worm removal tools through Windows Automatic Update, and saw this decline very quickly. Shows that WER scales to handle global events.
    • Distributed system architecture hasn’t changed in 10 years, and yet scales to global events.
    • Product teams now have ship goals based on reducing WER reports: led to a numerically-driven debugging approach. Fundamentally changed software development at Microsoft.
    • Q: what about privacy? Most private data ends up in the heap, not on the stack. Only collect stack. Also do some data-scrubbing based on things we know about (user ID, etc.). Looked for SSNs, credit-card numbers, etc.: found fewer than 10 possible matches in 100,000 error reports. MS also puts very strong employment conditions on how this data is accessed and used.
    • Q: []. Team looks for zero-day exploits in the WER data regularly. Philosophy is that there should be no overhead when users don’t hit an error report. 90% of users only ever have to do an increment of a counter.
    • Q: are you keeping track of how many people don’t send error reports? We have good estimates of opt-in rates, from other systems that collect information from machines that are running normally (only if people opt-in to those programs of course). If half the world turned off the error reports, we’d still get enough information.
    • Q: does analyzing these bugs tell you about common errors programmers make? Check for NULL is the most important ones. This has generated many guidelines that has been fed back into internal development processes.

    Detecting Large-Scale System Problems by Mining Console Logs

    • It is challenging to detect problems in large-scale internet services. Requires instrumentation (expensive to maintain, and may use modules that aren’t instrumented). So can we use console logs in lieu of instrumentation? They are easy for the developer to insert in their code, but they are imperfect (not originally intended for instrumentation). It is non-trivial to analyze their free-text contents.
    • Go from 24 million lines of log messages, finding a small number of abnormal log segments, to a single page of visualization. Fully automatic process without manual input.
    • Use machine learning based on carefully-defined (numerical) features in the logs. Parse the program source-code to infer information about the log contents, to generate appropriate features. Finally visualize the results.
    • Key insight: log contains the necessary information to create features. Identifiers and state variables are useful. Important information is in the correlation between messages (examples taken from HDFS (Hadoop Distributed File System)). e.g. “receiving block x” followed by “received block x” is normal; without “received block x”, you have an error.
    • First step is to parse free-text logs into semi-structured text. Look at program source to generate regular expressions that extract state variables (the block number, for example). It becomes non-trivial in OO languages, where type inference on the whole source tree is necessary. Yields highly accurate parsing results.
    • Identifiers are widely used in logs (filename, object keys, user IDs, etc.). Do group-by on identifiers. Identifiers can be discovered automatically.
    • Now build a numerical representation of traces for feature creation. Approach is similar to the bag of words model in information retrieval. Yields a message term vector with term frequencies.
    • Can use these vectors to do anomaly detection. Use Principal Component Analysis (PCA) to capture normal patterns in all vectors. These are based on correlations between dimensions of the vector.
    • Ran an experiment on Amazon EC2. 203 nodes ran Hadoop for 28 hours, with standard map-reduce jobs (sorting etc.). Generated 24 million lines of console logs with ~575000 HDFS blocks. 575000 vectors lead to ~680 distinct vectors. Distinct cases were labelled manually as being normal or abnormal (in the evaluation only). However, the algorithms are unsupervised and automatic.
    • 11 kinds of anomaly, occurring 16916 times. PCA detected 16808 of these. Two kinds of false positive: background migration and multiple replicas. Believe that no unsupervised algorithm could do better, so we’re now allowing operators to provide feedback.
    • Results are visualized with a decision tree. Unusual log message text is used to document split points in the decision tree.
    • Future work. Want to improve parsing so that it doesn’t require source code, and support more languages. Also want to improve feature creation and machine learning so that online detection is possible, also across applications and layers to provide more useful and comprehensive information.
    • Q: most applications have many identifiers, so how do you automatically detect these, and how reliably? The grouping step addresses the problem of multiple identifiers. In HDFS, we only have the block ID, but we have an example in the paper where we run the algorithm multiple times for each class of identifier.
    • Q: how do you know whether the different values of an identifier correspond to the same identifier variable? [Taken offline.]
    • Q: what was different about the single anomaly where you don’t do well? (Deleting a node when it no longer exists on the data node.) Block numbers can be arbitrary due to multiple reads and writes. Sometimes get errors in the correlations.
    • Q: how common do you think this is in other systems besides HDFS? HDFS has this problem most severely because every block interaction is written to the logs.
    • Q: you had to turn on the more-detailed logging level to get this to work, so how did you choose this? I had to turn on debug-logging. Depends on the problems you want to detect. Turn on more logging when you see problems but you can’t find out why.
    • Q: what happens to performance when you do this? Also what about heisenbugs that go away with more-detailed logging? System doesn’t do anything about logging-based heisenbugs.
    • Q: what if I add a new feature, would you be able to detect problems in it? Don’t currently deal with multiple versions of the software.
    • Q: how much information does your visualization offer to the developer to help them diagnose detected problems? If the operators have some insight, this tool can help them provide useful information to the developers.

    NSDI 2009: Day 3

    April 24th, 2009

    Wireless #2: Programming and Transport

    Wishbone: Profile-based Partitioning for Sensornet Applications

    • Extension to the WaveScope system.
    • Example application of locating marmots: listening out for loud alarm calls when confronted by a predator. High data rate application (node has four microphones). Used for sensing applications (animal localisation, pothole detection, computer vision, pipeline leak detection, speaker identification and EEG seizure detection). All expressible as a data flow graph. Predictable data rates. But run on heterogeneous platforms (CPU and radio bottlenecks).
    • Used Linux uservers, smartphones (Java-based and iPhones) and Meraki routers. Want to be able to mix and match. Have to deal with incompatible software environments (languages, SDKs and OSes).
    • Application is a dataflow graph (edges are streams; nodes are stream operators). Inputs are from sensor sources, outputs are results either to the user or a server. System partitions this graph between the embedded nodes and the server side. Wishbone handles serialisation and deserialisation across the network interface. Compiles and loads subgraph onto any platform. Replicates graph across all nodes (but may have different partitions depending on the node).
    • Contributions. First broadly portable sensor net programming environment. Partitioning algorithm. Optimises CPU/radio tradeoff.
    • WaveScope compiler gives dataflow graph in portable IL. Then have a backend code generator for each of the target platforms.
    • Wishbone needs sample data from the user to make the correct partitioning decisions.
    • Target a TinyOS mote. (16bit ucontroller, 10K RAM, no memory protection, no threads, task granularity messaging model.) Not directly compatible with the WaveScope threading model. Use profile-directed cooperative multitasking which makes good decisions (based on profiling information) about where to place yield points.
    • Profile streams and operators to find execution times, data rates. Separately profile network connections.
    • Some data flow nodes must be pinned to a particular platform (e.g. source and sink nodes, stateful global operators). Others are unpinned (stateless and locally stateful nodes).
    • Assign weights to the edges (net bandwidth) and nodes (CPU cost) in the dataflow graph. Interested in the sum of CPU cost on the embedded device, and the sum of edges across the network boundary. Formulate as an Integer Linear Program. Enforce resource bounds on the embedded device (sum over nodes on the embedded device), and network bound (sum over cut edges). Then minimise a linear combination of the CPU and network cost. (Tricky bit is to enforce other constraints (pinning and graph topology) while staying linear. Can set a parameter to tradeoff between CPU and radio, or else set it to reflect energy consumption.
    • Evaluated by looking at bandwidth versus compute cost on a linear pipeline of operators (evaluted on iPhone as the embedded node). Observe that processing reduces data quantity overall, but this is not monotonic.
    • Evaluated by changing the input data rate of the application. Look at how many operators remain in the optimal node partition. (EEG graph with multiple channels partitions horizontally (some channels on server, some on device).)
    • Given CPU and network bounds, can find an optimal partition if it exists. The partition gives estimated throughput. So do a binary search over CPU bounds to find the maximum possible throughput.
    • Ground truth is how many detections can actually be gotten out of a network of TMotes. Also looked at the compute/bandwidth tension in a single TMote/basestation network.
    • Related to graph partitioning for scientific codes (Zoltan in symmetric supercomputers), task scheduling (list scheduling), MapReduce/Condor, and Tenet and Vango (sensor network context, run TinyOS on embedded and server devices).
    • Q: have you considered situations where operators could tune the amount of processing that they do (or tolerate packet loss)? Considered it, but there are several generalisations we would look at first. Does it really change the structure of the system? Yes, we’d probably need a new partitioning algorithm.
    • Q: would it be worth profiling the amount of synchronisation necessary for pinned (global stateful) operators? We have strong consistency requirements, so we’d have to have a relatively expensive operation to synchronise this.
    • Q: have you applied the same sorts of partitioning for the scenario where you have different partitioning for different sensors? Allow you to port application across different platforms but the network is homogeneous, but you could run the algorithm multiple times (not considered in this paper).
    • Q: what is the effect of additional load on available bandwidth (interference, etc.)? Currently assume all nodes transmit at the same data rate, and use this assumption to assign channel capacity.

    Softspeak: Making VoIP Play Well in Existing 802.11 Deployments

    • VoIP and WiFi are becoming increasingly popular. Now even cellphones are becoming available with these capabilities (even Skype on iPhone). If these get used heavily, what happens to existing 802.11 deployments?
    • Imagine a commodity AP in a coffee shop. Most users are data users, and maybe there are one or two VoIP users. But as VoIP becomes more popular, the number of VoIP users will increase.
    • What does this do to call quality and to data users?
    • 802.11 was designed for data networks and has substantial per-packet overhead (headers, ACK) and contention (backoff and collision). VoIP has small packets, a high packet rate (20–100 packets per second), and does not respond to congestion. So VoIP makes inefficient use of WiFi.
    • Measured impact as residual capacity (TCP/UDP throughput) and “mean opinion score” (MOS, how audio appears to a real person, calculated based on loss and jitter).
    • Used an 802.11 testbed and gradually enabled for VoIP stations. Downlink MOS tails off much faster than uplink MOS (after 4 VoIP stations). Degradation of TCP throughput is linear, but much more severe than would be expected from the size of VoIP packets.
    • Possible solutions: could decrease VoIP packet rate, or use higher speed networks (802.11g/n). Lowering the packet rate still degrades MOS. Higher speed networks run in “protection mode” in presence of older 802.11 versions, which loses many of the benefits. VoIP is too infrequent to benefit from 802.11n aggregation.
    • Could prioritise VoIP traffic (802.11e), which increases the contention overhead, and further reduces the residual capacity.
    • Softspeak does prioritised TDMA for the uplink and addresses contention overhead. By serialising channel access, we avoid collisions. Cycle between 10 TDMA slots, each of 1ms (1ms accommodates current codecs well; could be varied). VoIP stations have to establish a schedule, synchronise clocks and compete with non-TDMA traffic.
    • Non VoIP stations are unaware of TDMA, which may prevent VoIP stations from sending on time. If data user wins contention, the VoIP cannot transmit; if they collide this is worse. Use 802.11e VoIP prioritisation to improve VoIP quality, but TDMA means that we don’t see the same data rate degradation. Even a quick collision recovery for VoIP means that we will overrun our TDMA slots: can use multiple priority levels to address this (though can exhaust priority levels). Worst case, fall back to regular 802.11, and do no worse than 802.11.
    • Experiment by visualising TDMA, with ten TDMA VoIP stations, and some CSMA/CA background data traffic. Most transmissions should start in their own or the next slot.
    • Softspeak also does downlink aggregation. VoIP stations will overhear aggregated packets and extract their own portion of the packet.
    • No modification to wireless card, access point or VoIP application. Softspeak controller registers IP address and port with the aggregator (wired connection to AP). Implemented for Skype and Twinkle.
    • Evaluated for call quality and residual throughput. TCP data traffic and 10ms voice codec. (More results in the paper.) Adding TDMA alone has little effect over 802.11b. Aggregation by itself greatly improves downlink MOS (and slightly improves residual throughput). Softspeak in total greatly improves downlink MOS (almost back at single VoIP station level) and also improves residual throughput by 5x.
    • For 802.11g, get a 3x improvement in residual throughput, slight degradation in downlink MOS and large improvement in uplink MOS.
    • Also looks at performance when contending with web traffic. The bulk TCP upload improvement disappears, but the combined TCP capacity improvement is preserved. Softspeak doesn’t change the basic behaviour of the data traffic.
    • Q: how would the result change if you were measuring against multiple TCP flows? Experiment we did with web traffic showed this. Multiple bulk TCP flows? In one direction, saw same level of improvement; in the other saw queuing at the AP and would need some form of prioritisation at the AP.
    • Q: how could this work in a real network (because you have more than one collision domain; how do you assign slots across multiple hops?)? Didn’t evaluate the case with multiple collision domains; details of slot assignment in the paper. Use an 802.11 management protocol for probe response which increases the range when it can be heard. Doesn’t completely eliminate hidden terminal problem, but we hope that this would apply to TDMA.
    • Q: with multiple APs and multiple slots, wouldn’t you need some kind of centralised control? Could work across multiple APs.
    • Q: did you measure the delay performance of this? Incorporated in the MOS value.
    • Q: did you compare to giving short packets high priority with a rate limiter? This is part of the 802.11e specification, and we measured this. Why does aggregation win by reducing the packet per second rate? Typically if you have short packets, you use more bandwidth.
    • Q: if I am an enterprise administrator, why not just reserve a single channel for VoIP traffic, because it no longer has to contend with data or non-VoIP traffic? That would be a good solution if you had that available to you.

    Block-switched Networks: A New Paradigm for Wireless Transport

    • Today, TCP performs poorly over wireless links. On a bad link, you get 40x goodput difference between UDP and TCP. Median has 2x goodput difference, and a good link has 1.6x goodput difference.
    • End-to-end rate control is error-prone and doesn’t work well in wireless networks. Loss feedback does not distinguish congestion and corruption losses. End-to-end retransmissions are wasteful in a multi-hop wireless network. Route disruptions cause unavailability (could use a disruption-tolerant protocol).
    • The use of packetisation introduces non-trivial overhead (for channel access: listen, backoff, RTS/CTS; and link-layer ARQ).
    • Also see complex cross-layer interaction: link-layer ARQs and backoffs hurt TCP rate control.
    • Hop is a clean-slate redesign. End-to-end becomes hop-by-hop. Packets become blocks to amortise overhead.
    • Reliable block transfer, ACK withholding and micro-block prioritisation on a per-hop basis. Virtual retransmission and backpressure components on a multi-hop basis.
    • Hop sends data in 802.11 burst mode. CSMA performed only before a TXOP (burst). Block-SYN and Block-ACK (bitmap of packets; only resend missing packets).
    • Virtual retransmission uses in-network caching and retransmits only on cache miss. Gives fewer transmissions, low overhead and a simple solution.
    • Backpressure mechanism limits the number of outstanding blocks per-flow at each forwarder (e.g. limited to 2 outstanding blocks). This improves network utilisation in the case where hop bandwidth is asymmetric.
    • ACK withholding mitigates hidden terminals better than RTS/CTS (which is overly conservative with high overhead). Buffer B-ACK messages while one terminal is sending, and then send one of the buffered B-ACKs when it is finished.
    • Micro-block prioritisation improves performance for applications like SSH and text messaging. Sender piggybacks small blocks on the B-SYN and receiver prioritises small-block B-ACKs. This gives low delay for small blocks.
    • Implemented on a testbed on the 2nd floor of the UMass CS building. 20 Mac Minis form an ad-hoc network.
    • Hop achieves significant goodput gains over TCP (1.6x at median, 28x at first quartile, 1.2x at third quantile).
    • Also on single-flow multi-hop performance (more modest gains: Q1 2.7x, median 2.3x, Q3 1.9x).
    • Achieves graceful degradation with loss (emulated link layer losses at the receiver). Still achieves higher goodput than TCP as the loss rate increases.
    • For high load (30 concurrent loads), Hop achieves much higher goodput than either TCP or hop-by-hop TCP (Q1 150x, median 20x, Q3 2x). Hop is also much fairer in the way it allocates bandwidth between flows.
    • Evaluated delay on small transfers. Hop has much less delay than TCP or TCP+RTS/CTS (especially for smaller transfers).
    • Many more evaluations in the paper. Partitionable networks, network and link-layer dynamics, effect on 802.11g, and effect on VoIP.
    • Much related work on fixing end-to-end rate control, backpressure and batching; Hop combines these into a viable system.
    • Q: most data transfers involve wired networks as well, so what do you do at the gateway? We do bridging here.
    • Q: if there is a separate TCP connection from the gateway to a wired endpoint, how do you deal with mobility (changing gateways)? Good question, want to look at this in future work.
    • Q: what is the real effect on interactive sessions (e.g. SSH)? Hop actually achieves an even lower delay than TCP or TCP + RTS/CTS. Didn’t evaluate transfer sizes smaller than 2KB.
    • Q: if a congested router is fed by two routers upstream, how do you allocate the backpressure? The senders will have to back off. There is no explicit rate control. Both upstream routers are treated similarly.

    Routing

    NetReview: Detecting When Interdomain Routing Goes Wrong

    • Interdomain routing sometimes goes wrong. Example of YouTube traffic being redirected to Pakistan. Only the most egregious problems make it to the news, but there are many inconvenient, small scale problems.
    • ASes exchange routing information using BGP. But BGP routing is plagued by misconfigurations. Faulty routing information propagates through the network. Bugs in the AS layer could prevent routing information to be misadvertised. Also, spammers hack into routers to prevent tracing.
    • Goal is to reliably detect each routing problem and link it to the AS that caused it. This would make it quicker and easier to respond to problems. Fault detection would work for a broad class of problems, incentivise reliable routing and be easy to deploy incrementally.
    • Idea: could just upload all router logs to a central entity, who inspects them for problems. Doesn’t work in practice. Logs contain sensitive information about internal routing structure. Also, relies on the router to have accurate information (may not be the case). Also need automation, incremental deployability and decentralisation.
    • NetReview solves these problems. Border routers maintain logs of all BGP messages (not data messages). Logs are tamper-evident (in the event of a faulty router): can reliably detect and obtain proof of faulty router. Neighbours can periodically audit each other’s logs and check them for routing problems. Auditor can prove existence of a problem to a third party.
    • ASes decide what to announce via BGP based on its routing policy, based on peering agreements (customer/provider), best practices (limited path length) and internal goals (short/cheap path).
    • A BGP fault is when the BGP messages sent by an AS do not conform to its expected behaviour. We know what BGP messages the AS sent from a complete and accurate message trace (using a robust and secure tracing mechanism). Expected behaviour for each AS could be different.
    • Example of expected behaviour: filter out routes with excessive paths; act as somebody’s provider; prefer routes through someone if available. Some of these rules may be confidential, but the AS need not reveal all of them to each author (e.g. reveal rules about agreements only to parties to those agreements).
    • Tamper-evident log based on PeerReview at SOSP 2007. Can tell if a router omits, modifies or forges entries (based on a hash chain). Messages are acknowledged, so cannot ignore a message. Neighbours gossip about hashes seen.
    • Rules are predicates on the AS’s routing state. They are declarative and hence are easy to get correct. Checks of S-BGP can be declared in a two-line rule.
    • Auditor requests the logs from each border router. Checks to see if the logs have been tampered with (or show inconsistencies). The auditor locally replays the logs to establish a series of routing states. Then evalutes that the rules are upheld in each routing state. If a rule is violated during some interval, the auditor can extract verifiable evidence from the logs.
    • Many practical challenges: here we will look at incremental deployment. Smallest useful deployment is at one AS. One AS can find bugs and misconfigurations. Two adjacent ASes can check peering agreements. The incentives for deployment are that reliable ASes can attract more customers, and logs can be used for root-cause analysis.
    • Evaluated on a synthetic network of 10 ASes running 35 Zebra BGP daemons. Use default routing policies, and injected a real BGP trace (Equinix) to get scale. Results in the talk are from a tier-2 AS with 6 neighbours.
    • Did fault injection and NetReview detected all of the injected faults and provide useful diagnostic information.
    • Evaluated processing overhead: a 15-minute log segment can be checked in 41.5s on a P4. A single commodity PC is sufficient to check a small network in real-time.
    • Storage space requirement was 710KB/minute, or about 356GB/year. Required 420Kbps, including BGP updates, which is insignificant compared to the data rates.
    • Related work includes fault prevention (secure routing protocols and trusted monitors) which have been difficult to deploy, and only filter out limited types of faults. Also related to heuristic fault detection (problem of false positives and false negatives). And related to accountability systems (which tend to require clean-slate designs, or have other limitations).
    • Q: can you say more about the web of trust approach? No PKI or CA required, because the networks already know who is at the end of each link, which gives a good basis to certify the identity of ASes.
    • Q: does the system prevent collusion between ASes? ASes cannot incriminate a correctly-operating AS by collusion, nor hide misbehaviour (apart from routing behaviour between the colluding ASes, which doesn’t matter).

    Making Routers Last Longer with ViAggre

    • Motivation is the steep growth in routing table size. Expected to get worse in future: as IPv4 gets exhausted, need more small prefixes. Also there’s a chance that IPv6 gets deployed…. A larger routing table means that routers need more FIB (Forwarding Information Base) space.
    • Does FIB size matter? Could just throw more RAM at it? Technical concerns about power and heat. SRAM is a low-volume component that does not trace Moore’s Law. Also, a larger routing table means less cost-effective networks (price per byte forwarded increases). Also, there’s a cost in upgrading router memory. Some routers are filtering out /24 routing table entries, which impacts reachability. ISPs will undergo some pain to extend the lifetime of their routers
    • Virtual Aggregation: a configuration-only approach to shrinking router FIBs. Works on legacy routers and can be adopted independently by and ISP. Huawei are implementing it in their routers.
    • Brad Karp came up with the name.
    • Insight is to divide the routing burden between routers. Individual routers only maintain routes for a fraction of the address space.
    • Router architecture: RIB on slow memory; FIB on fast memory (SRAM).
    • Problem space includes FIB space, RIB growth and problems of routing convergence (churn etc.). Existing proposals require architectural change and haven’t been deployed. ViAggre focuses on FIB space problem and is incrementally deployable.
    • Today all routers have routes to all destinations. Idea is to divide address space into /2 virtual prefixes. Now assign virtual prefixes to the routers. Each router has a prefix and only maintains routes to hosts in that prefix.
    • How do you do this without changes to routers and external cooperation? How do packets traverse routers with partial routing tables?
    • An external BGP peer may advertise its full routing table: need to insert routes into the FIB selectively (only install a subset of the RIB into the FIB, FIB supression is a simple approach). However, this has high performance overhead. Instead, offload task of maintaining this table onto machines off the data base (similar approach to BGP route reflectors). External router peers with a route reflector. Shrinks the FIB and RIB on all data-plane routers. This is somewhat invasive, because you have to change your peering architecture.
    • What about data plane paths between different virtual prefixes? When a packet comes in, the ingress router doesn’t have a route to the prefix. Therefore you need to route to router with the right virtual prefix. So maintain one entry per virtual prefix that a router is not aggregating. Cannot forward packets in a hop-by-hop fashion, so the packet must be tunnelled to a router that has the right prefix. The egress router removes the encapsulation.
    • Failover works using existing mechanisms.
    • Native paths in ViAggre can be longer than native paths (traffic stretch, increased load, etc.). But can use power law that 95% of traffic goes to 5% of prefixes. So install that 5% of prefixes on all readers. This will reduce the impact of ViAggre on the ISP’s network.
    • Evaluated effects on adopting ISP. Look at reduction in FIB Size versus traffic stretch (in ms) and load increase (more traffic carried by routers).
    • Choosing aggregation points to assign more routers to aggregate a virtual prefix. This reduces stretch. Use a constraint based assignment program. Want to minimise the worst FIB size such that the worst stretch is <= a constraint. This is NP hard so had to approximate it. With a worst-case stretch of 4ms, the worst case becomes close to the average FIB size, and the actual stretch is very low.
    • ViAggre can extend the lifetime of a router by 7–10 years.
    • Carried out a study of the prefixes to which traffic was sent, found that the top 5% most popular prefixes account for 95% of traffic.
    • As the fraction of popular prefixes increase, the load increase drops to 1.38%.
    • ViAggre gives a 10x reduction in FIB size.
    • Cons of ViAggre: control plane hacks may have overheads (in installation, convergence, failover); also planning overhead (choosing virtual prefixes and assigning them to routers).
    • Deployed ViAggre on a real network. Compared propagating routes using the status quo and ViAggre (using prefix lists for selective advertisement). Measured control-plane overhead in case a new route is installed. ViAggre reduces installation time because route reflector only has to advertise a subset of the routing table to the new router.
    • Also developed a configuration tool (330 lines of Python) that works on router configuration files and outputs ViAggre-compliant configuration files. Still working on a planning component. Working with Huawei to implement ViAggre natively.
    • Q: overhead of tunnelling? Use MPLS-based tunnelling at line rate. Makes management easy.
    • Q: why is the rate at which traffic is growing less than the rate of routing table growth? Not necessarily true. Router upgrading has been happening for reasons other than FIB size up until now (but the FIB size has increased). Tier-1 ISPs can forward packets at close to lambda rates. Even big ISPs may have to do upgrades only for memory concerns.
    • Q: why not just maintain popular routes and forward other packets to a default router? Simplest thing to do would be to put routers in a cache hierarchy. Why does it have to be distributed and complicated? Router vendors weren’t happy with route caching because of unpredictable performance and other reasons. ViAggre is useful for medium-sized ISPs (not small ISPs where you have a useable default route).
    • Q: can you use a configuration-only approach for the data centre to get higher aggregate throughput? The state problem in data centres is the layer-2 state (not the layer-3 state). Close to Seattle work at SIGCOMM 2008.
    • Q: why is it expensive to do route supression from the RIB to the FIB? We did it using access lists to say what should and should not go into the FIB. These access lists are heavyweight mechanisms (typically used with small access lists of up to 500), and haven’t been optimised. By making this efficient, you get much less complexity (no need for route reflectors).

    Symbiotic Relationships in Internet Routing Overlays

    • Two nodes are in symbiosis whenever both can benefit from one another’s position or resources in the network. Examples are in file sharing (BitTorrent), backup systems (Samsara) and AS relationships. No tragedy of the commons and no free riding. Everyone provides something in return.
    • Can we apply mutual advantage in overlay routing. Built PeerWise, an overlay routing system that reduces latency based on mutual advantage.
    • Route from College Park, MD to Seattle. Could take a direct path, or a detour path that violates the triangle inequality.
    • Measurement studies based on PlanetLab-to-popular-destinations (asymmetric) and PW-King (symmetric). 21% and 51% of node pairs respectively benefit from detours (only detour when latency reduction is at least 10ms and 10%).
    • For 20% of the PlanetLab nodes, there are no detours. Mutual advantage eliminates about half of the detours.
    • Can categorise web pages based on the number of prefixes and the number of nodes finding detours (regional websites see many nodes finding detours but few prefixes; Google and CDN-based sites have many prefixes, but relatively few nodes finding detours).
    • Three goals: efficiency (lower end-to-end latency), fairness (mutual advantage) and scalability (use network coordinates). Network coordinates give internet nodes positions in a geometric space, and predict latency based on position in that space: this has some power in predicting whether node-pairs will need a detour or be part of a detour.
    • Need to do neighbour tracking, and ranking based on proximity (minimise latency), embedding error (maximise this), coverage (minimise expected detour latency) or randomness. Proximity and coverage perform best.
    • Now do pairwise negotiation to establish mutual advantage. Keep a negotiation table and peering table per node. Negotiation table contains potential peers.
    • Implemented PeerWise and deployed it on about 200 PlanetLab nodes. PeerWise finds mutually-advantageous detours that offer significant and continuous latency reduction.
    • Propose three experimental scenarios: all-destination, random-subset-destination and Zipf-destination.
    • PeerWise finds detours quickly. Other results in the paper.
    • Looked at user-level application benefits (wget) through the use of detours. Compare the transfer time for popular website home pages using both the direct and detour paths. Compare PeerWise reduction ratio and wget reduction ratio. In 58% of all cases, wget takes less time through the detour path than through the direct path. But in 42% of cases, wget takes longer through the detour. Could be that the latencies measured by PeerWise had changed.
    • Found that PlanetLab relays influence transfers. Packets spend time in the relay node. Median relay wait time on PlanetLab is 48ms (a few weeks before the NSDI deadline). Found that on a UMD relay node, the median wait time was 5. With these delay characteristics, the performance would have been much better.
    • Q: is it not counter-intuitive to use latency (with triangle inequality violations) in network coordinates (a metric space) (should you not use topological information like in iPlane?)? Network coordinates give good results. So didn’t investigate anything else. When we embed a triangle inequality violation in a metric space, either make the short size shorter, or the long side longer.
    • Q: aren’t most network coordinate systems based on some sort of global optimisation? Taken offline.
    • Q: do you only consider bilateral advantages, or have you considered more complicated scenarios with multiple parties? Have thought about it, but just consider bilateral.
    • Q: have you done any work on trying to include the load of a node in the weighting system to determine whether it is advantageous to route through it? Have not looked at load.
    • Q: a long RTT TCP connection, split in two, will get better latency and loss tolerance, so have you considered doing this instead of TCP relaying? Is there a way to separate the effects in your evaluation? Not sure.

    http://jihadwatcher.com/?p=9-12069 migararDeDd o aoarT http://jihadwatcher.com/?p=9-6161 ee NennenmArh oeae http://jihadwatcher.com/?p=9-9120 nsoPo tlnn etrmuteCiGtaihee http://jihadwatcher.com/?p=9-2123 ptOr ieprrttiinens nOo http://jihadwatcher.com/?p=9-8024 rhIa Pt eh http://jihadwatcher.com/?p=9-3336 mnathr rsdmmSetn http://jihadwatcher.com/?p=9-1532 a0l http://jihadwatcher.com/?p=9-4645 riP etA eineOthd rBrtnmeeep http://jihadwatcher.com/?p=9-2229 nPeielhiteC ntrHhpW http://jihadwatcher.com/?p=9-5478 Ddmrosaer guTa http://jihadwatcher.com/?p=9-780 T http://jihadwatcher.com/?p=9-825 Ia http://jihadwatcher.com/?p=9-10409 nrnisCenPleam eiLiatcOeodesPihn http://jihadwatcher.com/?p=9-9335 Tfa dsrOoA bmeua http://jihadwatcher.com/?p=9-10650 heiPnnCe http://jihadwatcher.com/?p=9-10195 Wnnt h ese http://jihadwatcher.com/?p=9-4974 NemHrlnnhPs http://jihadwatcher.com/?p=9-2618 http://jihadwatcher.com/?p=9-6715 TDlsg http://jihadwatcher.com/?p=9-4793 rNisc io peirrhcnoPisttheoWm niteDeenut http://jihadwatcher.com/?p=9-300 PiUkeSi http://jihadwatcher.com/?p=9-2474 ro o nlpe e nlPCnthis http://jihadwatcher.com/?p=9-9690 fnrcehdiSeEePe http://jihadwatcher.com/?p=9-3689 hhiePB fT p ei tceaunOeefCmnt aShyli http://jihadwatcher.com/?p=9-2651 hmne Fol http://jihadwatcher.com/?p=9-2490 loeCrlmtolaT odna http://jihadwatcher.com/?p=9-488 e nfPp http://jihadwatcher.com/?p=9-13891 rxh35d mp http://jihadwatcher.com/?p=9-3582 heenonaticutrr http://jihadwatcher.com/?p=9-298 TerGhmngtntrebePnnnuilai http://jihadwatcher.com/?p=9-8478 Pizeoalbt http://jihadwatcher.com/?p=9-1444 ela drlAtra http://jihadwatcher.com/?p=9-7104 iei http://jihadwatcher.com/?p=9-9755 epienSrn7F rph3n5h timiM http://jihadwatcher.com/?p=9-3280 eAlfnIp http://jihadwatcher.com/?p=9-5647 nritPserhP erl nine http://jihadwatcher.com/?p=9-10765 nsnaehepirttem http://jihadwatcher.com/?p=9-878 oAal ir arhod http://jihadwatcher.com/?p=9-2989 eiean rhPadTdi http://jihadwatcher.com/?p=9-9102 P http://jihadwatcher.com/?p=9-13806 aes mWT http://jihadwatcher.com/?p=9-7816 nmTdoAedan e http://jihadwatcher.com/?p=9-1400 tebTn e erPdoBrruih F nemear http://jihadwatcher.com/?p=9-3610 eiPnrmhersMeembit http://jihadwatcher.com/?p=9-3872 n ieyh eihiO remPr http://jihadwatcher.com/?p=9-9628 Pinmoet http://jihadwatcher.com/?p=9-13570 ihPadrmrneltanTcioio rMiiAdt t http://jihadwatcher.com/?p=9-9693 lodike n BliadmotCnryoBae i http://jihadwatcher.com/?p=9-6684 eniC inBh http://jihadwatcher.com/?p=9-6362 te http://jihadwatcher.com/?p=9-8092 inndinciru http://jihadwatcher.com/?p=9-4139 m93res niemetp250h 9nP http://jihadwatcher.com/?p=9-12820 nramtnnp hiiC neaeO http://jihadwatcher.com/?p=9-409 MNrayawBHlm he no http://jihadwatcher.com/?p=9-2448 noNhntcini http://jihadwatcher.com/?p=9-12230 TdaamTa http://jihadwatcher.com/?p=9-4615 aieo teAM hmoatdldn http://jihadwatcher.com/?p=9-11567 vgrFoeeiSpdr hhnTeplntOri m http://jihadwatcher.com/?p=9-11258 Cneooi.ePett5u7n rihtm 3mnslg http://jihadwatcher.com/?p=9-6727 n eBt uPee http://jihadwatcher.com/?p=9-12884 irni eo http://jihadwatcher.com/?p=9-6567 ne rn57iihtPesmo Pripc3olret http://jihadwatcher.com/?p=9-6785 utrr i lleCmnOreaha http://jihadwatcher.com/?p=9-8365 N grrptnmpcoeNvn roO http://jihadwatcher.com/?p=9-9363 eOoe lMmde http://jihadwatcher.com/?p=9-13367 e http://jihadwatcher.com/?p=9-4087 peFsr dohn http://jihadwatcher.com/?p=9-537 rnPu http://jihadwatcher.com/?p=9-13910 a52Xen http://jihadwatcher.com/?p=9-13456 sreratnOlt it easlgi DnndrlnhsT o aUm eICeinmderpco http://jihadwatcher.com/?p=9-6681 taeiPAdlmd http://jihadwatcher.com/?p=9-10621 oOlri nSdl elievuD trydan http://jihadwatcher.com/?p=9-13657 hdeey liSaum EnFn BTy uotpi http://jihadwatcher.com/?p=9-7009 pmiBalPneiehrot http://jihadwatcher.com/?p=9-6129 tS r eheTeit http://jihadwatcher.com/?p=9-8979 ehenLeePhnPh erin http://jihadwatcher.com/?p=9-5193 meerN icnprPPeetnho http://jihadwatcher.com/?p=9-1577 ner snoealHianImTcg http://jihadwatcher.com/?p=9-1214 tesilh r3ebt7T5. http://jihadwatcher.com/?p=9-717 o TuremDiOap nasdad http://jihadwatcher.com/?p=9-6713 Piimenh ilahPSrnlart http://jihadwatcher.com/?p=9-9847 hiaecWSi http://jihadwatcher.com/?p=9-8090 rNhe seeteotro p http://jihadwatcher.com/?p=9-10548 othi rCarg http://jihadwatcher.com/?p=9-1604 oiCFroDctnrel eshamtnoeo r ettei http://jihadwatcher.com/?p=9-5494 ut http://jihadwatcher.com/?p=9-1208 euDdrtnigPeerO http://jihadwatcher.com/?p=9-5408 NeehtnOrrn erdeoP mi http://jihadwatcher.com/?p=9-2539 eLTpnd aaaoz http://jihadwatcher.com/?p=9-10892 nhieesv tLt v oigDtOereergre http://jihadwatcher.com/?p=9-13304 n odhsrmareten http://jihadwatcher.com/?p=9-11173 manenoetrPmhne http://jihadwatcher.com/?p=9-9292 rrinateDgmeh ePGnn http://jihadwatcher.com/?p=9-2215 it http://jihadwatcher.com/?p=9-3703 ntie ctneiaPtih http://jihadwatcher.com/?p=9-3267 rn oci nreoeen ieiP http://jihadwatcher.com/?p=9-7978 mao http://jihadwatcher.com/?p=9-3399 O dte http://jihadwatcher.com/?p=9-5513 luhm Cat liara http://jihadwatcher.com/?p=9-4116 dheeaeinMsTi http://jihadwatcher.com/?p=9-11881 TededconuC ma oehrrSlt http://jihadwatcher.com/?p=9-8046 ohWvnee http://jihadwatcher.com/?p=9-11817 u BrnpNm etminyrs7roeiPto3P http://jihadwatcher.com/?p=9-9771 rlporit enPPstP hNetilhiD http://jihadwatcher.com/?p=9-6464 dd lcm uS http://jihadwatcher.com/?p=9-1758 aeTersaaoI http://jihadwatcher.com/?p=9-1185 ftiDe Yle cneehnl http://jihadwatcher.com/?p=9-5559 enePpmtnhheiCre enip eihFraS http://jihadwatcher.com/?p=9-11238 rarmenetch http://jihadwatcher.com/?p=9-13195 oraQaic mdpeackCT hl http://jihadwatcher.com/?p=9-5446 nOhnmnPeeererl http://jihadwatcher.com/?p=9-2817 bTlm rgnaaibridmogio yaaFPe http://jihadwatcher.com/?p=9-11357 chtrPiecrshn WTiaruuats tpml aIooPver iW o http://jihadwatcher.com/?p=9-8157 mcN Otnnitdrioilrep eeh ePir Oreesonrn http://jihadwatcher.com/?p=9-10598 mrnyiemt ivhanr rlcPPehn http://jihadwatcher.com/?p=9-12328 na http://jihadwatcher.com/?p=9-8882 gvltlerhuTtaP miNn teyDio e eurS http://jihadwatcher.com/?p=9-11731 aadWTt http://jihadwatcher.com/?p=9-276 Ptoniee NcPsenhme rr http://jihadwatcher.com/?p=9-2287 o http://jihadwatcher.com/?p=9-9389 oa TaBdgryni http://jihadwatcher.com/?p=9-11742 P http://jihadwatcher.com/?p=9-2154 ci Raaor mexhhnytPP enr neNlimOn http://jihadwatcher.com/?p=9-12391 n http://jihadwatcher.com/?p=9-1410 ch http://jihadwatcher.com/?p=9-6985 hnidiOeea n Sina hemreeyDrlenvtu http://jihadwatcher.com/?p=9-1961 hnh0eeM aeeCi3ntesp http://jihadwatcher.com/?p=9-1986 cnn edhunEgiil re iiUdtn http://jihadwatcher.com/?p=9-2699 3e ne t5 http://jihadwatcher.com/?p=9-3783 e etoerWr http://jihadwatcher.com/?p=9-10535 http://jihadwatcher.com/?p=9-13760 aoOh Tccnodaemilithraalrem http://jihadwatcher.com/?p=9-13614 rrpPsePm ineenAeOchxinirpdtt eoi http://jihadwatcher.com/?p=9-4322 ireePete hmAnonppomnnNt http://jihadwatcher.com/?p=9-11553 ee e mtiPePrnhneiWghlan http://jihadwatcher.com/?p=9-11504 amtie rn n orietdR http://jihadwatcher.com/?p=9-8671 pPheiNreomh http://jihadwatcher.com/?p=9-598 nih mhePretaepn http://jihadwatcher.com/?p=9-821 mh taso renpirPierNrrPeC oehP tepnico http://jihadwatcher.com/?p=9-519 nvdoOSttinesd http://jihadwatcher.com/?p=9-7004 nh http://jihadwatcher.com/?p=9-5946 e t3e CnepPaemhinrh http://jihadwatcher.com/?p=9-2439 y oileualtTaO Pacludmrooo ipBnstTitrdh mnria W http://jihadwatcher.com/?p=9-11932 T me oCoffa http://jihadwatcher.com/?p=9-3704 reCeipsmnehht http://jihadwatcher.com/?p=9-807 p ugeMsmafiollnIyaa eTlrudtmotsDrnn http://jihadwatcher.com/?p=9-10835 rlToEuos xSur http://jihadwatcher.com/?p=9-9784 ydapaPT armoal http://jihadwatcher.com/?p=9-2582 LttsieoPehenn mwr http://jihadwatcher.com/?p=9-10000 pirntthWi teiti errOe PDrleveohshm nPgrniu http://jihadwatcher.com/?p=9-8528 nB bolhrpemei ielgBP http://jihadwatcher.com/?p=9-5545 m b http://jihadwatcher.com/?p=9-2925 ul http://jihadwatcher.com/?p=9-8727 Pi rDey ietimDeDteeiiaenaPbtcctn hgl http://jihadwatcher.com/?p=9-11107 n http://jihadwatcher.com/?p=9-13166 eOWiiuneri onchgitsremtdp rhort http://jihadwatcher.com/?p=9-12592 ynrirtddieltda soSam praePu http://jihadwatcher.com/?p=9-5509 peenrrteaPCh http://jihadwatcher.com/?p=9-13520 dom a1THlc http://jihadwatcher.com/?p=9-4437 reCeerp http://jihadwatcher.com/?p=9-3344 o arnpnliPscre http://jihadwatcher.com/?p=9-893 tma http://jihadwatcher.com/?p=9-8827 e BmesD http://jihadwatcher.com/?p=9-8423 pi P oSCeellnnnrema http://jihadwatcher.com/?p=9-12721 e sannghy hmy http://jihadwatcher.com/?p=9-5329 oi a m http://jihadwatcher.com/?p=9-10251 n http://jihadwatcher.com/?p=9-10084 n a H http://jihadwatcher.com/?p=9-7952 arOifnninlOeihilP ecetn http://jihadwatcher.com/?p=9-7118 nhre TGeonnPet http://jihadwatcher.com/?p=9-7006 tlPnis epePteirmcnrrenene anhC http://jihadwatcher.com/?p=9-9081 x hn http://jihadwatcher.com/?p=9-8771 eeyMne r eiOrrhd it http://jihadwatcher.com/?p=9-2592 ionnrshTm http://jihadwatcher.com/?p=9-1368 eoreonPposrahtrnm liildieCn http://jihadwatcher.com/?p=9-7445 e i http://jihadwatcher.com/?p=9-11571 L http://jihadwatcher.com/?p=9-11033 hi http://jihadwatcher.com/?p=9-2864 ebs ere http://jihadwatcher.com/?p=9-3020 teimeae rnth iossPuC http://jihadwatcher.com/?p=9-9813 muhBn http://jihadwatcher.com/?p=9-12128 oi nenNOLgNiO http://jihadwatcher.com/?p=9-448 hseuipntv’a cr aePephWyitsmtnnhAoi io http://jihadwatcher.com/?p=9-9364 aa5m http://jihadwatcher.com/?p=9-1802 dchaPm http://jihadwatcher.com/?p=9-4020 rr http://jihadwatcher.com/?p=9-4460 eaerP http://jihadwatcher.com/?p=9-3857 ee H5nh http://jihadwatcher.com/?p=9-795 hPiSXermt ne renOPrg thedirloien enSamnnnte http://jihadwatcher.com/?p=9-12642 ynmirieenalPte nrhtcc http://jihadwatcher.com/?p=9-11487 d http://jihadwatcher.com/?p=9-7018 Dctemrud niegPerhet http://jihadwatcher.com/?p=9-1512 htrmdornnid ny http://jihadwatcher.com/?p=9-614 deDTo http://jihadwatcher.com/?p=9-998 iaeTrnncdoa GO http://jihadwatcher.com/?p=9-8606 ohotl http://jihadwatcher.com/?p=9-6834 OPnirrnoendihi eei r rroe http://jihadwatcher.com/?p=9-384 asrmaWorldTo http://jihadwatcher.com/?p=9-11506 ti CihDet http://jihadwatcher.com/?p=9-12058 ie http://jihadwatcher.com/?p=9-9992 li Unneyer OuntenhemPiB http://jihadwatcher.com/?p=9-1949 eoPtxrdnC idmiTmrepApa http://jihadwatcher.com/?p=9-1962 p toeY eheknainm patnnhnoiWaePHS http://jihadwatcher.com/?p=9-6605 t iOmIhPi tnoreninemt http://jihadwatcher.com/?p=9-10734 eepenprnpi hs http://jihadwatcher.com/?p=9-9057 n eonreRe PNmiht http://jihadwatcher.com/?p=9-6191 nmd http://jihadwatcher.com/?p=9-6588 mlsnth ona http://jihadwatcher.com/?p=9-13167 uBCehretee nmi yWPn OE n http://jihadwatcher.com/?p=9-10098 m geLBnauilehen http://jihadwatcher.com/?p=9-8191 ieTeditAnd m http://jihadwatcher.com/?p=9-10695 Vrn TClaamqeF rianlnae http://jihadwatcher.com/?p=9-13695 ne rhai yctrrsmMWanPiee http://jihadwatcher.com/?p=9-4957 rr PmOhdnoPeryeo http://jihadwatcher.com/?p=9-9105 Ecoaarti http://jihadwatcher.com/?p=9-12376 dTaonrg http://jihadwatcher.com/?p=9-13816 aesahdtnemHwoo eWhinnetnriy http://jihadwatcher.com/?p=9-5138 aarrrhomyi datmrTlldUHrc http://jihadwatcher.com/?p=9-2534 haPramtserenh recpaoinir http://jihadwatcher.com/?p=9-4519 rei http://jihadwatcher.com/?p=9-1634 ire miisePt nmi http://jihadwatcher.com/?p=9-3722 iCn heopratehm http://jihadwatcher.com/?p=9-2277 nnPceeP tnhrioiaOl piniesrstener http://jihadwatcher.com/?p=9-7613 Pi inDieishitg http://jihadwatcher.com/?p=9-12779 rOmas http://jihadwatcher.com/?p=9-11718 Nle o http://jihadwatcher.com/?p=9-5631 imxnfnhitA http://jihadwatcher.com/?p=9-12355 eyDeted vra http://jihadwatcher.com/?p=9-4229 noIrieoaurrtinnDnge thfmm http://jihadwatcher.com/?p=9-7477 mCPitr http://jihadwatcher.com/?p=9-10809 http://jihadwatcher.com/?p=9-1344 raodhl adry http://jihadwatcher.com/?p=9-8230 aHCLkdTo http://jihadwatcher.com/?p=9-4612 Smapn Nhetorn rrmichP http://jihadwatcher.com/?p=9-10933 hie http://jihadwatcher.com/?p=9-1676 ra idaln http://jihadwatcher.com/?p=9-2685 rhnnedAo stecePhtp http://jihadwatcher.com/?p=9-10243 nrremchePeiit http://jihadwatcher.com/?p=9-10802 adret faf http://jihadwatcher.com/?p=9-12915 ery uct http://jihadwatcher.com/?p=9-2663 aos TFrrd eDmga http://jihadwatcher.com/?p=9-9207 3iC nrn m http://jihadwatcher.com/?p=9-6214 iooCerdoyurndoOecainoHnst nlFo mlT eta http://jihadwatcher.com/?p=9-7415 e pnOuW girmnniFtPtt iihiornienhedrst nieclon http://jihadwatcher.com/?p=9-4671 scAaadraI Tm ciro Not http://jihadwatcher.com/?p=9-5696 rtmtenIsnneFaoibPnm oitraniuht http://jihadwatcher.com/?p=9-652 iheyBnll http://jihadwatcher.com/?p=9-3964 ddiFHdhTeya http://jihadwatcher.com/?p=9-4363 e http://jihadwatcher.com/?p=9-8168 een miheUnet http://jihadwatcher.com/?p=9-11120 P http://jihadwatcher.com/?p=9-10973 rer lhimH http://jihadwatcher.com/?p=9-4044 wn http://jihadwatcher.com/?p=9-12923 onReZnlsrOaaoPilnodc mriihtv orrp http://jihadwatcher.com/?p=9-5214 eoi pPir emrnceirP http://jihadwatcher.com/?p=9-9064 G http://jihadwatcher.com/?p=9-10961 eDae nhrPeH lhei http://jihadwatcher.com/?p=9-9612 tecnCslan eoUtyusiPraP