Archive for the ‘Uni’ Category

NSDI 2009: Day 3

Friday, April 24th, 2009

Wireless #2: Programming and Transport

Wishbone: Profile-based Partitioning for Sensornet Applications

  • Extension to the WaveScope system.
  • Example application of locating marmots: listening out for loud alarm calls when confronted by a predator. High data rate application (node has four microphones). Used for sensing applications (animal localisation, pothole detection, computer vision, pipeline leak detection, speaker identification and EEG seizure detection). All expressible as a data flow graph. Predictable data rates. But run on heterogeneous platforms (CPU and radio bottlenecks).
  • Used Linux uservers, smartphones (Java-based and iPhones) and Meraki routers. Want to be able to mix and match. Have to deal with incompatible software environments (languages, SDKs and OSes).
  • Application is a dataflow graph (edges are streams; nodes are stream operators). Inputs are from sensor sources, outputs are results either to the user or a server. System partitions this graph between the embedded nodes and the server side. Wishbone handles serialisation and deserialisation across the network interface. Compiles and loads subgraph onto any platform. Replicates graph across all nodes (but may have different partitions depending on the node).
  • Contributions. First broadly portable sensor net programming environment. Partitioning algorithm. Optimises CPU/radio tradeoff.
  • WaveScope compiler gives dataflow graph in portable IL. Then have a backend code generator for each of the target platforms.
  • Wishbone needs sample data from the user to make the correct partitioning decisions.
  • Target a TinyOS mote. (16bit ucontroller, 10K RAM, no memory protection, no threads, task granularity messaging model.) Not directly compatible with the WaveScope threading model. Use profile-directed cooperative multitasking which makes good decisions (based on profiling information) about where to place yield points.
  • Profile streams and operators to find execution times, data rates. Separately profile network connections.
  • Some data flow nodes must be pinned to a particular platform (e.g. source and sink nodes, stateful global operators). Others are unpinned (stateless and locally stateful nodes).
  • Assign weights to the edges (net bandwidth) and nodes (CPU cost) in the dataflow graph. Interested in the sum of CPU cost on the embedded device, and the sum of edges across the network boundary. Formulate as an Integer Linear Program. Enforce resource bounds on the embedded device (sum over nodes on the embedded device), and network bound (sum over cut edges). Then minimise a linear combination of the CPU and network cost. (Tricky bit is to enforce other constraints (pinning and graph topology) while staying linear. Can set a parameter to tradeoff between CPU and radio, or else set it to reflect energy consumption.
  • Evaluated by looking at bandwidth versus compute cost on a linear pipeline of operators (evaluted on iPhone as the embedded node). Observe that processing reduces data quantity overall, but this is not monotonic.
  • Evaluated by changing the input data rate of the application. Look at how many operators remain in the optimal node partition. (EEG graph with multiple channels partitions horizontally (some channels on server, some on device).)
  • Given CPU and network bounds, can find an optimal partition if it exists. The partition gives estimated throughput. So do a binary search over CPU bounds to find the maximum possible throughput.
  • Ground truth is how many detections can actually be gotten out of a network of TMotes. Also looked at the compute/bandwidth tension in a single TMote/basestation network.
  • Related to graph partitioning for scientific codes (Zoltan in symmetric supercomputers), task scheduling (list scheduling), MapReduce/Condor, and Tenet and Vango (sensor network context, run TinyOS on embedded and server devices).
  • Q: have you considered situations where operators could tune the amount of processing that they do (or tolerate packet loss)? Considered it, but there are several generalisations we would look at first. Does it really change the structure of the system? Yes, we’d probably need a new partitioning algorithm.
  • Q: would it be worth profiling the amount of synchronisation necessary for pinned (global stateful) operators? We have strong consistency requirements, so we’d have to have a relatively expensive operation to synchronise this.
  • Q: have you applied the same sorts of partitioning for the scenario where you have different partitioning for different sensors? Allow you to port application across different platforms but the network is homogeneous, but you could run the algorithm multiple times (not considered in this paper).
  • Q: what is the effect of additional load on available bandwidth (interference, etc.)? Currently assume all nodes transmit at the same data rate, and use this assumption to assign channel capacity.

Softspeak: Making VoIP Play Well in Existing 802.11 Deployments

  • VoIP and WiFi are becoming increasingly popular. Now even cellphones are becoming available with these capabilities (even Skype on iPhone). If these get used heavily, what happens to existing 802.11 deployments?
  • Imagine a commodity AP in a coffee shop. Most users are data users, and maybe there are one or two VoIP users. But as VoIP becomes more popular, the number of VoIP users will increase.
  • What does this do to call quality and to data users?
  • 802.11 was designed for data networks and has substantial per-packet overhead (headers, ACK) and contention (backoff and collision). VoIP has small packets, a high packet rate (20–100 packets per second), and does not respond to congestion. So VoIP makes inefficient use of WiFi.
  • Measured impact as residual capacity (TCP/UDP throughput) and “mean opinion score” (MOS, how audio appears to a real person, calculated based on loss and jitter).
  • Used an 802.11 testbed and gradually enabled for VoIP stations. Downlink MOS tails off much faster than uplink MOS (after 4 VoIP stations). Degradation of TCP throughput is linear, but much more severe than would be expected from the size of VoIP packets.
  • Possible solutions: could decrease VoIP packet rate, or use higher speed networks (802.11g/n). Lowering the packet rate still degrades MOS. Higher speed networks run in “protection mode” in presence of older 802.11 versions, which loses many of the benefits. VoIP is too infrequent to benefit from 802.11n aggregation.
  • Could prioritise VoIP traffic (802.11e), which increases the contention overhead, and further reduces the residual capacity.
  • Softspeak does prioritised TDMA for the uplink and addresses contention overhead. By serialising channel access, we avoid collisions. Cycle between 10 TDMA slots, each of 1ms (1ms accommodates current codecs well; could be varied). VoIP stations have to establish a schedule, synchronise clocks and compete with non-TDMA traffic.
  • Non VoIP stations are unaware of TDMA, which may prevent VoIP stations from sending on time. If data user wins contention, the VoIP cannot transmit; if they collide this is worse. Use 802.11e VoIP prioritisation to improve VoIP quality, but TDMA means that we don’t see the same data rate degradation. Even a quick collision recovery for VoIP means that we will overrun our TDMA slots: can use multiple priority levels to address this (though can exhaust priority levels). Worst case, fall back to regular 802.11, and do no worse than 802.11.
  • Experiment by visualising TDMA, with ten TDMA VoIP stations, and some CSMA/CA background data traffic. Most transmissions should start in their own or the next slot.
  • Softspeak also does downlink aggregation. VoIP stations will overhear aggregated packets and extract their own portion of the packet.
  • No modification to wireless card, access point or VoIP application. Softspeak controller registers IP address and port with the aggregator (wired connection to AP). Implemented for Skype and Twinkle.
  • Evaluated for call quality and residual throughput. TCP data traffic and 10ms voice codec. (More results in the paper.) Adding TDMA alone has little effect over 802.11b. Aggregation by itself greatly improves downlink MOS (and slightly improves residual throughput). Softspeak in total greatly improves downlink MOS (almost back at single VoIP station level) and also improves residual throughput by 5x.
  • For 802.11g, get a 3x improvement in residual throughput, slight degradation in downlink MOS and large improvement in uplink MOS.
  • Also looks at performance when contending with web traffic. The bulk TCP upload improvement disappears, but the combined TCP capacity improvement is preserved. Softspeak doesn’t change the basic behaviour of the data traffic.
  • Q: how would the result change if you were measuring against multiple TCP flows? Experiment we did with web traffic showed this. Multiple bulk TCP flows? In one direction, saw same level of improvement; in the other saw queuing at the AP and would need some form of prioritisation at the AP.
  • Q: how could this work in a real network (because you have more than one collision domain; how do you assign slots across multiple hops?)? Didn’t evaluate the case with multiple collision domains; details of slot assignment in the paper. Use an 802.11 management protocol for probe response which increases the range when it can be heard. Doesn’t completely eliminate hidden terminal problem, but we hope that this would apply to TDMA.
  • Q: with multiple APs and multiple slots, wouldn’t you need some kind of centralised control? Could work across multiple APs.
  • Q: did you measure the delay performance of this? Incorporated in the MOS value.
  • Q: did you compare to giving short packets high priority with a rate limiter? This is part of the 802.11e specification, and we measured this. Why does aggregation win by reducing the packet per second rate? Typically if you have short packets, you use more bandwidth.
  • Q: if I am an enterprise administrator, why not just reserve a single channel for VoIP traffic, because it no longer has to contend with data or non-VoIP traffic? That would be a good solution if you had that available to you.

Block-switched Networks: A New Paradigm for Wireless Transport

  • Today, TCP performs poorly over wireless links. On a bad link, you get 40x goodput difference between UDP and TCP. Median has 2x goodput difference, and a good link has 1.6x goodput difference.
  • End-to-end rate control is error-prone and doesn’t work well in wireless networks. Loss feedback does not distinguish congestion and corruption losses. End-to-end retransmissions are wasteful in a multi-hop wireless network. Route disruptions cause unavailability (could use a disruption-tolerant protocol).
  • The use of packetisation introduces non-trivial overhead (for channel access: listen, backoff, RTS/CTS; and link-layer ARQ).
  • Also see complex cross-layer interaction: link-layer ARQs and backoffs hurt TCP rate control.
  • Hop is a clean-slate redesign. End-to-end becomes hop-by-hop. Packets become blocks to amortise overhead.
  • Reliable block transfer, ACK withholding and micro-block prioritisation on a per-hop basis. Virtual retransmission and backpressure components on a multi-hop basis.
  • Hop sends data in 802.11 burst mode. CSMA performed only before a TXOP (burst). Block-SYN and Block-ACK (bitmap of packets; only resend missing packets).
  • Virtual retransmission uses in-network caching and retransmits only on cache miss. Gives fewer transmissions, low overhead and a simple solution.
  • Backpressure mechanism limits the number of outstanding blocks per-flow at each forwarder (e.g. limited to 2 outstanding blocks). This improves network utilisation in the case where hop bandwidth is asymmetric.
  • ACK withholding mitigates hidden terminals better than RTS/CTS (which is overly conservative with high overhead). Buffer B-ACK messages while one terminal is sending, and then send one of the buffered B-ACKs when it is finished.
  • Micro-block prioritisation improves performance for applications like SSH and text messaging. Sender piggybacks small blocks on the B-SYN and receiver prioritises small-block B-ACKs. This gives low delay for small blocks.
  • Implemented on a testbed on the 2nd floor of the UMass CS building. 20 Mac Minis form an ad-hoc network.
  • Hop achieves significant goodput gains over TCP (1.6x at median, 28x at first quartile, 1.2x at third quantile).
  • Also on single-flow multi-hop performance (more modest gains: Q1 2.7x, median 2.3x, Q3 1.9x).
  • Achieves graceful degradation with loss (emulated link layer losses at the receiver). Still achieves higher goodput than TCP as the loss rate increases.
  • For high load (30 concurrent loads), Hop achieves much higher goodput than either TCP or hop-by-hop TCP (Q1 150x, median 20x, Q3 2x). Hop is also much fairer in the way it allocates bandwidth between flows.
  • Evaluated delay on small transfers. Hop has much less delay than TCP or TCP+RTS/CTS (especially for smaller transfers).
  • Many more evaluations in the paper. Partitionable networks, network and link-layer dynamics, effect on 802.11g, and effect on VoIP.
  • Much related work on fixing end-to-end rate control, backpressure and batching; Hop combines these into a viable system.
  • Q: most data transfers involve wired networks as well, so what do you do at the gateway? We do bridging here.
  • Q: if there is a separate TCP connection from the gateway to a wired endpoint, how do you deal with mobility (changing gateways)? Good question, want to look at this in future work.
  • Q: what is the real effect on interactive sessions (e.g. SSH)? Hop actually achieves an even lower delay than TCP or TCP + RTS/CTS. Didn’t evaluate transfer sizes smaller than 2KB.
  • Q: if a congested router is fed by two routers upstream, how do you allocate the backpressure? The senders will have to back off. There is no explicit rate control. Both upstream routers are treated similarly.


NetReview: Detecting When Interdomain Routing Goes Wrong

  • Interdomain routing sometimes goes wrong. Example of YouTube traffic being redirected to Pakistan. Only the most egregious problems make it to the news, but there are many inconvenient, small scale problems.
  • ASes exchange routing information using BGP. But BGP routing is plagued by misconfigurations. Faulty routing information propagates through the network. Bugs in the AS layer could prevent routing information to be misadvertised. Also, spammers hack into routers to prevent tracing.
  • Goal is to reliably detect each routing problem and link it to the AS that caused it. This would make it quicker and easier to respond to problems. Fault detection would work for a broad class of problems, incentivise reliable routing and be easy to deploy incrementally.
  • Idea: could just upload all router logs to a central entity, who inspects them for problems. Doesn’t work in practice. Logs contain sensitive information about internal routing structure. Also, relies on the router to have accurate information (may not be the case). Also need automation, incremental deployability and decentralisation.
  • NetReview solves these problems. Border routers maintain logs of all BGP messages (not data messages). Logs are tamper-evident (in the event of a faulty router): can reliably detect and obtain proof of faulty router. Neighbours can periodically audit each other’s logs and check them for routing problems. Auditor can prove existence of a problem to a third party.
  • ASes decide what to announce via BGP based on its routing policy, based on peering agreements (customer/provider), best practices (limited path length) and internal goals (short/cheap path).
  • A BGP fault is when the BGP messages sent by an AS do not conform to its expected behaviour. We know what BGP messages the AS sent from a complete and accurate message trace (using a robust and secure tracing mechanism). Expected behaviour for each AS could be different.
  • Example of expected behaviour: filter out routes with excessive paths; act as somebody’s provider; prefer routes through someone if available. Some of these rules may be confidential, but the AS need not reveal all of them to each author (e.g. reveal rules about agreements only to parties to those agreements).
  • Tamper-evident log based on PeerReview at SOSP 2007. Can tell if a router omits, modifies or forges entries (based on a hash chain). Messages are acknowledged, so cannot ignore a message. Neighbours gossip about hashes seen.
  • Rules are predicates on the AS’s routing state. They are declarative and hence are easy to get correct. Checks of S-BGP can be declared in a two-line rule.
  • Auditor requests the logs from each border router. Checks to see if the logs have been tampered with (or show inconsistencies). The auditor locally replays the logs to establish a series of routing states. Then evalutes that the rules are upheld in each routing state. If a rule is violated during some interval, the auditor can extract verifiable evidence from the logs.
  • Many practical challenges: here we will look at incremental deployment. Smallest useful deployment is at one AS. One AS can find bugs and misconfigurations. Two adjacent ASes can check peering agreements. The incentives for deployment are that reliable ASes can attract more customers, and logs can be used for root-cause analysis.
  • Evaluated on a synthetic network of 10 ASes running 35 Zebra BGP daemons. Use default routing policies, and injected a real BGP trace (Equinix) to get scale. Results in the talk are from a tier-2 AS with 6 neighbours.
  • Did fault injection and NetReview detected all of the injected faults and provide useful diagnostic information.
  • Evaluated processing overhead: a 15-minute log segment can be checked in 41.5s on a P4. A single commodity PC is sufficient to check a small network in real-time.
  • Storage space requirement was 710KB/minute, or about 356GB/year. Required 420Kbps, including BGP updates, which is insignificant compared to the data rates.
  • Related work includes fault prevention (secure routing protocols and trusted monitors) which have been difficult to deploy, and only filter out limited types of faults. Also related to heuristic fault detection (problem of false positives and false negatives). And related to accountability systems (which tend to require clean-slate designs, or have other limitations).
  • Q: can you say more about the web of trust approach? No PKI or CA required, because the networks already know who is at the end of each link, which gives a good basis to certify the identity of ASes.
  • Q: does the system prevent collusion between ASes? ASes cannot incriminate a correctly-operating AS by collusion, nor hide misbehaviour (apart from routing behaviour between the colluding ASes, which doesn’t matter).

Making Routers Last Longer with ViAggre

  • Motivation is the steep growth in routing table size. Expected to get worse in future: as IPv4 gets exhausted, need more small prefixes. Also there’s a chance that IPv6 gets deployed…. A larger routing table means that routers need more FIB (Forwarding Information Base) space.
  • Does FIB size matter? Could just throw more RAM at it? Technical concerns about power and heat. SRAM is a low-volume component that does not trace Moore’s Law. Also, a larger routing table means less cost-effective networks (price per byte forwarded increases). Also, there’s a cost in upgrading router memory. Some routers are filtering out /24 routing table entries, which impacts reachability. ISPs will undergo some pain to extend the lifetime of their routers
  • Virtual Aggregation: a configuration-only approach to shrinking router FIBs. Works on legacy routers and can be adopted independently by and ISP. Huawei are implementing it in their routers.
  • Brad Karp came up with the name.
  • Insight is to divide the routing burden between routers. Individual routers only maintain routes for a fraction of the address space.
  • Router architecture: RIB on slow memory; FIB on fast memory (SRAM).
  • Problem space includes FIB space, RIB growth and problems of routing convergence (churn etc.). Existing proposals require architectural change and haven’t been deployed. ViAggre focuses on FIB space problem and is incrementally deployable.
  • Today all routers have routes to all destinations. Idea is to divide address space into /2 virtual prefixes. Now assign virtual prefixes to the routers. Each router has a prefix and only maintains routes to hosts in that prefix.
  • How do you do this without changes to routers and external cooperation? How do packets traverse routers with partial routing tables?
  • An external BGP peer may advertise its full routing table: need to insert routes into the FIB selectively (only install a subset of the RIB into the FIB, FIB supression is a simple approach). However, this has high performance overhead. Instead, offload task of maintaining this table onto machines off the data base (similar approach to BGP route reflectors). External router peers with a route reflector. Shrinks the FIB and RIB on all data-plane routers. This is somewhat invasive, because you have to change your peering architecture.
  • What about data plane paths between different virtual prefixes? When a packet comes in, the ingress router doesn’t have a route to the prefix. Therefore you need to route to router with the right virtual prefix. So maintain one entry per virtual prefix that a router is not aggregating. Cannot forward packets in a hop-by-hop fashion, so the packet must be tunnelled to a router that has the right prefix. The egress router removes the encapsulation.
  • Failover works using existing mechanisms.
  • Native paths in ViAggre can be longer than native paths (traffic stretch, increased load, etc.). But can use power law that 95% of traffic goes to 5% of prefixes. So install that 5% of prefixes on all readers. This will reduce the impact of ViAggre on the ISP’s network.
  • Evaluated effects on adopting ISP. Look at reduction in FIB Size versus traffic stretch (in ms) and load increase (more traffic carried by routers).
  • Choosing aggregation points to assign more routers to aggregate a virtual prefix. This reduces stretch. Use a constraint based assignment program. Want to minimise the worst FIB size such that the worst stretch is <= a constraint. This is NP hard so had to approximate it. With a worst-case stretch of 4ms, the worst case becomes close to the average FIB size, and the actual stretch is very low.
  • ViAggre can extend the lifetime of a router by 7–10 years.
  • Carried out a study of the prefixes to which traffic was sent, found that the top 5% most popular prefixes account for 95% of traffic.
  • As the fraction of popular prefixes increase, the load increase drops to 1.38%.
  • ViAggre gives a 10x reduction in FIB size.
  • Cons of ViAggre: control plane hacks may have overheads (in installation, convergence, failover); also planning overhead (choosing virtual prefixes and assigning them to routers).
  • Deployed ViAggre on a real network. Compared propagating routes using the status quo and ViAggre (using prefix lists for selective advertisement). Measured control-plane overhead in case a new route is installed. ViAggre reduces installation time because route reflector only has to advertise a subset of the routing table to the new router.
  • Also developed a configuration tool (330 lines of Python) that works on router configuration files and outputs ViAggre-compliant configuration files. Still working on a planning component. Working with Huawei to implement ViAggre natively.
  • Q: overhead of tunnelling? Use MPLS-based tunnelling at line rate. Makes management easy.
  • Q: why is the rate at which traffic is growing less than the rate of routing table growth? Not necessarily true. Router upgrading has been happening for reasons other than FIB size up until now (but the FIB size has increased). Tier-1 ISPs can forward packets at close to lambda rates. Even big ISPs may have to do upgrades only for memory concerns.
  • Q: why not just maintain popular routes and forward other packets to a default router? Simplest thing to do would be to put routers in a cache hierarchy. Why does it have to be distributed and complicated? Router vendors weren’t happy with route caching because of unpredictable performance and other reasons. ViAggre is useful for medium-sized ISPs (not small ISPs where you have a useable default route).
  • Q: can you use a configuration-only approach for the data centre to get higher aggregate throughput? The state problem in data centres is the layer-2 state (not the layer-3 state). Close to Seattle work at SIGCOMM 2008.
  • Q: why is it expensive to do route supression from the RIB to the FIB? We did it using access lists to say what should and should not go into the FIB. These access lists are heavyweight mechanisms (typically used with small access lists of up to 500), and haven’t been optimised. By making this efficient, you get much less complexity (no need for route reflectors).

Symbiotic Relationships in Internet Routing Overlays

  • Two nodes are in symbiosis whenever both can benefit from one another’s position or resources in the network. Examples are in file sharing (BitTorrent), backup systems (Samsara) and AS relationships. No tragedy of the commons and no free riding. Everyone provides something in return.
  • Can we apply mutual advantage in overlay routing. Built PeerWise, an overlay routing system that reduces latency based on mutual advantage.
  • Route from College Park, MD to Seattle. Could take a direct path, or a detour path that violates the triangle inequality.
  • Measurement studies based on PlanetLab-to-popular-destinations (asymmetric) and PW-King (symmetric). 21% and 51% of node pairs respectively benefit from detours (only detour when latency reduction is at least 10ms and 10%).
  • For 20% of the PlanetLab nodes, there are no detours. Mutual advantage eliminates about half of the detours.
  • Can categorise web pages based on the number of prefixes and the number of nodes finding detours (regional websites see many nodes finding detours but few prefixes; Google and CDN-based sites have many prefixes, but relatively few nodes finding detours).
  • Three goals: efficiency (lower end-to-end latency), fairness (mutual advantage) and scalability (use network coordinates). Network coordinates give internet nodes positions in a geometric space, and predict latency based on position in that space: this has some power in predicting whether node-pairs will need a detour or be part of a detour.
  • Need to do neighbour tracking, and ranking based on proximity (minimise latency), embedding error (maximise this), coverage (minimise expected detour latency) or randomness. Proximity and coverage perform best.
  • Now do pairwise negotiation to establish mutual advantage. Keep a negotiation table and peering table per node. Negotiation table contains potential peers.
  • Implemented PeerWise and deployed it on about 200 PlanetLab nodes. PeerWise finds mutually-advantageous detours that offer significant and continuous latency reduction.
  • Propose three experimental scenarios: all-destination, random-subset-destination and Zipf-destination.
  • PeerWise finds detours quickly. Other results in the paper.
  • Looked at user-level application benefits (wget) through the use of detours. Compare the transfer time for popular website home pages using both the direct and detour paths. Compare PeerWise reduction ratio and wget reduction ratio. In 58% of all cases, wget takes less time through the detour path than through the direct path. But in 42% of cases, wget takes longer through the detour. Could be that the latencies measured by PeerWise had changed.
  • Found that PlanetLab relays influence transfers. Packets spend time in the relay node. Median relay wait time on PlanetLab is 48ms (a few weeks before the NSDI deadline). Found that on a UMD relay node, the median wait time was 5. With these delay characteristics, the performance would have been much better.
  • Q: is it not counter-intuitive to use latency (with triangle inequality violations) in network coordinates (a metric space) (should you not use topological information like in iPlane?)? Network coordinates give good results. So didn’t investigate anything else. When we embed a triangle inequality violation in a metric space, either make the short size shorter, or the long side longer.
  • Q: aren’t most network coordinate systems based on some sort of global optimisation? Taken offline.
  • Q: do you only consider bilateral advantages, or have you considered more complicated scenarios with multiple parties? Have thought about it, but just consider bilateral.
  • Q: have you done any work on trying to include the load of a node in the weighting system to determine whether it is advantageous to route through it? Have not looked at load.
  • Q: a long RTT TCP connection, split in two, will get better latency and loss tolerance, so have you considered doing this instead of TCP relaying? Is there a way to separate the effects in your evaluation? Not sure. migararDeDd o aoarT ee NennenmArh oeae nsoPo tlnn etrmuteCiGtaihee ptOr ieprrttiinens nOo rhIa Pt eh mnathr rsdmmSetn a0l riP etA eineOthd rBrtnmeeep nPeielhiteC ntrHhpW Ddmrosaer guTa T Ia nrnisCenPleam eiLiatcOeodesPihn Tfa dsrOoA bmeua heiPnnCe Wnnt h ese NemHrlnnhPs TDlsg rNisc io peirrhcnoPisttheoWm niteDeenut PiUkeSi ro o nlpe e nlPCnthis fnrcehdiSeEePe hhiePB fT p ei tceaunOeefCmnt aShyli hmne Fol loeCrlmtolaT odna e nfPp rxh35d mp heenonaticutrr TerGhmngtntrebePnnnuilai Pizeoalbt ela drlAtra iei epienSrn7F rph3n5h timiM eAlfnIp nritPserhP erl nine nsnaehepirttem oAal ir arhod eiean rhPadTdi P aes mWT nmTdoAedan e tebTn e erPdoBrruih F nemear eiPnrmhersMeembit n ieyh eihiO remPr Pinmoet ihPadrmrneltanTcioio rMiiAdt t lodike n BliadmotCnryoBae i eniC inBh te inndinciru m93res niemetp250h 9nP nramtnnp hiiC neaeO MNrayawBHlm he no noNhntcini TdaamTa aieo teAM hmoatdldn vgrFoeeiSpdr hhnTeplntOri m Cneooi.ePett5u7n rihtm 3mnslg n eBt uPee irni eo ne rn57iihtPesmo Pripc3olret utrr i lleCmnOreaha N grrptnmpcoeNvn roO eOoe lMmde e peFsr dohn rnPu a52Xen sreratnOlt it easlgi DnndrlnhsT o aUm eICeinmderpco taeiPAdlmd oOlri nSdl elievuD trydan hdeey liSaum EnFn BTy uotpi pmiBalPneiehrot tS r eheTeit ehenLeePhnPh erin meerN icnprPPeetnho ner snoealHianImTcg tesilh r3ebt7T5. o TuremDiOap nasdad Piimenh ilahPSrnlart hiaecWSi rNhe seeteotro p othi rCarg oiCFroDctnrel eshamtnoeo r ettei ut euDdrtnigPeerO NeehtnOrrn erdeoP mi eLTpnd aaaoz nhieesv tLt v oigDtOereergre n odhsrmareten manenoetrPmhne rrinateDgmeh ePGnn it ntie ctneiaPtih rn oci nreoeen ieiP mao O dte luhm Cat liara dheeaeinMsTi TededconuC ma oehrrSlt ohWvnee u BrnpNm etminyrs7roeiPto3P rlporit enPPstP hNetilhiD dd lcm uS aeTersaaoI ftiDe Yle cneehnl enePpmtnhheiCre enip eihFraS rarmenetch oraQaic mdpeackCT hl nOhnmnPeeererl bTlm rgnaaibridmogio yaaFPe chtrPiecrshn WTiaruuats tpml aIooPver iW o mcN Otnnitdrioilrep eeh ePir Oreesonrn mrnyiemt ivhanr rlcPPehn na gvltlerhuTtaP miNn teyDio e eurS aadWTt Ptoniee NcPsenhme rr o oa TaBdgryni P ci Raaor mexhhnytPP enr neNlimOn n ch hnidiOeea n Sina hemreeyDrlenvtu hnh0eeM aeeCi3ntesp cnn edhunEgiil re iiUdtn 3e ne t5 e etoerWr aoOh Tccnodaemilithraalrem rrpPsePm ineenAeOchxinirpdtt eoi ireePete hmAnonppomnnNt ee e mtiPePrnhneiWghlan amtie rn n orietdR pPheiNreomh nih mhePretaepn mh taso renpirPierNrrPeC oehP tepnico nvdoOSttinesd nh e t3e CnepPaemhinrh y oileualtTaO Pacludmrooo ipBnstTitrdh mnria W T me oCoffa reCeipsmnehht p ugeMsmafiollnIyaa eTlrudtmotsDrnn rlToEuos xSur ydapaPT armoal LttsieoPehenn mwr pirntthWi teiti errOe PDrleveohshm nPgrniu nB bolhrpemei ielgBP m b ul Pi rDey ietimDeDteeiiaenaPbtcctn hgl n eOWiiuneri onchgitsremtdp rhort ynrirtddieltda soSam praePu peenrrteaPCh dom a1THlc reCeerp o arnpnliPscre tma e BmesD pi P oSCeellnnnrema e sannghy hmy oi a m n n a H arOifnninlOeihilP ecetn nhre TGeonnPet tlPnis epePteirmcnrrenene anhC x hn eeyMne r eiOrrhd it ionnrshTm eoreonPposrahtrnm liildieCn e i L hi ebs ere teimeae rnth iossPuC muhBn oi nenNOLgNiO hseuipntv’a cr aePephWyitsmtnnhAoi io aa5m dchaPm rr eaerP ee H5nh hPiSXermt ne renOPrg thedirloien enSamnnnte ynmirieenalPte nrhtcc d Dctemrud niegPerhet htrmdornnid ny deDTo iaeTrnncdoa GO ohotl OPnirrnoendihi eei r rroe asrmaWorldTo ti CihDet ie li Unneyer OuntenhemPiB eoPtxrdnC idmiTmrepApa p toeY eheknainm patnnhnoiWaePHS t iOmIhPi tnoreninemt eepenprnpi hs n eonreRe PNmiht nmd mlsnth ona uBCehretee nmi yWPn OE n m geLBnauilehen ieTeditAnd m Vrn TClaamqeF rianlnae ne rhai yctrrsmMWanPiee rr PmOhdnoPeryeo Ecoaarti dTaonrg aesahdtnemHwoo eWhinnetnriy aarrrhomyi datmrTlldUHrc haPramtserenh recpaoinir rei ire miisePt nmi iCn heopratehm nnPceeP tnhrioiaOl piniesrstener Pi inDieishitg rOmas Nle o imxnfnhitA eyDeted vra noIrieoaurrtinnDnge thfmm mCPitr raodhl adry aHCLkdTo Smapn Nhetorn rrmichP hie ra idaln rhnnedAo stecePhtp nrremchePeiit adret faf ery uct aos TFrrd eDmga 3iC nrn m iooCerdoyurndoOecainoHnst nlFo mlT eta e pnOuW girmnniFtPtt iihiornienhedrst nieclon scAaadraI Tm ciro Not rtmtenIsnneFaoibPnm oitraniuht iheyBnll ddiFHdhTeya e een miheUnet P rer lhimH wn onReZnlsrOaaoPilnodc mriihtv orrp eoi pPir emrnceirP G eDae nhrPeH lhei tecnCslan eoUtyusiPraP

NSDI 2009: Day 2

Thursday, April 23rd, 2009


SPLAY: Distributed Systems Evaluation Made Simple (or How to Turn Ideas into Live Systems in a Breeze)

  • Large-scale distributed applications are difficult and costly to develop, deploy, test and tune. Moving from simulations to real deployments must bridge a “simplicity gap”.
  • Often focus on developers’ technical skills rather than the algorithms.
  • Motivated to use real testbeds (PlanetLab, Emulab, ModelNet, idle desktop machines, etc.) for development but these are hard to deploy on, so we need tools to simplify and accelerate development, resource management, deployment and control.
  • Simplify development: could use a high-level language or an abstract network substrate. Simplify deployment: resource selection, deployment and monitoring. Control application: centralised logging, replay churn/network conditions dynamics.
  • SPLAY is intended for easy prototyping, multiple-testbed deployment, distributed systems teaching (hands-on teaching), and using idle resources for distributed systems research.
  • SPLAY architecture: daemons written in C on the testbed machines (testbed agnostic (abstract away the type of testbed and OS) and lightweight). Results are retrieved by remote logging. Daemon instantiates distributed system participant (e.g. BitTorrent client or Chord node).
  • SPLAY language based on Lua (made for interaction with C, bytecode-based with garbage collection). Language close to pseudocode, so that programmers can focus on the algorithm. Programs written at the RPC level. Applications can be run locally for debugging and testing, and have a comprehensive set of libraries (compatible with wrapped C libraries).
  • SPLAY controller may be distributed, and performs deployment and control support. Can for example emulate churn or network packet loss.
  • Resource isolation is a requirement for non-dedicated testbeds (e.g. where you have multiple users or you want to sandbox testbed code). Need to limit the resources used and enforce protection. All access to the system is directed through libraries.
  • SPLAY can do Chord in 59 LOC (+17 for fault tolerance and 26 to use a leafset). Pastry takes 265 LOC. Also did Scribe, SplitStream, WebCache, BitTorrent (420 LOC, much of them for protocol parsing), Cyclon, Epidemic and Trees.
  • Interface for resource selection (web or command-line), using criteria on load, resource availability and location (e.g. from PlanetLab metadata).
  • Chord implementation is validated to provide the correct results (route length, rather than performance) which demonstrates the validity of using a real deployment instead of simulation.
  • SPLAY allows real traces of PlanetLab to be replayed on a testbed (for reproducible experiments). Can aslo use synthetic descriptions in a DSL (say when fractions of nodes join and leave).
  • Evaluated synthetic churn effect on routing performance in a Pastry DHT. Also looked at trace-driven churn (trace from OverNet file-sharing network), with variable speedup factors.
  • Evaluated runtime overhead. Can run up to 1263 nodes of Pastry on a machine with 2GB of RAM before swapping is triggered.
  • Also evaluated robustness for long-running applications (web cache application).
  • Q: how does SPLAY help you to validate simulation results? This is not the goal of SPLAY: instead it helps you to run a real application rather than simulate it. Could you reuse the Lua code? Yes, Lua is a good fit for this.
  • Q: if you get a non-deterministic bug on PlanetLab, it can be hard to reproduce, but does SPLAY help you with this? Don’t solve the distributed snapshot problem, but if the problem is due to churn, SPLAY will help you to reproduce it.
  • Q: does the framework allow querying network conditions (for developing network-aware distributed algorithms)? Don’t have direct access from the present Lua libraries to low-level network conditions, but anything that can be expressed in C can be used, so this could be added.

Modeling and Emulation of Internet Paths

  • How do you evaluate a distributed system? (e.g. a DHT, CDN, P2P….) Set up a topology on which you are going to test the system, then run it on Emulab, which will emulate the network between them.
  • Emulation is repeatable, real-world environment and controllable (dedicated nodes and network parameters).
  • Hitherto all about link emulation: not the paths between nodes.
  • Goal is path emulation: two nodes in the real world (opposite coasts of the US) with a path between them that you want to emulate.
  • Could do link-by-link emulation: need to know topology of the path, the capacities along that path, the queue sizes, and the cross-traffic along the path. Most researchers don’t have that information.
  • Instead abstract the path and do end-to-end emulation. Focus on high-level network characteristics (RTT and available bandwidth).
  • Obvious solution: given a link emulator, turn it into a path emulator by mapping the actions of the link emulator onto high-level characteristics. Link emulators have queues but it’s hard to determine the queue size from the path (set it to a default and we’ll adjust it later). Then parameterise a traffic shaper. And finally induce a fixed delay on every packet (based on measured real-world RTT, to estimate non-queuing delay). Must do this in both directions, which may be independent.
  • Used this on a real link with iperf running bandwidth measurements in both directions. 8.0% error on the forward path and 7.2% error on the reverse path. Room for improvement, but not too bad: good news! Bad news is on RTT: obvious emulation gives a fixed RTT which is an order of magnitude higher than on the real path. Also, for asymmetric paths (6.4Mbps/2.6Mbps): 50.6% error on the forward path (smaller than real path) and 8.5% error on the reverse path.
  • Why are these errors arising? TCP increases congestion window until it sees a loss, but there are no losses until the queue fills up. The large delays we were seeing were queuing delays. More than 200ms of delay was due to queuing delay.
  • Maximum tolerable queuing delay is ((maximum window size / target bandwidth) - base RTT). Total queue size must be <= product of capacity and maximum tolerable delay. Can also calculate a lower bound for the queue size. But this gives a lower bound greater than the upper bound! Solution is to distinguish capacity from available bandwidth. (ABW is the rate that my packets drain from the queue.) Lower bound is independent of capacity, but upper bound grows as capacity grows, so this gives a region in which we can select a viable queue size.
  • How do we separate the emulation of available bandwidth and capacity? Set queue based on constraints rather than a reasonable default. The traffic shaper should emulate the capacity (high fixed bandwidth, not available bandwidth). Delay is the same. Now introduce constant bitrate cross traffic into the queue and sink it after the shaper. Just enough such that only the available bandwidth is available inside the traffic shaper. Again, do this in both directions.
  • Use CBR traffic instead of TCP cross-traffic because TCP cross-traffic backs off. If the cross-traffic backs off, you would see a higher bandwidth than is realistic.
  • But how can CBR traffic be reactive? Approximate aggregate available bandwidth as a function of the number of foreground flows. Change CBR traffic based on flow count.
  • Evaluated the error in emulation (from obvious solution to new emulation): forward 50.6% down to 4.1%; reverse 8.5% down to 5.0%. Delay is less noisy than the real environment, but is in the correct order of magnitude.
  • Tested system with BitTorrent running on it (12 clients, 1 seed). Isolated capacity and queue size changes. Makes a difference when compared to obvious solution.
  • Another principle is modelling shared bottlenecks (details in the paper).
  • Q: RTT was shown as time-series, but should you show the PDF/CDF of RTTs? Do they also follow the real world? Very unlikely to follow what is seen in the real world, constraining in a simple way. Focussing on higher-level aspects that would matter at endpoints.
  • Q: did you compare results to real-world on PlanetLab? Hard to characterise ground-truth on PlanetLab. Used widely, but hard to distinguish between host conditions (overloaded host) between network conditions: this impacts things like BitTorrent applications. See FlexLab paper. Weren’t confident in the ability to do that experiment.
  • Q: there is great value in having repeatable experiments, but a problem with Emulab is parameterisation (too much to choose), and it would be good to come up with a “standard scenario” (”500 node March 2009 settings”)? Difficult to do but have taken steps in that direction. FlexMon did wide-area bandwidth and delay measurements on PlanetLab. Unfortunately, this is hard to scale up.
  • Q: do you have a sense of whether the byte-based queue size is realistic? Talked in terms of packets to make it simpler, but implemented in terms of bytes.

MODIST: Transparent Model Checking of Unmodified Distributed Systems

  • Distributed systems are hard to get right (complicated protocols and code to implement them). With no centralised view of the entire system, large range of failures to tolerate and increasingly large scale.
  • Normally do some kind of randomised testing, but this is low coverage and non-determininstic (very hard to reproduce).
  • MoDist is a model checker for distributed systems. It’s comprehensive. It runs in-situ (unmodified, real implementations). And it’s deterministic (allows replay).
  • Applied to Berkeley DB, Paxos-MPS (in production service for Microsoft’s data centres), and PacificA. Found 35 bugs of which 31 have been found by developers.
  • Look at BDB replication. Based on Paxos (single primary, multiple secondaries): primary can read and write, secondary can only read. On primary failure, secondaries will elect new primary. On duplicate primary, degrade both and re-elect. But there was a bug in the leader election protocol: could lead to a secondary node receiving an request-for-updates message, which is unexpected and causes the secondary to crash.
  • MoDist took about an hour to make the bug show up, and outputs a trace which can be replayed.
  • Goal is to explore all states and actions of the system. Model checking makes rare actions (failures, crashes) appear as often as common ones, which drives you into corner cases.
  • Look at real processes, communicating using messages, which may also be multi-threaded. Set of normal actions (send/recv message or run thread) and rare actions (message delay, link failure, machine crash) which may be injected.
  • Ideal is to explore all actions. But this leads to a combinatorial explosion. Built-in checks for crashes, deadlocks and infinite loops. Can also have user-written checks (local and global assertions). MoDist amplifies the checks that you give to it.
  • Avoid redundancy by exploring one interleaving of independent actions.
  • Challenges are exposing actions, checking timeout code, simulating realistic failures and scheduling actions (avoiding deadlocks and maintaining extensibility).
  • To check a system, must know and control the actions of the system. Previous work required users either to write an application in a special language, or port it into a fake environment. MODEST uses Explode (OSDI 2006) to interlace control needed into the checked system. Fake environment perturbs the system and can introduce false positives.
  • MODIST instead inserts an interpretation frontend that intercepts RPC API calls and sends these to a MODIST backend. Backend interposes on all RPCs and schedules all intercepted API calls. This is transparent and simple (does not perturb the system and cause false positives). Frontend is stateless: all state in the backend. Only frontend is OS-dependent; backend is OS-independent.
  • Frontend intercepts 82 API functions (e.g. networking, thread synchronisation). Most wrappers are simple, either returning a failure or calling the real API function. Each wrapper has an average of 67 lines of code.
  • Timeout checking is complicated by the heavy use of implicit timers (e.g. a comparison with gettime()). Could intercept gettime() but what should be returned? Don’t know what the comparison is (because it isn’t an API call), and previous work to address this was manual.
  • Instead, do a static, symbolic analysis. Observe that time values are used in simple ways (db_timespec, mostly with +, - and sometimes * or /). Also that timeouts are checked in the vicinity of gettime() calls, 12 out of 13 are within a few lines. Static intra-procedural symbolic analysis can discover implicit timers (much simpler than approaches like KLEE).
  • Checked systems of size up to 172.1KLOC, and found 35 bugs. 10 bugs in protocols, 25 in implementations. All bugs were previously unknown.
  • Q: how does it compare to DPOR (designed for multi-core systems)? Work with general distributed systems as well as multithreaded systems. Different kinds of failures in distributed system, so not clear how this can be mapped to a multithreaded system.

CrystalBall: Predicting and Preventing Inconsistencies in Deployed Distributed Systems

  • Errors remain in distributed systems even after the system is deployed (DoS, data loss, loss of connectivity), and manifest themselves as loss of safety properties.
  • Paxos is a fault tolerant protocol for achieving consensus, and is integrated in many deployed distributed systems.
  • Want to use increased computing power and bandwidth to make it easier to find errors in distributed systems. Use spare power to do state exploration that can predict what sequences of user actions might lead to lead to violations of user-specified consistency properties.
  • Compared against classic model checking, which is limited in the depth that can be achieved. Compared against replay-based/live predicate checking (state of the art distributed debugging). Deep online debugging periodically starts state space exploration to find inconsistencies. CrystalBall also does execution steering to avoid states which would lead to inconsistency.
  • High-level overview: service is implemented as a state machine with some amount of local state and several handlers. Runtime manages the timers and messages to/from the network. CrystalBall controller takes information from the state machine (neighbour info and periodic local checkpoints). The user puts safety properties into the controller. Controller invokes consequence prediction algorithm, looking for violations (passed back to controller). Controller can then put event filters (for steering) into the runtime.
  • State space exploration using MaceMC, run enabled handlers in a set of nodes (live code) and check safety properties in each state. Want to reduce state space coverage to predict future inconsistencies quickly. Want to explore all interleavings of network messages (important for race conditions), but not all interleavings of local node actions.
  • Standard hashing does not re-explore system-wide states. Instead use hashing to remove previously explored local actions of nodes. Network messages can still be interleaved.
  • Aim is to increase resilience of deployed systems, with a focus on generic runtime mechanisms. Want to prevent inconsistencies without introducing new ones (do no harm). But this is a hard problem.
  • Execution steering uses “sound filters” which have behaviour equivalent to events that could happen in normal running (e.g. TCP connection broken, UDP packet loss).
  • Model checker might not have enough time to detect inconsistencies. Deal with this using an “immediate safety check”.
  • Evaluated live using 6–100 participants on 25 machines. Implemented in Mace, and used ModelNet to emulated wide-area Internet behaviour. Looked at RandTree, Chord and Bullet’ systems. Found 7 inconsistencies that weren’t found by MaceMC or a few years of debugging.
  • Looked at execution steering for Paxos. Injected 2 bugs and induced live scenarios that violate the Paxos safety property. Ran each bug 100 times, with a random [0, 20] seconds between injections. Avoided violation in 95% of cases.
  • Also evaluated the performance impact. (Download times using Bullet’.) Less than a 5% slowdown: due to the bandwidth spent on shipping checkpoints.
  • Q: could you comment on the CPU overhead of running these state-space explorations? One core is dedicated to this and maxed out. The state machine is single-threaded.
  • Q: as machines get more powerful, you would expect them to handle more load… how much is the amount that should be spent on model checking? Well, we haven’t looked into that, but could parallelise state-space exploration and use multiple cores.
  • Q: how big is the state space in bytes? For 8 levels (depth reached at runtime), need only 600KB, which fits easily in L2 cache.
  • Q: is there any way of quantifying what “relevant” means for the relevant states that you aim to explore? The states that we are exploring are obviously quite relevant, better than random walk.
  • Q: Paxos example is artificial because you have 3 replicas and 2 faults? There were separate failures and a network partition? This failure still shouldn’t have happened and CrystalBall detected it.
  • Q: ditto. Paxos should have used stable storage for the proposed values? This was a bug in the implementation.

Wide-Area Services and Replication

Tolerating Latency in Replicated State Machines Through Client Speculation

  • Want to make fault tolerant systems faster in a distributed environment.
  • Simple service configuration with client and server, exchanging messages. Server has some state. For FT, replicate the service as a replicated state machine. Replicas need to agree on the order in which they execute the client requests. Need to reach consensus.
  • Problem is that RSMs have high latency: must block until enough non-faulty replies are received. Also have to deal with geographic distribution (to avoid correlated failures).
  • Idea is to use speculative execution in the RSM. Speculate before consensus reached. Without faults, any reply will have the consensus value.
  • Client can take a checkpoint of its state and execute speculatively while waiting on consensus. When consensus is reached, commit the state and continue executing. Obviously, if we speculate based on a faulty reply, we can rollback to the speculation point.
  • First reply latency sets the critical path (need speed and accuracy here), not the completion latency. Can therefore achieve high throughput, support good batching, have stability under contention and maintain a smaller number of replicas.
  • What if you make a request while executing speculatively? (Buy a Corvette after speculating that we’ve won the lottery.) Could hold this because it is external output, which gives bad performance (just as bad as before). Could use a distributed commit/rollback protocol but this makes state tracking complex.
  • Instead explicitly encode dependencies as predicates. New message is “buy if win=yes”. Replicas need to log past replies to test these predicates. Local decision at the replicas matches the client.
  • Applied speculative execution to PBFT (PBFT-Client Speculation, PBFT-CS). Move the execution stage to the beginning (reply from the primary). If the primary is non-faulty, it is in a natural place to give a high-quality speculation.
  • Added tentative execution to PBFT-CS, a read-only optimisation, a failure threshold and proved correctness.
  • Evaluated with benchmarks: a shared counter and NFS (running an Apache httpd build). Three topologies (primary-local, primary-remote and uniform). For primary-local on shared counter, the runtime is roughly equal to the non-replicated case (as the network delay is varied). Performs better than PBFT and Zyzzyva (which see run time increase linearly with network delay). Also evaluated with failures (still performs well with 1% failure). Also looked at throughput. If latency is the limiting factor and you’re not operating at peak capacity, then speculative execution is a win. However, if the server is fully loaded (bound by throughput), client speculation will not improve things (and may introduce some overhead).
  • Q: have you thought about adaptively using client speculation under low load and Zyzzyva under high load to improve overall throughput? Either client or the primary could switch off speculation when load is high.
  • Q: how do you deal with RSMs getting into divergent states because of speculation (i.e. by having external operations)? Could fall back to blocking or distributed commit/rollback.

Cimbiosys: A Platform for Content-based Partial Replication

  • Scenario of photo sharing: Alice stores her complete photo collection on her home PC, tags and labels all photos. The metadata is used to route photos to devices and services (public -> Flickr, 5* -> digital photo frame, family -> Mum). She might update these photos (do red-eye reduction) and these should propagate everywhere. She might upload photos from elsewhere (an internet café).
  • Two key replication requirements. Not all devices are interested in all data (content-based selection, device-specific filter). Not all devices talk to all other devices (need flexible, arbitrary communication patterns). Also eventual consistencty, incremental updates, conflict detection and resolution, bandwidth efficiency, etc.
  • Item store is a database of XML objects stored on each device. Application only accesses items in the item store on that device, but will then be synchronised by Cimbiosys (through Sync and Comm components).
  • Filtered synchronisation protocol. Target device sends list of its item store and its filter to the source. The source sends each item that is unknown to the device and matches the filter. Then the target adds these items to its store and updates it knowledge.
  • Main contribution is a protocol that has eventual filter consistency (device’s item store = filter(whole collection)) (correctness property), and eventual knowledge singularity (device’s knowledge = single version vector) (performance property). Extensive model checking has confirmed that these properties hold.
  • Concentrate on eventual filter consistency here.
  • Problem of partial sync: how can you sync from laptop (family photos) through Facebook (public photos only) to the home PC (all photos)? Facebook just gets public family photos.
  • Needed to invent “Item-Set Knowledge”. Knowledge is the set of versions known to device: either stored, obsolete or out-of-filter. The knowledge fragment is a set of items (or * for all items) and a version vector. A device knows about all versions in the version vector for all items in the set of items.
  • By acquiring knowledge using the sync protocol, you receive items at most once.
  • What happens when an item is updated such that it moves out of a filter? (Remove family keyword from a member of the family who’s been disowned.) Need to add a new type of message to the sync protocol: a move-out notification. Send one of these if the version is unknown to the target device and the item does not match the target’s filter and (optionally) if the item is stored at the target. Don’t want to send too many notifications for items that were never stored (not correctness but performance issue).
  • Also need to handle move-out notification chains. Send if the the source’s filter is as broad as the target’s, the source is as knowledgeable as the target and the source does not store an item stored at the target device.
  • Another problem is filter changes. Could just create a new empty collection. But that would be user-unfriendly. Want to discard things outside the new filter, retain the intersection of the old and new filters and acquire things not inside the old filter. So use a knowledge retraction protocol.
  • Evaluated based on a C# implementation (confirmed in Mace), by looking at the average number of inconsistent items per replica.
  • Q: what are the semantics of filter? Does it have any compositional properties? Only requirement on the filter is that it can make a binary decision based on an item and its metadata. How do you make the filters consistent? No notion of consistency for a filter.
  • Q: to what degree do failures or data corruption fit into this protocol? Don’t do anything about Byzantine failures; assume that each device has persistent storage.
  • Q: what is the trust model here, and how do you deal with not-fully-trusted cloud services? We have an access control policy that says what devices can perform what actions on items in the collection.
  • Q: does the originator of the item specify the policy? The owner of the collection specifies the policy. Also use signatures to prove the origin of an item.
  • Q: how is Cimbiosys different from Practi? Practi lets you build protocols as policies. But you would still have to specify policy on top of it (not built into the system).

RPC Chains: Efficient Client-Server Communication in Geodistributed Systems

  • RPC chains are a new distributed computing primitive. Web services run things over a large number of machines, e.g. webmail. Client talks to a frontend server, but that might talk to an authentication server, a storage server, an advert server, etc. The webmail app is a composition of more-basic services. RPC is used to build this composition. (Could be Sun RPC, RMI, SOAP, etc.)
  • If RPC is synchronous, it is neither efficient nor natural. As the number of services involved grows, scalability becomes a challenge. As does heterogeneity. And geodiversity. RPC is rigid and inefficient, so we want to give developers more control, and better tools for doing this.
  • A more natural path would be for the request to flow through each of the backend services.
  • Related work in continuations and function shipping (also mobile agents, active networks, distributed workflows, and MapReduce/Dryad).
  • Assume applications built as a composition of services, where services export service functions via RPC. Services have a fixed location but may be shared. Also assume a single administrative domain (security is less of an issue, orthogonal).
  • RPC has a simple interface which hides the complexity of the distributed environment. Goal is to preserve simplicity while providing finer-grain control.
  • Embed chaining logic as part of an RPC call. Chaining function is portable code that connects services.
  • Implemented as a C# prototype. Chaining functions are implemented as C# static methods, which are stored in a central repository and cached by hosts. Example applications are a storage service and a webmail service.
  • Used an NFS server with an RPC interface. Modified client applications to do a chained copy (third party transfer). NFS server was unmodified. Evaluated it with a client in Redmond and two servers in Mountain View. Chain copy gave a 5x improvement in performance. Chain copy gets a peak throughput of 10.4MB/s against 4.5MB/s for RPC copy.
  • Chaining function enables chains of arbitrary length, dynamic chains and parallel chains. Can also optimise by composing sub-chains recursively (e.g. if a storage server must invoke a backup).
  • Now maintain a stack of chaining functions and execution state. Push the current chaining function and state onto the stack when a subchain starts, and pop it when it ends. Also evaluated chain copy with composition turned on (12–20% savings).
  • Chain splitting and merging. Multiple chains can execute concurrently and services may invoke multiple subchains in parallel. This requires support for chain splitting and merging. Need a way of specifying this. Need a dedicated merge host and merge function.
  • Also issues of debugging and profiling, exceptions, broken chains, isolation (limit damage caused by a bug in the chaining function) and dealing with legacy RPC servers.
  • Q: what happens to worst case performance (presuming you have to set higher timeouts)? Nodes along the path report back to the source to monitor liveness (of the request). This is handled as a broken chain.
  • Q: is the process of adding chains automated/automatable? Not automated at present, but in future work we’ll try to do this. Instead provide an abstraction close enough to RPC that it shouldn’t be difficult.


Studying Spamming Botnets Using Botlab

  • Botnets are a big problem (spam, DDoS, phishing often use botnets as the underlying infrastructure). They are a network of compromised or infected machines. However, they are hard to study and hence not well understood. Malware authors go to great lengths of obfuscation.
  • Automate botnet analysis using a black-box approach: execute the binary and study its behaviour. Also want to automate the finding and execution of bot binaries. Needs to be scalable and safe.
  • Peek into botnets using spam: a high-volume activity of botnets. Can collect information about botnets from the spam. Synthesising multiple sources of data in real-time gives accurate, high-fidelity information.
  • Characterise botnets by size, behaviour, types of spam sent, etc. New and better data gives better answers about this and improves our understanding about the botnet ecosystem.
  • Botlab is a system for monitoring botnets. It obtains bot binaries, perform an initial analysis, execute safely, and output data.
  • Malware is traditionaly collected through honeypots. Collected 2000 binaries over a 2-month period from honeypots. Spam botnets are in a new generation that propagates through social engineering. Need to make honeypots perform active crawling to accumulate this. Also use a constant spam feed from UW’s mail servers (2.5 million emails per day of which 90% spam; 200000 email addresses; over 100000 unique URLs each day; 1% point to malicious binaries).
  • Want to discard duplicate binaries, but a simple hash is insufficient (because of obfuscation, repacking, etc.). Use network fingerprinting (DNS lookups, IPs, ports connected to) to build a behavioural profile instead. Also used to find binaries which perform VMM detection (if fingerprint is different between VMM and bare-metal, then you have something that tries to detect VMMs).
  • Want to run bots safely. Botlab should not cause harm. But it must be effective: you can’t just prevent traffic from leaving the system. Botlab drops traffic to privileged ports and known vulnerabilities; limits connection and data rates (to avoid DDoS); and redirects SMTP traffic to a fake mail server.
  • Manual adjustments (e.g. when a bot verifies over SMTP to its C&C server) needed to ensure that the bot functions.
  • Botlab tries to send 6 million emails per day (to hotmail, gmail, yahoo) from about 10 bot instances. Local view of spam producers but global view of spam produced.
  • Combining the spam sources gives a different perspective. Spam is received from almost every bot in the world: this gives a local view of the spam produced but a global view of the spam producers.
  • Want to combine these two sources of data. Observe that spam subjects are carefully chosen: there is no overlap in subjects sent by different botnets (489 different subjects/day/botnet).
  • Who is sending all the spam? 79% from just 6 botnets. 35% from the largest botnet (Srizbi).
  • What are some characteristics of the prominent botnets? Most contact only a small number of C&C servers. Send spam at 20–2000 messages/minute. Mailing lists are 100 million — 1 billion with a maximum overlap of 30% across two botnets. Active size is 16000–130000 nodes.
  • Are spam campaigns tied to specific botnets? Spam campaign defined as the contents of the webpage to which the spam URL points.
  • How does web hosting for spam relate to botnets? Many-to-many relationship between campaigns and web hosts. Does spam from a single botnet point to a single set of web servers? No, which suggests that hosting spam campaigns is a 3rd party service not tied to botnets. 80% of spam points to just 57 web server IPs.
  • Could use Botlab as a real-time database of bot-generated spam (including web links), which could be used for safer web browsing and better spam filtering.
  • Q: do botnets apply a sophisticated mechanism to back-off under user activity? Yes, some clever botnets try not to inconvenience you at all (Srizbi would look for mouse activity), but we weren’t actively using machines, so they worked at full throughput.
  • Q: what would the real sending rate be based on the presence of user activity? How would this affect the rate? Have not looked at this as it would depend on user behaviour.
  • Q: since number of botnets is small, could a botnet have several different masters? Based on the C&C servers, which imply the control structure of the botnet.
  • Q: who is providing the web hosting (as surely they would have a huge bandwidth provision)? We have information about this, but they serve static pages of constant size, and the bandwidth requirement is based on the small fraction who actually read the pages. Many hosted in South Korea.
  • Q: do bots contact their C&C server for reactivation? Not really, more for new data etc., except when they crash and require reactivation.

Not-a-Bot: Improving Service Availability in the Face of Botnet Attacks

  • [Firefox crashed and lost my notes on this! Fortunately, other people are also blogging!]
  • Basic idea was to attest to real human interaction, using a small trusted attester. Attester is separated from the untrusted OS and applications using virtualisation (Xen disaggregation in this case). TPM sealing used to protect a key that is used for attestation, and stored in memory in the attester. Human interaction can trigger an attestation which may be used to sign e.g. emails or other interactions (clicks on adverts, votes, etc.). This gives a small window of opportunity in which a botnet could send fraudulent clicks, but this is pretty short, and the evaluation shows that it cuts down on the amount of spam that may be sent or fradulent clicks that may be registered.
  • Q: if you can outsource CAPTCHA, couldn’t you outsource TPMs? No real benefit from this.
  • Q: a viable commercial OS needs to let you install third-party device drivers, so even if you assume trustworthy drivers, wouldn’t some drivers still need to be able to generate keystrokes? (Also, what about remote access to a machine?) Virtualisation is a convenient means to bootstrap this process. Could equally use trusted path techniques from Intel and AMD.

BotGraph: Large Scale Spamming Botnet Detection

  • Issue is web-account abuse attack. Zombie hosts can perform automated signup using CAPTCHA solvers. These accounts can then be used to send out spam.
  • Want to detect abused accounts based on Hotmail logs. Input is user activity traces (signup, login, email-sending), and goal is to stop aggressive account signup to limit outgoing spam.
  • At present, the attack is stealthy and large-scale. We need low false positive and false negative rates.
  • Designed a graph-based approach to detect attacks. Identifies a user-user graph to capture bot-account correlations. Identified 26 million bot accounts with a low false positive rate. Implemented using Dryad/DryadLINQ on a 240-machine cluster.
  • Graph of signup history for a particular IP. Notice a spike in the signup count. Could use exponential weighted moving average algorithm to predict future. Where there is a large prediction error, you have an anomalous window, during which we suppose malicious accounts are being created.
  • Can detect stealthy accounts using graphs. Observe that bot accounts work collaboratively. Normal users share IP addresses in a single AS with DHCP assignment. Bot users are likely to share different IPs across ASs. Bot users form a giant connected component while normal users do not. Can use random graph theory to detect this. So detect giant connected components from the user-user graph, then use a hierarchical algorithm to identify correct groupings. Then prune normal-user groups (e.g. due to cell phone users, Facebook applications, etc.).
  • Increase edge weight threshold until the connected component breaks up.
  • Implemented in parallel on DryadLINQ. EWMA-based signup abuse detection can partition data by IP and achieve real-time detection. The user-user graph construction uses two algorithms and some optimisations and can process 200–300GB of data in 1.5 hours on 240 machines. Connected component extraction using divide-and-conquer in just 7 minutes.
  • Graph construction by selecting ID group by IP (map phase), then generating potential edges (reduce phase). Then select IP group by ID pair (map) and calculate edge weight (reduce). Problem is that the number of weight-1 edges is two orders of magnitude larger than other weights. Their computation/communication is unnecessary.
  • Second algorithm does selective filtering, which saves transferring weight-1 edges between nodes. Also used optimisations for compression and broadcast (and the Join functionality).
  • Evaluated detection on two datasets (Jun 2007 and Jan 2008). Three types of data: signup, login and sendmail logs. Bot IPs went from 82k to 241k between datasets (user accounts from 4.83 million to 16.41 million). The anomaly window shrank from 1.45 to 1.01 days.
  • Validated using a manual check (sample groups sent to Hotmail team; almost no false positives). Also through comparison with known-spammers (detected 86% of complained-about accounts, and 54% of the detected accounts are new findings). False positive rate is very low.
  • How can you evade BotGraph? Be stealthy in your signups (sign up less). Also fix a binding to IP or AS, which lowers your utilisation rate. Accounts bound to a single host will easily be grouped. Or send few emails (like a normal user). All these approaches limit spam throughput.
  • Q: what is the relationship between this and Sybil-detection using a random walk? Not sure if random walk can be used to detect spam. Why do bots have to communicate with each other? Graph is just a way to cluster the bots.
  • Q: false positive rate of 0.5% on tens of millions of accounts seems like a concern? Absolute value is pretty large. Quite conservative; real false positive rate may be lower. Could also be a starting point from which more sophisticated and costly approaches could be used.

Network Management

Unraveling the Complexity of Network Management

  • Enterprise networks are complicated: topologies, diverse devices, tweaked network configuration and diverse goals.
  • Example of a configuration change: adding a new department with hosts spread across three buildings. Need to reconfigure routers, and an error can lead to outages or loopholes.
  • Complexity leads to misconfiguration. There is no metric that captures this, and it’s difficult to reason about the difficulty of future changes, or for selecting between possible changes.
  • Defined a set of complexity metrics, based on an empirical study of complexity in 7 networks. Metrics were validated by questionnaire sent to network operators (in public and private enterprises). Questionnaire had tasks to quantify complexity, either network-specific or common to all operators. The metrics focus on layer 3.
  • Complexity is unrelated to size or line count. Largest mean file size was a “simple” configuration. So was highest number of routers.
  • Implementation complexity (referential dependence and different roles for routers) and inherent complexity (uniformity). Inherent complexity provides a lower bound for implementation complexity.
  • Referential dependency metric. Look at referential graph in the stanza of the configuration file. (e.g. Router stanza contains line that references an interface stanza.) Could have intra- and inter-file links. Inter-file links correspond to global network symbols, e.g. the subnet and VLANs. Operators try to reduce dependency chains in their configurations so as to have few moving parts (dependencies). Metric should capture the difficult of setting up layer 3 functionality and the extent of dependencies.
  • Metric is the number of referential links, normalised by the number of devices. Greater number of links implies higher complexity.
  • Another metric is the number of routing instances (i.e. a partition of routing protocols into largest atomic domains of control).
  • Largest network (83 routers) has only 8 average ref links per router, which is low (simple).
  • Gave operators a task of adding a new subnet at a randomly chosen router. Metric was monotonically increasing but not absolute.
  • Inherent complexity: policies determine a network’s design and configuration complexity. Where policies are uniform, this is easy to configure, but special cases make it hard to configure. Challenge was to mine implemented policies and quantify similarities and consistency.
  • Policies were captured with reachability sets (i.e. the set of packets allowed between 2 routers). These imply a connectivity matrix between routers, which is affected by data/control plane mechanisms. Get a uniformity metric, which is the entropy of reachability sets. Simple policies show entropy values close to ideal. Simple policies have filtering at higher levels. Also discovered a bug in one configuration because it had close-to-ideal entropy when it should not have.
  • Some networks are simple, but most are complex. Most networks studied have inherently simple policies (so it was more implementation complexity).
  • In one network, get a high referential link count due to dangling references (to interfaces). Another was complex because the network was in the middle of a restructuring.
  • Future work to look at ISP networks, and consider absolute versus relative complexity.
  • Q: did operators introduce the right kind of complexity? Yes, they knew what they were doing. Does metric help them or did they know they were in trouble? They knew they were in trouble.
  • Q: is there a reason for normalising by the number of devices? This helps compare between two networks of different sizes. However, it’s a first attempt, and this may be refined.
  • Q: have you thought about the complexity of provisioning against the runtime complexity? Something might be easy to provision but could go wrong horribly if it fails? Not looked at that yet.

NetPrints: Diagnosing Home Network Misconfigurations Using Shared Knowledge

  • Typical home network has multiple devices connected to a (wireless) cable modem, which connects to the internet. You might be running diverse applications and have diverse firewall and security requirements. Very heterogeneous. Worse, there is no network administrator, so how do you manage it?
  • Looked at examples of problems faced in home networking. Some caused by home router misconfiguration, some by end-host misconfiguration and some by remote-host misconfiguration (that may nevertheless be solved locally, e.g. by changing MTU).
  • Users take on average 2 hours to solve these problems. NetPrints is network problem fingerprinting. It automates problem diagnosis using “shared knowledge”. NetPrints subscribers occasionally submit network configuration information to the NetPrints service. On receiving a problem notice, the NetPrints service can suggest a solution.
  • In context with rule-based techniques: these are too application specific (require too many rules). Also local configuration issue resolvers (like Autobash, etc.). NetPrints combines these approaches. It has to deal with unstructured, heterogeneous environments, and solve problems due to the interaction of multiple configurations.
  • Two basic assumptions: connectivity is available (application-layer problems only; knowledge base could be shipped offline, however), and we don’t look at performance (only “good” and “bad” states).
  • Example of user tries a VPN connection from home, and it fails. Enters application name into NetPrints, configurations are scraped, and this information is shipped off to the NetPrints server. NetPrints suggests a fix, and the client applies it directly.
  • Three diagnosis strategies: snapshot-based (collect snapshots from different users), change-based (collect the configuration changes that a user makes) and symptom-based (collect signatures of problems from network traffic).
  • NetPrints has two operating modes: normal (collecting information) and diagnose mode (when people complain).
  • The configuration scraper scrapes from the router (using UPnP to get basic information, and the web-based interface (HTTP Request Hijacking)), the end-host (interface-specific parameters, patches, software versions and firewall rules) and the remote system (composition of local and remote configurations).
  • Server knowledgebase stores per-application decision trees with the popular C4.5 decision tree learning algorithm.
  • Evaluation methodology: testbed with 7 different wireless routers, clients running the VPN client sent configuration information to the NetPrints service (6000 configuration parameters per snapshot), and then service learned these using C4.5.
  • Configuration tree is not too large for the applications we have seen.
  • Configuration mutation can be done automatically by walking the decision tree. However, don’t want useless advice (e.g. change your router manufacturer) when a soft parameter could be changed. Therefore track the frequency of configuration changes and use this to inform what the cost of a change would be.
  • Another technique that uses network traffic signatures and change trees to diagnose other problems.
  • Evaluated in three different scenarios. First: VPN client in home network talks to VPN server outside. Second: want connect from outside to an FTP server inside. Third: file sharing within the home network.
  • Found some intuitive inferences (need pptp_pass=1 for VPN to work), and some surprising inferences (to do with the stateful firewall being off).
  • NetPrints uses labelled data to learn its knowledgebase. a 13–17% mislabelling causes only a 1% error in diagnosis.
  • Q: what if you have a problem that none of the existing decision trees have seen? Did you consider merging decision trees? If you don’t have enough data, you should be able to use the knowledgebase of a similar application to solve the problem. Currently looking at calculating the similarity of applications.
  • Q: what if there are user-specific constraints that the decision tree would suggest you break? No user-specific weights in the current application. Could send back multiple suggestions (with weights) and combine them with local user policy.
  • Q: do you have a sense of whether the trees would still apply if you looked at other problems (beyond connectivity management)? Could have, e.g., an application that fails only with certain inputs, but not with others. Haven’t faced that yet, but intuition is that the trees would still be pretty small.
  • Q: what happens when the user fails to report something that could be crucial? System is limited to configuration that we can capture. Doesn’t deal with transient outages. Could look at work from CHI?

Green Networked Systems

Somniloquy: Augmenting Network Interfaces to Reduce PC Energy Usage

  • Power and energy efficiency are key drivers today. How can I make my laptop last longer? How can I lower the power consumption of my desktop machine (environmental impact/cost)?
  • IT equipment consumes significant power, but shutdown opportunities are rarely used. 67% of office PCs are left on after work hours (sleep and hibernate modes used in less than 4% of these PCs). Home PCs left on for 34% of the time; 50% of the time not being used. Also a problem at CSE@UCSD.
  • People leave machines on to maintain state, occasional remote access (SSH/VNC; administrator access) and active applications running all night (long downloads, maintaining IM presence).
  • Occasional access and active applications cannot be handled by sleep modes (maintaining state obviously can).
  • Hosts are today either active/awake or inactive/asleep. Power consumption is 100x in awake compared to asleep. The network assumes that hosts are always connected. What we really want is something that provides the functionality of an awake host while only consuming the power of sleep mode.
  • Want to do this with availability across the entire protocol stack without making changes to the infrastructure or user behaviour. Can we achieve this with a low-power secondary processor (low-power CPU, DRAM and flash memory storage)? Want power consumption when secondary processor is active to be ~1W. Could we put this on the NIC?
  • Add a separate power domain to the NIC, powered on when host is asleep, with secondary processor device, and same MAC/IP address as the PC. Can wake up host when needed, but also handle some applications using application stubs on the secondary processor.
  • Stateless applications can be supported using “filters”; compare to Wake-on-LAN (which is either impractical (too many wakeups) or it affects usability (need infrastructure to send special packets to do the waking)). Somniloquy specifies filters at any layer of the network stack.
  • Stateful applications can be supported using “stubs”: need application specific code on the secondary processor, but the secondary processor is limited in resources. Stub is simplified version of the application; stub code is generated manually. Investigating how to do this automatically. Done for BitTorrent, web downloads and IM.
  • Built using the gumstix platform. PXA270 processor with full TCP/IP stack. USB connection to PC for sleep detection/wakeup trigger; power while asleep and IP networking for data. Wired and wireless prototypes. *-1NIC (augmented NIC) and *-2NIC (uses PC-internal interface while awake and simplifies legacy support).
  • Built a physical prototype. Two USB interfaces (one for power and USB networking; other for control).
  • Evaluated to maintain network reachability, and look at stateless and stateful applications.
  • Reachability: 4–5 second outage while desktop goes to sleep or comes back from sleep. This is because sleep transition is not optimised for latency.
  • Look at setup latency for stateless applications: incoming TCP SYN causes wakeup. Additional Somniloquy latency is 3–10s. As a proportion of an interactive session, this is probably OK.
  • Gumstix power consumption is between 290mW (WiFi) and 1W (Ethernet).
  • Lowest power consumption without using Somniloquy (in Dell Optiplex 745) is 93.1W (down from 102.1W in normal idle state). In S3 suspend state, this is just 1.2W. Total power consumption for Somniloquy is 5W. Assuming a 45-hour work week, could save 620kWh per year, which is $56, or some amount of carbon dioxide.
  • Extends laptop battery life from 6 hours to 60 hours. (Power drops from 11W to 1W.)
  • Used desktop PC trace data from next talk, using 24 desktop PCs (ON, sleep, idle, OFF durations). Identified energy savings.
  • Energy savings using stateful applications: web download stub. 200MB flash; download when Desktop PC is asleep. Wake up PC to upload data whenever needed. Uses 92% less energy than using the host PC for the download.
  • Other details in the paper: how to build application stubs, and a model of energy savings validated with real measurements.
  • Q: how is this different from the small screen attached to the laptop? Windows Sideshow. This is not an active device; only shows you information from before the laptop went to sleep. This could be augmented with Somniloquy.
  • Q: why do you have the design constraint of only making changes to the client and not to the network? For individual desktops and laptops, Somniloquy is the way to go. For cost, it might be better to have a dedicated machine, but this adds security issues and other overheads.
  • Q: could you not look at zombie state resume? Unfortunate thing is that currently you have an either/or state, and everything is either on or off.
  • Q: what is the energy cost associated with sleeping (to disk) and resuming (from disk)? Could you end up waking and sleeping more (due to e.g. the wakeup filters for stateless applications), and hence using more applications? Results account for the energy cost in suspend in resume. As long as it’s not once every one or two minutes, we’ll be alright.
  • Q: how much in the way of latency savings could be achieved by integrating the device into the motherboard? Important design point is that you don’t require everything to be on to power the device.

Skilled in the Art of Being Idle: Reducing Energy Waste in Networked Systems

  • Idle network systems draw significant power even when idle. Either go into a low-power sleep state (S3) and lose network presence; or remain powered on and waste power.
  • Several proposals have looked at this: Wake-on-(W)LAN, special wakeup packets, Network Connection Proxy. These systems have so far seen little use. But there has been little evaluation of the potential for energy savings, and little exploration of the design space for a network proxy.
  • Is the problem worth solving? What is the design space? What protocols should be handled and how?
  • This work is trace-driven evaluation of the broader benefits and design tradeoffs. Focus on the types of energy savings that can be obtained with simple techniques.
  • Collected traces from 250 Intel host computers (90% laptops, 10% desktops), in the office and at home. Over 4 weeks in spring 2007. Traces contain packet traces, flow information, logs of keyboard and mouse activity, power state, etc.
  • Look at desktop machines and their power states. On average, they are idle for >50% of the time. Therefore desktops waste >60% of their energy while idle. Given there are 170 million desktop PCs in the US, this translates into 60TWh/year wasted (about $6billion).
  • Do we need proxying or should we just Wake-on-LAN? Depends on the time that it takes to wake up and go back to sleep. PCs receive 1–4 packets per second while idle. Optimistic assumption that transition time (sleep-to-wake-to-sleep) is 10s. Office environment does not give any savings.
  • Differentiate between relevant and irrelevant packets, and only wake for relevant? Or respond with some proxy device? Or do complex processing on the proxy? Option between transparent (default wake) or non-transparent (default ignore) solution.
  • Deconstruct by protocol. Some protocols are responsible for “poor sleep”. Enumerate them and decide how to handle them (ignore/respond/wake).
  • Define half-time sleep measures a protocol’s role in preventing sleep. How much can we sleep when waking for protocol P. Compute sleep for discrete transition times. ts_50 (half-time sleep) is the largest transition time for which sleep > 50%. The lower ts_50(P) is for protocol P, the worse an offender P is.
  • Majority of traffic is broadcast or multicast. Mostly useless network chatter. ARP can be handled by simple responses (just know IP address of a particular machine). IPX and NetBIOS packets are probably uninteresting to the machine. Can also ignor HSRP and PIM. IGMP and SSDP can be handled by simple responses.
  • Look now at unicast. Look at TCP and UDP ports. Also TCP keep-alives and TCP SYNs. Some can be handled with easy approaches. But some (SMB/CIFS and DCE/RPC) require special handling because they are used by many applications.
  • Transparent proxies might be good for home, but not office computers. Cannot handle unicast well unless it’s complex. Non-transparent proxies (even simple ones) are very efficient.
  • Architecture is a list of (trigger, action) rules. Agnostic to where the proxy runs (NIC, server running on same LAN, router, firewall). Example implementation as a standalone machine on the same LAN, implemented in Click. Used a simple (non-transparent) set of rules. No explicit state transfer between sleeping machine and the proxy; just learn state by sniffing traffic. So no modifications needed at end systems.
  • Q: where is the line between idleness that can be exploited for future better performance and idleness when we should power down? Have scheduled wakeups when we do things in batches. Even when exploiting idleness, we don’t do it at 100% utilisation. Instead, have periodic spells of high utilisation scheduled.
  • Q: [Missed.] If the machine was asleep to start with, some traffic wouldn’t arise. This solution couldn’t really help.
  • Q: do you think that the problem can be solved completely by proxy, or would it be better to involve applications and protocol development? Not sure what the best answer would be, but it would definitely be better to have protocols that work well with power saving. Have seen examples of bad applications that wake the computer up every couple of minutes.
  • Q: [Missed.] If we had the same periodicity of exchanges, we could just be awake for that, but we are not sure how to design protocols to achieve that. tho uhnipe riPthoc h dr erT crrihe eeOdCP ilme onnnOh n eenIiefn oDse PrnePiininmnicprhtOlor r eo m no ids ohhtCeoraaonnT lloeictLS urgt pnCaih0s3leemPnm diSu euielvrPnyBe htO Slf cnxe Nsieaencm piehh st nooiC ufgirmneertehno3CaP Iritenm hnFnoeionrreDn lueh rCineaomnePeeciprOnetisNh tplP snerioh e e e onemotni n neePieelMetn h cvrr eeU pn ue heaPtyi3eicp rufreCtinG rhhm ndi nee heeM ePteei utOmBehrnestyW ilenrrAtaPn DnerP nttaoCignlewLb ii t emPe Piehoimpe rteerNcep dirnCntN sedeP o enieBhPint m12ternoeIn cijt mreP rhrLdWdoeetProeCem iloiIrfbitfIonAonnor eeetPtlnh epeCoitsoePumkeihnene imtQtv tDnpraihceu ec uUmB eon F niyrmT t Fni1gO niPynOouRa iemhnNe ePtexhv evm ilrgereiEnD hOn t heheirneBuytOnvmnn Bore mrno lnttNPsCe rahe nbnSee rienmsele tP eBePnmyi l hsenideSFOhue nrPn nlripaeP et heecmi aBe eW7ho nr bem i5 PieO pPnmrs-rPr r h rrSiiPe Petmniue n eeii cePp rmiiocPNsororh rte ieePnPnpcnreGe htiPtCieFoWru Ca h eVPieAhnder nexitm p nm3e5g. heeBeh e geP abnPr ne enteumrae tn hdt eeh ip opedShnThaCatnrem iP e-irLem n OsFer dmPehtetni BeuamUtn BncPeke hi r lemtceurhisPitDionn penxeRan9 eNh eamtniun ereOvr dtneevuyieag rDhAre iCnlenu yiPPaBrontnemyahah algeOct eiinkNTu noKemcpioe 0P ins hnot9enOlru Me naaee pnestCmetr ee0 iaPFenhmrete sn ttios i n WpP eouiietttoph tg heidhWarwie nr moFran s CrriD hmu hyWPim eDbeOPlpnt tea hiae e hrnPBnoBt yh 7 C5oyrP e hee nRrh eiWmrond OuPe rtneremeP iacamrrli eBal umiciyearenh ieiniR nhBn een r PNeninotlreh neysoiP nchdninereir etmeePGi utPse uoi S pmPeniorhrt cni Pi m oerDhnntNee FmcahaOointP ClpPe nrrnraeh gnii nethpCsretrr tunXeh oeccDh m n etptCum uPenOcPfeiars nl hiieecnrtGPnm rieteC h tnbreier eOnPSne o ePnEirha rrch.ispt noy e PUts nIBnehe ndeiretti Ku ScPPhmeop o eP r PiUteeArdnF iraPthnF rcnmeShe rm ceenonFigeraiP er pibohnees ee ram hn neniyC Pn CeOtrmDi Sineibt rci WnPehoetCPih cmhthptoiNsoic mnreterh uoPGPmin nmhi erWoDtdi nevwlhn deehutlPnGencroam nNoNeoPmsDereicp hcnnFta litaaiemryPerAedrrci etnPSCh epipen hpiemernirFea h ytPreaenu hrervuruilSmmrtirvnegpytn aentPanhnBDteOe r eieeuPydie yh micsnT mePl epeh yen h Oetrnreemee oecmPp aonaVao iinheaimCxr lihnmPPtne t PnPrerne n i uelSiBnhtOe ai eMmnnR een reoe stNihn FpSpuBCynep meaOinin ArPmn ilnselptrn OehiPon nr PenSipmyOlh smSPre slPenOitp mnet n eW im tarirshtrPNheohpCPni ei r N nrten mnu mhe hMisPa53tt. rh eCenp7nge thN t eeCeiytethnsrp ehcrr ne PRW Pi et P rsr hhmlrm vtnteParhepeplrno mRi Aa rnvirEeefrmnAsisaeg nfdectnteWeP h pAtt cht mhreanPuinUnhcWaoesoirsiir e wnceiinlnertr i P3ie ttitree rmrntr MrespPhniaieemOhhm nnPh n arheeimcOnytPare imL lrrm nF reCdn nOntrnihmoOrdee gserP PeArTrnrnd i eShsitkO 9m9 hrroupnPPDenne tr reitciei nC Paemnhitei tPehe ur eenennhi eyrO nHmB UeStntdnmr ee ihie nL hDSseriePtnp iroisitopmnNePt nnechnSo reptpr rtnnn oaoLeiom SecgtarMPg mrPnsm PTPsbot rrmDertcothenrnhie asi eDnoePicxh hneOrePutrlinveimhgn eBe mO i rd P trrhh iOmlW esnminetn mxeornFelt i eeiPneLn em3dx 5it7PpAn 7en.e ehit leHDilr tPnernoe Peweonem toe3 uNnthoBenhrcyeimeCPn ti ascn Ciepeael nciP hnniDno ePMiisae a teeisrif gi gaeeD neuleinrePetsh iP Dmh rCp ny k niheOneirhryn sSn iontct ehaemleP eoEtraheR eLiOntn eDrenPri amsephTrP eirmePnIhnt 601n0 kPee hRiS stma emt3 Pem Wnnitesruce hayhrap tePP r pnh3C5hnPemt etnehbl ecmcoimrusnnheanL ocsOieasnitnsesD cr Bmnn hnsaeNPeCetPympirhnhoe i rri nInoeSn ettckeymBeH hP rntt P ePumtsWPehareh n ttmrehn rtZ rlieer Ceeenehnl mratph der B Aiht sxnhihn mpOreeicrWtI eetn09.nrhm P enesP Ytrni FirntnP oeheEre tem uyhniIreTonW thdsUemnPre nieM prnskA soE tnrimidexLentPehpha DornteCnoThiPe P ngnslneeo DmeriaiiiosnrOF reoCeitHt bmetoneePi dh .pn ucr hro3 eoW7 enitPdp e rpocP is eNe nhaieiePDn vreo eOre hrnedPeninitmr hpP t cyer m to eRrn afePhr nPeteeHll htDetieaiwSi itPeifeChme nOv nM.d enO l aPeeai cehe dAdrreMit

NSDI 2009: Day 1

Wednesday, April 22nd, 2009

Trust and Privacy

TrInc: Small Trusted Hardware for Large Distributed Systems

  • Won the best paper award.
  • New primitive for trust in distributed systems, which can no longer be taken for granted.
  • Talking today about equivocation: providing different results to different clients.
  • Example is the Byzantine Generals Problem (tell one advance, tell the other retreat): could imagine a corrupt server behaving similarly.
  • Also a voting system (tell one user that vote has been counted; send another a tally excluding the vote).
  • Also BitTorrent (lying about which pieces of a file you have).
  • With f malicious users, need 3f+1 users in a completely untrusted system; but without equivocation, just need a simple majority of non-malicious users.
  • So use trusted hardware at all participants to make equivocation impossible!
  • Must be small, so that it can be ubiquitous, tamper-resistant and easily-verifiable. (Idea: send it as part of a figurine with World of Warcraft.)
  • This paper introduces TrInc (a new primitive to eliminate equivocation), some applications of TrInc and an implementation in currently-available hardware.
  • What is the smallest thing possible that makes equivocation impossible? All you need are a counter and a key. TrInc = trusted incrementer: a monotonically increasing counter and a key for signing attestations. Attestations bind data to counters.
  • Operation e.g.: “Bind this data to counter value 36.” TrInc checks to see if this actually increases the counter, and returns a tuple of (old counter, new counter, data), signed by attestation key.
  • Two kinds of attestations: advance (moves counter forward; can only happen once; attests that nothing is bound to intermediate values), and status attestation (doesn’t advance counter, attests to current value and that nothing has yet been attested to with a higher counter value).
  • In practice, might want multiple counters. A “trinket” is some hardware with >= 1 counter.
  • TrInc is practical: can use the TPM to implement it (and this has massive penetration in x86 machines). The TPM is tamper-resistant, has 4 counters, can do crypto and has a small amount of storage. TPM merely lacks the right interface.
  • Applications. Ensure freshness in DHTs, BFT with fewer nodes and messages, etc.
  • Implementing a trusted log in TrInc: append-only (ensure that new data goes at end of the log), and lookup (no equivocation on what is or isn’t stored). Obviously can’t store the log in the trinket; instead put it in untrusted storage.
  • Use the counter to attest to a datum’s position in the log (the counter is the location in the log). Append by attesting the data to the next counter value. For lookup, only one valid attestation can correspond to a move into a new counter value. Using the old counter values in the attestations to prove that there are no holes in the log.
  • Attested Append-Only Memory can do this too: by construction, TrInc solves all of the same problems.
  • Preving under-reporting in BitTorrent. Peers represent what pieces of a file they have using a bitfield, and exchange these with each other. Selfish peers have an incentive to under-report what they have. It yields prolonged interest from others and leads to faster download times. This is equivocation! When a peer receives a block, it acknowledges receipt (to the original provider), then tells others that it doesn’t have it.
  • In BitTorrent, counter is the number of pieces that the peer has downloaded. Peers attest to the bitfield and the most recent piece received. Peers attest when they receive a piece (as an ack) and when they sync counters with other peers.
  • When receiving a block, attest “I have (some collection of blocks) and most recently received (one of those blocks).” Check that the counter matches the bitfield and that the most recent piece is attested to. Kick out nodes which lie, to create an incentive.
  • Attest to the latest piece to avoid an attack by buffering received nodes and under-reporting. Without sending the full log, need to ensure proper behaviour at each step.
  • Evaluated with a macrobenchmark on BitTorrent (solves under-reporting), A2M (higher throughput than A2M and reduces hardware requirements), and PeerReview.
  • Implemented on the Gemalto .NET smartcard: a few dozen lines of C#. Implemented all of the case studies.
  • Evaluated implementation with microbenchmarks: operations take on the order of milliseconds. Can use asymmetric (slow) or symmetric (2x faster) crypto. Takes 32ms just to write a counter.
  • Trusted hardware is typically used for bootstrapping, not for interactive use, but TrInc makes this hardware an intrinsic part of the protocol. The hardware can be faster; there just hasn’t been a call for it yet.
  • Q: about the BT protocol, one could potentially attack TrInc by having multiple identities with multiple counters. What could you do at this level to address this attack? You are limited to attesting to a multiple counter; obviously this would be the case even if you had multiple machines. But by under-reporting to some, you would not be giving back the expected attestation, which cuts off the number of people you can trade with. So is that sufficiently worthwhile? It’s not clear.
  • Q: about voting, these hardware devices are designed for “low-value” trusted applications. Do you see a barrier where TrInc would not be applicable? Looked at digital currency and wondered how much trust you would put in the tamper-resilience. For mission critical applications, you would spend more money on the tamper-resilience, or use a more complicated protocol.
  • Q: did you consider putting this on an actual TPM? Wanted to design the interface for TrInc; TPM doesn’t provide this.
  • Q: could you talk about the counter size, overflow, etc.? Overflow is impossible because you are setting the new counter value (not incrementing it), which is checked in the card itself. Resetting the counter increases a “meta-counter” (another TrInc) which gives each counter its own ID: effectively a session ID.

Sybil-Resilient Online Content Voting

  • Many websites encourage users to vote for different types of content (e.g. Digg). Sybil attacks can pollute the results (promoting spam links on Digg).
  • Talk today about defending against this kind of attack, and how they implemented it on Digg.
  • Hard to defend against Sybil attack because an open system lets an attacker join easily. CAPTCHA etc. are insufficient. Need a resource that cannot be acquired in abundance: use social network links.
  • Edges between genuine friends and subnetwork of attacker sybils are the attack edges. Hypothesise that the number of these is small.
  • Assume you can collect all votes and the social graph. Can be binary vote or multiple choice vote. Goal is to attack a subset of votes that includes mostly votes from real people (but might include some from Sybils).
  • Designate a vote collector, use max-flow to collect votes and then assign appropriate link capacities.
  • Need to break symmetry (Sybil network can exactly mirror the real social network), so designate a known non-attacker as the “vote collector”. Then use max-flow to the vote collector: bogus votes are congested at the small number of attack edges. Honest votes are congested at edges closer to the collector. Attack edges should be farther away from the collector. So give more capacity to the edges that are closer to the collector.
  • System is called “SumUp”: designed to assign capacity in the graph and leverage user feedback.
  • Assign capacity to collect at most v votes (ideally the number of honest votes, estimated using a separate mechanism). Give greater capacity to edges nearer collector, using process called “ticket distribution”: give equal fraction of tickets (initially v) to all edges out from the collector. Each node consumes one ticket, and distributes the rest to each of its outgoing links. Constructs a “vote envelope” around the collector.
  • Observe that when number of honest votes >> v, the number of collected votes is roughly equal to v. When it is << v, the number of collected votes << v. So iteratively (and exponentially) adjust v until at least 0.5 * v votes are collected.
  • Prove that the number of bogus votes is limited to the number of attack edges plus a constant factor.
  • Also prove that a large fraction of the honest votes are collected.
  • Can do better by using feedback from the vote collector, if it can tag some votes as bogus. Then reduce capacity on attack edges close to the collector (or possibly ignore them altogether). Idea is to penalise all edges along the path taken by the bogus vote (because we know that one of these is the attack edge).
  • Associate a penalty with each link: initially all zero. When a bogus vote is tagged, penalise the edge by 1/capacity. Links with a higher penalty receive fewer tickets. Ultimately eliminate links with a high penalty.
  • Evaluated on real social networks, and real Sybil attacks.
  • Applied to YouTube (0.5M), Flickr (1.5M) and synthetic (3M) social graphs.
  • As the fraction of honest votes increases past 0.1, the average number of bogus votes per attack edge increased sharply (up to 5 per edge) in all three graphs.
  • The fraction of honest votes collected is always > 90%.
  • Looked at real Sybil attack on Digg (positive and negative votes on articles). Digg maintains 130,000 “popular” articles among 7 million articles, using an undisclosed algorithm. Digg has a 3M node social network, with 0.5M nodes in a connected component. 80% of votes are from the connected component. (Data obtained by crawling Digg.)
  • Made the Digg founder (Kevin Rose) the vote collector. Manually sampled 30 articles. Found subjective evidence of attacks in 15 articles (one was an advert; 10 had votes from newly-registered voters; 4 received <50 votes after being marked "popular").
  • Observe that suspicious articles receive more negative votes (based on 5794 “popular articles”).
  • Q: even if SumUp can give you the attack edges, it would be difficult to defend against attacks in recommendation systems where there are a small number of honest nodes, so just by compromising a few hundred honest nodes (e.g. using a botnet), it would be possible to overwhelm the system. How does SumUp deal with this? SumUp doesn’t deal with this case.
  • Q: is there a dependency on the location of the collector? Could you manipulate the graph to place attack edges near to the vote collector? Yes, so the feedback mechanism is important.

Bunker: A Privacy-Oriented Platform for Network Tracing

  • Bunker anonymises data that it collects and offers software engineering benefits.
  • Network tracing used for traffic engineering, fault diagnosis, recovery, research studies. But customer privacy is very important to ISPs. Raw data is a liability for ISPs (lost, stolen, subpoenaed, etc.).
  • So nobody can have access to the raw data (ISPs always say no). Anonymising the data can help to mitigate privacy concerns. Anonymisation is a form of obfuscation that destroys personally-identifying data.
  • Could do anon. offline or online. Offline has high privacy risks (do it after collecting the trace). Online has high engineering costs (need to anon. the trace simultaneously with collection, at line speed).
  • A regex for phishing (looks for forms that take PINs, usernames, passwords, etc.) using libpcre takes 5.5s to process 30Mb of data (44Mbps maximum). But we want to look at multiple-gigabit links with multiple regexes.
  • Want the best of both worlds. So buffer raw data on disk, but the only thing that comes out is the anonymised trace.
  • Bunker is a closed box that protects sensitive data. Contains all raw data and processing code. Restricted access to the box (e.g. no console). Make the box “safe-on-reboot”: when it is rebooted, clear the ECC RAM using the BIOS after reboot, and use encryption to protect on-disk data. Use an encryption key held in RAM inside the closed box. Data on disk cannot be decrypted after reboot.
  • Design: online capture module, and offline TCP assembly, parsing and anonymisation modules. One-way interface that passes out the anon. data. Add an online encryption module and offline decryption module to store data on disk.
  • Closed-box VM built on Xen hypervisor. Also an open-box VM on the same platform which provides access to the trace, accessed using a separate NIC.
  • Closed-box implementation: no I/O or drivers in this VM except those needed (custom-made menuconfig). Use firewalls to restrict network communication (e.g. standard iptables config).
  • On boot, one of two configurations may be selected. Debugging config enables all drivers and allows access to the closed box. Tracing configuration eliminates most I/O and drivers. On choosing tracing config, the display and keyboard freeze (as there are no drivers), the kernel’s init runs a script to start the trace, and the operator can only log in to the open box using its dedicated NIC.
  • Gives strong privacy properties, and allows trace processing to be done offline (in your favourite language, e.g. Python).
  • Bunker has a large TCB, but narrow interfaces: it remains secure as long as a vulnerability cannot be exploited through narrow interfaces. Three classes of attacks: closed-box interfaces, hardware and trace injection.
  • Assume that it is hard to attack a VM from another VM. Enumerate each of the interfaces and reason that the defences are secure.
  • Safe-on-reboot eliminates most hardware attacks. One remaining is extracting keys from RAM while the system is running (cold boot attacks, bus monitor, special (e.g. FireWire) device to dump RAM without OS support). Secure co-processors could thwart these attacks, but TPMs are not useful.
  • Bunker has <7kloc and took 2 months to develop. Much smaller (order of magnitude) than previous line-speed systems at UW and Toronto. Able to use Python to simplify the development.
  • Q: what have you learned by trying to sell Bunker (with admitted vulnerabilities) to network operators? Do they require a proof that non un-anon. data can come out? Universities take their jobs very seriously (sometimes more so than ISPs). If you can prove that no data can come out, that’s great, but don’t know how to do that. Found that by explaining carefully what you’re doing, support is often forthcoming.
  • Q: great solution assuming the anonymisation is good enough. There have been several mistakes about this in the past. So how does Bunker affect this? Bunker doesn’t protect against that: indeed, it might even be worse. Assumes that there are no bugs in the anonymisation code: do code inspections and make it publically available to improve its quality. Need to work on these open problems.
  • Q: do you worry about physical access to the infrastructure or machine? Well, if you can do that, you can install your own network tap, so what’s the point? Bunker is designed to lower ISP’s liability. Doesn’t stop a lawyer coming in with a subpoena allowing him to install a new network tap.


Flexible, Wide-Area Storage for Distributed Systems with WheelFS

  • Increasing data storage on widely-spread resources (testbeds, grids, data centres, etc.). But not yet seen a universal storage layer. It’s hard because of failures, latency and limited bandwidth.
  • CoralCDN prefers low delay to strong consistency. Google wants to store e.g. email near the customer. Facebook forces all updates for a user to go through one data centre. So each application builds its own storage layer. No flexible layer gives all of these properties!
  • Need control of wide-area tradeoffs (fast timeout vs. consistency; fast writes vs. durability; proximity vs. availability). Also need a common, familiar API that looks like a traditional file system so we can reuse existing software built for local storage.
  • Solution is “semantic cues”: a small set of application-specific controls that corresponds to each of the wide-area challenges (e.g. eventual consistency, replication level, particular site). Allow applications to specify cues on a per-file basis, by putting it in the path names of files.
  • WheelFS works over the wide area, based on a standard distributed file system design with the addition of these cues. Built a full prototype which runs several applications.
  • Design: WheelFS looks like a big data storage layer. Distributed application runs on a bunch of client nodes spread throughout the wide-area. FUSE presents WheelFS to the application, and WheelFS client software communicates with WheelFS storage nodes. WheelFS configuration service uses Paxos and RSMs to map files to nodes.
  • Files have a primary and (by default) two replicas. A file’s primary is its creator. Clients can cache files using a lease-based invalidation protocol. Strict close-to-open consistency (serialised through the primary).
  • Consistency is enforced under failures (even network partitions): failing to reach the primary blocks the operation, until the configuration service promotes a new primary.
  • Only applications have the knowledge to make the right tradeoffs. So embed these cues in the path name.
  • /wfs/cache/a/b/foo -> /wfs/cache/a/b/.cue/foo (e.g. .EventualConsistency).
  • Flexible and minimal interface change that makes it easy to use existing applications.
  • Can apply a cue to entire directory subtrees, multiple cues at once (later cues override earlier, conflicting cues). Assume the developer uses these sensibly.
  • Durability through RepLevel=n cue. Permanent property of the data.
  • Large reads through HotSpot cue. Transient property (only applies to a particular opening of the file) using P2P BitTorrent-style caching for hotspots.
  • Placement using Site cue. Permanent property of the data.
  • Consistency though EventualConsistency cue. Either transient or permanent property.
  • Reading with eventual consistency: read latest version of the file that you can find quickly. Don’t need to go through the primary (which might have failed). Can try replicas, or a cached copy of the file elsewhere. .MaxTime=t cue to specify how long you should spend looking for a file. Writes can go to any replica, which creates two divergent replicas: a background maintenance process will figure out a way to merge files (without application involvement). Reconciling directories by taking the set union of files, and files by choosing one version as the winner (this will lose some writes, but is usually OK for apps that can tolerate eventual consistency).
  • Example use: cooperative web cache. Make a one-line change in the Apache config file to point it to a file in WheelFS. Using default WheelFS semantics leads to blocking under failure with strong consistency. But the freshness of a page can be determined using saved HTTP headers, so it’s alright to use eventual consistency.
  • Cache directory becomes /wfs/cache/.EventualConsistency/.MaxTime=200/.HotSpot/
  • Implemented on Linux, MacOS and FreeBSD. About 20kloc of C++. Support Unix ACLs. Deployed on PlanetLab and Emulab.
  • Applications include co-op web cache, all-pairs pings, distributed mail, file distribution and distributed make. At most 13 lines of configuration code had to be changed.
  • Evaluated performance: does it scale better than single-server DFS? Does it achieve performance equivalent to specialised storage? How does it perform under failures?
  • Scaling: as number of concurrent clients increases past about 125, NFS performance starts to suffer relative to WheelFS due to buffer cache exhaustion. (However, NFS is better than WheelFS for fewer clients. This is because it’s a local NFS server at MIT, and WheelFS is on PlanetLab.) WheelFS performance seems constant with added load.
  • Specialised storage for co-op web cache on PlanetLAb: 40 nodes as proxies, 40 as clients, same workload as CoralCDN paper. Compare CoralCDN to Apache on WheelFS. WheelFS achieves same rate as CoralCDN. However, CoralCDN ramps up to full rate faster, due to special optimisations.
  • Evaluated web cache under failures using Emulab. Each minute, one site went offline for 30 secs. Eventual consistency improves performance under failures: rate remains almost constant, whereas strict consistency makes it fall greatly.
  • Main difference with related work is the control over trade-offs.
  • Compare to storage with configurable consistency (PNUTS, PADS). WheelFS also provides durability and placement controls.
  • Q: do these primitives have a place in a generic distributed file systems? Where do you draw the line in what you include and what you don’t? Have only looked at what existing applications need and included that. If other applications become important, other things may be included.
  • Q: what insight have you gained from making it simple to create different configurations of system? Most applications just need something simple that applies to both reads and writes (i.e. strict or eventual consistency).
  • Q: want to hear more about reconciliation? Most apps that need append-only storage can be implemented as writing things into a directory (and taking the union). Cooperative web caching doesn’t care if you lose a version of the file. That’s been enough so far.

PADS: A Policy Architecture for Distributed Storage Systems

  • There are lots of data storage systems. They take a lot of time and effort to build: lots to reimplement. Why can’t this be easier? Is there a better way to build distributed storage systems, focussing on high-level design, not low-level details.
  • Previous work suggested a microkernel approach: a general mechanism layer with a pluggable policy.
  • Challenge was to build 10 different systems, each in 1kloc, before graduation! With PADS, 2 grad students built 12 diverse systems in just 4 months. Evidence that PADS captured the basic abstractions for building distributed storage systems.
  • Questions of data storage and propagation are just questions of routing. Consistency and durability are questions of blocking.
  • Routing specifies how data flows among nodes. When and where to send an update? Who to contact on a local read miss? Look at examples of routing in Bayou, Coda, chain replication and TierStore.
  • Primitive of subscription: options are the data set of interest (e.g. a path), notifications (invalidations) in causal order, or updates (bodies of files). Leads to an event-driven API. PADS gives a DSL to make this easier (based on OverLog), called R/Overlog. Policy is a bunch of rules that invoke actions.
  • Simple example is: on read, block and establish subscription to the server.
  • OverLog makes it possible to implement a whole system in a single page of rules. Rules for TierStore were presented in a single slide. Easier to debug and do code reviews.
  • Blocking policy defines when it is safe to access local data (either for consistency (what version can be accessed?) or durability (have updates propagated to safe locations?)). Need to block until the required semantics are guaranteed.
  • PADS provides 4 blocking points: before/after read/write. Specify a list of conditions that provide the required semantics. PADS provides 4 built-in bookkeeping conditions and one extensible condition.
  • e.g. Read at block: Is_causal. Write after block: R_Msg (ackFromServer).
  • Is PADS a better way to build distributed systems? Is it general enough to build any system? Is it easy-to-use? Is it easy-to-adapt? What are the overheads associated with the PADS approach?
  • Built a range of different systems that occupy different parts of the design space. Max number of routing rules was 75; up to 6 blocking conditions.
  • Added cooperative caching to Coda in 13 rules. Took less than a week, and greatly improved the read latency.
  • Overheads in an implementation of Bayou: within a small factor of the ideal number of bytes that must be transferred. Also looked at microbenchmarks on Coda versus P-Coda (PADS version): very close, and mostly due to the Java implementation of PADS.
  • Q: with this size of implementation, do you deal with system failures and recovery? Yes.
  • Q: how do you express routing that is based on network topology (e.g. TierStore hierarchy over DTN in a developing region) in OverLog? OverLog is typically used to set up network overlays, ping nodes and so on. Once you know who is alive, you can call into PADS to say that you’ve detected a peer with whom you can communicate for storage.
  • Q: can you talk about the trade-off between language versus library? (OverLog rules are a bit like haikus, sometimes you’d prefer a paragraph of text.) Can also use a Java API to configure PADS. Why OverLog; why not Java? Wanted to take advantage of the haiku of OverLog.
  • Q: a storage system doesn’t only have to worry about data movement, there’s also reconciliation or dealing with different storage layers at the same time. Should PADS worry about these? PADS doesn’t do conflict detection (it uses a simple scheme), and that has mostly been left to the application (though haven’t decided all of this so far). The storage layer is more of an application issue than a system issue.
  • Q: with OverLog, the runtime state can get very big, so how does this scale (with e.g. a complex topology)? Originally, DataLog would set up data flows which cause a lot of state to be exchanged. The custom version used for PADS cuts this down while using the same language.

Wireless #1: Software Radios

Sora: High Performance Software Radio Using General Purpose Multi-core Processors

  • Won the best paper award.
  • Currently, each wireless protocol is implemented using special hardware. Software radio ideal is a generic RF frontend and protocols implemented in software. Leads to universal connectivity and cost saving; a faster development cycle; and an open platform for wireless research.
  • Challenges. Need to process a large volume of high-fidelity digital signals (e.g. for 802.11 with 20MHz channel, need 1.2Gbps throughput… up to 5Gbps for 802.11n. Will be over 10Gbps for future standards). Processing is computationally-intensive: several complicated processing blocks, operating at high signal speeds. Need 40G operations per second to process 802.11a. Also a real-time system, so many hard deadlines and need accurate timing control. 10us windows for response.
  • Possible approaches include programmable hardware (FPGAs, embedded DSPs) (high performance, low programmability) and general purpose processors (low performance (100Kbps), high programmability). Sora is high performance and highly programmable. Achieves 10Gbps with ~10us latency.
  • Approach. A new PCIe-based interface card and optimisations to implement PHY algorithms and streamline processing on a multi-core CPU. “Core dedication” offers real-time support.
  • Uses a general radio frontend connected to a PCIe-based high-speed interface card (offers high throughput and low latency (~1us)). Frontend can connect to up to 8 channels. FPGA on card implements logic for control and data path (PCIe, DMA, SDRAM controllers).
  • Core part of Sora is the software architecture. To achieve high performance, uses three technologies.
  • First, an efficient PHY implementation makes extensive use of lookup tables which trade memory for calculation and still fit in L2 cache (e.g. convolutional encode requires 8 operations per bit in a direct implementation, but can use a 32Kb lookup table and two operations per 8 bits).
  • Second, most PHY algorithms have data parallelism (e.g. FFT and its inverse). So use wide-vector SIMD extensions developed for multimedia in the CPU.
  • Third, use multi-core streamline processing to speed up PHY processing. Divide processing pipelines into sub-pipelines, and assign these to different cores. Use a lightweight synchronised FIFO to connect cores. Can also do static scheduling at compile time.
  • Core dedication for real-time support: exclusively allocate enough cores for SDR processing in a multi-core system. This guarantees predictable performance and achieves us-level timing control. A simple abstraction, easily implemented in standard OSes (and easier than a real-time scheduler), such as WinXP.
  • Implemented for WinXP in 14kloc (C code) including the PCIe driver. Also, SoftWiFi implements 802.11a/b/g in 9kloc (C) in 4 man-months for development and testing. Works at up to 54Mbps.
  • Without optimisations, the required computation is far too large for any practical system. Sora offers up to a 30x speedup at high data rates.
  • End-to-end throughput compared to commercial-commercial 802.11 cards, Sora-commercial and Commercial-Sora. It is very close and sometimes faster to use Sora. Also seamlessly interoperates with commercial WiFi.
  • Extensions for jumbo frames in 802.11 can increase throughput. Also simply implement TDMA MAC. And applications which show low-level information about the PHY layer.
  • Q: do you have an algorithm for deciding how to allocate jobs to cores, or do you need to come up with an approach for each CPU architecture? Currently rely on the programmer to decide this, but there has been much other research on this, which could apply to Sora.
  • Q: most radios are used in mobile devices, so what is needed to make Sora work on power-constrained devices? GPPs have huge power consumption compared to special devices. Currently the benefit of SDR is for prototyping, so this is less of a concern. Also, Sora would work well on base stations.
  • Q: did you look into using existing systems for scheduling data-flow graphs on multi-cores? Don’t really need to consider the dynamic case because of fixed rounds etc.
  • Q: how do you provision dedicated cores in the presence of a shared cache and shared bus?
  • Q: a lot of the finer details in 802.11 is for working at low-performance (weak signal strength and high multipath), so does Sora have spare processing capacity to work with this? In the presence of these, you won’t get high throughput, so we don’t handle this completely.

Enabling MAC Protocol Implementations on Software-Defined Radios

  • What’s the hype about wireless MAC protocols? Achieving highest performance is application-specific (e.g. throughput, latency, power). No one MAC fits all. So there are diverse MAC implementations and optimisations. How can we easily implement these?
  • First approach has been to use standard wireless NICs (high performance and low cost). Although MAC is software, it’s closed-source and fixed functionality. SDR allows modifying full reprogramming of the PHY and MAC layers, but are higher cost and lower performance.
  • Various projects have used SDR for evaluation (based on GNU Radio and USRP). All processing is done in userspace (”extreme SDR”).
  • “Extreme” SDR architecture based on a frontend, ADC/DAC, FPGA, and USB connection to kernel and eventually userspace. Much too slow for 802.11 timeouts.
  • So commonly move layers closer to the frontend. However, these are costly, require special toolkits, require embedded systems knowledge and are much less portable.
  • Instead, take a split-functionality approach. Put a small, performance-critical part on the radio hardware, and a larger piece on the host for flexibility. Then develop an API for the core functions.
  • Building blocks are carrier sense, precision scheduling, backoff, fast-packet detection, dependent packet generation and fine-grained radio control. Believe this is a reasonable first “toolbox” for implementing high-performance MAC protocols. Talk about precision scheduling and fast-packet detection.
  • Precision scheduling. Do the scheduling on the host (for flexibility) and triggering on the hardware (for performance). Requires a lead time that varies based on the architecture.
  • Want to know how much precision we gain from this approach. Transmission error is approximately 1ms if triggering in the host. If in the kernel, this lowers to 35us. With split-functionality, this gives 125ns precision in scheduling.
  • Fast-packet detection. Goal is to detect packets accurately in the hardware, before they have been demodulated. The longer it takes to detect a data packet, the longer it will take to generate an ACK. Then demodulate only when necessary as this is CPU intensive. Uses a “matched filter”, which is an optimal linear filter for maximising the SNR. Try to detect framing bits, which are transformed into a discrete waveform. This is used as the known signal, and is cross-correlated with the incoming signal. If the correlation exceeds some score, trigger a response, or other action.
  • Simulation of detecting 1000 data packets destined to the host in varying noise. The matched filter achieves better noise tolerance than the full decoder (in the simulator). In real life, achieves 100% accuracy detecting frames, and <0.5% false positives.
  • Other mechanisms in the toolbox are detailed in the paper.
  • Implemented on GNU Radio and USRP. Implemented two popular MAC protocols (802.11-like and Bluetooth-like).
  • CSMA 802.11-like protocol uses carrier sense, backoff, fast-packet recognition and dependent packets. Cannot interoperate with real 802.11 because of bandwidth limitations. Target bitrate is 500Kbps, and uses the 2.485GHz band to avoid 802.11 interference. Achieves 2x throughput of the host-based “extreme” approach for 1MB-size file transfers.
  • TDMA Bluetooth-like protocol. Piconet of master and slaves, with 650us slot size. Bluetooth-like because USRP cannot frequency hop at a high enough rate to interoperate with Bluetooth. Again, target rate of 500Kbps, performing ten 100KB file transfers, and vary the number of slaves. Achieves 4x the average throughput of the host-based approach, using a much short guard time.
  • Q: is the split always applicable, even if the cores could be heterogeneous? The most important part of the API is between the radio hardware and the host, not core-to-core. (Follow-up: for embedded applications, trend towards system-on-a-chip, and you could have cores geared towards different things (such as radio).)
  • Q: how do you work around virtual carrier sense? Can include multiple timestamps in a packet.
  • Q: are there fuzzy edges or things that you might have trouble dealing with in this API? Yes, definitely not saying that we can do everything. Currently working on generating “fast ACKs”, but if you pre-modulate it then you don’t know the destination, so need to track that.
  • Q: problems encountered in sensor nets in developing new APIs; how generic is this work if a new protocol were developed? Difficult to say that any set is complete. Could tweak the implementation of the core functions to implement new ones (e.g. ZigZag). Starting to look into implementing novel MACs.
  • Q: as the PHY gets faster, will the matched filter be adequate? Possible to use multiple filters in parallel (though USRP-1 doesn’t have room for that). Could also switch the coefficients to search for other things.

Content Distribution

AntFarm: Efficient Content Distribution with Managed Swarms

  • What is the most efficient way to disseminate a large number of files to a large number of clients? A simple solution might be a simple client-server, which creates a bottleneck at the server, and leads to a high cost of ownership for the content owner.
  • Alternative is to do peer-to-peer. Examples include BitTorrent. This sacrifices efficiency, because peers share limited information and there is no global sense of the system as a whole: gives little control to the provider. Managing swarms could lead to a better use of bandwidth.
  • Goals for AntFarm: high performance (throughput), low cost of deployment, performance guarantees (administrator control) and accounting (resource contribution policies).
  • Key insight is to treat content distribution as an optimisation problem. Uses a hybrid architecture, revisiting the BitTorrent protocol, but in fact a brand-new protocol.
  • Has a set of peers, organised into swarms. A logically separate coordinator manages these swarms. Seeders outside the system provide the data, but altruistic peers will contribute much of the bandwidth.
  • As a strawman, the coordinator could schedule every single packet sent in the system: this is clearly unscaleable. Instead, it makes critical decisions based on observed dynamics. Remaining decisions left to the peers themselves. Peers can implement micro-optimisations (e.g. rarest block first).
  • Coordinator takes active measurements and extracts key parameters. It then formulates an optimisation problem that calculates the optimal bandwidth allocation.
  • Want to maximise throughput subject to bandwidth constraints. Response curve of swarm aggregate bandwidth as the seeder bandwidth is increased. At first, increasing seeder bandwidth gives a multiplicative increase in the aggregate bandwidth, but this eventually becomes slope=1, then flat. (Assumes that peers in the swarm are homogeneous (in network capacity) and the downlink is faster than uplink.)
  • Each swarm will have a different response curve. The coordinator measures these, and uses these for optimisations. Optimised using an iterative algorithm: allocate bandwidth to the swarm whose response curve has the steepest slope (favouring swarms with lower bandwidth where this is equal). Can first address SLAs and QoS constraints, which might lead to a very different allocation of bandwidth.
  • AntFarm must adapt to change as nodes churn and network conditions change. AntFarm will update response curves and bandwidth allocations.
  • AntFarm is built on top of a new wire protocol, which uses tokens as a form of microcurrency that is traded for blocks. Tokens are small and unforgeable. Peers return spent tokens to the coordinator as a proof of contribution.
  • Performance evaluation looks at global aggregate bandwidth across all swarms. Tested using a Zipf distribution of files with 60KB/s and 200KB/s seeders. Compared to client-server and BitTorrent. AntFarm greatly outperforms both of these cases.
  • Compare AntFarm to BitTorrent with two swarms: one self-sufficient and one singleton. BitTorrent will starve the singleton, but AntFarm will recognise based on the response curves that seeder bandwidth should be allocated to the singleton. Also observe that BitTorrent will starve new swarms (AntFarm will not).
  • Token management is embarrassingly parallel, which aids scalability. Ran coordinators on PlanetLab hosts and simulated multiple peers on other PlanetLab hosts. A one-machine coordinator supports 10K peers, and 8 coordinators will support up to 80K peers. A single PC can comput allocations for 10000 swards with 1000000 peers in 6 seconds (done once every 5 minutes).
  • AntFarm requires no fine-tuning, and subsumes hacks that have been devised for BitTorrent.
  • Q: what incentive does a swarm have to report its response curve correctly? There is a potential collusion problem here, but we assume that peers want data and will exchange tokens to ensure that they get the data as fast as possible.
  • Q: is there any concern about a Sybil attack that involves passing credits amongst yourself? Can force people to back an account with a credit card, to mitigate this.
  • Q: do you think that this token-based system will be necessary in a commercial system? It gives us what we want in terms of response curves.

HashCache: Cache Storage for the Next Billion

  • The next billion internet users are schools and urban middle class in developing regions. They have affordable hardware (OLPC, Classmate) but very expensive internet connections.
  • Standard approach for bandwidth saving is using a large cache. Large caches mean larger bandwidth savings. Can do overnight prefetch or push content from peers. They also have good offline behaviour, enabling prefetching and local search. Can even accelerate dynamic sites.
  • Cost is about 5–10GB of RAM per TB of storage. Cannot use laptop-grade hardware for caches: need server-grade hardware which is 10x more expensive than laptops.
  • Solution is a new storage engine that allows policies for efficiency and performance to be specified. Requires much less RAM than commercial or open-source caches, even for terabyte-sized caches. All techniques support far more GB/$ and allow a performance tradeoff.
  • Open-source solutions need multiple seeks for hits, misses and writes and depend on default filesystems. Commercial systems (using a circular log) require a single seek and achieve much better performance.
  • Focus on reducing the size of the (in-memory) index. Squid used 560 bits per entry; Tiger uses 232 bits per entry.
  • The cache size is limited by the memory size and performance is limited by the number of seeks. Want to reduce the dependency on memory size and improve the performance of inevitable seeks.
  • Instead, use the disk as a hashtable. Need on-disk structures for key lookup and value storage.
  • Basic HashCache policy: H(URL) = h bits… stores in a disk-based hash table of contiguous blocks, then puts the data in a circular log.
  • Collision control is difficult in disk-based systems as it requires multiple seeks. Instead use set associativity, t ways. The possible locations are allocated contiguously so they can be read together (which is good as seek time dominates for small reads).
  • Normally would reduce seeks using an in-memory hash table (with space consumed by pointers), but disk is already a hash table so pointers are not needed, so just use a large bitmap that mirrors the disk layout. Just store one hash per URL.
  • Large disks can support 10–100+ million objects. Global cache replacement is relevant when the disk size is roughly equal to that of the working set. When you have much larger disks, local replacement policies are roughly equivalent to global ones. Do LRU within the sets.
  • Most misses require no seeks; one seek per read; one seek per write. However, writes still need seeks.
  • Storing objects by hash produces random reads and writes. Need to restructure the on-disk table and store only the hash, rank and offset. Move all data to the log. Group writes will amortise seeks and scheduling related writes will enable read prefetch. Gives reads and writes in < 1 seek.
  • HashCache requires just 54 bits per URL.
  • All policies implemented in a Storage Engine with plug-in policies. Built a web proxy using the storage engine. Can have multiple apps on the same box, sharing memory. 20kloc (C) for the proxy and 1kloc for the indexing policies.
  • Evaluated using Web Polygraph (de facto feature and performance testing tool for web proxies). Compared against Squid and Tiger. Evaluated with “low end”, “high end” and “large disk” hardware capacities.
  • For low end, achieves hit rate comparable to Squid and Tiger. Can achieve performance comparable to Squid or Tiger, depending on the policy used.
  • On high end (5x 18GB disks), achieves performance very close to Tiger, much better than Squid.
  • Can achieve much larger disk capacities than either Squid or Tiger for the same amount of RAM. (1.5–5.4TB, depending on policy.)
  • Uses up to 600MB of ram with a 1TB disk (large disk configuration).
  • Currently deploying HashCache in Ghana and Nigeria, and working with a school supplier on new deployments.
  • Q: what observable bandwidth improvements do you see? Many techniques require large caches (e.g. WAN accelerator tools), and we are working on these. Is the performance improvement like that of Squid? Yes, and it will be better for things like multiple people in a class watching a YouTube video (where the number of objects is large).
  • Q: why do we need these large caches? Is there evidence that by increasing cache size from 200GB to 1TB there will be a drastic improvement? Wanted to move beyond web caching (where the benefits are limited) to WAN acceleration, which requires much larger caches.

iPlane Nano: Path Prediction for Peer-to-Peer Applications

  • Example application is a P2P CDN where content is replicated across a geographically distributed set of end-hosts. Every client needs to be redirected to the replica that provides best performance. However, internet performance is neither constant nor queriable.
  • Current best practice is for each application to measure internet performance on its own. Would be better for end-hosts to have the ability to predict performance without having to make measurements, and share infrastructure across applications.
  • Problem has been looked at before. Network coordinates were limited to latency but were a lightweight (scaleable) distributed system. iPlane had a rich set of metrics and used arbitrary end-hosts, but required a 2GB atlas to be distributed and had a large memory footprint.
  • iPlane Nano has same information as iPlane and sufficient accuracy, but only uses a 7MB atlas at end-hosts and services queries locally.
  • On the server side, iPlane Nano uses the same measurements as iPlane but stores and processes them differently.
  • Size of atlas is O(number of vantage points * number of destinations * average traceroute path length). iPlane combines paths to improve predictions. Instead replace atlas of paths with atlas of links. Now this is O(number of nodes * number of links).
  • Clients can use swarming to disseminate the atlas, and can service queries locally using the atlas.
  • However, just storing links loses routing policy information encoded in the routes (i.e. which path would actually be used?). Need to extract routing policy from measured routes and represent this compactly.
  • Strawman: could try to use shortest AS path routing + valley-free + early-exit routing. However, this gave very poor quality predictions (iPlane got 81% correct, this strawman approach got 30%). So we have thrown away too much information.
  • First technique is inferring AS filters. Not every path is necessarily a route (ASes filter propagation of a route received from one neighbour to other neighbours). Filters can be inferred from measured routes, by recording every triple of three successive ASes in each measured rout. Store (AS1, AS2, AS3) to imply that AS2 forwards packets from AS1 to AS3. This still gives multiple policy-compliant paths for some endpoint pairs, due to upstream AS routing policies.
  • Second technique is to infer AS preferences. For each measured route, alternate paths are determined in the link-based atlas. When paths diverge, this indicates preference.
  • Another challenge is routing asymmetry. Undirected edges are used to compute routes assuming symmetric routing (i.e. when the route has not been specifically measured), but more than half of internet routes are asymmetric. Merge clients’ additional (low-rate) traceroute measurements into the atlas that is distributed to all clients. Prefer a directed path in the atlas for prediction, or else fall back to undirected paths.
  • The improved path predictions are 70% accurate, which is almost as good as iPlane (with a 6.6MB atlas rather than 2GB; and a 1.4MB daily update).
  • Want to use routes to predict latency and loss rate. Latency is sum of link latencies. Loss-rate is the probability of loss on any link in the route. Ongoing challenge is to measure these properties themselves (link latency is hard to measure.) iPlane Nano can make good enough predictions to help applications.
  • System used to improve P2P applications (CDN, VoIP and detour routing for reliability). Look at CDN here; others in paper.
  • CDN chooses replica with best performance to serve a client request. Evaluated with 199 PlanetLab nodes as clients, and 10 random Akamai nodes as the replicas. Each node wants to download a 1MB file from the “best” replica. Look at the inflation in download time (w.r.t. optimal strategy) as a CDF of nodes. iPlane Nano does better than Vivaldi and OASIS, and indeed outperforms the expected measured latency. Random assignment gives bad inflation which shows the importance of an informed choice.
  • Q: how much does it matter if you have an out-of-date atlas? Once a day is good enough to capture the variance in latency and loss rate.
  • Q: how expensive is it to recompute the atlas? Very inexpensive.
  • Q: will your AS inferences lead to false AS links and how do you deal with that? Inter-AS links is a tricky issue, but rather than getting the topology right, it’s better to make good enough predictions, which are useful for applications. Addressing that will only improve accuracy.
  • Q: what gain do you get from not just querying the server? Say you want to instrument BitTorrent and rank order the peers that will give good performance. But what if every BitTorrent peer hits the server… this will overload a server and lead to you needing costly infrastructure (like Google).
  • Q: what is the measurement overhead that the end-hosts will incur if they have to run their own measurements? We process about 100 traceroutes per day at the end hosts.
  • Q: when does this technique (path segment composition) work better than others? Assumes routers are performing destination-based routing, rather than load-balancing. Turns out around 70% of routes are identical from day-to-day.


Making Byzantine Fault Tolerant Systems Tolerate Byzantine Failures

  • We’ve heard a lot about applications and optimisations for BFT systems. We now have impressive best case performance for many scenarios. But what happens when failures actually occur? Performance drops to zero or the system crashes!
  • How do we get robust BFT? Describe the route to “Aardvark” which is an implementation of this technique, and show that the performance under failures is not too bad.
  • 10 years ago, we thought BFT could never be fast. Goal was to should that BFT could work (in an asynchronous network). FLP means that all we can guarantee is eventual progress. Systems were designed so that the normal case was fast and safety was maintained.
  • Wanted to maximise performance with a synchronous network and all clients working properly. This is misguided (surely failures must occur or else why would we have BFT?), dangerous (encourages fragile optimisations with corner cases that are difficult to reason about, easy to overlook and difficult to implement) and futile (diminishing returns in performance improvements).
  • New goal: address the middle ground between (asynchronous with failures) and (synchronous without failures): i.e. a synchronous network with failures.
  • Want to maximise performance when the network is synchronous and at most f servers fail, while remaining safe if at most f servers fail.
  • Protocol is structured as a series of filters to remove some amount of bad messages. This limits the effect that bad messages can have on performance. Same filters are applied to all messages.
  • Signatures are expensive, so use MACs? But MACs can be used by clients to generate ambiguity, so Aardvark insists on signed requests. (Showed an example of an attack on MACs by a faulty client, where the MAC is validated by the primary, but the replicas cannot validate it, which leads to a tricky protocol that lowers throughput. Also a problem with a faulty primary.) Use a hybrid MAC/signature which is easier to verify. Signature schemes are asymmetric so most of the work can be pushed to the client. But what if a faulty client sends bad signatures into the system? Filter them out (blacklist for client, then verify MAC, then verify signature, and if this fails then blacklist the client).
  • View changes to be avoided? But they can be done frequently to enable high throughput even under failures. The primary is in a unique position of power (client sends request to primary, primary forwards it to replicas) and could wait for a long time. Usually deal with this using a view change timeout. But a bad primary can be just fast enough to avoid being replaced. Instead use adaptive view changes based on observed and required throughput. Guarantees that the current primary can either provide good throughput or be promptly replaced.
  • Hardware multicast is a boon? Use separate work queues for clients and network connections between machines.
  • Evaluated throughput versus latency compared to HQ, Q/U, PBFT and Zyzzyva. Aardvark has longer latency than others (at low throughput), and sustains a lower throughput than PBFT or Zyzzyva.
  • Evaluated performance with failures. Byzantine failures are arbitrary (cannot enumerate all of them), so made a good-faith effort to strain this. HQ implementation crashes with a faulty client (not all error handling was implemented) PBFT, Q/U and Zyzzyva drop to zero throughput. Aardvark maintains peak performance (although this is lower than the other schemes).
  • Also looked at effect of delay.
  • Q: why does the hybrid MAC/signature protocol require a MAC? If we don’t use a MAC, we don’t know who is sending the message so the MAC gives us a quick way to identify the sender and blacklist.
  • Q: given that people are already reluctant to use BFT, why would they take the performance hit? How well should you perform under failures or no failures? Imagine there is a range of protocols there, and could choose a trade-off.
  • Q: how would you deal with heterogeneous speeds in the adaptive scheme (and faulty nodes causing an attack there)? Looking at symmetric systems and base throughput on the history over previous views.

Zeno: Eventually Consistent Byzantine-Fault Tolerance

  • Data centre storage systems are the backbone of many internet-based services (e.g. Amazon, Facebook, Google). They have high availability and reliability requirements. Cost of downtime is huge.
  • Example of Amazon’s Dynamo shopping cart service. Needs reliable storage and responsiveness. Dynamo achieves reliability through replication. It achieves responsiveness by allowing stale state to be viewed during failures, and eventual consistency.
  • Cannot simultaneously achieve strong consistency and high availability if network partitions are possible (CAP theorem). Many storage backends prefer availability over consistency (e.g. Dynamo, PNUTS, Cassandra).
  • Two fault models: crash and Byzantine. Many deployed systems assume crash faults because the infrastructure is trusted. But Byzantine faults can happen (S3, Google, NetFlix had multiple-hour outages), as the majority of database bugs exhibit non-crash behaviour. So use BFT, which withstands arbitrary faults, using 3f+1 replicas to tolerate f faults. Used for mission critical systems, e.g. avionics. Improvements have been made in improving performance, but what about availability?
  • Existing protocols strive for strong consistency, which assumes the abstraction of a single correct server. Need >= 2/3 of replicas to be available.
  • Key idea is relaxed consistency to give availability. Data is available when other nodes block, but sometimes stale. Zeno is an eventually consistent BFT protocol.
  • What is an eventually consistent BFT service? Assume three clients, A, B and C, that are accessing the service. Model service state as partial order on operations. Have a committed history and one or more tentative histories from some point in time onwards. Can merge tentative histories to give a committed history. But some operations (e.g. two add to baskets of an item where only one is available) can be inconsistent. Therefore have “strong” and “weak” operation types. A weak operation observes eventual consistency and may miss previous operations, but will eventually get committed (e.g. add/delete items to/from shopping cart). A strong operation always observes the committed history (e.g. checkout in a shopping cart: only pay for what you buy).
  • Zeno has four components. Normal case for strong and weak operations; handling a faulty replica; conflict detetion; and conflict resolution.
  • Zeno requires 4 replicas (3f+1, f=1).
  • [Detailed description of the protocol.]
  • Strong quorum is used for strong consistency: ensures that no two requests are assigned the same sequence number (need 2f+1=3 matching replies). Weak operations don’t use the strong quorum: just need f+1 matching replies. With a weak quorum, intersection is not guaranteed, but it is not necessary for eventual consistency.
  • In event of a faulty primary, must be able to do a view change. Typically these require strong quorums. Zeno has a weak view change protocol that only requires weak quorums, which is necessary for high availability.
  • [Detailed description of conflict detection protocol.] Based on sequence number mismatch (same sequence number assigned to different requests).
  • Conflict resolution: weak operations are propagated between primaries of weak views, and finally reconciled. Correctness proof in the technical report.
  • Evaluated with a simulated workload with a varying fraction of weak operations. With no concurrent operations, compared against Zyzzyva. Look at all strong, 50% weak and all weak. Look at the throughput for Zyzzyva, Zeno(strong) and Zeno(weak). Weak operations continue to make progress in the presence of a network partition (but stall a bit with the partition is resolved, presumably as the conflict is resolved).
  • With concurrent operations that conflict (during the network partition). Weak drops briefly on the original partition, and also takes a slightly worse hit when the partition is resolved (but actually performs better during the partition). So Zeno provides higher availability than Zeno.
  • Q: instead of working with arbitrary partitions, could you exploit cliques on either side (i.e. in separate data centres)? Yes, definitely.
  • Q: what happens to the client state when the operations are rolled up? The result that you see might not be the final result: this influences the choice of weak operations. What if you insert weak operations before strong operations? When a strong operation is committed, all weak operations before it must be committed.
  • Q: throughput results were an order of magnitude lower than in previous talk? Just used a small number of clients.
  • Q: can you give an example of an application where it is okay to have a period of strong consistency, followed by one where the results may be obliterated by the conflict resolution? Shopping cart is a prime candidate. But future operations may rely on operations that have not yet been committed? Yes, that’s a design choice.
  • Q: is it true that you will always end up with divergent histories? If you assume that we have signatures, then no.

SOSP 2007: Day 3

Wednesday, October 17th, 2007

Indulge me for a moment. For some reason, the powers that be at SIGOPS like to hold SOSP at remote locations, which in recent years have included Bretton Woods, Banff, and Brighton. So I’ve not been the only one to point out that this year’s location, Skamania Lodge in southern Washington state, bears something of a resemblance to the infamous Overlook Hotel in Stephen King’s The Shining. A little Wikipedia surfing last night led me to discover that the basis for the Overlook is Timberline Lodge, just across the river in Oregon. Imagine my surprise as I watched the local news in our own hotel (which owes more than a little to Hitchcock), and the weather forecaster cut to a shot of the Timberline Lodge, where apparently the snow has just started falling for the season. Ah well, at least they’ve got a security camera up there these days.


SOSP 2007: Day 2

Tuesday, October 16th, 2007

I’ll skip doing a review of the banquet last night, so it’s on with the papers!


SOSP 2007: Day 1

Monday, October 15th, 2007

Since day 0 was driving up (through some outstanding scenery) from Portland to Skamania Lodge, I’ll come back to it when the photos are developed. Therefore, on with a little experiment: the live blog! (The page will grow downwards as more presentations are added.)

N.B. If you’re reading this through Facebook or are not a die-hard Computer Scientist, you might want to ignore it!


Things I am excited about, part II

Thursday, August 24th, 2006

Just twenty minutes ago, I finally handed in the fruits of four months’ labour. Of all the dissertations that I have written (or helped to write), Gathering BLOBs with OGSA-DAI is possibly my favourite, though I’m still glad to have it finished. In fact, I finished 30 days later than planned, but that’s okay, because I planned to finish 31 days early. The deadline is tomorrow, and I’m handing it in today to save the schlep out to KB for the penultimate time. (One more trip next week to give a valedictory presentation, but that’s not too much to worry about.)

Thus its off for a guilt-free weekend of fun (to contrast with all the guilty weekends and weeks of fun that have gone before), with pizzas, parties and popular music.

You may think that was a rather self-satisfied post. Don’t worry: there are more Things I am excited about. It gets worse.

Things I am excited about, part I

Tuesday, August 22nd, 2006

For the first time, I’m living in Edinburgh (well, sort of) during the Fringe. This has made it possible to see even more shows than last year. And it’s also given me the chance to work on my rage control techniques.