TrInc: Small Trusted Hardware for Large Distributed Systems
Won the best paper award.
New primitive for trust in distributed systems, which can no longer be taken for granted.
Talking today about equivocation: providing different results to different clients.
Example is the Byzantine Generals Problem (tell one advance, tell the other retreat): could imagine a corrupt server behaving similarly.
Also a voting system (tell one user that vote has been counted; send another a tally excluding the vote).
Also BitTorrent (lying about which pieces of a file you have).
With f malicious users, need 3f+1 users in a completely untrusted system; but without equivocation, just need a simple majority of non-malicious users.
So use trusted hardware at all participants to make equivocation impossible!
Must be small, so that it can be ubiquitous, tamper-resistant and easily-verifiable. (Idea: send it as part of a figurine with World of Warcraft.)
This paper introduces TrInc (a new primitive to eliminate equivocation), some applications of TrInc and an implementation in currently-available hardware.
What is the smallest thing possible that makes equivocation impossible? All you need are a counter and a key. TrInc = trusted incrementer: a monotonically increasing counter and a key for signing attestations. Attestations bind data to counters.
Operation e.g.: “Bind this data to counter value 36.” TrInc checks to see if this actually increases the counter, and returns a tuple of (old counter, new counter, data), signed by attestation key.
Two kinds of attestations: advance (moves counter forward; can only happen once; attests that nothing is bound to intermediate values), and status attestation (doesn’t advance counter, attests to current value and that nothing has yet been attested to with a higher counter value).
In practice, might want multiple counters. A “trinket” is some hardware with >= 1 counter.
TrInc is practical: can use the TPM to implement it (and this has massive penetration in x86 machines). The TPM is tamper-resistant, has 4 counters, can do crypto and has a small amount of storage. TPM merely lacks the right interface.
Applications. Ensure freshness in DHTs, BFT with fewer nodes and messages, etc.
Implementing a trusted log in TrInc: append-only (ensure that new data goes at end of the log), and lookup (no equivocation on what is or isn’t stored). Obviously can’t store the log in the trinket; instead put it in untrusted storage.
Use the counter to attest to a datum’s position in the log (the counter is the location in the log). Append by attesting the data to the next counter value. For lookup, only one valid attestation can correspond to a move into a new counter value. Using the old counter values in the attestations to prove that there are no holes in the log.
Attested Append-Only Memory can do this too: by construction, TrInc solves all of the same problems.
Preving under-reporting in BitTorrent. Peers represent what pieces of a file they have using a bitfield, and exchange these with each other. Selfish peers have an incentive to under-report what they have. It yields prolonged interest from others and leads to faster download times. This is equivocation! When a peer receives a block, it acknowledges receipt (to the original provider), then tells others that it doesn’t have it.
In BitTorrent, counter is the number of pieces that the peer has downloaded. Peers attest to the bitfield and the most recent piece received. Peers attest when they receive a piece (as an ack) and when they sync counters with other peers.
When receiving a block, attest “I have (some collection of blocks) and most recently received (one of those blocks).” Check that the counter matches the bitfield and that the most recent piece is attested to. Kick out nodes which lie, to create an incentive.
Attest to the latest piece to avoid an attack by buffering received nodes and under-reporting. Without sending the full log, need to ensure proper behaviour at each step.
Evaluated with a macrobenchmark on BitTorrent (solves under-reporting), A2M (higher throughput than A2M and reduces hardware requirements), and PeerReview.
Implemented on the Gemalto .NET smartcard: a few dozen lines of C#. Implemented all of the case studies.
Evaluated implementation with microbenchmarks: operations take on the order of milliseconds. Can use asymmetric (slow) or symmetric (2x faster) crypto. Takes 32ms just to write a counter.
Trusted hardware is typically used for bootstrapping, not for interactive use, but TrInc makes this hardware an intrinsic part of the protocol. The hardware can be faster; there just hasn’t been a call for it yet.
Q: about the BT protocol, one could potentially attack TrInc by having multiple identities with multiple counters. What could you do at this level to address this attack? You are limited to attesting to a multiple counter; obviously this would be the case even if you had multiple machines. But by under-reporting to some, you would not be giving back the expected attestation, which cuts off the number of people you can trade with. So is that sufficiently worthwhile? It’s not clear.
Q: about voting, these hardware devices are designed for “low-value” trusted applications. Do you see a barrier where TrInc would not be applicable? Looked at digital currency and wondered how much trust you would put in the tamper-resilience. For mission critical applications, you would spend more money on the tamper-resilience, or use a more complicated protocol.
Q: did you consider putting this on an actual TPM? Wanted to design the interface for TrInc; TPM doesn’t provide this.
Q: could you talk about the counter size, overflow, etc.? Overflow is impossible because you are setting the new counter value (not incrementing it), which is checked in the card itself. Resetting the counter increases a “meta-counter” (another TrInc) which gives each counter its own ID: effectively a session ID.
Sybil-Resilient Online Content Voting
Many websites encourage users to vote for different types of content (e.g. Digg). Sybil attacks can pollute the results (promoting spam links on Digg).
Talk today about defending against this kind of attack, and how they implemented it on Digg.
Hard to defend against Sybil attack because an open system lets an attacker join easily. CAPTCHA etc. are insufficient. Need a resource that cannot be acquired in abundance: use social network links.
Edges between genuine friends and subnetwork of attacker sybils are the attack edges. Hypothesise that the number of these is small.
Assume you can collect all votes and the social graph. Can be binary vote or multiple choice vote. Goal is to attack a subset of votes that includes mostly votes from real people (but might include some from Sybils).
Designate a vote collector, use max-flow to collect votes and then assign appropriate link capacities.
Need to break symmetry (Sybil network can exactly mirror the real social network), so designate a known non-attacker as the “vote collector”. Then use max-flow to the vote collector: bogus votes are congested at the small number of attack edges. Honest votes are congested at edges closer to the collector. Attack edges should be farther away from the collector. So give more capacity to the edges that are closer to the collector.
System is called “SumUp”: designed to assign capacity in the graph and leverage user feedback.
Assign capacity to collect at most v votes (ideally the number of honest votes, estimated using a separate mechanism). Give greater capacity to edges nearer collector, using process called “ticket distribution”: give equal fraction of tickets (initially v) to all edges out from the collector. Each node consumes one ticket, and distributes the rest to each of its outgoing links. Constructs a “vote envelope” around the collector.
Observe that when number of honest votes >> v, the number of collected votes is roughly equal to v. When it is << v, the number of collected votes << v. So iteratively (and exponentially) adjust v until at least 0.5 * v votes are collected.
Prove that the number of bogus votes is limited to the number of attack edges plus a constant factor.
Also prove that a large fraction of the honest votes are collected.
Can do better by using feedback from the vote collector, if it can tag some votes as bogus. Then reduce capacity on attack edges close to the collector (or possibly ignore them altogether). Idea is to penalise all edges along the path taken by the bogus vote (because we know that one of these is the attack edge).
Associate a penalty with each link: initially all zero. When a bogus vote is tagged, penalise the edge by 1/capacity. Links with a higher penalty receive fewer tickets. Ultimately eliminate links with a high penalty.
Evaluated on real social networks, and real Sybil attacks.
Applied to YouTube (0.5M), Flickr (1.5M) and synthetic (3M) social graphs.
As the fraction of honest votes increases past 0.1, the average number of bogus votes per attack edge increased sharply (up to 5 per edge) in all three graphs.
The fraction of honest votes collected is always > 90%.
Looked at real Sybil attack on Digg (positive and negative votes on articles). Digg maintains 130,000 “popular” articles among 7 million articles, using an undisclosed algorithm. Digg has a 3M node social network, with 0.5M nodes in a connected component. 80% of votes are from the connected component. (Data obtained by crawling Digg.)
Made the Digg founder (Kevin Rose) the vote collector. Manually sampled 30 articles. Found subjective evidence of attacks in 15 articles (one was an advert; 10 had votes from newly-registered voters; 4 received <50 votes after being marked "popular").
Observe that suspicious articles receive more negative votes (based on 5794 “popular articles”).
Q: even if SumUp can give you the attack edges, it would be difficult to defend against attacks in recommendation systems where there are a small number of honest nodes, so just by compromising a few hundred honest nodes (e.g. using a botnet), it would be possible to overwhelm the system. How does SumUp deal with this? SumUp doesn’t deal with this case.
Q: is there a dependency on the location of the collector? Could you manipulate the graph to place attack edges near to the vote collector? Yes, so the feedback mechanism is important.
Bunker: A Privacy-Oriented Platform for Network Tracing
Bunker anonymises data that it collects and offers software engineering benefits.
Network tracing used for traffic engineering, fault diagnosis, recovery, research studies. But customer privacy is very important to ISPs. Raw data is a liability for ISPs (lost, stolen, subpoenaed, etc.).
So nobody can have access to the raw data (ISPs always say no). Anonymising the data can help to mitigate privacy concerns. Anonymisation is a form of obfuscation that destroys personally-identifying data.
Could do anon. offline or online. Offline has high privacy risks (do it after collecting the trace). Online has high engineering costs (need to anon. the trace simultaneously with collection, at line speed).
A regex for phishing (looks for forms that take PINs, usernames, passwords, etc.) using libpcre takes 5.5s to process 30Mb of data (44Mbps maximum). But we want to look at multiple-gigabit links with multiple regexes.
Want the best of both worlds. So buffer raw data on disk, but the only thing that comes out is the anonymised trace.
Bunker is a closed box that protects sensitive data. Contains all raw data and processing code. Restricted access to the box (e.g. no console). Make the box “safe-on-reboot”: when it is rebooted, clear the ECC RAM using the BIOS after reboot, and use encryption to protect on-disk data. Use an encryption key held in RAM inside the closed box. Data on disk cannot be decrypted after reboot.
Design: online capture module, and offline TCP assembly, parsing and anonymisation modules. One-way interface that passes out the anon. data. Add an online encryption module and offline decryption module to store data on disk.
Closed-box VM built on Xen hypervisor. Also an open-box VM on the same platform which provides access to the trace, accessed using a separate NIC.
Closed-box implementation: no I/O or drivers in this VM except those needed (custom-made menuconfig). Use firewalls to restrict network communication (e.g. standard iptables config).
On boot, one of two configurations may be selected. Debugging config enables all drivers and allows access to the closed box. Tracing configuration eliminates most I/O and drivers. On choosing tracing config, the display and keyboard freeze (as there are no drivers), the kernel’s init runs a script to start the trace, and the operator can only log in to the open box using its dedicated NIC.
Gives strong privacy properties, and allows trace processing to be done offline (in your favourite language, e.g. Python).
Bunker has a large TCB, but narrow interfaces: it remains secure as long as a vulnerability cannot be exploited through narrow interfaces. Three classes of attacks: closed-box interfaces, hardware and trace injection.
Assume that it is hard to attack a VM from another VM. Enumerate each of the interfaces and reason that the defences are secure.
Safe-on-reboot eliminates most hardware attacks. One remaining is extracting keys from RAM while the system is running (cold boot attacks, bus monitor, special (e.g. FireWire) device to dump RAM without OS support). Secure co-processors could thwart these attacks, but TPMs are not useful.
Bunker has <7kloc and took 2 months to develop. Much smaller (order of magnitude) than previous line-speed systems at UW and Toronto. Able to use Python to simplify the development.
Q: what have you learned by trying to sell Bunker (with admitted vulnerabilities) to network operators? Do they require a proof that non un-anon. data can come out? Universities take their jobs very seriously (sometimes more so than ISPs). If you can prove that no data can come out, that’s great, but don’t know how to do that. Found that by explaining carefully what you’re doing, support is often forthcoming.
Q: great solution assuming the anonymisation is good enough. There have been several mistakes about this in the past. So how does Bunker affect this? Bunker doesn’t protect against that: indeed, it might even be worse. Assumes that there are no bugs in the anonymisation code: do code inspections and make it publically available to improve its quality. Need to work on these open problems.
Q: do you worry about physical access to the infrastructure or machine? Well, if you can do that, you can install your own network tap, so what’s the point? Bunker is designed to lower ISP’s liability. Doesn’t stop a lawyer coming in with a subpoena allowing him to install a new network tap.
Flexible, Wide-Area Storage for Distributed Systems with WheelFS
Increasing data storage on widely-spread resources (testbeds, grids, data centres, etc.). But not yet seen a universal storage layer. It’s hard because of failures, latency and limited bandwidth.
CoralCDN prefers low delay to strong consistency. Google wants to store e.g. email near the customer. Facebook forces all updates for a user to go through one data centre. So each application builds its own storage layer. No flexible layer gives all of these properties!
Need control of wide-area tradeoffs (fast timeout vs. consistency; fast writes vs. durability; proximity vs. availability). Also need a common, familiar API that looks like a traditional file system so we can reuse existing software built for local storage.
Solution is “semantic cues”: a small set of application-specific controls that corresponds to each of the wide-area challenges (e.g. eventual consistency, replication level, particular site). Allow applications to specify cues on a per-file basis, by putting it in the path names of files.
WheelFS works over the wide area, based on a standard distributed file system design with the addition of these cues. Built a full prototype which runs several applications.
Design: WheelFS looks like a big data storage layer. Distributed application runs on a bunch of client nodes spread throughout the wide-area. FUSE presents WheelFS to the application, and WheelFS client software communicates with WheelFS storage nodes. WheelFS configuration service uses Paxos and RSMs to map files to nodes.
Files have a primary and (by default) two replicas. A file’s primary is its creator. Clients can cache files using a lease-based invalidation protocol. Strict close-to-open consistency (serialised through the primary).
Consistency is enforced under failures (even network partitions): failing to reach the primary blocks the operation, until the configuration service promotes a new primary.
Only applications have the knowledge to make the right tradeoffs. So embed these cues in the path name.
Flexible and minimal interface change that makes it easy to use existing applications.
Can apply a cue to entire directory subtrees, multiple cues at once (later cues override earlier, conflicting cues). Assume the developer uses these sensibly.
Durability through RepLevel=n cue. Permanent property of the data.
Large reads through HotSpot cue. Transient property (only applies to a particular opening of the file) using P2P BitTorrent-style caching for hotspots.
Placement using Site cue. Permanent property of the data.
Consistency though EventualConsistency cue. Either transient or permanent property.
Reading with eventual consistency: read latest version of the file that you can find quickly. Don’t need to go through the primary (which might have failed). Can try replicas, or a cached copy of the file elsewhere. .MaxTime=t cue to specify how long you should spend looking for a file. Writes can go to any replica, which creates two divergent replicas: a background maintenance process will figure out a way to merge files (without application involvement). Reconciling directories by taking the set union of files, and files by choosing one version as the winner (this will lose some writes, but is usually OK for apps that can tolerate eventual consistency).
Example use: cooperative web cache. Make a one-line change in the Apache config file to point it to a file in WheelFS. Using default WheelFS semantics leads to blocking under failure with strong consistency. But the freshness of a page can be determined using saved HTTP headers, so it’s alright to use eventual consistency.
Implemented on Linux, MacOS and FreeBSD. About 20kloc of C++. Support Unix ACLs. Deployed on PlanetLab and Emulab.
Applications include co-op web cache, all-pairs pings, distributed mail, file distribution and distributed make. At most 13 lines of configuration code had to be changed.
Evaluated performance: does it scale better than single-server DFS? Does it achieve performance equivalent to specialised storage? How does it perform under failures?
Scaling: as number of concurrent clients increases past about 125, NFS performance starts to suffer relative to WheelFS due to buffer cache exhaustion. (However, NFS is better than WheelFS for fewer clients. This is because it’s a local NFS server at MIT, and WheelFS is on PlanetLab.) WheelFS performance seems constant with added load.
Specialised storage for co-op web cache on PlanetLAb: 40 nodes as proxies, 40 as clients, same workload as CoralCDN paper. Compare CoralCDN to Apache on WheelFS. WheelFS achieves same rate as CoralCDN. However, CoralCDN ramps up to full rate faster, due to special optimisations.
Evaluated web cache under failures using Emulab. Each minute, one site went offline for 30 secs. Eventual consistency improves performance under failures: rate remains almost constant, whereas strict consistency makes it fall greatly.
Main difference with related work is the control over trade-offs.
Compare to storage with configurable consistency (PNUTS, PADS). WheelFS also provides durability and placement controls.
Q: do these primitives have a place in a generic distributed file systems? Where do you draw the line in what you include and what you don’t? Have only looked at what existing applications need and included that. If other applications become important, other things may be included.
Q: what insight have you gained from making it simple to create different configurations of system? Most applications just need something simple that applies to both reads and writes (i.e. strict or eventual consistency).
Q: want to hear more about reconciliation? Most apps that need append-only storage can be implemented as writing things into a directory (and taking the union). Cooperative web caching doesn’t care if you lose a version of the file. That’s been enough so far.
PADS: A Policy Architecture for Distributed Storage Systems
There are lots of data storage systems. They take a lot of time and effort to build: lots to reimplement. Why can’t this be easier? Is there a better way to build distributed storage systems, focussing on high-level design, not low-level details.
Previous work suggested a microkernel approach: a general mechanism layer with a pluggable policy.
Challenge was to build 10 different systems, each in 1kloc, before graduation! With PADS, 2 grad students built 12 diverse systems in just 4 months. Evidence that PADS captured the basic abstractions for building distributed storage systems.
Questions of data storage and propagation are just questions of routing. Consistency and durability are questions of blocking.
Routing specifies how data flows among nodes. When and where to send an update? Who to contact on a local read miss? Look at examples of routing in Bayou, Coda, chain replication and TierStore.
Primitive of subscription: options are the data set of interest (e.g. a path), notifications (invalidations) in causal order, or updates (bodies of files). Leads to an event-driven API. PADS gives a DSL to make this easier (based on OverLog), called R/Overlog. Policy is a bunch of rules that invoke actions.
Simple example is: on read, block and establish subscription to the server.
OverLog makes it possible to implement a whole system in a single page of rules. Rules for TierStore were presented in a single slide. Easier to debug and do code reviews.
Blocking policy defines when it is safe to access local data (either for consistency (what version can be accessed?) or durability (have updates propagated to safe locations?)). Need to block until the required semantics are guaranteed.
PADS provides 4 blocking points: before/after read/write. Specify a list of conditions that provide the required semantics. PADS provides 4 built-in bookkeeping conditions and one extensible condition.
e.g. Read at block: Is_causal. Write after block: R_Msg (ackFromServer).
Is PADS a better way to build distributed systems? Is it general enough to build any system? Is it easy-to-use? Is it easy-to-adapt? What are the overheads associated with the PADS approach?
Built a range of different systems that occupy different parts of the design space. Max number of routing rules was 75; up to 6 blocking conditions.
Added cooperative caching to Coda in 13 rules. Took less than a week, and greatly improved the read latency.
Overheads in an implementation of Bayou: within a small factor of the ideal number of bytes that must be transferred. Also looked at microbenchmarks on Coda versus P-Coda (PADS version): very close, and mostly due to the Java implementation of PADS.
Q: with this size of implementation, do you deal with system failures and recovery? Yes.
Q: how do you express routing that is based on network topology (e.g. TierStore hierarchy over DTN in a developing region) in OverLog? OverLog is typically used to set up network overlays, ping nodes and so on. Once you know who is alive, you can call into PADS to say that you’ve detected a peer with whom you can communicate for storage.
Q: can you talk about the trade-off between language versus library? (OverLog rules are a bit like haikus, sometimes you’d prefer a paragraph of text.) Can also use a Java API to configure PADS. Why OverLog; why not Java? Wanted to take advantage of the haiku of OverLog.
Q: a storage system doesn’t only have to worry about data movement, there’s also reconciliation or dealing with different storage layers at the same time. Should PADS worry about these? PADS doesn’t do conflict detection (it uses a simple scheme), and that has mostly been left to the application (though haven’t decided all of this so far). The storage layer is more of an application issue than a system issue.
Q: with OverLog, the runtime state can get very big, so how does this scale (with e.g. a complex topology)? Originally, DataLog would set up data flows which cause a lot of state to be exchanged. The custom version used for PADS cuts this down while using the same language.
Wireless #1: Software Radios
Sora: High Performance Software Radio Using General Purpose Multi-core Processors
Won the best paper award.
Currently, each wireless protocol is implemented using special hardware. Software radio ideal is a generic RF frontend and protocols implemented in software. Leads to universal connectivity and cost saving; a faster development cycle; and an open platform for wireless research.
Challenges. Need to process a large volume of high-fidelity digital signals (e.g. for 802.11 with 20MHz channel, need 1.2Gbps throughput… up to 5Gbps for 802.11n. Will be over 10Gbps for future standards). Processing is computationally-intensive: several complicated processing blocks, operating at high signal speeds. Need 40G operations per second to process 802.11a. Also a real-time system, so many hard deadlines and need accurate timing control. 10us windows for response.
Possible approaches include programmable hardware (FPGAs, embedded DSPs) (high performance, low programmability) and general purpose processors (low performance (100Kbps), high programmability). Sora is high performance and highly programmable. Achieves 10Gbps with ~10us latency.
Approach. A new PCIe-based interface card and optimisations to implement PHY algorithms and streamline processing on a multi-core CPU. “Core dedication” offers real-time support.
Uses a general radio frontend connected to a PCIe-based high-speed interface card (offers high throughput and low latency (~1us)). Frontend can connect to up to 8 channels. FPGA on card implements logic for control and data path (PCIe, DMA, SDRAM controllers).
Core part of Sora is the software architecture. To achieve high performance, uses three technologies.
First, an efficient PHY implementation makes extensive use of lookup tables which trade memory for calculation and still fit in L2 cache (e.g. convolutional encode requires 8 operations per bit in a direct implementation, but can use a 32Kb lookup table and two operations per 8 bits).
Second, most PHY algorithms have data parallelism (e.g. FFT and its inverse). So use wide-vector SIMD extensions developed for multimedia in the CPU.
Third, use multi-core streamline processing to speed up PHY processing. Divide processing pipelines into sub-pipelines, and assign these to different cores. Use a lightweight synchronised FIFO to connect cores. Can also do static scheduling at compile time.
Core dedication for real-time support: exclusively allocate enough cores for SDR processing in a multi-core system. This guarantees predictable performance and achieves us-level timing control. A simple abstraction, easily implemented in standard OSes (and easier than a real-time scheduler), such as WinXP.
Implemented for WinXP in 14kloc (C code) including the PCIe driver. Also, SoftWiFi implements 802.11a/b/g in 9kloc (C) in 4 man-months for development and testing. Works at up to 54Mbps.
Without optimisations, the required computation is far too large for any practical system. Sora offers up to a 30x speedup at high data rates.
End-to-end throughput compared to commercial-commercial 802.11 cards, Sora-commercial and Commercial-Sora. It is very close and sometimes faster to use Sora. Also seamlessly interoperates with commercial WiFi.
Extensions for jumbo frames in 802.11 can increase throughput. Also simply implement TDMA MAC. And applications which show low-level information about the PHY layer.
Q: do you have an algorithm for deciding how to allocate jobs to cores, or do you need to come up with an approach for each CPU architecture? Currently rely on the programmer to decide this, but there has been much other research on this, which could apply to Sora.
Q: most radios are used in mobile devices, so what is needed to make Sora work on power-constrained devices? GPPs have huge power consumption compared to special devices. Currently the benefit of SDR is for prototyping, so this is less of a concern. Also, Sora would work well on base stations.
Q: did you look into using existing systems for scheduling data-flow graphs on multi-cores? Don’t really need to consider the dynamic case because of fixed rounds etc.
Q: how do you provision dedicated cores in the presence of a shared cache and shared bus?
Q: a lot of the finer details in 802.11 is for working at low-performance (weak signal strength and high multipath), so does Sora have spare processing capacity to work with this? In the presence of these, you won’t get high throughput, so we don’t handle this completely.
Enabling MAC Protocol Implementations on Software-Defined Radios
What’s the hype about wireless MAC protocols? Achieving highest performance is application-specific (e.g. throughput, latency, power). No one MAC fits all. So there are diverse MAC implementations and optimisations. How can we easily implement these?
First approach has been to use standard wireless NICs (high performance and low cost). Although MAC is software, it’s closed-source and fixed functionality. SDR allows modifying full reprogramming of the PHY and MAC layers, but are higher cost and lower performance.
Various projects have used SDR for evaluation (based on GNU Radio and USRP). All processing is done in userspace (”extreme SDR”).
“Extreme” SDR architecture based on a frontend, ADC/DAC, FPGA, and USB connection to kernel and eventually userspace. Much too slow for 802.11 timeouts.
So commonly move layers closer to the frontend. However, these are costly, require special toolkits, require embedded systems knowledge and are much less portable.
Instead, take a split-functionality approach. Put a small, performance-critical part on the radio hardware, and a larger piece on the host for flexibility. Then develop an API for the core functions.
Building blocks are carrier sense, precision scheduling, backoff, fast-packet detection, dependent packet generation and fine-grained radio control. Believe this is a reasonable first “toolbox” for implementing high-performance MAC protocols. Talk about precision scheduling and fast-packet detection.
Precision scheduling. Do the scheduling on the host (for flexibility) and triggering on the hardware (for performance). Requires a lead time that varies based on the architecture.
Want to know how much precision we gain from this approach. Transmission error is approximately 1ms if triggering in the host. If in the kernel, this lowers to 35us. With split-functionality, this gives 125ns precision in scheduling.
Fast-packet detection. Goal is to detect packets accurately in the hardware, before they have been demodulated. The longer it takes to detect a data packet, the longer it will take to generate an ACK. Then demodulate only when necessary as this is CPU intensive. Uses a “matched filter”, which is an optimal linear filter for maximising the SNR. Try to detect framing bits, which are transformed into a discrete waveform. This is used as the known signal, and is cross-correlated with the incoming signal. If the correlation exceeds some score, trigger a response, or other action.
Simulation of detecting 1000 data packets destined to the host in varying noise. The matched filter achieves better noise tolerance than the full decoder (in the simulator). In real life, achieves 100% accuracy detecting frames, and <0.5% false positives.
Other mechanisms in the toolbox are detailed in the paper.
Implemented on GNU Radio and USRP. Implemented two popular MAC protocols (802.11-like and Bluetooth-like).
CSMA 802.11-like protocol uses carrier sense, backoff, fast-packet recognition and dependent packets. Cannot interoperate with real 802.11 because of bandwidth limitations. Target bitrate is 500Kbps, and uses the 2.485GHz band to avoid 802.11 interference. Achieves 2x throughput of the host-based “extreme” approach for 1MB-size file transfers.
TDMA Bluetooth-like protocol. Piconet of master and slaves, with 650us slot size. Bluetooth-like because USRP cannot frequency hop at a high enough rate to interoperate with Bluetooth. Again, target rate of 500Kbps, performing ten 100KB file transfers, and vary the number of slaves. Achieves 4x the average throughput of the host-based approach, using a much short guard time.
Q: is the split always applicable, even if the cores could be heterogeneous? The most important part of the API is between the radio hardware and the host, not core-to-core. (Follow-up: for embedded applications, trend towards system-on-a-chip, and you could have cores geared towards different things (such as radio).)
Q: how do you work around virtual carrier sense? Can include multiple timestamps in a packet.
Q: are there fuzzy edges or things that you might have trouble dealing with in this API? Yes, definitely not saying that we can do everything. Currently working on generating “fast ACKs”, but if you pre-modulate it then you don’t know the destination, so need to track that.
Q: problems encountered in sensor nets in developing new APIs; how generic is this work if a new protocol were developed? Difficult to say that any set is complete. Could tweak the implementation of the core functions to implement new ones (e.g. ZigZag). Starting to look into implementing novel MACs.
Q: as the PHY gets faster, will the matched filter be adequate? Possible to use multiple filters in parallel (though USRP-1 doesn’t have room for that). Could also switch the coefficients to search for other things.
AntFarm: Efficient Content Distribution with Managed Swarms
What is the most efficient way to disseminate a large number of files to a large number of clients? A simple solution might be a simple client-server, which creates a bottleneck at the server, and leads to a high cost of ownership for the content owner.
Alternative is to do peer-to-peer. Examples include BitTorrent. This sacrifices efficiency, because peers share limited information and there is no global sense of the system as a whole: gives little control to the provider. Managing swarms could lead to a better use of bandwidth.
Goals for AntFarm: high performance (throughput), low cost of deployment, performance guarantees (administrator control) and accounting (resource contribution policies).
Key insight is to treat content distribution as an optimisation problem. Uses a hybrid architecture, revisiting the BitTorrent protocol, but in fact a brand-new protocol.
Has a set of peers, organised into swarms. A logically separate coordinator manages these swarms. Seeders outside the system provide the data, but altruistic peers will contribute much of the bandwidth.
As a strawman, the coordinator could schedule every single packet sent in the system: this is clearly unscaleable. Instead, it makes critical decisions based on observed dynamics. Remaining decisions left to the peers themselves. Peers can implement micro-optimisations (e.g. rarest block first).
Coordinator takes active measurements and extracts key parameters. It then formulates an optimisation problem that calculates the optimal bandwidth allocation.
Want to maximise throughput subject to bandwidth constraints. Response curve of swarm aggregate bandwidth as the seeder bandwidth is increased. At first, increasing seeder bandwidth gives a multiplicative increase in the aggregate bandwidth, but this eventually becomes slope=1, then flat. (Assumes that peers in the swarm are homogeneous (in network capacity) and the downlink is faster than uplink.)
Each swarm will have a different response curve. The coordinator measures these, and uses these for optimisations. Optimised using an iterative algorithm: allocate bandwidth to the swarm whose response curve has the steepest slope (favouring swarms with lower bandwidth where this is equal). Can first address SLAs and QoS constraints, which might lead to a very different allocation of bandwidth.
AntFarm must adapt to change as nodes churn and network conditions change. AntFarm will update response curves and bandwidth allocations.
AntFarm is built on top of a new wire protocol, which uses tokens as a form of microcurrency that is traded for blocks. Tokens are small and unforgeable. Peers return spent tokens to the coordinator as a proof of contribution.
Performance evaluation looks at global aggregate bandwidth across all swarms. Tested using a Zipf distribution of files with 60KB/s and 200KB/s seeders. Compared to client-server and BitTorrent. AntFarm greatly outperforms both of these cases.
Compare AntFarm to BitTorrent with two swarms: one self-sufficient and one singleton. BitTorrent will starve the singleton, but AntFarm will recognise based on the response curves that seeder bandwidth should be allocated to the singleton. Also observe that BitTorrent will starve new swarms (AntFarm will not).
Token management is embarrassingly parallel, which aids scalability. Ran coordinators on PlanetLab hosts and simulated multiple peers on other PlanetLab hosts. A one-machine coordinator supports 10K peers, and 8 coordinators will support up to 80K peers. A single PC can comput allocations for 10000 swards with 1000000 peers in 6 seconds (done once every 5 minutes).
AntFarm requires no fine-tuning, and subsumes hacks that have been devised for BitTorrent.
Q: what incentive does a swarm have to report its response curve correctly? There is a potential collusion problem here, but we assume that peers want data and will exchange tokens to ensure that they get the data as fast as possible.
Q: is there any concern about a Sybil attack that involves passing credits amongst yourself? Can force people to back an account with a credit card, to mitigate this.
Q: do you think that this token-based system will be necessary in a commercial system? It gives us what we want in terms of response curves.
HashCache: Cache Storage for the Next Billion
The next billion internet users are schools and urban middle class in developing regions. They have affordable hardware (OLPC, Classmate) but very expensive internet connections.
Standard approach for bandwidth saving is using a large cache. Large caches mean larger bandwidth savings. Can do overnight prefetch or push content from peers. They also have good offline behaviour, enabling prefetching and local search. Can even accelerate dynamic sites.
Cost is about 5–10GB of RAM per TB of storage. Cannot use laptop-grade hardware for caches: need server-grade hardware which is 10x more expensive than laptops.
Solution is a new storage engine that allows policies for efficiency and performance to be specified. Requires much less RAM than commercial or open-source caches, even for terabyte-sized caches. All techniques support far more GB/$ and allow a performance tradeoff.
Open-source solutions need multiple seeks for hits, misses and writes and depend on default filesystems. Commercial systems (using a circular log) require a single seek and achieve much better performance.
Focus on reducing the size of the (in-memory) index. Squid used 560 bits per entry; Tiger uses 232 bits per entry.
The cache size is limited by the memory size and performance is limited by the number of seeks. Want to reduce the dependency on memory size and improve the performance of inevitable seeks.
Instead, use the disk as a hashtable. Need on-disk structures for key lookup and value storage.
Basic HashCache policy: H(URL) = h bits… stores in a disk-based hash table of contiguous blocks, then puts the data in a circular log.
Collision control is difficult in disk-based systems as it requires multiple seeks. Instead use set associativity, t ways. The possible locations are allocated contiguously so they can be read together (which is good as seek time dominates for small reads).
Normally would reduce seeks using an in-memory hash table (with space consumed by pointers), but disk is already a hash table so pointers are not needed, so just use a large bitmap that mirrors the disk layout. Just store one hash per URL.
Large disks can support 10–100+ million objects. Global cache replacement is relevant when the disk size is roughly equal to that of the working set. When you have much larger disks, local replacement policies are roughly equivalent to global ones. Do LRU within the sets.
Most misses require no seeks; one seek per read; one seek per write. However, writes still need seeks.
Storing objects by hash produces random reads and writes. Need to restructure the on-disk table and store only the hash, rank and offset. Move all data to the log. Group writes will amortise seeks and scheduling related writes will enable read prefetch. Gives reads and writes in < 1 seek.
HashCache requires just 54 bits per URL.
All policies implemented in a Storage Engine with plug-in policies. Built a web proxy using the storage engine. Can have multiple apps on the same box, sharing memory. 20kloc (C) for the proxy and 1kloc for the indexing policies.
Evaluated using Web Polygraph (de facto feature and performance testing tool for web proxies). Compared against Squid and Tiger. Evaluated with “low end”, “high end” and “large disk” hardware capacities.
For low end, achieves hit rate comparable to Squid and Tiger. Can achieve performance comparable to Squid or Tiger, depending on the policy used.
On high end (5x 18GB disks), achieves performance very close to Tiger, much better than Squid.
Can achieve much larger disk capacities than either Squid or Tiger for the same amount of RAM. (1.5–5.4TB, depending on policy.)
Uses up to 600MB of ram with a 1TB disk (large disk configuration).
Currently deploying HashCache in Ghana and Nigeria, and working with a school supplier on new deployments.
Q: what observable bandwidth improvements do you see? Many techniques require large caches (e.g. WAN accelerator tools), and we are working on these. Is the performance improvement like that of Squid? Yes, and it will be better for things like multiple people in a class watching a YouTube video (where the number of objects is large).
Q: why do we need these large caches? Is there evidence that by increasing cache size from 200GB to 1TB there will be a drastic improvement? Wanted to move beyond web caching (where the benefits are limited) to WAN acceleration, which requires much larger caches.
iPlane Nano: Path Prediction for Peer-to-Peer Applications
Example application is a P2P CDN where content is replicated across a geographically distributed set of end-hosts. Every client needs to be redirected to the replica that provides best performance. However, internet performance is neither constant nor queriable.
Current best practice is for each application to measure internet performance on its own. Would be better for end-hosts to have the ability to predict performance without having to make measurements, and share infrastructure across applications.
Problem has been looked at before. Network coordinates were limited to latency but were a lightweight (scaleable) distributed system. iPlane had a rich set of metrics and used arbitrary end-hosts, but required a 2GB atlas to be distributed and had a large memory footprint.
iPlane Nano has same information as iPlane and sufficient accuracy, but only uses a 7MB atlas at end-hosts and services queries locally.
On the server side, iPlane Nano uses the same measurements as iPlane but stores and processes them differently.
Size of atlas is O(number of vantage points * number of destinations * average traceroute path length). iPlane combines paths to improve predictions. Instead replace atlas of paths with atlas of links. Now this is O(number of nodes * number of links).
Clients can use swarming to disseminate the atlas, and can service queries locally using the atlas.
However, just storing links loses routing policy information encoded in the routes (i.e. which path would actually be used?). Need to extract routing policy from measured routes and represent this compactly.
Strawman: could try to use shortest AS path routing + valley-free + early-exit routing. However, this gave very poor quality predictions (iPlane got 81% correct, this strawman approach got 30%). So we have thrown away too much information.
First technique is inferring AS filters. Not every path is necessarily a route (ASes filter propagation of a route received from one neighbour to other neighbours). Filters can be inferred from measured routes, by recording every triple of three successive ASes in each measured rout. Store (AS1, AS2, AS3) to imply that AS2 forwards packets from AS1 to AS3. This still gives multiple policy-compliant paths for some endpoint pairs, due to upstream AS routing policies.
Second technique is to infer AS preferences. For each measured route, alternate paths are determined in the link-based atlas. When paths diverge, this indicates preference.
Another challenge is routing asymmetry. Undirected edges are used to compute routes assuming symmetric routing (i.e. when the route has not been specifically measured), but more than half of internet routes are asymmetric. Merge clients’ additional (low-rate) traceroute measurements into the atlas that is distributed to all clients. Prefer a directed path in the atlas for prediction, or else fall back to undirected paths.
The improved path predictions are 70% accurate, which is almost as good as iPlane (with a 6.6MB atlas rather than 2GB; and a 1.4MB daily update).
Want to use routes to predict latency and loss rate. Latency is sum of link latencies. Loss-rate is the probability of loss on any link in the route. Ongoing challenge is to measure these properties themselves (link latency is hard to measure.) iPlane Nano can make good enough predictions to help applications.
System used to improve P2P applications (CDN, VoIP and detour routing for reliability). Look at CDN here; others in paper.
CDN chooses replica with best performance to serve a client request. Evaluated with 199 PlanetLab nodes as clients, and 10 random Akamai nodes as the replicas. Each node wants to download a 1MB file from the “best” replica. Look at the inflation in download time (w.r.t. optimal strategy) as a CDF of nodes. iPlane Nano does better than Vivaldi and OASIS, and indeed outperforms the expected measured latency. Random assignment gives bad inflation which shows the importance of an informed choice.
Q: how much does it matter if you have an out-of-date atlas? Once a day is good enough to capture the variance in latency and loss rate.
Q: how expensive is it to recompute the atlas? Very inexpensive.
Q: will your AS inferences lead to false AS links and how do you deal with that? Inter-AS links is a tricky issue, but rather than getting the topology right, it’s better to make good enough predictions, which are useful for applications. Addressing that will only improve accuracy.
Q: what gain do you get from not just querying the server? Say you want to instrument BitTorrent and rank order the peers that will give good performance. But what if every BitTorrent peer hits the server… this will overload a server and lead to you needing costly infrastructure (like Google).
Q: what is the measurement overhead that the end-hosts will incur if they have to run their own measurements? We process about 100 traceroutes per day at the end hosts.
Q: when does this technique (path segment composition) work better than others? Assumes routers are performing destination-based routing, rather than load-balancing. Turns out around 70% of routes are identical from day-to-day.
Making Byzantine Fault Tolerant Systems Tolerate Byzantine Failures
We’ve heard a lot about applications and optimisations for BFT systems. We now have impressive best case performance for many scenarios. But what happens when failures actually occur? Performance drops to zero or the system crashes!
How do we get robust BFT? Describe the route to “Aardvark” which is an implementation of this technique, and show that the performance under failures is not too bad.
10 years ago, we thought BFT could never be fast. Goal was to should that BFT could work (in an asynchronous network). FLP means that all we can guarantee is eventual progress. Systems were designed so that the normal case was fast and safety was maintained.
Wanted to maximise performance with a synchronous network and all clients working properly. This is misguided (surely failures must occur or else why would we have BFT?), dangerous (encourages fragile optimisations with corner cases that are difficult to reason about, easy to overlook and difficult to implement) and futile (diminishing returns in performance improvements).
New goal: address the middle ground between (asynchronous with failures) and (synchronous without failures): i.e. a synchronous network with failures.
Want to maximise performance when the network is synchronous and at most f servers fail, while remaining safe if at most f servers fail.
Protocol is structured as a series of filters to remove some amount of bad messages. This limits the effect that bad messages can have on performance. Same filters are applied to all messages.
Signatures are expensive, so use MACs? But MACs can be used by clients to generate ambiguity, so Aardvark insists on signed requests. (Showed an example of an attack on MACs by a faulty client, where the MAC is validated by the primary, but the replicas cannot validate it, which leads to a tricky protocol that lowers throughput. Also a problem with a faulty primary.) Use a hybrid MAC/signature which is easier to verify. Signature schemes are asymmetric so most of the work can be pushed to the client. But what if a faulty client sends bad signatures into the system? Filter them out (blacklist for client, then verify MAC, then verify signature, and if this fails then blacklist the client).
View changes to be avoided? But they can be done frequently to enable high throughput even under failures. The primary is in a unique position of power (client sends request to primary, primary forwards it to replicas) and could wait for a long time. Usually deal with this using a view change timeout. But a bad primary can be just fast enough to avoid being replaced. Instead use adaptive view changes based on observed and required throughput. Guarantees that the current primary can either provide good throughput or be promptly replaced.
Hardware multicast is a boon? Use separate work queues for clients and network connections between machines.
Evaluated throughput versus latency compared to HQ, Q/U, PBFT and Zyzzyva. Aardvark has longer latency than others (at low throughput), and sustains a lower throughput than PBFT or Zyzzyva.
Evaluated performance with failures. Byzantine failures are arbitrary (cannot enumerate all of them), so made a good-faith effort to strain this. HQ implementation crashes with a faulty client (not all error handling was implemented) PBFT, Q/U and Zyzzyva drop to zero throughput. Aardvark maintains peak performance (although this is lower than the other schemes).
Also looked at effect of delay.
Q: why does the hybrid MAC/signature protocol require a MAC? If we don’t use a MAC, we don’t know who is sending the message so the MAC gives us a quick way to identify the sender and blacklist.
Q: given that people are already reluctant to use BFT, why would they take the performance hit? How well should you perform under failures or no failures? Imagine there is a range of protocols there, and could choose a trade-off.
Q: how would you deal with heterogeneous speeds in the adaptive scheme (and faulty nodes causing an attack there)? Looking at symmetric systems and base throughput on the history over previous views.
Zeno: Eventually Consistent Byzantine-Fault Tolerance
Data centre storage systems are the backbone of many internet-based services (e.g. Amazon, Facebook, Google). They have high availability and reliability requirements. Cost of downtime is huge.
Example of Amazon’s Dynamo shopping cart service. Needs reliable storage and responsiveness. Dynamo achieves reliability through replication. It achieves responsiveness by allowing stale state to be viewed during failures, and eventual consistency.
Cannot simultaneously achieve strong consistency and high availability if network partitions are possible (CAP theorem). Many storage backends prefer availability over consistency (e.g. Dynamo, PNUTS, Cassandra).
Two fault models: crash and Byzantine. Many deployed systems assume crash faults because the infrastructure is trusted. But Byzantine faults can happen (S3, Google, NetFlix had multiple-hour outages), as the majority of database bugs exhibit non-crash behaviour. So use BFT, which withstands arbitrary faults, using 3f+1 replicas to tolerate f faults. Used for mission critical systems, e.g. avionics. Improvements have been made in improving performance, but what about availability?
Existing protocols strive for strong consistency, which assumes the abstraction of a single correct server. Need >= 2/3 of replicas to be available.
Key idea is relaxed consistency to give availability. Data is available when other nodes block, but sometimes stale. Zeno is an eventually consistent BFT protocol.
What is an eventually consistent BFT service? Assume three clients, A, B and C, that are accessing the service. Model service state as partial order on operations. Have a committed history and one or more tentative histories from some point in time onwards. Can merge tentative histories to give a committed history. But some operations (e.g. two add to baskets of an item where only one is available) can be inconsistent. Therefore have “strong” and “weak” operation types. A weak operation observes eventual consistency and may miss previous operations, but will eventually get committed (e.g. add/delete items to/from shopping cart). A strong operation always observes the committed history (e.g. checkout in a shopping cart: only pay for what you buy).
Zeno has four components. Normal case for strong and weak operations; handling a faulty replica; conflict detetion; and conflict resolution.
Zeno requires 4 replicas (3f+1, f=1).
[Detailed description of the protocol.]
Strong quorum is used for strong consistency: ensures that no two requests are assigned the same sequence number (need 2f+1=3 matching replies). Weak operations don’t use the strong quorum: just need f+1 matching replies. With a weak quorum, intersection is not guaranteed, but it is not necessary for eventual consistency.
In event of a faulty primary, must be able to do a view change. Typically these require strong quorums. Zeno has a weak view change protocol that only requires weak quorums, which is necessary for high availability.
[Detailed description of conflict detection protocol.] Based on sequence number mismatch (same sequence number assigned to different requests).
Conflict resolution: weak operations are propagated between primaries of weak views, and finally reconciled. Correctness proof in the technical report.
Evaluated with a simulated workload with a varying fraction of weak operations. With no concurrent operations, compared against Zyzzyva. Look at all strong, 50% weak and all weak. Look at the throughput for Zyzzyva, Zeno(strong) and Zeno(weak). Weak operations continue to make progress in the presence of a network partition (but stall a bit with the partition is resolved, presumably as the conflict is resolved).
With concurrent operations that conflict (during the network partition). Weak drops briefly on the original partition, and also takes a slightly worse hit when the partition is resolved (but actually performs better during the partition). So Zeno provides higher availability than Zeno.
Q: instead of working with arbitrary partitions, could you exploit cliques on either side (i.e. in separate data centres)? Yes, definitely.
Q: what happens to the client state when the operations are rolled up? The result that you see might not be the final result: this influences the choice of weak operations. What if you insert weak operations before strong operations? When a strong operation is committed, all weak operations before it must be committed.
Q: throughput results were an order of magnitude lower than in previous talk? Just used a small number of clients.
Q: can you give an example of an application where it is okay to have a period of strong consistency, followed by one where the results may be obliterated by the conflict resolution? Shopping cart is a prime candidate. But future operations may rely on operations that have not yet been committed? Yes, that’s a design choice.
Q: is it true that you will always end up with divergent histories? If you assume that we have signatures, then no.