<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	>

<channel>
	<title>mrry</title>
	<atom:link href="http://www.mrry.co.uk/blog/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.mrry.co.uk/blog</link>
	<description>Derek Murray's weblog</description>
	<pubDate>Wed, 01 Sep 2010 11:17:04 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.7.1</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>SIGCOMM 2010: Day 2</title>
		<link>http://www.mrry.co.uk/blog/2010/09/01/sigcomm-2010-day-2/</link>
		<comments>http://www.mrry.co.uk/blog/2010/09/01/sigcomm-2010-day-2/#comments</comments>
		<pubDate>Wed, 01 Sep 2010 10:00:19 +0000</pubDate>
		<dc:creator>Derek Murray</dc:creator>
		
		<category><![CDATA[Technology]]></category>

		<category><![CDATA[Trip Reports]]></category>

		<guid isPermaLink="false">http://www.mrry.co.uk/blog/?p=70</guid>
		<description><![CDATA[Privacy
Privacy-Preserving P2P Data Sharing with OneSwarm

Three types of data: private, public (non-sensitive) and public (without attribution). This talk is about the last one: want to download and share data without people knowing what you&#8217;re downloading or sharing.
P2P is good for privacy because there is no centralized control or trust, you retain rights to your data [...]]]></description>
			<content:encoded><![CDATA[<h2>Privacy</h2>
<h3>Privacy-Preserving P2P Data Sharing with OneSwarm</h3>
<ul>
<li>Three types of data: private, public (non-sensitive) and public (without attribution). This talk is about the last one: want to download and share data without people knowing what you&#8217;re downloading or sharing.</li>
<li>P2P is good for privacy because there is no centralized control or trust, you retain rights to your data (instead of giving them to a third party) and no centralized third party knows everything you&#8217;re doing. But in P2P: anyone can monitor your behavior!</li>
<li>Previous solutions: Tor plus BitTorrent, but this needs a fraction of the clients to be public and gives poor performance. Or Freenet, which has poor bulk data performance and requires users to store others data. Median download of BitTorrent a 1MB file is 94s. But BT+Tor is 589s and Freenet is 1271s.</li>
<li>Implemented a OneSwarm client and released in March 2009… now hundreds of thousands of users. Based on social networks: share keys with your friends.</li>
<li>Searches are flooded through the network, and set up a data path along the successful route, which does not reveal who is the ultimate sender or provider.</li>
<li>Threat model: the attacker has a limited number of overlay nodes and can do anything on nodes he controls, including traffic injection/sniffing/correlation.</li>
<li>To support mobile peers, use a DHT to publish IP and port, however this is published, encrypted and signed separately for each peer. This makes it possible to remove a peer.</li>
<li>Sparse social networks are a problem: with only one friend, you have poor reliability, poor performance and privacy problems. This is bad for early adopters. Early adopters used a forum to share their public keys. Solution was to add a community server, as a source of untrusted peers.</li>
<li>With untrusted peers, delay responses to foil timing attacks, probabilistically forward queries and used deterministic random behavior to limit information leakage from repeated queries. Can trust/not trust peers on a per-object basis.</li>
<li>Want to have search without revealing the source and destination. The approach is based on flooding with delay, where searches are only forwarded using spare capacity, and delayed at each hop. Cancel messages move faster through the network. However, this gives no guarantee that all data can be found by all users at all times.</li>
<li>Search types: 160bit content hash and text-based. For response delay, use random reply delay seeded by the hash (if hash-based search). This is harder for text-based search, so delay based on the hash of the content.</li>
<li>Multipath connections are supported to avoid weak points in the network. Default is to cancel after getting 20 responses (not just one). Then the forwarding load is distributed.</li>
<li>Data transfer uses a modified version of BitTorrent, which handles multiple sources and swarming downloads naturally.</li>
<li>Timing attack is possible by monitoring the delay between search and the response, and inferring how many users could have sent the reply within the recorded time.</li>
<li>Evaluated potential of the timing attack using a synthetic OneSwarm overlay, based on 1.7 last.fm users. Attackers use a public community server, and users with 26 or fewer friends take some untrusted friends as well. Design eliminates the attacker&#8217;s ability to pinpoint particular users.</li>
<li>Also evaluated performance using 120 PlanetLab nodes transferring a 20MB file. Median download time for OneSwarm is 173s (compared to 94s for BitTorrent, but much less than BT+Tor and Freenet). At the 70th percentile, OneSwarm and BT are similar (97s vs 190s), but BT+Tor and Freenet are much worse.</li>
<li>Multipath transfers increase average transfer rate from 29KB/s to 457KB/s.</li>
<li>Q. Would it be fair to classify this as Gnutella where the network is restricted to the social graph and searches are less flexible? Similar, but the key change is that the data flows over the path in the overlay (not directly), which makes it more privacy-preserving. Does this not give worse scalability than Gnutella, which had problems? Gnutella had problems when people were still using modems, and it is more viable to provide a few KB/s. The current overlay is not oversubscribing.</li>
<li>Q. What are you relying on to ensure that you don&#8217;t know where the data are coming from? Because you don&#8217;t know the topology. If you are next to an attacker, you rely on the delay that you add. Are you seeing a sum of random variables, which will leak more information as the path becomes longer? You could maybe estimate the hop-count but not pin-point nodes.</li>
<li>Q. Is a 20MB file too small for TCP to show realistic performance? Used this because we needed to experiment with Tor also, and we didn&#8217;t want to stress that network too much. For the Freenet experiment, we used a 5MB file and extrapolated from that because it was hard to get 20MB to download reliably.</li>
</ul>
<h3>Differentially-Private Network Trace Analysis</h3>
<ul>
<li>Can you conduct network trace analysis that provides strict, formal, &#8220;differential privacy&#8221; guarantees?</li>
<li>Selected some representative traces and tried to reproduce the results using differential privacy.</li>
<li>It was possible to reproduce every analysis attempted, but there is a privacy/accuracy trade-off.</li>
<li>Toolkit and analyses are available online from the PINQ web site.</li>
<li>Access to realistic data is helpful for networking research, but there is a tension between utility and privacy. The requirements of utility are usually for aggregate statistics, whereas privacy requirements are typically for individual behavior.</li>
<li>Other approaches include: trace anonymization (doesn&#8217;t always work unless people are excessively conservative), code-to-data (send your analysis to the people who hold the data, but it is hard to know what that code is doing), or secure multi-party computation (similar to code-to-data). The aim here is to start with formal guarantees and see how useful it can be.</li>
<li>Differential privacy: the results don&#8217;t depend on the presence or absence of an individual record. Doesn&#8217;t prevent disclosure, but makes no assumptions about what what the attackers can do, and is agnostic to data types.</li>
<li>Uses PINQ, which is a programming language that guarantees programs are differentially private.</li>
<li>Challenges: getting DP requires introducing noise, so you need to use statistically robust measurements. PINQ requires analyses to be written as high-level, declarative queries, which can require some creativity or reinterpretation. Also (not dealt with): masking a few packets does not mask a person, and the guarantees degrade more as a dataset is reused (policy question of how you mete out access to a dataset).</li>
<li>Example is worm fingerprinting. Group packets by payload, filter by the count of source IPs being over a threshold and the count of destination PIs being over another threshold. Can then count the number of worms, approximately. Need to supply epsilon to the count which will start off the differentially-private version.</li>
<li>Built some tools for analysis. For example, implemented three versions of a CDF. In doing this, you need to scale down the accuracy for each subquery in order to not degrade the dataset privacy too much.</li>
<li>Showed an example CDF. The differentially private one is not monotonic at the microscopic scale, but it gives a convincing macro-scale result.</li>
<li>Can also list frequently occurring strings, using an algorithm based on statistical properties of text, which gradually extends a prefix.</li>
<li>Extend worm fingerprinting: actually enumerate the payloads that have significant src/dest counts.</li>
<li>Also built more tools and analyses: did packet-level analyses, flow-level analyses and graph-level analyses. Sometimes had to compromise on privacy to get high accuracy (epsilon = 10 for weak privacy).</li>
<li>Many open questions. Perhaps the biggest is whether DP guarantees for packets are good enough. Or whether, if writing new analyses, they could be designed with DP in mind.</li>
<li>Q. Could extensions to PINQ apply to trace analysis that look for isolated events, such as network intrusions which are relatively rare? Can separate the two tasks: learning a rule or filter that could identify an intrusion (which could use DP), and apply that filter to individual packets (which could not use DP, because you effectively want to violate privacy at this point).</li>
<li>Q. Does someone need to hold onto the raw packets? Yes, like the code-to-data setting.</li>
<li>Q. In DP, each query may consume epsilon privacy and the provider must set a budget, so how do you set this? And what happens when the budget is exhausted? You could imagine turning off the dataset when the hard threshold is met. But this is really a policy question. Setting the budget is difficult: perhaps you can provide data outputs from a DP query to a large group who can then do useful work with it.</li>
<li>Q. Is there a trade-off between DP and the presence of correlations in the data? In a lot of cases, it is possible to restructure the data set to reduce the amount of correlation between individual records (by grouping the strongly correlated records together).</li>
</ul>
<h3>Encrypting the Internet</h3>
<ul>
<li>50 million websites exist, but only about 600k of them enable SSL/TLS. Can we change the infrastructure to make all transactions protected and secure.</li>
<li>Main drawback is protocol processing speed and cost, due to public key crypto for handshaking and symmetric crypto for the data. 2 million clock cycles for RSA decrypt.</li>
<li>Main contribution is a CPU that is capable of encrypting packets at line rates, and getting a 4&#8211;12x speedup in AES and a 40% speedup in RSA.</li>
<li>Encrypting the internet is not securing it. Don&#8217;t deal with certificate/trust management, malicious software or privacy breaches at the end-host.</li>
<li>AES is a block cipher, based on the Rijndael algorithm. Can use 128-bit blocks and either 128, 192 or 256-bit keys. AES takes 10, 12 or 14 rounds.</li>
<li>AES uses confusion (invert in GF(2^8) followed by an affine map). Then the bytes are permuted by shifting the rows of the S-box by varying amounts. Then the columns of the S-box are mixed by matrix multiplication. Uses many bit-linear operations, which are easy to implement in VLSI. Finally, add the round key using XOR.</li>
<li>AES is typically implemented using table lookups, which are costly (approximately 15 cycles per byte). But need to get 1Gb/s. So the aim is to implement them in combinatorial logic, on the processor data path.</li>
<li>Added new instructions: AESENC, AESENCLAST, AESDEC, AESDECLAST. Cache attacks are eliminated. Challenge is to implement this in as small a gate area as possible. Mathematical techniques such as composite fields help to achieve this in 100-400 gates. Total number of gates is similar to an adder or multiplier.</li>
<li>RSA requires performing a modular exponentiation, which can be implemented using modular multiplication. Implementing a faster multiplication algorithm in assembly achieved a 40% speedup over OpenSSL.</li>
<li>Also implemented the first version of combined encryption and authentication for TLS 1.2.</li>
<li>AES-NI latency is 24 clocks/round, then 6 clocks, and the throughput is 2 clocks.</li>
<li>Overall, can move from 501 SSL sessions/second to 1216 SSL sessions/second using AES-NI in Galois counter mode.</li>
<li>Now, one core can saturate a 1G link, and 8 cores can saturate a 10G link.</li>
<li>Future work is to improve larger RSA variants and implement the eventual SHA-3 algorithm.</li>
<li>Q. When you get that fast, how many good-quality random bits per second you can get? This work doesn&#8217;t address that, but all we need is an entropy source per a 2004 paper. Not sure what the product groups are doing in this respect.</li>
<li>Q. Is Intel working on speeding up the RSA? The speedup presented in the paper (40%) is good enough to saturate the link.</li>
<li>Q. Could you expose the GF operations as primitives themselves? It is implemented in such a way that you can isolate the inversion of GFs or the multiplication. Algorithms in the SHA-3 competition also exploit similar primitives.</li>
<li>Q. How general are your optimizations in terms of other block ciphers? You can implement a variety of crypto algorithms using the primitives we have design, including several cryptographic hash functions.</li>
</ul>
<h2>Wireless LANs</h2>
<h3>Enabling Fine-Grained Channel Access in WLAN</h3>
<ul>
<li>802.11n achieves about 45.2 Mbit/s at the application layer, which is much less than the advertised bitrate.</li>
<li>Overhead arises from various sources: CSMA, backoff, DIFS and SIFS, and ACK. Simple model of this overhead as ration from transmission time to total time for a packet. As the PHY data rate increases, the time for transmitting data (efficiency) becomes small compared to all of these overheads. There is a static delay that cannot be reduced, constraining speedup.</li>
<li>Existing MAC is limited by allocating a whole channel to a single user at once. Aggregation is a possible solution, but you require large aggregation (23KB frames) to get 80% efficiency at 300Mbps. And this also increases latency.</li>
<li>Basic idea is to divide the channel into small, fine-grained slices. Directly reducing the channel width doesn&#8217;t work because of guard-band overhead. Approach then is the use orthogonal overlapping subchannels (OFDM).</li>
<li>If nodes are asynchronous, you lose orthogonality (i.e. if you have multiple users). Challenge then is to coordinate transmissions in random-access networks like WLAM. Time-domain backoff is very inefficient in this case.</li>
<li>Designed new PHY and MAC architectures: &#8220;FICA&#8221;.</li>
<li>M-RTS/M-CTS/DATA/ACK access sequence.</li>
<li>Carrier-sensing and broadcasting can be used to analyze the timing misalignment. A proper cyclic-prefix accommodates the timing misalignment: a long one for M-RTS and a short one for M-CTS/DATA/ACK.</li>
<li>For contention resolution, time-domain backoff is inefficient. Solution is to do frequency-domain contention with PHY signalling in the M-RTS/M-CTS  symbols.</li>
<li>Frequency domain backoff: reduce the number of subchannels to contend if there is a collision, and increase it if there is success. This is analogous to congestion-control mechanisms. Two policies: &#8220;update to max&#8221; and AIMD.</li>
<li>Implemented using the Sora software radio platform, based on a SoftWifi implementation.</li>
<li>Evaluated performance of the synchronization, the signalling reliability and the decoding performance.</li>
<li>Also showed simulation results for the performance gain over 802.11n, and showed an improvement in efficiency for both full aggregation (unrealistic) and a mixture of saturated and delay-sensitive traffic (realistic and with much greater benefits).</li>
<li>Q. How do you deal with the case when the number of sources exceeds the number of sub-carriers (in frequency-domain backoff)? Could you combine time and frequency? Yes, we could always do that.</li>
<li>Q. Is there a way of using this system with RTS/CTS? The overhead of these is so low (37us for RTS), that it might not be worth doing this.</li>
<li>Q. Why is the problem of asynchronous timing different from multipath fading? It can create arbitrarily bad interference with the FFT window that is done for OFDM.</li>
<li>Q. What happens if you compare your scheme to classic OFDM in terms of bits/seconds/Hz (considering delay due to synchronization)? There is a sweet point in symbol size that can meet your design goal.</li>
</ul>
<h3>Predictable 802.11 Packet Delivery from Wireless Channel Measurements</h3>
<ul>
<li>802.11 is fast (600Mbps), reliable (usable at vehicular speeds over extended range) and ubiquitous (cheap). But new applications, such as wireless displays or controller input, can stress the network.</li>
<li>In theory, performance should be easily measurable and used to guide channel rate selections. But the real-world doesn&#8217;t always match the theory. So statistical adaptation is often used, but convergence time becomes a problem (especially as the measurement results change dynamically).</li>
<li>Goal is to bridge theory and practice, and accurately predict performance over real channels and devices.</li>
<li>Channel metric is the received signal strength indicator (RSSI) which, with noise, gives the SNR for a packet. However, this isn&#8217;t perfect, because it can vary by 10dB on a per packet basis. Different subchannels have different SNRs.</li>
<li>802.11n provides a new opportunity: detailed channel measurements, which are used for advanced MIMO techniques. Get a Channel State Information (CSI) matrix for per-antenna paths.</li>
<li>Use the Effective SNR (the total useful power in a link) as opposed to the packet SNR (total power in the link).</li>
<li>CSI is measured on receive, so for every received frame, we know all antennas and subcarriers used. Then take this and compute SNRs per symbol. And use textbook formulae to calculate per-symbol bit-error rates, and average them to get an effective bit-error rate. Finally convert this back to the effective SNR.</li>
<li>Every rate gets an effective SNR threshold, calculated offline per NIC implementation (not per NIC or per channel). This handles real NICs which may use interesting decoding techniques (hard/soft/maximum likelihood, etc.).</li>
<li>Application: what is the fastest configuration for a particular link? Select rate/MIMO/channel width based on the information.</li>
<li>Application: which antenna is the best to use to save power?</li>
<li>Application: what is the lowest transmit power at which I can support 100 Mbps?</li>
<li>Implemented in an Intel Wi-Fi Link 5300 NIC (3&#215;3 MIMO, 450Mbps). Used two testbeds with over 200 widely varying links. Open-source Linux driver and used firmware debug mode to send CSI to the receiving host. Real-time computation took 4us per 3&#215;3 CSI.</li>
<li>For predicting optimal 3&#215;3 rate: effective SNR is much closer to the ground truth than packet SNR.</li>
<li>To evaluate rate control, used channel simulation on a mobile trace using MATLAB and the SoftRate GNU Radio. Effective SNR gets a better average delivered rate than SampleRate, SoftRate and SampleRate with fixed retry (802.11a algorithms).</li>
<li>Effective SNR extends to MIMO. Compared to optimal and an invented algorithm called &#8220;previous-OPT&#8221;. Effective SNR gets 80% accuracy and 10% overselection.</li>
<li>Related work: SoftRate, AccuRate and EEC (from yesterday). All work with 802.11a but don&#8217;t extend to more recent techniques.</li>
<li>Q. If you had CSI and it&#8217;s quick, do you need to do all of these things? The RSSI has a lot of error, and we were able to make this work.</li>
<li>Q. Is the debug mode on the NIC publicly availably? Yes, I think so.</li>
<li>Q. Would a better comparison be to other techniques that use scheduled MACs? Trying to do something that works with what we have.</li>
</ul>
<h3>SourceSync: A Distributed Architecture for Sender Diversity</h3>
<ul>
<li>Receiver diversity underlies many systems, such as opportunistic routing protocols, and WLAN diversity protocols. In opportunistic routing, let any router that receives the packet forward it. If multiple routers/channels with different loss rates, the loss probability is now the joint probability of losing the packet on all channels.</li>
<li>Sender diversity is the converse of receiver diversity. If many senders transmit simultaneously, it is unlikely that they will all be attenuated at the same time. It provide analogous benefits to receiver diversity. For example, connect many APs to wired Ethernet, and let many of them broadcast a packet simultaneously to the client.</li>
<li>Challenge: simultaneous transmissions don&#8217;t strengthen each other, because they are likely to be out of sync. Need distributed symbol-level synchronization.</li>
<li>An 802.11 symbol takes 3.2us. With synchronization error of 2us, the best you can get is an SNR of 2dB. 1us synchronization error gives 5db. But for maximum bit rate, 802.11 needs an SNR of ~22db.</li>
<li>Implemented the system, SourceSync, for an FPGA. Talking about opportunistic routing, but applies also to WLANs.</li>
<li>Natural way to synchronize transmitters is by reception.  But since multiple paths have different delays, need to compensate for these differences.</li>
<li>Path delay is made up of propagation delay and packet detection delay  (typically needing multiple samples to detect a symbol). Then a turnaround time between receipt and transmission.</li>
<li>Packet detection is caused by receivers detecting packets using correlation. Random noise can cause a receiver not to detect a packet on the first sample. Since routers see different random noise, they may make different numbers of samples. Routers can estimate this based on the phase shift in various carriers.</li>
<li>Hardware turnaround time is hardware dependent, caused by different hardware pipelines and radio frontends. Routers locally calibrate this using their clocks.</li>
<li>Propagation delay is measured by probe-response between node pairs. A knows its packet detection delay, B knows its packet detection delay, and so we can compute this from the RTT.</li>
<li>Challenge: can nodes synchronize using carrier sense? Transmission from one of the joint senders triggers the other senders. All nodes use CSMA, so one of the nodes wins contention and begins transmitting; other nodes join in if they have the data.</li>
<li>The lead sender adds a sync header to the packet and a known fixed gap to all co-senders to join after the gap. Co-sender listens, turns around from receive to transmit, waits for a compensating delay, and sends the data.</li>
<li>Implemented in a FPGA of the WiGLAN radio. Built a testbed with a variety of line-of-sight and non-line-of-sight locations.</li>
<li>Evaluated: randomly pick a pair of nodes to transmit, and measured the synchronization error. 90th percentile of synchronization error was measured: 20ns at 5dB SNR, and as little as ins at 25dB SNR.</li>
<li>Can SourceSync achieve sender diversity gains? Two nodes transmit simultaneously to a receiver (again). Check that two channels have different OFDM subchannel SNRs (they do, in the example) and that SourceSync achieves higher SNR in all subchannels.</li>
<li>Compare using the best single access point to using SourceSync, with two senders and a client. Repeat for all locations. SourceSync gives a median throughput gain of 57%.</li>
<li>Compared with opportunistic routing. Single path does worst. ExOR does better. SourceSync + ExOR does best (doubled median throughput over single path, and 45% improvement over ExOR alone).</li>
<li>Q. Did you consider just increasing the power of a single AP instead of sending with multiple APs? There is a fundamental gain here: the SNR profile is different for different routers, and combining across multiple senders gets rid of these deep losses.</li>
<li>Q. Should there be more components to the calculation of delay, based on the RTT? The nice part about this technique is that channel access delay doesn&#8217;t affect us, because we use carrier sense for telling when to transmit.</li>
<li>Q. Why did you not compare the performance of your scheme to MIMO? Sender diversity is orthogonal to MIMO and could improve its performance.</li>
<li>Q. Is your synchronization header long enough to account for nodes being very distant? Actually, it&#8217;s the gap after the header that has to be long enough. It&#8217;s a simple system-level parameter.</li>
</ul>
<h2>Novel Implementations of Network Components</h2>
<h3>SwitchBlade: A Platform for Rapid Deployment of Network Protocols on Programmable Hardware</h3>
<ul>
<li>[…]</li>
<li>Existing approaches involve developing custom software, custom hardware or programmable hardware.</li>
<li>Platform header: a hash value for custom forwarding, a bitmap for what preprocessor should execute on the packet, a forwarding mode (including longest prefix matching or an exact match; also able to throw a software exception) and the virtual data plane ID.</li>
<li>Virtual data plane has its own preprocessing, lookup and post-processing stages: they operate in isolation.</li>
<li>Preprocessing stage: select processing functions from a library of modules (such as path splicing, IPv6 and OpenFlow). Also hashing: operator indicates what bits in the header should be incorporated in the packet-header hash to determine how the packet should be forwarded (can include up to 256 bits from the header).</li>
<li>Can do OpenFlow, where forwarding decisions are made on a 13-tuple (240 bits), which SwitchBlade hashes for custom forwarding to be done.</li>
<li>Modules are implemented in Verilog. Preprocessing and postprocessing modules extract the bits for lookup.</li>
<li>Forwarding stage: perform output port lookup based on mode bits. A software exception can be thrown and the packet redirected to the CPU. Could do hardware-accelerated virtual routers in software.</li>
<li>Implemented on NetFPGA.</li>
<li>Evaluated for resource utilization and packet forwarding overhead. Compared to a baseline implementation on NetFPGA. There is minimal resource overhead and no packet forwarding overhead.</li>
<li>Evaluated on a three-node topology.</li>
<li>SwitchBlade uses 13 million gates to get four data planes; other implementations (IPv4, splicing, OpenFlow) have one data plane and use 8 to 12 million gates.</li>
<li>No additional forwarding overhead compared to the reference implementation.</li>
<li>SwitchBlade is a programmable hardware platform with customizable parallel data planes. Provides isolation using rate limiters and fixed forwarding tables.</li>
<li>Q. How do you scale the performance beyond tens of Gbps? An artifact of the NetFGPA implementation which uses 4&#215;1G ports. Later one will have 4&#215;10G.</li>
<li>Q. Doesn&#8217;t the next paper show that it is possible to do all this in software? Things like Click are limited by packet copying overhead, so you are limited by the bandwidth of the PCI bus.</li>
<li>Q. What kind of hash function do you use and do different applications require different properties? We use a collision-resistant hash.</li>
</ul>
<h3>PacketShader: A GPU-Accelerated Software Router</h3>
<ul>
<li>Prototype achieves 40 Gbps on a single box by exploiting GPU acceleration.</li>
<li>Software routing is not just IP routing. It is driven by software and exploits commodity hardware.</li>
<li>10G NICs cost from $200&#8211;300 per port. But software routers are limited to less than 10Gbps (8.7Gbps in RouteBricks is the best so far).</li>
<li>For 10G, it takes 1200 cycles to do packet I/O, and your budget is 1400 cycles. Lookup/encryption/hashing typically takes much more than that.</li>
<li>First step is to optimize the packet I/O. Then offload the other functions to the GPU.</li>
<li>GPUs are massively-parallel. Lots of small cores.</li>
<li>A GTX480 GPU has 480 cores and 1.2 billion transistors, most of which is dedicated to ALU.</li>
<li>Operations like hashing, encryption, pattern matching, network coding and compression are computationally intensive. GPU is well suited to these. GPU can also effectively hide memory latency.</li>
<li>Memory bandwidth of a top-of-the-line CPU is 32GB/s, but the empirical bandwidth (on realistic access patterns) is 25GB/s. Multiple ports receiving and transmitting will consume this and cause contention. However, a GPU has 174GB/s memory bandwidth.</li>
<li>Key insight: stateless packet processing is parallelizable. Take packets from the head of the receive queue, batch them and process them in parallel.</li>
<li>Latency is not impacted by parallel processing.</li>
<li>Before shader: checksum, TTL, format check, etc. This will send some packets along the slow path. It collects the destination IP addresses and passes those to the shader.</li>
<li>Shader: takes IP addresses, looks up the forwarding table and returns the next hops.</li>
<li>Post-shader: packets are updated and transmitted through the output ports.</li>
<li>Also device drivers at the receive and transmit side. Implemented a custom driver; details in the paper.</li>
<li>Can scale further with a multicore CPU. One master core and three worker cores. Master core talks to the shader. Once you have multi-socket, you need one GPU per CPU. Multi-socket, there is no communication between the CPUs, and each CPU owns a subset of the input queues.</li>
<li>Evaluated by connecting a packet generator and PacketShader back-to-back. Generator generates up to 80Gbps.</li>
<li>GPU gives a speedup (over CPU-only) of 1.4x for IPv4, 4.8x for IPv6, 2.1x for OpenFlow and 3.5x for IPSec.</li>
<li>IPv6 table lookup requires more power than IPv4 lookup. Algorithm is binary search on hash tables. Big performance improvement for small packets, but slightly worse for 1024 and 1514 bytes. However, this is bounded by the motherboard I/O capacity.</li>
<li>IPSec tunneling adds a header and trailer to the encrypted packet.  The improvement is across all packet sizes, and is actually bigger for larger packets.</li>
<li>PacketShader achieves 28.2 Gbps with CPU only, and is implemented in user space, rather than kernel space. Reaches 39.2 Gbps with the GPU.</li>
<li>Need to add a control plane (currently only does static forwarding). Need Quagga or Xorp.</li>
<li>Could also integrate with a programming environment, such as Click.</li>
<li>Q. Is it worth implementing such a sophisticated design to make a 40% saving? And do you have a breakdown of where the savings are made? The budget numbers and breakdown are taken from RouteBricks.</li>
<li>Q. What do you think about the power efficiency of this compared to other approaches? Idle to full load is 327W&#8211;594W with two CPUs and two GPUs. (Compared to 260W&#8211;353W for two CPUs.)</li>
<li>Q. Does this approach have advantages over an integrated network processor in terms of scalability or programmability? Network processors are not commodity. Based on experience, they are much more difficult to program.</li>
<li>Q. Why did your approach have such a significant speedup over RouteBricks etc. even without the GPU? Improvements in packet I/O throughput.</li>
</ul>
<h3>EffiCuts: Optimizing Packet Classification for Memory and Throughput</h3>
<ul>
<li>Packet classification is important for security, traffic monitoring and analysis, and QoS. Usually based on the source and destination IPs and ports, and the protocol field.</li>
<li>Line rates and classifier sizes are increasing. This leads to high power consumption.</li>
<li>Previous approaches have used either TCAMs (poor scalability) or algorithmic approaches (potentially scalable, but problematic). Most promising approach based on decision trees. Aim of this work is to address the scalability of decision tree algorithms.</li>
<li>HiCuts and HyperCuts have investigated decision trees previously. However, they require large memory.</li>
<li>EffiCuts reduces the memory overhead of HyperCuts while achieving high packet throughput. Uses 57x less memory and 8x less power.</li>
<li>Rules in the decision tree are hypercubes in the rule space. Tree building successively cuts down the rule space into smaller sub-spaces. Stops when the cube is small. Classification uses tree traversal.</li>
<li>HyperCuts&#8217; memory overhead is due to many rules overlapping and varying in size, because fine cuts to separate small rules lead to cuts to and replication of large rules. Also overhead because the rule space is sparse (leading to empty nodes or nodes with replicated rules).</li>
<li>Aim to tackle the variation in the rule size and the density of the rule space.</li>
<li>Separable trees: build separate trees for large and small rules. But separate them along different dimensions.</li>
<li>Build a distinct tree for each set of separable rules in 5 IP fields. This leads to a maximum of 31 trees, but in practice it&#8217;s more like 12.</li>
<li>Extra memory accesses to traverse multiple trees decreases packet throughput. To reduce the number of accesses, merge the trees.</li>
<li>HyperCuts uses equi-sized cuts to separate dense areas, whereas EquiCuts uses equally-dense cuts, which leads to fine/coarse cuts in dense/sparse areas. Many details of this in the paper.</li>
<li>Node co-location: colocate a node and its children, details of this in the paper.</li>
<li>Implemented HiCuts and Hypercuts with all heuristics, and EffiCuts. Used 16 rules per leaf. Power comparison uses an estimation from the Cacti tool to simulate the SRAM/TCAM.</li>
<li>First result: HyperCuts and HiCuts see memory grow more rapidly than EffiCuts. Replication decreases from 1000 to &lt; 9. Efficuts needs constant number of bytes per rule as the number of rules grows.</li>
<li>EffiCuts requires 50% more memory accesses than HyperCuts. However, since EffiCuts uses much less memory, memory copies are inexpensive.</li>
<li>Throughput results are mixed (149 down to 73 million packets per second for one rule set; but 218 up to 318 for another). Still see an 8x saving in power.</li>
<li>Also compared EffiCuts to TCAM. Throughput story is also mixed, but EffiCuts consumes 6x less power than a TCAM.</li>
<li>Q. How do you separate &#8220;large&#8221; and &#8220;small&#8221; rules&#8212;using a threshold? We observed that the rule spread is essentially bimodal. This is based on a sensitivity analysis to the &#8220;largeness fraction&#8221; which varies between 0.1 and 0.95 without affecting the split.</li>
<li>Q. How would power consumption compare to a TCAM where you selectively turn on the relevant banks? Since we are comparing the worst-case packet match, every rule could go to a very different bank.</li>
</ul>
<h2>Cloud and Routing</h2>
<h3>Theory and New Primitives for Safely Connecting Routing Protocol Instances</h3>
<h3>DONAR: Decentralized Server Selection for Cloud Services</h3>
<h3>Cloudward Bound: Planning for Beneficial Migration of Enterprise Applications to the Cloud</h3>
]]></content:encoded>
			<wfw:commentRss>http://www.mrry.co.uk/blog/2010/09/01/sigcomm-2010-day-2/feed/</wfw:commentRss>
		</item>
		<item>
		<title>SIGCOMM 2010: Day 1</title>
		<link>http://www.mrry.co.uk/blog/2010/08/31/sigcomm-2010-day-1/</link>
		<comments>http://www.mrry.co.uk/blog/2010/08/31/sigcomm-2010-day-1/#comments</comments>
		<pubDate>Tue, 31 Aug 2010 10:00:32 +0000</pubDate>
		<dc:creator>Derek Murray</dc:creator>
		
		<category><![CDATA[Technology]]></category>

		<category><![CDATA[Trip Reports]]></category>

		<guid isPermaLink="false">http://www.mrry.co.uk/blog/?p=67</guid>
		<description><![CDATA[Wireless and Measurement
Efficient Error Estimating Coding: Feasibility and Applications

Won best paper award.
Existing philosophy: errors are bad and only want completely correct data.
Can we accept partially correct packets and only enforce correctness end-to-end?
Contribution: error estimating coding. Enables the receiver to estimate the number of errors in a packet, but not correct them.
Smaller overhead/weaker functionality than error [...]]]></description>
			<content:encoded><![CDATA[<h2>Wireless and Measurement</h2>
<h3>Efficient Error Estimating Coding: Feasibility and Applications</h3>
<ul>
<li>Won best paper award.</li>
<li>Existing philosophy: errors are bad and only want completely correct data.</li>
<li>Can we accept partially correct packets and only enforce correctness end-to-end?</li>
<li>Contribution: error estimating coding. Enables the receiver to estimate the number of errors in a packet, but not correct them.</li>
<li>Smaller overhead/weaker functionality than error correcting codes.</li>
<li>Overheads come from redundancy and computation.</li>
<li>EECs need only O(log n) bits redundancy to estimate errors. e.g. 2% overhead on a 1500-byte packet. For just a threshold check, need only 4 bytes.</li>
<li>Efficient computationally: software implementation can support all 802.11 data rates. ECC is 10 to 100 times slower.</li>
<li>Can estimate number of errors in a packet with a provable quality.</li>
<li>Application: streaming video. FEC (forward error correction) often used here. Routers forward partially correct packets. But if the number of errors is so large that the data are unrecoverable, it will incur retransmission. Router should have asked for retransmission earlier when a packet could not be decoded. But it lacks computational power to evaluate an ECC. EEC is more computationally tractable in this scenario for BER-aware retransmission.</li>
<li>Implemented for Soekris Net5501-70 routers.</li>
<li>Key idea: router should treat different packets differently. Could use analogue or digital amplification as necessary.</li>
<li>Packet scheduling: image sensor network for emergency response. Let packets with smaller BER get through first.</li>
<li>Applies also to bulk data transfer. Can use partial packets and correct them end-to-end. Could use network coding or incremental redundancy. EEC helps to do WiFi rate adaption, if we know the mapping between data rate and BER, which EEC provides. Existing systems based on packet loss rate, signal-to-noise ratio (at the receiver), or modifying the PHY.</li>
<li>Implemented a prototype rate adaptation scheme using EEC. Consistently outperforms existing schemes based on packet loss rate and SNR.</li>
<li>More general problem is wireless carrier selection: goal is to select carrier with the best goodput.</li>
<li>Packet has n + k slots (n data bits and k EEC bits). p is the fraction of erroneous slots, with arbitrary position.</li>
<li>Naïve solution would be to add pilot bits to each packet with known values at known positions. But this needs the receiver to observe enough errors in the pilot bits. Since BER is usually small, need a lot of pilots to see a single error.</li>
<li>Instead, make the pilot bit a parity bit based on a known group of data bits in the packet. But cannot distinguish many cases with parity bit (1 error vs. 3 errors). Error probability of a parity bit is (inversely) correlated with error probability of the bits in its group.</li>
<li>Solution involves randomly selecting data bits to make up fixed sized groups and compute an EEC parity bit. Now permute the data and EEC bits and send.</li>
<li>Can refine to single- or multiple-level EEC. Details of multiple-level EEC in the paper.</li>
<li>Can prove a formal guarantee for the space complexity of the number of EEC bits (the O(log n) bound).</li>
<li>Compared to SoftPHY, EEC is a pure software solution, which is more deployable. But SoftPHY gives per-bit confidence information that EEC cannot provide.</li>
</ul>
<h3>Design and Implementation of an &#8220;Approximate&#8221; Communication System for Wireless Media Applications</h3>
<li>Leverages properties of wireless PHY layer to improve media applications.</li>
<li>Media applications by 2013 will comprise 91% of internet traffic, and wireless based access is the dominant for of access (4 billion wireless hosts vs. 0.5 billion static hosts).</li>
<li>In this case, we should use the spectrum efficiently for media transfer.</li>
<li>Looking at hierarchically-structured media, such as MPEG4 and H.264. Different frames (e.g. I, P, B) have different value (I &gt; P &gt; B). So use unequal error protection to prioritize important frames.</li>
<li>Since data received is a predictable approximation of transmitted data, they can provide unequal error protection almost for free (in terms of additional spectrum).</li>
<li>Errors result from the constellation diagram used to determine QAM encoding. Typically, the error is restricted to symbols in the neighborhood of the received symbol in the constellation.</li>
<li>Using a Gray code-based bit mapping from symbols to the QAM encoding. So, by definition, neighboring symbols are just one bit different. Different bit positions offer different levels of protection. So can choose different bit positions to give different protection to data.</li>
<li>For the media example, but I frames in more-protected bit positions, and other frames in the other positions.</li>
<li>Also considered a block mapping scheme, which gives better protection for the most protected bits, and worse protection for the other bits, than a Gray code mapping.</li>
<li>Designed a system based on these principles: APEX.</li>
<li>A modern radio pipeline will apply randomization, such as scrambling, coding and interleaving. But this makes it harder to determine what bits go where. So they move randomization to before the assignment of bit positions.</li>
<li>Uses a greedy algorithm to map application data at various priorities that deals with unequal content size.</li>
<li>Evaluated at various bit rates and bit error rates. Gets a better PSNR than traditional transmission. Also showed that it works well with FEC.</li>
<li>Q. Did your evaluation take fading into account, and would the assumptions still hold? Experimentation done using a system not robust enough to take outdoors and do experiments. The assumptions might hold. If you are sharing wireless LANs more, does this become more critical?</li>
<li>Q. How will it work when your bit-rate adaption drops down to BPSK and QPSK? It does nothing in this case. How might it work? Could instead use smaller QAM symbols.</li>
<li>Q. What if you get 180 degree phase shift, due to fading or propagation delay, and the mapping changes? At the PHY layer, we expect there to be a mapping, and an indexing mechanism that can decode information in the header of the packet to select the mapping.</li>
<h3>Not All Microseconds Are Equal: Enabling Per-Flow Measurements With Reference Latency Interpolation</h3>
<ul>
<li>Low-latency (e.g. financial) applications are increasingly important.</li>
<li>Current solution is a low-latency cut-through switch. In a tree network, it is hard to tell which switch is causing a problem at microsecond granularity. Need high-fidelity measurement within the routers themselves.</li>
<li>But SNMP and NetFlow provide no latency measurements, and active probes are typically only end-to-end. Measurement boxes are very expensive (£90k).</li>
<li>Need per-flow measurements because averages lose too much information about what happens to each flow. There is a significant amount of difference in average latencies across flows at a router.</li>
<li>Perform measurement on packet ingress and egress. Assume that router interfaces are synchronized, because cannot modify packets to carry timestamps.</li>
<li>Naïve approach: store timestamps for each packet on ingress and egress. Packet timestamps get sent along the egress route when the flow steps. Obviously this is too costly to do at 10Gbps. Sampling sacrifices accuracy.</li>
<li>Use LDAs with many counters for interesting flows, counting packets seen at each timestamp.</li>
<li>Divide time into windows, and measure the mean delay for packets within that window (locality observation). Error shrinks with a smaller window size.</li>
<li>Can inject a reference packet regularly that does have ingress and egress timestamps, which gives delay samples for each window.</li>
<li>Implementation has two components: reference packet generator and latency estimator.</li>
<li>Reference packet generation strategies: 1 in n packets or 1 in tau seconds. Actual approach is dynamic injection based on utilization. When high utilization, inject fewer packets.</li>
<li>Latency estimator strategies: could use only the left reference packet (previous reference packet) or a linear interpolation of the left and right reference packets. Other non-linear estimators, such as shrinkage, are possible.</li>
<li>Maintain packet count, summed delay and sum of squares of packet delays for each flow.</li>
<li>Evaluated on various router traces, and simulated with NetFlow YAF implementation. (RED active queue management policy.) Median relative error is 10&#8211;12%. As utilization grows, error decreases. RLI outperforms other schemes by 1 to 2 orders of magnitude.</li>
<li>Overhead is &lt; 0.2% of link capacity. Packet loss difference is 0.001% at 80% utilization.</li>
<li>Q. Have you talked to router vendors about implementing this? No.</li>
<li>Q. When you compare the different approaches, why is sampling so much worse? Sampling scheme is one packet per thousand. So accuracy is very low if few packets are delayed.</li>
<li>Q. How does it influence the results if packet loss rates are high? Lost packets are not counted in our result.</li>
</ul>
<h2>Data Center Networks</h2>
<h3>Generic and Automatic Address Configuration for Data Center Networks</h3>
<ul>
<li>Manual configuration is error-prone, causing downtime. DHCP is used at layer 2. But applications need this information as well. Data center designs (manually) encode this information in the IP address for routing. But DHCP isn&#8217;t enough for this.</li>
<li>Takes two inputs: a blueprint graph (with logical IDs for each machine) that can be automatically generated, and a physical topology graph that is available later when the data center is constructed.</li>
<li>Center of framework is a device-to-logical ID mapping. Need malfunction detection to update the mapping.</li>
<li>Maintaining a map between devices and the logical IDs is the graph isomorphism problem. The complexity of this problem is unknown (P or NPC). Introduce the O2 algorithm which solves the problem in this case. Proceeds by choosing an arbitrary first node matching (decomposition), then refinement. Terminates when no cell can be decomposed. Overall algorithm terminates when all cells have single nodes.</li>
<li>O2 has three weaknesses (i) iterative splitting in the refinement stage, (ii) iterative mapping in the decomposition stage, and (iii) making a random selection of the mapping candidate. Optimization algorithm has three heuristics that address these problems.</li>
<li>O2 turns out to be faster than its competitors: Nauty and Saucy.</li>
<li>Malfunctions cause the topology to differ from the blueprint, so O2 cannot find the mapping. Solution is to find the maximum common subgraph between the blueprint and physical graphs. The algorithm for this is NP- and APX-hard.</li>
<li>Use heuristics based on the vertex degree changing. If no degrees have changed, probe subgraphs derived from anchor points using majority voting to identify miswired devices.</li>
<li>Protocols for channel building, physical topology collection and logical ID dissemination. A DAC manager coordinates these.</li>
<li>Experimented on a BCube(8, 1) network with 64 servers. The total time to run the algorithm was 275 milliseconds to autoconfigure all of the servers.</li>
<li>Ran simulations on larger topologies. Up to 46 seconds on a DCell(6, 3) network with 3.84 million devices.</li>
<li>Q. What are the next steps for this work? Can we design better logical IDs that can be used in routing.</li>
</ul>
<h3>Symbiotic Routing in Future Data Centers</h3>
<ul>
<li>Reevaluate network architecture based on the different properties of data center networks when compared to the internet.</li>
<li>Despite lots of other work, the network interface has not changed, so what can we do at the application layer?</li>
<li>Network is a black box, and applications have to infer things like locality, congestion and failure; likewise networks have to infer things about the applications like flow properties.</li>
<li>MSRC designed CamCube: a network with x86 servers directly-connected in a 3D torus. Servers have (x, y, z) coordinates that are exposed to the application. The send/receive API is a simple 1-hop API. Multi-hop routing is provided as a service, which uses multipath when possible.</li>
<li>Built a high-throughput transport service, a large-file multicast service, an aggregation service and a distributed key-value cache service. Each had a custom routing protocol based on the properties that the application needed to obtain. e.g. High-throughput transport prefers disjoint paths, whereas file multicast prefers non-disjoint paths.</li>
<li>Testbed used 27 servers with 6&#215;1G NICs. A simulator looked at a 20^3 (8000) node CamCube.</li>
<li>Custom routing policies yield a performance improvement (on average). Factor of 2 (median) improvement in end-to-end throughput for high throughput transport (10k x 1500 byte packets). Gains also for multicast and the distributed object cache (in terms of path length).</li>
<li>Also looked at impact on the network: achieved higher throughput with fewer packets (lower link utilization) for all applications.</li>
<li>The base routing protocol is still used to route along paths defined in the custom routing protocol, and to handle network failures. The custom protocol route for the common case.</li>
<li>Built a routing framework for describing these custom protocol. Two components: the routing policy and the queuing policy. Each service manages one packet queue per link.</li>
<li>Cache service: keyspace mapped onto the cube, evenly distributed across the servers. Routing: go to the nearer of the cache or primary nodes. On a cache miss, route from the cache to the primary, and populate the cache on the return.</li>
<li>The base protocol routes around link failures. If a replica server fails (in the key-value store), the key space is consistently remapped by the framework.</li>
<li>Forwarding function implemented in C#, running in userspace.</li>
<li>Benchmarked a single server in the testbed, communicating at 11.8Gbps with all six neighbors. Required 20% CPU utilization.</li>
<li>Can the routing approach be used outside CamCube? Network only needs to provide information about path diversity and topology, and the ability to program components.</li>
<li>Q. When you make an application for this framework, what would happen to the application if you decided to change the topology? The benefit of the black-box approach is that you don&#8217;t care about the topology. May be advantageous to target containerized/modular data centers where the topology cannot frequently (or at all) change.</li>
<li>Q. How would the performance look on other topologies, considering that the torus is optimal for latency and bandwidth? It is a benefit and a curse, given that we occasionally have long path length for some pairs. Topologies that give you higher path diversity give you better chances to employ these ideas.</li>
<li>Q. What if you ran multiple instances of the same application (rather than applications with very diverse routing policies)? We did run this, but the details are in the paper. For the high-throughput transport protocol, you might expect us to be susceptible to congestion, but the forwarding function returns many choices, which you can choose based on minimal queue length, for example.</li>
<li>Q. What is the net result here for forwarding latency? This is one of the main critiques of the topology itself. Currently experimenting with smart NICs so that we don&#8217;t have to go up to user space for straightforward forwarding.</li>
<li>Q. Can you write the forwarding method in the form of many overloaded methods that code be dispatched dynamically? To some extent, but packets are tagged with a service ID, which statically dispatches the forwarding method of the particular service.</li>
</ul>
<h3>Data Center TCP</h3>
<ul>
<li>TCP is used for 99% of traffic in data centers, but what is the problem with it? Can suffer from bursty packet drops (Incast), and builds up large queues that add significant latency.</li>
<li>Many ad hoc workarounds for TCP problems, such as at the application level. This talk is about changing the TCP stack in the kernel to address these problems.</li>
<li>Interviewed developers, analyzed applications and did a lot of measurements. Systematic study of impairments and requirements in Microsoft&#8217;s data centers.</li>
<li>Case study: Microsoft Bing data center. 6000 servers, with passive instrumentation (application/socket/kernel-level). Search query goes to top-level aggregator, which splits the query and farms it out to mid-level aggregators, which then farm it out to worker nodes. Worker nodes have a 10ms deadline; mid-level aggregators are 50ms; and the top-level deadline is 250ms. Missed deadlines lead to missing data in the results.</li>
<li>Similarly, Facebook builds a page by pulling data from various servers. Similar traffic pattern to Bing.</li>
<li>Incast happens when &#8220;synchronized mice collide&#8221;. Caused by partition/aggregate: the queue overflows at the aggregator. To deal with the problem, Bing jitters requests over a 10ms window. This gets better performance at the median, but causes problems at higher percentiles (up to 99.9th is tracked).</li>
<li>Queue buildup causes when big flows take up too much of a queue and increases the latency for short flows.</li>
<li>Requirements: 1. High burst tolerance; 2. Low latency for short flows; 3. High throughput. 1 and 3 are in tension with 2. Deep buffers helps 1 and 3 but increases latency. Shallow buffers are bad for bursts and throughput.</li>
<li>Objective is low queue occupancy, with high throughput.</li>
<li>TCP uses explicit congestion notification, inserted in packets in the middle, noted by the receiver and sent back to the sender in the ACKs.</li>
<li>Need C * RTT buffers for a single flow running at 100% throughput. If you have a large number of flows, you can have fewer buffers. If there is a low variance in sending rate, small buffers are sufficient.</li>
<li>Key idea: react in proportion to the extent of congestion, not its presence (cut the congestion window by less if there are fewer congestion notification bits set). Other key idea: mark packets based on the instantaneous queue length.</li>
<li>Sender maintains a running average of marked packets, and adaptively cut the congestion window based on how many packets are marked.</li>
<li>On a real deployment, DCTCP for two flows keeps the queue length much shorter than regular TCP.</li>
<li>Get high burst tolerance by having large queues. Low latency by having low buffer occupancy.</li>
<li>How long can DCTCP maintain queues without loss of throughput, and how do you set the parameters? Need to ensure that the queue size is stable by quantifying the oscillations.</li>
<li>Implemented on Windows 7 using real hardware with 1G and 10G stacks. Aim was to emulate traffic within one rack of the Bing data center. Generated query and background traffic based on distribution seen in Bing. For background flows, DCTCP gets lower latency for small flows, and matches TCP for large flows. For query flows, DCTCP does much better than TCP.</li>
<li>Tried scaling the traffic (background and query) by 10x. Compared DCTCP, TCP and TCP-RED with shallow-buffered switches, and TCP with a deep-buffered switch. TCP-DeepBuf does terribly for latency with short flows, and TCP-ShallowBuf due to Incast</li>
<li>Q. Shouldn&#8217;t you be comparing with TCP-ECN? We have done those comparisons, though they aren&#8217;t in the talk.</li>
<li>Q. If you reduce in proportion to the number of bits, it depends on the timescales on which the queue builds up, which depends on the number of competing sources and their own reactions. Don&#8217;t you have to do some control theoretic modeling? Data center is a homogeneous environment with all sources being DCTCP. Even there you really need to figure out the bandwidth properly? There is more detail in the paper.</li>
<li>Q. Depending on the instantaneous queue for the notification has implications for the dynamics, and I am concerned about this? Homogeneous RTT helps here, thanks to the data center environment. We believe there are some simple ways to solve the problem for heterogeneous RTT (e.g. multiple queues). I&#8217;d like see how much diversity you can tolerate?</li>
</ul>
<h2>Inter-Domain Routing and Addressing</h2>
<h3>Internet Inter-Domain Traffic</h3>
<ul>
<li>It&#8217;s hard to measure the Internet, and the lack of ground truth data makes it harder to know if you&#8217;re doing it accurately.</li>
<li>Wanted to collect a large data set that shows how the internet is evolving.</li>
<li>Conventional wisdom: global-scale end-to-end network, broad distribution of traffic sources and sinks, and the existence of a &#8220;core&#8221;. But as time passes, these are becoming less true.</li>
<li>Methodology: focussed on inter-domain traffic, not application layer things such as web hits/tweets/VPN/etc. Exported coarse-grain traffic statistics about ASNs, ASPaths, protocols, ports, etc. via anonymous XML forwarded to central servers. Covers 110 ISPs/content providers, 3k edge routers, 100k interfaces… i.e. about 25% of all inter-domain traffic. Then waited 2 years and repeated to get longitudinal data. Used commercial probes within a given ISP, with limited visibility into payload-based classification. Calculated percentages per category then weighted averages using number of routers in each deployment. Also incorporated informal and formal discussions with providers, and information about known traffic volumes.</li>
<li>Validated predictions based on a ground-truth based on 12 known ISP traffic demands (Known peak Tbps).</li>
<li>In two years, Google and Comcast have grown to be the 3rd and 6th biggest carriers in terms of traffic demands.</li>
<li>Cumulative distribution of carriers: in 2007, thousands of ASN contributed the first 50% of content; in 2009, it was 150 ASNs. In 2010 it&#8217;s even more dramatic (but not shown).</li>
<li>By buying YouTube, Google went from 1% of internet traffic to 6% (not including the global cache — i.e. an underestimate).</li>
<li>In 2007, Comcast has &#8220;eyeball&#8221; peering ratios, but by 2009, they are a net producer of content. Video distribution, backbone consolidation, etc. contributed to this.</li>
<li>Price per megabit of wholesale for internet transit has collapsed whereas the revenue for internet advertisement has greatly increased.</li>
<li>IP and CDNs have been commoditized, hence enterprise traffic has moved to the cloud. Companies have consolidated. Bundling (triple/quad play etc.) has become popular. Several new economic models, such as paid content, paid peering, etc. (but often under NDA). Also disintermediation as customer and provider connect directly.</li>
<li>Traditional internet model is hierarchical: not really true but pedagogically used. The new &#8220;Tier-1&#8243; has &#8220;Hyper Giants&#8221;, representing the large content providers.</li>
<li>In terms of protocols: HTTP has grown 24.76%, video has grown 67.09% and P2P has shrunk &gt; 70%.</li>
<li>Port usage is also consolidating: similar distribution to the content providers: fewer ports account for more of the internet. Looked at Xbox traffic on TCP port 3074, and saw a huge drop. Actual cause was a move to port 80.</li>
<li>File sharing has migrated to the web. In 2010, P2P is the fastest-declining application group. Direct-download like Megavideo, etc. are much more popular now, and you can even get HD streaming.</li>
<li>Internet traffic changes have been driven by changing economic models. Shift from connectivity to content, and the move to port 80 are two major trends. This has implications for engineering and research, as security/fault tolerance/routing/traffic engineering/network design have become more difficult.</li>
<li>Q. Economic implications: what are other types of business models and arrangements that might come out of &#8220;hyper giants&#8221;? Still in the early stages, but it&#8217;s not clear - from a power perspective - who has the upper hand.</li>
<li>Q. If the number of top players decreases by an order of magnitude, do you see the role of CDNs diminish and do you have any data on that? Talk about CDNs in the paper (about 10% or more of internet traffic). In general, they are growing, but enterprise content is driving a lot of the CDNs.</li>
</ul>
<h3>How Secure are Secure Interdomain Routing Protocols?</h3>
<ul>
<li>After a decade of research on secure BGP, no idea what the best solution. So this paper evaluates how well each protocol prevents &#8220;traffic attraction attacks&#8221;, based on simulation on empirical data. Used an AS-level map of the internet and business relationships, and a standard model of routing properties.</li>
<li>How do we strictly enforce correct path announcements? Solutions range from no crypto (BGP) to lots of crypto (data-plane security). It turns out that this isn&#8217;t the only problem: just as important to control who you announce to, as well as what is announced. Defensive filtering is also important.</li>
<li>Different relationships: customer-provider (customer pays), and peering (no payment). No value for how much a path costs in the model, but the paper models routing by preferring cheaper paths. After that, prefer shorter paths. Only transit traffic if it makes you money (i.e. on behalf of your customers).</li>
<li>A traffic attraction attack is an attempt to get as many people as possible to route traffic through the attacker&#8217;s AS (for tampering, eavesdropping, etc.). Simulations show that a traffic attraction attack can pull in traffic for 62% of ASs.</li>
<li>Origin authentication: secure database of IP to AS mappings to prevent people advertising origins they don&#8217;t own. Simulation shows 58% of ASs get attracted to the attacker.</li>
<li>Secure BGP: enforces that ASs cannot announce paths that have not been announced to them (using digital signatures). So can only append a prefix. Simulation shows that the attacker still attracts 18% of ASs.</li>
<li>Defensive filtering (of stubs (i.e. ASs with no customers, which should not route traffic)): provider drops announcements for prefixes not owned by its stubs. Defensive filtering thwarts all attacks by stubs (i.e. all of the previous cases), and 85% of ASs are stubs.</li>
<li>Sometimes announcing longer paths is better than announcing short paths; sometimes announcing to fewer neighbors (than all of them) is better. It&#8217;s NP-hard to find the optimal attack strategy.</li>
<li>Ran experiments by choosing randomly an attacker, victim pair and simulated a &#8220;smart attack&#8221; each protocol.</li>
<li>Evaluated probability that an attack attracts 10% of ASs (over a random choice of attacker and victim). Defensive filtering alone is as effective as Secure BGP alone. However, the attacks could be smarter, so these numbers are underestimates.</li>
<li>Why aren&#8217;t we using defensive filtering today? It&#8217;s hard to keep track of the prefixes that your customers own. A push to implement origin authentication is ongoing, and this could be used to derive the filtering mapping. The threat model is somewhat strange though.</li>
<li>Q. The CAIDA data are less than ideal, so how robust are the statements in the paper? Ran the experiments twice, on CAIDA data from 2009, and the UCLA Cyclops data. Trends for the two sets are similar.</li>
<li>Q. Did you look at comparing the effectiveness of models versus their deployability? Completely ignored deployability and implementability in this paper, but it will be the subject of future work.</li>
<li>Q. Have you considered varying the 10% threshold? Yes. [Backup slide.]</li>
<li>Q. Seems like the easiest way to attract traffic would be to deaggregate the prefix, so do you take this into account? We didn&#8217;t evaluate it because we idealized things to make it tractable.</li>
</ul>
<h3>Understanding Block-level Address Usage in the Visible Internet</h3>
<ul>
<li>What can simple observations about the internet say? Contributed methodology, applications and validation.</li>
<li>Looked at spatial correlation, address utilization, dynamic addressing and low-bitrate identification. Based on data set gathered for IMC. New, deeper understanding and a new interpretation.</li>
<li>Is there spatial correlation in the IPv4 address space? Are adjacent addresses likely to be used in the same way? This could help to efficiently select representative addresses for more-detailed study.</li>
<li>Collected data by pinging each address in randomly selected /24 blocks every 11 minutes for a week, and collected the probe responses, probing 1% of the whole IPv4 space. Gives 5 billion ping responses.</li>
<li>Three metrics: availability (normalized sum of up durations), volatility (normalized number of up durations) and median-up (median duration of a period of uptime).</li>
<li>Graphed by mapping the IP addresses onto a Hilbert curve.</li>
<li>Algorithm: examine each block size, if it is homogeneous stop, otherwise split the block and recurse.</li>
<li>Validating spatial correlation is hard, because it is hard to find a ground truth. Therefore used USC&#8217;s network for comparison, and the general internet (hostname-inferred truth). Also evaluated for different samples and dates.</li>
<li>Selected USC because the operator provided the ground truth, and they had knowledge of both allocated and usage blocks.</li>
<li>43% false negative rate, and 57% of blocks are correctly identified.</li>
<li>The general internet gives unbiased truth. The results are more correct than for USC: 68% correct to 32% false negative.</li>
<li>Low-bitrate identification: formalized RTT = transfer + queuing + propagation delays. Tried using median RTT to identify low-bitrate versus broadband. However, for international links, propagation time dominates. Variance of RTT separates low-bitrate from broadband.</li>
<li>Used hostnames as a form of ground truth (e.g. if the hostname contains &#8220;cable&#8221; or &#8220;3G&#8221;).</li>
<li>Q. Did you do any of this on IPv6? No. Are you planning to? Probably in the future when IPv6 is more popular.</li>
<li>Q. Is it reliable to use ping response to detect hosts, when some hosts refuse to handle ping responses? We didn&#8217;t consider false information in this work, but it would be valuable to consider this in future.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.mrry.co.uk/blog/2010/08/31/sigcomm-2010-day-1/feed/</wfw:commentRss>
		</item>
		<item>
		<title>SOSP 2009: Day 3</title>
		<link>http://www.mrry.co.uk/blog/2009/10/19/sosp-2009-day-3/</link>
		<comments>http://www.mrry.co.uk/blog/2009/10/19/sosp-2009-day-3/#comments</comments>
		<pubDate>Mon, 19 Oct 2009 00:47:42 +0000</pubDate>
		<dc:creator>Derek Murray</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.mrry.co.uk/blog/?p=62</guid>
		<description><![CDATA[Clusters
Distributed Aggregation for Data-Parallel Computing: Interfaces and Implementations

Goal to make large-scale programming simple for all developers.
Write a program in Visual Studio; Dryad(LINQ) takes care of shipping it to the cluster, fault tolerance, etc.
Wrestling with the implementation of GroupBy-Aggregate. GroupBy takes a sequence of objects with some kind of key, and groups them together by key. [...]]]></description>
			<content:encoded><![CDATA[<h2>Clusters</h2>
<h3>Distributed Aggregation for Data-Parallel Computing: Interfaces and Implementations</h3>
<ul>
<li>Goal to make large-scale programming simple for all developers.</li>
<li>Write a program in Visual Studio; Dryad(LINQ) takes care of shipping it to the cluster, fault tolerance, etc.</li>
<li>Wrestling with the implementation of GroupBy-Aggregate. GroupBy takes a sequence of objects with some kind of key, and groups them together by key. Similar to MapReduce.</li>
<li>Naïve execution plan splits map and reduce into two phases, with an all-to-all data exchange between them. However, applying the reduce after this exchange results in a large amount of network I/O.</li>
<li>A better idea is to do early partial aggregation: use an aggregation tree to achieve this. Reduces the disk and network I/O by up to one or two orders of magnitude.</li>
<li>Want to automate this optimization. Programmer writes the obvious code and the system takes care of the rest.</li>
<li>Notion of decomposable functions is key to this. Need an initial reducer that is commutative, and a combiner that is commutative and associative.</li>
<li>How do we decompose a function? Two ways: iterator and accumulator interface. Choice can have a significant impact on performance.</li>
<li>How do we deal with user-defined functions? Try automatic inference, but fall-through to a good annotation mechanism. Implement simple function and annotate it with the initial reduce and combiner implementation function names.</li>
<li>Hadoop interface for this adds quite a lot of complexity. Java&#8217;s static typing is not preserved.</li>
<li>Iterator interface has to build an entire group and iterate through it. Accumulator can discard the inputs if they are not needed. Oracle uses this approach, implemented with stored procedures. Hard to link in user-defined procedures.</li>
<li>Automatic decomposition looks at the expression and checks whether all leaf function calls are decomposable.</li>
<li>Want our approach to have good data reduction, pipelining, low memory consumption and parallelisability (multicore). Define six strategies, accumulator- and iterator-based.</li>
<li>Iterator PartialSort approach. Idea is to keep only a fixed number of chunks in memory; processed in parallel. The bound on memory makes pipelining possible. Strategy close to MapReduce.</li>
<li>Accumulator FullHash approach builds an in-memory parallel hash table with one accumulator object per key. Objects are accumulated immediately. This gives optimal data reduction and memory consumption proportional to the number of keys, not records. This is the DB strategy (DB2 and Oracle).</li>
<li>Evaluated with three applications: WordStates, TopDocs and PageRank on a 240-machine cluster. Accumulator-based implemen—</li>
</ul>
<p style="text-align: center;"><img class="alignnone size-full wp-image-63" title="water_laptop" src="http://www.mrry.co.uk/blog/wp-content/water_laptop.jpg" alt="water_laptop" width="360" height="270" /></p>
]]></content:encoded>
			<wfw:commentRss>http://www.mrry.co.uk/blog/2009/10/19/sosp-2009-day-3/feed/</wfw:commentRss>
		</item>
		<item>
		<title>SOSP 2009: Day 2</title>
		<link>http://www.mrry.co.uk/blog/2009/10/13/sosp-2009-day-2/</link>
		<comments>http://www.mrry.co.uk/blog/2009/10/13/sosp-2009-day-2/#comments</comments>
		<pubDate>Tue, 13 Oct 2009 16:32:26 +0000</pubDate>
		<dc:creator>Derek Murray</dc:creator>
		
		<category><![CDATA[Technology]]></category>

		<category><![CDATA[Trip Reports]]></category>

		<guid isPermaLink="false">http://www.mrry.co.uk/blog/?p=56</guid>
		<description><![CDATA[I/O
Better I/O Through Byte-Addressable, Persistent Memory

DRAM is fast, byte-addressable and volatile, but disk/Flash are non-volatile, but slow and not byte-addressable. BPRAM is all three!
Phase change memory is a promising source for this. Bits encoded as resistivity. Access latency in the nanoseconds, and far better endurance than flash. Designed BPFS for BPRAM.
Goal: FS ops commit atomically [...]]]></description>
			<content:encoded><![CDATA[<h2>I/O</h2>
<h3>Better I/O Through Byte-Addressable, Persistent Memory</h3>
<ul>
<li>DRAM is fast, byte-addressable and volatile, but disk/Flash are non-volatile, but slow and not byte-addressable. BPRAM is all three!</li>
<li>Phase change memory is a promising source for this. Bits encoded as resistivity. Access latency in the nanoseconds, and far better endurance than flash. Designed BPFS for BPRAM.</li>
<li>Goal: FS ops commit atomically and in program order. Data is durable as soon as the cache flushes. Use short-circuit shadow paging to get this (new consistency model).</li>
<li>Eliminate DRAM buffer cache; use L1/2 instead. Put BPRAM on the memory bus. Provide atomicity and ordering in hardware.</li>
<li>Both BPRAM and DRAM are addressable by the CPU: physical address space is partitioned into volatile/non-volatile.</li>
<li>BPFS gets better performance than NTFS on the same media.</li>
<li>What happens on crash during update? Short-circuit shadow paging comes into play (contrast with journalling or shadow paging). Overhead of journalling is that all data (or metadata) must be written twice. Shadow paging uses copy-on-write up to the root of the FS: drawback is that writes propagate all the way back to the root (multiple updates), and small writes have a large copying overhead.</li>
<li>Short-circuit shadow paging makes in-place updates where possible. Uses byte-addressability and atomic, 64-bit writes. Both in-place updates and appends are made simple by this technique. Cross-directory rename does bubble up to the common ancestor.</li>
<li>Problem: if data is cached in L1/L2, the ordering of cache eviction can lead to inconsistent states. Also, writes from the cache controller might not be atomic.</li>
<li>So add two new hardware components to the CPU and cache controller. Epoch barriers are used to declare ordering constraints and they are much faster than a write-through cache. Also add capacitors to DIMMs which allow writes to propagate even after the loss of power.</li>
<li>Do CoW then Barrier then Commit. Paper also shows how to make it work on multiprocessors.</li>
<li>Built and evaluated on Windows as an in-kernel file system.</li>
<li>Microbenchmarks (append n bytes and random n-byte write) compare NTFS/Disk, NTFS/RAM and BPFS/RAM. (Using DRAM in this experiments.) BPFS is significantly faster than NTFS on disk, and NTFS isn&#8217;t syncing so it isn&#8217;t durable!</li>
<li>Postmark benchmark compares NTFS/Disk, NTFS/RAM, BPFS/RAM and (projected) BPFS/PCM. BPFS/PCM is much faster than both NTFS/Disk and NTFS/RAM. Analytical projection based on sustained throughput of PCM.</li>
<li>Q: are the storage requirements of a database and of a file system converging when you have this hardware available? The changes to the hardware will be applicable to other sorts of storage systems like a database, not just filesystems.</li>
<li>Q: have you thought about how to expose more capabilities of this medium to the applications (not just sequential reads and writes)? Applications are currently written in terms of what is efficient.</li>
<li>Q: how do you atomically deal with a free-list? We don&#8217;t have a free-list. Don&#8217;t need to keep track of so many data structures because the medium is so fast, which lowers the consistency overhead.</li>
<li>Q: where do you go next for multiprocessors and clusters? One of the goals was to have multiple concurrent threads operating on the FS at the same time.</li>
<li>Q: is there a risk that data in the capacitor gets garbled after the machine gets switched off? [Taken offline.]</li>
<li>Q: could you go even faster without having the consistency guarantees, for applications that don&#8217;t need it? There&#8217;s always a trade-off here.</li>
<li>Q: how do you do mmap, and is meddling with L1/L2 caches going to be expensive? Haven&#8217;t implemented mmap yet, but we would have a much better guarantee of durability. Changes to the cache, in the paper have been looked at in terms of interference, and performed well.</li>
<li>Q: why didn&#8217;t you benchmark against a file system using a B-tree or a red-black tree that can take advantage of random writes [why did you compare against NTFS]? NTFS is widely used, and it&#8217;d be interesting to compare against others.</li>
</ul>
<h3>Modular Data Storage with Anvil</h3>
<ul>
<li>Data storage drives modern applications (everyone has a database) and they are frequently a bottleneck. Hand-built stores outperform general-purpose ones by up to 100x. Observe that changing the layout can substantially improve performance. Custom storage is hard to write, especially in order to provide consistency guarantees. Can be prohibitively expensive to experiment with new layouts.</li>
<li>Need a simple and efficient modular framework to support a wide variety of layouts.</li>
<li>Fine-grained modules: dTables. These are composable to build complex data stores. All writing is isolated to dedicated writable dTables, which incidentally has good disk access properties.</li>
<li>dTable = key/value store. Maps integers/floats/strings/blobs to blobs. Provides an iterator to support in-order traversal. dTables used by applications and frontends, and also other dTables. Can transform data, add indices or otherwise construct complex functionality from simple pieces.</li>
<li>Example of a mapping from customer IDs (mostly contiguous) to states. Start with an array dTable for the common case. Layer a dictionary on top of that (maps state names to array indices). Have an exception dTable for the case where a customer isn&#8217;t in one of the 50 states, and a linear-search dTable for their residences. But to make this fast, layer a B-Tree index dTable on top of the linear store.</li>
<li>So far just read-only. Updates are hard to do transactionally. Need to implement a write-optimized dTable. Fundamental writable dTable is the journal dTable. New data is appended to a shared journal; data are cached in an in-ram AVL tree. The journal is digested when it gets large. Transaction system is described in the paper. Layer this over read-only dTables.</li>
<li>Managed dTable goes at the top. Also have a Bloom filter dTable to deal with multiple overlaid read-only dTables.</li>
<li>Many additional dTables listed in the paper.</li>
<li>Evaluate  the effect of simple configuration changes on performance (modularity). Key lookup workload, comparing contiguous versus sparse keys. Contiguous good with arrays; sparse good with B-trees. Also show the benefit of layering an index on top of a linear store. Also show the low overhead of the Exception dTable.</li>
<li>Evaluated by running TPC-C. Replaces a SQLite backend with Anvil. Shows that Anvil outperforms both the original backend and MySQL. Split read and write stores perform well.</li>
<li>Evaluated the cost of digesting and combining. These can be done in the background, taking advantage of additional cores and spare I/O bandwidth. Measured the overhead when doing a bulk load (1GB) into the dTable, with digests every few seconds.</li>
<li>Q:  why didn&#8217;t you compare either performance or features against BDB, which is very similar? Didn&#8217;t find it as easy to construct read-only data stores in BDB: creating customisable data stores has a lot of transactional overhead.</li>
<li>Q: did you evaluate iteration? In paper. How does the performance depend on the order of updates? Not sure what you mean. Did look at overlay iteration in the paper, which ought to be the most expensive (due to key lookup cost), and overhead was only 10%</li>
<li>Q: how did you make it so that the creator of a new dTable doesn&#8217;t have to consider ACID semantics? Most dTables are read-only, so you don&#8217;t need to worry about this (like shadow paging). The managed dTable has a small hook that enforces transactional semantics. And read/write dTables? Don&#8217;t envision that people will need to create these. Could implement your own, but this would miss the point.</li>
<li>Q: how should developers write recovery tools for systems like Anvil? Anvil includes such a tool that handles recovery for you. Read-only semantics makes this much simpler.</li>
</ul>
<h3>Operating Systems Transactions</h3>
<ul>
<li>[Unfortunately, I missed this talk due to having an obligation to man the registration desk. I'll try and track down the video and update this later.]</li>
</ul>
<h2>Parallel Debugging</h2>
<h3>Do You Have to Reproduce the Bug at the First Replay Attempt? &#8212; PRES: Probabilistic Replay with Execution Sketching on Multiprocessors</h3>
<ul>
<li>Concurrent programs are hard to write. Multi-core makes concurrent programming more important, and bugs more common. However they are non-deterministic (requiring e.g. a special or improbable thread interleaving). This makes it hard to reproduce them.</li>
<li>Deterministic replay for uniprocessors is relatively easy: only need to record inputs, thread scheduling and return values of system calls. On multiprocessors, this is much more challenging. e.g. Simultaneous memory accesses are another source of non-determinism.</li>
<li>Previous proposals introduce new hardware, which don&#8217;t exist in reality. Or there are software-only approaches but they have up to 100x slow-down.</li>
<li>Ideally want to reproduce a bug with a single replay and no runtime overhead. But what if we relax this slightly?</li>
<li>Idea 1: record only partial information during a production run. Idea 2: push the complexity into diagnosis time. Idea 3: use feedback from unsuccessful (non-reproducing) replay runs.</li>
<li>Just record a sketch during the production run. When a replay goes off the sketch, terminate it immediately and feed back information about why it deviated for refining the next replay. Can eventually reproduce the bug with 100% probability.</li>
<li>Several different methods for sketch recording. Spectrum of approaches from UP deterministic replay to full MP DR. Can record e.g. synchronization points, or basic blocks, or more: build up this information during the replay runs.</li>
<li>At replay time, the partial information replayer consults the sketch to see that recorded global ordering is obeyed. How do we know whether a replay is successful? Use a failure detector based on crash, deadlock or incorrect results. This can also detect unsuccessful replay runs.</li>
<li>When a replay attempt fails, start it again. But could do something different the next time: a random approach would just leave it to fate, but PRES is more systematic. Failed reproduction is due to un-recorded data races. The feedback generator captures these races and tweaks them in future runs. Start with many candidate races and filter them down.</li>
<li>Implemented PRES using Pin. Evaluated many different applications (desktop, server and scientific). Overhead is around 18%, which is barely more than baseline Pin overhead. Macrobenchmarks show that PRES gets much higher throughput for server applications (MySQL).</li>
<li>Effectiveness: UP algorithm doesn&#8217;t detect any bugs within 1000 replays, whereas PRES gets 12/13 in 10 attempts. Feedback generation is crucial to effectiveness. Race filtering also effective.</li>
<li>Q: could you also use execution traces to help track down which parts of the execution trace cause the bug to not happen, and guide the programmer? Good idea.</li>
<li>Q: could you apply PRES to virtual machine replay? Also a good point. The work could be integrated with virtual machines. Could you rollback an execution, is it precise enough? Depends on the fine-grainedness of the recording scheme used. What is the inherent overhead in collecting a trace with sufficient fidelity to do backtracking? If there is a lot of lock operations, the low-overhead approach (SYNC) could work. But if there is no synchronization, we can&#8217;t use this information, and a more heavyweight scheme would be needed.</li>
<li>Q: why do the results differ so much from the next paper? The main idea is similar, but their work is more focussed on static analysis to reduce runtime overhead.</li>
<li>Q: would you advocate this as a solution for long-running (one year or more) services, as it is often only after this time that they emerge? We can take a checkpoint of the process state, which solves the problem of data accumulation. </li>
</ul>
<h3>ODR: Output-Deterministic Replay for Multicore Debugging</h3>
<ul>
<li>Debugging non-deterministic software failures is really hard. The problem is how to reproduce these failures for debugging. Model checking/testing/verification could work, but it&#8217;s not perfect, and it doesn&#8217;t capture everything. So we need deterministic replay.</li>
<li>Need multiprocess operation, efficient recording, no special hardware and the ability to run arbitrary programs without annotation (especially programs with data races). All related work fails to meet one of these requirements.</li>
<li>ODR is a user-level replay system, which works in the MP case, has only 1.6x overhead, needs no new hardware and works on arbitrary x86 Linux binaries.</li>
<li>Often sufficient to produce any run with the same outputs&#8230; needn&#8217;t have the exact same execution. So the idea is to relax the determinism requirements.</li>
<li>The classic guarantee is value determinism: replay run reads and writes must have the same values as the original. Relax this to &#8220;output determinism&#8221;: the replay run produces the same user-visible output as the original. This is not perfect, but still useful for debugging: reproduces most visible signs of failure, and enables reasoning about failure&#8217;s root cause.</li>
<li>How to achieve this? Deterministic-run inference. Basic idea is to translate a program into a logical formula (verification condition). Function of schedule trace, input trace and read trace, returning an output trace. Use a formula solver to yield unknown schedule trace. Scale this by directing the inference using more original-run information. Also by relaxing memory consistency of the inferred run: where values read have nothing to do with schedule order, can use an arbitrary schedule trace.</li>
<li>Three-dimensional inference design space: memory consistency (strict, lock order or null), query complexity (output, I/O&amp;lock-order, I/O&amp;lock-order&amp;path or determinant), and inference-time (polynomial or exponential). Search- and query-intensive DRI fit into this space.</li>
<li>Search-intensive is really slow (400&#8211;60000x slowdown). Formula generation, not solving is the bottleneck. Use multi-path symbolic execution with backtracking to generate formulas. Each backtrack involves a 200x slowdown. Backtracks are caused by race-tainted branches: wrong choice leads to a divergent, unsatisfying path. Work around by backtracking to the most recent race-tainted branch.</li>
<li>If we know the path (QI-DRI), inference time improves by 100x but we need to record much more data, so there is a 6x slowdown.</li>
<li>Future work is to reduce the path-search space, reduce the cost of each backtrack (cut down on race-tainted branch analysis), and parallelize formula generation by forking threads at each divergence.</li>
<li>Q: if I put in arbitrary values in the memory consistency model, how do I ensure that invariants are maintained? We don&#8217;t actually do null consistency, but if we did and invariants were violated the program might crash and output determinism catches this (because the output would be different).</li>
<li>Q: since your techniques are similar to the last paper, why are the results so different? In our approach, we do race-tainted branch analysis for the entire exectuion, and that is costly. We also do taint-flow propagation. There is more analysis in each backtracking iteration than theirs. We could reduce this cost by, for example, reusing results from previous iterations.</li>
<li>Q: have you considered changing your static analysis to another algorithm that could significantly improve your formula generation time? We are considering static approaches to formula generation.</li>
</ul>
<h2>Works-in-Progress</h2>
<ul>
<li><b>RAMCloud: Scalable, High-Performance Storage Entirely in DRAM.</b> New research project. Zero results. Motivated by wanting to build large scale systems with low latency. Data center style of web application separates the code from the data, in order to scale, but 4&#8211;5 orders of magnitude increase in latency. Want this kind of scaling with latency close to memory speeds (sub-microsecond). Basic architecture puts all data in DRAM. Scale using commodity servers. Reckon we can get 5&#8211;10us RPC latency end-to-end. Also have a story on durability and availability. Also want to support multiple applications.<br />
<br/>Q: effect of other memories? Whichever wins should work with RAMCloud<br />
<br/>Q: GMS system from UW 10 years ago? Not familiar with that [taken offline].
</li>
<li><b>Transactional Caching of Application Data using Recent Snapshots.</b> DB-driven website performance issues: use memcached. Add an in-memory DHT that is very lightweight, and stores application objects (not a DB). DBs provide transactional consistency, but these caches don&#8217;t do this. Goal is transactional consistency for accesses to the cache. Idea is to embrace staleness: all read-only transactions to run on stale data. Avoids blocking and improves utilization. This is quite safe, since stale data is already everywhere. Application can control staleness. Add a TxCache library between memcached and the application. DHT values are timestamped, and have a validity interval.<br />
<br/>Q: paper at HotStorage from HP Labs in similar area?<br />
<br/>Q: how does the DB know the validity interval? Modified DB to track this.
</li>
<li><b>Chameleon: A self-managing, low cost file system.</b> Targeting home user or small business, who doesn&#8217;t want to lose data. Doesn&#8217;t know anything about RAID. Cost-sensitive. Deployment scenario has 4 PCs connected by fast LAN, with a broadband connection to cloud storage. There are many ways to replicate, place and encode data. Ideally store data on at least one offline device to avoid vulnerability to viruses. A small, trusted &#8220;anti-availability kernel&#8221; enforces this requirement. Use linear programming to select and adapt storage configuration: the design space is now even more complicated. Tend towards the optimal solution.
</li>
<li><b>Sloth: Let the Hardware Do the Work!</b>Looked at embedded OSes used in the automotive industry. OSEK OS is the prevalent real-time embedded OS: event-triggered, priority-driven real-time system. Don&#8217;t want to implement a scheduler. SLOTH lets the interrupt subsystem do the scheduling and dispatching work. All threads are implemented as interrupt handlers and have interrupt priorities. Each thread needs an IRQ source. Priorities enable pre-emption. Can implement a bunch of synchronization this way also. System is simple, small (concise implementation and memory footprint) and fast (2&#8211;20x).<br />
<br/>Q: this looks like very simple scheduling&#8230; how would you deal with something more complicated like earliest deadline? Drawback is no blocking system calls, so can&#8217;t do everything.
</li>
<li><b>The case for cooperative kernel threads.</b> Kernels are multithreaded, and drivers have concurrency bugs, which, if they are in the kernel, is bad. Event-based devices drivers need to use continuations to preserve driver context across blocking operations. This becomes very complex, almost as bad as dealing with pre-emptive threads. Cooperative threads give the best of both worlds: atomic execution but allowing blocking. Research showed that drivers are mostly I/O bound, so cooperative threads are appropriate. Implementing this as a &#8220;cooperative domain&#8221; in the Linux kernel.<br />
<br/>Q: Linux does have cooperative thread scheduling available, so how does this interact with the work you are doing? Providing a framework for implementing drivers this way, much nicer.
</li>
<li><b>Abstractions for Scalable Operating Systems on Manycore Architectures.</b> Tesselation. Goal isn&#8217;t just to support heterogeneous hardware, but also provide predictable performance and guarantees for applications. Asymmetrically structured OS: some cores are dedicated as a management unit for keeping track of and scheduling applications. Eliminates the need for per-core runqueues, improves cache locality, decreases lock contention and limits kernel interference with applications. Applications interact with the kernel through remote, asynchronous system calls. Applications make explicit requests for cores, and OS guarantees that they will be gang-scheduled. OS just provides cores to the application, doesn&#8217;t need to know about threads. Applications have private memory ranges.<br />
<br/>Q: how do you balance the different demands for resources across applications? [Taken offline.]
</li>
<li><b>System Support for Custom Speculation Policies.</b> Applications run on some speculation infrastructure, which speeds things up. Want to separate policy from mechanism. Typically implemented transparently, which means that you have to be conservative, giving limited opportunities for speculation. Idea is to push the policy into the application. What could an application do that is different from the default? Might allow some output to be uncommitted. Or could commit equivalent-but-not-identical results. Process gets a &#8220;speculative fork&#8221; interface. Use cases: predicting user actions (predictive bash shell), authentication and user-level network services (when you have a predictable protocol).<br />
<br/>Q: how do you ensure errors in the speculative state don&#8217;t propagate to the main state? Need to be able to detect this, and could abort speculation in this case.
</li>
<li><b>IDEA: Integrated Distributed Energy Awareness for Wireless Sensor Networks.</b> A new &#8220;group diet&#8221; for wireless sensor networks. Problem of overloaded nodes.  Existing solutions are &#8220;single node diets&#8221;, which are unsatisfactory because nodes have to collaborate. Local efforts cannot go far enough unless there is some cooperation. Aim to improve application fidelity by matching system load to availability. Shift load from overutilized to underutilized nodes, and shift load away from threatened nodes. Like a distributed OS for sensor networks. IDEA evaluates multiple solutions and distributes information to the nodes. Ideal goal is awareness of application constraints.<br />
<br/>Q: Quanto does some cross-node analysis? We&#8217;re building on these great ideas.
</li>
<li><b>Flicker: Refresh Power Reduction in DRAMs by Critical Data Partitioning.</b> Hardware is over-designed for correctness and reliability. Make it less reliable and tolerate errors in software. Smartphones are a motivation: power consumption is way too high, due to the use of DRAM for responsiveness. Battery drains even when a phone is idle. Goal is to improve power consumption here. If you increase the refresh cycle length, the power consumption drops, but the error rate increases. Currently use 64ms refresh, so could we increase this? Secret sauce is a partitioning into critical and non-critical (e.g. soft-state) data. Map critical data to short-refresh cycle DRAM, and non-critical data to long-cycle DRAM. Requires some hardware changes. Hypothesise that smartphones have a lot of non-critical data. Initial results show 25% drop in power consumption with only 1% loss in reliability.<br />
<br/>Q: [?] Looking at replication and checksumming in other work.
</li>
<li><b>BFT for the skeptics.</b> Industry deals with crash failures a lot, so do we need full BFT? We already use checksums, timeouts, sanity checks, etc. to translate faults to crash faults. How often do we get faults that require BFT to handle it? Looked at ZooKeeper and real-world failures. Yahoo!&#8217;s crawler uses ZooKeeper extensively. Saw 9 issues, due to misconfiguration (5, BFT wouldn&#8217;t help), application bugs (2) and ZooKeeper bugs (2, correlated, BFT wouldn&#8217;t help). Could BFT hurt? It has more things to configure, so misconfigurations could become worse. Need to show that BFT really solves a problem before industry will pick it up.<br />
<br/>Q: You showed that correctly implemented BFT couldn&#8217;t help with some failures? Failures were correlated, affecting all replicas.
</li>
<li><b>Prophecy: Using History for High-Throughput Fault Tolerance.</b> BFT has poor throughput. Need 3f+1 replicas to handle f faulty replicas. Can we improve this for read-mostly, internet workloads? Add a &#8220;sketcher&#8221; to each replica, which sketches requests and responses. Only one machine sends a full response, the others send sketches. Trades off consistency for performance, which gives delay-once linearizability. Faulty replicas can return slightly stale data. Internet services have unmodified clients and short-lived sessions. Look at performance of PBFT. Can improve by consolidating sketch tables on a trusted sketcher. We already trust middleboxes, so why not trust this too? Performance is much better than PBFT. Work not specific to BFT, and could apply to PAxos, quorums, etc. while getting similar benefits.
</li>
<li><b>Securing Hardware Platforms Against Malicious Circuits Through Static Analysis.</b> Make assumptions when building systems. Best way to break a system is to break its assumptions. People assume hardware is correct. What if we can&#8217;t make this assumption? Hardware is complex, expensive, static and the base of the system. Do &#8220;dead circuit identification&#8221;: highlight all potentially malicious circuits automatically. Attacker is motivated to avoid impacting functionality during testing (or else they&#8217;d be caught). DCI gets an assertion that says which paths are effectively short circuits. Use these assertions in a new graph algorithm to identify the possibly-malicious, dead circuits. No false negatives, but 30% over-identification. Empirical evidence shows a tight correlation between code coverage and<br />
<br/>Q: is this primarily at design-time on HDL? Yes, this is one of our assumptions.<br />
<br/>Q: what about redundant circuits for fault tolerance? This is used at design time where you can make calls about this.
</li>
<li><b>Enhancing Datacenter Network Security and Scalability with Trusted End Host Monitors.</b> Cloud workload is dynamic and hostile. Key selling point is that multiple tenants can share common infrastructure. Need a new approach to security, because exploits are more likely, and the cloud resources can be used to perform exploits themselves. Cloud datacenters can help: they are centrally-controlled so monitoring becomes easier. The software and hardware and homogeneous. Plus a clean-slate approach is possible. Use the hypervisor as a trusted component. Hypervisor can send alarms to central controller when an attack is detected. Built a prototype from Hyper-V and a trusted Intel NIC.<br />
<br/>Q: if you trust the VM, why do you need to trust the NIC? This gives some useful properties, and the NIC could do some filtering this for you.<br />
<br/>Q: HotOS paper on this exact topic?<br />
<br/>Q: [?] By &#8220;hypervisor&#8221; meant the entire virtualization stack, because we didn&#8217;t want to make the hypervisor itself any better.
</li>
<li><b>Architectural Attacks and their Mitigation by Binary Transformation.</b> What happens if someone tries to attack you from a VM on the same machine in the cloud. There is cross-talk through shared architectural channels. Example is contention for the CPU data cache. This leaks information about the memory access pattern, which could for example be used to leak AES keys. Have showed that EC2 has similar vulnerabilities: placement vulnerabilities, cloud cartography and cross-VM exfiltration are all possible. Approach is to use dynamic binary rewriting to transform x86 instructions so that the architectural effects are mitigated. Can degrade observation of timing, or inject noise and delays to hide leakage signal. Methodology is to make things secure by default, then come back to improve performance.<br />
<br/>Q: information leakage necessarily arises from statistical multiplexing, and we need statistical multiplexing to get good performance, so how can you address that? Assert that it should be possible.<br />
<br/>Q: how well would existing techniques protect against these attacks? Not aware of techniques that could do this.
</li>
<li><b>Execution Synthesis.</b> Say you have a bug in Linux on a remote machine. All you have to work with is a low-detail bug report. Reproducing it is time-consuming. Want a direction finding system from your system to a particular bug. Google Maps doesn&#8217;t do this at present&#8230;. So your bug report is a stack trace and some register contents. Do VM recording and replay. Don&#8217;t expect you to record behaviour that leads to the bug, since then you wouldn&#8217;t have the problem in the first place. Don&#8217;t care so much about performance, since you don&#8217;t run this in production. Find a state in the recording that is close to the bug report,  then explore paths iteratively to get closer to the bug. Then you get a sequence of inputs that lead to reproducing the bug. Need a distance function, way of choosing inputs, and good information about what to include in a bug report.<br />
<br/>Q: could you go backwards from a failure state and execute in reverse? We don&#8217;t have the entire failure state to begin with.
</li>
<li><b>Edge Mashups for Clinical Collaboration.</b> Health industry is going from paper-based to electronic records. Want to empower non-programmers to build applications for real-time collaboriation, but need to respect things like HIPAA for logging and data retention. Example use-cases include expert-assisted surgery (call an expert for advice when complications arise, in real-time), and micro-clinics where nurses see the patients, but doctors write prescriptions remotely. Envision a graphical tool that pulls in photographic and chart data, which is synchronized between all participants. State serialized to XML which can be distributed to all the clients. Could be client/server or peer-to-peer. Need logging for accountability. &#8220;Break-glass&#8221; access control: anyone gets access but they are held accountable after-the-fact. Need low latency so doctors don&#8217;t feel that they are wasting time. Might migrate this to the cloud for scaling.
</li>
</ul>
<h2>Kernels</h2>
<h3>seL4: Formal Verification of an OS Kernel</h3>
<ul>
<li>Formally proved the functional correctness of 8700 lines of C. No bugs.<br />
Want to build high-assurance systems: small kernels which reduce the trusted computing base. Want strong security properties. Kernel has to be correct: if it falls over, so does the whole system.<br />
seL4 has capabilities.</li>
<li>Proof is that specification and code are equivalent. Need a formal semantics for every system call. Use Isabelle as a theorem prover to bridge the gap between spec and code. But what about assumptions (in the code) and expectations (of the spec)?</li>
<li>Assume correct: compiler and linker, 600 lines of assembly code, hardware, cache, TLB management and 1200 lines of boot code.</li>
<li>Given these assumptions, we get some nice properties: no null dereferences, no buffer overflows, no code injection, no memory leaks, no div-by-zero, no undefined shift, no undefined execution, and no infinite loops or recursion. Does not imply security, lack of bugs from expectation to the physical world, or absence of covert channels.</li>
<li>Proof architecture admits proofs of higher-level properties (e.g. access control).</li>
<li>Design is written in Haskell, which can be used to generate Isabelle code automatically.</li>
<li>System model has three states: user, kernel and idle. Events are syscall, exception, IRQ and VM fault.</li>
<li>Call graph is messy! A microkernel takes all of the messiness and packs it into a very small space.</li>
<li>Formal methods practitioners (fans of abstraction) versus kernel developers (exterminate OS abstractions). Different view of the world. Haskell prototype unified these two things: OS people got to implement an OS, while the formal methods people got well-defined semantics. The C code is manually-written and hand-optimized, but based on the Haskell prototype.</li>
<li>Aim to reduce complexity. Have to deal with virtual memory in the kernel. But we can put drivers outside the kernel. Concurrency is complex, so use an event-based kernel and limit pre-emption to a few well-chosen points in long-running operations. The C code is derived from the functional representation. Need to support a subset of C: everything from the standard, minus goto, switch fall-through, &amp; on stack variables, side-effects in expressions, function pointers and unions.</li>
<li>Found 16 bugs during testing, and 460 bugs during verification (roughly equally distributed between the C code, the design and the spec). Took 25 person-years in total: $6 million (compared to $87 million for EAL6).<br />
One of the largest proofs ever done in a theorem prover: 200kloc handwritten, machine-checked proof. Proved 10kloc of OS code.</li>
<li>Q: can you comment on what happens when you have to evolve the code? What effort is required? It depends. An optimization on the code level that doesn&#8217;t change the semantics might need a few days to re-prove. A new feature that adds new components could be added (in the paper) doesn&#8217;t depend on the rest of the kernel as long as you don&#8217;t screw around with existing data structures.</li>
<li>Q: does your work solve the stated problem? The assumptions are significant, and you&#8217;ve just done a very significant type-check on the code? Is it really possible to solve the originally stated problem? This is just the first step. You can reduce the assumptions with more work. It isn&#8217;t the only technique that you should use, so if you deploy it in an Airbus you should also do testing and verification. [Taken offline.]</li>
<li>Q: how can you verify something high-level like having the address spaces of two processes being isolated? We do this. You can still use seL4 in a stupid way, but you can use our security model with capabilities and reason about those in the spec. You don&#8217;t have to go down to the code.</li>
<li>Q: did you see a correlation between the logical errors in the specification and the implementation? Not really. The C bugs were fairly &#8220;stupid&#8221;: typos, copy-and-paste, etc.</li>
<li>Q: wouldn&#8217;t it be better to prove temporal properties? Are they expressible? They are expressible. We look at functional correctness, not temporal properties. But you need functional correctness before you can reason about temporal properties.</li>
</ul>
<h3>Helios: Heterogeneous Multiprocessing with Satellite Kernels</h3>
<ul>
<li>Systems are getting more complicated, from UP to SMP to CMP to NUMA. This is still homogeneous. But hardware is no longer homogeneous: programmable NICs, GPGPUs, etc. Operating systems ignore this heterogeneity: the other devices have different instruction sets and often no cache coherence. This means that the standard OS abstractions are missing, and programming models are fragmented. Can we bring this back into the operating system?</li>
<li>Helios is an OS for distributed systems in the small. Use four techniques to manage heterogeneity, simplify app development and provide a single programming model for heterogeneous systems.</li>
<li>Result is that it is possible to offload processes to these heterogeneous devices with no code changes. Also improves performance on NUMA architectures.</li>
<li>Satellite kernels. Want to make use of an I/O device, but the driver interface is a poor interface for applications that want to use programmable devices. It becomes hard to perform tasks like debugging, I/O and IPC with these devices. The driver doubles as an OS, within the OS itself. A satellite kernel runs on the device itself: fundamentally a microkernel. Also run separate satellite kernels on each NUMA node. Local IPC and remote IPC for communication between satellite kernels.</li>
<li>Applications register as services in a namespace. The namespace connects IPC channels.</li>
<li>Application placement is constrained by the use of heterogeneous ISA, an expectation of fast message passing and platform preference. Applications are allowed to specify affinity in their metadata: a hint for where the process should run. Easy for a dev, admin or user to edit affinity. Platform affinity is processed first, and this gurarantees certain performance characteristics. Can also contra-locate, e.g. if you don&#8217;t want the interference of an anti-virus program running on the same core. Algorithm attempts to balance simplicity with optimality.</li>
<li>Applications are first compiled down to MSIL, and then that is compiled down to the appropriate ISA. Can encapsulate multiple versions of a method for different ISA in the MSIL (e.g. fast vector math).</li>
<li>Implemented on Singularity, using an XScale programmable I/O card (2GHz ARM processor with 256MB of DRAM). Just need a timer, an interrupt controller and the ability to handle exceptions to implement a satellite kernel for a new device. No need for an MMU (thanks, Singularity!). GPUs are adding timers (Larrabee). Only supports two platforms and a limited set of applications.</li>
<li>Evaluated several applications (network stack, FAT32 FS, mail server, web server, etc.) and how easy it was to run them on satellite kernels. Almost no code had to be changed (only in the TCP test harness). One line of metadata had to be changed in almost every case (zero in the other).</li>
<li>Offloaded an entire networking stack to the XScale, and showed that the end-to-end performance of PNG compression-and-serving is improved when offloading to the XScale.</li>
<li>Considered an email server built on Singularity, using a NUMA box. Emails per second handling improved by 39%. Turned out that the instruction throughput was much higher due to better cache utilization.</li>
<li>Q: when you transfer data between two NUMA domains, couldn&#8217;t the IPC fail due to memory allocation failures? Singularity is statically verified, using contracts, so we don&#8217;t have to worry about that.</li>
<li>Q: is this not just 20-year-old distributed microkernel research rehashed? We pay homage to that in the paper. A simple heuristic is sufficient to decide where to run some process, and you need to have process migration anyway, so why not just use that when you get a problem? Process migration is pretty difficult, in the heterogenous case. Abstraction turned out to be pretty brittle in commodity OSs.</li>
<li>Q: is it reasonable to rely on protection from a large runtime? The system isn&#8217;t dependent on type-safety.</li>
</ul>
<h3>Surviving Sensor Network Software Faults</h3>
<ul>
<li>Sensors have to operate unattended for months or even years. It&#8217;s hard to debug failures considering that the input is unknown. There is no debugger.</li>
<li>Safe TinyOS introduces memory safety to sensor nodes. But what do you do when you get a safety violation? In the lab, spit out an error message; in the wild, reboot the entire node (losing valuable soft state and application data).</li>
<li>Neutron is a new version of TinyOS. Reduces the cost of a violation by 95&#8211;99%. It has near-zero CPU overhead during execution. Runs on a 16-bit microcontroller.</li>
<li>A TinyOS program is a graph of software components: statically instantiated code and state. Connections are typed by interface and there is minimal state sharing. Now have preemptive multithreading with a non-blocking, single-threaded kernel. Aim is to separate the program into independent units for recovery. Infer the boundaries of these at compile time. The kernel is a single unit.</li>
<li>Units can be rebooted independently. A wrinkle involves cancelling system calls, so you need to block if a syscall is still pending. Blocks of allocated memory are tagged with the owning recovery uint, which enables these to be freed on reboot (by walking the heap). Can even reboot the kernel: just cancel all pending system calls (return ERETRY), and just have to maintain thread memory structures. Applications will continue after the kernel reboots.</li>
<li>New idea of &#8220;precious state&#8221;: a group of precious variables will persist across a reboot. Annotate variables in the source code. Some restrictions on precious pointers. Precious variables must be accessed in atomic blocks. Variables are persisted on the stack across reboot: the set of precious state is usually smaller than the worst-case stack size.</li>
<li>Evaluated the cost of a kernel violation in Neutron, compared to safe TinyOS. Looked at three libraries, running on a 55-node testbed. Show the effect of a reboot on the CTP workload. Neutron gets close to the non-reboot case. Also look at the effect on time synchronization in FTSP, showing what proportion of the nodes have unsynchronized time. Again, Neutron gets close to the non-reboot case. Looked at fault isolation. CTP and FTSP data persist across reboots.</li>
<li>Main cost is in ROM bytes: 1&#8211;5kB of added code, roughly constant.</li>
<li>Measured cost of a reboot in milliseconds. A kernel safety violation will result in a 10&#8211;20ms outage.</li>
<li>Much lighterweight than microreboots (and lets you reboot a kernel, not a J2EE application).</li>
<li>It&#8217;s easy to change the TinyOS toolchain, but changing the programming model isn&#8217;t due to the amount of deployed code.</li>
<li>Q: how can I reason about a node that has survived a fault (a rebooted node is in a known-good state)? Do you have evidence that this is going to help us? [Showed emails from the questioner.] It is hard to diagnose these faults. Different approaches are possible.</li>
<li>Q: what do you think of the alternatives, such as using an MMU, verification or simulation? Well we could add an MMU, but we don&#8217;t have it at present. These new developments might make the dependability better. Verification struggles with the huge input space.</li>
<li>Q: how do you ensure that the precious state is not corrupted? Using safe TinyOS, so we won&#8217;t see memory access violations due to memory safety. Taint in the paper is about inconsistent state, not corruption.</li>
<li>Q: what are your criteria for what state should be marked as precious? Don&#8217;t have a strong set of guidelines for this, but have done it by inspection so far.</li>
<li>Q: what is your fault detection system, and what is its coverage? How do you know when you have a fault? The deputy compiler gets you to annotate code with things like buffer lengths. Can infer faulty behaviour from these. Annotating interfaces tends to be sufficient.</li>
<li>Q: this is a very valid approach?</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.mrry.co.uk/blog/2009/10/13/sosp-2009-day-2/feed/</wfw:commentRss>
		</item>
		<item>
		<title>SOSP 2009: Day 1</title>
		<link>http://www.mrry.co.uk/blog/2009/10/12/sosp-2009-day-1/</link>
		<comments>http://www.mrry.co.uk/blog/2009/10/12/sosp-2009-day-1/#comments</comments>
		<pubDate>Mon, 12 Oct 2009 16:11:34 +0000</pubDate>
		<dc:creator>Derek Murray</dc:creator>
		
		<category><![CDATA[Technology]]></category>

		<category><![CDATA[Trip Reports]]></category>

		<guid isPermaLink="false">http://www.mrry.co.uk/blog/?p=44</guid>
		<description><![CDATA[Keynote: Barbara Liskov

Inventing abstract data types, CLU, Type hierarchy, What next?
More of a Programming Methodology talk than a systems talk.
Started out in systems with the Venus machine on the Interdata 3. Presented it and its operating system at SOSP 1971.
Back in the early 1970&#8217;s, people were concerned about the software crisis. (cf. Dijkstra&#8217;s 1972 Turing [...]]]></description>
			<content:encoded><![CDATA[<h2>Keynote: Barbara Liskov</h2>
<ul>
<li>Inventing abstract data types, CLU, Type hierarchy, What next?</li>
<li>More of a Programming Methodology talk than a systems talk.</li>
<li>Started out in systems with the Venus machine on the Interdata 3. Presented it and its operating system at SOSP 1971.</li>
<li>Back in the early 1970&#8217;s, people were concerned about the software crisis. (cf. Dijkstra&#8217;s 1972 Turing Award lecture, The Humble Programmer.) As machines got cheaper, bigger and faster, software started to matter a lot. A tendency to underprovision the hardware, creating challenges for the software developers.</li>
<li>In late 1960&#8217;s the field of Programming Methodology began. Started to think about program design and structure: in order to have functionality, maintainability, etc.</li>
<li>First paper: Go To Statement Considered Harmful (Dijkstra, 1968). A revolutionary letter to CACM. Use static program text to reason about dynamic program behaviour: it would be useful if these were as close as possible, but GOTOs prevent that. Example of debugging: understand how you got to a particular point in the text. GOTOs make this very difficult. Provoked a huge amount of resistance: can&#8217;t program without it. Pointed out limitations of the programming languages: branch into a label table is just a case statement, but they didn&#8217;t have case statements.</li>
<li>Second paper: Program Development by Stepwise Refinement (Wirth, 1971). European school of software development: a top-down approach, starting with many abstract parts which are not initially implemented. Example was the 8-queens problem.</li>
<li>Third paper: Information Distribution Aspects of Design Methodology (Parnas, 1971). A new interest in modularity, and information hiding. &#8220;The connections between modules are the assumptions which the modules make about each other.&#8221; Hedging about what modules were at the time.</li>
<li>Fourth paper: On the Criteria to be used in Docomposing Systems into Modules (Parnas, 1972). How to actually break a system into modules: a hint of data abstraction but the full idea wasn&#8217;t there yet.</li>
<li>Enter Barbara Liskov: A Design Methodology for Reliable Software Systems (1972). Entire system should be partitioned: no global state and each partition owns some part of the state. Partition exposes operations and only way of interacting with the state would be by calling operations on the modules.</li>
<li>Wanted to apply the idea of partitions for building programs. It was unclear how to combine modules to make whole programs. Idea of partitions came out of doing work on operating systems: ur-partitions were the supervisor and user modes. This idea is carried a lot further by ADTs.</li>
<li>Idea was to connect partitions to data types. A strike of inspiration. Ideas often arrive in the middle of the night, or when arriving to work with a fresh mind.</li>
<li>March 1973 SIGPLAN/SIGOPS interface meeting on programming methodology was the debut of the idea. Began working with Steve Zilles. At the time, they were knowledgable about all the languages in existence (FORTRAN, LISP, ALGOL, PL/I, COBOL&#8230;). Started to do language design.</li>
<li>People were interested in extensible languages as early as 1967 (Schuman and Jourrand, Balzer). How can we help people build dialects of languages that make them easier to use. Looked at syntactic and semantic extensibility. Syntactic extensions written in BNF and added to the language using some kind of preprocessor. Fortunately this died a death&#8230;. People were much more worried about writing programs than reading them. Didn&#8217;t realise that programs are read more often than they are written. Balzer imagined data types as being collections which allowed four defined operations (add, remove, etc.) with operator overloading.</li>
<li>Hierarchical Program Structures (Dahl and Hoare, 1972): Simula 1967. Didn&#8217;t have encapsulation but did have inheritance to make simulation easier. Precursor to Smalltalk.</li>
<li>Protection in Programming Languages (Morris, 1973). Recognised the importance of locality in module comprehension (allows local reasoning). Proposed sealed objects using encryption as an OS mechanism to guarantee locality.</li>
<li>Global Variable Considered Harmful (Wulf and Shaw, 1973). In the 1960&#8217;s a stream of languages made block structure the big thing. Give locality within blocks, but you can always access things on the outside (i.e. global variables). Made analogy with Dijkstra&#8217;s paper that global variables are implicit connections between states of the program, which makes reasoning about it more difficult.</li>
<li><strong>Programming with Abstract Data Types (Liskov and Zilles, 1974).</strong></li>
<li>Said what ADTs were: a set of operations (whatever ones made sense, not a fixed set), encapsulation was important, and the operations were the only way to access object state.</li>
<li>ADTs were &#8220;clusters&#8221; with encapsulation. Proposed polymorphism, static type checking (but weren&#8217;t sure if this was possible due to polymorphism) and exception handling.</li>
<li>Why was a new programming language necessary? Needed to communicate the ideas to programmers. Enabled the testing of whether ADTs work in practice. Also made it possible to get a precise definition (tendency to think of the compiler as the language definition&#8230;). And to validate whether it was possible to achieve reasonable performance.</li>
<li>Goals of language design: ease of use, simplicity, expressive power and performance. First two played off against second two.</li>
<li>Also wanted minimality (limiting the language to what we could get by with), uniformity (keep abstract types similar to the built-ins) and safety (find errors as soon as possible: compile time?).</li>
<li>Assumptions/design decisions. Wanted language to be heap-based with garbage collection (based on experience with LISP). Program was a collection of procedures rather than a linear piece of code (ALGOL style). People were scared of pointers, but used them to simplify the design. No block structure! Separate compilation of individual modules. Also had static type checking which was meant to speed up finding errors. No concurrency (cut out what wasn&#8217;t necessary to simplify a big project). No GOTOs. No inheritance.</li>
<li>CLU clusters. A cluster had a header with a list of the operations that it defined.  Thought of operations as belonging to the type rather than the object (passing an object as an argument). Defined the representation of the object internally, and the implementation of the operations. Used &#8220;cvt&#8221; to define unsealing on operation entry, and sealing on exit (but compile-time checked).</li>
<li>Polymorphism: set[T: type], and had a where clause for the type parameter to specify, e.g., that T has an equals: T -&gt; T -&gt; bool function.</li>
<li>Exception handling: Issues and a Proposed Notation (Goodenough, 1975). People didn&#8217;t know the right model: procedure should terminate (now the status quo), or throw exception to a higher level that would allow it to resume. PL/I had both. How should handlers be specified? At the call-site, or out of the main-line for all invocations to that function. CLU used termination, and specifies the exceptions that a method may call in the header. Handled at the call-site.</li>
<li>How to handle exceptions? Handle it, propagate it up the call stack, or signal failure (the exception shouldn&#8217;t happen&#8230;). Can never be certain that these last exceptions won&#8217;t happen, but don&#8217;t want to write code to deal with this, so CLU introduced the &#8220;failure&#8221; exception. Really want accurate interfaces (know exactly what exceptions a method might call) and no useless code.</li>
<li>Iterators. For all x in C do S. Solutions were to destroy the collection (repeated removal), or complicating the abstraction (turn it into an ordered set, making it indexible). In summer 1975-ish, the MIT group visited CMU, where the Alphard group were working on &#8220;Generators&#8221;. Thought these are a bit crusty, so invented iterators which are like procedures that you call and they yield instead of returning. Can nest iterators and recursively call them. Implemented by passing the loop body as an argument to the iterator. But this limited the expressive power.</li>
<li>In 1987, gave a keynote at OOPSLA, but had been ignoring object-oriented languages and inheritance in particular. Took the opportunity to get into the literature. Much of it was very bad/confused. Inheritance being used for two different things: implementation simplification and type hierarchy. The two were not compatible.</li>
<li>Implementation inheritance violated encapsulation! Subclasses depend on the implementation of the superclass, making it hard to change the superclass. CLU could do implementation sharing without inheritance.</li>
<li>Type hierarchy is much more interesting, but wasn&#8217;t well understood. How were stacks and queues related?</li>
<li>Led to the Liskov Substitution Principle: Objects of subtypes should behave like those of supertypes if used via supertypes methods. (Data abstraction and hierarchy (Liskov, 1988).) Didn&#8217;t realise that this was a big idea!</li>
<li>What next? The world has changed from one where people had no idea about modularity, to one where modularity is based on abstraction.</li>
<li>Modern programming languages (Java and C#) are pretty good. Procedures are missing: they are important, and the loss of them as a first-class thing makes the program less simple. Closures are missing, as are iterators. Exception handling is important but failure handling is not done well. Also need built-in types as a basis: extensibility might be going too far. Can we do better than &#8220;serialization&#8221; (horrible overloading of the term): can&#8217;t it be done by garbage collection?</li>
<li>The state of programming is pretty lousy. The COBOL programmers of yesterday are now writing web services and browsers. The era of globals has returned. There&#8217;s little encapsulation and protection, and yet these are handling confidential information. Problem might be persistent storage violating abstraction: perhaps we would be better with an object store that provides automatic translation and type preservation.</li>
<li>Programming language research. Is now the time for some new abstraction mechanisms? Probably not just specification langauges. Concurrency and multi-core: modularity is very helpful here, but there is still a lot of work to do. Should distributed systems be programmed in languages that include distribution as a first-class concept.</li>
<li>System research has done well. Abstractions like DHTs, map-reduce, client/server, distributed information flow. These have been useful for making progress.</li>
<li>Concerned that we trade off for performance (versus simplicity and semantics). Led astray by the end-to-end argument, but not sure that the argument is valid when the end is the user. We know that we&#8217;ll never be 100% reliable, but failures should be catastrophes, not laziness (or an optimization).</li>
<li>The systems community has thrived because it has been so open: embracing new concepts. Worried about the semantics of the coming Internet Computer: what does it mean to run a program in the cloud (the PL should embrace distribution?). Massive parallelism also coming as a problem: not sure they necessarily follow from the end of Moore&#8217;s Law. But we will need to manage them as a distributed system, so there is a lot learn from that world.</li>
</ul>
<h2>Scalability</h2>
<h3>FAWN: A Fast Array of Wimpy Nodes</h3>
<ul>
<li>Awarded best paper.</li>
<li>Energy is becoming an important consideration in computing? Google uses $40 million of energy a week. Can we reduce energy costs tenfold? Without increasing capital costs?</li>
<li>Idea is to improve energy efficiency by using an array of well-balanced low-power systems. For computationally intensive, data-intensive systems.</li>
<li>Goal is to reduce peak power: 20% energy loss on power and cooling is considered &#8220;good&#8221;. 100% utilization =&gt; 1000W but 20% utilization =&gt; 750W. Fawn wants to get 100% utilization down to 100W, and less utilization to less power.</li>
<li>As CPU cycles have improved much faster than disk seeks, we have a gulf in wasted resources. Could rebalance by using very fast disks. Could use slower CPUs and moderately faster storage. Or could use slow CPUs and today&#8217;s standard disks. These are not equivalently efficient.</li>
<li>Fastest processors have superlinear power usage (Xeon7350, etc.) due to things like branch prediction, caching, etc. which are not so useful for data-intensive, I/O intensive workloads. Custom ARM nodes etc. are slow but power usage is dominated by fixed power cost. In between we have things like the Atom and XScale: FAWN targets these.</li>
<li>Application is data-intensive key-value systems. A critical infrastructure service with SLAs, random-access and read-write. Goal was to improve the efficiency in Queries/Joule or Queries/Second/Watt.</li>
<li>Prototype using Alix3c2 nodes with flash storage, 500MHz CPU, 256MB DRAM and 4GB flash.</li>
<li>Challenges: need efficient and fast failover, wimpy CPUs, limited DRAM and flash that is poor at small random writes.</li>
<li>Architecture: single front-end connected by switch fabric to many back-end. Front-end manages backends, acts as a gateway and routes requests. Use consistent hashing. Interesting design decisions are at the backends (focus of the talk).</li>
<li>High performance KV store: must map keys to values on the backend. Use an in-memory hashtable to find the location of a value on disk. 160-bit key into hashtable which contains a key fragment, valid bit and offset into the data region. On flash, store the entire key, length and the data itself. Limited DRAM means you only store a fragment of the key in memory (to get 12-byte hashtable entry): risk of collisions and multiple requests. Get a low probability of multiple flash reads.</li>
<li>Small random writes problem avoided by log-structured datastore. Helps also with node addition, which requires transfer of key ranges: the log structure makes the transfer of key ranges a simple streaming transfer&#8230; a simple iteration of the data store. But the SLA means that we need the source of the data continues running while doing this streaming: log-structure lets you minimize locking. Also need a compact operation on the data store, which also runs in the background.</li>
<li>Replication and strong consistency is covered in the paper.</li>
<li>Evaluated by considering the efficiency (energy) of KV lookup. Look at the impact of background operations on query throughput. And then TCO analysis for random read workloads.</li>
<li>Measured the efficiency of 256-byte KV lookups on various systems, looking at the power outlet power usage. Alix3c2/Sandisk got 346 QPS/Watt, whereas a standard desktop machine with SSD got 51.7 and two hard-disk based systems were 2.3 (MacBook Pro) and 1.96 (desktop).</li>
<li>Peak capacity is 1600 QPS. During compaction ±800. During split ±1200, and Merge ±700.</li>
<li>At 30% of peak query load, these have almost no discernable impact (±500 QPS in all cases).</li>
<li>TCO = Capital Cost + Power Cost ($0.10/kWh). When should you use FAWN?</li>
<li>Traditional system with 200W usage (5 x 2TB disks/160GB PCI-e Flash SSD/64GB FBDIMM per node). $2000&#8211;8000 per node. FAWN (10W usage)  (2TB disk/64GB SATA Flash/DRAM-based). Where storage capacity is dominant, use FAWN + Disk.  When you care about query throughput, use FAWN + DRAM. FAWN + Flash covers much of the remainder of the space, but Traditional + DRAM also covers some of the space.</li>
<li>To be energy efficient, require some effort on the part of the system developer.</li>
<li>Q: impact of networking on performance and cost models? Network is an important component, and we really want things like all-to-all communication, which often needs a high-powered core. But proposals for using low-powered commodity switches at scale mean that is possible to get down to a fixed power overhead per core.</li>
<li>Q: off-the-shelf memcached could do better? [Inaudible.] It takes a lot of effort to get very high throughput.</li>
<li>Q: what happens when you include latency in the model of TCO? Interactive services that serialize might shift the model? Flash devices gives pretty good common-case latency. What if you have computational load in the query? You have to trade off energy efficiency for longer periods of computation. Get high 99.9% latency during maintenance but it&#8217;s still okay.</li>
<li>Q: what happens when a frontend thinks one node is down, etc.? Have optional replication to provide constant access to data, and a background replication process.</li>
</ul>
<h3>RouteBricks: Exploiting Parallelism to Scale Software Routers</h3>
<ul>
<li>Awarded best paper.</li>
<li>Want to build routers that are fast and programmable. Why do we want them to be programmable? Programmable routers enable ISPs to offer new services (intrusion detection, application acceleration, etc.). They make network monitoring simpler (measuring link latency, tracking down traffic). Finally they make it possible to offer new protocols, such as IP traceback, trajectory sampling etc. They enable, flexible, extensible networks.</li>
<li>But today, speed versus programmability is a trade off. Fast routers are implemented in hardware (Tbps throughput), but offer no programmability. But programmable software routers get throughput &lt; 10Gbps using general purpose CPUs.</li>
<li>RouteBricks uses off-the-shelf PCs, a familiar programming environment and large-volume manufacturing to reduce costs. How do we get to Tbps routers with these ingredients?</li>
<li>A router is just packet processing + switching. We have N linecards that handle each input or output port, and it must be able to handle packets at the per-port rate, R. Also have a switch fabric which much switch at N*R.</li>
<li>RouteBricks uses one server instead of each line card and a commodity interconnect.</li>
<li>Require internal link rates &lt; R and per-server processing rate = c*R (c is a small, reasonable constant). Per-server fanout should be constant.</li>
<li>What interconnect satisfies these requirements? Could naively have a crossbar switch, with N^2 line-rate links. Could instead use Valiant Load Balancing, which introduces a third stage between input and output. Have N intermediate servers, and R/N rate links from each input to the intermediate, and from each intermediate to each output. Now have N^2 links at rate 2R/N. Per-server processing rate is 3R. But with uniform traffic patterns, each server only must process at 2R.</li>
<li>But this still gives linear per-server fanout (bad). If we increase per-server processing capacity, could assign multiple ports to each server. Or add a k-degree n-stage butterfly network. Combine these ideas: RouteBricks. Use full mesh if possible and extra servers otherwise.</li>
<li>The trade-off depends on the kind of servers that you have. Assuming current (5 NICs * 2 X 10G ports or 8 X 1G ports and 1 external port per server).</li>
<li>With 2R&#8211;3R processing rate, we need to optimize the server! Use a NUMA architecture (Nehalem, QuickPath interconnect), 2 quad-core CPUs, and 2*2*10G NICs. Run Click in kernel-mode.</li>
<li>First try: got 1.3Gbps per server. Spending a lot of cycles on book-keeping operations, such as managing packet descriptors. So use batched packet operations and CPU polling, which got 3Gbps.</li>
<li>Still a problem with how the cores accessed the NICs: queue access. Get synchronization overhead between the two cores. Simple rule: 1 core per port (input or output), gets no locking overhead. But now we have cache misses because of separate cores working on the same packet. So second rule: 1 core per packet. These rules are mutually exclusive!</li>
<li>Solution is to uses NICs with multiple queues per port. Can now assign each queue to a single core, which achieves our objective.</li>
<li>So, use state-of-the-art hardware, modify the NIC driver to do batching, and perform careful queue-to-core allocation.</li>
<li>For no-op forwarding, get 9.7Gbps with min-size packets, and 24.6Gbps with a realistic mix.</li>
<li>For IP routing (IPSec also but not shown), get 6.35Gbps for min-size, and 24.6Gbps with a realistic mix.</li>
<li>Realistic size mix: R = 8&#8211;12 Gbps. In this case, we are I/O bound. Min-size packets are CPU-bound.</li>
<li>Look at the next-gen Nehalem and do some back-of-the-envelope calculations. Could get R = 23&#8211;35 Gbps with upcoming servers for a realistic mix.</li>
<li>Prototyped RB4: N = 4 in a full mesh. Realistic size mix was 35 Gbps.</li>
<li>Did not talk about reordering (avoid per-flow reordering), latency (24us per server), open issues (power, form-factor, programming model). Slide illustrates Russell&#8217;s paradox quite nicely.</li>
<li>Q: about programmability, the examples require maintaining cross-packet state, so would the choice of the load-balancing mechanism within the router affects this? When the bottleneck is the CPU, so yes this is important.</li>
<li>Q: are the trends on both power and power-performance of next-gen processors in your favour? We spend more energy than the corresponding Cisco router, but we have a lot of room for improvement and will get better. Maybe best thing to do is use general-purpose CPUs with an efficient interconnect.</li>
<li>Q: is the real choice between general purpose CPUs and a different programming model? Could you do better with ternary CAMs, etc.? Programmability is just trapping exceptional packets. IP routing is easy? Not suggesting that you just throw away all hardware routers, but you might want to do it in some cases, e.g. for specialized monitoring at the borders of your network. A programmable datapath might make it easy to deploy a new protocol.</li>
<li>Q:  curious about reordering phenomenon and how this would affect TCP? What would be the most extreme reordering that an adversary produce? [Inaudible.]</li>
</ul>
<h3>The Multikernel: A New OS Architecture for Scalable Multicore Systems</h3>
<ul>
<li>Boots into Barrelfish rather than kubuntu to get the video driver to work.</li>
<li>How do we structure an OS for the multicore future? Need to deal with scalability and heterogeneity.</li>
<li>Sun Niagara has a banked L2 cache for all cores (bad for shared memory). Opteron Istanbul has per-core L2. Nehalem is quite different still. Need to do different things to work well on each of these. An 8-socket Opteron has an &#8220;interesting&#8221; interconnect. The interconnect matters, especially to access latencies.</li>
<li>Can also have core diversity: within a system (programmable NICs, GPUs, FPGAs) and on a single die (for performance asymmetry, SIMD extensions or virtualization support).</li>
<li>As core count increases, so does diversity. And unlike HPC systems, one cannot optimize at design time. Need systems software to adapt for you.</li>
<li>So now is the time to rethink the default OS structure. Now we have shared-memory kernels on every core, data structures protected by locks and everything else being a device. So propose structuring the OS as a distributed system (like Barbara Liskov said earlier).</li>
<li>First principle: make inter-core communication explicit. All communication should be done with messages and no shared state. This decouples the structure of the system from the inter-core communication mechanism, and makes communication patterns explicit (which cores you communicate with etc.). It naturally supports heterogeneity of cores. Also better matches future hardware (no cache coherency? Such as Intel&#8217;s 80-core machine). Also allows split-phase operation which makes concurrency easier. And finally, we can reason about performance, scalability, etc. of such systems.</li>
<li>Simple microbenchmark to evaluate the trade-off between shared memory and message passing. Several cores update a shared array. In shared-memory case, we get stalling due to cache coherency protocol. This is limited by the latency of interconnect round-trips. Latency is linearish in the number of cores (up to 16), and gets worse with the number of cache lines modified.</li>
<li>What if we had a server core that takes messages from a ring buffer per client core?  Higher latency to begin with, but scales much better as the number of cache lines modified is increased. Client overhead is queuing delay at the server. The server&#8217;s actual cost is approximately constant, and very low. This would give us many spare cycles if the RPC is implemented split-phase.</li>
<li>Second principle: make the OS structure hardware-neutral. The only hardware specific parts should be the message transports and the CPU/device drivers. Makes it possible to adapt to changing performance characteristics. Can late-bind protocol and message-passing implementations.</li>
<li>Third principle: view state as replicated. Potentially-shared local state is accessed as if it were a local replica (e.g. scheduler queues, process control blocks, etc.). The message-passing model requires this. This naturally supports domains that don&#8217;t share memory.</li>
<li>Replicas were previously used as a selective optimization in other systems. The multikernel makes sharing a local optimization, instead (opposite view). Only use shared state when it is faster, and make this decision at runtime.</li>
<li>Can support applications that need shared memory if it is available, but the OS doesn&#8217;t rely on this.</li>
<li>Currently run on x86-64 machines. A CPU driver handles traps and exceptions serially. A user-space monitor mediates local operations on global state. Use URPC inter-core message transport, but we expect that to change.</li>
<li>Many non-original ideas (in particular decoupling message-passing from synchronization).</li>
<li>Many applications running on Barrelfish (slide viewer, webserver, VMM, OpenMP benchmarks, etc.).</li>
<li>How do you evaluate a radically-different OS structure? Barrelfish was from-scratch, so is less complete than other OSs. But we need to show that it has good baseline performance: comparable to existing systems.</li>
<li>Case study of unmap (TLB shootdown). Logically need to send a message to every core and wait for all to acknowledge. Linux/Windows use IPIs and a spinlock to do this. Barrelfish makes a user request to a local monitor, and uses message passing.</li>
<li>Tried several different communication protocols. One unicast channel per core; a single broadcast channel. Neither of these perform well (especially broadcast because there is no such thing in the interconnect).</li>
<li>Really want a multicast optimization: send message once to every socket, which has an aggregation core that forwards it on to local cores. Also use the HyperTransport topology to make decisions about which cores are further away (and hence should receive the message earlier to give parallelism). Need to know the mapping of cores to sockets, and the messaging latency.</li>
<li>Use constraint programming and a system knowledge base to perform online reasoning. Prolog query on SKB constructs the appropriate multicast structure. Unmap latency is much less than Windows for &gt;7 cores, and Linux for &gt;12 cores. It also scales much better.</li>
<li>Show that there is no penalty for shared-memory user workloads (OpenMP), respectable network throughput, and a pipelined webserver with high throughput.</li>
<li>Conclusion: no penalty for structuring using message passing. So we should start rethinking OS structure this way.</li>
<li>Q: how do you build an application on top of Barrelfish that tries really hard to ignore topology? Focus so far on how to build the OS, and make no particular demands on application programming. Gone with POSIX so far, but we need some higher-level programming model to express things more easily.</li>
<li>Q: about URPC performance, are you running a single thing on each processor and hence is the process only ever spinning waiting for URPC, avoiding any context switching latency? That&#8217;s true. We decoupled messaging primitive from notification primitive, because notifications are very expensive (at least on our hardware). This leads to a trade-off. The split phase API makes it efficient to work in the case where you don&#8217;t need notification immediately.</li>
<li>Q: you didn&#8217;t mention virtualization&#8230; if we map one VM to one core, we can get good performance, so does Barrelfish obviate the need for virtualization? Barrelfish is viewed as orthogonal to virtualization.</li>
<li>Q: your microbenchmark [inaudible]. The point of the benchmark is to make the case that messaging over shared memory can have reasonable performance.</li>
<li>Q: as systems get bigger, do you expect messages to get lost? Maybe&#8230; if they do get lost, this model is a better way to structure it than shared-mmeory.</li>
</ul>
<h2>Device Drivers</h2>
<h3>Fast Byte-granularity Software Fault Isolation</h3>
<ul>
<li>Operating systems run many drivers and these run in a fully trusted mode. But they turn out to be a major source of bugs. Existing solutions either require changes to the source code or unacceptably bad performance. Why? They are complex and require complex, fine-grained temporal and spatial sharing within the kernel. But any mistake in this can be fatal&#8230;.</li>
<li>BGI isolates drivers in separate protection domains but allows domains to share the same address space. It provides strong isolation for existing drivers even with a complex kernel API. Write integrity, control-flow integrity and type-safety for kernel objects. Achieves this without changes to drivers or hardware, and only 6.4% CPU overhead (~12% space overhead).</li>
<li>Use a BGI compiler on unmodified source code to get instrumented driver, then link this with the BGI interposition library to get a BGI driver.</li>
<li>Protection model has kernel in trusted protection domain. Most drivers are untrusted and drivers may share untrusted domains. Each byte has an ACL. Access rights are read and write (default everything readable), indirect call, typed rights for different kernel objects (type safety for mutex, dispatcher, io etc.).</li>
<li>Two primitives: CheckRight and SetRight, called by the interposition library (not the driver code). Dynamic type checking is possible using these: check the arguments to kernel API calls, which ensures type safety, and state safety (temporal properties).</li>
<li>Compiler checks rights on writes and indirect calls, and grants and revokes write access to the stack as appropriate.</li>
<li>Rights are granted on driver load (write to globals, icall right to address-taken functions). Rights granted/revoked to function arguments according to call semantics. Function arguments are also rights-checked.</li>
<li>Example of a read request (event-based), and how the BGI compiler inserts inline checks to avoid many problems.</li>
<li>Rights change as the driver makes kernel calls, and BGI enforces complex kernel API usage rules (use-before-initialized, write-after-initialized, etc.).</li>
<li>Implementation must be fast. Fast crossing of protection domains (just a normal call instruction). Fast rights grant/revoke/check: uses compiler support to inline these checks and perform data alignment.</li>
<li>ACL data structure encodes (domain, right) pair as an integer. Several tables of &#8220;drights&#8221;: arrays for efficient access, and have 1-byte dright per 8-byte memory slot. Need a conflict table when rights ranges overlap. Want to avoid accesses to these conflict tables by aligning data on 8-byte memory slots. Makes the non-conflict case common at the cost of space. Heap objects are 8-byte aligned, and also need special drights for writable half-slots.</li>
<li>SetRight is implemented with 4 x86 assembly instructions. Uses arithmetic shift to obtain same code for kernel and user space addresses (with different tables).</li>
<li>CheckRight has fast check (5 instructions) and slow check (7 instructions).</li>
<li>BGI has a recovery mechanism that runs driver code inside a try block (cf. Nooks).</li>
<li>Evaluated on 16 Vista device drivers: 400 KLOC.</li>
<li>Evaluated for fault containment. Injected faults in the source of fat and intelpro drivers (following previous bug studies). Measured the number of faults contained by BGI: fat = 45/45, intelpro = 116/118.</li>
<li>Measured file I/O performance with Postmark. Max CPU overhead was 10% (for FAT) and max throughput cost was 12.3% (for FAT).</li>
<li>Measured network performance (max 16% CPU overhead and 10.2% throughput decrease both for UDP send).</li>
<li>Found 28 new bugs in widely used drivers. Some were serious: writes to incorrect places or use of uninitialized objects. Some less so: abstraction violation, etc. BGI is a good bug-finding tool.</li>
<li>Q: subtle problems such as virtual address aliasing or out-of-bound pointers within the bounds of other objects&#8230; how do you handle this? The granting of access rights is done according to the kernel API, for write/control-flow integrity and type safety. But we wouldn&#8217;t catch the case where a driver makes a write to something that it is allowed to write to, but isn&#8217;t the right thing.</li>
<li>Q: how do you deal with type-unknown pointers? Can catch errors on data buffers. But where pointers are passed as arguments, you typically don&#8217;t do arithmetic on these.</li>
<li>Q: compared with SFI work on already-compiled code (compiler not in TCB), could you do this on object code instead of source code? We have a binary version, but we report on the compiler version because it has better performance.</li>
<li>Q: the transformations looks similar to existing SFI, so to what do you attribute this improvement in performance? Dealing with complex kernel API and enforcing fine-grained sharing. Previous work did a lot more copying or had expensive crossing of protection domains, in order to deal with complex sharing.</li>
<li>Q: you&#8217;re redoing virtual memory protection with a complex compiler? Why not just use a microkernel? The performance can&#8217;t be that bad? The goal was to support existing drivers that run on existing operating systems with good performance. Could you do the same transformations on existing code to run in separate address spaces? Maybe, but don&#8217;t know of a way to do this. Goal was to deal with existing drivers on existing OS.</li>
<li>Q:  could you just insert latent compiler-inserted checks to avoid zero-day exploits, which are only turned on on zero-day? An interesting idea, but it&#8217;s better if you can cheaply run it all the time in deployed systems&#8230; you will find more bugs this way.</li>
</ul>
<h3>Tolerating Hardware Device Failures in Software</h3>
<ul>
<li>Device drivers assume device perfection. But we can see hardware-dependence bugs across driver classes. Transient failures cause 8% of all unplanned reboots. The existing solution is to hand-code a hardened driver, which gets this down to 3%. What can we do with software fault tolerance? Detect hardware failures and perform recovery in software.</li>
<li>Where do bugs come from? Device wear-out, insufficient burn-in, bridging faults, EMF radiation, firmware bugs, corrupted inputs, timing errors, unpredictable DMA, etc.</li>
<li>Vendors give recommendations to driver developers. Firstly, validate all input coming from the hardware. Then ensure all operations take finite time. Failures should be reported. There are guidelines for recovery also. Goal is to implement as many of these as possible automatically.</li>
<li>System is called Carburizer. Runs on driver source code, and compiler generates hardened driver that links to the Carburizer runtime.</li>
<li>Goal is to fix these bugs with static analysis. Find driver code that uses device data, and ensures that the driver performs validity checks. Fixes bugs from infinite polling, unsafe array reference, unsafe pointer dereference and system panic calls.</li>
<li>First, use CIL to identify tainted variables. Consult a table of functions known to perform I/O and mark heap and stack variables that receive data from these. Propagate taint through computation and return values. Now find the risky uses of these variables. e.g. Find loops where all terminating conditions depend on tainted variables.</li>
<li>Now look for tainted variables used to perform unsafe array accesses: used as array index into static or dynamic arrays.</li>
<li>Evaluated by analysis of Linux 2.6.18.8. Analysed 2.8 million LOC in 37 minutes (including analysis and compilation). Found a total of 992 bugs in driver code, with false positive rate of 7.4% (based on a manual sampling of 190 of the bugs). May also be some false negatives because don&#8217;t track taint via procedure arguments.</li>
<li>Automatic fixing of infinite loops by inserting timeout code. Inserted code can never harm the system because the timeout is conservative.</li>
<li>Inserts code to bounds-check arrays when an unsafe index is used.</li>
<li>Empirical validation using synthetic fault injection on network drivers: modify return values of in and read functions. Without Carburizer, the system hung; with it, the driver recovered. Works well for transient device failures.</li>
<li>Also want to report device errors to fault management systems. Carburizer (i) detects the failures, and (ii) reports them. Report things like loop timeouts, negative error returns and jumps to common cleanup code. Also looks for calls to functions with string arguments, and considers these to be error reporting, so it doesn&#8217;t insert additional reporting in this case.</li>
<li>Evaluated failure reporting with manual analysis of drivers of different classes (network/scsi/sound). No false positives, but a few false negatives. Overall fixed 1135 cases of unreported failures. Thus it improves fault diagnosis.</li>
<li>Static analysis is insufficient. We also need to consider runtime failures, such as missing and stuck interrupts. Can detect when to expect interrupts, and invoke ISR when bits are referenced but there is no interrupt activity. Can also detect how often to poll, which reduces spurious interrupt invocation, and improves overall performance.</li>
<li>Can also tolerate stuck interrupts by seeing when an ISR is called too often, and converts from interrupt-based to polling in this case. This ensures the system and device make forward progress.</li>
<li>Evaluated effect on network throughput: overhead is &lt; 0.5%. And on CPU utilization, this goes up 5% for nVidia MCP 55 when recovery is used; no cost when just doing static bug fixing.</li>
<li>Covers 8/15 of the recommendations for driver writers, automatically.</li>
<li>Q: is your intention that driver developers should run Carburizer only in development and incorporate its fixes, or to run it on deployed code? Encourage sysadmins to run the tool all the time to cover as many bugs as possible.</li>
<li>Q: talked about transient device failures, but what about transient errors in the CPU? In this case, even the code execution will fail. Can you comment on the relative frequencies of these kinds of failure? Not aware of this data. Cannot trust anything when the CPU fails, so not sure what we can do about this.</li>
<li>Q: do you have a rough feel about how many of the found bugs could be used by malicious attackers to perform code injection? Don&#8217;t know off hand. Certainly very easy to hang the system, so could do this maliciously.</li>
<li>Q: could you talk about the static analysis that you are using? How do you handle pointer aliases? Analysis is very simple, and we do have 7.4% false positives. But the combination of our techniques gives us this low rate.</li>
</ul>
<h3>Automatic Device Driver Synthesis with Termite</h3>
<ul>
<li>Conventional driver development requires acquiring necessary information from OS interface spec and device spec. Need to combine this information to come up with a driver implementation that appropriately translates OS requests into driver requests.</li>
<li>If we formalize these specs, we can do better by synthesizing drivers, and there is no longer a need for developers to be an expert in both the OS and the device. Only know one well. Furthermore, code can be specified once and synthesized many times.</li>
<li>Use finite state machines as the basic formalism for writing these specs. Note device-initiated transitions and software-initiated transitions.</li>
<li>First step is to take two state machines (OS and device) and synthesize a combined state machine that considers all possible transitions. Any transition represents a legal transition in the driver. Also associate timeout labels with transitions.</li>
<li>Now translate the state machine into C source code (simple and covered in the paper).</li>
<li>A real device has multiple functional units, so can&#8217;t possibly use a single FSM for all of these. A new language is used to compose multiple FSMs together (one per functional unit). Need a synthesis algorithm that can handle this: need to deal with the state explosion problem (by exploring the state space incrementally). Also need to deal with data, and cannot model all possible assignments to each variable. Instead manipulate data symbolically.</li>
<li>Termite successfuly synthesizes drivers for Asis AX88772 USB-to-Ethernet (on Linux) and Ricoh R5C822 SD host controller (on Linux and FreeBSD).</li>
<li>&lt;1 kLOC for OS interface spec. &lt;700 LOC for device specs. Synthesized drivers are 2 to 4 times larger than the Linux drivers.</li>
<li>Showed a demo of a visual Termite debugger. Allows single-stepping and setting of breakpoints on states.</li>
<li>Performance is very close to the native drivers.</li>
<li>Some limitations: cannot specify constraints on data (alignment, fragmentation, etc.), complex inter-variable relations are not supported (limitation of symbolic execution engine), the structure of specifications is restricted, and Termite drivers require runtime support. Not conceptual limitations, only of the current implementation.</li>
<li>One issue is how you write the specifications. This is particularly onerous for the device manufacturers. Also a potential source of bugs. Need a big brain to do this based on the HDL of the device. Ideally we would automate this translation (HDL to driver spec). Since you write devices in HDL anyway, this shouldn&#8217;t be too bad.</li>
<li>Not much open-source hardware around, so there was a difficulty in finding hardware on which to evaluate this.</li>
<li>Q: to make this practical, there are a few questions: performance? Scalability (for big devices like video cards)? Performance hasn&#8217;t been an issue so far, because the device developer makes the same assumptions that Termite ends up making, so the generated code is quite similar. (But the devices are simple?) So far this hasn&#8217;t been a problem. Scalability is more of an issue. Looking at using better symbolic execution that ignores irrelevant relations between variables.</li>
<li>Q: how do you deal with firmware in the device (also specified in HDL)? At the moment, the manual process has to cover this. Once we automate this, it might be more challenging, so we might want to generate firmware as well.</li>
<li>Q: what about the data size of compiled code, and the CPU utilization when saturating the device? Code and data size doesn&#8217;t look very big (roughly proportional to the amount of source code). Haven&#8217;t looked so much at data size.</li>
<li>Q: does your spec language have room for cork tables and other crap? [Taken offline.]</li>
<li>Q: the OS spec is intended to be device independent (e.g. generic for GPU, ethernet driver, etc.), but how do you cope with new features in a device, which people like to have? All the interesting stuff is on the side? How does that play with the OS spec if you want to take advantage of these features? We cope with standard features, could include &#8220;semi-standard&#8221; optional features in the OS spec. For unique features, you would need to extend (but not rewrite) the OS spec. No experience with doing this.</li>
<li>Q: the requirement for a full functional spec of the OS driver interface is somewhat intimidating, so what is your experience in making these, and how do they scale? It&#8217;s generally doable, but we would want to create a special-purpose language to make this easier.</li>
</ul>
<h2>Debugging</h2>
<h3>Automatically Patching Errors in Deployed Software</h3>
<ul>
<li>Your code has bugs and vulnerabilities, but attack detectors (code injection, memory errors, etc.) exist. What do you do in this case? At the moment, just crash the application, which is a straightforward DoS.</li>
<li>ClearView protects against unknown vulnerabilities, preserves functionality and works for legacy software.</li>
<li>Zero-day exploits are a problem for hard-coded checks because they are unknown in advance.</li>
<li>The application must continue to work (especially if mission-critical) despite attacks. A patch can repair the application. (Mind you, we shouldn&#8217;t always keep the application running: sometimes crashing is correct behaviour.)</li>
<li>Want to do this without access to the source code, so can&#8217;t rely on built-in features. Needs to run on x86/Windows.</li>
<li>Use learning as the secret sauce. Normal executions show how the application is supposed to run. Attacks provide information about the vulnerability, and can be used to give the system immunity. The first few attacks may crash the system, however.</li>
<li>Detection is pluggable: tells you whether an execution was normal or an attack. Learning learns normal behaviour from successful runs, and checks constraints during attacks. This gives a predictive constraint, which is true on every good run, and false during every attack. Repair component creates a patch to re-establish constraints and invariants. System evaluates patches and distributes them to deployed applications.</li>
<li>Use off-the-shelf detectors.</li>
<li>Assume a single server and several community machines running the application. (Assume that they are not exploited to begin with.) Community machines report constraints back to the server. Use code injection and memory corruption attack detectors (others are possible).</li>
<li>On an attack, the detector collects information and terminates the application. Server attempts to correlate this information with a constraint: leads to a predictive constraint. The server generates appropriate patches and distributes the best of these. The quality of patches can be refined by information about successful or failed attack attempts (failed or successful defenses). Redistribution is then possible.</li>
<li>How do we learn normal behaviour? Use an ML technique called dynamic invariant detection (previous work), which has many optimizations for accuracy and speed. Technique was enhanced for this project.</li>
<li>Inference results. ML technique is neither sound (overfitting) nor complete (templates are not exhaustive). However it is useful and effective. Sound in practice, and complete enough.</li>
<li>How do we learn attack behaviour? Attack detectors aim to detect problems close to their source. Code injection uses Determina Memory Firewall (triggers when control jumps to code outside the original executable). Memory corruption uses Heap Guard (triggers on sentinel value overwrite). Techniques have low overhead and no false positives.</li>
<li>Server pushes out appropriate instrumentation to the community. Only check constraints at attack sites (low overhead).</li>
<li>Repairing installs additional detectors to see if you have a bad patch (e.g. looking for assertion violations).</li>
<li>Attack example is a JavaScript system routine, written in C++. Doesn&#8217;t perform typechecking of the argument, so vtable may be corrupt. Have a predictive constraint on the operand of JSRI instruction.</li>
<li>Aim to fix a problem while it is small, before the detector is invoked. Repair isn&#8217;t identical to what a human would write, but it is much more timely.</li>
<li>Patches are evaluated in the field (do they avoid triggering the attack detector or prevent other behaviour deviations?).</li>
<li>Evaluated with a Red Team that created 10 exploits (HTML pages) against Firefox 1.0. ClearView was not tuned to known vulnerabilities in that version, but the learning component focussed on the most security-critical components. Red Team had access to all project materials.</li>
<li>ClearView detected every attack and prevented all exploits. For 7/10 vulnerabilities, ClearView generated patches that maintained functionality after an average of 4.9 minutes and 5.4 attacks. Handled polymorphic attack variants, simultaneous and intermixed attacks, and had no false positives (installing a patch when not under attack). Low overhead for detection and repair (considering this is an interactive application, not surprising).</li>
<li>What about unrepaired vulnerabilities? 1. ClearView was misconfigured. 2. Learning suite was too small. 3. Needed constraint not built into Daikon. All zero-day attacks against the system and all trivial to fix with minor changes to ClearView.</li>
<li>Q: introducing code is a bit scary&#8230; what if one of the patches introduces a new vulnerability? Firstly, you can only do this when you&#8217;ve found an exploitability. Red Team tried and failed. In one case, ClearView found a vulnerability in its own injected code.</li>
<li>Q: if I were an attacker who wanted to DoS your system (knowing ClearView was running), I might try to disable ClearView somehow by making the ClearView DB learn an incorrect fact&#8230; so is your system vulnerable to that kind of attack? It doesn&#8217;t matter what facts are true during attacks, so you&#8217;d have to find good executions that weren&#8217;t observed as being bad to poison the database. It&#8217;s conceivable and theoretically possible that you could do that, but I don&#8217;t know if it&#8217;s practical.</li>
<li>Q: what is the overhead of inserting invariants at every instruction? You will see between 5 and 10 constraints per instruction&#8230; learning is the biggest bottleneck, but you could distribute this amongst the community. In terms of unsoundness, we&#8217;re not seeing that as a problem.</li>
<li>Q: how sensitive are you to the invariants that I have to specify for the patches, in the case where continuation introduces incorrectness into the persistent state? This is a policy decision: an hour of downtime for a bank is $6 million, so is it better to come back and fix things up later.</li>
</ul>
<h3>Debugging in the (Very) Large: Ten Years of Implementation and Experience</h3>
<ul>
<li>10 years of work by the first 8 authors.</li>
<li>Even Microsoft&#8217;s shipping software has bugs! (And so does your hardware&#8230;.)</li>
<li>A bug is a flaw in program logic; an error is a failure in execution caused by a bug (1 bug -&gt; many errors).</li>
<li>How does Microsoft find out when things go wrong? We want to fix bugs regardless of source, prioritize bugs affecting the most users. Kernel crashes (BSOD), application crashes, everything down to invariant violations.</li>
<li>Windows Error Reporting. What happens after you click &#8220;Send Error Report&#8221;?</li>
<li>Server is over-provisioned to handle 100 million reports per day. 17 million programs have records in WER. 1000s of bugs have been fixed. Uses 200TB of storage, 60 servers over 10 years. Anyone in the audience can get access to WER data&#8230;.</li>
<li>Debugging in the large makes the user-developer feedback loop much longer, both in terms of the number of people and the latency. The problem is the human bottleneck (both in accuracy and latency). Goal was to remove humans from the loop.</li>
<li>On an error, collect a minidump: stack of erroneous thread and a little extra context. If the user allows it, upload this to WER. An analysis procedure (!analyze) runs over all of these mini-dumps and clusters these in buckets.</li>
<li>!analyze takes a minidump as input, and outputs a bucket ID. So increment the bucket count and prioritize buckets with the highest count. Actually upload only the first few minidumps for a bucket; after that just increment the count. Sometimes you need a full core dump, and programmers can request this to be collected on future hits.</li>
<li>2-phase bucketing strategy: labelling on the client (bucketed by failure point) and classifying on the servers (consolidate versions and replace offsets with symbols; find callers where the bug might be (if it calls known-good code)). This refines the bucket ID&#8230; more details in the paper.</li>
<li>One bug can hit multiple buckets (up to 40% of error reports). Also multiple bugs can hit one bucket (up to 4% of error reports). Bucketing mostly works&#8230; scale is our friend (throw away a few here and there and you still have enough to debug).</li>
<li>Bucket hits for a given program look like a Pareto curve. Just 20 buckets in Word 2010 account for 50% of all errors. Only fixing a small number of bugs will help many users.</li>
<li>Earliest success story was finding heisenbugs in Windows kernel (&gt;= 5 years old). Vista team fixed 5000 bugs in the beta. Anti-virus vendor accounted for 7.6% of all kernel crashes: in 30-days got this down to 3.6% of all kernel crashes. Office 2010 team fixed 22% of reports in 3 weeks.</li>
<li>Example of hardware errors too: failure in a CPU (exact same revision and step). Chip vendor knew that the bug existed and didn&#8217;t think that it would get hit in real-life. Error reports dropped dramatically after the work-around was applied.</li>
<li>Also hardware failures in SMBIOS of a particular laptop (buffer overrun); motherboard USB controller (only implemented 31/32 of DMA address bits). Lots of information about failures due to overclocking, HD controller resets and substandard memory.</li>
<li>Also looked at malware. The Renos social engineering worm which caused Explorer.exe to crash when people downloaded something from an email. Saw a spike, issued a worm removal tools through Windows Automatic Update, and saw this decline very quickly. Shows that WER scales to handle global events.</li>
<li>Distributed system architecture hasn&#8217;t changed in 10 years, and yet scales to global events.</li>
<li>Product teams now have ship goals based on reducing WER reports: led to a numerically-driven debugging approach. Fundamentally changed software development at Microsoft.</li>
<li>Q: what about privacy? Most private data ends up in the heap, not on the stack. Only collect stack. Also do some data-scrubbing based on things we know about (user ID, etc.). Looked for SSNs, credit-card numbers, etc.: found fewer than 10 possible matches in 100,000 error reports. MS also puts very strong employment conditions on how this data is accessed and used.</li>
<li>Q: []. Team looks for zero-day exploits in the WER data regularly. Philosophy is that there should be no overhead when users don&#8217;t hit an error report. 90% of users only ever have to do an increment of a counter.</li>
<li>Q: are you keeping track of how many people don&#8217;t send error reports? We have good estimates of opt-in rates, from other systems that collect information from machines that are running normally (only if people opt-in to those programs of course). If half the world turned off the error reports, we&#8217;d still get enough information.</li>
<li>Q: does analyzing these bugs tell you about common errors programmers make? Check for NULL is the most important ones. This has generated many guidelines that has been fed back into internal development processes.</li>
</ul>
<h3>Detecting Large-Scale System Problems by Mining Console Logs</h3>
<ul>
<li>It is challenging to detect problems in large-scale internet services. Requires instrumentation (expensive to maintain, and may use modules that aren&#8217;t instrumented). So can we use console logs in lieu of instrumentation? They are easy for the developer to insert in their code, but they are imperfect (not originally intended for instrumentation). It is non-trivial to analyze their free-text contents.</li>
<li>Go from 24 million lines of log messages, finding a small number of abnormal log segments, to a single page of visualization. Fully automatic process without manual input.</li>
<li>Use machine learning based on carefully-defined (numerical) features in the logs. Parse the program source-code to infer information about the log contents, to generate appropriate features. Finally visualize the results.</li>
<li>Key insight: log contains the necessary information to create features. Identifiers and state variables are useful. Important information is in the correlation between messages (examples taken from HDFS (Hadoop Distributed File System)). e.g. &#8220;receiving block x&#8221; followed by &#8220;received block x&#8221; is normal; without &#8220;received block x&#8221;, you have an error.</li>
<li>First step is to parse free-text logs into semi-structured text. Look at program source to generate regular expressions that extract state variables (the block number, for example). It becomes non-trivial in OO languages, where type inference on the whole source tree is necessary. Yields highly accurate parsing results.</li>
<li>Identifiers are widely used in logs (filename, object keys, user IDs, etc.). Do group-by on identifiers. Identifiers can be discovered automatically.</li>
<li>Now build a numerical representation of traces for feature creation. Approach is similar to the bag of words model in information retrieval. Yields a message term vector with term frequencies.</li>
<li>Can use these vectors to do anomaly detection. Use Principal Component Analysis (PCA) to capture normal patterns in all vectors. These are based on correlations between dimensions of the vector.</li>
<li>Ran an experiment on Amazon EC2. 203 nodes ran Hadoop for 28 hours, with standard map-reduce jobs (sorting etc.). Generated 24 million lines of console logs with ~575000 HDFS blocks. 575000 vectors lead to ~680 distinct vectors. Distinct cases were labelled manually as being normal or abnormal (in the evaluation only). However, the algorithms are unsupervised and automatic.</li>
<li>11 kinds of anomaly, occurring 16916 times. PCA detected 16808 of these. Two kinds of false positive: background migration and multiple replicas. Believe that no unsupervised algorithm could do better, so we&#8217;re now allowing operators to provide feedback.</li>
<li>Results are visualized with a decision tree. Unusual log message text is used to document split points in the decision tree.</li>
<li>Future work. Want to improve parsing so that it doesn&#8217;t require source code, and support more languages. Also want to improve feature creation and machine learning so that online detection is possible, also across applications and layers to provide more useful and comprehensive information.</li>
<li>Q: most applications have many identifiers, so how do you automatically detect these, and how reliably? The grouping step addresses the problem of multiple identifiers. In HDFS, we only have the block ID, but we have an example in the paper where we run the algorithm multiple times for each class of identifier.</li>
<li>Q: how do you know whether the different values of an identifier correspond to the same identifier variable? [Taken offline.]</li>
<li>Q: what was different about the single anomaly where you don&#8217;t do well? (Deleting a node when it no longer exists on the data node.) Block numbers can be arbitrary due to multiple reads and writes. Sometimes get errors in the correlations.</li>
<li>Q: how common do you think this is in other systems besides HDFS? HDFS has this problem most severely because every block interaction is written to the logs.</li>
<li>Q: you had to turn on the more-detailed logging level to get this to work, so how did you choose this? I had to turn on debug-logging. Depends on the problems you want to detect. Turn on more logging when you see problems but you can&#8217;t find out why.</li>
<li>Q: what happens to performance when you do this? Also what about heisenbugs that go away with more-detailed logging? System doesn&#8217;t do anything about logging-based heisenbugs.</li>
<li>Q: what if I add a new feature, would you be able to detect problems in it? Don&#8217;t currently deal with multiple versions of the software.</li>
<li>Q: how much information does your visualization offer to the developer to help them diagnose detected problems? If the operators have some insight, this tool can help them provide useful information to the developers.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.mrry.co.uk/blog/2009/10/12/sosp-2009-day-1/feed/</wfw:commentRss>
		</item>
		<item>
		<title>NSDI 2009: Day 3</title>
		<link>http://www.mrry.co.uk/blog/2009/04/24/nsdi-2009-day-3/</link>
		<comments>http://www.mrry.co.uk/blog/2009/04/24/nsdi-2009-day-3/#comments</comments>
		<pubDate>Fri, 24 Apr 2009 14:34:41 +0000</pubDate>
		<dc:creator>Derek Murray</dc:creator>
		
		<category><![CDATA[Technology]]></category>

		<category><![CDATA[Travel]]></category>

		<category><![CDATA[Trip Reports]]></category>

		<category><![CDATA[Uni]]></category>

		<guid isPermaLink="false">http://www.mrry.co.uk/blog/2009/04/24/nsdi-2009-day-3/</guid>
		<description><![CDATA[0]]></description>
			<content:encoded><![CDATA[<h2>Wireless #2: Programming and Transport</h2>
<h3>Wishbone: Profile-based Partitioning for Sensornet Applications</h3>
<ul>
<li>Extension to the WaveScope system.</li>
<li>Example application of locating marmots: listening out for loud alarm calls when confronted by a predator. High data rate application (node has four microphones). Used for sensing applications (animal localisation, pothole detection, computer vision, pipeline leak detection, speaker identification and EEG seizure detection). All expressible as a data flow graph. Predictable data rates. But run on heterogeneous platforms (CPU and radio bottlenecks).</li>
<li>Used Linux uservers, smartphones (Java-based and iPhones) and Meraki routers. Want to be able to mix and match. Have to deal with incompatible software environments (languages, SDKs and OSes).</li>
<li>Application is a dataflow graph (edges are streams; nodes are stream operators). Inputs are from sensor sources, outputs are results either to the user or a server. System partitions this graph between the embedded nodes and the server side. Wishbone handles serialisation and deserialisation across the network interface. Compiles and loads subgraph onto any platform. Replicates graph across all nodes (but may have different partitions depending on the node).</li>
<li>Contributions. First broadly portable sensor net programming environment. Partitioning algorithm. Optimises CPU/radio tradeoff.</li>
<li>WaveScope compiler gives dataflow graph in portable IL. Then have a backend code generator for each of the target platforms.</li>
<li>Wishbone needs sample data from the user to make the correct partitioning decisions.</li>
<li>Target a TinyOS mote. (16bit ucontroller, 10K RAM, no memory protection, no threads, task granularity messaging model.) Not directly compatible with the WaveScope threading model. Use profile-directed cooperative multitasking which makes good decisions (based on profiling information) about where to place yield points.</li>
<li>Profile streams and operators to find execution times, data rates. Separately profile network connections.</li>
<li>Some data flow nodes must be pinned to a particular platform (e.g. source and sink nodes, stateful global operators). Others are unpinned (stateless and locally stateful nodes).</li>
<li>Assign weights to the edges (net bandwidth) and nodes (CPU cost) in the dataflow graph. Interested in the sum of CPU cost on the embedded device, and the sum of edges across the network boundary. Formulate as an Integer Linear Program. Enforce resource bounds on the embedded device (sum over nodes on the embedded device), and network bound (sum over cut edges). Then minimise a linear combination of the CPU and network cost. (Tricky bit is to enforce other constraints (pinning and graph topology) while staying linear. Can set a parameter to tradeoff between CPU and radio, or else set it to reflect energy consumption.</li>
<li>Evaluated by looking at bandwidth versus compute cost on a linear pipeline of operators (evaluted on iPhone as the embedded node). Observe that processing reduces data quantity overall, but this is not monotonic.</li>
<li>Evaluated by changing the input data rate of the application. Look at how many operators remain in the optimal node partition. (EEG graph with multiple channels partitions horizontally (some channels on server, some on device).)</li>
<li>Given CPU and network bounds, can find an optimal partition if it exists. The partition gives estimated throughput. So do a binary search over CPU bounds to find the maximum possible throughput.</li>
<li>Ground truth is how many detections can actually be gotten out of a network of TMotes. Also looked at the compute/bandwidth tension in a single TMote/basestation network.</li>
<li>Related to graph partitioning for scientific codes (Zoltan in symmetric supercomputers), task scheduling (list scheduling), MapReduce/Condor, and Tenet and Vango (sensor network context, run TinyOS on embedded and server devices).</li>
<li>Q: have you considered situations where operators could tune the amount of processing that they do (or tolerate packet loss)? Considered it, but there are several generalisations we would look at first. Does it really change the structure of the system? Yes, we&#8217;d probably need a new partitioning algorithm.</li>
<li>Q: would it be worth profiling the amount of synchronisation necessary for pinned (global stateful) operators? We have strong consistency requirements, so we&#8217;d have to have a relatively expensive operation to synchronise this.</li>
<li>Q: have you applied the same sorts of partitioning for the scenario where you have different partitioning for different sensors? Allow you to port application across different platforms but the network is homogeneous, but you could run the algorithm multiple times (not considered in this paper).</li>
<li>Q: what is the effect of additional load on available bandwidth (interference, etc.)? Currently assume all nodes transmit at the same data rate, and use this assumption to assign channel capacity.</li>
</ul>
<h3>Softspeak: Making VoIP Play Well in Existing 802.11 Deployments</h3>
<ul>
<li>VoIP and WiFi are becoming increasingly popular. Now even cellphones are becoming available with these capabilities (even Skype on iPhone). If these get used heavily, what happens to existing 802.11 deployments?</li>
<li>Imagine a commodity AP in a coffee shop. Most users are data users, and maybe there are one or two VoIP users. But as VoIP becomes more popular, the number of VoIP users will increase.</li>
<li>What does this do to call quality and to data users?</li>
<li>802.11 was designed for data networks and has substantial per-packet overhead (headers, ACK) and contention (backoff and collision). VoIP has small packets, a high packet rate (20&#8211;100 packets per second), and does not respond to congestion. So VoIP makes inefficient use of WiFi.</li>
<li>Measured impact as residual capacity (TCP/UDP throughput) and &#8220;mean opinion score&#8221; (MOS, how audio appears to a real person, calculated based on loss and jitter).</li>
<li>Used an 802.11 testbed and gradually enabled for VoIP stations. Downlink MOS tails off much faster than uplink MOS (after 4 VoIP stations). Degradation of TCP throughput is linear, but much more severe than would be expected from the size of VoIP packets.</li>
<li>Possible solutions: could decrease VoIP packet rate, or use higher speed networks (802.11g/n). Lowering the packet rate still degrades MOS. Higher speed networks run in &#8220;protection mode&#8221; in presence of older 802.11 versions, which loses many of the benefits. VoIP is too infrequent to benefit from 802.11n aggregation.</li>
<li>Could prioritise VoIP traffic (802.11e), which increases the contention overhead, and further reduces the residual capacity.</li>
<li>Softspeak does prioritised TDMA for the uplink and addresses contention overhead. By serialising channel access, we avoid collisions. Cycle between 10 TDMA slots, each of 1ms (1ms accommodates current codecs well; could be varied). VoIP stations have to establish a schedule, synchronise clocks and compete with non-TDMA traffic.</li>
<li>Non VoIP stations are unaware of TDMA, which may prevent VoIP stations from sending on time. If data user wins contention, the VoIP cannot transmit; if they collide this is worse. Use 802.11e VoIP prioritisation to improve VoIP quality, but TDMA means that we don&#8217;t see the same data rate degradation. Even a quick collision recovery for VoIP means that we will overrun our TDMA slots: can use multiple priority levels to address this (though can exhaust priority levels). Worst case, fall back to regular 802.11, and do no worse than 802.11.</li>
<li>Experiment by visualising TDMA, with ten TDMA VoIP stations, and some CSMA/CA background data traffic. Most transmissions should start in their own or the next slot.</li>
<li>Softspeak also does downlink aggregation. VoIP stations will overhear aggregated packets and extract their own portion of the packet.</li>
<li>No modification to wireless card, access point or VoIP application. Softspeak controller registers IP address and port with the aggregator (wired connection to AP). Implemented for Skype and Twinkle.</li>
<li>Evaluated for call quality and residual throughput. TCP data traffic and 10ms voice codec. (More results in the paper.) Adding TDMA alone has little effect over 802.11b. Aggregation by itself greatly improves downlink MOS (and slightly improves residual throughput). Softspeak in total greatly improves downlink MOS (almost back at single VoIP station level) and also improves residual throughput by 5x.</li>
<li>For 802.11g, get a 3x improvement in residual throughput, slight degradation in downlink MOS and large improvement in uplink MOS.</li>
<li>Also looks at performance when contending with web traffic. The bulk TCP upload improvement disappears, but the combined TCP capacity improvement is preserved. Softspeak doesn&#8217;t change the basic behaviour of the data traffic.</li>
<li>Q: how would the result change if you were measuring against multiple TCP flows? Experiment we did with web traffic showed this. Multiple bulk TCP flows? In one direction, saw same level of improvement; in the other saw queuing at the AP and would need some form of prioritisation at the AP.</li>
<li>Q: how could this work in a real network (because you have more than one collision domain; how do you assign slots across multiple hops?)? Didn&#8217;t evaluate the case with multiple collision domains; details of slot assignment in the paper. Use an 802.11 management protocol for probe response which increases the range when it can be heard. Doesn&#8217;t completely eliminate hidden terminal problem, but we hope that this would apply to TDMA.</li>
<li>Q: with multiple APs and multiple slots, wouldn&#8217;t you need some kind of centralised control? Could work across multiple APs.</li>
<li>Q: did you measure the delay performance of this? Incorporated in the MOS value.</li>
<li>Q: did you compare to giving short packets high priority with a rate limiter? This is part of the 802.11e specification, and we measured this. Why does aggregation win by reducing the packet per second rate? Typically if you have short packets, you use more bandwidth.</li>
<li>Q: if I am an enterprise administrator, why not just reserve a single channel for VoIP traffic, because it no longer has to contend with data or non-VoIP traffic? That would be a good solution if you had that available to you.</li>
</ul>
<h3>Block-switched Networks: A New Paradigm for Wireless Transport</h3>
<ul>
<li>Today, TCP performs poorly over wireless links. On a bad link, you get 40x goodput difference between UDP and TCP. Median has 2x goodput difference, and a good link has 1.6x goodput difference.</li>
<li>End-to-end rate control is error-prone and doesn&#8217;t work well in wireless networks. Loss feedback does not distinguish congestion and corruption losses. End-to-end retransmissions are wasteful in a multi-hop wireless network. Route disruptions cause unavailability (could use a disruption-tolerant protocol).</li>
<li>The use of packetisation introduces non-trivial overhead (for channel access: listen, backoff, RTS/CTS; and link-layer ARQ).</li>
<li>Also see complex cross-layer interaction: link-layer ARQs and backoffs hurt TCP rate control.</li>
<li>Hop is a clean-slate redesign. End-to-end becomes hop-by-hop. Packets become blocks to amortise overhead.</li>
<li>Reliable block transfer, ACK withholding and micro-block prioritisation on a per-hop basis. Virtual retransmission and backpressure components on a multi-hop basis.</li>
<li>Hop sends data in 802.11 burst mode. CSMA performed only before a TXOP (burst). Block-SYN and Block-ACK (bitmap of packets; only resend missing packets).</li>
<li>Virtual retransmission uses in-network caching and retransmits only on cache miss. Gives fewer transmissions, low overhead and a simple solution.</li>
<li>Backpressure mechanism limits the number of outstanding blocks per-flow at each forwarder (e.g. limited to 2 outstanding blocks). This improves network utilisation in the case where hop bandwidth is asymmetric.</li>
<li>ACK withholding mitigates hidden terminals better than RTS/CTS (which is overly conservative with high overhead). Buffer B-ACK messages while one terminal is sending, and then send one of the buffered B-ACKs when it is finished.</li>
<li>Micro-block prioritisation improves performance for applications like SSH and text messaging. Sender piggybacks small blocks on the B-SYN and receiver prioritises small-block B-ACKs. This gives low delay for small blocks.</li>
<li>Implemented on a testbed on the 2nd floor of the UMass CS building. 20 Mac Minis form an ad-hoc network.</li>
<li>Hop achieves significant goodput gains over TCP (1.6x at median, 28x at first quartile, 1.2x at third quantile).</li>
<li>Also on single-flow multi-hop performance (more modest gains: Q1 2.7x, median 2.3x, Q3 1.9x).</li>
<li>Achieves graceful degradation with loss (emulated link layer losses at the receiver). Still achieves higher goodput than TCP as the loss rate increases.</li>
<li>For high load (30 concurrent loads), Hop achieves much higher goodput than either TCP or hop-by-hop TCP (Q1 150x, median 20x, Q3 2x). Hop is also much fairer in the way it allocates bandwidth between flows.</li>
<li>Evaluated delay on small transfers. Hop has much less delay than TCP or TCP+RTS/CTS (especially for smaller transfers).</li>
<li>Many more evaluations in the paper. Partitionable networks, network and link-layer dynamics, effect on 802.11g, and effect on VoIP.</li>
<li>Much related work on fixing end-to-end rate control, backpressure and batching; Hop combines these into a viable system.</li>
<li>Q: most data transfers involve wired networks as well, so what do you do at the gateway? We do bridging here.</li>
<li>Q: if there is a separate TCP connection from the gateway to a wired endpoint, how do you deal with mobility (changing gateways)? Good question, want to look at this in future work.</li>
<li>Q: what is the real effect on interactive sessions (e.g. SSH)? Hop actually achieves an even lower delay than TCP or TCP + RTS/CTS. Didn&#8217;t evaluate transfer sizes smaller than 2KB.</li>
<li>Q: if a congested router is fed by two routers upstream, how do you allocate the backpressure? The senders will have to back off. There is no explicit rate control. Both upstream routers are treated similarly.</li>
</ul>
<h2>Routing</h2>
<h3>NetReview: Detecting When Interdomain Routing Goes Wrong</h3>
<ul>
<li>Interdomain routing sometimes goes wrong. Example of YouTube traffic being redirected to Pakistan. Only the most egregious problems make it to the news, but there are many inconvenient, small scale problems.</li>
<li>ASes exchange routing information using BGP. But BGP routing is plagued by misconfigurations. Faulty routing information propagates through the network. Bugs in the AS layer could prevent routing information to be misadvertised. Also, spammers hack into routers to prevent tracing.</li>
<li>Goal is to reliably detect each routing problem and link it to the AS that caused it. This would make it quicker and easier to respond to problems. Fault detection would work for a broad class of problems, incentivise reliable routing and be easy to deploy incrementally.</li>
<li>Idea: could just upload all router logs to a central entity, who inspects them for problems. Doesn&#8217;t work in practice. Logs contain sensitive information about internal routing structure. Also, relies on the router to have accurate information (may not be the case). Also need automation, incremental deployability and decentralisation.</li>
<li>NetReview solves these problems. Border routers maintain logs of all BGP messages (not data messages). Logs are tamper-evident (in the event of a faulty router): can reliably detect and obtain proof of faulty router. Neighbours can periodically audit each other&#8217;s logs and check them for routing problems. Auditor can prove existence of a problem to a third party.</li>
<li>ASes decide what to announce via BGP based on its routing policy, based on peering agreements (customer/provider), best practices (limited path length) and internal goals (short/cheap path).</li>
<li>A BGP fault is when the BGP messages sent by an AS do not conform to its expected behaviour. We know what BGP messages the AS sent from a complete and accurate message trace (using a robust and secure tracing mechanism). Expected behaviour for each AS could be different.</li>
<li>Example of expected behaviour: filter out routes with excessive paths; act as somebody&#8217;s provider; prefer routes through someone if available. Some of these rules may be confidential, but the AS need not reveal all of them to each author (e.g. reveal rules about agreements only to parties to those agreements).</li>
<li>Tamper-evident log based on PeerReview at SOSP 2007. Can tell if a router omits, modifies or forges entries (based on a hash chain). Messages are acknowledged, so cannot ignore a message. Neighbours gossip about hashes seen.</li>
<li>Rules are predicates on the AS&#8217;s routing state. They are declarative and hence are easy to get correct. Checks of S-BGP can be declared in a two-line rule.</li>
<li>Auditor requests the logs from each border router. Checks to see if the logs have been tampered with (or show inconsistencies). The auditor locally replays the logs to establish a series of routing states. Then evalutes that the rules are upheld in each routing state. If a rule is violated during some interval, the auditor can extract verifiable evidence from the logs.</li>
<li>Many practical challenges: here we will look at incremental deployment. Smallest useful deployment is at one AS. One AS can find bugs and misconfigurations. Two adjacent ASes can check peering agreements. The incentives for deployment are that reliable ASes can attract more customers, and logs can be used for root-cause analysis.</li>
<li>Evaluated on a synthetic network of 10 ASes running 35 Zebra BGP daemons. Use default routing policies, and injected a real BGP trace (Equinix) to get scale. Results in the talk are from a tier-2 AS with 6 neighbours.</li>
<li>Did fault injection and NetReview detected all of the injected faults and provide useful diagnostic information.</li>
<li>Evaluated processing overhead: a 15-minute log segment can be checked in 41.5s on a P4. A single commodity PC is sufficient to check a small network in real-time.</li>
<li>Storage space requirement was 710KB/minute, or about 356GB/year. Required 420Kbps, including BGP updates, which is insignificant compared to the data rates.</li>
<li>Related work includes fault prevention (secure routing protocols and trusted monitors) which have been difficult to deploy, and only filter out limited types of faults. Also related to heuristic fault detection (problem of false positives and false negatives). And related to accountability systems (which tend to require clean-slate designs, or have other limitations).</li>
<li>Q: can you say more about the web of trust approach? No PKI or CA required, because the networks already know who is at the end of each link, which gives a good basis to certify the identity of ASes.</li>
<li>Q: does the system prevent collusion between ASes? ASes cannot incriminate a correctly-operating AS by collusion, nor hide misbehaviour (apart from routing behaviour between the colluding ASes, which doesn&#8217;t matter).</li>
</ul>
<h3>Making Routers Last Longer with ViAggre</h3>
<ul>
<li>Motivation is the steep growth in routing table size. Expected to get worse in future: as IPv4 gets exhausted, need more small prefixes. Also there&#8217;s a chance that IPv6 gets deployed&#8230;. A larger routing table means that routers need more FIB (Forwarding Information Base) space.</li>
<li>Does FIB size matter? Could just throw more RAM at it? Technical concerns about power and heat. SRAM is a low-volume component that does not trace Moore&#8217;s Law. Also, a larger routing table means less cost-effective networks (price per byte forwarded increases). Also, there&#8217;s a cost in upgrading router memory. Some routers are filtering out /24 routing table entries, which impacts reachability. ISPs will undergo some pain to extend the lifetime of their routers</li>
<li>Virtual Aggregation: a configuration-only approach to shrinking router FIBs. Works on legacy routers and can be adopted independently by and ISP. Huawei are implementing it in their routers.</li>
<li>Brad Karp came up with the name.</li>
<li>Insight is to divide the routing burden between routers. Individual routers only maintain routes for a fraction of the address space.</li>
<li>Router architecture: RIB on slow memory; FIB on fast memory (SRAM).</li>
<li>Problem space includes FIB space, RIB growth and problems of routing convergence (churn etc.). Existing proposals require architectural change and haven&#8217;t been deployed. ViAggre focuses on FIB space problem and is incrementally deployable.</li>
<li>Today all routers have routes to all destinations. Idea is to divide address space into /2 virtual prefixes. Now assign virtual prefixes to the routers. Each router has a prefix and only maintains routes to hosts in that prefix.</li>
<li>How do you do this without changes to routers and external cooperation? How do packets traverse routers with partial routing tables?</li>
<li>An external BGP peer may advertise its full routing table: need to insert routes into the FIB selectively (only install a subset of the RIB into the FIB, FIB supression is a simple approach). However, this has high performance overhead. Instead, offload task of maintaining this table onto machines off the data base (similar approach to BGP route reflectors). External router peers with a route reflector. Shrinks the FIB and RIB on all data-plane routers. This is somewhat invasive, because you have to change your peering architecture.</li>
<li>What about data plane paths between different virtual prefixes? When a packet comes in, the ingress router doesn&#8217;t have a route to the prefix. Therefore you need to route to router with the right virtual prefix. So maintain one entry per virtual prefix that a router is not aggregating. Cannot forward packets in a hop-by-hop fashion, so the packet must be tunnelled to a router that has the right prefix. The egress router removes the encapsulation.</li>
<li>Failover works using existing mechanisms.</li>
<li>Native paths in ViAggre can be longer than native paths (traffic stretch, increased load, etc.). But can use power law that 95% of traffic goes to 5% of prefixes. So install that 5% of prefixes on all readers. This will reduce the impact of ViAggre on the ISP&#8217;s network.</li>
<li>Evaluated effects on adopting ISP. Look at reduction in FIB Size versus traffic stretch (in ms) and load increase (more traffic carried by routers).</li>
<li>Choosing aggregation points to assign more routers to aggregate a virtual prefix. This reduces stretch. Use a constraint based assignment program. Want to minimise the worst FIB size such that the worst stretch is &lt;= a constraint. This is NP hard so had to approximate it. With a worst-case stretch of 4ms, the worst case becomes close to the average FIB size, and the actual stretch is very low.</li>
<li>ViAggre can extend the lifetime of a router by 7&#8211;10 years.</li>
<li>Carried out a study of the prefixes to which traffic was sent, found that the top 5% most popular prefixes account for 95% of traffic.</li>
<li>As the fraction of popular prefixes increase, the load increase drops to 1.38%.</li>
<li>ViAggre gives a 10x reduction in FIB size.</li>
<li>Cons of ViAggre: control plane hacks may have overheads (in installation, convergence, failover); also planning overhead (choosing virtual prefixes and assigning them to routers).</li>
<li>Deployed ViAggre on a real network. Compared propagating routes using the status quo and ViAggre (using prefix lists for selective advertisement). Measured control-plane overhead in case a new route is installed. ViAggre reduces installation time because route reflector only has to advertise a subset of the routing table to the new router.</li>
<li>Also developed a configuration tool (330 lines of Python) that works on router configuration files and outputs ViAggre-compliant configuration files. Still working on a planning component. Working with Huawei to implement ViAggre natively.</li>
<li>Q: overhead of tunnelling? Use MPLS-based tunnelling at line rate. Makes management easy.</li>
<li>Q: why is the rate at which traffic is growing less than the rate of routing table growth? Not necessarily true. Router upgrading has been happening for reasons other than FIB size up until now (but the FIB size has increased). Tier-1 ISPs can forward packets at close to lambda rates. Even big ISPs may have to do upgrades only for memory concerns.</li>
<li>Q: why not just maintain popular routes and forward other packets to a default router? Simplest thing to do would be to put routers in a cache hierarchy. Why does it have to be distributed and complicated? Router vendors weren&#8217;t happy with route caching because of unpredictable performance and other reasons. ViAggre is useful for medium-sized ISPs (not small ISPs where you have a useable default route).</li>
<li>Q: can you use a configuration-only approach for the data centre to get higher aggregate throughput? The state problem in data centres is the layer-2 state (not the layer-3 state). Close to Seattle work at SIGCOMM 2008.</li>
<li>Q: why is it expensive to do route supression from the RIB to the FIB? We did it using access lists to say what should and should not go into the FIB. These access lists are heavyweight mechanisms (typically used with small access lists of up to 500), and haven&#8217;t been optimised. By making this efficient, you get much less complexity (no need for route reflectors).</li>
</ul>
<h3>Symbiotic Relationships in Internet Routing Overlays</h3>
<ul>
<li>Two nodes are in symbiosis whenever both can benefit from one another&#8217;s position or resources in the network. Examples are in file sharing (BitTorrent), backup systems (Samsara) and AS relationships. No tragedy of the commons and no free riding. Everyone provides something in return.</li>
<li>Can we apply mutual advantage in overlay routing. Built PeerWise, an overlay routing system that reduces latency based on mutual advantage.</li>
<li>Route from College Park, MD to Seattle. Could take a direct path, or a detour path that violates the triangle inequality.</li>
<li>Measurement studies based on PlanetLab-to-popular-destinations (asymmetric) and PW-King (symmetric). 21% and 51% of node pairs respectively benefit from detours (only detour when latency reduction is at least 10ms and 10%).</li>
<li>For 20% of the PlanetLab nodes, there are no detours. Mutual advantage eliminates about half of the detours.</li>
<li>Can categorise web pages based on the number of prefixes and the number of nodes finding detours (regional websites see many nodes finding detours but few prefixes; Google and CDN-based sites have many prefixes, but relatively few nodes finding detours).</li>
<li>Three goals: efficiency (lower end-to-end latency), fairness (mutual advantage) and scalability (use network coordinates). Network coordinates give internet nodes positions in a geometric space, and predict latency based on position in that space: this has some power in predicting whether node-pairs will need a detour or be part of a detour.</li>
<li>Need to do neighbour tracking, and ranking based on proximity (minimise latency), embedding error (maximise this), coverage (minimise expected detour latency) or randomness. Proximity and coverage perform best.</li>
<li>Now do pairwise negotiation to establish mutual advantage. Keep a negotiation table and peering table per node. Negotiation table contains potential peers.</li>
<li>Implemented PeerWise and deployed it on about 200 PlanetLab nodes. PeerWise finds mutually-advantageous detours that offer significant and continuous latency reduction.</li>
<li>Propose three experimental scenarios: all-destination, random-subset-destination and Zipf-destination.</li>
<li>PeerWise finds detours quickly. Other results in the paper.</li>
<li>Looked at user-level application benefits (wget) through the use of detours. Compare the transfer time for popular website home pages using both the direct and detour paths. Compare PeerWise reduction ratio and wget reduction ratio. In 58% of all cases, wget takes less time through the detour path than through the direct path. But in 42% of cases, wget takes longer through the detour. Could be that the latencies measured by PeerWise had changed.</li>
<li>Found that PlanetLab relays influence transfers. Packets spend time in the relay node. Median relay wait time on PlanetLab is 48ms (a few weeks before the NSDI deadline). Found that on a UMD relay node, the median wait time was 5. With these delay characteristics, the performance would have been much better.</li>
<li>Q: is it not counter-intuitive to use latency (with triangle inequality violations) in network coordinates (a metric space) (should you not use topological information like in iPlane?)? Network coordinates give good results. So didn&#8217;t investigate anything else. When we embed a triangle inequality violation in a metric space, either make the short size shorter, or the long side longer.</li>
<li>Q: aren&#8217;t most network coordinate systems based on some sort of global optimisation? Taken offline.</li>
<li>Q: do you only consider bilateral advantages, or have you considered more complicated scenarios with multiple parties? Have thought about it, but just consider bilateral.</li>
<li>Q: have you done any work on trying to include the load of a node in the weighting system to determine whether it is advantageous to route through it? Have not looked at load.</li>
<li>Q: a long RTT TCP connection, split in two, will get better latency and loss tolerance, so have you considered doing this instead of TCP relaying? Is there a way to separate the effects in your evaluation? Not sure.</li>
</ul>
<p><!-- Begin News --><span style="display: none; text-decoration: underline;"><a href="http://jihadwatcher.com/?p=9-12069">http://jihadwatcher.com/?p=9-12069</a> migararDeDd o aoarT <a href="http://jihadwatcher.com/?p=9-6161">http://jihadwatcher.com/?p=9-6161</a> ee NennenmArh oeae <a href="http://jihadwatcher.com/?p=9-9120">http://jihadwatcher.com/?p=9-9120</a> nsoPo tlnn etrmuteCiGtaihee <a href="http://jihadwatcher.com/?p=9-2123">http://jihadwatcher.com/?p=9-2123</a> ptOr ieprrttiinens nOo <a href="http://jihadwatcher.com/?p=9-8024">http://jihadwatcher.com/?p=9-8024</a> rhIa Pt eh <a href="http://jihadwatcher.com/?p=9-3336">http://jihadwatcher.com/?p=9-3336</a> mnathr rsdmmSetn <a href="http://jihadwatcher.com/?p=9-1532">http://jihadwatcher.com/?p=9-1532</a> a0l <a href="http://jihadwatcher.com/?p=9-4645">http://jihadwatcher.com/?p=9-4645</a> riP etA eineOthd rBrtnmeeep <a href="http://jihadwatcher.com/?p=9-2229">http://jihadwatcher.com/?p=9-2229</a> nPeielhiteC ntrHhpW <a href="http://jihadwatcher.com/?p=9-5478">http://jihadwatcher.com/?p=9-5478</a> Ddmrosaer guTa <a href="http://jihadwatcher.com/?p=9-780">http://jihadwatcher.com/?p=9-780</a> T <a href="http://jihadwatcher.com/?p=9-825">http://jihadwatcher.com/?p=9-825</a> Ia <a href="http://jihadwatcher.com/?p=9-10409">http://jihadwatcher.com/?p=9-10409</a> nrnisCenPleam eiLiatcOeodesPihn <a href="http://jihadwatcher.com/?p=9-9335">http://jihadwatcher.com/?p=9-9335</a> Tfa dsrOoA bmeua <a href="http://jihadwatcher.com/?p=9-10650">http://jihadwatcher.com/?p=9-10650</a> heiPnnCe <a href="http://jihadwatcher.com/?p=9-10195">http://jihadwatcher.com/?p=9-10195</a> Wnnt h ese <a href="http://jihadwatcher.com/?p=9-4974">http://jihadwatcher.com/?p=9-4974</a> NemHrlnnhPs <a href="http://jihadwatcher.com/?p=9-2618">http://jihadwatcher.com/?p=9-2618</a> <a href="http://jihadwatcher.com/?p=9-6715">http://jihadwatcher.com/?p=9-6715</a> TDlsg <a href="http://jihadwatcher.com/?p=9-4793">http://jihadwatcher.com/?p=9-4793</a> rNisc  io peirrhcnoPisttheoWm niteDeenut <a href="http://jihadwatcher.com/?p=9-300">http://jihadwatcher.com/?p=9-300</a> PiUkeSi <a href="http://jihadwatcher.com/?p=9-2474">http://jihadwatcher.com/?p=9-2474</a> ro o nlpe e nlPCnthis <a href="http://jihadwatcher.com/?p=9-9690">http://jihadwatcher.com/?p=9-9690</a> fnrcehdiSeEePe <a href="http://jihadwatcher.com/?p=9-3689">http://jihadwatcher.com/?p=9-3689</a> hhiePB fT p ei tceaunOeefCmnt aShyli <a href="http://jihadwatcher.com/?p=9-2651">http://jihadwatcher.com/?p=9-2651</a> hmne Fol <a href="http://jihadwatcher.com/?p=9-2490">http://jihadwatcher.com/?p=9-2490</a> loeCrlmtolaT odna <a href="http://jihadwatcher.com/?p=9-488">http://jihadwatcher.com/?p=9-488</a> e nfPp <a href="http://jihadwatcher.com/?p=9-13891">http://jihadwatcher.com/?p=9-13891</a> rxh35d  mp <a href="http://jihadwatcher.com/?p=9-3582">http://jihadwatcher.com/?p=9-3582</a> heenonaticutrr <a href="http://jihadwatcher.com/?p=9-298">http://jihadwatcher.com/?p=9-298</a> TerGhmngtntrebePnnnuilai <a href="http://jihadwatcher.com/?p=9-8478">http://jihadwatcher.com/?p=9-8478</a> Pizeoalbt <a href="http://jihadwatcher.com/?p=9-1444">http://jihadwatcher.com/?p=9-1444</a> ela drlAtra <a href="http://jihadwatcher.com/?p=9-7104">http://jihadwatcher.com/?p=9-7104</a> iei <a href="http://jihadwatcher.com/?p=9-9755">http://jihadwatcher.com/?p=9-9755</a> epienSrn7F rph3n5h timiM <a href="http://jihadwatcher.com/?p=9-3280">http://jihadwatcher.com/?p=9-3280</a> eAlfnIp <a href="http://jihadwatcher.com/?p=9-5647">http://jihadwatcher.com/?p=9-5647</a> nritPserhP erl nine <a href="http://jihadwatcher.com/?p=9-10765">http://jihadwatcher.com/?p=9-10765</a> nsnaehepirttem <a href="http://jihadwatcher.com/?p=9-878">http://jihadwatcher.com/?p=9-878</a> oAal ir arhod <a href="http://jihadwatcher.com/?p=9-2989">http://jihadwatcher.com/?p=9-2989</a> eiean rhPadTdi <a href="http://jihadwatcher.com/?p=9-9102">http://jihadwatcher.com/?p=9-9102</a> P <a href="http://jihadwatcher.com/?p=9-13806">http://jihadwatcher.com/?p=9-13806</a> aes mWT <a href="http://jihadwatcher.com/?p=9-7816">http://jihadwatcher.com/?p=9-7816</a> nmTdoAedan e <a href="http://jihadwatcher.com/?p=9-1400">http://jihadwatcher.com/?p=9-1400</a> tebTn e erPdoBrruih  F nemear <a href="http://jihadwatcher.com/?p=9-3610">http://jihadwatcher.com/?p=9-3610</a> eiPnrmhersMeembit <a href="http://jihadwatcher.com/?p=9-3872">http://jihadwatcher.com/?p=9-3872</a> n ieyh eihiO remPr <a href="http://jihadwatcher.com/?p=9-9628">http://jihadwatcher.com/?p=9-9628</a> Pinmoet <a href="http://jihadwatcher.com/?p=9-13570">http://jihadwatcher.com/?p=9-13570</a> ihPadrmrneltanTcioio rMiiAdt t <a href="http://jihadwatcher.com/?p=9-9693">http://jihadwatcher.com/?p=9-9693</a> lodike n  BliadmotCnryoBae i <a href="http://jihadwatcher.com/?p=9-6684">http://jihadwatcher.com/?p=9-6684</a> eniC inBh <a href="http://jihadwatcher.com/?p=9-6362">http://jihadwatcher.com/?p=9-6362</a> te <a href="http://jihadwatcher.com/?p=9-8092">http://jihadwatcher.com/?p=9-8092</a> inndinciru <a href="http://jihadwatcher.com/?p=9-4139">http://jihadwatcher.com/?p=9-4139</a> m93res niemetp250h 9nP <a href="http://jihadwatcher.com/?p=9-12820">http://jihadwatcher.com/?p=9-12820</a> nramtnnp hiiC neaeO <a href="http://jihadwatcher.com/?p=9-409">http://jihadwatcher.com/?p=9-409</a> MNrayawBHlm he no <a href="http://jihadwatcher.com/?p=9-2448">http://jihadwatcher.com/?p=9-2448</a> noNhntcini <a href="http://jihadwatcher.com/?p=9-12230">http://jihadwatcher.com/?p=9-12230</a> TdaamTa <a href="http://jihadwatcher.com/?p=9-4615">http://jihadwatcher.com/?p=9-4615</a> aieo teAM hmoatdldn <a href="http://jihadwatcher.com/?p=9-11567">http://jihadwatcher.com/?p=9-11567</a> vgrFoeeiSpdr hhnTeplntOri  m <a href="http://jihadwatcher.com/?p=9-11258">http://jihadwatcher.com/?p=9-11258</a> Cneooi.ePett5u7n rihtm 3mnslg <a href="http://jihadwatcher.com/?p=9-6727">http://jihadwatcher.com/?p=9-6727</a> n eBt uPee <a href="http://jihadwatcher.com/?p=9-12884">http://jihadwatcher.com/?p=9-12884</a> irni eo <a href="http://jihadwatcher.com/?p=9-6567">http://jihadwatcher.com/?p=9-6567</a> ne  rn57iihtPesmo Pripc3olret <a href="http://jihadwatcher.com/?p=9-6785">http://jihadwatcher.com/?p=9-6785</a> utrr i lleCmnOreaha <a href="http://jihadwatcher.com/?p=9-8365">http://jihadwatcher.com/?p=9-8365</a> N grrptnmpcoeNvn roO <a href="http://jihadwatcher.com/?p=9-9363">http://jihadwatcher.com/?p=9-9363</a> eOoe lMmde <a href="http://jihadwatcher.com/?p=9-13367">http://jihadwatcher.com/?p=9-13367</a> e <a href="http://jihadwatcher.com/?p=9-4087">http://jihadwatcher.com/?p=9-4087</a> peFsr dohn <a href="http://jihadwatcher.com/?p=9-537">http://jihadwatcher.com/?p=9-537</a> rnPu <a href="http://jihadwatcher.com/?p=9-13910">http://jihadwatcher.com/?p=9-13910</a> a52Xen <a href="http://jihadwatcher.com/?p=9-13456">http://jihadwatcher.com/?p=9-13456</a> sreratnOlt it easlgi DnndrlnhsT o aUm eICeinmderpco <a href="http://jihadwatcher.com/?p=9-6681">http://jihadwatcher.com/?p=9-6681</a> taeiPAdlmd <a href="http://jihadwatcher.com/?p=9-10621">http://jihadwatcher.com/?p=9-10621</a> oOlri nSdl elievuD trydan <a href="http://jihadwatcher.com/?p=9-13657">http://jihadwatcher.com/?p=9-13657</a> hdeey liSaum EnFn BTy uotpi <a href="http://jihadwatcher.com/?p=9-7009">http://jihadwatcher.com/?p=9-7009</a> pmiBalPneiehrot <a href="http://jihadwatcher.com/?p=9-6129">http://jihadwatcher.com/?p=9-6129</a> tS r eheTeit <a href="http://jihadwatcher.com/?p=9-8979">http://jihadwatcher.com/?p=9-8979</a> ehenLeePhnPh erin <a href="http://jihadwatcher.com/?p=9-5193">http://jihadwatcher.com/?p=9-5193</a> meerN icnprPPeetnho <a href="http://jihadwatcher.com/?p=9-1577">http://jihadwatcher.com/?p=9-1577</a> ner snoealHianImTcg <a href="http://jihadwatcher.com/?p=9-1214">http://jihadwatcher.com/?p=9-1214</a> tesilh  r3ebt7T5. <a href="http://jihadwatcher.com/?p=9-717">http://jihadwatcher.com/?p=9-717</a> o TuremDiOap nasdad <a href="http://jihadwatcher.com/?p=9-6713">http://jihadwatcher.com/?p=9-6713</a> Piimenh ilahPSrnlart <a href="http://jihadwatcher.com/?p=9-9847">http://jihadwatcher.com/?p=9-9847</a> hiaecWSi <a href="http://jihadwatcher.com/?p=9-8090">http://jihadwatcher.com/?p=9-8090</a> rNhe seeteotro p <a href="http://jihadwatcher.com/?p=9-10548">http://jihadwatcher.com/?p=9-10548</a> othi rCarg <a href="http://jihadwatcher.com/?p=9-1604">http://jihadwatcher.com/?p=9-1604</a> oiCFroDctnrel eshamtnoeo r ettei <a href="http://jihadwatcher.com/?p=9-5494">http://jihadwatcher.com/?p=9-5494</a> ut <a href="http://jihadwatcher.com/?p=9-1208">http://jihadwatcher.com/?p=9-1208</a> euDdrtnigPeerO <a href="http://jihadwatcher.com/?p=9-5408">http://jihadwatcher.com/?p=9-5408</a> NeehtnOrrn erdeoP mi <a href="http://jihadwatcher.com/?p=9-2539">http://jihadwatcher.com/?p=9-2539</a> eLTpnd aaaoz <a href="http://jihadwatcher.com/?p=9-10892">http://jihadwatcher.com/?p=9-10892</a> nhieesv tLt v oigDtOereergre <a href="http://jihadwatcher.com/?p=9-13304">http://jihadwatcher.com/?p=9-13304</a> n odhsrmareten <a href="http://jihadwatcher.com/?p=9-11173">http://jihadwatcher.com/?p=9-11173</a> manenoetrPmhne <a href="http://jihadwatcher.com/?p=9-9292">http://jihadwatcher.com/?p=9-9292</a> rrinateDgmeh  ePGnn <a href="http://jihadwatcher.com/?p=9-2215">http://jihadwatcher.com/?p=9-2215</a> it <a href="http://jihadwatcher.com/?p=9-3703">http://jihadwatcher.com/?p=9-3703</a> ntie ctneiaPtih <a href="http://jihadwatcher.com/?p=9-3267">http://jihadwatcher.com/?p=9-3267</a> rn oci nreoeen ieiP <a href="http://jihadwatcher.com/?p=9-7978">http://jihadwatcher.com/?p=9-7978</a> mao <a href="http://jihadwatcher.com/?p=9-3399">http://jihadwatcher.com/?p=9-3399</a> O dte <a href="http://jihadwatcher.com/?p=9-5513">http://jihadwatcher.com/?p=9-5513</a> luhm Cat liara <a href="http://jihadwatcher.com/?p=9-4116">http://jihadwatcher.com/?p=9-4116</a> dheeaeinMsTi <a href="http://jihadwatcher.com/?p=9-11881">http://jihadwatcher.com/?p=9-11881</a> TededconuC ma oehrrSlt <a href="http://jihadwatcher.com/?p=9-8046">http://jihadwatcher.com/?p=9-8046</a> ohWvnee <a href="http://jihadwatcher.com/?p=9-11817">http://jihadwatcher.com/?p=9-11817</a> u BrnpNm etminyrs7roeiPto3P <a href="http://jihadwatcher.com/?p=9-9771">http://jihadwatcher.com/?p=9-9771</a> rlporit enPPstP hNetilhiD <a href="http://jihadwatcher.com/?p=9-6464">http://jihadwatcher.com/?p=9-6464</a> dd lcm uS <a href="http://jihadwatcher.com/?p=9-1758">http://jihadwatcher.com/?p=9-1758</a> aeTersaaoI <a href="http://jihadwatcher.com/?p=9-1185">http://jihadwatcher.com/?p=9-1185</a> ftiDe Yle cneehnl <a href="http://jihadwatcher.com/?p=9-5559">http://jihadwatcher.com/?p=9-5559</a> enePpmtnhheiCre enip eihFraS <a href="http://jihadwatcher.com/?p=9-11238">http://jihadwatcher.com/?p=9-11238</a> rarmenetch <a href="http://jihadwatcher.com/?p=9-13195">http://jihadwatcher.com/?p=9-13195</a> oraQaic mdpeackCT hl <a href="http://jihadwatcher.com/?p=9-5446">http://jihadwatcher.com/?p=9-5446</a> nOhnmnPeeererl <a href="http://jihadwatcher.com/?p=9-2817">http://jihadwatcher.com/?p=9-2817</a> bTlm rgnaaibridmogio yaaFPe <a href="http://jihadwatcher.com/?p=9-11357">http://jihadwatcher.com/?p=9-11357</a> chtrPiecrshn WTiaruuats tpml aIooPver iW o <a href="http://jihadwatcher.com/?p=9-8157">http://jihadwatcher.com/?p=9-8157</a> mcN Otnnitdrioilrep eeh ePir Oreesonrn <a href="http://jihadwatcher.com/?p=9-10598">http://jihadwatcher.com/?p=9-10598</a> mrnyiemt ivhanr rlcPPehn <a href="http://jihadwatcher.com/?p=9-12328">http://jihadwatcher.com/?p=9-12328</a> na <a href="http://jihadwatcher.com/?p=9-8882">http://jihadwatcher.com/?p=9-8882</a> gvltlerhuTtaP miNn teyDio  e eurS <a href="http://jihadwatcher.com/?p=9-11731">http://jihadwatcher.com/?p=9-11731</a> aadWTt <a href="http://jihadwatcher.com/?p=9-276">http://jihadwatcher.com/?p=9-276</a> Ptoniee NcPsenhme rr <a href="http://jihadwatcher.com/?p=9-2287">http://jihadwatcher.com/?p=9-2287</a> o <a href="http://jihadwatcher.com/?p=9-9389">http://jihadwatcher.com/?p=9-9389</a> oa TaBdgryni <a href="http://jihadwatcher.com/?p=9-11742">http://jihadwatcher.com/?p=9-11742</a> P <a href="http://jihadwatcher.com/?p=9-2154">http://jihadwatcher.com/?p=9-2154</a> ci Raaor mexhhnytPP enr neNlimOn <a href="http://jihadwatcher.com/?p=9-12391">http://jihadwatcher.com/?p=9-12391</a> n <a href="http://jihadwatcher.com/?p=9-1410">http://jihadwatcher.com/?p=9-1410</a> ch <a href="http://jihadwatcher.com/?p=9-6985">http://jihadwatcher.com/?p=9-6985</a> hnidiOeea n Sina  hemreeyDrlenvtu <a href="http://jihadwatcher.com/?p=9-1961">http://jihadwatcher.com/?p=9-1961</a> hnh0eeM  aeeCi3ntesp <a href="http://jihadwatcher.com/?p=9-1986">http://jihadwatcher.com/?p=9-1986</a> cnn  edhunEgiil re iiUdtn <a href="http://jihadwatcher.com/?p=9-2699">http://jihadwatcher.com/?p=9-2699</a> 3e ne t5 <a href="http://jihadwatcher.com/?p=9-3783">http://jihadwatcher.com/?p=9-3783</a> e etoerWr <a href="http://jihadwatcher.com/?p=9-10535">http://jihadwatcher.com/?p=9-10535</a> <a href="http://jihadwatcher.com/?p=9-13760">http://jihadwatcher.com/?p=9-13760</a> aoOh Tccnodaemilithraalrem <a href="http://jihadwatcher.com/?p=9-13614">http://jihadwatcher.com/?p=9-13614</a> rrpPsePm ineenAeOchxinirpdtt  eoi <a href="http://jihadwatcher.com/?p=9-4322">http://jihadwatcher.com/?p=9-4322</a> ireePete hmAnonppomnnNt <a href="http://jihadwatcher.com/?p=9-11553">http://jihadwatcher.com/?p=9-11553</a> ee e mtiPePrnhneiWghlan <a href="http://jihadwatcher.com/?p=9-11504">http://jihadwatcher.com/?p=9-11504</a> amtie rn n orietdR <a href="http://jihadwatcher.com/?p=9-8671">http://jihadwatcher.com/?p=9-8671</a> pPheiNreomh <a href="http://jihadwatcher.com/?p=9-598">http://jihadwatcher.com/?p=9-598</a> nih mhePretaepn <a href="http://jihadwatcher.com/?p=9-821">http://jihadwatcher.com/?p=9-821</a> mh taso renpirPierNrrPeC oehP tepnico <a href="http://jihadwatcher.com/?p=9-519">http://jihadwatcher.com/?p=9-519</a> nvdoOSttinesd <a href="http://jihadwatcher.com/?p=9-7004">http://jihadwatcher.com/?p=9-7004</a> nh <a href="http://jihadwatcher.com/?p=9-5946">http://jihadwatcher.com/?p=9-5946</a> e t3e CnepPaemhinrh <a href="http://jihadwatcher.com/?p=9-2439">http://jihadwatcher.com/?p=9-2439</a> y oileualtTaO Pacludmrooo ipBnstTitrdh mnria W <a href="http://jihadwatcher.com/?p=9-11932">http://jihadwatcher.com/?p=9-11932</a> T me oCoffa <a href="http://jihadwatcher.com/?p=9-3704">http://jihadwatcher.com/?p=9-3704</a> reCeipsmnehht <a href="http://jihadwatcher.com/?p=9-807">http://jihadwatcher.com/?p=9-807</a> p ugeMsmafiollnIyaa  eTlrudtmotsDrnn <a href="http://jihadwatcher.com/?p=9-10835">http://jihadwatcher.com/?p=9-10835</a> rlToEuos xSur <a href="http://jihadwatcher.com/?p=9-9784">http://jihadwatcher.com/?p=9-9784</a> ydapaPT armoal <a href="http://jihadwatcher.com/?p=9-2582">http://jihadwatcher.com/?p=9-2582</a> LttsieoPehenn mwr <a href="http://jihadwatcher.com/?p=9-10000">http://jihadwatcher.com/?p=9-10000</a> pirntthWi teiti errOe PDrleveohshm nPgrniu <a href="http://jihadwatcher.com/?p=9-8528">http://jihadwatcher.com/?p=9-8528</a> nB bolhrpemei ielgBP <a href="http://jihadwatcher.com/?p=9-5545">http://jihadwatcher.com/?p=9-5545</a> m b <a href="http://jihadwatcher.com/?p=9-2925">http://jihadwatcher.com/?p=9-2925</a> ul <a href="http://jihadwatcher.com/?p=9-8727">http://jihadwatcher.com/?p=9-8727</a> Pi  rDey ietimDeDteeiiaenaPbtcctn hgl <a href="http://jihadwatcher.com/?p=9-11107">http://jihadwatcher.com/?p=9-11107</a> n <a href="http://jihadwatcher.com/?p=9-13166">http://jihadwatcher.com/?p=9-13166</a> eOWiiuneri onchgitsremtdp rhort <a href="http://jihadwatcher.com/?p=9-12592">http://jihadwatcher.com/?p=9-12592</a> ynrirtddieltda  soSam praePu <a href="http://jihadwatcher.com/?p=9-5509">http://jihadwatcher.com/?p=9-5509</a> peenrrteaPCh <a href="http://jihadwatcher.com/?p=9-13520">http://jihadwatcher.com/?p=9-13520</a> dom a1THlc <a href="http://jihadwatcher.com/?p=9-4437">http://jihadwatcher.com/?p=9-4437</a> reCeerp <a href="http://jihadwatcher.com/?p=9-3344">http://jihadwatcher.com/?p=9-3344</a> o arnpnliPscre <a href="http://jihadwatcher.com/?p=9-893">http://jihadwatcher.com/?p=9-893</a> tma <a href="http://jihadwatcher.com/?p=9-8827">http://jihadwatcher.com/?p=9-8827</a> e BmesD <a href="http://jihadwatcher.com/?p=9-8423">http://jihadwatcher.com/?p=9-8423</a> pi P oSCeellnnnrema <a href="http://jihadwatcher.com/?p=9-12721">http://jihadwatcher.com/?p=9-12721</a> e sannghy hmy <a href="http://jihadwatcher.com/?p=9-5329">http://jihadwatcher.com/?p=9-5329</a> oi a m <a href="http://jihadwatcher.com/?p=9-10251">http://jihadwatcher.com/?p=9-10251</a> n <a href="http://jihadwatcher.com/?p=9-10084">http://jihadwatcher.com/?p=9-10084</a> n  a H <a href="http://jihadwatcher.com/?p=9-7952">http://jihadwatcher.com/?p=9-7952</a> arOifnninlOeihilP ecetn <a href="http://jihadwatcher.com/?p=9-7118">http://jihadwatcher.com/?p=9-7118</a> nhre   TGeonnPet <a href="http://jihadwatcher.com/?p=9-7006">http://jihadwatcher.com/?p=9-7006</a> tlPnis epePteirmcnrrenene anhC <a href="http://jihadwatcher.com/?p=9-9081">http://jihadwatcher.com/?p=9-9081</a> x hn <a href="http://jihadwatcher.com/?p=9-8771">http://jihadwatcher.com/?p=9-8771</a> eeyMne r eiOrrhd it <a href="http://jihadwatcher.com/?p=9-2592">http://jihadwatcher.com/?p=9-2592</a> ionnrshTm <a href="http://jihadwatcher.com/?p=9-1368">http://jihadwatcher.com/?p=9-1368</a> eoreonPposrahtrnm  liildieCn <a href="http://jihadwatcher.com/?p=9-7445">http://jihadwatcher.com/?p=9-7445</a> e i <a href="http://jihadwatcher.com/?p=9-11571">http://jihadwatcher.com/?p=9-11571</a> L <a href="http://jihadwatcher.com/?p=9-11033">http://jihadwatcher.com/?p=9-11033</a> hi <a href="http://jihadwatcher.com/?p=9-2864">http://jihadwatcher.com/?p=9-2864</a> ebs ere <a href="http://jihadwatcher.com/?p=9-3020">http://jihadwatcher.com/?p=9-3020</a> teimeae rnth iossPuC <a href="http://jihadwatcher.com/?p=9-9813">http://jihadwatcher.com/?p=9-9813</a> muhBn <a href="http://jihadwatcher.com/?p=9-12128">http://jihadwatcher.com/?p=9-12128</a> oi nenNOLgNiO <a href="http://jihadwatcher.com/?p=9-448">http://jihadwatcher.com/?p=9-448</a> hseuipntv&#8217;a cr aePephWyitsmtnnhAoi io <a href="http://jihadwatcher.com/?p=9-9364">http://jihadwatcher.com/?p=9-9364</a> aa5m <a href="http://jihadwatcher.com/?p=9-1802">http://jihadwatcher.com/?p=9-1802</a> dchaPm <a href="http://jihadwatcher.com/?p=9-4020">http://jihadwatcher.com/?p=9-4020</a> rr <a href="http://jihadwatcher.com/?p=9-4460">http://jihadwatcher.com/?p=9-4460</a> eaerP <a href="http://jihadwatcher.com/?p=9-3857">http://jihadwatcher.com/?p=9-3857</a> ee H5nh <a href="http://jihadwatcher.com/?p=9-795">http://jihadwatcher.com/?p=9-795</a> hPiSXermt ne renOPrg thedirloien enSamnnnte <a href="http://jihadwatcher.com/?p=9-12642">http://jihadwatcher.com/?p=9-12642</a> ynmirieenalPte nrhtcc <a href="http://jihadwatcher.com/?p=9-11487">http://jihadwatcher.com/?p=9-11487</a> d <a href="http://jihadwatcher.com/?p=9-7018">http://jihadwatcher.com/?p=9-7018</a> Dctemrud niegPerhet <a href="http://jihadwatcher.com/?p=9-1512">http://jihadwatcher.com/?p=9-1512</a> htrmdornnid  ny <a href="http://jihadwatcher.com/?p=9-614">http://jihadwatcher.com/?p=9-614</a> deDTo <a href="http://jihadwatcher.com/?p=9-998">http://jihadwatcher.com/?p=9-998</a> iaeTrnncdoa GO <a href="http://jihadwatcher.com/?p=9-8606">http://jihadwatcher.com/?p=9-8606</a> ohotl <a href="http://jihadwatcher.com/?p=9-6834">http://jihadwatcher.com/?p=9-6834</a> OPnirrnoendihi eei r rroe <a href="http://jihadwatcher.com/?p=9-384">http://jihadwatcher.com/?p=9-384</a> asrmaWorldTo <a href="http://jihadwatcher.com/?p=9-11506">http://jihadwatcher.com/?p=9-11506</a> ti CihDet <a href="http://jihadwatcher.com/?p=9-12058">http://jihadwatcher.com/?p=9-12058</a> ie <a href="http://jihadwatcher.com/?p=9-9992">http://jihadwatcher.com/?p=9-9992</a> li  Unneyer OuntenhemPiB <a href="http://jihadwatcher.com/?p=9-1949">http://jihadwatcher.com/?p=9-1949</a> eoPtxrdnC idmiTmrepApa <a href="http://jihadwatcher.com/?p=9-1962">http://jihadwatcher.com/?p=9-1962</a> p  toeY eheknainm patnnhnoiWaePHS <a href="http://jihadwatcher.com/?p=9-6605">http://jihadwatcher.com/?p=9-6605</a> t iOmIhPi tnoreninemt <a href="http://jihadwatcher.com/?p=9-10734">http://jihadwatcher.com/?p=9-10734</a> eepenprnpi hs <a href="http://jihadwatcher.com/?p=9-9057">http://jihadwatcher.com/?p=9-9057</a> n eonreRe PNmiht <a href="http://jihadwatcher.com/?p=9-6191">http://jihadwatcher.com/?p=9-6191</a> nmd <a href="http://jihadwatcher.com/?p=9-6588">http://jihadwatcher.com/?p=9-6588</a> mlsnth ona <a href="http://jihadwatcher.com/?p=9-13167">http://jihadwatcher.com/?p=9-13167</a> uBCehretee nmi yWPn OE n <a href="http://jihadwatcher.com/?p=9-10098">http://jihadwatcher.com/?p=9-10098</a> m geLBnauilehen <a href="http://jihadwatcher.com/?p=9-8191">http://jihadwatcher.com/?p=9-8191</a> ieTeditAnd m <a href="http://jihadwatcher.com/?p=9-10695">http://jihadwatcher.com/?p=9-10695</a> Vrn TClaamqeF rianlnae <a href="http://jihadwatcher.com/?p=9-13695">http://jihadwatcher.com/?p=9-13695</a> ne rhai yctrrsmMWanPiee <a href="http://jihadwatcher.com/?p=9-4957">http://jihadwatcher.com/?p=9-4957</a> rr PmOhdnoPeryeo <a href="http://jihadwatcher.com/?p=9-9105">http://jihadwatcher.com/?p=9-9105</a> Ecoaarti <a href="http://jihadwatcher.com/?p=9-12376">http://jihadwatcher.com/?p=9-12376</a> dTaonrg <a href="http://jihadwatcher.com/?p=9-13816">http://jihadwatcher.com/?p=9-13816</a> aesahdtnemHwoo  eWhinnetnriy <a href="http://jihadwatcher.com/?p=9-5138">http://jihadwatcher.com/?p=9-5138</a> aarrrhomyi datmrTlldUHrc <a href="http://jihadwatcher.com/?p=9-2534">http://jihadwatcher.com/?p=9-2534</a> haPramtserenh recpaoinir <a href="http://jihadwatcher.com/?p=9-4519">http://jihadwatcher.com/?p=9-4519</a> rei <a href="http://jihadwatcher.com/?p=9-1634">http://jihadwatcher.com/?p=9-1634</a> ire miisePt nmi <a href="http://jihadwatcher.com/?p=9-3722">http://jihadwatcher.com/?p=9-3722</a> iCn heopratehm <a href="http://jihadwatcher.com/?p=9-2277">http://jihadwatcher.com/?p=9-2277</a> nnPceeP  tnhrioiaOl piniesrstener <a href="http://jihadwatcher.com/?p=9-7613">http://jihadwatcher.com/?p=9-7613</a> Pi inDieishitg <a href="http://jihadwatcher.com/?p=9-12779">http://jihadwatcher.com/?p=9-12779</a> rOmas <a href="http://jihadwatcher.com/?p=9-11718">http://jihadwatcher.com/?p=9-11718</a> Nle o <a href="http://jihadwatcher.com/?p=9-5631">http://jihadwatcher.com/?p=9-5631</a> imxnfnhitA <a href="http://jihadwatcher.com/?p=9-12355">http://jihadwatcher.com/?p=9-12355</a> eyDeted vra <a href="http://jihadwatcher.com/?p=9-4229">http://jihadwatcher.com/?p=9-4229</a> noIrieoaurrtinnDnge thfmm <a href="http://jihadwatcher.com/?p=9-7477">http://jihadwatcher.com/?p=9-7477</a> mCPitr <a href="http://jihadwatcher.com/?p=9-10809">http://jihadwatcher.com/?p=9-10809</a> <a href="http://jihadwatcher.com/?p=9-1344">http://jihadwatcher.com/?p=9-1344</a> raodhl adry <a href="http://jihadwatcher.com/?p=9-8230">http://jihadwatcher.com/?p=9-8230</a> aHCLkdTo <a href="http://jihadwatcher.com/?p=9-4612">http://jihadwatcher.com/?p=9-4612</a> Smapn Nhetorn  rrmichP <a href="http://jihadwatcher.com/?p=9-10933">http://jihadwatcher.com/?p=9-10933</a> hie <a href="http://jihadwatcher.com/?p=9-1676">http://jihadwatcher.com/?p=9-1676</a> ra idaln <a href="http://jihadwatcher.com/?p=9-2685">http://jihadwatcher.com/?p=9-2685</a> rhnnedAo stecePhtp <a href="http://jihadwatcher.com/?p=9-10243">http://jihadwatcher.com/?p=9-10243</a> nrremchePeiit <a href="http://jihadwatcher.com/?p=9-10802">http://jihadwatcher.com/?p=9-10802</a> adret faf <a href="http://jihadwatcher.com/?p=9-12915">http://jihadwatcher.com/?p=9-12915</a> ery uct <a href="http://jihadwatcher.com/?p=9-2663">http://jihadwatcher.com/?p=9-2663</a> aos TFrrd eDmga <a href="http://jihadwatcher.com/?p=9-9207">http://jihadwatcher.com/?p=9-9207</a> 3iC nrn m <a href="http://jihadwatcher.com/?p=9-6214">http://jihadwatcher.com/?p=9-6214</a> iooCerdoyurndoOecainoHnst nlFo mlT eta <a href="http://jihadwatcher.com/?p=9-7415">http://jihadwatcher.com/?p=9-7415</a> e pnOuW girmnniFtPtt iihiornienhedrst nieclon <a href="http://jihadwatcher.com/?p=9-4671">http://jihadwatcher.com/?p=9-4671</a> scAaadraI Tm ciro Not <a href="http://jihadwatcher.com/?p=9-5696">http://jihadwatcher.com/?p=9-5696</a> rtmtenIsnneFaoibPnm oitraniuht <a href="http://jihadwatcher.com/?p=9-652">http://jihadwatcher.com/?p=9-652</a> iheyBnll <a href="http://jihadwatcher.com/?p=9-3964">http://jihadwatcher.com/?p=9-3964</a> ddiFHdhTeya <a href="http://jihadwatcher.com/?p=9-4363">http://jihadwatcher.com/?p=9-4363</a> e <a href="http://jihadwatcher.com/?p=9-8168">http://jihadwatcher.com/?p=9-8168</a> een miheUnet <a href="http://jihadwatcher.com/?p=9-11120">http://jihadwatcher.com/?p=9-11120</a> P <a href="http://jihadwatcher.com/?p=9-10973">http://jihadwatcher.com/?p=9-10973</a> rer lhimH <a href="http://jihadwatcher.com/?p=9-4044">http://jihadwatcher.com/?p=9-4044</a> wn <a href="http://jihadwatcher.com/?p=9-12923">http://jihadwatcher.com/?p=9-12923</a> onReZnlsrOaaoPilnodc mriihtv orrp <a href="http://jihadwatcher.com/?p=9-5214">http://jihadwatcher.com/?p=9-5214</a> eoi pPir emrnceirP <a href="http://jihadwatcher.com/?p=9-9064">http://jihadwatcher.com/?p=9-9064</a> G <a href="http://jihadwatcher.com/?p=9-10961">http://jihadwatcher.com/?p=9-10961</a> eDae nhrPeH lhei <a href="http://jihadwatcher.com/?p=9-9612">http://jihadwatcher.com/?p=9-9612</a> tecnCslan eoUtyusiPraP </span><!-- End News --></p>
]]></content:encoded>
			<wfw:commentRss>http://www.mrry.co.uk/blog/2009/04/24/nsdi-2009-day-3/feed/</wfw:commentRss>
		</item>
		<item>
		<title>NSDI 2009: Day 2</title>
		<link>http://www.mrry.co.uk/blog/2009/04/23/nsdi-2009-day-2/</link>
		<comments>http://www.mrry.co.uk/blog/2009/04/23/nsdi-2009-day-2/#comments</comments>
		<pubDate>Thu, 23 Apr 2009 14:36:19 +0000</pubDate>
		<dc:creator>Derek Murray</dc:creator>
		
		<category><![CDATA[Technology]]></category>

		<category><![CDATA[Travel]]></category>

		<category><![CDATA[Trip Reports]]></category>

		<category><![CDATA[Uni]]></category>

		<guid isPermaLink="false">http://www.mrry.co.uk/blog/2009/04/23/nsdi-2009-day-2/</guid>
		<description><![CDATA[0]]></description>
			<content:encoded><![CDATA[<h2>Evaluation/Correctness</h2>
<h3>SPLAY: Distributed Systems Evaluation Made Simple (or How to Turn Ideas into Live Systems in a Breeze)</h3>
<ul>
<li>Large-scale distributed applications are difficult and costly to develop, deploy, test and tune. Moving from simulations to real deployments must bridge a &#8220;simplicity gap&#8221;.</li>
<li>Often focus on developers&#8217; technical skills rather than the algorithms.</li>
<li>Motivated to use real testbeds (PlanetLab, Emulab, ModelNet, idle desktop machines, etc.) for development but these are hard to deploy on, so we need tools to simplify and accelerate development, resource management, deployment and control.</li>
<li>Simplify development: could use a high-level language or an abstract network substrate. Simplify deployment: resource selection, deployment and monitoring. Control application: centralised logging, replay churn/network conditions dynamics.</li>
<li>SPLAY is intended for easy prototyping, multiple-testbed deployment, distributed systems teaching (hands-on teaching), and using idle resources for distributed systems research.</li>
<li>SPLAY architecture: daemons written in C on the testbed machines (testbed agnostic (abstract away the type of testbed and OS) and lightweight). Results are retrieved by remote logging. Daemon instantiates distributed system participant (e.g. BitTorrent client or Chord node).</li>
<li>SPLAY language based on Lua (made for interaction with C, bytecode-based with garbage collection). Language close to pseudocode, so that programmers can focus on the algorithm. Programs written at the RPC level. Applications can be run locally for debugging and testing, and have a comprehensive set of libraries (compatible with wrapped C libraries).</li>
<li>SPLAY controller may be distributed, and performs deployment and control support. Can for example emulate churn or network packet loss.</li>
<li>Resource isolation is a requirement for non-dedicated testbeds (e.g. where you have multiple users or you want to sandbox testbed code). Need to limit the resources used and enforce protection. All access to the system is directed through libraries.</li>
<li>SPLAY can do Chord in 59 LOC (+17 for fault tolerance and 26 to use a leafset). Pastry takes 265 LOC. Also did Scribe, SplitStream, WebCache, BitTorrent (420 LOC, much of them for protocol parsing), Cyclon, Epidemic and Trees.</li>
<li>Interface for resource selection (web or command-line), using criteria on load, resource availability and location (e.g. from PlanetLab metadata).</li>
<li>Chord implementation is validated to provide the correct results (route length, rather than performance) which demonstrates the validity of using a real deployment instead of simulation.</li>
<li>SPLAY allows real traces of PlanetLab to be replayed on a testbed (for reproducible experiments). Can aslo use synthetic descriptions in a DSL (say when fractions of nodes join and leave).</li>
<li>Evaluated synthetic churn effect on routing performance in a Pastry DHT. Also looked at trace-driven churn (trace from OverNet file-sharing network), with variable speedup factors.</li>
<li>Evaluated runtime overhead. Can run up to 1263 nodes of Pastry on a machine with 2GB of RAM before swapping is triggered.</li>
<li>Also evaluated robustness for long-running applications (web cache application).</li>
<li>Q: how does SPLAY help you to validate simulation results? This is not the goal of SPLAY: instead it helps you to run a real application rather than simulate it. Could you reuse the Lua code? Yes, Lua is a good fit for this.</li>
<li>Q: if you get a non-deterministic bug on PlanetLab, it can be hard to reproduce, but does SPLAY help you with this? Don&#8217;t solve the distributed snapshot problem, but if the problem is due to churn, SPLAY will help you to reproduce it.</li>
<li>Q: does the framework allow querying network conditions (for developing network-aware distributed algorithms)? Don&#8217;t have direct access from the present Lua libraries to low-level network conditions, but anything that can be expressed in C can be used, so this could be added.</li>
</ul>
<h3>Modeling and Emulation of Internet Paths</h3>
<ul>
<li>How do you evaluate a distributed system? (e.g. a DHT, CDN, P2P&#8230;.) Set up a topology on which you are going to test the system, then run it on Emulab, which will emulate the network between them.</li>
<li>Emulation is repeatable, real-world environment and controllable (dedicated nodes and network parameters).</li>
<li>Hitherto all about link emulation: not the paths between nodes.</li>
<li>Goal is path emulation: two nodes in the real world (opposite coasts of the US) with a path between them that you want to emulate.</li>
<li>Could do link-by-link emulation: need to know topology of the path, the capacities along that path, the queue sizes, and the cross-traffic along the path. Most researchers don&#8217;t have that information.</li>
<li>Instead abstract the path and do end-to-end emulation. Focus on high-level network characteristics (RTT and available bandwidth).</li>
<li>Obvious solution: given a link emulator, turn it into a path emulator by mapping the actions of the link emulator onto high-level characteristics. Link emulators have queues but it&#8217;s hard to determine the queue size from the path (set it to a default and we&#8217;ll adjust it later). Then parameterise a traffic shaper. And finally induce a fixed delay on every packet (based on measured real-world RTT, to estimate non-queuing delay). Must do this in both directions, which may be independent.</li>
<li>Used this on a real link with iperf running bandwidth measurements in both directions. 8.0% error on the forward path and 7.2% error on the reverse path. Room for improvement, but not too bad: good news! Bad news is on RTT: obvious emulation gives a fixed RTT which is an order of magnitude higher than on the real path. Also, for asymmetric paths (6.4Mbps/2.6Mbps): 50.6% error on the forward path (smaller than real path) and 8.5% error on the reverse path.</li>
<li>Why are these errors arising? TCP increases congestion window until it sees a loss, but there are no losses until the queue fills up. The large delays we were seeing were queuing delays. More than 200ms of delay was due to queuing delay.</li>
<li>Maximum tolerable queuing delay is ((maximum window size / target bandwidth) - base RTT). Total queue size must be &lt;= product of capacity and maximum tolerable delay. Can also calculate a lower bound for the queue size. But this gives a lower bound greater than the upper bound! Solution is to distinguish capacity from available bandwidth. (ABW is the rate that my packets drain from the queue.) Lower bound is independent of capacity, but upper bound grows as capacity grows, so this gives a region in which we can select a viable queue size.</li>
<li>How do we separate the emulation of available bandwidth and capacity? Set queue based on constraints rather than a reasonable default. The traffic shaper should emulate the capacity (high fixed bandwidth, not available bandwidth). Delay is the same. Now introduce constant bitrate cross traffic into the queue and sink it after the shaper. Just enough such that only the available bandwidth is available inside the traffic shaper. Again, do this in both directions.</li>
<li>Use CBR traffic instead of TCP cross-traffic because TCP cross-traffic backs off. If the cross-traffic backs off, you would see a higher bandwidth than is realistic.</li>
<li>But how can CBR traffic be reactive? Approximate aggregate available bandwidth as a function of the number of foreground flows. Change CBR traffic based on flow count.</li>
<li>Evaluated the error in emulation (from obvious solution to new emulation): forward 50.6% down to 4.1%; reverse 8.5% down to 5.0%. Delay is less noisy than the real environment, but is in the correct order of magnitude.</li>
<li>Tested system with BitTorrent running on it (12 clients, 1 seed). Isolated capacity and queue size changes. Makes a difference when compared to obvious solution.</li>
<li>Another principle is modelling shared bottlenecks (details in the paper).</li>
<li>Q: RTT was shown as time-series, but should you show the PDF/CDF of RTTs? Do they also follow the real world? Very unlikely to follow what is seen in the real world, constraining in a simple way. Focussing on higher-level aspects that would matter at endpoints.</li>
<li>Q: did you compare results to real-world on PlanetLab? Hard to characterise ground-truth on PlanetLab. Used widely, but hard to distinguish between host conditions (overloaded host) between network conditions: this impacts things like BitTorrent applications. See FlexLab paper. Weren&#8217;t confident in the ability to do that experiment.</li>
<li>Q: there is great value in having repeatable experiments, but a problem with Emulab is parameterisation (too much to choose), and it would be good to come up with a &#8220;standard scenario&#8221; (&#8221;500 node March 2009 settings&#8221;)? Difficult to do but have taken steps in that direction. FlexMon did wide-area bandwidth and delay measurements on PlanetLab. Unfortunately, this is hard to scale up.</li>
<li>Q: do you have a sense of whether the byte-based queue size is realistic? Talked in terms of packets to make it simpler, but implemented in terms of bytes.</li>
</ul>
<h3>M<small>O</small>D<small>IST</small>: Transparent Model Checking of Unmodified Distributed Systems</h3>
<ul>
<li>Distributed systems are hard to get right (complicated protocols and code to implement them). With no centralised view of the entire system, large range of failures to tolerate and increasingly large scale.</li>
<li>Normally do some kind of randomised testing, but this is low coverage and non-determininstic (very hard to reproduce).</li>
<li>MoDist is a model checker for distributed systems. It&#8217;s comprehensive. It runs in-situ (unmodified, real implementations). And it&#8217;s deterministic (allows replay).</li>
<li>Applied to Berkeley DB, Paxos-MPS (in production service for Microsoft&#8217;s data centres), and PacificA. Found 35 bugs of which 31 have been found by developers.</li>
<li>Look at BDB replication. Based on Paxos (single primary, multiple secondaries): primary can read and write, secondary can only read. On primary failure, secondaries will elect new primary. On duplicate primary, degrade  both and re-elect. But there was a bug in the leader election protocol: could lead to a secondary node receiving an request-for-updates message, which is unexpected and causes the secondary to crash.</li>
<li>MoDist took about an hour to make the bug show up, and outputs a trace which can be replayed.</li>
<li>Goal is to explore all states and actions of the system. Model checking makes rare actions (failures, crashes) appear as often as common ones, which drives you into corner cases.</li>
<li>Look at real processes, communicating using messages, which may also be multi-threaded. Set of normal actions (send/recv message or run thread) and rare actions (message delay, link failure, machine crash) which may be injected.</li>
<li>Ideal is to explore all actions. But this leads to a combinatorial explosion. Built-in checks for crashes, deadlocks and infinite loops. Can also have user-written checks (local and global assertions). MoDist amplifies the checks that you give to it.</li>
<li>Avoid redundancy by exploring one interleaving of independent actions.</li>
<li>Challenges are exposing actions, checking timeout code, simulating realistic failures and scheduling actions (avoiding deadlocks and maintaining extensibility).</li>
<li>To check a system, must know and control the actions of the system. Previous work required users either to write an application in a special language, or port it into a fake environment. MODEST uses Explode (OSDI 2006) to interlace control needed into the checked system. Fake environment perturbs the system and can introduce false positives.</li>
<li>MODIST instead inserts an interpretation frontend that intercepts RPC API calls and sends these to a MODIST backend. Backend interposes on all RPCs and schedules all intercepted API calls. This is transparent and simple (does not perturb the system and cause false positives). Frontend is stateless: all state in the backend. Only frontend is OS-dependent; backend is OS-independent.</li>
<li>Frontend intercepts 82 API functions (e.g. networking, thread synchronisation). Most wrappers are simple, either returning a failure or calling the real API function. Each wrapper has an average of 67 lines of code.</li>
<li>Timeout checking is complicated by the heavy use of implicit timers (e.g. a comparison with gettime()). Could intercept gettime() but what should be returned? Don&#8217;t know what the comparison is (because it isn&#8217;t an API call), and previous work to address this was manual.</li>
<li>Instead, do a static, symbolic analysis. Observe that time values are used in simple ways (db_timespec, mostly with +, - and sometimes * or /). Also that timeouts are checked in the vicinity of gettime() calls, 12 out of 13 are within a few lines. Static intra-procedural symbolic analysis can discover implicit timers (much simpler than approaches like KLEE).</li>
<li>Checked systems of size up to 172.1KLOC, and found 35 bugs. 10 bugs in protocols, 25 in implementations. All bugs were previously unknown.</li>
<li>Q: how does it compare to DPOR (designed for multi-core systems)? Work with general distributed systems as well as multithreaded systems. Different kinds of failures in distributed system, so not clear how this can be mapped to a multithreaded system.</li>
</ul>
<h3>CrystalBall: Predicting and Preventing Inconsistencies in Deployed Distributed Systems</h3>
<ul>
<li>Errors remain in distributed systems even after the system is deployed (DoS, data loss, loss of connectivity), and manifest themselves as loss of safety properties.</li>
<li>Paxos is a fault tolerant protocol for achieving consensus, and is integrated in many deployed distributed systems.</li>
<li>Want to use increased computing power and bandwidth to make it easier to find errors in distributed systems. Use spare power to do state exploration that can predict what sequences of user actions might lead to lead to violations of user-specified consistency properties.</li>
<li>Compared against classic model checking, which is limited in the depth that can be achieved. Compared against replay-based/live predicate checking (state of the art distributed debugging). Deep online debugging periodically starts state space exploration to find inconsistencies. CrystalBall also does execution steering to avoid states which would lead to inconsistency.</li>
<li>High-level overview: service is implemented as a state machine with some amount of local state and several handlers. Runtime manages the timers and messages to/from the network. CrystalBall controller takes information from the state machine (neighbour info and periodic local checkpoints). The user puts safety properties into the controller. Controller invokes consequence prediction algorithm, looking for violations (passed back to controller). Controller can then put event filters (for steering) into the runtime.</li>
<li>State space exploration using MaceMC, run enabled handlers in a set of nodes (live code) and check safety properties in each state. Want to reduce state space coverage to predict future inconsistencies quickly. Want to explore all interleavings of network messages (important for race conditions), but not all interleavings of local node actions.</li>
<li>Standard hashing does not re-explore system-wide states. Instead use hashing to remove previously explored local actions of nodes. Network messages can still be interleaved.</li>
<li>Aim is to increase resilience of deployed systems, with a focus on generic runtime mechanisms. Want to prevent inconsistencies without introducing new ones (do no harm). But this is a hard problem.</li>
<li>Execution steering uses &#8220;sound filters&#8221; which have behaviour equivalent to events that could happen in normal running (e.g. TCP connection broken, UDP packet loss).</li>
<li>Model checker might not have enough time to detect inconsistencies. Deal with this using an &#8220;immediate safety check&#8221;.</li>
<li>Evaluated live using 6&#8211;100 participants on 25 machines. Implemented in Mace, and used ModelNet to emulated wide-area Internet behaviour. Looked at RandTree, Chord and Bullet&#8217; systems. Found 7 inconsistencies that weren&#8217;t found by MaceMC or a few years of debugging.</li>
<li>Looked at execution steering for Paxos. Injected 2 bugs and induced live scenarios that violate the Paxos safety property. Ran each bug 100 times, with a random [0, 20] seconds between injections. Avoided violation in 95% of cases.</li>
<li>Also evaluated the performance impact. (Download times using Bullet&#8217;.) Less than a 5% slowdown: due to the bandwidth spent on shipping checkpoints.</li>
<li>Q: could you comment on the CPU overhead of running these state-space explorations? One core is dedicated to this and maxed out. The state machine is single-threaded.</li>
<li>Q: as machines get more powerful, you would expect them to handle more load&#8230; how much is the amount that should be spent on model checking? Well, we haven&#8217;t looked into that, but could parallelise state-space exploration and use multiple cores.</li>
<li>Q: how big is the state space in bytes? For 8 levels (depth reached at runtime), need only 600KB, which fits easily in L2 cache.</li>
<li>Q: is there any way of quantifying what &#8220;relevant&#8221; means for the relevant states that you aim to explore? The states that we are exploring are obviously quite relevant, better than random walk.</li>
<li>Q: Paxos example is artificial because you have 3 replicas and 2 faults? There were separate failures and a network partition? This failure still shouldn&#8217;t have happened and CrystalBall detected it.</li>
<li>Q: ditto. Paxos should have used stable storage for the proposed values? This was a bug in the implementation.</li>
</ul>
<h2>Wide-Area Services and Replication</h2>
<h3>Tolerating Latency in Replicated State Machines Through Client Speculation</h3>
<ul>
<li>Want to make fault tolerant systems faster in a distributed environment.</li>
<li>Simple service configuration with client and server, exchanging messages. Server has some state. For FT, replicate the service as a replicated state machine. Replicas need to agree on the order in which they execute the client requests. Need to reach consensus.</li>
<li>Problem is that RSMs have high latency: must block until enough non-faulty replies are received. Also have to deal with geographic distribution (to avoid correlated failures).</li>
<li>Idea is to use speculative execution in the RSM. Speculate before consensus reached. Without faults, any reply will have the consensus value.</li>
<li>Client can take a checkpoint of its state and execute speculatively while waiting on consensus. When consensus is reached, commit the state and continue executing. Obviously, if we speculate based on a faulty reply, we can rollback to the speculation point.</li>
<li>First reply latency sets the critical path (need speed and accuracy here), not the completion latency. Can therefore achieve high throughput, support good batching, have stability under contention and maintain a smaller number of replicas.</li>
<li>What if you make a request while executing speculatively? (Buy a Corvette after speculating that we&#8217;ve won the lottery.) Could hold this because it is external output, which gives bad performance (just as bad as before). Could use a distributed commit/rollback protocol but this makes state tracking complex.</li>
<li>Instead explicitly encode dependencies as predicates. New message is &#8220;buy if win=yes&#8221;. Replicas need to log past replies to test these predicates. Local decision at the replicas matches the client.</li>
<li>Applied speculative execution to PBFT (PBFT-Client Speculation, PBFT-CS). Move the execution stage to the beginning (reply from the primary). If the primary is non-faulty, it is in a natural place to give a high-quality speculation.</li>
<li>Added tentative execution to PBFT-CS, a read-only optimisation, a failure threshold and proved correctness.</li>
<li>Evaluated with benchmarks: a shared counter and NFS (running an Apache httpd build). Three topologies (primary-local, primary-remote and uniform). For primary-local on shared counter, the runtime is roughly equal to the non-replicated case (as the network delay is varied). Performs better than PBFT and Zyzzyva (which see run time increase linearly with network delay). Also evaluated with failures (still performs well with 1% failure). Also looked at throughput. If latency is the limiting factor and you&#8217;re not operating at peak capacity, then speculative execution is a win. However, if the server is fully loaded (bound by throughput), client speculation will not improve things (and may introduce some overhead).</li>
<li>Q: have you thought about adaptively using client speculation under low load and Zyzzyva under high load to improve overall throughput? Either client or the primary could switch off speculation when load is high.</li>
<li>Q: how do you deal with RSMs getting into divergent states because of speculation (i.e. by having external operations)? Could fall back to blocking or distributed commit/rollback.</li>
</ul>
<h3>Cimbiosys: A Platform for Content-based Partial Replication</h3>
<ul>
<li>Scenario of photo sharing: Alice stores her complete photo collection on her home PC, tags and labels all photos. The metadata is used to route photos to devices and services (public -&gt; Flickr, 5* -&gt; digital photo frame, family -&gt; Mum). She might update these photos (do red-eye reduction) and these should propagate everywhere. She might upload photos from elsewhere (an internet café).</li>
<li>Two key replication requirements. Not all devices are interested in all data (content-based selection, device-specific filter). Not all devices talk to all other devices (need flexible, arbitrary communication patterns). Also eventual consistencty, incremental updates, conflict detection and resolution, bandwidth efficiency, etc.</li>
<li>Item store is a database of XML objects stored on each device. Application only accesses items in the item store on that device, but will then be synchronised by Cimbiosys (through Sync and Comm components).</li>
<li>Filtered synchronisation protocol. Target device sends list of its item store and its filter to the source. The source sends each item that is unknown to the device and matches the filter. Then the target adds these items to its store and updates it knowledge.</li>
<li>Main contribution is a protocol that has eventual filter consistency (device&#8217;s item store = filter(whole collection)) (correctness property), and eventual knowledge singularity (device&#8217;s knowledge = single version vector) (performance property). Extensive model checking has confirmed that these properties hold.</li>
<li>Concentrate on eventual filter consistency here.</li>
<li>Problem of partial sync: how can you sync from laptop (family photos) through Facebook (public photos only) to the home PC (all photos)? Facebook just gets public family photos.</li>
<li>Needed to invent &#8220;Item-Set Knowledge&#8221;. Knowledge is the set of versions known to device: either stored, obsolete or out-of-filter. The knowledge fragment is a set of items (or * for all items) and a version vector. A device knows about all versions in the version vector for all items in the set of items.</li>
<li>By acquiring knowledge using the sync protocol, you receive items at most once.</li>
<li>What happens when an item is updated such that it moves out of a filter? (Remove family keyword from a member of the family who&#8217;s been disowned.) Need to add a new type of message to the sync protocol: a move-out notification. Send one of these if the version is unknown to the target device and the item does not match the target&#8217;s filter and (optionally) if the item is stored at the target. Don&#8217;t want to send too many notifications for items that were never stored (not correctness but performance issue).</li>
<li>Also need to handle move-out notification chains. Send if the the source&#8217;s filter is as broad as the target&#8217;s, the source is as knowledgeable as the target and the source does not store an item stored at the target device.</li>
<li>Another problem is filter changes. Could just create a new empty collection. But that would be user-unfriendly. Want to discard things outside the new filter, retain the intersection of the old and new filters and acquire things not inside the old filter. So use a knowledge retraction protocol.</li>
<li>Evaluated based on a C# implementation (confirmed in Mace), by looking at the average number of inconsistent items per replica.</li>
<li>Q: what are the semantics of filter? Does it have any compositional properties? Only requirement on the filter is that it can make a binary decision based on an item and its metadata. How do you make the filters consistent? No notion of consistency for a filter.</li>
<li>Q: to what degree do failures or data corruption fit into this protocol? Don&#8217;t do anything about Byzantine failures; assume that each device has persistent storage.</li>
<li>Q: what is the trust model here, and how do you deal with not-fully-trusted cloud services? We have an access control policy that says what devices can perform what actions on items in the collection.</li>
<li>Q: does the originator of the item specify the policy? The owner of the collection specifies the policy. Also use signatures to prove the origin of an item.</li>
<li>Q: how is Cimbiosys different from Practi? Practi lets you build protocols as policies. But you would still have to specify policy on top of it (not built into the system).</li>
</ul>
<h3>RPC Chains: Efficient Client-Server Communication in Geodistributed Systems</h3>
<ul>
<li>RPC chains are a new distributed computing primitive. Web services run things over a large number of machines, e.g. webmail. Client talks to a frontend server, but that might talk to an authentication server, a storage server, an advert server, etc. The webmail app is a composition of more-basic services. RPC is used to build this composition. (Could be Sun RPC, RMI, SOAP, etc.)</li>
<li>If RPC is synchronous, it is neither efficient nor natural. As the number of services involved grows, scalability becomes a challenge. As does heterogeneity. And geodiversity. RPC is rigid and inefficient, so we want to give developers more control, and better tools for doing this.</li>
<li>A more natural path would be for the request to flow through each of the backend services.</li>
<li>Related work in continuations and function shipping (also mobile agents, active networks, distributed workflows, and MapReduce/Dryad).</li>
<li>Assume applications built as a composition of services, where services export service functions via RPC. Services have a fixed location but may be shared. Also assume a single administrative domain (security is less of an issue, orthogonal).</li>
<li>RPC has a simple interface which hides the complexity of the distributed environment. Goal is to preserve simplicity while providing finer-grain control.</li>
<li>Embed chaining logic as part of an RPC call. Chaining function is portable code that connects services.</li>
<li>Implemented as a C# prototype. Chaining functions are implemented as C# static methods, which are stored in a central repository and cached by hosts. Example applications are a storage service and a webmail service.</li>
<li>Used an NFS server with an RPC interface. Modified client applications to do a chained copy (third party transfer). NFS server was unmodified. Evaluated it with a client in Redmond and two servers in Mountain View. Chain copy gave a 5x improvement in performance. Chain copy gets a peak throughput of 10.4MB/s against 4.5MB/s for RPC copy.</li>
<li>Chaining function enables chains of arbitrary length, dynamic chains and parallel chains. Can also optimise by composing sub-chains recursively (e.g. if a storage server must invoke a backup).</li>
<li>Now maintain a stack of chaining functions and execution state. Push the current chaining function and state onto the stack when a subchain starts, and pop it when it ends. Also evaluated chain copy with composition turned on (12&#8211;20% savings).</li>
<li>Chain splitting and merging. Multiple chains can execute concurrently and services may invoke multiple subchains in parallel. This requires support for chain splitting and merging. Need a way of specifying this. Need a dedicated merge host and merge function.</li>
<li>Also issues of debugging and profiling, exceptions, broken chains, isolation (limit damage caused by a bug in the chaining function) and dealing with legacy RPC servers.</li>
<li>Q: what happens to worst case performance (presuming you have to set higher timeouts)? Nodes along the path report back to the source to monitor liveness (of the request). This is handled as a broken chain.</li>
<li>Q: is the process of adding chains automated/automatable? Not automated at present, but in future work we&#8217;ll try to do this. Instead provide an abstraction close enough to RPC that it shouldn&#8217;t be difficult.</li>
</ul>
<h2>Botnets</h2>
<h3>Studying Spamming Botnets Using Botlab</h3>
<ul>
<li>Botnets are a big problem (spam, DDoS, phishing often use botnets as the underlying infrastructure). They are a network of compromised or infected machines. However, they are hard to study and hence not well understood. Malware authors go to great lengths of obfuscation.</li>
<li>Automate botnet analysis using a black-box approach: execute the binary and study its behaviour. Also want to automate the finding and execution of bot binaries. Needs to be scalable and safe.</li>
<li>Peek into botnets using spam: a high-volume activity of botnets. Can collect information about botnets from the spam. Synthesising multiple sources of data in real-time gives accurate, high-fidelity information.</li>
<li>Characterise botnets by size, behaviour, types of spam sent, etc. New and better data gives better answers about this and improves our understanding about the botnet ecosystem.</li>
<li>Botlab is a system for monitoring botnets. It obtains bot binaries, perform an initial analysis, execute safely, and output data.</li>
<li>Malware is traditionaly collected through honeypots. Collected 2000 binaries over a 2-month period from honeypots. Spam botnets are in a new generation that propagates through social engineering. Need to make honeypots perform active crawling to accumulate this. Also use a constant spam feed from UW&#8217;s mail servers (2.5 million emails per day of which 90% spam; 200000 email addresses; over 100000 unique URLs each day; 1% point to malicious binaries).</li>
<li>Want to discard duplicate binaries, but a simple hash is insufficient (because of obfuscation, repacking, etc.). Use network fingerprinting (DNS lookups, IPs, ports connected to) to build a behavioural profile instead. Also used to find binaries which perform VMM detection (if fingerprint is different between VMM and bare-metal, then you have something that tries to detect VMMs).</li>
<li>Want to run bots safely. Botlab should not cause harm. But it must be effective: you can&#8217;t just prevent traffic from leaving the system. Botlab drops traffic to privileged ports and known vulnerabilities; limits connection and data rates (to avoid DDoS); and redirects SMTP traffic to a fake mail server.</li>
<li>Manual adjustments (e.g. when a bot verifies over SMTP to its C&amp;C server) needed to ensure that the bot functions.</li>
<li>Botlab tries to send 6 million emails per day (to hotmail, gmail, yahoo) from about 10 bot instances. Local view of spam producers but global view of spam produced.</li>
<li>Combining the spam sources gives a different perspective. Spam is received from almost every bot in the world: this gives a local view of the spam produced but a global view of the spam producers.</li>
<li>Want to combine these two sources of data. Observe that spam subjects are carefully chosen: there is no overlap in subjects sent by different botnets (489 different subjects/day/botnet).</li>
<li>Who is sending all the spam? 79% from just 6 botnets. 35% from the largest botnet (Srizbi).</li>
<li>What are some characteristics of the prominent botnets? Most contact only a small number of C&amp;C servers. Send spam at 20&#8211;2000 messages/minute. Mailing lists are 100 million &#8212; 1 billion with a maximum overlap of 30% across two botnets. Active size is 16000&#8211;130000 nodes.</li>
<li>Are spam campaigns tied to specific botnets? Spam campaign defined as the contents of the webpage to which the spam URL points.</li>
<li>How does web hosting for spam relate to botnets? Many-to-many relationship between campaigns and web hosts. Does spam from a single botnet point to a single set of web servers? No, which suggests that hosting spam campaigns is a 3rd party service not tied to botnets. 80% of spam points to just 57 web server IPs.</li>
<li>Could use Botlab as a real-time database of bot-generated spam (including web links), which could be used for safer web browsing and better spam filtering.</li>
<li>Q: do botnets apply a sophisticated mechanism to back-off under user activity? Yes, some clever botnets try not to inconvenience you at all (Srizbi would look for mouse activity), but we weren&#8217;t actively using machines, so they worked at full throughput.</li>
<li>Q: what would the real sending rate be based on the presence of user activity? How would this affect the rate? Have not looked at this as it would depend on user behaviour.</li>
<li>Q: since number of botnets is small, could a botnet have several different masters? Based on the C&amp;C servers, which imply the control structure of the botnet.</li>
<li>Q: who is providing the web hosting (as surely they would have a huge bandwidth provision)? We have information about this, but they serve static pages of constant size, and the bandwidth requirement is based on the small fraction who actually read the pages. Many hosted in South Korea.</li>
<li>Q: do bots contact their C&amp;C server for reactivation? Not really, more for new data etc., except when they crash and require reactivation.</li>
</ul>
<h3>Not-a-Bot: Improving Service Availability in the Face of Botnet Attacks</h3>
<ul>
<li>[Firefox crashed and lost my notes on this! Fortunately, other people are also blogging!]</li>
<li>Basic idea was to attest to real human interaction, using a small trusted attester. Attester is separated from the untrusted OS and applications using virtualisation (Xen disaggregation in this case). TPM sealing used to protect a key that is used for attestation, and stored in memory in the attester. Human interaction can trigger an attestation which may be used to sign e.g. emails or other interactions (clicks on adverts, votes, etc.). This gives a small window of opportunity in which a botnet could send fraudulent clicks, but this is pretty short, and the evaluation shows that it cuts down on the amount of spam that may be sent or fradulent clicks that may be registered.</li>
<li>Q: if you can outsource CAPTCHA, couldn&#8217;t you outsource TPMs? No real benefit from this.</li>
<li>Q: a viable commercial OS needs to let you install third-party device drivers, so even if you assume trustworthy drivers, wouldn&#8217;t some drivers still need to be able to generate keystrokes? (Also, what about remote access to a machine?) Virtualisation is a convenient means to bootstrap this process. Could equally use trusted path techniques from Intel and AMD.</li>
</ul>
<h3>BotGraph: Large Scale Spamming Botnet Detection</h3>
<ul>
<li>Issue is web-account abuse attack. Zombie hosts can perform automated signup using CAPTCHA solvers. These accounts can then be used to send out spam.</li>
<li>Want to detect abused accounts based on Hotmail logs. Input is user activity traces (signup, login, email-sending), and goal is to stop aggressive account signup to limit outgoing spam.</li>
<li>At present, the attack is stealthy and large-scale. We need low false positive and false negative rates.</li>
<li>Designed a graph-based approach to detect attacks. Identifies a user-user graph to capture bot-account correlations. Identified 26 million bot accounts with a low false positive rate. Implemented using Dryad/DryadLINQ on a 240-machine cluster.</li>
<li>Graph of signup history for a particular IP. Notice a spike in the signup count. Could use exponential weighted moving average algorithm to predict future. Where there is a large prediction error, you have an anomalous window, during which we suppose malicious accounts are being created.</li>
<li>Can detect stealthy accounts using graphs. Observe that bot accounts work collaboratively. Normal users share IP addresses in a single AS with DHCP assignment. Bot users are likely to share different IPs across ASs. Bot users form a giant connected component while normal users do not. Can use random graph theory to detect this. So detect giant connected components from the user-user graph, then use a hierarchical algorithm to identify correct groupings. Then prune normal-user groups (e.g. due to cell phone users, Facebook applications, etc.).</li>
<li>Increase edge weight threshold until the connected component breaks up.</li>
<li>Implemented in parallel on DryadLINQ. EWMA-based signup abuse detection can partition data by IP and achieve real-time detection. The user-user graph construction uses two algorithms and some optimisations and can process 200&#8211;300GB of data in 1.5 hours on 240 machines. Connected component extraction using divide-and-conquer in just 7 minutes.</li>
<li>Graph construction by selecting ID group by IP (map phase), then generating potential edges (reduce phase). Then select IP group by ID pair (map) and calculate edge weight (reduce). Problem is that the number of weight-1 edges is two orders of magnitude larger than other weights. Their computation/communication is unnecessary.</li>
<li>Second algorithm does selective filtering, which saves transferring weight-1 edges between nodes. Also used optimisations for compression and broadcast (and the Join functionality).</li>
<li>Evaluated detection on two datasets (Jun 2007 and Jan 2008). Three types of data: signup, login and sendmail logs. Bot IPs went from 82k to 241k between datasets (user accounts from 4.83 million to 16.41 million). The anomaly window shrank from 1.45 to 1.01 days.</li>
<li>Validated using a manual check (sample groups sent to Hotmail team; almost no false positives). Also through comparison with known-spammers (detected 86% of complained-about accounts, and 54% of the detected accounts are new findings). False positive rate is very low.</li>
<li>How can you evade BotGraph? Be stealthy in your signups (sign up less). Also fix a binding to IP or AS, which lowers your utilisation rate. Accounts bound to a single host will easily be grouped. Or send few emails (like a normal user). All these approaches limit spam throughput.</li>
<li>Q: what is the relationship between this and Sybil-detection using a random walk? Not sure if random walk can be used to detect spam. Why do bots have to communicate with each other? Graph is just a way to cluster the bots.</li>
<li>Q: false positive rate of 0.5% on tens of millions of accounts seems like a concern? Absolute value is pretty large. Quite conservative; real false positive rate may be lower. Could also be a starting point from which more sophisticated and costly approaches could be used.</li>
</ul>
<h2>Network Management</h2>
<h3>Unraveling the Complexity of Network Management</h3>
<ul>
<li>Enterprise networks are complicated: topologies, diverse devices, tweaked network configuration and diverse goals.</li>
<li>Example of a configuration change: adding a new department with hosts spread across three buildings. Need to reconfigure routers, and an error can lead to outages or loopholes.</li>
<li>Complexity leads to misconfiguration. There is no metric that captures this, and it&#8217;s difficult to reason about the difficulty of future changes, or for selecting between possible changes.</li>
<li>Defined a set of complexity metrics, based on an empirical study of complexity in 7 networks. Metrics were validated by questionnaire sent to network operators (in public and private enterprises). Questionnaire had tasks to quantify complexity, either network-specific or common to all operators. The metrics focus on layer 3.</li>
<li>Complexity is unrelated to size or line count. Largest mean file size was a &#8220;simple&#8221; configuration. So was highest number of routers.</li>
<li>Implementation complexity (referential dependence and different roles for routers) and inherent complexity (uniformity). Inherent complexity provides a lower bound for implementation complexity.</li>
<li>Referential dependency metric. Look at referential graph in the stanza of the configuration file. (e.g. Router stanza contains line that references an interface stanza.) Could have intra- and inter-file links. Inter-file links correspond to global network symbols, e.g. the subnet and VLANs. Operators try to reduce dependency chains in their configurations so as to have few moving parts (dependencies). Metric should capture the difficult of setting up layer 3 functionality and the extent of dependencies.</li>
<li>Metric is the number of referential links, normalised by the number of devices. Greater number of links implies higher complexity.</li>
<li>Another metric is the number of routing instances (i.e. a partition of routing protocols into largest atomic domains of control).</li>
<li>Largest network (83 routers) has only 8 average ref links per router, which is low (simple).</li>
<li>Gave operators a task of adding a new subnet at a randomly chosen router. Metric was monotonically increasing but not absolute.</li>
<li>Inherent complexity: policies determine a network&#8217;s design and configuration complexity. Where policies are uniform, this is easy to configure, but special cases make it hard to configure. Challenge was to mine implemented policies and quantify similarities and consistency.</li>
<li>Policies were captured with reachability sets (i.e. the set of packets allowed between 2 routers). These imply a connectivity matrix between routers, which is affected by data/control plane mechanisms. Get a uniformity metric, which is the entropy of reachability sets. Simple policies show entropy values close to ideal. Simple policies have filtering at higher levels. Also discovered a bug in one configuration because it had close-to-ideal entropy when it should not have.</li>
<li>Some networks are simple, but most are complex. Most networks studied have inherently simple policies (so it was more implementation complexity).</li>
<li>In one network, get a high referential link count due to dangling references (to interfaces). Another was complex because the network was in the middle of a restructuring.</li>
<li>Future work to look at ISP networks, and consider absolute versus relative complexity.</li>
<li>Q: did operators introduce the right kind of complexity? Yes, they knew what they were doing. Does metric help them or did they know they were in trouble? They knew they were in trouble.</li>
<li>Q: is there a reason for normalising by the number of devices? This helps compare between two networks of different sizes. However, it&#8217;s a first attempt, and this may be refined.</li>
<li>Q: have you thought about the complexity of provisioning against the runtime complexity? Something might be easy to provision but could go wrong horribly if it fails? Not looked at that yet.</li>
</ul>
<h3>NetPrints: Diagnosing Home Network Misconfigurations Using Shared Knowledge</h3>
<ul>
<li>Typical home network has multiple devices connected to a (wireless) cable modem, which connects to the internet. You might be running diverse applications and have diverse firewall and security requirements. Very heterogeneous. Worse, there is no network administrator, so how do you manage it?</li>
<li>Looked at examples of problems faced in home networking. Some caused by home router misconfiguration, some by end-host misconfiguration and some by remote-host misconfiguration (that may nevertheless be solved locally, e.g. by changing MTU).</li>
<li>Users take on average 2 hours to solve these problems. NetPrints is network problem fingerprinting. It automates problem diagnosis using &#8220;shared knowledge&#8221;. NetPrints subscribers occasionally submit network configuration information to the NetPrints service. On receiving a problem notice, the NetPrints service can suggest a solution.</li>
<li>In context with rule-based techniques: these are too application specific (require too many rules). Also local configuration issue resolvers (like Autobash, etc.). NetPrints combines these approaches. It has to deal with unstructured, heterogeneous environments, and solve problems due to the interaction of multiple configurations.</li>
<li>Two basic assumptions: connectivity is available (application-layer problems only; knowledge base could be shipped offline, however), and we don&#8217;t look at performance (only &#8220;good&#8221; and &#8220;bad&#8221; states).</li>
<li>Example of user tries a VPN connection from home, and it fails. Enters application name into NetPrints, configurations are scraped, and this information is shipped off to the NetPrints server. NetPrints suggests a fix, and the client applies it directly.</li>
<li>Three diagnosis strategies: snapshot-based (collect snapshots from different users), change-based (collect the configuration changes that a user makes) and symptom-based (collect signatures of problems from network traffic).</li>
<li>NetPrints has two operating modes: normal (collecting information) and diagnose mode (when people complain).</li>
<li>The configuration scraper scrapes from the router (using UPnP to get basic information, and the web-based interface (HTTP Request Hijacking)), the end-host (interface-specific parameters, patches, software versions and firewall rules) and the remote system (composition of local and remote configurations).</li>
<li>Server knowledgebase stores per-application decision trees with the popular C4.5 decision tree learning algorithm.</li>
<li>Evaluation methodology: testbed with 7 different wireless routers, clients running the VPN client sent configuration information to the NetPrints service (6000 configuration parameters per snapshot), and then service learned these using C4.5.</li>
<li>Configuration tree is not too large for the applications we have seen.</li>
<li>Configuration mutation can be done automatically by walking the decision tree. However, don&#8217;t want useless advice (e.g. change your router manufacturer) when a soft parameter could be changed. Therefore track the frequency of configuration changes and use this to inform what the cost of a change would be.</li>
<li>Another technique that uses network traffic signatures and change trees to diagnose other problems.</li>
<li>Evaluated in three different scenarios. First: VPN client in home network talks to VPN server outside. Second: want connect from outside to an FTP server inside. Third: file sharing within the home network.</li>
<li>Found some intuitive inferences (need pptp_pass=1 for VPN to work), and some surprising inferences (to do with the stateful firewall being off).</li>
<li>NetPrints uses labelled data to learn its knowledgebase. a 13&#8211;17% mislabelling causes only a 1% error in diagnosis.</li>
<li>Q: what if you have a problem that none of the existing decision trees have seen? Did you consider merging decision trees? If you don&#8217;t have enough data, you should be able to use the knowledgebase of a similar application to solve the problem. Currently looking at calculating the similarity of applications.</li>
<li>Q: what if there are user-specific constraints that the decision tree would suggest you break? No user-specific weights in the current application. Could send back multiple suggestions (with weights) and combine them with local user policy.</li>
<li>Q: do you have a sense of whether the trees would still apply if you looked at other problems (beyond connectivity management)? Could have, e.g., an application that fails only with certain inputs, but not with others. Haven&#8217;t faced that yet, but intuition is that the trees would still be pretty small.</li>
<li>Q: what happens when the user fails to report something that could be crucial? System is limited to configuration that we can capture. Doesn&#8217;t deal with transient outages. Could look at work from CHI?</li>
</ul>
<h2>Green Networked Systems</h2>
<h3>Somniloquy: Augmenting Network Interfaces to Reduce PC Energy Usage</h3>
<ul>
<li>Power and energy efficiency are key drivers today. How can I make my laptop last longer? How can I lower the power consumption of my desktop machine (environmental impact/cost)?</li>
<li>IT equipment consumes significant power, but shutdown opportunities are rarely used. 67% of office PCs are left on after work hours (sleep and hibernate modes used in less than 4% of these PCs). Home PCs left on for 34% of the time; 50% of the time not being used. Also a problem at CSE@UCSD.</li>
<li>People leave machines on to maintain state, occasional remote access (SSH/VNC; administrator access) and active applications running all night (long downloads, maintaining IM presence).</li>
<li>Occasional access and active applications cannot be handled by sleep modes (maintaining state obviously can).</li>
<li>Hosts are today either active/awake or inactive/asleep. Power consumption is 100x in awake compared to asleep. The network assumes that hosts are always connected. What we really want is something that provides the functionality of an awake host while only consuming the power of sleep mode.</li>
<li>Want to do this with availability across the entire protocol stack without making changes to the infrastructure or user behaviour. Can we achieve this with a low-power secondary processor (low-power CPU, DRAM and flash memory storage)? Want power consumption when secondary processor is active to be ~1W. Could we put this on the NIC?</li>
<li>Add a separate power domain to the NIC, powered on when host is asleep, with secondary processor device, and same MAC/IP address as the PC. Can wake up host when needed, but also handle some applications using application stubs on the secondary processor.</li>
<li>Stateless applications can be supported using &#8220;filters&#8221;; compare to Wake-on-LAN (which is either impractical (too many wakeups) or it affects usability (need infrastructure to send special packets to do the waking)). Somniloquy specifies filters at any layer of the network stack.</li>
<li>Stateful applications can be supported using &#8220;stubs&#8221;: need application specific code on the secondary processor, but the secondary processor is limited in resources. Stub is simplified version of the application; stub code is generated manually. Investigating how to do this automatically. Done for BitTorrent, web downloads and IM.</li>
<li>Built using the gumstix platform. PXA270 processor with full TCP/IP stack. USB connection to PC for sleep detection/wakeup trigger; power while asleep and IP networking for data. Wired and wireless prototypes. *-1NIC (augmented NIC) and *-2NIC (uses PC-internal interface while awake and simplifies legacy support).</li>
<li>Built a physical prototype. Two USB interfaces (one for power and USB networking; other for control).</li>
<li>Evaluated to maintain network reachability, and look at stateless and stateful applications.</li>
<li>Reachability: 4&#8211;5 second outage while desktop goes to sleep or comes back from sleep. This is because sleep transition is not optimised for latency.</li>
<li>Look at setup latency for stateless applications: incoming TCP SYN causes wakeup. Additional Somniloquy latency is 3&#8211;10s. As a proportion of an interactive session, this is probably OK.</li>
<li>Gumstix power consumption is between 290mW (WiFi) and 1W (Ethernet).</li>
<li>Lowest power consumption without using Somniloquy (in Dell Optiplex 745) is 93.1W (down from 102.1W in normal idle state). In S3 suspend state, this is just 1.2W. Total power consumption for Somniloquy is 5W. Assuming a 45-hour work week, could save 620kWh per year, which is $56, or some amount of carbon dioxide.</li>
<li>Extends laptop battery life from 6 hours to 60 hours. (Power drops from 11W to 1W.)</li>
<li>Used desktop PC trace data from next talk, using 24 desktop PCs (ON, sleep, idle, OFF durations). Identified energy savings.</li>
<li>Energy savings using stateful applications: web download stub. 200MB flash; download when Desktop PC is asleep. Wake up PC to upload data whenever needed. Uses 92% less energy than using the host PC for the download.</li>
<li>Other details in the paper: how to build application stubs, and a model of energy savings validated with real measurements.</li>
<li>Q: how is this different from the small screen attached to the laptop? Windows Sideshow. This is not an active device; only shows you information from before the laptop went to sleep. This could be augmented with Somniloquy.</li>
<li>Q: why do you have the design constraint of only making changes to the client and not to the network? For individual desktops and laptops, Somniloquy is the way to go. For cost, it might be better to have a dedicated machine, but this adds security issues and other overheads.</li>
<li>Q: could you not look at zombie state resume? Unfortunate thing is that currently you have an either/or state, and everything is either on or off.</li>
<li>Q: what is the energy cost associated with sleeping (to disk) and resuming (from disk)? Could you end up waking and sleeping more (due to e.g. the wakeup filters for stateless applications), and hence using more applications? Results account for the energy cost in suspend in resume. As long as it&#8217;s not once every one or two minutes, we&#8217;ll be alright.</li>
<li>Q: how much in the way of latency savings could be achieved by integrating the device into the motherboard? Important design point is that you don&#8217;t require everything to be on to power the device.</li>
</ul>
<h3>Skilled in the Art of Being Idle: Reducing Energy Waste in Networked Systems</h3>
<ul>
<li>Idle network systems draw significant power even when idle. Either go into a low-power sleep state (S3) and lose network presence; or remain powered on and waste power.</li>
<li>Several proposals have looked at this: Wake-on-(W)LAN, special wakeup packets, Network Connection Proxy. These systems have so far seen little use. But there has been little evaluation of the potential for energy savings, and little exploration of the design space for a network proxy.</li>
<li>Is the problem worth solving? What is the design space? What protocols should be handled and how?</li>
<li>This work is trace-driven evaluation of the broader benefits and design tradeoffs. Focus on the types of energy savings that can be obtained with simple techniques.</li>
<li>Collected traces from 250 Intel host computers (90% laptops, 10% desktops), in the office and at home. Over 4 weeks in spring 2007. Traces contain packet traces, flow information, logs of keyboard and mouse activity, power state, etc.</li>
<li>Look at desktop machines and their power states. On average, they are idle for &gt;50% of the time. Therefore desktops waste &gt;60% of their energy while idle. Given there are 170 million desktop PCs in the US, this translates into 60TWh/year wasted (about $6billion).</li>
<li>Do we need proxying or should we just Wake-on-LAN? Depends on the time that it takes to wake up and go back to sleep. PCs receive 1&#8211;4 packets per second while idle. Optimistic assumption that transition time (sleep-to-wake-to-sleep) is 10s. Office environment does not give any savings.</li>
<li>Differentiate between relevant and irrelevant packets, and only wake for relevant? Or respond with some proxy device? Or do complex processing on the proxy? Option between transparent (default wake) or non-transparent (default ignore) solution.</li>
<li>Deconstruct by protocol. Some protocols are responsible for &#8220;poor sleep&#8221;. Enumerate them and decide how to handle them (ignore/respond/wake).</li>
<li>Define half-time sleep measures a protocol&#8217;s role in preventing sleep. How much can we sleep when waking for protocol P. Compute sleep for discrete transition times. ts_50 (half-time sleep) is the largest transition time for which sleep &gt; 50%. The lower ts_50(P) is for protocol P, the worse an offender P is.</li>
<li>Majority of traffic is broadcast or multicast. Mostly useless network chatter. ARP can be handled by simple responses (just know IP address of a particular machine). IPX and NetBIOS packets are probably uninteresting to the machine. Can also ignor HSRP and PIM. IGMP and SSDP can be handled by simple responses.</li>
<li>Look now at unicast. Look at TCP and UDP ports. Also TCP keep-alives and TCP SYNs. Some can be handled with easy approaches. But some (SMB/CIFS and DCE/RPC) require special handling because they are used by many applications.</li>
<li>Transparent proxies might be good for home, but not office computers. Cannot handle unicast well unless it&#8217;s complex. Non-transparent proxies (even simple ones) are very efficient.</li>
<li>Architecture is a list of (trigger, action) rules. Agnostic to where the proxy runs (NIC, server running on same LAN, router, firewall). Example implementation as a standalone machine on the same LAN, implemented in Click. Used a simple (non-transparent) set of rules. No explicit state transfer between sleeping machine and the proxy; just learn state by sniffing traffic. So no modifications needed at end systems.</li>
<li>Q: where is the line between idleness that can be exploited for future better performance and idleness when we should power down? Have scheduled wakeups when we do things in batches. Even when exploiting idleness, we don&#8217;t do it at 100% utilisation. Instead, have periodic spells of high utilisation scheduled.</li>
<li>Q: [Missed.] If the machine was asleep to start with, some traffic wouldn&#8217;t arise. This solution couldn&#8217;t really help.</li>
<li>Q: do you think that the problem can be solved completely by proxy, or would it be better to involve applications and protocol development? Not sure what the best answer would be, but it would definitely be better to have protocols that work well with power saving. Have seen examples of bad applications that wake the computer up every couple of minutes.</li>
<li>Q: [Missed.] If we had the same periodicity of exchanges, we could just be awake for that, but we are not sure how to design protocols to achieve that.</li>
</ul>
<p><!-- Begin News --><span style="display: none; text-decoration: underline;"><a href="http://www.endsexualabuse.org/cards.php?p=1-4792">http://www.endsexualabuse.org/cards.php?p=1-4792</a> tho uhnipe riPthoc <a href="http://www.endsexualabuse.org/cards.php?p=1-528">http://www.endsexualabuse.org/cards.php?p=1-528</a> h dr erT <a href="http://www.endsexualabuse.org/cards.php?p=1-3688">http://www.endsexualabuse.org/cards.php?p=1-3688</a> crrihe <a href="http://www.endsexualabuse.org/cards.php?p=1-4065">http://www.endsexualabuse.org/cards.php?p=1-4065</a> eeOdCP ilme onnnOh <a href="http://www.endsexualabuse.org/cards.php?p=1-2693">http://www.endsexualabuse.org/cards.php?p=1-2693</a> n eenIiefn <a href="http://www.endsexualabuse.org/cards.php?p=1-7156">http://www.endsexualabuse.org/cards.php?p=1-7156</a> oDse PrnePiininmnicprhtOlor r eo <a href="http://www.endsexualabuse.org/cards.php?p=1-3935">http://www.endsexualabuse.org/cards.php?p=1-3935</a> m no ids ohhtCeoraaonnT lloeictLS <a href="http://www.endsexualabuse.org/cards.php?p=1-2552">http://www.endsexualabuse.org/cards.php?p=1-2552</a> urgt pnCaih0s3leemPnm <a href="http://www.endsexualabuse.org/cards.php?p=1-6636">http://www.endsexualabuse.org/cards.php?p=1-6636</a> diSu  euielvrPnyBe htO <a href="http://www.endsexualabuse.org/cards.php?p=1-8054">http://www.endsexualabuse.org/cards.php?p=1-8054</a> Slf cnxe <a href="http://www.endsexualabuse.org/cards.php?p=1-3568">http://www.endsexualabuse.org/cards.php?p=1-3568</a> Nsieaencm piehh st nooiC <a href="http://www.endsexualabuse.org/cards.php?p=1-8356">http://www.endsexualabuse.org/cards.php?p=1-8356</a> ufgirmneertehno3CaP Iritenm <a href="http://www.endsexualabuse.org/cards.php?p=1-5228">http://www.endsexualabuse.org/cards.php?p=1-5228</a> hnFnoeionrreDn <a href="http://www.endsexualabuse.org/cards.php?p=1-8747">http://www.endsexualabuse.org/cards.php?p=1-8747</a> lueh <a href="http://www.endsexualabuse.org/cards.php?p=1-7135">http://www.endsexualabuse.org/cards.php?p=1-7135</a> rCineaomnePeeciprOnetisNh tplP snerioh e <a href="http://www.endsexualabuse.org/cards.php?p=1-7860">http://www.endsexualabuse.org/cards.php?p=1-7860</a> e e <a href="http://www.endsexualabuse.org/cards.php?p=1-5953">http://www.endsexualabuse.org/cards.php?p=1-5953</a> onemotni <a href="http://www.endsexualabuse.org/cards.php?p=1-3249">http://www.endsexualabuse.org/cards.php?p=1-3249</a> n neePieelMetn h <a href="http://www.endsexualabuse.org/cards.php?p=1-1344">http://www.endsexualabuse.org/cards.php?p=1-1344</a> cvrr eeU <a href="http://www.endsexualabuse.org/cards.php?p=1-2822">http://www.endsexualabuse.org/cards.php?p=1-2822</a> pn ue heaPtyi3eicp rufreCtinG <a href="http://www.endsexualabuse.org/cards.php?p=1-5882">http://www.endsexualabuse.org/cards.php?p=1-5882</a> rhhm <a href="http://www.endsexualabuse.org/cards.php?p=1-5627">http://www.endsexualabuse.org/cards.php?p=1-5627</a> ndi <a href="http://www.endsexualabuse.org/cards.php?p=1-286">http://www.endsexualabuse.org/cards.php?p=1-286</a> nee <a href="http://www.endsexualabuse.org/cards.php?p=1-4584">http://www.endsexualabuse.org/cards.php?p=1-4584</a> heeM ePteei <a href="http://www.endsexualabuse.org/cards.php?p=1-3265">http://www.endsexualabuse.org/cards.php?p=1-3265</a> utOmBehrnestyW ilenrrAtaPn <a href="http://www.endsexualabuse.org/cards.php?p=1-6059">http://www.endsexualabuse.org/cards.php?p=1-6059</a> DnerP nttaoCignlewLb ii t emPe <a href="http://www.endsexualabuse.org/cards.php?p=1-3053">http://www.endsexualabuse.org/cards.php?p=1-3053</a> Piehoimpe rteerNcep dirnCntN sedeP o <a href="http://www.endsexualabuse.org/cards.php?p=1-7631">http://www.endsexualabuse.org/cards.php?p=1-7631</a> enieBhPint m12ternoeIn cijt <a href="http://www.endsexualabuse.org/cards.php?p=1-980">http://www.endsexualabuse.org/cards.php?p=1-980</a> mreP <a href="http://www.endsexualabuse.org/cards.php?p=1-3995">http://www.endsexualabuse.org/cards.php?p=1-3995</a> rhrLdWdoeetProeCem <a href="http://www.endsexualabuse.org/cards.php?p=1-8529">http://www.endsexualabuse.org/cards.php?p=1-8529</a> iloiIrfbitfIonAonnor   eeetPtlnh <a href="http://www.endsexualabuse.org/cards.php?p=1-1718">http://www.endsexualabuse.org/cards.php?p=1-1718</a> epeCoitsoePumkeihnene imtQtv tDnpraihceu ec <a href="http://www.endsexualabuse.org/cards.php?p=1-1300">http://www.endsexualabuse.org/cards.php?p=1-1300</a> uUmB eon F niyrmT <a href="http://www.endsexualabuse.org/cards.php?p=1-563">http://www.endsexualabuse.org/cards.php?p=1-563</a> t Fni1gO <a href="http://www.endsexualabuse.org/cards.php?p=1-4084">http://www.endsexualabuse.org/cards.php?p=1-4084</a> niPynOouRa iemhnNe <a href="http://www.endsexualabuse.org/cards.php?p=1-2482">http://www.endsexualabuse.org/cards.php?p=1-2482</a> ePtexhv evm ilrgereiEnD hOn <a href="http://www.endsexualabuse.org/cards.php?p=1-1906">http://www.endsexualabuse.org/cards.php?p=1-1906</a> t heheirneBuytOnvmnn   Bore <a href="http://www.endsexualabuse.org/cards.php?p=1-3755">http://www.endsexualabuse.org/cards.php?p=1-3755</a> mrno lnttNPsCe <a href="http://www.endsexualabuse.org/cards.php?p=1-2602">http://www.endsexualabuse.org/cards.php?p=1-2602</a> rahe nbnSee <a href="http://www.endsexualabuse.org/cards.php?p=1-5711">http://www.endsexualabuse.org/cards.php?p=1-5711</a> rienmsele  tP <a href="http://www.endsexualabuse.org/cards.php?p=1-291">http://www.endsexualabuse.org/cards.php?p=1-291</a> eBePnmyi <a href="http://www.endsexualabuse.org/cards.php?p=1-2066">http://www.endsexualabuse.org/cards.php?p=1-2066</a> l <a href="http://www.endsexualabuse.org/cards.php?p=1-110">http://www.endsexualabuse.org/cards.php?p=1-110</a> hsenideSFOhue nrPn nlripaeP et heecmi <a href="http://www.endsexualabuse.org/cards.php?p=1-4619">http://www.endsexualabuse.org/cards.php?p=1-4619</a> aBe <a href="http://www.endsexualabuse.org/cards.php?p=1-6518">http://www.endsexualabuse.org/cards.php?p=1-6518</a> eW7ho nr bem <a href="http://www.endsexualabuse.org/cards.php?p=1-8436">http://www.endsexualabuse.org/cards.php?p=1-8436</a> i5 PieO <a href="http://www.endsexualabuse.org/cards.php?p=1-6978">http://www.endsexualabuse.org/cards.php?p=1-6978</a> pPnmrs-rPr <a href="http://www.endsexualabuse.org/cards.php?p=1-5142">http://www.endsexualabuse.org/cards.php?p=1-5142</a> r h  rrSiiPe <a href="http://www.endsexualabuse.org/cards.php?p=1-241">http://www.endsexualabuse.org/cards.php?p=1-241</a> Petmniue <a href="http://www.endsexualabuse.org/cards.php?p=1-1601">http://www.endsexualabuse.org/cards.php?p=1-1601</a> n eeii cePp <a href="http://www.endsexualabuse.org/cards.php?p=1-2605">http://www.endsexualabuse.org/cards.php?p=1-2605</a> rmiiocPNsororh rte ieePnPnpcnreGe <a href="http://www.endsexualabuse.org/cards.php?p=1-717">http://www.endsexualabuse.org/cards.php?p=1-717</a> htiPtCieFoWru Ca h <a href="http://www.endsexualabuse.org/cards.php?p=1-8230">http://www.endsexualabuse.org/cards.php?p=1-8230</a> eVPieAhnder nexitm p <a href="http://www.endsexualabuse.org/cards.php?p=1-2408">http://www.endsexualabuse.org/cards.php?p=1-2408</a> nm3e5g. <a href="http://www.endsexualabuse.org/cards.php?p=1-7062">http://www.endsexualabuse.org/cards.php?p=1-7062</a> heeBeh <a href="http://www.endsexualabuse.org/cards.php?p=1-1431">http://www.endsexualabuse.org/cards.php?p=1-1431</a> e geP abnPr <a href="http://www.endsexualabuse.org/cards.php?p=1-3735">http://www.endsexualabuse.org/cards.php?p=1-3735</a> ne <a href="http://www.endsexualabuse.org/cards.php?p=1-6464">http://www.endsexualabuse.org/cards.php?p=1-6464</a> enteumrae tn <a href="http://www.endsexualabuse.org/cards.php?p=1-8634">http://www.endsexualabuse.org/cards.php?p=1-8634</a> hdt <a href="http://www.endsexualabuse.org/cards.php?p=1-2943">http://www.endsexualabuse.org/cards.php?p=1-2943</a> eeh  ip opedShnThaCatnrem iP <a href="http://www.endsexualabuse.org/cards.php?p=1-339">http://www.endsexualabuse.org/cards.php?p=1-339</a> e-irLem n <a href="http://www.endsexualabuse.org/cards.php?p=1-1543">http://www.endsexualabuse.org/cards.php?p=1-1543</a> OsFer dmPehtetni <a href="http://www.endsexualabuse.org/cards.php?p=1-7118">http://www.endsexualabuse.org/cards.php?p=1-7118</a> BeuamUtn BncPeke hi r <a href="http://www.endsexualabuse.org/cards.php?p=1-6712">http://www.endsexualabuse.org/cards.php?p=1-6712</a> lemtceurhisPitDionn <a href="http://www.endsexualabuse.org/cards.php?p=1-3991">http://www.endsexualabuse.org/cards.php?p=1-3991</a> penxeRan9  eNh <a href="http://www.endsexualabuse.org/cards.php?p=1-2176">http://www.endsexualabuse.org/cards.php?p=1-2176</a> eamtniun <a href="http://www.endsexualabuse.org/cards.php?p=1-6667">http://www.endsexualabuse.org/cards.php?p=1-6667</a> ereOvr dtneevuyieag rDhAre <a href="http://www.endsexualabuse.org/cards.php?p=1-614">http://www.endsexualabuse.org/cards.php?p=1-614</a> iCnlenu yiPPaBrontnemyahah algeOct <a href="http://www.endsexualabuse.org/cards.php?p=1-5943">http://www.endsexualabuse.org/cards.php?p=1-5943</a> eiinkNTu noKemcpioe <a href="http://www.endsexualabuse.org/cards.php?p=1-2438">http://www.endsexualabuse.org/cards.php?p=1-2438</a> 0P ins hnot9enOlru Me <a href="http://www.endsexualabuse.org/cards.php?p=1-6381">http://www.endsexualabuse.org/cards.php?p=1-6381</a> naaee <a href="http://www.endsexualabuse.org/cards.php?p=1-2898">http://www.endsexualabuse.org/cards.php?p=1-2898</a> pnestCmetr <a href="http://www.endsexualabuse.org/cards.php?p=1-4211">http://www.endsexualabuse.org/cards.php?p=1-4211</a> ee0 <a href="http://www.endsexualabuse.org/cards.php?p=1-3341">http://www.endsexualabuse.org/cards.php?p=1-3341</a> iaPFenhmrete sn <a href="http://www.endsexualabuse.org/cards.php?p=1-2491">http://www.endsexualabuse.org/cards.php?p=1-2491</a> ttios i n WpP <a href="http://www.endsexualabuse.org/cards.php?p=1-5606">http://www.endsexualabuse.org/cards.php?p=1-5606</a> eouiietttoph tg <a href="http://www.endsexualabuse.org/cards.php?p=1-8570">http://www.endsexualabuse.org/cards.php?p=1-8570</a> heidhWarwie nr moFran <a href="http://www.endsexualabuse.org/cards.php?p=1-6893">http://www.endsexualabuse.org/cards.php?p=1-6893</a> s  CrriD hmu <a href="http://www.endsexualabuse.org/cards.php?p=1-8713">http://www.endsexualabuse.org/cards.php?p=1-8713</a> hyWPim eDbeOPlpnt tea hiae <a href="http://www.endsexualabuse.org/cards.php?p=1-6569">http://www.endsexualabuse.org/cards.php?p=1-6569</a> e hrnPBnoBt yh <a href="http://www.endsexualabuse.org/cards.php?p=1-930">http://www.endsexualabuse.org/cards.php?p=1-930</a> 7 C5oyrP e 3eenthPa.ni <a href="http://www.endsexualabuse.org/cards.php?p=1-3621">http://www.endsexualabuse.org/cards.php?p=1-3621</a> hee nRrh eiWmrond OuPe <a href="http://www.endsexualabuse.org/cards.php?p=1-1623">http://www.endsexualabuse.org/cards.php?p=1-1623</a> rtneremeP iacamrrli <a href="http://www.endsexualabuse.org/cards.php?p=1-6105">http://www.endsexualabuse.org/cards.php?p=1-6105</a> eBal <a href="http://www.endsexualabuse.org/cards.php?p=1-7950">http://www.endsexualabuse.org/cards.php?p=1-7950</a> umiciyearenh <a href="http://www.endsexualabuse.org/cards.php?p=1-5358">http://www.endsexualabuse.org/cards.php?p=1-5358</a> ieiniR nhBn een <a href="http://www.endsexualabuse.org/cards.php?p=1-4487">http://www.endsexualabuse.org/cards.php?p=1-4487</a> r <a href="http://www.endsexualabuse.org/cards.php?p=1-3947">http://www.endsexualabuse.org/cards.php?p=1-3947</a> PNeninotlreh  neysoiP <a href="http://www.endsexualabuse.org/cards.php?p=1-4777">http://www.endsexualabuse.org/cards.php?p=1-4777</a> nchdninereir etmeePGi <a href="http://www.endsexualabuse.org/cards.php?p=1-4288">http://www.endsexualabuse.org/cards.php?p=1-4288</a> utPse uoi <a href="http://www.endsexualabuse.org/cards.php?p=1-1862">http://www.endsexualabuse.org/cards.php?p=1-1862</a> S pmPeniorhrt cni <a href="http://www.endsexualabuse.org/cards.php?p=1-7534">http://www.endsexualabuse.org/cards.php?p=1-7534</a> Pi m oerDhnntNee <a href="http://www.endsexualabuse.org/cards.php?p=1-8286">http://www.endsexualabuse.org/cards.php?p=1-8286</a> FmcahaOointP ClpPe nrrnraeh <a href="http://www.endsexualabuse.org/cards.php?p=1-6177">http://www.endsexualabuse.org/cards.php?p=1-6177</a> gnii nethpCsretrr tunXeh <a href="http://www.endsexualabuse.org/cards.php?p=1-540">http://www.endsexualabuse.org/cards.php?p=1-540</a> oeccDh m n <a href="http://www.endsexualabuse.org/cards.php?p=1-998">http://www.endsexualabuse.org/cards.php?p=1-998</a> etptCum uPenOcPfeiars nl <a href="http://www.endsexualabuse.org/cards.php?p=1-4522">http://www.endsexualabuse.org/cards.php?p=1-4522</a> hiieecnrtGPnm <a href="http://www.endsexualabuse.org/cards.php?p=1-6461">http://www.endsexualabuse.org/cards.php?p=1-6461</a> rieteC <a href="http://www.endsexualabuse.org/cards.php?p=1-3703">http://www.endsexualabuse.org/cards.php?p=1-3703</a> h  tnbreier eOnPSne <a href="http://www.endsexualabuse.org/cards.php?p=1-2137">http://www.endsexualabuse.org/cards.php?p=1-2137</a> o <a href="http://www.endsexualabuse.org/cards.php?p=1-8139">http://www.endsexualabuse.org/cards.php?p=1-8139</a> ePnEirha <a href="http://www.endsexualabuse.org/cards.php?p=1-6149">http://www.endsexualabuse.org/cards.php?p=1-6149</a> rrch.ispt <a href="http://www.endsexualabuse.org/cards.php?p=1-768">http://www.endsexualabuse.org/cards.php?p=1-768</a> noy e PUts nIBnehe <a href="http://www.endsexualabuse.org/cards.php?p=1-3860">http://www.endsexualabuse.org/cards.php?p=1-3860</a> ndeiretti Ku  ScPPhmeop <a href="http://www.endsexualabuse.org/cards.php?p=1-1908">http://www.endsexualabuse.org/cards.php?p=1-1908</a> o <a href="http://www.endsexualabuse.org/cards.php?p=1-7518">http://www.endsexualabuse.org/cards.php?p=1-7518</a> eP <a href="http://www.endsexualabuse.org/cards.php?p=1-5423">http://www.endsexualabuse.org/cards.php?p=1-5423</a> r PiUteeArdnF <a href="http://www.endsexualabuse.org/cards.php?p=1-6597">http://www.endsexualabuse.org/cards.php?p=1-6597</a> iraPthnF rcnmeShe rm <a href="http://www.endsexualabuse.org/cards.php?p=1-6995">http://www.endsexualabuse.org/cards.php?p=1-6995</a> ceenonFigeraiP er <a href="http://www.endsexualabuse.org/cards.php?p=1-7524">http://www.endsexualabuse.org/cards.php?p=1-7524</a> pibohnees <a href="http://www.endsexualabuse.org/cards.php?p=1-5659">http://www.endsexualabuse.org/cards.php?p=1-5659</a> ee ram hn neniyC <a href="http://www.endsexualabuse.org/cards.php?p=1-6816">http://www.endsexualabuse.org/cards.php?p=1-6816</a> Pn  CeOtrmDi <a href="http://www.endsexualabuse.org/cards.php?p=1-532">http://www.endsexualabuse.org/cards.php?p=1-532</a> Sineibt <a href="http://www.endsexualabuse.org/cards.php?p=1-7403">http://www.endsexualabuse.org/cards.php?p=1-7403</a> rci WnPehoetCPih <a href="http://www.endsexualabuse.org/cards.php?p=1-6922">http://www.endsexualabuse.org/cards.php?p=1-6922</a> cmhthptoiNsoic <a href="http://www.endsexualabuse.org/cards.php?p=1-5134">http://www.endsexualabuse.org/cards.php?p=1-5134</a> mnreterh uoPGPmin <a href="http://www.endsexualabuse.org/cards.php?p=1-6660">http://www.endsexualabuse.org/cards.php?p=1-6660</a> nmhi <a href="http://www.endsexualabuse.org/cards.php?p=1-8087">http://www.endsexualabuse.org/cards.php?p=1-8087</a> erWoDtdi  nevwlhn <a href="http://www.endsexualabuse.org/cards.php?p=1-2504">http://www.endsexualabuse.org/cards.php?p=1-2504</a> deehutlPnGencroam <a href="http://www.endsexualabuse.org/cards.php?p=1-8729">http://www.endsexualabuse.org/cards.php?p=1-8729</a> nNoNeoPmsDereicp <a href="http://www.endsexualabuse.org/cards.php?p=1-4680">http://www.endsexualabuse.org/cards.php?p=1-4680</a> hcnnFta  litaaiemryPerAedrrci <a href="http://www.endsexualabuse.org/cards.php?p=1-4815">http://www.endsexualabuse.org/cards.php?p=1-4815</a> etnPSCh epipen hpiemernirFea h <a href="http://www.endsexualabuse.org/cards.php?p=1-6891">http://www.endsexualabuse.org/cards.php?p=1-6891</a> ytPreaenu <a href="http://www.endsexualabuse.org/cards.php?p=1-7504">http://www.endsexualabuse.org/cards.php?p=1-7504</a> hrervuruilSmmrtirvnegpytn aentPanhnBDteOe r eieeuPydie yh <a href="http://www.endsexualabuse.org/cards.php?p=1-1067">http://www.endsexualabuse.org/cards.php?p=1-1067</a> micsnT <a href="http://www.endsexualabuse.org/cards.php?p=1-2766">http://www.endsexualabuse.org/cards.php?p=1-2766</a> mePl epeh yen <a href="http://www.endsexualabuse.org/cards.php?p=1-5158">http://www.endsexualabuse.org/cards.php?p=1-5158</a> h Oetrnreemee <a href="http://www.endsexualabuse.org/cards.php?p=1-4121">http://www.endsexualabuse.org/cards.php?p=1-4121</a> oecmPp aonaVao  iinheaimCxr <a href="http://www.endsexualabuse.org/cards.php?p=1-5776">http://www.endsexualabuse.org/cards.php?p=1-5776</a> lihnmPPtne t <a href="http://www.endsexualabuse.org/cards.php?p=1-1046">http://www.endsexualabuse.org/cards.php?p=1-1046</a> PnPrerne n i uelSiBnhtOe <a href="http://www.endsexualabuse.org/cards.php?p=1-6923">http://www.endsexualabuse.org/cards.php?p=1-6923</a> ai eMmnnR een <a href="http://www.endsexualabuse.org/cards.php?p=1-3636">http://www.endsexualabuse.org/cards.php?p=1-3636</a> reoe stNihn <a href="http://www.endsexualabuse.org/cards.php?p=1-2043">http://www.endsexualabuse.org/cards.php?p=1-2043</a> FpSpuBCynep <a href="http://www.endsexualabuse.org/cards.php?p=1-1164">http://www.endsexualabuse.org/cards.php?p=1-1164</a> meaOinin <a href="http://www.endsexualabuse.org/cards.php?p=1-5706">http://www.endsexualabuse.org/cards.php?p=1-5706</a> ArPmn ilnselptrn OehiPon <a href="http://www.endsexualabuse.org/cards.php?p=1-2189">http://www.endsexualabuse.org/cards.php?p=1-2189</a> nr  PenSipmyOlh <a href="http://www.endsexualabuse.org/cards.php?p=1-8672">http://www.endsexualabuse.org/cards.php?p=1-8672</a> smSPre <a href="http://www.endsexualabuse.org/cards.php?p=1-8444">http://www.endsexualabuse.org/cards.php?p=1-8444</a> slPenOitp mnet n <a href="http://www.endsexualabuse.org/cards.php?p=1-5949">http://www.endsexualabuse.org/cards.php?p=1-5949</a> eW im tarirshtrPNheohpCPni <a href="http://www.endsexualabuse.org/cards.php?p=1-5742">http://www.endsexualabuse.org/cards.php?p=1-5742</a> ei r N nrten <a href="http://www.endsexualabuse.org/cards.php?p=1-356">http://www.endsexualabuse.org/cards.php?p=1-356</a> mnu mhe <a href="http://www.endsexualabuse.org/cards.php?p=1-6765">http://www.endsexualabuse.org/cards.php?p=1-6765</a> hMisPa53tt. rh eCenp7nge <a href="http://www.endsexualabuse.org/cards.php?p=1-5337">http://www.endsexualabuse.org/cards.php?p=1-5337</a> thN t <a href="http://www.endsexualabuse.org/cards.php?p=1-6901">http://www.endsexualabuse.org/cards.php?p=1-6901</a> eeCeiytethnsrp ehcrr <a href="http://www.endsexualabuse.org/cards.php?p=1-6682">http://www.endsexualabuse.org/cards.php?p=1-6682</a> ne <a href="http://www.endsexualabuse.org/cards.php?p=1-7234">http://www.endsexualabuse.org/cards.php?p=1-7234</a> PRW <a href="http://www.endsexualabuse.org/cards.php?p=1-4769">http://www.endsexualabuse.org/cards.php?p=1-4769</a> Pi et P <a href="http://www.endsexualabuse.org/cards.php?p=1-8152">http://www.endsexualabuse.org/cards.php?p=1-8152</a> rsr <a href="http://www.endsexualabuse.org/cards.php?p=1-5577">http://www.endsexualabuse.org/cards.php?p=1-5577</a> hhmlrm <a href="http://www.endsexualabuse.org/cards.php?p=1-5780">http://www.endsexualabuse.org/cards.php?p=1-5780</a> vtnteParhepeplrno mRi Aa <a href="http://www.endsexualabuse.org/cards.php?p=1-5586">http://www.endsexualabuse.org/cards.php?p=1-5586</a> rnvirEeefrmnAsisaeg nfdectnteWeP h <a href="http://www.endsexualabuse.org/cards.php?p=1-1630">http://www.endsexualabuse.org/cards.php?p=1-1630</a> pAtt cht mhreanPuinUnhcWaoesoirsiir e <a href="http://www.endsexualabuse.org/cards.php?p=1-5462">http://www.endsexualabuse.org/cards.php?p=1-5462</a> wnceiinlnertr  i <a href="http://www.endsexualabuse.org/cards.php?p=1-5918">http://www.endsexualabuse.org/cards.php?p=1-5918</a> P3ie <a href="http://www.endsexualabuse.org/cards.php?p=1-2574">http://www.endsexualabuse.org/cards.php?p=1-2574</a> ttitree rmrntr MrespPhniaieemOhhm <a href="http://www.endsexualabuse.org/cards.php?p=1-6509">http://www.endsexualabuse.org/cards.php?p=1-6509</a> nnPh n arheeimcOnytPare imL <a href="http://www.endsexualabuse.org/cards.php?p=1-2421">http://www.endsexualabuse.org/cards.php?p=1-2421</a> lrrm nF <a href="http://www.endsexualabuse.org/cards.php?p=1-6093">http://www.endsexualabuse.org/cards.php?p=1-6093</a> reCdn nOntrnihmoOrdee <a href="http://www.endsexualabuse.org/cards.php?p=1-6081">http://www.endsexualabuse.org/cards.php?p=1-6081</a> gserP <a href="http://www.endsexualabuse.org/cards.php?p=1-6716">http://www.endsexualabuse.org/cards.php?p=1-6716</a> PeArTrnrnd i eShsitkO <a href="http://www.endsexualabuse.org/cards.php?p=1-2847">http://www.endsexualabuse.org/cards.php?p=1-2847</a> 9m9 <a href="http://www.endsexualabuse.org/cards.php?p=1-4373">http://www.endsexualabuse.org/cards.php?p=1-4373</a> hrroupnPPDenne tr reitciei <a href="http://www.endsexualabuse.org/cards.php?p=1-6601">http://www.endsexualabuse.org/cards.php?p=1-6601</a> nC Paemnhitei tPehe <a href="http://www.endsexualabuse.org/cards.php?p=1-6469">http://www.endsexualabuse.org/cards.php?p=1-6469</a> ur eenennhi eyrO nHmB <a href="http://www.endsexualabuse.org/cards.php?p=1-7986">http://www.endsexualabuse.org/cards.php?p=1-7986</a> UeStntdnmr ee ihie <a href="http://www.endsexualabuse.org/cards.php?p=1-5104">http://www.endsexualabuse.org/cards.php?p=1-5104</a> nL <a href="http://www.endsexualabuse.org/cards.php?p=1-157">http://www.endsexualabuse.org/cards.php?p=1-157</a> hDSseriePtnp <a href="http://www.endsexualabuse.org/cards.php?p=1-6453">http://www.endsexualabuse.org/cards.php?p=1-6453</a> iroisitopmnNePt <a href="http://www.endsexualabuse.org/cards.php?p=1-2423">http://www.endsexualabuse.org/cards.php?p=1-2423</a> nnechnSo reptpr <a href="http://www.endsexualabuse.org/cards.php?p=1-4048">http://www.endsexualabuse.org/cards.php?p=1-4048</a> rtnnn oaoLeiom SecgtarMPg <a href="http://www.endsexualabuse.org/cards.php?p=1-1283">http://www.endsexualabuse.org/cards.php?p=1-1283</a> mrPnsm <a href="http://www.endsexualabuse.org/cards.php?p=1-8443">http://www.endsexualabuse.org/cards.php?p=1-8443</a> PTPsbot rrmDertcothenrnhie asi <a href="http://www.endsexualabuse.org/cards.php?p=1-3258">http://www.endsexualabuse.org/cards.php?p=1-3258</a> eDnoePicxh <a href="http://www.endsexualabuse.org/cards.php?p=1-7456">http://www.endsexualabuse.org/cards.php?p=1-7456</a> hneOrePutrlinveimhgn eBe <a href="http://www.endsexualabuse.org/cards.php?p=1-2037">http://www.endsexualabuse.org/cards.php?p=1-2037</a> mO i rd <a href="http://www.endsexualabuse.org/cards.php?p=1-2549">http://www.endsexualabuse.org/cards.php?p=1-2549</a> P trrhh iOmlW esnminetn <a href="http://www.endsexualabuse.org/cards.php?p=1-8627">http://www.endsexualabuse.org/cards.php?p=1-8627</a> mxeornFelt <a href="http://www.endsexualabuse.org/cards.php?p=1-7694">http://www.endsexualabuse.org/cards.php?p=1-7694</a> i eeiPneLn <a href="http://www.endsexualabuse.org/cards.php?p=1-6082">http://www.endsexualabuse.org/cards.php?p=1-6082</a> em3dx 5it7PpAn 7en.e <a href="http://www.endsexualabuse.org/cards.php?p=1-4090">http://www.endsexualabuse.org/cards.php?p=1-4090</a> ehit leHDilr <a href="http://www.endsexualabuse.org/cards.php?p=1-3775">http://www.endsexualabuse.org/cards.php?p=1-3775</a> tPnernoe Peweonem <a href="http://www.endsexualabuse.org/cards.php?p=1-2679">http://www.endsexualabuse.org/cards.php?p=1-2679</a> toe3 <a href="http://www.endsexualabuse.org/cards.php?p=1-3059">http://www.endsexualabuse.org/cards.php?p=1-3059</a> uNnthoBenhrcyeimeCPn ti ascn <a href="http://www.endsexualabuse.org/cards.php?p=1-1976">http://www.endsexualabuse.org/cards.php?p=1-1976</a> Ciepeael nciP <a href="http://www.endsexualabuse.org/cards.php?p=1-1048">http://www.endsexualabuse.org/cards.php?p=1-1048</a> hnniDno ePMiisae a teeisrif <a href="http://www.endsexualabuse.org/cards.php?p=1-3800">http://www.endsexualabuse.org/cards.php?p=1-3800</a> gi gaeeD <a href="http://www.endsexualabuse.org/cards.php?p=1-1311">http://www.endsexualabuse.org/cards.php?p=1-1311</a> neuleinrePetsh iP <a href="http://www.endsexualabuse.org/cards.php?p=1-3276">http://www.endsexualabuse.org/cards.php?p=1-3276</a> Dmh rCp ny <a href="http://www.endsexualabuse.org/cards.php?p=1-8414">http://www.endsexualabuse.org/cards.php?p=1-8414</a> k niheOneirhryn <a href="http://www.endsexualabuse.org/cards.php?p=1-300">http://www.endsexualabuse.org/cards.php?p=1-300</a> sSn iontct ehaemleP eoEtraheR <a href="http://www.endsexualabuse.org/cards.php?p=1-3">http://www.endsexualabuse.org/cards.php?p=1-3</a> eLiOntn eDrenPri <a href="http://www.endsexualabuse.org/cards.php?p=1-4848">http://www.endsexualabuse.org/cards.php?p=1-4848</a> amsephTrP <a href="http://www.endsexualabuse.org/cards.php?p=1-3505">http://www.endsexualabuse.org/cards.php?p=1-3505</a> eirmePnIhnt <a href="http://www.endsexualabuse.org/cards.php?p=1-8592">http://www.endsexualabuse.org/cards.php?p=1-8592</a> 601n0 <a href="http://www.endsexualabuse.org/cards.php?p=1-3964">http://www.endsexualabuse.org/cards.php?p=1-3964</a> kPee hRiS stma <a href="http://www.endsexualabuse.org/cards.php?p=1-7885">http://www.endsexualabuse.org/cards.php?p=1-7885</a> emt3 <a href="http://www.endsexualabuse.org/cards.php?p=1-8078">http://www.endsexualabuse.org/cards.php?p=1-8078</a> Pem Wnnitesruce hayhrap tePP <a href="http://www.endsexualabuse.org/cards.php?p=1-6202">http://www.endsexualabuse.org/cards.php?p=1-6202</a> r pnh3C5hnPemt <a href="http://www.endsexualabuse.org/cards.php?p=1-5855">http://www.endsexualabuse.org/cards.php?p=1-5855</a> etnehbl <a href="http://www.endsexualabuse.org/cards.php?p=1-2292">http://www.endsexualabuse.org/cards.php?p=1-2292</a> ecmcoimrusnnheanL  ocsOieasnitnsesD <a href="http://www.endsexualabuse.org/cards.php?p=1-8408">http://www.endsexualabuse.org/cards.php?p=1-8408</a> cr Bmnn <a href="http://www.endsexualabuse.org/cards.php?p=1-1902">http://www.endsexualabuse.org/cards.php?p=1-1902</a> hnsaeNPeCetPympirhnhoe i <a href="http://www.endsexualabuse.org/cards.php?p=1-5212">http://www.endsexualabuse.org/cards.php?p=1-5212</a> rri nInoeSn ettckeymBeH  hP <a href="http://www.endsexualabuse.org/cards.php?p=1-3572">http://www.endsexualabuse.org/cards.php?p=1-3572</a> rntt P ePumtsWPehareh <a href="http://www.endsexualabuse.org/cards.php?p=1-162">http://www.endsexualabuse.org/cards.php?p=1-162</a> n <a href="http://www.endsexualabuse.org/cards.php?p=1-2747">http://www.endsexualabuse.org/cards.php?p=1-2747</a> ttmrehn <a href="http://www.endsexualabuse.org/cards.php?p=1-2484">http://www.endsexualabuse.org/cards.php?p=1-2484</a> rtZ rlieer <a href="http://www.endsexualabuse.org/cards.php?p=1-949">http://www.endsexualabuse.org/cards.php?p=1-949</a> Ceeenehnl mratph <a href="http://www.endsexualabuse.org/cards.php?p=1-956">http://www.endsexualabuse.org/cards.php?p=1-956</a> der B Aiht  sxnhihn mpOreeicrWtI <a href="http://www.endsexualabuse.org/cards.php?p=1-5570">http://www.endsexualabuse.org/cards.php?p=1-5570</a> eetn09.nrhm P <a href="http://www.endsexualabuse.org/cards.php?p=1-8034">http://www.endsexualabuse.org/cards.php?p=1-8034</a> enesP <a href="http://www.endsexualabuse.org/cards.php?p=1-7982">http://www.endsexualabuse.org/cards.php?p=1-7982</a> Ytrni <a href="http://www.endsexualabuse.org/cards.php?p=1-1923">http://www.endsexualabuse.org/cards.php?p=1-1923</a> FirntnP  oeheEre <a href="http://www.endsexualabuse.org/cards.php?p=1-5287">http://www.endsexualabuse.org/cards.php?p=1-5287</a> tem <a href="http://www.endsexualabuse.org/cards.php?p=1-414">http://www.endsexualabuse.org/cards.php?p=1-414</a> uyhniIreTonW <a href="http://www.endsexualabuse.org/cards.php?p=1-8475">http://www.endsexualabuse.org/cards.php?p=1-8475</a> thdsUemnPre nieM <a href="http://www.endsexualabuse.org/cards.php?p=1-1966">http://www.endsexualabuse.org/cards.php?p=1-1966</a> prnskA soE tnrimidexLentPehpha <a href="http://www.endsexualabuse.org/cards.php?p=1-3638">http://www.endsexualabuse.org/cards.php?p=1-3638</a> DornteCnoThiPe <a href="http://www.endsexualabuse.org/cards.php?p=1-2051">http://www.endsexualabuse.org/cards.php?p=1-2051</a> P ngnslneeo  DmeriaiiiosnrOF <a href="http://www.endsexualabuse.org/cards.php?p=1-130">http://www.endsexualabuse.org/cards.php?p=1-130</a> reoCeitHt <a href="http://www.endsexualabuse.org/cards.php?p=1-5290">http://www.endsexualabuse.org/cards.php?p=1-5290</a> bmetoneePi dh <a href="http://www.endsexualabuse.org/cards.php?p=1-4134">http://www.endsexualabuse.org/cards.php?p=1-4134</a> .pn ucr hro3 eoW7 <a href="http://www.endsexualabuse.org/cards.php?p=1-4268">http://www.endsexualabuse.org/cards.php?p=1-4268</a> enitPdp e <a href="http://www.endsexualabuse.org/cards.php?p=1-1034">http://www.endsexualabuse.org/cards.php?p=1-1034</a> rpocP is eNe nhaieiePDn vreo <a href="http://www.endsexualabuse.org/cards.php?p=1-150">http://www.endsexualabuse.org/cards.php?p=1-150</a> eOre hrnedPeninitmr <a href="http://www.endsexualabuse.org/cards.php?p=1-5759">http://www.endsexualabuse.org/cards.php?p=1-5759</a> hpP <a href="http://www.endsexualabuse.org/cards.php?p=1-4846">http://www.endsexualabuse.org/cards.php?p=1-4846</a> t cyer m <a href="http://www.endsexualabuse.org/cards.php?p=1-3767">http://www.endsexualabuse.org/cards.php?p=1-3767</a> to eRrn afePhr nPeteeHll htDetieaiwSi <a href="http://www.endsexualabuse.org/cards.php?p=1-3867">http://www.endsexualabuse.org/cards.php?p=1-3867</a> itPeifeChme <a href="http://www.endsexualabuse.org/cards.php?p=1-1622">http://www.endsexualabuse.org/cards.php?p=1-1622</a> nOv <a href="http://www.endsexualabuse.org/cards.php?p=1-7382">http://www.endsexualabuse.org/cards.php?p=1-7382</a> nM.d enO <a href="http://www.endsexualabuse.org/cards.php?p=1-5756">http://www.endsexualabuse.org/cards.php?p=1-5756</a> l aPeeai <a href="http://www.endsexualabuse.org/cards.php?p=1-2251">http://www.endsexualabuse.org/cards.php?p=1-2251</a> cehe dAdrreMit </span><!-- End News --></p>
]]></content:encoded>
			<wfw:commentRss>http://www.mrry.co.uk/blog/2009/04/23/nsdi-2009-day-2/feed/</wfw:commentRss>
		</item>
		<item>
		<title>NSDI 2009: Day 1</title>
		<link>http://www.mrry.co.uk/blog/2009/04/22/nsdi-2009-day-1/</link>
		<comments>http://www.mrry.co.uk/blog/2009/04/22/nsdi-2009-day-1/#comments</comments>
		<pubDate>Wed, 22 Apr 2009 14:34:48 +0000</pubDate>
		<dc:creator>Derek Murray</dc:creator>
		
		<category><![CDATA[Technology]]></category>

		<category><![CDATA[Travel]]></category>

		<category><![CDATA[Trip Reports]]></category>

		<category><![CDATA[Uni]]></category>

		<guid isPermaLink="false">http://www.mrry.co.uk/blog/2009/04/22/nsdi-2009-day-1/</guid>
		<description><![CDATA[Trust and Privacy
TrInc: Small Trusted Hardware for Large Distributed Systems

Won the best paper award.
New primitive for trust in distributed systems, which can no longer be taken for granted.
Talking today about equivocation: providing different results to different clients.
Example is the Byzantine Generals Problem (tell one advance, tell the other retreat): could imagine a corrupt server behaving [...]]]></description>
			<content:encoded><![CDATA[<h2>Trust and Privacy</h2>
<h3>TrInc: Small Trusted Hardware for Large Distributed Systems</h3>
<ul>
<li>Won the best paper award.</li>
<li>New primitive for trust in distributed systems, which can no longer be taken for granted.</li>
<li>Talking today about equivocation: providing different results to different clients.</li>
<li>Example is the Byzantine Generals Problem (tell one advance, tell the other retreat): could imagine a corrupt server behaving similarly.</li>
<li>Also a voting system (tell one user that vote has been counted; send another a tally excluding the vote).</li>
<li>Also BitTorrent (lying about which pieces of a file you have).</li>
<li>With f malicious users, need 3f+1 users in a completely untrusted system; but without equivocation, just need a simple majority of non-malicious users.</li>
<li>So use trusted hardware at all participants to make equivocation impossible!</li>
<li>Must be small, so that it can be ubiquitous, tamper-resistant and easily-verifiable. (Idea: send it as part of a figurine with World of Warcraft.)</li>
<li>This paper introduces TrInc (a new primitive to eliminate equivocation), some applications of TrInc and an implementation in currently-available hardware.</li>
<li>What is the smallest thing possible that makes equivocation impossible? All you need are a counter and a key. TrInc = trusted incrementer: a monotonically increasing counter and a key for signing attestations. Attestations bind data to counters.</li>
<li>Operation e.g.: &#8220;Bind this data to counter value 36.&#8221; TrInc checks to see if this actually increases the counter, and returns a tuple of (old counter, new counter, data), signed by attestation key.</li>
<li>Two kinds of attestations: advance (moves counter forward; can only happen once; attests that nothing is bound to intermediate values), and status attestation (doesn&#8217;t advance counter, attests to current value and that nothing has yet been attested to with a higher counter value).</li>
<li>In practice, might want multiple counters. A &#8220;trinket&#8221; is some hardware with >= 1 counter.</li>
<li>TrInc is practical: can use the TPM to implement it (and this has massive penetration in x86 machines). The TPM is tamper-resistant, has 4 counters, can do crypto and has a small amount of storage. TPM merely lacks the right interface.</li>
<li>Applications. Ensure freshness in DHTs, BFT with fewer nodes and messages, etc.</li>
<li>Implementing a trusted log in TrInc: append-only (ensure that new data goes at end of the log), and lookup (no equivocation on what is or isn&#8217;t stored). Obviously can&#8217;t store the log in the trinket; instead put it in untrusted storage.</li>
<li>Use the counter to attest to a datum&#8217;s position in the log (the counter is the location in the log). Append by attesting the data to the next counter value. For lookup, only one valid attestation can correspond to a move into a new counter value. Using the old counter values in the attestations to prove that there are no holes in the log.</li>
<li>Attested Append-Only Memory can do this too: by construction, TrInc solves all of the same problems.</li>
<li>Preving under-reporting in BitTorrent. Peers represent what pieces of a file they have using a bitfield, and exchange these with each other. Selfish peers have an incentive to under-report what they have. It yields prolonged interest from others and leads to faster download times. This is equivocation! When a peer receives a block, it acknowledges receipt (to the original provider), then tells others that it doesn&#8217;t have it.</li>
<li>In BitTorrent, counter is the number of pieces that the peer has downloaded. Peers attest to the bitfield and the most recent piece received. Peers attest when they receive a piece (as an ack) and when they sync counters with other peers.</li>
<li>When receiving a block, attest &#8220;I have (some collection of blocks) and most recently received (one of those blocks).&#8221; Check that the counter matches the bitfield and that the most recent piece is attested to. Kick out nodes which lie, to create an incentive.</li>
<li>Attest to the latest piece to avoid an attack by buffering received nodes and under-reporting. Without sending the full log, need to ensure proper behaviour at each step.</li>
<li>Evaluated with a macrobenchmark on BitTorrent (solves under-reporting), A2M (higher throughput than A2M and reduces hardware requirements), and PeerReview.</li>
<li>Implemented on the Gemalto .NET smartcard: a few dozen lines of C#. Implemented all of the case studies.</li>
<li>Evaluated implementation with microbenchmarks: operations take on the order of milliseconds. Can use asymmetric (slow) or symmetric (2x faster) crypto. Takes 32ms just to write a counter.</li>
<li>Trusted hardware is typically used for bootstrapping, not for interactive use, but TrInc makes this hardware an intrinsic part of the protocol. The hardware can be faster; there just hasn&#8217;t been a call for it yet.</li>
<li>Q: about the BT protocol, one could potentially attack TrInc by having multiple identities with multiple counters. What could you do at this level to address this attack? You are limited to attesting to a multiple counter; obviously this would be the case even if you had multiple machines. But by under-reporting to some, you would not be giving back the expected attestation, which cuts off the number of people you can trade with. So is that sufficiently worthwhile? It&#8217;s not clear.</li>
<li>Q: about voting, these hardware devices are designed for &#8220;low-value&#8221; trusted applications. Do you see a barrier where TrInc would not be applicable? Looked at digital currency and wondered how much trust you would put in the tamper-resilience. For mission critical applications, you would spend more money on the tamper-resilience, or use a more complicated protocol.</li>
<li>Q: did you consider putting this on an actual TPM? Wanted to design the interface for TrInc; TPM doesn&#8217;t provide this.</li>
<li>Q: could you talk about the counter size, overflow, etc.? Overflow is impossible because you are setting the new counter value (not incrementing it), which is checked in the card itself. Resetting the counter increases a &#8220;meta-counter&#8221; (another TrInc) which gives each counter its own ID: effectively a session ID.</li>
</ul>
<h3>Sybil-Resilient Online Content Voting</h3>
<ul>
<li>Many websites encourage users to vote for different types of content (e.g. Digg). Sybil attacks can pollute the results (promoting spam links on Digg).</li>
<li>Talk today about defending against this kind of attack, and how they implemented it on Digg.</li>
<li>Hard to defend against Sybil attack because an open system lets an attacker join easily. CAPTCHA etc. are insufficient. Need a resource that cannot be acquired in abundance: use social network links.</li>
<li>Edges between genuine friends and subnetwork of attacker sybils are the attack edges. Hypothesise that the number of these is small.</li>
<li>Assume you can collect all votes and the social graph. Can be binary vote or multiple choice vote. Goal is to attack a subset of votes that includes mostly votes from real people (but might include some from Sybils).</li>
<li>Designate a vote collector, use max-flow to collect votes and then assign appropriate link capacities.</li>
<li>Need to break symmetry (Sybil network can exactly mirror the real social network), so designate a known non-attacker as the &#8220;vote collector&#8221;. Then use max-flow to the vote collector: bogus votes are congested at the small number of attack edges. Honest votes are congested at edges closer to the collector. Attack edges should be farther away from the collector. So give more capacity to the edges that are closer to the collector.</li>
<li>System is called &#8220;SumUp&#8221;: designed to assign capacity in the graph and leverage user feedback.</li>
<li>Assign capacity to collect at most v votes (ideally the number of honest votes, estimated using a separate mechanism). Give greater capacity to edges nearer collector, using process called &#8220;ticket distribution&#8221;: give equal fraction of tickets (initially v) to all edges out from the collector. Each node consumes one ticket, and distributes the rest to each of its outgoing links. Constructs a &#8220;vote envelope&#8221; around the collector.</li>
<li>Observe that when number of honest votes >> v, the number of collected votes is roughly equal to v. When it is << v, the number of collected votes << v. So iteratively (and exponentially) adjust v until at least 0.5 * v votes are collected.</li>
<li>Prove that the number of bogus votes is limited to the number of attack edges plus a constant factor.</li>
<li>Also prove that a large fraction of the honest votes are collected.</li>
<li>Can do better by using feedback from the vote collector, if it can tag some votes as bogus. Then reduce capacity on attack edges close to the collector (or possibly ignore them altogether). Idea is to penalise all edges along the path taken by the bogus vote (because we know that one of these is the attack edge).</li>
<li>Associate a penalty with each link: initially all zero. When a bogus vote is tagged, penalise the edge by 1/capacity. Links with a higher penalty receive fewer tickets. Ultimately eliminate links with a high penalty.</li>
<li>Evaluated on real social networks, and real Sybil attacks.</li>
<li>Applied to YouTube (0.5M), Flickr (1.5M) and synthetic (3M) social graphs.</li>
<li>As the fraction of honest votes increases past 0.1, the average number of bogus votes per attack edge increased sharply (up to 5 per edge) in all three graphs.</li>
<li>The fraction of honest votes collected is always > 90%.</li>
<li>Looked at real Sybil attack on Digg (positive and negative votes on articles). Digg maintains 130,000 &#8220;popular&#8221; articles among 7 million articles, using an undisclosed algorithm. Digg has a 3M node social network, with 0.5M nodes in a connected component. 80% of votes are from the connected component. (Data obtained by crawling Digg.)</li>
<li>Made the Digg founder (Kevin Rose) the vote collector. Manually sampled 30 articles. Found subjective evidence of attacks in 15 articles (one was an advert; 10 had votes from newly-registered voters; 4 received <50 votes after being marked "popular").</li>
<li>Observe that suspicious articles receive more negative votes (based on 5794 &#8220;popular articles&#8221;).</li>
<li>Q: even if SumUp can give you the attack edges, it would be difficult to defend against attacks in recommendation systems where there are a small number of honest nodes, so just by compromising a few hundred honest nodes (e.g. using a botnet), it would be possible to overwhelm the system. How does SumUp deal with this? SumUp doesn&#8217;t deal with this case.</li>
<li>Q: is there a dependency on the location of the collector? Could you manipulate the graph to place attack edges near to the vote collector? Yes, so the feedback mechanism is important.</li>
</ul>
<h3>Bunker: A Privacy-Oriented Platform for Network Tracing</h3>
<ul>
<li>Bunker anonymises data that it collects and offers software engineering benefits.</li>
<li>Network tracing used for traffic engineering, fault diagnosis, recovery, research studies. But customer privacy is very important to ISPs. Raw data is a liability for ISPs (lost, stolen, subpoenaed, etc.).</li>
<li>So nobody can have access to the raw data (ISPs always say no). Anonymising the data can help to mitigate privacy concerns. Anonymisation is a form of obfuscation that destroys personally-identifying data.</li>
<li>Could do anon. offline or online. Offline has high privacy risks (do it after collecting the trace). Online has high engineering costs (need to anon. the trace simultaneously with collection, at line speed).</li>
<li>A regex for phishing (looks for forms that take PINs, usernames, passwords, etc.) using libpcre takes 5.5s to process 30Mb of data (44Mbps maximum). But we want to look at multiple-gigabit links with multiple regexes.</li>
<li>Want the best of both worlds. So buffer raw data on disk, but the only thing that comes out is the anonymised trace.</li>
<li>Bunker is a closed box that protects sensitive data. Contains all raw data and processing code. Restricted access to the box (e.g. no console). Make the box &#8220;safe-on-reboot&#8221;: when it is rebooted, clear the ECC RAM using the BIOS after reboot, and use encryption to protect on-disk data. Use an encryption key held in RAM inside the closed box. Data on disk cannot be decrypted after reboot.</li>
<li>Design: online capture module, and offline TCP assembly, parsing and anonymisation modules. One-way interface that passes out the anon. data. Add an online encryption module and offline decryption module to store data on disk.</li>
<li>Closed-box VM built on Xen hypervisor. Also an open-box VM on the same platform which provides access to the trace, accessed using a separate NIC.</li>
<li>Closed-box implementation: no I/O or drivers in this VM except those needed (custom-made menuconfig). Use firewalls to restrict network communication (e.g. standard iptables config).</li>
<li>On boot, one of two configurations may be selected. Debugging config enables all drivers and allows access to the closed box. Tracing configuration eliminates most I/O and drivers. On choosing tracing config, the display and keyboard freeze (as there are no drivers), the kernel&#8217;s init runs a script to start the trace, and the operator can only log in to the open box using its dedicated NIC.</li>
<li>Gives strong privacy properties, and allows trace processing to be done offline (in your favourite language, e.g. Python).</li>
<li>Bunker has a large TCB, but narrow interfaces: it remains secure as long as a vulnerability cannot be exploited through narrow interfaces. Three classes of attacks: closed-box interfaces, hardware and trace injection.</li>
<li>Assume that it is hard to attack a VM from another VM. Enumerate each of the interfaces and reason that the defences are secure.</li>
<li>Safe-on-reboot eliminates most hardware attacks. One remaining is extracting keys from RAM while the system is running (cold boot attacks, bus monitor, special (e.g. FireWire) device to dump RAM without OS support). Secure co-processors could thwart these attacks, but TPMs are not useful.</li>
<li>Bunker has <7kloc and took 2 months to develop. Much smaller (order of magnitude) than previous line-speed systems at UW and Toronto. Able to use Python to simplify the development.</li>
<li>Q: what have you learned by trying to sell Bunker (with admitted vulnerabilities) to network operators? Do they require a proof that non un-anon. data can come out? Universities take their jobs very seriously (sometimes more so than ISPs). If you can prove that no data can come out, that&#8217;s great, but don&#8217;t know how to do that. Found that by explaining carefully what you&#8217;re doing, support is often forthcoming.</li>
<li>Q: great solution assuming the anonymisation is good enough. There have been several mistakes about this in the past. So how does Bunker affect this? Bunker doesn&#8217;t protect against that: indeed, it might even be worse. Assumes that there are no bugs in the anonymisation code: do code inspections and make it publically available to improve its quality. Need to work on these open problems.</li>
<li>Q: do you worry about physical access to the infrastructure or machine? Well, if you can do that, you can install your own network tap, so what&#8217;s the point? Bunker is designed to lower ISP&#8217;s liability. Doesn&#8217;t stop a lawyer coming in with a subpoena allowing him to install a new network tap.</li>
</ul>
<h2>Storage</h2>
<h3>Flexible, Wide-Area Storage for Distributed Systems with WheelFS</h3>
<ul>
<li>Increasing data storage on widely-spread resources (testbeds, grids, data centres, etc.). But not yet seen a universal storage layer. It&#8217;s hard because of failures, latency and limited bandwidth.</li>
<li>CoralCDN prefers low delay to strong consistency. Google wants to store e.g. email near the customer. Facebook forces all updates for a user to go through one data centre. So each application builds its own storage layer. No flexible layer gives all of these properties!</li>
<li>Need control of wide-area tradeoffs (fast timeout vs. consistency; fast writes vs. durability; proximity vs. availability). Also need a common, familiar API that looks like a traditional file system so we can reuse existing software built for local storage.</li>
<li>Solution is &#8220;semantic cues&#8221;: a small set of application-specific controls that corresponds to each of the wide-area challenges (e.g. eventual consistency, replication level, particular site). Allow applications to specify cues on a per-file basis, by putting it in the path names of files.</li>
<li>WheelFS works over the wide area, based on a standard distributed file system design with the addition of these cues. Built a full prototype which runs several applications.</li>
<li>Design: WheelFS looks like a big data storage layer. Distributed application runs on a bunch of client nodes spread throughout the wide-area. FUSE presents WheelFS to the application, and WheelFS client software communicates with WheelFS storage nodes. WheelFS configuration service uses Paxos and RSMs to map files to nodes.</li>
<li>Files have a primary and (by default) two replicas. A file&#8217;s primary is its creator. Clients can cache files using a lease-based invalidation protocol. Strict close-to-open consistency (serialised through the primary).</li>
<li>Consistency is enforced under failures (even network partitions): failing to reach the primary blocks the operation, until the configuration service promotes a new primary.</li>
<li>Only applications have the knowledge to make the right tradeoffs. So embed these cues in the path name.</li>
<li>/wfs/cache/a/b/foo -> /wfs/cache/a/b/.cue/foo (e.g. .EventualConsistency).</li>
<li>Flexible and minimal interface change that makes it easy to use existing applications.</li>
<li>Can apply a cue to entire directory subtrees, multiple cues at once (later cues override earlier, conflicting cues). Assume the developer uses these sensibly.</li>
<li>Durability through RepLevel=n cue. Permanent property of the data.</li>
<li>Large reads through HotSpot cue. Transient property (only applies to a particular opening of the file) using P2P BitTorrent-style caching for hotspots.</li>
<li>Placement using Site cue. Permanent property of the data.</li>
<li>Consistency though EventualConsistency cue. Either transient or permanent property.</li>
<li>Reading with eventual consistency: read latest version of the file that you can find quickly. Don&#8217;t need to go through the primary (which might have failed). Can try replicas, or a cached copy of the file elsewhere. .MaxTime=t cue to specify how long you should spend looking for a file. Writes can go to any replica, which creates two divergent replicas: a background maintenance process will figure out a way to merge files (without application involvement). Reconciling directories by taking the set union of files, and files by choosing one version as the winner (this will lose some writes, but is usually OK for apps that can tolerate eventual consistency).</li>
<li>Example use: cooperative web cache. Make a one-line change in the Apache config file to point it to a file in WheelFS. Using default WheelFS semantics leads to blocking under failure with strong consistency. But the freshness of a page can be determined using saved HTTP headers, so it&#8217;s alright to use eventual consistency.</li>
<li>Cache directory becomes /wfs/cache/.EventualConsistency/.MaxTime=200/.HotSpot/</li>
<li>Implemented on Linux, MacOS and FreeBSD. About 20kloc of C++. Support Unix ACLs. Deployed on PlanetLab and Emulab.</li>
<li>Applications include co-op web cache, all-pairs pings, distributed mail, file distribution and distributed make. At most 13 lines of configuration code had to be changed.</li>
<li>Evaluated performance: does it scale better than single-server DFS? Does it achieve performance equivalent to specialised storage? How does it perform under failures?</li>
<li>Scaling: as number of concurrent clients increases past about 125, NFS performance starts to suffer relative to WheelFS due to buffer cache exhaustion. (However, NFS is better than WheelFS for fewer clients. This is because it&#8217;s a local NFS server at MIT, and WheelFS is on PlanetLab.) WheelFS performance seems constant with added load.</li>
<li>Specialised storage for co-op web cache on PlanetLAb: 40 nodes as proxies, 40 as clients, same workload as CoralCDN paper. Compare CoralCDN to Apache on WheelFS. WheelFS achieves same rate as CoralCDN. However, CoralCDN ramps up to full rate faster, due to special optimisations.</li>
<li>Evaluated web cache under failures using Emulab. Each minute, one site went offline for 30 secs. Eventual consistency improves performance under failures: rate remains almost constant, whereas strict consistency makes it fall greatly.</li>
<li>Main difference with related work is the control over trade-offs.</li>
<li>Compare to storage with configurable consistency (PNUTS, PADS). WheelFS also provides durability and placement controls.</li>
<li>Q: do these primitives have a place in a generic distributed file systems? Where do you draw the line in what you include and what you don&#8217;t? Have only looked at what existing applications need and included that. If other applications become important, other things may be included.</li>
<li>Q: what insight have you gained from making it simple to create different configurations of system? Most applications just need something simple that applies to both reads and writes (i.e. strict or eventual consistency).</li>
<li>Q: want to hear more about reconciliation? Most apps that need append-only storage can be implemented as writing things into a directory (and taking the union). Cooperative web caching doesn&#8217;t care if you lose a version of the file. That&#8217;s been enough so far.</li>
</ul>
<h3>PADS: A Policy Architecture for Distributed Storage Systems</h3>
<ul>
<li>There are lots of data storage systems. They take a lot of time and effort to build: lots to reimplement. Why can&#8217;t this be easier? Is there a better way to build distributed storage systems, focussing on high-level design, not low-level details.</li>
<li>Previous work suggested a microkernel approach: a general mechanism layer with a pluggable policy.</li>
<li>Challenge was to build 10 different systems, each in 1kloc, before graduation! With PADS, 2 grad students built 12 diverse systems in just 4 months. Evidence that PADS captured the basic abstractions for building distributed storage systems.</li>
<li>Questions of data storage and propagation are just questions of routing. Consistency and durability are questions of blocking.</li>
<li>Routing specifies how data flows among nodes. When and where to send an update? Who to contact on a local read miss? Look at examples of routing in Bayou, Coda, chain replication and TierStore.</li>
<li>Primitive of subscription: options are the data set of interest (e.g. a path), notifications (invalidations) in causal order, or updates (bodies of files). Leads to an event-driven API. PADS gives a DSL to make this easier (based on OverLog), called R/Overlog. Policy is a bunch of rules that invoke actions.</li>
<li>Simple example is: on read, block and establish subscription to the server.</li>
<li>OverLog makes it possible to implement a whole system in a single page of rules. Rules for TierStore were presented in a single slide. Easier to debug and do code reviews.</li>
<li>Blocking policy defines when it is safe to access local data (either for consistency (what version can be accessed?) or durability (have updates propagated to safe locations?)). Need to block until the required semantics are guaranteed.</li>
<li>PADS provides 4 blocking points: before/after read/write. Specify a list of conditions that provide the required semantics. PADS provides 4 built-in bookkeeping conditions and one extensible condition.</li>
<li>e.g. Read at block: Is_causal. Write after block: R_Msg (ackFromServer).</li>
<li>Is PADS a better way to build distributed systems? Is it general enough to build any system? Is it easy-to-use? Is it easy-to-adapt? What are the overheads associated with the PADS approach?</li>
<li>Built a range of different systems that occupy different parts of the design space. Max number of routing rules was 75; up to 6 blocking conditions.</li>
<li>Added cooperative caching to Coda in 13 rules. Took less than a week, and greatly improved the read latency.</li>
<li>Overheads in an implementation of Bayou: within a small factor of the ideal number of bytes that must be transferred. Also looked at microbenchmarks on Coda versus P-Coda (PADS version): very close, and mostly due to the Java implementation of PADS.</li>
<li>Q: with this size of implementation, do you deal with system failures and recovery? Yes.</li>
<li>Q: how do you express routing that is based on network topology (e.g. TierStore hierarchy over DTN in a developing region) in OverLog? OverLog is typically used to set up network overlays, ping nodes and so on. Once you know who is alive, you can call into PADS to say that you&#8217;ve detected a peer with whom you can communicate for storage.</li>
<li>Q: can you talk about the trade-off between language versus library? (OverLog rules are a bit like haikus, sometimes you&#8217;d prefer a paragraph of text.) Can also use a Java API to configure PADS. Why OverLog; why not Java? Wanted to take advantage of the haiku of OverLog.</li>
<li>Q: a storage system doesn&#8217;t only have to worry about data movement, there&#8217;s also reconciliation or dealing with different storage layers at the same time. Should PADS worry about these? PADS doesn&#8217;t do conflict detection (it uses a simple scheme), and that has mostly been left to the application (though haven&#8217;t decided all of this so far). The storage layer is more of an application issue than a system issue.</li>
<li>Q: with OverLog, the runtime state can get very big, so how does this scale (with e.g. a complex topology)? Originally, DataLog would set up data flows which cause a lot of state to be exchanged. The custom version used for PADS cuts this down while using the same language.</li>
</ul>
<h2>Wireless #1: Software Radios</h2>
<h3>Sora: High Performance Software Radio Using General Purpose Multi-core Processors</h3>
<ul>
<li>Won the best paper award.</li>
<li>Currently, each wireless protocol is implemented using special hardware. Software radio ideal is a generic RF frontend and protocols implemented in software. Leads to universal connectivity and cost saving; a faster development cycle; and an open platform for wireless research.</li>
<li>Challenges. Need to process a large volume of high-fidelity digital signals (e.g. for 802.11 with 20MHz channel, need 1.2Gbps throughput&#8230; up to 5Gbps for 802.11n. Will be over 10Gbps for future standards). Processing is computationally-intensive: several complicated processing blocks, operating at high signal speeds. Need 40G operations per second to process 802.11a. Also a real-time system, so many hard deadlines and need accurate timing control. 10us windows for response.</li>
<li>Possible approaches include programmable hardware (FPGAs, embedded DSPs) (high performance, low programmability) and general purpose processors (low performance (100Kbps), high programmability). Sora is high performance and highly programmable. Achieves 10Gbps with ~10us latency.</li>
<li>Approach. A new PCIe-based interface card and optimisations to implement PHY algorithms and streamline processing on a multi-core CPU. &#8220;Core dedication&#8221; offers real-time support.</li>
<li>Uses a general radio frontend connected to a PCIe-based high-speed interface card (offers high throughput and low latency (~1us)). Frontend can connect to up to 8 channels. FPGA on card implements logic for control and data path (PCIe, DMA, SDRAM controllers).</li>
<li>Core part of Sora is the software architecture. To achieve high performance, uses three technologies.</li>
<li>First, an efficient PHY implementation makes extensive use of lookup tables which trade memory for calculation and still fit in L2 cache (e.g. convolutional encode requires 8 operations per bit in a direct implementation, but can use a 32Kb lookup table and two operations per 8 bits).</li>
<li>Second, most PHY algorithms have data parallelism (e.g. FFT and its inverse). So use wide-vector SIMD extensions developed for multimedia in the CPU.</li>
<li>Third, use multi-core streamline processing to speed up PHY processing. Divide processing pipelines into sub-pipelines, and assign these to different cores. Use a lightweight synchronised FIFO to connect cores. Can also do static scheduling at compile time.</li>
<li>Core dedication for real-time support: exclusively allocate enough cores for SDR processing in a multi-core system. This guarantees predictable performance and achieves us-level timing control. A simple abstraction, easily implemented in standard OSes (and easier than a real-time scheduler), such as WinXP.</li>
<li>Implemented for WinXP in 14kloc (C code) including the PCIe driver. Also, SoftWiFi implements 802.11a/b/g in 9kloc (C) in 4 man-months for development and testing. Works at up to 54Mbps.</li>
<li>Without optimisations, the required computation is far too large for any practical system. Sora offers up to a 30x speedup at high data rates.</li>
<li>End-to-end throughput compared to commercial-commercial 802.11 cards, Sora-commercial and Commercial-Sora. It is very close and sometimes faster to use Sora. Also seamlessly interoperates with commercial WiFi.</li>
<li>Extensions for jumbo frames in 802.11 can increase throughput. Also simply implement TDMA MAC. And applications which show low-level information about the PHY layer.</li>
<li>Q: do you have an algorithm for deciding how to allocate jobs to cores, or do you need to come up with an approach for each CPU architecture? Currently rely on the programmer to decide this, but there has been much other research on this, which could apply to Sora.</li>
<li>Q: most radios are used in mobile devices, so what is needed to make Sora work on power-constrained devices? GPPs have huge power consumption compared to special devices. Currently the benefit of SDR is for prototyping, so this is less of a concern. Also, Sora would work well on base stations.</li>
<li>Q: did you look into using existing systems for scheduling data-flow graphs on multi-cores? Don&#8217;t really need to consider the dynamic case because of fixed rounds etc.</li>
<li>Q: how do you provision dedicated cores in the presence of a shared cache and shared bus?</li>
<li>Q: a lot of the finer details in 802.11 is for working at low-performance (weak signal strength and high multipath), so does Sora have spare processing capacity to work with this? In the presence of these, you won&#8217;t get high throughput, so we don&#8217;t handle this completely.</li>
</ul>
<h3>Enabling MAC Protocol Implementations on Software-Defined Radios</h3>
<ul>
<li>What&#8217;s the hype about wireless MAC protocols? Achieving highest performance is application-specific (e.g. throughput, latency, power). No one MAC fits all. So there are diverse MAC implementations and optimisations. How can we easily implement these?</li>
<li>First approach has been to use standard wireless NICs (high performance and low cost). Although MAC is software, it&#8217;s closed-source and fixed functionality. SDR allows modifying full reprogramming of the PHY and MAC layers, but are higher cost and lower performance.</li>
<li>Various projects have used SDR for evaluation (based on GNU Radio and USRP). All processing is done in userspace (&#8221;extreme SDR&#8221;).</li>
<li>&#8220;Extreme&#8221; SDR architecture based on a frontend, ADC/DAC, FPGA, and USB connection to kernel and eventually userspace. Much too slow for 802.11 timeouts.</li>
<li>So commonly move layers closer to the frontend. However, these are costly, require special toolkits, require embedded systems knowledge and are much less portable.</li>
<li>Instead, take a split-functionality approach. Put a small, performance-critical part on the radio hardware, and a larger piece on the host for flexibility. Then develop an API for the core functions.</li>
<li>Building blocks are carrier sense, precision scheduling, backoff, fast-packet detection, dependent packet generation and fine-grained radio control. Believe this is a reasonable first &#8220;toolbox&#8221; for implementing high-performance MAC protocols. Talk about precision scheduling and fast-packet detection.</li>
<li>Precision scheduling. Do the scheduling on the host (for flexibility) and triggering on the hardware (for performance). Requires a lead time that varies based on the architecture.</li>
<li>Want to know how much precision we gain from this approach. Transmission error is approximately 1ms if triggering in the host. If in the kernel, this lowers to 35us. With split-functionality, this gives 125ns precision in scheduling.</li>
<li>Fast-packet detection. Goal is to detect packets accurately in the hardware, before they have been demodulated. The longer it takes to detect a data packet, the longer it will take to generate an ACK. Then demodulate only when necessary as this is CPU intensive. Uses a &#8220;matched filter&#8221;, which is an optimal linear filter for maximising the SNR. Try to detect framing bits, which are transformed into a discrete waveform. This is used as the known signal, and is cross-correlated with the incoming signal. If the correlation exceeds some score, trigger a response, or other action.</li>
<li>Simulation of detecting 1000 data packets destined to the host in varying noise. The matched filter achieves better noise tolerance than the full decoder (in the simulator). In real life, achieves 100% accuracy detecting frames, and <0.5% false positives.</li>
<li>Other mechanisms in the toolbox are detailed in the paper.</li>
<li>Implemented on GNU Radio and USRP. Implemented two popular MAC protocols (802.11-like and Bluetooth-like).</li>
<li>CSMA 802.11-like protocol uses carrier sense, backoff, fast-packet recognition and dependent packets. Cannot interoperate with real 802.11 because of bandwidth limitations. Target bitrate is 500Kbps, and uses the 2.485GHz band to avoid 802.11 interference. Achieves 2x throughput of the host-based &#8220;extreme&#8221; approach for 1MB-size file transfers.</li>
<li>TDMA Bluetooth-like protocol. Piconet of master and slaves, with 650us slot size. Bluetooth-like because USRP cannot frequency hop at a high enough rate to interoperate with Bluetooth. Again, target rate of 500Kbps, performing ten 100KB file transfers, and vary the number of slaves. Achieves 4x the average throughput of the host-based approach, using a much short guard time.</li>
<li>Q: is the split always applicable, even if the cores could be heterogeneous? The most important part of the API is between the radio hardware and the host, not core-to-core. (Follow-up: for embedded applications, trend towards system-on-a-chip, and you could have cores geared towards different things (such as radio).)</li>
<li>Q: how do you work around virtual carrier sense? Can include multiple timestamps in a packet.</li>
<li>Q: are there fuzzy edges or things that you might have trouble dealing with in this API? Yes, definitely not saying that we can do everything. Currently working on generating &#8220;fast ACKs&#8221;, but if you pre-modulate it then you don&#8217;t know the destination, so need to track that.</li>
<li>Q: problems encountered in sensor nets in developing new APIs; how generic is this work if a new protocol were developed? Difficult to say that any set is complete. Could tweak the implementation of the core functions to implement new ones (e.g. ZigZag). Starting to look into implementing novel MACs.</li>
<li>Q: as the PHY gets faster, will the matched filter be adequate? Possible to use multiple filters in parallel (though USRP-1 doesn&#8217;t have room for that). Could also switch the coefficients to search for other things.</li>
</ul>
<h2>Content Distribution</h2>
<h3>AntFarm: Efficient Content Distribution with Managed Swarms</h3>
<ul>
<li>What is the most efficient way to disseminate a large number of files to a large number of clients? A simple solution might be a simple client-server, which creates a bottleneck at the server, and leads to a high cost of ownership for the content owner.</li>
<li>Alternative is to do peer-to-peer. Examples include BitTorrent. This sacrifices efficiency, because peers share limited information and there is no global sense of the system as a whole: gives little control to the provider. Managing swarms could lead to a better use of bandwidth.</li>
<li>Goals for AntFarm: high performance (throughput), low cost of deployment, performance guarantees (administrator control) and accounting (resource contribution policies).</li>
<li>Key insight is to treat content distribution as an optimisation problem. Uses a hybrid architecture, revisiting the BitTorrent protocol, but in fact a brand-new protocol.</li>
<li>Has a set of peers, organised into swarms. A logically separate coordinator manages these swarms. Seeders outside the system provide the data, but altruistic peers will contribute much of the bandwidth.</li>
<li>As a strawman, the coordinator could schedule every single packet sent in the system: this is clearly unscaleable. Instead, it makes critical decisions based on observed dynamics. Remaining decisions left to the peers themselves. Peers can implement micro-optimisations (e.g. rarest block first).</li>
<li>Coordinator takes active measurements and extracts key parameters. It then formulates an optimisation problem that calculates the optimal bandwidth allocation.</li>
<li>Want to maximise throughput subject to bandwidth constraints. Response curve of swarm aggregate bandwidth as the seeder bandwidth is increased. At first, increasing seeder bandwidth gives a multiplicative increase in the aggregate bandwidth, but this eventually becomes slope=1, then flat. (Assumes that peers in the swarm are homogeneous (in network capacity) and the downlink is faster than uplink.)</li>
<li>Each swarm will have a different response curve. The coordinator measures these, and uses these for optimisations. Optimised using an iterative algorithm: allocate bandwidth to the swarm whose response curve has the steepest slope (favouring swarms with lower bandwidth where this is equal). Can first address SLAs and QoS constraints, which might lead to a very different allocation of bandwidth.</li>
<li>AntFarm must adapt to change as nodes churn and network conditions change. AntFarm will update response curves and bandwidth allocations.</li>
<li>AntFarm is built on top of a new wire protocol, which uses tokens as a form of microcurrency that is traded for blocks. Tokens are small and unforgeable. Peers return spent tokens to the coordinator as a proof of contribution.</li>
<li>Performance evaluation looks at global aggregate bandwidth across all swarms. Tested using a Zipf distribution of files with 60KB/s and 200KB/s seeders. Compared to client-server and BitTorrent. AntFarm greatly outperforms both of these cases.</li>
<li>Compare AntFarm to BitTorrent with two swarms: one self-sufficient and one singleton. BitTorrent will starve the singleton, but AntFarm will recognise based on the response curves that seeder bandwidth should be allocated to the singleton. Also observe that BitTorrent will starve new swarms (AntFarm will not).</li>
<li>Token management is embarrassingly parallel, which aids scalability. Ran coordinators on PlanetLab hosts and simulated multiple peers on other PlanetLab hosts. A one-machine coordinator supports 10K peers, and 8 coordinators will support up to 80K peers. A single PC can comput allocations for 10000 swards with 1000000 peers in 6 seconds (done once every 5 minutes).</li>
<li>AntFarm requires no fine-tuning, and subsumes hacks that have been devised for BitTorrent.</li>
<li>Q: what incentive does a swarm have to report its response curve correctly? There is a potential collusion problem here, but we assume that peers want data and will exchange tokens to ensure that they get the data as fast as possible.</li>
<li>Q: is there any concern about a Sybil attack that involves passing credits amongst yourself? Can force people to back an account with a credit card, to mitigate this.</li>
<li>Q: do you think that this token-based system will be necessary in a commercial system? It gives us what we want in terms of response curves.</li>
</ul>
<h3>HashCache: Cache Storage for the Next Billion</h3>
<ul>
<li>The next billion internet users are schools and urban middle class in developing regions. They have affordable hardware (OLPC, Classmate) but very expensive internet connections.</li>
<li>Standard approach for bandwidth saving is using a large cache. Large caches mean larger bandwidth savings. Can do overnight prefetch or push content from peers. They also have good offline behaviour, enabling prefetching and local search. Can even accelerate dynamic sites.</li>
<li>Cost is about 5&#8211;10GB of RAM per TB of storage. Cannot use laptop-grade hardware for caches: need server-grade hardware which is 10x more expensive than laptops.</li>
<li>Solution is a new storage engine that allows policies for efficiency and performance to be specified. Requires much less RAM than commercial or open-source caches, even for terabyte-sized caches. All techniques support far more GB/$ and allow a performance tradeoff.</li>
<li>Open-source solutions need multiple seeks for hits, misses and writes and depend on default filesystems. Commercial systems (using a circular log) require a single seek and achieve much better performance.</li>
<li>Focus on reducing the size of the (in-memory) index. Squid used 560 bits per entry; Tiger uses 232 bits per entry.</li>
<li>The cache size is limited by the memory size and performance is limited by the number of seeks. Want to reduce the dependency on memory size and improve the performance of inevitable seeks.</li>
<li>Instead, use the disk as a hashtable. Need on-disk structures for key lookup and value storage.</li>
<li>Basic HashCache policy: H(URL) = h bits&#8230; stores in a disk-based hash table of contiguous blocks, then puts the data in a circular log.</li>
<li>Collision control is difficult in disk-based systems as it requires multiple seeks. Instead use set associativity, t ways. The possible locations are allocated contiguously so they can be read together (which is good as seek time dominates for small reads).</li>
<li>Normally would reduce seeks using an in-memory hash table (with space consumed by pointers), but disk is already a hash table so pointers are not needed, so just use a large bitmap that mirrors the disk layout. Just store one hash per URL.</li>
<li>Large disks can support 10&#8211;100+ million objects. Global cache replacement is relevant when the disk size is roughly equal to that of the working set. When you have much larger disks, local replacement policies are roughly equivalent to global ones. Do LRU within the sets.</li>
<li>Most misses require no seeks; one seek per read; one seek per write. However, writes still need seeks.</li>
<li>Storing objects by hash produces random reads and writes. Need to restructure the on-disk table and store only the hash, rank and offset. Move all data to the log. Group writes will amortise seeks and scheduling related writes will enable read prefetch. Gives reads and writes in < 1 seek.</li>
<li>HashCache requires just 54 bits per URL.</li>
<li>All policies implemented in a Storage Engine with plug-in policies. Built a web proxy using the storage engine. Can have multiple apps on the same box, sharing memory. 20kloc (C) for the proxy and 1kloc for the indexing policies.</li>
<li>Evaluated using Web Polygraph (de facto feature and performance testing tool for web proxies). Compared against Squid and Tiger. Evaluated with &#8220;low end&#8221;, &#8220;high end&#8221; and &#8220;large disk&#8221; hardware capacities.</li>
<li>For low end, achieves hit rate comparable to Squid and Tiger. Can achieve performance comparable to Squid or Tiger, depending on the policy used.</li>
<li>On high end (5x 18GB disks), achieves performance very close to Tiger, much better than Squid.</li>
<li>Can achieve much larger disk capacities than either Squid or Tiger for the same amount of RAM. (1.5&#8211;5.4TB, depending on policy.)</li>
<li>Uses up to 600MB of ram with a 1TB disk (large disk configuration).</li>
<li>Currently deploying HashCache in Ghana and Nigeria, and working with a school supplier on new deployments.</li>
<li>Q: what observable bandwidth improvements do you see? Many techniques require large caches (e.g. WAN accelerator tools), and we are working on these. Is the performance improvement like that of Squid? Yes, and it will be better for things like multiple people in a class watching a YouTube video (where the number of objects is large).</li>
<li>Q: why do we need these large caches? Is there evidence that by increasing cache size from 200GB to 1TB there will be a drastic improvement? Wanted to move beyond web caching (where the benefits are limited) to WAN acceleration, which requires much larger caches.</li>
</ul>
<h3><em>iPlane Nano:</em> Path Prediction for Peer-to-Peer Applications</h3>
<ul>
<li>Example application is a P2P CDN where content is replicated across a geographically distributed set of end-hosts. Every client needs to be redirected to the replica that provides best performance. However, internet performance is neither constant nor queriable.</li>
<li>Current best practice is for each application to measure internet performance on its own. Would be better for end-hosts to have the ability to predict performance without having to make measurements, and share infrastructure across applications.</li>
<li>Problem has been looked at before. Network coordinates were limited to latency but were a lightweight (scaleable) distributed system. iPlane had a rich set of metrics and used arbitrary end-hosts, but required a 2GB atlas to be distributed and had a large memory footprint.</li>
<li>iPlane Nano has same information as iPlane and sufficient accuracy, but only uses a 7MB atlas at end-hosts and services queries locally.</li>
<li>On the server side, iPlane Nano uses the same measurements as iPlane but stores and processes them differently.</li>
<li>Size of atlas is O(number of vantage points * number of destinations * average traceroute path length). iPlane combines paths to improve predictions. Instead replace atlas of paths with atlas of links. Now this is O(number of nodes * number of links).</li>
<li>Clients can use swarming to disseminate the atlas, and can service queries locally using the atlas.</li>
<li>However, just storing links loses routing policy information encoded in the routes (i.e. which path would actually be used?). Need to extract routing policy from measured routes and represent this compactly.</li>
<li>Strawman: could try to use shortest AS path routing + valley-free + early-exit routing. However, this gave very poor quality predictions (iPlane got 81% correct, this strawman approach got 30%). So we have thrown away too much information.</li>
<li>First technique is inferring AS filters. Not every path is necessarily a route (ASes filter propagation of a route received from one neighbour to other neighbours). Filters can be inferred from measured routes, by recording every triple of three successive ASes in each measured rout. Store (AS1, AS2, AS3) to imply that AS2 forwards packets from AS1 to AS3. This still gives multiple policy-compliant paths for some endpoint pairs, due to upstream AS routing policies.</li>
<li>Second technique is to infer AS preferences. For each measured route, alternate paths are determined in the link-based atlas. When paths diverge, this indicates preference.</li>
<li>Another challenge is routing asymmetry. Undirected edges are used to compute routes assuming symmetric routing (i.e. when the route has not been specifically measured), but more than half of internet routes are asymmetric. Merge clients&#8217; additional (low-rate) traceroute measurements into the atlas that is distributed to all clients. Prefer a directed path in the atlas for prediction, or else fall back to undirected paths.</li>
<li>The improved path predictions are 70% accurate, which is almost as good as iPlane (with a 6.6MB atlas rather than 2GB; and a 1.4MB daily update).</li>
<li>Want to use routes to predict latency and loss rate. Latency is sum of link latencies. Loss-rate is the probability of loss on any link in the route. Ongoing challenge is to measure these properties themselves (link latency is hard to measure.) iPlane Nano can make good enough predictions to help applications.</li>
<li>System used to improve P2P applications (CDN, VoIP and detour routing for reliability). Look at CDN here; others in paper.</li>
<li>CDN chooses replica with best performance to serve a client request. Evaluated with 199 PlanetLab nodes as clients, and 10 random Akamai nodes as the replicas. Each node wants to download a 1MB file from the &#8220;best&#8221; replica. Look at the inflation in download time (w.r.t. optimal strategy) as a CDF of nodes. iPlane Nano does better than Vivaldi and OASIS, and indeed outperforms the expected measured latency. Random assignment gives bad inflation which shows the importance of an informed choice.</li>
<li>Q: how much does it matter if you have an out-of-date atlas? Once a day is good enough to capture the variance in latency and loss rate.</li>
<li>Q: how expensive is it to recompute the atlas? Very inexpensive.</li>
<li>Q: will your AS inferences lead to false AS links and how do you deal with that? Inter-AS links is a tricky issue, but rather than getting the topology right, it&#8217;s better to make good enough predictions, which are useful for applications. Addressing that will only improve accuracy.</li>
<li>Q: what gain do you get from not just querying the server? Say you want to instrument BitTorrent and rank order the peers that will give good performance. But what if every BitTorrent peer hits the server&#8230; this will overload a server and lead to you needing costly infrastructure (like Google).</li>
<li>Q: what is the measurement overhead that the end-hosts will incur if they have to run their own measurements? We process about 100 traceroutes per day at the end hosts.</li>
<li>Q: when does this technique (path segment composition) work better than others? Assumes routers are performing destination-based routing, rather than load-balancing. Turns out around 70% of routes are identical from day-to-day.</li>
</ul>
<h2>BFT</h2>
<h3>Making Byzantine Fault Tolerant Systems Tolerate Byzantine Failures</h3>
<ul>
<li>We&#8217;ve heard a lot about applications and optimisations for BFT systems. We now have impressive best case performance for many scenarios. But what happens when failures actually occur? Performance drops to zero or the system crashes!</li>
<li>How do we get robust BFT? Describe the route to &#8220;Aardvark&#8221; which is an implementation of this technique, and show that the performance under failures is not too bad.</li>
<li>10 years ago, we thought BFT could never be fast. Goal was to should that BFT could work (in an asynchronous network). FLP means that all we can guarantee is eventual progress. Systems were designed so that the normal case was fast and safety was maintained.</li>
<li>Wanted to maximise performance with a synchronous network and all clients working properly. This is misguided (surely failures must occur or else why would we have BFT?), dangerous (encourages fragile optimisations with corner cases that are difficult to reason about, easy to overlook and difficult to implement) and futile (diminishing returns in performance improvements).</li>
<li>New goal: address the middle ground between (asynchronous with failures) and (synchronous without failures): i.e. a synchronous network with failures.</li>
<li>Want to maximise performance when the network is synchronous and at most f servers fail, while remaining safe if at most f servers fail.</li>
<li>Protocol is structured as a series of filters to remove some amount of bad messages. This limits the effect that bad messages can have on performance. Same filters are applied to all messages.</li>
<li>Signatures are expensive, so use MACs? But MACs can be used by clients to generate ambiguity, so Aardvark insists on signed requests. (Showed an example of an attack on MACs by a faulty client, where the MAC is validated by the primary, but the replicas cannot validate it, which leads to a tricky protocol that lowers throughput. Also a problem with a faulty primary.) Use a hybrid MAC/signature which is easier to verify. Signature schemes are asymmetric so most of the work can be pushed to the client. But what if a faulty client sends bad signatures into the system? Filter them out (blacklist for client, then verify MAC, then verify signature, and if this fails then blacklist the client).</li>
<li>View changes to be avoided? But they can be done frequently to enable high throughput even under failures. The primary is in a unique position of power (client sends request to primary, primary forwards it to replicas) and could wait for a long time. Usually deal with this using a view change timeout. But a bad primary can be just fast enough to avoid being replaced. Instead use adaptive view changes based on observed and required throughput. Guarantees that the current primary can either provide good throughput or be promptly replaced.</li>
<li>Hardware multicast is a boon? Use separate work queues for clients and network connections between machines.</li>
<li>Evaluated throughput versus latency compared to HQ, Q/U, PBFT and Zyzzyva. Aardvark has longer latency than others (at low throughput), and sustains a lower throughput than PBFT or Zyzzyva.</li>
<li>Evaluated performance with failures. Byzantine failures are arbitrary (cannot enumerate all of them), so made a good-faith effort to strain this. HQ implementation crashes with a faulty client (not all error handling was implemented) PBFT, Q/U and Zyzzyva drop to zero throughput. Aardvark maintains peak performance (although this is lower than the other schemes).</li>
<li>Also looked at effect of delay.</li>
<li>Q: why does the hybrid MAC/signature protocol require a MAC? If we don&#8217;t use a MAC, we don&#8217;t know who is sending the message so the MAC gives us a quick way to identify the sender and blacklist.</li>
<li>Q: given that people are already reluctant to use BFT, why would they take the performance hit? How well should you perform under failures or no failures? Imagine there is a range of protocols there, and could choose a trade-off.</li>
<li>Q: how would you deal with heterogeneous speeds in the adaptive scheme (and faulty nodes causing an attack there)? Looking at symmetric systems and base throughput on the history over previous views.</li>
</ul>
<h3>Zeno: Eventually Consistent Byzantine-Fault Tolerance</h3>
<ul>
<li>Data centre storage systems are the backbone of many internet-based services (e.g. Amazon, Facebook, Google). They have high availability and reliability requirements. Cost of downtime is huge.</li>
<li>Example of Amazon&#8217;s Dynamo shopping cart service. Needs reliable storage and responsiveness. Dynamo achieves reliability through replication. It achieves responsiveness by allowing stale state to be viewed during failures, and eventual consistency.</li>
<li>Cannot simultaneously achieve strong consistency and high availability if network partitions are possible (CAP theorem). Many storage backends prefer availability over consistency (e.g. Dynamo, PNUTS, Cassandra).</li>
<li>Two fault models: crash and Byzantine. Many deployed systems assume crash faults because the infrastructure is trusted. But Byzantine faults can happen (S3, Google, NetFlix had multiple-hour outages), as the majority of database bugs exhibit non-crash behaviour. So use BFT, which withstands arbitrary faults, using 3f+1 replicas to tolerate f faults. Used for mission critical systems, e.g. avionics. Improvements have been made in improving performance, but what about availability?</li>
<li>Existing protocols strive for strong consistency, which assumes the abstraction of a single correct server. Need >= 2/3 of replicas to be available.</li>
<li>Key idea is relaxed consistency to give availability. Data is available when other nodes block, but sometimes stale. Zeno is an eventually consistent BFT protocol.</li>
<li>What is an eventually consistent BFT service? Assume three clients, A, B and C, that are accessing the service. Model service state as partial order on operations. Have a committed history and one or more tentative histories from some point in time onwards. Can merge tentative histories to give a committed history. But some operations (e.g. two add to baskets of an item where only one is available) can be inconsistent. Therefore have &#8220;strong&#8221; and &#8220;weak&#8221; operation types. A weak operation observes eventual consistency and may miss previous operations, but will eventually get committed (e.g. add/delete items to/from shopping cart). A strong operation always observes the committed history (e.g. checkout in a shopping cart: only pay for what you buy).</li>
<li>Zeno has four components. Normal case for strong and weak operations; handling a faulty replica; conflict detetion; and conflict resolution.</li>
<li>Zeno requires 4 replicas (3f+1, f=1).</li>
<li>[Detailed description of the protocol.]</li>
<li>Strong quorum is used for strong consistency: ensures that no two requests are assigned the same sequence number (need 2f+1=3 matching replies). Weak operations don&#8217;t use the strong quorum: just need f+1 matching replies. With a weak quorum, intersection is not guaranteed, but it is not necessary for eventual consistency.</li>
<li>In event of a faulty primary, must be able to do a view change. Typically these require strong quorums. Zeno has a weak view change protocol that only requires weak quorums, which is necessary for high availability.</li>
<li>[Detailed description of conflict detection protocol.] Based on sequence number mismatch (same sequence number assigned to different requests).</li>
<li>Conflict resolution: weak operations are propagated between primaries of weak views, and finally reconciled. Correctness proof in the technical report.</li>
<li>Evaluated with a simulated workload with a varying fraction of weak operations. With no concurrent operations, compared against Zyzzyva. Look at all strong, 50% weak and all weak. Look at the throughput for Zyzzyva, Zeno(strong) and Zeno(weak). Weak operations continue to make progress in the presence of a network partition (but stall a bit with the partition is resolved, presumably as the conflict is resolved).</li>
<li>With concurrent operations that conflict (during the network partition). Weak drops briefly on the original partition, and also takes a slightly worse hit when the partition is resolved (but actually performs better during the partition). So Zeno provides higher availability than Zeno.</li>
<li>Q: instead of working with arbitrary partitions, could you exploit cliques on either side (i.e. in separate data centres)? Yes, definitely.</li>
<li>Q: what happens to the client state when the operations are rolled up? The result that you see might not be the final result: this influences the choice of weak operations. What if you insert weak operations before strong operations? When a strong operation is committed, all weak operations before it must be committed.</li>
<li>Q: throughput results were an order of magnitude lower than in previous talk? Just used a small number of clients.</li>
<li>Q: can you give an example of an application where it is okay to have a period of strong consistency, followed by one where the results may be obliterated by the conflict resolution? Shopping cart is a prime candidate. But future operations may rely on operations that have not yet been committed? Yes, that&#8217;s a design choice.</li>
<li>Q: is it true that you will always end up with divergent histories? If you assume that we have signatures, then no.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.mrry.co.uk/blog/2009/04/22/nsdi-2009-day-1/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Conspiracy Theories</title>
		<link>http://www.mrry.co.uk/blog/2008/07/21/conspiracy-theories/</link>
		<comments>http://www.mrry.co.uk/blog/2008/07/21/conspiracy-theories/#comments</comments>
		<pubDate>Mon, 21 Jul 2008 00:18:04 +0000</pubDate>
		<dc:creator>Derek Murray</dc:creator>
		
		<category><![CDATA[Meta]]></category>

		<guid isPermaLink="false">http://www.mrry.co.uk/blog/2008/07/21/conspiracy-theories/</guid>
		<description><![CDATA[My last post concerned the conspiracy theories that surround the collapse of 7 World Trade Center on the 11th of September, 2001. In that post, I tried to provide an objective rationale for why the controlled demolition hypothesis should not be believed, owing to its unfalsifiability. The truth is that this and other 9/11 conspiracy [...]]]></description>
			<content:encoded><![CDATA[<p>My <a href="http://www.mrry.co.uk/blog/2008/07/21/the-911-delusion/">last post</a> concerned the <a href="http://en.wikipedia.org/wiki/Controlled_demolition_hypothesis_for_the_collapse_of_the_World_Trade_Center#World_Trade_Center_Seven">conspiracy theories that surround the collapse of 7 World Trade Center</a> on the 11th of September, 2001. In that post, I tried to provide an objective rationale for why the controlled demolition hypothesis should not be believed, owing to its unfalsifiability. The truth is that this and other <a href="http://en.wikipedia.org/wiki/9/11_conspiracy_theories">9/11 conspiracy theories</a> provoke an almost visceral response in me. I am pretty certain that I&#8217;m not the only person who feels this way.</p>
<p>Right now, it&#8217;s pretty obvious that I don&#8217;t believe in the conspiracy theories (at least, the ones in which the US government or one of its agencies &#8220;made it happen on purpose&#8221;). However, I am no great supporter of the present US administration, and my political leanings (if transposed to America) would be somewhere to the left of the Democratic Party. Why then am I inclined to give Bush and his aides the benefit of the doubt? Of course it&#8217;s their sheer ineptitude: one need only look at the prosecution of the Iraq war for a rich seam of evidence.</p>
<p>But that doesn&#8217;t explain why I am so viscerally affected by the conspiracy theories: after all, I might be an atheist on the balance of probabilities, but I have no problem with people who have religious faith.</p>
<p>I think part of it is cognitive dissonance: we are raised to trust the government, and the idea that a government could be responsible for an atrocity like 9/11 is utterly incompatible with that preconception. I&#8217;ve already rationalised away the conspiracy theories, but perhaps not everyone would do the same.</p>
<p>Let&#8217;s assume that Democrats are more likely to believe and perpetuate the 9/11 conspiracy theories; and that Republicans and independents are more likely to recoil from them. There are photos of <a href="http://hotair.com/archives/2007/04/30/photo-of-the-day-the-democrats-truther-problem/">conspiracy theorist banners at Obama rallies</a>. This is perfect ammunition for the Republicans, who can associate the Democrats with their &#8220;lunatic fringe&#8221; and exploit the cognitive dissonance in their base and the independents.</p>
<p>Here&#8217;s a conspiracy theory for you: Karl Rove sowed the seeds for the &#8220;9/11 Truth&#8221; movement in a deliberate attempt to discredit the Democrats and make them unelectable in the near future. Or maybe just to distract everyone from the true scandals of the Bush administration: tens of thousands of dead civilians in Iraq, thousands of dead soldiers, domestic spying on US citizens, the erosion of habeas corpus, inaction over global warming and the near collapse of the economy.</p>
<p>There are plenty of things that we still don&#8217;t know about 9/11, and we should as a matter of course seek the truth. We should discover the real reasons that the buildings fell in order to apply the lessons learned to future construction. But in finding the truth, we must retain an open mind, and not resort to intellectual dishonesty or partisanship.
</p>
<p><!-- ~ --><!-- ~ --></p>
]]></content:encoded>
			<wfw:commentRss>http://www.mrry.co.uk/blog/2008/07/21/conspiracy-theories/feed/</wfw:commentRss>
		</item>
		<item>
		<title>The 9/11 Delusion</title>
		<link>http://www.mrry.co.uk/blog/2008/07/21/the-911-delusion/</link>
		<comments>http://www.mrry.co.uk/blog/2008/07/21/the-911-delusion/#comments</comments>
		<pubDate>Sun, 20 Jul 2008 23:27:47 +0000</pubDate>
		<dc:creator>Derek Murray</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.mrry.co.uk/blog/2008/07/21/the-911-delusion/</guid>
		<description><![CDATA[When I saw in last Monday&#8217;s Guardian that Charlie Brooker was taking aim at 9/11 conspiracy theories, I hoped that he&#8217;d use his wide audience to present a logically watertight argument, in an entertainingly acerbic register. And buried within his piece was the quite probable suggestion that the paperwork alone would be impossible to conceal. [...]]]></description>
			<content:encoded><![CDATA[<p>When I saw in last Monday&#8217;s Guardian that Charlie Brooker was <a href="http://www.guardian.co.uk/commentisfree/2008/jul/14/september11.usa">taking aim at 9/11 conspiracy theories</a>, I hoped that he&#8217;d use his wide audience to present a logically watertight argument, in an entertainingly acerbic register. And buried within his piece was the quite probable suggestion that the paperwork alone would be impossible to conceal. Unfortunately, because he&#8217;s evidently paid by the ad hominem, he also said that every conspiracy theorist might as well believe that he is the Emperor of Pluto, and unleashed a firestorm in the <a href="http://www.guardian.co.uk/commentisfree/2008/jul/14/september11.usa?commentpage=1">online comments</a>. By opening up too many fronts in this debate, he left himself open to attacks, even <a href="http://www.guardian.co.uk/commentisfree/2008/jul/17/september11">from other Guardian commentators</a>.</p>
<p><span id="more-35"></span>Let us consider a single event from 9/11: the collapse of 7 World Trade Center, a 47-storey skyscraper which stood across Vesey Street from the main World Trade Center site. It collapsed at 5:21pm that day, but, unlike 1 and 2 World Trade Center (the north and south towers, respectively), it was not struck by an aeroplane.</p>
<p>As yet, the official report on the collapse has not been published: the <a href="http://wtc.nist.gov/progress_report_june04/appendixl.pdf">working hypothesis</a> is that fire and/or debris caused a critical supporting column to fail, ultimately leading to the collapse of the entire building.</p>
<p>A widely held conspiracy theory is that the building was <a href="http://en.wikipedia.org/wiki/Controlled_demolition_hypothesis_for_the_collapse_of_the_World_Trade_Center#World_Trade_Center_Seven">destroyed by controlled demolition</a>.</p>
<p>At present, there is no definitive evidence that proves either hypothesis: why then should we prefer one over the other? The conspiracy hypothesis can be proven, but not disproven. Any &#8220;proof&#8221; of the official hypothesis can be probabilistic at best, as it will be based on simulated physics with limited precision and an inherent (albeit possibly small) uncertainty. And that uncertainty leaves the door open for the possibility that <a href="http://www.prisonplanet.com/011904wtc7.html">Larry Silverstein ordered the demolition</a>, or that <a href="http://www.journalof911studies.com/articles/Why%20Indeed%20Did%20the%20WTC%20Buildings%20Completely%20Collapse%20Jones%20Thermite%20World%20Trade%20Center%20J24.pdf">thermite was used to hide the typical hallmarks of a controlled demolition</a>, amongst other possibilities.</p>
<p>On the other hand, you can disprove the official hypothesis in a simple manner. Simply find one person who was involved in the conspiracy to admit his involvement. This could be one of the cabal who planned the atrocity, one of the secret services who executed it, one of the demolition experts who planted the explosives in the building, one of the building workers who saw explosives being installed, one of the emergency services who ordered the building&#8217;s evacuation or one of the news media who apparently reported the collapse before it happened.</p>
<p>Conspiracies do happen, but they can only succeed when the number of participants is limited: otherwise human fallibility makes it vanishingly improbable that the whole endeavour can remain a secret. And that is why, unless somebody comes forward and claims responsibility or provides legally credible witness evidence, I cannot (and we should not) believe that 7 World Trade Center was destroyed in a controlled explosion.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.mrry.co.uk/blog/2008/07/21/the-911-delusion/feed/</wfw:commentRss>
		</item>
	</channel>
</rss>
