Three types of data: private, public (non-sensitive) and public (without attribution). This talk is about the last one: want to download and share data without people knowing what you’re downloading or sharing.
P2P is good for privacy because there is no centralized control or trust, you retain rights to your data (instead of giving them to a third party) and no centralized third party knows everything you’re doing. But in P2P: anyone can monitor your behavior!
Previous solutions: Tor plus BitTorrent, but this needs a fraction of the clients to be public and gives poor performance. Or Freenet, which has poor bulk data performance and requires users to store others data. Median download of BitTorrent a 1MB file is 94s. But BT+Tor is 589s and Freenet is 1271s.
Implemented a OneSwarm client and released in March 2009… now hundreds of thousands of users. Based on social networks: share keys with your friends.
Searches are flooded through the network, and set up a data path along the successful route, which does not reveal who is the ultimate sender or provider.
Threat model: the attacker has a limited number of overlay nodes and can do anything on nodes he controls, including traffic injection/sniffing/correlation.
To support mobile peers, use a DHT to publish IP and port, however this is published, encrypted and signed separately for each peer. This makes it possible to remove a peer.
Sparse social networks are a problem: with only one friend, you have poor reliability, poor performance and privacy problems. This is bad for early adopters. Early adopters used a forum to share their public keys. Solution was to add a community server, as a source of untrusted peers.
With untrusted peers, delay responses to foil timing attacks, probabilistically forward queries and used deterministic random behavior to limit information leakage from repeated queries. Can trust/not trust peers on a per-object basis.
Want to have search without revealing the source and destination. The approach is based on flooding with delay, where searches are only forwarded using spare capacity, and delayed at each hop. Cancel messages move faster through the network. However, this gives no guarantee that all data can be found by all users at all times.
Search types: 160bit content hash and text-based. For response delay, use random reply delay seeded by the hash (if hash-based search). This is harder for text-based search, so delay based on the hash of the content.
Multipath connections are supported to avoid weak points in the network. Default is to cancel after getting 20 responses (not just one). Then the forwarding load is distributed.
Data transfer uses a modified version of BitTorrent, which handles multiple sources and swarming downloads naturally.
Timing attack is possible by monitoring the delay between search and the response, and inferring how many users could have sent the reply within the recorded time.
Evaluated potential of the timing attack using a synthetic OneSwarm overlay, based on 1.7 last.fm users. Attackers use a public community server, and users with 26 or fewer friends take some untrusted friends as well. Design eliminates the attacker’s ability to pinpoint particular users.
Also evaluated performance using 120 PlanetLab nodes transferring a 20MB file. Median download time for OneSwarm is 173s (compared to 94s for BitTorrent, but much less than BT+Tor and Freenet). At the 70th percentile, OneSwarm and BT are similar (97s vs 190s), but BT+Tor and Freenet are much worse.
Multipath transfers increase average transfer rate from 29KB/s to 457KB/s.
Q. Would it be fair to classify this as Gnutella where the network is restricted to the social graph and searches are less flexible? Similar, but the key change is that the data flows over the path in the overlay (not directly), which makes it more privacy-preserving. Does this not give worse scalability than Gnutella, which had problems? Gnutella had problems when people were still using modems, and it is more viable to provide a few KB/s. The current overlay is not oversubscribing.
Q. What are you relying on to ensure that you don’t know where the data are coming from? Because you don’t know the topology. If you are next to an attacker, you rely on the delay that you add. Are you seeing a sum of random variables, which will leak more information as the path becomes longer? You could maybe estimate the hop-count but not pin-point nodes.
Q. Is a 20MB file too small for TCP to show realistic performance? Used this because we needed to experiment with Tor also, and we didn’t want to stress that network too much. For the Freenet experiment, we used a 5MB file and extrapolated from that because it was hard to get 20MB to download reliably.
Differentially-Private Network Trace Analysis
Can you conduct network trace analysis that provides strict, formal, “differential privacy” guarantees?
Selected some representative traces and tried to reproduce the results using differential privacy.
It was possible to reproduce every analysis attempted, but there is a privacy/accuracy trade-off.
Toolkit and analyses are available online from the PINQ web site.
Access to realistic data is helpful for networking research, but there is a tension between utility and privacy. The requirements of utility are usually for aggregate statistics, whereas privacy requirements are typically for individual behavior.
Other approaches include: trace anonymization (doesn’t always work unless people are excessively conservative), code-to-data (send your analysis to the people who hold the data, but it is hard to know what that code is doing), or secure multi-party computation (similar to code-to-data). The aim here is to start with formal guarantees and see how useful it can be.
Differential privacy: the results don’t depend on the presence or absence of an individual record. Doesn’t prevent disclosure, but makes no assumptions about what what the attackers can do, and is agnostic to data types.
Uses PINQ, which is a programming language that guarantees programs are differentially private.
Challenges: getting DP requires introducing noise, so you need to use statistically robust measurements. PINQ requires analyses to be written as high-level, declarative queries, which can require some creativity or reinterpretation. Also (not dealt with): masking a few packets does not mask a person, and the guarantees degrade more as a dataset is reused (policy question of how you mete out access to a dataset).
Example is worm fingerprinting. Group packets by payload, filter by the count of source IPs being over a threshold and the count of destination PIs being over another threshold. Can then count the number of worms, approximately. Need to supply epsilon to the count which will start off the differentially-private version.
Built some tools for analysis. For example, implemented three versions of a CDF. In doing this, you need to scale down the accuracy for each subquery in order to not degrade the dataset privacy too much.
Showed an example CDF. The differentially private one is not monotonic at the microscopic scale, but it gives a convincing macro-scale result.
Can also list frequently occurring strings, using an algorithm based on statistical properties of text, which gradually extends a prefix.
Extend worm fingerprinting: actually enumerate the payloads that have significant src/dest counts.
Also built more tools and analyses: did packet-level analyses, flow-level analyses and graph-level analyses. Sometimes had to compromise on privacy to get high accuracy (epsilon = 10 for weak privacy).
Many open questions. Perhaps the biggest is whether DP guarantees for packets are good enough. Or whether, if writing new analyses, they could be designed with DP in mind.
Q. Could extensions to PINQ apply to trace analysis that look for isolated events, such as network intrusions which are relatively rare? Can separate the two tasks: learning a rule or filter that could identify an intrusion (which could use DP), and apply that filter to individual packets (which could not use DP, because you effectively want to violate privacy at this point).
Q. Does someone need to hold onto the raw packets? Yes, like the code-to-data setting.
Q. In DP, each query may consume epsilon privacy and the provider must set a budget, so how do you set this? And what happens when the budget is exhausted? You could imagine turning off the dataset when the hard threshold is met. But this is really a policy question. Setting the budget is difficult: perhaps you can provide data outputs from a DP query to a large group who can then do useful work with it.
Q. Is there a trade-off between DP and the presence of correlations in the data? In a lot of cases, it is possible to restructure the data set to reduce the amount of correlation between individual records (by grouping the strongly correlated records together).
Encrypting the Internet
50 million websites exist, but only about 600k of them enable SSL/TLS. Can we change the infrastructure to make all transactions protected and secure.
Main drawback is protocol processing speed and cost, due to public key crypto for handshaking and symmetric crypto for the data. 2 million clock cycles for RSA decrypt.
Main contribution is a CPU that is capable of encrypting packets at line rates, and getting a 4–12x speedup in AES and a 40% speedup in RSA.
Encrypting the internet is not securing it. Don’t deal with certificate/trust management, malicious software or privacy breaches at the end-host.
AES is a block cipher, based on the Rijndael algorithm. Can use 128-bit blocks and either 128, 192 or 256-bit keys. AES takes 10, 12 or 14 rounds.
AES uses confusion (invert in GF(2^8) followed by an affine map). Then the bytes are permuted by shifting the rows of the S-box by varying amounts. Then the columns of the S-box are mixed by matrix multiplication. Uses many bit-linear operations, which are easy to implement in VLSI. Finally, add the round key using XOR.
AES is typically implemented using table lookups, which are costly (approximately 15 cycles per byte). But need to get 1Gb/s. So the aim is to implement them in combinatorial logic, on the processor data path.
Added new instructions: AESENC, AESENCLAST, AESDEC, AESDECLAST. Cache attacks are eliminated. Challenge is to implement this in as small a gate area as possible. Mathematical techniques such as composite fields help to achieve this in 100-400 gates. Total number of gates is similar to an adder or multiplier.
RSA requires performing a modular exponentiation, which can be implemented using modular multiplication. Implementing a faster multiplication algorithm in assembly achieved a 40% speedup over OpenSSL.
Also implemented the first version of combined encryption and authentication for TLS 1.2.
AES-NI latency is 24 clocks/round, then 6 clocks, and the throughput is 2 clocks.
Overall, can move from 501 SSL sessions/second to 1216 SSL sessions/second using AES-NI in Galois counter mode.
Now, one core can saturate a 1G link, and 8 cores can saturate a 10G link.
Future work is to improve larger RSA variants and implement the eventual SHA-3 algorithm.
Q. When you get that fast, how many good-quality random bits per second you can get? This work doesn’t address that, but all we need is an entropy source per a 2004 paper. Not sure what the product groups are doing in this respect.
Q. Is Intel working on speeding up the RSA? The speedup presented in the paper (40%) is good enough to saturate the link.
Q. Could you expose the GF operations as primitives themselves? It is implemented in such a way that you can isolate the inversion of GFs or the multiplication. Algorithms in the SHA-3 competition also exploit similar primitives.
Q. How general are your optimizations in terms of other block ciphers? You can implement a variety of crypto algorithms using the primitives we have design, including several cryptographic hash functions.
Enabling Fine-Grained Channel Access in WLAN
802.11n achieves about 45.2 Mbit/s at the application layer, which is much less than the advertised bitrate.
Overhead arises from various sources: CSMA, backoff, DIFS and SIFS, and ACK. Simple model of this overhead as ration from transmission time to total time for a packet. As the PHY data rate increases, the time for transmitting data (efficiency) becomes small compared to all of these overheads. There is a static delay that cannot be reduced, constraining speedup.
Existing MAC is limited by allocating a whole channel to a single user at once. Aggregation is a possible solution, but you require large aggregation (23KB frames) to get 80% efficiency at 300Mbps. And this also increases latency.
Basic idea is to divide the channel into small, fine-grained slices. Directly reducing the channel width doesn’t work because of guard-band overhead. Approach then is the use orthogonal overlapping subchannels (OFDM).
If nodes are asynchronous, you lose orthogonality (i.e. if you have multiple users). Challenge then is to coordinate transmissions in random-access networks like WLAM. Time-domain backoff is very inefficient in this case.
Designed new PHY and MAC architectures: “FICA”.
M-RTS/M-CTS/DATA/ACK access sequence.
Carrier-sensing and broadcasting can be used to analyze the timing misalignment. A proper cyclic-prefix accommodates the timing misalignment: a long one for M-RTS and a short one for M-CTS/DATA/ACK.
For contention resolution, time-domain backoff is inefficient. Solution is to do frequency-domain contention with PHY signalling in the M-RTS/M-CTS symbols.
Frequency domain backoff: reduce the number of subchannels to contend if there is a collision, and increase it if there is success. This is analogous to congestion-control mechanisms. Two policies: “update to max” and AIMD.
Implemented using the Sora software radio platform, based on a SoftWifi implementation.
Evaluated performance of the synchronization, the signalling reliability and the decoding performance.
Also showed simulation results for the performance gain over 802.11n, and showed an improvement in efficiency for both full aggregation (unrealistic) and a mixture of saturated and delay-sensitive traffic (realistic and with much greater benefits).
Q. How do you deal with the case when the number of sources exceeds the number of sub-carriers (in frequency-domain backoff)? Could you combine time and frequency? Yes, we could always do that.
Q. Is there a way of using this system with RTS/CTS? The overhead of these is so low (37us for RTS), that it might not be worth doing this.
Q. Why is the problem of asynchronous timing different from multipath fading? It can create arbitrarily bad interference with the FFT window that is done for OFDM.
Q. What happens if you compare your scheme to classic OFDM in terms of bits/seconds/Hz (considering delay due to synchronization)? There is a sweet point in symbol size that can meet your design goal.
Predictable 802.11 Packet Delivery from Wireless Channel Measurements
802.11 is fast (600Mbps), reliable (usable at vehicular speeds over extended range) and ubiquitous (cheap). But new applications, such as wireless displays or controller input, can stress the network.
In theory, performance should be easily measurable and used to guide channel rate selections. But the real-world doesn’t always match the theory. So statistical adaptation is often used, but convergence time becomes a problem (especially as the measurement results change dynamically).
Goal is to bridge theory and practice, and accurately predict performance over real channels and devices.
Channel metric is the received signal strength indicator (RSSI) which, with noise, gives the SNR for a packet. However, this isn’t perfect, because it can vary by 10dB on a per packet basis. Different subchannels have different SNRs.
802.11n provides a new opportunity: detailed channel measurements, which are used for advanced MIMO techniques. Get a Channel State Information (CSI) matrix for per-antenna paths.
Use the Effective SNR (the total useful power in a link) as opposed to the packet SNR (total power in the link).
CSI is measured on receive, so for every received frame, we know all antennas and subcarriers used. Then take this and compute SNRs per symbol. And use textbook formulae to calculate per-symbol bit-error rates, and average them to get an effective bit-error rate. Finally convert this back to the effective SNR.
Every rate gets an effective SNR threshold, calculated offline per NIC implementation (not per NIC or per channel). This handles real NICs which may use interesting decoding techniques (hard/soft/maximum likelihood, etc.).
Application: what is the fastest configuration for a particular link? Select rate/MIMO/channel width based on the information.
Application: which antenna is the best to use to save power?
Application: what is the lowest transmit power at which I can support 100 Mbps?
Implemented in an Intel Wi-Fi Link 5300 NIC (3×3 MIMO, 450Mbps). Used two testbeds with over 200 widely varying links. Open-source Linux driver and used firmware debug mode to send CSI to the receiving host. Real-time computation took 4us per 3×3 CSI.
For predicting optimal 3×3 rate: effective SNR is much closer to the ground truth than packet SNR.
To evaluate rate control, used channel simulation on a mobile trace using MATLAB and the SoftRate GNU Radio. Effective SNR gets a better average delivered rate than SampleRate, SoftRate and SampleRate with fixed retry (802.11a algorithms).
Effective SNR extends to MIMO. Compared to optimal and an invented algorithm called “previous-OPT”. Effective SNR gets 80% accuracy and 10% overselection.
Related work: SoftRate, AccuRate and EEC (from yesterday). All work with 802.11a but don’t extend to more recent techniques.
Q. If you had CSI and it’s quick, do you need to do all of these things? The RSSI has a lot of error, and we were able to make this work.
Q. Is the debug mode on the NIC publicly availably? Yes, I think so.
Q. Would a better comparison be to other techniques that use scheduled MACs? Trying to do something that works with what we have.
SourceSync: A Distributed Architecture for Sender Diversity
Receiver diversity underlies many systems, such as opportunistic routing protocols, and WLAN diversity protocols. In opportunistic routing, let any router that receives the packet forward it. If multiple routers/channels with different loss rates, the loss probability is now the joint probability of losing the packet on all channels.
Sender diversity is the converse of receiver diversity. If many senders transmit simultaneously, it is unlikely that they will all be attenuated at the same time. It provide analogous benefits to receiver diversity. For example, connect many APs to wired Ethernet, and let many of them broadcast a packet simultaneously to the client.
Challenge: simultaneous transmissions don’t strengthen each other, because they are likely to be out of sync. Need distributed symbol-level synchronization.
An 802.11 symbol takes 3.2us. With synchronization error of 2us, the best you can get is an SNR of 2dB. 1us synchronization error gives 5db. But for maximum bit rate, 802.11 needs an SNR of ~22db.
Implemented the system, SourceSync, for an FPGA. Talking about opportunistic routing, but applies also to WLANs.
Natural way to synchronize transmitters is by reception. But since multiple paths have different delays, need to compensate for these differences.
Path delay is made up of propagation delay and packet detection delay (typically needing multiple samples to detect a symbol). Then a turnaround time between receipt and transmission.
Packet detection is caused by receivers detecting packets using correlation. Random noise can cause a receiver not to detect a packet on the first sample. Since routers see different random noise, they may make different numbers of samples. Routers can estimate this based on the phase shift in various carriers.
Hardware turnaround time is hardware dependent, caused by different hardware pipelines and radio frontends. Routers locally calibrate this using their clocks.
Propagation delay is measured by probe-response between node pairs. A knows its packet detection delay, B knows its packet detection delay, and so we can compute this from the RTT.
Challenge: can nodes synchronize using carrier sense? Transmission from one of the joint senders triggers the other senders. All nodes use CSMA, so one of the nodes wins contention and begins transmitting; other nodes join in if they have the data.
The lead sender adds a sync header to the packet and a known fixed gap to all co-senders to join after the gap. Co-sender listens, turns around from receive to transmit, waits for a compensating delay, and sends the data.
Implemented in a FPGA of the WiGLAN radio. Built a testbed with a variety of line-of-sight and non-line-of-sight locations.
Evaluated: randomly pick a pair of nodes to transmit, and measured the synchronization error. 90th percentile of synchronization error was measured: 20ns at 5dB SNR, and as little as ins at 25dB SNR.
Can SourceSync achieve sender diversity gains? Two nodes transmit simultaneously to a receiver (again). Check that two channels have different OFDM subchannel SNRs (they do, in the example) and that SourceSync achieves higher SNR in all subchannels.
Compare using the best single access point to using SourceSync, with two senders and a client. Repeat for all locations. SourceSync gives a median throughput gain of 57%.
Compared with opportunistic routing. Single path does worst. ExOR does better. SourceSync + ExOR does best (doubled median throughput over single path, and 45% improvement over ExOR alone).
Q. Did you consider just increasing the power of a single AP instead of sending with multiple APs? There is a fundamental gain here: the SNR profile is different for different routers, and combining across multiple senders gets rid of these deep losses.
Q. Should there be more components to the calculation of delay, based on the RTT? The nice part about this technique is that channel access delay doesn’t affect us, because we use carrier sense for telling when to transmit.
Q. Why did you not compare the performance of your scheme to MIMO? Sender diversity is orthogonal to MIMO and could improve its performance.
Q. Is your synchronization header long enough to account for nodes being very distant? Actually, it’s the gap after the header that has to be long enough. It’s a simple system-level parameter.
Novel Implementations of Network Components
SwitchBlade: A Platform for Rapid Deployment of Network Protocols on Programmable Hardware
Platform header: a hash value for custom forwarding, a bitmap for what preprocessor should execute on the packet, a forwarding mode (including longest prefix matching or an exact match; also able to throw a software exception) and the virtual data plane ID.
Virtual data plane has its own preprocessing, lookup and post-processing stages: they operate in isolation.
Preprocessing stage: select processing functions from a library of modules (such as path splicing, IPv6 and OpenFlow). Also hashing: operator indicates what bits in the header should be incorporated in the packet-header hash to determine how the packet should be forwarded (can include up to 256 bits from the header).
Can do OpenFlow, where forwarding decisions are made on a 13-tuple (240 bits), which SwitchBlade hashes for custom forwarding to be done.
Modules are implemented in Verilog. Preprocessing and postprocessing modules extract the bits for lookup.
Forwarding stage: perform output port lookup based on mode bits. A software exception can be thrown and the packet redirected to the CPU. Could do hardware-accelerated virtual routers in software.
Implemented on NetFPGA.
Evaluated for resource utilization and packet forwarding overhead. Compared to a baseline implementation on NetFPGA. There is minimal resource overhead and no packet forwarding overhead.
Evaluated on a three-node topology.
SwitchBlade uses 13 million gates to get four data planes; other implementations (IPv4, splicing, OpenFlow) have one data plane and use 8 to 12 million gates.
No additional forwarding overhead compared to the reference implementation.
SwitchBlade is a programmable hardware platform with customizable parallel data planes. Provides isolation using rate limiters and fixed forwarding tables.
Q. How do you scale the performance beyond tens of Gbps? An artifact of the NetFGPA implementation which uses 4×1G ports. Later one will have 4×10G.
Q. Doesn’t the next paper show that it is possible to do all this in software? Things like Click are limited by packet copying overhead, so you are limited by the bandwidth of the PCI bus.
Q. What kind of hash function do you use and do different applications require different properties? We use a collision-resistant hash.
PacketShader: A GPU-Accelerated Software Router
Prototype achieves 40 Gbps on a single box by exploiting GPU acceleration.
Software routing is not just IP routing. It is driven by software and exploits commodity hardware.
10G NICs cost from $200–300 per port. But software routers are limited to less than 10Gbps (8.7Gbps in RouteBricks is the best so far).
For 10G, it takes 1200 cycles to do packet I/O, and your budget is 1400 cycles. Lookup/encryption/hashing typically takes much more than that.
First step is to optimize the packet I/O. Then offload the other functions to the GPU.
GPUs are massively-parallel. Lots of small cores.
A GTX480 GPU has 480 cores and 1.2 billion transistors, most of which is dedicated to ALU.
Operations like hashing, encryption, pattern matching, network coding and compression are computationally intensive. GPU is well suited to these. GPU can also effectively hide memory latency.
Memory bandwidth of a top-of-the-line CPU is 32GB/s, but the empirical bandwidth (on realistic access patterns) is 25GB/s. Multiple ports receiving and transmitting will consume this and cause contention. However, a GPU has 174GB/s memory bandwidth.
Key insight: stateless packet processing is parallelizable. Take packets from the head of the receive queue, batch them and process them in parallel.
Latency is not impacted by parallel processing.
Before shader: checksum, TTL, format check, etc. This will send some packets along the slow path. It collects the destination IP addresses and passes those to the shader.
Shader: takes IP addresses, looks up the forwarding table and returns the next hops.
Post-shader: packets are updated and transmitted through the output ports.
Also device drivers at the receive and transmit side. Implemented a custom driver; details in the paper.
Can scale further with a multicore CPU. One master core and three worker cores. Master core talks to the shader. Once you have multi-socket, you need one GPU per CPU. Multi-socket, there is no communication between the CPUs, and each CPU owns a subset of the input queues.
Evaluated by connecting a packet generator and PacketShader back-to-back. Generator generates up to 80Gbps.
GPU gives a speedup (over CPU-only) of 1.4x for IPv4, 4.8x for IPv6, 2.1x for OpenFlow and 3.5x for IPSec.
IPv6 table lookup requires more power than IPv4 lookup. Algorithm is binary search on hash tables. Big performance improvement for small packets, but slightly worse for 1024 and 1514 bytes. However, this is bounded by the motherboard I/O capacity.
IPSec tunneling adds a header and trailer to the encrypted packet. The improvement is across all packet sizes, and is actually bigger for larger packets.
PacketShader achieves 28.2 Gbps with CPU only, and is implemented in user space, rather than kernel space. Reaches 39.2 Gbps with the GPU.
Need to add a control plane (currently only does static forwarding). Need Quagga or Xorp.
Could also integrate with a programming environment, such as Click.
Q. Is it worth implementing such a sophisticated design to make a 40% saving? And do you have a breakdown of where the savings are made? The budget numbers and breakdown are taken from RouteBricks.
Q. What do you think about the power efficiency of this compared to other approaches? Idle to full load is 327W–594W with two CPUs and two GPUs. (Compared to 260W–353W for two CPUs.)
Q. Does this approach have advantages over an integrated network processor in terms of scalability or programmability? Network processors are not commodity. Based on experience, they are much more difficult to program.
Q. Why did your approach have such a significant speedup over RouteBricks etc. even without the GPU? Improvements in packet I/O throughput.
EffiCuts: Optimizing Packet Classification for Memory and Throughput
Packet classification is important for security, traffic monitoring and analysis, and QoS. Usually based on the source and destination IPs and ports, and the protocol field.
Line rates and classifier sizes are increasing. This leads to high power consumption.
Previous approaches have used either TCAMs (poor scalability) or algorithmic approaches (potentially scalable, but problematic). Most promising approach based on decision trees. Aim of this work is to address the scalability of decision tree algorithms.
HiCuts and HyperCuts have investigated decision trees previously. However, they require large memory.
EffiCuts reduces the memory overhead of HyperCuts while achieving high packet throughput. Uses 57x less memory and 8x less power.
Rules in the decision tree are hypercubes in the rule space. Tree building successively cuts down the rule space into smaller sub-spaces. Stops when the cube is small. Classification uses tree traversal.
HyperCuts’ memory overhead is due to many rules overlapping and varying in size, because fine cuts to separate small rules lead to cuts to and replication of large rules. Also overhead because the rule space is sparse (leading to empty nodes or nodes with replicated rules).
Aim to tackle the variation in the rule size and the density of the rule space.
Separable trees: build separate trees for large and small rules. But separate them along different dimensions.
Build a distinct tree for each set of separable rules in 5 IP fields. This leads to a maximum of 31 trees, but in practice it’s more like 12.
Extra memory accesses to traverse multiple trees decreases packet throughput. To reduce the number of accesses, merge the trees.
HyperCuts uses equi-sized cuts to separate dense areas, whereas EquiCuts uses equally-dense cuts, which leads to fine/coarse cuts in dense/sparse areas. Many details of this in the paper.
Node co-location: colocate a node and its children, details of this in the paper.
Implemented HiCuts and Hypercuts with all heuristics, and EffiCuts. Used 16 rules per leaf. Power comparison uses an estimation from the Cacti tool to simulate the SRAM/TCAM.
First result: HyperCuts and HiCuts see memory grow more rapidly than EffiCuts. Replication decreases from 1000 to < 9. Efficuts needs constant number of bytes per rule as the number of rules grows.
EffiCuts requires 50% more memory accesses than HyperCuts. However, since EffiCuts uses much less memory, memory copies are inexpensive.
Throughput results are mixed (149 down to 73 million packets per second for one rule set; but 218 up to 318 for another). Still see an 8x saving in power.
Also compared EffiCuts to TCAM. Throughput story is also mixed, but EffiCuts consumes 6x less power than a TCAM.
Q. How do you separate “large” and “small” rules—using a threshold? We observed that the rule spread is essentially bimodal. This is based on a sensitivity analysis to the “largeness fraction” which varies between 0.1 and 0.95 without affecting the split.
Q. How would power consumption compare to a TCAM where you selectively turn on the relevant banks? Since we are comparing the worst-case packet match, every rule could go to a very different bank.
Cloud and Routing
Theory and New Primitives for Safely Connecting Routing Protocol Instances
Earlier measurement study showed that internet routing is much more complex than the traditional two-level hierarchical model (EIGRP/BGP/OSPF). Since it is so complicated, the connecting primitives play a critical role. 99.9% of analyzed networks depend on them. They are used for things like domain backup, partition healing and router-level shortest path routing.
Designs are usually either safe or flexible. The status quo is unsafe and inflexible. The talk describes something that is both safe and flexible!
Framework based on routing algebras (metarouting). Connecting primitives have provable safety properties, more expressivity and require no modifications to existing protocols.
Today’s connecting primitives have two features: route selection (ranking) and route redistribution (information exchange). Configured using various optional settings in router configuration scripts.
Misconfiguration can lead to suboptimal routes, or loops.
Two questions: how should routes be compared? and when should they be redistributed? The idea is to have a conversion function from routes to a universal metric. Each routing instance is associated with a pair of conversion functions. Now frame the problem as what properties these functions should satisfy.
Contributions: sufficient conditions to guarantee correct routing.
Goal: unmodified BGP, OSPF, RIP, EIGRP. But BGP and OSPF E2 are not strictly monotonic.
Non-BGP protocols modeled by a 2-dimensional ranking, comprising route type and cost. Conversion functions map these to and from an ordered set.
Route selection: prefer non-BGP to BGP routes, for non-BGP routes prefer type-A then B and then C, and among non-BGP routes of the same type prefer the lowest cost.
Domain backup/partition healing: currently possible, but only with a complex configuration, a star topology and giving protection only to the leaves. In the new design with the default conversion functions, can do this with any topology and any available path.
Router-level shortest path: currently only between OSPF and cannot change the cost. The new design can do it with OSPF, RIP and EIGRP using per-instance metrics.
Traffic engineering: existing design allows this only within instance, but now we can do it across instances.
Q. Do the functions have to be common across all the routers? Yes, it has to be consistent across all the border routers. But future work would let it be different.
Q. Do you see the potential for work in the standards bodies to standardize these functions? Talking to the router vendors about this.
Q. What is the behavior during convergence? We haven’t focused on that. After some time, we converge to a correct path. How can you be sure that it is correct without enforcing a global ordering? We do enforce a global ordering at present.
Q. Does your scheme require a centralized entity? We have no such entity. What about between different ASs? …
DONAR: Decentralized Server Selection for Cloud Services
Server selection as a customizable constrained optimization. This talk describes a distributed solution to this problem.
User-facing services are being replicated across the world. Server selection involves selecting which replica to use for a particular user. Want to be agnostic to the wide-area protocol being used. For example, distributed DNS and HTTP redirection/proxying.
Idea was to build a system, DONAR, to which server selection can be outsourced.
Policy interface. Naïve policies include round-robin (for load balancing), or location-aware closest-node. But want to support complex policies over many nodes.
Policies are represented by constraints. For example, a bandwidth cap, or a split ratio and allowable deviation. Without constraints, just use closest node, but this can lead to massive imbalance.
Improvement: add a bandwidth cap on some instances, which now gives some locality, but doesn’t overtax a particular instance (and offloads some traffic to other instances).
Improvement: split 10% across ten replicas with a +/- of 5%. The tolerance is unique to this implementation. Gives the ability to trade off network proximity and load distribution.
Can have a mix of bandwidth cap, proportional split and tolerance across different instances.
For a customer, can define a global linear program that describes the optimal pairing. Minimize network cost (distance) such that load is within tolerance and bandwidth caps are met.
Need to do this for each customer, and continuously. Aim is to have potentially hundreds of DONAR nodes, customers and replicas per customer, and tens of thousands of client groups per customer. The linear program has millions of variables per customer.
DONAR nodes could measure traffic and optimize locally. But no one node sees the entire client population, and the distribution at each node is skewed.
Could maybe aggregate all of the data at a central coordinator. This would end up sharing a lot of data and compromising responsiveness. For example, we would want a node to respond quickly to a flash crowd.
Actually came up with a decomposition strategy, for both the objective function and the constraints. Uses a Gauss-Siedel iterative decomposition approach. The summary data shared is only proportional to the number of replicas. Proof in the paper that this converges to the global optimum.
Deployed since November 2009, and in production use for two customers. Currently services around one million DNS requests today.
Other systems challenges in the paper: network availability, reliable data storage, etc.
Experimental setup using CoralCDN with a proportional split across the replicas. Closest node policy is very volatile, whereas DONAR equal split gives much more predictable workloads.
Better than simple round-robin, since DONAR keeps the network distance as small as possible (shown by the rank distribution of node distance from the client).
Q. How quickly does the global solution converge? Covered in the paper, and usually after one round of hearing from everyone.
Q. Can you incorporate server-specific policy like consistency or staleness (cf. weak/eventual consistency, where different replicas have different data)? We assume that we can direct a request to every replica. But we could solve this problem for separate domains.
Q. What’s the inaccuracy that comes from geolocation based on the location of the resolver? This has been well studied.
Q. What if a server’s settings change or it fails, how long will it take to recompute? Built-in support for liveness updates (either notification or loss of heartbeat), and use a heuristic before we rerun the optimization. Rerun the optimization every two minutes.
Cloudward Bound: Planning for Beneficial Migration of Enterprise Applications to the Cloud
Challenge: data privacy issues, such as national privacy laws and indsutry-specific privacy laws (like HIPAA). Challenge: SLA requirements lke response time.
Possible solution is hybrid clouds.
First focus is on planning hybrid cloud layouts, making cost savings, minimizing response times and bandwidth costs.
Second focus is on migrating security policies, such as firewall contexts and ACLs.
Contributions: study of the complexity of enterprise applications, first-cut solutions to the two challenges, and validations.
Enterprise applications are typically three-tier but with multiple components in each tier, and complex interdependencies.
Abstract the planning problem using a graph with vertices for components, and virtual vertices for internal and external users. Each vertex has a size (number of servers), and edges have a number of transactions per second and size of transactions per second. Objective is to maximize cost savings through migration, subject to policy constraints and bounds on the increase in transaction delay. Then partition the graph between local and remote.
Approach is to use easily-available information, like computation times of components and communication times on links.
Model user response times using bounds on changes to the mean delay and variance.
Benefits for cost savings based on the estimates in the Above the Clouds tech report. Based on a non-elastic migration, and future work is to look at using the cloud for peaks.
Migration algorithm is based on a reachability matrix to determine necessary security policies.
Evaluated based on two case studies, the Windows Azure SDK application and a Campus Enterprise Resource Planning application.
Used a cloud testbed to evaluate a thumbnail-making example application. The plan results in a mean delay less than 10% and an increase in variance less than 50%.
The campus ERP application involves multiple front-end, business logic and back-end components, in use at Purdue.
With a 30% bound on increase in mean delay, get $58k savings by migrating all components. For a 15% bound, get $38k savings by migrating a subset of the components. For a 15% bound with a placement policy (don’t move DBs), still get a $14k saving. Paper contains a sensitivity study on the benefit ratios.
Various security policies are also generated and evaluated.
Q. How much time did it take to understand the dependency mapping between components? Talked to operators and interviewed them, so the dependencies were extracted by humans. What about license dependencies, such as “may not be run in a VM”? [Taken offline.]
Q. Why did the delay increase more than the policy in the graphs? Used PlanetLab before a deadline, so the environment was hostile.
Q. Can you comment on the complexity of rewriting applications to run in split mode, which seems to be more important than performance or cost? There is an increasing trend towards service-oriented architectures, which make it easier to do this migration. Haven’t solved this for legacy applications.
Q. Did you have a contralocation scheme for preventing e.g. certain pieces of data not being stored in different languages? Constraints based on cloud/non-cloud.