SIGCOMM 2010: Day 1

Wireless and Measurement

Efficient Error Estimating Coding: Feasibility and Applications

Won best paper award.
Existing philosophy: errors are bad and only want completely correct data.
Can we accept partially correct packets and only enforce correctness end-to-end?
Contribution: error estimating coding. Enables the receiver to estimate the number of errors in a packet, but not correct them.
Smaller overhead/weaker functionality than error correcting codes.
Overheads come from redundancy and computation.
EECs need only O(log n) bits redundancy to estimate errors. e.g. 2% overhead on a 1500-byte packet. For just a threshold check, need only 4 bytes.
Efficient computationally: software implementation can support all 802.11 data rates. ECC is 10 to 100 times slower.
Can estimate number of errors in a packet with a provable quality.
Application: streaming video. FEC (forward error correction) often used here. Routers forward partially correct packets. But if the number of errors is so large that the data are unrecoverable, it will incur retransmission. Router should have asked for retransmission earlier when a packet could not be decoded. But it lacks computational power to evaluate an ECC. EEC is more computationally tractable in this scenario for BER-aware retransmission.
Implemented for Soekris Net5501-70 routers.
Key idea: router should treat different packets differently. Could use analogue or digital amplification as necessary.
Packet scheduling: image sensor network for emergency response. Let packets with smaller BER get through first.
Applies also to bulk data transfer. Can use partial packets and correct them end-to-end. Could use network coding or incremental redundancy. EEC helps to do WiFi rate adaption, if we know the mapping between data rate and BER, which EEC provides. Existing systems based on packet loss rate, signal-to-noise ratio (at the receiver), or modifying the PHY.
Implemented a prototype rate adaptation scheme using EEC. Consistently outperforms existing schemes based on packet loss rate and SNR.
More general problem is wireless carrier selection: goal is to select carrier with the best goodput.
Packet has n + k slots (n data bits and k EEC bits). p is the fraction of erroneous slots, with arbitrary position.
NaÃ¯ve solution would be to add pilot bits to each packet with known values at known positions. But this needs the receiver to observe enough errors in the pilot bits. Since BER is usually small, need a lot of pilots to see a single error.
Instead, make the pilot bit a parity bit based on a known group of data bits in the packet. But cannot distinguish many cases with parity bit (1 error vs. 3 errors). Error probability of a parity bit is (inversely) correlated with error probability of the bits in its group.
Solution involves randomly selecting data bits to make up fixed sized groups and compute an EEC parity bit. Now permute the data and EEC bits and send.
Can refine to single- or multiple-level EEC. Details of multiple-level EEC in the paper.
Can prove a formal guarantee for the space complexity of the number of EEC bits (the O(log n) bound).
Compared to SoftPHY, EEC is a pure software solution, which is more deployable. But SoftPHY gives per-bit confidence information that EEC cannot provide.

Design and Implementation of an “Approximate” Communication System for Wireless Media Applications

Leverages properties of wireless PHY layer to improve media applications.

Media applications by 2013 will comprise 91% of internet traffic, and wireless based access is the dominant for of access (4 billion wireless hosts vs. 0.5 billion static hosts).

In this case, we should use the spectrum efficiently for media transfer.

Looking at hierarchically-structured media, such as MPEG4 and H.264. Different frames (e.g. I, P, B) have different value (I > P > B). So use unequal error protection to prioritize important frames.

Since data received is a predictable approximation of transmitted data, they can provide unequal error protection almost for free (in terms of additional spectrum).

Errors result from the constellation diagram used to determine QAM encoding. Typically, the error is restricted to symbols in the neighborhood of the received symbol in the constellation.

Using a Gray code-based bit mapping from symbols to the QAM encoding. So, by definition, neighboring symbols are just one bit different. Different bit positions offer different levels of protection. So can choose different bit positions to give different protection to data.

For the media example, but I frames in more-protected bit positions, and other frames in the other positions.

Also considered a block mapping scheme, which gives better protection for the most protected bits, and worse protection for the other bits, than a Gray code mapping.

Designed a system based on these principles: APEX.

A modern radio pipeline will apply randomization, such as scrambling, coding and interleaving. But this makes it harder to determine what bits go where. So they move randomization to before the assignment of bit positions.

Uses a greedy algorithm to map application data at various priorities that deals with unequal content size.

Evaluated at various bit rates and bit error rates. Gets a better PSNR than traditional transmission. Also showed that it works well with FEC.

Q. Did your evaluation take fading into account, and would the assumptions still hold? Experimentation done using a system not robust enough to take outdoors and do experiments. The assumptions might hold. If you are sharing wireless LANs more, does this become more critical?

Q. How will it work when your bit-rate adaption drops down to BPSK and QPSK? It does nothing in this case. How might it work? Could instead use smaller QAM symbols.

Q. What if you get 180 degree phase shift, due to fading or propagation delay, and the mapping changes? At the PHY layer, we expect there to be a mapping, and an indexing mechanism that can decode information in the header of the packet to select the mapping.

Not All Microseconds Are Equal: Enabling Per-Flow Measurements With Reference Latency Interpolation

Low-latency (e.g. financial) applications are increasingly important.
Current solution is a low-latency cut-through switch. In a tree network, it is hard to tell which switch is causing a problem at microsecond granularity. Need high-fidelity measurement within the routers themselves.
But SNMP and NetFlow provide no latency measurements, and active probes are typically only end-to-end. Measurement boxes are very expensive (Â£90k).
Need per-flow measurements because averages lose too much information about what happens to each flow. There is a significant amount of difference in average latencies across flows at a router.
Perform measurement on packet ingress and egress. Assume that router interfaces are synchronized, because cannot modify packets to carry timestamps.
NaÃ¯ve approach: store timestamps for each packet on ingress and egress. Packet timestamps get sent along the egress route when the flow steps. Obviously this is too costly to do at 10Gbps. Sampling sacrifices accuracy.
Use LDAs with many counters for interesting flows, counting packets seen at each timestamp.
Divide time into windows, and measure the mean delay for packets within that window (locality observation). Error shrinks with a smaller window size.
Can inject a reference packet regularly that does have ingress and egress timestamps, which gives delay samples for each window.
Implementation has two components: reference packet generator and latency estimator.
Reference packet generation strategies: 1 in n packets or 1 in tau seconds. Actual approach is dynamic injection based on utilization. When high utilization, inject fewer packets.
Latency estimator strategies: could use only the left reference packet (previous reference packet) or a linear interpolation of the left and right reference packets. Other non-linear estimators, such as shrinkage, are possible.
Maintain packet count, summed delay and sum of squares of packet delays for each flow.
Evaluated on various router traces, and simulated with NetFlow YAF implementation. (RED active queue management policy.) Median relative error is 10–12%. As utilization grows, error decreases. RLI outperforms other schemes by 1 to 2 orders of magnitude.
Overhead is < 0.2% of link capacity. Packet loss difference is 0.001% at 80% utilization.
Q. Have you talked to router vendors about implementing this? No.
Q. When you compare the different approaches, why is sampling so much worse? Sampling scheme is one packet per thousand. So accuracy is very low if few packets are delayed.
Q. How does it influence the results if packet loss rates are high? Lost packets are not counted in our result.

Data Center Networks

Generic and Automatic Address Configuration for Data Center Networks

Manual configuration is error-prone, causing downtime. DHCP is used at layer 2. But applications need this information as well. Data center designs (manually) encode this information in the IP address for routing. But DHCP isn’t enough for this.
Takes two inputs: a blueprint graph (with logical IDs for each machine) that can be automatically generated, and a physical topology graph that is available later when the data center is constructed.
Center of framework is a device-to-logical ID mapping. Need malfunction detection to update the mapping.
Maintaining a map between devices and the logical IDs is the graph isomorphism problem. The complexity of this problem is unknown (P or NPC). Introduce the O2 algorithm which solves the problem in this case. Proceeds by choosing an arbitrary first node matching (decomposition), then refinement. Terminates when no cell can be decomposed. Overall algorithm terminates when all cells have single nodes.
O2 has three weaknesses (i) iterative splitting in the refinement stage, (ii) iterative mapping in the decomposition stage, and (iii) making a random selection of the mapping candidate. Optimization algorithm has three heuristics that address these problems.
O2 turns out to be faster than its competitors: Nauty and Saucy.
Malfunctions cause the topology to differ from the blueprint, so O2 cannot find the mapping. Solution is to find the maximum common subgraph between the blueprint and physical graphs. The algorithm for this is NP- and APX-hard.
Use heuristics based on the vertex degree changing. If no degrees have changed, probe subgraphs derived from anchor points using majority voting to identify miswired devices.
Protocols for channel building, physical topology collection and logical ID dissemination. A DAC manager coordinates these.
Experimented on a BCube(8, 1) network with 64 servers. The total time to run the algorithm was 275 milliseconds to autoconfigure all of the servers.
Ran simulations on larger topologies. Up to 46 seconds on a DCell(6, 3) network with 3.84 million devices.
Q. What are the next steps for this work? Can we design better logical IDs that can be used in routing.

Symbiotic Routing in Future Data Centers

Reevaluate network architecture based on the different properties of data center networks when compared to the internet.
Despite lots of other work, the network interface has not changed, so what can we do at the application layer?
Network is a black box, and applications have to infer things like locality, congestion and failure; likewise networks have to infer things about the applications like flow properties.
MSRC designed CamCube: a network with x86 servers directly-connected in a 3D torus. Servers have (x, y, z) coordinates that are exposed to the application. The send/receive API is a simple 1-hop API. Multi-hop routing is provided as a service, which uses multipath when possible.
Built a high-throughput transport service, a large-file multicast service, an aggregation service and a distributed key-value cache service. Each had a custom routing protocol based on the properties that the application needed to obtain. e.g. High-throughput transport prefers disjoint paths, whereas file multicast prefers non-disjoint paths.
Testbed used 27 servers with 6×1G NICs. A simulator looked at a 20^3 (8000) node CamCube.
Custom routing policies yield a performance improvement (on average). Factor of 2 (median) improvement in end-to-end throughput for high throughput transport (10k x 1500 byte packets). Gains also for multicast and the distributed object cache (in terms of path length).
Also looked at impact on the network: achieved higher throughput with fewer packets (lower link utilization) for all applications.
The base routing protocol is still used to route along paths defined in the custom routing protocol, and to handle network failures. The custom protocol route for the common case.
Built a routing framework for describing these custom protocol. Two components: the routing policy and the queuing policy. Each service manages one packet queue per link.
Cache service: keyspace mapped onto the cube, evenly distributed across the servers. Routing: go to the nearer of the cache or primary nodes. On a cache miss, route from the cache to the primary, and populate the cache on the return.
The base protocol routes around link failures. If a replica server fails (in the key-value store), the key space is consistently remapped by the framework.
Forwarding function implemented in C#, running in userspace.
Benchmarked a single server in the testbed, communicating at 11.8Gbps with all six neighbors. Required 20% CPU utilization.
Can the routing approach be used outside CamCube? Network only needs to provide information about path diversity and topology, and the ability to program components.
Q. When you make an application for this framework, what would happen to the application if you decided to change the topology? The benefit of the black-box approach is that you don’t care about the topology. May be advantageous to target containerized/modular data centers where the topology cannot frequently (or at all) change.
Q. How would the performance look on other topologies, considering that the torus is optimal for latency and bandwidth? It is a benefit and a curse, given that we occasionally have long path length for some pairs. Topologies that give you higher path diversity give you better chances to employ these ideas.
Q. What if you ran multiple instances of the same application (rather than applications with very diverse routing policies)? We did run this, but the details are in the paper. For the high-throughput transport protocol, you might expect us to be susceptible to congestion, but the forwarding function returns many choices, which you can choose based on minimal queue length, for example.
Q. What is the net result here for forwarding latency? This is one of the main critiques of the topology itself. Currently experimenting with smart NICs so that we don’t have to go up to user space for straightforward forwarding.
Q. Can you write the forwarding method in the form of many overloaded methods that code be dispatched dynamically? To some extent, but packets are tagged with a service ID, which statically dispatches the forwarding method of the particular service.

Data Center TCP

TCP is used for 99% of traffic in data centers, but what is the problem with it? Can suffer from bursty packet drops (Incast), and builds up large queues that add significant latency.
Many ad hoc workarounds for TCP problems, such as at the application level. This talk is about changing the TCP stack in the kernel to address these problems.
Interviewed developers, analyzed applications and did a lot of measurements. Systematic study of impairments and requirements in Microsoft’s data centers.
Case study: Microsoft Bing data center. 6000 servers, with passive instrumentation (application/socket/kernel-level). Search query goes to top-level aggregator, which splits the query and farms it out to mid-level aggregators, which then farm it out to worker nodes. Worker nodes have a 10ms deadline; mid-level aggregators are 50ms; and the top-level deadline is 250ms. Missed deadlines lead to missing data in the results.
Similarly, Facebook builds a page by pulling data from various servers. Similar traffic pattern to Bing.
Incast happens when “synchronized mice collide”. Caused by partition/aggregate: the queue overflows at the aggregator. To deal with the problem, Bing jitters requests over a 10ms window. This gets better performance at the median, but causes problems at higher percentiles (up to 99.9th is tracked).
Queue buildup causes when big flows take up too much of a queue and increases the latency for short flows.
Requirements: 1. High burst tolerance; 2. Low latency for short flows; 3. High throughput. 1 and 3 are in tension with 2. Deep buffers helps 1 and 3 but increases latency. Shallow buffers are bad for bursts and throughput.
Objective is low queue occupancy, with high throughput.
TCP uses explicit congestion notification, inserted in packets in the middle, noted by the receiver and sent back to the sender in the ACKs.
Need C * RTT buffers for a single flow running at 100% throughput. If you have a large number of flows, you can have fewer buffers. If there is a low variance in sending rate, small buffers are sufficient.
Key idea: react in proportion to the extent of congestion, not its presence (cut the congestion window by less if there are fewer congestion notification bits set). Other key idea: mark packets based on the instantaneous queue length.
Sender maintains a running average of marked packets, and adaptively cut the congestion window based on how many packets are marked.
On a real deployment, DCTCP for two flows keeps the queue length much shorter than regular TCP.
Get high burst tolerance by having large queues. Low latency by having low buffer occupancy.
How long can DCTCP maintain queues without loss of throughput, and how do you set the parameters? Need to ensure that the queue size is stable by quantifying the oscillations.
Implemented on Windows 7 using real hardware with 1G and 10G stacks. Aim was to emulate traffic within one rack of the Bing data center. Generated query and background traffic based on distribution seen in Bing. For background flows, DCTCP gets lower latency for small flows, and matches TCP for large flows. For query flows, DCTCP does much better than TCP.
Tried scaling the traffic (background and query) by 10x. Compared DCTCP, TCP and TCP-RED with shallow-buffered switches, and TCP with a deep-buffered switch. TCP-DeepBuf does terribly for latency with short flows, and TCP-ShallowBuf due to Incast
Q. Shouldn’t you be comparing with TCP-ECN? We have done those comparisons, though they aren’t in the talk.
Q. If you reduce in proportion to the number of bits, it depends on the timescales on which the queue builds up, which depends on the number of competing sources and their own reactions. Don’t you have to do some control theoretic modeling? Data center is a homogeneous environment with all sources being DCTCP. Even there you really need to figure out the bandwidth properly? There is more detail in the paper.
Q. Depending on the instantaneous queue for the notification has implications for the dynamics, and I am concerned about this? Homogeneous RTT helps here, thanks to the data center environment. We believe there are some simple ways to solve the problem for heterogeneous RTT (e.g. multiple queues). I’d like see how much diversity you can tolerate?

Inter-Domain Routing and Addressing

Internet Inter-Domain Traffic

It’s hard to measure the Internet, and the lack of ground truth data makes it harder to know if you’re doing it accurately.
Wanted to collect a large data set that shows how the internet is evolving.
Conventional wisdom: global-scale end-to-end network, broad distribution of traffic sources and sinks, and the existence of a “core”. But as time passes, these are becoming less true.
Methodology: focussed on inter-domain traffic, not application layer things such as web hits/tweets/VPN/etc. Exported coarse-grain traffic statistics about ASNs, ASPaths, protocols, ports, etc. via anonymous XML forwarded to central servers. Covers 110 ISPs/content providers, 3k edge routers, 100k interfacesâ€¦ i.e. about 25% of all inter-domain traffic. Then waited 2 years and repeated to get longitudinal data. Used commercial probes within a given ISP, with limited visibility into payload-based classification. Calculated percentages per category then weighted averages using number of routers in each deployment. Also incorporated informal and formal discussions with providers, and information about known traffic volumes.
Validated predictions based on a ground-truth based on 12 known ISP traffic demands (Known peak Tbps).
In two years, Google and Comcast have grown to be the 3rd and 6th biggest carriers in terms of traffic demands.
Cumulative distribution of carriers: in 2007, thousands of ASN contributed the first 50% of content; in 2009, it was 150 ASNs. In 2010 it’s even more dramatic (but not shown).
By buying YouTube, Google went from 1% of internet traffic to 6% (not including the global cache â€” i.e. an underestimate).
In 2007, Comcast has “eyeball” peering ratios, but by 2009, they are a net producer of content. Video distribution, backbone consolidation, etc. contributed to this.
Price per megabit of wholesale for internet transit has collapsed whereas the revenue for internet advertisement has greatly increased.
IP and CDNs have been commoditized, hence enterprise traffic has moved to the cloud. Companies have consolidated. Bundling (triple/quad play etc.) has become popular. Several new economic models, such as paid content, paid peering, etc. (but often under NDA). Also disintermediation as customer and provider connect directly.
Traditional internet model is hierarchical: not really true but pedagogically used. The new “Tier-1″ has “Hyper Giants”, representing the large content providers.
In terms of protocols: HTTP has grown 24.76%, video has grown 67.09% and P2P has shrunk > 70%.
Port usage is also consolidating: similar distribution to the content providers: fewer ports account for more of the internet. Looked at Xbox traffic on TCP port 3074, and saw a huge drop. Actual cause was a move to port 80.
File sharing has migrated to the web. In 2010, P2P is the fastest-declining application group. Direct-download like Megavideo, etc. are much more popular now, and you can even get HD streaming.
Internet traffic changes have been driven by changing economic models. Shift from connectivity to content, and the move to port 80 are two major trends. This has implications for engineering and research, as security/fault tolerance/routing/traffic engineering/network design have become more difficult.
Q. Economic implications: what are other types of business models and arrangements that might come out of “hyper giants”? Still in the early stages, but it’s not clear - from a power perspective - who has the upper hand.
Q. If the number of top players decreases by an order of magnitude, do you see the role of CDNs diminish and do you have any data on that? Talk about CDNs in the paper (about 10% or more of internet traffic). In general, they are growing, but enterprise content is driving a lot of the CDNs.

How Secure are Secure Interdomain Routing Protocols?

After a decade of research on secure BGP, no idea what the best solution. So this paper evaluates how well each protocol prevents “traffic attraction attacks”, based on simulation on empirical data. Used an AS-level map of the internet and business relationships, and a standard model of routing properties.
How do we strictly enforce correct path announcements? Solutions range from no crypto (BGP) to lots of crypto (data-plane security). It turns out that this isn’t the only problem: just as important to control who you announce to, as well as what is announced. Defensive filtering is also important.
Different relationships: customer-provider (customer pays), and peering (no payment). No value for how much a path costs in the model, but the paper models routing by preferring cheaper paths. After that, prefer shorter paths. Only transit traffic if it makes you money (i.e. on behalf of your customers).
A traffic attraction attack is an attempt to get as many people as possible to route traffic through the attacker’s AS (for tampering, eavesdropping, etc.). Simulations show that a traffic attraction attack can pull in traffic for 62% of ASs.
Origin authentication: secure database of IP to AS mappings to prevent people advertising origins they don’t own. Simulation shows 58% of ASs get attracted to the attacker.
Secure BGP: enforces that ASs cannot announce paths that have not been announced to them (using digital signatures). So can only append a prefix. Simulation shows that the attacker still attracts 18% of ASs.
Defensive filtering (of stubs (i.e. ASs with no customers, which should not route traffic)): provider drops announcements for prefixes not owned by its stubs. Defensive filtering thwarts all attacks by stubs (i.e. all of the previous cases), and 85% of ASs are stubs.
Sometimes announcing longer paths is better than announcing short paths; sometimes announcing to fewer neighbors (than all of them) is better. It’s NP-hard to find the optimal attack strategy.
Ran experiments by choosing randomly an attacker, victim pair and simulated a “smart attack” each protocol.
Evaluated probability that an attack attracts 10% of ASs (over a random choice of attacker and victim). Defensive filtering alone is as effective as Secure BGP alone. However, the attacks could be smarter, so these numbers are underestimates.
Why aren’t we using defensive filtering today? It’s hard to keep track of the prefixes that your customers own. A push to implement origin authentication is ongoing, and this could be used to derive the filtering mapping. The threat model is somewhat strange though.
Q. The CAIDA data are less than ideal, so how robust are the statements in the paper? Ran the experiments twice, on CAIDA data from 2009, and the UCLA Cyclops data. Trends for the two sets are similar.
Q. Did you look at comparing the effectiveness of models versus their deployability? Completely ignored deployability and implementability in this paper, but it will be the subject of future work.
Q. Have you considered varying the 10% threshold? Yes. [Backup slide.]
Q. Seems like the easiest way to attract traffic would be to deaggregate the prefix, so do you take this into account? We didn’t evaluate it because we idealized things to make it tractable.

Understanding Block-level Address Usage in the Visible Internet

What can simple observations about the internet say? Contributed methodology, applications and validation.
Looked at spatial correlation, address utilization, dynamic addressing and low-bitrate identification. Based on data set gathered for IMC. New, deeper understanding and a new interpretation.
Is there spatial correlation in the IPv4 address space? Are adjacent addresses likely to be used in the same way? This could help to efficiently select representative addresses for more-detailed study.
Collected data by pinging each address in randomly selected /24 blocks every 11 minutes for a week, and collected the probe responses, probing 1% of the whole IPv4 space. Gives 5 billion ping responses.
Three metrics: availability (normalized sum of up durations), volatility (normalized number of up durations) and median-up (median duration of a period of uptime).
Graphed by mapping the IP addresses onto a Hilbert curve.
Algorithm: examine each block size, if it is homogeneous stop, otherwise split the block and recurse.
Validating spatial correlation is hard, because it is hard to find a ground truth. Therefore used USC’s network for comparison, and the general internet (hostname-inferred truth). Also evaluated for different samples and dates.
Selected USC because the operator provided the ground truth, and they had knowledge of both allocated and usage blocks.
43% false negative rate, and 57% of blocks are correctly identified.
The general internet gives unbiased truth. The results are more correct than for USC: 68% correct to 32% false negative.
Low-bitrate identification: formalized RTT = transfer + queuing + propagation delays. Tried using median RTT to identify low-bitrate versus broadband. However, for international links, propagation time dominates. Variance of RTT separates low-bitrate from broadband.
Used hostnames as a form of ground truth (e.g. if the hostname contains “cable” or “3G”).
Q. Did you do any of this on IPv6? No. Are you planning to? Probably in the future when IPv6 is more popular.
Q. Is it reliable to use ping response to detect hosts, when some hosts refuse to handle ping responses? We didn’t consider false information in this work, but it would be valuable to consider this in future.

This entry was posted on Tuesday, August 31st, 2010 at 11:00 am and is filed under Technology, Trip Reports. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

mrry