September « 2010

Archive for September, 2010

SIGCOMM 2010: Day 3

Thursday, September 2nd, 2010

Network IDS

NetFence: Preventing Internet Denial of Service from Inside Out

DDoS is projected to be the biggest problem facing the internet in the next 12 months, and it is difficult to combat, since it conflicts with the openness and robustness internet design principles.
Previously, people have looked at the receivers of the DDoS (Denial of Edge Service). Usually using network filters or network capabilities.
But with a large enough bonnet, bots can collude to send packet floods which impair network services.
Challenge is to design a network architecture that combats both kinds of attack.
Solution: NetFence. Gives the network control over its resource allocation to combat denial of network services (DoNS). Also hierarchical and coupled with network capabilities.
Hierarchical congestion policing slows down flooding senders, and is robust to both compromised routers and hosts. Uses symmetric key cryptography, and each packet carries a secure token (based on Passport from NSDI 2008).
Secure congestion policing feedback are like network capabilities. Capabilities are returned if the receiver authorizes the traffic as “desired”.
Two types of packet: request and regular. Packet has five fields: mode (nop/monitor), link ID, action (up or down), timestamp and MAC (authentication).
First a sender sends a request packet. The mode field gets stamped by the access router as nop, and the MAC is calculated based on a hash of the original fields. The action gets set to down (deprioritize) the Link ID is stored and the mode is set to monitor, if an attack is deemed to be underway (i.e. congestion is encountered). The routers have keys distributed using Diffie-Hellman over BGP.
Policing is done at the access router, which looks at the packet sent back from the receiver (mode, action, etc.), and configures a leaky bucket as necessary.
Congestion policing loop uses AIMD at the access router to vary the sender’s bucket capacity.
A policing cycle is started based on a load- or loss-based detection procedure in the bottleneck router. RED is used to signal congestion within a cycle.
Works because: (i) secret keys used by routers to do feedback, (ii) periodic AIMD used to achieve fairness/efficiency, and (iii) congestion feedback acts as capabilities to prevent unbounded traffic.
Provable fairness shown in the paper. i.e. Each good user achieves a proportion of the network capacity that is equal to one over the total number of senders. Denial of service becomes “predictable delay of service”.
Many possible attacks against a system like this. Discussed in the paper, but two are discussed here.
To deal with floods of request packets, the request packet channel is separate, and there is a per-sender request packet limit, which is policed. There is a priority-based backoff which emulates computational puzzles.
To deal with routers hiding backoff feedback, the system treats the absence of an up-feedback as down-feedback.
Implemented on Linux using XORP and Click. AES-128 is the MAC function (see Encrypting the Internet). Benchmarked using DeterLab, dual-core 3GHz Xeons with 2GB of RAM.
The bottleneck router has 0 processing overhead when there is no attack. Overhead is 492–554 ns/packet (one AES computation) when there is an attack.
Shim layer between IP and TCP (or something else), which adds a header overhead between 20–28 bytes.
Simulated using NS-2 to evaluate various attacks. Compared to other systems, which put more state in the core.
Experiment on a denial of edge services attack. As the number of simulated senders increases, the file transfer time remains constant, unlike Fair Queuing which increase, but TVA+ and StopIt are faster (but less scalable).
Experiment on a denial of network services attack. Looked at ratio of average user throughput to average attacker throughput. NetFence achieves fairness.
Q. Do you distinguish good and bad users? No, we used AIMD to achieve fairness instead.
Q. How can you separate a flash crowd from malicious traffic? We don’t, treating extreme congestion the same way as an attack because it is a failure of end-to-end congestion control.

ASTUTE: Detecting a Different Class of Traffic Anomalies

Network management is used to ensure that customer SLAs, security policies, resource availability are maintained. Anomaly detection normally involves building a statistical model of normal traffic and defining an anomaly as a deviation from normal.
However, it is hard to obtain a model of “normal” traffic. Look at a time series of packet counts, and usually define a model baseline (tolerance) based on something like EWMA, and anomalies are anything outside that. However, training isn’t guaranteed to be anomaly-free.
Aim is to detect anomalies without having to define what is normal. Advantage is a simple tool that doesn’t have to perform training and is hence immune to data poisoning. It is accurate for a well-defined class of traffic anomalies, with theoretical guarantees on the false positive rates. However, its applicability is limited to when traffic characteristics don’t change.
Empirical properties: flow independence (although some weak correlation between flows), stationarity (time invariance over the timescales of a typical flow duration), and independence and stationarity => equilibrium.
ASTUTE = A Short-Timescale Uncorrelated Traffic Equilibrium. Between two consecutive time-bins, flow volume changes are zero-mean i.i.d.
Measure the number of flows, mean volume changes and variance of volume changes between consecutive time bins. Flag an alarm if the ASTUTE Assessment Value (AAV), calculated from these, is greater than some threshold.
The threshold controls the false positive rate. Appeal to the central limit theorem, so for a large number of flows, the AAV has a Gaussian distribution. False positive rate is just the area of the bell curve outside the threshold.
If ASTUTE is violated, at least one of the model assumptions is violated. For example, stationarity. Long bin sizes (one hour) lead to anomalies flagged when people arrive and leave at the beginning and end of the day (daily bias). Short timescales see no bias at all.
Worked with flow traces from Internet2, GEANT2 and the Technicolor corporate network. Compared Kalman and Wavelet filters.
Small overlap between anomalies detected by ASTUTE and the other methods. ASTUTE finds different classes of anomalies: tends to be larger numbers of flows with fewer packets than the Kalman and Wavelet approaches.
Plotted classified anomalies in each network on a similar graph (#flows vs #packets per flow), and saw that ASTUTE is worse on DoS attacks, but better on prefix outages, link outages and gaps, and port scans.
Looked at the ROC curve to see the trade-off between false and true alarms. Kalman would need a much higher false positive rate to detect port scans. But ASTUTE would require a very high false positive rate to detect DoS attacks.
Q. Can you not detect large flows because the time windows are so short that they look i.i.d. over those time scales? If it has a small number of flows, it will look independent to ASTUTE. There is an analytical limit to how many flows you need before you can detect it (threshold-squared).
Q. Who cares about detecting correlated flows? ASTUTE is not only useful for anomaly detection. But the interesting thing is that it can identify things that the operator would not be aware of, like bugs in misbehaving applications.
Q. Do you have the ground truth that the DoS attacks are real DoS attacks? Yes, we have analyzed the data, and there were lots of SYN packets going to a single location, usually from a single IP.
Q. Is there a way to classify an anomaly or is it ad hoc? We started with visual inspection, but we developed a tool for this.
Q. If your traffic is skewed towards a few flows, does the CLT hold? The CLT and assumption that we have lots of flows is an assumption for normal behavior.

NetShield: Massive Semantics-Based Vulnerability Signature Matching for High-Speed Networks

Maintaining network security is a grand challenge. Worms and botnets are widespread.
Talk concentrates on signature-based IDS. Normally, there is a database of signatures which is matched against each packet and used to generate alerts. This needs to be accurate and fast.
State of the art is regex-based. Used in Cisco IPS, Juniper IDS and Bro (open-source). It can efficiently match multiple signatures simultaneously using an NDFA, and can describe the syntactic conext. But the expressive power is limited, and it cannot describe the semantic context. This leads to inaccuracy.
Other state of the art is vulnerability signatures for host-based IDS. It directly describes the semantic context and is very expressive (able to describe the vulnerability exactly). It’s accurate, but slow, using sequential matching and requiring protocol parsing.
Vulnerability signature matching requires parsing, matching and combining. Since the protocol grammar is context-sensitive, cannot use a regex, as well as it being practically difficult.
Also a regex assumes a single input, so it cannot help with the combining phase.
So regex approaches cannot be used to match vulnerability signatures.
First challenge: matching thousands of vulnerability signatures simultaneously. Second challenge: parse protocols quickly. Solution achieves 10G throughput with an efficient matching algorithm and a tailored parsing design for high-speed matching.
Basically, a vulnerability signature uses a sequence of protocol data units (PDUs) with one predicate per PDU. PDU could be something like the HTTP version or the method. Need numbers and strings, number operators (comparisons) and string operators (equality, regex matching and length)
Given n signatures defined on k matching dimensions, a matcher is a two-tuple (field, operation) or a four-tuple for associative array elements. This leads to an n-by-k table. A table representation admits the possibility of matching multiple signatures simultaneously. Table looks like an associative array, with lots of don’t-cares.
Worst case time complexity is O((log n)^(k-1)) or O(n^k) space complexity. Based on the Snort and Cisco rulesets, which have selective matchers, the design actually gives O(k) time complexity.
Iterative matching algorithm on the columns, based on intersecting relevant rulesets with special treatment for don’t cares.
Complexity of merging requires k-1 merging iterations. Worst case merge complexity is O(n) in the worst case, but for real-world russets it will be more like O(1).
For high-soeed parsing, compare tree-based and streaming parsers. Streaming parsers can only retain signature related fields. Built an automated parser generator that builds a parsing state machine for parsing the protocol.
Implemented in 10kloc of C++ and 3kloc of Python. Evaluated on 26GB traces from Tsinghua University, Northwestern and DARPA. Run on a P4 3.8GHz with 4GB of RAM. For HTTP, 794 vulnerability signatures, and WINRPC 45 vulnerability signatures. Speedup ratio compared to Binpac is around 11x for non-HTTP and 3–4x for HTTP. Maintained throughput of 2.63 (HTTP in the university) to 17.6 (HTTP at Northwestern) Gbps for parsing and matching. Multicore gives a speedup.
Tool available online.
Q. Can you go into more details about the memory overhead? DFA requires 5.29GB for 973 Snort rules, whereas NetShield requires 2.3MB. The XFA paper showed 863 rules in 1.08MB. NetShield could improve by implementing XFA.
Q. Is it possible to do the massive matching using GPUs? Currently, most connections are independent, so yes probably.
Q. Do your scalability results not show that you require a clock cycle per bit? We only have to look at the bits in the signature.
Q. What are the advantages of your scheme with respect to XFA? Limited accuracy: XFA would make false positives.

Network Architecture and Operations

R3: Resilient Routing Reconfiguration

Failures are common, but today’s emerging applications impose a stringent requirement of network reliability. Plus SLA violations may impact an ISP’s revenue. Aim is to recover quickly from a single or multiple overlapping failures.
In 500-link network, failure scenarios up to three links exceeds 20 million. So it is difficult to optimize routing to avoid congestion under all possible failure scenarios.
Existing approaches focus exclusively on reachability. But these may lead to congestion and unpredictable performance. Some existing approaches consider only a small subset of failures, or optimize routing after failures, but this is too little, too late.
R3 require son enumeration of failure scenarios, is provably congestion-free, efficient in terms of storage overhead and flexible to diverse requirements.
Represent network as a graph, with link capacities and traffic demands on each link. Output of R3 is a base routing and a protection routing. Protection routing is a fast rerouting defined for every link that might fail.
Idea is to transform topology uncertainty to traffic uncertainty. Routing is optimized for he set of traffic demands on the original topology. Consider the amount of load that is shifted to other links when a failure occurs. If the routing is congestion free, rerouted traffic is less than capacity.
R3 has two phases. First, offline precomputation which minimizes congestion for original demand plus rerouting virtual demand on the original topology. The protection routing may use routes that later fail. Solve using a linear programming technique.
After a link fails, convert the protection routing for that link into a valid routing that doesn’t use any other failed links. After the failure, need to reconfigure the protection routing, which uses the computed rerouting.
Offline precomputation and online recompilation is sufficient to get congestion-free routing. Whether it is optimal for more than one link failure is an open problem. The reconfiguration is order independent, which enables distributed recompilation.
Some extensions: fixed base routing, trade-off between no-failure and failure protection to bound the no-failure performance, trade-off link utilization and end-to-end delay, prioritized traffic protection, realistic failure scenarios (share risk and maintenance link groups), and traffic variations.
Evaluated on two real networks and a synthetic topology. Compared to various rerouting schemes. Added R3 to OSPF and MPLS-ff. Looked for maximum link utilization.
For a single failure, R3 achieves near optimal performance. Under multiple failures, it is at least 50% better than other schemes.
Implemented for Linux and Linux MPLS. Emulated the Abilene topology on Emulab. 3 physical link failures simulated. Outperforms OSPF+recon by a factor of around 3.
Profiled the precomputation time: less than 36 minutes for each topology and less than 17 minutes for non-generated topologies.
Storage overhead is < 300KB in the FIB, and < 20MB in the RIB.
Q. Have you looked at how to redistribute traffic after a link returns? Have a reconfiguration rule for failure recovery. It will revert back to the last failure scenario, but the ordering may be different (this is provably alright).
Q. Have you looked at the overhead of announcements during convergence under churn? No packets will be dropped during this case.
Q. How does your algorithm cope with network partition? Studied this in the paper. In this case, we cannot have reachability, so we cannot have congestion-freedom. R3 will ignore the demands that it cannot fulfill.
Q. How does your approach compare against oblivious routing schemes (such as Valiant load balance)? These don’t usually handle a large number of failures. Normally big ISPs see larger number of failures than that.
Q. How do you evaluate traffic prioritization? Get 10 different priority classes from a US ISP, and show that IP traffic gets sacrifices to protect VPN traffic.

Mercury: Detecting the Performance Impact of Network Upgrades

Networks are becoming more complex and diverse. Software and hardware are both becoming more complex. This makes things more sensitive to network glitches or other performance issues. Purpose is to see whether a change makes the network better or worse performing.
Normal intuition is that an upgrade will make things better, but complex interactions can lead to unintended consequences. So it is important to monitor the impact of upgrades. This is hard due to the scale and diversity of different devices. So the challenge is to efficiently monitor at scale.
Mercury does automated data mining to extract trends, scales across a large number of measurements and flexibly across data sources, and is easy to interpret. Challenge is how to know when an upgrade happens, what their effect on performance is, and to find common factors in who is affected (or it is network-wide).
Could drive upgrade detection from the change management system, but since human information is unreliable, instead mine the configuration and workflow logs. Things like OS version and firmware upgrades are easy to track. However, lots of small configuration changes are not related to upgrades (such as customer provisioning). Out-of-the-ordinary changes are ones that are applied to multiple locations in the network, but rarely.
Divide event series (SNMP etc.) into equal time-bins to get a time series. Behavior change detection is based on a persistent shift in levels. Recursive rank-based cumulative sum is used on means, medians, standard deviations or distributions.
Identifying commonality (of attributes, configurations, etc.) is a machine learning problem (search in a multi-dimensional space). Use the RIPPER rule learner for this.
Sometimes aggregation will erroneously amplify rare events. Solution is to time-align each upgrade to each device (as if the upgrade happened at the same time).
Evaluted using close interaction with network operators. Used data sets about router configurations and workflow logs, and performance event series: SNMP and syslogs. Collected this data from a tier-1 ISP over 6 months. 988 routers in the study. Categories of router: core, aggregate, access, route reflector and hub.
Upgrade detection evaluated for false positives and false negatives. Threshold varied (frequency of change). Tends to see more false positives than false negatives, but these can be filtered.
Mercury reduces the number of upgrade-induced change points that the operator must look at by several orders of magnitude, compared to number of syslog entries. It confirmed the earlier operator findings and showed some unknown to the operator.
OS upgrades could cause CPU utilization to go down on access routers, but increases in memory utilization on aggregate routers (larger OS image). Varying changes in the number of layer-1 link flaps. More protection switching events.
Firmware upgrades could cause less CPU utilization on the central and CPU-facing routers’ CPUs.
Protection switching is line-card protection in customer-facing routers. Failover for the access router that customers connect to. Saw a small increase in the frequency of automated PS events. Time alignment was able to show this problem.
Q. Have you thought about the inverse problem where your triggers are the alarms of an anomaly detector, and you want to find the root causes? Problem with that is false alarms. With better anomaly detectors, this might become feasible.
Q. What is the time horizon of the attribute changes that you consider? We do persistent change detection, so look at daily averages over a history of about 6 months. We are now looking at whether transient things do matter (for the purpose of meeting SLAs, etc.).
Q. Do you monitor link capacity in your system? Currently only look at aggregate router statistics, not particular links/interfaces. We are starting to look into that.

California Fault Lines: Understanding the Causes and Impact of Network Failures

Most network failures are not catastrophic. But it’s difficult to collect comprehensive failure data. Lightweight techniques are limited, and special-purpose monitoring is expensive.
Contributions: a methodology to reconstruct the failure history of a network using only commonly-available data. Basically a time series of layer-3 failure events. Preferably annotated with the cause and impact of the failure. Data source for this is the syslog and the router configuration files in a version control system.
But this data is not intended for failure reconstruction. First rebuild the topology from the configuration file, then replay syslog messages. We also have semi-structured data from the maintenance logs.
Looked at CENIC network with 200 routers and 5 years of data (California academic network).
Limitations: syslog is sent using UDP which leads to message loss. We might see a series of log messages containing a DOWN followed by a DOWN, so just ignore messages until get back on track. Selection bias in the operational announcements.
Comprehensiveness: treat the operational announcements as ground truth and see how many of them have corresponding syslog messages. 97% of announcements were confirmed by the syslog.
Accuracy: using Skitter project which does frequent traceroutes to confirm that no packets went over down routers.
Validated down states using RouteViews (recorded BGP traffic) to track failure events.
60% of failures last less than a minute, which inhibits detection or recovery. Turns out mostly to be flap events.
7000 emails led to 3000 events. 28% of events are failures and 18% of observed failures are explained.
Failure causes: hardware, power, external, software, other and configuration. Hardware is the biggest cause of notices, but software is the biggest cause of failures (32% of failures). But almost 80% of software failures were due to scheduled changes.
Q. How are those failures distributed on the network? More at the backbone or on the edge? More downtime on the customer links and the high performance links than on the backbone.
Q. How do you think what you show shows more about the impact than simply tracking the control plane? It’s hard to know what the actual impact is, since we don’t collect that information. What other sources of information do we need on top of routing information? If we understood link utilization then we could see how links were being strained by these events.
Q. Are you saying that software upgrades are a dominant cause of failures? Not dominant, but serious. The UP/DOWN messages are a side-effect of the maintenance activity? Might be interesting to look at this.
Q. Do you see many concurrent failures? More details about this in the paper.

Novel Technologies for Data Center Networks

c-Through: Part-Time Optics in Data Centers

Comparing optical circuit switching to electrical packet switching. Circuit switching vs. store and forward. Optical can do 320×100G, vs. 16×40G for electrical. But the optical switching time is about 10ms, compared to packet granularity.
Despite slow switching time, optical circuit switching is still promising. Full bisection bandwidth at packet granularity may not be necessary.
Looked at a hybrid packet/circuit switched architecture. PS for low latency and optical-CS for high capacity transfer. Optical paths are provisioned rack-to-rack.
Control plane needs to estimate traffic demand and configure optical circuit based on it. Data plane does traffic demuxing and optimizes circuit utilization (maybe).
c-Through is a specific design for this. A centralized controller manages circuit configuration. Applications and switches are not modified, and end hosts are leveraged for traffic management.
Enlarge socket buffers for applications to identify which flows are heavy and which are lightweight. This generates a per-rack demand vector. Applications are unmodified, and packets are buffered per-flow to avoid head of line blocking. This estimates traffic demand and pre-batches data to improve optical circuit utilization.
Traffic demand vectors aggregated into a traffic matrix. Use Edmonds’ algorithm to compute the optimal configuration (maximum weight matching problem). Then servers are notified. The control traffic overhead could be reduced.
Electrical and optical networks isolated using VLANs.
Traffic control on hosts, which makes end-hosts tag packets for the two VLANS. accordingly.
Testbed with 16 servers, a hybrid network on a 48-port Ethernet switch. Optical switch is emulated using 4G links, whereas electrical network uses 100Mbps links. Optical circuit emulation: optical paths are only available when hosts are notified. There is a 10ms reconfiguration delay.
Evaluated TCP performance using dynamic bandwidth, overhead of traffic control and buffering effects. Also application performance (VM migration, MapReduce, MPI-FFT).
TCP exploits the dynamic bandwidth quickly. Throughput ramps up within 10ms. Throughput stabilizes within 100ms.
MapReduce performance. Since shuffling is independently transferred, it is amenable to batching. Sorted 10GB of random data, which took 800 seconds on an electrical network. With full bisection bandwidth, the performance is 135 seconds. As c-Through varies the buffer size limit, the best performance is 153 seconds, for 100MB buffers, which is close to ideal. As the reconfiguration interval is varied, can do it as infrequently as every 3 seconds, and the performance is 168s.
Ran Yahoo Gridmix benchmark, which contains 3 runs of 100 mixed jobs, such as web query, web scan and sorting. Uses 200GB of uncompressed data, and 50GB of compressed data. c-Through comes very close to the full bisection bandwidth network.
Q. Surprised by the claim that TCP works fine in this case, considering the multipath issues: would new protocols not be more appropriate? This technique we didn’t see many things blow up.
Q. Do you think it could work if the fibre is cut, and how will it affect the network? Current system doesn’t take this into account, but since there is dynamic monitoring, we could detect that and handle it.
Q. Won’t you have to reconfigure faster to catch short, bursty flows, and then isn’t there a risk of oscillations? Didn’t see that in our experiments.
Q. What is the cost of these optical technologies, and are they practical today? Expensive fixed cost, but the per-port marginal cost is not so high, which makes it competitive. A mature technology that is already on the market.

Helios: A Hybrid Electrical/Optical Switch Architecture for Modular Data Centers

Talk is about combining electrical packet switches (ePSs) and optical circuit switches (oCSs) in a data center network. Both cost $500/port. But ePS is limited to about 1G or 10G (maybe 40G or 100G in future), and oCS is rate-free. ePS uses 12W/port and oCS uses 240mW/port. Finally the oCS doesn’t require a transceiver, which costs another watt per port. But the downside of oCS is the 12ms switching time. ePS suited to bursty, uniform traffic, whereas oCS suitable for stable, pair-wise traffic.
Switching delay is due to mirrors on motors which it is necessary to reposition to switch the circuit. This simply gives a full crossbar circuit switch which does not decode packets and needs an external scheduler.
Wavelength division multiplexing uses one wavelength for a channel. WDM mux and demux are used on the electrical packet switch transceivers.
Need stability to be increased, using aggregation. Processes are more likely to communicate than threads; racks are more likely to communicate than servers; data centers are more likely to communicate than pods. Sweet spot is modular data centers.
With 64 pods, with 1024 hosts per pod, with a 10% electrical network (10:1 oversubscribed), need $6.3M, 96.5kW and 6656 cables. With a 100% electrical example, it would cost $62.2M, use 950kW and need 65,536 cables. Helios costs $22.1M, uses 157.2kW and needs 14016 cables.
Optical switch has a simple software agent, and the intelligence is in the centralized topology manager. Control loop estimates the traffic demand (hard to do), computes the optimal topology for maximum throughput, and then configure the pod and circuit switches.
Estimate: will this flow use more bandwidth if we give it more capacity (is it an elephant flow or a mouse flow)? However the results are biased by the current topology. So use the Hedera algorithm (NSDI 2010) which assumes all hosts are connected to an ideal crossbar switch, then compute the max-min fair bandwidth fixpoint.
The optimal topology is computed as a max-weight perfect matching on a bipartite graph, using Edmonds’ algorithm.
Testbed used two networks: a traditional one and a Helios network. 100% bisection bandwidth is 240Gb/s. Used 26 servers, and various switches including an optical circuit switch.
Ran Hadoop on this network, but didn’t get good numbers because the network was massively overprovisioned.
Got 190Gb/s peak and 171Gb/s on average on the traditional network, with drops due to hash collisions. The 50Gb/s difference from the full bisection bandwidth is the TCP overhead.
Helios got 160Gb/s peak and 43Gb/s average. Due to some quirks of the packet switched routers, such as port debouncing which prevents false positives on ports being up, which led to poor performance on reconfiguration. Turning that off got the average up the 87Gb/s. Turning off EDC got a 142Gb/s average. Remaining overhead is a limitation in the software. Still have 27ms gaps, due to some switching delay.
Helios used unidirectional circuits but there are bidirectional circuits as well. Unidirectional doesn’t waste bandwidth on the return path, which leads to a daisy chain topology.
First paper to demonstrate WDM in a hybrid electrical/optical network.
Q. Have you thought about how the traffic demand estimation technique would work at lower levels (down to within a pod, a rack, a server, a process)? The Hedera demand estimator works on the level of TCP-flows, so we could do that. Would the bias you get become stronger? [Taken offline.]
Q. The number of electrical and optical switches you provision is an a priori design decisions, so how would you address changing traffic patterns? The way around that is to build a hybrid electrical/optical switch.
Q. Have you thought about application-limited flows, where there is a bottleneck in the application that stops it using the additional bandwidth? Sensitive to the elephant flow classification. The whole pipeline depends on a good classification. Wouldn’t it be better to use OS modification (per c-Through)? Prefer not to modify the host.
Q. What would happen, if you didn’t have such short RTTs (such as in an aggregation network), to the end-to-end flows without buffering? It’s not clear that this would do so well (unmodified) between data centers, but the switching technology is well-suited.

Scalable Flow-Based Networking with DIFANE

A scalable way to apply fine-grained policies in enterprises.
Want to support flexible policies, such as access control rules, customized routing (e.g. Skype calls on a low-latency path) and measurement (e.g. detailed HTTP traffic statistics).
Flow-based switches store their rules in a high-speed TCAM, and perform simple actions based on those rules. The flow space has at least five dimensions. Want to specify these in a high-level management system and enforce low-level rules in the TCAM. Want to support large numbers of hosts, switches and policies with limited TCAM space.
If you pre-install the rules in the switches, this is simple, but it doesn’t support host mobility and switches don’t have enough memory for all rules.
Alternatively (per Ethane, NOX) install the rules on demand, buffering the first packet while the rules are looked up in the controller. The first packet misses the rules, and gives additional switch complexity, and the risk of DoS by sending multiple different packet headers.
DIFANE supports host mobility, reduces memory usage and keeps all packets in the data plane.
Stage 1: controller proactively generates rules and sends them to some authority switches. The flow space is partitioned between the authority switches.
Stage 2: authority switches keep packets in the data plane. When a packet is received, it is routed to the authority switch and sends feedback of the rules to cache. Subsequent packets hit the cache and are forwarded directly. There is no longer a race between updating the cache and forwarding subsequent packets.
A small set of coarse-grained wildcard rules is used to give the partition function for authority switches. Not a DHT, since wildcards are used in the rules.
A switch’s TCAM has cached rules, authority rules (if the switch is an authority switch) and partition rules (to route to an authority switch). Prefer cached rules and authority rules over partition rules.s
Switch prototype built with an OpenFlow switch.
Tricky to cache rules when wildcard rules may overlap (with different priorities). Therefore have to generate new rules based on contiguous subregions. Partition based on minimizing the TCAM entries in switches. Use a decision tree base rule partition algorithm to decide where to place the splits in the flow space.
Need to handle policy changes at the controller, topology changes at the switches and host mobility.
Evaluated prototype by implementing DIFANE in a kernel-level Click-based OpenFlow switch. Traffic generator, switches and controller run on separate 3GHz Xeons.
NOX sees a 10ms RTT delay for the first packet, but DIFANE sees a 0.4ms delay.
DIFANE can easily be implemented in hardware, whereas NOX requires more software intervention.
For peak throughput (one authority switch, single-packet flow), NOX hits an ingress switch bottleneck at 20Kflows/sec with one ingress switch, and then reaches a controller bottleneck with more ingress switches.
How many authority switches? Depends on number of rules. Campus network has 30K rules, which is assumed to be 160KB of TCAM memory. This leads to about 3 authority switches. An IPTV network with 5M rules requires 1.6MB of TCAM and would require 100 authority switches.
Tension between distributed (switch-based) and centralized (controller-based, easier to manage) operation. DIFANE is a point in between these extremes.
Q. How realistic are your assumed TCAM sizes? Already have 160 KB TCAMs, so we would just use more switches.
Q. If you have a slow path you can scale much better, so why do you want to keep everything on the fast path? [Taken offline.]
Q. Did you experiment with cache replacement policies? Much work done on how to cache rules, so we can just leverage that.
Q. What about the importance of dynamic rules that might change frequently, and how can DIFANE handle it? Think that only traffic engineering needs such dynamic rules. DIFANE can get the controller involved to manage these. But the performance gain is not much over OpenFlow in that scenario. Isn’t a benefit of OpenFlow that you can implement e.g. authentication at the application level? Yes, but we can get the controller to push this into the rules.
Q. Is there a cost to have all switches be authority switches? Depends on the network and how it is used. Why not make every switch an authority switch? May need more redirection, and hence more stretch. Also the rules will become smaller.
Q. Does this make internet traffic more unpredictable? A reasonable comment, but since we know the location of the authority switch, we know the paths that the traffic may take.

Social Networks

An analysis of Social Network-based Sybil defenses

Many online services allow attackers to create accounts for free and they can hence manipulate the system.
Defense approaches: trusted certification (such as SSN or passport number), or resource challenges (e.g. cryptopuzzles, not hard to solve if you can get cloud resources on demand). Or can use links in the social network to detect Sybils, since we presume that attackers can only create a limited number of links to non-Sybil users. Spawned a lot of research.
Unanswered questions: since the schemes use different mechanisms, it is unclear how the schemes are related, or whether there is a common insight across the schemes? This would help us understand the limitations of the defenses.
Talk proposes a new methodology for comparing these systems and finds that they all work in a similar manner. It implies that they have a hidden dependence on the network structure, which identifies the limitations of the schemes.
The interesting fact is how these schemes identify nodes as Sybils.
Schemes take a social network and a single trusted node, and declares Sybils from the perspective of the trusted node. Internally, each node has a Sybil probability, which gives each node a ranking of Sybilness. Can this ranking be used to compare schemes?
Compared rankings from each scheme from the same social graph. The ranking is jumbled between the different schemes. All schemes seemed to have a cut-off point where the partitions were (unordered) equalish.
The cut-off point comes at the boundary of the local community. So all schemes are effectively detecting communities. Nodes in the local community are ranked higher, but the ranking within and outwith the community are in no particular order. Can we then leverage the work on community detection to design new approaches?
Bad news: this depends on the graph having monolithic community structure, and the characteristics of the community around the trusted node.
Does this make certain network structures more vulnerable? Does having more communities make it harder to identify communities? Evaluated this on various real-world social networks. Simulated a Sybil attack by consistently adding Sybils (5% attack links and 25% Sybil nodes). Accuracy measured using ranking, i.e. the probability that Sybils will be ranked lower than non-Sybils. Compared amount of community structure (modularity) to the accuracy. Modularity seems to be negatively correlated with accuracy.
How can the attacker use this intuition? Can he do better than just choosing random links? For example, by placing links closer to the trusted node. Then the attacker could blend in to the community of the trusted node. Experiment ranks the nodes and gives the attacker to give the ability to place links randomly among the top N nodes. Smaller N implies an attacker with more control. Graph shows an attacker with more control will reduce the accuracy of the algorithms.
Moving forward: could be useful for whitelisting nodes, and could potentially incorporate information from more layers to make the decision about who is a Sybil.
Q. Have you evaluated where the number of Sybil nodes far exceeds the number of attack links? The results hold in those settings as well.
Q. Attacks are launched from compromised and fake accounts, so how do you deal with this? This violates the basic assumption that the attacker has few real links, so none of these schemes will work.
Q. What if the Sybils form multiple communities? No matter the Sybil topology, as long as the number of attack links is small, none of these schemes will work.

The Little Engine(s) that could: Scaling Online Social Networks.

Systems should be scalable, but it can be hard to implement and is not necessary at the start of an online service. Of course, this can lead to a success disaster. The cloud gives hardware scalability, but no automatic application scalability.
Frontend, stateless components are easy to make transparently scalable, but the data source is a bottleneck.
Obvious solution is full replication of the DB, but the state doesn’t decrease with the number of servers. However it maintains data locality.
Next most likely solution is horizontal partitioning/sharding, but the splits are disjoint, which is bad news for OSNs. The shards cannot be disjoint, because OSNs involve queries across social links, or data dissemination across social links. Presumably want to colocate all of your friends on the same server.
Relational databases don’t perform well under horizontal partitioning, and are expensive, so people use DHTs. These perform better but there is no SQL, less abstraction, and they suffer under high traffic (incest, multi-get hole, jitter). Also a DHT gives random partitioning, which means many servers will be hit with a particular update, and there is a high replication overhead.
Can leverage underlying social structure to make the partition. The SPAR (Social Partitioning And Replication) algorithm does this.
Algorithm has to be online (due to system and social network dynamics), fast and simple (using local information, a hill-climbing heuristic and back-pressure load balancing), stable (no cascades) and effective (approximates an NP-hard problem, minimize replicas while also maintaining a level of redundancy).
Evaluated on real OSN data (Twitter/Orkut/Facebook). Looked at various algorithms, including random partitioning, MO and METIS.
SPAR has a lower replication overhead than the other algorithms, with only 22% overhead over the replication constraint.
Three-tier application: front end and application logic are stateless on top, with SPAR middleware in the application logic and the data store (to intercept messages). The SPAR controller, partition manager and directory service coordinate the whole system. To applications, SPAR is totally transparent, implemented on top of MySQL and Cassandra, but could be implemented using other things.
Evaluated using a non-distributed Twitter clone (StatusNet) and real Twitter data, and saw if it could scale up across 16 commodity desktop machines. The 99th percentile latency for MySQL with full replication was 16 requests per second, whereas SPAR+MySQL does 2500 requests per second. Vanilla Cassandra does 200 req/s, whereas SPAR+Cassandra does 800 req/s.
Q. Can you replicate e.g. Facebook pictures based on the groups of friends? The rule is applied when processing the query itself, though some redundant data would be stored.
Q. Have you looked at incorporating more dynamic interaction behaviors in the partitioning algorithms? We have considered adding weights.
Q. Any thoughts on Diaspora? Only know what I read in the news and that it’s fully distributed, so don’t think there will be such a thing as a global data store.
Q. []? The more clustered you are, the less replication you will need. The results are consistent for large data sets.
Q. Would the replication overhead for Orkut not be higher? 12 or 16.
Q. Where is the notion of load per server? Would this not allocate servers that have absolutely no work to do? Details in paper.
Q. Are there not better designs than a read fan-out? Arguably.

Crowdsourcing Service-Level Network Event Detection

Want to identify problems that affect end-to-end performance. Do this in an online way with reliable detection and isolation.
Idea is to do monitoring at the edge systems and detect drops in performance.
Need a system that is scalable, and has localization in time and space. Scalability from passive monitoring, and fully distributed detection. Also privacy, reliability from uncontrolled hosts and wide adoption (incentive to install).
Approach: passively monitor local performance information (signals), and detect drops in performance. Then attempt to get group corroboration from other hosts. A likelihood ration distinguishes network effects from coincidence. Store the data distributedly, and give the operator a tap to get that data out.
Evaluated the approach using peer-to-peer applications (a natural fit). This gets us an edge trace. The dataset is from a plugin called Ono, which has been installed by 1 million BitTorrent users worldwide.
Case study on the BT Yahoo! network, which has information about confirmed events on its web interface. Gives the dates and times of the issue arising and having been fixed.
BitTorrent peers monitor many performance signals, both general and protocol specific (like Torrent availability). The individual signals are noisy, having uncontrolled duration and having a wide range of values. Use some moving-average smoothing to make this easier to interpret.
To do group corroboration, why might they occur at the same time? Could be service-specific problems (e.g. lack of a seeder), coincidence (noisy local detection), or a genuine network problem. Coincidence becomes very small with a large number of users. Can tune a likelihood ratio knob to make this more or less sensitive.
Evaluated in the wide-area. Don’t know the false positive or false negative rates, because ISPs wouldn’t provide information about when their network went down. Therefore use public information from BT Yahoo!, and do some work under NDA.
In one month of BT Yahoo! data, detected 181 events and 54 occur during confirmed events. There were 14 other reported problems. Remaining are not necessarily false positives.
Worked with a North American ISP under NDA. Detected 50% of events in regions with fewer than 10k subscribers.
Evaluated sensitivity to the likelihood ratio, detected problems 2% of the time for small moving average deviations and 0.75% of the time for larger deviations.
Deployed as the Network Early Warning System in 48k BitTorrent clients.
Q. This seems like the right approach since performance monitoring should be done at the application layer.
Q. Do you think that IP prefix or geolocation information would be useful for grouping? Depends on whether groupings are useful to help with the problem. Using IP prefix already.
Q. How are your techniques different from the earlier talks on anomaly detection? This is at the user, so the information that comes back is more useful. Why are you using moving averages compared to something more sophisticated? Wanted to implement it simply and get it incorporated in a BitTorrent client. Many schemes assume a long continuous stream of data.
Q. Once you have detected the events, what do you do with them? The idea is for operators to go and fetch this information. But there is a root cause analysis problem here, which is important future work in this area.

Posted in Uncategorized | No Comments »

SIGCOMM 2010: Day 2

Wednesday, September 1st, 2010

Privacy

Privacy-Preserving P2P Data Sharing with OneSwarm

Three types of data: private, public (non-sensitive) and public (without attribution). This talk is about the last one: want to download and share data without people knowing what you’re downloading or sharing.
P2P is good for privacy because there is no centralized control or trust, you retain rights to your data (instead of giving them to a third party) and no centralized third party knows everything you’re doing. But in P2P: anyone can monitor your behavior!
Previous solutions: Tor plus BitTorrent, but this needs a fraction of the clients to be public and gives poor performance. Or Freenet, which has poor bulk data performance and requires users to store others data. Median download of BitTorrent a 1MB file is 94s. But BT+Tor is 589s and Freenet is 1271s.
Implemented a OneSwarm client and released in March 2009â€¦ now hundreds of thousands of users. Based on social networks: share keys with your friends.
Searches are flooded through the network, and set up a data path along the successful route, which does not reveal who is the ultimate sender or provider.
Threat model: the attacker has a limited number of overlay nodes and can do anything on nodes he controls, including traffic injection/sniffing/correlation.
To support mobile peers, use a DHT to publish IP and port, however this is published, encrypted and signed separately for each peer. This makes it possible to remove a peer.
Sparse social networks are a problem: with only one friend, you have poor reliability, poor performance and privacy problems. This is bad for early adopters. Early adopters used a forum to share their public keys. Solution was to add a community server, as a source of untrusted peers.
With untrusted peers, delay responses to foil timing attacks, probabilistically forward queries and used deterministic random behavior to limit information leakage from repeated queries. Can trust/not trust peers on a per-object basis.
Want to have search without revealing the source and destination. The approach is based on flooding with delay, where searches are only forwarded using spare capacity, and delayed at each hop. Cancel messages move faster through the network. However, this gives no guarantee that all data can be found by all users at all times.
Search types: 160bit content hash and text-based. For response delay, use random reply delay seeded by the hash (if hash-based search). This is harder for text-based search, so delay based on the hash of the content.
Multipath connections are supported to avoid weak points in the network. Default is to cancel after getting 20 responses (not just one). Then the forwarding load is distributed.
Data transfer uses a modified version of BitTorrent, which handles multiple sources and swarming downloads naturally.
Timing attack is possible by monitoring the delay between search and the response, and inferring how many users could have sent the reply within the recorded time.
Evaluated potential of the timing attack using a synthetic OneSwarm overlay, based on 1.7 last.fm users. Attackers use a public community server, and users with 26 or fewer friends take some untrusted friends as well. Design eliminates the attacker’s ability to pinpoint particular users.
Also evaluated performance using 120 PlanetLab nodes transferring a 20MB file. Median download time for OneSwarm is 173s (compared to 94s for BitTorrent, but much less than BT+Tor and Freenet). At the 70th percentile, OneSwarm and BT are similar (97s vs 190s), but BT+Tor and Freenet are much worse.
Multipath transfers increase average transfer rate from 29KB/s to 457KB/s.
Q. Would it be fair to classify this as Gnutella where the network is restricted to the social graph and searches are less flexible? Similar, but the key change is that the data flows over the path in the overlay (not directly), which makes it more privacy-preserving. Does this not give worse scalability than Gnutella, which had problems? Gnutella had problems when people were still using modems, and it is more viable to provide a few KB/s. The current overlay is not oversubscribing.
Q. What are you relying on to ensure that you don’t know where the data are coming from? Because you don’t know the topology. If you are next to an attacker, you rely on the delay that you add. Are you seeing a sum of random variables, which will leak more information as the path becomes longer? You could maybe estimate the hop-count but not pin-point nodes.
Q. Is a 20MB file too small for TCP to show realistic performance? Used this because we needed to experiment with Tor also, and we didn’t want to stress that network too much. For the Freenet experiment, we used a 5MB file and extrapolated from that because it was hard to get 20MB to download reliably.

Differentially-Private Network Trace Analysis

Can you conduct network trace analysis that provides strict, formal, “differential privacy” guarantees?
Selected some representative traces and tried to reproduce the results using differential privacy.
It was possible to reproduce every analysis attempted, but there is a privacy/accuracy trade-off.
Toolkit and analyses are available online from the PINQ web site.
Access to realistic data is helpful for networking research, but there is a tension between utility and privacy. The requirements of utility are usually for aggregate statistics, whereas privacy requirements are typically for individual behavior.
Other approaches include: trace anonymization (doesn’t always work unless people are excessively conservative), code-to-data (send your analysis to the people who hold the data, but it is hard to know what that code is doing), or secure multi-party computation (similar to code-to-data). The aim here is to start with formal guarantees and see how useful it can be.
Differential privacy: the results don’t depend on the presence or absence of an individual record. Doesn’t prevent disclosure, but makes no assumptions about what what the attackers can do, and is agnostic to data types.
Uses PINQ, which is a programming language that guarantees programs are differentially private.
Challenges: getting DP requires introducing noise, so you need to use statistically robust measurements. PINQ requires analyses to be written as high-level, declarative queries, which can require some creativity or reinterpretation. Also (not dealt with): masking a few packets does not mask a person, and the guarantees degrade more as a dataset is reused (policy question of how you mete out access to a dataset).
Example is worm fingerprinting. Group packets by payload, filter by the count of source IPs being over a threshold and the count of destination PIs being over another threshold. Can then count the number of worms, approximately. Need to supply epsilon to the count which will start off the differentially-private version.
Built some tools for analysis. For example, implemented three versions of a CDF. In doing this, you need to scale down the accuracy for each subquery in order to not degrade the dataset privacy too much.
Showed an example CDF. The differentially private one is not monotonic at the microscopic scale, but it gives a convincing macro-scale result.
Can also list frequently occurring strings, using an algorithm based on statistical properties of text, which gradually extends a prefix.
Extend worm fingerprinting: actually enumerate the payloads that have significant src/dest counts.
Also built more tools and analyses: did packet-level analyses, flow-level analyses and graph-level analyses. Sometimes had to compromise on privacy to get high accuracy (epsilon = 10 for weak privacy).
Many open questions. Perhaps the biggest is whether DP guarantees for packets are good enough. Or whether, if writing new analyses, they could be designed with DP in mind.
Q. Could extensions to PINQ apply to trace analysis that look for isolated events, such as network intrusions which are relatively rare? Can separate the two tasks: learning a rule or filter that could identify an intrusion (which could use DP), and apply that filter to individual packets (which could not use DP, because you effectively want to violate privacy at this point).
Q. Does someone need to hold onto the raw packets? Yes, like the code-to-data setting.
Q. In DP, each query may consume epsilon privacy and the provider must set a budget, so how do you set this? And what happens when the budget is exhausted? You could imagine turning off the dataset when the hard threshold is met. But this is really a policy question. Setting the budget is difficult: perhaps you can provide data outputs from a DP query to a large group who can then do useful work with it.
Q. Is there a trade-off between DP and the presence of correlations in the data? In a lot of cases, it is possible to restructure the data set to reduce the amount of correlation between individual records (by grouping the strongly correlated records together).

Encrypting the Internet

50 million websites exist, but only about 600k of them enable SSL/TLS. Can we change the infrastructure to make all transactions protected and secure.
Main drawback is protocol processing speed and cost, due to public key crypto for handshaking and symmetric crypto for the data. 2 million clock cycles for RSA decrypt.
Main contribution is a CPU that is capable of encrypting packets at line rates, and getting a 4–12x speedup in AES and a 40% speedup in RSA.
Encrypting the internet is not securing it. Don’t deal with certificate/trust management, malicious software or privacy breaches at the end-host.
AES is a block cipher, based on the Rijndael algorithm. Can use 128-bit blocks and either 128, 192 or 256-bit keys. AES takes 10, 12 or 14 rounds.
AES uses confusion (invert in GF(2^8) followed by an affine map). Then the bytes are permuted by shifting the rows of the S-box by varying amounts. Then the columns of the S-box are mixed by matrix multiplication. Uses many bit-linear operations, which are easy to implement in VLSI. Finally, add the round key using XOR.
AES is typically implemented using table lookups, which are costly (approximately 15 cycles per byte). But need to get 1Gb/s. So the aim is to implement them in combinatorial logic, on the processor data path.
Added new instructions: AESENC, AESENCLAST, AESDEC, AESDECLAST. Cache attacks are eliminated. Challenge is to implement this in as small a gate area as possible. Mathematical techniques such as composite fields help to achieve this in 100-400 gates. Total number of gates is similar to an adder or multiplier.
RSA requires performing a modular exponentiation, which can be implemented using modular multiplication. Implementing a faster multiplication algorithm in assembly achieved a 40% speedup over OpenSSL.
Also implemented the first version of combined encryption and authentication for TLS 1.2.
AES-NI latency is 24 clocks/round, then 6 clocks, and the throughput is 2 clocks.
Overall, can move from 501 SSL sessions/second to 1216 SSL sessions/second using AES-NI in Galois counter mode.
Now, one core can saturate a 1G link, and 8 cores can saturate a 10G link.
Future work is to improve larger RSA variants and implement the eventual SHA-3 algorithm.
Q. When you get that fast, how many good-quality random bits per second you can get? This work doesn’t address that, but all we need is an entropy source per a 2004 paper. Not sure what the product groups are doing in this respect.
Q. Is Intel working on speeding up the RSA? The speedup presented in the paper (40%) is good enough to saturate the link.
Q. Could you expose the GF operations as primitives themselves? It is implemented in such a way that you can isolate the inversion of GFs or the multiplication. Algorithms in the SHA-3 competition also exploit similar primitives.
Q. How general are your optimizations in terms of other block ciphers? You can implement a variety of crypto algorithms using the primitives we have design, including several cryptographic hash functions.

Wireless LANs

Enabling Fine-Grained Channel Access in WLAN

802.11n achieves about 45.2 Mbit/s at the application layer, which is much less than the advertised bitrate.
Overhead arises from various sources: CSMA, backoff, DIFS and SIFS, and ACK. Simple model of this overhead as ration from transmission time to total time for a packet. As the PHY data rate increases, the time for transmitting data (efficiency) becomes small compared to all of these overheads. There is a static delay that cannot be reduced, constraining speedup.
Existing MAC is limited by allocating a whole channel to a single user at once. Aggregation is a possible solution, but you require large aggregation (23KB frames) to get 80% efficiency at 300Mbps. And this also increases latency.
Basic idea is to divide the channel into small, fine-grained slices. Directly reducing the channel width doesn’t work because of guard-band overhead. Approach then is the use orthogonal overlapping subchannels (OFDM).
If nodes are asynchronous, you lose orthogonality (i.e. if you have multiple users). Challenge then is to coordinate transmissions in random-access networks like WLAM. Time-domain backoff is very inefficient in this case.
Designed new PHY and MAC architectures: “FICA”.
M-RTS/M-CTS/DATA/ACK access sequence.
Carrier-sensing and broadcasting can be used to analyze the timing misalignment. A proper cyclic-prefix accommodates the timing misalignment: a long one for M-RTS and a short one for M-CTS/DATA/ACK.
For contention resolution, time-domain backoff is inefficient. Solution is to do frequency-domain contention with PHY signalling in the M-RTS/M-CTS symbols.
Frequency domain backoff: reduce the number of subchannels to contend if there is a collision, and increase it if there is success. This is analogous to congestion-control mechanisms. Two policies: “update to max” and AIMD.
Implemented using the Sora software radio platform, based on a SoftWifi implementation.
Evaluated performance of the synchronization, the signalling reliability and the decoding performance.
Also showed simulation results for the performance gain over 802.11n, and showed an improvement in efficiency for both full aggregation (unrealistic) and a mixture of saturated and delay-sensitive traffic (realistic and with much greater benefits).
Q. How do you deal with the case when the number of sources exceeds the number of sub-carriers (in frequency-domain backoff)? Could you combine time and frequency? Yes, we could always do that.
Q. Is there a way of using this system with RTS/CTS? The overhead of these is so low (37us for RTS), that it might not be worth doing this.
Q. Why is the problem of asynchronous timing different from multipath fading? It can create arbitrarily bad interference with the FFT window that is done for OFDM.
Q. What happens if you compare your scheme to classic OFDM in terms of bits/seconds/Hz (considering delay due to synchronization)? There is a sweet point in symbol size that can meet your design goal.

Predictable 802.11 Packet Delivery from Wireless Channel Measurements

802.11 is fast (600Mbps), reliable (usable at vehicular speeds over extended range) and ubiquitous (cheap). But new applications, such as wireless displays or controller input, can stress the network.
In theory, performance should be easily measurable and used to guide channel rate selections. But the real-world doesn’t always match the theory. So statistical adaptation is often used, but convergence time becomes a problem (especially as the measurement results change dynamically).
Goal is to bridge theory and practice, and accurately predict performance over real channels and devices.
Channel metric is the received signal strength indicator (RSSI) which, with noise, gives the SNR for a packet. However, this isn’t perfect, because it can vary by 10dB on a per packet basis. Different subchannels have different SNRs.
802.11n provides a new opportunity: detailed channel measurements, which are used for advanced MIMO techniques. Get a Channel State Information (CSI) matrix for per-antenna paths.
Use the Effective SNR (the total useful power in a link) as opposed to the packet SNR (total power in the link).
CSI is measured on receive, so for every received frame, we know all antennas and subcarriers used. Then take this and compute SNRs per symbol. And use textbook formulae to calculate per-symbol bit-error rates, and average them to get an effective bit-error rate. Finally convert this back to the effective SNR.
Every rate gets an effective SNR threshold, calculated offline per NIC implementation (not per NIC or per channel). This handles real NICs which may use interesting decoding techniques (hard/soft/maximum likelihood, etc.).
Application: what is the fastest configuration for a particular link? Select rate/MIMO/channel width based on the information.
Application: which antenna is the best to use to save power?
Application: what is the lowest transmit power at which I can support 100 Mbps?
Implemented in an Intel Wi-Fi Link 5300 NIC (3×3 MIMO, 450Mbps). Used two testbeds with over 200 widely varying links. Open-source Linux driver and used firmware debug mode to send CSI to the receiving host. Real-time computation took 4us per 3×3 CSI.
For predicting optimal 3×3 rate: effective SNR is much closer to the ground truth than packet SNR.
To evaluate rate control, used channel simulation on a mobile trace using MATLAB and the SoftRate GNU Radio. Effective SNR gets a better average delivered rate than SampleRate, SoftRate and SampleRate with fixed retry (802.11a algorithms).
Effective SNR extends to MIMO. Compared to optimal and an invented algorithm called “previous-OPT”. Effective SNR gets 80% accuracy and 10% overselection.
Related work: SoftRate, AccuRate and EEC (from yesterday). All work with 802.11a but don’t extend to more recent techniques.
Q. If you had CSI and it’s quick, do you need to do all of these things? The RSSI has a lot of error, and we were able to make this work.
Q. Is the debug mode on the NIC publicly availably? Yes, I think so.
Q. Would a better comparison be to other techniques that use scheduled MACs? Trying to do something that works with what we have.

SourceSync: A Distributed Architecture for Sender Diversity

Receiver diversity underlies many systems, such as opportunistic routing protocols, and WLAN diversity protocols. In opportunistic routing, let any router that receives the packet forward it. If multiple routers/channels with different loss rates, the loss probability is now the joint probability of losing the packet on all channels.
Sender diversity is the converse of receiver diversity. If many senders transmit simultaneously, it is unlikely that they will all be attenuated at the same time. It provide analogous benefits to receiver diversity. For example, connect many APs to wired Ethernet, and let many of them broadcast a packet simultaneously to the client.
Challenge: simultaneous transmissions don’t strengthen each other, because they are likely to be out of sync. Need distributed symbol-level synchronization.
An 802.11 symbol takes 3.2us. With synchronization error of 2us, the best you can get is an SNR of 2dB. 1us synchronization error gives 5db. But for maximum bit rate, 802.11 needs an SNR of ~22db.
Implemented the system, SourceSync, for an FPGA. Talking about opportunistic routing, but applies also to WLANs.
Natural way to synchronize transmitters is by reception. But since multiple paths have different delays, need to compensate for these differences.
Path delay is made up of propagation delay and packet detection delay (typically needing multiple samples to detect a symbol). Then a turnaround time between receipt and transmission.
Packet detection is caused by receivers detecting packets using correlation. Random noise can cause a receiver not to detect a packet on the first sample. Since routers see different random noise, they may make different numbers of samples. Routers can estimate this based on the phase shift in various carriers.
Hardware turnaround time is hardware dependent, caused by different hardware pipelines and radio frontends. Routers locally calibrate this using their clocks.
Propagation delay is measured by probe-response between node pairs. A knows its packet detection delay, B knows its packet detection delay, and so we can compute this from the RTT.
Challenge: can nodes synchronize using carrier sense? Transmission from one of the joint senders triggers the other senders. All nodes use CSMA, so one of the nodes wins contention and begins transmitting; other nodes join in if they have the data.
The lead sender adds a sync header to the packet and a known fixed gap to all co-senders to join after the gap. Co-sender listens, turns around from receive to transmit, waits for a compensating delay, and sends the data.
Implemented in a FPGA of the WiGLAN radio. Built a testbed with a variety of line-of-sight and non-line-of-sight locations.
Evaluated: randomly pick a pair of nodes to transmit, and measured the synchronization error. 90th percentile of synchronization error was measured: 20ns at 5dB SNR, and as little as ins at 25dB SNR.
Can SourceSync achieve sender diversity gains? Two nodes transmit simultaneously to a receiver (again). Check that two channels have different OFDM subchannel SNRs (they do, in the example) and that SourceSync achieves higher SNR in all subchannels.
Compare using the best single access point to using SourceSync, with two senders and a client. Repeat for all locations. SourceSync gives a median throughput gain of 57%.
Compared with opportunistic routing. Single path does worst. ExOR does better. SourceSync + ExOR does best (doubled median throughput over single path, and 45% improvement over ExOR alone).
Q. Did you consider just increasing the power of a single AP instead of sending with multiple APs? There is a fundamental gain here: the SNR profile is different for different routers, and combining across multiple senders gets rid of these deep losses.
Q. Should there be more components to the calculation of delay, based on the RTT? The nice part about this technique is that channel access delay doesn’t affect us, because we use carrier sense for telling when to transmit.
Q. Why did you not compare the performance of your scheme to MIMO? Sender diversity is orthogonal to MIMO and could improve its performance.
Q. Is your synchronization header long enough to account for nodes being very distant? Actually, it’s the gap after the header that has to be long enough. It’s a simple system-level parameter.

Novel Implementations of Network Components

SwitchBlade: A Platform for Rapid Deployment of Network Protocols on Programmable Hardware

[â€¦]
Existing approaches involve developing custom software, custom hardware or programmable hardware.
Platform header: a hash value for custom forwarding, a bitmap for what preprocessor should execute on the packet, a forwarding mode (including longest prefix matching or an exact match; also able to throw a software exception) and the virtual data plane ID.
Virtual data plane has its own preprocessing, lookup and post-processing stages: they operate in isolation.
Preprocessing stage: select processing functions from a library of modules (such as path splicing, IPv6 and OpenFlow). Also hashing: operator indicates what bits in the header should be incorporated in the packet-header hash to determine how the packet should be forwarded (can include up to 256 bits from the header).
Can do OpenFlow, where forwarding decisions are made on a 13-tuple (240 bits), which SwitchBlade hashes for custom forwarding to be done.
Modules are implemented in Verilog. Preprocessing and postprocessing modules extract the bits for lookup.
Forwarding stage: perform output port lookup based on mode bits. A software exception can be thrown and the packet redirected to the CPU. Could do hardware-accelerated virtual routers in software.
Implemented on NetFPGA.
Evaluated for resource utilization and packet forwarding overhead. Compared to a baseline implementation on NetFPGA. There is minimal resource overhead and no packet forwarding overhead.
Evaluated on a three-node topology.
SwitchBlade uses 13 million gates to get four data planes; other implementations (IPv4, splicing, OpenFlow) have one data plane and use 8 to 12 million gates.
No additional forwarding overhead compared to the reference implementation.
SwitchBlade is a programmable hardware platform with customizable parallel data planes. Provides isolation using rate limiters and fixed forwarding tables.
Q. How do you scale the performance beyond tens of Gbps? An artifact of the NetFGPA implementation which uses 4×1G ports. Later one will have 4×10G.
Q. Doesn’t the next paper show that it is possible to do all this in software? Things like Click are limited by packet copying overhead, so you are limited by the bandwidth of the PCI bus.
Q. What kind of hash function do you use and do different applications require different properties? We use a collision-resistant hash.

PacketShader: A GPU-Accelerated Software Router

Prototype achieves 40 Gbps on a single box by exploiting GPU acceleration.
Software routing is not just IP routing. It is driven by software and exploits commodity hardware.
10G NICs cost from $200–300 per port. But software routers are limited to less than 10Gbps (8.7Gbps in RouteBricks is the best so far).
For 10G, it takes 1200 cycles to do packet I/O, and your budget is 1400 cycles. Lookup/encryption/hashing typically takes much more than that.
First step is to optimize the packet I/O. Then offload the other functions to the GPU.
GPUs are massively-parallel. Lots of small cores.
A GTX480 GPU has 480 cores and 1.2 billion transistors, most of which is dedicated to ALU.
Operations like hashing, encryption, pattern matching, network coding and compression are computationally intensive. GPU is well suited to these. GPU can also effectively hide memory latency.
Memory bandwidth of a top-of-the-line CPU is 32GB/s, but the empirical bandwidth (on realistic access patterns) is 25GB/s. Multiple ports receiving and transmitting will consume this and cause contention. However, a GPU has 174GB/s memory bandwidth.
Key insight: stateless packet processing is parallelizable. Take packets from the head of the receive queue, batch them and process them in parallel.
Latency is not impacted by parallel processing.
Before shader: checksum, TTL, format check, etc. This will send some packets along the slow path. It collects the destination IP addresses and passes those to the shader.
Shader: takes IP addresses, looks up the forwarding table and returns the next hops.
Post-shader: packets are updated and transmitted through the output ports.
Also device drivers at the receive and transmit side. Implemented a custom driver; details in the paper.
Can scale further with a multicore CPU. One master core and three worker cores. Master core talks to the shader. Once you have multi-socket, you need one GPU per CPU. Multi-socket, there is no communication between the CPUs, and each CPU owns a subset of the input queues.
Evaluated by connecting a packet generator and PacketShader back-to-back. Generator generates up to 80Gbps.
GPU gives a speedup (over CPU-only) of 1.4x for IPv4, 4.8x for IPv6, 2.1x for OpenFlow and 3.5x for IPSec.
IPv6 table lookup requires more power than IPv4 lookup. Algorithm is binary search on hash tables. Big performance improvement for small packets, but slightly worse for 1024 and 1514 bytes. However, this is bounded by the motherboard I/O capacity.
IPSec tunneling adds a header and trailer to the encrypted packet. The improvement is across all packet sizes, and is actually bigger for larger packets.
PacketShader achieves 28.2 Gbps with CPU only, and is implemented in user space, rather than kernel space. Reaches 39.2 Gbps with the GPU.
Need to add a control plane (currently only does static forwarding). Need Quagga or Xorp.
Could also integrate with a programming environment, such as Click.
Q. Is it worth implementing such a sophisticated design to make a 40% saving? And do you have a breakdown of where the savings are made? The budget numbers and breakdown are taken from RouteBricks.
Q. What do you think about the power efficiency of this compared to other approaches? Idle to full load is 327W–594W with two CPUs and two GPUs. (Compared to 260W–353W for two CPUs.)
Q. Does this approach have advantages over an integrated network processor in terms of scalability or programmability? Network processors are not commodity. Based on experience, they are much more difficult to program.
Q. Why did your approach have such a significant speedup over RouteBricks etc. even without the GPU? Improvements in packet I/O throughput.

EffiCuts: Optimizing Packet Classification for Memory and Throughput

Packet classification is important for security, traffic monitoring and analysis, and QoS. Usually based on the source and destination IPs and ports, and the protocol field.
Line rates and classifier sizes are increasing. This leads to high power consumption.
Previous approaches have used either TCAMs (poor scalability) or algorithmic approaches (potentially scalable, but problematic). Most promising approach based on decision trees. Aim of this work is to address the scalability of decision tree algorithms.
HiCuts and HyperCuts have investigated decision trees previously. However, they require large memory.
EffiCuts reduces the memory overhead of HyperCuts while achieving high packet throughput. Uses 57x less memory and 8x less power.
Rules in the decision tree are hypercubes in the rule space. Tree building successively cuts down the rule space into smaller sub-spaces. Stops when the cube is small. Classification uses tree traversal.
HyperCuts’ memory overhead is due to many rules overlapping and varying in size, because fine cuts to separate small rules lead to cuts to and replication of large rules. Also overhead because the rule space is sparse (leading to empty nodes or nodes with replicated rules).
Aim to tackle the variation in the rule size and the density of the rule space.
Separable trees: build separate trees for large and small rules. But separate them along different dimensions.
Build a distinct tree for each set of separable rules in 5 IP fields. This leads to a maximum of 31 trees, but in practice it’s more like 12.
Extra memory accesses to traverse multiple trees decreases packet throughput. To reduce the number of accesses, merge the trees.
HyperCuts uses equi-sized cuts to separate dense areas, whereas EquiCuts uses equally-dense cuts, which leads to fine/coarse cuts in dense/sparse areas. Many details of this in the paper.
Node co-location: colocate a node and its children, details of this in the paper.
Implemented HiCuts and Hypercuts with all heuristics, and EffiCuts. Used 16 rules per leaf. Power comparison uses an estimation from the Cacti tool to simulate the SRAM/TCAM.
First result: HyperCuts and HiCuts see memory grow more rapidly than EffiCuts. Replication decreases from 1000 to < 9. Efficuts needs constant number of bytes per rule as the number of rules grows.
EffiCuts requires 50% more memory accesses than HyperCuts. However, since EffiCuts uses much less memory, memory copies are inexpensive.
Throughput results are mixed (149 down to 73 million packets per second for one rule set; but 218 up to 318 for another). Still see an 8x saving in power.
Also compared EffiCuts to TCAM. Throughput story is also mixed, but EffiCuts consumes 6x less power than a TCAM.
Q. How do you separate “large” and “small” rules—using a threshold? We observed that the rule spread is essentially bimodal. This is based on a sensitivity analysis to the “largeness fraction” which varies between 0.1 and 0.95 without affecting the split.
Q. How would power consumption compare to a TCAM where you selectively turn on the relevant banks? Since we are comparing the worst-case packet match, every rule could go to a very different bank.

Cloud and Routing

Theory and New Primitives for Safely Connecting Routing Protocol Instances

Earlier measurement study showed that internet routing is much more complex than the traditional two-level hierarchical model (EIGRP/BGP/OSPF). Since it is so complicated, the connecting primitives play a critical role. 99.9% of analyzed networks depend on them. They are used for things like domain backup, partition healing and router-level shortest path routing.
Designs are usually either safe or flexible. The status quo is unsafe and inflexible. The talk describes something that is both safe and flexible!
Framework based on routing algebras (metarouting). Connecting primitives have provable safety properties, more expressivity and require no modifications to existing protocols.
Today’s connecting primitives have two features: route selection (ranking) and route redistribution (information exchange). Configured using various optional settings in router configuration scripts.
Misconfiguration can lead to suboptimal routes, or loops.
Two questions: how should routes be compared? and when should they be redistributed? The idea is to have a conversion function from routes to a universal metric. Each routing instance is associated with a pair of conversion functions. Now frame the problem as what properties these functions should satisfy.
Contributions: sufficient conditions to guarantee correct routing.
Goal: unmodified BGP, OSPF, RIP, EIGRP. But BGP and OSPF E2 are not strictly monotonic.
Non-BGP protocols modeled by a 2-dimensional ranking, comprising route type and cost. Conversion functions map these to and from an ordered set.
Route selection: prefer non-BGP to BGP routes, for non-BGP routes prefer type-A then B and then C, and among non-BGP routes of the same type prefer the lowest cost.
Domain backup/partition healing: currently possible, but only with a complex configuration, a star topology and giving protection only to the leaves. In the new design with the default conversion functions, can do this with any topology and any available path.
Router-level shortest path: currently only between OSPF and cannot change the cost. The new design can do it with OSPF, RIP and EIGRP using per-instance metrics.
Traffic engineering: existing design allows this only within instance, but now we can do it across instances.
Q. Do the functions have to be common across all the routers? Yes, it has to be consistent across all the border routers. But future work would let it be different.
Q. Do you see the potential for work in the standards bodies to standardize these functions? Talking to the router vendors about this.
Q. What is the behavior during convergence? We haven’t focused on that. After some time, we converge to a correct path. How can you be sure that it is correct without enforcing a global ordering? We do enforce a global ordering at present.
Q. Does your scheme require a centralized entity? We have no such entity. What about between different ASs? â€¦

DONAR: Decentralized Server Selection for Cloud Services

Server selection as a customizable constrained optimization. This talk describes a distributed solution to this problem.
User-facing services are being replicated across the world. Server selection involves selecting which replica to use for a particular user. Want to be agnostic to the wide-area protocol being used. For example, distributed DNS and HTTP redirection/proxying.
Idea was to build a system, DONAR, to which server selection can be outsourced.
Policy interface. NaÃ¯ve policies include round-robin (for load balancing), or location-aware closest-node. But want to support complex policies over many nodes.
Policies are represented by constraints. For example, a bandwidth cap, or a split ratio and allowable deviation. Without constraints, just use closest node, but this can lead to massive imbalance.
Improvement: add a bandwidth cap on some instances, which now gives some locality, but doesn’t overtax a particular instance (and offloads some traffic to other instances).
Improvement: split 10% across ten replicas with a +/- of 5%. The tolerance is unique to this implementation. Gives the ability to trade off network proximity and load distribution.
Can have a mix of bandwidth cap, proportional split and tolerance across different instances.
For a customer, can define a global linear program that describes the optimal pairing. Minimize network cost (distance) such that load is within tolerance and bandwidth caps are met.
Need to do this for each customer, and continuously. Aim is to have potentially hundreds of DONAR nodes, customers and replicas per customer, and tens of thousands of client groups per customer. The linear program has millions of variables per customer.
DONAR nodes could measure traffic and optimize locally. But no one node sees the entire client population, and the distribution at each node is skewed.
Could maybe aggregate all of the data at a central coordinator. This would end up sharing a lot of data and compromising responsiveness. For example, we would want a node to respond quickly to a flash crowd.
Actually came up with a decomposition strategy, for both the objective function and the constraints. Uses a Gauss-Siedel iterative decomposition approach. The summary data shared is only proportional to the number of replicas. Proof in the paper that this converges to the global optimum.
Deployed since November 2009, and in production use for two customers. Currently services around one million DNS requests today.
Other systems challenges in the paper: network availability, reliable data storage, etc.
Experimental setup using CoralCDN with a proportional split across the replicas. Closest node policy is very volatile, whereas DONAR equal split gives much more predictable workloads.
Better than simple round-robin, since DONAR keeps the network distance as small as possible (shown by the rank distribution of node distance from the client).
Q. How quickly does the global solution converge? Covered in the paper, and usually after one round of hearing from everyone.
Q. Can you incorporate server-specific policy like consistency or staleness (cf. weak/eventual consistency, where different replicas have different data)? We assume that we can direct a request to every replica. But we could solve this problem for separate domains.
Q. What’s the inaccuracy that comes from geolocation based on the location of the resolver? This has been well studied.
Q. What if a server’s settings change or it fails, how long will it take to recompute? Built-in support for liveness updates (either notification or loss of heartbeat), and use a heuristic before we rerun the optimization. Rerun the optimization every two minutes.

Cloudward Bound: Planning for Beneficial Migration of Enterprise Applications to the Cloud

Challenge: data privacy issues, such as national privacy laws and indsutry-specific privacy laws (like HIPAA). Challenge: SLA requirements lke response time.
Possible solution is hybrid clouds.
First focus is on planning hybrid cloud layouts, making cost savings, minimizing response times and bandwidth costs.
Second focus is on migrating security policies, such as firewall contexts and ACLs.
Contributions: study of the complexity of enterprise applications, first-cut solutions to the two challenges, and validations.
Enterprise applications are typically three-tier but with multiple components in each tier, and complex interdependencies.
Abstract the planning problem using a graph with vertices for components, and virtual vertices for internal and external users. Each vertex has a size (number of servers), and edges have a number of transactions per second and size of transactions per second. Objective is to maximize cost savings through migration, subject to policy constraints and bounds on the increase in transaction delay. Then partition the graph between local and remote.
Approach is to use easily-available information, like computation times of components and communication times on links.
Model user response times using bounds on changes to the mean delay and variance.
Benefits for cost savings based on the estimates in the Above the Clouds tech report. Based on a non-elastic migration, and future work is to look at using the cloud for peaks.
Migration algorithm is based on a reachability matrix to determine necessary security policies.
Evaluated based on two case studies, the Windows Azure SDK application and a Campus Enterprise Resource Planning application.
Used a cloud testbed to evaluate a thumbnail-making example application. The plan results in a mean delay less than 10% and an increase in variance less than 50%.
The campus ERP application involves multiple front-end, business logic and back-end components, in use at Purdue.
With a 30% bound on increase in mean delay, get $58k savings by migrating all components. For a 15% bound, get $38k savings by migrating a subset of the components. For a 15% bound with a placement policy (don’t move DBs), still get a $14k saving. Paper contains a sensitivity study on the benefit ratios.
Various security policies are also generated and evaluated.
Q. How much time did it take to understand the dependency mapping between components? Talked to operators and interviewed them, so the dependencies were extracted by humans. What about license dependencies, such as “may not be run in a VM”? [Taken offline.]
Q. Why did the delay increase more than the policy in the graphs? Used PlanetLab before a deadline, so the environment was hostile.
Q. Can you comment on the complexity of rewriting applications to run in split mode, which seems to be more important than performance or cost? There is an increasing trend towards service-oriented architectures, which make it easier to do this migration. Haven’t solved this for legacy applications.
Q. Did you have a contralocation scheme for preventing e.g. certain pieces of data not being stored in different languages? Constraints based on cloud/non-cloud.

Posted in Technology, Trip Reports | No Comments »

mrry

Archive for September, 2010

SIGCOMM 2010: Day 3

Network IDS

NetFence: Preventing Internet Denial of Service from Inside Out

ASTUTE: Detecting a Different Class of Traffic Anomalies

NetShield: Massive Semantics-Based Vulnerability Signature Matching for High-Speed Networks

Network Architecture and Operations

R3: Resilient Routing Reconfiguration

Mercury: Detecting the Performance Impact of Network Upgrades

California Fault Lines: Understanding the Causes and Impact of Network Failures

Novel Technologies for Data Center Networks

c-Through: Part-Time Optics in Data Centers

Helios: A Hybrid Electrical/Optical Switch Architecture for Modular Data Centers

Scalable Flow-Based Networking with DIFANE

Social Networks

An analysis of Social Network-based Sybil defenses

The Little Engine(s) that could: Scaling Online Social Networks.

Crowdsourcing Service-Level Network Event Detection

SIGCOMM 2010: Day 2

Privacy

Privacy-Preserving P2P Data Sharing with OneSwarm

Differentially-Private Network Trace Analysis

Encrypting the Internet

Wireless LANs

Enabling Fine-Grained Channel Access in WLAN

Predictable 802.11 Packet Delivery from Wireless Channel Measurements

SourceSync: A Distributed Architecture for Sender Diversity

Novel Implementations of Network Components

SwitchBlade: A Platform for Rapid Deployment of Network Protocols on Programmable Hardware

PacketShader: A GPU-Accelerated Software Router

EffiCuts: Optimizing Packet Classification for Memory and Throughput

Cloud and Routing

Theory and New Primitives for Safely Connecting Routing Protocol Instances

DONAR: Decentralized Server Selection for Cloud Services

Cloudward Bound: Planning for Beneficial Migration of Enterprise Applications to the Cloud

Pages

Archives

Categories