Archive for the ‘Uncategorized’ Category

SIGCOMM 2010: Day 3

Thursday, September 2nd, 2010

Network IDS

NetFence: Preventing Internet Denial of Service from Inside Out

  • DDoS is projected to be the biggest problem facing the internet in the next 12 months, and it is difficult to combat, since it conflicts with the openness and robustness internet design principles.
  • Previously, people have looked at the receivers of the DDoS (Denial of Edge Service). Usually using network filters or network capabilities.
  • But with a large enough bonnet, bots can collude to send packet floods which impair network services.
  • Challenge is to design a network architecture that combats both kinds of attack.
  • Solution: NetFence. Gives the network control over its resource allocation to combat denial of network services (DoNS). Also hierarchical and coupled with network capabilities.
  • Hierarchical congestion policing slows down flooding senders, and is robust to both compromised routers and hosts. Uses symmetric key cryptography, and each packet carries a secure token (based on Passport from NSDI 2008).
  • Secure congestion policing feedback are like network capabilities. Capabilities are returned if the receiver authorizes the traffic as “desired”.
  • Two types of packet: request and regular. Packet has five fields: mode (nop/monitor), link ID, action (up or down), timestamp and MAC (authentication).
  • First a sender sends a request packet. The mode field gets stamped by the access router as nop, and the MAC is calculated based on a hash of the original fields. The action gets set to down (deprioritize) the Link ID is stored and the mode is set to monitor, if an attack is deemed to be underway (i.e. congestion is encountered). The routers have keys distributed using Diffie-Hellman over BGP.
  • Policing is done at the access router, which looks at the packet sent back from the receiver (mode, action, etc.), and configures a leaky bucket as necessary.
  • Congestion policing loop uses AIMD at the access router to vary the sender’s bucket capacity.
  • A policing cycle is started based on a load- or loss-based detection procedure in the bottleneck router. RED is used to signal congestion within a cycle.
  • Works because: (i) secret keys used by routers to do feedback, (ii) periodic AIMD used to achieve fairness/efficiency, and (iii) congestion feedback acts as capabilities to prevent unbounded traffic.
  • Provable fairness shown in the paper. i.e. Each good user achieves a proportion of the network capacity that is equal to one over the total number of senders. Denial of service becomes “predictable delay of service”.
  • Many possible attacks against a system like this. Discussed in the paper, but two are discussed here.
  • To deal with floods of request packets, the request packet channel is separate, and there is a per-sender request packet limit, which is policed. There is a priority-based backoff which emulates computational puzzles.
  • To deal with routers hiding backoff feedback, the system treats the absence of an up-feedback as down-feedback.
  • Implemented on Linux using XORP and Click. AES-128 is the MAC function (see Encrypting the Internet). Benchmarked using DeterLab, dual-core 3GHz Xeons with 2GB of RAM.
  • The bottleneck router has 0 processing overhead when there is no attack. Overhead is 492–554 ns/packet (one AES computation) when there is an attack.
  • Shim layer between IP and TCP (or something else), which adds a header overhead between 20–28 bytes.
  • Simulated using NS-2 to evaluate various attacks. Compared to other systems, which put more state in the core.
  • Experiment on a denial of edge services attack. As the number of simulated senders increases, the file transfer time remains constant, unlike Fair Queuing which increase, but TVA+ and StopIt are faster (but less scalable).
  • Experiment on a denial of network services attack. Looked at ratio of average user throughput to average attacker throughput. NetFence achieves fairness.
  • Q. Do you distinguish good and bad users? No, we used AIMD to achieve fairness instead.
  • Q. How can you separate a flash crowd from malicious traffic? We don’t, treating extreme congestion the same way as an attack because it is a failure of end-to-end congestion control.

ASTUTE: Detecting a Different Class of Traffic Anomalies

  • Network management is used to ensure that customer SLAs, security policies, resource availability are maintained. Anomaly detection normally involves building a statistical model of normal traffic and defining an anomaly as a deviation from normal.
  • However, it is hard to obtain a model of “normal” traffic. Look at a time series of packet counts, and usually define a model baseline (tolerance) based on something like EWMA, and anomalies are anything outside that. However, training isn’t guaranteed to be anomaly-free.
  • Aim is to detect anomalies without having to define what is normal. Advantage is a simple tool that doesn’t have to perform training and is hence immune to data poisoning. It is accurate for a well-defined class of traffic anomalies, with theoretical guarantees on the false positive rates. However, its applicability is limited to when traffic characteristics don’t change.
  • Empirical properties: flow independence (although some weak correlation between flows), stationarity (time invariance over the timescales of a typical flow duration), and independence and stationarity => equilibrium.
  • ASTUTE = A Short-Timescale Uncorrelated Traffic Equilibrium. Between two consecutive time-bins, flow volume changes are zero-mean i.i.d.
  • Measure the number of flows, mean volume changes and variance of volume changes between consecutive time bins. Flag an alarm if the ASTUTE Assessment Value (AAV), calculated from these, is greater than some threshold.
  • The threshold controls the false positive rate. Appeal to the central limit theorem, so for a large number of flows, the AAV has a Gaussian distribution. False positive rate is just the area of the bell curve outside the threshold.
  • If ASTUTE is violated, at least one of the model assumptions is violated. For example, stationarity. Long bin sizes (one hour) lead to anomalies flagged when people arrive and leave at the beginning and end of the day (daily bias). Short timescales see no bias at all.
  • Worked with flow traces from Internet2, GEANT2 and the Technicolor corporate network. Compared Kalman and Wavelet filters.
  • Small overlap between anomalies detected by ASTUTE and the other methods. ASTUTE finds different classes of anomalies: tends to be larger numbers of flows with fewer packets than the Kalman and Wavelet approaches.
  • Plotted classified anomalies in each network on a similar graph (#flows vs #packets per flow), and saw that ASTUTE is worse on DoS attacks, but better on prefix outages, link outages and gaps, and port scans.
  • Looked at the ROC curve to see the trade-off between false and true alarms. Kalman would need a much higher false positive rate to detect port scans. But ASTUTE would require a very high false positive rate to detect DoS attacks.
  • Q. Can you not detect large flows because the time windows are so short that they look i.i.d. over those time scales? If it has a small number of flows, it will look independent to ASTUTE. There is an analytical limit to how many flows you need before you can detect it (threshold-squared).
  • Q. Who cares about detecting correlated flows? ASTUTE is not only useful for anomaly detection. But the interesting thing is that it can identify things that the operator would not be aware of, like bugs in misbehaving applications.
  • Q. Do you have the ground truth that the DoS attacks are real DoS attacks? Yes, we have analyzed the data, and there were lots of SYN packets going to a single location, usually from a single IP.
  • Q. Is there a way to classify an anomaly or is it ad hoc? We started with visual inspection, but we developed a tool for this.
  • Q. If your traffic is skewed towards a few flows, does the CLT hold? The CLT and assumption that we have lots of flows is an assumption for normal behavior.

NetShield: Massive Semantics-Based Vulnerability Signature Matching for High-Speed Networks

  • Maintaining network security is a grand challenge. Worms and botnets are widespread.
  • Talk concentrates on signature-based IDS. Normally, there is a database of signatures which is matched against each packet and used to generate alerts. This needs to be accurate and fast.
  • State of the art is regex-based. Used in Cisco IPS, Juniper IDS and Bro (open-source). It can efficiently match multiple signatures simultaneously using an NDFA, and can describe the syntactic conext. But the expressive power is limited, and it cannot describe the semantic context. This leads to inaccuracy.
  • Other state of the art is vulnerability signatures for host-based IDS. It directly describes the semantic context and is very expressive (able to describe the vulnerability exactly). It’s accurate, but slow, using sequential matching and requiring protocol parsing.
  • Vulnerability signature matching requires parsing, matching and combining. Since the protocol grammar is context-sensitive, cannot use a regex, as well as it being practically difficult.
  • Also a regex assumes a single input, so it cannot help with the combining phase.
  • So regex approaches cannot be used to match vulnerability signatures.
  • First challenge: matching thousands of vulnerability signatures simultaneously. Second challenge: parse protocols quickly. Solution achieves 10G throughput with an efficient matching algorithm and a tailored parsing design for high-speed matching.
  • Basically, a vulnerability signature uses a sequence of protocol data units (PDUs) with one predicate per PDU. PDU could be something like the HTTP version or the method. Need numbers and strings, number operators (comparisons) and string operators (equality, regex matching and length)
  • Given n signatures defined on k matching dimensions, a matcher is a two-tuple (field, operation) or a four-tuple for associative array elements. This leads to an n-by-k table. A table representation admits the possibility of matching multiple signatures simultaneously. Table looks like an associative array, with lots of don’t-cares.
  • Worst case time complexity is O((log n)^(k-1)) or O(n^k) space complexity. Based on the Snort and Cisco rulesets, which have selective matchers, the design actually gives O(k) time complexity.
  • Iterative matching algorithm on the columns, based on intersecting relevant rulesets with special treatment for don’t cares.
  • Complexity of merging requires k-1 merging iterations. Worst case merge complexity is O(n) in the worst case, but for real-world russets it will be more like O(1).
  • For high-soeed parsing, compare tree-based and streaming parsers. Streaming parsers can only retain signature related fields. Built an automated parser generator that builds a parsing state machine for parsing the protocol.
  • Implemented in 10kloc of C++ and 3kloc of Python. Evaluated on 26GB traces from Tsinghua University, Northwestern and DARPA. Run on a P4 3.8GHz with 4GB of RAM. For HTTP, 794 vulnerability signatures, and WINRPC 45 vulnerability signatures. Speedup ratio compared to Binpac is around 11x for non-HTTP and 3–4x for HTTP. Maintained throughput of 2.63 (HTTP in the university) to 17.6 (HTTP at Northwestern) Gbps for parsing and matching. Multicore gives a speedup.
  • Tool available online.
  • Q. Can you go into more details about the memory overhead? DFA requires 5.29GB for 973 Snort rules, whereas NetShield requires 2.3MB. The XFA paper showed 863 rules in 1.08MB. NetShield could improve by implementing XFA.
  • Q. Is it possible to do the massive matching using GPUs? Currently, most connections are independent, so yes probably.
  • Q. Do your scalability results not show that you require a clock cycle per bit? We only have to look at the bits in the signature.
  • Q. What are the advantages of your scheme with respect to XFA? Limited accuracy: XFA would make false positives.

Network Architecture and Operations

R3: Resilient Routing Reconfiguration

  • Failures are common, but today’s emerging applications impose a stringent requirement of network reliability. Plus SLA violations may impact an ISP’s revenue. Aim is to recover quickly from a single or multiple overlapping failures.
  • In 500-link network, failure scenarios up to three links exceeds 20 million. So it is difficult to optimize routing to avoid congestion under all possible failure scenarios.
  • Existing approaches focus exclusively on reachability. But these may lead to congestion and unpredictable performance. Some existing approaches consider only a small subset of failures, or optimize routing after failures, but this is too little, too late.
  • R3 require son enumeration of failure scenarios, is provably congestion-free, efficient in terms of storage overhead and flexible to diverse requirements.
  • Represent network as a graph, with link capacities and traffic demands on each link. Output of R3 is a base routing and a protection routing. Protection routing is a fast rerouting defined for every link that might fail.
  • Idea is to transform topology uncertainty to traffic uncertainty. Routing is optimized for he set of traffic demands on the original topology. Consider the amount of load that is shifted to other links when a failure occurs. If the routing is congestion free, rerouted traffic is less than capacity.
  • R3 has two phases. First, offline precomputation which minimizes congestion for original demand plus rerouting virtual demand on the original topology. The protection routing may use routes that later fail. Solve using a linear programming technique.
  • After a link fails, convert the protection routing for that link into a valid routing that doesn’t use any other failed links. After the failure, need to reconfigure the protection routing, which uses the computed rerouting.
  • Offline precomputation and online recompilation is sufficient to get congestion-free routing. Whether it is optimal for more than one link failure is an open problem. The reconfiguration is order independent, which enables distributed recompilation.
  • Some extensions: fixed base routing, trade-off between no-failure and failure protection to bound the no-failure performance, trade-off link utilization and end-to-end delay, prioritized traffic protection, realistic failure scenarios (share risk and maintenance link groups), and traffic variations.
  • Evaluated on two real networks and a synthetic topology. Compared to various rerouting schemes. Added R3 to OSPF and MPLS-ff. Looked for maximum link utilization.
  • For a single failure, R3 achieves near optimal performance. Under multiple failures, it is at least 50% better than other schemes.
  • Implemented for Linux and Linux MPLS. Emulated the Abilene topology on Emulab. 3 physical link failures simulated. Outperforms OSPF+recon by a factor of around 3.
  • Profiled the precomputation time: less than 36 minutes for each topology and less than 17 minutes for non-generated topologies.
  • Storage overhead is < 300KB in the FIB, and < 20MB in the RIB.
  • Q. Have you looked at how to redistribute traffic after a link returns? Have a reconfiguration rule for failure recovery. It will revert back to the last failure scenario, but the ordering may be different (this is provably alright).
  • Q. Have you looked at the overhead of announcements during convergence under churn? No packets will be dropped during this case.
  • Q. How does your algorithm cope with network partition? Studied this in the paper. In this case, we cannot have reachability, so we cannot have congestion-freedom. R3 will ignore the demands that it cannot fulfill.
  • Q. How does your approach compare against oblivious routing schemes (such as Valiant load balance)? These don’t usually handle a large number of failures. Normally big ISPs see larger number of failures than that.
  • Q. How do you evaluate traffic prioritization? Get 10 different priority classes from a US ISP, and show that IP traffic gets sacrifices to protect VPN traffic.

Mercury: Detecting the Performance Impact of Network Upgrades

  • Networks are becoming more complex and diverse. Software and hardware are both becoming more complex. This makes things more sensitive to network glitches or other performance issues. Purpose is to see whether a change makes the network better or worse performing.
  • Normal intuition is that an upgrade will make things better, but complex interactions can lead to unintended consequences. So it is important to monitor the impact of upgrades. This is hard due to the scale and diversity of different devices. So the challenge is to efficiently monitor at scale.
  • Mercury does automated data mining to extract trends, scales across a large number of measurements and flexibly across data sources, and is easy to interpret. Challenge is how to know when an upgrade happens, what their effect on performance is, and to find common factors in who is affected (or it is network-wide).
  • Could drive upgrade detection from the change management system, but since human information is unreliable, instead mine the configuration and workflow logs. Things like OS version and firmware upgrades are easy to track. However, lots of small configuration changes are not related to upgrades (such as customer provisioning). Out-of-the-ordinary changes are ones that are applied to multiple locations in the network, but rarely.
  • Divide event series (SNMP etc.) into equal time-bins to get a time series. Behavior change detection is based on a persistent shift in levels. Recursive rank-based cumulative sum is used on means, medians, standard deviations or distributions.
  • Identifying commonality (of attributes, configurations, etc.) is a machine learning problem (search in a multi-dimensional space). Use the RIPPER rule learner for this.
  • Sometimes aggregation will erroneously amplify rare events. Solution is to time-align each upgrade to each device (as if the upgrade happened at the same time).
  • Evaluted using close interaction with network operators. Used data sets about router configurations and workflow logs, and performance event series: SNMP and syslogs. Collected this data from a tier-1 ISP over 6 months. 988 routers in the study. Categories of router: core, aggregate, access, route reflector and hub.
  • Upgrade detection evaluated for false positives and false negatives. Threshold varied (frequency of change). Tends to see more false positives than false negatives, but these can be filtered.
  • Mercury reduces the number of upgrade-induced change points that the operator must look at by several orders of magnitude, compared to number of syslog entries. It confirmed the earlier operator findings and showed some unknown to the operator.
  • OS upgrades could cause CPU utilization to go down on access routers, but increases in memory utilization on aggregate routers (larger OS image). Varying changes in the number of layer-1 link flaps. More protection switching events.
  • Firmware upgrades could cause less CPU utilization on the central and CPU-facing routers’ CPUs.
  • Protection switching is line-card protection in customer-facing routers. Failover for the access router that customers connect to. Saw a small increase in the frequency of automated PS events. Time alignment was able to show this problem.
  • Q. Have you thought about the inverse problem where your triggers are the alarms of an anomaly detector, and you want to find the root causes? Problem with that is false alarms. With better anomaly detectors, this might become feasible.
  • Q. What is the time horizon of the attribute changes that you consider? We do persistent change detection, so look at daily averages over a history of about 6 months. We are now looking at whether transient things do matter (for the purpose of meeting SLAs, etc.).
  • Q. Do you monitor link capacity in your system? Currently only look at aggregate router statistics, not particular links/interfaces. We are starting to look into that.

California Fault Lines: Understanding the Causes and Impact of Network Failures

  • Most network failures are not catastrophic. But it’s difficult to collect comprehensive failure data. Lightweight techniques are limited, and special-purpose monitoring is expensive.
  • Contributions: a methodology to reconstruct the failure history of a network using only commonly-available data. Basically a time series of layer-3 failure events. Preferably annotated with the cause and impact of the failure. Data source for this is the syslog and the router configuration files in a version control system.
  • But this data is not intended for failure reconstruction. First rebuild the topology from the configuration file, then replay syslog messages. We also have semi-structured data from the maintenance logs.
  • Looked at CENIC network with 200 routers and 5 years of data (California academic network).
  • Limitations: syslog is sent using UDP which leads to message loss. We might see a series of log messages containing a DOWN followed by a DOWN, so just ignore messages until get back on track. Selection bias in the operational announcements.
  • Comprehensiveness: treat the operational announcements as ground truth and see how many of them have corresponding syslog messages. 97% of announcements were confirmed by the syslog.
  • Accuracy: using Skitter project which does frequent traceroutes to confirm that no packets went over down routers.
  • Validated down states using RouteViews (recorded BGP traffic) to track failure events.
  • 60% of failures last less than a minute, which inhibits detection or recovery. Turns out mostly to be flap events.
  • 7000 emails led to 3000 events. 28% of events are failures and 18% of observed failures are explained.
  • Failure causes: hardware, power, external, software, other and configuration. Hardware is the biggest cause of notices, but software is the biggest cause of failures (32% of failures). But almost 80% of software failures were due to scheduled changes.
  • Q. How are those failures distributed on the network? More at the backbone or on the edge? More downtime on the customer links and the high performance links than on the backbone.
  • Q. How do you think what you show shows more about the impact than simply tracking the control plane? It’s hard to know what the actual impact is, since we don’t collect that information. What other sources of information do we need on top of routing information? If we understood link utilization then we could see how links were being strained by these events.
  • Q. Are you saying that software upgrades are a dominant cause of failures? Not dominant, but serious. The UP/DOWN messages are a side-effect of the maintenance activity? Might be interesting to look at this.
  • Q. Do you see many concurrent failures? More details about this in the paper.

Novel Technologies for Data Center Networks

c-Through: Part-Time Optics in Data Centers

  • Comparing optical circuit switching to electrical packet switching. Circuit switching vs. store and forward. Optical can do 320×100G, vs. 16×40G for electrical. But the optical switching time is about 10ms, compared to packet granularity.
  • Despite slow switching time, optical circuit switching is still promising. Full bisection bandwidth at packet granularity may not be necessary.
  • Looked at a hybrid packet/circuit switched architecture. PS for low latency and optical-CS for high capacity transfer. Optical paths are provisioned rack-to-rack.
  • Control plane needs to estimate traffic demand and configure optical circuit based on it. Data plane does traffic demuxing and optimizes circuit utilization (maybe).
  • c-Through is a specific design for this. A centralized controller manages circuit configuration. Applications and switches are not modified, and end hosts are leveraged for traffic management.
  • Enlarge socket buffers for applications to identify which flows are heavy and which are lightweight. This generates a per-rack demand vector. Applications are unmodified, and packets are buffered per-flow to avoid head of line blocking. This estimates traffic demand and pre-batches data to improve optical circuit utilization.
  • Traffic demand vectors aggregated into a traffic matrix. Use Edmonds’ algorithm to compute the optimal configuration (maximum weight matching problem). Then servers are notified. The control traffic overhead could be reduced.
  • Electrical and optical networks isolated using VLANs.
  • Traffic control on hosts, which makes end-hosts tag packets for the two VLANS. accordingly.
  • Testbed with 16 servers, a hybrid network on a 48-port Ethernet switch. Optical switch is emulated using 4G links, whereas electrical network uses 100Mbps links. Optical circuit emulation: optical paths are only available when hosts are notified. There is a 10ms reconfiguration delay.
  • Evaluated TCP performance using dynamic bandwidth, overhead of traffic control and buffering effects. Also application performance (VM migration, MapReduce, MPI-FFT).
  • TCP exploits the dynamic bandwidth quickly. Throughput ramps up within 10ms. Throughput stabilizes within 100ms.
  • MapReduce performance. Since shuffling is independently transferred, it is amenable to batching. Sorted 10GB of random data, which took 800 seconds on an electrical network. With full bisection bandwidth, the performance is 135 seconds. As c-Through varies the buffer size limit, the best performance is 153 seconds, for 100MB buffers, which is close to ideal. As the reconfiguration interval is varied, can do it as infrequently as every 3 seconds, and the performance is 168s.
  • Ran Yahoo Gridmix benchmark, which contains 3 runs of 100 mixed jobs, such as web query, web scan and sorting. Uses 200GB of uncompressed data, and 50GB of compressed data. c-Through comes very close to the full bisection bandwidth network.
  • Q. Surprised by the claim that TCP works fine in this case, considering the multipath issues: would new protocols not be more appropriate? This technique we didn’t see many things blow up.
  • Q. Do you think it could work if the fibre is cut, and how will it affect the network? Current system doesn’t take this into account, but since there is dynamic monitoring, we could detect that and handle it.
  • Q. Won’t you have to reconfigure faster to catch short, bursty flows, and then isn’t there a risk of oscillations? Didn’t see that in our experiments.
  • Q. What is the cost of these optical technologies, and are they practical today? Expensive fixed cost, but the per-port marginal cost is not so high, which makes it competitive. A mature technology that is already on the market.

Helios: A Hybrid Electrical/Optical Switch Architecture for Modular Data Centers

  • Talk is about combining electrical packet switches (ePSs) and optical circuit switches (oCSs) in a data center network. Both cost $500/port. But ePS is limited to about 1G or 10G (maybe 40G or 100G in future), and oCS is rate-free. ePS uses 12W/port and oCS uses 240mW/port. Finally the oCS doesn’t require a transceiver, which costs another watt per port. But the downside of oCS is the 12ms switching time. ePS suited to bursty, uniform traffic, whereas oCS suitable for stable, pair-wise traffic.
  • Switching delay is due to mirrors on motors which it is necessary to reposition to switch the circuit. This simply gives a full crossbar circuit switch which does not decode packets and needs an external scheduler.
  • Wavelength division multiplexing uses one wavelength for a channel. WDM mux and demux are used on the electrical packet switch transceivers.
  • Need stability to be increased, using aggregation. Processes are more likely to communicate than threads; racks are more likely to communicate than servers; data centers are more likely to communicate than pods. Sweet spot is modular data centers.
  • With 64 pods, with 1024 hosts per pod, with a 10% electrical network (10:1 oversubscribed), need $6.3M, 96.5kW and 6656 cables. With a 100% electrical example, it would cost $62.2M, use 950kW and need 65,536 cables. Helios costs $22.1M, uses 157.2kW and needs 14016 cables.
  • Optical switch has a simple software agent, and the intelligence is in the centralized topology manager. Control loop estimates the traffic demand (hard to do), computes the optimal topology for maximum throughput, and then configure the pod and circuit switches.
  • Estimate: will this flow use more bandwidth if we give it more capacity (is it an elephant flow or a mouse flow)? However the results are biased by the current topology. So use the Hedera algorithm (NSDI 2010) which assumes all hosts are connected to an ideal crossbar switch, then compute the max-min fair bandwidth fixpoint.
  • The optimal topology is computed as a max-weight perfect matching on a bipartite graph, using Edmonds’ algorithm.
  • Testbed used two networks: a traditional one and a Helios network. 100% bisection bandwidth is 240Gb/s. Used 26 servers, and various switches including an optical circuit switch.
  • Ran Hadoop on this network, but didn’t get good numbers because the network was massively overprovisioned.
  • Got 190Gb/s peak and 171Gb/s on average on the traditional network, with drops due to hash collisions. The 50Gb/s difference from the full bisection bandwidth is the TCP overhead.
  • Helios got 160Gb/s peak and 43Gb/s average. Due to some quirks of the packet switched routers, such as port debouncing which prevents false positives on ports being up, which led to poor performance on reconfiguration. Turning that off got the average up the 87Gb/s. Turning off EDC got a 142Gb/s average. Remaining overhead is a limitation in the software. Still have 27ms gaps, due to some switching delay.
  • Helios used unidirectional circuits but there are bidirectional circuits as well. Unidirectional doesn’t waste bandwidth on the return path, which leads to a daisy chain topology.
  • First paper to demonstrate WDM in a hybrid electrical/optical network.
  • Q. Have you thought about how the traffic demand estimation technique would work at lower levels (down to within a pod, a rack, a server, a process)? The Hedera demand estimator works on the level of TCP-flows, so we could do that. Would the bias you get become stronger? [Taken offline.]
  • Q. The number of electrical and optical switches you provision is an a priori design decisions, so how would you address changing traffic patterns? The way around that is to build a hybrid electrical/optical switch.
  • Q. Have you thought about application-limited flows, where there is a bottleneck in the application that stops it using the additional bandwidth? Sensitive to the elephant flow classification. The whole pipeline depends on a good classification. Wouldn’t it be better to use OS modification (per c-Through)? Prefer not to modify the host.
  • Q. What would happen, if you didn’t have such short RTTs (such as in an aggregation network), to the end-to-end flows without buffering? It’s not clear that this would do so well (unmodified) between data centers, but the switching technology is well-suited.

Scalable Flow-Based Networking with DIFANE

  • A scalable way to apply fine-grained policies in enterprises.
  • Want to support flexible policies, such as access control rules, customized routing (e.g. Skype calls on a low-latency path) and measurement (e.g. detailed HTTP traffic statistics).
  • Flow-based switches store their rules in a high-speed TCAM, and perform simple actions based on those rules. The flow space has at least five dimensions. Want to specify these in a high-level management system and enforce low-level rules in the TCAM. Want to support large numbers of hosts, switches and policies with limited TCAM space.
  • If you pre-install the rules in the switches, this is simple, but it doesn’t support host mobility and switches don’t have enough memory for all rules.
  • Alternatively (per Ethane, NOX) install the rules on demand, buffering the first packet while the rules are looked up in the controller. The first packet misses the rules, and gives additional switch complexity, and the risk of DoS by sending multiple different packet headers.
  • DIFANE supports host mobility, reduces memory usage and keeps all packets in the data plane.
  • Stage 1: controller proactively generates rules and sends them to some authority switches. The flow space is partitioned between the authority switches.
  • Stage 2: authority switches keep packets in the data plane. When a packet is received, it is routed to the authority switch and sends feedback of the rules to cache. Subsequent packets hit the cache and are forwarded directly. There is no longer a race between updating the cache and forwarding subsequent packets.
  • A small set of coarse-grained wildcard rules is used to give the partition function for authority switches. Not a DHT, since wildcards are used in the rules.
  • A switch’s TCAM has cached rules, authority rules (if the switch is an authority switch) and partition rules (to route to an authority switch). Prefer cached rules and authority rules over partition rules.s
  • Switch prototype built with an OpenFlow switch.
  • Tricky to cache rules when wildcard rules may overlap (with different priorities). Therefore have to generate new rules based on contiguous subregions. Partition based on minimizing the TCAM entries in switches. Use a decision tree base rule partition algorithm to decide where to place the splits in the flow space.
  • Need to handle policy changes at the controller, topology changes at the switches and host mobility.
  • Evaluated prototype by implementing DIFANE in a kernel-level Click-based OpenFlow switch. Traffic generator, switches and controller run on separate 3GHz Xeons.
  • NOX sees a 10ms RTT delay for the first packet, but DIFANE sees a 0.4ms delay.
  • DIFANE can easily be implemented in hardware, whereas NOX requires more software intervention.
  • For peak throughput (one authority switch, single-packet flow), NOX hits an ingress switch bottleneck at 20Kflows/sec with one ingress switch, and then reaches a controller bottleneck with more ingress switches.
  • How many authority switches? Depends on number of rules. Campus network has 30K rules, which is assumed to be 160KB of TCAM memory. This leads to about 3 authority switches. An IPTV network with 5M rules requires 1.6MB of TCAM and would require 100 authority switches.
  • Tension between distributed (switch-based) and centralized (controller-based, easier to manage) operation. DIFANE is a point in between these extremes.
  • Q. How realistic are your assumed TCAM sizes? Already have 160 KB TCAMs, so we would just use more switches.
  • Q. If you have a slow path you can scale much better, so why do you want to keep everything on the fast path? [Taken offline.]
  • Q. Did you experiment with cache replacement policies? Much work done on how to cache rules, so we can just leverage that.
  • Q. What about the importance of dynamic rules that might change frequently, and how can DIFANE handle it? Think that only traffic engineering needs such dynamic rules. DIFANE can get the controller involved to manage these. But the performance gain is not much over OpenFlow in that scenario. Isn’t a benefit of OpenFlow that you can implement e.g. authentication at the application level? Yes, but we can get the controller to push this into the rules.
  • Q. Is there a cost to have all switches be authority switches? Depends on the network and how it is used. Why not make every switch an authority switch? May need more redirection, and hence more stretch. Also the rules will become smaller.
  • Q. Does this make internet traffic more unpredictable? A reasonable comment, but since we know the location of the authority switch, we know the paths that the traffic may take.

Social Networks

An analysis of Social Network-based Sybil defenses

  • Many online services allow attackers to create accounts for free and they can hence manipulate the system.
  • Defense approaches: trusted certification (such as SSN or passport number), or resource challenges (e.g. cryptopuzzles, not hard to solve if you can get cloud resources on demand). Or can use links in the social network to detect Sybils, since we presume that attackers can only create a limited number of links to non-Sybil users. Spawned a lot of research.
  • Unanswered questions: since the schemes use different mechanisms, it is unclear how the schemes are related, or whether there is a common insight across the schemes? This would help us understand the limitations of the defenses.
  • Talk proposes a new methodology for comparing these systems and finds that they all work in a similar manner. It implies that they have a hidden dependence on the network structure, which identifies the limitations of the schemes.
  • The interesting fact is how these schemes identify nodes as Sybils.
  • Schemes take a social network and a single trusted node, and declares Sybils from the perspective of the trusted node. Internally, each node has a Sybil probability, which gives each node a ranking of Sybilness. Can this ranking be used to compare schemes?
  • Compared rankings from each scheme from the same social graph. The ranking is jumbled between the different schemes. All schemes seemed to have a cut-off point where the partitions were (unordered) equalish.
  • The cut-off point comes at the boundary of the local community. So all schemes are effectively detecting communities. Nodes in the local community are ranked higher, but the ranking within and outwith the community are in no particular order. Can we then leverage the work on community detection to design new approaches?
  • Bad news: this depends on the graph having monolithic community structure, and the characteristics of the community around the trusted node.
  • Does this make certain network structures more vulnerable? Does having more communities make it harder to identify communities? Evaluated this on various real-world social networks. Simulated a Sybil attack by consistently adding Sybils (5% attack links and 25% Sybil nodes). Accuracy measured using ranking, i.e. the probability that Sybils will be ranked lower than non-Sybils. Compared amount of community structure (modularity) to the accuracy. Modularity seems to be negatively correlated with accuracy.
  • How can the attacker use this intuition? Can he do better than just choosing random links? For example, by placing links closer to the trusted node. Then the attacker could blend in to the community of the trusted node. Experiment ranks the nodes and gives the attacker to give the ability to place links randomly among the top N nodes. Smaller N implies an attacker with more control. Graph shows an attacker with more control will reduce the accuracy of the algorithms.
  • Moving forward: could be useful for whitelisting nodes, and could potentially incorporate information from more layers to make the decision about who is a Sybil.
  • Q. Have you evaluated where the number of Sybil nodes far exceeds the number of attack links? The results hold in those settings as well.
  • Q. Attacks are launched from compromised and fake accounts, so how do you deal with this? This violates the basic assumption that the attacker has few real links, so none of these schemes will work.
  • Q. What if the Sybils form multiple communities? No matter the Sybil topology, as long as the number of attack links is small, none of these schemes will work.

The Little Engine(s) that could: Scaling Online Social Networks.

  • Systems should be scalable, but it can be hard to implement and is not necessary at the start of an online service. Of course, this can lead to a success disaster. The cloud gives hardware scalability, but no automatic application scalability.
  • Frontend, stateless components are easy to make transparently scalable, but the data source is a bottleneck.
  • Obvious solution is full replication of the DB, but the state doesn’t decrease with the number of servers. However it maintains data locality.
  • Next most likely solution is horizontal partitioning/sharding, but the splits are disjoint, which is bad news for OSNs. The shards cannot be disjoint, because OSNs involve queries across social links, or data dissemination across social links. Presumably want to colocate all of your friends on the same server.
  • Relational databases don’t perform well under horizontal partitioning, and are expensive, so people use DHTs. These perform better but there is no SQL, less abstraction, and they suffer under high traffic (incest, multi-get hole, jitter). Also a DHT gives random partitioning, which means many servers will be hit with a particular update, and there is a high replication overhead.
  • Can leverage underlying social structure to make the partition. The SPAR (Social Partitioning And Replication) algorithm does this.
  • Algorithm has to be online (due to system and social network dynamics), fast and simple (using local information, a hill-climbing heuristic and back-pressure load balancing), stable (no cascades) and effective (approximates an NP-hard problem, minimize replicas while also maintaining a level of redundancy).
  • Evaluated on real OSN data (Twitter/Orkut/Facebook). Looked at various algorithms, including random partitioning, MO and METIS.
  • SPAR has a lower replication overhead than the other algorithms, with only 22% overhead over the replication constraint.
  • Three-tier application: front end and application logic are stateless on top, with SPAR middleware in the application logic and the data store (to intercept messages). The SPAR controller, partition manager and directory service coordinate the whole system. To applications, SPAR is totally transparent, implemented on top of MySQL and Cassandra, but could be implemented using other things.
  • Evaluated using a non-distributed Twitter clone (StatusNet) and real Twitter data, and saw if it could scale up across 16 commodity desktop machines. The 99th percentile latency for MySQL with full replication was 16 requests per second, whereas SPAR+MySQL does 2500 requests per second. Vanilla Cassandra does 200 req/s, whereas SPAR+Cassandra does 800 req/s.
  • Q. Can you replicate e.g. Facebook pictures based on the groups of friends? The rule is applied when processing the query itself, though some redundant data would be stored.
  • Q. Have you looked at incorporating more dynamic interaction behaviors in the partitioning algorithms? We have considered adding weights.
  • Q. Any thoughts on Diaspora? Only know what I read in the news and that it’s fully distributed, so don’t think there will be such a thing as a global data store.
  • Q. []? The more clustered you are, the less replication you will need. The results are consistent for large data sets.
  • Q. Would the replication overhead for Orkut not be higher? 12 or 16.
  • Q. Where is the notion of load per server? Would this not allocate servers that have absolutely no work to do? Details in paper.
  • Q. Are there not better designs than a read fan-out? Arguably.

Crowdsourcing Service-Level Network Event Detection

  • Want to identify problems that affect end-to-end performance. Do this in an online way with reliable detection and isolation.
  • Idea is to do monitoring at the edge systems and detect drops in performance.
  • Need a system that is scalable, and has localization in time and space. Scalability from passive monitoring, and fully distributed detection. Also privacy, reliability from uncontrolled hosts and wide adoption (incentive to install).
  • Approach: passively monitor local performance information (signals), and detect drops in performance. Then attempt to get group corroboration from other hosts. A likelihood ration distinguishes network effects from coincidence. Store the data distributedly, and give the operator a tap to get that data out.
  • Evaluated the approach using peer-to-peer applications (a natural fit). This gets us an edge trace. The dataset is from a plugin called Ono, which has been installed by 1 million BitTorrent users worldwide.
  • Case study on the BT Yahoo! network, which has information about confirmed events on its web interface. Gives the dates and times of the issue arising and having been fixed.
  • BitTorrent peers monitor many performance signals, both general and protocol specific (like Torrent availability). The individual signals are noisy, having uncontrolled duration and having a wide range of values. Use some moving-average smoothing to make this easier to interpret.
  • To do group corroboration, why might they occur at the same time? Could be service-specific problems (e.g. lack of a seeder), coincidence (noisy local detection), or a genuine network problem. Coincidence becomes very small with a large number of users. Can tune a likelihood ratio knob to make this more or less sensitive.
  • Evaluated in the wide-area. Don’t know the false positive or false negative rates, because ISPs wouldn’t provide information about when their network went down. Therefore use public information from BT Yahoo!, and do some work under NDA.
  • In one month of BT Yahoo! data, detected 181 events and 54 occur during confirmed events. There were 14 other reported problems. Remaining are not necessarily false positives.
  • Worked with a North American ISP under NDA. Detected 50% of events in regions with fewer than 10k subscribers.
  • Evaluated sensitivity to the likelihood ratio, detected problems 2% of the time for small moving average deviations and 0.75% of the time for larger deviations.
  • Deployed as the Network Early Warning System in 48k BitTorrent clients.
  • Q. This seems like the right approach since performance monitoring should be done at the application layer.
  • Q. Do you think that IP prefix or geolocation information would be useful for grouping? Depends on whether groupings are useful to help with the problem. Using IP prefix already.
  • Q. How are your techniques different from the earlier talks on anomaly detection? This is at the user, so the information that comes back is more useful. Why are you using moving averages compared to something more sophisticated? Wanted to implement it simply and get it incorporated in a BitTorrent client. Many schemes assume a long continuous stream of data.
  • Q. Once you have detected the events, what do you do with them? The idea is for operators to go and fetch this information. But there is a root cause analysis problem here, which is important future work in this area.

SOSP 2009: Day 3

Monday, October 19th, 2009


Distributed Aggregation for Data-Parallel Computing: Interfaces and Implementations

  • Goal to make large-scale programming simple for all developers.
  • Write a program in Visual Studio; Dryad(LINQ) takes care of shipping it to the cluster, fault tolerance, etc.
  • Wrestling with the implementation of GroupBy-Aggregate. GroupBy takes a sequence of objects with some kind of key, and groups them together by key. Similar to MapReduce.
  • Naïve execution plan splits map and reduce into two phases, with an all-to-all data exchange between them. However, applying the reduce after this exchange results in a large amount of network I/O.
  • A better idea is to do early partial aggregation: use an aggregation tree to achieve this. Reduces the disk and network I/O by up to one or two orders of magnitude.
  • Want to automate this optimization. Programmer writes the obvious code and the system takes care of the rest.
  • Notion of decomposable functions is key to this. Need an initial reducer that is commutative, and a combiner that is commutative and associative.
  • How do we decompose a function? Two ways: iterator and accumulator interface. Choice can have a significant impact on performance.
  • How do we deal with user-defined functions? Try automatic inference, but fall-through to a good annotation mechanism. Implement simple function and annotate it with the initial reduce and combiner implementation function names.
  • Hadoop interface for this adds quite a lot of complexity. Java’s static typing is not preserved.
  • Iterator interface has to build an entire group and iterate through it. Accumulator can discard the inputs if they are not needed. Oracle uses this approach, implemented with stored procedures. Hard to link in user-defined procedures.
  • Automatic decomposition looks at the expression and checks whether all leaf function calls are decomposable.
  • Want our approach to have good data reduction, pipelining, low memory consumption and parallelisability (multicore). Define six strategies, accumulator- and iterator-based.
  • Iterator PartialSort approach. Idea is to keep only a fixed number of chunks in memory; processed in parallel. The bound on memory makes pipelining possible. Strategy close to MapReduce.
  • Accumulator FullHash approach builds an in-memory parallel hash table with one accumulator object per key. Objects are accumulated immediately. This gives optimal data reduction and memory consumption proportional to the number of keys, not records. This is the DB strategy (DB2 and Oracle).
  • Evaluated with three applications: WordStates, TopDocs and PageRank on a 240-machine cluster. Accumulator-based implemen—


The 9/11 Delusion

Monday, July 21st, 2008

When I saw in last Monday’s Guardian that Charlie Brooker was taking aim at 9/11 conspiracy theories, I hoped that he’d use his wide audience to present a logically watertight argument, in an entertainingly acerbic register. And buried within his piece was the quite probable suggestion that the paperwork alone would be impossible to conceal. Unfortunately, because he’s evidently paid by the ad hominem, he also said that every conspiracy theorist might as well believe that he is the Emperor of Pluto, and unleashed a firestorm in the online comments. By opening up too many fronts in this debate, he left himself open to attacks, even from other Guardian commentators.


At least he’s got a bag for life

Sunday, May 25th, 2008

The law of unintended consequences often comes shopping with me.


My name is not Bob

Sunday, May 18th, 2008

This weekend finds me back in Glasgow to visit my parents, and I’ve spent much of the afternoon clearing out a desk that I’d used since 1996. Filled with memories, old tickets and trinkets, my first football match, my first gig, my first trip to London on my own, my trip to Cambridge for interview. Filled with old school work and school reports (Computing – “I am pleased with my grades, and I like computers.” R.E. – “I am pleased with my progress in the short course, and look forward to its completion.”), and the surprising insight that I apparently had a “particular ability at Volleyball.” Filled with greetings cards from old friends, people I barely remember, and people I’d rather forget.


Strangers on a plane

Friday, May 16th, 2008

The other day, I was getting off a plane from Istanbul back to Stansted, and retrieving my Duty-Free carry-on, when a fellow passenger accosted me:

“I think that’s my bag,” he said.

“I’m fairly sure it’s not.”

“Does it have Turkish Delight in it?”

“Well, yes….”

Civic Pride

Monday, May 5th, 2008

Growing up in Glasgow, I was exposed to more than my fair share of internecine rivalries: when I was more serious about blogging, I planned a grand series of posts cataloguing every single one of them. Easy, I thought, there’s the other football team, the other side of the river, the suburbs, the other city, and don’t even get me started on the English.


The Apprentice: I demand a recount!

Thursday, June 14th, 2007

I accept that it’s a bit jejune to get worked up about Reality TV, but last night’s Apprentice took the biscuit.