Wireless #2: Programming and Transport
Wishbone: Profile-based Partitioning for Sensornet Applications
- Extension to the WaveScope system.
- Example application of locating marmots: listening out for loud alarm calls when confronted by a predator. High data rate application (node has four microphones). Used for sensing applications (animal localisation, pothole detection, computer vision, pipeline leak detection, speaker identification and EEG seizure detection). All expressible as a data flow graph. Predictable data rates. But run on heterogeneous platforms (CPU and radio bottlenecks).
- Used Linux uservers, smartphones (Java-based and iPhones) and Meraki routers. Want to be able to mix and match. Have to deal with incompatible software environments (languages, SDKs and OSes).
- Application is a dataflow graph (edges are streams; nodes are stream operators). Inputs are from sensor sources, outputs are results either to the user or a server. System partitions this graph between the embedded nodes and the server side. Wishbone handles serialisation and deserialisation across the network interface. Compiles and loads subgraph onto any platform. Replicates graph across all nodes (but may have different partitions depending on the node).
- Contributions. First broadly portable sensor net programming environment. Partitioning algorithm. Optimises CPU/radio tradeoff.
- WaveScope compiler gives dataflow graph in portable IL. Then have a backend code generator for each of the target platforms.
- Wishbone needs sample data from the user to make the correct partitioning decisions.
- Target a TinyOS mote. (16bit ucontroller, 10K RAM, no memory protection, no threads, task granularity messaging model.) Not directly compatible with the WaveScope threading model. Use profile-directed cooperative multitasking which makes good decisions (based on profiling information) about where to place yield points.
- Profile streams and operators to find execution times, data rates. Separately profile network connections.
- Some data flow nodes must be pinned to a particular platform (e.g. source and sink nodes, stateful global operators). Others are unpinned (stateless and locally stateful nodes).
- Assign weights to the edges (net bandwidth) and nodes (CPU cost) in the dataflow graph. Interested in the sum of CPU cost on the embedded device, and the sum of edges across the network boundary. Formulate as an Integer Linear Program. Enforce resource bounds on the embedded device (sum over nodes on the embedded device), and network bound (sum over cut edges). Then minimise a linear combination of the CPU and network cost. (Tricky bit is to enforce other constraints (pinning and graph topology) while staying linear. Can set a parameter to tradeoff between CPU and radio, or else set it to reflect energy consumption.
- Evaluated by looking at bandwidth versus compute cost on a linear pipeline of operators (evaluted on iPhone as the embedded node). Observe that processing reduces data quantity overall, but this is not monotonic.
- Evaluated by changing the input data rate of the application. Look at how many operators remain in the optimal node partition. (EEG graph with multiple channels partitions horizontally (some channels on server, some on device).)
- Given CPU and network bounds, can find an optimal partition if it exists. The partition gives estimated throughput. So do a binary search over CPU bounds to find the maximum possible throughput.
- Ground truth is how many detections can actually be gotten out of a network of TMotes. Also looked at the compute/bandwidth tension in a single TMote/basestation network.
- Related to graph partitioning for scientific codes (Zoltan in symmetric supercomputers), task scheduling (list scheduling), MapReduce/Condor, and Tenet and Vango (sensor network context, run TinyOS on embedded and server devices).
- Q: have you considered situations where operators could tune the amount of processing that they do (or tolerate packet loss)? Considered it, but there are several generalisations we would look at first. Does it really change the structure of the system? Yes, we’d probably need a new partitioning algorithm.
- Q: would it be worth profiling the amount of synchronisation necessary for pinned (global stateful) operators? We have strong consistency requirements, so we’d have to have a relatively expensive operation to synchronise this.
- Q: have you applied the same sorts of partitioning for the scenario where you have different partitioning for different sensors? Allow you to port application across different platforms but the network is homogeneous, but you could run the algorithm multiple times (not considered in this paper).
- Q: what is the effect of additional load on available bandwidth (interference, etc.)? Currently assume all nodes transmit at the same data rate, and use this assumption to assign channel capacity.
Softspeak: Making VoIP Play Well in Existing 802.11 Deployments
- VoIP and WiFi are becoming increasingly popular. Now even cellphones are becoming available with these capabilities (even Skype on iPhone). If these get used heavily, what happens to existing 802.11 deployments?
- Imagine a commodity AP in a coffee shop. Most users are data users, and maybe there are one or two VoIP users. But as VoIP becomes more popular, the number of VoIP users will increase.
- What does this do to call quality and to data users?
- 802.11 was designed for data networks and has substantial per-packet overhead (headers, ACK) and contention (backoff and collision). VoIP has small packets, a high packet rate (20–100 packets per second), and does not respond to congestion. So VoIP makes inefficient use of WiFi.
- Measured impact as residual capacity (TCP/UDP throughput) and “mean opinion score” (MOS, how audio appears to a real person, calculated based on loss and jitter).
- Used an 802.11 testbed and gradually enabled for VoIP stations. Downlink MOS tails off much faster than uplink MOS (after 4 VoIP stations). Degradation of TCP throughput is linear, but much more severe than would be expected from the size of VoIP packets.
- Possible solutions: could decrease VoIP packet rate, or use higher speed networks (802.11g/n). Lowering the packet rate still degrades MOS. Higher speed networks run in “protection mode” in presence of older 802.11 versions, which loses many of the benefits. VoIP is too infrequent to benefit from 802.11n aggregation.
- Could prioritise VoIP traffic (802.11e), which increases the contention overhead, and further reduces the residual capacity.
- Softspeak does prioritised TDMA for the uplink and addresses contention overhead. By serialising channel access, we avoid collisions. Cycle between 10 TDMA slots, each of 1ms (1ms accommodates current codecs well; could be varied). VoIP stations have to establish a schedule, synchronise clocks and compete with non-TDMA traffic.
- Non VoIP stations are unaware of TDMA, which may prevent VoIP stations from sending on time. If data user wins contention, the VoIP cannot transmit; if they collide this is worse. Use 802.11e VoIP prioritisation to improve VoIP quality, but TDMA means that we don’t see the same data rate degradation. Even a quick collision recovery for VoIP means that we will overrun our TDMA slots: can use multiple priority levels to address this (though can exhaust priority levels). Worst case, fall back to regular 802.11, and do no worse than 802.11.
- Experiment by visualising TDMA, with ten TDMA VoIP stations, and some CSMA/CA background data traffic. Most transmissions should start in their own or the next slot.
- Softspeak also does downlink aggregation. VoIP stations will overhear aggregated packets and extract their own portion of the packet.
- No modification to wireless card, access point or VoIP application. Softspeak controller registers IP address and port with the aggregator (wired connection to AP). Implemented for Skype and Twinkle.
- Evaluated for call quality and residual throughput. TCP data traffic and 10ms voice codec. (More results in the paper.) Adding TDMA alone has little effect over 802.11b. Aggregation by itself greatly improves downlink MOS (and slightly improves residual throughput). Softspeak in total greatly improves downlink MOS (almost back at single VoIP station level) and also improves residual throughput by 5x.
- For 802.11g, get a 3x improvement in residual throughput, slight degradation in downlink MOS and large improvement in uplink MOS.
- Also looks at performance when contending with web traffic. The bulk TCP upload improvement disappears, but the combined TCP capacity improvement is preserved. Softspeak doesn’t change the basic behaviour of the data traffic.
- Q: how would the result change if you were measuring against multiple TCP flows? Experiment we did with web traffic showed this. Multiple bulk TCP flows? In one direction, saw same level of improvement; in the other saw queuing at the AP and would need some form of prioritisation at the AP.
- Q: how could this work in a real network (because you have more than one collision domain; how do you assign slots across multiple hops?)? Didn’t evaluate the case with multiple collision domains; details of slot assignment in the paper. Use an 802.11 management protocol for probe response which increases the range when it can be heard. Doesn’t completely eliminate hidden terminal problem, but we hope that this would apply to TDMA.
- Q: with multiple APs and multiple slots, wouldn’t you need some kind of centralised control? Could work across multiple APs.
- Q: did you measure the delay performance of this? Incorporated in the MOS value.
- Q: did you compare to giving short packets high priority with a rate limiter? This is part of the 802.11e specification, and we measured this. Why does aggregation win by reducing the packet per second rate? Typically if you have short packets, you use more bandwidth.
- Q: if I am an enterprise administrator, why not just reserve a single channel for VoIP traffic, because it no longer has to contend with data or non-VoIP traffic? That would be a good solution if you had that available to you.
Block-switched Networks: A New Paradigm for Wireless Transport
- Today, TCP performs poorly over wireless links. On a bad link, you get 40x goodput difference between UDP and TCP. Median has 2x goodput difference, and a good link has 1.6x goodput difference.
- End-to-end rate control is error-prone and doesn’t work well in wireless networks. Loss feedback does not distinguish congestion and corruption losses. End-to-end retransmissions are wasteful in a multi-hop wireless network. Route disruptions cause unavailability (could use a disruption-tolerant protocol).
- The use of packetisation introduces non-trivial overhead (for channel access: listen, backoff, RTS/CTS; and link-layer ARQ).
- Also see complex cross-layer interaction: link-layer ARQs and backoffs hurt TCP rate control.
- Hop is a clean-slate redesign. End-to-end becomes hop-by-hop. Packets become blocks to amortise overhead.
- Reliable block transfer, ACK withholding and micro-block prioritisation on a per-hop basis. Virtual retransmission and backpressure components on a multi-hop basis.
- Hop sends data in 802.11 burst mode. CSMA performed only before a TXOP (burst). Block-SYN and Block-ACK (bitmap of packets; only resend missing packets).
- Virtual retransmission uses in-network caching and retransmits only on cache miss. Gives fewer transmissions, low overhead and a simple solution.
- Backpressure mechanism limits the number of outstanding blocks per-flow at each forwarder (e.g. limited to 2 outstanding blocks). This improves network utilisation in the case where hop bandwidth is asymmetric.
- ACK withholding mitigates hidden terminals better than RTS/CTS (which is overly conservative with high overhead). Buffer B-ACK messages while one terminal is sending, and then send one of the buffered B-ACKs when it is finished.
- Micro-block prioritisation improves performance for applications like SSH and text messaging. Sender piggybacks small blocks on the B-SYN and receiver prioritises small-block B-ACKs. This gives low delay for small blocks.
- Implemented on a testbed on the 2nd floor of the UMass CS building. 20 Mac Minis form an ad-hoc network.
- Hop achieves significant goodput gains over TCP (1.6x at median, 28x at first quartile, 1.2x at third quantile).
- Also on single-flow multi-hop performance (more modest gains: Q1 2.7x, median 2.3x, Q3 1.9x).
- Achieves graceful degradation with loss (emulated link layer losses at the receiver). Still achieves higher goodput than TCP as the loss rate increases.
- For high load (30 concurrent loads), Hop achieves much higher goodput than either TCP or hop-by-hop TCP (Q1 150x, median 20x, Q3 2x). Hop is also much fairer in the way it allocates bandwidth between flows.
- Evaluated delay on small transfers. Hop has much less delay than TCP or TCP+RTS/CTS (especially for smaller transfers).
- Many more evaluations in the paper. Partitionable networks, network and link-layer dynamics, effect on 802.11g, and effect on VoIP.
- Much related work on fixing end-to-end rate control, backpressure and batching; Hop combines these into a viable system.
- Q: most data transfers involve wired networks as well, so what do you do at the gateway? We do bridging here.
- Q: if there is a separate TCP connection from the gateway to a wired endpoint, how do you deal with mobility (changing gateways)? Good question, want to look at this in future work.
- Q: what is the real effect on interactive sessions (e.g. SSH)? Hop actually achieves an even lower delay than TCP or TCP + RTS/CTS. Didn’t evaluate transfer sizes smaller than 2KB.
- Q: if a congested router is fed by two routers upstream, how do you allocate the backpressure? The senders will have to back off. There is no explicit rate control. Both upstream routers are treated similarly.
Routing
NetReview: Detecting When Interdomain Routing Goes Wrong
- Interdomain routing sometimes goes wrong. Example of YouTube traffic being redirected to Pakistan. Only the most egregious problems make it to the news, but there are many inconvenient, small scale problems.
- ASes exchange routing information using BGP. But BGP routing is plagued by misconfigurations. Faulty routing information propagates through the network. Bugs in the AS layer could prevent routing information to be misadvertised. Also, spammers hack into routers to prevent tracing.
- Goal is to reliably detect each routing problem and link it to the AS that caused it. This would make it quicker and easier to respond to problems. Fault detection would work for a broad class of problems, incentivise reliable routing and be easy to deploy incrementally.
- Idea: could just upload all router logs to a central entity, who inspects them for problems. Doesn’t work in practice. Logs contain sensitive information about internal routing structure. Also, relies on the router to have accurate information (may not be the case). Also need automation, incremental deployability and decentralisation.
- NetReview solves these problems. Border routers maintain logs of all BGP messages (not data messages). Logs are tamper-evident (in the event of a faulty router): can reliably detect and obtain proof of faulty router. Neighbours can periodically audit each other’s logs and check them for routing problems. Auditor can prove existence of a problem to a third party.
- ASes decide what to announce via BGP based on its routing policy, based on peering agreements (customer/provider), best practices (limited path length) and internal goals (short/cheap path).
- A BGP fault is when the BGP messages sent by an AS do not conform to its expected behaviour. We know what BGP messages the AS sent from a complete and accurate message trace (using a robust and secure tracing mechanism). Expected behaviour for each AS could be different.
- Example of expected behaviour: filter out routes with excessive paths; act as somebody’s provider; prefer routes through someone if available. Some of these rules may be confidential, but the AS need not reveal all of them to each author (e.g. reveal rules about agreements only to parties to those agreements).
- Tamper-evident log based on PeerReview at SOSP 2007. Can tell if a router omits, modifies or forges entries (based on a hash chain). Messages are acknowledged, so cannot ignore a message. Neighbours gossip about hashes seen.
- Rules are predicates on the AS’s routing state. They are declarative and hence are easy to get correct. Checks of S-BGP can be declared in a two-line rule.
- Auditor requests the logs from each border router. Checks to see if the logs have been tampered with (or show inconsistencies). The auditor locally replays the logs to establish a series of routing states. Then evalutes that the rules are upheld in each routing state. If a rule is violated during some interval, the auditor can extract verifiable evidence from the logs.
- Many practical challenges: here we will look at incremental deployment. Smallest useful deployment is at one AS. One AS can find bugs and misconfigurations. Two adjacent ASes can check peering agreements. The incentives for deployment are that reliable ASes can attract more customers, and logs can be used for root-cause analysis.
- Evaluated on a synthetic network of 10 ASes running 35 Zebra BGP daemons. Use default routing policies, and injected a real BGP trace (Equinix) to get scale. Results in the talk are from a tier-2 AS with 6 neighbours.
- Did fault injection and NetReview detected all of the injected faults and provide useful diagnostic information.
- Evaluated processing overhead: a 15-minute log segment can be checked in 41.5s on a P4. A single commodity PC is sufficient to check a small network in real-time.
- Storage space requirement was 710KB/minute, or about 356GB/year. Required 420Kbps, including BGP updates, which is insignificant compared to the data rates.
- Related work includes fault prevention (secure routing protocols and trusted monitors) which have been difficult to deploy, and only filter out limited types of faults. Also related to heuristic fault detection (problem of false positives and false negatives). And related to accountability systems (which tend to require clean-slate designs, or have other limitations).
- Q: can you say more about the web of trust approach? No PKI or CA required, because the networks already know who is at the end of each link, which gives a good basis to certify the identity of ASes.
- Q: does the system prevent collusion between ASes? ASes cannot incriminate a correctly-operating AS by collusion, nor hide misbehaviour (apart from routing behaviour between the colluding ASes, which doesn’t matter).
Making Routers Last Longer with ViAggre
- Motivation is the steep growth in routing table size. Expected to get worse in future: as IPv4 gets exhausted, need more small prefixes. Also there’s a chance that IPv6 gets deployed…. A larger routing table means that routers need more FIB (Forwarding Information Base) space.
- Does FIB size matter? Could just throw more RAM at it? Technical concerns about power and heat. SRAM is a low-volume component that does not trace Moore’s Law. Also, a larger routing table means less cost-effective networks (price per byte forwarded increases). Also, there’s a cost in upgrading router memory. Some routers are filtering out /24 routing table entries, which impacts reachability. ISPs will undergo some pain to extend the lifetime of their routers
- Virtual Aggregation: a configuration-only approach to shrinking router FIBs. Works on legacy routers and can be adopted independently by and ISP. Huawei are implementing it in their routers.
- Brad Karp came up with the name.
- Insight is to divide the routing burden between routers. Individual routers only maintain routes for a fraction of the address space.
- Router architecture: RIB on slow memory; FIB on fast memory (SRAM).
- Problem space includes FIB space, RIB growth and problems of routing convergence (churn etc.). Existing proposals require architectural change and haven’t been deployed. ViAggre focuses on FIB space problem and is incrementally deployable.
- Today all routers have routes to all destinations. Idea is to divide address space into /2 virtual prefixes. Now assign virtual prefixes to the routers. Each router has a prefix and only maintains routes to hosts in that prefix.
- How do you do this without changes to routers and external cooperation? How do packets traverse routers with partial routing tables?
- An external BGP peer may advertise its full routing table: need to insert routes into the FIB selectively (only install a subset of the RIB into the FIB, FIB supression is a simple approach). However, this has high performance overhead. Instead, offload task of maintaining this table onto machines off the data base (similar approach to BGP route reflectors). External router peers with a route reflector. Shrinks the FIB and RIB on all data-plane routers. This is somewhat invasive, because you have to change your peering architecture.
- What about data plane paths between different virtual prefixes? When a packet comes in, the ingress router doesn’t have a route to the prefix. Therefore you need to route to router with the right virtual prefix. So maintain one entry per virtual prefix that a router is not aggregating. Cannot forward packets in a hop-by-hop fashion, so the packet must be tunnelled to a router that has the right prefix. The egress router removes the encapsulation.
- Failover works using existing mechanisms.
- Native paths in ViAggre can be longer than native paths (traffic stretch, increased load, etc.). But can use power law that 95% of traffic goes to 5% of prefixes. So install that 5% of prefixes on all readers. This will reduce the impact of ViAggre on the ISP’s network.
- Evaluated effects on adopting ISP. Look at reduction in FIB Size versus traffic stretch (in ms) and load increase (more traffic carried by routers).
- Choosing aggregation points to assign more routers to aggregate a virtual prefix. This reduces stretch. Use a constraint based assignment program. Want to minimise the worst FIB size such that the worst stretch is <= a constraint. This is NP hard so had to approximate it. With a worst-case stretch of 4ms, the worst case becomes close to the average FIB size, and the actual stretch is very low.
- ViAggre can extend the lifetime of a router by 7–10 years.
- Carried out a study of the prefixes to which traffic was sent, found that the top 5% most popular prefixes account for 95% of traffic.
- As the fraction of popular prefixes increase, the load increase drops to 1.38%.
- ViAggre gives a 10x reduction in FIB size.
- Cons of ViAggre: control plane hacks may have overheads (in installation, convergence, failover); also planning overhead (choosing virtual prefixes and assigning them to routers).
- Deployed ViAggre on a real network. Compared propagating routes using the status quo and ViAggre (using prefix lists for selective advertisement). Measured control-plane overhead in case a new route is installed. ViAggre reduces installation time because route reflector only has to advertise a subset of the routing table to the new router.
- Also developed a configuration tool (330 lines of Python) that works on router configuration files and outputs ViAggre-compliant configuration files. Still working on a planning component. Working with Huawei to implement ViAggre natively.
- Q: overhead of tunnelling? Use MPLS-based tunnelling at line rate. Makes management easy.
- Q: why is the rate at which traffic is growing less than the rate of routing table growth? Not necessarily true. Router upgrading has been happening for reasons other than FIB size up until now (but the FIB size has increased). Tier-1 ISPs can forward packets at close to lambda rates. Even big ISPs may have to do upgrades only for memory concerns.
- Q: why not just maintain popular routes and forward other packets to a default router? Simplest thing to do would be to put routers in a cache hierarchy. Why does it have to be distributed and complicated? Router vendors weren’t happy with route caching because of unpredictable performance and other reasons. ViAggre is useful for medium-sized ISPs (not small ISPs where you have a useable default route).
- Q: can you use a configuration-only approach for the data centre to get higher aggregate throughput? The state problem in data centres is the layer-2 state (not the layer-3 state). Close to Seattle work at SIGCOMM 2008.
- Q: why is it expensive to do route supression from the RIB to the FIB? We did it using access lists to say what should and should not go into the FIB. These access lists are heavyweight mechanisms (typically used with small access lists of up to 500), and haven’t been optimised. By making this efficient, you get much less complexity (no need for route reflectors).
Symbiotic Relationships in Internet Routing Overlays
- Two nodes are in symbiosis whenever both can benefit from one another’s position or resources in the network. Examples are in file sharing (BitTorrent), backup systems (Samsara) and AS relationships. No tragedy of the commons and no free riding. Everyone provides something in return.
- Can we apply mutual advantage in overlay routing. Built PeerWise, an overlay routing system that reduces latency based on mutual advantage.
- Route from College Park, MD to Seattle. Could take a direct path, or a detour path that violates the triangle inequality.
- Measurement studies based on PlanetLab-to-popular-destinations (asymmetric) and PW-King (symmetric). 21% and 51% of node pairs respectively benefit from detours (only detour when latency reduction is at least 10ms and 10%).
- For 20% of the PlanetLab nodes, there are no detours. Mutual advantage eliminates about half of the detours.
- Can categorise web pages based on the number of prefixes and the number of nodes finding detours (regional websites see many nodes finding detours but few prefixes; Google and CDN-based sites have many prefixes, but relatively few nodes finding detours).
- Three goals: efficiency (lower end-to-end latency), fairness (mutual advantage) and scalability (use network coordinates). Network coordinates give internet nodes positions in a geometric space, and predict latency based on position in that space: this has some power in predicting whether node-pairs will need a detour or be part of a detour.
- Need to do neighbour tracking, and ranking based on proximity (minimise latency), embedding error (maximise this), coverage (minimise expected detour latency) or randomness. Proximity and coverage perform best.
- Now do pairwise negotiation to establish mutual advantage. Keep a negotiation table and peering table per node. Negotiation table contains potential peers.
- Implemented PeerWise and deployed it on about 200 PlanetLab nodes. PeerWise finds mutually-advantageous detours that offer significant and continuous latency reduction.
- Propose three experimental scenarios: all-destination, random-subset-destination and Zipf-destination.
- PeerWise finds detours quickly. Other results in the paper.
- Looked at user-level application benefits (wget) through the use of detours. Compare the transfer time for popular website home pages using both the direct and detour paths. Compare PeerWise reduction ratio and wget reduction ratio. In 58% of all cases, wget takes less time through the detour path than through the direct path. But in 42% of cases, wget takes longer through the detour. Could be that the latencies measured by PeerWise had changed.
- Found that PlanetLab relays influence transfers. Packets spend time in the relay node. Median relay wait time on PlanetLab is 48ms (a few weeks before the NSDI deadline). Found that on a UMD relay node, the median wait time was 5. With these delay characteristics, the performance would have been much better.
- Q: is it not counter-intuitive to use latency (with triangle inequality violations) in network coordinates (a metric space) (should you not use topological information like in iPlane?)? Network coordinates give good results. So didn’t investigate anything else. When we embed a triangle inequality violation in a metric space, either make the short size shorter, or the long side longer.
- Q: aren’t most network coordinate systems based on some sort of global optimisation? Taken offline.
- Q: do you only consider bilateral advantages, or have you considered more complicated scenarios with multiple parties? Have thought about it, but just consider bilateral.
- Q: have you done any work on trying to include the load of a node in the weighting system to determine whether it is advantageous to route through it? Have not looked at load.
- Q: a long RTT TCP connection, split in two, will get better latency and loss tolerance, so have you considered doing this instead of TCP relaying? Is there a way to separate the effects in your evaluation? Not sure.