OSDI 2010: Day 3

Production Networks

Onix: A Distributed Control Platform for Large-scale Production Networks

Traditional networking has a traditional router with a forwarding plane and a control plane, which are coupled together in one box. Changing one without the other is difficult to do. This slows down ideas like the Trill protocol, which took 6 years to deploy.
Software-defined networking moves the control plane into a server outside the router that talks to the router using a management protocol, sending down forwarding entries to the router, and receiving information about new flows, etc.
One of the problems with L2 Ethernet networks is ARP broadcast, which doesn’t scale. Idea to have a centralized unicast ARP server. However, don’t want a single point of failure.
Onix is a software-defined networking platform that implements distributed control. Centralized control logic thought of as a distributed system. Simple API and flexible mechanisms for state distribution, scaling and reliability.
Challenges are generality, scalability, reliability, simplicity and performance.
The API is based on treating the network as a graph. Nodes are physical (or logical) network entities, such as a forwarding engine, which is connected to a port, which is connected to an end host. Nodes have methods defined on them (e.g. “write flow entry” for a forwarding engine). The graph is a “network information base” (NIB).
Switching protocols integrated into the NIB, such as OpenFlow.
Need a way to distribute the NIB to the switches and to other control servers. Depending on the application, you might want strong consistency or eventual consistency. Two options: replicated-transactional (SQL) data, and one-hop memory-based DHT (for eventual consistency). For an ARP server: switch topology is hard state, IP-to-MAC mapping are soft state. Applications only interact with the NIB at run-time, after setting up the distribution mechanisms initially.
Integrated with ZooKeeper for distributed coordination, in order to do leader election, failover, distributed locking, barrieres, etc.
Can trade off scalability and reliability, by partitioning the control state variously. Partitioning techniques based on VLAN and MPLS (versus integration techniques like routing on IP prefixes or ATM routing areas). Provide similar interfaces to the user. Can change the subset of Onix instances to which a server connects to vary this. Or give different computational tasks to different instances. Or partition the NIB between instances.
Applications can reduce the fidelity of information before sharing it (based on an ATM trick).
For reliability, need to deal with various kinds of failures: network element/link failures (application has to deal with that), connectivity infrastructure failure (the network that connects Onix to the networking infrastructure, usually use a management backbone, but could do multipathing in that backbone), and Onix failure (application-controlled, but mitigated by distributed coordination facilities (like ZooKeeper)).
Implemented in 150KLOC of C++, Python and Java, from 15 developers in 3 institutions. In QA testing, and will be appearing in products.
Implemented Ethane (enterprise security solution) on Onix. Multiple Onix instances are necessary to handle the load from flows being set up.
Implemented a distributed virtual switch, which does centralized management of VMM virtual switches. Only need a single instance of Onix with a cold standby for failover.
Implemented a multi-tenant virtualised data center, which provides per-tenant L2 networks in a data center. Provides isolation in addressing and resource usage. Need to partition the NIB and management tasks.
Implemented a scale-out IP router, which does routing control over multiple physical switches.
Q: What is an “application” and how do they multiplex on a single network? What is the implication at the level of routing? Envision only a single application per Onix instance (app is a program that manipulates the NIB graph). Forwarding layer gets programmed using something like OpenFlow (a TCP connection to the controller), which programs a TCAM on the switch. How do you set up the network if there is no routing? Depends on the application. Need configuration support at the beginning.
Q: How do you refresh the ARP cache at the host, don’t you need to do broadcast? We didn’t actually build an ARP server, just an example for the talk. Imagine we could still save a lot of traffic from gratuitous ARPs.
Q: What is your favourite trick or hack that we might be able to reuse from your implementation? And/or what might you do differently? Favourite idea is that the distribution mechanisms are hidden from you in the NIB, and you can switch out the storage layer with your own if you want.

Can the Production Network Be the Testbed?

Hard to evaluate new network services realistically, which makes it hard to decide what should and should not be deployed in production networks. Testbed networks are too small, and maybe don’t have the proper topology. A successful service on a testbed may not correlate with deployment on a real network.
Problem with NetFPGA: expensive to deploy and fan-out is small. Software testbeds are slow and may not have realistic topologies. And getting real users to join an experimental testbed is hard.
Solution: network slicing. Divide a production network into logical slices, in each of which different control plane is run. Users can opt in to a particular slice. Need to have strong isolation between slices so that actions in one don’t affect another.
Control plane makes up forwarding rules and pushes them down into the data plane. Data plane pushes exceptions up to the control plane (e.g. packets that don’t match any rules). Add a slicing layer between the control plane and data plane, and allow multiple control planes to control the same data plane.
Network slice is a collection of sliced switches or routers. The data plane remains unmodified, and the slicing layer performs rewrites on the control planes’ rules to ensure isolation.
Slicing policy specifies resource limits for a slice (e.g. link bandwidth, maximum number of rules, topology and CPU usage).
Now, allow real users to opt-in to slices on a flow-by-flow level. So one slice can handle HTTP, another can handle VoIP, etc.
Implemented on top of OpenFlow for communication between the control and data planes. Control logic is moved to a commodity PC. OpenFlow is just a firmware upgrade for various commodity hardware.
FlowVisor performs message handling (the slicing layer). On getting an exception from the data plane, FlowVisor looks up which slice controls the packet and passes it to the appropriate controller. When the controller sends back a rule, FlowVisor checks that the update is allowed, and writes it to the data plane if so.
Implemented for 20 OpenFlow message types as a transparent OpenFlow proxy in 8261 lines of C.
Isolation performance in terms of device CPU usage (other metrics are considered in the paper). Need to prevent CPU exhaustion, using rule-insertion rate limiting. Consider a malicious slice that tries to do DoS on the switch. Isolation reduces this to 40% CPU usage on average (though bursty).
Runs on the production network at Stanford, with 15 switches, 35 access points, 25 users and >1 year of use. Host 7 different OpenFlow demos on the same physical network. Also runs in many different research groups.
Not explicitly virtualization, because currently only deal with slices of the physical topology. Looking into virtual nodes in future.
Q: How do you create isolation between slices so that one slice doesn’t do too many flow setups per second? This is handled by isolating switch CPU usage.
Q: What about latency? Â Add about 0.483 ms of latency.
Q: Don’t you have to have n times as much forwarding memory, which creates problems for realism? Tricks that you can play in terms of caching, but it is true that the number of rules becomes a scarce commodity (second most important after device CPU).
Q: [...] Experimental services use the production slice for their control plane. Worked out an in-network control protocol.

Building Extensible Networks with Rule-Based Forwarding

Goal is to have flexible forwarding in the network. A network becomes more powerful if its forwarding plane is more flexible. For example, we might want forwarding to be aware of middleboxes (for intrusion detection). But also multipath, active networks, multicast, and so on.
However, flexibility is not enough, because it must be secure as well. Need policy support to constrain forwarding directives. And need every entity that appears in a forwarding directive to be able to refuse it. Not respected today (e.g. denial of service, since the source performs forwarding). Or might want to do IP source routing but give every intermediate ISP the ability to refuse the route.
Idea: packet rule, which carries forwarding directive in the packet itself. All intermediate nodes must validate the rule. Rules encapsulate a proof of policy compliance.
Rules are owned by destinations (but they could be owned by other nodes). They are certified by trusted third parties. An extended DNS returns rules instead of addresses. Routers verify the rule signature and follow rule directives. Packets may also have a return rule for allowing data to be returned to the source.
Distribution and certification is done in the control plane (in the paper). Talk focuses on the data plane.
Assume: an anti-spoofing mechanism (ingress filtering), existence of rule-certifying entities and key distribution, and DNS well-provisioned against DDoS.
Rule is a sequence of actions in if-then-else statement form. Conditions are comparisons based on packet headers and router state. Allowable actions are: modify header, forward packet, drop packet or pass up to the control layer.
Rule forwarding engine can modify packet attributes, such as adding explicit congestion notification. Arbitrary attributes on a packet may be set. Need to make attribute setting secure, by using strong cryptography or checking the source (to avoid circumventing an IDS middlebox).
Rules can be used to prevent DoS, as they can work like network capabilities. Requires the ability to dynamically grant rules, which is unusual for DNS, so would need to do some proxying.
Many other examples of applications that work in this architecture.
Rules are flexible, policy-compliant and safe (bounded forwarding time, and cannot modify router state or amplify traffic). Invoked functions are controlled by the ISPs and middlebox owners. Resource isolation can be used to prevent attacks.
Evaluated several questions: size overhead of rules, forwarding overhead, performance isolation, load on certifiers and security. Only looking at overhead in the talk.
Rule compiler transforms the rules into a binary form. Rule size is between 60 and 140 bytes. Average across examples is below 85 bytes (13% overhead on an average IP packet). Overhead is 27% if using RSA signatures.
Built a prototype router on top of RouteBricks. Tested on an 8-core Nehalem with 2 dual-port NICs. Compared RBF to plain RouteBricks, and saw little overhead (no overhead for packets over 500 bytes). Can forward 35Gbps on a real traffic distribution.
Signature verification happens only at “trust boundary routers”. Results can be cached (about 19 bytes per rule), and new flows tend to represent 1% of backbone line capacity on average (5% in worst case). Existing hardware can cope with this, and it can be parallelized, so throwing more hardware at it will support higher bandwidth links.
Q: What if as a receiver, I want packets to take a particular path… can I drop those packets earlier in the path? No, can only verify the previous hop. Would need stronger crypto guarantees to detect that. What about stakeholders who aren’t named in rules? In a way, we are like overlay networks. We do not name entities between waypoints. But since they approve packet forwarding, you could see them as overlay nodes.
Q: What is the best-case/worst-case for ping? Latency is negligible in our implementation.
Q: Can you comment on deployability? Could use network slicing. Could be partially deployed at some routers and get many of the benefits. What is the incentive for ISPs? The ability to diversify their offering to customers.

Mobile

TaintDroid: An Information-Flow Tracking System for Realtime Privacy Monitoring on Smartphones

60 million smart phones sold in Q2 this year. But privacy is very poor on these devices, especially in terms of the applications that run on them (e.g. of a wallpaper application on Android). Challenge is to balance fun and utility with privacy. Idea is to look inside applications to watch how they use privacy-sensitive data, and determine how it leaves the phone.
Challenges include resource constraints, allowing apps to use some privacy-sensitive information, context-based privacy and applications sharing information.
Idea: use dynamic taint analysis, with a taint source, taint propagation and a taint sink. Need to trade off between performance and granularity. TaintDroid does taint tracking on Android, with variable tracking throughout Dalvik, handling native method invocation, and extending to storage.
Modified the Dalvik interpreter to store and propagate taint tags on variables. Store tags adjacent to local variables/args on the internal execution stack, for locality. For class fields, store them in instance fields on heap objects. For arrays, store one tag per array to minimize overhead (in particular because many arrays are strings).
Various rules for taint propagation through data-flow.
JNI is used to execute native methods. TaintDroid doesn’t support third-party native code, but uses a combination of heuristics and method profiles to patch up the tracking state on return from native methods.
For IPC and file propagation, taint at the message level and the file level. File taint is stored in an XATTR on the file.
On the CaffeineMark benchmark, overall 14% performance overhead. Relatively worse for the “string” workload. 4.4% of memory overhead. IPC has a 27% overhead. Macrobenchmarks overhead between 3% and 29% (worst for taking pictures: 0.5 seconds).
Sources and sinks are carefully integrated into the existing framework. Need to treat sources differently based on the properties of the information that they provide. IMSI needed particularly careful handling.
Started with the top-50 most popular applications in categories on the Android Marketplace, and winnowed it down to 30 applications that used the internet, location, microphone and camera. 105 connections were flagged, but only 37 were clearly legitimate.
15/30 applications were sharing location with a third-party advertiser, usually in plaintext (e.g. in the HTTP GET url for AdMob, which also contained a unique ID for the particular phone). Sometimes every 30 seconds. No sharing was obvious to the user or covered in the EULA.
7 applications sent the IMEI, and 2 apps send the phone number, IMSI, or ICC-ID to a remote server. One application mentioned in the EULA that said the IMEI would be sent, another sent a hash of the IMEI.
Limitations to the approach: only explicit data flows. Sources: IMSI contains country and network codes, so tainting it is too conservative.
Source code should be released next Wednesday.
Q: What about timing attacks, and what implicit flows do you think are important? Need to do static analysis to capture all implicit flows, and limitations would lead to false positives.
Q: Have you thought about providing this application as a website to which people could upload applications? Haven’t thought about that specific model, but definite potential there.
Q: How do you stop applications modify the XATTR to remove taint? XATTR not used in Android, so could use LSMs to do this. Accessing this requires third-party native libraries, which we don’t support (can’t access them from Java).
Q: Are you planning to do anything for control dependencies, and how would that impact performance and false positives? We are not attempting to identify privacy violiations, just that privacy-sensitive information has left the phone. Using the information for control is the next step.
Q: Now that TaintDroid is public, what do you see as its future purpose (since people will now try to get around it)? Going to look at more techniques in future (implicit flows), though we can still catch low-hanging fruit where the leak is unintentional.
Q: [...]. Understanding how information is going to be used is a tough challenge. Want to integrate privacy-preserving technologies.

StarTrack Next Generation: A Scalable Infrastructure for Track-Based Applications

Since many phones can now identify their own location, many apps now use this information. A track is a sequence of (location, timestamp) readings. Applications could use these tracks to e.g. personalise driving directions (collapsing directions that are familiar, such as “Take 101 north”, rather than a bunch of redundant short steps around one’s own house).
Application taxonomy: use current location, use past locations or use tracks. StarTrack deals with the last case (e.g. personalised driving directions, track-based search, ride sharing, discovery, urban sensing).
Two kinds of application: insertion and retrieval/manipulation/comparison operations. ST server is a database designed to store geometric data efficiently.
Challenges: handling error-prone tracks, providing a flexible API and efficiently implementing track operations. (Also must provide scalability and fault tolerance.)
Raw tracks are difficult to compare for equality, since there may be noise, and readings may be taken at different times. Canonicalisation enables more efficient retrieval and comparison, by mapping similar tracks onto the same path.
API needs to: pre-filter tracks (based on similarity criteria), manipulate them and fetch them. Track collections are abstract groupings of tasks, which aid programming convenience and implementation efficiency (preventing unnecessary message exchanges and enabling caching).
Manipulation operations include joining collections, sorting them, taking a subset, getting similar ones, getting common segments, and getting individual tracks.
Showed a code fragment for a ride-sharing application in terms of the StarTrack API.
In the DB, StarTrack maintains only non-duplicate tracks per user. The table of unique coordinates are separate. Quadtrees are used to do geographic range searches.
To compute track similarity, compute the length of common segments over the length of all segments in both tracks (i.e. Jaccard similarity). Conventional DBs don’t support this efficiently. The track tree data structure is an efficient representation for this, for doing get-similar-tracks and get-common-segments.
Evaluated the performance of track trees and two sample applications (driving directions and ride sharing). Used synthetically-generated tracks, 9 servers and 3 DB servers.
Evaluated get-similar-tracks against DB, in-memory filtered, and in-memory brute force approaches. Track tree query takes 1ms for up to 100K tracks in the collection. Other approaches are much more costly. To build a track tree takes up to 2 minutes for 100K tracks, and about 170MB of RAM.
Personalised driving directions uses get-common-segments. Cached takes about 50 ms until hitting a wall at about 260 requests per second. Ride sharing can do 30 requests/second at 170ms response time.
Have a working infrastructure, and looking for users.
Q: Have you tried to pair intent and content with a track? We have the availability to stick metadata on tracks, so different applications could associate different metadata. Expect we could have ACLs on this metadata.
Q: How do you handle errors in the tracks? Canonicalisation handles this, pre-filtering obviously-erroneous points. What technique do you use for this? Rely on the underlying algorithm but don’t use statistical techniques.
Q: Have you thought about how you might do non-intersecting paths in a privacy preserving way (e.g. to avoid an ex)? Thought about it, but it is very hard. Access control might address this, but would like to have a way to aggregate it using differential privacy techniques.
Q: Why does the performance curve hit a wall? Network congestion and packets getting dropped.

Virtualization

The Turtles Project: Design and Implementation of Nested Virtualization

Approach is to multiplex the hardware between different layers of virtualization. Focus on two layers of nesting.
Multiple VMCS for each layer transition. The hypervisor merges VMCSs so that they can run in root mode.
Need to avoid multiplying the number of vmexits to one per nesting layer. One idea is to make a single exit fast, and another involves cutting down their frequency.
Multi-dimensional paging cuts down the number of exits. Baseline approach uses shadow paging on shadow paging (assuming no EPT). Better to use a shadow page table on EPT (secondary hardware table), which turns three logical translations into two in hardware. Best is to observe that since EPT tables rarely change, do multi-dimensional paging, which compresses the two levels of EPT into one (from layer 0 to 2).
For I/O virtualization, could do device emulation, or paravirtualized drivers, or (even better) direct driver assignment. This requires an IOMMU.
For nested, need to do multi-level device assignment. Idea is to have the bottom level emulate an IOMMU for the first level, and the second level access the device directly.
Experimental set up involved kernbench, SPECjbb, and netperf. Evaluated different hypervisor choices for the nested performance. Overhead is 9.5% in the guest, and a further 14.5% in the nested guest for kernbench. For SPECjbb, it’s 7.6% in the guest and a further 7.8% in the nested guest. Â vmread and vmwrite operations were the most expensive, so these were hacked to not cause a trap.
Multidimensional paging improves kernbench a lot (3.5x), but SpecJBB and netperf are roughly the same.
Multi-level device assignment is the best performing I/O setup, but it adds an overhead of 3x in CPU cost. Interrupts cause exit multiplication. If we could deliver interrupts directly to the nested VM, the overhead would just be 7%. But can still fill a 1G pipe.
Q: Do you have a feeling for how the overhead will scale as the number of levels increases? Hope that additional levels will add approximately the same overhead.
Q: Are you confident that you could run this under arbitrary unmodified hypervisors? Not advocating rootkits at all.
Q:
Q: Would binary translation help to apply changes to e.g. the vmread and vmwrite instructions? Two ways to solve this: either get better hardware support or do binary translation. Not feasible to do binary translation in this case.

mClock: Handling Throughput Variability for Hypervisor IO Scheduling

Done this work because customers need it.
Hypervisor currently multiplexes hardware resources between VMs. Three possible controls: reservation (minimum guarantee), limit (non work-conserving) and proportional share. Supported for CPU and Memory in ESX Server since 2003. Now looking at I/O resource allocation.
Contention for I/O resources can have an arbitrarily bad effect on a VM’s resource allocation. Hard problem for storage (considered in terms of iops) to get reservations or limits, because you need to continuously recompute the allocation in the face of bursty, variable workloads. Also, since this access may be distributed and shared, it is even more complex.
A single algorithm to support all controls, handling variable and unknown capacity, and that’s simple to implement.
Typical proportional-share scheduling assigns a weight to every application, and tags each request with the application. Then requests are spaced based on their tag and the weight of their application.
Three key ideas in mClock: use real-time tags (instead of virtual time tags) based on wall-clock time. Use separate tags for reservation, shares and limits. Then use dynamic tag selection and synchronization.
Scheduling proceeds by first doing constraint-based scheduling, the weight-based if reservations are met.
Idle VMs are synchronized when they make a new request.
Need to scale the number of requests based on their size.
Distributed mClock for clustered storage architectures.
Experimented with a 3 to 6 VM cluster with mixture of Windows and Linux hosts. Workloads: iometer and and Filebench OLTP benchmark. Showed an experiment with diverse workloads and allocations. Compared SFQ(D) to mClock. Showed that mClock closely enforces the limits and reservations for the relevant VMs.
Another experiment looking at burst handling, when idle VMs get benefit from spare capacity.
Looked at an OLTP workload on both SFQ(D) and mClock. On SFQ(D), the reservation is not respected, whereas mClock manages it.
Distributed mClock experiment, with three servers and three clients, and various reservations.
Future work involves extracting SLAs and deriving scheduling policies from those.
Q: Why does the total number of iops become lower for mClock as SFQ(D) when two VMs are switched on? Some variance would be acceptable, due to the varying workloads and access patterns etc. Should note that the VMs are not doing exactly the same workloads in each case.
Q: Can you comment on how you make the trade-off between throughput and latency? Focussed on getting a certain amount of throughput. Previous work looked at latency. Overall, to do a good job at latency, need support from the underlying array vendor.
Q: [...]. We group I/O into 32-request batches onto the LUN.

Virtualize Everything but Time

Problem is clock synchronization, which is important for network monitoring, traffic analysis, telecoms, finance, gaming, and distributed scheduling.
Status quo under Xen is based on ntpd, but amplifies its flaws. When live VM migration is done, it fails.
Propose a new architecture. based on RADclock (clock synchronization solution). Robust, accurate and scalable solution that supports seamless migration.
Idea: give each physical host a single clock that never migrates. A stateless clock-read function migrates with each VM. (Concept of a “dependent clock” (not new).)
Context is paravirtualized Xen guests. It would be more complex to do it for fully-virtualized guests, and is not discussed here.Â Implemented for pvops Linux (2.6.31.13) on Xen 4.0.
Clocks are built on local hardware (oscillators/counters). However, they are imperfect and drift (often due to temperature). The TSC counts CPU cycles and is the highest resolution/lowest latency time source, but it might be unreliable. HPET is reliable but low-resolution and high latency. The Xen clocksource is a hardware/software hybrid that the hypervisor provides, and aims to combine the reliability of HPET with the low latency of TSC: a 1GHz 64-bit counter. Performs well, but HPET is not far behind and a lot simpler.
Raw timestamps and clock timestamps are different, and scaled counters are insufficient since they drift. Network based synchronization is convenient, but suffers from delay variability and path asymmetry. (Typically assume symmetry.)
Two synchronization algorithms under discussion. First is ntpd, which is feedback-based, using the Xen clocksource on Xen. RADclock uses raw timestamps, and can use any raw counter, including HPET and the Xen clocksource.
Experimental setup includes an atomic clock, a GPS receiver and a network connection to a stratum-1 NTP server.
With the client and time server on the same LAN, and a 16-second polling period, can get 60us error. Using a feedback-directed polling period with multiple servers (but no load, no traffic and high quality servers) can lead to millisecond-level errors.
On Xen, ntpd is much worse. Consider a dependent ntpd clock, independent ntpd clocks and migrating independent ntpd clocks. With a dependent ntpd clock, only dom0 runs ntpd and the Xen clocksource is used (no clock sync in the guests). If an independent ntpd clock per guest runs (resource-hungry), works almost as well as non-virtualized but with a bit more noise due to virtualization latency. Adding some load and churn leads to millisecond-sized errors. A migrating independent ntpd clock is worse [...].
RADclock does raw counter reads, not clock reads and is independent of the clock algorithm. Has a synchronization algorithm based on raw timestamps and server timestamps with feed-forward.
Feed-forward allows you to define a “difference clock” which can determine time differences on a small timescale that is highly accurate. This is inherently difficult in a feedback-based system.
Now have dom0 run a RADclock daemon that talks to timeserver, and make information available through XenStore.
For migration, only the clock-reading function actually migrates, and no state migrates. However, since we can have asymmetry and it will be different on a different host, this will result in a small clock jump. (No algorithm can do anything about this.)
Migration experiment. The ntpd clock has errors on the order of seconds and convergence time on the order of hours. The RADclock-based solution is much more stable.
Q: What about the protocol for network synchronisation with RADclock: could you keep the NTP protocol to allow incremental deployment? We do use the NTP protocol.
Q: How much is RADclock buying you over NTP when the biggest gain comes from using independent clocks? Could you modify NTP to do dependent clocks well? Problem is that two heads try to control the same state.
Q: How does your clock perform under a workload that might cause temperature variation? High temperature is not a problem: rapidly-changing temperature is the problem.

This entry was posted on Wednesday, October 6th, 2010 at 6:30 pm and is filed under Technology, Trip Reports. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

mrry