Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Paolo Costa costa@imperial.ac.uk joint work with Ant Rowstron, Austin Donnelly, and Greg O’Shea (MSR Cambridge) Hussam Abu-Libdeh, Simon Schubert (Interns) Paolo Costa CamCube - Rethinking the Data Center Cluster 2/54 • Data Centers – Large group of networked servers – Typically built from commodity hardware • Scale-out vs. scale-up – Many low-cost PCs rather than few expensive high-end servers – Economies of scale • Custom software stack – – – – – Microsoft: Autopilot, Dryad Google: GFS, BigTable, MapReduce Yahoo: Hadoop, HDFS Amazon: Dynamo, Astrolabe Facebook: Cassandra, Scribe, HBase Paolo Costa CamCube - Rethinking the Data Center Cluster Focus of this talk 3/54 • Mismatch between abstraction & reality • How programmers see the network Paolo Costa CamCube - Rethinking the Data Center Cluster Key-based Overlay networks Explicit structure Symmetry of role … 4/54 Paolo Costa CamCube - Rethinking the Data Center Cluster 5/54 ~12 racks / pod 10 Gbps 10 Gbps ~ 20-40 servers / rack Paolo Costa CamCube - Rethinking the Data Center Cluster 6/54 ~12 racks / pod 10 Gbps 10 Gbps ~ 20-40 servers / rack Paolo Costa CamCube - Rethinking the Data Center Cluster 7/54 ~12 racks / pod 10 Gbps Data center services are distributed but they have no control over the network 10 Gbps Must infer network properties (e.g., locality) Must use inefficient overlay networks No bandwidth guarantee ~ 20-40 servers / rack Paolo Costa CamCube - Rethinking the Data Center Cluster 8/54 • Top-down (applications) • Bottom-up (network) – Custom routing? • Overlays • Use richer topologies (e.g., fat-trees) – Priority traffic? • Use multiple TCP flows – Locality? • Use traceroute and ping – TCP Incast? • Add random jitter Paolo Costa – More bandwidth? – Fate-sharing? • Use prediction models to estimate flow length / traffic patterns – TCP Incast? • Use ECN in routers CamCube - Rethinking the Data Center Cluster 9/24 • The network is still a black box Good for a multi-owned, heterogeneous Internet Limiting for a single-owned, homogenous data center • DCs are not mini-Internets – – – – single owner / administration domain we know (and define) the topology low hardware/software diversity no malicious or selfish node • Is TCP/IP still appropriate? Paolo Costa CamCube - Rethinking the Data Center Cluster 10/54 • “A data center from the perspective of a distributed • systems builder” What happens if you integrate network & compute? Distributed Systems CamCube HPC Paolo Costa Networking CamCube - Rethinking the Data Center Cluster 11/54 TCP/IP Service composition Architecture Link and key-based API Topology Paolo Costa CamCube - Rethinking the Data Center Cluster 12/54 (1,2,2) (1,2,2) (1,2,1) (1,2,1) y • No distinction between overlay • and underlying network Nodes have (x,y,z)coordinate z – defines key-space • Simple one hop API to x send/receive packets – all other functionality implemented as services • Key coordinates are re-mapped in case of failures – enable a KBR-like API Paolo Costa CamCube - Rethinking the Data Center Cluster 13/54 (1,2,2) (1,2,1) y • No distinction between overlay • and underlying network Nodes have (x,y,z)coordinate z – defines key-space • Simple one hop API to x send/receive packets – all other functionality implemented as services • Key coordinates are re-mapped in case of failures – enable a KBR-like API Paolo Costa CamCube - Rethinking the Data Center Cluster 14/54 Fetch mail costa:1234 (0,0,0) • • • • (1,0,0) (2,0,0) Packets move from service to service and server to server Services can intercept and process packets along the way Keys are re-mapped in case of failures Server resources (RAM, Disks, CPUs) are accessible to all services Paolo Costa CamCube - Rethinking the Data Center Cluster 15/54 Fetch mail costa:1234 (0,0,0) • • • • (1,0,0) (2,0,0) Packets move from service to service and server to server Services can intercept and process packets along the way Keys are re-mapped in case of failures Server resources (RAM, Disks, CPUs) are accessible to all services Paolo Costa CamCube - Rethinking the Data Center Cluster 16/54 Fetch mail costa:1234 (0,0,0) • • • • (1,0,0) (2,0,0) Packets move from service to service and server to server Services can intercept and process packets along the way Keys are re-mapped in case of failures Server resources (RAM, Disks, CPUs) are accessible to all services Paolo Costa CamCube - Rethinking the Data Center Cluster 17/54 Fetch mail costa:1234 (0,0,0) • • • • (1,0,0) (2,0,0) Packets move from service to service and server to server Services can intercept and process packets along the way Keys are re-mapped in case of failures Server resources (RAM, Disks, CPUs) are accessible to all services Paolo Costa CamCube - Rethinking the Data Center Cluster 18/54 Fetch mail costa:1234 (0,0,0) • • • • (1,0,0) (2,0,0) Packets move from service to service and server to server Services can intercept and process packets along the way Keys are re-mapped in case of failures Server resources (RAM, Disks, CPUs) are accessible to all services Paolo Costa CamCube - Rethinking the Data Center Cluster 19/54 Fetch mail costa:1234 (0,0,0) • • • • (1,0,0) (2,0,0) Packets move from service to service and server to server Services can intercept and process packets along the way Keys are re-mapped in case of failures Server resources (RAM, Disks, CPUs) are accessible to all services Paolo Costa CamCube - Rethinking the Data Center Cluster 20/54 Fetch mail costa:1234 (0,0,0) • • • • (1,0,0) (2,0,0) Packets move from service to service and server to server Services can intercept and process packets along the way Keys are re-mapped in case of failures Server resources (RAM, Disks, CPUs) are accessible to all services Paolo Costa CamCube - Rethinking the Data Center Cluster 21/54 Fetch mail costa:1234 (0,0,0) • • • • (1,0,0) (2,0,0) Packets move from service to service and server to server Services can intercept and process packets along the way Keys are re-mapped in case of failures Server resources (RAM, Disks, CPUs) are accessible to all services Paolo Costa CamCube - Rethinking the Data Center Cluster 22/54 Fetch mail costa:1234 (0,0,0) • • • • (1,0,0) (2,0,0) Packets move from service to service and server to server Services can intercept and process packets along the way Keys are re-mapped in case of failures Server resources (RAM, Disks, CPUs) are accessible to all services Paolo Costa CamCube - Rethinking the Data Center Cluster 23/54 Fetch mail costa:1234 (0,0,0) • • • • (1,0,0) (2,0,0) Packets move from service to service and server to server Services can intercept and process packets along the way Keys are re-mapped in case of failures Server resources (RAM, Disks, CPUs) are accessible to all services Paolo Costa CamCube - Rethinking the Data Center Cluster 24/54 Fetch mail costa:1234 (0,0,0) • • • • (1,0,0) (2,0,0) Packets move from service to service and server to server Services can intercept and process packets along the way Keys are re-mapped in case of failures Server resources (RAM, Disks, CPUs) are accessible to all services Paolo Costa CamCube - Rethinking the Data Center Cluster 25/54 Fetch mail costa:1234 (0,0,0) • • • • (1,0,0) (2,0,0) Packets move from service to service and server to server Services can intercept and process packets along the way Keys are re-mapped in case of failures Server resources (RAM, Disks, CPUs) are accessible to all services Paolo Costa CamCube - Rethinking the Data Center Cluster 26/54 Fetch mail costa:1234 (0,0,0) • • • • (1,0,0) (2,0,0) Packets move from service to service and server to server Services can intercept and process packets along the way Keys are re-mapped in case of failures Server resources (RAM, Disks, CPUs) are accessible to all services Paolo Costa CamCube - Rethinking the Data Center Cluster 27/54 (0,0,0) • • • • (1,0,0) (2,0,0) Packets move from service to service and server to server Services can intercept and process packets along the way Keys are re-mapped in case of failures Server resources (RAM, Disks, CPUs) are accessible to all services Paolo Costa CamCube - Rethinking the Data Center Cluster 28/54 (1,1,0) (0,0,0) • • • • (1,0,0) (2,0,0) Packets move from service to service and server to server Services can intercept and process packets along the way Keys are re-mapped in case of failures Server resources (RAM, Disks, CPUs) are accessible to all services Paolo Costa CamCube - Rethinking the Data Center Cluster 29/54 (1,1,0) (0,0,0) • • • • (1,0,0) (2,0,0) Packets move from service to service and server to server Services can intercept and process packets along the way Keys are re-mapped in case of failures Server resources (RAM, Disks, CPUs) are accessible to all services Paolo Costa CamCube - Rethinking the Data Center Cluster 30/54 (1,1,0) (0,0,0) • • • • (1,0,0) (2,0,0) Packets move from service to service and server to server Services can intercept and process packets along the way Keys are re-mapped in case of failures Server resources (RAM, Disks, CPUs) are accessible to all services Paolo Costa CamCube - Rethinking the Data Center Cluster 31/54 (0,0,0) • • • • (1,0,0) (2,0,0) Packets move from service to service and server to server Services can intercept and process packets along the way Keys are re-mapped in case of failures Server resources (RAM, Disks, CPUs) are accessible to all services Paolo Costa CamCube - Rethinking the Data Center Cluster 32/54 • 27-server (3x3x3) prototype – quad-core Intel Xeon 5520 2.27 Ghz – 12GB RAM – 6 Intel PRO/1000 PT 1 Gbps ports – Runtime & services implemented in C# • Multicast, key-value store, MapReduce, graph processing engine, … • Experiment – all servers saturating 6 links – 11.94 Gbps (9K pkts) using < 22% CPU Paolo Costa CamCube - Rethinking the Data Center Cluster 33/54 • MapReduce – originally developed by Google – open-source version (Hadoop) used by several companies • Yahoo, Facebook, Twitter, Amazon, … – many applications • data mining, search, indexing… • Simple abstraction – map: apply a function that generates (key, value) pairs – reduce: combine intermediate results Paolo Costa CamCube - Rethinking the Data Center Cluster 34/54 Reduce Map Map Map Paolo Costa Map Map Map Map Map Map CamCube - Rethinking the Data Center Cluster Map Map Map 35/54 Results stored on a single node Can handle all reduce functions Bottleneck (network and CPU) Map Map Map Map Map Map Map Map Map Map Map Map Reduce Reduce Paolo Costa CamCube - Rethinking the Data Center Cluster 36/54 CPU & network load distributed N2 flows: needs full bisection bw results are split across R servers not always applicable (e.g., max) Map Map Map Map Map Map Map Map Map Map Map Map Reduce Reduce Reduce Reduce Reduce Reduce Reduce Reduce Paolo Costa CamCube - Rethinking the Data Center Cluster 37/54 Reduce Map Map Map Paolo Costa Map Map Map Map Map Map CamCube - Rethinking the Data Center Cluster Map Map Map 38/54 Use hierarchical aggregation! Reduce Map Map Map Paolo Costa Map Map Map Map Map Map CamCube - Rethinking the Data Center Cluster Map Map Map 39/54 Map Map Map Map Map Map Map Map Map Map Paolo Costa Map Map Map Map Map Map Map Map CamCube - Rethinking the Data Center Cluster Map Map Map 40/54 Intermediate results can be aggregated at each hop Aggregate Aggregate Map Map Map Paolo Costa Aggregate Map Map Map Map Map Map CamCube - Rethinking the Data Center Cluster Map Map Map 41/54 Paolo Costa CamCube - Rethinking the Data Center Cluster 42/54 In-network aggregation Paolo Costa CamCube - Rethinking the Data Center Cluster 43/54 Traffic reduced at each hop No need for full bisection bw Final results stored on a single node In-network aggregation Paolo Costa CamCube - Rethinking the Data Center Cluster 44/54 • Camdoop builds 1 tree – failure resilient – (up to) 6 Gbps throughput / server Paolo Costa CamCube - Rethinking the Data Center Cluster 45/54 • Camdoop builds 1 tree 6 disjoint trees – failure resilient – (up to) 6 Gbps throughput / server Paolo Costa CamCube - Rethinking the Data Center Cluster 46/54 • 27-server testbed • 27 GB input size – 1GB / server • Output size/ intermediate size (S) – S=1/N ≈ 0 (full aggregation) – S=1 (no aggregation) • Two configurations: – All-to-one – All-to-all Paolo Costa CamCube - Rethinking the Data Center Cluster 47/54 Time (s) logscale Worse 1000 100 switch 10 Camdoop (no agg) Camdoop Switch 1 Gbps (bound) Better 1 0 Full aggregation Paolo Costa 0.2 0.4 0.6 Output size / intermediate size (S) CamCube - Rethinking the Data Center Cluster 0.8 1 No aggregation 48/54 Impact of aggregation Impact of higher bandwidth Performance independent of S Time (s) logscale Worse 1000 100 switch 10 Camdoop (no agg) Facebook reported aggregation ratio Better 1 0 Full aggregation Paolo Costa 0.2 Camdoop Switch 1 Gbps (bound) 0.4 0.6 Output size / intermediate size (S) CamCube - Rethinking the Data Center Cluster 0.8 1 No aggregation 49/54 Worse 1000 switch Time (s) logscale Camdoop (no agg) Camdoop 100 Switch 1 Gbps (bound) 10 Better 1 0 Full aggregation Paolo Costa 0.2 0.4 0.6 Output size / intermediate size (S) CamCube - Rethinking the Data Center Cluster 0.8 1 No aggregation 50/54 Impact of higher bandwidth Worse 1000 switch Time (s) logscale Impact of aggregation Camdoop (no agg) Camdoop 100 Switch 1 Gbps (bound) 10 Better 1 0 Full aggregation Paolo Costa 0.2 0.4 0.6 Output size / intermediate size (S) CamCube - Rethinking the Data Center Cluster 0.8 1 No aggregation 51/54 Worse 60 Time (ms) 50 switch Camdoop (no agg) Camdoop 40 30 20 10 0 Better 20 KB 200 KB Input data size / server Paolo Costa CamCube - Rethinking the Data Center Cluster 52/54 In-network aggregation beneficial also for low-latency services Worse 60 Time (ms) 50 switch Camdoop (no agg) Camdoop 40 30 20 10 0 Better 20 KB 200 KB Input data size / server Paolo Costa CamCube - Rethinking the Data Center Cluster 53/54 • Building large-scale services is hard – mismatch between physical and logical topology – the network is a black box • CamCube – designed from the perspective of developers – explicit topology and key-based abstraction • Benefits simplified development more efficient applications and better fault-tolerance H. Abu-Libdeh, P. Costa, A. Rowstron, G. O'Shea, and A. Donnelly. “Symbiotic Routing for Future Data Centres”, In ACM SIGCOMM, 2010. P. Costa, A. Donnelly, A. Rowstron, and G. O'Shea. “Camdoop: Exploiting Innetwork Aggregation for Big Data Applications”, In USENIX NSDI, 2012. Paolo Costa http://www.doc.ic.ac.uk/~costa