Download (slides)

Paolo Costa costa@imperial.ac.uk joint work with Ant Rowstron, Austin Donnelly, and Greg O’Shea (MSR Cambridge) Hussam Abu-Libdeh, Simon Schubert (Interns) Paolo Costa CamCube - Rethinking the Data Center Cluster 2/54 • Data Centers – Large group of networked servers – Typically built from commodity hardware • Scale-out vs. scale-up – Many low-cost PCs rather than few expensive high-end servers – Economies of scale • Custom software stack – – – – – Microsoft: Autopilot, Dryad Google: GFS, BigTable, MapReduce Yahoo: Hadoop, HDFS Amazon: Dynamo, Astrolabe Facebook: Cassandra, Scribe, HBase Paolo Costa CamCube - Rethinking the Data Center Cluster Focus of this talk 3/54 • Mismatch between abstraction & reality • How programmers see the network Paolo Costa      CamCube - Rethinking the Data Center Cluster Key-based Overlay networks Explicit structure Symmetry of role … 4/54 Paolo Costa CamCube - Rethinking the Data Center Cluster 5/54 ~12 racks / pod 10 Gbps 10 Gbps ~ 20-40 servers / rack Paolo Costa CamCube - Rethinking the Data Center Cluster 6/54 ~12 racks / pod 10 Gbps 10 Gbps ~ 20-40 servers / rack Paolo Costa CamCube - Rethinking the Data Center Cluster 7/54 ~12 racks / pod 10 Gbps Data center services are distributed but they have no control over the network 10 Gbps Must infer network properties (e.g., locality) Must use inefficient overlay networks No bandwidth guarantee ~ 20-40 servers / rack Paolo Costa CamCube - Rethinking the Data Center Cluster 8/54 • Top-down (applications) • Bottom-up (network) – Custom routing? • Overlays • Use richer topologies (e.g., fat-trees) – Priority traffic? • Use multiple TCP flows – Locality? • Use traceroute and ping – TCP Incast? • Add random jitter Paolo Costa – More bandwidth? – Fate-sharing? • Use prediction models to estimate flow length / traffic patterns – TCP Incast? • Use ECN in routers CamCube - Rethinking the Data Center Cluster 9/24 • The network is still a black box  Good for a multi-owned, heterogeneous Internet  Limiting for a single-owned, homogenous data center • DCs are not mini-Internets – – – – single owner / administration domain we know (and define) the topology low hardware/software diversity no malicious or selfish node • Is TCP/IP still appropriate? Paolo Costa CamCube - Rethinking the Data Center Cluster 10/54 • “A data center from the perspective of a distributed • systems builder” What happens if you integrate network & compute? Distributed Systems CamCube HPC Paolo Costa Networking CamCube - Rethinking the Data Center Cluster 11/54 TCP/IP Service composition Architecture Link and key-based API Topology Paolo Costa CamCube - Rethinking the Data Center Cluster 12/54 (1,2,2) (1,2,2) (1,2,1) (1,2,1) y • No distinction between overlay • and underlying network Nodes have (x,y,z)coordinate z – defines key-space • Simple one hop API to x send/receive packets – all other functionality implemented as services • Key coordinates are re-mapped in case of failures – enable a KBR-like API Paolo Costa CamCube - Rethinking the Data Center Cluster 13/54 (1,2,2) (1,2,1) y • No distinction between overlay • and underlying network Nodes have (x,y,z)coordinate z – defines key-space • Simple one hop API to x send/receive packets – all other functionality implemented as services • Key coordinates are re-mapped in case of failures – enable a KBR-like API Paolo Costa CamCube - Rethinking the Data Center Cluster 14/54 Fetch mail costa:1234 (0,0,0) • • • • (1,0,0) (2,0,0) Packets move from service to service and server to server Services can intercept and process packets along the way Keys are re-mapped in case of failures Server resources (RAM, Disks, CPUs) are accessible to all services Paolo Costa CamCube - Rethinking the Data Center Cluster 15/54 Fetch mail costa:1234 (0,0,0) • • • • (1,0,0) (2,0,0) Packets move from service to service and server to server Services can intercept and process packets along the way Keys are re-mapped in case of failures Server resources (RAM, Disks, CPUs) are accessible to all services Paolo Costa CamCube - Rethinking the Data Center Cluster 16/54 Fetch mail costa:1234 (0,0,0) • • • • (1,0,0) (2,0,0) Packets move from service to service and server to server Services can intercept and process packets along the way Keys are re-mapped in case of failures Server resources (RAM, Disks, CPUs) are accessible to all services Paolo Costa CamCube - Rethinking the Data Center Cluster 17/54 Fetch mail costa:1234 (0,0,0) • • • • (1,0,0) (2,0,0) Packets move from service to service and server to server Services can intercept and process packets along the way Keys are re-mapped in case of failures Server resources (RAM, Disks, CPUs) are accessible to all services Paolo Costa CamCube - Rethinking the Data Center Cluster 18/54 Fetch mail costa:1234 (0,0,0) • • • • (1,0,0) (2,0,0) Packets move from service to service and server to server Services can intercept and process packets along the way Keys are re-mapped in case of failures Server resources (RAM, Disks, CPUs) are accessible to all services Paolo Costa CamCube - Rethinking the Data Center Cluster 19/54 Fetch mail costa:1234 (0,0,0) • • • • (1,0,0) (2,0,0) Packets move from service to service and server to server Services can intercept and process packets along the way Keys are re-mapped in case of failures Server resources (RAM, Disks, CPUs) are accessible to all services Paolo Costa CamCube - Rethinking the Data Center Cluster 20/54 Fetch mail costa:1234 (0,0,0) • • • • (1,0,0) (2,0,0) Packets move from service to service and server to server Services can intercept and process packets along the way Keys are re-mapped in case of failures Server resources (RAM, Disks, CPUs) are accessible to all services Paolo Costa CamCube - Rethinking the Data Center Cluster 21/54 Fetch mail costa:1234 (0,0,0) • • • • (1,0,0) (2,0,0) Packets move from service to service and server to server Services can intercept and process packets along the way Keys are re-mapped in case of failures Server resources (RAM, Disks, CPUs) are accessible to all services Paolo Costa CamCube - Rethinking the Data Center Cluster 22/54 Fetch mail costa:1234 (0,0,0) • • • • (1,0,0) (2,0,0) Packets move from service to service and server to server Services can intercept and process packets along the way Keys are re-mapped in case of failures Server resources (RAM, Disks, CPUs) are accessible to all services Paolo Costa CamCube - Rethinking the Data Center Cluster 23/54 Fetch mail costa:1234 (0,0,0) • • • • (1,0,0) (2,0,0) Packets move from service to service and server to server Services can intercept and process packets along the way Keys are re-mapped in case of failures Server resources (RAM, Disks, CPUs) are accessible to all services Paolo Costa CamCube - Rethinking the Data Center Cluster 24/54 Fetch mail costa:1234 (0,0,0) • • • • (1,0,0) (2,0,0) Packets move from service to service and server to server Services can intercept and process packets along the way Keys are re-mapped in case of failures Server resources (RAM, Disks, CPUs) are accessible to all services Paolo Costa CamCube - Rethinking the Data Center Cluster 25/54 Fetch mail costa:1234 (0,0,0) • • • • (1,0,0) (2,0,0) Packets move from service to service and server to server Services can intercept and process packets along the way Keys are re-mapped in case of failures Server resources (RAM, Disks, CPUs) are accessible to all services Paolo Costa CamCube - Rethinking the Data Center Cluster 26/54 Fetch mail costa:1234 (0,0,0) • • • • (1,0,0) (2,0,0) Packets move from service to service and server to server Services can intercept and process packets along the way Keys are re-mapped in case of failures Server resources (RAM, Disks, CPUs) are accessible to all services Paolo Costa CamCube - Rethinking the Data Center Cluster 27/54 (0,0,0) • • • • (1,0,0) (2,0,0) Packets move from service to service and server to server Services can intercept and process packets along the way Keys are re-mapped in case of failures Server resources (RAM, Disks, CPUs) are accessible to all services Paolo Costa CamCube - Rethinking the Data Center Cluster 28/54 (1,1,0) (0,0,0) • • • • (1,0,0) (2,0,0) Packets move from service to service and server to server Services can intercept and process packets along the way Keys are re-mapped in case of failures Server resources (RAM, Disks, CPUs) are accessible to all services Paolo Costa CamCube - Rethinking the Data Center Cluster 29/54 (1,1,0) (0,0,0) • • • • (1,0,0) (2,0,0) Packets move from service to service and server to server Services can intercept and process packets along the way Keys are re-mapped in case of failures Server resources (RAM, Disks, CPUs) are accessible to all services Paolo Costa CamCube - Rethinking the Data Center Cluster 30/54 (1,1,0) (0,0,0) • • • • (1,0,0) (2,0,0) Packets move from service to service and server to server Services can intercept and process packets along the way Keys are re-mapped in case of failures Server resources (RAM, Disks, CPUs) are accessible to all services Paolo Costa CamCube - Rethinking the Data Center Cluster 31/54 (0,0,0) • • • • (1,0,0) (2,0,0) Packets move from service to service and server to server Services can intercept and process packets along the way Keys are re-mapped in case of failures Server resources (RAM, Disks, CPUs) are accessible to all services Paolo Costa CamCube - Rethinking the Data Center Cluster 32/54 • 27-server (3x3x3) prototype – quad-core Intel Xeon 5520 2.27 Ghz – 12GB RAM – 6 Intel PRO/1000 PT 1 Gbps ports – Runtime & services implemented in C# • Multicast, key-value store, MapReduce, graph processing engine, … • Experiment – all servers saturating 6 links – 11.94 Gbps (9K pkts) using < 22% CPU Paolo Costa CamCube - Rethinking the Data Center Cluster 33/54 • MapReduce – originally developed by Google – open-source version (Hadoop) used by several companies • Yahoo, Facebook, Twitter, Amazon, … – many applications • data mining, search, indexing… • Simple abstraction – map: apply a function that generates (key, value) pairs – reduce: combine intermediate results Paolo Costa CamCube - Rethinking the Data Center Cluster 34/54 Reduce Map Map Map Paolo Costa Map Map Map Map Map Map CamCube - Rethinking the Data Center Cluster Map Map Map 35/54  Results stored on a single node  Can handle all reduce functions  Bottleneck (network and CPU) Map Map Map Map Map Map Map Map Map Map Map Map Reduce Reduce Paolo Costa CamCube - Rethinking the Data Center Cluster 36/54  CPU & network load distributed  N2 flows: needs full bisection bw  results are split across R servers  not always applicable (e.g., max) Map Map Map Map Map Map Map Map Map Map Map Map Reduce Reduce Reduce Reduce Reduce Reduce Reduce Reduce Paolo Costa CamCube - Rethinking the Data Center Cluster 37/54 Reduce Map Map Map Paolo Costa Map Map Map Map Map Map CamCube - Rethinking the Data Center Cluster Map Map Map 38/54 Use hierarchical aggregation! Reduce Map Map Map Paolo Costa Map Map Map Map Map Map CamCube - Rethinking the Data Center Cluster Map Map Map 39/54 Map Map Map Map Map Map Map Map Map Map Paolo Costa Map Map Map Map Map Map Map Map CamCube - Rethinking the Data Center Cluster Map Map Map 40/54 Intermediate results can be aggregated at each hop Aggregate Aggregate Map Map Map Paolo Costa Aggregate Map Map Map Map Map Map CamCube - Rethinking the Data Center Cluster Map Map Map 41/54 Paolo Costa CamCube - Rethinking the Data Center Cluster 42/54 In-network aggregation Paolo Costa CamCube - Rethinking the Data Center Cluster 43/54  Traffic reduced at each hop  No need for full bisection bw  Final results stored on a single node In-network aggregation Paolo Costa CamCube - Rethinking the Data Center Cluster 44/54 • Camdoop builds 1 tree – failure resilient – (up to) 6 Gbps throughput / server Paolo Costa CamCube - Rethinking the Data Center Cluster 45/54 • Camdoop builds 1 tree 6 disjoint trees – failure resilient – (up to) 6 Gbps throughput / server Paolo Costa CamCube - Rethinking the Data Center Cluster 46/54 • 27-server testbed • 27 GB input size – 1GB / server • Output size/ intermediate size (S) – S=1/N ≈ 0 (full aggregation) – S=1 (no aggregation) • Two configurations: – All-to-one – All-to-all Paolo Costa CamCube - Rethinking the Data Center Cluster 47/54 Time (s) logscale Worse 1000 100 switch 10 Camdoop (no agg) Camdoop Switch 1 Gbps (bound) Better 1 0 Full aggregation Paolo Costa 0.2 0.4 0.6 Output size / intermediate size (S) CamCube - Rethinking the Data Center Cluster 0.8 1 No aggregation 48/54 Impact of aggregation Impact of higher bandwidth Performance independent of S Time (s) logscale Worse 1000 100 switch 10 Camdoop (no agg) Facebook reported aggregation ratio Better 1 0 Full aggregation Paolo Costa 0.2 Camdoop Switch 1 Gbps (bound) 0.4 0.6 Output size / intermediate size (S) CamCube - Rethinking the Data Center Cluster 0.8 1 No aggregation 49/54 Worse 1000 switch Time (s) logscale Camdoop (no agg) Camdoop 100 Switch 1 Gbps (bound) 10 Better 1 0 Full aggregation Paolo Costa 0.2 0.4 0.6 Output size / intermediate size (S) CamCube - Rethinking the Data Center Cluster 0.8 1 No aggregation 50/54 Impact of higher bandwidth Worse 1000 switch Time (s) logscale Impact of aggregation Camdoop (no agg) Camdoop 100 Switch 1 Gbps (bound) 10 Better 1 0 Full aggregation Paolo Costa 0.2 0.4 0.6 Output size / intermediate size (S) CamCube - Rethinking the Data Center Cluster 0.8 1 No aggregation 51/54 Worse 60 Time (ms) 50 switch Camdoop (no agg) Camdoop 40 30 20 10 0 Better 20 KB 200 KB Input data size / server Paolo Costa CamCube - Rethinking the Data Center Cluster 52/54 In-network aggregation beneficial also for low-latency services Worse 60 Time (ms) 50 switch Camdoop (no agg) Camdoop 40 30 20 10 0 Better 20 KB 200 KB Input data size / server Paolo Costa CamCube - Rethinking the Data Center Cluster 53/54 • Building large-scale services is hard – mismatch between physical and logical topology – the network is a black box • CamCube – designed from the perspective of developers – explicit topology and key-based abstraction • Benefits simplified development more efficient applications and better fault-tolerance H. Abu-Libdeh, P. Costa, A. Rowstron, G. O'Shea, and A. Donnelly. “Symbiotic Routing for Future Data Centres”, In ACM SIGCOMM, 2010. P. Costa, A. Donnelly, A. Rowstron, and G. O'Shea. “Camdoop: Exploiting Innetwork Aggregation for Big Data Applications”, In USENIX NSDI, 2012. Paolo Costa http://www.doc.ic.ac.uk/~costa

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download (slides)