Download (slides)

Document related concepts

IEEE 1355 wikipedia , lookup

Remote Desktop Services wikipedia , lookup

Lag wikipedia , lookup

Transcript
Paolo Costa
costa@imperial.ac.uk
joint work with
Ant Rowstron, Austin Donnelly, and Greg O’Shea (MSR Cambridge)
Hussam Abu-Libdeh, Simon Schubert (Interns)
Paolo Costa
CamCube - Rethinking the Data Center Cluster
2/54
• Data Centers
– Large group of networked servers
– Typically built from commodity hardware
• Scale-out vs. scale-up
– Many low-cost PCs rather than few expensive high-end servers
– Economies of scale
• Custom software stack
–
–
–
–
–
Microsoft: Autopilot, Dryad
Google: GFS, BigTable, MapReduce
Yahoo: Hadoop, HDFS
Amazon: Dynamo, Astrolabe
Facebook: Cassandra, Scribe, HBase
Paolo Costa
CamCube - Rethinking the Data Center Cluster
Focus of
this talk
3/54
• Mismatch between abstraction & reality
• How programmers see the network
Paolo Costa





CamCube - Rethinking the Data Center Cluster
Key-based
Overlay networks
Explicit structure
Symmetry of role
…
4/54
Paolo Costa
CamCube - Rethinking the Data Center Cluster
5/54
~12 racks / pod
10 Gbps
10 Gbps
~ 20-40 servers / rack
Paolo Costa
CamCube - Rethinking the Data Center Cluster
6/54
~12 racks / pod
10 Gbps
10 Gbps
~ 20-40 servers / rack
Paolo Costa
CamCube - Rethinking the Data Center Cluster
7/54
~12 racks / pod
10 Gbps
Data center services are distributed but they
have no control over the network
10 Gbps
Must infer network properties (e.g., locality)
Must use inefficient overlay networks
No bandwidth guarantee
~ 20-40 servers / rack
Paolo Costa
CamCube - Rethinking the Data Center Cluster
8/54
• Top-down (applications) • Bottom-up (network)
– Custom routing?
• Overlays
• Use richer topologies
(e.g., fat-trees)
– Priority traffic?
• Use multiple TCP flows
– Locality?
• Use traceroute and ping
– TCP Incast?
• Add random jitter
Paolo Costa
– More bandwidth?
– Fate-sharing?
• Use prediction models to
estimate flow length /
traffic patterns
– TCP Incast?
• Use ECN in routers
CamCube - Rethinking the Data Center Cluster
9/24
• The network is still a black box
 Good for a multi-owned, heterogeneous Internet
 Limiting for a single-owned, homogenous data center
• DCs are not mini-Internets
–
–
–
–
single owner / administration domain
we know (and define) the topology
low hardware/software diversity
no malicious or selfish node
• Is TCP/IP still appropriate?
Paolo Costa
CamCube - Rethinking the Data Center Cluster
10/54
• “A data center from the perspective of a distributed
•
systems builder”
What happens if you integrate network & compute?
Distributed
Systems
CamCube
HPC
Paolo Costa
Networking
CamCube - Rethinking the Data Center Cluster
11/54
TCP/IP
Service
composition
Architecture
Link and
key-based
API
Topology
Paolo Costa
CamCube - Rethinking the Data Center Cluster
12/54
(1,2,2)
(1,2,2)
(1,2,1)
(1,2,1)
y
• No distinction between overlay
•
and underlying network
Nodes have (x,y,z)coordinate
z
– defines key-space
• Simple one hop API to
x
send/receive packets
– all other functionality implemented as services
• Key coordinates are re-mapped in case of failures
– enable a KBR-like API
Paolo Costa
CamCube - Rethinking the Data Center Cluster
13/54
(1,2,2)
(1,2,1)
y
• No distinction between overlay
•
and underlying network
Nodes have (x,y,z)coordinate
z
– defines key-space
• Simple one hop API to
x
send/receive packets
– all other functionality implemented as services
• Key coordinates are re-mapped in case of failures
– enable a KBR-like API
Paolo Costa
CamCube - Rethinking the Data Center Cluster
14/54
Fetch mail
costa:1234
(0,0,0)
•
•
•
•
(1,0,0)
(2,0,0)
Packets move from service to service and server to server
Services can intercept and process packets along the way
Keys are re-mapped in case of failures
Server resources (RAM, Disks, CPUs) are accessible
to all services
Paolo Costa
CamCube - Rethinking the Data Center Cluster
15/54
Fetch mail
costa:1234
(0,0,0)
•
•
•
•
(1,0,0)
(2,0,0)
Packets move from service to service and server to server
Services can intercept and process packets along the way
Keys are re-mapped in case of failures
Server resources (RAM, Disks, CPUs) are accessible
to all services
Paolo Costa
CamCube - Rethinking the Data Center Cluster
16/54
Fetch mail
costa:1234
(0,0,0)
•
•
•
•
(1,0,0)
(2,0,0)
Packets move from service to service and server to server
Services can intercept and process packets along the way
Keys are re-mapped in case of failures
Server resources (RAM, Disks, CPUs) are accessible
to all services
Paolo Costa
CamCube - Rethinking the Data Center Cluster
17/54
Fetch mail
costa:1234
(0,0,0)
•
•
•
•
(1,0,0)
(2,0,0)
Packets move from service to service and server to server
Services can intercept and process packets along the way
Keys are re-mapped in case of failures
Server resources (RAM, Disks, CPUs) are accessible
to all services
Paolo Costa
CamCube - Rethinking the Data Center Cluster
18/54
Fetch mail
costa:1234
(0,0,0)
•
•
•
•
(1,0,0)
(2,0,0)
Packets move from service to service and server to server
Services can intercept and process packets along the way
Keys are re-mapped in case of failures
Server resources (RAM, Disks, CPUs) are accessible
to all services
Paolo Costa
CamCube - Rethinking the Data Center Cluster
19/54
Fetch mail
costa:1234
(0,0,0)
•
•
•
•
(1,0,0)
(2,0,0)
Packets move from service to service and server to server
Services can intercept and process packets along the way
Keys are re-mapped in case of failures
Server resources (RAM, Disks, CPUs) are accessible
to all services
Paolo Costa
CamCube - Rethinking the Data Center Cluster
20/54
Fetch mail
costa:1234
(0,0,0)
•
•
•
•
(1,0,0)
(2,0,0)
Packets move from service to service and server to server
Services can intercept and process packets along the way
Keys are re-mapped in case of failures
Server resources (RAM, Disks, CPUs) are accessible
to all services
Paolo Costa
CamCube - Rethinking the Data Center Cluster
21/54
Fetch mail
costa:1234
(0,0,0)
•
•
•
•
(1,0,0)
(2,0,0)
Packets move from service to service and server to server
Services can intercept and process packets along the way
Keys are re-mapped in case of failures
Server resources (RAM, Disks, CPUs) are accessible
to all services
Paolo Costa
CamCube - Rethinking the Data Center Cluster
22/54
Fetch mail
costa:1234
(0,0,0)
•
•
•
•
(1,0,0)
(2,0,0)
Packets move from service to service and server to server
Services can intercept and process packets along the way
Keys are re-mapped in case of failures
Server resources (RAM, Disks, CPUs) are accessible
to all services
Paolo Costa
CamCube - Rethinking the Data Center Cluster
23/54
Fetch mail
costa:1234
(0,0,0)
•
•
•
•
(1,0,0)
(2,0,0)
Packets move from service to service and server to server
Services can intercept and process packets along the way
Keys are re-mapped in case of failures
Server resources (RAM, Disks, CPUs) are accessible
to all services
Paolo Costa
CamCube - Rethinking the Data Center Cluster
24/54
Fetch mail
costa:1234
(0,0,0)
•
•
•
•
(1,0,0)
(2,0,0)
Packets move from service to service and server to server
Services can intercept and process packets along the way
Keys are re-mapped in case of failures
Server resources (RAM, Disks, CPUs) are accessible
to all services
Paolo Costa
CamCube - Rethinking the Data Center Cluster
25/54
Fetch mail
costa:1234
(0,0,0)
•
•
•
•
(1,0,0)
(2,0,0)
Packets move from service to service and server to server
Services can intercept and process packets along the way
Keys are re-mapped in case of failures
Server resources (RAM, Disks, CPUs) are accessible
to all services
Paolo Costa
CamCube - Rethinking the Data Center Cluster
26/54
Fetch mail
costa:1234
(0,0,0)
•
•
•
•
(1,0,0)
(2,0,0)
Packets move from service to service and server to server
Services can intercept and process packets along the way
Keys are re-mapped in case of failures
Server resources (RAM, Disks, CPUs) are accessible
to all services
Paolo Costa
CamCube - Rethinking the Data Center Cluster
27/54
(0,0,0)
•
•
•
•
(1,0,0)
(2,0,0)
Packets move from service to service and server to server
Services can intercept and process packets along the way
Keys are re-mapped in case of failures
Server resources (RAM, Disks, CPUs) are accessible
to all services
Paolo Costa
CamCube - Rethinking the Data Center Cluster
28/54
(1,1,0)
(0,0,0)
•
•
•
•
(1,0,0)
(2,0,0)
Packets move from service to service and server to server
Services can intercept and process packets along the way
Keys are re-mapped in case of failures
Server resources (RAM, Disks, CPUs) are accessible
to all services
Paolo Costa
CamCube - Rethinking the Data Center Cluster
29/54
(1,1,0)
(0,0,0)
•
•
•
•
(1,0,0)
(2,0,0)
Packets move from service to service and server to server
Services can intercept and process packets along the way
Keys are re-mapped in case of failures
Server resources (RAM, Disks, CPUs) are accessible
to all services
Paolo Costa
CamCube - Rethinking the Data Center Cluster
30/54
(1,1,0)
(0,0,0)
•
•
•
•
(1,0,0)
(2,0,0)
Packets move from service to service and server to server
Services can intercept and process packets along the way
Keys are re-mapped in case of failures
Server resources (RAM, Disks, CPUs) are accessible
to all services
Paolo Costa
CamCube - Rethinking the Data Center Cluster
31/54
(0,0,0)
•
•
•
•
(1,0,0)
(2,0,0)
Packets move from service to service and server to server
Services can intercept and process packets along the way
Keys are re-mapped in case of failures
Server resources (RAM, Disks, CPUs) are accessible
to all services
Paolo Costa
CamCube - Rethinking the Data Center Cluster
32/54
• 27-server (3x3x3) prototype
– quad-core Intel Xeon 5520 2.27 Ghz
– 12GB RAM
– 6 Intel PRO/1000 PT 1 Gbps ports
– Runtime & services implemented in C#
• Multicast, key-value store, MapReduce,
graph processing engine, …
• Experiment
– all servers saturating 6 links
– 11.94 Gbps (9K pkts) using < 22% CPU
Paolo Costa
CamCube - Rethinking the Data Center Cluster
33/54
• MapReduce
– originally developed by Google
– open-source version (Hadoop)
used by several companies
• Yahoo, Facebook, Twitter, Amazon, …
– many applications
• data mining, search, indexing…
• Simple abstraction
– map: apply a function that generates (key, value) pairs
– reduce: combine intermediate results
Paolo Costa
CamCube - Rethinking the Data Center Cluster
34/54
Reduce
Map
Map
Map
Paolo Costa
Map
Map
Map
Map
Map
Map
CamCube - Rethinking the Data Center Cluster
Map
Map
Map
35/54
 Results stored on a single node
 Can handle all reduce functions
 Bottleneck (network and CPU)
Map
Map
Map
Map
Map
Map
Map
Map
Map
Map
Map
Map
Reduce
Reduce
Paolo Costa
CamCube - Rethinking the Data Center Cluster
36/54
 CPU & network load distributed
 N2 flows: needs full bisection bw
 results are split across R servers
 not always applicable (e.g., max)
Map
Map
Map
Map
Map
Map
Map
Map
Map
Map
Map
Map
Reduce
Reduce
Reduce
Reduce
Reduce
Reduce
Reduce
Reduce
Paolo Costa
CamCube - Rethinking the Data Center Cluster
37/54
Reduce
Map
Map
Map
Paolo Costa
Map
Map
Map
Map
Map
Map
CamCube - Rethinking the Data Center Cluster
Map
Map
Map
38/54
Use hierarchical
aggregation!
Reduce
Map
Map
Map
Paolo Costa
Map
Map
Map
Map
Map
Map
CamCube - Rethinking the Data Center Cluster
Map
Map
Map
39/54
Map
Map
Map
Map
Map
Map
Map
Map
Map
Map
Paolo Costa
Map
Map
Map
Map
Map
Map
Map
Map
CamCube - Rethinking the Data Center Cluster
Map
Map
Map
40/54
Intermediate results
can be aggregated at
each hop
Aggregate
Aggregate
Map
Map
Map
Paolo Costa
Aggregate
Map
Map
Map
Map
Map
Map
CamCube - Rethinking the Data Center Cluster
Map
Map
Map
41/54
Paolo Costa
CamCube - Rethinking the Data Center Cluster
42/54
In-network
aggregation
Paolo Costa
CamCube - Rethinking the Data Center Cluster
43/54
 Traffic reduced at each hop
 No need for full bisection bw
 Final results stored on a single node
In-network
aggregation
Paolo Costa
CamCube - Rethinking the Data Center Cluster
44/54
• Camdoop builds 1 tree
– failure resilient
– (up to) 6 Gbps throughput / server
Paolo Costa
CamCube - Rethinking the Data Center Cluster
45/54
• Camdoop builds 1 tree 6 disjoint trees
– failure resilient
– (up to) 6 Gbps throughput / server
Paolo Costa
CamCube - Rethinking the Data Center Cluster
46/54
• 27-server testbed
• 27 GB input size
– 1GB / server
• Output size/ intermediate size (S)
– S=1/N ≈ 0 (full aggregation)
– S=1 (no aggregation)
• Two configurations:
– All-to-one
– All-to-all
Paolo Costa
CamCube - Rethinking the Data Center Cluster
47/54
Time (s) logscale
Worse 1000
100
switch
10
Camdoop (no agg)
Camdoop
Switch 1 Gbps (bound)
Better 1
0
Full aggregation
Paolo Costa
0.2
0.4
0.6
Output size / intermediate size (S)
CamCube - Rethinking the Data Center Cluster
0.8
1
No aggregation
48/54
Impact of
aggregation
Impact of higher
bandwidth
Performance
independent of S
Time (s) logscale
Worse 1000
100
switch
10
Camdoop (no agg)
Facebook reported
aggregation ratio
Better 1
0
Full aggregation
Paolo Costa
0.2
Camdoop
Switch 1 Gbps (bound)
0.4
0.6
Output size / intermediate size (S)
CamCube - Rethinking the Data Center Cluster
0.8
1
No aggregation
49/54
Worse 1000
switch
Time (s) logscale
Camdoop (no agg)
Camdoop
100
Switch 1 Gbps (bound)
10
Better
1
0
Full aggregation
Paolo Costa
0.2
0.4
0.6
Output size / intermediate size (S)
CamCube - Rethinking the Data Center Cluster
0.8
1
No aggregation
50/54
Impact of
higher
bandwidth
Worse 1000
switch
Time (s) logscale
Impact of
aggregation
Camdoop (no agg)
Camdoop
100
Switch 1 Gbps (bound)
10
Better
1
0
Full aggregation
Paolo Costa
0.2
0.4
0.6
Output size / intermediate size (S)
CamCube - Rethinking the Data Center Cluster
0.8
1
No aggregation
51/54
Worse 60
Time (ms)
50
switch
Camdoop (no agg)
Camdoop
40
30
20
10
0
Better
20 KB
200 KB
Input data size / server
Paolo Costa
CamCube - Rethinking the Data Center Cluster
52/54
In-network aggregation
beneficial also for
low-latency services
Worse 60
Time (ms)
50
switch
Camdoop (no agg)
Camdoop
40
30
20
10
0
Better
20 KB
200 KB
Input data size / server
Paolo Costa
CamCube - Rethinking the Data Center Cluster
53/54
• Building large-scale services is hard
– mismatch between physical and logical topology
– the network is a black box
• CamCube
– designed from the perspective of developers
– explicit topology and key-based abstraction
• Benefits
simplified development
more efficient applications and better fault-tolerance
H. Abu-Libdeh, P. Costa, A. Rowstron, G. O'Shea, and A. Donnelly. “Symbiotic
Routing for Future Data Centres”, In ACM SIGCOMM, 2010.
P. Costa, A. Donnelly, A. Rowstron, and G. O'Shea. “Camdoop: Exploiting Innetwork Aggregation for Big Data Applications”, In USENIX NSDI, 2012.
Paolo Costa
http://www.doc.ic.ac.uk/~costa