Download Decentralized Location Services

Document related concepts

Computer network wikipedia , lookup

Backpressure routing wikipedia , lookup

Distributed operating system wikipedia , lookup

Recursive InterNetwork Architecture (RINA) wikipedia , lookup

Airborne Networking wikipedia , lookup

IEEE 802.1aq wikipedia , lookup

List of wireless community networks by region wikipedia , lookup

CAN bus wikipedia , lookup

Routing wikipedia , lookup

Kademlia wikipedia , lookup

Routing in delay-tolerant networking wikipedia , lookup

Transcript
Tapestry Deployment and
Fault-tolerant Routing
Ben Y. Zhao
L. Huang, S. Rhea, J. Stribling,
A. D. Joseph, J. D. Kubiatowicz
Berkeley Research Retreat
January 2003
Scaling Network Applications

Complexities of global deployment

Network unreliability


Lack of administrative control over components


BGP slow convergence, redundancy unexploited
Constrains protocol deployment: multicast,
congestion ctrl.
Management of large scale resources /
components

UCB Winter Retreat
Locate, utilize resources despite failures
ravenben@eecs.berkeley.edu
2
Enabling Technology: DOLR
(Decentralized Object Location and Routing)
GUID1
DOLR
GUID2
UCB Winter Retreat
GUID1
ravenben@eecs.berkeley.edu
3
What is Tapestry?

DOLR driving OceanStore global storage
(Zhao, Kubiatowicz, Joseph et al. 2000)

Network structure



Nodes assigned bit sequence nodeIds from
namespace: 0-2160, based on some radix (e.g. 16)
keys from same namespace
Keys dynamically map to 1 unique live node: root
Base API



Publish / Unpublish (Object ID)
RouteToNode (NodeId)
RouteToObject (Object ID)
UCB Winter Retreat
ravenben@eecs.berkeley.edu
4
Tapestry3 Mesh
4
NodeID
0xEF97
NodeID
0xEF32
NodeID
0xE399
4
NodeID
0xE530
3
NodeID
0xEF34
4
NodeID
0xEF37
3
NodeID
0xEF44
2
2
4
3
1
1
3
NodeID
0x099F
2
3
NodeID
0xE555
1
NodeID
0xFF37
UCB Winter Retreat
2
NodeID
0xEFBA
NodeID
0xEF40
NodeID
0xEF31
4
NodeID
0x0999
1
2
2
NodeID
0xE324
3
NodeID
0xE932
ravenben@eecs.berkeley.edu
1
NodeID
0x0921
5
Object Location
UCB Winter Retreat
ravenben@eecs.berkeley.edu
6
Talk Outline

Introduction

Architecture

Node architecture

Node implementation

Deployment Evaluation

Fault-tolerant Routing
UCB Winter Retreat
ravenben@eecs.berkeley.edu
7
Single Node Architecture
Decentralized
File Systems
Application-Level
Multicast
Approximate
Text Matching
Application Interface / Upcall API
Dynamic Node
Management
Routing Table
&
Router
Object Pointer DB
Network Link Management
Transport Protocols
UCB Winter Retreat
ravenben@eecs.berkeley.edu
8
Single Node Implementation
Applications
API calls
Upcalls
Enter/leave
Tapestry
State Maint.
Node Ins/del
Core Router
Routing Link
Maintenance
Patchwork
route to
node / obj
Dynamic Tapestry
Application Programming Interface
Distance Map
UDP Pings
Network Stage
SEDA Event-driven Framework
Java Virtual Machine
UCB Winter Retreat
ravenben@eecs.berkeley.edu
9
Deployment Status

C simulator



Packet level simulation
Scales up to 10,000 nodes
Java implementation



50000 semicolons of Java, 270 class files
Deployed on local area cluster (40 nodes)
Deployed on Planet Lab global network (~100
distributed nodes)
UCB Winter Retreat
ravenben@eecs.berkeley.edu
10
Talk Outline

Introduction

Architecture

Deployment Evaluation


Micro-benchmarks

Stable network performance

Single and parallel node insertion
Fault-tolerant Routing
UCB Winter Retreat
ravenben@eecs.berkeley.edu
11
Micro-benchmark Methodology
Sender
Control
Tapestry



Receiver
Control
LAN
Link
Tapestry
Experiment run in LAN, GBit Ethernet
Sender sends 60001 messages at full speed
Measure inter-arrival time for last 50000 msgs


10000 msgs: remove cold-start effects
50000 msgs: remove network jitter effects
UCB Winter Retreat
ravenben@eecs.berkeley.edu
12
Micro-benchmark Results
Message Processing Latency
Sustainable Throughput
30
25
10
TPut (MB/s)
Time / msg (ms)
100
1
20
15
10
0.1
5
0
0.01
0.01
0.1
1
10
100
1000
10000
0.01
0.1
Message Size (KB)



1
10
100
1000
10000
Message Size (KB)
Constant processing overhead ~ 50s
Latency dominated by byte copying
For 5K messages, throughput = ~10,000 msgs/sec
UCB Winter Retreat
ravenben@eecs.berkeley.edu
13
Large Scale Methodology

PlanetLab global network




101 machines at 42 institutions, in North America, Europe,
Australia (~ 60 machines utilized)
1.26Ghz PIII (1GB RAM), 1.8Ghz P4 (2GB RAM)
North American machines (2/3) on Internet2
Tapestry Java deployment




6-7 nodes on each physical machine
IBM Java JDK 1.30
Node virtualization inside JVM and SEDA
Scheduling between virtual nodes increases latency
UCB Winter Retreat
ravenben@eecs.berkeley.edu
14
Node to Node Routing
35
RDP (min, med, 90%)
30
Median=31.5, 90th percentile=135
25
20
15
10
5
0
0
50
100
150
200
250
300
Internode RTT Ping time (5ms buckets)


Ratio of end-to-end routing latency to shortest ping distance
between nodes
All node pairs measured, placed into buckets
UCB Winter Retreat
ravenben@eecs.berkeley.edu
15
Object Location
25
RDP (min, median, 90%)
90th percentile=158
20
15
10
5
0
0
20
40
60
80
100
120
140
160
180
200
Client to Obj RTT Ping time (1ms buckets)


Ratio of end-to-end latency for object location, to shortest ping
distance between client and object location
Each node publishes 10,000 objects, lookup on all objects
UCB Winter Retreat
ravenben@eecs.berkeley.edu
16
Latency to Insert Node
2000
Integration Latency (ms)
1800
1600
1400
1200
1000
800
600
400
200
0
0
100
200
300
400
500
Size of Existing Network (nodes)


Latency to dynamically insert a node into an existing Tapestry, as
function of size of existing Tapestry
Humps due to expected filling of each routing level
UCB Winter Retreat
ravenben@eecs.berkeley.edu
17
Bandwidth to Insert Node
1.4
Control Traffic BW (KB)
1.2
1
0.8
0.6
0.4
0.2
0
0
50
100
150
200
250
300
350
400
Size of Existing Network (nodes)


Cost in bandwidth of dynamically inserting a node into the Tapestry,
amortized for each node in network
Per node bandwidth decreases with size of network
UCB Winter Retreat
ravenben@eecs.berkeley.edu
18
Parallel Insertion Latency
20000
Latency to Convergence (ms)
18000
90th percentile=55042
16000
14000
12000
10000
8000
6000
4000
2000
0
0
0.05
0.1
0.15
0.2
0.25
0.3
Ratio of Insertion Group Size to Network Size


Latency to dynamically insert nodes in unison into an existing
Tapestry of 200
Shown as function of insertion group size / network size
UCB Winter Retreat
ravenben@eecs.berkeley.edu
19
Talk Outline

Introduction

Architecture

Deployment Evaluation

Fault-tolerant Routing

Tunneling through scalable overlays

Example using Tapestry
UCB Winter Retreat
ravenben@eecs.berkeley.edu
20
Adaptive and Resilient Routing

Goals




Reachability as a service
Agility / adaptability in routing
Scalable deployment
Useful for all client endpoints
UCB Winter Retreat
ravenben@eecs.berkeley.edu
21
Existing Redundancy in DOLR/DHTs

Fault-detection via soft-state beacons



Periodically sent to each node in routing table
 Scales logarithmically with size of network
Worst case overhead: 240 nodes, 160b ID  20 hex
1 beacon/sec, 100B each = 240 kbps
can minimize B/W w/ better techniques (Hakim, Shelley)
Precomputed backup routes

Intermediate hops in overlay path are flexible
Keep list of backups for outgoing hops
(e.g. 3 node pointers for each route entry in Tapestry)

Maintain backups using node membership algorithms
(no additional overhead)
UCB Winter Retreat
ravenben@eecs.berkeley.edu
22
Bootstrapping Non-overlay Endpoints

Goal



Allow non-overlay nodes to benefit
Endpoints communicate via overlay proxies
Example: legacy nodes L1, L2




Li registers w/ nearby overlay proxy Pi
Pi assigns Li a proxy name Di
s.t. Di is the closest possible unique name to Pi
(e.g. start w/ Pi, increment for each node)
Li and L2 exchange new proxy names
messages route to nodes using proxy names
UCB Winter Retreat
ravenben@eecs.berkeley.edu
23
Tunneling through an Overlay
D2
P1
L1
Overlay Network
P2
L2
D1



L1 registers with P1 as document D1
L2 registers with P2 as document D2
Traffic tunnels through overlay via proxies
UCB Winter Retreat
ravenben@eecs.berkeley.edu
24
Failure Avoidance in Tapestry
UCB Winter Retreat
ravenben@eecs.berkeley.edu
25
Routing Convergence
UCB Winter Retreat
ravenben@eecs.berkeley.edu
26
Bandwidth Overhead for Misroute
Increase in Latency for 1 Misroute (Secondary Route)
Proportional Increase to
Path Latency
20 ms
26.66 ms
60 ms
80 ms
93.33 ms
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
1
2
3
4
Position of Branch (Hop)

UCB Winter Retreat
Status: under deployment on PlanetLab
ravenben@eecs.berkeley.edu
27
For more information …
Tapestry and related projects (and these slides):
http://www.cs.berkeley.edu/~ravenben/tapestry
OceanStore:
http://oceanstore.cs.berkeley.edu
Related papers:
http://oceanstore.cs.berkeley.edu/publications
http://www.cs.berkeley.edu/~ravenben/publications
ravenben@eecs.berkeley.edu
UCB Winter Retreat
ravenben@eecs.berkeley.edu
28
Backup Slides Follow…
UCB Winter Retreat
ravenben@eecs.berkeley.edu
29
The Naming Problem

Tracking modifiable objects



Current approaches



Example: email, Usenet articles, tagged audio
Goal: verifiable names, robust to small changes
Content-based hashed naming
Content-independent naming
ADOLR Project: (Feng Zhou, Li Zhuang)


Approximate names based on feature vectors
Leverage to match / search for similar content
UCB Winter Retreat
ravenben@eecs.berkeley.edu
30
Approximation Extension to DOLR/DHT

Publication using features



Objects are described using a set of features:
AO ≡ Feature Vector (FV) = {f1, f2, f3, …, fn}
Locate AOs in DOLR ≡ find all AOs in the network
with |FV* ∩ FV| ≥ Thres, 0 < Thres ≤ |FV|
Driving application: decentralized spam filter



Humans are the only fool-proof spam filter
Mark spam, publish spam by text feature vector
Incoming mail filtered by FV query on P2P overlay
UCB Winter Retreat
ravenben@eecs.berkeley.edu
31
Evaluation on Real Emails

Accuracy of feature vector matching on real emails

Spam (29631 Junk Emails from www.spamarchive.org)


14925 (unique), 86% of spam ≤ 5K
Normal Emails

9589 (total) = 50% newsgroup posts, 50% personal emails
“Similarity” Test
“False Positive” Test
3440 modified copies of 39
emails Fail
THRES Detected
%
9589(normal)×14925(spam)

Match FP # pair probability
3/10
3356
84
97.56
2/10
4
2.79e-8
4/10
3172
268
92.21
>2/10
0
0
Status



Prototype implemented as Outlook Plug-in
Interfaces w/ Tapestry overlay
http://www.cs.berkeley.edu/~zf/spamwatch
UCB Winter Retreat
ravenben@eecs.berkeley.edu
32
State of the Art Routing

High dimensionality and
coordinate-based P2P routing




Tapestry, Pastry, Chord, CAN,
etc…
Sub-linear storage and # of
overlay hops per route
Properties dependent on random
name distribution
Optimized for uniform mesh style
networks
UCB Winter Retreat
ravenben@eecs.berkeley.edu
33
Reality


Transit-stub topology, disparate resources per node
Result: Inefficient inter-domain routing (b/w, latency)
AS-3
AS-1
S
R
AS-2
P2P Overlay
Network
UCB Winter Retreat
ravenben@eecs.berkeley.edu
34
Landmark Routing on P2P

Brocade



Exploit non-uniformity
Minimize wide-area routing hops / bandwidth
Secondary overlay on top of Tapestry

Select super-nodes by admin. domain


Super-nodes form secondary Tapestry


Divide network into cover sets
Advertise cover set as local objects
brocade routes directly into destination’s local network,
then resumes p2p routing
UCB Winter Retreat
ravenben@eecs.berkeley.edu
35
Brocade Routing
Brocade Layer
Original Route
Brocade Route
AS-3
AS-1
S
D
AS-2
P2P
Network
UCB Winter Retreat
ravenben@eecs.berkeley.edu
36
Overlay Routing Networks

CAN: Ratnasamy et al., (ACIRI /
Fast Insertion / Deletion
UCB)
Constant-sized routing state
 Uses d-dimensional coordinate space
Unconstrained # of hops
to implement distributed hash table
Overlay distance not prop. to
 Route to neighbor closest to
physical distance
destination coordinate

Chord: Stoica, Morris, Karger, et al.,
(MIT / UCB)
 Linear namespace modeled as
circular address space
 “Finger-table” point to logarithmic # of
inc. remote hosts

Pastry: Rowstron and Druschel
(Microsoft / Rice )
 Hypercube routing similar to PRR97
 Objects replicated to servers by name
UCB Winter Retreat
Simplicity in algorithms
Fast fault-recovery
Log2(N) hops and routing state
Overlay distance not prop. to
physical distance
Fast fault-recovery
Log(N) hops and routing state
Data replication required for
fault-tolerance
ravenben@eecs.berkeley.edu
37
Routing in Detail
Example: Octal digits, 212 namespace, 2175  0157
2175
2175
0 1
2
3 4 5
6 7
0880
0 1
2
3 4 5
6 7
0123
0 1
2
3 4 5
6 7
0154
0 1
2
3 4 5
6 7
0157
0 1
2
3 4 5
6 7
UCB Winter Retreat
0880
0123
0154
ravenben@eecs.berkeley.edu
0157
38
Publish / Lookup Details

Publish object with ObjectID:
// route towards “virtual root,” ID=ObjectID
For (i=0, i<Log2(N), i+=j) { //Define hierarchy
 j is # of bits in digit size, (i.e. for hex digits, j = 4 )
 Insert entry into nearest node that matches on
last i bits
 If no matches found, deterministically choose alternative
 Found real root node, when no external routes left

Lookup object
Traverse same path to root as publish, except search for entry at
each node
For (i=0, i<Log2(N), i+=j) {
 Search for cached object location
 Once found, route via IP or Tapestry to object
UCB Winter Retreat
ravenben@eecs.berkeley.edu
39
Dynamic Insertion
Build up new node’s routing map
1.



2.
3.
4.
Send messages to each hop along path from gateway to
current node N’ that best approximates N
The ith hop along the path sends its ith level route table to N
N optimizes those tables where necessary
Notify via acked multicast nodes with null entries for
N’s ID
Notified node issues republish message for relevant
objects
Notify local neighbors
UCB Winter Retreat
ravenben@eecs.berkeley.edu
40
Dynamic
Insertion
Example
3
4
NodeID
0x779FE
NodeID
0x244FE
2
NodeID
0xA23FE
NodeID
0x6993E
NodeID
0x973FE
3
4
NodeID
0xC035E
3
2
NodeID
NodeID
0x243FE
0x243FE
4
4
3
1
1
3
NodeID
0x4F990
2
3
NodeID 2
0xB555E
NodeID
0x0ABFE
NodeID
0x704FE
NodeID
0x913FE
4
NodeID
0x09990
1
2
1
3
Gateway
0xD73FF
UCB Winter Retreat
NEW
0x143FE
NodeID
0x5239E
ravenben@eecs.berkeley.edu
1
NodeID
0x71290
41
Dynamic Root Mapping

Problem: choosing a root node for every
object



Deterministic over network changes
Globally consistent
Assumptions


All nodes with same matching suffix contains
same null/non-null pattern in next level of routing
map
Requires: consistent knowledge of nodes across
network
UCB Winter Retreat
ravenben@eecs.berkeley.edu
42
PRR Solution

Given desired ID N,



Find set S of nodes in existing network nodes n matching
most # of suffix digits with N
Choose Si = node in S with highest valued ID
Issues:



Mapping must be generated statically using global
knowledge
Must be kept as hard state in order to operate in changing
environment
Mapping is not well distributed, many nodes in n get no
mappings
UCB Winter Retreat
ravenben@eecs.berkeley.edu
43
Tapestry Solution

Globally consistent distributed algorithm:




Attempt to route to desired ID Ni
Whenever null entry encountered, choose next “higher”
non-null pointer entry
If current node S is only non-null pointer in rest of route
map, terminate route, f (N) = S
Assumes:


Routing maps across network are up to date
Null/non-null properties identical at all nodes sharing same
suffix
UCB Winter Retreat
ravenben@eecs.berkeley.edu
44
Analysis
Globally consistent deterministic mapping


Null entry  no node in network with suffix
 consistent map  identical null entries across same route
maps of nodes w/ same suffix
Additional hops compared to PRR solution:




Reduce to coupon collector problem
Assuming random distribution
With n  ln(n) + cn entries, P(all coupons) = 1-e-c
For n=b, c=b-ln(b),
P(b2 nodes left) = 1-b/eb = 1.8 10-6
# of additional hops  Logb(b2) = 2
Distributed algorithm with minimal additional hops
UCB Winter Retreat
ravenben@eecs.berkeley.edu
45
Dynamic Mapping Border Cases

Node vanishes undetected



Routing proceeds on invalid link, fails
No backup router, so proceed to surrogate routing
Node enters network undetected; messages going
to surrogate node instead


New node checks with surrogate after all such nodes have
been notified
Route info at surrogate is moved to new node
UCB Winter Retreat
ravenben@eecs.berkeley.edu
46
SPAA slides follow
UCB Winter Retreat
ravenben@eecs.berkeley.edu
47
Network Assumption


Nearest neighbor is hard in general metric
Assume the following:




Ball of radius 2r contains only a factor of c more
nodes than ball of radius r.
Also, b > c2
[Both assumed by PRR]
Start knowing one node; allow distance
queries
UCB Winter Retreat
ravenben@eecs.berkeley.edu
48
Algorithm Idea



Call a node a level i node if it matches the
new node in i digits.
The whole network is contained in forest of
trees rooted at highest possible imax.
Let list[imax] contain the root of all trees.
Then, starting at imax, while i > 1


list[i-1] = getChildren(list[i])
Certainly, list[i] contains level i neighbors.
UCB Winter Retreat
ravenben@eecs.berkeley.edu
49
We Reach The Whole Network
3
4
NodeID
0xEF97
NodeID
0xEF32
NodeID
0xE399
NodeID
0xEF34
4
NodeID
0xEF37
3
NodeID
0xEF44
2
2
1
4
NodeID
0xE530
3
NodeID
0xE555
1
NodeID
0xFF37
UCB Winter Retreat
2
NodeID
0xEFBA
1
NodeID
0x099F
3
NodeID
0xEF40
NodeID
0xEF31
NodeID
0x0999
2
2
NodeID
0xE324
NodeID
0xE932
ravenben@eecs.berkeley.edu
1
NodeID
0x0921
50
The Real Algorithm




Simplified version ALL nodes in the network.
But far away nodes are not likely to have
close descendents
 Trim the list at each step.
New version: while i > 1


List[i-1] = getChildren(list[i])
Trim(list[i-1])
UCB Winter Retreat
ravenben@eecs.berkeley.edu
51
How to Trim



Consider circle of
radius r with at least
one level i node.
Level-(i-1) node in
little circle must
must point to a leveli node in the big
circle
Want: list[i] had
radius three times
list[i-1] and list[i –1]
contains one level i
UCB Winter Retreat
<2r
r
ravenben@eecs.berkeley.edu
52
Animation
new
UCB Winter Retreat
ravenben@eecs.berkeley.edu
53
True in Expectation


Want: list[i] had radius three times list[i-1]
and list[i –1] contains one level i
Suppose list[i-1] has k elements and
radius r



Expect ball of radius 4r to contain kc2/b
Ball of radius 3r contains less than k nodes, so
keeping k all along is enough.
To work with high probability,
= O(log n)
UCB Winter Retreat
ravenben@eecs.berkeley.edu
k
54
Steps of Insertion

Find node with closest matching ID (surrogate) and
get preliminary neighbor table



Find all nodes that need to put new node in routing
table via multicast
Optimize neighbor table


If surrogate’s table is hole-free, so is this one.
w.h.p. contacted nodes in building table only ones that
need to update their own tables
Need:


No fillable holes.
Keep objects reachable
UCB Winter Retreat
ravenben@eecs.berkeley.edu
55
Need-to-know nodes
• Need-to-know = a node with a hole
in neighbor table filled by new node
• If 1234 is new node, and no 123s
existed, must notify 12?? Nodes
• Acknowledged multicast to all
matching nodes
UCB Winter Retreat
ravenben@eecs.berkeley.edu
56
Acknowledged Multicast Algorithm
Locates & Contacts all nodes with a given prefix
• Create a tree based on IDs as we go
• Nodes send acks when all children reached
• Starting node knows when all nodes reached
The node then sends to any
5430?, any 5431?, any
5434?, etc. if possible
5431?
543??
5434?
54340
UCB Winter Retreat
ravenben@eecs.berkeley.edu
54345
57