Download Document

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Deep packet inspection wikipedia , lookup

Internet protocol suite wikipedia , lookup

Net bias wikipedia , lookup

Recursive InterNetwork Architecture (RINA) wikipedia , lookup

Transcript
Master’s Project Presentation
End-host Route Selection in the
CHEETAH Networking Solution
Zhanxiang Huang
05/01/2006
Advisor: Malathi Veeraraghavan
Acknowledgement: This work was carried out under the sponsorship of NSF ITR0312376, NSF ANI-0335190, NSF ANI-0087487, and DOE DE-FG02-04ER25640
grants.
1
Outline
•
•
•
•
•
CHEETAH project overview
End-host route selection problem
Model-based solution
Measurement-based solution
Conclusion and future work
2
Circuit-switched High-speed End-to-End
Transport ArcHitecture (CHEETAH)
Goal: high-speed rate-guaranteed end-to-end
circuits with call-by-call-based bandwidth sharing
end-to-end connection
Telephony
Network
64kbps circuits
Connectionless
Best-effort
Internet
Congestion
Delay
Jitter
Loss
long term leased line
(under-utilized &
expensive)
3
CHEETAH Applications
• Applications:
– video telephony
– high speed file transfer
– remote visualization
especially in eScience community,
e.g. Terascale Supernova Initiative (TSI) project
Internet
Internet
4
Current CHEETAH Network
CUNY
high-speed network
SN16000
UVA
Control card
dynamic signaling scheme
signaling engine
end-host software
GbE
card
…
OC192
OC192
card
card
NCSU
NC
SN16000
Control card
OC-192
signaling engine
OC192
card
Cray X-1
GbE/10GbE
card
ORNL
SN16000
OC192
card
Atlanta
GbE/10GbE
card
GTech
Control card
OC-192
signaling engine
5
CHEETAH End-host Software Architecture
End-host
End-host
CHEETAH software
Internet
CHEETAH software
OCS Client
OCS Client
Routing Decision
Routing Decision
RSVP-TE Module
Application
NICI
TCP/IP
C-TCP
NICII
CHEETAH
Network
NICI
RSVP-TE Module
TCP/IP
NICII
C-TCP
Application
– OCS: check Optical Connection Service availability.
– Routing Decision: choose between circuit and Internet
path for each file transfer.
– RSVP-TE Module: dynamic provision of circuits.
– C-TCP: transport layer protocol optimized for circuits.
6
Circuit or Internet Path?
• Circuit setup requests may be denied.
• It depends on the data transfer delays on
the two paths.
An extreme example: Transfer a 1K-byte file using TCP.
Internet transfer delay is about 100ms.
round trip time=24ms
Internet
(best-effort path) Bottleneck link rate=100Mbps
End-host
CHEETAH
Network
(circuit)
Circuit transfer delay is about 5.1 seconds.
End-host
round trip time=8ms
circuit rate=1Gbps
setup delay=5 seconds
7
What Determines Data Transfer Delays?
• Over paths:
– Circuit:
• Circuit rate
• Round trip time
• Setup delay
– Internet:
• Round trip time
• Bottleneck link rate
• Packet loss rate
• At end-hosts:
– Transport layer protocol and parameter settings
– OS Process scheduling
– Hard disk throughput
8
How to Estimate Data Transfer Delays?
• Model-based solution
– Construct mathematical models for computing file transfer
delays over the circuit and Internet paths.
• Measurement-based solution
– Estimate file transfer delays based on delay
measurements of past file transfers.
9
Model-based Solution
• Modeling TCP delay over Internet path
– TCP Reno delay model [UMass98]
• Modeling delay over CHEETAH circuit
– Let Pb be the call blocking probability
– Average delay over circuit is
(1  Pb )  ( setup _ delay  transfer _ delay _ over _ circuit ) 
Pb  (average _ setup _ failure _ delay  delay _ over _ Internet )
10
Inputs to Delay Models
• Inputs to TCP
Reno delay model:
–
–
–
–
File size
Bottleneck link rate
Round trip time
Packet loss rate
– Initial congestion
window size
– Sender and receiver
buffer sizes
• Inputs to circuit delay
model:
– File size
– Circuit rate
– Round trip time over the
circuit path
– Round trip time over the
signaling path
– Call processing delay at
each switch
– Signaling engine call load
– Number of switches on the
path
– Call blocking probability 11
Limitations of the Model-based Solution
• Packet loss rate is difficult to measure.
(Tools that I tested include Sting, iperf, ping,
badabing and etc.)
• Same are call blocking probability and
signaling engine call load.
• Many TCP variants are emerging but there
is no delay model for them yet.
– e.g. BIC-TCP has been included in linux kernel 2.6 but has
not been modeled yet.
12
Measurement-based Solution
• Assumptions
– Fixed circuit rates, e.g. 1Gbps,
100Mbps…
– The number of destinations with
which an end-host typically
communicates, is not large.
– Internet traffic has repeating
patterns over time, which means
that during a specific time
period, round trip time, packet
loss rate and call blocking
probability are likely the same.
delay
Internet
circuit
0
crossover
Internet
file
size
circuit
Idea: Discretize time and file
size, at each time slot, for
each destination and each
circuit rate, measure the
delays of file transfers over
both paths to find the
crossover file size.
13
Active and Passive Measurements
• Active measurements
– Traffic is injected into the network explicitly for
the purpose of obtaining measurements.
• Passive measurements
– Data is collected under normal network usage.
14
A Best-case Active-measurement Experiment
Drawback: significant
measurement traffic overhead
Best-case means packet loss rate and call blocking probability are
equal to zero. TCP buffers are set to Bandwidth Delay Product values.
15
Active Measurements
Delays on Internet path and circuit
are random variables, DI and DC.
1. Find an interval (min, max) that
contains the crossover file size;
2. Measure delays on both paths
for file size mid=(min+max)/2;
3. If |E(DI)-E(DC)|<e,
crossover=mid;
4. If E(DI)>E(DC), max=mid;
5. If E(DI)<E(DC), min=mid;
6. Go to 2;
Drawback: measurement
traffic overhead
delay
Internet
circuit
0
crossover
min
mid
file size
max
Let M be the initial max
file size and N be the
initial min file size.
Traffic size =
16
O(M*log(M-N)).
Passive Measurements
1.
2.
3.
4.
5.
Initiate (min, max) with (0,
+inf).
If file size < min, choose
Internet;
If file size > max, choose
circuit;
If min <= file size <= max,
choose each path with
probability ½. Record the
data transfer delays.
Once there are sufficient
records to compute Pr(DIDC>0) for a file size in (min,
max), adjust min or max
based on Pr(DI-DC>0).
crossover
p
1
1/2
file
size
0
min
max
(Note that min and max are file
sizes in application queries
and assume DI and DC follow
normal distributions.)
17
Hybrid Measurements
• Fast startup
– Find the bottleneck link rate of the Internet path and the
circuit setup delay through either passive or active
measurement.
– Solve the equation for “file_size”.
estimated _ setup _ delay 
file _ size
file _ size

circuit _ rate
Internet _ path _ bottleneck _ link _ rate
– Init (min, max) with (file_size/2, file_size*2).
• Use active measurements when initiated by
administrator users.
18
Bookkeeping Data Structure
Time Slot
02:00 –
03:00
Sunday
Destination
128.109.34.22
Circuit Crossover
Rate
File Size
1Gbps
50MByte –
70MByte
Transfer Delay Records
File Size
DI (sec)
DC (sec)
50MByte
5.081
5.715
60MByte
5.060
5.066
70MByte
5.033
4.002
…
…
…
…
19
Interaction Between CHEETAH Software
Modules and Applications
Administrator
TCP
Routing Decision Module
5
1
trigger
update
Thread 1
2
query
Application
RSVP
API
4
query
Query DecisionInterface making
reply
reply
trigger
6
Measurement
Tools
trigger
report delays
RSVP / C-TCP
Modules
report
blocks
trigger
RSV
PAPI
RD
Database
3
5
RD
API
Admin
Interface
Thread 2
7
Measurement
Report
Monitor
Interface
report
delays or
bandwidth
update
Thread 3
Active
Measurement
Scheduler
SysCall
Interface
trigger
20
Evaluation
• Experiment setup
– The Routing Decision server and an application run on a
Linux-2.6 box with 2 Xeon 2.8GHz CPUs and 1GB memory.
– The application queries with parameters, <128.109.34.22,
1Gbps circuit rate, 1GByte file size, time slot 02:00 Sunday>.
The database has an entry corresponding to this IP and time
slot.
– Internet path: bottleneck link rate=100Mbps; round trip time
=24ms. Circuit: round trip time=8ms.
• Delay
– An application submits 100 queries.
– Mean query delay = 0.0055 sec < round trip time << 5 sec
(the average setup delay).
– Query delay standard deviation = 2.3608e-004 sec < 0.3ms
21
Conclusion and Future Work
• Conclusion
– Measurement-based solution is better than the modelbased solution.
 Adaptive to new TCP variants
 Adaptive to the traffic pattern changes
 Adaptive to hardware or software configuration changes
 Low overhead
• Future work
– Scalability issues
• For a computer that communicates with a large number of end-hosts (e.g. a
web server), we can separate the RD module from the computer and run a
separate RD server for it.
• For computers in the same LAN and with the same hardware and software
configurations, we create an RD server for the whole LAN.
22
Reference
[CHEETAH] M. Veeraraghavan, X. Zheng, H. Lee, M. Gardner,
W. Feng, CHEETAH: Circuit-switched High-speed End-to-End
Transport ArcHitecture, Proc. of Opticomm 2003, Oct. 13-17,
2003. Dallas, TX, Won Best Student Paper Award.
[C-TCP] A. P. Mudambi, X. Zheng, and M. Veeraraghavan, A
Transport Protocol for Dedicated End-to-End Circuits,
accepted by ICC 2006.
[UMass98] J. Padhye, V. Firoiu, D. Towsley and J. Kurose.
Modeling TCP throughput: A simple model and its empirical
validation. In SIGCOMM ’98, September 1998.
23
Backup Slides
24
How to compute Pr(DI-DC>0)?
• Assume the delays observed on the Internet path and the
circuit are normally distributed random variables, DI and DC.
Each file size has these two random variables.
E ( DI  DC )  E ( DI )  E ( DC )
P(DI-DC)
E(DI-DC)
V ( DI  DC )  V ( DI )  V ( DC )

0
DI-DC
n  (2 z  ), where z is standard normal distribution,
2
w
 is the sample standard deviation,  is the confidence
level and w is the width of the confidence interval.
25
CHEETAH network
NYC
HOPI
Force10
UVa
UVa host H
ORNL
1G
Compute-0-2 152.48.249.4
1GFC
UCNS
1G
1G
1G
H
H
H
WASH
Abilene
T640
H
CUNY host
CUNY
OC192
Centuar
FastIron
FESX448
1G
1G
1G
H
Compute-0-1 152.48.249.3
H
Compute-0-0 152.48.249.2
Wukong
152.48.249.102 H
X1(E)
OC192
1G
2x1G
MPLS tunnels
WASH
HOPI
Force10
Orbitty Compute Nodes
Compute-0-4 152.48.249.6
Compute-0-3 152.48.249.5
Force10
E300
switch
1G
UVa
Catalyst
4948
1G
NCSU
M20
NC
CUNY
Foundry
3x1G VLAN
GbE
1G
1G
GbE
Zelda4 10.0.0.14 H
Zelda5 10.0.0.15 H
1G
1G
1-8-33
1-8-34
1-6-1 1-7-1
1-8-35
1-8-36
1-8-37
1-6-17 1-7-17
1-8-38
10GbE OC192
1-7-33
1-7-34
1-7-35 1-7-1 1-6-1
1-7-36
1G
1G
1G
1G
1G
1G
1-8-39
Cheetah-ornl
MCNC
Catalyst
7600
H Wuneng 152.48.249.103
cheetah-nc
Juniper
T320
Atlanta
OC-192 lamda
Zelda1 10.0.0.11 H
Zelda2 10.0.0.12 H
Zelda3 10.0.0.13 H
1G
1G
1G
1G
2x1G
MPLS tunnels
Juniper
T320
1G
GbE
10GbE OC192
1-7-33
1-7-34
1-7-35
1-6-1
1-7-36
1-7-1
1-7-37
1-7-38
1-6-17
1-7-39
Cheetah-atl
Direct fibers
VLANs
MPLS tunnels
26
By Xuan Zheng, xuan@virginia.edu
Delay model
The average delay using CHEETAH circuit is:
E[Tcheetah ]  (1  Pb )( E[Tsetup ]  E[T
circuit
primary
])  Pb ( E[T fail ]  E[Ttcp
]),
(1)
primary
Comparing (1) with E[Ttcp
], we get:
primary
primary
circuit
(1)  E[Ttcp
]  (1  Pb )( E[Tsetup ]  E[T
]  E[Ttcp
])  Pb E[T fail ],
(2)
Approximating E[T fail ] to E[Tsetup ], we get:
(2)  (1  Pb )( E[T
circuit
primary
]  E[Ttcp
])  E[Tsetup ],
(3)
If (3)<0 then the application should try to set up a circuit;
otherwise it should use the primary access link.
E[Tsetup ]
primary
circuit
In orther words, if E[Ttcp
]  E[T
]
,
1  Pb
(4)
then attempt circuit setup, otherwise resort to the TCP/IP Internet path.
27
Circuit delay model (1)
(a) Pb is call blocking probability.
(b) E[T
circuit
]
f
rc

circuit
T prop
2
,
in which
f is the size of the file to transfer,
rc is the data rate of the circuit, and
circuit
T prop is round-trip propagation delay of the circuit.
28
Circuit delay model (2)
(c) E[Tsetup ] 
msig
rs
[1 
 sig
2(1   sig )
]( k  1)  Tsp [1 
 sp
2(1   sp )
signaling
]k  T prop
,
(6)
in which
msig is the cumulative size of signaling messages used in circuit setup,
rs is the signaling link rate, assuming all the signaling links have the same rate,
 sig is the traffic load of the M/D/1 queue model of the signaling link,
k is the number of switches on the circuit path,
Tsp is the call-processing delay incurred at each switch,
 sp is the traffic load of the M/D/1 queue model of the signaling
processor, and
signaling
T prop
is round-trip propagation delay of the circuit's signaling path.
29
TCP-Reno delay model (1)
primary
(d) E[Ttcp
]  E[Tss ]  E[Tloss ]  E[Tca ]  E[Tdelack ]
The E[Tdelack ] depends on the specific operating system.
Approximate E[Tdelack ] to 100ms for BSD-derived stacks and 150ms for Windows,
(i) Calculate E[Tss ]

E[ d ss ](  1)
)], when E[Wss ]  Wmax
RTT [log (
w1

E[Tss ]  
RTT [log ( Wmax )  1  1 ( E[ d ]   Wmax  w1 )], otherwise

ss

w1
Wmax
 1

in which
RTT=[the round trip delay]
  1  1/ b [the rate of exponential growth of cwnd during slow start]
E[ d ss ] 
d
[1  (1  p ) ](1  p )
1
p
w1  [sender's initial cwnd size]
Wmax  [the maximum window we would expect TCP to achieve at the end of slow start]
d  [number of segments to send]
p  [data segment loss rate]
E[Wss ] 
E[ d ss ](  1)

w
 1

b  [number of segments to send a delayed ACK]
30
TCP-Reno delay model (2)
(ii) Calculate E[Tloss ]
E[Tloss ]  lss [Q ( p , E[Wss ]) E[ Z
lss  1  (1  p )
TO
]  (1  Q ( p , E[Wss ])) RTT ]
d
3
w3
1  (1  p ) [1  (1  p )
]
Q ( p , w)  min(1,
w
3
[1  (1  p ) ] /[1  (1  p ) ]
E[ Z
TO
]
G ( p )T0
1 p
,
6 i 1 i
G ( p)  1   2 p
i 1
T0  [the average duration of the first TO in a sequence of one
or more successful timeouts].
31
TCP-Reno delay model (3)
(iii) Calculate E[Tca ]
E[Tca ]  E[ d ca ] / R ( p , RTT , T0 , Wmax )
E[ d ca ]  d  E[ d ss ]
W ( p) 
2b
3b

8(1  p )
3bp
(
2b 2
)
3b
R ( p , RTT , T0 , Wmax ) 

1  p W ( p)

 Q ( p , W ( p ))

p
2

, when W ( p )  Wmax
Q
(
p
,
W
(
p
))
G
(
p
)
T
 RTT ( b W ( p )  1) 
0

2
1 p

1  p Wmax


 Q ( p , Wmax )

p
2
, otherwise

Q ( p , Wmax )G ( p )T0
b
1 p
 RTT ( Wmax 
2) 
8
pW
1 p

max
32
Binary Search Algorithm for Determining
the Crossover File Size for One Destination
Init sl = s = su
setup_delay*ci
rcuit_rate,
cover = false
Start
Start Setup Delay
Timer
Start Internet
Transfer Delay
Timer
| T_Internet T_Circuit | <
delta
Yes
Yes
Yes
Call Bandwidth
Requester
Transfer file of size
s over circuit
Transfer file of size
s over the Internet
Crossover File
Size is s and
update the DB
sl = s
If ( !cover ) su
= 2*su
s = (sl+su)/2
sl = 0
s = (sl+su)/2
If ( !cover )
cover = true
Stop Circuit
Transfer Delay
Timer
Stop Internet
Transfer Delay
Timer
Tear down circuit
Compute Circuit
Throughput
Compute Internet
Throughput
End
Too
many
fails
Setup Success
No
Yes
Stop Setup Delay
Timer
Internet
Throughput
>
Circuit
Throughput
Start Circuit
Transfer Delay
Timer
No
No
sl = su
No
su = s
s = (sl+su)/2
If ( !cover )
cover = true
s denotes File Size,
sl denotes the Lower Bound of s,
su denotes the upper Bound of s,
cover denotes whether or not (sl, su) has
covered the crossover file size and
delta is the threshold for the difference
between circuit and Internet throughputs.
33
Measurement example room in
34
Experiment setup
mvstu6
CPU
2 CPUs, each is Intel(R)
Xeon(TM) CPU
2.80GHz with 1024KB
cache
Memory
1GB
Hard disk
1 MegaRAID Model: LD 0
RAID0 69G
OS
2.6.12-1.1381_FC3smp
File system
EXT3
NIC
Intel PRO/1000 Single Port
Adapters working at rate
100Mbps, Full Duplex
35
Acronym
• CHEETAH – Circuit-switched High-speed Endto-End Transport ArcHitecture
• PLR – Packet Loss Rate
• SD – Setup/Teardown Delay
• RTT – Round Trip Time
• AB – Available Bandwidth
• GMPLS – Generalized Multiple Protocol Label
Switching
• SONET – Synchronous Optical NETwork
• SDH – Synchronous Digital Hierarchy
36