Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Master’s Project Presentation End-host Route Selection in the CHEETAH Networking Solution Zhanxiang Huang 05/01/2006 Advisor: Malathi Veeraraghavan Acknowledgement: This work was carried out under the sponsorship of NSF ITR0312376, NSF ANI-0335190, NSF ANI-0087487, and DOE DE-FG02-04ER25640 grants. 1 Outline • • • • • CHEETAH project overview End-host route selection problem Model-based solution Measurement-based solution Conclusion and future work 2 Circuit-switched High-speed End-to-End Transport ArcHitecture (CHEETAH) Goal: high-speed rate-guaranteed end-to-end circuits with call-by-call-based bandwidth sharing end-to-end connection Telephony Network 64kbps circuits Connectionless Best-effort Internet Congestion Delay Jitter Loss long term leased line (under-utilized & expensive) 3 CHEETAH Applications • Applications: – video telephony – high speed file transfer – remote visualization especially in eScience community, e.g. Terascale Supernova Initiative (TSI) project Internet Internet 4 Current CHEETAH Network CUNY high-speed network SN16000 UVA Control card dynamic signaling scheme signaling engine end-host software GbE card … OC192 OC192 card card NCSU NC SN16000 Control card OC-192 signaling engine OC192 card Cray X-1 GbE/10GbE card ORNL SN16000 OC192 card Atlanta GbE/10GbE card GTech Control card OC-192 signaling engine 5 CHEETAH End-host Software Architecture End-host End-host CHEETAH software Internet CHEETAH software OCS Client OCS Client Routing Decision Routing Decision RSVP-TE Module Application NICI TCP/IP C-TCP NICII CHEETAH Network NICI RSVP-TE Module TCP/IP NICII C-TCP Application – OCS: check Optical Connection Service availability. – Routing Decision: choose between circuit and Internet path for each file transfer. – RSVP-TE Module: dynamic provision of circuits. – C-TCP: transport layer protocol optimized for circuits. 6 Circuit or Internet Path? • Circuit setup requests may be denied. • It depends on the data transfer delays on the two paths. An extreme example: Transfer a 1K-byte file using TCP. Internet transfer delay is about 100ms. round trip time=24ms Internet (best-effort path) Bottleneck link rate=100Mbps End-host CHEETAH Network (circuit) Circuit transfer delay is about 5.1 seconds. End-host round trip time=8ms circuit rate=1Gbps setup delay=5 seconds 7 What Determines Data Transfer Delays? • Over paths: – Circuit: • Circuit rate • Round trip time • Setup delay – Internet: • Round trip time • Bottleneck link rate • Packet loss rate • At end-hosts: – Transport layer protocol and parameter settings – OS Process scheduling – Hard disk throughput 8 How to Estimate Data Transfer Delays? • Model-based solution – Construct mathematical models for computing file transfer delays over the circuit and Internet paths. • Measurement-based solution – Estimate file transfer delays based on delay measurements of past file transfers. 9 Model-based Solution • Modeling TCP delay over Internet path – TCP Reno delay model [UMass98] • Modeling delay over CHEETAH circuit – Let Pb be the call blocking probability – Average delay over circuit is (1 Pb ) ( setup _ delay transfer _ delay _ over _ circuit ) Pb (average _ setup _ failure _ delay delay _ over _ Internet ) 10 Inputs to Delay Models • Inputs to TCP Reno delay model: – – – – File size Bottleneck link rate Round trip time Packet loss rate – Initial congestion window size – Sender and receiver buffer sizes • Inputs to circuit delay model: – File size – Circuit rate – Round trip time over the circuit path – Round trip time over the signaling path – Call processing delay at each switch – Signaling engine call load – Number of switches on the path – Call blocking probability 11 Limitations of the Model-based Solution • Packet loss rate is difficult to measure. (Tools that I tested include Sting, iperf, ping, badabing and etc.) • Same are call blocking probability and signaling engine call load. • Many TCP variants are emerging but there is no delay model for them yet. – e.g. BIC-TCP has been included in linux kernel 2.6 but has not been modeled yet. 12 Measurement-based Solution • Assumptions – Fixed circuit rates, e.g. 1Gbps, 100Mbps… – The number of destinations with which an end-host typically communicates, is not large. – Internet traffic has repeating patterns over time, which means that during a specific time period, round trip time, packet loss rate and call blocking probability are likely the same. delay Internet circuit 0 crossover Internet file size circuit Idea: Discretize time and file size, at each time slot, for each destination and each circuit rate, measure the delays of file transfers over both paths to find the crossover file size. 13 Active and Passive Measurements • Active measurements – Traffic is injected into the network explicitly for the purpose of obtaining measurements. • Passive measurements – Data is collected under normal network usage. 14 A Best-case Active-measurement Experiment Drawback: significant measurement traffic overhead Best-case means packet loss rate and call blocking probability are equal to zero. TCP buffers are set to Bandwidth Delay Product values. 15 Active Measurements Delays on Internet path and circuit are random variables, DI and DC. 1. Find an interval (min, max) that contains the crossover file size; 2. Measure delays on both paths for file size mid=(min+max)/2; 3. If |E(DI)-E(DC)|<e, crossover=mid; 4. If E(DI)>E(DC), max=mid; 5. If E(DI)<E(DC), min=mid; 6. Go to 2; Drawback: measurement traffic overhead delay Internet circuit 0 crossover min mid file size max Let M be the initial max file size and N be the initial min file size. Traffic size = 16 O(M*log(M-N)). Passive Measurements 1. 2. 3. 4. 5. Initiate (min, max) with (0, +inf). If file size < min, choose Internet; If file size > max, choose circuit; If min <= file size <= max, choose each path with probability ½. Record the data transfer delays. Once there are sufficient records to compute Pr(DIDC>0) for a file size in (min, max), adjust min or max based on Pr(DI-DC>0). crossover p 1 1/2 file size 0 min max (Note that min and max are file sizes in application queries and assume DI and DC follow normal distributions.) 17 Hybrid Measurements • Fast startup – Find the bottleneck link rate of the Internet path and the circuit setup delay through either passive or active measurement. – Solve the equation for “file_size”. estimated _ setup _ delay file _ size file _ size circuit _ rate Internet _ path _ bottleneck _ link _ rate – Init (min, max) with (file_size/2, file_size*2). • Use active measurements when initiated by administrator users. 18 Bookkeeping Data Structure Time Slot 02:00 – 03:00 Sunday Destination 128.109.34.22 Circuit Crossover Rate File Size 1Gbps 50MByte – 70MByte Transfer Delay Records File Size DI (sec) DC (sec) 50MByte 5.081 5.715 60MByte 5.060 5.066 70MByte 5.033 4.002 … … … … 19 Interaction Between CHEETAH Software Modules and Applications Administrator TCP Routing Decision Module 5 1 trigger update Thread 1 2 query Application RSVP API 4 query Query DecisionInterface making reply reply trigger 6 Measurement Tools trigger report delays RSVP / C-TCP Modules report blocks trigger RSV PAPI RD Database 3 5 RD API Admin Interface Thread 2 7 Measurement Report Monitor Interface report delays or bandwidth update Thread 3 Active Measurement Scheduler SysCall Interface trigger 20 Evaluation • Experiment setup – The Routing Decision server and an application run on a Linux-2.6 box with 2 Xeon 2.8GHz CPUs and 1GB memory. – The application queries with parameters, <128.109.34.22, 1Gbps circuit rate, 1GByte file size, time slot 02:00 Sunday>. The database has an entry corresponding to this IP and time slot. – Internet path: bottleneck link rate=100Mbps; round trip time =24ms. Circuit: round trip time=8ms. • Delay – An application submits 100 queries. – Mean query delay = 0.0055 sec < round trip time << 5 sec (the average setup delay). – Query delay standard deviation = 2.3608e-004 sec < 0.3ms 21 Conclusion and Future Work • Conclusion – Measurement-based solution is better than the modelbased solution. Adaptive to new TCP variants Adaptive to the traffic pattern changes Adaptive to hardware or software configuration changes Low overhead • Future work – Scalability issues • For a computer that communicates with a large number of end-hosts (e.g. a web server), we can separate the RD module from the computer and run a separate RD server for it. • For computers in the same LAN and with the same hardware and software configurations, we create an RD server for the whole LAN. 22 Reference [CHEETAH] M. Veeraraghavan, X. Zheng, H. Lee, M. Gardner, W. Feng, CHEETAH: Circuit-switched High-speed End-to-End Transport ArcHitecture, Proc. of Opticomm 2003, Oct. 13-17, 2003. Dallas, TX, Won Best Student Paper Award. [C-TCP] A. P. Mudambi, X. Zheng, and M. Veeraraghavan, A Transport Protocol for Dedicated End-to-End Circuits, accepted by ICC 2006. [UMass98] J. Padhye, V. Firoiu, D. Towsley and J. Kurose. Modeling TCP throughput: A simple model and its empirical validation. In SIGCOMM ’98, September 1998. 23 Backup Slides 24 How to compute Pr(DI-DC>0)? • Assume the delays observed on the Internet path and the circuit are normally distributed random variables, DI and DC. Each file size has these two random variables. E ( DI DC ) E ( DI ) E ( DC ) P(DI-DC) E(DI-DC) V ( DI DC ) V ( DI ) V ( DC ) 0 DI-DC n (2 z ), where z is standard normal distribution, 2 w is the sample standard deviation, is the confidence level and w is the width of the confidence interval. 25 CHEETAH network NYC HOPI Force10 UVa UVa host H ORNL 1G Compute-0-2 152.48.249.4 1GFC UCNS 1G 1G 1G H H H WASH Abilene T640 H CUNY host CUNY OC192 Centuar FastIron FESX448 1G 1G 1G H Compute-0-1 152.48.249.3 H Compute-0-0 152.48.249.2 Wukong 152.48.249.102 H X1(E) OC192 1G 2x1G MPLS tunnels WASH HOPI Force10 Orbitty Compute Nodes Compute-0-4 152.48.249.6 Compute-0-3 152.48.249.5 Force10 E300 switch 1G UVa Catalyst 4948 1G NCSU M20 NC CUNY Foundry 3x1G VLAN GbE 1G 1G GbE Zelda4 10.0.0.14 H Zelda5 10.0.0.15 H 1G 1G 1-8-33 1-8-34 1-6-1 1-7-1 1-8-35 1-8-36 1-8-37 1-6-17 1-7-17 1-8-38 10GbE OC192 1-7-33 1-7-34 1-7-35 1-7-1 1-6-1 1-7-36 1G 1G 1G 1G 1G 1G 1-8-39 Cheetah-ornl MCNC Catalyst 7600 H Wuneng 152.48.249.103 cheetah-nc Juniper T320 Atlanta OC-192 lamda Zelda1 10.0.0.11 H Zelda2 10.0.0.12 H Zelda3 10.0.0.13 H 1G 1G 1G 1G 2x1G MPLS tunnels Juniper T320 1G GbE 10GbE OC192 1-7-33 1-7-34 1-7-35 1-6-1 1-7-36 1-7-1 1-7-37 1-7-38 1-6-17 1-7-39 Cheetah-atl Direct fibers VLANs MPLS tunnels 26 By Xuan Zheng, xuan@virginia.edu Delay model The average delay using CHEETAH circuit is: E[Tcheetah ] (1 Pb )( E[Tsetup ] E[T circuit primary ]) Pb ( E[T fail ] E[Ttcp ]), (1) primary Comparing (1) with E[Ttcp ], we get: primary primary circuit (1) E[Ttcp ] (1 Pb )( E[Tsetup ] E[T ] E[Ttcp ]) Pb E[T fail ], (2) Approximating E[T fail ] to E[Tsetup ], we get: (2) (1 Pb )( E[T circuit primary ] E[Ttcp ]) E[Tsetup ], (3) If (3)<0 then the application should try to set up a circuit; otherwise it should use the primary access link. E[Tsetup ] primary circuit In orther words, if E[Ttcp ] E[T ] , 1 Pb (4) then attempt circuit setup, otherwise resort to the TCP/IP Internet path. 27 Circuit delay model (1) (a) Pb is call blocking probability. (b) E[T circuit ] f rc circuit T prop 2 , in which f is the size of the file to transfer, rc is the data rate of the circuit, and circuit T prop is round-trip propagation delay of the circuit. 28 Circuit delay model (2) (c) E[Tsetup ] msig rs [1 sig 2(1 sig ) ]( k 1) Tsp [1 sp 2(1 sp ) signaling ]k T prop , (6) in which msig is the cumulative size of signaling messages used in circuit setup, rs is the signaling link rate, assuming all the signaling links have the same rate, sig is the traffic load of the M/D/1 queue model of the signaling link, k is the number of switches on the circuit path, Tsp is the call-processing delay incurred at each switch, sp is the traffic load of the M/D/1 queue model of the signaling processor, and signaling T prop is round-trip propagation delay of the circuit's signaling path. 29 TCP-Reno delay model (1) primary (d) E[Ttcp ] E[Tss ] E[Tloss ] E[Tca ] E[Tdelack ] The E[Tdelack ] depends on the specific operating system. Approximate E[Tdelack ] to 100ms for BSD-derived stacks and 150ms for Windows, (i) Calculate E[Tss ] E[ d ss ]( 1) )], when E[Wss ] Wmax RTT [log ( w1 E[Tss ] RTT [log ( Wmax ) 1 1 ( E[ d ] Wmax w1 )], otherwise ss w1 Wmax 1 in which RTT=[the round trip delay] 1 1/ b [the rate of exponential growth of cwnd during slow start] E[ d ss ] d [1 (1 p ) ](1 p ) 1 p w1 [sender's initial cwnd size] Wmax [the maximum window we would expect TCP to achieve at the end of slow start] d [number of segments to send] p [data segment loss rate] E[Wss ] E[ d ss ]( 1) w 1 b [number of segments to send a delayed ACK] 30 TCP-Reno delay model (2) (ii) Calculate E[Tloss ] E[Tloss ] lss [Q ( p , E[Wss ]) E[ Z lss 1 (1 p ) TO ] (1 Q ( p , E[Wss ])) RTT ] d 3 w3 1 (1 p ) [1 (1 p ) ] Q ( p , w) min(1, w 3 [1 (1 p ) ] /[1 (1 p ) ] E[ Z TO ] G ( p )T0 1 p , 6 i 1 i G ( p) 1 2 p i 1 T0 [the average duration of the first TO in a sequence of one or more successful timeouts]. 31 TCP-Reno delay model (3) (iii) Calculate E[Tca ] E[Tca ] E[ d ca ] / R ( p , RTT , T0 , Wmax ) E[ d ca ] d E[ d ss ] W ( p) 2b 3b 8(1 p ) 3bp ( 2b 2 ) 3b R ( p , RTT , T0 , Wmax ) 1 p W ( p) Q ( p , W ( p )) p 2 , when W ( p ) Wmax Q ( p , W ( p )) G ( p ) T RTT ( b W ( p ) 1) 0 2 1 p 1 p Wmax Q ( p , Wmax ) p 2 , otherwise Q ( p , Wmax )G ( p )T0 b 1 p RTT ( Wmax 2) 8 pW 1 p max 32 Binary Search Algorithm for Determining the Crossover File Size for One Destination Init sl = s = su setup_delay*ci rcuit_rate, cover = false Start Start Setup Delay Timer Start Internet Transfer Delay Timer | T_Internet T_Circuit | < delta Yes Yes Yes Call Bandwidth Requester Transfer file of size s over circuit Transfer file of size s over the Internet Crossover File Size is s and update the DB sl = s If ( !cover ) su = 2*su s = (sl+su)/2 sl = 0 s = (sl+su)/2 If ( !cover ) cover = true Stop Circuit Transfer Delay Timer Stop Internet Transfer Delay Timer Tear down circuit Compute Circuit Throughput Compute Internet Throughput End Too many fails Setup Success No Yes Stop Setup Delay Timer Internet Throughput > Circuit Throughput Start Circuit Transfer Delay Timer No No sl = su No su = s s = (sl+su)/2 If ( !cover ) cover = true s denotes File Size, sl denotes the Lower Bound of s, su denotes the upper Bound of s, cover denotes whether or not (sl, su) has covered the crossover file size and delta is the threshold for the difference between circuit and Internet throughputs. 33 Measurement example room in 34 Experiment setup mvstu6 CPU 2 CPUs, each is Intel(R) Xeon(TM) CPU 2.80GHz with 1024KB cache Memory 1GB Hard disk 1 MegaRAID Model: LD 0 RAID0 69G OS 2.6.12-1.1381_FC3smp File system EXT3 NIC Intel PRO/1000 Single Port Adapters working at rate 100Mbps, Full Duplex 35 Acronym • CHEETAH – Circuit-switched High-speed Endto-End Transport ArcHitecture • PLR – Packet Loss Rate • SD – Setup/Teardown Delay • RTT – Round Trip Time • AB – Available Bandwidth • GMPLS – Generalized Multiple Protocol Label Switching • SONET – Synchronous Optical NETwork • SDH – Synchronous Digital Hierarchy 36