* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Playing Distributed Systems with Memory-to
Dynamic Host Configuration Protocol wikipedia , lookup
Server Message Block wikipedia , lookup
Distributed firewall wikipedia , lookup
Distributed operating system wikipedia , lookup
Zero-configuration networking wikipedia , lookup
Cracking of wireless networks wikipedia , lookup
Remote Desktop Services wikipedia , lookup
Recursive InterNetwork Architecture (RINA) wikipedia , lookup
Building Network-Centric
Systems
Liviu Iftode
Before WWW, people were happy...
E-mail, Telnet
TCP/IP
Emacs
NFS
TCP/IP
CS.umd.EDU
CS.rutgers.EDU
Mostly local computing
Occasional TCP/IP networking with low expectations and mostly
non-interactive traffic
local area networks: file server (NFS)
wide area networks -Internet- : E-mail, Telnet, Ftp
Networking was not a major concern for the OS
One Exception: Cluster Computing
Multicomputers
Clusters of computers
Cost-effective solution for high-performance
distributed computing
TCP/IP networking was the headache
large software overheads
Software DSM not a network-centric system :-(
The Great WWW Challenge
Web Browsing
http://www.Bank.com
TCP/IP
Bank.com
World Wide Web made access over the Internet easy
Internet became commercial
Dramatic increase of interactive traffic
WWW networking creates a network-centric system:
Internet server
performance: service more network clients
availability: be accessible all the time over the network
security: protect resources against network attacks
Network-Centric Systems
Networking dominates the operating system
Mobile Systems
mobility aware TCP/IP (Mobile IP, I-TCP, etc), disconnected file
systems (Coda), adaptation-aware applications for
mobility(Odyssey), etc
Internet Servers
resource allocation (Lazy Receive Processing, Resource
Containers), OS shortcuts (Scout, IO-Lite), etc
Pervasive/Ubiquitous Systems
Tiny OS , sensor networks (Directed Diffusion, etc),
programmability (One World, etc)
Storage Networking
network-attached storage (NASD, etc), peer-to-peer systems
(Oceanstore, etc), secure file systems (SFS, Farsite), etc
Big Picture
Research sparked by various OS-Networking
tensions
Shift of focus from Performance to Availability
and Manageability
Networking and Storage I/O Convergence
Server-based and serverless systems
TCP/IP and non-TCP/IP protocols
Local area, wide-area, ad-hoc and
application/overlay networks
Significant interest from industry
Outline
TCP Servers
Migratory-TCP and Service Continuations
Cooperative Computing, Smart Messages and
Spatial Programming
Federated File Systems
Talk Highlights and Conclusions
Problem 1: TCP/IP is too Expensive
User space
20%
Other system
calls
9%
Network
Processing
71%
Breakdown of the CPU time for Apache (uniprocessor based Web-server)
Traditional Send/Receive Communication
App
OS
send(a)
NIC
NIC
OS
App
copy(a,send_buf)
DMA(send_buf,NIC)
interrupt
DMA(NIC,recv_buf)
copy(recv_buf,b)
sender
receive(b)
receiver
A Closer Look
User space
20%
Hardware Interrupt
Processing
Software Interrupt
8%
Processing
11%
IP Receive
0%
IP Send
0%
Other system calls
9%
TCP Receive
7%
TCP Send
45%
Multiprocessor Server Performance
Does not Scale
•Throughput (requests/s)
•700
Dual Processor
•600
Uniprocessor
•500
•400
•300
•200
•100
•0
•300
•350
•400
•450
•500
•550
•600
•650
•700
•750
•Offered load (connections/s)
Apache Web server 1.3.20 on 1 Way and 2 Way 300MHz Pentium II SMP with
repeatedly accessing a static16 KB file
TCP/IP-Application Co-Habitation
TCP/IP “steals” compute cycles and memory from
applications
TCP/IP executes in kernel-mode: mode switching
overhead
TCP/IP executes asynchronously
interrupt processing overhead
internal synchronization on multiprocessor servers causes
execution serialization
Cache pollution
Hidden “Service-work”
TCP packet retransmission
TCP ACK processing
ARP request service
Extreme cases can compromise server performance
Receive livelocks
Denial-of-service (DoS) attacks
Two Solutions
Replace TCP/IP with a lightweight transport protocol
Offload some/all of the TCP from host to a dedicated
computing unit (processor, computer or “intelligent”
network interface)
Industry: high-performance, expensive solutions
Memory-to-Memory Communication: InfiniBand
“Intelligent” network interface: TCP Offloading Engine(TOE)
Cost-effective and flexible solutions: TCP Servers
Memory-to-Memory(M-M) Communication
Sender
Send
Receiver
Receive
TCP/IP
Application
OS
Network
Interface (NIC)
Remote
DMA
M-M
OS
NIC
Memory
Buffer
OS
NIC
Memory-to-Memory Communication
is Non-Intrusive
App
NIC
NIC
App
RDMA_Write(a,b)
b is
updated
Sender:
low overhead
Receiver:
zero overhead
TCP Server at a Glance
A software offloading architecture using existing hardware
Basic idea: Dedicate one or more computing units
exclusively for TCP/IP
Compared to TOE
track technology better: latest processors
flexible: adapt to changing load conditions
cost-effective: no extra hardware
Isolate application computation from network processing
Eliminate network interrupts and context switches
Efficient resource allocation
Additional performance gains (zero-copy) with extended socket API
Related work
Very preliminary offloading solutions: Piglet, CSP
Socket Direct Protocol, Zero-copy TCP
Two TCP Server Architectures
TCP Servers for Multiprocessor Servers
TCP/IP
TCP-Server
Server Appl
CPU
CPU
Shared Memory
TCP Servers for Cluster-based Servers
TCP/IP
M-M
TCP-Server
Server Appl
Where to Split TCP/IP Processing?
(How much to offload?)
APPLICATION
Application
Processors
SEND
TCP Servers
SYSTEM CALLS
RECEIVE
copy_from_application_buffers
copy_to_application_buffers
TCP_send
TCP_receive
IP_send
IP_receive
packet_scheduler
software_interrupt_handler
setup_DMA
interrupt_handler
packet_out
packet_in
Evaluation Testbed
Multiprocessor Server
4-Way 550MHz Intel Pentium II system
running Apache 1.3.20 web server on Linux 2.4.9
NIC : 3-Com 996-BT Gigabit Ethernet
Used sclients as a client program [Banga 97]
Comparative Throughput
3500
Throughput (requests/sec)
3000
2500
2000
1500
1000
500
0
Uniprocessor
SMP 4 processors
SMP - 1 TCP
Server
SMP - 2 TCP
Servers
Clients issue file requests according to a web server trace
Adaptive TCP Servers
Static TCP Server configuration
Too few TCP Servers can lead to network
processing becoming the bottleneck
Too many TCP Servers lead to degradation in
performance of CPU intensive applications
Dynamic TCP Server configuration
Monitor the TCP Server queue lengths and
system load
Dynamically add or remove TCP Server
processors
Next Target: The Storage Networking
Storage Networking dilemma
TCP
TCP Offloading
or not TCP?
M-M Communication (InfiniBand)
iSCSI (SCSI over IP) DAFS (Direct Access File Systems)
non-TCP/IP solutions require new wiring or tunneling
over IP-based Ethernet networks
TCP/IP solutions require TCP offloading
Future Work: TCP Servers & iSCSI
Server Appl TCP-Server & iSCSI
CPU
CPU
SCSI Storage
iSCSI
TCP/IP
Shared Memory
Use TCP-Servers to connect to SCSI storage using
iSCSI protocol over TCP/IP networks
Problem 2: TCP/IP is too Rigid
Server vs. Service Availability
client interested in Service availability
Adverse conditions may affect service availability
internetwork congestion or failure
servers overloaded, failed or under DoS attack
TCP has one response
network delays => packet loss => retransmission
TCP limits the OS solutions for service availability
early binding of service to a server
client cannot switch to another server for sustained
service after the connection is established
Service Availability through Migration
Server 1
Client
Server 2
Migratory TCP at a Glance
Migratory TCP migrates live connections among
cooperative servers
Migration mechanism is generic (not application specific)
lightweight (fine-grained migration) and low-latency
Migration triggered by client or server
Servers can be geographically distributed (different IP
addresses)
Requires changes to the server application
Totally transparent to the client application
Interoperates with existing TCP
Migration policies decoupled from migration mechanism
Basic Idea: Fine-Grained State Migration
Server1 Process
Application state
Connection state
Client
C1 C2 C3 C4
C5
C6
Server2 Process
Migratory-TCP (Lazy) Protocol
Server 1
Client
Server 2
Non-Intrusive Migration
Migrate state without involving old-server application
(only old server OS)
Old server exports per-connection state periodically
Connection state and Application state can go out of
sync
Upon migration, new server imports the last exported
state of the migrated connection
OS uses connection state to synchronize with
application
Non-intrusive migration with M-M communication
uses RDMA read to extract state from the old server with
zero-overhead
works even when the old server is overloaded or frozen
Service Continuation (SC)
Front-End
Server Process
socket
SC
pipe
SC
API
pipe
exported
state
exported
state
Connection state
Back-End
Server Process2
Back-End
Server Process1
Pipe state
sc= create_cont(C1);
p1=pipe();
associate(sc,p1);
fork_exec(Process1);
….
export(sc,state)
sc= open_cont(p1);
…
exported
state
Pipe state
sc= open_cont(p2);
….
export(sc, state)
export(sc,state)
Related Work
Process migration: Sprite [Douglis ‘91], Locus [Walker
‘83], MOSIX [Barak ‘98], etc.
VM migration [Rosemblum ‘02, Nieh ‘02]
Migration in web server clusters [Snoeren ‘00, Luo ‘01]
Fault-tolerant TCP [Alvisi ‘00]
TCP extensions for host mobility: I-TCP [Bakre ‘95],
Snoop TCP [Balakrishnan ‘95], end-to-end approaches
[Snoeren ‘00], Msocks [Maltz ‘98]
SCTP (RFC 2960)
Evaluation
Implemented SC and M-TCP in FreeBSD kernel
Integrated SC in real Internet servers
web, media streaming, transactional DB
Microbenchmark
impact of migration on client perceived throughput
for a two-process server using TTCP
Real applications
sustain web server throughput under load produced
by increasing the number of client connections
Impact of Migration on Throughput
8,000
SC size 1 KB
SC size 5 KB
SC size 10 KB
Effective throughput (KB/s)
7,900
7,800
7,700
7,600
7,500
7,400
7,300
No migration
2
5
Migration period (s)
10
Web Server Throughput
Throughput(replies/s)
800
700
12,000
Migrated Connections
M-Apache
Apache
10,000
600
8,000
500
6,000
400
300
4,000
200
2,000
100
0
0
300
400
500
600
700
800
900 1,000 1,100 1,200 1,300 1,400 1,500 1,600 1,700
Offered load (connections/s)
Migrated connections
900
Future Research: Use SC to Build
Self-Healing Cluster-based Systems
SC2
SC3
Problem 3: Computer Systems move
Outdoors
Sensors
Linux Watch
Linux Camera
Linux Car
Massive numbers of computers will be embedded
everywhere in the physical world
Dynamic ad-hoc networking
How to execute user-defined applications over these
networks?
Outdoor Distributed Computing
Traditional distributed computing has been indoor
Target: performance and/or fault tolerance
Stable configuration, robust networking (TCP/IP or M-M)
Relatively small scale
Functionally equivalent nodes
Message passing or shared memory programming
Outdoor Distributed Computing
Target: Collect/Disseminate distributed data and/or perform
collective tasks
Volatile nodes and links
Node equivalence determined by their physical properties
(content-based naming)
Data migration is not good
expensive to perform end-to-end transfer control
too rigid for such a dynamic network
Cooperative Computing at a Glance
Distributed computing with execution migration
Smart Message: carries the execution state (and
possibly the code) in addition to the payload
execution state assumed to be small (explicit migration)
code usually cached (few applications)
Nodes “cooperate” by allowing Smart Messages
to execute on them
to use their memory to store “persistent” data (tags)
Nodes do not provide routing
Smart Message executes on each node of its path
Application executed on target nodes (nodes of interest)
Routing executed on each node of the path (self-routing)
During its lifetime, an application generates at least
one, possibly multiple, smart messages
Smart vs. “Dumb” Messages
Mary’s lunch:
Appetizer
Entree
Dessert
Data migration
Execution migration
•`
Smart Messages
Hot
Hot
Hot
SM Execution
0
1
Application
do
migrate(Hot_tag,timeout);
Water_tag = ON;
N=N+1;
until (N==3 or timeout);
1
1
2
2
3
Routing
migrate(tag,timeout) {
do
if (NextHot_tag)
sys_migrate(NextHot_tag,timeout);
else {
spawn_SM(Route_Discovery,Hot);
block_SM(NextHot_tag,timeout);
until (Hot_tag or timeout); }
Cooperative Node Architecure
SM Arrival Admission
Manager
Virtual
Machine
SM Migration
Scheduling
Tag Space
OS & I/O
Admission control for resource security
Non-preemptive scheduling with timeout-kill
Tags created by SMs (limited lifetime) or I/O tags
(permanent)
global tag name space {hash(SM code), tag name}
five protection domains defined using hash(SM code), SM source
node ID, and SM starting time.
Related Work
Mobile agents (D’Agents, Ajanta)
Active networks (ANTS, SNAP)
Sensor networks (Diffusion, TinyOS, TAG)
Pervasive computing (One.world)
Prototype Implementation
8 HP iPAQs running Linux
802.11 wireless communication
Sun Java K Virtual Machine
Geographic (simplified GPSR) and
On-Demand (AODV) routing
user node
Routing algorithm
Geographic (GPSR)
On-demand (AODV)
intermediate node
node of interest
Code not cached (ms)
Code cached (ms)
415.6
506.6
126.6
314.7
Completion Time
Self-Routing
There is no best routing outdoors
Depends on application and node property dynamics
Application-controlled routing
Possible with Smart Messages (execution state
carried in the message)
When migration times out, the application is upcalled
on the current node to decide what to do next
Self-Routing Effectiveness (simulation)
• geographical routing to reach target regions
• on-demand routing within region
• application decides when to switch between the two
starting node
node of interest
other node
Next Target: Spatial Programming
Smart Message: too low-level programming
How to describe distributed computing over dynamic
outdoor networks of embedded systems with limited
knowledge about resource number, location, etc
Spatial Programming (SP) design guidelines:
space is a first-order programming concept
resources named by their expected location and properties
(spatial reference)
reference consistency: spatial reference-to- resource mappings
are consistent throughout the program
program must tolerate resource dynamics
SP can be implemented using Smart Messages (the
spatial reference mapping table carried as payload)
Spatial Programming Example
Right Hill
Left Hill
Mobile sprinklers with
temperature sensors
Hot spot
Program sprinklers to water the hottest spot of the Left Hill
for(i=0;i<10;i++)
if {Left_Hill:Hot}[i].temp > Max_temp
Max_temp = {Left_Hill:Hot[I]}.temp;
What if <10 hot spots ?
Spatial Reference for Hot spots
on Left Hill
id = i;
{Left_Hill:Hot}[id].water = ON;
Spatial Reference consistency
Problem 4: Manageable Distributed File
Systems
Most distributed file servers use TCP/IP both for
client-server and intra-server communication
Strong file consistency, file locking and load balancing:
difficult to provide
File servers require significant human effort to manage:
add storage, move directories, etc
Cluster-based file servers are cost-effective
Scalable performance requires load balancing
Load balancing may require file migration
File migration limited if file naming is location-dependent
We need a scalable, location-independent and easy to
manage cluster-based distributed file system
Federated File System at a Glance
A2
A2
A3
A3
A3
A1
FedFS
Local
FS
FedFS
Local
FS
Local
FS
Local
FS
M-M Interconnect
Global file name space over cluster of autonomous local file
systems interconnected by a M-M network
Location Independent Global File Naming
Virtual Directory (VD): union of local directories
volatile, created on demand (dirmerge)
contains information about files including location (homes of files)
assigned dynamically to nodes (managers)
supports location independent file naming and file migration
Directory Tables (DT): local caches of VD entries (~TLB)
usr
virtual directory
file1 file2
usr
file1
Local file system 1
usr
local directories
file2
Local file system 2
Direct Access File System (DAFS)
Federated DAFS
Distributed NFS over FedFS
Federated DAFS
NFS Server
DAFS Server
FedFS
M-M Local FS
Direct Access FS
(DAFS)
Application
NFS Client
TCP/IP
Application
NFS Client
TCP/IP
Application
NFS Client
TCP/IP
Application
NFS Server
FedFS
M-M Local FS
DAFS Client
+
M-M
DAFS Server
M-M Local FS
Application
DAFS Client
M-M
Application
DAFS Server
DAFS Client
FedFS
M-M Local FS
M-M
Application
M-M
DAFS Client
M-M
M-M
DAFS Server
FedFS
M-M Local FS
NFS Server
FedFS
M-M Local FS
TCP/IP
FedFS
M-M Local FS
M-M
M-M
Related Work
Cluster-based File Systems
Frangipani[Thekkath’97], PVFS [Carns’00],GFS,
Archipelago [JI’00], Trapeze (Duke)
DAFS [NetApp’03,Magoutis’01,02,03]
User-level communication in cluster-based network
servers [Carrera’02]
Experimental Platform
Eight node server cluster
800 MHz PIII, 512 MB SDRAM, 9 GB 10K RPM
SCSI
Client
Dual processor (300 MHz PII), 512 MB SDRAM
Linux-2.4
Servers and Clients equipped with Emulex cLAN
adapter (M-M network)
Workload I
Postmark – Synthetic benchmark
Short-lived small files
Mix of metadata-intensive operations
Postmark outline
Create a pool of files
Perform transactions – READ/WRITE paired with
CREATE/DELETE
Delete created files
Each Postmark client performs 30,000 transactions
Clients distribute requests to servers using a hash
function on pathnames
Files are physically placed on the node which
receives client requests
Postmark Throughput
•30000
•File size: 2K
•Postmark Throughput (txns/sec)
•File size: 4K
•25000
•File size: 8K
•File size: 16K
•20000
•15000
•10000
•5000
•0
•0
•1
•2
•3
•4
•5
•Number of Servers
•6
•7
•8
•9
Workload II
Postmark performs only READ transactions
No create/delete operations
Federated DAFS does not control file placement
No client request sent to file’s correct location
Postmark Read Throughput
•60000
•PostmarkRead
•Postmark Read Throughput (txns/sec)
•PostmarkRead - NoCache
•50000
•40000
•30000
•20000
•10000
•0
•2
•4
•Number of Servers
Next Target: Federated DAFS over
the Internet
DAFS Server
FedFS
M-M Local FS
Application
DAFS Client
M-M
TCP/IP
DAFS Server
Application
DAFS Client
M-M
Internet
FedFS
M-M Local FS
Application
DAFS Client
M-M
DAFS Server
FedFS
M-M Local FS
Outline
TCP Servers
Migratory-TCP and Service Continuations
Cooperative Computing, Smart Messages and
Spatial Programming
Federated File Systems
Talk Highlights and Conclusions
Talk Highlights
Back to Migration
Service Continuation: service availability and self-healing clusters
Smart Messages: programming dynamic networks of embedded
systems
Exploit Non-Intrusive M-M Communication
TCP offloading
State migration
Federated file systems
Network and Storage I/O Convergence
TCP Servers & iSCSI
Federated File Systems & M-M
Programmability
Smart Messages and Spatial Programming
Extended Server API: Service Continuation, TCP Servers,
Federated file system
Conclusions
Network-Centric Systems: very promising bordercrossing systems research area
Common issues for a large spectrum of systems and
networks
Tremendous potential to impact industry
Aknowledgements
UMD students: Andrzej Kochut, Chunyuan Liao, Tamer
Nadeem, Iulian Neamtiu and Jihwang Yeo.
Rutgers students: Ashok Arumugam, Kalpana Banerjee,
Aniruddha Bohra, Cristian Borcea, Suresh Gopalakrisnan,
Deepa Iyer, Porlin Kang, Vivek Pathak, Murali Rangarajan,
Rabita Sarker, Akhilesh Saxena, Steve Smaldone, Kiran
Srinivasan, Florin Sultan and Gang Xu.
Post-doc: Chalermek Intanagonwiwat
Collaborations at Rutgers: EEL (Ulrich Kremer), DARK
(Ricardo Bianchini), PANIC (Rich Martin and Thu Nguyen)
Support: NSF ITR ANI-0121416 and CAREER CCR-013366