Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Some Claims about Giant-Scale Systems Prof. Eric A. Brewer UC Berkeley Vanguard Conference, May 5, 2003 Five Claims  Scalability  How is easy (availability is hard) to build a giant-scale system  Services are King (infrastructure centric)  P2P is cool but overrated  XML doesn’t help much  Need new IT for the Third World May 5, 2003 Vanguard Conference Claim 1: Availability >> Scalability Key New Problems  Unknown but large growth    Must be truly highly available     Incremental & Absolute scalability 1000’s of components Hot swap everything (no recovery time) $6M/hour for stocks, $300k/hour for retailers Graceful degradation under faults & saturation Constant evolution (internet time)    Software will be buggy Hardware will fail These can’t be emergencies... May 5, 2003 Vanguard Conference Typical Cluster May 5, 2003 Vanguard Conference Scalability is EASY  Just add more nodes….  Unless you want HA  Or you want to change the system…  Hard parts:  Availability  Overload  Evolution May 5, 2003 Vanguard Conference Step 1: Basic Architecture Client Client Client Client IP Network Single Site Server Load Manager Op tio nal B ackp lane Persistent Data Store May 5, 2003 Vanguard Conference Step 2: Divide Persistent Data  Transactional (expensive, must be correct)  Supporting Data     HTML, images, applets, etc. Read mostly Publish in snapshots (normally stale) Session state    Persistent across connections Limited lifetime Can be lost (rarely) May 5, 2003 Vanguard Conference Step 3: Basic Availability  a) depend on layer-7 switches  Isolate external names (IP address, port) from specific machines  b) automatic detection of problems  Node-level checks (e.g. memory footprint)  (Remote) App-level checks  c) Focus on MTTR  Easier May 5, 2003 to fix & test than MTBF Vanguard Conference Step 4: Overload Strategy  Goal: degrade service (some) to allow more capacity during overload  Examples:  Simpler pages (less dynamic, smaller size)  Less options (just the basics)  Must kick in automatically  “overload mode”  Relatively easy to detect May 5, 2003 Vanguard Conference Step 5: Disaster Tolerance  a) pick a few locations  Independent  b) failure Dynamic redirection of load  Best: client-side control  Next: switch all traffic (long routes)  Worst: DNS remapping (takes a while)  c) Target site will get overloaded  -- May 5, 2003 but you have overload handling Vanguard Conference Step 6: Online Evolution  Goal: rapid evolution without downtime  a) Publishing model  Decouple development from live system  Atomic push of content  Automatic revert if trouble arises  b) Three methods May 5, 2003 Vanguard Conference Evolution: Three Approaches  Flash Upgrade     Rolling Upgrade     Fast reboot into new version Focus on MTTR (< 10 sec) Reduces yield (and uptime) Upgrade nodes one at time in a “wave” Temporary 1/n harvest reduction, 100% yield Requires co-existing versions “Big Flip” May 5, 2003 Vanguard Conference The Big Flip  Steps: 1) take down 1/2 the nodes 2) upgrade that half 3) flip the “active half” (site upgraded) 4) upgrade second half 5) return to 100%  Avoids mixed versions (!)   can replace schema, protocols, ... Twice used to change physical location May 5, 2003 Vanguard Conference The Hat Trick  Merge the handling of:  Disaster tolerance  Online evolution  Overload handling  The first two reduce capacity, which then kicks in overload handling (!) May 5, 2003 Vanguard Conference Claim 2: Services are King Coming of The Infrastructure 100 90 80 1996 1998 2000 70 60 50 40 30 20 10 0 Telephone May 5, 2003 PC Vanguard Conference Non-PC Internet Infrastructure Services  Much   lower cost & more functionality longer battery life  Data    is in the infrastructure can lose the device enables groupware can update/access from home or work   phone book on the web, not in the phone can use a real PC & keyboard  Much   simpler devices faster access Surfing is 3-7 times faster Graphics look good May 5, 2003 Vanguard Conference Transformation Examples  Tailor content for each user & device 6.8x 65x 10x 1.2 The Remote Queue Model We introduce Remote Queues (RQ), …. Infrastructure Services (2)  Much   cheaper overall cost (20x?) Device utilization = 4%, infrastructure = 80% Admin & supports costs also decrease  “Super    View powerpoint slides with teleconference Integrated cell phone, pager, web/email access Map, driving directions, location-based services  Can   Convergence” (all -> IP) upgrade/add services in place! Devices last longer and grow in usefulness Easy to deploy new services => new revenue May 5, 2003 Vanguard Conference Internet Phases (prediction)  Internet as New Media   Consumer Services (today)   Shopping, travel, Yahoo!, eBay, tickets, … Industrial Services   HTML, basic search XML, micropayments, spot markets Everything in the Infrastructure   May 5, 2003 store your data in the infrastructure access anytime/anywhere Vanguard Conference Claim 3: P2P is cool but overrated P2P Services?  Not soon…  Challenges  Untrusted nodes !!  Network Partitions  Much harder to understand behavior  Harder to upgrade  Relatively May 5, 2003 few advantages… Vanguard Conference Better: Smart Clients  Mostly helps with:  Load balancing  Disaster tolerance  Overload  Can also offload work from servers  Can also personalize results  E.g. mix search results locally  Can include private data May 5, 2003 Vanguard Conference Claim 4: XML doesn’t help much… The Problem  Need services for computers to use  HTML only works for people   Sites depend on human interpretation of ambiguous content “Scaping” content is BAD    Very error prone No strategy for evolution XML doesn’t solve any of these issues!  At best: RPC with an extensible schema May 5, 2003 Vanguard Conference Why it is hard…  The real problem is *social*    Doesn’t make evolution better…      What do the fields mean? Who gets to decide? Two sides still need to agree on schema Can ignore stuff you don’t understand? When can a field change? Consequences? At least need versioning system… XML can mislead us to ignore/postpone the real issues! May 5, 2003 Vanguard Conference Claim 5: New IT for the Third World Plug for new area…  Bridging the IT gap is the only long-term path to global stability  Convergence makes it possible:  802.11 wireless ($5/chipset)  Systems on a chip (cost, power)  Infrastructure services (cost, power)  Goal: 10-100x reduction in overall cost and power May 5, 2003 Vanguard Conference Five Claims  Scalability  How is easy (availability is hard) to build a giant-scale system  Services are King (infrastructure centric)  P2P is cool but overrated  XML doesn’t help much  Need new IT for the Third World May 5, 2003 Vanguard Conference Backup Refinement  Retrieve part of distilled object at higher quality Distilled image (by 60X) Zoom in to original resolution The CAP Theorem Consistency Availability Tolerance to network Partitions May 5, 2003 Theorem: You can have at most two of these properties for any shared-data system Vanguard Conference Forfeit Partitions Consistency Availability Examples  Single-site databases  Cluster databases  LDAP  xFS file system Tolerance to network Partitions May 5, 2003 Traits  2-phase commit  cache validation Vanguard Conference protocols Forfeit Availability Consistency Availability Examples  Distributed databases  Distributed locking  Majority protocols Tolerance to network Partitions May 5, 2003 Traits  Pessimistic locking  Make minority Vanguard Conference partitions unavailable Forfeit Consistency Consistency Availability    Tolerance to network Partitions May 5, 2003 Examples Coda Web cachinge DNS Traits  expirations/leases  conflict resolution  optimistic Vanguard Conference These Tradeoffs are Real  The whole space is useful  Real internet systems are a careful mixture of ACID and BASE subsystems  We use ACID for user profiles and logging (for revenue)  But there is almost no work in this area  Symptom of a deeper problem: systems and database communities are separate but overlapping (with distinct vocabulary) May 5, 2003 Vanguard Conference CAP Take Homes  Can have consistency & availability within a cluster (foundation of Ninja), but it is still hard in practice  OS/Networking good at BASE/Availability, but terrible at consistency  Databases better at C than Availability  Wide-area databases can’t have both  Disconnected clients can’t have both  All systems are probabilistic… May 5, 2003 Vanguard Conference The DQ Principle Data/query * Queries/sec = constant = DQ  for a given node  for a given app/OS release  A fault can reduce the capacity (Q), completeness (D) or both  Faults reduce this constant linearly (at best) May 5, 2003 Vanguard Conference Harvest & Yield  Yield: Fraction of Answered Queries     Harvest: Fraction of the Complete Result    Related to uptime but measured by queries, not by time Drop 1 out of 10 connections => 90% yield At full utilization: yield ~ capacity ~ Q Reflects that some of the data may be missing due to faults Replication: maintain D under faults DQ corollary: harvest * yield ~ constant  ACID => choose 100% harvest (reduce Q but 100% Vanguard Conference D) May 5, 2003 Harvest Options 1) Ignore lost nodes    RPC gives up forfeit small part of the database reduce D, keep Q 2) Pair up nodes    RPC tries alternate survives one fault per pair reduce Q, keep D 3) n-member replica groups Decide when you care... RAID RAID Replica Groups With n members:  Each fault reduces Q by 1/n  D stable until nth fault  Added load is 1/(n-1) per fault     n=2 => double load or 50% capacity n=4 => 133% load or 75% capacity “load redirection problem” Disaster tolerance: better have >3 mirrors May 5, 2003 Vanguard Conference Graceful Degradation  Goal: smooth decrease in harvest/yield proportional to faults   Saturation will occur     we know DQ drops linearly high peak/average ratios... must reduce harvest or yield (or both) must do admission control!!! One answer: reduce D dynamically  disaster => redirect load, then reduce D to compensate for extra load May 5, 2003 Vanguard Conference Thinking Probabilistically  Maximize symmetry   Make faults independent      SPMD + simple replication schemes requires thought avoid cascading errors/faults understand redirected load KISS Use randomness    May 5, 2003 makes worst-case and average case the same ex: Inktomi spreads data & queries randomly Node loss implies a random 1% harvest reduction Vanguard Conference Server Pollution Can’t fix all memory leaks  Third-party software leaks memory and sockets    so does the OS sometimes Some failures tie up local resources Solution: planned periodic “bounce”    Not worth the stress to do any better Bounce time is less than 10 seconds Nice to remove load first… May 5, 2003 Vanguard Conference Key New Problems  Unknown but large growth    Must be truly highly available     Incremental & Absolute scalability 1000’s of components Hot swap everything (no recovery time allowed) No “night” Graceful degradation under faults & saturation Constant evolution (internet time)    Software will be buggy Hardware will fail These can’t be emergencies... May 5, 2003 Vanguard Conference Conclusions  Parallel Programming is very relevant, except…      historically avoids availability no notion of online evolution limited notions of graceful degradation (checkpointing) best for CPU-bound tasks Must think probabilistically about everything     no such thing as a 100% working system no such thing as 100% fault tolerance partial results are often OK (and better than none) Capacity * Completeness == Constant May 5, 2003 Vanguard Conference Partial checklist  What is shared? (namespace, schema?)  What kind of state in each boundary?  How would you evolve an API?  Lifetime of references? Expiration impact?  Graceful degradation as modules go down?  External persistent names?  Consistency semantics and boundary? May 5, 2003 Vanguard Conference The Move to Clusters  No single machine can handle the load  Only  Other solution is clusters cluster advantages:  Cost: about 50% cheaper per CPU  Availability: possible to build HA systems  Incremental growth: add nodes as needed  Replace whole nodes (easier) May 5, 2003 Vanguard Conference Goals  Sheer scale  Handle 100M users, going toward 1B   Largest: AOL Web Cache: 12B hits/day High Availability  Large cost for downtime    $250K per hour for online retailers $6M per hour for stock brokers Disaster Tolerance?  Overload Handling  System Evolution  Decentralization? May 5, 2003 Vanguard Conference