Download Scalability -- Vanguard Conference

Some Claims about Giant-Scale Systems Prof. Eric A. Brewer UC Berkeley Vanguard Conference, May 5, 2003 Five Claims  Scalability  How is easy (availability is hard) to build a giant-scale system  Services are King (infrastructure centric)  P2P is cool but overrated  XML doesn’t help much  Need new IT for the Third World May 5, 2003 Vanguard Conference Claim 1: Availability >> Scalability Key New Problems  Unknown but large growth    Must be truly highly available     Incremental & Absolute scalability 1000’s of components Hot swap everything (no recovery time) $6M/hour for stocks, $300k/hour for retailers Graceful degradation under faults & saturation Constant evolution (internet time)    Software will be buggy Hardware will fail These can’t be emergencies... May 5, 2003 Vanguard Conference Typical Cluster May 5, 2003 Vanguard Conference Scalability is EASY  Just add more nodes….  Unless you want HA  Or you want to change the system…  Hard parts:  Availability  Overload  Evolution May 5, 2003 Vanguard Conference Step 1: Basic Architecture Client Client Client Client IP Network Single Site Server Load Manager Op tio nal B ackp lane Persistent Data Store May 5, 2003 Vanguard Conference Step 2: Divide Persistent Data  Transactional (expensive, must be correct)  Supporting Data     HTML, images, applets, etc. Read mostly Publish in snapshots (normally stale) Session state    Persistent across connections Limited lifetime Can be lost (rarely) May 5, 2003 Vanguard Conference Step 3: Basic Availability  a) depend on layer-7 switches  Isolate external names (IP address, port) from specific machines  b) automatic detection of problems  Node-level checks (e.g. memory footprint)  (Remote) App-level checks  c) Focus on MTTR  Easier May 5, 2003 to fix & test than MTBF Vanguard Conference Step 4: Overload Strategy  Goal: degrade service (some) to allow more capacity during overload  Examples:  Simpler pages (less dynamic, smaller size)  Less options (just the basics)  Must kick in automatically  “overload mode”  Relatively easy to detect May 5, 2003 Vanguard Conference Step 5: Disaster Tolerance  a) pick a few locations  Independent  b) failure Dynamic redirection of load  Best: client-side control  Next: switch all traffic (long routes)  Worst: DNS remapping (takes a while)  c) Target site will get overloaded  -- May 5, 2003 but you have overload handling Vanguard Conference Step 6: Online Evolution  Goal: rapid evolution without downtime  a) Publishing model  Decouple development from live system  Atomic push of content  Automatic revert if trouble arises  b) Three methods May 5, 2003 Vanguard Conference Evolution: Three Approaches  Flash Upgrade     Rolling Upgrade     Fast reboot into new version Focus on MTTR (< 10 sec) Reduces yield (and uptime) Upgrade nodes one at time in a “wave” Temporary 1/n harvest reduction, 100% yield Requires co-existing versions “Big Flip” May 5, 2003 Vanguard Conference The Big Flip  Steps: 1) take down 1/2 the nodes 2) upgrade that half 3) flip the “active half” (site upgraded) 4) upgrade second half 5) return to 100%  Avoids mixed versions (!)   can replace schema, protocols, ... Twice used to change physical location May 5, 2003 Vanguard Conference The Hat Trick  Merge the handling of:  Disaster tolerance  Online evolution  Overload handling  The first two reduce capacity, which then kicks in overload handling (!) May 5, 2003 Vanguard Conference Claim 2: Services are King Coming of The Infrastructure 100 90 80 1996 1998 2000 70 60 50 40 30 20 10 0 Telephone May 5, 2003 PC Vanguard Conference Non-PC Internet Infrastructure Services  Much   lower cost & more functionality longer battery life  Data    is in the infrastructure can lose the device enables groupware can update/access from home or work   phone book on the web, not in the phone can use a real PC & keyboard  Much   simpler devices faster access Surfing is 3-7 times faster Graphics look good May 5, 2003 Vanguard Conference Transformation Examples  Tailor content for each user & device 6.8x 65x 10x 1.2 The Remote Queue Model We introduce Remote Queues (RQ), …. Infrastructure Services (2)  Much   cheaper overall cost (20x?) Device utilization = 4%, infrastructure = 80% Admin & supports costs also decrease  “Super    View powerpoint slides with teleconference Integrated cell phone, pager, web/email access Map, driving directions, location-based services  Can   Convergence” (all -> IP) upgrade/add services in place! Devices last longer and grow in usefulness Easy to deploy new services => new revenue May 5, 2003 Vanguard Conference Internet Phases (prediction)  Internet as New Media   Consumer Services (today)   Shopping, travel, Yahoo!, eBay, tickets, … Industrial Services   HTML, basic search XML, micropayments, spot markets Everything in the Infrastructure   May 5, 2003 store your data in the infrastructure access anytime/anywhere Vanguard Conference Claim 3: P2P is cool but overrated P2P Services?  Not soon…  Challenges  Untrusted nodes !!  Network Partitions  Much harder to understand behavior  Harder to upgrade  Relatively May 5, 2003 few advantages… Vanguard Conference Better: Smart Clients  Mostly helps with:  Load balancing  Disaster tolerance  Overload  Can also offload work from servers  Can also personalize results  E.g. mix search results locally  Can include private data May 5, 2003 Vanguard Conference Claim 4: XML doesn’t help much… The Problem  Need services for computers to use  HTML only works for people   Sites depend on human interpretation of ambiguous content “Scaping” content is BAD    Very error prone No strategy for evolution XML doesn’t solve any of these issues!  At best: RPC with an extensible schema May 5, 2003 Vanguard Conference Why it is hard…  The real problem is *social*    Doesn’t make evolution better…      What do the fields mean? Who gets to decide? Two sides still need to agree on schema Can ignore stuff you don’t understand? When can a field change? Consequences? At least need versioning system… XML can mislead us to ignore/postpone the real issues! May 5, 2003 Vanguard Conference Claim 5: New IT for the Third World Plug for new area…  Bridging the IT gap is the only long-term path to global stability  Convergence makes it possible:  802.11 wireless ($5/chipset)  Systems on a chip (cost, power)  Infrastructure services (cost, power)  Goal: 10-100x reduction in overall cost and power May 5, 2003 Vanguard Conference Five Claims  Scalability  How is easy (availability is hard) to build a giant-scale system  Services are King (infrastructure centric)  P2P is cool but overrated  XML doesn’t help much  Need new IT for the Third World May 5, 2003 Vanguard Conference Backup Refinement  Retrieve part of distilled object at higher quality Distilled image (by 60X) Zoom in to original resolution The CAP Theorem Consistency Availability Tolerance to network Partitions May 5, 2003 Theorem: You can have at most two of these properties for any shared-data system Vanguard Conference Forfeit Partitions Consistency Availability Examples  Single-site databases  Cluster databases  LDAP  xFS file system Tolerance to network Partitions May 5, 2003 Traits  2-phase commit  cache validation Vanguard Conference protocols Forfeit Availability Consistency Availability Examples  Distributed databases  Distributed locking  Majority protocols Tolerance to network Partitions May 5, 2003 Traits  Pessimistic locking  Make minority Vanguard Conference partitions unavailable Forfeit Consistency Consistency Availability    Tolerance to network Partitions May 5, 2003 Examples Coda Web cachinge DNS Traits  expirations/leases  conflict resolution  optimistic Vanguard Conference These Tradeoffs are Real  The whole space is useful  Real internet systems are a careful mixture of ACID and BASE subsystems  We use ACID for user profiles and logging (for revenue)  But there is almost no work in this area  Symptom of a deeper problem: systems and database communities are separate but overlapping (with distinct vocabulary) May 5, 2003 Vanguard Conference CAP Take Homes  Can have consistency & availability within a cluster (foundation of Ninja), but it is still hard in practice  OS/Networking good at BASE/Availability, but terrible at consistency  Databases better at C than Availability  Wide-area databases can’t have both  Disconnected clients can’t have both  All systems are probabilistic… May 5, 2003 Vanguard Conference The DQ Principle Data/query * Queries/sec = constant = DQ  for a given node  for a given app/OS release  A fault can reduce the capacity (Q), completeness (D) or both  Faults reduce this constant linearly (at best) May 5, 2003 Vanguard Conference Harvest & Yield  Yield: Fraction of Answered Queries     Harvest: Fraction of the Complete Result    Related to uptime but measured by queries, not by time Drop 1 out of 10 connections => 90% yield At full utilization: yield ~ capacity ~ Q Reflects that some of the data may be missing due to faults Replication: maintain D under faults DQ corollary: harvest * yield ~ constant  ACID => choose 100% harvest (reduce Q but 100% Vanguard Conference D) May 5, 2003 Harvest Options 1) Ignore lost nodes    RPC gives up forfeit small part of the database reduce D, keep Q 2) Pair up nodes    RPC tries alternate survives one fault per pair reduce Q, keep D 3) n-member replica groups Decide when you care... RAID RAID Replica Groups With n members:  Each fault reduces Q by 1/n  D stable until nth fault  Added load is 1/(n-1) per fault     n=2 => double load or 50% capacity n=4 => 133% load or 75% capacity “load redirection problem” Disaster tolerance: better have >3 mirrors May 5, 2003 Vanguard Conference Graceful Degradation  Goal: smooth decrease in harvest/yield proportional to faults   Saturation will occur     we know DQ drops linearly high peak/average ratios... must reduce harvest or yield (or both) must do admission control!!! One answer: reduce D dynamically  disaster => redirect load, then reduce D to compensate for extra load May 5, 2003 Vanguard Conference Thinking Probabilistically  Maximize symmetry   Make faults independent      SPMD + simple replication schemes requires thought avoid cascading errors/faults understand redirected load KISS Use randomness    May 5, 2003 makes worst-case and average case the same ex: Inktomi spreads data & queries randomly Node loss implies a random 1% harvest reduction Vanguard Conference Server Pollution Can’t fix all memory leaks  Third-party software leaks memory and sockets    so does the OS sometimes Some failures tie up local resources Solution: planned periodic “bounce”    Not worth the stress to do any better Bounce time is less than 10 seconds Nice to remove load first… May 5, 2003 Vanguard Conference Key New Problems  Unknown but large growth    Must be truly highly available     Incremental & Absolute scalability 1000’s of components Hot swap everything (no recovery time allowed) No “night” Graceful degradation under faults & saturation Constant evolution (internet time)    Software will be buggy Hardware will fail These can’t be emergencies... May 5, 2003 Vanguard Conference Conclusions  Parallel Programming is very relevant, except…      historically avoids availability no notion of online evolution limited notions of graceful degradation (checkpointing) best for CPU-bound tasks Must think probabilistically about everything     no such thing as a 100% working system no such thing as 100% fault tolerance partial results are often OK (and better than none) Capacity * Completeness == Constant May 5, 2003 Vanguard Conference Partial checklist  What is shared? (namespace, schema?)  What kind of state in each boundary?  How would you evolve an API?  Lifetime of references? Expiration impact?  Graceful degradation as modules go down?  External persistent names?  Consistency semantics and boundary? May 5, 2003 Vanguard Conference The Move to Clusters  No single machine can handle the load  Only  Other solution is clusters cluster advantages:  Cost: about 50% cheaper per CPU  Availability: possible to build HA systems  Incremental growth: add nodes as needed  Replace whole nodes (easier) May 5, 2003 Vanguard Conference Goals  Sheer scale  Handle 100M users, going toward 1B   Largest: AOL Web Cache: 12B hits/day High Availability  Large cost for downtime    $250K per hour for online retailers $6M per hour for stock brokers Disaster Tolerance?  Overload Handling  System Evolution  Decentralization? May 5, 2003 Vanguard Conference

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Scalability -- Vanguard Conference