* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download RM3G: Next Generation Recovery Manager
		                    
		                    
								Survey							
                            
		                
		                
                            
                            
								Document related concepts							
                        
                        
                    
						
						
							Transcript						
					
					RM3G: Next Generation Recovery Manager Steve Zhang and Armando Fox Stanford University Design Goals  Overall Goal: Manage the detection of and recovery from system failures  New in 3G: Focus on online Statistical Learning Theory (SLT) algorithms for application generic failure detection    Previous generation used End-2-End and Exception monitors SLTs RM3G Not tie ourselves to any particular algorithms and make new algorithms easy to plug-in  Standardize the APIs for observation, analysis, and control of system components  Provide common services and abstractions to SLT algorithms Comp RM itself must also be resilient to failures © 2004 Steve Zhang RADS Architecture User Operator Client Server Distributed Middleware SLT Services (RM3G) Distributed Middleware PNE Edge Network ApplicationSpecific Overlay Network EdgePNE Network Router Router Commodity Internet & IP networks © 2004 Steve Zhang Design Diagram SLT Processes Comp B Spawned by SLT Proc Srv Comp C Comp A SLT Plug-ins Data Store Srv SLT Select Srv Ctrl Srv Ctrl/Obsrv point descriptors Control policies RM Proc Srv Observation Points RMDB Name & Reg Srv Control Points © 2004 Steve Zhang Collaboration with ACME    Infrastructure for monitoring, analyzing, and controlling Internet-scale systems  Sensors = Observation Points  Actuators = Control Points RM potentially benefits from two ACME features  An in-network aggregator combines data from sensors as they are routed through an overlay network  Configuration language that specifies under what conditions to trigger actuators ACME could benefit from more powerful sensor data analysis using SLTs © 2004 Steve Zhang Observation Points  We want to avoid requiring every component to be individually instrumented   Components may directly provide their own observation data if they wish (e.g. D-store and SSM provide their own data for monitoring with Pinpoint) Several types of observation data can be collected in an application generic way  OS can provide application level data (e.g. memory usage, number of files open, etc) and system level data (e.g. size of swap space, network ports used, etc)  Middleware can provide intra-application data (e.g. interaction between different components of an application) © 2004 Steve Zhang SLT Data Services    Abstracts information from observation points  SLT algorithms are spawned for each component in the system, as they are instantiated  Observation data stored by SLT Data Server possibly in a streaming database. Listens for feedback from SLT algorithms to adjust the data stream as necessary  Increase data sampling rate if anomaly is suspected  Stop reporting certain data if it is deemed to be irrelevant Provide persistent data storage for SLT algorithms  Remember properties learned from previous analysis of observation data © 2004 Steve Zhang Control Points  Assumes crash-only components    Components can be reliably restarted through external means (can’t rely on components restarting themselves cleanly) Initially, only restart control points are supported  Instrument application server (JBoss) to restart applications and application components  OS can restart application servers  IP addressable power strips can restart entire nodes Components can specify custom control policy  Leverage ACME’s configuration language © 2004 Steve Zhang Future Work  “Master” SLT   Support additional types of control points   Multiple level settings that tune component parameters (e.g. filter level) Support additional types of observation points   Multiple SLTs are run for each component. Choosing which SLTs to believe is itself an interesting SLT problem. Use programming language techniques (e.g. source code transformation) to instrument applications in a generic way Online SLT algorithms for anomaly detection are not mature © 2004 Steve Zhang
 
									 
									 
									 
									 
									 
									 
									 
									 
									 
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                            