Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
EMBL-EBI Visualization & Data mining EMBL-EBI Visualisation  The process of representing abstract data to aid in understanding the meaning of the data.  Not to be confused with rendering data (drawing pictures)  Typically though, we render data in such a way to visualize the information within that data. EMBL-EBI Introduction  Biological data comes from & is of interest to:       Chemists : reaction mechanism, drug design Biologists : sequence, expression, homology, function. Structure biologists : atomic structure, fold, classification, function. Medicine : clinical effect Education : Media :  Presentation of diverse information to a diverse audience.  Each has there own point of view (context).  Expert = scientist working within their own field of expertise  Non-expert = scientist using data/information outside their field  Novice = Non-scientist EMBL-EBI Not just presentation of results  Web pages  These are notoriously badly designed often resulting in the information on that site being unusable.  The front page should load quickly  The main point should appear on the first full screen  Clutter – not logically laid out  Too busy – cannot find the salient point  8% men & 0.5% women are colour blind Google is a  Bad text/fonts  Too often it doesn’t work good design  User will go somewhere else  The latest wiz-bang stuff only works on the latest browsers  Only works in one browser – they only tested on one. – Does not conform to standard HTMl EMBL-EBI Asking questions  Asking questions  Biological data is very complex  Chemistry, Biology, Physics, Statistics, Medicine..  Most users will be from a different field  Asking the right question is difficult.  The user cannot use the correct terminology  Too many things to query (2000 attributes in MSD)  SQL : not suitable for most users  Interface too complex  Too many check boxes, widgets etc  Trying to be too clever  The “Go” button is buried somewhere EMBL-EBI Result presentation  Results  Biological data is complex  Chemistry, physics, biology, statistics, medicine…  Experts users want all the detail  Ie : want to use a specific method  They want all the details  The want (I hope) the statistical validity of the results  The non-expert wants the best practice answer returned within their own context.  The want comparative analysis with other fields  The want to know the results are valid EMBL-EBI Query design  The simple text box design is very common  Suitable for text queries  Only one logic AND or OR  Predefined  Easy to use  Limited scope  2000 attributes -> 2000 check-boxes ! EMBL-EBI Query design  Graphical interface  Multiple logic AND/OR/NOT  Under users control  Slower  Steep learning curve  Some users just cannot get it  Intuitive once mastered  Pretty EMBL-EBI Query design  Figurative 2D sketch for 3D query (Active sites)  Informative – presents meaning for the question  Slower  Less error prone HIS|SER:S/H>C2.0 HIS.ne2:S/S>C2.0 HIS.[n]/T>C2.0 EMBL-EBI YAMGP (yet another molecular graphics program)  Many different programs are available AstexViewer@MSD-EBI LigPlot VMD Quanta InsightII Bobscript WebMol Frodo iMol Chime Grasp Pymol POVRay Spock Rasmol Pymol Mage Raster3D Yasara Molscript Chimera O MolMol Whatif Frodo XtalView WebLab-viewer Swiss-PDBviewer EMBL-EBI Result visualisation  Multiple types of biological data          Textual data 3D structure 2D chemical sketches 1D sequence Node linked General/derived data Web pages Time Errors/Variance Patented ! EMBL-EBI Visualisation : AstexViewer@MSI-EBI  Visualisation             Structure/sequence/data Lensing Linked views Brushing Picking Flying views Hyperbolic distortion Animation Solid rendering Depth cues Colour,lighting Highlighting Etc… EMBL-EBI Visualisation : comparative analysis  Similarity/Difference  Data superposition  Attribute display  Colour, size…  Correlation  Attribute mapping  Sequence colour by structure alignment Analysis Example EMBL-EBI Animation  Animation  Time dependent display  Reaction chemistry  Visual clues.  Expression data  Shown as…  Rotation  Flash  On/off  Object Synchronization  Size, Colour….  Sound  NO : incredibly annoying Animation Example EMBL-EBI Multidimensional analysis  Comparative analysis on multiple data  Eg. Phi,Psi, Bvalue, Omega  1D & 2D easy  3D graphs are difficult to see.  4D requires 3D + iso-surfaces  Higher – too busy  Use 2D + multiple properties  SPOTFIRE is the most well known  Use : X/Y/Colour/size/shape…  Interactive bracketing Example EMBL-EBI Visualization- Summary  Rendering data is not visualization  Not just the display of results  Huge array of non-specific techniques – and entire scientific field ! EMBL-EBI Data mining  “Analysis of data in a database using tools which look for trends or anomalies without knowledge of the meaning of the data.” (Hyperdictionary)  “True data mining software does not just change the presentation, but discovers previously unknown relationships among the data.” (IBM) EMBL-EBI Data mining & Data analysis  Traditional analysis is via “verification-driven analysis”  Requires hypothesis of the desired information (target)  Requires correct interpretation of proposed query  Discovery-driven data mining  Finds data with common characteristics  Results are ideal solutions to discovery  Finds results without previous hypothesis  Results have unbiased mean and variance EMBL-EBI So what is Hypothesis driven data analysis ?  Define a target = hypothesis  Search for target  There are/are-not “hits”  Verify/negate hypothesis  Distribution is centred on target “catalytic triad” : Atomic coordinates : Mathematical graph : HIS,ASP,SER : text string matching coordinate superposition graph matching data hierarchy knowledge EMBL-EBI Four types of data mining  Creation of predictive models : future data expectation  Link analysis : connections between data objects  Database segmentation : classification  Deviation detection : finding outliers. IBM : white papers EMBL-EBI So what is this data mining ?  Given multiple sets of primary data (dependant variables)  Characters, numbers, Function(numbers),….  Find anomalies  To many : numerical occurrence  Data variation : Derivatives  Singularities  …..  Correlations and clusters  Within primary data  with other data (independent variables) Finds new things ! But not what it means ! EMBL-EBI Eg  Retail and Financial industry are heavily into DM.  A well known US food supermarket chain found a correlation :  Babies nappies  Beer  5pm on Friday  Wife rings husband, “get some nappies for the weekend”  Husband takes opportunity to buy some beer ! You won’t grant funding to test this hypothesis ! EMBL-EBI Self/Cross data mining  Most mining software looks for correlations between dependent variables.  Rainfall, temperature, cloud-cover  It rains when it is cloudy  Free : http://www.cs.waikato.ac.nz/~ml/  Bioinformatics usually involves anomalies within data objects  Sequence clusters (sequence finger prints)  Local coordinate clusters (active sites)  Global coordinate cluster (folds) EMBL-EBI Data mining – not idiot proof  Date of birth and age will give 100 % correlation  Authors for structure submission will be correlated to authors on primary citation.  “Lysozyme” is the most common fold pattern  36 spelling’s of E.Coli will mask results.  Requires representative sets Statistically valid ones too !  Signal/Noise ratio is a problem EMBL-EBI Discovery driven data mining of the PDB  Analysis of 3-dimensional coordinates  Defined common patterns of atomic interactions locally  DB segmentation - active sites & common packing features  Link analysis - Similarity between different functional group  Defined globally  DB segmentation - common patterns of super-secondary str’  Link analysis - common folds in diverse protein families  Outlier detection - unique folds EMBL-EBI Issues  Systematic “error” propagates as solution 300 lysozyme structures return as a strong solution  Results cannot be found below the noise level  Need to characterise the noise level  Need to improve signal/noise ratio (S/N) to see information  Target is not biologically defined  It does not give you the biological answer  Results should reproduce known biology  Can give you new results not previously observed EMBL-EBI Data selection  Cannot leave in 300 lysozyme structures !  Select by sequence similarity at 70% exact alignment Different “phase space” to select data     Remove structures with resolution < 2.5A Remove NMR (different statistics) Remove pre-1982 etc. Geometrical analysis criteria to check for outliers Using properties NOT target parameters of structure solution EMBL-EBI Local atomic interactions  Data  Function(3D coordinates) = distance  Atom names (independent variable)  Residue names (independent variable)  Create 3D Hash table of triplets of distances(*) between “points”  This is the dependant variable  Order = 3 EMBL-EBI Local atomic interactions  Merge triplets  Any pair of N-fold interactions are a (N+1) interaction if they have (N-1) equivalence.  Order = N  Just keep going until no more (N+1) interaction are found.  Time = 8 seconds to find ~ 2000 interactions (Digital alpha ES40) EMBL-EBI Catalytic quartet EMBL-EBI Electrostatic interaction Ligands are found close by rather than associated with the residues EMBL-EBI Iron binding site EMBL-EBI Double disulphide EMBL-EBI N-linked glycosolation binding site +  Spot the non-sugar  This glycosolation site is the same as active site found in “1a53” – indol-3glycerolphosphate synthase EMBL-EBI Summary  Nearly all Bioinformatics is based on hypothesis driven data analysis  Data mining has lost its meaning within Bioinformatics.  Discovery driven data-analysis (true data mining) :  Can find unknown dependencies, clusters, outliers  Is based on statistical probability  Returns distributions unbiased by previous ideas  Information technology may be better for genomes (1D)  “A numerical measure of the uncertainty of an outcome”  Information content of gene sequences can be defined by the normalized probability of finding “words” within that sequence