Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Analysing Cosmological Simulations in the Virtual Observatory: Designing and Mining the Millennium Simulation Database Gerard Lemson German Astrophysical Virtual Observatory (GAVO) ARI, Heidelberg MPE, Garching bei München 1 Acknowledgments   Alex Szalay Virgo consortium, in particular:  Volker Springel, Simon White, Gabriella DeLucia, Jeremy Blaizot (MPA, Munich, Germany),  Carlos Frenk, Richard Bower, John Helly (ICC, Durham, UK)  Similar efforts/sites to Millennium Database  Durham (mirror of Millennium DB)  Horizon/GalICS (Lyon)  ITVO (Trieste)  2 GAVO is funded by the German Federal Ministry for Education and Research (BMBF) Garching, June 26, 2008 Summary  VO aims to provide access to remote data for use/analysis by 3rd parties.  Data analysis requires  advanced methods for analysis  data  Data sets are often very large, often far away (makes them even larger!)  To analyse remote datasets, one needs to be able to bring the analysis to the data.  “Standard” approach using flat files and C/IDL/etc code sub-optimal  To analyse very large datasets we also need advanced methods of data organisation and data access  Structured approach supported by relational database system allows one to concentrate on science, iso worry about I/O optimisation etc  And the questions can become pretty complex ! 3 Garching, June 26, 2008 Case study: The Millennium Simulation Springel V. et al. 2005 Nature 435, 629 4 Garching, June 26, 2008 Millennium Simulation  Virgo consortium        Gadget 3 10 billion particles, dark matter only 500 Mpc periodic box Concordance model (as of 2004) initial conditions 64 snapshots 350000 CPU hours O(30Tb) raw + post-processed data  Post-processing data products complex and large  Challenge to analyse, even locally!  SimDAP-like approach required for remote access. 5 Garching, June 26, 2008 Intermezzo: Data Access is Hitting a Wall (courtesy Alex Szalay) FTP and GREP are not adequate       You can GREP/FTP 1 MB in a second You can GREP/FTP 1 GB in a minute You can GREP/FTP 1 TB in 2 days You can GREP/FTP 1 PB in 3 years SFTP much slower and 1PB ~2,000 disks  At some point you need indices to limit search parallel data search and analysis  This is where databases can help 6 Garching, June 26, 2008 Analysis and Databases (courtesy Alex Szalay)  Much statistical analysis deals with         Creating uniform samples -- data filtering Assembling relevant subsets Estimating completeness Censoring bad data Counting and building histograms Generating Monte-Carlo subsets Likelihood calculations Hypothesis testing  Traditionally these are performed on files  Most of these tasks are much better done inside a database 7 Garching, June 26, 2008 Advantages of relational databases  Encapsulation of data in terms of logical structure, no need to know about internals of data storage  Standard query language for finding information  Advanced query optimizers (indexes, clustering)  Transparent internal parallelization  Authenticated remote access for multiple users at same time     8 Forces one to think carefully about data structure Speeds up path from science question to answer Facilitates communication (query code is cleaner) Facilitates adaptation to IVOA standards (ADQL) Garching, June 26, 2008 Millennium Simulation Phenomenology  Density field on 2563 mesh  CIC  Gaussian smoothed: 1.25,2.5,5,10 Mpc/h  Friends-of-Friends (FOF) groups  SUBFIND Subhalos  Galaxies from 2 semi-analytical models (SAMs)  MPA (L-Galaxies, DeLucia & Blaizot, 2006; Bertone et al 2007)  Durham (GalForm, Bower et al, 2006)  Subhalo and galaxy formation histories: merger trees  Mock catalogues on light-cone  Pencil beams (Kitzbichler & White, 2006)  All-sky (depth of SDSS spectral sample) (Blaizot et al, 2005)  In preparation: Spectra for light cone galaxies 9 Garching, June 26, 2008 10 Garching, June 26, 2008 Millennium Simulation Phenomenology  Density field on 2563 mesh  CIC  Gaussian smoothed: 1.25,2.5,5,10 Mpc/h  Friends-of-Friends (FOF) groups  SUBFIND Subhalos  Galaxies from 2 semi-analytical models (SAMs)  MPA (L-Galaxies, DeLucia & Blaizot, 2006; Bertone et al 2007)  Durham (GalForm, Bower et al, 2006)  Subhalo and galaxy formation histories: merger trees  Mock catalogues on light-cone  Pencil beams (Kitzbichler & White, 2006)  All-sky (depth of SDSS spectral sample) (Blaizot et al, 2005)  In preparation: Spectra for light cone galaxies 11 Garching, June 26, 2008 FOF groups, (sub)halos and galaxies 12 Garching, June 26, 2008 Millennium Simulation Phenomenology  Density field on 2563 mesh  CIC  Gaussian smoothed: 1.25,2.5,5,10 Mpc/h  Friends-of-Friends (FOF) groups  SUBFIND Subhalos  Galaxies from 2 semi-analytical models (SAMs)  MPA (L-Galaxies, DeLucia & Blaizot, 2006; Bertone et al 2007)  Durham (GalForm, Bower et al, 2006)  Subhalo and galaxy formation histories: merger trees  Mock catalogues on light-cone  Pencil beams (Kitzbichler & White, 2006)  All-sky (depth of SDSS spectral sample) (Blaizot et al, 2005)  In preparation: Spectra for light cone galaxies 13 Garching, June 26, 2008 Time evolution: merger trees 14 Garching, June 26, 2008 Millennium Simulation Phenomenology  Density field on 2563 mesh  CIC  Gaussian smoothed: 1.25,2.5,5,10 Mpc/h  Friends-of-Friends (FOF) groups  SUBFIND Subhalos  Galaxies from 2 semi-analytical models (SAMs)  MPA (L-Galaxies, DeLucia & Blaizot, 2006; Bertone et al 2007)  Durham (GalForm, Bower et al, 2006)  Subhalo and galaxy formation histories: merger trees  Mock catalogues on light-cone  Pencil beams (Kitzbichler & White, 2006)  All-sky (depth of SDSS spectral sample) (Blaizot et al, 2005)  In preparation: Spectra for light cone galaxies 15 Garching, June 26, 2008 Mock catalogues 16 Garching, June 26, 2008 Millennium Simulation Phenomenology  Density field on 2563 mesh  CIC  Gaussian smoothed: 1.25,2.5,5,10 Mpc/h  Friends-of-Friends (FOF) groups  SUBFIND Subhalos  Galaxies from 2 semi-analytical models (SAMs)  MPA (L-Galaxies, DeLucia & Blaizot, 2006; Bertone et al 2007)  Durham (GalForm, Bower et al, 2006)  Subhalo and galaxy formation histories: merger trees  Mock catalogues on light-cone  Pencil beams (Kitzbichler & White, 2006)  All-sky (depth of SDSS spectral sample) (Blaizot et al, 2005)  In preparation: Spectra for light cone galaxies 17 Garching, June 26, 2008 Synthetic spectra (not yet available) 18 Garching, June 26, 2008 Hierarchy of Data Products FOF Group Parent FOF group Subhalo Located in Light Cone Galaxy Density Field Mesh Cell SUBFIND result Located in original Spectrum SAM Galaxy Merger Tree Parent halo Subhalo MergerTree Tree relationships 19 Garching, June 26, 2008 Designing the Database   Need a model for data, including relations between different objects Model needs to support science: “20 questions” (following Gray & Szalay) 1. 2. 3. 4. 5. 6. 7. 8. 20 Return the galaxies residing in halos of mass between 10^13 and 10^14 solar masses. Return the galaxy content at z=3 of the progenitors of a halo identified at z=0 Return the complete halo merger tree for a halo identified at z=0 Find all the z=3 progenitors of z=0 red ellipticals (i.e. B-V>0.8 B/T > 0.5) Find the descendents at z=1 of all LBG's (i.e. galaxies with SFR>10 Msun/yr) at z=3 Find all the z=2 galaxies which were within 1Mpc of a LBG (i.e. SFR>10Msun/yr) at some previous redshift. Find the multiplicity function of halos depending on their environment (overdensity of density field smoothed on certain scale) Find the dependency of halo properties on environment Garching, June 26, 2008 Data model features  Each object its table  properties are columns  each a unique identifier  Relations implemented through foreign keys,     pointers to unique identifier column FOF to mesh cell it lies in Sub-halo to its FOF group galaxy to its sub-halo etc  Special design needed for  Hierarchical relations: merger trees  Spatial relations: multi-dimensional indexes required  Support for random sample selection 21 Garching, June 26, 2008 Formation histories: Subhalo and Galaxy merger trees  Tree structure  halos have single descendant  halos have main progenitor  Hierarchical structures usually handled using recursive code  inefficient for data access  not (well) supported in RDBs  Tree indexes  depth first ordering of nodes defines identifier  pointer to last progenitor in subtree 22 Garching, June 26, 2008 23 Garching, June 26, 2008 Merger trees : select prog.* from galaxies des , galaxies prog where des.galaxyId = 0 and prog.galaxyId between des.galaxyId and des.lastProgenitorId Branching points : select descendantId from galaxies des where descendantId != -1 group by descendantId having count(*) > 1 24 Garching, June 26, 2008 Spatial queries, random samples  Spatial queries require multi-dimensional indexes.  (x,y,z) does not work: need discretisation  index on (ix,iy,iz) with ix=floor(x/10) etc  More sophisticated: space filling curves  bit-interleaving/oct-tree/Z-Index  Peano-Hilbert curve  Need custom functions for range queries (Implemented in T-SQL)  Random sampling using a RANDOM column  RANDOM from [0,1000000] 25 Garching, June 26, 2008 The Millennium Database web site  SQLServer 2005 database  Web application (Java in Apache Tomcat web server)     portal: http://www.mpa-garching.mpg.de/millennium/ public DB access: http://www.g-vo.org/Millennium private access: http://www.g-vo.org/MyMillennium MyDB  Access methods  browser with plotting capabilities through VOPlot applet  wget + IDL, R  TOPCAT (3.1) 26 Garching, June 26, 2008 27 Garching, June 26, 2008 Usage statistics      28 Up since August 2006 (astro-ph/0608019) ~225 registered users > 5 million queries > 40 billion rows ~130 papers, ~50% not related to Virgo consortium (see http://www.mpa-garching.mpg.de/millennium ) Garching, June 26, 2008 Some science questions and their implementation as SQL If time permits, in any case 1-1 demo possible. 29 Find light cone galaxies in a slice in redshift, RA and Dec select from where and 30 ra,dec,redshift_obs kitzbichler2006a_obs redshift_obs between 1 and 1.1 dec between -.05 and .0 Garching, June 26, 2008 Color-magnitude for random sample of galaxies select , , from where and and 31 mag_bdust mag_bdust - mag_vdust as color type delucia2006a snapnum=63 random between 0 and 100 mag_b < 0 Garching, June 26, 2008 32 Garching, June 26, 2008 Get merger tree for identified galaxy select p.snapnum , p.x,p.y,p.z , p.stellarmass , p.mag_b-p.mag_v as color from delucia2006a d , delucia2006a p where d.galaxyid=0 and p.galaxyid between d.galaxyid and d.lastprogenitorid 33 Garching, June 26, 2008 34 Garching, June 26, 2008 35 Garching, June 26, 2008 Histogram of density field at redshifts 0,1,2,3; Gaussian smoothing 5 Mpc/h select , , from where group order 36 snapnum .01*floor(f.g5/.01) as g5 count(*) as num mfield f f.snapnum in (63,41,32,27) by snapnum,.01*floor(f.g5/.01) by 1,2 Garching, June 26, 2008 37 Garching, June 26, 2008 FOF multiplicity function at redshifts 0,1,2,3, select snapnum , .1*floor(log10(np)/.1) as lognp , count(*) as num from fof where snapnum in (63,41,32,27) group by snapnum , .1*floor(log10(np)/.1) order by 1,2 38 Garching, June 26, 2008 39 Garching, June 26, 2008 FOF mass multiplicity function, conditioned on density in environment select .1*floor(log10(fof.np)/.1) as lognp , count(*) as num from mfield f , fof where fof.snapnum=f.snapnum and fof.phkey = f.phkey and f.snapnum=63 and f.g5 between 1 and 1.1 group by .1*floor(log10(fof.np)/.1) order by 1 40 Garching, June 26, 2008 41 Garching, June 26, 2008 Thank you ! 42