Download Section 2 XML as Semi Structured Data

Brian Sabino CS 632 <XML> Introduction It's estimated that today 90% of the worlds data sits outside of any database system, usually in the file system or a custom application [PR 98]. Much of this data is unstructured or only semi-structured and as such is not suitable for storage in today's world of RDBMSs. XML holds the promise of a common interchange format, and, more importantly from our perspective, a method for giving data enough structure that it can be stored in a database. This will allow the database community to offer efficient storage, transaction processing, and the ability to query data to areas where databases have traditionally had little impact. So, the question now becomes how to deal with XML documents and data. Do we need an entirely new data model? Can we adapt relational databases to work with semi-structured XML data? And finally what types of queries should we support, and in what language? This paper starts with a very brief introduction to XML in section 1. In section 2 we will examine XML as traditional semi-structured data and give an introduction to LORE. In section 3 we will hear some opposing views and examine the possibility of storing XML in a traditional RDBMS with a look at STORED. Finally in section 4 we will briefly examine the proposed query languages for XML. Section 1 What is XML For the purposes of this paper we don't really care what XML is, we are really only interested in how we can apply database techniques to XML, and what opportunities 1 Brian Sabino CS 632 XML offers the database community. Therefore we will give only a brief introduction starting with XML basics then explaining DTD's and Namespaces. XML stands for Extensible Mark-up Language. XML like HTML is a subset of SGML (Standard Generalized Markup Language). Now lets spend a moment to untangle the alphabet soup. Markup Language just means that the text is "marked" with tags (usually symmetric) like the familiar <html></html>. XML is designed by the w3c to be a simple subset of the very complex SGML in the hopes that this simplicity will speed adoption. So how is XML different from HTML? HTML defines how a document is to be displayed, but it says nothing about the structure or contents of the document. XML on the other hand contains both display information and document structure information. XML allows you to define your own tags and use them to give structure to a document. For example you might write the following in HTML: <table><tr><td align=center> Panel on:<br> <b><font size=-1>Honey, I Shrunk the DBMS: Footprint, Mobility, and Beyond </font></b><br><font size=-1>Praveen Seshadri</font></td></tr> </table> While an equivalent piece of XML might look like: <Panel startTime=1400 endTime=1530><Title>Honey, I Shrunk the DBMS: Footprint, Mobility, and Beyond</Title> <Moderator> <firstName> Praveen </firstName> <lastName> Seshadri </lastName></Moderator></Panel> DTDs A DTD (Document Type Definition) can be thought of as a grammar defining the form of an XML document. A DTD looks like a series of regular expressions with the * (keen closure), +, ? (optional), and | operators. Continuing the above example we might define a DTD file conference.dtd as follows: <!ELEMENT conference (presentation | panel)+> 2 Brian Sabino CS 632 <!ELEMENT panel (title, moderators) > <!ELEMENT title (#PCDATA)> <!ELEMENT moderators (moderator)+> <!ELEMENT moderator (firstName,lastName)> … To finish the example we would add <!DOCTYPE conference SYSTEM "conference.dtd"> to the original XML document. A DTD provides a way to validate an XML document. A good example of this would be a company that created an "invoice.dtd". Whenever they received an invoice they could compare it to the DTD and determine if the invoice was well formed (used syntactically correct XML) and valid (adhered to the DTD). The DTD is of particular interest to us because it can be seen in some ways as defining a schema that the XML documents must adhere to. Namespaces Namespaces are a simple addition to XML to allow multiple people to use the same XML tags. For example instead of every conference defining their own tags they can instead use a common set of tags like so: <Moderator:Conference>. Meaning this is the moderator tag defined in the Conference namespace. Namespaces are important because they help insure a common format for similar XML documents. Unfortunately we cannot always ensure that similar documents use the same namespace, and thus determining if two similar documents are equivalent is a serious challenge for XML document processing. XSL XSL stands for eXtensible Stylesheet Language. This is a tree based language very similar to CSS for HTML[W3CXSL]. It provides methods for manipulating the display of an XML document, for example if you wanted to display different pieces of 3 Brian Sabino CS 632 one XML document in different places. As we will see later extensions to XSL are being looked at as a possible query language for XSL. Section 2 XML as Semi Structured Data Some people see XML as something of a solved problem. If you view XML solely as semi-structured data then you can simply apply the considerable research done on semi-structured data [Abi97, Bun97, PGMW95] to processing XML. An excellent example of this approach is LORE. Before examining LORE we will first look at what is the "de facto model for semistructured data"[Suc98] the OEM model. The OEM model The OEM model [PGMW95] treats semi-structured data as a labeled digraph with one unique root. Each node in this graph is either a built in data type (possible leaf) or a node with edges to other nodes (internal node). Each edge in this model represents a relationship between nodes. The example at right shows how the XML example above might be translated under the OEM model. A key point to note is that there is no schema that the data conforms to. Instead the data is in some sense "self describing" [MAG+97] Another important observation is that the data can be irregular. That is for example a panel might have zero or more moderators, or a moderator might have only a last name. The ability to effectively handle irregular data is an advantage of 4 Brian Sabino CS 632 the OEM model over the relational model, but as we will see this feature has interesting implications for queries over the data. LORE Lore (Lightweight Object REpository) is a DBMS built from the ground up to work with semi-structured data [MAG+97]. Lore is based on the OEM model in that it explicitly represents its data as a digraph. Lore has its own query language Lorel which will be discussed below in section 4. A unique feature of Lore is its use of Data Guides. A data guide is in essence a digraph summary of the OEM database with the property that every path (sequence of relations) in the database is represented in the data guide. For example a moderator node (from above) in the data guide will always have two edges (first name, last name) even though some moderator nodes in the database might have zero, one, or two of those edges. These data guides provide a concise graphical representation of the database which can be used for query creation. Drawbacks of XML as Semi-Structured Data There are two problems with using traditional semi-structured data techniques with XML. The first problem is schema replication [DFS99]. With this approach the schema is in some sense stored with every data item. Put another way, the schema is represented by the edges in the graph, and all edges are retained even if many subtrees have identical schema. This incurs a space cost for storing the redundant data, and a time cost for processing the replicated schema. The second and more important problem is that we can't use a traditional RDBMS with this model. 5 Brian Sabino CS 632 Section 3 Alternatives to Semi Structured Data So the big question is this: Do we have to throw out the relational model to effectively use XML? A very interesting answer comes in the form of STORED a query language over semi-structured data [DFS99] (Semistructured TO RElational Data). Stored The basic idea of STORED is this: use data mining techniques to find a good schema for the semi-structured data. Put all the data that fits the schema in a normal relational database, and put the rest in an "overflow graph". As long as the amount of data in the overflow graph is small we can store it however we like. Obviously the first challenge is schema extraction. STORED employs a new data mining algorithm over semi-structured data [WL98]. The basic idea of the algorithm is to find the largest and most common subtrees, and the most common paths in the OEM data representation. STORED then uses this information to construct appropriate relational tables. The second challenge is automatic overflow generation. A typical XML document may have attributes like <sum> that appear only once, and don't make sense in any sort of schema. In these cases an "overflow graph" is generated to capture any data not stored in the relational tables. This keeps the STORED representation lossless. STORED is a query language that takes in graph queries and rewrites them as relational queries, however it is of much more interest for it's ability to store XML data in a slightly modified RDBMS. 6 Brian Sabino CS 632 Section 4 Toward a Query Language At this point there is no clear consensus on a query language for XML [XML99]. There are however several possibilities: LOREL the query language for LORE, XQL, and XML-QL. Not surprisingly the LOREL approach is tree based with queries having the form attribute1.attribute2 = x. This approach has the advantage of being very general, but it is not suitable for the specific needs of an XQL query language. The same can be said for the other semistructured data querying methods including UnQL[BDHS96] and STRUQL. Microsoft's proposal XQL is basically an extension of XSL that allows for pattern matching of tags. However this proposal has been criticized as being too weak in that it does not allow sorting, joining, or output specifications[XML99]. Finally XMLQL [DFF+98] offers a more powerful language using both pattern matching and querying across multiple XML sources. Both XQL and XML-QL are under active consideration by the w3c, and many companies are waiting for a standard to emerge before fully adopting XML as a document standard [XML99]. Conclusion In the short run I think an approach like STORED, where a traditional relational schema is automatically extracted will be popular for dealing with XML data. Who wants to abandon the tried and true RDBMS when it can possibly be adapted to efficiently deal with these new types of data? I envision large database companies augmenting their products with "overflow graphs" like those in STORED in order to fully support XML. In the long run I think some very few semi-structured only databases such as LOREL will appear applications with very irregular data that cannot easily be mapped to a fixed schema. 7 Brian Sabino CS 632 Overall I think XML has a lot to offer the database community. XML represents a common data format, an opportunity to extend the database into the realm of document management and storage, and an opportunity to bring the database closer to the web. </XML> 8 Brian Sabino CS 632 [Abi97] Serge Abiteboul. Querying semi-structured data. Proceedings of the International Conference on Database Theory, p 1-18 1997 [BDHS96] Peter Buneman, Susan Davidson, Gerd Hillebrand, and Dan Suciu. Aquery language and optimization techniques for unstructured data. Proceedings of ACMSIGMOD International Conference on Management of Data. p 505 - 516 1996 [Bun97] Peter Buneman. Tutorial: Semistructured data. Proceedings of ACM Symposium on Principles of Database Systems, p 117-121 1997 [DFF+98] A. Deutsch, M. Fernandez, D. Florescu, A. Levy, and D. Suciu. Xml-ql: A query language for XML, 1998 http://www.w3c.org/TR/NOTE-xml-ql/ [DFS99] Alin Deutsch, Mary Fernandez, Dan Suciu. Storing Semistructured Data with STORED [FFLS97] M. Fernandez, D. Florescu, A. Levy, and D. Suciu. A query language for a web-site management system. SIGMOD Record 26(3) p 4-11 1997 [MAG+97] J. McHugh, S. Abiteboul, R.Goldman, D. Quass, and J. Widom. Lore: A database management system for semistructured data. SIGMOD Record, 26(3):54-66 1997 [PGMW95] Y. Papakonstantinou, H. Garcia-Molina, and J. Widom. Object exchange across heterogeneous information sources. IEEE International Conference on Data Engineering, p251-260 1995 [PR98] STS Prasad and Anand Rajaraman Virtual Database Technology, XML, and the Evolution of the Web [Suc98] Dan Suciu Semistructured Data and XML. Proceedings of the International Conference on Foundations of Data Organization 98 [W3CXSL] http://www.w3.org/Style/XSL/ [WL98] Ke Wang and Huiqing Liu. Discovering typical structures of documents: a road map approach. ACM SIGIR Conference on Research and Development in Information Retrieval 1998 [XML99] QL98 http://www.xml.com/xml/pub/1999/03/quest/index2.html 9

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Section 2 XML as Semi Structured Data