Download Section 2 XML as Semi Structured Data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Brian Sabino
CS 632
<XML>
Introduction
It's estimated that today 90% of the worlds data sits outside of any database
system, usually in the file system or a custom application [PR 98]. Much of this data is
unstructured or only semi-structured and as such is not suitable for storage in today's
world of RDBMSs. XML holds the promise of a common interchange format, and, more
importantly from our perspective, a method for giving data enough structure that it can be
stored in a database. This will allow the database community to offer efficient storage,
transaction processing, and the ability to query data to areas where databases have
traditionally had little impact.
So, the question now becomes how to deal with XML documents and data. Do
we need an entirely new data model? Can we adapt relational databases to work with
semi-structured XML data? And finally what types of queries should we support, and in
what language?
This paper starts with a very brief introduction to XML in section 1. In section 2
we will examine XML as traditional semi-structured data and give an introduction to
LORE. In section 3 we will hear some opposing views and examine the possibility of
storing XML in a traditional RDBMS with a look at STORED. Finally in section 4 we
will briefly examine the proposed query languages for XML.
Section 1 What is XML
For the purposes of this paper we don't really care what XML is, we are really
only interested in how we can apply database techniques to XML, and what opportunities
1
Brian Sabino
CS 632
XML offers the database community. Therefore we will give only a brief introduction
starting with XML basics then explaining DTD's and Namespaces.
XML stands for Extensible Mark-up Language. XML like HTML is a subset of
SGML (Standard Generalized Markup Language). Now lets spend a moment to untangle
the alphabet soup. Markup Language just means that the text is "marked" with tags
(usually symmetric) like the familiar <html></html>. XML is designed by the w3c to be
a simple subset of the very complex SGML in the hopes that this simplicity will speed
adoption. So how is XML different from HTML? HTML defines how a document is to
be displayed, but it says nothing about the structure or contents of the document. XML
on the other hand contains both display information and document structure information.
XML allows you to define your own tags and use them to give structure to a document.
For example you might write the following in HTML:
<table><tr><td align=center>
Panel on:<br>
<b><font size=-1>Honey, I Shrunk the DBMS: Footprint, Mobility, and Beyond
</font></b><br><font size=-1>Praveen Seshadri</font></td></tr>
</table>
While an equivalent piece of XML might look like:
<Panel startTime=1400 endTime=1530><Title>Honey, I Shrunk the DBMS: Footprint,
Mobility, and Beyond</Title> <Moderator> <firstName> Praveen </firstName>
<lastName> Seshadri </lastName></Moderator></Panel>
DTDs
A DTD (Document Type Definition) can be thought of as a grammar defining the
form of an XML document. A DTD looks like a series of regular expressions with the *
(keen closure), +, ? (optional), and | operators. Continuing the above example we might
define a DTD file conference.dtd as follows:
<!ELEMENT conference (presentation | panel)+>
2
Brian Sabino
CS 632
<!ELEMENT panel (title, moderators) >
<!ELEMENT title (#PCDATA)>
<!ELEMENT moderators (moderator)+>
<!ELEMENT moderator (firstName,lastName)> …
To finish the example we would add <!DOCTYPE conference SYSTEM
"conference.dtd"> to the original XML document. A DTD provides a way to validate an
XML document. A good example of this would be a company that created an
"invoice.dtd". Whenever they received an invoice they could compare it to the DTD and
determine if the invoice was well formed (used syntactically correct XML) and valid
(adhered to the DTD). The DTD is of particular interest to us because it can be seen in
some ways as defining a schema that the XML documents must adhere to.
Namespaces
Namespaces are a simple addition to XML to allow multiple people to use the
same XML tags. For example instead of every conference defining their own tags they
can instead use a common set of tags like so: <Moderator:Conference>. Meaning this is
the moderator tag defined in the Conference namespace. Namespaces are important
because they help insure a common format for similar XML documents. Unfortunately
we cannot always ensure that similar documents use the same namespace, and thus
determining if two similar documents are equivalent is a serious challenge for XML
document processing.
XSL
XSL stands for eXtensible Stylesheet Language. This is a tree based language
very similar to CSS for HTML[W3CXSL]. It provides methods for manipulating the
display of an XML document, for example if you wanted to display different pieces of
3
Brian Sabino
CS 632
one XML document in different places. As we will see later extensions to XSL are
being looked at as a possible query language for XSL.
Section 2 XML as Semi Structured Data
Some people see XML as something of a solved problem. If you view XML
solely as semi-structured data then you can simply apply the considerable research done
on semi-structured data [Abi97, Bun97, PGMW95] to processing XML. An excellent
example of this approach is LORE. Before examining LORE we will first look at what is
the "de facto model for semistructured data"[Suc98] the OEM model.
The OEM model
The OEM model [PGMW95] treats semi-structured data as a labeled digraph with one
unique root. Each node in this graph is either a built in data type (possible leaf) or a node
with edges to other
nodes (internal node).
Each edge in this
model represents a
relationship between
nodes. The example
at right shows how the
XML example above might be translated under the OEM model. A key point to note is
that there is no schema that the data conforms to. Instead the data is in some sense "self
describing" [MAG+97] Another important observation is that the data can be irregular.
That is for example a panel might have zero or more moderators, or a moderator might
have only a last name. The ability to effectively handle irregular data is an advantage of
4
Brian Sabino
CS 632
the OEM model over the relational model, but as we will see this feature has interesting
implications for queries over the data.
LORE
Lore (Lightweight Object REpository) is a DBMS built from the ground up to
work with semi-structured data [MAG+97]. Lore is based on the OEM model in that it
explicitly represents its data as a digraph. Lore has its own query language Lorel which
will be discussed below in section 4. A unique feature of Lore is its use of Data Guides.
A data guide is in essence a digraph summary of the OEM database with the property that
every path (sequence of relations) in the database is represented in the data guide. For
example a moderator node (from above) in the data guide will always have two edges
(first name, last name) even though some moderator nodes in the database might have
zero, one, or two of those edges. These data guides provide a concise graphical
representation of the database which can be used for query creation.
Drawbacks of XML as Semi-Structured Data
There are two problems with using traditional semi-structured data techniques
with XML. The first problem is schema replication [DFS99]. With this approach the
schema is in some sense stored with every data item. Put another way, the schema is
represented by the edges in the graph, and all edges are retained even if many subtrees
have identical schema. This incurs a space cost for storing the redundant data, and a time
cost for processing the replicated schema. The second and more important problem is
that we can't use a traditional RDBMS with this model.
5
Brian Sabino
CS 632
Section 3 Alternatives to Semi Structured Data
So the big question is this: Do we have to throw out the relational model to
effectively use XML? A very interesting answer comes in the form of STORED a query
language over semi-structured data [DFS99] (Semistructured TO RElational Data).
Stored
The basic idea of STORED is this: use data mining techniques to find a good
schema for the semi-structured data. Put all the data that fits the schema in a normal
relational database, and put the rest in an "overflow graph". As long as the amount of
data in the overflow graph is small we can store it however we like.
Obviously the first challenge is schema extraction. STORED employs a new data
mining algorithm over semi-structured data [WL98]. The basic idea of the algorithm is
to find the largest and most common subtrees, and the most common paths in the OEM
data representation. STORED then uses this information to construct appropriate
relational tables.
The second challenge is automatic overflow generation. A typical XML
document may have attributes like <sum> that appear only once, and don't make sense in
any sort of schema. In these cases an "overflow graph" is generated to capture any data
not stored in the relational tables. This keeps the STORED representation lossless.
STORED is a query language that takes in graph queries and rewrites them as
relational queries, however it is of much more interest for it's ability to store XML data in
a slightly modified RDBMS.
6
Brian Sabino
CS 632
Section 4 Toward a Query Language
At this point there is no clear consensus on a query language for XML [XML99].
There are however several possibilities: LOREL the query language for LORE, XQL, and
XML-QL. Not surprisingly the LOREL approach is tree based with queries having the
form attribute1.attribute2 = x. This approach has the advantage of being very general,
but it is not suitable for the specific needs of an XQL query language. The same can be
said for the other semistructured data querying methods including UnQL[BDHS96] and
STRUQL. Microsoft's proposal XQL is basically an extension of XSL that allows for
pattern matching of tags. However this proposal has been criticized as being too weak in
that it does not allow sorting, joining, or output specifications[XML99]. Finally XMLQL [DFF+98] offers a more powerful language using both pattern matching and querying
across multiple XML sources. Both XQL and XML-QL are under active consideration
by the w3c, and many companies are waiting for a standard to emerge before fully
adopting XML as a document standard [XML99].
Conclusion
In the short run I think an approach like STORED, where a traditional relational
schema is automatically extracted will be popular for dealing with XML data. Who
wants to abandon the tried and true RDBMS when it can possibly be adapted to
efficiently deal with these new types of data? I envision large database companies
augmenting their products with "overflow graphs" like those in STORED in order to fully
support XML. In the long run I think some very few semi-structured only databases such
as LOREL will appear applications with very irregular data that cannot easily be mapped
to a fixed schema.
7
Brian Sabino
CS 632
Overall I think XML has a lot to offer the database community. XML represents
a common data format, an opportunity to extend the database into the realm of document
management and storage, and an opportunity to bring the database closer to the web.
</XML>
8
Brian Sabino
CS 632
[Abi97] Serge Abiteboul. Querying semi-structured data. Proceedings of the
International Conference on Database Theory, p 1-18 1997
[BDHS96] Peter Buneman, Susan Davidson, Gerd Hillebrand, and Dan Suciu. Aquery
language and optimization techniques for unstructured data. Proceedings of ACMSIGMOD International Conference on Management of Data. p 505 - 516 1996
[Bun97] Peter Buneman. Tutorial: Semistructured data. Proceedings of ACM Symposium
on Principles of Database Systems, p 117-121 1997
[DFF+98] A. Deutsch, M. Fernandez, D. Florescu, A. Levy, and D. Suciu. Xml-ql: A
query language for XML, 1998 http://www.w3c.org/TR/NOTE-xml-ql/
[DFS99] Alin Deutsch, Mary Fernandez, Dan Suciu. Storing Semistructured Data with
STORED
[FFLS97] M. Fernandez, D. Florescu, A. Levy, and D. Suciu. A query language for a
web-site management system. SIGMOD Record 26(3) p 4-11 1997
[MAG+97] J. McHugh, S. Abiteboul, R.Goldman, D. Quass, and J. Widom. Lore: A
database management system for semistructured data. SIGMOD Record,
26(3):54-66 1997
[PGMW95] Y. Papakonstantinou, H. Garcia-Molina, and J. Widom. Object exchange
across heterogeneous information sources. IEEE International Conference on
Data Engineering, p251-260 1995
[PR98] STS Prasad and Anand Rajaraman Virtual Database Technology, XML, and the
Evolution of the Web
[Suc98] Dan Suciu Semistructured Data and XML. Proceedings of the International
Conference on Foundations of Data Organization 98
[W3CXSL] http://www.w3.org/Style/XSL/
[WL98] Ke Wang and Huiqing Liu. Discovering typical structures of documents: a road
map approach. ACM SIGIR Conference on Research and Development in
Information Retrieval 1998
[XML99] QL98 http://www.xml.com/xml/pub/1999/03/quest/index2.html
9