;; -*- mode: memo -*- Tutorial: XML toolkit V1.02 September 5, 2002 Makoto Onizuka (oni@acm.org) 1. What is XML toolkit? XML toolkit is a framework for a light-weight and high performance XML data stream processing. It is comprised from two fundamental libraries (XML TSAX parser, XPath processor) and a collection of simple XML processing commands (that correspond to plain text processing commands like cat, head, tail, sort, grep and so forth) built upon those libraries. Section 2 explains the XML toolkit commands and Section 3 explains the XML toolkit libraries and an example. 2. Let's look into the some details of XML toolkit commands. 2.1 What is merit of XML toolkit commands? There may be many types of data operation demand for XML data streams, like concatenation, filtering (selection), re-structuring, sorting, duplicate elimination, calculating aggregation, and so forth. XML toolkit commands satisfy these kinds of demands. You can also combine any of the XML toolkit commands using pipe. For example, cat book.txt article.txt | grep title | sort; can be replaced as xcat book.xml article.xml | xsort -c "/bib" -e "book" -k "title/text()"; Of course, as the data is written in XML format, one can devise more complex data operations like 'choose every first author in every book or article'. xcat book.xml article.xml | xhead -c "/bib/book" -e "author" -n 1 | xrun /bib/*; If the data is written in plain text, we need to do much more work than this using perl or combination of sed, awk, etc. How do we then compare XML toolkit commands with XSL syntax or XQuery? Although each toolkit command has only subset functions of XSL and XQuery, as described above, we can combine any of the toolkit commands, existing unix commands, and user-implemented data operation commands using UNIX pipe to realize complex data operations. 2.2 XML toolkit commands details 2.2.1 xcat a)usage: xcat [-b] [-t type-expr] xmlfile1 xmlfile2 ...; The -h option shows the above usage message. b)description: This command simply concatenates a given XML files (either file or stdin, either tokenized XML or plain XML) and outputs it to stdout as plain XML (default) or as tokenized XML (-b option). The tokenized XML (or we call "binary XML") is such a format that all tags (element and attribute) is tokenized as integer. The output XML can be a forest, meaning that there may be several root elements. The -t option enables an advanced tokenizer for an integer value with some constant strings like "$30" or "30 dollar". For example, xcat -b -t 'xpathExpr:PREFIX%iSUFFIX' tokenizes all text value whose prefix is "PREFIX" and suffix is "SUFFIX" that are matched with the given xpathExpr. %i indicates that the value except PREFIX and SUFFIX is a integer type. c)example: xcat article.xml book.xml > bib.xml; xcat -b article.xml book.xml > bib.xml; xcat -b -t '//price:$%i' article.xml book.xml > bib.xml; 2.2.2 xrun a)usage: xrun xpathExpr xmlfile; "xrun" without any parameters shows the above usage message. b)description This command inputs one XPath expression and one XML file (either tokenized XML or plain XML) and outputs a sub-part of the input XML that satisfies the given XPath expression. The result can be a forest. The restriction of the XPath expression is described in Section 4. c)example: xrun "/" xmlfile; xrun "/bib/book/title" xmlfile; xrun "//title/text()" xmlfile; xrun "//book/*/@year" xmlfile; 2.2.3 xsort More details are written in the PLANX paper. http://www.cs.washington.edu/homes/suciu/XMLTK/planx.ps a)usage: xsort [-binary | -b ] [(-memory | -m) mem-size] [(-type | -t) type-expr] [(-context | -c) xpath-expr] [(-element | -e) xpath-expr] [(-key | -k) xpath-expr]]] [xmlfile]; The -help, -h or -? shows the above usage message. b)description This command sorts target element (specified by an element XPath expression) under the context element (specified by a context XPath expression) using the key element (specified by a key XPath expression) for an XML input (stdin or file). All elements under the context are emitted except the elements satisfied by the -e XPath expression. The -b and -t options play the same role as that of the xcat. The -m option determines the memory size for in-memory sorting (beyond which sorting is conducted externally, i.e. on disk). c)example: xsort -c "/bib" -e "book" -k "title/text()" xmlfile; This sorts all "book" element under "bib" element using "title/text()" as key value. xsort -c "/bib" -e "book" -k "title" xmlfile; When you specify the key parameter as an element (not text()), then the concatenation of all text() under the key element is interpreted as a key value. xsort -c "/bib" -e "*" -k "@year" xmlfile; This sorts all elements under "bib" element using "@year" as a key value. The key value is always treated as string value (not as integer value). xsort -c "/bib/book" -e "author" -k "lastname/text()" -e "*" xmlfile; This sorts all "author" element under "book" element using "lastname/text()" as key value. In addition, all elements except "author" under the "book" element also are outputed. In this case, there are two -e XPath expressions for one context element. The xsort tries to check the first XPath expression, and if it fails then it tries to check the second XPath expression. So, the order of the -e XPath expression means a precedence. xsort -c "/bib" -e "book" -k "title/text()" -e "article -k "@year" xmlfile; This sorts all "book" element under "bib" element using "title/text()" as key value, and sorts "article" element under "bib" element using "@year" as key value. xsort -c "/bib/book" -e "author" -k "lastname/text()" -c "/bib/article" -e "author" -k "firstname/text()" xmlfile; This sorts all "author" element under "bib/book" element using "lastname/text()" as key value, and sorts "author" element under "/bib/article" element using "firstname/text()" as key value. In this case, there are two context XPath expressions. The xsort tries to check the first XPath expression, and if it fails then it tries to check the second XPath expression. So, the order of the -c XPath expression also means a precedence. 2.2.4 xtail a)usage: xtail [-help | -h | -? ] [-binary | -b ] [(-context | -c) xpath-expr [(-element | -e) xpath-expr [(-number | -n) number]]] [xmlfile]; The -help, -h or -? shows the above usage message. b)description This command filters the last specified number of target element (specified by an element XPath expression) under the context element (specified by a context XPath expression) for an XML input (stdin or file). All elements under the context are emitted except the elements satisfied by the -e XPath expression. The -b option plays the same role as that of the xcat. c)example: xtail -c "/bib" -e "//author" -n 2 xmlfile; xtail -c "/bib" -e "*" -n +1 xmlfile; xtail -c "/bib" -e "*" -n -1 xmlfile; 2.2.5 xstat a)usage: xstat xmlfile; b)description This command reports the number of elements, the maximum depth and the average depth of the input XML file. For example, if the input XML contains only one root element (like ), then the xstat outputs: c)example: xstat bib.xml; 2.2.6 file2xml a)usage: file2xml [-s start directory name] [-f file to direct output]; b)description This command collects file and directory information (name, path, size, permission, type, userid, gourpid, last access time, last modified time) under a specified start directory and packs it as XML format. For example, if "CVS" directory contains one "Root" file, file2xml outputs: CVS Root /homes/june/makoto/work/xmlbase/xmltoolkit/xmltk/file2xml/CVS/Root 33 -rwrr---- regular file 13750 330 Wed Nov 21 11:22:33 2001 Wed Nov 21 11:22:23 2001 c)example: file2xml -s . -f currentFile.xml; 2.2.7 createSindex a)usage: createSindex [-t threshHold] xmlfile; b)description This command creates a streaming index (SIX) for the input XML file. If the XML file's name is "xmlfile.xml", the index is stored at "xmlfile.six" in the same directory as the XML file. The "-t" option specifies the threshHold (byte) for the index entry deletion. Whenever there is a SIX file for its XML file, the XPath processor automatically uses the index to skip parsing certain part of XML. There are two type skips: [fail skip] When we receive a startElement SAX event and it is not what we need, we want to skip the whole element and go to the nest sibling element. For example, if we want to find a book element (/bib/book) and receive an article startElement event, then we skip the article element (Element/attribute based fail skip). Another example is that we want to find a book that is published at 1999 (/bib/book[@year='1999']) but receive a book start element event that is published at 2001 (Value based fail skip). [succeed skip] When we receive an endElement SAX event and we don't need to look into the following siblings, we can skip them all. For example, if we want to find a second book's first author (/bib/book[2]/author[1]) and receive an end element SAX event for the author element, then we can skip 2 depth to go to bib endElement (Element/attribute based succeed skip). Another 1 depth succeed skip example is that we have a key constraints on isbn of books and we found a book whose isbn matches our condition (/bib/book[isbn='XX123'], value based succeed skip). Some future work on the SIX is to enable skip parsing using key constraints, a schema specifying a order of sibling elements (e.g. DTD content model like (A, B*, C)), sorted elements, and existential filter condition. c)example: createSindex bib.xml; This creates a full streaming index (bib.six) for the bib.xml. createSindex -t 100 bib.xml; This creates a partial streaming index (bib.six) for the bib.xml whose entry is only for elements larger than 100 byte. 3. Let's look into the some details of XML toolkit libraries. 3.1 What is functionality? There are two libraries: one is a tokenized XML parser, and the other is XPath processor. The tokenized XML parser converts every element tag or attribute into an unique integer value (we implement as XTOKEN) and calls tokenized SAX events. For example, startElement('bib') can be tokenized as startElement(5) and the value '5' corresponds to 'bib'. We extends the Xmill's XML parser to implement our tokenized XML parser, as it parses XML data stream around ten times faster than the xerces XML parser. But it just supports ascii character code (it doesn't support unicode) and does not have DTD validation. The XPath processor combined with the above tokenized XML parser calls only such SAX events that satisfy a given collection of XPath expressions. In other words, it filters out any SAX events that does not satisfy any given XPath expressions. This is effective for such a case that you want to operate some particular part of the input XML data stream (not whole of the XML data). For example, if you want to make a title list of bibliography, then you need to register two XPath expression '/bib/book/title' as $x1 and '/bib/article/title' as $x2 (or simply one XPath expression '//title' or '/bib/*/title'). If the input XML data stream is Data on the web Serge Abiteboul Peter Buneman Dan Suciu
Processing XML Streams with Deterministic ... Todd J Green Makoto Onizuka Dan Suciu
then the XPath processor calls only the following call-back events: startContext($1) // for '/bib/book/title' startElement('title') character('Data on the web') endElement('title') endContext($1) startContext($2) // for '/bib/article/title' startElement('title') character('Processing XML Streams with Deterministic ...') endElement('title') endContext($2) You can find there are additional events 'startContext' and 'endContext'. These enable the applications to know which XPath expression satisfies which SAX events. 3.2 How does one develop a new application using XML toolkit library? If you would like to build a new application or command using XML toolkit library, you need to understand: 1) com-lite interface 2) main functions 3) handler functions (including SAX events) 4) library functions 3.2.1 com-lite interface TJ wrote this part already. See URL http://www.cs.washington.edu/homes/suciu/XMLTK/comlite.html The concept is that it utilizes the factory class design pattern and smart pointer. 3.2.2 main functions Here, I explain library functions using the xrun example. #include "xmltk.h" int main(int argc, char* argv[]){ if (argc != 3){ cout << "Usage: " << argv[0] << " xpathExpr xmlfile" << endl; return 1; } // initialize token table that convert string // (element, attribute, extented integer) to XTOKEN InitGlobalTokenTable(); // // before the parsing // try { IDfaFilter *g_pfilter = NULL; // create XML parser and XPath processor object // 1st parameter: Interface ID // 2nd parameter: result CreateDfaFilter(&IID_IDfaFilter, (void**)&g_pfilter); // register a given XPath to the XPath processor. // 1st parameter: Variable *, a variable for a parent XPath expression // 2nd parameter: char *, XPath expression // 3rd parameter: book, echo toggle (described at Section 3.4.1) // The XPath processor manages the construction/destruction of // the returned Variable *. Variable * g_pvRoot = g_pfilter->RegisterQuery(NULL, argv[1], true); // create an input file stream object // 1st parameter: char *, file path IFileStream *pstm = _CreateFileStream(argv[2]); // create an output file stream object // 1st parameter: bool, binary format or plain format ITSAXContentHandler *pchOut = CreateStdoutStream(false); // create a handler that implements call-back functions // 1st parameter: ITSAXContentHandler * myHandler *handler = new myHandler(pchOut); // register hander to XPath processor // 1st parameter: IDfaFilter * // 2nd parameter: user class derived from IFilteredTSAXHandler IUnknownCL_SetHandler(g_pfilter, handler); // // parse (call-back functions are invoked from the parse) // 1st parameter: IFileStream * // 2nd parameter: IDfaFilter * ParseUnknownStream(pstm, g_pfilter); // // after the parsing // ATOMICRELEASE deletes the parameter and set it as NULL; ATOMICRELEASE(pchOut); ATOMICRELEASE(pstm); ATOMICRELEASE(g_pfilter); } catch (_Error &err) { // print an error message err.perror(); } // cleanup the token table CleanupGlobalTokenTable(); return 0; } 3.2.3 Hanlder functions Similar to handling the SAX events, you need to implement your handler to receive events from the XPath processor (or XML parser) using ITSAXContentHandler or IFilteredTSAXHandler interface. ITSAXContentHandler interface defines SAX events and IFilteredTSAXHandler interface defines both SAX events and context events (startContext and endContext). class myHandler : public IFilteredTSAXHandler { public: // // *** IUnknownCL methods *** // CL_STDMETHOD is a macro to define a bool method // CL_STDMETHOD_ is a macro to define a method whose return // type is specified by the first parameter // CL_STDMETHOD(QueryInterface) (RCLIID riid, void **ppvObj); CL_STDMETHOD_(ULONG,AddRef) (); CL_STDMETHOD_(ULONG,Release) (); // // *** ITSAXContentHandler methods *** // CL_STDMETHOD(startDocument) (); CL_STDMETHOD(endDocument) (); CL_STDMETHOD(startElement) (XTOKEN xtName); CL_STDMETHOD(endElement) (XTOKEN xtName); CL_STDMETHOD(attribute) (XTOKEN xtName, char *pszChars, int cchChars); CL_STDMETHOD(characters) (char *pszChars, int cchChars); CL_STDMETHOD(cdata) (char *pszChars, int cchChars); CL_STDMETHOD(extendedint) (XTOKEN xt, int iInt); // // *** IFilteredTSAXHandler methods *** // CL_STDMETHOD(startContext) (Variable *pv); CL_STDMETHOD(endContext) (Variable *pv); myHandler(ITSAXContentHandler *pchOut) { m_pchOut = pchOut; m_pchOut->AddRef(); } virtual ~myHandler(){ ATOMICRELEASE(m_pchOut); } ITSAXContentHandler *m_pchOut; ULONG m_cRef; }; // // *** IUnknown methods *** // // RCLIID is a static value that indicates a class ID bool myHandler::QueryInterface(RCLIID riid, void **ppvObj) { if (IsEqualCLIID(riid, &IID_IUnknownCL) || IsEqualCLIID(riid, &IID_ITSAXContentHandler) || IsEqualCLIID(riid, &IID_IFilteredTSAXHandler)){ *ppvObj = static_cast(this); } else { *ppvObj = NULL; return false; } AddRef(); return true; } ULONG myHandler::AddRef(){ return ++m_cRef; } ULONG myHandler::Release(){ if (--m_cRef > 0) { return m_cRef; } delete this; return 0; } You need follow the com-lite style because the XML Toolkit class library is defined in com-lite style and you need to define your own class as a sub-class of those library classes. Thus, you need to implement all IUnknownCL methods in the way like above. You can implement anything you like for every SAX call-back function, ITSAXContentHandler and IFilteredTSAXHandler methods. You can also implement anything you like on the constructer and destructor of your handler. You may want to initialize your own data member, or may need to free your data member memory space. 3.2.4 library functions a) g_ptt: g_ptt is a global variable points to a global token table. char * StrFromXTOKEN(XTOKEN token); returns the original string of the "token". XTOKEN XTOKENFromStr(char * s, XST_TYPE); XST_TYPE=: (XST_ELEMENT | XST_ATTRIBUTE | XST_EXTENTEDINT) returns a token of the "s" as type "XST_TYPE". The XST_TYPE is eigther XST_ELEMENT, XST_ATTRIBUTE, or XST_EXTENTEDINT. b) Variable class: The Variable instance is returned by the IDfaFilter::RegisterQuery(). Variable * getParent(void); returns a variable for the parent XPath expression. void setUserSpace(void * s); You can set any pointer to your own memory space. void* getUserSpace(void); returns the pointer to the memory space. 3.3 Advanced feature of the tokenized XML processor 3.3.1 Input The input XML data can be either tokenized XML or ordinary plain text XML. The tokenized XML (or we call "binary XML") is such a format that all tags (element and attribute) is tokenized as integer. 3.3.2 Call-back event In addition to the same as SAX events (startDocument, endDocument, startElement, endElement, characters, cdata), there are additional and modified events. a) characters(char *pszChars, int cchChars); cdata(char *pszChars, int cchChars); Not like ordinary XML SAX parser, the pszChars is a pointer to the character data WITHOUT STRING TERMINATOR. So you need to know the length of the string. The ccChars tells you the length of the string. b) attribute(XTOKEN xtName, char *pszChars, int cchChars); Not like the ordinally SAX event interface, an attribute is independent from its element and "attribute" event is invoked for each attribute. c) extendedint(XTOKEN xt, int iInt); This event is a special case for character SAX event. If the user registers some extended integer using "RegisterType" method with a parameter in 'xpathExpr:PREFIX%iSUFFIX' format, then an extendedint event is invoked instead of a character event. d) context events: startContext(Variable *pv), endContext(Variable *pv) As decribed in Section 3.1, these enable the applications to know which XPath expression satisfies which SAX events. The Variable * is returned when the user register an XPath expression. 3.4 Advanced feature of the XPath processor 3.4.1 Echo toggle (ON/OFF) When the user register XPath expression, he/she needs to specify an echo toggle (ON/OFF) by the third parameter of "RegisterQuery" method. RegisterQuery(Variable * parentVariable, const char * xpath, bool echoToggle); If the toggle is ON (true), then both the context events and all the SAX events during the given XPath expression is satisfied are call-backed (as described in Section 3.1). If the toggle is OFF (false), then the only context events are invoked. None of SAX events are invoked. 3.4.2 Echo overwrite The echo toggle of an XPath expression can be overwritten by its child XPath expression. For example, if the user want to output all element except "title" element, then he/she realizes it as follows: Variable * v1 = g_pfilter->RegisterQuery(NULL, "/", true); Variable * v2 = g_pfilter->RegisterQuery(v1, "//title", false); This directs all SAX events invoked under "/" XPath expression unless "//title" XPath expression is satisfied. 3.4.3 Precedence Like xsort examples, the user sometimes need to use some conditional procedure: IF ... THEN ... ELSE IF ... THEN ... .... ELSE ... The XPath processor supports this functionality by introducing a precedence in XPath expressions. RegisterQuery(Variable * parentVariable, const char * xpath, bool echoToggle float precedence); The forth parameter of the "RegisterQuery" method indicates the precedence. When several XPath expressions become satisfied, the XPath processor only invokes startContext events with the highest precedence. Precisely to say, the XPath processor calculates the minimum precedence in the satisfied XPath expressions, and locates and invokes startContext event for variables with the minimum precedence. 4. Restrictions 4.1 The predicate expressions == supported == 1. The predicate needs to be written with an attribute or value. If the predicate is binary, it has to be [attribute operator value] order. e.g. //book[@year]/title, //book[@year=1999]/title 2. The position predicate can be used at any location step. e.g. /bib/book[2]/author[1] 3. Predicates can be concatinated. e.g. //book[2][@year=1999]/title[contains(text(),'XML')/text() == not supported (restrictions) == 4. The tail location step can not have the predicate written in the above, because the lazyDFA does not support output XML buffering. n.g. //book[@year] o.k. //book[@year]/title 5. DoubleSlash with position predicate is not supported. n.g. /bib/book//name[2] "book" has to control the increment/reset the number of "name". 6. Position predicate following another predicate is not supported. n.g. //book[@year=1998][2] o.k. //book[2][@year=1998] "@year" has to control the increment and "book" has to control the reset. 7. Element or "." is not supported in the predicate. n.g. //book[contains(.,'XML')]/title 8. Such XPath expressions that uses both "//" and any predicate does not work correctly. The lazyDFA does not output error message in this case. Precisely saying, such XPath expressions does not work correctly if it contains some predicate and it matches with prural XML fragment simaltaniously. For example, //a/a[1] matches two XML fragment in the following XML. <-- 1st context starts here <-- 1st context completed here. simaltaniously, 2nd context starts here. 123 123 456 456