Divided No More
Teradata Streamed XML Shredding enables large document processing.
by Srinivas Pandrangi
Extensible markup language (XML) is a popular data format for application integration. Teradata XML Services extends the capabilities of the Teradata Database by providing functionality specific to handling XML-formatted data, which includes:
- XML Publishing Compose XML documents from SQL query results
- XML Shredding Populate Teradata Database with contents of XML documents
- XML Processing Validate, transform and query
The size and structure of XML documents vary considerably depending on their application. Teradata XML Services supports processing large XML documents through the streaming versions of the publishing and shredding capabilities. Streamed publishing [See "Bit by bit" at TeradataMagazine.com] allows users to publish large XML documents as a stream of XML fragments. This conserves memory on both the database node and the client system
Likewise, streamed shredding does the same for large XML documents whose contents are to be extracted and used for database updates. Since Teradata XML Services is implemented as a set of functions and stored procedures, the size of the documents that can be processed is limited by the memory available to each function.
In versions of Teradata XML Services before v1.1.7, large documents needed to be split into smaller documents using XMLSPLIT. Each of these then had to be individually shredded. Streamed shredding removes these limitations and provides improved performance by processing the document stream in a single pass, without the need to load the entire document into memory at one time.
XML document shredding involves two phases: a design phase, in which the hierarchical structure of the XML documents is mapped to the tabular structure of the relational database, and a runtime phase, when the mapping is applied to XML documents to extract their content for database updates. In streamed shredding, the design phase involves annotating the XML schema to which the document conforms. Runtime involves invoking the AS_SHRED_BATCH stored procedure using the annotated schema and the XML documents as parameters.
Schema annotation
The structure of XML documents is described by XML schemas. These are usually one or more documents conforming to the W3C XML Schema Definition Language that uses ".xsd" as a filename extension. The schema for XML documents is defined either by a standards organization (XBRL, FpML, ACORD, etc.) or internally by the user’s organization. If a schema is not available, one can be generated based on a sample XML document using a tool like Altova’s XMLSpy.
In the design phase, the user annotates the schema—essentially adding a comment to it—with the mapping instructions. The shredding annotations indicate which element and attribute values are to be extracted from an XML document stream, how to transform them and how they map to a relational schema.
Most of the mapping information is contained in the context element, an example of which is shown below: (The complete sample annotated schema can be found at sample.xsd.)

Click to enlarge
The context element sets the environment for shredding by specifying the default database and the encoding that will be used. The transaction element encapsulates the database operations that, as a result of shredding, will be done in a single transaction.
Operation types can be insert, update, delete and upsert. Each operation describes the tables that will be affected, with each column of the table being described by its name (the name attribute), data type (the sqltype element), the data item in the XML document structure it relates to (the ref and path attributes), and the transformations that need to be done on these data items before they are used in the database updates. The transformations are described in the sqlexpr element using any valid SQL expression. This puts the full expressive power of SQL at the user’s disposal.
In the example above, ‘:)’ is appended to a previously extracted value (Name_1) to derive the value for the Name column in the ocsxml table. Note that the Name_1 column is defined as a transient column to indicate that this is a temporary holding place for a value to be used in computing the value of another column. This temporary value will not be stored in the target ocsxml table. (For details of the mapping definition, download the Orange Book, Teradata XML Services version 1.1.7 at Teradata.com/t/at-your-service.)
Runtime
Users register the annotated schemas with the database using the REGISTER_SCHEMA stored procedure. Once a schema is registered, instead of passing around the schema documents in their entirety, users can refer to it by its uniform resource identifier (URI), a string that uniquely identifies that schema within the XML Services Schema Repository. At runtime, users will call a stored procedure, such as AS_SHRED_BATCH, and pass to it the XML documents to be shredded, along with the registration URI of the annotated schema. Below is an example of a document to be shredded using the described annotated schema:

Click to enlarge
After one or more sample documents, such as the one shown above, are loaded into a work table, the next step is to invoke the AS_SHRED_BATCH procedure, using:
call SYSXML.AS_SHRED_BATCH(‘select id, doc from xmlstaging’,‘annotated_po_schema.xsd’
, NULL, ‘poshredder’, NULL, rc)
The first parameter of the stored procedure is a SQL query that returns the documents to be shredded. The second parameter is the registered URI of the annotated schema, the third parameter is NULL and the fourth is the user who will perform the database updates that result from the shredding operations. This user should have the privileges necessary to execute all of the database operations defined in the annotated schema for the specified target tables.
Once the call to AS_SHRED_BATCH is executed, the XML documents in the work table are processed using a SAX parser (a streaming, event-driven API for parsing XML documents). As the parser in the XML document stream encounters each element or attribute, it is validated against the schema. If the annotation in the schema says that the data item is part of a database update operation, the data item is loaded into a temporary work table. After the entire document stream has been processed, set-based SQL operations are performed to update the target tables. Using the sample document and annotated schema from this article, the shredding would result in the insertion of the following row into the ocsxml table:
OCS Fictional Company:) | ADDRESS LINE 1 | | MELBOURNE | 03 99999999 |
03 99999999 | 99 999 999 | ADDRESS LINE 1 | E9999999 | 9999 | J BLOGGS | ST
High performance
The streamed shredding functions provide users with a high-performance method to extract data from large XML documents for input into the Teradata Database. These functions are ideal for situations where external systems provide large data extracts that need to be loaded into the enterprise data warehouse. Recent performance tests have shown more than a 60% improvement in performance from streamed shredding over other available methods. Along with streamed publishing, shredding provides a resource-efficient method for the Teradata Database to interact with data in XML format.
Srinivas Pandrangi is the architect of Teradata XML Services. He has more than 10 years of experience implementing industry standards, including those related to XML.
Editor's Note
For another article on Teradata XML Services, see "Speak the language" at TeradataMagazine.com.