Google

Jan 8, 2013

XML processing in Java and reading XML data with a Stax Reader



Q. What APIs do Java provide to process XML? What are the pros and cons of each, and when to use what?
A

SAX: Pros: Memory efficient and faster than the DOM parser. Good for very large files. Supports schema validation.
SAX: Cons: You can use it for reading only. It does not have xpath support. You have to write more code to get things done as there is no object model mapping and you have to tap into the events and create your self.

DOM: Pros: Simple to use, bi-directional(read & write) and supports schema validation. The read XML is kept in memory allowing XML manipulation and preserves element order. Supports CRUD operation.
DOM: Cons: Not suited for large XML files as it consumes more memory. You will have to map to your object model as it supports generic model incorporating nodes.

Stax: Pros:  Gives you the best of SAX (i.e. memory efficient) and DOM (i.e. ease of use). Supports both reading and writing. Very efficient processing as it can read multiple documents same time in one single thread, and also can process XML in parallel on multiple threads. If you need speed, using a Stax parser is the best way.
Stax: Cons:  You have to write more code to get things done and you have to get used to process xml in streams.
 
JAXB: Pros:
Allows you to access and process XML data without having to know XML by binding Java objects to XML through annotations. Supports both reading & writing and more memory efficient than a DOM (as DOM is a generic model)
JAXB: Cons: It can only parse valid XML documents.

SAX, DOM, Stax, and JAXB are just specifications. There are many open source and commercial implementations of these specifications. For example, the JAXB API has implementations like Metro (The reference implementation, included in Java SE 6), EclipseLink MOXy, and JaxMe.

There are other XML to object mapping frameworks like xtream from thoughtworks and JiBX (very efficient and uses byte code injection).

The StAX has the following implementation libraries  --  The reference implementation (i.e. http://stax.codehaus.org), the Woodstox implementation (i.e. http://woodstox.codehaus.org), and Sun's SJSXP implementation (https://sjsxp.dev.java.net/).

If you want to transform XML from one format to another, then use a TrAX. TrAX is based on XSLT, which is a rule-based language. A TrAX source document may be created via SAX or DOM. TrAX needs both Java and XSLT knowledge.



Q. What is the main difference between a StAX and SAX API ?
A. The main differences between the StAX and SAX API's are

  • StAX is a "pull" API. SAX is a "push" API. 
  • StAX can do both XML reading and writing. SAX can only do reading.

SAX is a push style API, which means SAX parser iterates through the XML and calls methods on the handler object provided by you. For instance, when the SAX parser encounters the beginning of an XML element, it calls the startElement on your handler object. It "pushes" the information from the XML into your object. Hence the name "push" style API. This is also referred to as an "event driven" API.

StAX is a pull style API, which means that you have to move the StAX parser from item to item in the XML file yourself, just like you do with a standard Iterator or JDBC ResultSet. You can then access the XML information via the StAX parser for each such "item" encountered in the XML file .

Q. Why do you need StAX when you alread have SAX and DOM?

A. The primary goal of the StAX API is to give "parsing control to the programmer by exposing a simple iterator based API. This allows the programmer to ask for the next event (pull the event) and allows state to be stored in procedural fashion." StAX was created to address limitations in the two most prevalent parsing APIs, SAX and DOM.

Q. What is the issue with the DOM parser?
A. The DOM model involves creating in-memory objects representing an entire document tree and the complete data set  state for an XML document. Once in memory, DOM trees can be navigated freely and parsed arbitrarily, and as such provide maximum flexibility for developers. However the cost of this flexibility is a potentially large memory footprint and significant processor requirements, as the entire representation of the document must be held in memory as objects for the duration of the document processing. So, this approach will not be suited for larger XML documents.


Q. What is the issue with a "push" parser?
A. Pull parsing provides several advantages over push parsing when working with XML streams:
  • With pull parsing, the invoking application controls the application thread, and can call methods on the parser when needed. By contrast, with push processing, the parser controls the application thread, and the client can only accept invocations from the parser.
  • Pull parsing libraries can be much smaller and the client code to interact with those libraries much simpler than with push libraries, even for more complex documents.
  • Pull clients can read multiple documents at one time with a single thread.
  • A StAX pull parser can filter XML documents such that elements unnecessary to the client can be ignored, and it can support XML views of non-XML data.


Now that you know why a StAX parser is more useful. here is an example of using the StAX API to read an XML snippet. The XML snippet is as shown below: 

 
<metadata BatchJobId='17232674' ParentBatchJobId='17232675' BatchCode='SOME_JOB_NAME' />


Now the StAX parser code

package com.myapp.item.reader;

import java.io.ByteArrayInputStream;
import java.io.InputStream;
import java.util.Iterator;

import javax.xml.stream.XMLEventReader;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.events.Attribute;
import javax.xml.stream.events.StartElement;
import javax.xml.stream.events.XMLEvent;

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import com.myapp.model.BatchJobMeta;

public class CashForecastJobMetaStaxReader {

 private final static Logger logger = LoggerFactory.getLogger(CashForecastJobMetaStaxReader.class);

 private static final String BATCH_JOB_ID = "BatchJobId";
 private static final String PARENT_BATCH_JOB_ID = "ParentBatchJobId";
 private static final String BATCH_CODE = "BatchCode";
 

 public BatchJobMeta readBatchJobMetaInfo(String input) {

  //First create a new XMLInputFactory
  XMLInputFactory inputFactory = XMLInputFactory.newInstance();
  InputStream in = new ByteArrayInputStream(input.getBytes());
  BatchJobMeta item = null;
  try {
   XMLEventReader eventReader = inputFactory.createXMLEventReader(in);
   item = new BatchJobMeta();

   while (eventReader.hasNext()) {
    XMLEvent event = eventReader.nextEvent();

    if (event.isStartElement()) {
     StartElement startElement = event.asStartElement();
     @SuppressWarnings("unchecked")
     Iterator<Attribute> attributes = startElement.getAttributes();
     while (attributes.hasNext()) {
      @SuppressWarnings("unused")
      Attribute attribute = attributes.next();
      if (attribute.getName().toString().equals(BATCH_JOB_ID)) {
       item.setBatchJobId(Long.valueOf(attribute.getValue()));
      }

      if (attribute.getName().toString().equals(PARENT_BATCH_JOB_ID)) {
       item.setParentBatchJobId(Long.valueOf(attribute.getValue()));
      }

      if (attribute.getName().toString().equals(BATCH_CODE)) {
       item.setBatchCode(attribute.getValue());
      }
     }
    }
   }

  } catch (XMLStreamException e) {
   logger.error("", e);
  }

  return item;

 }

}



Finally, the JUnit test class that tests the above XML reader code.

 

package com.myapp.item.reader;

import org.junit.Assert;
import org.junit.Test;

import com.myapp.model.BatchJobMeta;

public class CashForecastJobMetaStaxReaderTest {
 private static final String META_SNIPPET = "<metadata batchcode="CSHFR" batchjobid="17232674" parentbatchjobid="17232675">";

 @Test
 public void testReadBatchJobMetaInfo() {
  CashForecastJobMetaStaxReader staxReader = new CashForecastJobMetaStaxReader();
  BatchJobMeta readBatchJobMetaInfo = staxReader.readBatchJobMetaInfo(META_SNIPPET);

  Assert.assertNotNull(readBatchJobMetaInfo);

  Assert.assertEquals("Failed on Job Id", 17232674L, readBatchJobMetaInfo.getBatchJobId());
  Assert.assertEquals("Failed on Parent Job Id", 17232675L, readBatchJobMetaInfo.getParentBatchJobId());
  Assert.assertEquals("Failed on batch code", "CSHFR", readBatchJobMetaInfo.getBatchCode());
 }

}


You may also like:

Labels:

0 Comments:

Post a Comment

Subscribe to Post Comments [Atom]

<< Home