Practical SAX

As a CMS vendor we live on the edge between unstructured text and structured data. And somehow, don't we all build a little CMS once in a while? We deal with beans and business logic as well as rich and plain ole text and of course they interact. This is a writeup of some my experiences with and interpretations of SAX.

Choosing the representation

On the delivery side we deal with practically read-only data. We want to deliver XML content a) fast and b) concurrently, yet dynamically transformed per request. Requirement a) hints at an in-memory data structure like DOM. Now DOM trees tend to use quite some memory but most surprisingly, a DOM tree is not necessarily thread-safe, not even for reading. Try hitting hard on a Xerces document with several threads and you'll be surprised. XML itself is too slow if you want to transform it before delivery. So in order to stay lean and thread-safe, we chose a binary stream encoding similar to what has been done for Fast Infoset (though our format never leaves the JVM). A "Markup" as we call it, is a frozen representation of SAX events that can be concurrently written onto a chain of ContentHandlers. Sizewise, with some naive compression in place (like using indices for repeated namespace uris, for example), it turns out, a Markup takes up about as much space as the original XML.

SAX and Namespaces

SAX namespace support comes in two flavours. Depending on the configuration of a SAX parser's "prefixes" option, it will deliver xmlns* attributes for namespace declarations and optionally qnames with prefixes. We settled on the "no-prefixes" default as it carries the least ambiguity. It turns out this is a big weakness of dealing with SAX. Most APIs that deal with SAX do not specify which namespace flavour they would like their events in. They go with the default (no prefixes) but expect qnames after all (since all parsers deliver them anyway). The correct way would be to deal with the XMLReader interface which offers the aforementioned configuration option, like the TraX SaxSource interface. Yet, J2SE's transformer won't ask a SaxSource for prefixes and will barf if you don't hand them anyway (One of my Mustang contributions fixed that). The SAXResult API doesn't even give you the chance to define what mode it will be operated in. All in all, this is a weak point of SAX. They should have settled for exactly one representation.


As we mix templated beans and rich text, we're mostly dealing with XML fragments as opposed to whole documents. SAX is not really well-defined in this area either. Common practice seems to be to send #startDocument and #endDocument anyway as it serves as init and stop event for serializers. Also, can your serializer deal with namespace declarations "around" the fragment? startPrefixMapping, startElement, endElement, startElement, endElement, endPrefixMapping?

Mixing SAX and character data

We want to interop with character-oriented output in both directions. We want to embed JSP-in-SAX-in-JSP. SAX in JSP is not too difficult, it's plain serialization, right? It's a little more complicated. Typically your JSP will declare the outer namespaces using plain text. Now when we serialize a fragment, we want to leverage this context, so our serializer must support *declaration* of namespaces (so it knows which prefixes to generate) without actually printing the xmlns attributes (already there). Nothing that the Apache or JDK serializer will do for us. Vice versa, while we're deep in a chain of filters piping SAX events around, we find that we want to embed another fragment that is rendered with a JSP. We don't want to parse the JSP's output into SAX; we want to simply embed it raw. SAX doesn't give us that feature, so we support an extended SAX protocol to transport chunks of "raw" character data. (The apache serializer actually supports something similar, through the #start/#endNonEscaping protocol.) So we have a really nice embedding option where the JSP writes onto a custom writer that actually sends chunks of non-escaped chars down the pipeline.


Although state machines can be difficult to handle, a little framework code can help you out. I find that a filter-based transformation approach works fine for use-cases like link rewriting. You can even easily build filters that listen to the stream and start processing at a certain XPath and call you back with the fragment below. And it's magnitudes faster than XSL. Not to speak of the amount of temporary memory allocated.


SAX is the foster child of the XML APIs. It always ranks as "also", "low-level", "if you need the speed". I was surprised about the number of ambiguities and bugs I ran into with seemingly straightforward tasks. Yet after some persistence, it's the ideal choice for us. It's fast, low-overhead and still gives us all the dynamics we need. I'm looking forward to some benchmarking on a T2000(?) coming to our house soon.

Posted by Matthias at 1:12.34. Comment: blog@mernst.org
Edited on: 2006-02-23 8:37.38


Is it so difficult?

Again and again, if you're life is too simple, make it complicated. Write a hundred times:

  R resource = acquireResource();
  try {
  } finally { 

Well at least it saved James Roberson's day ...

Posted by Matthias at 20:59.31. Comment: blog@mernst.org

Moot Challenge

Of course I should have been more specific. Actually this code does exactly what I want: the class to define extends another class loaded by the "parent" class loader. The invoking code will only refer to it through its superclass. A new classloader seems much cleaner to me than injecting those bytes into the parent class loader itself, for example like java.lang.reflect.Proxy. It makes the generated class eligible for collection, too, if it's no longer being used. Or is there some catch I'm missing here? How expensive is a class loader anyway? Can the VM deal with hundreds or thousands of class loaders? Will the PermSpace blow up? Gotta try...

A valid point raised by Eugene, though. Challenging the blogosphere (can't believe I just wrote this word) if you don't have comments is kind of questionable.

Posted by Matthias at 8:39.29. Comment: blog@mernst.org


The shortest way to define a class ...

  private static <T> Class<T> defineClass(ClassLoader parent, final String className, final byte[] bytes) {
final Class[] result = new Class[1];
new ClassLoader(parent) {{
  result[0] = defineClass(className, bytes, 0, bytes.length);
return (Class<T>)result[0];
Can you do it shorter?

Tom Hawtin can:

  private static <T> Class<T> defineClass(ClassLoader parent, final String className, final byte[] bytes) {
    return new ClassLoader(parent) {
      public Class defineClass() { return defineClass(className, bytes, 0, bytes.length); }
I was surprised but yes, the type of an anonymous class construction expression is the full anonymous type, not just the supertype: http://java.sun.com/docs/books/jls/third_edition/html/expressions.html#15.9.1: The class being instantiated is the anonymous subclass. ... The type of the class instance creation expression is the class type being instantiated.
Posted by Matthias at 23:32.25. Comment: blog@mernst.org
Edited on: 2006-02-15 23:00.56


Compiled EL: an adventure in bytecode engineering

We have a running gag at work. When a webapp is slow to start, we'll say "ah it needs to compile the JSPs". Actually, nowadays, JSP compilation is pretty much useless. No one uses Java code in JSPs anymore. It is string literals, custom tags and expression language. Now the sad thing is, the EL is interpreted, too, so there's no need to compile JSPs anymore except for legacy compliance.

Interpreted EL turns me off. Too much reflection during evaluation, BeanInfo caching, ... Tomcat's JSP implementation doesn't even remember the parsed expression trees with a page instance, so they parse every expression anew, mitigated by yet another cache. A nice synchronization bottleneck.

So I've been intrigued by the idea of compiling EL to bytecode. Since I didn't want to wait for Gilad to bring out #invokedynamic (which is probably running in the lab already), I decided to go the strongly typed route. With generic type information, you actually have a fair chance to do proper typing for an expression without casting. Thus the compiler interface takes a type environment as follows:

  public static CompiledExpression compile(String expressionString, Map<String, ? extends Type> env);

  public interface CompiledExpression {
    Type getResultType();
    Object evaluate(Map<String,?> env) throws Exception;

The expression string is parsed with the Commons EL parser. For bytecode generation, the compiler draws all its information from the runtime environment; no searching/parsing of class files takes place. The java.lang.reflect.Type hierarchy holds all the required information. Resolving all wildcards and type variables correctly proved a fun exercise; the operators and type coercion are a bitch. So back to the fun part; I built a simple interactive toplevel which looks like follows:

string: class java.lang.String = string
map: interface java.util.Map[class java.lang.String, class java.lang.String] = {key=value}
> map.keySet.iterator.next
   0:     aload_1
   1:     ldc     #14; //String map
   3:     invokeinterface     #20,  2; //InterfaceMethod java/util/Map.get:(Ljava/lang/Object;)Ljava/lang/Object;
   8:     checkcast     #16; //class java/util/Map
   11:     invokeinterface     #24,  1; //InterfaceMethod java/util/Map.keySet:()Ljava/util/Set;
   16:     invokeinterface     #30,  1; //InterfaceMethod java/util/Set.iterator:()Ljava/util/Iterator;
   21:     invokeinterface     #36,  1; //InterfaceMethod java/util/Iterator.next:()Ljava/lang/Object;
   26:     checkcast     #38; //class java/lang/String
   29:     areturn

   type: class java.lang.String
  value: "key"
> map["key"].class.newInstance
   0:     aload_1
   1:     ldc     #14; //String map
   3:     invokeinterface     #20,  2; //InterfaceMethod java/util/Map.get:(Ljava/lang/Object;)Ljava/lang/Object;
   8:     checkcast     #16; //class java/util/Map
   11:     ldc     #22; //String key
   13:     invokeinterface     #20,  2; //InterfaceMethod java/util/Map.get:(Ljava/lang/Object;)Ljava/lang/Object;
   18:     checkcast     #24; //class java/lang/String
   21:     invokevirtual     #28; //Method java/lang/String.getClass:()Ljava/lang/Class;
   24:     invokevirtual     #34; //Method java/lang/Class.newInstance:()Ljava/lang/Object;
   27:     areturn

   type: ?  extends [class java.lang.Object]
  value: ""

As you can see, you can call methods; a nice add-on to EL is the ability to call methods with arguments. The compiler curries one argument a time; a curried call is typed as a Map:

> map.put["key"]
   type: interface java.util.Map[class java.lang.String, class java.lang.String]
  value: fun { {key=value}.put("key", $0) }

So Map.put is called with one argument, one remaining. This is represented by a special Map implementation that completes the call when #get is called. This part right now requires reflective method invocation at runtime. However, the Expression instance already contains the Method object to invoke and thus profits from java.lang.reflect's bytecode generation. I want to play with functional expressions, too, e.g. collection[map.get] should apply map.get to each collection element.

If you're interested to toy around with it, let me know.

Posted by Matthias at 1:05.42. Comment: blog@mernst.org

Code snippet: calling javap programmatically

This just came in handy. I'm doing some bytecode generation, using the javap classes to disassemble the generated code on the go. You need tools.jar in your classpath:

  private static Field JAVAP_C = null;
  static {
    try {
      JAVAP_C = JavapEnvironment.class.getDeclaredField("showDisassembled"); // :-(
    } catch (NoSuchFieldException e) {

  public static void javap(byte[] classBytes, String methodName) {
    JavapEnvironment env = new JavapEnvironment();
    try {
      if(JAVAP_C != null) JAVAP_C.set(env, true);
    } catch (IllegalAccessException e) {
      throw new RuntimeException(e);

    PrintWriter pw = new PrintWriter(System.out);
    JavapPrinter javapPrinter = new JavapPrinter(new ByteArrayInputStream(classBytes), pw, env);
    ClassData classData = new ClassData(new ByteArrayInputStream(classBytes));
    MethodData[] methods = classData.getMethods();
    for (int i = 0; i < methods.length; i++) {
      MethodData method = methods[i];

Example output:

   0:     iconst_3
   1:     ineg
   2:     invokestatic     #18; //Method java/lang/Integer.valueOf:(I)Ljava/lang/Integer;
   5:     areturn
Posted by Matthias at 24:15.57. Comment: blog@mernst.org


the j.u.logging default configuration

I don't mind the j.u.logging API. Its default configuration sucks hard, however. How to enable FINE output programmatically?
     ExpressionCompiler.logger.setLevel(Level.FINE); // you would think
     Logger.getLogger("").getHandlers()[0].setLevel(Level.ALL); // #!@#!@$
Why is that? $JRE_HOME/lib/logging.properties decides what goes to your console:
     # Limit the message that are printed on the console to INFO and above.
     java.util.logging.ConsoleHandler.level = INFO
See http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4462908. Looks like I finally arrived in 2001...
Posted by Matthias at 10:13.48. Comment: blog@mernst.org