Straddling Tables
…isn’t some new extreme interrogation technique at gitmo1 it’s just the title of a blog post about the commonalities between file-format features. The goal is to have one file, a Pilferpage, that can be dynamically converted into HTML, XUL, ODF, XSL-FO, Flex, CALS2, DocBook, CSV, but to do this involves seeing what could possibly be converted between these various formats and what cannot. If one language doesn’t support bold text, or hierarchical table rows, then this may affect the design of the unifying Pilferpage file-format. First up, tables.
At a bare minimum Tables consist of a flat list of rows containing cells. They don’t necessarily contain headers or footers, or cell spans, or hierarchical rows. As it turns out however when you add a few more features then Tables become a generalised model for both DataGrids and TreeViews so –for the purposes of comparing tables– we’ll call all three of these Tables.
Hierarchical Rows

In HTML, XSL-FO, CSV, and ODF table rows are a one-dimensional list, whereas in XUL rows can be hierarchical (the Subject column, above). Hierarchical rows are a simple way of supporting multi-column TreeViews. HTML, DocBook, XSL-FO, ODF and CSV do not inherently support row hierarchy but you can fake this through formatting (style padding or indenting characters). In HTML you can make this faux-hierarchy faux-interactive with JavaScript. So hierarchical rows can be represented, and they seem like a useful feature for Pilferpage.
Hierarchical Tables
All of the markup-language-based formats (HTML, XSL-FO, ODF, DocBook) support hierarchical tables; which leaves CSV as the one that doesn’t. For CSV you’d have to render different tables one after another (or, more sensibly, as separate CSV files). Providing there was a way of extracting tables from the page (perhaps with some URL parameters) then CSV could use hierarchical tables, so again this seems like a useful feature for pivot-tables and general data drilling.
Subtables
Subtables however are not useful — these are typically an awkward way of encoding cell-spanning and row-spanning by declaring that multiple nested tables should be treated as a single unified table. I believe that Subtables will be deprecated in ODF 1.2.
Pilferpage won’t support them.
Headers and Footers
As well as providing visual cues table headers and footers allow cell data to be accessible to disabled people. A disabled person navigating a cell that reads “11%” may not be able to easily glance up the column in order to understand that it’s about “Elbow Growth” but by explicitly encoding headers the table can be made practically usable to these people. Another benefit of clearly articulating the table header/footer is that software can reuse the data more easily. Most DataGrids expect explicit column headings, for example.
Column Headers and Footers
HTML, ODF, XSL-FO and Flex all support basic single-level column headers.
HTML, ODF, and XSL-FO support single-level column footers.
HTML and ODF support multi-level column headers and footers.
CSV doesn’t support table headers or footers. Some software implies headers by assuming that it’s the first row.
HTML uses table headings and footers to allow progressive loading of table data via the <thead> <tbody> and <tfoot> tags. The idea is that a browser may receive the table header, then the footer, and then the table body. The browser keeps filling out the table body as more and more data is streamed in. If browsers support this reliably then it would be useful to rearrange a pilferpage in order to support this in HTML.
Cell Headers and Footers
There’s a difference between column headers and cell headers,

In the screenshot the yellow table cells mark a cross-section from a table that only shows the top-left and bottom-right portions. This technique is already popular, and to make this accessible they couldn’t use column headings — they’d need cell headings. So, by cell headings we mean that each cell references the appropriate cell headings rather than headings being implied by columns or rows.
HTML and ODF support cell headings. XSL-FO, CSV, and Flex do not.
Another example is the periodic table of elements where it’s groupings could be expressed by cell headings. Note the background colours:

Heading Levels
Again, this is perhaps best described by a screenshot,

The ‘System’ heading is a grouping of the ‘Metal Parts‘ and ‘Wood Parts‘ headings. As well as the cells being headings the relationship between these three cells is expressed by encoding heading levels.
HTML and ODF support hierarchical cell headings by way of putting textural headings inside cell headings. XSL-FO, Flex, and CSV do not.
It seems that an ambiguity can occur when the heading levels are encoded using conventional text headings: if in HTML a cell heading contains H1 and H3 and another cell heading contains H2 then in which order should a screen-reader speak the headings? Because of this it seems that it would be better to encode cell headings per-cell, perhaps with a heading-level attribute.
Diagonal Table Headings
Diagonal table headings (or labels) are used to describe columns and rows in a more compact notation. Given a table of,

One would make diagonal table heading of,

They’re popular in Asia, particularly in China and Japan. There seems to be some disagreement as to whether diagonal table headings are headings or labels but in both cases a user navigating cell data may want to access a description while browsing cell data. The distinction between heading or label may only be useful when browsing a table hierarchically, where headings (but not labels) would presumably group cell headings. Personally I’d say that they’re headings and not labels.
These Diagonal Table Headings can also be multi-level. In the following table I’ve coloured the headings,

And sometimes they even cram a title in there…

While in English the letters do appear cramped it’s not the same for the Chinese language,

(source: Diagonal Table Header Specification)
HTML, CSV, Flex, do not support them diagonal table headings. ODF doesn’t yet support it although there is a proposal to support diagonal table headings in OpenOffice.org (and ODF). Unfortunately no one seems to be making much progress here. Chinese versions of Microsoft Office and OpenOffice.org appear to support it (via UOF and .doc) but these aren’t part of ODF. So, how do we want diagonal headings in Pilferpage? Well, as there’s no output format it’s probably not worth the bother.
But it’s worth considering to ensure that it wouldn’t break our assumptions about table data-structures; that it can build upon our existing table heading hierarchies. With some simple rules it looks like Pilferpage could support this,
- Where there’s an attribute of diagonalHeading=”right down” in the <cell> tag. Values are up/right/down/left.
- This same cell should be empty (only text nodes of whitespace in the cell).
- From this cell any cells along the axis (depending on the diagonalHeading) must span in that direction to the edges of the table.
If a Pilferpage table does this then we would have enough information to generate a diagonal table header.
These diagonal table headings can appear in any cell, so you can get diagonal table footers too. Perhaps then it should be an attribute of diagonalTitle or aggregateTitle where you could specify several formatting options.
Spanning Cells
HTML, ODF, and XSL-FO cells can be spanned down or right but not left or up or in arbitrary shapes (for example, ‘L’ shaped spans aren’t possible).
Flex and CSV do not support cell spanning.
Style
For now I’m just going to declare these out of scope. It’s too big of a topic.
Request For Comment
I probably haven’t gotten everything right so please post comments and I’ll update the post accordingly. Cheers!
[1] It’s an old technique there. [2] Just kidding, no one uses CALS.


August 16th, 2008 at 3:58 pm
Flex does support Headers, although they are separated from the DataGridColumn object (weird, I know): http://livedocs.adobe.com/flex/201/langref/mx/controls/DataGrid.html
August 21st, 2008 at 8:02 am
Hey Matt,
This is a bit of a combined response to your previous posts on Pilferpage as a whole. From an information science (library) perspective, the idea of a single-source, higher-level language for producing multiple outputs is nothing new. When you ask “What would a source format that straddled all these formats look like?”, my thought has always been that a (purely) semantic XML format would be the answer. Moreover, I would expect the elements of this source format to be the natural union of all the target formats (ie. in order to support column headers in one of the target formats, they would have to be definable in the source format); although the argument could be made to use the intersection for simplicity or an initial version.
You may also find that part of the “separation of concerns” problems in HTML actually arise from the fact that pretty well all flavours of HTML are mixed semantic-style formats (eg. containing which is semantic, and <b>, <i>, or which are stylistic). There are plenty of legacy and human-factor reasons for this, but the net result in my opinion is that it confuses things — are you marking up document structure or style? We’re wrestling with much the same problem in Lemon8-XML, and how to balance users’ expectations of WYSIWYG editing with the need for them to actually edit in a WYSIWYM fashion.
I also think, having worked a long time ago with Cocoon and Popoon (and of course Docvert), that the XML/XSLT pipeline approach is a very powerful one, especially if you can piggyback on existing XSLT mappings (crosswalks) that have been written by others. In many cases, there is a “dumbing-down” (ie. loss of information) in these (eg. converting from, say, Docbook to HTML) but it sure beats writing everything from scratch. Unfortunately, I don’t have a good answer for an appropriate source format — I’ve tried to stick behind the NLM Journal Publishing DTD, if only because it is well-documented, well-used, and highly semantic (ie. very little style-based markup). Perhaps more importantly, it seems able to represent any type of complex document (including rich metadata); and there are already XHTML and XSL-FO XSLT available.
We’re about to spend some serious time sitting back and digging much deeper into this issue at PKP, since it has some major implications for L8X development, as well as some partnerships with publishers and epresses around the world. It would definitely be great to share your ideas and experience on these issues as we try to figure it out together.
MJ
August 21st, 2008 at 8:03 am
Nice. Wordpress didn’t escape my sample HTML tags. Sorry about that.
August 21st, 2008 at 8:09 pm
@Z-Bo: thanks for the info, I’m not particularly good at Flex so any advice on how to structure the output would be good.
@MJ: Heh, that’s ok — just fixed it.
I think our ideas on the format are pretty much aligned and of course the idea of making blog posts about it is to get feedback (if this was developed in isolation it wouldn’t get anywhere) and see how practical or useful this is.
I generally agree that we should keep style out of the source format. I suppose the one thing I am considering reuse of some of the XSL-FO ideas of page regions (before,start,end,after,body) for webpages. It would optional, but it would look something like this…
<page>
<top>header</top>
<left>menu</left>
<middle>body content</middle>
<right>sidebar</right>
<bottom>footer</bottom>
</page>
To effectively straddle XSL-FO you need to be able to do that, and the semantics work well (I think) for HTML, XUL and especially the headers and footers of ODT (ODF). Although these page regions have no semantic meaning I think they’re good approximations of the output format structures and it seems good to have a standardised taxonomy. An alternative would be to not enforce a naming convention and to expect the user to do some post-processing to arrange their nodes, Eg, to have a page of <section id=”top”>header</section> …which comes through to HTML as… <div id=”top” class=”section”>header</div> …but that seems to be a rather pointless abstraction/indirection.
August 25th, 2008 at 1:57 pm
Good to see we’re heading in the same direction. But again, beware how hard-wired our brains are to think of things in terms of layout — the structure above (page / top / left … bottom etc) isn’t semantic at all, but purely layout description. This maps nicely to XSL-FO, but doesn’t say anything about the meaning of the content that goes in each region. That’s where it gets tricky: defining something semantic (”title”, for instance), could map to different layout elements based on user preferences, and not just output formats (ie. “title” could go into “page/top”, or “page/bottom”, or “head/title”, and so on).
If you’re going to define a generic layout container format (and this might be the most useful way to go), then all I would suggest is being clear up front about that aspect. One of our partners has put some work into using the ePub (http://en.wikipedia.org/wiki/Epub#IDPF / http://www.idpf.org/specs.htm) format for exactly this purpose, although I haven’t yet had a chance to evaluate its suitability. There seem to be at least a few competing standards, which might be the way to go rather than inventing another one from scratch.
August 16th, 2009 at 5:13 am
Can I suggest for the PilferPage table markup that you consider the idea of not creating container elements for rows, columns, pages, etc. The alternative, to declare rows, columns, etc, independently of the cells, is actually quite attractive. So a table would be essentially a list of cells, each of which might fall into one or more rows, columns, sheets, hypersheets, etc, by reference, rather than by containment.