[oclug] Open Source Oppositions Get Serious
rechlin at ncf.ca
Wed Oct 9 21:44:58 EDT 2002
At 11:36 PM 10/8/02 -0400, you wrote:
>On Tue, Oct 08, 2002 at 10:10:26PM -0400, Robert Echlin wrote:
> Using XSL, you can convert from XML to LaTeX (in fact, this is one of
>the main ways of transforming DocBook XML into PDF). Using standard LaTeX
>tools, you can easily convert to HTML (and probably XHTML, though if not, it
>would be trivial to support that standard). Saving as "XML format" is too
>ambiguous a requirement, but with a small amount of coding, any reasonable
>XML-based file format could be supported.
Do you mean XSL or XSLT?
I know that XSLT creates a DOM of the entire doc before applying
This is scalable for a small data set - say under 100K, or a stretch of up
to a couple of MB.
The problems with XSLT for large data sets are:
(please let me know if this description is incorrect)
a) It makes a minimum of two copies of the document in RAM - as I
understand it, you apply the transformations from one DOM to get another
DOM (or LaTex doc) in RAM before outputting the second copy to disk.
b) It touches *all* the data 3 times
- disk to first DOM
- first DOM to second DOM
- second DOM to output (disk or dynamic web page or what have you)
And of course your XML to LaTex to HTML solution adds another layer of
I prefer streaming tools for processing text data, such as OmniMark (plug)
or SAX or even a streaming Perl program. A streaming program touches the
data once in general - more for those bits of data that must be buffered
for data you don't have yet - such as when you are reversing the order of,
for instance, first and last name. If you want to put the total page count
on every page, please put in placeholders and go over it a second time to
insert this - it will still be 25%-40% faster than XSLT.
Sample of streaming Perl processing:
60 HTML pages (288K total) in about 8 seconds on a 166MHZ machine.
That's the generated pages at http://magma.ca/~rechlin/fred/index.html (OK,
that's a plug)
Caveat - the source here is formatted ASCII, not XML or SGML.
That's fast enough I can use HTML output for the final edit and make
changes on the fly.
The template is stored in RAM, and parsed line by line, inserting variables
based on a cfg file that's in RAM of course, and on the source file name.
When the program gets to the spot where the source data goes into the page,
the source page is read line by line and simple HTML is inserted as required.
OmniMark had one prospect who took home a copy of the OmniMark command-line
tool from a show, way back when it was free, took a couple of days to set
up a duplicate in OmniMark of what they were doing in their usual tool, and
turned it loose on their data.
That was over 2GB of data.
They normally took 8 hours to process it.
OmniMark finished it while they were at lunch. (45 minutes)
As I said to Greg, LaTex and Tex sound like an interesting tool, I've heard
of them before, and I think I need to try them. I would like to put in
place a better process for creating PDFs for our documentation at work. I
don't need PDFs for my personal stuff - yet.
But I certainly wouldn't use them for HTML.
(the long winded)
> //\ dmo at acm dot org
>oclug mailing list
>oclug at lists.oclug.on.ca
----- Robert Echlin, B. Eng. --
Read '101 ways to burn water' at: http://magma.ca/~rechlin/burn/
Fave site: http://www.HogwartsAlumni.com (bias alert)
Ottawa, Ontario, Canada
More information about the OCLUG