[oclug] Open Source Oppositions Get Serious

Robert Echlin rechlin at ncf.ca
Wed Oct 9 21:44:58 EDT 2002


At 11:36 PM 10/8/02 -0400, you wrote:
>On Tue, Oct 08, 2002 at 10:10:26PM -0400, Robert Echlin wrote:
>  Using XSL, you can convert from XML to LaTeX (in fact, this is one of
>the main ways of transforming DocBook XML into PDF).  Using standard LaTeX
>tools, you can easily convert to HTML (and probably XHTML, though if not, it
>would be trivial to support that standard).  Saving as "XML format" is too
>ambiguous a requirement, but with a small amount of coding, any reasonable
>XML-based file format could be supported.

Do you mean XSL or XSLT?
I know that XSLT creates a DOM of the entire doc before applying 
transformations.
Does XSL?
This is scalable for a small data set - say under 100K, or a stretch of up 
to a couple of MB.

The problems with XSLT for large data sets are:
  (please let me know if this description is incorrect)
a) It makes a minimum of two copies of the document in RAM - as I 
understand it, you apply the transformations from one DOM to get another 
DOM (or LaTex doc) in RAM before outputting the second copy to disk.
b) It touches *all* the data 3 times
  - disk to first DOM
  - first DOM to second DOM
  - second DOM to output (disk or dynamic web page or what have you)

And of course your XML to LaTex to HTML solution adds another layer of 
processing.

I prefer streaming tools for processing text data, such as OmniMark (plug) 
or SAX or even a streaming Perl program. A streaming program touches the 
data once in general - more for those bits of data that must be buffered 
for data you don't have yet - such as when you are reversing the order of, 
for instance, first and last name. If you want to put the total page count 
on every page, please put in placeholders and go over it a second time to 
insert this - it will still be 25%-40% faster than XSLT.

Sample of streaming Perl processing:
60 HTML pages (288K total) in about 8 seconds on a 166MHZ machine.
That's the generated pages at http://magma.ca/~rechlin/fred/index.html (OK, 
that's a plug)
Caveat - the source here is formatted ASCII, not XML or SGML.
That's fast enough I can use HTML output for the final edit and make 
changes on the fly.

The template is stored in RAM, and parsed line by line, inserting variables 
based on a cfg file that's in RAM of course, and on the source file name.
When the program gets to the spot where the source data goes into the page, 
the source page is read line by line and simple HTML is inserted as required.

Another sample:
OmniMark had one prospect who took home a copy of the OmniMark command-line 
tool from a show, way back when it was free, took a couple of days to set 
up a duplicate in OmniMark of what they were doing in their usual tool, and 
turned it loose on their data.

That was over 2GB of data.
They normally took 8 hours to process it.
OmniMark finished it while they were at lunch. (45 minutes)

As I said to Greg, LaTex and Tex sound like an interesting tool, I've heard 
of them before, and I think I need to try them. I would like to put in 
place a better process for creating PDFs for our documentation at work. I 
don't need PDFs for my personal stuff - yet.

But I certainly wouldn't use them for HTML.

Robert
(the long winded)

>-dave0
>--
>     ('>
>     //\  dmo at acm dot org
>     v_/_
>_______________________________________________
>oclug mailing list
>oclug at lists.oclug.on.ca
>http://www.oclug.on.ca/mailman/listinfo/oclug

----- Robert Echlin, B. Eng.    --
Read '101 ways to burn water' at: http://magma.ca/~rechlin/burn/
Fave site: http://www.HogwartsAlumni.com (bias alert)
Ottawa, Ontario, Canada




More information about the OCLUG mailing list