[oclug] A set of Linux/XML/database/Python questions

Marcin Kolbuszewski marcink at magma.ca
Sun Feb 13 17:06:37 EST 2005


Hello,

I've just registered, and hope that this forum is appropriate for the questions I have.
If not, please genetly tell me to go away...

Essentially, I want to set a largish database, that will be populated from XML files,
will have a web interface to put XML in, take XML out, and do some searching. I'd like as
much design as possible  to be 'automatic' and if needs to be manual, then in Python, 
because that is something I am familiar with.

Naturally I do not want to try to unlock open door, so would appreciate your pointers and 
advice. Im some sense I know how to do what I want to do at the low level, i.e how to 
parse XML (SAX, DOM), how to build SQL, how to link to database from Python, how to write 
HTML, but there must be better ways :-)

Below I put a description of the problem:

First I will have approx 100,000 million XML files that look more or less like this, Each 
contains roughly a dozen '<inner'> tags. So the total number 
of '<outerelement>' 'objects' will be 100,000. In the final version I'd like to try with 
roughly a million...

<outerelement>
 <outer1>AA</outer1>
 <outer2>BB</outer2>
 ...
 <outer10>CC</outer10>
 <data>10 kilobytes ascii here</data>
 <inner>
   <inner1>a</inner1>
   <inner2>b</inner2>
   <inner3>c</inner3>
 </inner>
<inner>
   <inner1>aa</inner1>
   <inner2>bbb</inner2>
   <inner3>ccc</inner3>
 </inner>
 ...
<inner>
   <inner1>aaaaaaaa</inner1>
   <inner2>bbbbbbbb</inner2>
   <inner3>cccccccc</inner3>
</inner>
</outerelement>


I need to setup a database to hold it. The <inner> fields and <outer?> have
to be searchable, '<data>' free text searchable, but only after narrowing the 
amount of data to a sensible amount (<500). Performance is secondary.

So, I need something to: suggest a database schema, possibly create database, parse XML,
create SQL, insert into database, help create a web search interface, with the least 
amount of effort.

If I were to do it myself I'd either pay Oracle to do it - or reinvent the wheel and: 
parse the files using SAX or DOM, manually code creation of SQL, open DB from Python, run 
SQL etc.....

So, I'd appreciate pointers to resources that would help me do it as painless as possible.

Cheers,

Marcin Kolbuszewski           Email: marcink at magma.ca



More information about the OCLUG mailing list