I have noticed that MARC XML is inconsistently saved to the content column in the ole_ds_bib_t table. I have noticed that some records have text such as the following before the XML. I.e. XML preamble?
<?xml version="1.0" encoding="UTF-8"?>
I am thinking that this text should be removed if it is present in the incoming XML. All the data is stored as UTF-8 in the database. So, it is redundant to have that there. Also, when you fetch the data out using something like JDBC, it comes back as a string which is Unicode, not UTF-8. Also, it is needlessly taking up space.
I have also noticed that the XML is not consistently formatted. I looked at a few records and some didn't have whitespace formatting and others did. It looked like the ones with the XML preample listed above did not include whitespace. It would be good to format the XML before storing it to the database, either with whitespace formatting or without. With, has the advantage that it is easier to read/debug if you are doing adhoc queries using SQL. Without has the advantage that it would save a lot of space on disk.
Also, would it possible to eliminate the <collection> element? The records will never contain a collection of <record> elements. It will only ever have one. So, it is confusing and a waste of space to include the collection element.
I would recommend XML such as the following with no preamble and formatted with whitespace.
The code would need to be modified handle whether the top level element was collection or record for backward compatibility.