MARC XML inconsistently saved to ole_ds_bib_t.content column

Notes

I have noticed that MARC XML is inconsistently saved to the content column in the ole_ds_bib_t table. I have noticed that some records have text such as the following before the XML. I.e. XML preamble?

<?xml version="1.0" encoding="UTF-8"?>

I am thinking that this text should be removed if it is present in the incoming XML. All the data is stored as UTF-8 in the database. So, it is redundant to have that there. Also, when you fetch the data out using something like JDBC, it comes back as a string which is Unicode, not UTF-8. Also, it is needlessly taking up space.

I have also noticed that the XML is not consistently formatted. I looked at a few records and some didn't have whitespace formatting and others did. It looked like the ones with the XML preample listed above did not include whitespace. It would be good to format the XML before storing it to the database, either with whitespace formatting or without. With, has the advantage that it is easier to read/debug if you are doing adhoc queries using SQL. Without has the advantage that it would save a lot of space on disk.

Also, would it possible to eliminate the <collection> element? The records will never contain a collection of <record> elements. It will only ever have one. So, it is confusing and a waste of space to include the collection element.

I would recommend XML such as the following with no preamble and formatted with whitespace.

<record xmlns="http://www.loc.gov/MARC21/slim">
...
</record>

The code would need to be modified handle whether the top level element was collection or record for backward compatibility.

Assignee

Dale Arntson

Reporter

Jon Miller

Labels

None

Priority By Function

4 - Describe

Solr Version

None

Work Group

None

Process &amp; Sub-Process

None

Parent Jira

None

Co-Assignee/s

None

Due By

None

Contribution

No

Contributing Developer

None

Contributing Institution

None

Contribution Type

None

Value Proposition

None

Components

Fix versions

Priority

Critical
Configure