OLE DocStore

1. Overview

Document Store for OLE is a content management system with features like checkin, checkout, versioning, locking etc for library records such as Bibliographic, Instance (Holdings and Items), Patron, License etc. Most of the records are in XML format but the Document Store is format agnostic in that it stores the content as is without any type conversion. Furthermore indexing of the stored data is also supported for efficient search and retrieval. Although the Document Store is an independent system that comes with basic UI to enable supported operations, majority of interaction happens from OLE such as ingest of new records, editing of existing records, search and retrievals. 

2. Core Technologies

Document Store for OLE uses Apache Jackrabbit 2.0 for content storage and Apache Solr 3.0 for indexing and searching. 

3. Architecture

The architecture was designed primarily around the need to store various document types, formats and the volume. Jackrabbit is a content hierarchy of "items". An item can be a node or a property (stores the actual value). For example if one were to store the first name, middle name and and the last name of the person in jackrabbit, there will be a node for person with properties for first name, middle name and last name as properties that store the actual value. In comparison to a traditional RDBMS, nodes can be thought of as tables and properties as columns that have the actual values.

Even though the architecture is flexible i.e. at the time of setup implementors can specify the content hierarchy of the data, the default content hierarchy has been setup assuming three levels i.e document category, document type and document forma as shown in the diagram below;

1 - Denotes the 1st level of content hierarchy under the root node which represents document categories. Here its is "Work" 

2 - Denotes the 2nd level which represents the document types such as "Bibliographic" and "Instance"

3 - Denotes the 3rd level which represents the document formats such as "MARC" and "Dublin Core" for Bibliographic records and "OLEML" (OLE Markup Language defined) for Instance records.

File System Analogy

Content hierarchy in the docstore can be thought of as a file system with folders and files. Wherever it says "nt:folder" - think of it as behaving like a folder that can contain one or more folders and files. Wherever it says "nt:file" one can think of it as a file with actual content in it. For the MARC content hierarchy represented in the picture on the left, at L1 there can be 1 or up to 1K folders. Each MARC folder can in turn have one or up to 1K more MARC folders at L2. Each MARC folder at L2 can then have 1 or up to 1K more MARC folders at L3. Each MARC folder at L3 can then have up to 1K actual MARC files. 


L1 - This denotes level 1 where you can have up to 1K nodes.

L2 - This denotes level 2 where you can have up to 1k nodes.

L3 - This denotes level 3 where you can have up to 1K nodes.

Total number of MARC records at L1

Jackrabbit recommends small number of nodes per parent node for efficiency.

The total number of nodes (i.e. resulting number of files) that can be accommodated at L3 is 1M (1K * 1K), L2 is 1000M (1000 *1M) and so at L1 we can expect to handle 1000M nodes or records.

Turning on versioning for a particular node results in the creation of ~4 more nodes to maintain version info. So we can expect four times growth for data that has versioned turned on.













Instance Content Management & Meta Data Information

Instance content hierarchy is similar to the MARC content hierarchy explained above. For Instance data even though it may be represented as one single document, within the docstore the content is broken down and stored in the respective folders and files as shown in the picture on the left. At L1 there can be one or unto 1K OLEML folders. Each OLEML at L1 in turn can have one or up to 1K further OLEML folders at L2. Each OLEML folder at L2 can then have 1 or up to 1K Instance folders at L3. Each Instance folder can have one Instance file to store instance specific content and a folder for Holdings. Each Holdings folder can can contain one ore more Item files with Item specific content. The Holdings folder further contains a Holdings file with Holding specific information.
Meta data about each record for instance "dateLastUpdated", "fastAddFlag", "supressFromPublic" etc that can be part of the Instance/Holdings/Item records itself are stripped out and added as properties on the node. This can be thought of adding properties on a regular folder like the name and any other attributes.


 
L1 - This denotes leve 1 with a parent node and corresponding child nodes.

L2 - This denotes level 2 with up to 1K nodes.

L3 - This denotes level 3 with un to 1K nodes.

Total number of Instance records at L1

Jackrabbit recommends small number of nodes per parent node for efficiency.

The total number of nodes (i.e. resulting number of files) that can be accommodated at L3 is 1M (1k * 1K), L2 is 1000M (1000 * 1M) and so at L1 we can expect to handle 1000M nodes or records.

Turning on versioning for a particular node results in the creation of ~4 more nodes to maintain version info. So we can expect four times growth for data that has versioned turned on.
  

 

 

 

 

Linking within the docstore is achieved by adding reference properties on the nodes themselves to indicate which node is linked to which other node. In Jackrabbit linking is supported by property called "Reference". In regards to the linking of Bib and Instance records, there can be 1 - Many, Many - Many linking as depicted in the picture. 

Green Dots - 1 Bib - 1 Instance
Blue Dots - Many Bibs - 1 Instance
Not shown - There can be many instances with 1 Bib as well

This is the general logic model for instance in DocStore, It is showing that the business rules in the DocStore:
From one bib can link to multiple instances;
One instance may be linked by multiple bibs;
One instance has one holding, and multiple items.

 

 

4. Features (Available thru restful services)

Document Store comes with two sets of restful service application programming interfaces (APIs). The first set consists of services for operations against the Document Store which are as follows:

a. Ingest (Single file with one ore more records) - Allows storing of documents in the document store. The input file has to conform with a standard schema.

b. Checkout (Requires a Universally Unique Identified (UUID) (created by Jackrabbit) of the document to be checked out) -  Allows for checking out a single file from the Document Store.

c. Checkin (Similar to the Ingest) - Checks in a file with versioning.

d. Browse - To be able to look at the repository file count. 

e. Link (Requires UUIDs of two documents) - Allows two records to be linked to each other via their UUIDs.

The second set of services are for the discovery layer. Currently these are straight forward SOLR APIs that are constructed based on the search criteria, examples of which will be provided in the technical section.

5. DocStore/Discovery Service Contracts

6. Document Store Installation Guide

Operated as a Community Resource by the Open Library Foundation