<< Chapter < Page | Chapter >> Page > |
Indexing full-text in a relational database does not give optimum performance or results: in fact, the performance degradation could be described as exponential in relation to the size of the database. Keeping both advantages and disadvantages in mind, it was proposed that all REKn binary data be stored in a file system rather than in the database. File systems are designed to store files, whereas the PostgreSQL database is designed to store relational data. To get the two mixed together defeats the advantages of each. Moreover, in testing the proof of concept, users found speed to be a significant issue, with many unwilling to wait five minutes between operations. In its proof-of-concept iteration, the computing interaction simply could not keep pace with the cognitive functions it was intended to augment and assist. We recognized that this issue could be resolved in the future by recourse to high-performance computing techniques—in the meantime, however, we decided to reduce the REKn data to a subset, which would allow us to imagine and work on functionality at a smaller scale.
Having decided to store all binary data in a file system, we had to develop a standardized method of storing and linking the data, one that accounted for both linking the relational data to the file system data as well as keeping the data mobile (such as would allow migrating the data to a new server or distributing the files over multiple servers). Flexibility was also flagged as an important design consideration, since the storage solution might eventually be shared with many different organizations, each with their own particular needs. This method would also require the implementation of a search technology capable of performing fast searches over millions of documents. In addition to the problem posed by the sheer volume of documents, the variety of file types stored would require the employ of an indexing engine capable of extracting text out of encoded files. After a survey of the existing software tools, it was decided that Lucene was the perfect fit for our project requirements: it is an open source full-text indexing engine capable of handling millions of files of various types without any major degradation in performance, and it is extensible with plug-ins to handle additional file types should the need arise.
The question of how to go about harvesting data for REKn, or indeed any content-specific knowledgebase, turned out to be a question of negotiating with the suppliers of document collections for permission to copy the documents. Since each of these suppliers (such as the academic and commercial publishers and the publication amalgamator service providers) has structured access to the documents differently, scripts to allow for harvesting their documents had to be tailored individually for each supplier. For example, some suppliers provide an API to their database, others use HTTP, and still others distribute their documents via tapes or CDs of files. Designing an automated process for harvesting documents from suppliers could be accomplished by combining all of these different scripts together with a mechanism for automatically detecting the various custom access requirements and selecting the correct script to use.
Notification Switch
Would you like to follow the 'Online humanities scholarship: the shape of things to come' conversation and receive update notifications?