<< Chapter < Page | Chapter >> Page > |
Ideally, such endeavors will mean the reassessment of the initial exclusion criteria for knowledgebase materials. The increasing number of books published and republished in electronic format, for example, means that the inclusion of monograph-length studies of the Sonnets is no longer a task so onerous as to be prohibitive. Indeed, large-scale digitization projects such as Google Books and the Internet Archive are also making a growing number of books, both old and new, available in digital form.
We recognized that the next stages of our work would be predicated on the ability to create topic- or domain-specific knowledgebases from electronic materials. The work, then, pointed to the need for a better Internet resource discovery system, one that allowed topic-specific harvesting of Internet-based data, returning results pertinent to targeted knowledge domains, and that integrated with existing collections of materials (such as REKn) operating in existing reading systems (such as PReE), in order to take advantage of the functionality of existing tools in relation to the results. To investigate this further, we collaborated with Iter, a not-for-profit partnership created to develop and support electronic resources to assist scholars studying European culture from 400 to 1700 CE. On the mandate, history, and development of Iter, see Bowen (2000, 2008). For a more detailed report on this collaborative experiment, see Siemens, et al. (2006).
We thought we could use technologies like Nutch and models from other more complex harvesters (such as DataFountains and the Nalanda iVia Focused Crawler) See also Mitchell (2006). to create something that would suit our purposes and be freely distributable and transportable among our several partners and their work. In using such technologies, we hoped also to explore how best to exploit representations of ontological structures found in bibliographic databases to ensure that the material returned via Internet searches was reliably on-topic.
The underlying method for the prototype REKn Crawler is quite straightforward. An Iter search returns bibliographic (MARC) records, which in turn provide the metadata (such as author, title, subject) to seed a web search, the results of which are returned to the knowledgebase. In the end, the original corpus is complemented by a collection of pages from the web that are related to the same subject. While all of these web materials may not always be directly relevant, they may still be useful.
The method ensures accuracy, scalability, and utility. Accuracy is ensured insofar as the results are disambiguated by comparison against Iter’s bibliographic records—that is, via a process of domain-specific ontological structures. Scalability is ensured in that individual searches can be automatically sequenced, drawing bibliographic records from Iter one at a time to ensure that the harvester covers all parts of an identified knowledge domain. Utility is ensured because the resultant materials are drawn into the reading system and bibliographic records are created (via the original records, or using Lemon8-XML).
Notification Switch
Would you like to follow the 'Online humanities scholarship: the shape of things to come' conversation and receive update notifications?