<< Chapter < Page | Chapter >> Page > |
According to Toms and O’Brien (2008), the work of humanities researchers using digital resources is concerned with access to sources, the presentation of texts and the ability to analyse texts using a well-defined set of analysis tools. HiTHeR promises direct retrieval of relevant primary sources for research on the NCSE collections. It provides an automatically generated browsing interface, which allows for the crucial Humanities 'chain of readings' activities that define most Humanities researchers' work. In Humanities research processes, new relevant resources are based on the initial discovery of other relevant resources. HiTHeR offers an interface to primary resources by automatically generating a chain of related documents for reading.
However, the advanced automated methods that could help to create such a browsing view using text mining to aid the information retrieval task by users require greater processing power than is available in standard desktop environments. Prior to the current case study, we experimented with a simple document similarity index to allow journals of similar contents to be represented next to each other. Initial benchmarks on a stand-alone server allowed us to conclude that (assuming the test set was representative) a complete set of comparisons for the corpus would take more than 1,000 years!
Governments, private enterprise and funding bodies are investing heavily in digitization of cultural heritage and humanities research resources. With advances in the availability of parallel computing resources and the simultaneous need to process large and complicated historical collections, it seems logical to turn attention towards the best parallel computing infrastructures to support work as envisioned in the HiTHeR project. In HiTHeR we set up an infrastructure based on High Throughput Computing (HTC), which uses many computational resources to accomplish a single computational task.
The HiTHeR project created a prototype infrastructure to demonstrate to textual scholars, and indeed to humanities researchers in general, the utility of HTC methods using Condor. It uses Condor to set up a Campus Grid.In our case, we have built a Campus Grid using underutilized computers from two institutions, which share a building at King’s College London: the Centre for Computing in the Humanities (CCH) and the Centre for e-Research (CeRch). We use two types of computer systems: underutilized normal desktops and dedicated servers. Both, CCH and CeRch, have a large number of desktop machines and servers, used to present their vast archives and online publications. While the servers contain several Terabytes of data, they have underused processing capabilities which can be made available for advanced processing. Additionally, the Condor Toolkit can use the national research infrastructure in the UK, the National Grid Service (NGS), which is a free service to UK researchers and provides dedicated advanced computing facilities.
The evaluation showed that the time used for calculating document similarity could be reduced significantly by using the HTC resource. However, it also showed that more work is needed to exactly determine how text mining for humanities can best be served by UK research infrastructures. More research is also needed to determine when HTC can serve the needs and when dedicated hardware is required.
There is great unplugged potential for using e-Research technologies in textual resource analysis. Computation of textual resources is quite well researched and there are by now many well performing algorithms and data structures to serve the needs not only of the general user, but also the specific needs of researchers. But less work has been done to consider infrastructural needs for the future of research based on these methodologies. More user studies are required to analyse existing work in Digital Humanities involving textual resources. We need to better understand how new methods such as text mining could be used, or how the discipline of textual studies and humanities in general is transformed by the ability to do more data-driven empirical research. The field of humanities has the opportunity to move towards a new more empirical way of working in which more and more resources, increasing not only in number but in size, become easily available. Interest in such new working practices already exists as repeatedly shown in research reports for conferences. This chapter has presented just a few of the many projects working on this agenda. TextGrid looks at how collaboration can enable new research in the textual studies, while HiTHeR looks at enhancing online editions in Digital Humanities using text mining approaches. As the need for research using large digital corpora increases, other projects will emerge that will further advance computational text analysis in arts and humanities research.
Brockman, W. S., L. Newmann, et al., Eds. (2001). Scholarly Work in the Humanities and the Evolving Information Environment . Washington DC, Digital Library Federation. Council on Library and Information Resources.
Gietz, P., A. Aschenbrenner, et al. (2006). TextGrid and eHumanities. Proceedings of the Second IEEE International Conference on e-Science and Grid Computing , IEEE Computer Society.
Nentwich, M. (2003). Cyberscience. Research in the Age of the Internet . Vienna, Austrian Academy of Science Press.
Schreibman, S., R. Siemens, et al., Eds. (2004). A Companion to Digital Humanities . Oxford, Blackwell Publishing.
Toms, E. and H. L. O'Brien (2008). " Understanding the information and communication technology needs of the e-humanist." Journal of Documentation 64 .
Notification Switch
Would you like to follow the 'Research in a connected world' conversation and receive update notifications?