Due to scheduled maintenance, some HTRC services are not available from Friday, March 28th at 1:00pm ET to Monday, March 31th at 12:00pm ET. We apologize for any inconvenience.
Due to scheduled maintenance, some HTRC services are not available from Friday, March 28th at 1:00pm ET to Monday, March 31th at 12:00pm ET. We apologize for any inconvenience.
The answer to this question could be different depending on your past experience with text and/or data analysis research - a user with a background in natural language processing (NLP) methods, for example, might be ready to jump into the data capsule documentation (a tool designed for more intermediate to advanced HTRC users, since it requires you to be familiar with the Linux/Unix command-line and programming languages like Python or R).
But, for the most part, this page will address the basics for someone who is new to both text analysis and HTRC tools and data.
Important: You will need an HTRC Analytics account to use most HTRC tools. Exceptions are the Bookworm and other visualization tools (no account required for these). Read about user accounts.
Computational text analysis is the use of computational tools to study textual data, such as the text of a novel or the text of a thousand novels. The analysis may include identifying frequently used words; named entities, like people, places, and things; or words that commonly occur together and suggest significant themes or topics. The speed and efficiency of computational tools allow one to analyze many more volumes than one could by traditional reading. The term "distant reading" is frequently used to characterize this computational analysis of massive amounts of text and contrast it with the term "close reading" applied to traditional scholarly analysis and interpretation of texts.
Text data to analyze (in the form of volumes) comes from the HathiTrust Digital Library.
To perform text analysis via HTRC, the volumes typically go through a transformation process - meaning that in most situations, you don't get direct, "human" readable access to the texts you want to analyze, especially texts that are still under copyright. There are some exceptions, like steps you can take when requesting a research data capsule, but even in that situation, you will be prevented from taking raw textual data anywhere outside the capsule environment.
One of the foundational principles of performing text analysis at HTRC is the concept of non-consumptive use. The HathiTrust Digital Library is a repository of millions of volumes and book-like artifacts, some in the public domain, and others under copyright. As such, all access to these materials through HTRC adheres to the non-consumptive use policy, which means, in most cases, no substantial portion of restricted texts are available for a researcher to read or distribute for its original expressive content (read about additional permissions granted as a Members-only benefit for research capsule users). Most forms of HathiTrust textual data have undergone some level of transformation, whereby textual analysis is possible, but not more traditional forms of reading. This may sound like complicated legalese, but this is what helps HTRC researchers acquire access to over 18 million volumes with which they can perform text and data mining on a scale like no other repository. Read the full non-consumptive use preamble and policy.
There are several ways to do this, but ultimately the first step is to create an HTRC workset.
Essentially, a workset is a list of HathiTrust IDs, which are unique identifiers assigned to each volume inside the HathiTrust Digital Library. There are ways to get these IDs yourself, but the easiest way to make a workset will probably be to start over in the HathiTrust Digital Library site, create a collection (or find a pre-existing collection that interests you), and then import that collection into HTRC using our Import from HathiTrust form.
Read all about this and our two other methods for creating a workset!
Nope! We have several handy visualization tools on the HTRC Analytics that do not require worksets, such as the Bookworm tool, or the Gendered Characteristizations tool. However, if you want to limit your investigation to a custom set of volumes, rather than either the whole collection or other pre-defined sets, you'll need to create a workset of items you want to study.
Want full text of public domain items? HathiTrust makes that data available via their dataset request process.
We recommend using one of our beginner-friendly algorithms. They are easy to use in the sense that you do not need to know any coding languages to perform some common text analysis methods, such as Named Entity Recognition or Topic Modeling.
The next logical progression would be to start tackling more complex text analysis methods utilizing software and code libraries with one of our derived datasets or with full-text volumes in the data capsule (see the Feature Reader and Workset Toolkit pages).
Data capsules are virtual machines (accessed like a remote desktop environment) with increased security settings. There are some standard pre-installed tools on the desktop, like Voyant and an Anaconda install that includes many standard Python libraries. All data outputs are subject to HTRC automated and manual human review, ensuring that every researcher is in compliance with non-consumptive use of their data and research materials. Read the Data Capsule Terms of Use.
Throughout HTRC Analytics we try to indicate if a tool or resource is beginner-friendly, intermediate, or advanced so that researchers and users who broadly fall into those categories (or traverse them) can know if a tool will be useful for them.
Beginner - indicates that no programming knowledge is required for the tool or resource.
Intermediate - indicates that some high-level knowledge of either coding or data output may be required for using the tool or resource.
Advanced - indicates that programming skills will likely be necessary in order to use the tool or resource.