🔔 HathiTrust Research Center Services Temporarily Unavailable

Due to scheduled maintenance, some HTRC services are not available from Friday, March 28th at 1:00pm ET to Monday, March 31th at 12:00pm ET. We apologize for any inconvenience.

Learn and support

New to HTRC?

Where do I begin?

The answer to this question could be different depending on your past experience with text and/or data analysis research - a user with a background in natural language processing (NLP) methods, for example, might be ready to jump into the data capsule documentation (a tool designed for more intermediate to advanced HTRC users, since it requires you to be familiar with the Linux/Unix command-line and programming languages like Python or R).

But, for the most part, this page will address the basics for someone who is new to both text analysis and HTRC tools and data.

Important: You will need an HTRC Analytics account to use most HTRC tools. Exceptions are the Bookworm and other visualization tools (no account required for these). Read about user accounts.

What's text analysis?

Computational text analysis is the use of computational tools to study textual data, such as the text of a novel or the text of a thousand novels. The analysis may include identifying frequently used words; named entities, like people, places, and things; or words that commonly occur together and suggest significant themes or topics. The speed and efficiency of computational tools allow one to analyze many more volumes than one could by traditional reading. The term "distant reading" is frequently used to characterize this computational analysis of massive amounts of text and contrast it with the term "close reading" applied to traditional scholarly analysis and interpretation of texts.

Okay, great! Where do I get the data to start analyzing my texts?

Text data to analyze (in the form of volumes) comes from the HathiTrust Digital Library.

To perform text analysis via HTRC, the volumes typically go through a transformation process - meaning that in most situations, you don't get direct, "human" readable access to the texts you want to analyze, especially texts that are still under copyright. There are some exceptions, like steps you can take when requesting a research data capsule, but even in that situation, you will be prevented from taking raw textual data anywhere outside the capsule environment.

Non-consumptive use

One of the foundational principles of performing text analysis at HTRC is the concept of non-consumptive use. The HathiTrust Digital Library is a repository of millions of volumes and book-like artifacts, some in the public domain, and others under copyright. As such, all access to these materials through HTRC adheres to the non-consumptive use policy, which means, in most cases, no substantial portion of restricted texts are available for a researcher to read or distribute for its original expressive content (read about additional permissions granted as a Members-only benefit for research capsule users). Most forms of HathiTrust textual data have undergone some level of transformation, whereby textual analysis is possible, but not more traditional forms of reading. This may sound like complicated legalese, but this is what helps HTRC researchers acquire access to over 18 million volumes with which they can perform text and data mining on a scale like no other repository. Read the full non-consumptive use preamble and policy.

So how do I get my data?

There are several ways to do this, but ultimately the first step is to create an HTRC workset.

What's a workset?

Essentially, a workset is a list of HathiTrust IDs, which are unique identifiers assigned to each volume inside the HathiTrust Digital Library. There are ways to get these IDs yourself, but the easiest way to make a workset will probably be to start over in the HathiTrust Digital Library site, create a collection (or find a pre-existing collection that interests you), and then import that collection into HTRC using our Import from HathiTrust form.

Read all about this and our two other methods for creating a workset!

Do I have to create a workset to start analyzing text data from the HathiTrust Digital Library?

Nope! We have several handy visualization tools on the HTRC Analytics that do not require worksets, such as the Bookworm tool, or the Gendered Characteristizations tool. However, if you want to limit your investigation to a custom set of volumes, rather than either the whole collection or other pre-defined sets, you'll need to create a workset of items you want to study.

Want full text of public domain items? HathiTrust makes that data available via their dataset request process.

OK, I've explored the Bookworm tool, and also made my first workset - I'm ready to use it! What can I do with it?

We recommend using one of our beginner-friendly algorithms. They are easy to use in the sense that you do not need to know any coding languages to perform some common text analysis methods, such as Named Entity Recognition or Topic Modeling.

Read about the algorithms.

I've played around with the algorithms, read their documentation, and feel like I'm getting the hang of understanding their outputs. What should I do next?

The next logical progression would be to start tackling more complex text analysis methods utilizing software and code libraries with one of our derived datasets or with full-text volumes in the data capsule (see the Feature Reader and Workset Toolkit pages).

Data Capsules

Data capsules are virtual machines (accessed like a remote desktop environment) with increased security settings. There are some standard pre-installed tools on the desktop, like Voyant and an Anaconda install that includes many standard Python libraries. All data outputs are subject to HTRC automated and manual human review, ensuring that every researcher is in compliance with non-consumptive use of their data and research materials. Read the Data Capsule Terms of Use.

Read about using our data capsules.

How do I know what level of expertise I should have before I try to use a tool or other HTRC resource?

Throughout HTRC Analytics we try to indicate if a tool or resource is beginner-friendly, intermediate, or advanced so that researchers and users who broadly fall into those categories (or traverse them) can know if a tool will be useful for them.

Beginner - indicates that no programming knowledge is required for the tool or resource.

Intermediate - indicates that some high-level knowledge of either coding or data output may be required for using the tool or resource.

Advanced - indicates that programming skills will likely be necessary in order to use the tool or resource.