HTRC Analytics

Due to scheduled maintenance, some HTRC services are not available from Friday, July 11th at 12:00am ET to Monday, July 14th at 12:00pm ET. We apologize for any inconvenience.

HTRC glossary

The following is an alphabetically organized list of HTRC’s terminology used across HTRC Analytics and corresponding documentation. If you're new to our site, text analysis or the HathiTrust generally, you might consider keeping this page open in a separate window or tab for reference.

Algorithm

HTRC algorithms are click-and-run tools that allow account holders to perform several types of text analysis queries on HathiTrust data within the website. There are currently 4 HTRC algorithms:

Token Count and Tag Cloud Creator (visualization is generated)
Named Entity Recognition (csv file is generated)
InPHO Topic Model Explorer (visualization is generated)
Extracted Features Download Helper (script is generated)

With the exception of the Extracted Features Download Helper, only worksets of fewer than 3,000 volumes are valid when using these tools.

Collection

A collection is made in the HathiTrust Digital Library site. Collections are typically user-generated, but institutions or HathiTrust can add collections as well. Simply put, they are collections of HathiTrust volumes that exist within the HathiTrust Digital Library.

Customized research capsule templates

A template is a snapshot, or image, of a pre-existing research data capsule that can be made accessible for other HTRC account holders to find and use. The template is created and set up by one HTRC user who has a research data capsule they wish to share for educational or research purposes. This user downloads additional code libraries, software tools, worksets, and potentially any other data they can access from the internet when in Maintenance mode and that is not already available in HTRC’s Ubuntu virtual machine, so that other users can clone this capsule and not have to do the setup and customization.

Data capsule

Data capsules are customizable environments for writing or running your own code or tools to study HathiTrust volumes, so users will likely need to have a familiarity with the command line and some knowledge of a programming language, which is why it is considered an advanced tool.

Dataset

A dataset on HTRC Analytics typically refers to one of our “derived” datasets, meaning a file (or set of files) that contains information extracted or created from HathiTrust volumes, but not the full text from any volume.

HTRC offers 4 derived datasets:

Extracted Features (our most robust dataset, which contains volume and page-level metadata and data, including part-of-speech tags, tokens, and token counts for over 17 million volumes in the HathiTrust Digital Library)
Word Frequencies in English-language Literature, 1700-1922 (dataset of word frequencies across all HathiTrust English-language volumes from 1700-1922)
Geographic Locations in English-Language Literature, 1701-2011 (dataset of volume metadata and geographic locations mentioned in works of English-language fiction found in the HathiTrust Digital Library)
BookNLP Dataset for English-Language Fiction (a derived dataset created using the BookNLP pipeline and extracting data from the NovelTM English-language fiction set of around 213,000 volumes in the HathiTrust Digital Library)

Feature Reader

An HTRC-developed Python library to be used with our Extracted Features dataset that simplifies common data science/text analysis methods.

Read the GitHub documentation and Feature Reader info page.

Hathifiles

Tab-delimited metadata files provided by HathiTrust that can be downloaded from the Hathifiles page. These files contain information, including bibliographic metadata, about every item in the HathiTrust Digital Library.

Wait…there are HTRC datasets and HathiTrust datasets, and they are not the same thing?

Correct! Both HathiTrust and HTRC provide access to various datasets. More information about requesting HathiTrust datasets is available on the HathiTrust website.

Non-consumptive use policy

This policy underpins all research conducted using HTRC methods and tools. Essentially, it means that HTRC supports computational research as long as researchers do not view or read large amounts of text from items under copyright. HTRC works to transform human-readable text into machine-readable text so that the Research Center and its users comply with copyright law.

Visualization

An HTRC visualization is a visual format of your data, usually after some sort of processing has been performed upon it. HTRC’s Bookworm tool is an example of a visualization we offer as a form of textual analysis. The InPHO Topic Model tool is another.

Workset

An HTRC workset is a collection (or list) of HathiTrust volumes. We call it a workset because, instead of thinking about it like a collection of readable texts, HTRC needs it as data. HathiTrust volume IDs allow our systems to retrieve information from the digital library’s systems and convert text into this data in a variety of ways, depending on what type of research you are interested in doing.

Workset Toolkit

This is a command line tool used inside an HTRC data capsule that helps users download data for individual items or worksets.

🔔 HathiTrust Research Center Services Temporarily Unavailable

Learn and support