🔔 HathiTrust Research Center Services Temporarily Unavailable

Due to scheduled maintenance, some HTRC services are not available from Friday, March 28th at 1:00pm ET to Monday, March 31th at 12:00pm ET. We apologize for any inconvenience.

Learn and support

HTRC glossary

The following is an alphabetically organized list of HTRC’s terminology used across HTRC Analytics and corresponding documentation. If you're new to our site, text analysis or the HathiTrust generally, you might consider keeping this page open in a separate window or tab for reference.

HTRC algorithms are click-and-run tools that allow account holders to perform several types of text analysis queries on HathiTrust data within the website. There are currently 4 HTRC algorithms:

  • Token Count and Tag Cloud Creator (visualization is generated)
  • Named Entity Recognition (csv file is generated)
  • InPHO Topic Model Explorer (visualization is generated)
  • Extracted Features Download Helper (script is generated)

With the exception of the Extracted Features Download Helper, only worksets of fewer than 3,000 volumes are valid when using these tools.

Read more about algorithms.

A collection is made in the HathiTrust Digital Library site. Collections are typically user-generated, but institutions or HathiTrust can add collections as well. Simply put, they are collections of HathiTrust volumes that exist within the HathiTrust Digital Library.

A template is a snapshot, or image, of a pre-existing research data capsule that can be made accessible for other HTRC account holders to find and use. The template is created and set up by one HTRC user who has a research data capsule they wish to share for educational or research purposes. This user downloads additional code libraries, software tools, worksets, and potentially any other data they can access from the internet when in Maintenance mode and that is not already available in HTRC’s Ubuntu virtual machine, so that other users can clone this capsule and not have to do the setup and customization.

Data capsules are customizable environments for writing or running your own code or tools to study HathiTrust volumes, so users will likely need to have a familiarity with the command line and some knowledge of a programming language, which is why it is considered an advanced tool.

Read more about data capsules.

A dataset on HTRC Analytics typically refers to one of our “derived” datasets, meaning a file (or set of files) that contains information extracted or created from HathiTrust volumes, but not the full text from any volume.

HTRC offers 4 derived datasets:

  • Extracted Features (our most robust dataset, which contains volume and page-level metadata and data, including part-of-speech tags, tokens, and token counts for over 17 million volumes in the HathiTrust Digital Library)
  • Word Frequencies in English-language Literature, 1700-1922 (dataset of word frequencies across all HathiTrust English-language volumes from 1700-1922)
  • Geographic Locations in English-Language Literature, 1701-2011 (dataset of volume metadata and geographic locations mentioned in works of English-language fiction found in the HathiTrust Digital Library)
  • BookNLP Dataset for English-Language Fiction (a derived dataset created using the BookNLP pipeline and extracting data from the NovelTM English-language fiction set of around 213,000 volumes in the HathiTrust Digital Library)

Read more about datasets.

An HTRC-developed Python library to be used with our Extracted Features dataset that simplifies common data science/text analysis methods.

Read the GitHub documentation and Feature Reader info page.

Tab-delimited metadata files provided by HathiTrust that can be downloaded from the Hathifiles page. These files contain information, including bibliographic metadata, about every item in the HathiTrust Digital Library.

Wait…there are HTRC datasets and HathiTrust datasets, and they are not the same thing?

Correct! Both HathiTrust and HTRC provide access to various datasets. More information about requesting HathiTrust datasets is available on the HathiTrust website.

This policy underpins all research conducted using HTRC methods and tools. Essentially, it means that HTRC supports computational research as long as researchers do not view or read large amounts of text from items under copyright. HTRC works to transform human-readable text into machine-readable text so that the Research Center and its users comply with copyright law.

Read more about HathiTrust and HTRC’s non-consumptive use policy.

An HTRC visualization is a visual format of your data, usually after some sort of processing has been performed upon it. HTRC’s Bookworm tool is an example of a visualization we offer as a form of textual analysis. The InPHO Topic Model tool is another.

An HTRC workset is a collection (or list) of HathiTrust volumes. We call it a workset because, instead of thinking about it like a collection of readable texts, HTRC needs it as data. HathiTrust volume IDs allow our systems to retrieve information from the digital library’s systems and convert text into this data in a variety of ways, depending on what type of research you are interested in doing.

Read more about worksets.

This is a command line tool used inside an HTRC data capsule that helps users download data for individual items or worksets.

Read more about the toolkit.