Due to scheduled maintenance, some HTRC services are not available from Friday, March 28th at 1:00pm ET to Monday, March 31th at 12:00pm ET. We apologize for any inconvenience.
Due to scheduled maintenance, some HTRC services are not available from Friday, March 28th at 1:00pm ET to Monday, March 31th at 12:00pm ET. We apologize for any inconvenience.
HTRC algorithms are click-and-run tools that allow account holders to perform several types of text analysis queries on HathiTrust data within the website. There are currently 4 HTRC algorithms:
With the exception of the Extracted Features Download Helper, only worksets of fewer than 3,000 volumes are valid when using these tools.
A collection is made in the HathiTrust Digital Library site. Collections are typically user-generated, but institutions or HathiTrust can add collections as well. Simply put, they are collections of HathiTrust volumes that exist within the HathiTrust Digital Library.
A template is a snapshot, or image, of a pre-existing research data capsule that can be made accessible for other HTRC account holders to find and use. The template is created and set up by one HTRC user who has a research data capsule they wish to share for educational or research purposes. This user downloads additional code libraries, software tools, worksets, and potentially any other data they can access from the internet when in Maintenance mode and that is not already available in HTRC’s Ubuntu virtual machine, so that other users can clone this capsule and not have to do the setup and customization.
Data capsules are customizable environments for writing or running your own code or tools to study HathiTrust volumes, so users will likely need to have a familiarity with the command line and some knowledge of a programming language, which is why it is considered an advanced tool.
A dataset on HTRC Analytics typically refers to one of our “derived” datasets, meaning a file (or set of files) that contains information extracted or created from HathiTrust volumes, but not the full text from any volume.
HTRC offers 4 derived datasets:
An HTRC-developed Python library to be used with our Extracted Features dataset that simplifies common data science/text analysis methods.
Read the GitHub documentation and Feature Reader info page.
Tab-delimited metadata files provided by HathiTrust that can be downloaded from the Hathifiles page. These files contain information, including bibliographic metadata, about every item in the HathiTrust Digital Library.
Wait…there are HTRC datasets and HathiTrust datasets, and they are not the same thing?
Correct! Both HathiTrust and HTRC provide access to various datasets. More information about requesting HathiTrust datasets is available on the HathiTrust website.
This policy underpins all research conducted using HTRC methods and tools. Essentially, it means that HTRC supports computational research as long as researchers do not view or read large amounts of text from items under copyright. HTRC works to transform human-readable text into machine-readable text so that the Research Center and its users comply with copyright law.
Read more about HathiTrust and HTRC’s non-consumptive use policy.
An HTRC visualization is a visual format of your data, usually after some sort of processing has been performed upon it. HTRC’s Bookworm tool is an example of a visualization we offer as a form of textual analysis. The InPHO Topic Model tool is another.
An HTRC workset is a collection (or list) of HathiTrust volumes. We call it a workset because, instead of thinking about it like a collection of readable texts, HTRC needs it as data. HathiTrust volume IDs allow our systems to retrieve information from the digital library’s systems and convert text into this data in a variety of ways, depending on what type of research you are interested in doing.
This is a command line tool used inside an HTRC data capsule that helps users download data for individual items or worksets.