🔔 HathiTrust Research Center Services Temporarily Unavailable

Due to scheduled maintenance, some HTRC services are not available from Friday, July 11th at 12:00am ET to Wednesday, July 16th at 12:00pm ET. We apologize for any inconvenience.

Data

Recommended worksets

What is a recommended workset?

Recommended worksets are curated worksets created by real researchers of past and ongoing projects that utilize resources from the HathiTrust Digital Library and can be analyzed with tools built by the HTRC.

Why should I use a recommended workset?

To better facilitate future researchers and library educators in their understanding of HTRC tools and text mining best practices, HTRC is sharing these research-compatible worksets in the hopes of further fostering a community of HTRC users. Please read more about each workset below, and feel free to explore tools like the Named Entity Recognizer or the Token Count and Tag Cloud Creator utilizing one (or all!) of the worksets. Because these worksets have been pre-approved for producing meaningful results, users focus on the power of the tools themselves, rather than compiling a suitable workset for analysis beforehand.

Recommended workset options
Workset TitleOriginal ResearchersWorkset Description# of VolumesRecommended Algorithms
NovelTM Datasets for English-Language Fiction: Manually-Checked SubsetPatrick Kimutis, Ted Underwood, Jessica WitteRandomized subset of a larger 138,164 volume list of English-language fiction in HathiTrust; distributed evenly across publication years 1700-2009.2,730
  • Token Count and Tag Cloud Creator
  • Named Entity Recognizer
  • InPhO Topic Model Explorer
NovelTM Datasets for English-Language Fiction: Gender-Balanced SubsetPatrick Kimutis, Ted Underwood, Jessica WitteSubset of the "NovelTM Datasets for English-Language Fiction: Manually-Checked Subset" workset; reduced to produce a list giving equal balance to men, women, and "other" authors.1,501
  • Token Count and Tag Cloud Creator
  • Named Entity Recognizer
  • InPhO Topic Model Explorer
NovelTM Datasets for English-Language Fiction: Frequently Reprinted TitlesPatrick Kimutis, Ted Underwood, Jessica WitteSubset of a larger 138,164 volume list of English-lanuage fiction in HathiTrust containing those volumes that had the most reprintings (i.e., obscure titles removed).2,100
  • Token Count and Tag Cloud Creator
  • Named Entity Recognizer
  • InPhO Topic Model Explorer
Yellow Fever in the Caribbean (French-Language Volumes)Mariola EspinosaDeduplicated set of French-language volumes containing 10+ mentions of yellow fever and related historical terms (fièvre jaune, typhus d'Amérique, and maladie de Siam) and published before 1900.2,193
  • Token Count and Tag Cloud Creator
  • Named Entity Recognizer
  • InPhO Topic Model Explorer
Yellow Fever in the Caribbean (Spanish-Language Volumes)Mariola EspinosaDeduplicated set of Spanish-language volumes containing 10+ mentions of yellow fever and related historical terms (fiebre amarilla, calentura biliosa remitente amarilla, vómito negro, and vómito prieto) and published before 1900.677
  • Token Count and Tag Cloud Creator
  • Named Entity Recognizer
  • InPhO Topic Model Explorer
United States Presidential Speeches (1970s)Eleanor KoehlSelect issues of the Public Papers of the Presidents of the United States, including those from the presidencies of Jimmy Carter, Gerald Ford, and Richard Nixon. Has volumes representing every year of the 1970s. Volumes contain public messages and statements from U.S. presidents and is published by the National Archives.16
  • Token Count and Tag Cloud Creator
  • Named Entity Recognizer
  • InPhO Topic Model Explorer
20th Century English-Language Speculative FictionLaure Thompson and David MimnoVolumes of speculative fiction identified both through matching titles and authors to Worlds Without End (WWE), an extensive fan-built database of speculative fiction, and via computational text similarity analysis techniques. Includes works published from the 20th century.2,454
  • Token Count and Tag Cloud Creator
  • Named Entity Recognizer
  • InPhO Topic Model Explorer
How do I use a recommended workset?

Once you have an account and are logged in to HTRC Analytics, you may click on the Worksets main menu option. On the Worksets page, you will see a table of either My Worksets or Recommended Worksets (this depends on whether you have created your own worksets from your HTRC Analytics account). Change to the Recommended Worksets view by using the faceting menu located on the far right of the screen.

You will also be able to find our recommended worksets when faceted to All Worksets since all are publicly available. You can easily identify a recommended workset from its color highlighting and the author listed as 'htrc'.

Click on the recommended workset's name that you would like to test (e.g., 20th Century English-Language Speculative Fiction). You will be taken to a workset details page. On the right-hand corner of the screen you may click the Analyze With Algorithm menu. Choose which algorithm you would like to analyze the workset with, and follow the steps on the following page.

Please visit our Workset documentation Wiki for more information about HTRC worksets.