Due to scheduled maintenance, some HTRC services are not available from Friday, July 11th at 12:00am ET to Wednesday, July 16th at 12:00pm ET. We apologize for any inconvenience.
Due to scheduled maintenance, some HTRC services are not available from Friday, July 11th at 12:00am ET to Wednesday, July 16th at 12:00pm ET. We apologize for any inconvenience.
Recommended worksets are curated worksets created by real researchers of past and ongoing projects that utilize resources from the HathiTrust Digital Library and can be analyzed with tools built by the HTRC.
To better facilitate future researchers and library educators in their understanding of HTRC tools and text mining best practices, HTRC is sharing these research-compatible worksets in the hopes of further fostering a community of HTRC users. Please read more about each workset below, and feel free to explore tools like the Named Entity Recognizer or the Token Count and Tag Cloud Creator utilizing one (or all!) of the worksets. Because these worksets have been pre-approved for producing meaningful results, users focus on the power of the tools themselves, rather than compiling a suitable workset for analysis beforehand.
Workset Title | Original Researchers | Workset Description | # of Volumes | Recommended Algorithms |
---|---|---|---|---|
NovelTM Datasets for English-Language Fiction: Manually-Checked Subset | Patrick Kimutis, Ted Underwood, Jessica Witte | Randomized subset of a larger 138,164 volume list of English-language fiction in HathiTrust; distributed evenly across publication years 1700-2009. | 2,730 |
|
NovelTM Datasets for English-Language Fiction: Gender-Balanced Subset | Patrick Kimutis, Ted Underwood, Jessica Witte | Subset of the "NovelTM Datasets for English-Language Fiction: Manually-Checked Subset" workset; reduced to produce a list giving equal balance to men, women, and "other" authors. | 1,501 |
|
NovelTM Datasets for English-Language Fiction: Frequently Reprinted Titles | Patrick Kimutis, Ted Underwood, Jessica Witte | Subset of a larger 138,164 volume list of English-lanuage fiction in HathiTrust containing those volumes that had the most reprintings (i.e., obscure titles removed). | 2,100 |
|
Yellow Fever in the Caribbean (French-Language Volumes) | Mariola Espinosa | Deduplicated set of French-language volumes containing 10+ mentions of yellow fever and related historical terms (fièvre jaune, typhus d'Amérique, and maladie de Siam) and published before 1900. | 2,193 |
|
Yellow Fever in the Caribbean (Spanish-Language Volumes) | Mariola Espinosa | Deduplicated set of Spanish-language volumes containing 10+ mentions of yellow fever and related historical terms (fiebre amarilla, calentura biliosa remitente amarilla, vómito negro, and vómito prieto) and published before 1900. | 677 |
|
United States Presidential Speeches (1970s) | Eleanor Koehl | Select issues of the Public Papers of the Presidents of the United States, including those from the presidencies of Jimmy Carter, Gerald Ford, and Richard Nixon. Has volumes representing every year of the 1970s. Volumes contain public messages and statements from U.S. presidents and is published by the National Archives. | 16 |
|
20th Century English-Language Speculative Fiction | Laure Thompson and David Mimno | Volumes of speculative fiction identified both through matching titles and authors to Worlds Without End (WWE), an extensive fan-built database of speculative fiction, and via computational text similarity analysis techniques. Includes works published from the 20th century. | 2,454 |
|
Once you have an account and are logged in to HTRC Analytics, you may click on the Worksets main menu option. On the Worksets page, you will see a table of either My Worksets or Recommended Worksets (this depends on whether you have created your own worksets from your HTRC Analytics account). Change to the Recommended Worksets view by using the faceting menu located on the far right of the screen.
You will also be able to find our recommended worksets when faceted to All Worksets since all are publicly available. You can easily identify a recommended workset from its color highlighting and the author listed as 'htrc'.
Click on the recommended workset's name that you would like to test (e.g., 20th Century English-Language Speculative Fiction). You will be taken to a workset details page. On the right-hand corner of the screen you may click the Analyze With Algorithm menu. Choose which algorithm you would like to analyze the workset with, and follow the steps on the following page.
Please visit our Workset documentation Wiki for more information about HTRC worksets.