Due to scheduled maintenance, some HTRC services are not available from Sunday, April 13th at 5:00am ET to Monday, April 14th at 12:00pm ET. We apologize for any inconvenience.
Due to scheduled maintenance, some HTRC services are not available from Sunday, April 13th at 5:00am ET to Monday, April 14th at 12:00pm ET. We apologize for any inconvenience.
HTRC Algorithms are web-based, click-and-run tools to perform computational text analysis on volumes in the HathiTrust Digital Library. The algorithms can help you explore, analyze, and visualize public worksets or those you have created.
Generate a script that allows you to download extracted features data for your workset of choice. The script is a file containing a list of the rsync commands to access the volumes of the workset. After you download the script from HTRC Analytics, it can be run locally (from your computer), which will then download the extracted features data to your computer via rsync. For more information on the extracted features data see the documentation.
Note: Extracted features data was not created for a small number of volumes, so it is possible that not all of your workset volumes will be processed.
Result of job: script to download extracted features data files.
The InPho Topic Explorer trains multiple LDA topic models and allows you to export files containing the word-topic and topic-document distributions, along with an interactive visualization. For full detailed description, please review the documentation.
How it works:
- Downloads each HathiTrust volume from the Data API.
- Tokenizes each volume using the topicexplorer init command.
- Apply stoplists based on the frequency of terms in the corpus, removing the most frequent words accounting for 50% of the collection and the least frequent words accounting for 10% of the collection.
- Create a new topic model for each number of topics specified. For example, "20 40 60 80" would train separate models with 20 topics, 40 topics, 60 topics and 80 topics.
- Display a visualization of how topics across models cluster together. This enables a user to see the granularity of the different models and how terms may be grouped together into "larger" topics.
More documentation of the Topic Explorer is available at https://inpho.github.io/topic-explorer/.
Result of job: Four files are generated. Three are for displaying a visualization of topic clusters and top terms: topics.html, cluster.csv, topics.json. The final file (workset.tez) can be used with a local install of the Topic Explorer to access the complete word-topic and topic-document matricies, along with other advanced analytics.
Generate a list of all of the names of people and places, as well as dates, times, percentages, and monetary terms, found in a workset. You can choose which entities you would like to extract.
How it works:
- performs header/body/footer identification
- extracts body text only for analysis
- combines of end-of-line hyphenated words in order to de-hyphenate the text
- tokenizes the text using the Stanford NLP model for the language specified by the user
- performs entity recognition/extraction using the Stanford Named Entity Recognizer
- shuffles the entities found on each page (to prevent aiding page reconstruction)
- saves the resulting entities to a file
Result of job: table of the named entities found in a workset.
Identify the tokens (words) that occur most often in a workset and the number of times they occur. Create a tag cloud visualization of the most frequently occurring words in a workset, where the size of the word is displayed in proportion to the number of times it occurred.
How it works:
- identifies page header/body/footer
- extracts page body only for analysis
- combines of end-of-line hyphenated words in order to de-hyphenate the text
- removes stop words as specified by user
- applies replacement rules (i.e. corrections) as specified by user, maintaining the original case of the replaced words
- tokenizes the text using the Stanford NLP model for the language specified by the user, or does white-space tokenization
- counts tokens
- sorts tokens in descending order by count
- saves the sorted token counts to a file
- generates the tag cloud according to the filter(s) specified by the user
Result of job: tag cloud showing the most frequently occurring words, and a file with a list of those words and the number of times they occur.