Research Guides: Text Mining Resources: Sources

Text data sources available at Princeton

Dimensions Plus API
Dimensions Plus includes grants, publications, citations, alternative metrics, clinical trials, patents and policy documents. Must register with NetID and Password and email support@dimensions.ai to enable API access.
Elsevier API
Elsevier's API program allows you to integrate content and data from Elsevier products into your own website and applications.API's are free for the products Princeton subscribes to: Scopus, Engineering Village, and subscribed journals in Science Direct.
IEEE API Portal
API portal for IEEE
JSTOR for Data Research
Data for Research (DfR) provides datasets of content on JSTOR for use in research and teaching. Researchers may use DfR to define and submit their desired dataset to be automatically processed. Data available through the service includes metadata, n-grams, and word counts for most articles and book chapters, and for all research reports and pamphlets on JSTOR. Datasets are produced at no cost to researchers and may include data for up to 25,000 documents.
LexisNexis Web Services Kit
Lexis Nexis Web Services Kit is a mediated service that allows bulk download of Nexis UNI content (formerly Lexis Nexis Academic). Up to 250 documents and 1000 metadata downloads are allowable on Nexus UNI without use of the API. Contact your subject librarian for access to LexisNexis Web Services Kit.
PLOS API
Python tool for downloading/updating/maintaining a repository of all PLOS XML article files. Use this program to download all PLOS XML article files instead of doing web scraping.
ProQuest TDM Studio
ProQuest TDM (Text and Data Mining) Studio allows you to create and analyze datasets from ProQuest content.
Ravenpack News Analytics
Use for financial and economic analysis. Access through WRDS.
News and Social Media Datasets
Listing of additional news and social media datasets available at Princeton.

Freely available text data sources

arXiv Bulk Data
Our mission is to provide rapid dissemination of scientific results at no cost to authors or readers. Providing free Application Programming Interfaces (APIs) helps us to advance that mission by enabling platforms and projects that extend the discoverability of arXiv e-prints and provide valuable services to scientists and interested readers.
Books to Scrape
Demo website for web scraping purposes. Prices and ratings here were randomly assigned and have no real meaning.
CORE: Open Access Research Papers
CORE provides a central API to access full content from tens of thousands of openly available scientific publications from thousands of OA repositories. Full datasets available by request.
HathiTrust Research Center Analyitics
Supports large-scale computational analysis of the works in the HathiTrust Digital Library to facilitate non-profit and educational research.
Internet Archive
Internet Archive is a non-profit library of millions of free books, movies, software, music, websites, and more.
Library of Congress (LC) for Robots
We hope this list of APIs, bulk downloads, and tutorials will help you begin exploring the many ways the Library of Congress provides machine-readable access to its digital collections.
New York Times Developer Network
All the APIs fit to post.
Project Gutenberg Robot Access
Project Gutenberg is a library of over 60,000 free eBooks. Information about robot access to our pages outlines allowable automated access to content.
Pubmed APIs
PMC hosts a number of important article datasets and makes our APIs and some code available via public code repositories.

Social media and the web

For data collection from social media, it is typical to use the publicly available APIs made available by the social media platforms, such as the following:

Twitter API
Google Blogger
Internet Archive Bulk Download
Download files from archive.org in an automated way using wget.
reddit API
Access data from posts, threads, comments, users and more from reddit and subreddits.
Historical Reddit data has been collected at http://files.pushshift.io/reddit/ as monthly CSV downloads.
Stanford Large Network Dataset Collection (SNAP)
The SNAP library collects data on large social and information networks since 2004.
Twitter Streaming APIs
Public streams provide access to public data flowing through Twitter. Suitable for following specific users or topics, and data mining. You can also access single-user streams, containing roughly all of the data corresponding with a single user’s view of Twitter.
Wikipedia Data Dumps
Monthly database backups of all Wikimedia wikis in various formats.
Yelp API
Access to business data, including location, photos, Yelp rating, price levels, hours of operation, and types of transactions. Also includes a Review API, which returns up to 3 review excerpts for a business.
Blog Authorship Corpus
Over 600,000 posts from more than 19 thousand bloggers

Government documents

Congress.gov API
The Congress.gov API includes bills, amendments, summaries, Congress, members, the Congressional Record, committee reports, nominations, treaties, and House Communications. Over time we will be adding hearing transcripts and Senate Communications. Sign up for a free API key to use.
Proquest Congressional Text 1824-2020
Full text of United States Congressional Hearings (both House and Senate) 1824-2020 as extracted by ProQuest from its various Congressional hearings collections and delivered in bulk as XML files. Pre-processing completed by Politics Librarian, Jeremy Darrington, to extract individual hearing files, rename by hearing ID, and group into folders by decade. By accessing the data, you agree to abide by the included Terms of Use file. Read it thoroughly before use.
CourtListener API / Bulk Legal Data
Opinions, docket files, and more from 420 courts.
FDSys Bulk Download
Bulk data downloads of major US Government publications including Congressional Bills, Commerce Business Daily, Federal Register, Public Papers of the Presidents of the United States, Supreme Court Decisions 1937-1975 (FLITE) and more.
Harvard Caselaw Access Project
Includes all official, book-published United States case law — every volume designated as an official report of decisions by a court within the United States. Scope includes all state courts and federal courts. Research scholars can qualify for bulk data access by agreeing to certain use and redistribution restrictions. You can request a bulk access agreement by creating an account and then visiting your account page.
U.S. Department of the Interior. Bureau of Land Management. General Land Office Records.
BLM GLO provides direct access to all of the data behind the glorecords.blm.gov website with a series of web service methods. XML format.
Voxgov
Provides access to real-time documents, press releases, and social media posts from candidates for Congress and governor across the U.S. Options to compare candidates and groups (e.g. Senate Democrats vs. Republicans), filter by geography or demographics, and to generate term frequency charts and word clouds. (Princeton-subscription resource)
United States Patent & Trademark Open Data Portal
"Open data" is publicly available data that is structured in a way that enables the data to be fully discoverable and usable by end users. It can be freely used, reused and redistributed by anyone. Its value lies not only in what it does today, but also in what it can do in the future. It is a valuable national resource and a strategic asset to the federal government, its partners, and the public.