Data
I3 BigQuery Data Repository
Together with Matt Marx
The I3 BigQuery Data Repository is designed to make it easier for researchers to work with large-scale innovation datasets. Instead of wrestling with massive files and local storage limitations, users can access curated datasets directly in the cloud using Google BigQuery. It’s fast, scalable, and works seamlessly with SQL, Python, and R—making it easy to process data, collaborate with coauthors, and ensure reproducibility. The repository includes both raw data (like OpenAlex and PatentsView) and derived datasets contributed by the research community, such as DISCERN and Reliance on Science. The goal is to lower barriers to entry and foster a more collaborative, efficient research environment, with workshops, tutorials, and new tools on the horizon.
To access the data, star the nber-i3
project in the BigQuery Console.
Join our Google Group for updates and discussions.
Panel Data on U.S. STEM PhD graduates, 1950-2022
Together with Hansen Zhang, Lee Fleming and Daniel P. Gross
This dataset leverages a near-population sample of U.S. STEM PhD dissertations (1950-2022) to document the scientific training ecosystem. Using large language models, we extract information from dissertation acknowledgments to identify research sponsors and classify graduates by their association with 18 critical technology areas defined by the White House Office of Science and Technology Policy. The repository provides three panel datasets tracking PhD production by field, university, and time, including measures of government, industry, and non-profit support, as well as graduates’ contributions to AI, quantum science, biotechnology, and other strategically important fields.
The data is available for download here
DISCERN 2.0: Duke Innovation & SCientific Enterprises Research Network
Together with Sharon Belenzon, Larisa Cioaca, Lia Sheer and John Shin
The DISCERN dataset, now in version 2.0, links U.S. public firms’ data from Compustat to their patents and scientific publications, providing comprehensive coverage of subsidiaries and ownership changes over time. This latest update extends coverage to 2021, transitions to open-access sources (PatentsView for patents and OpenAlex for publications), and uses SEC filings for more reliable subsidiary data. DISCERN 2.0 is freely available under the O-UDA-1.0 License, allowing unrestricted use for research and commercial purposes. This update significantly enhances the dataset’s utility for studying corporate innovation and R&D investment trends.
The data is available for download here