Relationship of federal support to PhD production at the field-year level, 1970-2022 | |||||
---|---|---|---|---|---|
Variable | Ln(PhD graduates) | Ln(Publications) | |||
(1) All | (2) All | (3) Non-USG | (4) | (5) | |
Ln(USG-supported PhDs) | 0.437*** | 0.770*** | 0.623*** | ||
(0.015) | (0.021) | (0.037) | |||
Ln(Non-USG PhDs) | 0.534*** | ||||
(0.021) | |||||
Ln(Past 20 years' PhDs) | 0.253*** | ||||
(0.098) | |||||
Ln(Past 20 years' USG PhDs) | 0.246** | ||||
(0.098) | |||||
N | 901 | 901 | 901 | 900 | 900 |
F-stat | 133.81 | 147.14 | 147.14 | 622.98 | 342.38 |
Field FEs | Y | Y | Y | Y | Y |
Year FEs | Y | Y | Y | Y | Y |
Funding the U.S. Scientific Training Ecosystem: New Data, Methods, and Evidence
Every year, the United States produces roughly 30,000 new STEM PhDs—the scientists and engineers who will pioneer the next generation of discoveries in artificial intelligence, biotechnology, quantum computing, and countless other fields. Yet despite the critical role these researchers will play in advancing human knowledge, we lack a comprehensive understanding of the funding landscape that makes their training possible.
Using data from the near-population of U.S. STEM PhD dissertations since 1950, this research provides the first comprehensive mapping of doctoral funding sources across seven decades. Our analysis creates a new dataset—made publicly available-that enables systematic research into how scientific training is financed, with implications that extend far beyond individual career outcomes to the very structure of knowledge production in America.
Who funds PhD training in the United States?
We find that the U.S. federal government is by far the largest sponsor of STEM PhD training. Over 40% of graduates acknowledge direct governmental support, compared to roughly 10% from industry and 15% from non-profits. The National Science Foundation and National Institutes of Health alone are each acknowledged by more PhD graduates than the entire commercial sector. This pattern of government dominance holds across most universities and fields, though we observe notable variation—with subjects like astronomy heavily government-supported while pharmaceutical sciences receive more industry funding.
The dominance of federal funding varies significantly across scientific fields. While government support is substantial across most areas, the balance between federal and industry funding reveals interesting patterns that reflect both scientific priorities and commercial interests. For instance, automotive engineering and pharmaceutical science receive more industry support than government funding, while astronomy and astrophysics are heavily government-supported with essentially no private backing. Some fields like geology and materials science receive a balanced mix from both sectors.
These patterns become clearer when examining specific funding organizations. The following table shows that traditional government agencies like NSF, NIH, and DoD each dwarf private sector and philanthropic contributions to doctoral education, despite the prominence of major technology companies and foundations in public discourse about scientific research.
Government agencies | Count | Firms | Count | Non-profit organizations | Count | ||
---|---|---|---|---|---|---|---|
National Science Foundation | 91,895 | Intel | 2,276 | Howard Hughes Medical Institute | 3,733 | ||
Department of Health and Human Services | 78,033 | IBM | 1,943 | American Heart Association | 3,677 | ||
Department of Defense | 34,103 | Merck | 1,417 | Sigma Xi | 2,790 | ||
Department of Energy | 30,544 | 1,300 | American Cancer Society | 1,568 | |||
National Aeronautics and Space Administration | 15,044 | Microsoft | 1,221 | American Chemical Society | 1,453 | ||
Department of Agriculture | 13,455 | Pfizer | 1,180 | Geological Society of America | 1,388 | ||
Department of Commerce | 8,928 | General Electric | 980 | Robert Wood Johnson Foundation | 1,181 | ||
Department of the Interior | 6,890 | DuPont | 875 | W. M. Keck Foundation | 1,104 | ||
Environmental Protection Agency | 5,638 | Dow Chemical | 847 | Fulbright Program | 953 | ||
Department of Transportation | 5,292 | Eli Lilly | 822 | David and Lucile Packard Foundation | 943 | ||
Department of Education | 4,615 | Chevron | 821 | Welch Foundation | 910 | ||
Department of State | 4,248 | ExxonMobil | 775 | Burroughs Wellcome Fund | 897 | ||
Department of Veterans Affairs | 2,220 | GlaxoSmithKline | 765 | Gordon and Betty Moore Foundation | 856 | ||
Agency for International Development | 1,617 | Novartis | 753 | Ford Foundation | 818 | ||
Department of Homeland Security | 1,369 | Boeing | 718 | National Geographic Society | 760 |
PhD production in critical technology areas
Using our classification of graduates to 18 critical technology areas identified by the White House, we map the institutional landscape training scientists in AI, quantum computing, biotechnology, and other strategically important fields. MIT, Stanford, and UC Berkeley emerge as the top producers across multiple technology areas, while federal agencies—led by NSF and DoD—are the primary funders in nearly every critical technology domain. The data reveal both the concentration of critical technology training at elite institutions and the government’s outsized role in developing national technological capabilities.
Effect of government funding on PhD production
Leveraging variation in federal agencies’ funding priorities and budget fluctuations over time, we estimate that PhD production scales nearly one-for-one with government support. Our results suggest that a 10% increase in government-funded graduates leads to a 7.5% increase in total PhD production—indicating either that federal investment crowds in additional private support, or that government funding is more prevalent than acknowledgments suggest. Either interpretation points to the same conclusion: public investment is the primary lever determining the size and composition of America’s scientific workforce, with government funding decisions today directly shaping the research capacity for decades to come.
Summary
These findings reveal the federal government as the dominant architect of America’s scientific workforce, with funding decisions today directly shaping the research enterprise for decades to come. As policymakers consider investments in artificial intelligence, quantum computing, and other emerging fields, our data provide the empirical foundation to understand how funding choices translate into scientific talent. The methods we’ve developed can now track this ecosystem in real-time, offering policymakers and institutions unprecedented visibility into how public investment influences the scale and direction of research across fields, regions, and time.
Methodology
Data Collection and Sample Construction
We compiled a near-population dataset of U.S. STEM PhD graduates from ProQuest Dissertations & Theses Global (PQDT), supplemented with dissertations from individual university repositories. Our sample contains 1.17 million dissertations from 1950-2022, filtered to natural sciences and engineering fields at R1/R2 Carnegie-classified institutions. We obtained full dissertation text for about 870,000 graduates (75% of the sample, rising to 96% post-2000). To validate sample completeness, we compared annual graduate counts to the Survey of Earned Doctorates, finding close alignment until 2010 and 90% coverage through 2022.
Critical Technology Classification
We developed an unsupervised large language model pipeline to classify dissertations by their relationship to 18 critical technology areas identified by the White House Office of Science and Technology Policy. Using GPT-4o-mini, we first generated standardized one-sentence summaries of each dissertation based on titles and abstracts. We then applied zero-shot classification to assess relevance to each technology area, followed by a second-stage filter that mapped dissertations to specific technology subfields. This two-stage approach classified 42.7% of graduates to at least one critical technology area, with validation showing strong correlations between our classifications and universities’ publication patterns in corresponding fields.
Research Sponsor Identification
To extract funding information, we processed dissertation full text using a six-step pipeline combining rule-based text processing with large language models. We first isolated potential acknowledgment sentences using keyword matching, then employed Solar 10.7B and Smaug 34B models to identify supporting organizations, classify them by sector (government, industry, non-profit), and extract grant identifiers. We consolidated entity names using additional LLM processing and linked organizations to external registries including the Research Organization Registry (ROR) and Wikidata. This process extracted 9.3 million organizational mentions from 11 million acknowledgment sentences.
Validation
We validated our sponsor identification against multiple benchmarks, including manual review of 500 dissertations, NSF Graduate Research Fellowship awardee lists (finding 98.6% accuracy among those mentioning NSF), and university-level comparisons with the Survey of Graduate Students and Postdoctorates in Science and Engineering (showing strong correlations across agencies).
Causal Analysis
To estimate causal effects of federal funding on PhD production, we employed a shift-share instrumental variables design that exploits variation in agencies’ field-specific funding priorities and annual budget fluctuations. The instrument leverages each university-field’s historical share of graduates from specific agencies, interacted with those agencies’ total annual graduate support, to identify exogenous variation in federal funding exposure.
Acknowledgments
We thank Bhaven Sampat and Bruce Weinberg for helpful conversations, as well as audiences at the ICSSI annual conference, Summer School on Data and Algorithms for Science, Technology & Innovation Studies conference, and NBER Investments in Early Career Scientists meeting for comments. We also thank James Dunham and colleagues at the Center for Security and Emerging Technology for insights related to technology classification; Michelle Qiu and Max Murakami-Moses for research assistance; and the Duke University Fuqua School of Business, University of Toronto Rotman School of Management and the UC Berkeley Technology Competitiveness and Industrial Policy Center, Alfred P. Sloan Foundation, and National Science Foundation (Grant No. 2420824) for financial support. All errors are our own.