Where Can I Find Data?
“Where can I find data?” may well be the question I am most often asked by students. It’s a good question, and although the answer changes over time, there are some relatively stable sites to point to. This page summarises some of those options as a staring point for further exploration. All sites have been manually checked against minimal criteria: URL works and data accessible. Other initiatives like UNICC are a matter of “watch this space” and not yet included. Consider these as you begin to explore what others have created, so that you can make sure your work complements and/or extends what has gone before, and that we continue to build a great cathedral of cumulative science together!
In short, this is a sign-posting service to accessible data.
┌──────────────────┐
│ ← DATA THIS WAY │
└────────┬┬────────┘
││
┌────────┴┴────────────┐
│ AND MORE THIS WAY → │
└────────┬┬────────────┘
││
││
─────────────┘└──────────────
In brief:
- Title: “Where Can I Find Data?”
- By: Mark Gotham, 2025
- Licence: CC-By-SA
- URLs last accessed and verified: October 2025
- Suggestions/contributions: … are welcome! PR or email (contact details here)
Top Tips
Practical:
- Check many sources, not just one! If in doubt, start with Kaggle/Hugging Face for most beginner projects … but do branch out to check others once you settle on a topic.
- You may have access to more resources than you think. E.g., some databases require a subscription or similar, but school/university/city library membership might unlock access.
- Accessing data with APIs takes a bit more technical overhead, but that’s often worthwhile, especially if you want structured data from/about social media.
Ethical:
- Check licenses (e.g., MIT, etc.) before using data. Most of the datasets on the repositories listed here will have permissive licences that enable research (at least!), but always check.
- “Scraping” websites is possible but should be considered a last resort for want of a better option, and having checked for any indication of permission/restrictions (e.g.,
robots.txtfile). - … and many more! E.g., among your ethical considerations should be how they made their datasets.
Sensible:
- In working with data it’s likely you’ll end up expanding and/or improving it. Having gone to that trouble, find a way to contribute your work back into the ecosystem! I.e., this page can double as a “where can I put my data”?
Publicly Accessible Data Sources for Analysis
For the Data Itself
✅ Hundreds of thousands of datasets, freely downloadable, across many formats and topics.
❌ Quality varies. Most are user-uploads and not subject to quality control (see ‘Paired Academic Publications’ below).
- Zenodo
“Built and operated by CERN and OpenAIRE to ensure that everyone can join in Open Science”.
- Open Science Foundation (OSF)
A “free, open platform to support your research and enable collaboration”.
- Kaggle Datasets
Datasets as well as benchmarks, models and more.
- Hugging Face Datasets
ML-focus. Includes search by modality (text, audio, vision).
- GitHub Datasets
This is a small list of curated public datasets via GitHub repos among the
many more datasets on GitHub outside this.
Paired Academic Publications
✅ Many datasets (e.g., those at the platforms above) are paired with an academic publication. Having an associated academic publication (that has been properly reviewed) provides some quality control.
❌ The academic journals themselves tend not to host the data (so you need the paired dataset above). There are many great datasets without a paired academic publication.
- Journals focussed entirely on datasets:
- Some high profile examples include Data (MDPI), Scientific Data (Nature Portfolio), Earth System Science Data, GigaScience
- Examples specific to the humanities include Journal of Open Humanities Data (JOHD)
- Humanities journals that include a “track” (or similar) specifically for introducing datasets
Knowledge Graphs and Metadata
✅ Knowledge graphs and metadata support sharing of and search over structured representations of knowledge. Individual datasets like those above may not adhere to such best practices.
❌ Use for structure and relationships, not (usually) as a primary source of data.
-
Wikidata
Free, collaborative knowledge base (not Wikipedia). Use for: Entity relationships, multilingual facts, ontology. -
Europeana
Cultural heritage data (art, books, music). 10M+ items from European museums/libraries.
Aggregated and Processed Data
✅ Pre-processed, visualised, and updated by experts.
❌ Where the raw data is not provided, this leaves you less scope to explore your own analysis.
- Gapminder
Global development data (health, economy). Includes interactive tools and datasets.
- Our World in Data (OWID)
Rigorous, source-verified datasets on global issues (climate, inequality, …). Most of the studied include an option to download the data used directly, or (failing that) at least provide a redirect to the third-party website hosting that data.
- Google Trends
See what the world is searching for by topic (and also region, time, …). Free, but limited to trends (not raw counts). See also Google books’ ngram viewer
Paid Resources (but often free via Libraries)
❌ Not easily available to all.
✅ May be more available than you think, e.g., via your local library.
- Factiva
News, business, and financial data. Available via university libraries.
- NexisLexis
Legal, news, and business databases. Typically included in academic library subscriptions.
Further Specific Sources by Type
✅ Specific sources such as government data tend to be free, high-quality and as authoritative (for their subject) as can be expected.
❌ Given the specific focus, the overall scope is more limited than Kaggle, for instance.
Government and Similar
- data.gov (USA)
250k+ datasets from US federal agencies (health, environment, economy).
- data.gov.uk (UK)
UK public sector data (crime, education, transport). Related UK sites include UK data service, Police data, the Office for National Statistics (ONS), and the Open Data Institute (ODI) (as a partner, not a source).
- EU Open Data Portal
500k+ datasets from EU institutions (trade, environment, research).
- UN Data
UN-aggregated global statistics (demographics, trade, SDGs).
Geospatial (Location-Based) Data
- OpenStreetMap (OSM) Data
Free global map data for roads, buildings, and places of interest. Explore manually or engage computationally using
osmnx(Python) or Overpass API. - NASA Earthdata
Satellite imagery, climate, and environmental data.
- NOAA Climate Data
Historical weather, ocean, and climate records.
Economic and Financial Data
- FRED (Federal Reserve Economic Data)
US economic time series data (GDP, unemployment, interest rates, …). Most data is directly downloadable as CSV.
- OpenCorporates
Global company registry (100M+ entities).
- Open spending
“Spending” (financial) data about countries all around the world.
- World Bank Open Data
Many indicators for most countries across topics including poverty, education, and energy.
Health and Life Sciences
- WHO (World Health Organisation)
Global health statistics: diseases, mortality, healthcare access.
- CDC (Centre for Disease Control) WONDER
US public health data: mortality, births, infectious diseases).
APIs for Social Media and More
✅ Social media stores a great repository of cultural trends and more. Many social media and related companies offer APIs which provide a structured set of data and controls for all/only the data the company is content to share.
❌ Not all APIs are equal, “free” can be a misnomer, and APIs are more technically demanding than direct downloads.
- Academic Torrents
API or similar (CLI) required. Provided by a non-profit.
- GDELT Project
Global event data across the media (broadcast, print, and web).
- Reddit API
Access Reddit API for posts/comments e.g., via
PRAW(The Python Reddit API Wrapper). Note: Tighter restrictions since 2023, e.g.,OAuth2and Reddit account required. - “X” / “Twitter” API
‘Free’, but be wary of rate limits.
- Facebook
Not publicly accessible. There are ways, but generally to be avoided.
- Many other companies like Spotify
Restrictions apply and the data coverage can change without warning (it has done so in this case). See also secondary representations such as https://spotifyplaylistarchive.com/.
That’s all folks! Thanks for reading! Have you enjoyed this? Is it missing something? Suggestions are welcome (PR or email contact details here)