“Where can I find data?” may well be the question I am most often asked by students. It’s a good question, and although the answer changes over time, there are some relatively stable sites to point to. This page summarises some of those options as a staring point for further exploration. All sites have been manually checked against minimal criteria: URL works and data accessible. Other initiatives like UNICC are a matter of “watch this space” and not yet included. Consider these as you begin to explore what others have created, so that you can make sure your work complements and/or extends what has gone before, and that we continue to build a great cathedral of cumulative science together!

In short, this is a sign-posting service to accessible data.

    ┌──────────────────┐
    │ ← DATA THIS WAY  │
    └────────┬┬────────┘
             ││
    ┌────────┴┴────────────┐
    │  AND MORE THIS WAY → │
    └────────┬┬────────────┘
             ││
             ││
─────────────┘└──────────────

In brief:

  • Title: “Where Can I Find Data?”
  • By: Mark Gotham, 2025
  • Licence: CC-By-SA
  • URLs last accessed and verified: October 2025
  • Suggestions/contributions: … are welcome! PR or email (contact details here)

Top Tips

Practical:

  • Check many sources, not just one! If in doubt, start with Kaggle/Hugging Face for most beginner projects … but do branch out to check others once you settle on a topic.
  • You may have access to more resources than you think. E.g., some databases require a subscription or similar, but school/university/city library membership might unlock access.
  • Accessing data with APIs takes a bit more technical overhead, but that’s often worthwhile, especially if you want structured data from/about social media.

Ethical:

  • Check licenses (e.g., MIT, etc.) before using data. Most of the datasets on the repositories listed here will have permissive licences that enable research (at least!), but always check.
  • “Scraping” websites is possible but should be considered a last resort for want of a better option, and having checked for any indication of permission/restrictions (e.g., robots.txt file).
  • … and many more! E.g., among your ethical considerations should be how they made their datasets.

Sensible:

  • In working with data it’s likely you’ll end up expanding and/or improving it. Having gone to that trouble, find a way to contribute your work back into the ecosystem! I.e., this page can double as a “where can I put my data”?

Publicly Accessible Data Sources for Analysis

For the Data Itself

✅ Hundreds of thousands of datasets, freely downloadable, across many formats and topics.

❌ Quality varies. Most are user-uploads and not subject to quality control (see ‘Paired Academic Publications’ below).

Paired Academic Publications

✅ Many datasets (e.g., those at the platforms above) are paired with an academic publication. Having an associated academic publication (that has been properly reviewed) provides some quality control.

❌ The academic journals themselves tend not to host the data (so you need the paired dataset above). There are many great datasets without a paired academic publication.


Knowledge Graphs and Metadata

✅ Knowledge graphs and metadata support sharing of and search over structured representations of knowledge. Individual datasets like those above may not adhere to such best practices.

❌ Use for structure and relationships, not (usually) as a primary source of data.

  • Wikidata
    Free, collaborative knowledge base (not Wikipedia). Use for: Entity relationships, multilingual facts, ontology.

  • Europeana
    Cultural heritage data (art, books, music). 10M+ items from European museums/libraries.


Aggregated and Processed Data

✅ Pre-processed, visualised, and updated by experts.

❌ Where the raw data is not provided, this leaves you less scope to explore your own analysis.

  • Gapminder

    Global development data (health, economy). Includes interactive tools and datasets.

  • Our World in Data (OWID)

    Rigorous, source-verified datasets on global issues (climate, inequality, …). Most of the studied include an option to download the data used directly, or (failing that) at least provide a redirect to the third-party website hosting that data.

  • Google Trends

    See what the world is searching for by topic (and also region, time, …). Free, but limited to trends (not raw counts). See also Google books’ ngram viewer


❌ Not easily available to all.

✅ May be more available than you think, e.g., via your local library.

  • Factiva

    News, business, and financial data. Available via university libraries.

  • NexisLexis

    Legal, news, and business databases. Typically included in academic library subscriptions.


Further Specific Sources by Type

✅ Specific sources such as government data tend to be free, high-quality and as authoritative (for their subject) as can be expected.

❌ Given the specific focus, the overall scope is more limited than Kaggle, for instance.


Government and Similar


Geospatial (Location-Based) Data


Economic and Financial Data


Health and Life Sciences


APIs for Social Media and More

✅ Social media stores a great repository of cultural trends and more. Many social media and related companies offer APIs which provide a structured set of data and controls for all/only the data the company is content to share.

❌ Not all APIs are equal, “free” can be a misnomer, and APIs are more technically demanding than direct downloads.

  • Academic Torrents

    API or similar (CLI) required. Provided by a non-profit.

  • GDELT Project

    Global event data across the media (broadcast, print, and web).

  • Reddit API

    Access Reddit API for posts/comments e.g., via PRAW (The Python Reddit API Wrapper). Note: Tighter restrictions since 2023, e.g., OAuth2 and Reddit account required.

  • “X” / “Twitter” API

    ‘Free’, but be wary of rate limits.

  • Facebook

    Not publicly accessible. There are ways, but generally to be avoided.

  • Many other companies like Spotify

    Restrictions apply and the data coverage can change without warning (it has done so in this case). See also secondary representations such as https://spotifyplaylistarchive.com/.


That’s all folks! Thanks for reading! Have you enjoyed this? Is it missing something? Suggestions are welcome (PR or email contact details here)