NOTICE: This repo is automatically generated by apd-core. Please DO NOT modify this file directly. We have provided a new way to contribute to Awesome Public Datasets. The original PR entrance directly on repo is closed forever.
This list of a topic-centric public data sources in high quality. They are collected and tidied from blogs, answers, and user responses. Most of the data sets listed below are free, however, some are not. Other amazingly awesome lists can be found in sindresorhus's awesome list.
Table of Contents
- Agriculture
- Biology
- Climate+Weather
- ComplexNetworks
- ComputerNetworks
- DataChallenges
- EarthScience
- Economics
- Education
- Energy
- Finance
- GIS
- Government
- Healthcare
- ImageProcessing
- MachineLearning
- Museums
- NaturalLanguage
- Neuroscience
- Physics
- Psychology+Cognition
- PublicDomains
- SearchEngines
- SocialNetworks
- SocialSciences
- Software
- Sports
- TimeSeries
- Transportation
- Complementary Collections
- 1000 Genomes
- American Gut (Microbiome Project)
- Broad Bioimage Benchmark Collection (BBBC)
- Broad Cancer Cell Line Encyclopedia (CCLE)
- Cell Image Library
- Complete Genomics Public Data
- EBI ArrayExpress
- EBI Protein Data Bank in Europe
- ENCODE project
- Electron Microscopy Pilot Image Archive (EMPIAR)
- Ensembl Genomes
- Gene Expression Omnibus (GEO)
- Gene Ontology (GO)
- Global Biotic Interactions (GloBI)
- Harvard Medical School (HMS) LINCS Project
- Human Genome Diversity Project
- Human Microbiome Project (HMP)
- ICOS PSP Benchmark
- International HapMap Project
- Journal of Cell Biology DataViewer
- KEGG - KEGG is a database resource for understanding high-level functions and utilities of the biological system, such as the cell, the organism and the ecosystem, from molecular-level information, especially large-scale molecular datasets generated by genome sequencing and other high-throughput experimental technologies.
- MIT Cancer Genomics Data
- NCBI Proteins
- NCBI Taxonomy
- NCI Genomic Data Commons
- NIH Microarray data
- OpenSNP genotypes data
- Pathguid - Protein-Protein Interactions Catalog
- Protein Data Bank
- Psychiatric Genomics Consortium
- PubChem Project
- PubGene (now Coremine Medical)
- Sanger Catalogue of Somatic Mutations in Cancer (COSMIC)
- Sanger Genomics of Drug Sensitivity in Cancer Project (GDSC)
- Sequence Read Archive(SRA)
- Stanford Microarray Data
- Stowers Institute Original Data Repository
- Systems Science of Biological Dynamics (SSBD) Database
- The Cancer Genome Atlas (TCGA), available via Broad GDAC
- The Catalogue of Life
- The Personal Genome Project
- UCSC Public Data
- UniGene
- Universal Protein Resource (UnitProt)
- Actuaries Climate Index
- Australian Weather
- Aviation Weather Center - Consistent, timely and accurate weather information for the world airspace system
- Brazilian Weather - Historical data (In Portuguese)
- Canadian Meteorological Centre
- Climate Data from UEA (updated monthly)
- European Climate Assessment & Dataset
- Global Climate Data Since 1929
- NASA Global Imagery Browse Services
- NOAA Bering Sea Climate
- NOAA Climate Datasets
- NOAA Realtime Weather Models
- NOAA SURFRAD Meteorology and Radiation Datasets
- The World Bank Open Data Resources for Climate Change
- UEA Climatic Research Unit
- WU Historical Weather Worldwide
- WorldClim - Global Climate Data
- AMiner Citation Network Dataset
- CrossRef DOI URLs
- DBLP Citation dataset
- DIMACS Road Networks Collection
- NBER Patent Citations
- NIST complex networks data collection
- Network Repository with Interactive Exploratory Analysis Tools
- Protein-protein interaction network
- PyPI and Maven Dependency Network
- Scopus Citation Database
- Small Network Data
- Stanford GraphBase
- Stanford Large Network Dataset Collection
- Stanford Longitudinal Network Data Sources
- The Koblenz Network Collection
- The Laboratory for Web Algorithmics (UNIMI)
- The Nexus Network Repository
- UCI Network Data Repository
- UFL sparse matrix collection
- WSU Graph Database
- 3.5B Web Pages from CommonCrawl 2012
- 53.5B Web clicks of 100K users in Indiana Univ.
- CAIDA Internet Datasets
- CRAWDAD Wireless datasets from Dartmouth Univ.
- ClueWeb09 - 1B web pages
- ClueWeb12 - 733M web pages
- CommonCrawl Web Data over 7 years
- Criteo click-through data
- Internet-Wide Scan Data Repository
- OONI: Open Observatory of Network Interference - Internet censorship data
- Open Mobile Data by MobiPerf
- Rapid7 Sonar Internet Scans
- UCSD Network Telescope, IPv4 /8 net
- Bruteforce Database
- Challenges in Machine Learning
- CrowdANALYTIX dataX
- D4D Challenge of Orange
- DrivenData Competitions for Social Good
- ICWSM Data Challenge (since 2009)
- KDD Cup by Tencent 2012
- Kaggle Competition Data
- Localytics Data Visualization Challenge
- Netflix Prize
- Space Apps Challenge
- Telecom Italia Big Data Challenge
- TravisTorrent Dataset - MSR'2017 Mining Challenge
- TunedIT - Data mining & machine learning data sets, algorithms, challenges
- Yelp Dataset Challenge
- AQUASTAT - Global water resources and uses
- BODC - marine data of ~22K vars
- EOSDIS - NASA's earth observing system data
- Earth Models
- Integrated Marine Observing System (IMOS) - roughly 30TB of ocean measurements
- Marinexplore - Open Oceanographic Data
- Smithsonian Institution Global Volcano and Eruption Database
- USGS Earthquake Archives
- American Economic Association (AEA)
- EconData from UMD
- Economic Freedom of the World Data
- Historical MacroEconomc Statistics
- INFORUM - Interindustry Forecasting at the University of Maryland
- International Economics Database
- International Trade Statistics
- Internet Product Code Database
- Joint External Debt Data Hub
- Jon Haveman International Trade Data Links
- OpenCorporates Database of Companies in the World
- Our World in Data
- SciencesPo World Trade Gravity Datasets
- The Atlas of Economic Complexity
- The Center for International Data
- The Observatory of Economic Complexity
- UN Commodity Trade Statistics
- UN Human Development Reports
- AMPds
- BLUEd
- COMBED
- DRED
- ECO
- EIA
- HES - Household Electricity Study, UK
- HFED
- PLAID - The Plug Load Appliance Identification Dataset
- REDD
- Tracebase
- UK-DALE - UK Domestic Appliance-Level Electricity
- WHITED
- iAWE
- CBOE Futures Exchange
- Google Finance
- Google Trends
- NASDAQ
- NYSE Market Data
- OANDA
- OSU Financial data
- Quandl
- St Louis Federal
- Yahoo Finance
- ArcGIS Open Data portal
- Cambridge, MA, US, GIS data on GitHub
- Factual Global Location Data
- Geo Maps - High Quality GeoJSON maps programmatically generated
- Geo Spatial Data from ASU
- Geo Wiki Project - Citizen-driven Environmental Monitoring
- GeoFabrik - OSM data extracted to a variety of formats and areas
- GeoNames Worldwide
- Global Administrative Areas Database (GADM)
- Homeland Infrastructure Foundation-Level Data
- Landsat 8 on AWS
- List of all countries in all languages
- National Weather Service GIS Data Portal
- Natural Earth - vectors and rasters of the world
- OpenAddresses
- OpenStreetMap (OSM)
- Pleiades - Gazetteer and graph of ancient places
- Reverse Geocoder using OSM data
- TIGER/Line - U.S. boundaries and roads
- TZ Timezones shapfiles
- TwoFishes - Foursquare's coarse geocoder
- UN Environmental Data
- World boundaries from the U.S. Department of State
- World countries in multiple formats
- Alberta, Province of Canada
- Antwerp, Belgium
- Argentina (non official)
- Argentina
- Austin, TX, US
- Australia (abs.gov.au)
- Australia (data.gov.au)
- Austria (data.gv.at)
- Baton Rouge, LA, US
- Belgium
- Brazil
- Buenos Aires, Argentina
- Calgary, AB, Canada
- Cambridge, MA, US
- Canada
- Chicago
- Chile
- Dallas Open Data
- DataBC - data from the Province of British Columbia
- Denver Open Data
- Durham, NC Open Data
- Edmonton, AB, Canada
- England LGInform
- EuroStat
- EveryPolitician - Ongoing project collating and sharing data on every politician.
- FedStats
- Finland
- France
- Fredericton, NB, Canada
- Gatineau, QC, Canada
- Germany
- Ghent, Belgium
- Glasgow, Scotland, UK
- Greece
- Guardian world governments
- Halifax, NS, Canada
- Helsinki Region, Finland
- Hong Kong, China
- Houston Open Data
- Indian Government Data
- Indonesian Data Portal
- Ireland's Open Data Portal
- Italy - Il Portale dati.gov.it è il catalogo nazionale dei metadati relativi ai dati rilasciati in formato aperto dalle pubbliche amministrazioni italiane. Il Portale è promosso dal Governo Italiano e gestito dall’Agenzia per l’Italia digitale con il supporto di FormezPA.
- Japan
- Laval, QC, Canada
- Lexington, KY
- London Datastore, UK
- London, ON, Canada
- Los Angeles Open Data
- MassGIS, Massachusetts, U.S.
- Metropolitain Transportation Commission (MTC), California, US
- Mexico
- Missisauga, ON, Canada
- Moldova
- Moncton, NB, Canada
- Montreal, QC, Canada
- Mountain View, California, US (GIS)
- NYC Open Data
- NYC betanyc
- Netherlands
- New Zealand
- OECD
- Oakland, California, US
- Oklahoma
- Open Data for Africa
- Open Government Data (OGD) Platform India
- OpenDataSoft's list of 1,600 open data
- Oregon
- Ottawa, ON, Canada
- Palo Alto, California, US
- Portland, Oregon
- Portugal - Pordata organization
- Puerto Rico Government
- Quebec City, QC, Canada
- Quebec Province of Canada
- Regina SK, Canada
- Rio de Janeiro, Brazil
- Romania
- Russia
- San Francisco Data sets
- San Jose, California, US
- San Mateo County, California, US
- Saskatchewan, Province of Canada
- Seattle
- Singapore Government Data
- South Africa Trade Statistics
- South Africa
- State of Utah, US
- Switzerland
- Taiwan g0v
- Taiwan
- Tel-Aviv Open Data
- Texas Open Data
- The World Bank
- Toronto, ON, Canada
- Tunisia
- U.K. Government Data
- U.S. American Community Survey
- U.S. CDC Public Health datasets
- U.S. Census Bureau
- U.S. Department of Housing and Urban Development (HUD)
- U.S. Federal Government Agencies
- U.S. Federal Government Data Catalog
- U.S. Food and Drug Administration (FDA)
- U.S. National Center for Education Statistics (NCES)
- U.S. Open Government
- UK 2011 Census Open Atlas Project
- U.S. Patent and Trademark Office (USPTO) Bulk Data Products
- Uganda Bureau of Statistics
- United Nations
- Uruguay
- Valley Transportation Authority (VTA), California, US
- Vancouver, BC Open Data Catalog
- Victoria, BC, Canada
- Vienna, Austria
- Composition of Foods Raw, Processed, Prepared USDA National Nutrient Database for Standard Reference - The database consists of several sets of data: food descriptions, nutrients, weights and measures, footnotes, and sources of data. The Nutrient Data file contains mean nutrient values per 100 g of the edible portion of food, along with fields to further describe the mean value.
- EHDP Large Health Data Sets
- GDC - GDC supports several cancer genome programs for CCG, TCGA, TARGET etc.
- Gapminder World demographic databases
- MeSH, the vocabulary thesaurus used for indexing articles for PubMed
- Medicare Coverage Database (MCD), U.S.
- Medicare Data Engine of medicare.gov Data
- Medicare Data File
- Number of Ebola Cases and Deaths in Affected Countries (2014)
- Open-ODS (structure of the UK NHS)
- OpenPaymentsData, Healthcare financial relationship data
- PhysioBank Databases - A large and growing archive of physiological data.
- The Cancer Imaging Archive (TCIA)
- The Cancer Genome Atlas project (TCGA)
- World Health Organization Global Health Observatory
- 10k US Adult Faces Database
- 2GB of Photos of Cats
- Adience Unfiltered faces for gender and age classification
- Affective Image Classification
- Animals with attributes
- Caltech Pedestrian Detection Benchmark
- Chars74K dataset - Character Recognition in Natural Images (both English and Kannada are available)
- Face Recognition Benchmark
- Flickr: 32 Class Brand Logos
- GDXray - X-ray images for X-ray testing and Computer Vision
- ImageNet (in WordNet hierarchy)
- Indoor Scene Recognition
- International Affective Picture System, UFL
- MNIST database of handwritten digits, near 1 million examples
- Massive Visual Memory Stimuli, MIT
- SUN database, MIT
- Several Shape-from-Silhouette Datasets
- Stanford Dogs Dataset
- The Action Similarity Labeling (ASLAN) Challenge
- The Oxford-IIIT Pet Dataset
- Violent-Flows - Crowd Violence / Non-violence Database and benchmark
- Visual genome
- YouTube Faces Database
- Context-aware data sets from five domains
- Delve Datasets for classification and regression
- Discogs Monthly Data
- Free Music Archive
- IMDb Database
- Keel Repository for classification, regression and time series
- Labeled Faces in the Wild (LFW)
- Lending Club Loan Data
- Machine Learning Data Set Repository
- Million Song Dataset
- More Song Datasets
- MovieLens Data Sets
- New Yorker caption contest ratings
- RDataMining - "R and Data Mining" ebook data
- Registered Meteorites on Earth
- Restaurants Health Score Data in San Francisco
- UCI Machine Learning Repository
- Yahoo! Ratings and Classification Data
- YouTube-BoundingBoxes
- Youtube 8m
- eBay Online Auctions (2012)
- Canada Science and Technology Museums Corporation's Open Data
- Cooper-Hewitt's Collection Database
- Minneapolis Institute of Arts metadata
- Natural History Museum (London) Data Portal
- Rijksmuseum Historical Art Collection
- Tate Collection metadata
- The Getty vocabularies
- Automatic Keyphrase Extraction
- Blogger Corpus
- CLiPS Stylometry Investigation Corpus
- ClueWeb09 FACC
- ClueWeb12 FACC
- DBpedia - 4.58M things with 583M facts
- Flickr Personal Taxonomies
- Freebase of people, places, and things
- Google Books Ngrams (2.2TB)
- Google MC-AFP - Generated based on the public available Gigaword dataset using Paragraph Vectors
- Google Web 5gram (1TB, 2006)
- Gutenberg eBooks List
- Hansards text chunks of Canadian Parliament
- Microsoft MAchine Reading COmprehension Dataset (or MS MARCO)
- Machine Comprehension Test (MCTest) of text from Microsoft Research
- Machine Translation of European languages
- Making Sense of Microposts 2013 - Concept Extraction
- Making Sense of Microposts 2016 - Named Entity rEcognition and Linking
- Multi-Domain Sentiment Dataset (version 2.0)
- Open Multilingual Wordnet
- POS/NER/Chunk annotated data
- Personae Corpus
- SMS Spam Collection in English
- SaudiNewsNet Collection of Saudi Newspaper Articles (Arabic, 30K articles)
- Stanford Question Answering Dataset (SQuAD)
- USENET postings corpus of 2005~2011
- Universal Dependencies
- Webhose - News/Blogs in multiple languages
- Wikidata - Wikipedia databases
- Wikipedia Links data - 40 Million Entities in Context
- WordNet databases and tools
- Allen Institute Datasets
- Brain Catalogue
- Brainomics
- CodeNeuro Datasets
- Collaborative Research in Computational Neuroscience (CRCNS)
- FCP-INDI
- Human Connectome Project
- NDAR
- NIMH Data Archive
- NeuroData
- Neuroelectro
- OASIS
- OpenfMRI
- Study Forrest
- CERN Open Data Portal
- Crystallography Open Database
- IceCube - South Pole Neutrino Observatory
- NASA Exoplanet Archive
- NSSDC (NASA) data of 550 space spacecraft
- Sloan Digital Sky Survey (SDSS) - Mapping the Universe
- Amazon
- Archive.org Datasets
- Archive-it from Internet Archive
- CMU JASA data archive
- CMU StatLab collections
- Data.World
- Data360
- Enigma Public
- Infochimps
- KDNuggets Data Collections
- Microsoft Azure Data Market Free DataSets
- Microsoft Data Science for Research
- Numbray
- Open Library Data Dumps
- Reddit Datasets
- RevolutionAnalytics Collection
- Sample R data sets
- StatSci.org
- Stats4Stem R data sets
- The Washington Post List
- UCLA SOCR data collection
- UFO Reports
- Wikileaks 911 pager intercepts
- Yahoo Webscope
- Academic Torrents of data sharing from UMB
- DataMarket (Qlik)
- Datahub.io
- Harvard Dataverse Network of scientific data
- ICPSR (UMICH)
- Institute of Education Sciences
- National Technical Reports Library
- Open Data Certificates (beta)
- OpenDataNetwork - A search engine of all Socrata powered data portals
- Statista.com - statistics and Studies
- Zenodo - An open dependable home for the long-tail of science
- 72 hours #gamergate Twitter Scrape
- Ancestry.com Forum Dataset over 10 years
- CMU Enron Email of 150 users
- Cheng-Caverlee-Lee September 2009 - January 2010 Twitter Scrape
- EDRM Enron EMail of 151 users, hosted on S3
- Facebook Data Scrape (2005)
- Facebook Social Networks from LAW (since 2007)
- Foursquare from UMN/Sarwat (2013)
- GitHub Collaboration Archive
- Google Scholar citation relations
- High-Resolution Contact Networks from Wearable Sensors
- Indie Map: social graph and crawl of top IndieWeb sites
- Mobile Social Networks from UMASS
- Network Twitter Data
- Reddit Comments
- Skytrax' Air Travel Reviews Dataset
- Social Twitter Data
- SourceForge.net Research Data
- Twitter Data for Online Reputation Management
- Twitter Data for Sentiment Analysis
- Twitter Graph of entire Twitter site
- Twitter Scrape Calufa May 2011
- UNIMI/LAW Social Network Datasets
- Yahoo! Graph and Social Data
- Youtube Video Social Graph in 2007,2008
- ACLED (Armed Conflict Location & Event Data Project)
- Canadian Legal Information Institute
- Center for Systemic Peace Datasets - Conflict Trends, Polities, State Fragility, etc
- Correlates of War Project
- Cryptome Conspiracy Theory Items
- Datacards
- European Social Survey
- FBI Hate Crime 2013 - aggregated data
- Fragile States Index
- GDELT Global Events Database
- General Social Survey (GSS) since 1972
- German Social Survey
- Global Religious Futures Project
- Humanitarian Data Exchange
- INFORM Index for Risk Management
- Institute for Demographic Studies
- International Networks Archive
- International Social Survey Program ISSP
- International Studies Compendium Project
- James McGuire Cross National Data
- MIT Reality Mining Dataset
- MacroData Guide by Norsk samfunnsvitenskapelig datatjeneste
- Minnesota Population Center
- Notre Dame Global Adaptation Index (NG-DAIN)
- Open Crime and Policing Data in England, Wales and Northern Ireland
- OpenSanctions - A global database of persons and companies of political, criminal, or economic interest.
- Paul Hensel General International Data Page
- PewResearch Internet Survey Project
- PewResearch Society Data Collection
- Political Polarity Data
- StackExchange Data Explorer
- Terrorism Research and Analysis Consortium
- Texas Inmates Executed Since 1984
- Titanic Survival Data Set
- UCB's Archive of Social Science Data (D-Lab)
- UCLA Social Sciences Data Archive
- UN Civil Society Database
- UPJOHN for Labor Employment Research
- Universities Worldwide
- Uppsala Conflict Data Program
- World Bank Open Data
- WorldPop project - Worldwide human population distributions
- FLOSSmole data about free, libre, and open source software development
- Libraries.io Open Source Repository and Dependency Metadata
- Betfair Historical Exchange Data
- Cricsheet Matches (cricket)
- Ergast Formula 1, from 1950 up to date (API)
- Football/Soccer resources (data and APIs)
- Lahman's Baseball Database
- Pinhooker: Thoroughbred Bloodstock Sale Data
- Retrosheet Baseball Statistics
- Tennis database of rankings, results, and stats for ATP
- Tennis database of rankings, results, and stats for WTA
- Databanks International Cross National Time Series Data Archive
- Hard Drive Failure Rates
- Heart Rate Time Series from MIT
- Time Series Data Library (TSDL) from MU
- UC Riverside Time Series Dataset
- Airlines OD Data 1987-2008
- Bay Area Bike Share Data
- Bike Share Systems (BSS) collection
- GeoLife GPS Trajectory from Microsoft Research
- German train system by Deutsche Bahn
- Hubway Million Rides in MA
- Montreal BIXI Bike Share
- NYC Taxi Trip Data 2009-
- NYC Taxi Trip Data 2013 (FOIA/FOILed)
- NYC Uber trip data April 2014 to September 2014
- Open Traffic collection
- OpenFlights - airport, airline and route data
- Philadelphia Bike Share Stations (JSON)
- Plane Crash Database, since 1920
- RITA Airline On-Time Performance data
- RITA/BTS transport data collection (TranStat)
- Toronto Bike Share Stations (XML file)
- Transport for London (TFL)
- Travel Tracker Survey (TTS) for Chicago
- U.S. Bureau of Transportation Statistics (BTS)
- U.S. Domestic Flights 1990 to 2009
- U.S. Freight Analysis Framework since 2007
- Data Packaged Core Datasets
- Database of Scientific Code Contributions
- A growing collection of public datasets: CoolDatasets.
- DataWrangling: Some Datasets Available on the Web
- Inside-r: Finding Data on the Internet
- OpenDataMonitor: An overview of available open data resources in Europe
- Quora: Where can I find large datasets open to the public?
- RS.io: 100+ Interesting Data Sets for Statistics
- StaTrek: Leveraging open data to understand urban lives