用于数据科学项目的公开数据库—19 Free Public Data Sets For Your First Data Science Project

地址:https://www.springboard.com/blog/free-public-data-sets-data-science-project/

Completing your first project is a major milestone on the road to becoming a data scientist. It’s also an intimidating process. The first step is to find an appropriate, interesting data set. You should decide how large and how messy a dataset you want to work with; while cleaning data is an integral part of data science, you may want to start with clean dataset for your first project so that you can focus on the analysis rather than on cleaning the data.

在成为数据科学家的道路上完成第一个项目是一个重要的里程碑。这也是一个令人生畏的过程。第一步是找到一个合适的,有趣的数据集。你应该决定你将处理有多大以及多混乱的数据集;同时清洗数据是数据科学不可或缺的一部分,你可能开始于一个干净的数据集,这样你专注于分析而不是数据的清理。

Based on the learnings from our Foundations of Data Science Workshop, we’ve selected datasets of varying types and complexity that we think work well for first projects (some of them work for research projects as well!). These data-sets cover a variety of sources: demographic data, economic data, text data, and corporate data.基于“数据科学学习基础研讨会”的学习,我们选择多类和复杂的数据集,能很好地用于第一个项目。这些数据集包含多种来源:人口统计数据、经济数据、文本数据、企业数据。

    1. United States Census Data: The United States Census publishes reams of demographic data at the state, city, and even zip code level. The data set is fantastic for creating geographic data visualizations and can be accessed on the Census Website. Alternatively, the data can be accessed via an API. One convenient way to use that API is through the chloroplethr. In general, this data is very clean and very comprehensive. 美国人口统计局的数据,
    2. FBI Crime Data: The FBI crime data set is fascinating. If you’re interested in analyzing time series data, you can use it to chart changes in crime rates at the national level over a 20 year period. Alternatively, you can look at the datageographically.5 FBI犯罪数据
    3. CDC Cause of Death: The Center for Disease Control control maintains a database on cause of death. The data can be segmented in almost every way imaginable: age, race, year, and so on.3 美国疾病控制与预防中心的死亡原因:
    4. Medicare Hospital Quality: Medicare maintains a database on complication rates by hospital that provides for interesting comparisons.医院医疗质量:
    5. SEER Cancer Incidence: The US government also has data about cancer incidence, again segmented by age, race, gender, year, and other factors.SEER癌症发病率:
    6. Bureau of Labor Statistics: Many important economic indicators for the United States (like unemployment1 and inflation) can be found on the Bureau of Labor Statistics website. Most of the data can be segmented both by time and by geography. 劳工统计局(Bureau of Labor Statistics):
    7. The Bureau of Economic Analysis: The Bureau of Economic Analysis also has national and regional economic data, like GDP and exchange rates. :经济分析局:
    8. IMF Economic Data: If you want a view of international data, you can find it on theIMF website.国际货币基金组织的经济数据:
    9. Dow Jones Weekly Returns: Predicting stock prices is a major application of data analysis and machine learning. One dataset to explore is the weekly returns of the Dow Jones Index.道琼斯每周回报:
    10. Boston Housing Data: The Boston Housing Data Set contains median housing prices in Boston suburbs as well as 13 attributes that contribute to those prices. It’s an excellent set for experimenting with various types of regressions.1 波士顿住房数据:
    11. Enron Emails: After the collapse of Enron, a dataset of roughly 500,000 emails with message text and metadata were released. The dataset is now famous and provides an excellent testing ground for text related analysis. It has the messiness of real world data. 
    12. Google N-Grams: If you’re interested in truly massive data, the Google n-gramsdataset counts the frequency of words and phrases by year across a huge number of text sources. The resulting file is 2.2 TB. 
    13. Sentence Sentiments: Researchers have labeled 3,000 sentences as expressing positive or negative sentiments. If you’re interested in classifying text, this is a great place to start. 名子情感
    14. Reddit Comments: Reddit released a dataset of every comment that has ever been made on the site. That’s over a terabyte of data uncompressed, so if you want a smaller dataset to work with Kaggle has hosted the comments from May 2015 on their site. Reddit 评论数据.
    15. Wikipedia: Wikipedia provides instructions for downloading the text of English language articles
    16. Lending Club: Lending Club provides data about loan applications it has rejected as well as the performance of loans that it issued. The dataset lends itself both to categorization techniques (will a given loan default) as well as regressions (how much will be paid back on a given loan.) 
    17. Walmart: Walmart has released store level sales data for 98 items across 45 stores. This is an excellent data for time series analysis and has interesting seasonal components as well. 
    18. Airbnb: Airbnb released user session data as part of a content to create analysis and visualizations. 
    19. Yelp: Yelp releases an academic dataset that contains information for the areas around 30 universities.Yelp释放一个学术领域的数据集,其中包含信息约30所大学。

Well – now it’s time to get cracking! If you want to jumpstart your Data Science career today, I’d recommend checking out our 12-Week Online Workshop – Foundations of Data Science. Headhere for more on that. If you wanted even more resources, check out the Springboard home page.


你可能感兴趣的:(转载文档,数据科学)