不知道大家在自学Python/Stata/R的时候,有没有纠结过这个问题:想动手实践下感兴趣的命令或模型,那么就需要一个样本数据(dta,数据框)。去哪找省时省力,找到的数据集靠谱又好用呢?——还记之前提到过最简单的方法有两种,一是手工录入(之前有文章介绍过);https://zhuanlan.zhihu.com/p/321627537zhuanlan.zhihu.com
二是导入软件自带的数据集。第一种录入数据方式一般的相关书籍都有介绍,至于第二种…...就我看过的上百本Stata-R-Python的教材而言,几乎没有一本提到过。这就是说这个知识点冷的原因咯。
导入自带的数据集,简单、高效,数据往往也更加真实。应该成为初学者导入数据的首选方法。
有几个常见的数据集,想必很多用户都非常熟悉——比如Stata的auto和nlsw,R语言或Python中的Iris,mtcars和Titanic。
接下来,让我们复习一下如何手工录入一个数据集(下面是截图,代码在上面的链接中)。这里以Python为例。
1 Stata自带数据集
Stata是个高度商业化的收费软件,不管什么功能,只要有的,一般都做得很极致。就拿自带数据导入来说,就是一个简单的sysuse auto.dta命令。除了系统自带的数据,Stata还提供官方的在线数据集。在线数据集包括了sysuse命令相关的数据集,具体可以用webuse xxx.dta命令来导入。
总的说来,Stata自带的数据种类齐全,使用方便。就因为这点,当年我刚开始学用Python跑回归的时候,就经常跑到Stata去倒(导)出数据。
1.1 sysuse命令与本地数据集
在Stata命令窗口输入 sysuse dir,返回的是Stata本地的几十个范例数据集列表。
接下来,可使用sysuse+数据集名字的命令,将该数据集导入Stata内存。简单,又方便。
1.2 在线示范数据集
通过help dta_manuals命令,返回一个按手册名字分类、共包含数百个示例数据集(可以用webuse命令调用,也包括了全部的本地数据集)。
下面我们点击打开最后一个User's Guide [U]手册(help q_user命令也可),里面带有数十个范例数据集。其中前面两个就是我们比较熟悉的auto和nlsw系列数据之一的nlswork数据集。help q_user 命令可查看Stata全部范例dta
点击旁边的use就可以导入内存(如下图所示),点击describe对数据进行描述(不导入内存)。点击 use 在线导入auto数据集也可以使用 webuse 命令在线导入 auto 数据集
在将Stata的范例数据集导入到内存后,可以通过以下命令将数据导出,方便后面R或Python的使用。
. export delimited using "D:\Spyder\auto.csv", replace //导出为CSV文件
. export excel using "D:\Spyder\auto.xlsx", firstrow(variables) //导出为Excel文件
注意:Stata的dta在导出过程中可能出现信息丢失。对于带有标签数值数据,导出为CSV或者xlsx的时候,保留的是Stata的变量标签取值,而不是原来的数值。比如原来foreign变量下面的取值是1或0,对应的标签是Foreign和Domestic,导出后数据就变成了Foreign或Domestic,而不是原来的1或0。
2 R 自带的数据集
R导入自带的数据集理论上是最方便的,比Stata还快。因为,R自带的数据集甚至都不用导入(当前工作环境),就可以直接查看或者调用。(想当年我尝试过好多种不同的方法导入R自带的数据。还经常从Stata倒腾数据过来,给R用==!)
2.1 R自带的datasets包中的数据集
R自带的数据集只要是来自本地的datasets,都可以直接查看和调用。比Stata还更方便一点。先演示直接查看和调用mtcars数据集。
> ##直接查看mtcars数据集
> head(mtcars) ##查看头6行
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
> View(mtcars) ##使用查看器查看数据。相当于Stata的browse命令。结果在最后再展示
> ##查看mtcars数据集及其各变量的信息
> str(mtcars)
'data.frame':32 obs. of 11 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
$ disp: num 160 160 108 258 360 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num 16.5 17 18.6 19.4 17 ...
$ vs : num 0 0 1 1 0 1 0 1 1 1 ...
$ am : num 1 1 1 0 0 0 0 0 0 0 ...
$ gear: num 4 4 4 3 3 3 3 4 4 4 ...
$ carb: num 4 4 1 1 2 1 4 2 2 4 ...
> summary(mtcars)
mpg cyl disp hp drat
Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0 Min. :2.760
1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5 1st Qu.:3.080
Median :19.20 Median :6.000 Median :196.3 Median :123.0 Median :3.695
Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7 Mean :3.597
3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0 3rd Qu.:3.920
Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0 Max. :4.930
wt qsec vs am
Min. :1.513 Min. :14.50 Min. :0.0000 Min. :0.0000
1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000 1st Qu.:0.0000
Median :3.325 Median :17.71 Median :0.0000 Median :0.0000
Mean :3.217 Mean :17.85 Mean :0.4375 Mean :0.4062
3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000 3rd Qu.:1.0000
Max. :5.424 Max. :22.90 Max. :1.0000 Max. :1.0000
gear carb
Min. :3.000 Min. :1.000
1st Qu.:3.000 1st Qu.:2.000
Median :4.000 Median :2.000
Mean :3.688 Mean :2.812
3rd Qu.:4.000 3rd Qu.:4.000
Max. :5.000 Max. :8.000
> ##使用mtcars数据建模
> m1
> m1
Call:
lm(formula = mtcars$mpg ~ mtcars$cyl + mtcars$disp + mtcars$hp)
Coefficients:
(Intercept) mtcars$cyl mtcars$disp mtcars$hp
34.18492 -1.22742 -0.01884 -0.01468
> summary(m1)
Call:
lm(formula = mtcars$mpg ~ mtcars$cyl + mtcars$disp + mtcars$hp)
Residuals:
Min 1Q Median 3Q Max
-4.0889 -2.0845 -0.7745 1.3972 6.9183
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 34.18492 2.59078 13.195 1.54e-13 ***
mtcars$cyl -1.22742 0.79728 -1.540 0.1349
mtcars$disp -0.01884 0.01040 -1.811 0.0809 .
mtcars$hp -0.01468 0.01465 -1.002 0.3250
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.055 on 28 degrees of freedom
Multiple R-squared: 0.7679,Adjusted R-squared: 0.743
F-statistic: 30.88 on 3 and 28 DF, p-value: 5.054e-09
> ##目前为止,我们都是在直接使用mtcars
> ##将mtcars导入到当前的工作环境Environment
> data("mtcars")
> View(iris) ##查看数据。结果如下图
接下来,通过data函数,查看datasets包全部的数据集。对于其他包带的数据,查看命令类似。
>data(packages = 'datasets') ##或者直接输入data()
##得到的是一个长长的数据集列表。当中就有我们常见到的iris和mtcars数据集
Data sets in package ‘datasets’:
AirPassengers Monthly Airline Passenger Numbers
1949-1960
BJsales Sales Data with Leading Indicator
BJsales.lead (BJsales)
Sales Data with Leading Indicator
BOD Biochemical Oxygen Demand
CO2 Carbon Dioxide Uptake in Grass Plants
ChickWeight Weight versus age of chicks on different
diets
DNase Elisa assay of DNase
EuStockMarkets Daily Closing Prices of Major European
Stock Indices, 1991-1998
Formaldehyde Determination of Formaldehyde
HairEyeColor Hair and Eye Color of Statistics Students
Harman23.cor Harman Example 2.3
Harman74.cor Harman Example 7.4
Indometh Pharmacokinetics of Indomethacin
InsectSprays Effectiveness of Insect Sprays
JohnsonJohnson Quarterly Earnings per Johnson & Johnson
Share
LakeHuron Level of Lake Huron 1875-1972
LifeCycleSavings Intercountry Life-Cycle Savings Data
Loblolly Growth of Loblolly pine trees
Nile Flow of the River Nile
Orange Growth of Orange Trees
OrchardSprays Potency of Orchard Sprays
PlantGrowth Results from an Experiment on Plant Growth
Puromycin Reaction Velocity of an Enzymatic Reaction
Seatbelts Road Casualties in Great Britain 1969-84
Theoph Pharmacokinetics of Theophylline
Titanic Survival of passengers on the Titanic
ToothGrowth The Effect of Vitamin C on Tooth Growth in
Guinea Pigs
UCBAdmissions Student Admissions at UC Berkeley
UKDriverDeaths Road Casualties in Great Britain 1969-84
UKgas UK Quarterly Gas Consumption
USAccDeaths Accidental Deaths in the US 1973-1978
USArrests Violent Crime Rates by US State
USJudgeRatings Lawyers' Ratings of State Judges in the US
Superior Court
USPersonalExpenditure
Personal Expenditure Data
UScitiesD Distances Between European Cities and
Between US Cities
VADeaths Death Rates in Virginia (1940)
WWWusage Internet Usage per Minute
WorldPhones The World's Telephones
ability.cov Ability and Intelligence Tests
airmiles Passenger Miles on Commercial US Airlines,
1937-1960
airquality New York Air Quality Measurements
anscombe Anscombe's Quartet of 'Identical' Simple
Linear Regressions
attenu The Joyner-Boore Attenuation Data
attitude The Chatterjee-Price Attitude Data
austres Quarterly Time Series of the Number of
Australian Residents
beaver1 (beavers) Body Temperature Series of Two Beavers
beaver2 (beavers) Body Temperature Series of Two Beavers
cars Speed and Stopping Distances of Cars
chickwts Chicken Weights by Feed Type
co2 Mauna Loa Atmospheric CO2 Concentration
crimtab Student's 3000 Criminals Data
discoveries Yearly Numbers of Important Discoveries
esoph Smoking, Alcohol and (O)esophageal Cancer
euro Conversion Rates of Euro Currencies
euro.cross (euro) Conversion Rates of Euro Currencies
eurodist Distances Between European Cities and
Between US Cities
faithful Old Faithful Geyser Data
fdeaths (UKLungDeaths)
Monthly Deaths from Lung Diseases in the
UK
freeny Freeny's Revenue Data
freeny.x (freeny) Freeny's Revenue Data
freeny.y (freeny) Freeny's Revenue Data
infert Infertility after Spontaneous and Induced
Abortion
iris Edgar Anderson's Iris Data
iris3 Edgar Anderson's Iris Data
islands Areas of the World's Major Landmasses
ldeaths (UKLungDeaths)
Monthly Deaths from Lung Diseases in the
UK
lh Luteinizing Hormone in Blood Samples
longley Longley's Economic Regression Data
lynx Annual Canadian Lynx trappings 1821-1934
mdeaths (UKLungDeaths)
Monthly Deaths from Lung Diseases in the
UK
morley Michelson Speed of Light Data
mtcars Motor Trend Car Road Tests
nhtemp Average Yearly Temperatures in New Haven
nottem Average Monthly Temperatures at
Nottingham, 1920-1939
npk Classical N, P, K Factorial Experiment
occupationalStatus Occupational Status of Fathers and their
Sons
precip Annual Precipitation in US Cities
presidents Quarterly Approval Ratings of US
Presidents
pressure Vapor Pressure of Mercury as a Function of
Temperature
quakes Locations of Earthquakes off Fiji
randu Random Numbers from Congruential Generator
RANDU
rivers Lengths of Major North American Rivers
rock Measurements on Petroleum Rock Samples
sleep Student's Sleep Data
stack.loss (stackloss)
Brownlee's Stack Loss Plant Data
stack.x (stackloss)
Brownlee's Stack Loss Plant Data
stackloss Brownlee's Stack Loss Plant Data
state.abb (state) US State Facts and Figures
state.area (state) US State Facts and Figures
state.center (state)
US State Facts and Figures
state.division (state)
US State Facts and Figures
state.name (state) US State Facts and Figures
state.region (state)
US State Facts and Figures
state.x77 (state) US State Facts and Figures
sunspot.month Monthly Sunspot Data, from 1749 to
"Present"
sunspot.year Yearly Sunspot Data, 1700-1988
sunspots Monthly Sunspot Numbers, 1749-1983
swiss Swiss Fertility and Socioeconomic
Indicators (1888) Data
treering Yearly Treering Data, -6000-1979
trees Diameter, Height and Volume for Black
Cherry Trees
uspop Populations Recorded by the US Census
volcano Topographic Information on Auckland's
Maunga Whau Volcano
warpbreaks The Number of Breaks in Yarn during
Weaving
women Average Heights and Weights for American
Women
3 Python导入自带数据集
3.1 sciki-learn机器学习的datasets
根据sciki-learn官网的说明,sciki-learn自带的数据集大概有三十来个。每个数据集还有自己"专用的"导入函数。
from sklearn import datasets ##导入datasets
iris = datasets.load_iris() ##导入iris数据集
print(iris) ##结果太长不作展示
也可以在Spyder的对象查看器中点点鼠标,进行查看。如下图。
3.2 高级画图seaborn包所带数据集
和sciki-learn包类似,seaborn高级画图包也带有一些经典的数据集,比如Titanic。
import seaborn as sns
titanic=sns.load_dataset('titanic') ##加载titanic数据集
## 还可以查看更多seaborns所携带的数据集
sns.get_dataset_names()
Out[2]:
['anagrams',
'anscombe',
'attention',
'brain_networks',
'car_crashes',
'diamonds',
'dots',
'exercise',
'flights',
'fmri',
'gammas',
'geyser',
'iris',
'mpg',
'penguins',
'planets',
'tips',
'titanic']
Ref7. Dataset loading utilitiesscikit-learn.org
----全文结束-----