用python还需要stata吗_极冷知识点——Stata/Python/R自带数据导入(附代码)

不知道大家在自学Python/Stata/R的时候,有没有纠结过这个问题:想动手实践下感兴趣的命令或模型,那么就需要一个样本数据(dta,数据框)。去哪找省时省力,找到的数据集靠谱又好用呢?——还记之前提到过最简单的方法有两种,一是手工录入(之前有文章介绍过);https://zhuanlan.zhihu.com/p/321627537​zhuanlan.zhihu.com

二是导入软件自带的数据集。第一种录入数据方式一般的相关书籍都有介绍,至于第二种…...就我看过的上百本Stata-R-Python的教材而言,几乎没有一本提到过。这就是说这个知识点冷的原因咯。

导入自带的数据集,简单、高效,数据往往也更加真实。应该成为初学者导入数据的首选方法。

有几个常见的数据集,想必很多用户都非常熟悉——比如Stata的auto和nlsw,R语言或Python中的Iris,mtcars和Titanic。

接下来,让我们复习一下如何手工录入一个数据集(下面是截图,代码在上面的链接中)。这里以Python为例。

1 Stata自带数据集

Stata是个高度商业化的收费软件,不管什么功能,只要有的,一般都做得很极致。就拿自带数据导入来说,就是一个简单的sysuse auto.dta命令。除了系统自带的数据,Stata还提供官方的在线数据集。在线数据集包括了sysuse命令相关的数据集,具体可以用webuse xxx.dta命令来导入。

总的说来,Stata自带的数据种类齐全,使用方便。就因为这点,当年我刚开始学用Python跑回归的时候,就经常跑到Stata去倒(导)出数据。

1.1 sysuse命令与本地数据集

在Stata命令窗口输入 sysuse dir,返回的是Stata本地的几十个范例数据集列表。

接下来,可使用sysuse+数据集名字的命令,将该数据集导入Stata内存。简单,又方便。

1.2 在线示范数据集

通过help dta_manuals命令,返回一个按手册名字分类、共包含数百个示例数据集(可以用webuse命令调用,也包括了全部的本地数据集)。

下面我们点击打开最后一个User's Guide [U]手册(help q_user命令也可),里面带有数十个范例数据集。其中前面两个就是我们比较熟悉的auto和nlsw系列数据之一的nlswork数据集。help q_user 命令可查看Stata全部范例dta

点击旁边的use就可以导入内存(如下图所示),点击describe对数据进行描述(不导入内存)。点击 use 在线导入auto数据集也可以使用 webuse 命令在线导入 auto 数据集

在将Stata的范例数据集导入到内存后,可以通过以下命令将数据导出,方便后面R或Python的使用。

. export delimited using "D:\Spyder\auto.csv", replace //导出为CSV文件

. export excel using "D:\Spyder\auto.xlsx", firstrow(variables) //导出为Excel文件

注意:Stata的dta在导出过程中可能出现信息丢失。对于带有标签数值数据,导出为CSV或者xlsx的时候,保留的是Stata的变量标签取值,而不是原来的数值。比如原来foreign变量下面的取值是1或0,对应的标签是Foreign和Domestic,导出后数据就变成了Foreign或Domestic,而不是原来的1或0。

2 R 自带的数据集

R导入自带的数据集理论上是最方便的,比Stata还快。因为,R自带的数据集甚至都不用导入(当前工作环境),就可以直接查看或者调用。(想当年我尝试过好多种不同的方法导入R自带的数据。还经常从Stata倒腾数据过来,给R用==!)

2.1 R自带的datasets包中的数据集

R自带的数据集只要是来自本地的datasets,都可以直接查看和调用。比Stata还更方便一点。先演示直接查看和调用mtcars数据集。

> ##直接查看mtcars数据集

> head(mtcars) ##查看头6行

mpg cyl disp hp drat wt qsec vs am gear carb

Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4

Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4

Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1

Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1

Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2

Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1

> View(mtcars) ##使用查看器查看数据。相当于Stata的browse命令。结果在最后再展示

> ##查看mtcars数据集及其各变量的信息

> str(mtcars)

'data.frame':32 obs. of 11 variables:

$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...

$ cyl : num 6 6 4 6 8 6 8 4 4 6 ...

$ disp: num 160 160 108 258 360 ...

$ hp : num 110 110 93 110 175 105 245 62 95 123 ...

$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...

$ wt : num 2.62 2.88 2.32 3.21 3.44 ...

$ qsec: num 16.5 17 18.6 19.4 17 ...

$ vs : num 0 0 1 1 0 1 0 1 1 1 ...

$ am : num 1 1 1 0 0 0 0 0 0 0 ...

$ gear: num 4 4 4 3 3 3 3 4 4 4 ...

$ carb: num 4 4 1 1 2 1 4 2 2 4 ...

> summary(mtcars)

mpg cyl disp hp drat

Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0 Min. :2.760

1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5 1st Qu.:3.080

Median :19.20 Median :6.000 Median :196.3 Median :123.0 Median :3.695

Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7 Mean :3.597

3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0 3rd Qu.:3.920

Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0 Max. :4.930

wt qsec vs am

Min. :1.513 Min. :14.50 Min. :0.0000 Min. :0.0000

1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000 1st Qu.:0.0000

Median :3.325 Median :17.71 Median :0.0000 Median :0.0000

Mean :3.217 Mean :17.85 Mean :0.4375 Mean :0.4062

3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000 3rd Qu.:1.0000

Max. :5.424 Max. :22.90 Max. :1.0000 Max. :1.0000

gear carb

Min. :3.000 Min. :1.000

1st Qu.:3.000 1st Qu.:2.000

Median :4.000 Median :2.000

Mean :3.688 Mean :2.812

3rd Qu.:4.000 3rd Qu.:4.000

Max. :5.000 Max. :8.000

> ##使用mtcars数据建模

> m1

> m1

Call:

lm(formula = mtcars$mpg ~ mtcars$cyl + mtcars$disp + mtcars$hp)

Coefficients:

(Intercept) mtcars$cyl mtcars$disp mtcars$hp

34.18492 -1.22742 -0.01884 -0.01468

> summary(m1)

Call:

lm(formula = mtcars$mpg ~ mtcars$cyl + mtcars$disp + mtcars$hp)

Residuals:

Min 1Q Median 3Q Max

-4.0889 -2.0845 -0.7745 1.3972 6.9183

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 34.18492 2.59078 13.195 1.54e-13 ***

mtcars$cyl -1.22742 0.79728 -1.540 0.1349

mtcars$disp -0.01884 0.01040 -1.811 0.0809 .

mtcars$hp -0.01468 0.01465 -1.002 0.3250

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.055 on 28 degrees of freedom

Multiple R-squared: 0.7679,Adjusted R-squared: 0.743

F-statistic: 30.88 on 3 and 28 DF, p-value: 5.054e-09

> ##目前为止,我们都是在直接使用mtcars

> ##将mtcars导入到当前的工作环境Environment

> data("mtcars")

> View(iris) ##查看数据。结果如下图

接下来,通过data函数,查看datasets包全部的数据集。对于其他包带的数据,查看命令类似。

>data(packages = 'datasets') ##或者直接输入data()

##得到的是一个长长的数据集列表。当中就有我们常见到的iris和mtcars数据集

Data sets in package ‘datasets’:

AirPassengers Monthly Airline Passenger Numbers

1949-1960

BJsales Sales Data with Leading Indicator

BJsales.lead (BJsales)

Sales Data with Leading Indicator

BOD Biochemical Oxygen Demand

CO2 Carbon Dioxide Uptake in Grass Plants

ChickWeight Weight versus age of chicks on different

diets

DNase Elisa assay of DNase

EuStockMarkets Daily Closing Prices of Major European

Stock Indices, 1991-1998

Formaldehyde Determination of Formaldehyde

HairEyeColor Hair and Eye Color of Statistics Students

Harman23.cor Harman Example 2.3

Harman74.cor Harman Example 7.4

Indometh Pharmacokinetics of Indomethacin

InsectSprays Effectiveness of Insect Sprays

JohnsonJohnson Quarterly Earnings per Johnson & Johnson

Share

LakeHuron Level of Lake Huron 1875-1972

LifeCycleSavings Intercountry Life-Cycle Savings Data

Loblolly Growth of Loblolly pine trees

Nile Flow of the River Nile

Orange Growth of Orange Trees

OrchardSprays Potency of Orchard Sprays

PlantGrowth Results from an Experiment on Plant Growth

Puromycin Reaction Velocity of an Enzymatic Reaction

Seatbelts Road Casualties in Great Britain 1969-84

Theoph Pharmacokinetics of Theophylline

Titanic Survival of passengers on the Titanic

ToothGrowth The Effect of Vitamin C on Tooth Growth in

Guinea Pigs

UCBAdmissions Student Admissions at UC Berkeley

UKDriverDeaths Road Casualties in Great Britain 1969-84

UKgas UK Quarterly Gas Consumption

USAccDeaths Accidental Deaths in the US 1973-1978

USArrests Violent Crime Rates by US State

USJudgeRatings Lawyers' Ratings of State Judges in the US

Superior Court

USPersonalExpenditure

Personal Expenditure Data

UScitiesD Distances Between European Cities and

Between US Cities

VADeaths Death Rates in Virginia (1940)

WWWusage Internet Usage per Minute

WorldPhones The World's Telephones

ability.cov Ability and Intelligence Tests

airmiles Passenger Miles on Commercial US Airlines,

1937-1960

airquality New York Air Quality Measurements

anscombe Anscombe's Quartet of 'Identical' Simple

Linear Regressions

attenu The Joyner-Boore Attenuation Data

attitude The Chatterjee-Price Attitude Data

austres Quarterly Time Series of the Number of

Australian Residents

beaver1 (beavers) Body Temperature Series of Two Beavers

beaver2 (beavers) Body Temperature Series of Two Beavers

cars Speed and Stopping Distances of Cars

chickwts Chicken Weights by Feed Type

co2 Mauna Loa Atmospheric CO2 Concentration

crimtab Student's 3000 Criminals Data

discoveries Yearly Numbers of Important Discoveries

esoph Smoking, Alcohol and (O)esophageal Cancer

euro Conversion Rates of Euro Currencies

euro.cross (euro) Conversion Rates of Euro Currencies

eurodist Distances Between European Cities and

Between US Cities

faithful Old Faithful Geyser Data

fdeaths (UKLungDeaths)

Monthly Deaths from Lung Diseases in the

UK

freeny Freeny's Revenue Data

freeny.x (freeny) Freeny's Revenue Data

freeny.y (freeny) Freeny's Revenue Data

infert Infertility after Spontaneous and Induced

Abortion

iris Edgar Anderson's Iris Data

iris3 Edgar Anderson's Iris Data

islands Areas of the World's Major Landmasses

ldeaths (UKLungDeaths)

Monthly Deaths from Lung Diseases in the

UK

lh Luteinizing Hormone in Blood Samples

longley Longley's Economic Regression Data

lynx Annual Canadian Lynx trappings 1821-1934

mdeaths (UKLungDeaths)

Monthly Deaths from Lung Diseases in the

UK

morley Michelson Speed of Light Data

mtcars Motor Trend Car Road Tests

nhtemp Average Yearly Temperatures in New Haven

nottem Average Monthly Temperatures at

Nottingham, 1920-1939

npk Classical N, P, K Factorial Experiment

occupationalStatus Occupational Status of Fathers and their

Sons

precip Annual Precipitation in US Cities

presidents Quarterly Approval Ratings of US

Presidents

pressure Vapor Pressure of Mercury as a Function of

Temperature

quakes Locations of Earthquakes off Fiji

randu Random Numbers from Congruential Generator

RANDU

rivers Lengths of Major North American Rivers

rock Measurements on Petroleum Rock Samples

sleep Student's Sleep Data

stack.loss (stackloss)

Brownlee's Stack Loss Plant Data

stack.x (stackloss)

Brownlee's Stack Loss Plant Data

stackloss Brownlee's Stack Loss Plant Data

state.abb (state) US State Facts and Figures

state.area (state) US State Facts and Figures

state.center (state)

US State Facts and Figures

state.division (state)

US State Facts and Figures

state.name (state) US State Facts and Figures

state.region (state)

US State Facts and Figures

state.x77 (state) US State Facts and Figures

sunspot.month Monthly Sunspot Data, from 1749 to

"Present"

sunspot.year Yearly Sunspot Data, 1700-1988

sunspots Monthly Sunspot Numbers, 1749-1983

swiss Swiss Fertility and Socioeconomic

Indicators (1888) Data

treering Yearly Treering Data, -6000-1979

trees Diameter, Height and Volume for Black

Cherry Trees

uspop Populations Recorded by the US Census

volcano Topographic Information on Auckland's

Maunga Whau Volcano

warpbreaks The Number of Breaks in Yarn during

Weaving

women Average Heights and Weights for American

Women

3 Python导入自带数据集

3.1 sciki-learn机器学习的datasets

根据sciki-learn官网的说明,sciki-learn自带的数据集大概有三十来个。每个数据集还有自己"专用的"导入函数。

from sklearn import datasets ##导入datasets

iris = datasets.load_iris() ##导入iris数据集

print(iris) ##结果太长不作展示

也可以在Spyder的对象查看器中点点鼠标,进行查看。如下图。

3.2 高级画图seaborn包所带数据集

和sciki-learn包类似,seaborn高级画图包也带有一些经典的数据集,比如Titanic。

import seaborn as sns

titanic=sns.load_dataset('titanic') ##加载titanic数据集

## 还可以查看更多seaborns所携带的数据集

sns.get_dataset_names()

Out[2]:

['anagrams',

'anscombe',

'attention',

'brain_networks',

'car_crashes',

'diamonds',

'dots',

'exercise',

'flights',

'fmri',

'gammas',

'geyser',

'iris',

'mpg',

'penguins',

'planets',

'tips',

'titanic']

Ref7. Dataset loading utilities​scikit-learn.org

----全文结束-----

你可能感兴趣的:(用python还需要stata吗_极冷知识点——Stata/Python/R自带数据导入(附代码))