摘要: 数据科学实战, 抓住一本好的学习资料, 然后静下心来研究, 实战, 比到处东跑西看要好的多.
处理流程:
1) 将当前路径设置为文件vehicles.csv所在的路径:
# setwd()可以设置R软件的当前工作目录,getwd()可以获取当前工作目录
setwd("your file path")
2) 载入数据, 可以直接从zip文件中载入数据
vehicles <- read.csv(unz("vehicles.csv.zip", "vehicles.csv"), stringsAsFactors = F)
函数用法:
# 输入参数: description:文件名称, filename:zip文件中的文件名,相似的函数有: file,gzfile,bzfile等
unz(description, filename, open="",encoding=getOption("encoding"))
# 读取csv文件, 是read.table()的衍生品, R会自动将字符串的列辨认成factor,比如有一列是名字,如果不告诉R的话
# R会将这一列认成因子模式factor(character->factor). stringsAsFactors = FALSE
read.csv()
3) 查看数据是否已经载入,可以展示数据的前几行:
# Return the first or last part of an Object (data.frame,matrix,table,function)
> head(vehicles, n=5) # 打印前5行数据
> tail(vehicles, n=5) # 打印末尾5行数据
> tail(vehicles, n=2)
barrels08 barrelsA08 charge120 charge240 city08 city08U cityA08 cityA08U
37899 15.69571 0 0 0 18 0 0 0
37900 18.31167 0 0 0 16 0 0 0
cityCD cityE cityUF co2 co2A co2TailpipeAGpm co2TailpipeGpm comb08
37899 0 0 0 -1 -1 0 423.1905 21
37900 0 0 0 -1 -1 0 493.7222 18
comb08U combA08 combA08U combE combinedCD combinedUF cylinders displ
37899 0 0 0 0 0 0 4 2.2
37900 0 0 0 0 0 0 4 2.2
drive engId eng_dscr feScore fuelCost08
37899 4-Wheel or All-Wheel Drive 66030 (FFS) -1 1600
37900 4-Wheel or All-Wheel Drive 66031 (FFS,TRBO) -1 2250
fuelCostA08 fuelType fuelType1 ghgScore ghgScoreA highway08
37899 0 Regular Regular Gasoline -1 -1 24
37900 0 Premium Premium Gasoline -1 -1 21
highway08U highwayA08 highwayA08U highwayCD highwayE highwayUF hlv hpv
37899 0 0 0 0 0 0 0 0
37900 0 0 0 0 0 0 0 0
id lv2 lv4 make model mpgData phevBlended pv2 pv4 range
37899 9998 0 14 Subaru Legacy AWD N false 0 90 0
37900 9999 0 14 Subaru Legacy AWD Turbo N false 0 90 0
rangeCity rangeCityA rangeHwy rangeHwyA trany UCity UCityA
37899 0 0 0 0 Manual 5-spd 23 0
37900 0 0 0 0 Automatic 4-spd 20 0
UHighway UHighwayA VClass year youSaveSpend guzzler trans_dscr
37899 34 0 Compact Cars 1993 -1250
37900 29 0 Compact Cars 1993 -4500 CLKUP
tCharger sCharger atvType fuelType2 rangeA evMotor mfrCode c240Dscr
37899 NA
37900 TRUE
charge240b c240bDscr createdOn
37899 0 Tue Jan 01 00:00:00 EST 2013
37900 0 Tue Jan 01 00:00:00 EST 2013
modifiedOn startStop phevCity phevHwy phevComb
37899 Tue Jan 01 00:00:00 EST 2013 0 0 0
37900 Tue Jan 01 00:00:00 EST 2013 0 0 0
> nrow(vehicles) # 37900 查看数据的总行数
> ncol(vehicles) # 83 查看数据的总列数
> names(vehicles) # 查看数据每一列的属性
[1] "barrels08" "barrelsA08" "charge120" "charge240"
[5] "city08" "city08U" "cityA08" "cityA08U"
[9] "cityCD" "cityE" "cityUF" "co2"
[13] "co2A" "co2TailpipeAGpm" "co2TailpipeGpm" "comb08"
[17] "comb08U" "combA08" "combA08U" "combE"
[21] "combinedCD" "combinedUF" "cylinders" "displ"
[25] "drive" "engId" "eng_dscr" "feScore"
[29] "fuelCost08" "fuelCostA08" "fuelType" "fuelType1"
[33] "ghgScore" "ghgScoreA" "highway08" "highway08U"
[37] "highwayA08" "highwayA08U" "highwayCD" "highwayE"
[41] "highwayUF" "hlv" "hpv" "id"
[45] "lv2" "lv4" "make" "model"
[49] "mpgData" "phevBlended" "pv2" "pv4"
[53] "range" "rangeCity" "rangeCityA" "rangeHwy"
[57] "rangeHwyA" "trany" "UCity" "UCityA"
[61] "UHighway" "UHighwayA" "VClass" "year"
[65] "youSaveSpend" "guzzler" "trans_dscr" "tCharger"
[69] "sCharger" "atvType" "fuelType2" "rangeA"
[73] "evMotor" "mfrCode" "c240Dscr" "charge240b"
[77] "c240bDscr" "createdOn" "modifiedOn" "startStop"
[81] "phevCity" "phevHwy" "phevComb"
# 可以看到有个属性是year(如1992),我们可以查看这些数据中包含了多少个不同的年份
> unique(vehicles[,"year"]) # 注意,顺序是数据中的顺序,不是从小到大排列或是什么的
[1] 1985 1993 1994 1995 1996 1997 1998 1999 2000 2001 1986 2002 2003 2004 2005
[16] 2006 2007 2008 2009 2010 1984 1987 1988 1989 1990 1991 1992 2011 2012 2013
[31] 2014 2015 2016 2017
> length(unique(vehicles[,"year"]))
[1] 34
> first_year <- min(vehicles[,"year"]) # 1985
我们可以查看燃料的类型, 以及每个类型的燃料数目,从结果可以看出,大部分汽车都在使用普通汽油.
> unique(vehicles[,"fuelType1"])
[1] "Regular Gasoline" "Premium Gasoline" "Diesel"
[4] "Natural Gas" "Electricity" "Midgrade Gasoline"
> table(vehicles$fuelType1) # 等同于table(vehicles[,"fuelType1"])
Diesel Electricity Midgrade Gasoline Natural Gas
1101 121 74 60
Premium Gasoline Regular Gasoline
10196 26348
函数用法:
# table函数是一种记录频数的方法
# $是从一个dataframe里面取出一列数据,属于S3类(???)
> table(c(8,9,8,7,8,9,8,7,8,9))
7 8 9
2 5 3
探索汽车使用的传动方式trany属性, 如Manual 5-spd, Automatic 4-spd.
# 缺省值用NA填补
vehicles$trany[vehicles$trany == ""] <- NA
# 我们关注传动方式是自动还是手动,使用substr函数提取trany的前4个字符,生成新变量trany2
# substr(x,start,stop), 如substr("sinablog",2,4)为[1] "ina"
# substring(text,first,last=1000000)可以不指定结束位置,默认到字符串结尾
vehicles$trany2 <- ifelse(substr(vehicles$trany, 1, 4)=="Auto","Auto","Manual")
函数用法:
ifelse()是个函数,不是运算符. 相当于max=x>y?x:y
> x = c(11,3,5,13,20)
> y = c(2,4,9,13,1)
> ifelse(x>y,x,y) # 很神奇吧
[1] 11 4 9 13 20
Data Frame被翻译成数据框, 像个表, 由行和列组成, 和Matrix不同的是, 每个列可以是不同的数据类型, 而Matrix是相同的(和最新版Matlab里面的table类型比较相似), Data Frame的每一列有列名,每一行可以指定行名,如果不指定行名,那么就是从1开始来作为每行的索引. 可以使用data.frame来初始化一个Data Frame.
> student <- data.frame(ID=c(11,12,13),Name=c("Devin","Jacd","Pace"),Gender=c("M","F","M"))
> student
ID Name Gender
1 11 Devin M
2 12 Jacd F
3 13 Pace M
> names(student)
[1] "ID" "Name" "Gender"
# student$Gender=="F"会返回一个布尔向量,和Matlab一样,切片操作和python类似
> student[student$Gender=="F",]
ID Name Gender
2 12 Jacd F
> student[student$Gender=="F","ID"]
[1] 12
[1] 数据科学实战手册(R+Python) 第二章.