读书笔记-数据科学实战-Capt2_汽车数据可视化分析

摘要: 数据科学实战, 抓住一本好的学习资料, 然后静下心来研究, 实战, 比到处东跑西看要好的多.

处理流程:
1) 将当前路径设置为文件vehicles.csv所在的路径:

# setwd()可以设置R软件的当前工作目录,getwd()可以获取当前工作目录
setwd("your file path")

2) 载入数据, 可以直接从zip文件中载入数据

vehicles <- read.csv(unz("vehicles.csv.zip", "vehicles.csv"), stringsAsFactors = F)

函数用法:

# 输入参数: description:文件名称, filename:zip文件中的文件名,相似的函数有: file,gzfile,bzfile等
unz(description, filename, open="",encoding=getOption("encoding"))
# 读取csv文件, 是read.table()的衍生品, R会自动将字符串的列辨认成factor,比如有一列是名字,如果不告诉R的话
# R会将这一列认成因子模式factor(character->factor). stringsAsFactors = FALSE
read.csv()

3) 查看数据是否已经载入,可以展示数据的前几行:

# Return the first or last part of an Object (data.frame,matrix,table,function)
> head(vehicles, n=5)  # 打印前5行数据
> tail(vehicles, n=5)  # 打印末尾5行数据
> tail(vehicles, n=2)
      barrels08 barrelsA08 charge120 charge240 city08 city08U cityA08 cityA08U
37899  15.69571          0         0         0     18       0       0        0
37900  18.31167          0         0         0     16       0       0        0
      cityCD cityE cityUF co2 co2A co2TailpipeAGpm co2TailpipeGpm comb08
37899      0     0      0  -1   -1               0       423.1905     21
37900      0     0      0  -1   -1               0       493.7222     18
      comb08U combA08 combA08U combE combinedCD combinedUF cylinders displ
37899       0       0        0     0          0          0         4   2.2
37900       0       0        0     0          0          0         4   2.2
                           drive engId   eng_dscr feScore fuelCost08
37899 4-Wheel or All-Wheel Drive 66030      (FFS)      -1       1600
37900 4-Wheel or All-Wheel Drive 66031 (FFS,TRBO)      -1       2250
      fuelCostA08 fuelType        fuelType1 ghgScore ghgScoreA highway08
37899           0  Regular Regular Gasoline       -1        -1        24
37900           0  Premium Premium Gasoline       -1        -1        21
      highway08U highwayA08 highwayA08U highwayCD highwayE highwayUF hlv hpv
37899          0          0           0         0        0         0   0   0
37900          0          0           0         0        0         0   0   0
        id lv2 lv4   make            model mpgData phevBlended pv2 pv4 range
37899 9998   0  14 Subaru       Legacy AWD       N       false   0  90     0
37900 9999   0  14 Subaru Legacy AWD Turbo       N       false   0  90     0
      rangeCity rangeCityA rangeHwy rangeHwyA           trany UCity UCityA
37899         0          0        0         0    Manual 5-spd    23      0
37900         0          0        0         0 Automatic 4-spd    20      0
      UHighway UHighwayA       VClass year youSaveSpend guzzler trans_dscr
37899       34         0 Compact Cars 1993        -1250                   
37900       29         0 Compact Cars 1993        -4500              CLKUP
      tCharger sCharger atvType fuelType2 rangeA evMotor mfrCode c240Dscr
37899       NA                                                           
37900     TRUE                                                           
      charge240b c240bDscr                    createdOn
37899          0           Tue Jan 01 00:00:00 EST 2013
37900          0           Tue Jan 01 00:00:00 EST 2013
                        modifiedOn startStop phevCity phevHwy phevComb
37899 Tue Jan 01 00:00:00 EST 2013                  0       0        0
37900 Tue Jan 01 00:00:00 EST 2013                  0       0        0
> nrow(vehicles)  # 37900 查看数据的总行数
> ncol(vehicles)  # 83  查看数据的总列数
> names(vehicles)  # 查看数据每一列的属性
 [1] "barrels08"       "barrelsA08"      "charge120"       "charge240"      
 [5] "city08"          "city08U"         "cityA08"         "cityA08U"       
 [9] "cityCD"          "cityE"           "cityUF"          "co2"            
[13] "co2A"            "co2TailpipeAGpm" "co2TailpipeGpm"  "comb08"         
[17] "comb08U"         "combA08"         "combA08U"        "combE"          
[21] "combinedCD"      "combinedUF"      "cylinders"       "displ"          
[25] "drive"           "engId"           "eng_dscr"        "feScore"        
[29] "fuelCost08"      "fuelCostA08"     "fuelType"        "fuelType1"      
[33] "ghgScore"        "ghgScoreA"       "highway08"       "highway08U"     
[37] "highwayA08"      "highwayA08U"     "highwayCD"       "highwayE"       
[41] "highwayUF"       "hlv"             "hpv"             "id"             
[45] "lv2"             "lv4"             "make"            "model"          
[49] "mpgData"         "phevBlended"     "pv2"             "pv4"            
[53] "range"           "rangeCity"       "rangeCityA"      "rangeHwy"       
[57] "rangeHwyA"       "trany"           "UCity"           "UCityA"         
[61] "UHighway"        "UHighwayA"       "VClass"          "year"           
[65] "youSaveSpend"    "guzzler"         "trans_dscr"      "tCharger"       
[69] "sCharger"        "atvType"         "fuelType2"       "rangeA"         
[73] "evMotor"         "mfrCode"         "c240Dscr"        "charge240b"     
[77] "c240bDscr"       "createdOn"       "modifiedOn"      "startStop"      
[81] "phevCity"        "phevHwy"         "phevComb"   
# 可以看到有个属性是year(如1992),我们可以查看这些数据中包含了多少个不同的年份
> unique(vehicles[,"year"])  # 注意,顺序是数据中的顺序,不是从小到大排列或是什么的
 [1] 1985 1993 1994 1995 1996 1997 1998 1999 2000 2001 1986 2002 2003 2004 2005
[16] 2006 2007 2008 2009 2010 1984 1987 1988 1989 1990 1991 1992 2011 2012 2013
[31] 2014 2015 2016 2017
> length(unique(vehicles[,"year"]))
  [1] 34
> first_year <- min(vehicles[,"year"])  # 1985

我们可以查看燃料的类型, 以及每个类型的燃料数目,从结果可以看出,大部分汽车都在使用普通汽油.

> unique(vehicles[,"fuelType1"])
[1] "Regular Gasoline"  "Premium Gasoline"  "Diesel"           
[4] "Natural Gas"       "Electricity"       "Midgrade Gasoline"
> table(vehicles$fuelType1) # 等同于table(vehicles[,"fuelType1"])
           Diesel       Electricity Midgrade Gasoline       Natural Gas 
             1101               121                74                60 
 Premium Gasoline  Regular Gasoline 
            10196             26348 

函数用法:

# table函数是一种记录频数的方法
# $是从一个dataframe里面取出一列数据,属于S3类(???)
> table(c(8,9,8,7,8,9,8,7,8,9))
7 8 9 
2 5 3 

探索汽车使用的传动方式trany属性, 如Manual 5-spd, Automatic 4-spd.

# 缺省值用NA填补
vehicles$trany[vehicles$trany == ""] <- NA
# 我们关注传动方式是自动还是手动,使用substr函数提取trany的前4个字符,生成新变量trany2
# substr(x,start,stop), 如substr("sinablog",2,4)为[1] "ina"
# substring(text,first,last=1000000)可以不指定结束位置,默认到字符串结尾
vehicles$trany2 <- ifelse(substr(vehicles$trany, 1, 4)=="Auto","Auto","Manual")

函数用法:
ifelse()是个函数,不是运算符. 相当于max=x>y?x:y

> x = c(11,3,5,13,20)
> y = c(2,4,9,13,1)
> ifelse(x>y,x,y)  # 很神奇吧
[1] 11  4  9 13 20

Data Frame被翻译成数据框, 像个表, 由行和列组成, 和Matrix不同的是, 每个列可以是不同的数据类型, 而Matrix是相同的(和最新版Matlab里面的table类型比较相似), Data Frame的每一列有列名,每一行可以指定行名,如果不指定行名,那么就是从1开始来作为每行的索引. 可以使用data.frame来初始化一个Data Frame.

> student <- data.frame(ID=c(11,12,13),Name=c("Devin","Jacd","Pace"),Gender=c("M","F","M"))
> student
  ID  Name Gender
1 11 Devin      M
2 12  Jacd      F
3 13  Pace      M
> names(student)
[1] "ID"     "Name"   "Gender"
# student$Gender=="F"会返回一个布尔向量,和Matlab一样,切片操作和python类似
> student[student$Gender=="F",] 
  ID Name Gender
2 12 Jacd      F
> student[student$Gender=="F","ID"]
[1] 12

[1] 数据科学实战手册(R+Python) 第二章.

你可能感兴趣的:(读书笔记)