R语言:因子与字符串的互转

在导入大批量数据时,如果没有显式地指定“stringsAsFactors = FALSE”,默认会将所有的字符串转换为因子,导致数据处理速度较慢。

示例数据如下:

name,math,english,sex,year
"yiifaa",65,68,"M",2018
"yiifee",95,98,"F",2018
"guagua",75,78,"M",2018
"MM",85,88,"F",2018

查看数据概要,发现默认将字符串转换为因子,并进行了分组计数(这也是处理速度较慢的原因之一),概要如下:

  name        math         english     sex        year     
 guagua:1   Min.   :65.0   Min.   :68.0   F:2   Min.   :2018  
 MM    :1   1st Qu.:72.5   1st Qu.:75.5   M:2   1st Qu.:2018  
 yiifaa:1   Median :80.0   Median :83.0         Median :2018  
 yiifee:1   Mean   :80.0   Mean   :83.0         Mean   :2018  
            3rd Qu.:87.5   3rd Qu.:90.5         3rd Qu.:2018  
            Max.   :95.0   Max.   :98.0         Max.   :2018  

但这样的分组计数并没有意义,所以需要利用“as.character”转换为字符,如下:

#! /usr/bin/env RScript
setwd("D:/Workspace/R-Works/R-Stat")
scores <- read.table("Score.txt", header = TRUE, sep = ",", quote="\"", encoding = "UTF-8", stringsAsFactors = TRUE)
# 将因子转换为字符
scores$name <- as.character(scores$name)
# 多转一个进行测试
scores$sex <- as.character(scores$sex)

再次查看概要,如下:

name                math         english         sex                 year     
 Length:4           Min.   :65.0   Min.   :68.0   Length:4           Min.   :2018  
 Class :character   1st Qu.:72.5   1st Qu.:75.5   Class :character   1st Qu.:2018  
 Mode  :character   Median :80.0   Median :83.0   Mode  :character   Median :2018  
                    Mean   :80.0   Mean   :83.0                      Mean   :2018  
                    3rd Qu.:87.5   3rd Qu.:90.5                      3rd Qu.:2018  
                    Max.   :95.0   Max.   :98.0                      Max.   :2018  

可以看到,概要中已经没有了分组计数,但多了总数计量,如果要恢复分组计数,则需要重新创建因子,如下:

scores$sex <- factor(scores$sex, levels=c("M", "F"), ordered = TRUE)

结论

在导入大批量数据时,为了提高性能,尽可能分两步走:
1. 显式指定“stringsAsFactors = FALSE”;
2. 依次将所需要的数据列(向量)转换为因子;

你可能感兴趣的:(R语言:因子与字符串的互转)