R for Data Science总结之——readr

readr包顾名思义就是将数据导入R环境的方法,我们这里直接使用tidyverse框架,其中包含了readr包:

library(tidyverse)

主要方法有:

  • 分隔符读入:read_csv(), read_csv2(), read_tsv(), read_delim()
  • 空格分隔读入:read_fwf(), read_table()
  • log文件读入:read_log()

首先来看看read_csv():

heights <- read_csv("data/heights.csv")
#> Parsed with column specification:
#> cols(
#>   earn = col_double(),
#>   height = col_double(),
#>   sex = col_character(),
#>   ed = col_integer(),
#>   age = col_integer(),
#>   race = col_character()
#> )

read_csv("a,b,c
1,2,3
4,5,6")
#> # A tibble: 2 x 3
#>       a     b     c
#>   <int> <int> <int>
#> 1     1     2     3
#> 2     4     5     6

这里可以发现与read.csv()不同的是,read_csv()默认读入的文件为一个tibble数据集,这会对一些老式方法写的数据读入造成一些困难,这时可以先read.csv()读入生成data.frame再as_tibble()转成一个tibble。
特殊用法:

read_csv("The first line of metadata
  The second line of metadata
  x,y,z
  1,2,3", skip = 2)
#> # A tibble: 1 x 3
#>       x     y     z
#>   <int> <int> <int>
#> 1     1     2     3

read_csv("# A comment I want to skip
  x,y,z
  1,2,3", comment = "#")
#> # A tibble: 1 x 3
#>       x     y     z
#>   <int> <int> <int>
#> 1     1     2     3

read_csv("1,2,3\n4,5,6", col_names = FALSE)
#> # A tibble: 2 x 3
#>      X1    X2    X3
#>   <int> <int> <int>
#> 1     1     2     3
#> 2     4     5     6

read_csv("1,2,3\n4,5,6", col_names = c("x", "y", "z"))
#> # A tibble: 2 x 3
#>       x     y     z
#>   <int> <int> <int>
#> 1     1     2     3
#> 2     4     5     6

read_csv("a,b,c\n1,2,.", na = ".")
#> # A tibble: 1 x 3
#>       a     b c    
#>   <int> <int> <chr>
#> 1     1     2 <NA>

以上方法已经可以涵盖75%日常遇到的问题,特殊问题可使用read_tsv()和read_fwf()解决。

读入原理

readr读入数据时会对每一列猜测其数据量类型,这里用到了数据转换guess_parser()和parse_guess()函数:

guess_parser("2010-10-01")
#> [1] "date"
guess_parser("15:01")
#> [1] "time"
guess_parser(c("TRUE", "FALSE"))
#> [1] "logical"
guess_parser(c("1", "5", "9"))
#> [1] "integer"
guess_parser(c("12,352,561"))
#> [1] "number"

str(parse_guess("2010-10-10"))
#>  Date[1:1], format: "2010-10-10"

然而这会有两个问题:

  • guess_parser()只针对前1000行进行猜测,若前1000行是数值,后面是字符串则会出错。
  • 若前1000行都为NA值则会猜测其为字符串,后面无论是什么数据类型都不加以考虑。

这里我们对readr_example(“challenge.csv”)进行试验,这个数据集由x, y 两列组成,x列前1000行为整形,后面为浮点数,y列前1000行为NA,后面为日期:

challenge <- read_csv(readr_example("challenge.csv"))
#> Parsed with column specification:
#> cols(
#>   x = col_integer(),
#>   y = col_character()
#> )
#> Warning in rbind(names(probs), probs_f): number of columns of result is not
#> a multiple of vector length (arg 1)
#> Warning: 1000 parsing failures.
#> row # A tibble: 5 x 5 col     row col   expected         actual       file                           expected   <int> <chr> <chr>            <chr>        <chr>                          actual 1  1001 x     no trailing cha… .2383797508… '/home/travis/R/Library/readr… file 2  1002 x     no trailing cha… .4116799717… '/home/travis/R/Library/readr… row 3  1003 x     no trailing cha… .7460716762… '/home/travis/R/Library/readr… col 4  1004 x     no trailing cha… .7234505538… '/home/travis/R/Library/readr… expected 5  1005 x     no trailing cha… .6145241374… '/home/travis/R/Library/readr…
#> ... ................. ... .......................................................................... ........ .......................................................................... ...... .......................................................................... .... .......................................................................... ... .......................................................................... ... .......................................................................... ........ ..........................................................................
#> See problems(...) for more details.

使用problems()调出错误信息:

problems(challenge)
#> # A tibble: 1,000 x 5
#>     row col   expected         actual       file                          
#>   <int> <chr> <chr>            <chr>        <chr>                         
#> 1  1001 x     no trailing cha… .2383797508… '/home/travis/R/Library/readr…
#> 2  1002 x     no trailing cha… .4116799717… '/home/travis/R/Library/readr…
#> 3  1003 x     no trailing cha… .7460716762… '/home/travis/R/Library/readr…
#> 4  1004 x     no trailing cha… .7234505538… '/home/travis/R/Library/readr…
#> 5  1005 x     no trailing cha… .6145241374… '/home/travis/R/Library/readr…
#> 6  1006 x     no trailing cha… .4739805692… '/home/travis/R/Library/readr…
#> # ... with 994 more rows

这里最佳方法是一点一点调整数据类型,我们首先看默认方法:

challenge <- read_csv(
  readr_example("challenge.csv"), 
  col_types = cols(
    x = col_integer(),
    y = col_character()
  )
)

调整数据类型:

challenge <- read_csv(
  readr_example("challenge.csv"), 
  col_types = cols(
    x = col_double(),
    y = col_character()
  )
)

tail(challenge)
#> # A tibble: 6 x 2
#>       x y         
#>   <dbl> <chr>     
#> 1 0.805 2019-11-21
#> 2 0.164 2018-03-29
#> 3 0.472 2014-08-04
#> 4 0.718 2015-08-16
#> 5 0.270 2020-02-04
#> 6 0.608 2019-01-06

这会解决第一个问题,再对y列进行调整:

challenge <- read_csv(
  readr_example("challenge.csv"), 
  col_types = cols(
    x = col_double(),
    y = col_date()
  )
)

tail(challenge)
#> # A tibble: 6 x 2
#>       x y         
#>   <dbl> <date>    
#> 1 0.805 2019-11-21
#> 2 0.164 2018-03-29
#> 3 0.472 2014-08-04
#> 4 0.718 2015-08-16
#> 5 0.270 2020-02-04
#> 6 0.608 2019-01-06

前面我们说过guess_parser()默认根据前1000行进行猜测,我们可以手动设为1001:

challenge2 <- read_csv(readr_example("challenge.csv"), guess_max = 1001)
#> Parsed with column specification:
#> cols(
#>   x = col_double(),
#>   y = col_date(format = "")
#> )
challenge2
#> # A tibble: 2,000 x 2
#>       x y         
#>   <dbl> <date>    
#> 1   404 NA        
#> 2  4172 NA        
#> 3  3004 NA        
#> 4   787 NA        
#> 5    37 NA        
#> 6  2332 NA        
#> # ... with 1,994 more rows

有时直接把所有数据默认为character更为方便:

challenge2 <- read_csv(readr_example("challenge.csv"), 
  col_types = cols(.default = col_character())
)

这和type_convert()联用十分方便:

df <- tribble(
  ~x,  ~y,
  "1", "1.21",
  "2", "2.32",
  "3", "4.56"
)
df
#> # A tibble: 3 x 2
#>   x     y    
#>   <chr> <chr>
#> 1 1     1.21 
#> 2 2     2.32 
#> 3 3     4.56

# Note the column types
type_convert(df)
#> Parsed with column specification:
#> cols(
#>   x = col_integer(),
#>   y = col_double()
#> )
#> # A tibble: 3 x 2
#>       x     y
#>   <int> <dbl>
#> 1     1  1.21
#> 2     2  2.32
#> 3     3  4.56

写文件

write_csv()和write_tsv()是写文件的代表函数,写出的字符串都是UTF-8类型,日期都是ISO8601格式,若想导出csv文件到Excel,使用write_excel_csv(),这会告诉Excel我们用的是UTF-8编码。

write_csv(challenge, "challenge.csv")

这里注意,写出文件后每一列的数据类型都会丢失:

challenge
#> # A tibble: 2,000 x 2
#>       x y         
#>   <dbl> <date>    
#> 1   404 NA        
#> 2  4172 NA        
#> 3  3004 NA        
#> 4   787 NA        
#> 5    37 NA        
#> 6  2332 NA        
#> # ... with 1,994 more rows
write_csv(challenge, "challenge-2.csv")
read_csv("challenge-2.csv")
#> Parsed with column specification:
#> cols(
#>   x = col_integer(),
#>   y = col_character()
#> )
#> # A tibble: 2,000 x 2
#>       x y    
#>   <int> <chr>
#> 1   404 <NA> 
#> 2  4172 <NA> 
#> 3  3004 <NA> 
#> 4   787 <NA> 
#> 5    37 <NA> 
#> 6  2332 <NA> 
#> # ... with 1,994 more rows

这里推荐使用write_rds()和read_rds(),会将数据存储为R的特殊二进制格式RDS,这两个函数是基本的readRDS()和saveRDS()的包装:

write_rds(challenge, "challenge.rds")
read_rds("challenge.rds")
#> # A tibble: 2,000 x 2
#>       x y         
#>   <dbl> <date>    
#> 1   404 NA        
#> 2  4172 NA        
#> 3  3004 NA        
#> 4   787 NA        
#> 5    37 NA        
#> 6  2332 NA        
#> # ... with 1,994 more rows

这里也推荐feather包的方法,其中的二进制格式存储更快:

library(feather)
write_feather(challenge, "challenge.feather")
read_feather("challenge.feather")
#> # A tibble: 2,000 x 2
#>       x      y
#>   <dbl> <date>
#> 1   404   <NA>
#> 2  4172   <NA>
#> 3  3004   <NA>
#> 4   787   <NA>
#> 5    37   <NA>
#> 6  2332   <NA>
#> # ... with 1,994 more rows

其他格式数据读取

  • haven包读入SPSS, Stata, SAS文件
  • readxl包读入.xls和.xlsx文件
  • DBI读入RMySQL, RSQLite, RPostgreSQL, 针对SQL数据库返回数据集
  • jsonlite读入json文件
  • xml2读入XML文件

全文代码已上传GITHUB点此进入

10-07 11:28