从R中读取由Excel中的逗号分隔的数字的长度

我有一个.csv文件,我从R读取。有一列包含的单元格为:suppse

单元格C1 = 2,3 C2 = 1,2,3,4 C3 = 1等等…

编辑: C1代表C列和第一行。

我只想从R中读出这些单元格中的数字的长度。如何做到这一点?

有没有人有任何线索?

从Excel中读取数据。

data=read.csv("location", header=T) 

我需要计算长度单元格的数据列之一。

 V24 1,2,3,4 1,2,3,4 1,4,2,3 1,2,4,3 1,3,2,4 4,3,1,2 

这个数据太大了, 因此我不能在这里粘贴。

快照数据; 12列和35行。

编辑1

 dput(string_data) structure(list(v_1 = c(NA, NA, NA, NA, 3L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L, NA, NA, NA, NA, NA, 2L, NA, NA, NA, 1L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 0L, NA, NA, NA, NA, 2L ), v_2 = c(NA, NA, NA, NA, 1L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 3L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 3L, NA, NA, NA, NA, NA, 2L, NA, NA, NA, 3L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 4L, NA, NA, NA, NA, 2L), v_3 = structure(c(1L, 1L, 1L, 1L, 6L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 1L, 1L, 1L, 1L, 1L, 4L, 1L, 1L, 1L, 7L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 5L, 1L, 1L, 1L, 1L, 2L), .Label = c("", "1,4", "2", "2,1", "2,4", "3", "4"), class = "factor"), v_4 = c(NA, NA, NA, NA, NA, 0L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 0L, NA, NA, NA, NA, NA, NA, 0L, 0L, NA, NA, NA, NA, 0L, 2L, NA, 0L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), v_5 = c(NA, NA, NA, NA, NA, 0L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 2L, NA, NA, NA, NA, NA, NA, 2L, 2L, NA, NA, NA, NA, 2L, 0L, NA, 0L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), v_6 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), v_7 = c(NA, NA, NA, NA, 0L, NA, NA, NA, NA, NA, NA, NA, 1L, NA, NA, NA, NA, 0L, NA, NA, 0L, NA, NA, 1L, NA, NA, 0L, 0L, NA, NA, NA, NA, 0L, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L, 0L, NA, NA, NA, NA, 1L, NA, NA, NA, NA, NA, NA, 1L, NA, NA), v_8 = c(NA, NA, NA, NA, 1L, NA, NA, NA, NA, NA, NA, NA, 0L, NA, NA, NA, NA, 1L, NA, NA, 1L, NA, NA, 0L, NA, NA, 1L, 1L, NA, NA, NA, NA, 1L, NA, NA, NA, NA, NA, NA, NA, NA, NA, 0L, 1L, NA, NA, NA, NA, 0L, NA, NA, NA, NA, NA, NA, 0L, NA, NA), v_9 = c(NA, NA, NA, NA, 1L, NA, NA, NA, NA, NA, NA, NA, 4L, NA, NA, NA, NA, 1L, NA, NA, 3L, NA, NA, 4L, NA, NA, 3L, 3L, NA, NA, NA, NA, 3L, NA, NA, NA, NA, NA, NA, NA, NA, NA, 4L, 3L, NA, NA, NA, NA, 4L, NA, NA, NA, NA, NA, NA, 4L, NA, NA), v_10 = c(NA, 5L, NA, NA, NA, 0L, 3L, NA, 3L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 3L, NA, NA, 3L, NA, NA, NA, NA, NA, NA), v_11 = c(NA, 0L, NA, NA, NA, 0L, 2L, NA, 2L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 2L, NA, NA, 2L, NA, NA, NA, NA, NA, NA ), v_12 = structure(c(1L, 4L, 1L, 1L, 1L, 1L, 2L, 1L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L ), .Label = c("", "3", "4", "4,1,3"), class = "factor")), .Names = c("v_1", "v_2", "v_3", "v_4", "v_5", "v_6", "v_7", "v_8", "v_9", "v_10", "v_11", "v_12"), class = "data.frame", row.names = c(NA, -58L )) 

我们可以使用str_trim去除前导/滞后的空格(如果有的话),用str_trim计算分隔符的个数,空行可以通过nzchar找出来,我们可以使这些元素不NA

  library(stringr) dat1$V24 <- str_trim(dat1$V24) with(dat1, str_count(V24, ',')+1 * NA^!nzchar(V24)) #[1] NA 4 NA NA NA NA 4 NA 4 NA NA 3 NA NA NA NA 3 

stringi类似的function会更快

  library(stringi) dat1$V24 <- stri_trim_both(dat1$V24) with(dat1, stri_count(V24, fixed= ',')+1 * NA^!nzchar(V24)) #[1] NA 4 NA NA NA NA 4 NA 4 NA NA 3 NA NA NA NA 3 

更新

如果您想对数据集的每个第三列执行此操作

  indx <- seq(1, ncol(dat2), by=3) lapply(dat2[indx], function(x) {r1 <- str_trim(x) str_count(r1, ',')+1 * NA^!nzchar(r1) }) #$V1 #[1] 1 4 1 1 3 #$V4 #[1] 4 1 2 3 NA 

哪里,

  dat2[indx] # V1 V4 #1 1 1,2,5,6 #2 1,2,3,4 1 #3 3 1,2 #4 1 15,23,24 #5 1,2,3 

UPDATE2

  lapply(dat3[indx], function(x) {r1 <- str_trim(x) str_count(r1, ',')+1 * NA^is.na(r1)}) #$V1 #[1] 1 4 1 NA 3 #$V4 #[1] 4 NA 2 3 NA 

UPDATE3

根据dputstring_data ,只有两列(3和12)是factor类,它们是string元素。 即2,4

  indx1 <- sapply(string_data, is.factor) lapply(string_data[indx1], function(x){r1 <- str_trim(x) str_count(r1, ',')+1 * NA^!nzchar(r1)}) #$v_3 #[1] NA NA NA NA 1 NA NA NA NA NA NA NA NA NA NA 1 NA NA NA NA NA NA NA NA NA #[26] NA 1 NA NA NA NA NA 2 NA NA NA 1 NA NA NA NA NA NA NA NA NA NA NA NA NA #[51] NA NA 2 NA NA NA NA 2 #$v_12 #[1] NA 3 NA NA NA NA 1 NA 1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA #[26] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 1 NA #[51] NA 1 NA NA NA NA NA NA 

所有其他variables是integerlogical

数据

  dat1 <- data.frame(V24=c('', '1,2,3,4', ' ', '', '', ' ', '1,2,3,4', '', '1,4,2,3', '', '', '1,2,4', ' ', '', '', ' ', '1,3,2'), stringsAsFactors=FALSE) dat2 <- data.frame(V1=c('1', '1,2,3,4', '3 ', '1', '1,2,3'), V2=1:5, V3=6:10, V4=c('1,2,5,6', '1', '1,2', '15,23,24', ' '), V6=11:15, stringsAsFactors=FALSE) dat3 <- data.frame(V1= c('1', '1,2,3,4', '3 ', NA, '1,2,3'), V2=1:5, V3=6:10, V4=c('1,2,5,6', NA, '1,2', '15,23,24', NA), stringsAsFactors=FALSE) 

在R中, read.table使用的函数是count.fields ,你可以这样使用(使用@ akrun的示例数据):

 count.fields(textConnection(dat1$V24), sep = ",", blank.lines.skip = FALSE) # [1] 0 4 1 0 0 1 4 0 4 0 0 3 1 0 0 1 3 

NA代替0应该是非常简单的。

请注意,这与@ akrun的方法并不相同,因为这是为了计算数据集中应该有多less列。 因此, ” “和空string不一样,因此我的结果中的”1“值,而不是@ akrun的值,你可以使用gsub("\\s+", "", dat$V24)去除那些。