自动获取excel表格的列types

我有一个excel文件,几张纸,每个都有几列,所以我不想单独指定列的types,但自动。 我想读它们为stringsAsFactors= FALSE会做,因为它正确地解释列的types。 在我当前的方法中,列宽度“0.492±0.6”被解释为数字,返回NA,因为“ stringsAsFactors选项在read_excel不可用。 所以在这里,我写了一个解决方法,或多或less地工作,但我不能在现实生活中使用,因为我不被允许创build一个新的文件。 注意:我需要其他列作为数字或整数,还有其他人只有文字作为字符,因为stringsAsFactors在我的read.csv例子。

 library(readxl) file= "myfile.xlsx" firstread<-read_excel(file, sheet = "mysheet", col_names = TRUE, na = "", skip = 0) #firstread has the problem of the a column with "0.492 ± 0.6", #being interpreted as number (returns NA) colna<-colnames(firstread) # read every column as character colnumt<-ncol(firstread) textcol<-rep("text", colnumt) secondreadchar<-read_excel(file, sheet = "mysheet", col_names = TRUE, col_types = textcol, na = "", skip = 0) # another column, with the number 0.532, is now 0.5319999999999999 # and several other similar cases. # read again with stringsAsFactors # critical step, in real life, I "cannot" write a csv file. write.csv(secondreadchar, "allcharac.txt", row.names = FALSE) stringsasfactor<-read.csv("allcharac.txt", stringsAsFactors = FALSE) colnames(stringsasfactor)<-colna # column with "0.492 ± 0.6" now is character, as desired, others numeric as desired as well 

这是一个脚本,导入您的Excel文件中的所有数据。 它将每个表单的数据放在一个名为dfslist

 library(readxl) # Get all the sheets all_sheets <- excel_sheets("myfile.xlsx") # Loop through the sheet names and get the data in each sheet dfs <- lapply(all_sheets, function(x) { #Get the number of column in current sheet col_num <- NCOL(read_excel(path = "myfile.xlsx", sheet = x)) # Get the dataframe with columns as text df <- read_excel(path = "myfile.xlsx", sheet = x, col_types = rep('text',col_num)) # Convert to data.frame df <- as.data.frame(df, stringsAsFactors = FALSE) # Get numeric fields by trying to convert them into # numeric values. If it returns NA then not a numeric field. # Otherwise numeric. cond <- apply(df, 2, function(x) { x <- x[!is.na(x)] all(suppressWarnings(!is.na(as.numeric(x)))) }) numeric_cols <- names(df)[cond] df[,numeric_cols] <- sapply(df[,numeric_cols], as.numeric) # Return df in desired format df }) # Just for convenience in order to remember # which sheet is associated with which dataframe names(dfs) <- all_sheets 

过程如下:

首先,用excel_sheets获取文件中的所有表单,然后遍历表单名称以创build数据excel_sheets 。 对于这些数据col_types每一个,通过将col_types参数设置为text ,最初将数据作为text导入。 将数据框的列作为文本获取后,可以将结构从一个data.frame转换为一个data.frame 。 之后,您可以find实际为数字列的列并将其转换为数字值。

编辑:

截至4月底,新版本的readxl得到了发布, read_excel函数得到了两个与这个问题相关的增强。 首先,你可以通过提供给col_types参数的参数“guess”来让函数猜测列types。 第二个增强(第一个的推论)是guess_max参数被添加到read_excel函数中。 这个新参数允许你设置猜测列types的行数。 本质上,我上面写的可以缩写为:

 library(readxl) # Get all the sheets all_sheets <- excel_sheets("myfile.xlsx") dfs <- lapply(all_sheets, function(sheetname) { suppressWarnings(read_excel(path = "myfile.xlsx", sheet = sheetname, col_types = 'guess', guess_max = Inf)) }) # Just for convenience in order to remember # which sheet is associated with which dataframe names(dfs) <- all_sheets 

我build议您将readxl更新为最新版本以缩短脚本,从而避免可能的烦恼。

我希望这有帮助。