自动获取excel表格的列types
我有一个excel文件,几张纸,每个都有几列,所以我不想单独指定列的types,但自动。 我想读它们为stringsAsFactors= FALSE
会做,因为它正确地解释列的types。 在我当前的方法中,列宽度“0.492±0.6”被解释为数字,返回NA,因为“ stringsAsFactors
选项在read_excel
不可用。 所以在这里,我写了一个解决方法,或多或less地工作,但我不能在现实生活中使用,因为我不被允许创build一个新的文件。 注意:我需要其他列作为数字或整数,还有其他人只有文字作为字符,因为stringsAsFactors
在我的read.csv
例子。
library(readxl) file= "myfile.xlsx" firstread<-read_excel(file, sheet = "mysheet", col_names = TRUE, na = "", skip = 0) #firstread has the problem of the a column with "0.492 ± 0.6", #being interpreted as number (returns NA) colna<-colnames(firstread) # read every column as character colnumt<-ncol(firstread) textcol<-rep("text", colnumt) secondreadchar<-read_excel(file, sheet = "mysheet", col_names = TRUE, col_types = textcol, na = "", skip = 0) # another column, with the number 0.532, is now 0.5319999999999999 # and several other similar cases. # read again with stringsAsFactors # critical step, in real life, I "cannot" write a csv file. write.csv(secondreadchar, "allcharac.txt", row.names = FALSE) stringsasfactor<-read.csv("allcharac.txt", stringsAsFactors = FALSE) colnames(stringsasfactor)<-colna # column with "0.492 ± 0.6" now is character, as desired, others numeric as desired as well
这是一个脚本,导入您的Excel文件中的所有数据。 它将每个表单的数据放在一个名为dfs
的list
:
library(readxl) # Get all the sheets all_sheets <- excel_sheets("myfile.xlsx") # Loop through the sheet names and get the data in each sheet dfs <- lapply(all_sheets, function(x) { #Get the number of column in current sheet col_num <- NCOL(read_excel(path = "myfile.xlsx", sheet = x)) # Get the dataframe with columns as text df <- read_excel(path = "myfile.xlsx", sheet = x, col_types = rep('text',col_num)) # Convert to data.frame df <- as.data.frame(df, stringsAsFactors = FALSE) # Get numeric fields by trying to convert them into # numeric values. If it returns NA then not a numeric field. # Otherwise numeric. cond <- apply(df, 2, function(x) { x <- x[!is.na(x)] all(suppressWarnings(!is.na(as.numeric(x)))) }) numeric_cols <- names(df)[cond] df[,numeric_cols] <- sapply(df[,numeric_cols], as.numeric) # Return df in desired format df }) # Just for convenience in order to remember # which sheet is associated with which dataframe names(dfs) <- all_sheets
过程如下:
首先,用excel_sheets
获取文件中的所有表单,然后遍历表单名称以创build数据excel_sheets
。 对于这些数据col_types
每一个,通过将col_types
参数设置为text
,最初将数据作为text
导入。 将数据框的列作为文本获取后,可以将结构从一个data.frame
转换为一个data.frame
。 之后,您可以find实际为数字列的列并将其转换为数字值。
编辑:
截至4月底,新版本的readxl
得到了发布, read_excel
函数得到了两个与这个问题相关的增强。 首先,你可以通过提供给col_types
参数的参数“guess”来让函数猜测列types。 第二个增强(第一个的推论)是guess_max
参数被添加到read_excel
函数中。 这个新参数允许你设置猜测列types的行数。 本质上,我上面写的可以缩写为:
library(readxl) # Get all the sheets all_sheets <- excel_sheets("myfile.xlsx") dfs <- lapply(all_sheets, function(sheetname) { suppressWarnings(read_excel(path = "myfile.xlsx", sheet = sheetname, col_types = 'guess', guess_max = Inf)) }) # Just for convenience in order to remember # which sheet is associated with which dataframe names(dfs) <- all_sheets
我build议您将readxl
更新为最新版本以缩短脚本,从而避免可能的烦恼。
我希望这有帮助。