导入.xlsx文件后，从matrix列表中构build适当的dataframe

实现：

我将一个.xlsx文件导入R.这个文件由三张表组成。我将所有的表单都绑定到列表中。

需要实施

现在我想把这个matrix列表组合成一个单一的data.frame 。标题是 – >名称（数据集）。

我尝试使用as.data.frame与read.xlsx中给出的read.xlsx ，但它没有工作。我明确地尝试与as.data.frame(as.table(dataset))但它仍然会产生一个长长的data.frame列表，但没有我想要的。

我想有一个像header = names和下面的值的结构，就像read.table如何导入数据。

这是我正在使用的代码：

  xlfile <- list.files(pattern = "*.xlsx") wb <- loadWorkbook(xlfile) sheet_ct <- wb$getNumberOfSheets() b <- rbind(list(lapply(1:sheet_ct, function(x) { res <- read.xlsx(xlfile, x, as.data.frame = TRUE, header = TRUE) }))) b <- b [-c(1),] # Just want to remove the second header

我想要有如下的数据安排。

 Ei Mi hours Nphy Cphy CHLphy Nhet Chet Ndet Cdet DON DOC DIN DIC AT dCCHO TEPC Ncocco Ccocco CHLcocco PICcocco par Temp Sal co2atm u10 dicfl co2ppm co2mol pH 1 1 1 1 0.1023488 0.6534707 0.1053458 0.04994161 0.3308593 0.04991916 0.3307085 0.05042275 49.76304 14.99330000 2050.132 2150.007 0.9642220 0.1339044 0.1040715 0.6500288 0.1087667 0.1000664 0.0000000 9.900000 31.31000 370 0.01 -2.963256000 565.1855 0.02562326 7.879427 2 1 1 2 0.1045240 0.6448216 0.1103250 0.04988347 0.3304699 0.04984045 0.3301691 0.05085697 49.52745 14.98729000 2050.264 2150.007 0.9308690 0.1652179 0.1076058 0.6386706 0.1164099 0.1001396 0.0000000 9.900000 31.31000 370 0.01 -2.971632000 565.7373 0.02564828 7.879042 3 1 1 3 0.1064772 0.6369597 0.1148174 0.04982555 0.3300819 0.04976363 0.3296314 0.05130091 49.29323 14.98221000 2050.396 2150.007 0.8997098 0.1941872 0.1104229 0.6291149 0.1225822 0.1007908 0.8695131 9.900000 31.31000 370 0.01 -2.980446000 566.3179 0.02567460 7.878636 4 1 1 4 0.1081702 0.6299084 0.1187672 0.04976784 0.3296952 0.04968840 0.3290949 0.05175249 49.06034 14.97810000 2050.524 2150.007 0.8705440 0.2210289 0.1125141 0.6213265 0.1273103 0.1018360 1.5513170 9.900000 31.31000 370 0.01 -2.989259000 566.8983 0.02570091 7.878231 5 1 1 5 0.1095905 0.6239005 0.1221460 0.04971029 0.3293089 0.04961446 0.3285598 0.05220978 48.82878 14.97485000 2050.641 2150.007 0.8431960 0.2459341 0.1140222 0.6152447 0.1308843 0.1034179 2.7777070 9.900000

请不要build议我将所有数据放在一张纸上，并将.xlsx转换为.csv或简单的文本格式。我真的很难从.xlsx文件中获得正确的dataframe。

以下是该文件

这是后面的文章：后续

结果是：

 str(full_data) 'data.frame': 0 obs. of 19 variables: $ Experiment : Factor w/ 2 levels "#","1": $ Mesocosm : Factor w/ 10 levels "#","1","2","3",..: $ Exp.day : Factor w/ 24 levels "1","10","11",..: $ Hour : Factor w/ 24 levels "108","12","132",..: $ Temperature: Factor w/ 125 levels "10","10.01","10.02",..: $ Salinity : num $ pH : num $ DIC : Factor w/ 205 levels "1582.2925","1588.6475",..: $ TA : Factor w/ 117 levels "1813","1826",..: $ DIN : Factor w/ 66 levels "0.2","0.3","0.4",..: $ Chl.a : Factor w/ 156 levels "0.171","0.22",..: $ PIC : Factor w/ 194 levels "-0.47","-0.96",..: $ POC : Factor w/ 199 levels "-0.046","1.733",..: $ PON : Factor w/ 151 levels "1.675","1.723",..: $ POP : Factor w/ 110 levels "0.032","0.034",..: $ DOC : Factor w/ 93 levels "100.1","100.4",..: $ DON : Factor w/ 1 level "Âµmol/L": $ DOP : Factor w/ 1 level "Âµmol/L": $ TEP : Factor w/ 100 levels "10.4934","11.0053",..: [Note: Above is the structure after reading from .xlsx file......the levels makes the calculation and manipulation part tedious and messy.]

这是我想要实现的：

STR（a）中

 'data.frame': 9936 obs. of 29 variables: $ Ei : int 1 1 1 1 1 1 1 1 1 1 ... $ Mi : int 1 1 1 1 1 1 1 1 1 1 ... $ hours : int 1 2 3 4 5 6 7 8 9 10 ... $ Cphy : num 0.653 0.645 0.637 0.63 0.624 ... $ CHLphy : num 0.105 0.11 0.115 0.119 0.122 ... $ Nhet : num 0.0499 0.0499 0.0498 0.0498 0.0497 ... $ Chet : num 0.331 0.33 0.33 0.33 0.329 ... $ Ndet : num 0.0499 0.0498 0.0498 0.0497 0.0496 ... $ Cdet : num 0.331 0.33 0.33 0.329 0.329 ... $ DON : num 0.0504 0.0509 0.0513 0.0518 0.0522 ... $ DOC : num 49.8 49.5 49.3 49.1 48.8 ... $ DIN : num 15 15 15 15 15 ... $ DIC : num 2050 2050 2050 2051 2051 ... $ AT : num 2150 2150 2150 2150 2150 ... $ dCCHO : num 0.964 0.931 0.9 0.871 0.843 ... $ TEPC : num 0.134 0.165 0.194 0.221 0.246 ... $ Ncocco : num 0.104 0.108 0.11 0.113 0.114 ... $ Ccocco : num 0.65 0.639 0.629 0.621 0.615 ... $ CHLcocco: num 0.109 0.116 0.123 0.127 0.131 ... $ PICcocco: num 0.1 0.1 0.101 0.102 0.103 ... $ par : num 0 0 0.87 1.55 2.78 ... $ Temp : num 9.9 9.9 9.9 9.9 9.9 9.9 9.9 9.9 9.9 9.9 ... $ Sal : num 31.3 31.3 31.3 31.3 31.3 ... $ co2atm : num 370 370 370 370 370 370 370 370 370 370 ... $ u10 : num 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 ... $ dicfl : num -2.96 -2.97 -2.98 -2.99 -3 ... $ co2ppm : num 565 566 566 567 567 ... $ co2mol : num 0.0256 0.0256 0.0257 0.0257 0.0257 ... $ pH : num 7.88 7.88 7.88 7.88 7.88 ...

[注：对于额外的列抱歉，这是另一个数据集（简单文本），我正在阅读read.table]

与NA处理：

 > unique(mydf_1$Exp.num) [1] # 1 Levels: # 1 > unique(mydf_2$Exp.num) [1] # 2 Levels: # 2 > unique(mydf_3$Exp.num) [1] # 3 Levels: # 3 > unique(full_data$Exp.num) [1] 2 3 4

在不处理NA的情况下：

 > unique(full_data$Exp.num) [1] 1 NA 2 3 > unique(full_data$Mesocosm) [1] 1 2 3 4 5 6 7 8 9 NA

我认为这是你所需要的。我在我正在做的事情上添加一些评论：

 xlfile <- list.files(pattern = "*.xlsx") wb <- loadWorkbook(xlfile) sheet_ct <- wb$getNumberOfSheets() for( i in 1:sheet_ct) { #read the sheets into 3 separate dataframes (mydf_1, mydf_2, mydf3) print(i) variable_name <- sprintf('mydf_%s',i) assign(variable_name, read.xlsx(xlfile, sheetIndex=i,startRow=1, endRow=209)) #using this you don't need to use my formula to eliminate NAs. but you need to specify the first and last rows. } colnames(mydf_1) <- names(mydf_2) #this here was unclear. I chose the second sheet's # names as column names but you can chose whichever you want using the same (second and third column had the same names). #some of the sheets were loaded with a few blank rows (full of NAs) which I remove #with the following function according to the first column which is always populated #according to what I see remove_na_rows <- function(x) { x <- x[!is.na(x)] a <- length(x==TRUE) } mydf_1 <- mydf_1[1:remove_na_rows(mydf_1$Exp.num),] mydf_2 <- mydf_2[1:remove_na_rows(mydf_2$Exp.num),] mydf_3 <- mydf_3[1:remove_na_rows(mydf_3$Exp.num),] full_data <- rbind(mydf_1[-1,],mydf_2[-1,],mydf_3[-1,]) #making one dataframe here full_data <- lapply(full_data,function(x) as.numeric(x)) #convert fields to numeric full_data2$Ei <- as.integer(full_data[['Ei']]) #use this to convert any column to integer full_data2$Mi <- as.integer(full_data[['Mi']]) full_data2$hours <- as.integer(full_data[['hours']]) #*********code to use for removing NA rows ***************** #so if you rbind not caring about the NA rows you can use the below to get rid of them #I just tested it and it seems to be working n_row <- NULL for ( i in 1:nrow(full_data)) { x <- full_data[i,] if ( all(is.na(x)) ) { n_row <- append(n_row,i) } } full_data <- full_data[-n_row,]

我想现在这是你所需要的

导入.xlsx文件后，从matrix列表中构build适当的dataframe

写数据框以优于标题

如何在R中使用具有特定行和列的循环读取多个xlsx文件

相同的行和列名称R

通过pd.read_excel（）读取excel表格作为多索引dataframe

计算一对值出现在多less行中

如何将多个数据框写入R中的一个csv excel文件的多个工作表？

在延伸过单元格的文本上使用xls文件的readtable时出错

将excel或csv文件转换为pandas多级数据框

Python Pandas DataFrame：如何处理由字典组成的列到由string的键确定的多列？

在r数据框中设置Column Class时，将####清除为NA错误