XLS转换为CSV或R data.frame

我需要（非手动）下载这个文件，并将内容转换为data.frame，忽略几行的能力将是有用的。 我正在寻找R或Python的解决scheme。

该文件本身可以从：

http://horizons.prod.transmissionmedia.ca/GetDailyFundSummaryExcel.aspx?lang=en

以下是我迄今为止所做的：

我试过XLConnect（ Error: IllegalArgumentException (Java): Your InputStream was neither an OLE2 stream, nor an OOXML stream ）
我试过RODBC（ Error in odbcConnectExcel("xl.file") : odbcConnectExcel is only usable with 32-bit Windows ）
我已经尝试过Python中的XLRDError: Unsupported format or corrupt file （ XLRDError: Unsupported format or corrupt file ）
我试过gdata（ Error in xls2sep(xls, sheet, verbose = verbose, ..., method = method, : Intermediate file '...' missing!

如果您在记事本中打开文件，它是一个xml文件，并在Excel中打开时，您会收到一条警告消息“的格式和扩展名不匹配”。

我可以探索自己的想法也是有用的，如果你没有答案，请评论。

我到目前为止使用XML / regex的尝试：

 library(XML) library(stringr) download.file("http://horizons.prod.transmissionmedia.ca/GetDailyFundSummaryExcel.aspx?lang=en", destfile = "horizons.xls") doc <- readLines(con = "horizons.xls") doc <- str_extract(doc,"<Table[^>]*>(.*?)</Table>") doc <- xmlParse(doc) listing <- xpathApply(doc, "//Row", xmlToDataFrame) listing <- listing[4:length(listing)] listing <- do.call(rbind,lapply(listing, t))[,6:16] listing[,3:11] <- gsub("[^-.0-9]", "", listing[,3:11]) listing <- as.data.frame(listing, row.names = NULL,stringsAsFactors = FALSE,) listing$V1 <- str_replace_all(listing$V1, "[^a-zA-Z0-9]", " ") listing[5:11] <- lapply(listing[5:11],as.numeric) names(listing) <- c( "Product Name", "Ticker", "Class", "Price", "Price % Change", "Volume", "NAV/unit", "NAV % Change", "% Prem/Disc", "Outst. Shares" )

也许有一种方法可以在R：

 library(XML) download.file("http://horizons.prod.transmissionmedia.ca/GetDailyFundSummaryExcel.aspx?lang=en", file.path(tempdir(), "xls.xml")) doc <- xmlParse(file.path(tempdir(), "xls.xml")) df <- xmlToDataFrame(nodes = getNodeSet(doc, "//ss:Row", "ss")[-(1:2)], stringsAsFactors = FALSE) names(df) <- unlist(df[1, ], use.names = F); df <- df[-1, ] # put first row as col header and delete it head(df) # # # Language ETF Type Subtype Product Name Ticker Class Closing Date Price Price % Change Volume NAV/unit NAV % Change % Prem/Disc Outst. Shares # 2 1 1 en INDEX AND BENCHMARK Equities — Large Cap Horizons S&P 500® Index ETF HXS 2015-03-30 47.3800 2.09 314223 47.4302 1.9621 -0.11 5675671 # 3 2 2 en Horizons S&P 500® Index ETF (US$) HXS.U 2015-03-30 37.2800 -0.19 52769 37.3539 1.2312 -0.20 5675671 # 4 3 3 en Horizons S&P/TSX 60™ Index ETF HXT 2015-03-30 27.9600 0.98 372656 27.9144 0.9095 0.16 22019328 # 5 4 4 en Horizons S&P/TSX 60™ Index ETF (US$) HXT.U 2015-03-30 22.0300 -0.56 0 21.9842 0.1864 0.21 22019328 # 6 5 5 en Horizons S&P/TSX Capped Energy Index ETF HXE 2015-03-30 21.4800 0.00 1200 21.5441 0.6578 -0.30 902485 # 7 6 6 en Horizons S&P/TSX Capped Financials Index ETF HXF 2015-03-30 30.0100 0.00 900 30.0804 0.1395 -0.23 500440

这可能不是最好的方法，但会有助于提供一些头脑。

 require("XML") myfile1 <- download.file(http://horizons.prod.transmissionmedia.ca/GetDailyFundSummaryExcel.aspx?lang=en) doc <- xmlParse(myfile1) root_doc <- xmlRoot(doc) identify_worksheet <- c() for (i in 1:xmlSize(root_doc)){ identify_worksheet <- c(identify_worksheet, xmlName(root_doc[[i]]) == "Worksheet") } worksheet_index = which(identify_worksheet == TRUE) name1 <- xmlSApply(root_doc[[worksheet_index]], xmlName) row_size <- xmlSize(root_doc[[worksheet_index]][[name1]]) col_size = max(xmlSApply(root_doc[[worksheet_index]][[name1]], xmlSize)) row_index = which(xmlSApply(root_doc[[worksheet_index]][[name1]], xmlSize) == max(xmlSApply(root_doc[[worksheet_index]][[name1]], xmlSize))) df1 <- data.frame(matrix(nrow = length(row_index)-1, ncol = col_size), stringsAsFactors = FALSE) colnames(df1) <- getChildrenStrings(root_doc[[worksheet_index]][[name1]][[row_index[1]]]) for(i in 2:length(row_index)){ df_index = i-1 df1[df_index,] <- getChildrenStrings(root_doc[[worksheet_index]][[name1]][[row_index[i]]]) } View(df1) df2 <- df1[4:ncol(df1)] View(df2)

从XML格式的Excel表格中识别名称。我希望知道，下面的名称是在XML格式的Excel表单中遵循的标准，而且当有多个工作表时，工作表的名称以数字为后缀增加（例如：Worksheet1，Worksheet2等等。）。

 xmlName(root_doc) [1] "Workbook" xmlName(root_doc[[1]]) [1] "DocumentProperties" xmlName(root_doc[[2]]) [1] "Styles" xmlName(root_doc[[3]]) [1] "Worksheet"

产量

 head(df1) # # Language ETF Type Subtype Product Name Ticker Class Closing Date Price Price % Change Volume 1 1 1 en INDEX AND BENCHMARK Equities — Large Cap Horizons S&P 500® Index ETF HXS 2015-03-30 47.3800 2.09 314223 2 2 2 en Horizons S&P 500® Index ETF (US$) HXS.U 2015-03-30 37.2800 -0.19 52769 3 3 3 en Horizons S&P/TSX 60™ Index ETF HXT 2015-03-30 27.9600 0.98 372656 4 4 4 en Horizons S&P/TSX 60™ Index ETF (US$) HXT.U 2015-03-30 22.0300 -0.56 0 5 5 5 en Horizons S&P/TSX Capped Energy Index ETF HXE 2015-03-30 21.4800 0.00 1200 6 6 6 en Horizons S&P/TSX Capped Financials Index ETF HXF 2015-03-30 30.0100 0.00 900 NAV/unit NAV % Change % Prem/Disc Outst. Shares 1 47.4302 1.9621 -0.11 5675671 2 37.3539 1.2312 -0.20 5675671 3 27.9144 0.9095 0.16 22019328 4 21.9842 0.1864 0.21 22019328 5 21.5441 0.6578 -0.30 902485 6 30.0804 0.1395 -0.23 500440 head(df2) ETF Type Subtype Product Name Ticker Class Closing Date Price Price % Change Volume NAV/unit 1 INDEX AND BENCHMARK Equities — Large Cap Horizons S&P 500® Index ETF HXS 2015-03-30 47.3800 2.09 314223 47.4302 2 Horizons S&P 500® Index ETF (US$) HXS.U 2015-03-30 37.2800 -0.19 52769 37.3539 3 Horizons S&P/TSX 60™ Index ETF HXT 2015-03-30 27.9600 0.98 372656 27.9144 4 Horizons S&P/TSX 60™ Index ETF (US$) HXT.U 2015-03-30 22.0300 -0.56 0 21.9842 5 Horizons S&P/TSX Capped Energy Index ETF HXE 2015-03-30 21.4800 0.00 1200 21.5441 6 Horizons S&P/TSX Capped Financials Index ETF HXF 2015-03-30 30.0100 0.00 900 30.0804 NAV % Change % Prem/Disc Outst. Shares 1 1.9621 -0.11 5675671 2 1.2312 -0.20 5675671 3 0.9095 0.16 22019328 4 0.1864 0.21 22019328 5 0.6578 -0.30 902485 6 0.1395 -0.23 500440

XLS转换为CSV或R data.frame

如何获取没有隐藏在Apache POI中的页数

使用VBA自动化IE – 单击带有跨度的锚点button

是否可以将自定义部分添加到Excel工作簿

如何从Microsoft.Office.Interop.Excel.Range对象的“value”属性中获取dynamictypesvariables的值

在公司局域网中传播工作表（如Microsoft Excel）查看器作为服务

寻找将closuresOneDrive同步的vba代码

除去周末和节假日时间的经过天数小时数

通过python3将带有制表符分隔的.txt文件转换为xlsx

dynamicparsingExcel工作表（不提供单元格范围，因为工作表中数据的位置可能会dynamic变化）

包validation错误