XLS转换为CSV或R data.frame

我需要(非手动)下载这个文件,并将内容转换为data.frame,忽略几行的能力将是有用的。 我正在寻找R或Python的解决scheme。

该文件本身可以从:

http://horizo​​ns.prod.transmissionmedia.ca/GetDailyFundSummaryExcel.aspx?lang=en

以下是我迄今为止所做的:

  1. 我试过XLConnect( Error: IllegalArgumentException (Java): Your InputStream was neither an OLE2 stream, nor an OOXML stream
  2. 我试过RODBC( Error in odbcConnectExcel("xl.file") : odbcConnectExcel is only usable with 32-bit Windows
  3. 我已经尝试过Python中的XLRDError: Unsupported format or corrupt fileXLRDError: Unsupported format or corrupt file
  4. 我试过gdata( Error in xls2sep(xls, sheet, verbose = verbose, ..., method = method, : Intermediate file '...' missing!

如果您在记事本中打开文件,它是一个xml文件,并在Excel中打开时,您会收到一条警告消息“的格式和扩展名不匹配”。

我可以探索自己的想法也是有用的,如果你没有答案,请评论。

我到目前为止使用XML / regex的尝试:

 library(XML) library(stringr) download.file("http://horizons.prod.transmissionmedia.ca/GetDailyFundSummaryExcel.aspx?lang=en", destfile = "horizons.xls") doc <- readLines(con = "horizons.xls") doc <- str_extract(doc,"<Table[^>]*>(.*?)</Table>") doc <- xmlParse(doc) listing <- xpathApply(doc, "//Row", xmlToDataFrame) listing <- listing[4:length(listing)] listing <- do.call(rbind,lapply(listing, t))[,6:16] listing[,3:11] <- gsub("[^-.0-9]", "", listing[,3:11]) listing <- as.data.frame(listing, row.names = NULL,stringsAsFactors = FALSE,) listing$V1 <- str_replace_all(listing$V1, "[^a-zA-Z0-9]", " ") listing[5:11] <- lapply(listing[5:11],as.numeric) names(listing) <- c( "Product Name", "Ticker", "Class", "Price", "Price % Change", "Volume", "NAV/unit", "NAV % Change", "% Prem/Disc", "Outst. Shares" ) 

也许有一种方法可以在R:

 library(XML) download.file("http://horizons.prod.transmissionmedia.ca/GetDailyFundSummaryExcel.aspx?lang=en", file.path(tempdir(), "xls.xml")) doc <- xmlParse(file.path(tempdir(), "xls.xml")) df <- xmlToDataFrame(nodes = getNodeSet(doc, "//ss:Row", "ss")[-(1:2)], stringsAsFactors = FALSE) names(df) <- unlist(df[1, ], use.names = F); df <- df[-1, ] # put first row as col header and delete it head(df) # # # Language ETF Type Subtype Product Name Ticker Class Closing Date Price Price % Change Volume NAV/unit NAV % Change % Prem/Disc Outst. Shares # 2 1 1 en INDEX AND BENCHMARK Equities — Large Cap Horizons S&P 500® Index ETF HXS 2015-03-30 47.3800 2.09 314223 47.4302 1.9621 -0.11 5675671 # 3 2 2 en Horizons S&P 500® Index ETF (US$) HXS.U 2015-03-30 37.2800 -0.19 52769 37.3539 1.2312 -0.20 5675671 # 4 3 3 en Horizons S&P/TSX 60™ Index ETF HXT 2015-03-30 27.9600 0.98 372656 27.9144 0.9095 0.16 22019328 # 5 4 4 en Horizons S&P/TSX 60™ Index ETF (US$) HXT.U 2015-03-30 22.0300 -0.56 0 21.9842 0.1864 0.21 22019328 # 6 5 5 en Horizons S&P/TSX Capped Energy Index ETF HXE 2015-03-30 21.4800 0.00 1200 21.5441 0.6578 -0.30 902485 # 7 6 6 en Horizons S&P/TSX Capped Financials Index ETF HXF 2015-03-30 30.0100 0.00 900 30.0804 0.1395 -0.23 500440 

这可能不是最好的方法,但会有助于提供一些头脑。

 require("XML") myfile1 <- download.file(http://horizons.prod.transmissionmedia.ca/GetDailyFundSummaryExcel.aspx?lang=en) doc <- xmlParse(myfile1) root_doc <- xmlRoot(doc) identify_worksheet <- c() for (i in 1:xmlSize(root_doc)){ identify_worksheet <- c(identify_worksheet, xmlName(root_doc[[i]]) == "Worksheet") } worksheet_index = which(identify_worksheet == TRUE) name1 <- xmlSApply(root_doc[[worksheet_index]], xmlName) row_size <- xmlSize(root_doc[[worksheet_index]][[name1]]) col_size = max(xmlSApply(root_doc[[worksheet_index]][[name1]], xmlSize)) row_index = which(xmlSApply(root_doc[[worksheet_index]][[name1]], xmlSize) == max(xmlSApply(root_doc[[worksheet_index]][[name1]], xmlSize))) df1 <- data.frame(matrix(nrow = length(row_index)-1, ncol = col_size), stringsAsFactors = FALSE) colnames(df1) <- getChildrenStrings(root_doc[[worksheet_index]][[name1]][[row_index[1]]]) for(i in 2:length(row_index)){ df_index = i-1 df1[df_index,] <- getChildrenStrings(root_doc[[worksheet_index]][[name1]][[row_index[i]]]) } View(df1) df2 <- df1[4:ncol(df1)] View(df2) 

从XML格式的Excel表格中识别名称。 我希望知道,下面的名称是在XML格式的Excel表单中遵循的标准,而且当有多个工作表时,工作表的名称以数字为后缀增加(例如:Worksheet1,Worksheet2等等。 )。

 xmlName(root_doc) [1] "Workbook" xmlName(root_doc[[1]]) [1] "DocumentProperties" xmlName(root_doc[[2]]) [1] "Styles" xmlName(root_doc[[3]]) [1] "Worksheet" 

产量

 head(df1) # # Language ETF Type Subtype Product Name Ticker Class Closing Date Price Price % Change Volume 1 1 1 en INDEX AND BENCHMARK Equities — Large Cap Horizons S&P 500® Index ETF HXS 2015-03-30 47.3800 2.09 314223 2 2 2 en Horizons S&P 500® Index ETF (US$) HXS.U 2015-03-30 37.2800 -0.19 52769 3 3 3 en Horizons S&P/TSX 60™ Index ETF HXT 2015-03-30 27.9600 0.98 372656 4 4 4 en Horizons S&P/TSX 60™ Index ETF (US$) HXT.U 2015-03-30 22.0300 -0.56 0 5 5 5 en Horizons S&P/TSX Capped Energy Index ETF HXE 2015-03-30 21.4800 0.00 1200 6 6 6 en Horizons S&P/TSX Capped Financials Index ETF HXF 2015-03-30 30.0100 0.00 900 NAV/unit NAV % Change % Prem/Disc Outst. Shares 1 47.4302 1.9621 -0.11 5675671 2 37.3539 1.2312 -0.20 5675671 3 27.9144 0.9095 0.16 22019328 4 21.9842 0.1864 0.21 22019328 5 21.5441 0.6578 -0.30 902485 6 30.0804 0.1395 -0.23 500440 head(df2) ETF Type Subtype Product Name Ticker Class Closing Date Price Price % Change Volume NAV/unit 1 INDEX AND BENCHMARK Equities — Large Cap Horizons S&P 500® Index ETF HXS 2015-03-30 47.3800 2.09 314223 47.4302 2 Horizons S&P 500® Index ETF (US$) HXS.U 2015-03-30 37.2800 -0.19 52769 37.3539 3 Horizons S&P/TSX 60™ Index ETF HXT 2015-03-30 27.9600 0.98 372656 27.9144 4 Horizons S&P/TSX 60™ Index ETF (US$) HXT.U 2015-03-30 22.0300 -0.56 0 21.9842 5 Horizons S&P/TSX Capped Energy Index ETF HXE 2015-03-30 21.4800 0.00 1200 21.5441 6 Horizons S&P/TSX Capped Financials Index ETF HXF 2015-03-30 30.0100 0.00 900 30.0804 NAV % Change % Prem/Disc Outst. Shares 1 1.9621 -0.11 5675671 2 1.2312 -0.20 5675671 3 0.9095 0.16 22019328 4 0.1864 0.21 22019328 5 0.6578 -0.30 902485 6 0.1395 -0.23 500440