将.data文件转换为.csv

我发现了以下数据集名为ecoli.data,可在以下位置find:

https://archive.ics.uci.edu/ml/machine-learning-databases/ecoli/

我想在R中打开它做一个分类任务,但是我宁愿把这个文档转换成一个csv文件。 当我打开它时,我注意到这不是制表符分隔,因为每行之间有像树空间; 所以最基本的问题是如何使用Excel或Python将这个文件转换成csv。

将文件重命名为ecoli.txt然后在Excel中打开它。 这样,您将使用Microsoft Excel的“文本导入向导”,使您可以select“固定宽度”等选项。 只需点击“下一步”几次,然后“完成”,您将在Excel网格中的数据。 现在再次保存为CSV。

使用Python 2.7:

 import csv with open('ecoli.data.txt') as input_file: lines = input_file.readlines() newLines = [] for line in lines: newLine = line.strip().split() newLines.append( newLine ) with open('output.csv', 'wb') as test_file: file_writer = csv.writer(test_file) file_writer.writerows( newLines ) 

这里有两种方法可以在R(那个工作)中实际做到这一点:

 library(readr) url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/ecoli/ecoli.data" 

与基地R

 df <- read.table(url) dplyr::glimpse(df) ## Observations: 336 ## Variables: ## $ V1 (fctr) AAT_ECOLI, ACEA_ECOLI, ACEK_ECOLI, ACKA_ECOLI, ADI_ECOLI, ... ## $ V2 (dbl) 0.49, 0.07, 0.56, 0.59, 0.23, 0.67, 0.29, 0.21, 0.20, 0.42,... ## $ V3 (dbl) 0.29, 0.40, 0.40, 0.49, 0.32, 0.39, 0.28, 0.34, 0.44, 0.40,... ## $ V4 (dbl) 0.48, 0.48, 0.48, 0.48, 0.48, 0.48, 0.48, 0.48, 0.48, 0.48,... ## $ V5 (dbl) 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,... ## $ V6 (dbl) 0.56, 0.54, 0.49, 0.52, 0.55, 0.36, 0.44, 0.51, 0.46, 0.56,... ## $ V7 (dbl) 0.24, 0.35, 0.37, 0.45, 0.25, 0.38, 0.23, 0.28, 0.51, 0.18,... ## $ V8 (dbl) 0.35, 0.44, 0.46, 0.36, 0.35, 0.46, 0.34, 0.39, 0.57, 0.30,... ## $ V9 (fctr) cp, cp, cp, cp, cp, cp, cp, cp, cp, cp, cp, cp, cp, cp, cp... write.csv(df, "ecoli.csv", row.names=FALSE) 

具有readrfunction

 df <- read_table(url, col_names=FALSE) dplyr::glimpse(df) ## Observations: 336 ## Variables: ## $ X1 (chr) "AAT_ECOLI", "ACEA_ECOLI", "ACEK_ECOLI", "ACKA_ECOLI", "ADI... ## $ X2 (dbl) 0.49, 0.07, 0.56, 0.59, 0.23, 0.67, 0.29, 0.21, 0.20, 0.42,... ## $ X3 (dbl) 0.29, 0.40, 0.40, 0.49, 0.32, 0.39, 0.28, 0.34, 0.44, 0.40,... ## $ X4 (dbl) 0.48, 0.48, 0.48, 0.48, 0.48, 0.48, 0.48, 0.48, 0.48, 0.48,... ## $ X5 (dbl) 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,... ## $ X6 (dbl) 0.56, 0.54, 0.49, 0.52, 0.55, 0.36, 0.44, 0.51, 0.46, 0.56,... ## $ X7 (dbl) 0.24, 0.35, 0.37, 0.45, 0.25, 0.38, 0.23, 0.28, 0.51, 0.18,... ## $ X8 (dbl) 0.35, 0.44, 0.46, 0.36, 0.35, 0.46, 0.34, 0.39, 0.57, 0.30,... ## $ X9 (chr) "cp", "cp", "cp", "cp", "cp", "cp", "cp", "cp", "cp", "cp",... write_csv(df, "ecoli.csv")