规范化和清理R中的excel文件

我需要帮助清理R中的excel文件

这些是由不同人制作的excel文件，它们应该包含相同的文本。我的任务是比较文本的片段，并确保它们匹配（有时候人们的types，有时候人们复制粘贴，这是一团糟）。

我特别的问题是，没有标准的格式，其中一些已经从PDF中提取。

给你一个想法，文字可能是这样的：

文件A：

“猴子必须吃：

·香蕉，或

6个香蕉。

文件B：

“猴子必须吃：

香蕉，或者

5个香蕉“。

文件C：

“猴子必须吃：

·香蕉，或

6个香蕉。

到目前为止，我已经使用了以下function的组合，但是最后我的比较仍然是FALSE 。

monkeyr$txtcp <- stri_enc_toascii(monkeyr$txtcp) monkeyr$txtcp <- removeNumbers(monkeyr$txtcp) #bad idea as I want to compare the number of bananas monkeyr$txtcp <- tolower(monkeyr$txtcp) monkeyr$txtcp <- stripWhitespace(monkeyr$txtcp) monkeyr$txtcp <- removePunctuation(monkeyr$txtcp) monkeyr$txtcp <- trimws(monkeyr$txtcp) monkeyr$txtcp <- stri_replace_all_charclass(monkeyr$txtcp, "\t", " ", merge = T) #This above was specially because I wanted to remove the "tab" in File 3. #Does not work. This is some sort of "invisible" tab that gets turned into #a series of ->->-> when saved as csv.

附加信息：

这是FILE C，在excel中被剥离和打开后的样子：

“不可见”选项卡的屏幕截图，变成箭头

有任何build议来正常化文本莫名其妙？

注意事项：没有使用Java的软件包。

提前致谢

编辑

示例input：

 monkeyr <- structure(list(id = c("MON1", "MON2", "MON3"), txtcp = c("The monkey must be fed a combination of:\r\n<U+F0B7> Bananas, or\r\n<U+F0B7> 6 Bananas.", "The monkey must:\r\n· Be active\r\n· Be petted\r\n· Be inactive.", "The monkey must:\r\nbe tame\r\njump")), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -3L), .Names = c("id", "txtcp"))

预期产出：

 cleanmonkey <- structure(list(id = c("MON1", "MON2", "MON3"), txtcp = c("the monkey must be fed a combination of bananas or 6 bananas", "the monkey must be active be petted be inactive", "the monkey must be tame jump" )), .Names = c("id", "txtcp"), class = c("tbl_df", "tbl", "data.frame" ), row.names = c(NA, -3L))

没有优雅，但这个怎么样。代码replace非ASCII元素，然后是“\ r”和“\ n”，最后清理多余的空格。

 monkeyr$clean <- str_replace_all(string = monkeyr$txtcp, pattern = "<U.*>", replacement = "") monkeyr$clean <- str_replace_all(string = monkeyr$clean, pattern = "\\\r", replacement = "") monkeyr$clean <- str_replace_all(string = monkeyr$clean, pattern = "\\\n", replacement = "") monkeyr$clean <- str_replace_all(string = monkeyr$clean, pattern = "[[:punct:]]", replacement = "") monkeyr$clean <- str_replace_all(string = monkeyr$clean, pattern = "\\s{2}", replacement = "") monkeyr$clean [1] "The monkey must be fed a combination of Bananas or 6 Bananas" [2] "The monkey must Be active Be petted Be inactive" [3] "The monkey mustbe tamejump"

应注意“mustbe”和“tamejump”的组合。

规范化和清理R中的excel文件

复制基于单元格值的行，然后添加小计

如何用条件对单元格进行着色

使用左，中，右函数来提取两个字符/单词之间的文本？

插入新行后如何保持公式中的Excel单元格

在Excel列中查找连续的大写字母

有条件的条件高亮行匹配

在Excel中隐藏<td>（从HTML）

C＃GemBox Excel导入错误

如何解决“exception从HRESULT：0x800A03EC”着色Excel工作表的特定单元格？

如何根据范围select多个形状？