最好的方式来查找字典列表string输出英文单词

我有一个约2万个严格的字母/文本string列表作为CSV文件输出到Excel中,但是相当混乱。 我想要做的是查询一个单独的英文字典单词的参考文件,这样我就可以基本上创build一个查找并返回字典单词,减去文本噪音的负载,这个负载是前缀或附加到string的。 下面的例子。

xyzbuildingcontractor = Building Contractor upholsteryabcdef = Upholstery lmnoengineer = Engineer 

作为一个相对的n00b程序员,我只想衡量一下这样做的最好方法,Excel是否是最好的平台。

任何指导将非常感激地收到,预先感谢。

吉姆

好吧,这是一个非常粗略的草案 ,你可能需要调整,但总体思路是这样的:

  1. 一个Trie被用来build立一个单词字典
  2. clsTrieIterator类允许在Trie中同时跟踪多个单词
  3. 要testing的string一次parsing一个字符,每个string都会启动一个新的clsTrieIterator
  4. 所有现有的活动clsTrieIterators消耗每个下一个字符,如果字符的结果组合是不可能给予字典,它停止被跟踪

这是一个简短的例子:

 Public Sub Main() Dim wf As clsWordFinder Set wf = New clsWordFinder wf.Add "Building" wf.Add "Contractor" wf.Add "Upholstery" wf.Add "Engineer" Debug.Print wf.getWordsFromString("xyzbuildingcontractor") Debug.Print wf.getWordsFromString("upholsteryabcdef") Debug.Print wf.getWordsFromString("lmnoengineer") End Sub 

将以下内容输出到VBA的即时窗口中:

build筑承包商

内饰

工程师

…以下是课程。

clsTrieNode是树的每个单独的节点。 它代表一个单独的字母,可能有多达26个孩子,假设他们在字典中形成有效的单词。 如果字符的组合,从树根到这个点逐个节点形成一个单词,Trie将设置“isWord”。

 Option Compare Database Option Explicit Public KeyChar As String Public isWord As Boolean Private m_Children(0 To 25) As clsTrieNode Public Property Get Child(strChar As String) As clsTrieNode 'better be ONE char Set Child = m_Children(charToIndex(strChar)) End Property Public Property Set Child(strChar As String, oNode As clsTrieNode) Set m_Children(charToIndex(strChar)) = oNode End Property Private Function charToIndex(strChar As String) As Long charToIndex = Asc(strChar) - 97 'asc("a") End Function 

clsTrie是面向公众的接口,与构成树的节点树进行交互。 它包含一个将单词放入字典的isWord方法和一个允许对字典的string进行testing以查看它是否是有效单词的isWord方法。 Remove是一个很好的方法,但可能不是你的问题,所以我没有实现它。

 Option Compare Database Option Explicit Private m_Head As clsTrieNode Private Sub Class_Initialize() Set m_Head = New clsTrieNode End Sub Public Sub Add(strKey As String) Dim currNode As clsTrieNode Dim tempNode As clsTrieNode Set currNode = m_Head Dim strLCaseKey As String strLCaseKey = LCase(strKey) Dim i As Long For i = 1 To Len(strLCaseKey) If Not currNode.Child(Mid(strLCaseKey, i, 1)) Is Nothing Then Set currNode = currNode.Child(Mid(strLCaseKey, i, 1)) Else Exit For End If Next For i = i To Len(strLCaseKey) Set tempNode = New clsTrieNode tempNode.KeyChar = Mid(strLCaseKey, i, 1) Set currNode.Child(Mid(strLCaseKey, i, 1)) = tempNode Set currNode = tempNode Next currNode.isWord = True End Sub Public Sub Remove(strKey As String) 'Might be nice to have End Sub Public Function isWord(strKey As String) Dim currNode As clsTrieNode Set currNode = m_Head Dim strLCaseKey As String strLCaseKey = LCase(strKey) Dim i As Long For i = 1 To Len(strLCaseKey) If Not currNode.Child(Mid(strLCaseKey, i, 1)) Is Nothing Then Set currNode = currNode.Child(Mid(strLCaseKey, i, 1)) Else isWord = False Exit Function End If Next If currNode.isWord Then isWord = True Else isWord = False End If End Function Public Function getIterator() As clsTrieIterator Dim oIterator As clsTrieIterator Set oIterator = New clsTrieIterator oIterator.Init m_Head Set getIterator = oIterator End Function 

clsTrieIterator是一个由clsTrie返回的特殊类,它允许使用consumeChar逐个字符地完成string的parsing,而不是像clsTrie.isWord 。 这允许一些自由地parsingstring而不用回溯或者多次读取相同的字符,并且当你不确定它们将要多久时,它允许find单词。

 Option Compare Database Option Explicit Private m_currNode As clsTrieNode Private m_currString As String Public Property Get getCurrentString() As String getCurrentString = m_currString End Property Public Sub Init(oNode As clsTrieNode) Set m_currNode = oNode End Sub Public Function consumeChar(strChar As String) As Boolean Dim strLCaseChar As String strLCaseChar = LCase(strChar) If Not m_currNode.Child(strLCaseChar) Is Nothing Then consumeChar = True Set m_currNode = m_currNode.Child(strLCaseChar) m_currString = m_currString & strChar Else consumeChar = False Set m_currNode = Nothing End If End Function Public Function isWord() As Boolean isWord = m_currNode.isWord End Function 

clsWordFinder把所有的东西放在一个简单的api中,适合你的具体问题。 可能值得添加一些逻辑来处理不同的行为,如“贪婪”匹配与“懒惰”匹配和重叠与非重叠词语parsing。

 Option Compare Database Option Explicit Private m_Trie As clsTrie Private Sub Class_Initialize() Set m_Trie = New clsTrie End Sub Public Sub Add(strWord As String) m_Trie.Add strWord End Sub Public Function getWordsFromString(strString As String) As String Dim colIterators As Collection Set colIterators = New Collection Dim colMatches As Collection Set colMatches = New Collection Dim oIterator As clsTrieIterator Dim strMatch As String Dim i As Long Dim iter For i = 1 To Len(strString) Set oIterator = m_Trie.getIterator colIterators.Add oIterator, CStr(ObjPtr(oIterator)) For Each iter In colIterators If Not iter.consumeChar(Mid(strString, i, 1)) Then colIterators.Remove CStr(ObjPtr(iter)) ElseIf iter.isWord() Then strMatch = iter.getCurrentString Mid(strMatch, 1, 1) = UCase(Mid(strMatch, 1, 1)) colMatches.Add strMatch colIterators.Remove CStr(ObjPtr(iter)) End If Next Next getWordsFromString = JoinCollection(colMatches) End Function Public Function getWordsCollectionFromString(strString As String) As Collection Dim colIterators As Collection Set colIterators = New Collection Dim colMatches As Collection Set colMatches = New Collection Dim oIterator As clsTrieIterator Dim strMatch As String Dim i As Long Dim iter For i = 1 To Len(strString) Set oIterator = m_Trie.getIterator colIterators.Add oIterator, CStr(ObjPtr(oIterator)) For Each iter In colIterators If Not iter.consumeChar(Mid(strString, i, 1)) Then colIterators.Remove CStr(ObjPtr(iter)) ElseIf iter.isWord() Then strMatch = iter.getCurrentString Mid(strMatch, 1, 1) = UCase(Mid(strMatch, 1, 1)) colMatches.Add strMatch colIterators.Remove CStr(ObjPtr(iter)) End If Next Next Set getWordsCollectionFromString = colMatches End Function Private Function JoinCollection(colStrings As Collection, Optional strDelimiter = " ") As String Dim strOut As String Dim i As Long If colStrings.Count > 0 Then strOut = colStrings.Item(1) For i = 2 To colStrings.Count strOut = strOut & strDelimiter & colStrings.Item(i) Next JoinCollection = strOut End If End Function