文本相似度分析(Excel)
我有一个项目清单,我想确定它们与这个清单中其他项目的相似性。
我期望的输出将是沿着以下线的东西:
相似性列中显示的百分比纯粹是说明性的。 我在考虑对相似性进行testing的方法是:
并发字母数/按匹配项目中的字母总数
但是会热衷于获得有关这方面的意见。
这是Excel中合理可行的东西吗? 我只有一个只包含字母数字值的小数据集(140kb)。
我也接受替代方法来解决这个问题,因为我之前没有处理过这样的事情!
Ps我已经学了几个月的Python了,所以使用Python的build议也会很好!
这是一个使用VBA UDF的解决scheme:
编辑 :增加了一个新的可选参数名为arg_lMinConsecutive
,用于确定必须匹配的连续字符的最小数量。 请注意以下公式中的额外参数2
,表示至less有2个连续字符必须匹配。
Public Function FuzzyMatch(ByVal arg_sText As String, _ ByVal arg_vList As Variant, _ ByVal arg_lOutput As Long, _ Optional ByVal arg_lMinConsecutive As Long = 1, _ Optional ByVal arg_bMatchCase As Boolean = True, _ Optional ByVal arg_bExactCount As Boolean = True) _ As Variant Dim dExactCounts As Object Dim aResults() As Variant Dim vList As Variant Dim vListItem As Variant Dim sLetter As String Dim dMaxMatch As Double Dim lMaxIndex As Long Dim lResultIndex As Long Dim lLastMatch As Long Dim i As Long Dim bMatch As Boolean If arg_lMinConsecutive <= 0 Then FuzzyMatch = CVErr(xlErrNum) Exit Function End If If arg_bExactCount = True Then Set dExactCounts = CreateObject("Scripting.Dictionary") If TypeName(arg_vList) = "Collection" Or TypeName(arg_vList) = "Range" Then ReDim aResults(1 To arg_vList.Count, 1 To 3) Set vList = arg_vList ElseIf IsArray(arg_vList) Then ReDim aResults(1 To UBound(arg_vList) - LBound(arg_vList) + 1, 1 To 3) vList = arg_vList Else ReDim vList(1 To 1) vList(1) = arg_vList ReDim aResults(1 To 1, 1 To 3) End If dMaxMatch = 0# lMaxIndex = 0 lResultIndex = 0 For Each vListItem In vList If vListItem <> arg_sText Then lLastMatch = -arg_lMinConsecutive lResultIndex = lResultIndex + 1 aResults(lResultIndex, 3) = vListItem If arg_bExactCount Then dExactCounts.RemoveAll For i = 1 To Len(arg_sText) - arg_lMinConsecutive + 1 bMatch = False sLetter = Mid(arg_sText, i, arg_lMinConsecutive) If Not arg_bMatchCase Then sLetter = LCase(sLetter) If arg_bExactCount Then dExactCounts(sLetter) = dExactCounts(sLetter) + 1 Select Case Abs(arg_bMatchCase) + Abs(arg_bExactCount) * 2 Case 0 'MatchCase is false and ExactCount is false If InStr(1, vListItem, sLetter, vbTextCompare) > 0 Then bMatch = True Case 1 'MatchCase is true and ExactCount is false If InStr(1, vListItem, sLetter) > 0 Then bMatch = True Case 2 'MatchCase is false and ExactCount is true If Len(vListItem) - Len(Replace(vListItem, sLetter, vbNullString, Compare:=vbTextCompare)) >= dExactCounts(sLetter) Then bMatch = True Case 3 'MatchCase is true and ExactCount is true If Len(vListItem) - Len(Replace(vListItem, sLetter, vbNullString)) >= dExactCounts(sLetter) Then bMatch = True End Select If bMatch Then aResults(lResultIndex, 1) = aResults(lResultIndex, 1) + WorksheetFunction.Min(arg_lMinConsecutive, i - lLastMatch) lLastMatch = i End If Next i If Len(vListItem) > 0 Then aResults(lResultIndex, 2) = aResults(lResultIndex, 1) / Len(vListItem) If aResults(lResultIndex, 2) > dMaxMatch Then dMaxMatch = aResults(lResultIndex, 2) lMaxIndex = lResultIndex End If Else aResults(lResultIndex, 2) = 0 End If End If Next vListItem If dMaxMatch = 0# Then Select Case arg_lOutput Case 1: FuzzyMatch = 0 Case 2: FuzzyMatch = vbNullString Case Else: FuzzyMatch = CVErr(xlErrNum) End Select Else Select Case arg_lOutput Case 1: FuzzyMatch = Application.Min(1, aResults(lMaxIndex, 2)) Case 2: FuzzyMatch = aResults(lMaxIndex, 3) Case Else: FuzzyMatch = CVErr(xlErrNum) End Select End If End Function
只使用列A和B中的原始数据,可以使用此UDF在列C和D中获得所需的结果:
在单元格C2中复制下来的是这个公式:
=FuzzyMatch($B2,$B$2:$B$6,COLUMN(A2),2)
在单元格D2中复制下来的是这个公式:
=IFERROR(INDEX(A:A,MATCH(FuzzyMatch($B2,$B$2:$B$6,COLUMN(B2),2),B:B,0)),"-")
请注意,它们都使用FuzzyMatch
UDF。
在Python中,您可以使用Levenshtein距离来获得结果。 看看这个答案:
Python中的模糊string比较,与哪个库混淆使用
我真的没有得到整个逻辑,但如果你需要100%的逻辑在这里:
Option Explicit Sub TestMe() Dim rngCell As Range Dim rngCell2 As Range Dim lngTotal As Long Dim lngTotal2 As Long Dim lngCount As Long For Each rngCell In Sheets(1).Range("A1:A5") For Each rngCell2 In Sheets(1).Range("A1:A5") If rngCell.Address <> rngCell2.Address Then If InStr(1, rngCell, rngCell2) Then rngCell.Offset(0, 1) = 1 Else If InStr(1, rngCell2, rngCell) Then rngCell.Offset(0, 2) = Round(CDbl(Len(rngCell) / Len(rngCell2)), 2) End If End If End If Next rngCell2 Next rngCell End Sub
在这里,你去图片: