将Excel中的2个列表与VBA正则expression式进行比较

我想用它们来比较Excel中的两个列表(列)以查找匹配项。 由于这是一个非常复杂的操作,我以前在Excel中使用了几个不同的函数(非VBA),但是事实certificate它最多是尴尬的,所以我想尝试一个全合一的VBA解决scheme,如果可能的话。

第一列有不规则的名称(例如引用的昵称,后缀如“jr”或“sr”,括号中的“首选”版本)。 另外,当中间名字出现时,它们可能是名字或者是名字。

第一列的顺序是:

<first name or initial> <space> <any parenthetical 'preferred' names - if they exist> <space> <middle name or initial - if it exists> <space> <quoted nickname or initial - if it exists> <space> <last name> <comma - if necessary><space - if necessary><suffix - if it exists> 

第二栏的顺序是:

  `<lastname><space><suffix>,<firstname><space><middle name, if it exists>` 

,没有任何第一栏中的“违规行为”。

我的主要目标是按照以下顺序“清理”第一列:

  `lastname-space-suffix,firstname-space-preferred name-space- middle name-space-nickname` 

尽pipe我在这里保留了“违规行为”,但是我可能会在比较代码中使用某种“标志”来逐个提醒我。

我一直在尝试几种模式,这是我最近的:

 ["]?([A-Za-z]?)[.]?["]?[.]?[\s]?[,]?[\s]? 

不过,我想允许姓和后缀(如果存在)。 我已经用“全局”来testing它,但是我不知道如何通过反向引用来分隔姓和后缀。

然后,我想比较两个列表之间的最后一个,第一个,中间首字母(因为大多数名字只是第一个列表中的首字母)。

  An example would be: (1st list) John (Johnny) B. "Abe" Smith, Jr. turned into: Smith Jr,John (Johnny) B "Abe" or Smith Jr,John B and (2nd list) Smith Jr,John Bertrand turned into: Smith Jr,John B Then run a comparison between the two columns. 

这个清单比较会是一个好的开始还是延续点?


2012年4月10日附件:

作为一个便笺,我将需要消除来自首选名称的昵称和括号中的引号。 我可以将分组引用进一步分解为子组(在下面的例子中)?

  (?: ([ ] \( [^)]* \)))? # (2) parenthetical 'preferred' name (optional) (?: ([ ] (["'] ) .*?) \6 )? # (5,6) quoted nickname or initial (optional) 

我可以像这样对他们进行分组:

  (?:(([ ])(\()([^)]*)(\))))? # (2) parenthetical 'preferred' name (optional) not sure how to do this one - # (5,6) quoted nickname or initial (optional) 

我在“Regex Coach”和“RegExr”中试过,他们工作的很好,但是在VBA中,当我想要返回的反向引用时,所有返回的都是名字,数字1和逗号(例如“Carl1”)。 我要回去检查是否有错别字。 谢谢你的帮助。


2012年4月17日附件:

我忽略了一个名字“情况”,那就是由两个或两个以上单词组成的姓氏,例如“St Cyr”或“Von Wilhelm”。
会增加下面的内容

  `((St|Von)[ ])? 

在这个正则expression式,你提供的?

  `((St|Von)[ ])?([^\,()"']+) 

我在Regex Coach和RegExr中的testing还没有完成,因为replace返回“St”,前面有一个空格。

重做 –

这是不同的方法。 它可能在你的VBA中工作,只是一个例子。 我在Perl中testing了它,它工作得很好。 但是,我不会显示Perl代码,
只是正则expression式的一些解释。

这是一个两步的过程。

  1. 标准化列文本
  2. 做主要的parsing

规范化过程

  • 获取列值
  • 去掉所有的点. – 全局search\. ,什么都不换
  • 将空格转换为空格 – 全局search\s+ ,replace为单个空格[ ]

(请注意,如果不能正常化,不pipe尝试什么,我都没有太多的成功机会)

主要parsing过程

标准化一个列值后(对两列做),通过这些正则expression式运行。

第1列正则expression式

 ^ [ ]? ([^\ ,()"']+) # (1) first name or initial (required) (?: ([ ] \( [^)]* \)) )? # (2) parenthetical 'preferred' name (optional) (?: ([ ] [^\ ,()"'] ) # (3,4) middle initial OR name (optional) ([^\ ,()"']*) # name and initial are both captured )? (?: ([ ] (["'] ) .*?) \6 )? # (5,6) quoted nickname or initial (optional) [ ] ([^\ ,()"']+) # (7) last name (required) (?: [, ]* ([ ].+?) [ ]? # (8) suffix (optional) | .*? )? $ 

更换取决于你想要的。
定义了三种types(根据需要用\replace$ ):

  1. 1a型全中 – $7$8,$1$2$3$4$5$6
  2. 1b型中间初始 – $7$8,$1$2$3$5$6
  3. types2中间初始 – $7$8,$1$3

转换示例:

 Input (raw) = 'John (Johnny) Bertrand "Abe" Smith, Jr. ' Out type 1 full middle = 'Smith Jr,John (Johnny) Bertrand "Abe"' Out type 1 middle initial = 'Smith Jr,John (Johnny) B "Abe"' Out type 2 middle initial = 'Smith Jr,John B' 

第2列正则expression式

 ^ [ ]? ([^\ ,()"']+) # (1) last name (required) (?: ([ ] [^\ ,()"']+) )? # (2) suffix (optional) , ([^\ ,()"']+) # (3) first name or initial (required) (?: ([ ] [^\ ,()"']) # (4,5) middle initial OR name (optional) ([^\ ,()"']*) )? .* $ 

更换取决于你想要的。
定义了两种types(根据需要用$replace$ ):

  1. 1a型全中 – $1$2,$3$4$5
  2. types1b中间初始 – $1$2,$3$4

转换示例:

 Input = 'Smith Jr.,John Bertrand ' Out type 1 full middle = 'Smith Jr,John Bertrand' Out type 1 middle initial = 'Smith Jr,John B' 

VBAreplace帮助

这工作在一个非常旧的Excel副本,创build一个VBA项目。
这两个模块是为了显示一个例子而创build的。
他们都做同样的事情。

第一个是所有可能的replacetypes的详细例子。
第二个是使用types2比较的修剪版本。

我以前没有做过VB,但是应该很简单
为你收集如何更换工作,以及如何配合的Excel
列。

如果你只是做一个平坦的比较,你可能想要做一个col 1 val
一次,然后检查列2中的每个值,然后转到下一个val
第1列,然后重复。

为了最快的方式做到这一点,创build2个额外的列,转换尊重
列valstypes2(variablesstrC1_2和strC2_2,请参阅示例),然后复制它们
到新的专栏。
之后,你不需要regex,只需比较列,find匹配的行,
然后删除types2列。

详细 –

 Sub RegexColumnValueComparison() ' Column 1 and 2 , Sample values ' These should probably be passed in values ' ============================================ strC1 = "John (Johnny) Bertrand ""Abe"" Smith, Jr. " strC2 = "Smith Jr.,John Bertrand " ' Normalization Regexs for whitespace's and period's ' (use for both column values) ' ============================================= Set rxDot = CreateObject("vbscript.regexp") rxDot.Global = True rxDot.Pattern = "\." Set rxWSp = CreateObject("vbscript.regexp") rxWSp.Global = True rxWSp.Pattern = "\s+" ' Column 1 Regex ' ================== Set rxC1 = CreateObject("vbscript.regexp") rxC1.Global = False rxC1.Pattern = "^[ ]?([^ ,()""']+)(?:([ ]\([^)]*\)))?(?:([ ][^ ,()""'])([^ ,()""']*))?(?:([ ]([""']).*?)\6)?[ ]([^ ,()""']+)(?:[, ]*([ ].+?)[ ]?|.*?)?$" ' Column 2 Regex ' ================== Set rxC2 = CreateObject("vbscript.regexp") rxC2.Global = False rxC2.Pattern = "^[ ]?([^ ,()""']+)(?:([ ][^ ,()""']+))?,([^ ,()""']+)(?:([ ][^ ,()""'])([^ ,()""']*))?.*$" ' Normalize column 1 and 2, Copy to new var ' ============================================ strC1_Normal = rxDot.Replace(rxWSp.Replace(strC1, " "), "") strC2_Normal = rxDot.Replace(rxWSp.Replace(strC2, " "), "") ' ------------------------------------------------------ ' This section is informational ' Shows some sample replacements before comparison ' Just pick 1 replacement from each column, discard the rest ' ------------------------------------------------------ ' Create Some Replacement Types for Column 1 ' ===================================================== strC1_1a = rxC1.Replace(strC1_Normal, "$7$8,$1$2$3$4$5$6") strC1_1b = rxC1.Replace(strC1_Normal, "$7$8,$1$2$3$5$6") strC1_2 = rxC1.Replace(strC1_Normal, "$7$8,$1$3") ' Create Some Replacement Types for Column 2 ' ===================================================== strC2_1b = rxC2.Replace(strC2_Normal, "$1$2,$3$4$5") strC2_2 = rxC2.Replace(strC2_Normal, "$1$2,$3$4") ' Show Types in Message Box ' ===================================================== c1_t1a = "Column1 Types:" & Chr(13) & "type 1a full middle - " & strC1_1a c1_t1b = "type 1b middle initial - " & strC1_1b c1_t2 = "type 2 middle initial - " & strC1_2 c2_t1b = "Column2 Types:" & Chr(13) & "type 1b middle initial - " & strC2_1b c2_t2 = "type 2 middle initial - " & strC2_2 MsgBox (c1_t1a & Chr(13) & c1_t1b & Chr(13) & c1_t2 & Chr(13) & Chr(13) & c2_t1b & Chr(13) & c2_t2) ' ------------------------------------------------------ ' Compare a Value from Column 1 vs Column 2 ' For this we will compare Type 2 values ' ------------------------------------------------------ If strC1_2 = strC2_2 Then MsgBox ("Type 2 values are EQUAL: " & Chr(13) & strC1_2) Else MsgBox ("Type 2 values are NOT Equal:" & Chr(13) & strC1_2 & " != " & strC1_2) End If ' ------------------------------------------------------ ' Same comparison (Type 2) of Normalized column 1,2 values ' In esscense, this is all you need ' ------------------------------------------------------ If rxC1.Replace(strC1_Normal, "$7$8,$1$3") = rxC2.Replace(strC2_Normal, "$1$2,$3$4") Then MsgBox ("Type 2 values are EQUAL") Else MsgBox ("Type 2 values are NOT Equal") End If End Sub 

只有types2 –

 Sub RegexColumnValueComparison() ' Column 1 and 2 , Sample values ' These should probably be passed in values ' ============================================ strC1 = "John (Johnny) Bertrand ""Abe"" Smith, Jr. " strC2 = "Smith Jr.,John Bertrand " ' Normalization Regexes for whitespace's and period's ' (use for both column values) ' ============================================= Set rxDot = CreateObject("vbscript.regexp") rxDot.Global = True rxDot.Pattern = "\." Set rxWSp = CreateObject("vbscript.regexp") rxWSp.Global = True rxWSp.Pattern = "\s+" ' Column 1 Regex ' ================== Set rxC1 = CreateObject("vbscript.regexp") rxC1.Global = False rxC1.Pattern = "^[ ]?([^ ,()""']+)(?:([ ]\([^)]*\)))?(?:([ ][^ ,()""'])([^ ,()""']*))?(?:([ ]([""']).*?)\6)?[ ]([^ ,()""']+)(?:[, ]*([ ].+?)[ ]?|.*?)?$" ' Column 2 Regex ' ================== Set rxC2 = CreateObject("vbscript.regexp") rxC2.Global = False rxC2.Pattern = "^[ ]?([^ ,()""']+)(?:([ ][^ ,()""']+))?,([^ ,()""']+)(?:([ ][^ ,()""'])([^ ,()""']*))?.*$" ' Normalize column 1 and 2, Copy to new var ' ============================================ strC1_Normal = rxDot.Replace(rxWSp.Replace(strC1, " "), "") strC2_Normal = rxDot.Replace(rxWSp.Replace(strC2, " "), "") ' Comparison (Type 2) of Normalized column 1,2 values ' ============================================ strC1_2 = rxC1.Replace(strC1_Normal, "$7$8,$1$3") strC2_2 = rxC2.Replace(strC2_Normal, "$1$2,$3$4") If strC1_2 = strC2_2 Then MsgBox ("Type 2 values are EQUAL") Else MsgBox ("Type 2 values are NOT Equal") End If End Sub 

帕伦/报价回应

As a side note, I will need to eliminate the quotes from the nicknames and the parentheses from the preferred names.

如果我理解正确

是的,您可以单独地在引号和括号内捕捉内容。
这只是需要一些修改。 下面的正则expression式有能力
用或不用引号和/或括号制定替代scheme,
或其他forms。

下面的样品给出了制定替代品的方法。

非常重要请注意这里

如果你正在讨论去除引号“”和括号()
匹配正则expression式,这也可以做到。 它需要一个新的正则expression式。

唯一的问题是所有区分首选/中间/尼克
被扔出窗外,因为这些都是位置以及
(即:(首选)中间“尼克”)。

取消这个考虑将需要像这样的正则expression式

 (?:[ ]([^ ,]+))? # optional preferred (?:[ ]([^ ,]+))? # optional middle (?:[ ]([^ ,]+))? # optional nick 

而且,他们是可选的,失去了所有的位置参考,并呈现中期的初始
expression无效。

结束注释

正则expression式模板(用于制定replacestring)

 ^ [ ]? # (required) # First # $1 name # ----------------------------------------- ([^\ ,()"']+) # (1) name # (optional) # Parenthetical 'preferred' # $2 all # $3$4 name # ----------------------------------------- (?: ( # (2) all ([ ]) \( ([^)]*) \) # (3,4) space and name ) )? # (optional) # Middle # $5 initial # $5$6 name # ----------------------------------------- (?: ([ ] [^\ ,()"'] ) # (5) first character ([^\ ,()"']*) # (6) remaining characters )? # (optional) # Quoted nick # $7$8$9$8 all # $7$9 name # ----------------------------------------- (?: ([ ]) # (7) space (["']) # (8) quote (.*?) # (9) name \8 )? # (required) # Last # $10 name # ----------------------------------------- [ ] ([^\ ,()"']+) # (10) name # (optional) # Suffix # $11 suffix # ----------------------------------------- (?: [, ]* ([ ].+?) [ ]? # (11) suffix | .*? )? $ 

VBA正则expression式(第二版,从上面的VBA项目中testing)

 rxC1.Pattern = "^[ ]?([^ ,()""']+)(?:(([ ])\(([^)]*)\)))?(?:([ ][^ ,()""'])([^ ,()""']*))?(?:([ ])([""'])(.*?)\8)?[ ]([^ ,()""']+)(?:[, ]*([ ].+?)[ ]?|.*?)?$" strC1_1a = rxC1.Replace( strC1_Normal, "$10$11,$1$2$5$6$7$8$9$8" ) strC1_1aa = rxC1.Replace( strC1_Normal, "$10$11,$1$3$4$5$6$7$9" ) strC1_1b = rxC1.Replace( strC1_Normal, "$10$11,$1$2$5$7$8$9$8" ) strC1_1bb = rxC1.Replace( strC1_Normal, "$10$11,$1$3$4$5$7$9" ) strC1_2 = rxC1.Replace( strC1_Normal, "$10$11,$1$5" ) 

示例input/输出可能性

 Input (raw) = 'John (Johnny) Bertrand "Abe" Smith, Jr. ' Out type 1a full middle = 'Smith Jr,John (Johnny) Bertrand "Abe"' Out type 1aa full middle = 'Smith Jr,John Johnny Bertrand Abe' Out type 1b middle initial = 'Smith Jr,John (Johnny) B "Abe"' Out type 1bb middle initial = 'Smith Jr,John Johnny B Abe' Out type 2 middle initial = 'Smith Jr,John B' Input (raw) = 'John (Johnny) Smith, Jr.' Out type 1a full middle = 'Smith Jr,John (Johnny)' Out type 1aa full middle = 'Smith Jr,John Johnny' Out type 1b middle initial = 'Smith Jr,John (Johnny)' Out type 1bb middle initial = 'Smith Jr,John Johnny' Out type 2 middle initial = 'Smith Jr,John' Input (raw) = 'John (Johnny) "Abe" Smith, Jr.' Out type 1a full middle = 'Smith Jr,John (Johnny) "Abe"' Out type 1aa full middle = 'Smith Jr,John Johnny Abe' Out type 1b middle initial = 'Smith Jr,John (Johnny) "Abe"' Out type 1bb middle initial = 'Smith Jr,John Johnny Abe' Out type 2 middle initial = 'Smith Jr,John' Input (raw) = 'John "Abe" Smith, Jr.' Out type 1a full middle = 'Smith Jr,John "Abe"' Out type 1aa full middle = 'Smith Jr,John Abe' Out type 1b middle initial = 'Smith Jr,John "Abe"' Out type 1bb middle initial = 'Smith Jr,John Abe' Out type 2 middle initial = 'Smith Jr,John' 

回复:4/17关注

last names that have 2 or more words. Would the allowance for certain literal names, rather than generic word patterns, be the solution?

其实不,不会。 在这种情况下,对于你的表单,允许多个单词的姓氏
将空间字段分隔符注入到姓氏字段中。

然而,对于你的特定forms,这是可以做到的,因为唯一的障碍就是当时的情况
"nick"字段丢失。 当它缺less,并给予只有一个词在中
中间名,列出2个排列。

希望您可以从下面的3个正则expression式和testing用例输出中获得解决scheme。 正则expression式已经从捕获中删除了空格分隔符。 所以,你可以写作
Replace方法的replace,或者只是存储捕获缓冲区进行比较
其他列的捕获scheme的结果。

 Nick_rx.Pattern (template) * This pattern is multi-word last name, NICK is required ^ [ ]? # First (req'd) ([^\ ,()"']+) # (1) first name # Preferred first (?: [ ] ( # (2) (preferred), -or- \( ([^)]*?) \) # (3) preferred ) )? # Middle (?: [ ] ( # (4) full middle, -or- ([^\ ,()"']) # (5) initial [^\ ,()"']* ) )? # Quoted nick (req'd) [ ] ( # (6) "nick", (["']) # (7) n/a -or- (.*?) # (8) nick \7 ) # Single/Multi Last (req'd) [ ] ( # (9) multi/single word last name [^\ ,()"']+ (?:[ ][^\ ,()"']+)* ) # Suffix (?: [ ]? , [ ]? (.*?) )? # (10) suffix [ ]? $ ----------------------------------- FLs_rx.Pattern (template) * This pattern has no MIDDLE/NICK, is single-word last name, * and has no permutations. ^ [ ]? # First (req'd) ([^\ ,()"']+) # (1) first name # Preferred first (?: [ ] ( # (2) (preferred), -or- \( ([^)]*?) \) # (3) preferred ) )? # Single Last (req'd) [ ] ([^\ ,()"']+) # (4) single word last name # Suffix (?: [ ]? , [ ]? (.*?) )? # (5) suffix [ ]? $ ----------------------------------- FLm_rx.Pattern (template) * This pattern has no NICK, is multi-word last name, * and has 2 permutations. * 1. Middle as part of Last name. * 2. Middle is separate from Last name. ^ [ ]? # First (req'd) ([^\ ,()"']+) # (1) first name # Preferred first (?: [ ] ( # (2) (preferred), -or- \( ([^)]*?) \) # (3) preferred ) )? # Multi Last (req'd) [ ] ( # (4) Multi, as Middle + Last, # -or- (?: # Middle ( # (5) full middle, -or- ([^\ ,()"']) # (6) initial [^\ ,()"']* ) [ ] ) # Last (req'd) ( # (7) multi/single word last name [^\ ,()"']+ (?:[ ][^\ ,()"']+)* ) ) # Suffix (?: [ ]? , [ ]? (.*?) )? # (8) suffix [ ]? $ ----------------------------------- Each of these regexes are mutually exclusive and should be checked in an if-then-else like this (Pseudo code): str_Normal = rxDot.Replace(rxWSp.Replace(str, " "), "") If Nick_rx.Test(str_Normal) Then N_1a = rxWSp.Replace( Nick_rx.Replace(str_Normal, "$9 $10 , $1 $2 $4 $6 "), " ") N_1aa = rxWSp.Replace( Nick_rx.Replace(str_Normal, "$9 $10 , $1 $3 $4 $8 "), " ") N_1b = rxWSp.Replace( Nick_rx.Replace(str_Normal, "$9 $10 , $1 $2 $5 $6 "), " ") N_1bb = rxWSp.Replace( Nick_rx.Replace(str_Normal, "$9 $10 , $1 $3 $5 $8 "), " ") N_2 = rxWSp.Replace( Nick_rx.Replace(str_Normal, "$9 $10 , $1 $5 "), " ") ' see test case results in output below Else If FLs_rx.Test(str_Normal) Then FLs_1a = rxWSp.Replace( FLs_rx.Replace(str_Normal, "$4 $5 , $1 $2 "), " ") FLs_1aa = rxWSp.Replace( FLs_rx.Replace(str_Normal, "$4 $5 , $1 $3 "), " ") FLs_2 = rxWSp.Replace( FLs_rx.Replace(str_Normal, "$4 $5 , $1 "), " ") Else If FLm_rx.Test(str_Normal) Then ' Permutation 1: FLm1_1a = rxWSp.Replace( FLm_rx.Replace(str_Normal, "$4 $8 , $1 $2 "), " ") FLm1_1aa = rxWSp.Replace( FLm_rx.Replace(str_Normal, "$4 $8 , $1 $3 "), " ") FLm1_2 = rxWSp.Replace( FLm_rx.Replace(str_Normal, "$4 $8 , $1 "), " ") ' Permutation 2: FLm2_1a = rxWSp.Replace( FLm_rx.Replace(str_Normal, "$7 $8 , $1 $2 $5 "), " ") FLm2_1aa = rxWSp.Replace( FLm_rx.Replace(str_Normal, "$7 $8 , $1 $3 $5 "), " ") FLm2_1b = rxWSp.Replace( FLm_rx.Replace(str_Normal, "$7 $8 , $1 $2 $6 "), " ") FLm2_1bb = rxWSp.Replace( FLm_rx.Replace(str_Normal, "$7 $8 , $1 $3 $6 "), " ") FLm2_2 = rxWSp.Replace( FLm_rx.Replace(str_Normal, "$7 $8 , $1 $6 "), " ") ' At this point, the odds are that only one of these permutations will match ' a different column. Else ' The data could not be matched against a valid form End If ----------------------------- Test Cases Found form 'Nick' Input (raw) = 'John1 (JJ) Bert "nick" St Van Helsing ,Jr ' Normal = 'John1 (JJ) Bert "nick" St Van Helsing ,Jr ' Out type 1a full middle = 'St Van Helsing Jr , John1 (JJ) Bert "nick" ' Out type 1aa full middle = 'St Van Helsing Jr , John1 JJ Bert nick ' Out type 1b middle initial = 'St Van Helsing Jr , John1 (JJ) B "nick" ' Out type 1bb middle initial = 'St Van Helsing Jr , John1 JJ B nick ' Out type 2 middle initial = 'St Van Helsing Jr , John1 B ' ======================================================= Found form 'Nick' Input (raw) = 'John2 Bert "nick" Helsing ,Jr ' Normal = 'John2 Bert "nick" Helsing ,Jr ' Out type 1a full middle = 'Helsing Jr , John2 Bert "nick" ' Out type 1aa full middle = 'Helsing Jr , John2 Bert nick ' Out type 1b middle initial = 'Helsing Jr , John2 B "nick" ' Out type 1bb middle initial = 'Helsing Jr , John2 B nick ' Out type 2 middle initial = 'Helsing Jr , John2 B ' ======================================================= Found form 'Nick' Input (raw) = 'John3 Bert "nick" St Van Helsing ,Jr ' Normal = 'John3 Bert "nick" St Van Helsing ,Jr ' Out type 1a full middle = 'St Van Helsing Jr , John3 Bert "nick" ' Out type 1aa full middle = 'St Van Helsing Jr , John3 Bert nick ' Out type 1b middle initial = 'St Van Helsing Jr , John3 B "nick" ' Out type 1bb middle initial = 'St Van Helsing Jr , John3 B nick ' Out type 2 middle initial = 'St Van Helsing Jr , John3 B ' ======================================================= Found form 'First-Last (single)' Input (raw) = 'John4 Helsing ' Normal = 'John4 Helsing ' Out type 1a no middle = 'Helsing , John4 ' Out type 1aa no middle = 'Helsing , John4 ' Out type 2 = 'Helsing , John4 ' ======================================================= Found form 'First-Last (single)' Input (raw) = 'John5 (JJ) Helsing ' Normal = 'John5 (JJ) Helsing ' Out type 1a no middle = 'Helsing , John5 (JJ) ' Out type 1aa no middle = 'Helsing , John5 JJ ' Out type 2 = 'Helsing , John5 ' ======================================================= Found form 'First-Last (multi)' Input (raw) = 'John6 (JJ) Bert St Van Helsing ,Jr ' Normal = 'John6 (JJ) Bert St Van Helsing ,Jr ' Permutation 1: Out type 1a no middle = 'Bert St Van Helsing Jr , John6 (JJ) ' Out type 1aa no middle = 'Bert St Van Helsing Jr , John6 JJ ' Out type 2 = 'Bert St Van Helsing Jr , John6 ' Permutation 2: Out type 1a full middle = 'St Van Helsing Jr , John6 (JJ) Bert ' Out type 1aa full middle = 'St Van Helsing Jr , John6 JJ Bert ' Out type 1b middle initial = 'St Van Helsing Jr , John6 (JJ) B ' Out type 1bb middle initial = 'St Van Helsing Jr , John6 JJ B ' Out type 2 middle initial = 'St Van Helsing Jr , John6 B ' ======================================================= Found form 'First-Last (multi)' Input (raw) = 'John7 Bert St Van Helsing ,Jr ' Normal = 'John7 Bert St Van Helsing ,Jr ' Permutation 1: Out type 1a no middle = 'Bert St Van Helsing Jr , John7 ' Out type 1aa no middle = 'Bert St Van Helsing Jr , John7 ' Out type 2 = 'Bert St Van Helsing Jr , John7 ' Permutation 2: Out type 1a full middle = 'St Van Helsing Jr , John7 Bert ' Out type 1aa full middle = 'St Van Helsing Jr , John7 Bert ' Out type 1b middle initial = 'St Van Helsing Jr , John7 B ' Out type 1bb middle initial = 'St Van Helsing Jr , John7 B ' Out type 2 middle initial = 'St Van Helsing Jr , John7 B ' ======================================================= Form *** (unknown) Input (raw) = ' do(e)s not. match ,' Normal = ' do(e)s not match ,' ======================================================= 

这是一个可能有用的正则expression式,这将给你6个捕获组,按照以下顺序:名字,优先名称,中间名,昵称,姓氏,后缀。

 ([az]+)\.?\s(?:(\([az]+\))\s)?(?:([az]+)\.?\s)?(?:("[az]+")\s)?([az]+)(?:,\s([az]+))? 

这里是一个解释:

 ([az]+)\.?\s # First name, followed by optional '.' (required) (?:(\([az]+\))\s)? # Preferred name, optional (?:([az]+)\.?\s)? # Middle name, optional (?:("[az]+")\s)? # Nickname, optional ([az]+) # Last name, required (?:,\s([az]+))? # Suffix, optional 

例如,你可以把John (Johnny) B. "Abe" Smith, Jr.变成Smith Jr,John (Johnny) B "Abe" \5 \6,\1 \2 \3 \4或者你可以用\5 \6,\1 \3把它变成Smith Jr,John B