我昨天对有人[0123456789]在正则表达式而不是[0-9]or中使用的答案发表了评论\d。我说过使用范围或数字说明符可能比字符集更有效。
[0123456789]
[0-9]
\d
我决定今天测试一下,我惊讶地发现(至少在 c# regex 引擎中)\d似乎比其他两个似乎没有太大差异的效率低。这是我的测试输出超过 10000 个随机字符串的 1000 个随机字符,其中 5077 实际上包含一个数字:
Regex \d took 00:00:00.2141226 result: 5077/10000 Regex [0-9] took 00:00:00.1357972 result: 5077/10000 63.42 % of first Regex [0123456789] took 00:00:00.1388997 result: 5077/10000 64.87 % of first
有两个原因让我感到惊讶,如果有人能解释一下,我会很感兴趣:
这是测试代码:
using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.Diagnostics; using System.Text.RegularExpressions; namespace SO_RegexPerformance { class Program { static void Main(string[] args) { var rand = new Random(1234); var strings = new List<string>(); //10K random strings for (var i = 0; i < 10000; i++) { //generate random string var sb = new StringBuilder(); for (var c = 0; c < 1000; c++) { //add a-z randomly sb.Append((char)('a' + rand.Next(26))); } //in roughly 50% of them, put a digit if (rand.Next(2) == 0) { //replace 1 char with a digit 0-9 sb[rand.Next(sb.Length)] = (char)('0' + rand.Next(10)); } strings.Add(sb.ToString()); } var baseTime = testPerfomance(strings, @"\d"); Console.WriteLine(); var testTime = testPerfomance(strings, "[0-9]"); Console.WriteLine(" {0:P2} of first", testTime.TotalMilliseconds / baseTime.TotalMilliseconds); testTime = testPerfomance(strings, "[0123456789]"); Console.WriteLine(" {0:P2} of first", testTime.TotalMilliseconds / baseTime.TotalMilliseconds); } private static TimeSpan testPerfomance(List<string> strings, string regex) { var sw = new Stopwatch(); int successes = 0; var rex = new Regex(regex); sw.Start(); foreach (var str in strings) { if (rex.Match(str).Success) { successes++; } } sw.Stop(); Console.Write("Regex {0,-12} took {1} result: {2}/{3}", regex, sw.Elapsed, successes, strings.Count); return sw.Elapsed; } } }
\d检查所有 Unicode 数字,但[0-9]仅限于这 10 个字符。例如,波斯数字 ,郾鄄鄢鄞鄣鄱鄯鄹酃是与 匹配\d但不匹配的 Unicode 数字的示例[0-9]。
郾鄄鄢鄞鄣鄱鄯鄹酃
您可以使用以下代码生成所有此类字符的列表:
var sb = new StringBuilder(); for(UInt16 i = 0; i < UInt16.MaxValue; i++) { string str = Convert.ToChar(i).ToString(); if (Regex.IsMatch(str, @"\d")) sb.Append(str); } Console.WriteLine(sb.ToString());
生成:
012345678901234567890123456789߀߁߂߃߄߅߆߇߈߉012345678901২345678901234567890123456789୦୧୨୩୪୫୬୭୮୯0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789᠐᠑᠒᠓᠔᠕᠖᠗᠘᠙᥆᥇᥈᥉᥊᥋᥌᥍᥎᥏᧐᧑᧒᧓᧔᧕᧖᧗᧘᧙᭐᭑᭒᭓᭔᭕᭖᭗᭘᭙᮰᮱᮲᮳᮴᮵᮶᮷᮸᮹᱀᱁᱂᱃᱄᱅᱆᱇᱈᱉᱐᱑᱒᱓᱔᱕᱖᱗᱘᱙꘠꘡꘢꘣꘤꘥꘦꘧꘨꘩꣐꣑꣒꣓꣔꣕꣖꣗꣘꣙꤀꤁꤂꤃꤄꤅꤆꤇꤈꤉꩐꩑꩒꩓꩔꩕꩖꩗꩘꩙0123456789