我正在寻找一个好的.NET正则表达式,可用于从文本正文中解析出单个句子。
它应该能够将以下文本块解析为正好六个句子:
Hello world! How are you? I am fine. This is a difficult sentence because I use I.D. Newlines should also be accepted. Numbers should not cause sentence breaks, like 1.23.
事实证明,这比我最初想象的要困难得多。
任何帮助将不胜感激。我将使用它在已知的正文上训练系统。
试试这个@"(\S.+?[.!?])(?=\s+|$)":
@"(\S.+?[.!?])(?=\s+|$)"
string str=@"Hello world! How are you? I am fine. This is a difficult sentence because I use I.D. Newlines should also be accepted. Numbers should not cause sentence breaks, like 1.23."; Regex rx = new Regex(@"(\S.+?[.!?])(?=\s+|$)"); foreach (Match match in rx.Matches(str)) { int i = match.Index; Console.WriteLine(match.Value); }
结果:
当然,对于复杂的解析器,您将需要像SharpNLP或NLTK这样的真正解析器。我的只是一个快速而肮脏的。
这是SharpNLP信息和功能:
SharpNLP是用C#编写的自然语言处理工具的集合。当前,它提供以下NLP工具: