我想知道是否有任何Python库可以进行模糊文本搜索。例如:
我已经尝试过fuzzywuzzy解决不了我的问题。另一个库Whoosh看起来很强大,但是我找不到合适的功能…
fuzzywuzzy
Whoosh
{1} 您可以在中执行此操作Whoosh 2.7。通过添加插件可以进行模糊搜索whoosh.qparser.FuzzyTermPlugin:
Whoosh 2.7
whoosh.qparser.FuzzyTermPlugin
whoosh.qparser.FuzzyTermPlugin使您可以搜索“模糊”术语,即不必完全匹配的术语。模糊项将与一定数量的“编辑”(字符插入,删除和/或换位–称为“ Damerau-Levenshtein编辑距离”)内的任何相似项匹配。
要添加模糊插件:
parser = qparser.QueryParser("fieldname", my_index.schema) parser.add_plugin(qparser.FuzzyTermPlugin())
将模糊插件添加到解析器后,可以通过添加~后跟可选的最大编辑距离来指定模糊项。如果未指定编辑距离,则默认值为1。
~
例如,以下“模糊”术语查询:
letter~ letter~2 letter~2/3
{2} 要使单词井然有序,请使用Query,whoosh.query.Phrase但您应该用替换Phrase插件,whoosh.qparser.SequencePlugin以允许您在短语内使用模糊术语:
whoosh.query.Phrase
Phrase
whoosh.qparser.SequencePlugin
"letter~ stamp~ mail~"
要将默认的短语插件替换为序列插件:
parser = qparser.QueryParser("fieldname", my_index.schema) parser.remove_plugin_class(qparser.PhrasePlugin) parser.add_plugin(qparser.SequencePlugin())
{3} 要在两者之间允许单词,slop请将词组查询中的arg 初始化为更大的数字:
slop
whoosh.query.Phrase(fieldname, words, slop=1, boost=1.0, char_ranges=None)
斜率 –短语中每个“单词”之间允许的单词数;默认值1表示词组必须完全匹配。
您还可以像这样在Query中定义坡度:
"letter~ stamp~ mail~"~10
{4} 整体解决方案:
{4.a} 索引器 将类似于:
from whoosh.index import create_in from whoosh.fields import * schema = Schema(title=TEXT(stored=True), content=TEXT) ix = create_in("indexdir", schema) writer = ix.writer() writer.add_document(title=u"First document", content=u"This is the first document we've added!") writer.add_document(title=u"Second document", content=u"The second one is even more interesting!") writer.add_document(title=u"Third document", content=u"letter first, stamp second, mail third") writer.add_document(title=u"Fourth document", content=u"stamp first, mail third") writer.add_document(title=u"Fivth document", content=u"letter first, mail third") writer.add_document(title=u"Sixth document", content=u"letters first, stamps second, mial third wrong") writer.add_document(title=u"Seventh document", content=u"stamp first, letters second, mail third") writer.commit()
{4.b} 搜索者 会像:
from whoosh.qparser import QueryParser, FuzzyTermPlugin, PhrasePlugin, SequencePlugin with ix.searcher() as searcher: parser = QueryParser(u"content", ix.schema) parser.add_plugin(FuzzyTermPlugin()) parser.remove_plugin_class(PhrasePlugin) parser.add_plugin(SequencePlugin()) query = parser.parse(u"\"letter~2 stamp~2 mail~2\"~10") results = searcher.search(query) print "nb of results =", len(results) for r in results: print r
结果如下:
nb of results = 2 <Hit {'title': u'Sixth document'}> <Hit {'title': u'Third document'}>
{5} 如果要在不使用word~n查询的每个单词的语法的情况下将模糊搜索设置为默认值,则可以这样初始化QueryParser:
word~n
QueryParser
from whoosh.query import FuzzyTerm parser = QueryParser(u"content", ix.schema, termclass = FuzzyTerm)
现在您可以使用查询了,"letter stamp mail"~10但是请记住,它FuzzyTerm具有默认的编辑距离maxdist = 1。如果您想要更大的编辑距离,请个性化班级:
"letter stamp mail"~10
FuzzyTerm
maxdist = 1
class MyFuzzyTerm(FuzzyTerm): def __init__(self, fieldname, text, boost=1.0, maxdist=2, prefixlength=1, constantscore=True): super(D, self).__init__(fieldname, text, boost, maxdist, prefixlength, constantscore) # super().__init__() for Python 3 I think
参考文献: