小编典典

Elasticsearch:计算文档中的术语

elasticsearch

我是相当新的elasticsearch,使用6.5版。我的数据库包含网站页面及其内容,如下所示:

Url      Content
abc.com  There is some content about cars here. Lots of cars!
def.com  This page is all about cars.
ghi.com  Here it tells us something about insurances.
jkl.com  Another page about cars and how to buy cars.

我已经能够执行一个简单的查询,该查询返回所有内容中包含“汽车”一词的文档(使用Python):

current_app.elasticsearch.search(index=index, doc_type=index, 
    body={"query": {"multi_match": {"query": "cars", "fields": ["*"]}}, 
    "from": 0, "size": 100})

结果看起来像这样:

{'took': 2521, 
'timed_out': False, 
'_shards': {'total': 5, 'successful': 5, 'skipped': 0, 'failed': 0}, 
'hits': {'total': 29, 'max_score': 3.0240571, 'hits': [{'_index': 
'pages', '_type': 'pages', '_id': '17277', '_score': 3.0240571, 
'_source': {'content': '....'}}]}}

“ _id”指的是一个域,所以我基本上回来了:

但我现在想知道如何往往是搜索关键词(“汽车”)出现 每个文档,如:

我发现了几个解决方案如何获得的包含搜索关键词的文档数量,但没有说会告诉如何获得的项数
的文档。我也没有在官方文档中找到任何东西,尽管我很确定它在那里,而且我可能只是没有意识到这是解决我的问题的方法。

更新:

如@Curious_MInd所建议,我尝试了术语聚合:

current_app.elasticsearch.search(index=index, doc_type=index, 
    body={"aggs" : {"cars_count" : {"terms" : { "field" : "Content" 
}}}})

结果:

{'took': 729, 'timed_out': False, '_shards': {'total': 5, 'successful': 
5, 'skipped': 0, 'failed': 0}, 'hits': {'total': 48, 'max_score': 1.0, 
'hits': [{'_index': 'pages', '_type': 'pages', '_id': '17252', 
'_score': 1.0, '_source': {'content': '...'}}]}, 'aggregations': 
{'skala_count': {'doc_count_error_upper_bound': 0, 
'sum_other_doc_count': 0, 'buckets': []}}}

我在这里看不到它将显示每个文档的计数,但是我假设这是因为“存储桶”为空?另一个要注意的是:通过术语聚合发现的结果明显比使用multi_match查询的结果差。有什么办法可以合并这些?


阅读 270

收藏
2020-06-22

共1个答案

小编典典

您要实现的目标无法在单个查询中完成。第一个查询将是过滤并获取需要对术语进行计数的文档ID。假设您具有以下映射:

{
  "test": {
    "mappings": {
      "_doc": {
        "properties": {
          "details": {
            "type": "text",
            "store": true,
            "term_vector": "with_positions_offsets_payloads"
          },
          "name": {
            "type": "keyword"
          }
        }
      }
    }
  }
}

假设您查询返回以下两个文档:

{
  "hits": {
    "total": 2,
    "max_score": 1,
    "hits": [
      {
        "_index": "test",
        "_type": "_doc",
        "_id": "1",
        "_score": 1,
        "_source": {
          "details": "There is some content about cars here. Lots of cars!",
          "name": "n1"
        }
      },
      {
        "_index": "test",
        "_type": "_doc",
        "_id": "2",
        "_score": 1,
        "_source": {
          "details": "This page is all about cars",
          "name": "n2"
        }
      }
    ]
  }
}

从上面的响应中,您可以获得与查询匹配的所有文档ID。上面我们有:"_id": "1""_id": "2"

现在,我们使用_mtermvectorsapi获取给定字段中每个术语的频率(计数):

test/_doc/_mtermvectors
{
  "docs": [
    {
      "_id": "1",
      "fields": [
        "details"
      ]
    },
    {
      "_id": "2",
      "fields": [
        "details"
      ]
    }
  ]
}

上面返回以下结果:

{
  "docs": [
    {
      "_index": "test",
      "_type": "_doc",
      "_id": "1",
      "_version": 1,
      "found": true,
      "took": 8,
      "term_vectors": {
        "details": {
          "field_statistics": {
            "sum_doc_freq": 15,
            "doc_count": 2,
            "sum_ttf": 16
          },
          "terms": {
            ....
            ,
            "cars": {
              "term_freq": 2,
              "tokens": [
                {
                  "position": 5,
                  "start_offset": 28,
                  "end_offset": 32
                },
                {
                  "position": 9,
                  "start_offset": 47,
                  "end_offset": 51
                }
              ]
            },
            ....
          }
        }
      }
    },
    {
      "_index": "test",
      "_type": "_doc",
      "_id": "2",
      "_version": 1,
      "found": true,
      "took": 2,
      "term_vectors": {
        "details": {
          "field_statistics": {
            "sum_doc_freq": 15,
            "doc_count": 2,
            "sum_ttf": 16
          },
          "terms": {
            ....
            ,
            "cars": {
              "term_freq": 1,
              "tokens": [
                {
                  "position": 5,
                  "start_offset": 23,
                  "end_offset": 27
                }
              ]
            },
            ....
        }
      }
    }
  ]
}

请注意,....由于术语向量api返回所有术语的术语相关详细信息,因此我曾经在该字段中表示其他术语数据。您绝对可以从上述响应中提取有关所需字词的信息,这是我在此处显示的cars,您感兴趣的字段是term_freq

2020-06-22