我有一个pandas数据框,其中一列是一串带有某些旅行细节的字符串。我的目标是解析每个字符串以提取始发城市和目的地城市(我希望最终有两个新列分别为“起源”和“目的地”)。
数据:
df_col = [ 'new york to venice, italy for usd271', 'return flights from brussels to bangkok with etihad from €407', 'from los angeles to guadalajara, mexico for usd191', 'fly to australia new zealand from paris from €422 return including 2 checked bags' ]
这应导致:
Origin: New York, USA; Destination: Venice, Italy Origin: Brussels, BEL; Destination: Bangkok, Thailand Origin: Los Angeles, USA; Destination: Guadalajara, Mexico Origin: Paris, France; Destination: Australia / New Zealand (this is a complicated case given two countries)
到目前为止,我已经尝试了:各种NLTK方法,但是让我最接近的是使用该nltk.pos_tag方法来标记字符串中的每个单词。结果是带有每个单词和相关标签的元组列表。这是一个例子
nltk.pos_tag
[('Fly', 'NNP'), ('to', 'TO'), ('Australia', 'NNP'), ('&', 'CC'), ('New', 'NNP'), ('Zealand', 'NNP'), ('from', 'IN'), ('Paris', 'NNP'), ('from', 'IN'), ('€422', 'NNP'), ('return', 'NN'), ('including', 'VBG'), ('2', 'CD'), ('checked', 'VBD'), ('bags', 'NNS'), ('!', '.')]
我停留在这个阶段,不确定如何最好地实现这一点。有人能指出我正确的方向吗?谢谢。
乍一看几乎是不可能的,除非您可以访问某些包含相当复杂的组件的API。
从第一眼看,似乎您是在要求神奇地解决自然语言问题。但是,让我们分解一下它的范围,将其范围扩展到可以构建某些东西的程度。
首先,要识别国家和城市,您需要枚举它们的数据,因此,请尝试:https : //www.google.com/search?q= list+of+countries+and+cities+in+the+world+ json
在搜索结果的顶部,我们找到了指向world-cities.json文件的https://datahub.io/core/world- cities。现在我们将它们加载到多个国家和城市中。
import requests import json cities_url = "https://pkgstore.datahub.io/core/world-cities/world-cities_json/data/5b3dd46ad10990bca47b04b4739a02ba/world-cities_json.json" cities_json = json.loads(requests.get(cities_url).content.decode('utf8')) countries = set([city['country'] for city in cities_json]) cities = set([city['name'] for city in cities_json])
让他们放在一起。
import requests import json from flashtext import KeywordProcessor cities_url = "https://pkgstore.datahub.io/core/world-cities/world-cities_json/data/5b3dd46ad10990bca47b04b4739a02ba/world-cities_json.json" cities_json = json.loads(requests.get(cities_url).content.decode('utf8')) countries = set([city['country'] for city in cities_json]) cities = set([city['name'] for city in cities_json]) keyword_processor = KeywordProcessor(case_sensitive=False) keyword_processor.add_keywords_from_list(sorted(countries)) keyword_processor.add_keywords_from_list(sorted(cities)) texts = ['new york to venice, italy for usd271', 'return flights from brussels to bangkok with etihad from €407', 'from los angeles to guadalajara, mexico for usd191', 'fly to australia new zealand from paris from €422 return including 2 checked bags'] keyword_processor.extract_keywords(texts[0])
[出]:
['York', 'Venice', 'Italy']
进行尽职调查时,首先的预感是数据中没有“纽约”,
>>> "New York" in cities False
什么?!#$%^&*为了理智起见,我们检查以下内容:
>>> len(countries) 244 >>> len(cities) 21940
是的,您不能只信任单个数据源,所以让我们尝试获取所有数据源。
在https://www.google.com/search?q=list+of+countries+and+cities+in+the+world+json中,您可以找到另一个链接https://github.com/dr5hn/countries- states -cities-database让我们对此…
import requests import json cities_url = "https://pkgstore.datahub.io/core/world-cities/world-cities_json/data/5b3dd46ad10990bca47b04b4739a02ba/world-cities_json.json" cities1_json = json.loads(requests.get(cities_url).content.decode('utf8')) countries1 = set([city['country'] for city in cities1_json]) cities1 = set([city['name'] for city in cities1_json]) dr5hn_cities_url = "https://raw.githubusercontent.com/dr5hn/countries-states-cities-database/master/cities.json" dr5hn_countries_url = "https://raw.githubusercontent.com/dr5hn/countries-states-cities-database/master/countries.json" cities2_json = json.loads(requests.get(dr5hn_cities_url).content.decode('utf8')) countries2_json = json.loads(requests.get(dr5hn_countries_url).content.decode('utf8')) countries2 = set([c['name'] for c in countries2_json]) cities2 = set([c['name'] for c in cities2_json]) countries = countries2.union(countries1) cities = cities2.union(cities1)
>>> len(countries) 282 >>> len(cities) 127793
哇,那儿的城市比以前多了。
让我们flashtext再次尝试代码。
flashtext
from flashtext import KeywordProcessor keyword_processor = KeywordProcessor(case_sensitive=False) keyword_processor.add_keywords_from_list(sorted(countries)) keyword_processor.add_keywords_from_list(sorted(cities)) texts = ['new york to venice, italy for usd271', 'return flights from brussels to bangkok with etihad from €407', 'from los angeles to guadalajara, mexico for usd191', 'fly to australia new zealand from paris from €422 return including 2 checked bags'] keyword_processor.extract_keywords(texts[0])
好的,要进行更多的完整性检查,只需在城市列表中查找“纽约”即可。
>>> [c for c in cities if 'york' in c.lower()] ['Yorklyn', 'West York', 'West New York', 'Yorktown Heights', 'East Riding of Yorkshire', 'Yorke Peninsula', 'Yorke Hill', 'Yorktown', 'Jefferson Valley-Yorktown', 'New York Mills', 'City of York', 'Yorkville', 'Yorkton', 'New York County', 'East York', 'East New York', 'York Castle', 'York County', 'Yorketown', 'New York City', 'York Beach', 'Yorkshire', 'North Yorkshire', 'Yorkeys Knob', 'York', 'York Town', 'York Harbor', 'North York']
你: 这是什么恶作剧?
语言学家: 欢迎来到 自然语言 处理的世界,在自然世界中,自然语言是受公共和教义变体影响的社会建构。
您 :废话,告诉我如何解决。
NLP Practitioner (一个真正的处理嘈杂的用户生成文本的人):您只需添加到列表中。但在此之前,请根据给定的列表检查 指标 。
from itertools import zip_longest from flashtext import KeywordProcessor keyword_processor = KeywordProcessor(case_sensitive=False) keyword_processor.add_keywords_from_list(sorted(countries)) keyword_processor.add_keywords_from_list(sorted(cities)) texts_labels = [('new york to venice, italy for usd271', ('New York', 'Venice', 'Italy')), ('return flights from brussels to bangkok with etihad from €407', ('Brussels', 'Bangkok')), ('from los angeles to guadalajara, mexico for usd191', ('Los Angeles', 'Guadalajara')), ('fly to australia new zealand from paris from €422 return including 2 checked bags', ('Australia', 'New Zealand', 'Paris'))] # No. of correctly extracted terms. true_positives = 0 false_positives = 0 total_truth = 0 for text, label in texts_labels: extracted = keyword_processor.extract_keywords(text) # We're making some assumptions here that the order of # extracted and the truth must be the same. true_positives += sum(1 for e, l in zip_longest(extracted, label) if e == l) false_positives += sum(1 for e, l in zip_longest(extracted, label) if e != l) total_truth += len(label) # Just visualization candies. print(text) print(extracted) print(label) print()
实际上,它看起来还不错。我们的准确度是90%:
>>> true_positives / total_truth 0.9
好吧,好吧,所以看看上面方法造成的“唯一”错误,只是“纽约”不在城市列表中。
您 :我们为什么不只在城市列表中添加“纽约”,即
keyword_processor.add_keyword('New York') print(texts[0]) print(keyword_processor.extract_keywords(texts[0]))
['New York', 'Venice', 'Italy']
你 :看,我做到了!!!现在我该喝啤酒了。 语言学家 :怎么样'I live in Marawi'?
'I live in Marawi'
>>> keyword_processor.extract_keywords('I live in Marawi') []
NLP执行师 (插话中):怎么样'I live in Jeju'?
'I live in Jeju'
>>> keyword_processor.extract_keywords('I live in Jeju') []
雷蒙德· 海廷格( Raymond Hettinger)的粉丝 (来自遥远的地方):“一定有更好的方法!”
是的,如果我们只是尝试一些愚蠢的事情,例如在我们的网站中添加以“ City”结尾的城市关键字,该keyword_processor怎么办?
keyword_processor
for c in cities: if 'city' in c.lower() and c.endswith('City') and c[:-5] not in cities: if c[:-5].strip(): keyword_processor.add_keyword(c[:-5]) print(c[:-5])
现在让我们重试我们的回归测试示例:
from itertools import zip_longest from flashtext import KeywordProcessor keyword_processor = KeywordProcessor(case_sensitive=False) keyword_processor.add_keywords_from_list(sorted(countries)) keyword_processor.add_keywords_from_list(sorted(cities)) for c in cities: if 'city' in c.lower() and c.endswith('City') and c[:-5] not in cities: if c[:-5].strip(): keyword_processor.add_keyword(c[:-5]) texts_labels = [('new york to venice, italy for usd271', ('New York', 'Venice', 'Italy')), ('return flights from brussels to bangkok with etihad from €407', ('Brussels', 'Bangkok')), ('from los angeles to guadalajara, mexico for usd191', ('Los Angeles', 'Guadalajara')), ('fly to australia new zealand from paris from €422 return including 2 checked bags', ('Australia', 'New Zealand', 'Paris')), ('I live in Florida', ('Florida')), ('I live in Marawi', ('Marawi')), ('I live in jeju', ('Jeju'))] # No. of correctly extracted terms. true_positives = 0 false_positives = 0 total_truth = 0 for text, label in texts_labels: extracted = keyword_processor.extract_keywords(text) # We're making some assumptions here that the order of # extracted and the truth must be the same. true_positives += sum(1 for e, l in zip_longest(extracted, label) if e == l) false_positives += sum(1 for e, l in zip_longest(extracted, label) if e != l) total_truth += len(label) # Just visualization candies. print(text) print(extracted) print(label) print()
new york to venice, italy for usd271 ['New York', 'Venice', 'Italy'] ('New York', 'Venice', 'Italy') return flights from brussels to bangkok with etihad from €407 ['Brussels', 'Bangkok'] ('Brussels', 'Bangkok') from los angeles to guadalajara, mexico for usd191 ['Los Angeles', 'Guadalajara', 'Mexico'] ('Los Angeles', 'Guadalajara') fly to australia new zealand from paris from €422 return including 2 checked bags ['Australia', 'New Zealand', 'Paris'] ('Australia', 'New Zealand', 'Paris') I live in Florida ['Florida'] Florida I live in Marawi ['Marawi'] Marawi I live in jeju ['Jeju'] Jeju
但是严重的是,这仅仅是问题的提示。如果您有这样的句子会发生什么:
>>> keyword_processor.extract_keywords('Adam flew to Bangkok from Singapore and then to China') ['Adam', 'Bangkok', 'Singapore', 'China']
为什么要Adam提取为城市?
Adam
然后再进行一些神经质检查:
>>> 'Adam' in cities Adam
恭喜,您已经跳入另一个多义词的NLP兔子洞,其中的同一个词具有不同的含义,在这种情况下,Adam很可能是指句子中的一个人,但也恰巧是城市的名称(根据您的数据从)。
[在]:
['new york to venice, italy for usd271', 'return flights from brussels to bangkok with etihad from €407', 'from los angeles to guadalajara, mexico for usd191', 'fly to australia new zealand from paris from €422 return including 2 checked bags' ]
语言学家 :即使假设介词(例如from,to)城市之前给你的“出身” /“目的地”标签,你打算怎么处理的“多腿”航班的情况,例如
from
to
>>> keyword_processor.extract_keywords('Adam flew to Bangkok from Singapore and then to China')
这句话的预期输出是什么:
> Adam flew to Bangkok from Singapore and then to China
也许是这样吗?规格是多少?您输入的文本是如何(非结构化)的?
> Origin: Singapore > Departure: Bangkok > Departure: China
让我们假设您已经有了,并尝试一些相同的flashtext方法。
如果我们增加to和from到列表中?
from itertools import zip_longest from flashtext import KeywordProcessor keyword_processor = KeywordProcessor(case_sensitive=False) keyword_processor.add_keywords_from_list(sorted(countries)) keyword_processor.add_keywords_from_list(sorted(cities)) for c in cities: if 'city' in c.lower() and c.endswith('City') and c[:-5] not in cities: if c[:-5].strip(): keyword_processor.add_keyword(c[:-5]) keyword_processor.add_keyword('to') keyword_processor.add_keyword('from') texts = ['new york to venice, italy for usd271', 'return flights from brussels to bangkok with etihad from €407', 'from los angeles to guadalajara, mexico for usd191', 'fly to australia new zealand from paris from €422 return including 2 checked bags'] for text in texts: extracted = keyword_processor.extract_keywords(text) print(text) print(extracted) print()
new york to venice, italy for usd271 ['New York', 'to', 'Venice', 'Italy'] return flights from brussels to bangkok with etihad from €407 ['from', 'Brussels', 'to', 'Bangkok', 'from'] from los angeles to guadalajara, mexico for usd191 ['from', 'Los Angeles', 'to', 'Guadalajara', 'Mexico'] fly to australia new zealand from paris from €422 return including 2 checked bags ['to', 'Australia', 'New Zealand', 'from', 'Paris', 'from']
好的,让我们使用上面的输出,看看我们如何处理该问题1. 也许检查from后面的术语是否为city,如果不是,则删除to / from?
from itertools import zip_longest from flashtext import KeywordProcessor keyword_processor = KeywordProcessor(case_sensitive=False) keyword_processor.add_keywords_from_list(sorted(countries)) keyword_processor.add_keywords_from_list(sorted(cities)) for c in cities: if 'city' in c.lower() and c.endswith('City') and c[:-5] not in cities: if c[:-5].strip(): keyword_processor.add_keyword(c[:-5]) keyword_processor.add_keyword('to') keyword_processor.add_keyword('from') texts = ['new york to venice, italy for usd271', 'return flights from brussels to bangkok with etihad from €407', 'from los angeles to guadalajara, mexico for usd191', 'fly to australia new zealand from paris from €422 return including 2 checked bags'] for text in texts: extracted = keyword_processor.extract_keywords(text) print(text) new_extracted = [] extracted_next = extracted[1:] for e_i, e_iplus1 in zip_longest(extracted, extracted_next): if e_i == 'from' and e_iplus1 not in cities and e_iplus1 not in countries: print(e_i, e_iplus1) continue elif e_i == 'from' and e_iplus1 == None: # last word in the list. continue else: new_extracted.append(e_i) print(new_extracted) print()
这似乎可以解决问题,并删除from城市/国家/地区之前的。
new york to venice, italy for usd271 ['New York', 'to', 'Venice', 'Italy'] return flights from brussels to bangkok with etihad from €407 from None ['from', 'Brussels', 'to', 'Bangkok'] from los angeles to guadalajara, mexico for usd191 ['from', 'Los Angeles', 'to', 'Guadalajara', 'Mexico'] fly to australia new zealand from paris from €422 return including 2 checked bags from None ['to', 'Australia', 'New Zealand', 'from', 'Paris']
语言学家 :请慎重考虑,应该通过做出明智的决定以使歧义变得明显来解决歧义吗?如果是这样,知情决定中的“信息”是什么?它应该在填充歧义之前先遵循某个模板来检测信息吗?
您 :我对 您 失去耐心…您使我一圈又一圈,那条能听懂新闻,谷歌和Facebook等所有内容的人类语言的AI在哪里?
您 :您给我的内容都是基于规则的,而这些方面的AI在哪里?
NLP从业者 :您不是想要100%吗?编写“业务逻辑”或基于规则的系统将是在没有特定预设数据集的情况下真正实现“ 100%”的唯一方法,而无需使用任何预设数据集即可“训练AI”。
您 :培训AI是什么意思?为什么我不能只使用Google或Facebook或Amazon或Microsoft甚至IBM的AI?
NLP从业者 :让我向您介绍
欢迎来到计算语言学和NLP的世界!
是的,还没有真正的现成的神奇解决方案,如果您想使用“ AI”或机器学习算法,则很可能需要更多的训练数据,如texts_labels上面示例中所示。
texts_labels