用nltk改进人名的提取

 杜杜狼2602891895 发布于 2023-02-13 18:44

我试图从文本中提取人名.

有没有人有他们推荐的方法?

这就是我尝试的(代码如下):我nltk用来查找标记为人的所有内容,然后生成该人的所有NNP部分的列表.我正在跳过那些只有一个NNP的人,这可以避免抓住一个单独的姓氏.

我得到了不错的结果但是想知道是否有更好的方法来解决这个问题.

码:

import nltk
from nameparser.parser import HumanName

def get_human_names(text):
    tokens = nltk.tokenize.word_tokenize(text)
    pos = nltk.pos_tag(tokens)
    sentt = nltk.ne_chunk(pos, binary = False)
    person_list = []
    person = []
    name = ""
    for subtree in sentt.subtrees(filter=lambda t: t.node == 'PERSON'):
        for leaf in subtree.leaves():
            person.append(leaf[0])
        if len(person) > 1: #avoid grabbing lone surnames
            for part in person:
                name += part + ' '
            if name[:-1] not in person_list:
                person_list.append(name[:-1])
            name = ''
        person = []

    return (person_list)

text = """
Some economists have responded positively to Bitcoin, including 
Francois R. Velde, senior economist of the Federal Reserve in Chicago 
who described it as "an elegant solution to the problem of creating a 
digital currency." In November 2013 Richard Branson announced that 
Virgin Galactic would accept Bitcoin as payment, saying that he had invested 
in Bitcoin and found it "fascinating how a whole new global currency 
has been created", encouraging others to also invest in Bitcoin.
Other economists commenting on Bitcoin have been critical. 
Economist Paul Krugman has suggested that the structure of the currency 
incentivizes hoarding and that its value derives from the expectation that 
others will accept it as payment. Economist Larry Summers has expressed 
a "wait and see" attitude when it comes to Bitcoin. Nick Colas, a market 
strategist for ConvergEx Group, has remarked on the effect of increasing 
use of Bitcoin and its restricted supply, noting, "When incremental 
adoption meets relatively fixed supply, it should be no surprise that 
prices go up. And that’s exactly what is happening to BTC prices."
"""

names = get_human_names(text)
print "LAST, FIRST"
for name in names: 
    last_first = HumanName(name).last + ', ' + HumanName(name).first
        print last_first

输出:

LAST, FIRST
Velde, Francois
Branson, Richard
Galactic, Virgin
Krugman, Paul
Summers, Larry
Colas, Nick

除了维珍银河,这是有效的输出.当然,在本文的背景下知道维珍银河不是人的名字是困难的(也许是不可能的)部分.

3 个回答
  • 您可以尝试解析找到的名称,并检查是否可以在freebase.com等数据库中找到它们.在本地获取数据并查询(在RDF中),或使用google的api:https://developers.google.com/freebase/v1/getting-started.然后可以根据freebase数据丢弃大多数大公司,地理位置等(可能会被您的代码段捕获).

    2023-02-13 18:46 回答
  • 必须同意"让我的代码更好"的建议不适合这个网站,但我可以给你一些方法,你可以尝试挖掘.

    看一下Stanford Named Entity Recognizer(NER).它的绑定已包含在NLTK v 2.0中,但您必须下载一些核心文件.这是可以为您完成所有这些操作的脚本.

    我写了这个脚本:

    import nltk
    from nltk.tag.stanford import NERTagger
    st = NERTagger('stanford-ner/all.3class.distsim.crf.ser.gz', 'stanford-ner/stanford-ner.jar')
    text = """YOUR TEXT GOES HERE"""
    
    for sent in nltk.sent_tokenize(text):
        tokens = nltk.tokenize.word_tokenize(sent)
        tags = st.tag(tokens)
        for tag in tags:
            if tag[1]=='PERSON': print tag
    

    并没有那么糟糕的输出:

    ('Francois','PERSON')('R'','PERSON')('Velde','PERSON')('Richard','PERSON')('Branson','PERSON')('Virgin') ,'PERSON')('Galactic','PERSON')('比特币','PERSON')('比特币','PERSON')('保罗','PERSON')('Krugman','PERSON') ('拉里','PERSON')('Summers','PERSON')('比特币','PERSON')('Nick','PERSON')('Colas','PERSON')

    希望这是有帮助的.

    2023-02-13 18:46 回答
  • 对于其他任何人来说,我发现这篇文章很有用:http://timmcnamara.co.nz/post/2650550090/extracting-names-with-6-lines-of-python-code

    >>> import nltk
    >>> def extract_entities(text):
    ...     for sent in nltk.sent_tokenize(text):
    ...         for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
    ...             if hasattr(chunk, 'node'):
    ...                 print chunk.node, ' '.join(c[0] for c in chunk.leaves())
    ...
    

    2023-02-13 18:47 回答
撰写答案
今天,你开发时遇到什么问题呢?
立即提问
热门标签
PHP1.CN | 中国最专业的PHP中文社区 | PNG素材下载 | DevBox开发工具箱 | json解析格式化 |PHP资讯 | PHP教程 | 数据库技术 | 服务器技术 | 前端开发技术 | PHP框架 | 开发工具 | 在线工具
Copyright © 1998 - 2020 PHP1.CN. All Rights Reserved 京公网安备 11010802041100号 | 京ICP备19059560号-4 | PHP1.CN 第一PHP社区 版权所有