文本处理 - 求教使用python库提取pdf的方法?

 nuabolalalala5_760 发布于 2022-10-26 19:10

使用过pypdf 对英文pdf文档处理比较简单,但是对中文的支持好像不太好

使用过textract 看文档支持的格式比较多方法也比较简单,但是老师出错

-- coding: utf-8 --

import textract
import pyPdf
import pdf2text
import pdfminer
import chardet

text = textract.process("F:ll.pdf",method = 'pdfminer')
print text

这个 出错是编码问题 -- coding: utf-8 --

import textract
import pyPdf
import pdfminer
import chardet

text = textract.process("F:ll.pdf",method = 'pdfminer')
print text

这个出错类型不清楚

少使用了pdf2text库,但是出错情况好像不一样。

pdfminer库还没看过,看着好像麻烦一些, 求解一下解析提取中文的pdf的方法。谢谢

1 个回答
  • 之前用过的pdfminer pip install pdfminer

    # -*- coding: utf-8 -*-
    from bs4 import BeautifulSoup
    import requests
    import re
    from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
    from pdfminer.converter import TextConverter
    from pdfminer.layout import LAParams
    from cStringIO  import StringIO
    #from io  import StringIO for python3
    from io import open
    from pdfminer.pdfpage import PDFPage
    def pdf_txt(url):
        rsrcmgr = PDFResourceManager()
        retstr = StringIO()
        codec = 'utf-8'
        laparams = LAParams()
        device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
        f = requests.get(url).content
        fp = StringIO(f)
        interpreter = PDFPageInterpreter(rsrcmgr, device)
        password = ""
        maxpages = 0
        caching = True
        pagenos = set()
        for page in PDFPage.get_pages(fp,
                                      pagenos,
                                      maxpages=maxpages,
                                      password=password,
                                      caching=caching,
                                      check_extractable=True):
            interpreter.process_page(page)
        fp.close()
        device.close()
        str = retstr.getvalue()
        retstr.close()
        return str
    txt=tpdf_txt('http://pythonscraping.com/pages/warandpeace/chapter1.pdf')
    print txt
    #如果pdf含有中文,输出到文件
    #open('pdf.txt','wb').write(txt)
    
    python readpdf.py
    '''
    CHAPTER I
    "Well, Prince, so Genoa and Lucca are now just family estates of
    theBuonapartes. But I warn you, if you don't tell me that this
    means war,if you still try to defend the infamies and horrors
    perpetrated bythat Antichrist- I really believe he is Antichrist- I will
    havenothing more to do with you and you are no longer my friend,
    no longermy 'faithful slave,' as you call yourself! But how do you
    do? I seeI have frightened you- sit down and tell me all the news."
    It was in July, 1805, and the speaker was the well-known
    AnnaPavlovna Scherer, maid of honor and favorite of the
    Empress MaryaFedorovna. With these words she greeted Prince
    Vasili Kuragin, a manof high rank and importance, who was the
    first to arrive at herreception. Anna Pavlovna had had a cough for
    some days. She was, asshe said, suffering from la grippe; grippe
    being then a new word inSt. Petersburg, used only by the elite.
    All her invitations without exception, written in French,
    anddelivered by a scarlet-liveried footman that morning, ran as
    '''   
    2022-10-27 01:12 回答
撰写答案
今天,你开发时遇到什么问题呢?
立即提问
热门标签
PHP1.CN | 中国最专业的PHP中文社区 | PNG素材下载 | DevBox开发工具箱 | json解析格式化 |PHP资讯 | PHP教程 | 数据库技术 | 服务器技术 | 前端开发技术 | PHP框架 | 开发工具 | 在线工具
Copyright © 1998 - 2020 PHP1.CN. All Rights Reserved 京公网安备 11010802041100号 | 京ICP备19059560号-4 | PHP1.CN 第一PHP社区 版权所有