美化汤在版权标志上失败-BeautifulSoupPrettifyfailsoncopyrightsymbol

作者：朱美萍客户 | 来源：互联网 | 2023-10-16 13:39

IamgettingaUnicodeerror:UnicodeEncodeError:charmapcodeccantencodecharacteru\xa9in

I am getting a Unicode error: UnicodeEncodeError: 'charmap' codec can't encode character u'\xa9' in position 822: character maps to

我得到一个Unicode错误:UnicodeEncodeError:“charmap”编解码器不能在位置822对字符u'\xa9'进行编码:字符映射到

This appears to be a standard copyright symbol, and in the HTML is ©. I have not been able to find a way past this. I even tried a custom function to replace copy with a space but that also failed with the same error.

这似乎是一个标准的版权符号，在HTML是©。我一直没能找到一个办法过去。我甚至尝试了一个自定义函数来替换一个空格，但同样的错误也让我失败了。

import sys
import pprint
import mechanize
import COOKIElib
from bs4 import BeautifulSoup
import html2text
import lxml

def MakePretty():

def ChangeCopy(S):
    return S.replace(chr(169)," ")
br = mechanize.Browser()

# COOKIE Jar
cj = COOKIElib.LWPCOOKIEJar()
br.set_COOKIEjar(cj)

# Browser options
br.set_handle_equiv(True)
#br.set_handle_gzip(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)

# Follows refresh 0 but not hangs on refresh > 0
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)

# User-Agent (this is cheating, ok?)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]

# The site we will navigate into, handling its session
# Open the site
br.open('http://www.thesitewizard.com/faqs/copyright-symbol.shtml')
html = br.response().read()
soup = BeautifulSoup(html)
print soup.prettify()

if __name__ == '__main__':
    MakePretty()

How do I get prettify past the copyright symbol? I have searched all over the web for a solution to no avail (or I might not understand as I am fairly new to Python and scraping).

我怎样才能在经过版权标志后变得更漂亮呢?我已经在web上搜索了所有的解决方案，但是没有任何用处(或者我可能不理解，因为我对Python和抓取非常陌生)。

Thanks for your help.

谢谢你的帮助。

4 个解决方案

#1

I had the same problem. This may work for you:

我也有同样的问题。这可能对你有用:

print soup.prettify().encode('UTF-8')

打印soup.prettify().encode(utf - 8)

#2

The page http://www.thesitewizard.com/faqs/copyright-symbol.shtml is sent without specifying character encoding. The page itself specifies the encoding as ISO-8859-1 in a meta tag, but only after the occurrence of the “©” character. So clients have to make a guess, and the guess may be wrong. If the client guesses UTF-8, then it will see the bit A9, which is a data error in UTF-8 data.

页面http://www.thesitewizard.com/faqs/copyright-symbol.shtml在发送时没有指定字符编码。页面本身将编码指定为iso - 8859 - 1在meta标签,但只有发生后的“©”性格。因此，客户必须进行猜测，而猜测可能是错误的。如果客户机猜测UTF-8，那么它将看到位A9，这是UTF-8数据中的一个数据错误。

So it seems that you need to set the encoding (to ISO-8859-1, or more safely to windows-1252) when reading the data. This is of course an ad hoc solution only; it makes no sense to fix the encoding in general.

因此，在读取数据时，似乎需要将编码设置为ISO-8859-1，或者更安全地设置为windows-1252。这当然是一个特别的解决方案;一般来说，修正编码是没有意义的。

#3

You're using chr(), which is wrong here because it expects ASCII and that goes only as far as 127/0x7F only (despite to popular folklore, ASCII is 7-bit only). 0xA9 / © is Unicode, so you should use unichr(169) instead.

#4

Just changing to unichr in the formatter function didn't work. Ended up using decode(formatter=blah) which did return unformatted html without the copyright symbol. Saved that html and fed that into prettify which did the trick.

仅仅在格式化程序函数中更改为unichr是不行的。最终使用decode(formatter=blah)返回没有版权符号的未格式化的html。保存了html，并将其添加到美化中，这就成功了。

推荐阅读

io
GetWindowLong函数

今天在看一个代码里头写了GetWindowLong(hwnd,0)，我当时就有点费解，靠，上网搜索函数原型说明，死活找不到第 ... [详细]

蜡笔小新 2023-12-14 17:58:15
web
南邮ctf-web的writeup

本文介绍了南邮ctf-web的writeup，包括签到题和md5 collision。在CTF比赛和渗透测试中，可以通过查看源代码、代码注释、页面隐藏元素、超链接和HTTP响应头部来寻找flag或提示信息。利用PHP弱类型，可以发现md5('QNKCDZO')='0e830400451993494058024219903391'和md5('240610708')='0e462097431906509019562988736854'。 ... [详细]

蜡笔小新 2023-12-13 10:58:55
web
sqoop自定义分隔符的实现方法及步骤详解

本文介绍了在sqoop1.4.*版本中，如何实现自定义分隔符的方法及步骤。通过修改sqoop生成的java文件，并重新编译，可以满足实际开发中对分隔符的需求。具体步骤包括修改java文件中的一行代码，重新编译所需的hadoop包等。详细步骤和编译方法在本文中都有详细说明。 ... [详细]

蜡笔小新 2023-12-10 11:29:22
web
获取当前模块所在路径的GetModuleFileName函数用法详解

本文详细介绍了GetModuleFileName函数的用法，该函数可以用于获取当前模块所在的路径，方便进行文件操作和读取配置信息。文章通过示例代码和详细的解释，帮助读者理解和使用该函数。同时，还提供了相关的API函数声明和说明。 ... [详细]

蜡笔小新 2023-12-14 19:29:57
io
android listview OnItemClickListener失效原因

最近在做listview时发现OnItemClickListener失效的问题，经过查找发现是因为button的原因。不仅listitem中存在button会影响OnItemClickListener事件的失效，还会导致单击后listview每个item的背景改变，使得item中的所有有关焦点的事件都失效。本文给出了一个范例来说明这种情况，并提供了解决方法。 ... [详细]

蜡笔小新 2023-12-14 14:25:50
io
eclipse学习（第三章：ssh中的Hibernate）——11.Hibernate的缓存（2级缓存，get和load）

本文介绍了eclipse学习中的第三章内容，主要讲解了ssh中的Hibernate的缓存，包括2级缓存和get方法、load方法的区别。文章还涉及了项目实践和相关知识点的讲解。 ... [详细]

蜡笔小新 2023-12-14 00:31:35
regex
Python正则表达式学习记录及常用方法

本文记录了学习Python正则表达式的过程，介绍了re模块的常用方法re.search，并解释了rawstring的作用。正则表达式是一种方便检查字符串匹配模式的工具，通过本文的学习可以掌握Python中使用正则表达式的基本方法。 ... [详细]

蜡笔小新 2023-12-13 16:37:19
io
Golang如何使用Cookie跟踪位置

关键词：Golang, Cookie, 跟踪位置, net/http/cookiejar, package main, golang.org/x/net/publicsuffix, io/ioutil, log, net/http, net/http/cookiejar ... [详细]

蜡笔小新 2023-12-13 15:47:22
io
PE总结9PE文件结构之解析导出表

本文介绍了PE文件结构中的导出表的解析方法，包括获取区段头表、遍历查找所在的区段等步骤。通过该方法可以准确地解析PE文件中的导出表信息。 ... [详细]

蜡笔小新 2023-12-13 11:47:24
bash
成功安装Sabayon Linux在thinkpad X60上的经验分享

本文分享了作者在国庆期间在thinkpad X60上成功安装Sabayon Linux的经验。通过修改CHOST和执行emerge命令，作者顺利完成了安装过程。Sabayon Linux是一个基于Gentoo Linux的发行版，可以将电脑快速转变为一个功能强大的系统。除了作为一个live DVD使用外，Sabayon Linux还可以被安装在硬盘上，方便用户使用。 ... [详细]

蜡笔小新 2023-12-13 11:35:40
io
[大整数乘法] java代码实现

本文介绍了使用java代码实现大整数乘法的过程，同时也涉及到大整数加法和大整数减法的计算方法。通过分治算法来提高计算效率，并对算法的时间复杂度进行了研究。详细代码实现请参考文章链接。 ... [详细]

蜡笔小新 2023-12-13 11:21:32
io
php 主动断掉http,怎么在PHP项目中实现一个HTTP断点续传功能

怎么在PHP项目中实现一个HTTP断点续传功能发布时间：2021-01-1916:26:06来源：亿速云阅读：96作者：Le ... [详细]

蜡笔小新 2023-12-12 17:17:29
web
MooTools和JQuery并排 - MooTools and JQuery Side by Side

IjustinheritedsomewebpageswhichusesMooTools.IneverusedMooTools.NowIneedtoaddsomef ... [详细]

蜡笔小新 2023-12-12 13:43:58
io
iOS实现UITextField+Limit的字符限制方法

本文介绍了在iOS开发中使用UITextField实现字符限制的方法，包括利用代理方法和使用BNTextField-Limit库的实现策略。通过这些方法，开发者可以方便地限制UITextField的字符个数和输入规则。 ... [详细]

蜡笔小新 2023-12-12 09:50:30
io
Oracle存储过程写法小例子及已命名的异常

本文介绍了Oracle存储过程的基本语法和写法示例，同时还介绍了已命名的系统异常的产生原因。 ... [详细]

蜡笔小新 2023-12-11 15:10:15

朱美萍客户

这个家伙很懒，什么也没留下！

Tags | 热门标签

RankList | 热门文章