作者:如痴如醉as_961 | 来源:互联网 | 2023-05-18 00:56
ImnewtoXPath-pleasegoeasyonme.我是XPath的新手-请放轻松我。HavingtroubleextractingXPathonmy
I'm new to XPath - please go easy on me.
我是XPath的新手 - 请放轻松我。
Having trouble extracting XPath on my target pages for elements that don't have a lot of structure.
无法在目标页面上为不具有大量结构的元素提取XPath。
The data set is NJ school report cards. Individual report cards look like this
数据集是新泽西州学校的成绩单。个人成绩单看起来像这样
I've figured out how to pull out tables that have a summary
tag:
我已经弄清楚如何提取具有摘要标记的表:
url <- paste("http://education.state.nj.us/rc/rc11/rcreport.php?c=",
all_sch[i,1],";d=",all_sch[i,2],";s=",all_sch[i,3],sep = '')
doc = htmlParse(url)
admin_salaries = getNodeSet(doc, '//table[@summary="Administrative Salaries and Benefits"]')
but am having trouble where there isn't a lot of extra identifying information to work off of.
但是在没有很多额外识别信息可以解决的情况下遇到麻烦。
For instance, the table that has school name and district looks like this:
例如,具有学校名称和分区的表格如下所示:
SCHOOL: |
New Jersey Ave |
COUNTY: |
Atlantic |
DISTRICT: |
Atlantic City |
My strategy here was 'find nodes that are tables and have the text COUNTY
我的策略是“查找表格中的节点并将文本设置为COUNTY”
Reading as much as I can about XPath, I'm trying this:
尽可能多地阅读关于XPath的内容,我正在尝试这样做:
names = getNodeSet(doc,'//table and //*[contains(text(),"COUNTY")]')
But instead of returning back the table node, it gives me a boolean TRUE
value.
但它没有返回表节点,而是给我一个布尔值TRUE值。
So, the question is: How can I use XPath to find tables that have the text COUNTY and SCHOOL?
所以,问题是:如何使用XPath查找具有COUNTY和SCHOOL文本的表?
I've tried a lot of other strategies to little avail. One approach suggested by others was simply to pull out every table data cell using something like this:
我尝试了很多其他策略,但收效甚微。其他人建议的一种方法就是使用以下方法提取每个表数据单元:
xpathApply( htmlTreeParse(url, useInt=T), "//td", function(x) xmlValue(x))
But the templates aren't consistent for missing data - incomplete reports have pretty different structure, and elements aren't in the same position across the 2,000+ pages.
但是模板对于缺失数据并不一致 - 不完整的报告具有完全不同的结构,并且元素在2,000多个页面中的位置不同。
Any help is greatly appreciated!
任何帮助是极大的赞赏!
1 个解决方案