我试图将元素与多行字符串分开:
lines = '''c0 c1 c2 c3 c4 c5 0 10 100.5 [1.5, 2] [[10, 10.4], [c, 10, eee]] [[a , bg], [5.5, ddd, edd]] 100.5 1 20 200.5 [2.5, 2] [[20, 20.4], [d, 20, eee]] [[a , bg], [7.5, udd, edd]] 200.5'''
我的目标是获得一个列表lst
:
# first value is index lst[0] = ['c0', 'c1', 'c2', 'c3', 'c4','c5'] lst[1] = [0, 10, 100.5, [1.5, 2], [[10, 10.4], ['c', 10, 'eee']], [['a' , 'bg'], [5.5, 'ddd', 'edd']], 100.5 ] lst[2] = [1, 20, 200.5, [2.5, 2], [[20, 20.4], ['d', 20, 'eee']], [['a' , 'bg'], [7.5, 'udd', 'edd']], 200.5 ]
到目前为止我的尝试是这样的:
import re lines = '''c0 c1 c2 c3 c4 c5 0 10 100.5 [1.5, 2] [[10, 10.4], [c, 10, eee]] [[a , bg], [5.5, ddd, edd]] 100.5 1 20 200.5 [2.5, 2] [[20, 20.4], [d, 20, eee]] [[a , bg], [7.5, udd, edd]] 200.5''' # get n elements for n lines and remove empty lines lines = lines.split('\n') lines = list(filter(None,lines)) lst = [] lst.append(lines[0].split()) for i in range(1,len(lines)): change = re.sub('([a-zA-Z]+)', r"'\1'", lines[i]) lst.append(change) for i in lst[1]: print(i)
如何修复正则表达式?
更新
测试数据集
data = """ orig shifted not_equal cumsum lst 0 10 NaN True 1 [[10, 10.4], [c, 10, eee]] 1 10 10.0 False 1 [[10, 10.4], [c, 10, eee]] 2 23 10.0 True 2 [[10, 10.4], [c, 10, eee]] """ # Gives: ValueError: malformed node or string: data = """ Name Result Value 0 Name1 5 2 1 Name1 5 3 2 Name2 11 1 """ # gives same error data = """ product value 0 A 25 1 B 45 2 C 15 3 C 14 4 C 13 5 B 22 """ # gives same error data = ''' c0 c1 0 10 100.5 1 20 200.5 ''' # works perfect
Tomalak.. 8
正如评论中所指出的,这个任务与正则表达式无关.正则表达式从根本上说无法处理嵌套结构.你需要的是一个解析器.
创建解析器的方法之一是PEG,它允许您以声明性语言设置令牌列表及其相互之间的关系.然后将此解析器定义转换为可以处理所描述的输入的实际解析器.解析成功后,您将获得一个树结构,其中所有项都已正确嵌套.
出于演示目的,我使用了Javascript实现peg.js,它有一个在线演示页面,您可以根据某些输入对解析器进行实时测试.这个解析器定义:
{ // [value, [[delimiter, value], ...]] => [value, value, ...] const list = values => [values[0]].concat(values[1].map(i => i[1])); } document = line* line "line" = value:(item (whitespace item)*) whitespace? eol { return list(value) } item "item" = number / string / group group "group" = "[" value:(item (comma item)*) whitespace? "]" { return list(value) } comma "comma" = whitespace? "," whitespace? number "number" = value:$[0-9.]+ { return +value } string "string" = $([^ 0-9\[\]\r\n,] [^ \[\]\r\n,]*) whitespace "whitespace" = $" "+ eol "eol" = [\r]? [\n] / eof eof "eof" = !.
可以理解这种输入:
c0 c1 c2 c3 c4 c5 0 10 100.5 [1.5, 2] [[10, 10.4], [c, 10, eee]] [[a , bg], [5.5, ddd, edd]] 1 20 200.5 [2.5, 2] [[20, 20.4], [d, 20, eee]] [[a , bg], [7.5, udd, edd1]]
并生成此对象树(JSON表示法):
[ ["c0", "c1", "c2", "c3", "c4", "c5"], [0, 10, 100.5, [1.5, 2], [[10, 10.4], ["c", 10, "eee"]], [["a", "bg"], [5.5, "ddd", "edd"]]], [1, 20, 200.5, [2.5, 2], [[20, 20.4], ["d", 20, "eee"]], [["a", "bg"], [7.5, "udd", "edd1"]]] ]
即
一系列的线条,
每个都是一个值数组,
每个都可以是数字,字符串或其他值数组
然后,您的程序可以处理此树结构.
上面的例子可以用node.js将您的输入转换为JSON.以下最小JS程序接受来自STDIN的数据并将解析后的结果写入STDOUT:
// reference the parser.js file, e.g. downloaded from https://pegjs.org/online
const parser = require('./parser');
var chunks = [];
// handle STDIN events to slurp up all the input into one big string
process.stdin.on('data', buffer => chunks.push(buffer.toString()));
process.stdin.on('end', function () {
var text = chunks.join('');
var data = parser.parse(text);
var json = JSON.stringify(data, null, 4);
process.stdout.write(json);
});
// start reading from STDIN
process.stdin.resume();
保存为text2json.js
或类似的东西,并将一些文本重定向(或管道):
# input redirection (this works on Windows, too) node text2json.jsoutput.json # common alternative, but I'd recommend input redirection over this cat input.txt | node text2json.js > output.json
还有用于Python的PEG解析器生成器,例如https://github.com/erikrose/parsimonious.解析器创建语言在实现之间有所不同,因此上面只能用于peg.js,但原理完全相同.
编辑我已经挖到Parsimonious并在Python代码中重新创建了上述解决方案.方法是相同的,解析器语法是相同的,只有一些微小的语法变化.
from parsimonious.grammar import Grammar from parsimonious.nodes import NodeVisitor grammar = Grammar( r""" document = line* line = whitespace? item (whitespace item)* whitespace? eol item = group / number / boolean / string group = "[" item (comma item)* whitespace? "]" comma = whitespace? "," whitespace? number = "NaN" / ~"[0-9.]+" boolean = "True" / "False" string = ~"[^ 0-9\[\]\r\n,][^ \[\]\r\n,]*" whitespace = ~" +" eol = ~"\r?\n" / eof eof = ~"$" """) class DataExtractor(NodeVisitor): @staticmethod def concat_items(first_item, remaining_items): """ helper to concat the values of delimited items (lines or goups) """ return first_item + list(map(lambda i: i[1][0], remaining_items)) def generic_visit(self, node, processed_children): """ in general we just want to see the processed children of any node """ return processed_children def visit_line(self, node, processed_children): """ line nodes return an array of their processed_children """ _, first_item, remaining_items, _, _ = processed_children return self.concat_items(first_item, remaining_items) def visit_group(self, node, processed_children): """ group nodes return an array of their processed_children """ _, first_item, remaining_items, _, _ = processed_children return self.concat_items(first_item, remaining_items) def visit_number(self, node, processed_children): """ number nodes return floats (nan is a special value of floats) """ return float(node.text) def visit_boolean(self, node, processed_children): """ boolean nodes return return True or False """ return node.text == "True" def visit_string(self, node, processed_children): """ string nodes just return their own text """ return node.text
该DataExtractor
负责遍历树和节点拉出数据,返回一个字符串,数字,布尔值,或NaN的名单.
该concat_items()
函数执行与list()
上述Javascript代码中的函数相同的任务,其他函数也在peg.js方法中具有等价物,除了peg.js将它们直接集成到解析器定义中,而Parsimonious期望在单独的类中定义,所以相比之下它有点晦涩,但也不算太糟糕.
用法,假设一个名为"data.txt"的输入文件,也反映了JS代码:
de = DataExtractor() with open("data.txt", encoding="utf8") as f: text = f.read() tree = grammar.parse(text) data = de.visit(tree) print(data)
输入:
orig shifted not_equal cumsum lst 0 10 NaN True 1 [[10, 10.4], [c, 10, eee]] 1 10 10.0 False 1 [[10, 10.4], [c, 10, eee]] 2 23 10.0 True 2 [[10, 10.4], [c, 10, eee]]
输出:
[ ['orig', 'shifted', 'not_equal', 'cumsum', 'lst'], [0.0, 10.0, nan, True, 1.0, [[10.0, 10.4], ['c', 10.0, 'eee']]], [1.0, 10.0, 10.0, False, 1.0, [[10.0, 10.4], ['c', 10.0, 'eee']]], [2.0, 23.0, 10.0, True, 2.0, [[10.0, 10.4], ['c', 10.0, 'eee']]] ]
从长远来看,我希望这种方法比正则表达式hackery更易于维护和灵活.添加对NaN和布尔值的明确支持(例如上面的peg.js-Solution没有 - 它们被解析为字符串)很容易.
正如评论中所指出的,这个任务与正则表达式无关.正则表达式从根本上说无法处理嵌套结构.你需要的是一个解析器.
创建解析器的方法之一是PEG,它允许您以声明性语言设置令牌列表及其相互之间的关系.然后将此解析器定义转换为可以处理所描述的输入的实际解析器.解析成功后,您将获得一个树结构,其中所有项都已正确嵌套.
出于演示目的,我使用了Javascript实现peg.js,它有一个在线演示页面,您可以根据某些输入对解析器进行实时测试.这个解析器定义:
{ // [value, [[delimiter, value], ...]] => [value, value, ...] const list = values => [values[0]].concat(values[1].map(i => i[1])); } document = line* line "line" = value:(item (whitespace item)*) whitespace? eol { return list(value) } item "item" = number / string / group group "group" = "[" value:(item (comma item)*) whitespace? "]" { return list(value) } comma "comma" = whitespace? "," whitespace? number "number" = value:$[0-9.]+ { return +value } string "string" = $([^ 0-9\[\]\r\n,] [^ \[\]\r\n,]*) whitespace "whitespace" = $" "+ eol "eol" = [\r]? [\n] / eof eof "eof" = !.
可以理解这种输入:
c0 c1 c2 c3 c4 c5 0 10 100.5 [1.5, 2] [[10, 10.4], [c, 10, eee]] [[a , bg], [5.5, ddd, edd]] 1 20 200.5 [2.5, 2] [[20, 20.4], [d, 20, eee]] [[a , bg], [7.5, udd, edd1]]
并生成此对象树(JSON表示法):
[ ["c0", "c1", "c2", "c3", "c4", "c5"], [0, 10, 100.5, [1.5, 2], [[10, 10.4], ["c", 10, "eee"]], [["a", "bg"], [5.5, "ddd", "edd"]]], [1, 20, 200.5, [2.5, 2], [[20, 20.4], ["d", 20, "eee"]], [["a", "bg"], [7.5, "udd", "edd1"]]] ]
即
一系列的线条,
每个都是一个值数组,
每个都可以是数字,字符串或其他值数组
然后,您的程序可以处理此树结构.
上面的例子可以用node.js将您的输入转换为JSON.以下最小JS程序接受来自STDIN的数据并将解析后的结果写入STDOUT:
// reference the parser.js file, e.g. downloaded from https://pegjs.org/online
const parser = require('./parser');
var chunks = [];
// handle STDIN events to slurp up all the input into one big string
process.stdin.on('data', buffer => chunks.push(buffer.toString()));
process.stdin.on('end', function () {
var text = chunks.join('');
var data = parser.parse(text);
var json = JSON.stringify(data, null, 4);
process.stdout.write(json);
});
// start reading from STDIN
process.stdin.resume();
保存为text2json.js
或类似的东西,并将一些文本重定向(或管道):
# input redirection (this works on Windows, too) node text2json.jsoutput.json # common alternative, but I'd recommend input redirection over this cat input.txt | node text2json.js > output.json
还有用于Python的PEG解析器生成器,例如https://github.com/erikrose/parsimonious.解析器创建语言在实现之间有所不同,因此上面只能用于peg.js,但原理完全相同.
编辑我已经挖到Parsimonious并在Python代码中重新创建了上述解决方案.方法是相同的,解析器语法是相同的,只有一些微小的语法变化.
from parsimonious.grammar import Grammar from parsimonious.nodes import NodeVisitor grammar = Grammar( r""" document = line* line = whitespace? item (whitespace item)* whitespace? eol item = group / number / boolean / string group = "[" item (comma item)* whitespace? "]" comma = whitespace? "," whitespace? number = "NaN" / ~"[0-9.]+" boolean = "True" / "False" string = ~"[^ 0-9\[\]\r\n,][^ \[\]\r\n,]*" whitespace = ~" +" eol = ~"\r?\n" / eof eof = ~"$" """) class DataExtractor(NodeVisitor): @staticmethod def concat_items(first_item, remaining_items): """ helper to concat the values of delimited items (lines or goups) """ return first_item + list(map(lambda i: i[1][0], remaining_items)) def generic_visit(self, node, processed_children): """ in general we just want to see the processed children of any node """ return processed_children def visit_line(self, node, processed_children): """ line nodes return an array of their processed_children """ _, first_item, remaining_items, _, _ = processed_children return self.concat_items(first_item, remaining_items) def visit_group(self, node, processed_children): """ group nodes return an array of their processed_children """ _, first_item, remaining_items, _, _ = processed_children return self.concat_items(first_item, remaining_items) def visit_number(self, node, processed_children): """ number nodes return floats (nan is a special value of floats) """ return float(node.text) def visit_boolean(self, node, processed_children): """ boolean nodes return return True or False """ return node.text == "True" def visit_string(self, node, processed_children): """ string nodes just return their own text """ return node.text
该DataExtractor
负责遍历树和节点拉出数据,返回一个字符串,数字,布尔值,或NaN的名单.
该concat_items()
函数执行与list()
上述Javascript代码中的函数相同的任务,其他函数也在peg.js方法中具有等价物,除了peg.js将它们直接集成到解析器定义中,而Parsimonious期望在单独的类中定义,所以相比之下它有点晦涩,但也不算太糟糕.
用法,假设一个名为"data.txt"的输入文件,也反映了JS代码:
de = DataExtractor() with open("data.txt", encoding="utf8") as f: text = f.read() tree = grammar.parse(text) data = de.visit(tree) print(data)
输入:
orig shifted not_equal cumsum lst 0 10 NaN True 1 [[10, 10.4], [c, 10, eee]] 1 10 10.0 False 1 [[10, 10.4], [c, 10, eee]] 2 23 10.0 True 2 [[10, 10.4], [c, 10, eee]]
输出:
[ ['orig', 'shifted', 'not_equal', 'cumsum', 'lst'], [0.0, 10.0, nan, True, 1.0, [[10.0, 10.4], ['c', 10.0, 'eee']]], [1.0, 10.0, 10.0, False, 1.0, [[10.0, 10.4], ['c', 10.0, 'eee']]], [2.0, 23.0, 10.0, True, 2.0, [[10.0, 10.4], ['c', 10.0, 'eee']]] ]
从长远来看,我希望这种方法比正则表达式hackery更易于维护和灵活.添加对NaN和布尔值的明确支持(例如上面的peg.js-Solution没有 - 它们被解析为字符串)很容易.