我有一个像下面这样的单词列表.
wordlist = ['p1','p2','p3','p4','p5','p6','p7']
数据框如下所示.
df = pd.DataFrame({'id' : [1,2,3,4], 'path' : ["p1,p2,p3,p4","p1,p2,p1","p1,p5,p5,p7","p1,p2,p3,p3"]})
输出:
id path 1 p1,p2,p3,p4 2 p1,p2,p1 3 p1,p5,p5,p7 4 p1,p2,p3,p3
我想计算路径数据以获得以下输出.是否有可能实现这种转变?
id p1 p2 p3 p4 p5 p6 p7 1 1 1 1 1 0 0 0 2 2 1 0 0 0 0 0 3 1 0 0 0 2 0 1 4 1 1 2 0 0 0 0
joris.. 5
您可以使用矢量化字符串方法str.count()
(请参阅文档和参考),并将wordlist中的每个元素提供给新数据帧:
In [4]: pd.DataFrame({name : df["path"].str.count(name) for name in wordlist}) Out[4]: p1 p2 p3 p4 p5 p6 p7 id 1 1 1 1 1 0 0 0 2 2 1 0 0 0 0 0 3 1 0 0 0 2 0 1 4 1 1 2 0 0 0 0
更新:评论的一些答案.实际上,如果字符串可以是彼此的子串,这将不起作用(但OP应该澄清它).如果是这种情况,这将起作用(并且也更快):
splitted = df["path"].str.split(",") pd.DataFrame({name : splitted.apply(lambda x: x.count(name)) for name in wordlist})
还有一些测试可以支持我更快的声明:-)
当然,我不知道实际用例是什么,但是我把数据框放大了一点(只重复了1000次,差异就大了) :
In [37]: %%timeit ....: splitted = df["path"].str.split(",") ....: pd.DataFrame({name : splitted.apply(lambda x: x.count(name)) for name i n wordlist}) ....: 100 loops, best of 3: 17.9 ms per loop In [38]: %%timeit ....: pd.DataFrame({name:df["path"].str.count(name) for name in wordlist}) ....: 10 loops, best of 3: 23.6 ms per loop In [39]: %%timeit ....: c = df["path"].str.split(',').apply(Counter) ....: pd.DataFrame({n: c.apply(lambda x: x.get(n, 0)) for n in wordlist}) ....: 10 loops, best of 3: 42.3 ms per loop In [40]: %%timeit ....: dfN = df["path"].str.split(',').apply(lambda x: pd.Series(Counter(x))) ....: pd.DataFrame(dfN, columns=wordlist).fillna(0) ....: 1 loops, best of 3: 715 ms per loop
我也用更多的元素进行了测试wordlist
,结论是:如果你有一个更大的数据帧,wordlist
我的方法中元素数量相对较少,如果你有一个很大wordlist
的方法,Counter
来自@RomanPekar可以更快(但只有最后一个).
您可以使用矢量化字符串方法str.count()
(请参阅文档和参考),并将wordlist中的每个元素提供给新数据帧:
In [4]: pd.DataFrame({name : df["path"].str.count(name) for name in wordlist}) Out[4]: p1 p2 p3 p4 p5 p6 p7 id 1 1 1 1 1 0 0 0 2 2 1 0 0 0 0 0 3 1 0 0 0 2 0 1 4 1 1 2 0 0 0 0
更新:评论的一些答案.实际上,如果字符串可以是彼此的子串,这将不起作用(但OP应该澄清它).如果是这种情况,这将起作用(并且也更快):
splitted = df["path"].str.split(",") pd.DataFrame({name : splitted.apply(lambda x: x.count(name)) for name in wordlist})
还有一些测试可以支持我更快的声明:-)
当然,我不知道实际用例是什么,但是我把数据框放大了一点(只重复了1000次,差异就大了) :
In [37]: %%timeit ....: splitted = df["path"].str.split(",") ....: pd.DataFrame({name : splitted.apply(lambda x: x.count(name)) for name i n wordlist}) ....: 100 loops, best of 3: 17.9 ms per loop In [38]: %%timeit ....: pd.DataFrame({name:df["path"].str.count(name) for name in wordlist}) ....: 10 loops, best of 3: 23.6 ms per loop In [39]: %%timeit ....: c = df["path"].str.split(',').apply(Counter) ....: pd.DataFrame({n: c.apply(lambda x: x.get(n, 0)) for n in wordlist}) ....: 10 loops, best of 3: 42.3 ms per loop In [40]: %%timeit ....: dfN = df["path"].str.split(',').apply(lambda x: pd.Series(Counter(x))) ....: pd.DataFrame(dfN, columns=wordlist).fillna(0) ....: 1 loops, best of 3: 715 ms per loop
我也用更多的元素进行了测试wordlist
,结论是:如果你有一个更大的数据帧,wordlist
我的方法中元素数量相对较少,如果你有一个很大wordlist
的方法,Counter
来自@RomanPekar可以更快(但只有最后一个).
我认为这会很有效率
# create Series with dictionaries >>> from collections import Counter >>> c = df["path"].str.split(',').apply(Counter) >>> c 0 {u'p2': 1, u'p3': 1, u'p1': 1, u'p4': 1} 1 {u'p2': 1, u'p1': 2} 2 {u'p1': 1, u'p7': 1, u'p5': 2} 3 {u'p2': 1, u'p3': 2, u'p1': 1} # create DataFrame >>> pd.DataFrame({n: c.apply(lambda x: x.get(n, 0)) for n in wordlist}) p1 p2 p3 p4 p5 p6 p7 0 1 1 1 1 0 0 0 1 2 1 0 0 0 0 0 2 1 0 0 0 2 0 1 3 1 1 2 0 0 0 0
另一种方法:
>>> dfN = df["path"].str.split(',').apply(lambda x: pd.Series(Counter(x))) >>> pd.DataFrame(dfN, columns=wordlist).fillna(0) p1 p2 p3 p4 p5 p6 p7 0 1 1 1 1 0 0 0 1 2 1 0 0 0 0 0 2 1 0 0 0 2 0 1 3 1 1 2 0 0 0 0
一些粗略的性能测试:
>>> dfL = pd.concat([df]*100) >>> timeit('c = dfL["path"].str.split(",").apply(Counter); d = pd.DataFrame({n: c.apply(lambda x: x.get(n, 0)) for n in wordlist})', 'from __main__ import dfL, wordlist; import pandas as pd; from collections import Counter', number=100) 0.7363274283027295 >>> timeit('splitted = dfL["path"].str.split(","); d = pd.DataFrame({name : splitted.apply(lambda x: x.count(name)) for name in wordlist})', 'from __main__ import dfL, wordlist; import pandas as pd', number=100) 0.5305424618886718 # now let's make wordlist larger >>> wordlist = wordlist + list(lowercase) + list(uppercase) >>> timeit('c = dfL["path"].str.split(",").apply(Counter); d = pd.DataFrame({n: c.apply(lambda x: x.get(n, 0)) for n in wordlist})', 'from __main__ import dfL, wordlist; import pandas as pd; from collections import Counter', number=100) 1.765344003293876 >>> timeit('splitted = dfL["path"].str.split(","); d = pd.DataFrame({name : splitted.apply(lambda x: x.count(name)) for name in wordlist})', 'from __main__ import dfL, wordlist; import pandas as pd', number=100) 2.33328927599905
看完这个话题后我发现这Counter
很慢.您可以使用defaultdict
以下方法优化它:
>>> def create_dict(x): ... d = defaultdict(int) ... for c in x: ... d[c] += 1 ... return d >>> c = df["path"].str.split(",").apply(create_dict) >>> pd.DataFrame({n: c.apply(lambda x: x[n]) for n in wordlist}) p1 p2 p3 p4 p5 p6 p7 0 1 1 1 1 0 0 0 1 2 1 0 0 0 0 0 2 1 0 0 0 2 0 1 3 1 1 2 0 0 0 0
和一些测试:
>>> timeit('c = dfL["path"].str.split(",").apply(create_dict); d = pd.DataFrame({n: c.apply(lambda x: x[n]) for n in wordlist})', 'from __main__ import dfL, wordlist, create_dict; import pandas as pd; from collections import defaultdict', number=100) 0.45942801555111146 # now let's make wordlist larger >>> wordlist = wordlist + list(lowercase) + list(uppercase) >>> timeit('c = dfL["path"].str.split(",").apply(create_dict); d = pd.DataFrame({n: c.apply(lambda x: x[n]) for n in wordlist})', 'from __main__ import dfL, wordlist, create_dict; import pandas as pd; from collections import defaultdict', number=100) 1.5798653213942089