I have a pandas dataframe
in which one column of text strings contains comma-separated values. I want to split each CSV field and create a new row per entry (assume that CSV are clean and need only be split on ','). For example, a
should become b
:
我有一个熊猫dataframe,其中一列文本字符串包含逗号分隔值。我想分割每个CSV字段,并为每个条目创建一个新的行(假设CSV是干净的,只需要在','上分割)。例如,a应该变成b:
In [7]: a
Out[7]:
var1 var2
0 a,b,c 1
1 d,e,f 2
In [8]: b
Out[8]:
var1 var2
0 a 1
1 b 1
2 c 1
3 d 2
4 e 2
5 f 2
So far, I have tried various simple functions, but the .apply
method seems to only accept one row as return value when it is used on an axis, and I can't get .transform
to work. Any suggestions would be much appreciated!
到目前为止,我已经尝试了各种简单的函数,但是.apply方法在轴上使用时似乎只接受一行作为返回值,我无法让.transform起作用。如有任何建议,我们将不胜感激!
Example data:
示例数据:
from pandas import DataFrame
import numpy as np
a = DataFrame([{'var1': 'a,b,c', 'var2': 1},
{'var1': 'd,e,f', 'var2': 2}])
b = DataFrame([{'var1': 'a', 'var2': 1},
{'var1': 'b', 'var2': 1},
{'var1': 'c', 'var2': 1},
{'var1': 'd', 'var2': 2},
{'var1': 'e', 'var2': 2},
{'var1': 'f', 'var2': 2}])
I know this won't work because we lose DataFrame meta-data by going through numpy, but it should give you a sense of what I tried to do:
我知道这行不通,因为我们通过numpy失去了DataFrame元数据,但它应该能让你了解我想做什么:
def fun(row):
letters = row['var1']
letters = letters.split(',')
out = np.array([row] * len(letters))
out['var1'] = letters
a['idx'] = range(a.shape[0])
z = a.groupby('idx')
z.transform(fun)
36
How about something like this:
像这样的东西怎么样:
In [55]: pd.concat([Series(row['var2'], row['var1'].split(','))
for _, row in a.iterrows()]).reset_index()
Out[55]:
index 0
0 a 1
1 b 1
2 c 1
3 d 2
4 e 2
5 f 2
Then you just have to rename the columns
然后只需重命名列
63
After painful experimentation to find something faster than the accepted answer, I got this to work. It ran around 100x faster on the dataset I tried it on.
在痛苦的实验之后,我找到了比公认的答案更快的答案,我使这个工作。在我试用的数据集中,它的运行速度快了100倍。
If someone knows a way to make this more elegant, by all means please modify my code. I couldn't find a way that works without setting the other columns you want to keep as the index and then resetting the index and re-naming the columns, but I'd imagine there's something else that works.
如果有人知道如何使它更优雅,请务必修改我的代码。如果不将其他列设置为索引,然后重新设置索引并重新命名列,我就找不到一种有效的方法,但我可以想象还有其他的方法。
b = DataFrame(a.var1.str.split(',').tolist(), index=a.var2).stack()
b = b.reset_index()[[0, 'var2']] # var1 variable is currently labeled 0
b.columns = ['var1', 'var2'] # renaming var1
54
UPDATE2: more generic vectorized function, which will work for multiple normal
and multiple list
columns
UPDATE2:更通用的矢量化函数,它将用于多个正常和多个列表列。
def explode(df, lst_cols, fill_value=''):
# make sure `lst_cols` is a list
if lst_cols and not isinstance(lst_cols, list):
lst_cols = [lst_cols]
# all columns except `lst_cols`
idx_cols = df.columns.difference(lst_cols)
# calculate lengths of lists
lens = df[lst_cols[0]].str.len()
if (lens > 0).all():
# ALL lists in cells aren't empty
return pd.DataFrame({
col:np.repeat(df[col].values, lens)
for col in idx_cols
}).assign(**{col:np.concatenate(df[col].values) for col in lst_cols}) \
.loc[:, df.columns]
else:
# at least one list in cells is empty
return pd.DataFrame({
col:np.repeat(df[col].values, lens)
for col in idx_cols
}).assign(**{col:np.concatenate(df[col].values) for col in lst_cols}) \
.append(df.loc[lens==0, idx_cols]).fillna(fill_value) \
.loc[:, df.columns]
Demo:
演示:
Multiple list
columns - all list
columns must have the same # of elements in each row:
多个列表列-所有列表列必须在每行中有相同的元素#:
In [36]: df
Out[36]:
aaa myid num text
0 10 1 [1, 2, 3] [aa, bb, cc]
1 11 2 [1, 2] [cc, dd]
2 12 3 [] []
3 13 4 [] []
In [37]: explode(df, ['num','text'], fill_value='')
Out[37]:
aaa myid num text
0 10 1 1 aa
1 10 1 2 bb
2 10 1 3 cc
3 11 2 1 cc
4 11 2 2 dd
2 12 3
3 13 4
Setup:
设置:
df = pd.DataFrame({
'aaa': {0: 10, 1: 11, 2: 12, 3: 13},
'myid': {0: 1, 1: 2, 2: 3, 3: 4},
'num': {0: [1, 2, 3], 1: [1, 2], 2: [], 3: []},
'text': {0: ['aa', 'bb', 'cc'], 1: ['cc', 'dd'], 2: [], 3: []}
})
CSV column:
CSV专栏:
In [46]: df
Out[46]:
var1 var2 var3
0 a,b,c 1 XX
1 d,e,f,x,y 2 ZZ
In [47]: explode(df.assign(var1=df.var1.str.split(',')), 'var1')
Out[47]:
var1 var2 var3
0 a 1 XX
1 b 1 XX
2 c 1 XX
3 d 2 ZZ
4 e 2 ZZ
5 f 2 ZZ
6 x 2 ZZ
7 y 2 ZZ
using this little trick we can convert CSV-like column to list
column:
使用这个小技巧,我们可以将类csv的列转换为列表列:
In [48]: df.assign(var1=df.var1.str.split(','))
Out[48]:
var1 var2 var3
0 [a, b, c] 1 XX
1 [d, e, f, x, y] 2 ZZ
UPDATE: generic vectorized approach (will work also for multiple columns):
更新:通用矢量化方法(也适用于多列):
Original DF:
原始DF:
In [177]: df
Out[177]:
var1 var2 var3
0 a,b,c 1 XX
1 d,e,f,x,y 2 ZZ
Solution:
解决方案:
first let's convert CSV strings to lists:
首先让我们将CSV字符串转换为列表:
In [178]: lst_col = 'var1'
In [179]: x = df.assign(**{lst_col:df[lst_col].str.split(',')})
In [180]: x
Out[180]:
var1 var2 var3
0 [a, b, c] 1 XX
1 [d, e, f, x, y] 2 ZZ
Now we can do this:
现在我们可以这样做:
In [181]: pd.DataFrame({
...: col:np.repeat(x[col].values, x[lst_col].str.len())
...: for col in x.columns.difference([lst_col])
...: }).assign(**{lst_col:np.concatenate(x[lst_col].values)})[x.columns.tolist()]
...:
Out[181]:
var1 var2 var3
0 a 1 XX
1 b 1 XX
2 c 1 XX
3 d 2 ZZ
4 e 2 ZZ
5 f 2 ZZ
6 x 2 ZZ
7 y 2 ZZ
OLD answer:
旧的回答:
Inspired by @AFinkelstein solution, i wanted to make it bit more generalized which could be applied to DF with more than two columns and as fast, well almost, as fast as AFinkelstein's solution):
受到@AFinkelstein解决方案的启发,我想使它更一般化一些,可以应用到超过两列的DF上,而且速度几乎和AFinkelstein的解决方案一样快):
In [2]: df = pd.DataFrame(
...: [{'var1': 'a,b,c', 'var2': 1, 'var3': 'XX'},
...: {'var1': 'd,e,f,x,y', 'var2': 2, 'var3': 'ZZ'}]
...: )
In [3]: df
Out[3]:
var1 var2 var3
0 a,b,c 1 XX
1 d,e,f,x,y 2 ZZ
In [4]: (df.set_index(df.columns.drop('var1',1).tolist())
...: .var1.str.split(',', expand=True)
...: .stack()
...: .reset_index()
...: .rename(columns={0:'var1'})
...: .loc[:, df.columns]
...: )
Out[4]:
var1 var2 var3
0 a 1 XX
1 b 1 XX
2 c 1 XX
3 d 2 ZZ
4 e 2 ZZ
5 f 2 ZZ
6 x 2 ZZ
7 y 2 ZZ
18
Here's a function I wrote for this common task. It's more efficient than the Series
/stack
methods. Column order and names are retained.
这是我为这个公共任务编写的函数。它比系列/堆栈方法更有效。保留列顺序和名称。
def tidy_split(df, column, sep='|', keep=False):
"""
Split the values of a column and expand so the new DataFrame has one split
value per row. Filters rows where the column is missing.
Params
------
df : pandas.DataFrame
dataframe with the column to split and expand
column : str
the column to split and expand
sep : str
the string used to split the column's values
keep : bool
whether to retain the presplit value as it's own row
Returns
-------
pandas.DataFrame
Returns a dataframe with the same columns as `df`.
"""
indexes = list()
new_values = list()
df = df.dropna(subset=[column])
for i, presplit in enumerate(df[column].astype(str)):
values = presplit.split(sep)
if keep and len(values) > 1:
indexes.append(i)
new_values.append(presplit)
for value in values:
indexes.append(i)
new_values.append(value)
new_df = df.iloc[indexes, :].copy()
new_df[column] = new_values
return new_df
With this function, the original question is as simple as:
有了这个函数,原来的问题就简单到:
tidy_split(a, 'var1', sep=',')
10
Similar question as: pandas: How do I split text in a column into multiple rows?
类似的问题如:熊猫:如何将列中的文本分割成多行?
You could do:
你能做的:
>> a=pd.DataFrame({"var1":"a,b,c d,e,f".split(),"var2":[1,2]})
>> s = a.var1.str.split(",").apply(pd.Series, 1).stack()
>> s.index = s.index.droplevel(-1)
>> del a['var1']
>> a.join(s)
var2 var1
0 1 a
0 1 b
0 1 c
1 2 d
1 2 e
1 2 f
5
I came up with a solution for dataframes with arbitrary numbers of columns (while still only separating one column's entries at a time).
我为具有任意列数的dataframes提供了一个解决方案(但每次仍然只分离一个列的条目)。
def splitDataFrameList(df,target_column,separator):
''' df = dataframe to split,
target_column = the column containing the values to split
separator = the symbol used to perform the split
returns: a dataframe with each entry for the target column separated, with each element moved into a new row.
The values in the other columns are duplicated across the newly divided rows.
'''
def splitListToRows(row,row_accumulator,target_column,separator):
split_row = row[target_column].split(separator)
for s in split_row:
new_row = row.to_dict()
new_row[target_column] = s
row_accumulator.append(new_row)
new_rows = []
df.apply(splitListToRows,axis=1,args = (new_rows,target_column,separator))
new_df = pandas.DataFrame(new_rows)
return new_df
1
Just used jiln's excellent answer from above, but needed to expand to split multiple columns. Thought I would share.
只是使用了上面jiln的优秀答案,但是需要扩展到多个列。想我应该分享。
def splitDataFrameList(df,target_column,separator):
''' df = dataframe to split,
target_column = the column containing the values to split
separator = the symbol used to perform the split
returns: a dataframe with each entry for the target column separated, with each element moved into a new row.
The values in the other columns are duplicated across the newly divided rows.
'''
def splitListToRows(row, row_accumulator, target_columns, separator):
split_rows = []
for target_column in target_columns:
split_rows.append(row[target_column].split(separator))
# Seperate for multiple columns
for i in range(len(split_rows[0])):
new_row = row.to_dict()
for j in range(len(split_rows)):
new_row[target_columns[j]] = split_rows[j][i]
row_accumulator.append(new_row)
new_rows = []
df.apply(splitListToRows,axis=1,args = (new_rows,target_column,separator))
new_df = pd.DataFrame(new_rows)
return new_df
1
Here is a fairly straightforward message that uses the split
method from pandas str
accessor and then uses NumPy to flatten each row into a single array.
这里有一个相当简单的消息,它使用熊猫str accessor的split方法,然后使用NumPy将每一行压平成一个数组。
The corresponding values are retrieved by repeating the non-split column the correct number of times with np.repeat
.
通过使用np.repeat重复非拆分列正确的次数来检索相应的值。
var1 = df.var1.str.split(',', expand=True).values.ravel()
var2 = np.repeat(df.var2.values, len(var1) / len(df))
pd.DataFrame({'var1': var1,
'var2': var2})
var1 var2
0 a 1
1 b 1
2 c 1
3 d 2
4 e 2
5 f 2
1
Based on the excellent @DMulligan's solution, here is a generic vectorized (no loops) function which splits a column of a dataframe into multiple rows, and merges it back to the original dataframe. It also uses a great generic change_column_order
function from this answer.
基于优秀的@DMulligan的解决方案,这里是一个通用的矢量化(无循环)函数,它将一个dataframe的列分割成多个行,并将其合并回原始的dataframe。它还使用了一个很好的通用的change_column_order函数。
def change_column_order(df, col_name, index):
cols = df.columns.tolist()
cols.remove(col_name)
cols.insert(index, col_name)
return df[cols]
def split_df(dataframe, col_name, sep):
orig_col_index = dataframe.columns.tolist().index(col_name)
orig_index_name = dataframe.index.name
orig_columns = dataframe.columns
dataframe = dataframe.reset_index() # we need a natural 0-based index for proper merge
index_col_name = (set(dataframe.columns) - set(orig_columns)).pop()
df_split = pd.DataFrame(
pd.DataFrame(dataframe[col_name].str.split(sep).tolist())
.stack().reset_index(level=1, drop=1), columns=[col_name])
df = dataframe.drop(col_name, axis=1)
df = pd.merge(df, df_split, left_index=True, right_index=True, how='inner')
df = df.set_index(index_col_name)
df.index.name = orig_index_name
# merge adds the column to the last place, so we need to move it back
return change_column_order(df, col_name, orig_col_index)
Example:
例子:
df = pd.DataFrame([['a:b', 1, 4], ['c:d', 2, 5], ['e:f:g:h', 3, 6]],
columns=['Name', 'A', 'B'], index=[10, 12, 13])
df
Name A B
10 a:b 1 4
12 c:d 2 5
13 e:f:g:h 3 6
split_df(df, 'Name', ':')
Name A B
10 a 1 4
10 b 1 4
12 c 2 5
12 d 2 5
13 e 3 6
13 f 3 6
13 g 3 6
13 h 3 6
Note that it preserves the original index and order of the columns. It also works with dataframes which have non-sequential index.
注意,它保留了列的原始索引和顺序。它也适用于具有非顺序索引的数据aframes。
0
I have come up with the following solution to this problem:
对于这个问题,我有以下的解决办法:
def iter_var1(d):
for _, row in d.iterrows():
for v in row["var1"].split(","):
yield (v, row["var2"])
new_a = DataFrame.from_records([i for i in iter_var1(a)],
columns=["var1", "var2"])
0
Another solution that uses python copy package
另一个使用python复制包的解决方案
import copy
new_observatiOns= list()
def pandas_explode(df, column_to_explode):
new_observatiOns= list()
for row in df.to_dict(orient='records'):
explode_values = row[column_to_explode]
del row[column_to_explode]
if type(explode_values) is list or type(explode_values) is tuple:
for explode_value in explode_values:
new_observation = copy.deepcopy(row)
new_observation[column_to_explode] = explode_value
new_observations.append(new_observation)
else:
new_observation = copy.deepcopy(row)
new_observation[column_to_explode] = explode_values
new_observations.append(new_observation)
return_df = pd.DataFrame(new_observations)
return return_df
df = pandas_explode(df, column_name)
0
The string function split can take an option boolean argument 'expand'.
字符串函数分割可以使用布尔参数“展开”。
Here is a solution using this argument:
这里有一个使用这个论点的解决方案:
a.var1.str.split(",",expand=True).set_index(a.var2).stack().reset_index(level=1, drop=True).reset_index().rename(columns={0:"var1"})