pandas的stack和unstack

近几日,处理数据时,需要对A列的值进行转换,将A列的所有值,转换成列名。

如上,一开始的思路是通过透视表或者逆透视表进行操作,但是melt之后,再次melt仍然达不到目的。因为,根本上,是需要将A列中的value转换成列名进行转换。Pandas中的stack和unstack很好的解决这一问题。

import numpy as np
import pandas as pd
from pandas import Series,DataFrame
data=DataFrame(np.arange(12).reshape((3,4)),index=pd.Index(['street1','street2','street3']),
               columns=pd.Index(['store1','store2','store3','store4']))
print(data)
print('-----------------------------------------\n')
data2=data.stack()
data3=data2.unstack()
print(data2)
print('-----------------------------------------\n')
print(data3)
'''
         store1  store2  store3  store4
street1       0       1       2       3
street2       4       5       6       7
street3       8       9      10      11
-----------------------------------------
street1  store1     0
         store2     1
         store3     2
         store4     3
street2  store1     4
         store2     5
         store3     6
         store4     7
street3  store1     8
         store2     9
         store3    10
         store4    11
dtype: int32
-----------------------------------------
         store1  store2  store3  store4
street1       0       1       2       3
street2       4       5       6       7
street3       8       9      10      11
'''

可以看到:使用stack函数,将data的列索引['store1','store2','store3’,'store4']转变成行索引(第二层),便得到了一个层次化的Series(data2),使用unstack函数,将data2的第二层行索引转变成列索引(默认内层索引,level=-1),便又得到了DataFrame(data3)。

下面的例子我们利用level选择具体哪层索引。

data4=data2.unstack(level=0)
print(data4)
'''
        street1  street2  street3
store1        0        4        8
store2        1        5        9
store3        2        6       10
store4        3        7       11
'''

我们可以清晰看到,当我们取level=0时,即最外层索引时,unstack把行索引['street1','street2','street3’]变为了列索引。

上述操作的大前提,是进行了索引,所以操作前,记得进行索引相应列。

实战代码如下:

# 导入pandas
import pandas as pd

# 设置读取哪些列名
cols = ['V2','V2A','V9','V186','V187','V192']

# 设置header为0,即第一行为列名
df_wv = pd.read_excel('WV5_Data_Spss_v20180912.xlsx',header=0, usecols=cols)
df_wv.head()

# 逆透视,保持V2和V2A不变,将V9、V186、V187和V192合并为一列
df_melt = df_wv.melt(id_vars=['V2','V2A'],value_vars=['V9','V186','V187','V192'])
df_melt.head()

# 对V2、V2A、variable和value进行分组,随后逆透视,统计size,并重新生成序列,命名为count
df_t1 = df_melt.groupby(['V2','V2A','variable','value']).agg('size').reset_index(name='count')
df_t1.head()

# 对V2、V2A和variable进行分组,对count列进行求和
df_t2 = df_t1.groupby(['V2','V2A','variable']).agg({'count': 'sum'})
df_t2.head()

# 以V2、V2A和variable,合并df_t1和df_t2,左连接
df_tt = pd.merge(df_t1,df_t2,on=['V2','V2A','variable'],how='left')
df_tt.head()

# 计算percent,并生成新的列percent
df_tt['percent']= df_tt['count_x']/df_tt['count_y']
df_tt.head()

# 设置query条件
# tt_query = "variable == 'V186' and value == 1 "
# df_tt.query(tt_query)
tt_query = "variable == 'V186' and value == 1 | variable == 'V9' and value == [1,2]"
df_tt.query(tt_query)
df_t_final = df_tt.query(tt_query)
df_t_final.head()

# 对V2、V2A和variable进行分组,对percent列进行求和
# df_t_final.groupby(['V2','V2A','variable']).agg({'percent': 'sum'})
df_final_new = df_t_final.groupby(['V2','V2A','variable']).agg({'percent': 'sum'})
df_final_new.head()

# melt前进行重新生成索引,否则报错KeyError: "The following 'id_vars' are not present in the DataFrame
df_final_melt = df_final_new.reset_index()
df_final_melt.head()

# 生成(多级)索引列,利用unstack函数列变行,转换variable到列名
df_final_melt.set_index(["V2", "V2A","variable"])["percent"].unstack()

参考链接:
https://blog.csdn.net/anshuai_aw1/article/details/82830916
https://www.cnblogs.com/traditional/p/11967360.html
https://blog.csdn.net/maymay_/article/details/80039677
https://blog.csdn.net/maymay_/article/details/105349956?spm=1001.2014.3001.5502
https://blog.csdn.net/maymay_/article/details/80039677?spm=1001.2014.3001.5502

pandas.ipynb
WV5_Data_Spss_v20180912.zip

本文链接:

https://ma.ge/archives/531.html