pandas的stack和unstack
近几日,处理数据时,需要对A列的值进行转换,将A列的所有值,转换成列名。
如上,一开始的思路是通过透视表或者逆透视表进行操作,但是melt之后,再次melt仍然达不到目的。因为,根本上,是需要将A列中的value转换成列名进行转换。Pandas中的stack和unstack很好的解决这一问题。
import numpy as np
import pandas as pd
from pandas import Series,DataFrame
data=DataFrame(np.arange(12).reshape((3,4)),index=pd.Index(['street1','street2','street3']),
columns=pd.Index(['store1','store2','store3','store4']))
print(data)
print('-----------------------------------------\n')
data2=data.stack()
data3=data2.unstack()
print(data2)
print('-----------------------------------------\n')
print(data3)
'''
store1 store2 store3 store4
street1 0 1 2 3
street2 4 5 6 7
street3 8 9 10 11
-----------------------------------------
street1 store1 0
store2 1
store3 2
store4 3
street2 store1 4
store2 5
store3 6
store4 7
street3 store1 8
store2 9
store3 10
store4 11
dtype: int32
-----------------------------------------
store1 store2 store3 store4
street1 0 1 2 3
street2 4 5 6 7
street3 8 9 10 11
'''
可以看到:使用stack函数,将data的列索引['store1','store2','store3’,'store4']转变成行索引(第二层),便得到了一个层次化的Series(data2),使用unstack函数,将data2的第二层行索引转变成列索引(默认内层索引,level=-1),便又得到了DataFrame(data3)。
下面的例子我们利用level选择具体哪层索引。
data4=data2.unstack(level=0)
print(data4)
'''
street1 street2 street3
store1 0 4 8
store2 1 5 9
store3 2 6 10
store4 3 7 11
'''
我们可以清晰看到,当我们取level=0时,即最外层索引时,unstack把行索引['street1','street2','street3’]变为了列索引。
上述操作的大前提,是进行了索引,所以操作前,记得进行索引相应列。
实战代码如下:
# 导入pandas
import pandas as pd
# 设置读取哪些列名
cols = ['V2','V2A','V9','V186','V187','V192']
# 设置header为0,即第一行为列名
df_wv = pd.read_excel('WV5_Data_Spss_v20180912.xlsx',header=0, usecols=cols)
df_wv.head()
# 逆透视,保持V2和V2A不变,将V9、V186、V187和V192合并为一列
df_melt = df_wv.melt(id_vars=['V2','V2A'],value_vars=['V9','V186','V187','V192'])
df_melt.head()
# 对V2、V2A、variable和value进行分组,随后逆透视,统计size,并重新生成序列,命名为count
df_t1 = df_melt.groupby(['V2','V2A','variable','value']).agg('size').reset_index(name='count')
df_t1.head()
# 对V2、V2A和variable进行分组,对count列进行求和
df_t2 = df_t1.groupby(['V2','V2A','variable']).agg({'count': 'sum'})
df_t2.head()
# 以V2、V2A和variable,合并df_t1和df_t2,左连接
df_tt = pd.merge(df_t1,df_t2,on=['V2','V2A','variable'],how='left')
df_tt.head()
# 计算percent,并生成新的列percent
df_tt['percent']= df_tt['count_x']/df_tt['count_y']
df_tt.head()
# 设置query条件
# tt_query = "variable == 'V186' and value == 1 "
# df_tt.query(tt_query)
tt_query = "variable == 'V186' and value == 1 | variable == 'V9' and value == [1,2]"
df_tt.query(tt_query)
df_t_final = df_tt.query(tt_query)
df_t_final.head()
# 对V2、V2A和variable进行分组,对percent列进行求和
# df_t_final.groupby(['V2','V2A','variable']).agg({'percent': 'sum'})
df_final_new = df_t_final.groupby(['V2','V2A','variable']).agg({'percent': 'sum'})
df_final_new.head()
# melt前进行重新生成索引,否则报错KeyError: "The following 'id_vars' are not present in the DataFrame
df_final_melt = df_final_new.reset_index()
df_final_melt.head()
# 生成(多级)索引列,利用unstack函数列变行,转换variable到列名
df_final_melt.set_index(["V2", "V2A","variable"])["percent"].unstack()
参考链接:
https://blog.csdn.net/anshuai_aw1/article/details/82830916
https://www.cnblogs.com/traditional/p/11967360.html
https://blog.csdn.net/maymay_/article/details/80039677
https://blog.csdn.net/maymay_/article/details/105349956?spm=1001.2014.3001.5502
https://blog.csdn.net/maymay_/article/details/80039677?spm=1001.2014.3001.5502
pandas.ipynb
WV5_Data_Spss_v20180912.zip