现在有很多类似的问题,但是大多数回答了如何删除重复的列。但是,我想知道如何制作一个元组列表,其中每个元组都包含重复列的列名。我假设每一列都有一个唯一的名称。为了进一步说明我的问题:
df = pd.DataFrame({'A': [1, 2, 3, 4, 5],'B': [2, 4, 2, 1, 9], 'C': [1, 2, 3, 4, 5],'D': [2, 4, 2, 1, 9], 'E': [3, 4, 2, 1, 2],'F': [1, 1, 1, 1, 1]}, index = ['a1', 'a2', 'a3', 'a4', 'a5'])
然后我想要输出:
[('A', 'C'), ('B', 'D')]
如果今天您感觉很好,则将相同的问题扩展到行。如何获取元组列表,其中每个元组都包含重复的行。
这是NumPy的一种方法-
def group_duplicate_cols(df): a = df.values sidx = np.lexsort(a) b = a[:,sidx] m = np.concatenate(([False], (b[:,1:] == b[:,:-1]).all(0), [False] )) idx = np.flatnonzero(m[1:] != m[:-1]) C = df.columns[sidx].tolist() return [C[i:j] for i,j in zip(idx[::2],idx[1::2]+1)]
样品运行-
In [100]: df Out[100]: A B C D E F a1 1 2 1 2 3 1 a2 2 4 2 4 4 1 a3 3 2 3 2 2 1 a4 4 1 4 1 1 1 a5 5 9 5 9 2 1 In [101]: group_duplicate_cols(df) Out[101]: [['A', 'C'], ['B', 'D']] # Let's add one more duplicate into group containing 'A' In [102]: df.F = df.A In [103]: group_duplicate_cols(df) Out[103]: [['A', 'C', 'F'], ['B', 'D']]
进行转换即可,但是对于rows(index),我们只需要沿另一条轴切换操作,就像这样-
def group_duplicate_rows(df): a = df.values sidx = np.lexsort(a.T) b = a[sidx] m = np.concatenate(([False], (b[1:] == b[:-1]).all(1), [False] )) idx = np.flatnonzero(m[1:] != m[:-1]) C = df.index[sidx].tolist() return [C[i:j] for i,j in zip(idx[::2],idx[1::2]+1)]
In [260]: df2 Out[260]: a1 a2 a3 a4 a5 A 3 5 3 4 5 B 1 1 1 1 1 C 3 5 3 4 5 D 2 9 2 1 9 E 2 2 2 1 2 F 1 1 1 1 1 In [261]: group_duplicate_rows(df2) Out[261]: [['B', 'F'], ['A', 'C']]
方法-
# @John Galt's soln-1 from itertools import combinations def combinations_app(df): return[x for x in combinations(df.columns, 2) if (df[x[0]] == df[x[-1]]).all()] # @Abdou's soln def pandas_groupby_app(df): return [tuple(d.index) for _,d in df.T.groupby(list(df.T.columns)) if len(d) > 1] # @COLDSPEED's soln def triu_app(df): c = df.columns.tolist() i, j = np.triu_indices(len(c), 1) x = [(c[_i], c[_j]) for _i, _j in zip(i, j) if (df[c[_i]] == df[c[_j]]).all()] return x # @cmaher's soln def lambda_set_app(df): return list(filter(lambda x: len(x) > 1, list(set([tuple([x for x in df.columns if all(df[x] == df[y])]) for y in df.columns]))))
注意:@John Galt's soln-2未包括在内,因为输入的大小(8000,500)会与针对该输入的提议不符broadcasting。
@John Galt's soln-2
(8000,500)
broadcasting
时间-
In [179]: # Setup inputs with sizes as mentioned in the question ...: df = pd.DataFrame(np.random.randint(0,10,(8000,500))) ...: df.columns = ['C'+str(i) for i in range(df.shape[1])] ...: idx0 = np.random.choice(df.shape[1], df.shape[1]//2,replace=0) ...: idx1 = np.random.choice(df.shape[1], df.shape[1]//2,replace=0) ...: df.iloc[:,idx0] = df.iloc[:,idx1].values ...: # @John Galt's soln-1 In [180]: %timeit combinations_app(df) 1 loops, best of 3: 24.6 s per loop # @Abdou's soln In [181]: %timeit pandas_groupby_app(df) 1 loops, best of 3: 3.81 s per loop # @COLDSPEED's soln In [182]: %timeit triu_app(df) 1 loops, best of 3: 25.5 s per loop # @cmaher's soln In [183]: %timeit lambda_set_app(df) 1 loops, best of 3: 27.1 s per loop # Proposed in this post In [184]: %timeit group_duplicate_cols(df) 10 loops, best of 3: 188 ms per loop
借助NumPy的视图功能实现超级助推
借助NumPy的视图功能,我们可以将每组元素视为一个dtype,这样可以进一步提高性能,如下所示-
def view1D(a): # a is array a = np.ascontiguousarray(a) void_dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1])) return a.view(void_dt).ravel() def group_duplicate_cols_v2(df): a = df.values sidx = view1D(a.T).argsort() b = a[:,sidx] m = np.concatenate(([False], (b[:,1:] == b[:,:-1]).all(0), [False] )) idx = np.flatnonzero(m[1:] != m[:-1]) C = df.columns[sidx].tolist() return [C[i:j] for i,j in zip(idx[::2],idx[1::2]+1)]
In [322]: %timeit group_duplicate_cols(df) 10 loops, best of 3: 185 ms per loop In [323]: %timeit group_duplicate_cols_v2(df) 10 loops, best of 3: 69.3 ms per loop
只是疯狂的加速!