Python：结合低频因素/类别计数

小编典典

Python：结合低频因素/类别计数

python

R中有一个很好的解决方案。

我的df.column样子：

Windows
Windows
Mac
Mac
Mac
Linux
Windows
...

我想在此df.column向量中将低频类别替换为“其他” 。例如，我需要让我df.column看起来像

Windows
Windows
Mac
Mac
Mac
Linux -> Other
Windows
...

我想重命名这些稀有类别，以减少回归中的因素数量。这就是为什么我需要原始向量。在python中，运行命令以获取频率表后，我得到：

pd.value_counts(df.column)


Windows          26083
iOS              19711
Android          13077
Macintosh         5799
Chrome OS          347
Linux              285
Windows Phone      167
(not set)           22
BlackBerry          11

我想知道是否有一种方法可以将“ Chrome OS”，“ Linux”（低频数据）重命名为另一个类别（例如，“其他”类别），并以一种有效的方式进行重命名。

阅读 213

2021-01-16

共1个答案

小编典典

通过找到占用百分比来遮罩，即：

series = pd.value_counts(df.column)
mask = (series/series.sum() * 100).lt(1)
# To replace df['column'] use np.where I.e 
df['column'] = np.where(df['column'].isin(series[mask].index),'Other',df['column'])

要使用sum更改索引：

new = series[~mask]
new['Other'] = series[mask].sum()

Windows      26083
iOS          19711
Android      13077
Macintosh     5799
Other          832
Name: 1, dtype: int64

如果要替换索引，则：

series.index = np.where(series.index.isin(series[mask].index),'Other',series.index)

Windows      26083
iOS          19711
Android      13077
Macintosh     5799
Other          347
Other          285
Other          167
Other           22
Other           11
Name: 1, dtype: int64

说明

(series/series.sum() * 100) # This will give you the percentage i.e

Windows          39.820158
iOS              30.092211
Android          19.964276
Macintosh         8.853165
Chrome OS         0.529755
Linux             0.435101
Windows Phone     0.254954
(not set)         0.033587
BlackBerry        0.016793
Name: 1, dtype: float64

.lt(1)等于小于1。这会根据该掩码索引为您提供一个布尔掩码并分配数据

2021-01-16