我已经实现了 算法 对于GMM使用这个 后[GMMs]与最大似然优化 努比](https://towardsdatascience.com/how-to-code-gaussian-mixture-models- from-scratch-in-python-9e7975df5252)未成功,如下所示:
import numpy as np def PDF(data, means, variances): return 1/(np.sqrt(2 * np.pi * variances) + eps) * np.exp(-1/2 * (np.square(data - means) / (variances + eps))) def EM_GMM(data, k, iterations): weights = np.ones((k, 1)) / k # shape=(k, 1) means = np.random.choice(data, k)[:, np.newaxis] # shape=(k, 1) variances = np.random.random_sample(size=k)[:, np.newaxis] # shape=(k, 1) data = np.repeat(data[np.newaxis, :], k, 0) # shape=(k, n) for step in range(iterations): # Expectation step likelihood = PDF(data, means, np.sqrt(variances)) # shape=(k, n) # Maximization step b = likelihood * weights # shape=(k, n) b /= np.sum(b, axis=1)[:, np.newaxis] + eps # updage means, variances, and weights means = np.sum(b * data, axis=1)[:, np.newaxis] / (np.sum(b, axis=1)[:, np.newaxis] + eps) variances = np.sum(b * np.square(data - means), axis=1)[:, np.newaxis] / (np.sum(b, axis=1)[:, np.newaxis] + eps) weights = np.mean(b, axis=1)[:, np.newaxis] return means, variances
when I run the algorithm on a 1-D time-series dataset, for k equal to 3, it returns an output like the following:
array([[0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 3.05053810e-003, 2.36989898e-025, 2.36989898e-025, 1.32797395e-136, 6.91134950e-031, 5.47347807e-001, 1.44637007e+000, 1.44637007e+000, 1.44637007e+000, 1.44637007e+000, 1.44637007e+000, 1.44637007e+000, 1.44637007e+000, 1.44637007e+000, 1.44637007e+000, 1.44637007e+000, 1.44637007e+000, 1.44637007e+000, 1.44637007e+000, 2.25849208e-064, 0.00000000e+000, 1.61228562e-303, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 3.94387272e-242, 1.13078186e+000, 2.53108878e-001, 5.33548114e-001, 9.14920432e-001, 2.07015697e-013, 4.45250680e-038, 1.43000602e+000, 1.28781615e+000, 1.44821615e+000, 1.18186109e+000, 3.21610659e-002, 3.21610659e-002, 3.21610659e-002, 3.21610659e-002, 3.21610659e-002, 2.47382844e-039, 0.00000000e+000, 2.09150855e-200, 0.00000000e+000, 0.00000000e+000], [5.93203066e-002, 1.01647068e+000, 5.99299162e-001, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 2.14690238e-010, 2.49337135e-191, 5.10499986e-001, 9.32658804e-001, 1.21148135e+000, 1.13315278e+000, 2.50324069e-237, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 1.73966953e-125, 2.53559290e-275, 1.42960975e-065, 7.57552338e-001], [0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 3.05053810e-003, 2.36989898e-025, 2.36989898e-025, 1.32797395e-136, 6.91134950e-031, 5.47347807e-001, 1.44637007e+000, 1.44637007e+000, 1.44637007e+000, 1.44637007e+000, 1.44637007e+000, 1.44637007e+000, 1.44637007e+000, 1.44637007e+000, 1.44637007e+000, 1.44637007e+000, 1.44637007e+000, 1.44637007e+000, 1.44637007e+000, 2.25849208e-064, 0.00000000e+000, 1.61228562e-303, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 3.94387272e-242, 1.13078186e+000, 2.53108878e-001, 5.33548114e-001, 9.14920432e-001, 2.07015697e-013, 4.45250680e-038, 1.43000602e+000, 1.28781615e+000, 1.44821615e+000, 1.18186109e+000, 3.21610659e-002, 3.21610659e-002, 3.21610659e-002, 3.21610659e-002, 3.21610659e-002, 2.47382844e-039, 0.00000000e+000, 2.09150855e-200, 0.00000000e+000, 0.00000000e+000]])
我认为这是错误的,因为输出是两个向量哪一个 其中一个表示“平均值”,另一个表示“方差”` 价值观。使我对执行产生怀疑的模糊点是 返回“0.00000000e+000”,以显示和显示大多数输出 它不需要真正地将这些输出可视化。顺便说一下,输入数据是 时间序列数据。我检查了所有东西,追踪了好几次,但是 没有虫子出现。
Here are my input data:
[25.31 , 24.31 , 24.12 , 43.46 , 41.48666667, 41.48666667, 37.54 , 41.175 , 44.81 , 44.44571429, 44.44571429, 44.44571429, 44.44571429, 44.44571429, 44.44571429, 44.44571429, 44.44571429, 44.44571429, 44.44571429, 44.44571429, 44.44571429, 44.44571429, 39.71 , 26.69 , 34.15 , 24.94 , 24.75 , 24.56 , 24.38 , 35.25 , 44.62 , 44.94 , 44.815 , 44.69 , 42.31 , 40.81 , 44.38 , 44.56 , 44.44 , 44.25 , 43.66666667, 43.66666667, 43.66666667, 43.66666667, 43.66666667, 40.75 , 32.31 , 36.08 , 30.135 , 24.19 ]
我想知道是否有一个优雅的方式来实现它通过“numpy”或SciKit学习。任何帮助都将不胜感激。
SciKit学习
正如我在评论中提到的,我看到的关键点是“手段”`初始化。遵循[sk]的默认实现 混合物, 我没有随机初始化,而是切换到KMeans。
import numpy as np import seaborn as sns import matplotlib.pyplot as plt plt.style.use('seaborn') eps=1e-8 def PDF(data, means, variances): return 1/(np.sqrt(2 * np.pi * variances) + eps) * np.exp(-1/2 * (np.square(data - means) / (variances + eps))) def EM_GMM(data, k=3, iterations=100, init_strategy='kmeans'): weights = np.ones((k, 1)) / k # shape=(k, 1) if init_strategy=='kmeans': from sklearn.cluster import KMeans km = KMeans(k).fit(data[:, None]) means = km.cluster_centers_ # shape=(k, 1) else: # init_strategy=='random' means = np.random.choice(data, k)[:, np.newaxis] # shape=(k, 1) variances = np.random.random_sample(size=k)[:, np.newaxis] # shape=(k, 1) data = np.repeat(data[np.newaxis, :], k, 0) # shape=(k, n) for step in range(iterations): # Expectation step likelihood = PDF(data, means, np.sqrt(variances)) # shape=(k, n) # Maximization step b = likelihood * weights # shape=(k, n) b /= np.sum(b, axis=1)[:, np.newaxis] + eps # updage means, variances, and weights means = np.sum(b * data, axis=1)[:, np.newaxis] / (np.sum(b, axis=1)[:, np.newaxis] + eps) variances = np.sum(b * np.square(data - means), axis=1)[:, np.newaxis] / (np.sum(b, axis=1)[:, np.newaxis] + eps) weights = np.mean(b, axis=1)[:, np.newaxis] return means, variances
This seems to yield the desired output much more consistently:
s = np.array([25.31 , 24.31 , 24.12 , 43.46 , 41.48666667, 41.48666667, 37.54 , 41.175 , 44.81 , 44.44571429, 44.44571429, 44.44571429, 44.44571429, 44.44571429, 44.44571429, 44.44571429, 44.44571429, 44.44571429, 44.44571429, 44.44571429, 44.44571429, 44.44571429, 39.71 , 26.69 , 34.15 , 24.94 , 24.75 , 24.56 , 24.38 , 35.25 , 44.62 , 44.94 , 44.815 , 44.69 , 42.31 , 40.81 , 44.38 , 44.56 , 44.44 , 44.25 , 43.66666667, 43.66666667, 43.66666667, 43.66666667, 43.66666667, 40.75 , 32.31 , 36.08 , 30.135 , 24.19 ]) k=3 n_iter=100 means, variances = EM_GMM(s, k, n_iter) print(means,variances) [[44.42596231] [24.509301 ] [35.4137508 ]] [[0.07568723] [0.10583743] [0.52125856]] # Plotting the results colors = ['green', 'red', 'blue', 'yellow'] bins = np.linspace(np.min(s)-2, np.max(s)+2, 100) plt.figure(figsize=(10,7)) plt.xlabel('$x$') plt.ylabel('pdf') sns.scatterplot(s, [0.05] * len(s), color='navy', s=40, marker=2, label='Series data') for i, (m, v) in enumerate(zip(means, variances)): sns.lineplot(bins, PDF(bins, m, v), color=colors[i], label=f'Cluster {i+1}') plt.legend() plt.plot()
最后我们可以看到,纯随机初始化会产生不同的结果;让我们看看结果“means”:
for _ in range(5): print(EM_GMM(s, k, n_iter, init_strategy='random')[0], '\n') [[44.42596231] [44.42596231] [44.42596231]] [[44.42596231] [24.509301 ] [30.1349997 ]] [[44.42596231] [35.4137508 ] [44.42596231]] [[44.42596231] [30.1349997 ] [44.42596231]] [[44.42596231] [44.42596231] [44.42596231]]
在某些情况下,结果是如何不同的呢是常量,意味着选择了3个相似的值而没有 迭代时会有很大的变化。在EM\GMM中添加一些打印语句我会澄清的。
EM\GMM中添加一些打印语句