非常重要的编辑: 所有 _A i_都是 唯一的 。
我有一个列表 一个 的 ñ 独特的 对象。每个对象 A i_具有可变的百分比 _P i。
我想,以形成产生一个新的列表的算法 乙 的 ķ 对象( ķ < ñ / 2,并且在大多数情况下 ķ 是显著小于 ñ / 2。 例如,N = 231,K = 21 )。列表 B 应该没有重复项,并且将填充具有以下限制的源自列表 A的 对象:
物体 A i_出现在 _B中 的概率为 P i。
(这些代码段仅出于测试目的而在PHP中)我首先列出了列表 A
$list = [ "A" => 2.5, "B" => 2.5, "C" => 2.5, "D" => 2.5, "E" => 2.5, "F" => 2.5, "G" => 2.5, "H" => 2.5, "I" => 5, "J" => 5, "K" => 2.5, "L" => 2.5, "M" => 2.5, "N" => 2.5, "O" => 2.5, "P" => 2.5, "Q" => 2.5, "R" => 2.5, "S" => 2.5, "T" => 2.5, "U" => 5, "V" => 5, "W" => 5, "X" => 5, "Y" => 5, "Z" => 20 ];
最初,我尝试了以下两个算法(这些仅在PHP中用于测试):
$result = []; while (count($result) < 10) { $rnd = rand(0,10000000) / 100000; $sum = 0; foreach ($list as $key => $value) { $sum += $value; if ($rnd <= $sum) { if (in_array($key,$result)) { break; } else { $result[] = $key; break; } } } }
和
$result = []; while (count($result) < 10) { $sum = 0; foreach ($list as $key => $value) { $sum += $value; } $rnd = rand(0,$sum * 100000) / 100000; $sum = 0; foreach ($list as $key => $value) { $sum += $value; if ($rnd <= $sum) { $result[] = $key; unset($list[$key]); break; } } }
两种算法之间的唯一区别是,一种算法在遇到重复项时会重试,而另一种则在选择对象列表 A 时将其删除。事实证明,这两种算法具有相同的概率输出。
我运行了第二个算法100,000次,并跟踪了每个字母被选择了多少次。以下数组确定了从100,000个测试中选出任何列表 B 中的字母的机会百分比。
[A] => 30.213 [B] => 29.865 [C] => 30.357 [D] => 30.198 [E] => 30.152 [F] => 30.472 [G] => 30.343 [H] => 30.011 [I] => 51.367 [J] => 51.683 [K] => 30.271 [L] => 30.197 [M] => 30.341 [N] => 30.15 [O] => 30.225 [P] => 30.135 [Q] => 30.406 [R] => 30.083 [S] => 30.251 [T] => 30.369 [U] => 51.671 [V] => 52.098 [W] => 51.772 [X] => 51.739 [Y] => 51.891 [Z] => 93.74
当回顾算法时,这很有意义。该算法错误地将原始百分比解释为针对任何给定位置(而不是任何列表 B )选择对象的机会百分比。因此,例如,实际上,在列表 B中 选择Z的机会是93%,但是在索引 B n上_选择Z的机会是20%。这不是我想要的。我希望从列表 _B中 选择Z的机会是20%。
这有可能吗?如何做呢?
我尝试简单地让所有 P i = k的和,如果所有 _P i_都相等,则可以工作,但是修改它们的值后,它开始变得越来越错误。
初始概率
$list= [ "A" => 8.4615, "B" => 68.4615, "C" => 13.4615, "D" => 63.4615, "E" => 18.4615, "F" => 58.4615, "G" => 23.4615, "H" => 53.4615, "I" => 28.4615, "J" => 48.4615, "K" => 33.4615, "L" => 43.4615, "M" => 38.4615, "N" => 38.4615, "O" => 38.4615, "P" => 38.4615, "Q" => 38.4615, "R" => 38.4615, "S" => 38.4615, "T" => 38.4615, "U" => 38.4615, "V" => 38.4615, "W" => 38.4615, "X" => 38.4615, "Y" =>38.4615, "Z" => 38.4615 ];
10,000次运行后的结果
Array ( [A] => 10.324 [B] => 59.298 [C] => 15.902 [D] => 56.299 [E] => 21.16 [F] => 53.621 [G] => 25.907 [H] => 50.163 [I] => 30.932 [J] => 47.114 [K] => 35.344 [L] => 43.175 [M] => 39.141 [N] => 39.127 [O] => 39.346 [P] => 39.364 [Q] => 39.501 [R] => 39.05 [S] => 39.555 [T] => 39.239 [U] => 39.283 [V] => 39.408 [W] => 39.317 [X] => 39.339 [Y] => 39.569 [Z] => 39.522 )
我们必须拥有sum_i P_i = k,否则我们将无法成功。
sum_i P_i = k
如前所述,这个问题有些容易,但是您可能不喜欢这个答案,因为它“不够随机”。
Sample a uniform random permutation Perm on the integers [0, n) Sample X uniformly at random from [0, 1) For i in Perm If X < P_i, then append A_i to B and update X := X + (1 - P_i) Else, update X := X - P_i End
您需要使用定点算术而不是浮点算术来逼近涉及实数的计算。
缺少的条件是该分布具有称为“最大熵”的技术属性。像阿米特(Amit)一样,我想不出一个好方法。这是一种笨拙的方式。
解决这个问题的第一个(也是错误的)本能是将每个事件独立地包含A_i在B概率内,P_i然后重试,直到B正确的长度为止(不会有太多的重试,因为您可以向math.SE询问有关的原因)。问题在于,条件弄乱了概率。如果P_1 = 1/3和P_2 = 2/3和k = 1,则结果为
A_i
B
P_i
P_1 = 1/3
P_2 = 2/3
k = 1
{}: probability 2/9 {A_1}: probability 1/9 {A_2}: probability 4/9 {A_1, A_2}: probability 2/9,
条件概率实际上1/5是A_1和4/5的A_2。
1/5
A_1
4/5
A_2
相反,我们应该替换Q_i产生适当条件分布的新概率。我不知道封闭形式Q_i,所以我建议用像梯度下降这样的数值优化算法找到它们。初始化Q_i = P_i(为什么不呢?)。使用动态编程,对于当前设置,可以找到Q_i给定包含l元素的结果即A_i那些元素之一的概率。(我们只关心l = k条目,但是我们需要其他人来使递归起作用。)再多做一点,我们就可以得到整个梯度。抱歉,这太粗略了。
Q_i
Q_i = P_i
l
l = k
在Python 3中,使用似乎总是收敛的非线性求解方法(q_i同时将每个更新到其边际正确值并进行归一化):
q_i
#!/usr/bin/env python3 import collections import operator import random def constrained_sample(qs): k = round(sum(qs)) while True: sample = [i for i, q in enumerate(qs) if random.random() < q] if len(sample) == k: return sample def size_distribution(qs): size_dist = [1] for q in qs: size_dist.append(0) for j in range(len(size_dist) - 1, 0, -1): size_dist[j] += size_dist[j - 1] * q size_dist[j - 1] *= 1 - q assert abs(sum(size_dist) - 1) <= 1e-10 return size_dist def size_distribution_without(size_dist, q): size_dist = size_dist[:] if q >= 0.5: for j in range(len(size_dist) - 1, 0, -1): size_dist[j] /= q size_dist[j - 1] -= size_dist[j] * (1 - q) del size_dist[0] else: for j in range(1, len(size_dist)): size_dist[j - 1] /= 1 - q size_dist[j] -= size_dist[j - 1] * q del size_dist[-1] assert abs(sum(size_dist) - 1) <= 1e-10 return size_dist def test_size_distribution(qs): d = size_distribution(qs) for i, q in enumerate(qs): d1a = size_distribution_without(d, q) d1b = size_distribution(qs[:i] + qs[i + 1 :]) assert len(d1a) == len(d1b) assert max(map(abs, map(operator.sub, d1a, d1b))) <= 1e-10 def normalized(qs, k): sum_qs = sum(qs) qs = [q * k / sum_qs for q in qs] assert abs(sum(qs) / k - 1) <= 1e-10 return qs def approximate_qs(ps, reps=100): k = round(sum(ps)) qs = ps[:] for j in range(reps): size_dist = size_distribution(qs) for i, p in enumerate(ps): d = size_distribution_without(size_dist, qs[i]) d.append(0) qs[i] = p * d[k] / ((1 - p) * d[k - 1] + p * d[k]) qs = normalized(qs, k) return qs def test(ps, reps=100000): print(ps) qs = approximate_qs(ps) print(qs) counter = collections.Counter() for j in range(reps): counter.update(constrained_sample(qs)) test_size_distribution(qs) print("p", "Actual", sep="\t") for i, p in enumerate(ps): print(p, counter[i] / reps, sep="\t") if __name__ == "__main__": test([2 / 3, 1 / 2, 1 / 2, 1 / 3])