我正在使用SQL Server 2008 R2。我需要计算每个组的百分位数值,例如:
SELECT id, PCTL(0.9, x) -- for the 90th percentile FROM my_table GROUP BY id ORDER BY id
例如,给定此DDL(fiddle)-
CREATE TABLE my_table (id INT, x REAL); INSERT INTO my_table VALUES (7, 0.164595), (5, 0.671311), (7, 0.0118385), (6, 0.704592), (3, 0.633521), (3, 0.337268), (0, 0.54739), (6, 0.312282), (0, 0.220618), (7, 0.214973), (6, 0.410768), (7, 0.151572), (7, 0.0639506), (5, 0.339075), (1, 0.284094), (2, 0.126722), (2, 0.870079), (3, 0.369366), (1, 0.6687), (5, 0.199456), (5, 0.0296715), (1, 0.330339), (9, 0.0000459612), (5, 0.391947), (3, 0.753965), (8, 0.334207), (7, 0.583357), (3, 0.326951), (4, 0.207057), (2, 0.258463), (2, 0.0532811), (1, 0.751584), (7, 0.592624), (7, 0.673506), (5, 0.44764), (6, 0.733737), (5, 0.141215), (7, 0.222452), (3, 0.597019), (1, 0.293901), (4, 0.516213), (7, 0.498336), (6, 0.410461), (2, 0.32211), (1, 0.466735), (5, 0.720456), (8, 0.000428383), (3, 0.46085), (0, 0.402963), (7, 0.677002), (0, 0.400122), (1, 0.762357), (9, 0.158455), (7, 0.359723), (4, 0.225914), (7, 0.795345), (6, 0.902261), (2, 0.69533), (8, 0.593605), (6, 0.266233), (0, 0.917188), (9, 0.96353), (2, 0.577035), (8, 0.945236), (3, 0.257776), (4, 0.560569), (0, 0.838326), (2, 0.660338), (2, 0.537372), (8, 0.33806), (0, 0.545107), (1, 0.616673), (5, 0.30411), (0, 0.434737), (2, 0.588249), (9, 0.991362), (8, 0.772253), (6, 0.705396), (5, 0.323255), (8, 0.830319), (3, 0.679546), (4, 0.399748), (4, 0.440115), (6, 0.938154), (8, 0.333143), (9, 0.923541), (7, 0.19552), (4, 0.869822), (7, 0.620006), (4, 0.833529), (4, 0.297515), (4, 0.19906), (5, 0.540905), (9, 0.33313), (5, 0.200515), (5, 0.900481), (6, 0.02665), (3, 0.495421), (0, 0.96582), (9, 0.847218);
-–我大致希望(在普通百分位数方法的变化范围内)以下内容:
id x ---------- 0 0.9658 1 0.7624 2 0.6953 3 0.6795 4 0.8335 5 0.7205 6 0.9023 7 0.677 8 0.9452 9 0.9914
实际的输入集大约有200万行,每个实际的id组都有几十到几百(甚至更多)行。
id
我已经在SO和其他站点上探索了解决方案,但似乎我检查的十几个页面中的解决方案仅适用于计算整个行集而不是行集的每个组/分区的百分位。(我对SQL相对缺乏经验,因此我可能忽略了一些东西。)
我还查看了排名功能的文档,但无法将有效的查询粘合在一起。
我想使用PERCENTILE_DISC或PERCENTILE_CONT,但是我暂时仍使用2008 R2。
我喜欢使用row_number()/rank()和window函数直接进行这些计算。内置函数很有用,但实际上并没有节省太多精力:
row_number()
rank()
SELECT id, MIN(CASE WHEN seqnum >= 0.9 * cnt THEN x END) as percentile_90 FROM (select t.*, row_number() over (partition by id order by x) as seqnum, count(*) over (partition by id) as cnt from my_table t ) t GROUP BY id ORDER BY id;
这将采用第90个百分点或更高的第一个值。可以做连续版本的版本有多种变体(取最大值小于或等于,最小的值大于并进行插值)。