想象一下,我有一个这样的表:
CREATE TABLE time_series ( snapshot_date DATE, sales INTEGER, PRIMARY KEY (snapshot_date));
具有这样的值:
INSERT INTO time_series SELECT '2017-01-01'::DATE AS snapshot_date,10 AS sales; INSERT INTO time_series SELECT '2017-01-02'::DATE AS snapshot_date,4 AS sales; INSERT INTO time_series SELECT '2017-01-03'::DATE AS snapshot_date,13 AS sales; INSERT INTO time_series SELECT '2017-01-04'::DATE AS snapshot_date,7 AS sales; INSERT INTO time_series SELECT '2017-01-05'::DATE AS snapshot_date,15 AS sales; INSERT INTO time_series SELECT '2017-01-06'::DATE AS snapshot_date,8 AS sales;
我希望能够做到这一点:
SELECT a.snapshot_date, AVG(b.sales) AS sales_avg, COUNT(*) AS COUNT FROM time_series AS a JOIN time_series AS b ON a.snapshot_date > b.snapshot_date GROUP BY a.snapshot_date
产生如下结果:
*---------------*-----------*-------* | snapshot_date | sales_avg | count | *---------------*-----------*-------* | 2017-01-02 | 10.0 | 1 | | 2017-01-03 | 7.0 | 2 | | 2017-01-04 | 9.0 | 3 | | 2017-01-05 | 8.5 | 4 | | 2017-01-06 | 9.8 | 5 | -------------------------------------
如本例所示,由于行数很少,因此查询运行非常快。问题是我必须对数百万行执行此操作,并且在Redshift(语法类似于Postgres)上,我的查询需要几天才能运行。这太慢了,但这是我最常见的查询模式之一。我怀疑问题是由于数据中O(n ^ 2)的增长与更可取的O(n)的关系所致。
我在python中的O(n)实现将是这样的:
rows = [('2017-01-01',10), ('2017-01-02',4), ('2017-01-03',13), ('2017-01-04',7), ('2017-01-05',15), ('2017-01-06',8)] sales_total_previous = 0 count = 0 for index, row in enumerate(rows): snapshot_date = row[0] sales = row[1] if index == 0: sales_total_previous += sales continue count += 1 sales_avg = sales_total_previous / count print((snapshot_date,sales_avg, count)) sales_total_previous += sales
具有这样的结果(与SQL查询相同):
('2017-01-02', 10.0, 1) ('2017-01-03', 7.0, 2) ('2017-01-04', 9.0, 3) ('2017-01-05', 8.5, 4) ('2017-01-06', 9.8, 5)
我正在考虑切换到Apache Spark,以便可以完全执行该python查询,但是几百万行实际上并不是那么大(最多3-4 GB),并且使用具有100 GB RAM的Spark集群似乎过度杀伤力。有没有一种有效且易于阅读的方法,可以在SQL中获得O(n)效率,最好是在Postgres / Redshift中?
您似乎想要:
SELECT ts.snapshot_date, AVG(ts.sales) OVER (ORDER BY ts.snapshot_date) AS sales_avg, ROW_NUMBER() OVER (ORDER BY ts.snapshot_date) AS COUNT FROM time_series ts;
您会发现使用窗口函数效率更高。