我需要一些帮助来编写/优化查询,以按类型检索每行的最新版本,并根据类型执行一些计算。我认为最好举个例子说明一下。
给定以下数据集:
+-------+-------------------+---------------------+-------------+---------------------+--------+----------+ | id | event_type | event_timestamp | message_id | sent_at | status | rate | +-------+-------------------+---------------------+-------------+---------------------+--------+----------+ | 1 | create | 2016-11-25 09:17:48 | 1 | 2016-11-25 09:17:48 | 0 | 0.500000 | | 2 | status_update | 2016-11-25 09:24:38 | 1 | 2016-11-25 09:28:49 | 1 | 0.500000 | | 3 | create | 2016-11-25 09:47:48 | 2 | 2016-11-25 09:47:48 | 0 | 0.500000 | | 4 | status_update | 2016-11-25 09:54:38 | 2 | 2016-11-25 09:48:49 | 1 | 0.500000 | | 5 | rate_update | 2016-11-25 09:55:07 | 2 | 2016-11-25 09:50:07 | 0 | 1.000000 | | 6 | create | 2016-11-26 09:17:48 | 3 | 2016-11-26 09:17:48 | 0 | 0.500000 | | 7 | create | 2016-11-27 09:17:48 | 4 | 2016-11-27 09:17:48 | 0 | 0.500000 | | 8 | rate_update | 2016-11-27 09:55:07 | 4 | 2016-11-27 09:50:07 | 0 | 2.000000 | | 9 | rate_update | 2016-11-27 09:55:07 | 2 | 2016-11-25 09:55:07 | 0 | 2.000000 | +-------+-------------------+---------------------+-------------+---------------------+--------+----------+
预期结果应为:
+------------+--------------------+--------------------+-----------------------+ | sent_at | sum(submitted_msg) | sum(delivered_msg) | sum(rate_total) | +------------+--------------------+--------------------+-----------------------+ | 2016-11-25 | 2 | 2 | 2.500000 | | 2016-11-26 | 1 | 0 | 0.500000 | | 2016-11-27 | 1 | 0 | 2.000000 | +------------+--------------------+--------------------+-----------------------+
帖子的末尾是用于获取此结果的查询。我愿意打赌,应该有一种优化它的方法,因为它使用的是带有联接的子查询,从我所了解的BigQuery来看,最好避免联接。但是首先要有一些背景:
从本质上讲,数据集表示一个只附加表,在该表中写入了multipe事件。数据的大小是数亿,并且将增长到数十亿以上。由于BigQuery中的Updates不切实际,并且数据正在流式传输到BQ,因此我需要一种方法来检索每个事件的最新事件,根据特定条件执行一些计算并返回准确的结果。该查询是根据用户输入动态生成的,因此可以包含更多字段/计算,但是为简单起见,已将其省略。
create
n
rate_update
status_update
所以我想我有几个问题:
实际上,关于如何高效,友好地查询此数据集的任何建议都非常受欢迎!谢谢!:)
我想出的怪兽如下。根据INNER JOINS此资源,用于检索每行的最新版本
INNER JOINS
select sent_at as sent_at, sum(submitted_msg) as submitted, sum(delivered_msg) as delivered, sum(sales_rate_total) as sales_rate_total FROM ( #DELIVERED SELECT d.message_id, FORMAT_TIMESTAMP('%Y-%m-%d 00:00:00', sent_at) AS sent_at, 0 as submitted_msg, sum(if(status=1,1,0)) as delivered_msg, 0 as sales_rate_total FROM `events` d INNER JOIN ( select message_id, max(event_timestamp) as ts from `events` where event_type = "status_update" group by 1 ) g on d.message_id = g.message_id and d.event_timestamp = g.ts GROUP BY 1,2 UNION ALL #SALES RATE SELECT s.message_id, FORMAT_TIMESTAMP('%Y-%m-%d 00:00:00', sent_at) AS sent_at, 0 as submitted_msg, 0 as delivered_msg, sum(sales_rate) as sales_rate_total FROM `events` s INNER JOIN ( select message_id, max(event_timestamp) as ts from `events` where event_type in ("rate_update", "create") group by 1 ) f on s.message_id = f.message_id and s.event_timestamp = f.ts GROUP BY 1,2 UNION ALL #SUBMITTED & REST SELECT r.message_id, FORMAT_TIMESTAMP('%Y-%m-%d 00:00:00', sent_at) AS sent_at, sum(if(status=0,1,0)) as submitted_msg, 0 as delivered_msg, 0 as sales_rate_total FROM `events` r INNER JOIN ( select message_id, max(event_timestamp) as ts from `events` where event_type = "create" group by 1 ) e on r.message_id = e.message_id and r.event_timestamp = e.ts GROUP BY 1, 2 ) k group by 1
如何优化此查询?
尝试以下版本
#standardSQL WITH types AS ( SELECT FORMAT_TIMESTAMP('%Y-%m-%d', sent_at) AS sent_at, message_id, FIRST_VALUE(status) OVER(PARTITION BY message_id ORDER BY (event_type = "create") DESC, event_timestamp DESC) AS submitted_status, FIRST_VALUE(status) OVER(PARTITION BY message_id ORDER BY (event_type = "status_update") DESC, event_timestamp DESC) AS delivered_status, FIRST_VALUE(rate) OVER(PARTITION BY message_id ORDER BY (event_type IN ("rate_update", "create")) DESC, event_timestamp DESC) AS sales_rate FROM events ), latest AS ( SELECT sent_at, message_id, ANY_VALUE(IF(submitted_status=0,1,0)) AS submitted, ANY_VALUE(IF(delivered_status=1,1,0)) AS delivered, ANY_VALUE(sales_rate) AS sales_rate FROM types GROUP BY 1, 2 ) SELECT sent_at, SUM(submitted) AS submitted, SUM(delivered) AS delivered, SUM(sales_rate) AS sales_rate_total FROM latest GROUP BY 1
它足够紧凑,可以轻松地进行管理,没有冗余,根本没有任何连接等等。 如果您的表已分区-您只需在一个位置调整查询就可以轻松使用它
如果要先检查低容量的上述查询,则可以使用下面的虚拟数据
WITH events AS ( SELECT 1 AS id, 'create' AS event_type, TIMESTAMP '2016-11-25 09:17:48' AS event_timestamp, 1 AS message_id, TIMESTAMP '2016-11-25 09:17:48' AS sent_at, 0 AS status, 0.500000 AS rate UNION ALL SELECT 2 AS id, 'status_update' AS event_type, TIMESTAMP '2016-11-25 09:24:38' AS event_timestamp, 1 AS message_id, TIMESTAMP '2016-11-25 09:28:49' AS sent_at, 1 AS status, 0.500000 AS rate UNION ALL SELECT 3 AS id, 'create' AS event_type, TIMESTAMP '2016-11-25 09:47:48' AS event_timestamp, 2 AS message_id, TIMESTAMP '2016-11-25 09:47:48' AS sent_at, 0 AS status, 0.500000 AS rate UNION ALL SELECT 4 AS id, 'status_update' AS event_type, TIMESTAMP '2016-11-25 09:54:38' AS event_timestamp, 2 AS message_id, TIMESTAMP '2016-11-25 09:48:49' AS sent_at, 1 AS status, 0.500000 AS rate UNION ALL SELECT 5 AS id, 'rate_update' AS event_type, TIMESTAMP '2016-11-25 09:55:07' AS event_timestamp, 2 AS message_id, TIMESTAMP '2016-11-25 09:50:07' AS sent_at, 0 AS status, 1.000000 AS rate UNION ALL SELECT 6 AS id, 'create' AS event_type, TIMESTAMP '2016-11-26 09:17:48' AS event_timestamp, 3 AS message_id, TIMESTAMP '2016-11-26 09:17:48' AS sent_at, 0 AS status, 0.500000 AS rate UNION ALL SELECT 7 AS id, 'create' AS event_type, TIMESTAMP '2016-11-27 09:17:48' AS event_timestamp, 4 AS message_id, TIMESTAMP '2016-11-27 09:17:48' AS sent_at, 0 AS status, 0.500000 AS rate UNION ALL SELECT 8 AS id, 'rate_update' AS event_type, TIMESTAMP '2016-11-27 09:55:07' AS event_timestamp, 4 AS message_id, TIMESTAMP '2016-11-27 09:50:07' AS sent_at, 0 AS status, 2.000000 AS rate UNION ALL SELECT 9 AS id, 'rate_update' AS event_type, TIMESTAMP '2016-11-27 09:55:07' AS event_timestamp, 2 AS message_id, TIMESTAMP '2016-11-25 09:55:07' AS sent_at, 0 AS status, 2.000000 AS rate )