我是BigQuery和SQL的新手,我想构建一个标准SQL查询,该查询将在X天的滚动时间内对事件进行分组和计数。我的数据表如下所示:
event_id | url | timestamp ----------------------------------------------------------- xx a.html 2016-10-18 15:55:16 UTC xx a.html 2016-10-19 16:68:55 UTC xx a.html 2016-10-25 20:55:57 UTC yy b.html 2016-10-18 15:58:09 UTC yy a.html 2016-10-18 08:32:43 UTC zz a.html 2016-10-20 04:44:22 UTC zz c.html 2016-10-21 02:12:34 UTC
我正在跟踪网址上发生的事件。我想知道在X天的滚动时间内,每个事件在每个URL上发生了多少次。当我问这个问题时,我得到了一个很好的答案:
WITH dailyAggregations AS ( SELECT DATE(ts) AS day, url, event_id, UNIX_SECONDS(TIMESTAMP(DATE(ts))) AS sec, COUNT(1) AS events FROM yourTable GROUP BY day, url, event_id, sec ) SELECT url, event_id, day, events, SUM(events) OVER(PARTITION BY url, event_id ORDER BY sec RANGE BETWEEN 259200 PRECEDING AND CURRENT ROW ) AS rolling4daysEvents FROM dailyAggregations
259200是3天,以秒为单位(3x24x3600)。据我了解,该查询创建一个中间表,该表按天对事件进行分组和计数。它还将timestamp字段转换为其等效的unix秒。然后,它使用一个以秒为单位的窗口来总结事件。
现在,这将产生一个具有正确运行总计的表,但它不能保证每个日期,URL和事件都有一行。换句话说,如果存在某个事件从未在给定的URL上发生过的日期,则结果表中将缺少日期。底线是,我可以修改上面的查询(或构造一个不同的查询),以正确地为间隔中的每个日期生成rolling4daysEvents的值吗?例如:像一个间隔定义为:
SELECT * FROM UNNEST (GENERATE_DATE_ARRAY('2016-08-28', '2016-11-06')) AS day ORDER BY day ASC
谢谢!
WITH dailyAggregations AS ( SELECT DATE(ts) AS day, url, event_id, UNIX_SECONDS(TIMESTAMP(DATE(ts))) AS sec, COUNT(1) AS events FROM yourTable GROUP BY day, url, event_id, sec ), calendar AS ( SELECT day FROM UNNEST (GENERATE_DATE_ARRAY(‘2016-08-28’, ‘2016-11-06’)) AS day ) SELECT c.day, url, event_id, events, SUM(events) OVER(PARTITION BY url, event_id ORDER BY sec RANGE BETWEEN 259200 PRECEDING AND CURRENT ROW ) AS rolling4daysEvents FROM calendar AS c LEFT JOIN dailyAggregations AS a ON a.day = c.day