有什么想法可以使该查询在Google BigQuery上返回结果吗?我收到超出资源的错误…数据集中大约有2B行。我正在尝试获取每个user_id最多出现的艺术家ID。
select user_id, artist, count(*) as count from [legacy20130831.merged_data] as d group each by user_id, artist order by user_id ASC, count DESC
对公共数据的等效查询,引发相同的错误:
SELECT actor, repository_name, count(*) AS count FROM [githubarchive:github.timeline] AS d GROUP EACH BY actor, repository_name ORDER BY actor, count desc
与相同的查询进行比较,再加上要返回的结果的限制。这有效(对我来说是14秒):
SELECT actor, repository_name, count(*) as count FROM [githubarchive:github.timeline] as d GROUP EACH BY actor, repository_name ORDER BY actor, count desc LIMIT 100
除了使用LIMIT之外,您还可以使用user_id的一小部分。在我的情况下,1/3有效:
SELECT actor, repository_name, count(*) as count FROM [githubarchive:github.timeline] as d WHERE ABS(HASH(actor) % 3) = 0 GROUP EACH BY actor, repository_name
但是,您真正想要的是“获取对于每个user_id最多出现的艺术家ID”。让我们更进一步,并获得:
SELECT actor, repository_name, count FROM ( SELECT actor, repository_name, count, ROW_NUMBER() OVER (PARTITION BY actor ORDER BY count DESC) rank FROM ( SELECT actor, repository_name, count(*) as count FROM [githubarchive:github.timeline] as d WHERE ABS(HASH(actor) % 10) = 0 GROUP EACH BY actor, repository_name )) WHERE rank=1
请注意,这次我使用了%10,因为它可以使我更快地获得结果。但是您可能想知道“我想通过一个查询而不是10个查询来获得结果”。
您可以执行以下两项操作:
如果您愿意与我共享您的数据集,我可以提供特定于数据集的建议(很大程度上取决于基数)。