我正在尝试查询Google BigQuery公共Reddit数据集。我的目标是使用Jaccards’Index来计算子reddit的相似性,该索引的定义如下:
我的计划是在2016年8月按评论数选择前N = 1000个子项,然后计算其笛卡尔积,以得到subreddit1, subreddit2形状中所有子项的组合。
subreddit1, subreddit2
然后使用这些组合的行来查询subreddit1和subreddit 2之间的用户并集以及交集。
我到目前为止的查询是这样的:
SELECT subreddit1, subreddit2, (SELECT COUNT(DISTINCT author) FROM `fh-bigquery.reddit_comments.2016_08` WHERE subreddit = subreddit1 OR subreddit = subreddit2 LIMIT 1 ) as subreddits_union, ( SELECT COUNT(DISTINCT author) FROM `fh-bigquery.reddit_comments.2016_08` WHERE subreddit = subreddit1 AND author IN ( SELECT author FROM `fh-bigquery.reddit_comments.2016_08` WHERE subreddit= subreddit2 GROUP BY author ) as subreddits_intersection FROM (SELECT a.subreddit as subreddit1, b.subreddit as subreddit2 FROM ( SELECT subreddit, count(*) as n_comments FROM `fh-bigquery.reddit_comments.2016_08` GROUP BY subreddit ORDER BY n_comments DESC LIMIT 1000 ) a CROSS JOIN ( SELECT subreddit, count(*) as n_comments FROM `fh-bigquery.reddit_comments.2016_08` GROUP BY subreddit ORDER BY n_comments DESC LIMIT 1000 ) b WHERE a.subreddit < b.subreddit )
理想情况下会给出以下结果:
subreddit1, subreddit2, subreddits_union, subreddits_interception ----------------------------------------------------------------- Art | Politics | 50000 | 21000 Art | Science | 92320 | 15000 ... | ... | ... | ...
但是,此查询给我以下BigQuery错误: Error: Correlated subqueries that reference other tables are not supported unless they can be de-correlated, such as by transforming them into an efficient JOIN.
Error: Correlated subqueries that reference other tables are not supported unless they can be de-correlated, such as by transforming them into an efficient JOIN.
我了解的。但是,我认为此查询不能转换为有效的联接。鉴于BQ没有apply方法,是否有任何方法可以设置此查询而无需诉诸单个查询?也许有一个PARTITION BY?
PARTITION BY
感谢您的回答。这个在返回subreddit联合中效果很好,但是,您将如何实现交集呢?
也许有些类似的东西
WITH top_most AS ( SELECT subreddit, count(*) as n_comments FROM `fh-bigquery.reddit_comments.2016_08` GROUP BY subreddit ORDER BY n_comments DESC LIMIT 20 ), authors AS ( SELECT DISTINCT author, subreddit FROM `fh-bigquery.reddit_comments.2016_08` ) SELECT count(DISTINCT a1.author), subreddit1, subreddit2 FROM ( SELECT t1.subreddit subreddit1, t2.subreddit subreddit2 FROM top_most t1 CROSS JOIN top_most t2 LIMIT 1000000 ) INNER JOIN authors a1 on a1.subreddit = subreddit1 INNER JOIN authors a2 on a2.subreddit = subreddit2 WHERE a1.author = a2.author GROUP BY subreddit1, subreddit2 ORDER BY subreddit1, subreddit2