这个问题已经有了SQL的答案,并且我能够使用R在R中实现该解决方案sqldf。但是,我一直找不到使用来实现它的方法data.table。
sqldf
data.table
问题是要计算滚动日期范围内一列的不同值,例如(如果直接从链接的问题中引用)数据是否如下所示:
Date | email -------+---------------- 1/1/12 | test@test.com 1/1/12 | test1@test.com 1/1/12 | test2@test.com 1/2/12 | test1@test.com 1/2/12 | test2@test.com 1/3/12 | test@test.com 1/4/12 | test@test.com 1/5/12 | test@test.com 1/5/12 | test@test.com 1/6/12 | test@test.com 1/6/12 | test@test.com 1/6/12 | test1@test.com
如果我们使用3天的日期范围,则结果集将类似于以下内容
date | count(distinct email) -------+------ 1/1/12 | 3 1/2/12 | 3 1/3/12 | 3 1/4/12 | 3 1/5/12 | 2 1/6/12 | 2
这是使用R在R中创建相同数据的代码data.table:
date <- as.Date(c('2012-01-01','2012-01-01','2012-01-01', '2012-01-02','2012-01-02','2012-01-03', '2012-01-04','2012-01-05','2012-01-05', '2012-01-06','2012-01-06','2012-01-06')) email <- c('test@test.com', 'test1@test.com','test2@test.com', 'test1@test.com', 'test2@test.com','test@test.com', 'test@test.com','test@test.com','test@test.com', 'test@test.com','test@test.com','test1@test.com') dt <- data.table(date, email)
在这方面的任何帮助将不胜感激。谢谢!
编辑1:
这是一个玩具问题,我想将其应用于更大的数据集,因此使用笛卡尔积是有问题的。相反,我想要一些与SQL中的 相关子查询 等效的东西,例如,我最初链接的问题的解决方案是:
SELECT day ,(SELECT count(DISTINCT email) FROM tbl WHERE day BETWEEN t.day - 2 AND t.day -- period of 3 days ) AS dist_emails FROM tbl t WHERE day BETWEEN '2012-01-01' AND '2012-01-06' GROUP BY 1 ORDER BY 1;
编辑2:这是根据@jangorecki要求的基于@MichaelChirico解决方案的一些时间安排:
# The data > dim(temp) [1] 2627785 4 > head(temp) date category1 category2 itemId 1: 2013-11-08 0 2 1713 2: 2013-11-08 0 2 90485 3: 2013-11-08 0 2 74249 4: 2013-11-08 0 2 2592 5: 2013-11-08 0 2 2592 6: 2013-11-08 0 2 765 > uniqueN(temp$itemId) [1] 13510 > uniqueN(temp$date) [1] 127 # Timing for data.table > system.time(dtTime <- temp[, + .(count = temp[.(seq.Date(.BY$date - 6L, .BY$date, "day"), + .BY$category1, .BY$category2 ), uniqueN(itemId), nomatch = 0L]), + by = c("date","category1","category2")]) user system elapsed 6.913 0.130 6.940 > # Time for sqldf > system.time(sqlDfTime <- + sqldf(c("create index ldx on temp(date, category1, category2)", + "SELECT date, category1, category2, + (SELECT count(DISTINCT itemId) + FROM temp + WHERE category1 = t.category1 AND category2 = t.category2 AND + date BETWEEN t.date - 6 AND t.date + ) AS numItems + FROM temp t + GROUP BY date, category1, category2 + ORDER BY 1;")) user system elapsed 87.225 0.098 87.295
输出是等效的,但是使用data.table而不是sqldf导致速度提高了12.5倍。相当可观!
利用的新的非等额连接功能,这是可行的方法data.table。
dt[dt[ , .(date3=date, date2 = date - 2, email)], on = .(date >= date2, date<=date3), allow.cartesian = TRUE ][ , .(count = uniqueN(email)), by = .(date = date + 2)] # date V1 # 1: 2011-12-30 3 # 2: 2011-12-31 3 # 3: 2012-01-01 3 # 4: 2012-01-02 3 # 5: 2012-01-03 1 # 6: 2012-01-04 2
老实说,我对它的工作方式有点不满意,但是我的想法是加入dt进来date,匹配date两天前到今天之间的任何东西。我不确定为什么我们必须在date = date + 2事后进行清理。
dt
date
date = date + 2
这是一种使用键的方法:
setkey(dt, date) dt[ , .(count = dt[.(seq.Date(.BY$date - 2L, .BY$date, "day")), uniqueN(email), nomatch = 0L]), by = date]