我有以下表格:
Order ---- ID (pk) OrderItem ---- OrderID (fk -> Order.ID) ItemID (fk -> Item.ID) Quantity Item ---- ID (pk)
我该如何编写一个查询,以选择Orders与特定对象至少85%相似的所有对象Order?
Orders
Order
我考虑过使用Jaccard Index统计量来计算两个相似度Orders。(通过将每个集合的交集OrderItems除以每个集合的并集OrderItems)
OrderItems
但是,我想不出一种方法来存储两个的每个可能的Jaccard Index Orders。 还有另一种方法吗?
另外,是否有一种方法可以将Quantity每个匹配项的差异包括OrderItem在内?
Quantity
OrderItem
附加信息:
总计Orders:〜79K 总OrderItems:〜1.76米 魅力。OrderItems每位Order:21.5 总计Items:〜13k
Items
注意
85%的相似度数字只是对客户实际需求的最佳猜测,将来可能会改变。适用于任何相似性的解决方案将是更可取的。
您指定
如何编写查询以选择与特定订单至少相似85%的所有订单?
与“至少相似度为85%的所有成对订单”相比,这是一个重要的简化。
我们将使用一些TDQD(测试驱动的查询设计)和一些分析来帮助我们。
要远程相似,两个订单必须至少有一个共同点。此查询可用于确定哪些订单与指定订单至少具有一个共同点:
SELECT DISTINCT I1.OrderID AS ID FROM OrderItem AS I1 JOIN OrderItem AS I2 ON I2.ItemID = I1.ItemID AND I2.OrderID = <specified order ID> WHERE I1.OrderID != <specified order ID>
这会大大减少要检查的其他订单的列表,尽管如果指定的订单包含您最受欢迎的商品之一,则很可能很多其他订单也这样做。
除了DISTINCT,您可以使用:
SELECT I1.OrderID AS ID, COUNT(*) AS Num_Common FROM OrderItem AS I1 JOIN OrderItem AS I2 ON I2.ItemID = I1.ItemID AND I2.OrderID = <specified order ID> WHERE I1.OrderID != <specified order ID> GROUP BY I1.OrderID
这将为您提供与指定订单共有的订单中的商品数量。我们还需要每个订单中的商品数量:
SELECT OrderID AS ID, COUNT(*) AS Num_Total FROM OrderItem GROUP BY OrderID;
对于100%的相似性,这两个订单的共同点与每个物品的点数一样多。但是,这可能找不到很多成对的订单。我们可以很容易地找到与指定订单完全相同的商品的订单:
SELECT L1.ID FROM (SELECT OrderID AS ID, COUNT(*) AS Num_Total FROM OrderItem GROUP BY OrderID ) AS L1 JOIN (SELECT I1.OrderID AS ID, COUNT(*) AS Num_Common FROM OrderItem AS I1 JOIN OrderItem AS I2 ON I2.ItemID = I1.ItemID AND I2.OrderID = <specified order ID> WHERE I1.OrderID != <specified order ID> GROUP BY I1.OrderID ) AS L2 ON L1.ID = L2.ID AND L1.Num_Total = L2.Num_Common;
编辑: 结果证明不够严格;为了使订单相同,指定订单中的商品数也必须与共同的商品数相同:
SELECT L1.ID, L1.Num_Total, L2.ID, L2.Num_Common, L3.ID, L3.Num_Total FROM (SELECT OrderID AS ID, COUNT(*) AS Num_Total FROM OrderItem GROUP BY OrderID ) AS L1 JOIN (SELECT I1.OrderID AS ID, COUNT(*) AS Num_Common FROM OrderItem AS I1 JOIN OrderItem AS I2 ON I2.ItemID = I1.ItemID AND I2.OrderID = <specified order ID> WHERE I1.OrderID != <specified order ID> GROUP BY I1.OrderID ) AS L2 ON L1.ID = L2.ID AND L1.Num_Total = L2.Num_Common JOIN (SELECT OrderID AS ID, COUNT(*) AS Num_Total FROM OrderItem WHERE OrderID = <specified order ID> GROUP BY OrderID ) AS L3 ON L2.Num_Common = L3.Num_Total;
将 Wikipedia上定义的Jaccard相似度应用于具有| A |的两个顺序A和B。是顺序A中项数的计数,Jaccard相似度 J(A,B)= |A∩B| ÷|A∪B| ,其中|A∩B| 是两个订单共有的项目数,| A commonB | 是订购的不同商品的总数。
为了满足“ 85%提卡相似度”标准,如果任一订单中的商品数量少于某个阈值,则订单必须相同。例如,如果订单A和B都有5个商品,但是两者之间有一个商品不同,那么它将为您提供4个共同商品(|A∩B|)和总共6个商品(|A∪B|) ,因此Jaccard相似度J(A,B)仅为66⅔%。
对于85%的相似度,当两个订单中的每一个中有N个项目,而一个项目不同时, (N-1)÷(N + 1)≥0.85 ,这意味着N> 12(准确地说是12⅓)。对于分数F = J(A,B),可以用一项不同的均值 (N-1)÷(N + 1)≥F 来求解N, 得出N≥(1 + F)÷(1- F) 。随着相似性要求的提高,N的值越大,阶数必须相同。
进一步概括一下,让我们假设我们有N个和M个项目的不同大小顺序(不失一般性,N <M)。|A∩B|的最大值 现在是N,且|A∪B|的最小值 是M(表示所有较小顺序的项目都按较大顺序出现)。让我们定义M = N + ∆,并且存在∂个项,这些项以较小的顺序显示,而没有以较大的顺序显示。因此,存在以较大顺序显示的∆ +∂项目,而不是以较小顺序显示的项目。
根据定义,|A∩B| =N-∂,并且|A∪B| =(N-∂)+∂+(N + ∆-(N-∂)),其中三个相加项表示(1)两个订单之间共有的项目数,(2)仅在(3)仅以较大顺序排列的商品数量。简化为:|A∪B| = N + ∆ +∂。
对于相似分数F,我们对J(A,B)≥F的订单对感兴趣,因此:
(N-∂)÷(N + ∆ +∂)≥F F≤(N-∂)÷(N + ∆ +∂)
(N-∂)÷(N + ∆ +∂)≥F
F≤(N-∂)÷(N + ∆ +∂)
我们可以使用电子表格来绘制它们之间的关系。对于给定数量的较小顺序(x轴)的项目,以及给定的相似度,我们可以绘制∂的最大值,以得出与我们相似的F。公式为:
∂=(N(1-F)-F∆)÷(1 + F)
对于常数F,这是一个在N和∆中的线性方程。对于不同的F值,它是非线性的。显然,∂必须是一个非负整数。
给定F = 0.85,对于相同大小的订单(∆ = 0),对于1≤N <13,∂= 0;对于13≤N <25,∂≤1; 对于25≤N <37,∂≤2,对于37≤N <50,∂≤3。
对于相差1(∆ = 1)的阶,对于1≤N <18,∂= 0;对于18≤N <31,∂≤1; 对于31≤N <43,∂≤2; 等等。如果∆ = 6,则您需要N = 47才能使订单仍与∂= 1相似的85%。这意味着小订单有47个项目,其中46个与大订单有53个项目相同。
到目前为止,一切都很好。我们如何应用该理论来选择类似于指定订单的订单?
首先,我们观察到指定的订单可能与相似的订单大小相同,或者更大或更小。这使事情变得有些复杂。
上式的参数为:
在顶部开发的查询中使用细微变化可以使用的值:
对应查询:
SELECT OrderID AS ID, COUNT(*) AS NA FROM OrderItem WHERE OrderID = <specified order ID> GROUP BY OrderID; SELECT OrderID AS ID, COUNT(*) AS NB FROM OrderItem WHERE OrderID != <specified order ID> GROUP BY OrderID; SELECT I1.OrderID AS ID, COUNT(*) AS NC FROM OrderItem AS I1 JOIN OrderItem AS I2 ON I2.ItemID = I1.ItemID AND I2.OrderID = <specified order ID> WHERE I1.OrderID != <specified order ID> GROUP BY I1.OrderID
为了方便起见,我们希望获得N和N + ∆(因此还有∆)值,因此我们可以使用UNION适当地安排事物,其中:
在UNION查询的第二个版本中,具有:
这两个查询都保留两个订单ID号,以便您以后可以追溯到其余的订单信息。
SELECT v1.ID AS OrderID_1, v1.NA AS NS, v2.ID AS OrderID_2, v2.NB AS NL FROM (SELECT OrderID AS ID, COUNT(*) AS NA FROM OrderItem WHERE OrderID = <specified order ID> GROUP BY OrderID ) AS v1 JOIN (SELECT OrderID AS ID, COUNT(*) AS NB FROM OrderItem WHERE OrderID != <specified order ID> GROUP BY OrderID ) AS v2 ON v1.NA <= v2.NB UNION SELECT v2.ID AS OrderID_1, v2.NB AS NS, v1.ID AS OrderID_2, v1.NA AS NL FROM (SELECT OrderID AS ID, COUNT(*) AS NA FROM OrderItem WHERE OrderID = <specified order ID> GROUP BY OrderID ) AS v1 JOIN (SELECT OrderID AS ID, COUNT(*) AS NB FROM OrderItem WHERE OrderID != <specified order ID> GROUP BY OrderID ) AS v2 ON v1.NA > v2.NB
这为我们提供了一个表格表达式,其中包含列OrderID_1,NS,OrderID_2,NL,其中NS是“较小顺序”中的项目数,而NL是较大顺序中的项目数。由于v1和v2表表达式生成的订单号没有重叠,因此无需担心OrderID值相同的“反身”条目。也可以在UNION查询中最容易地将NC添加到此:
SELECT v1.ID AS OrderID_1, v1.NA AS NS, v2.ID AS OrderID_2, v2.NB AS NL, v3.NC AS NC FROM (SELECT OrderID AS ID, COUNT(*) AS NA FROM OrderItem WHERE OrderID = <specified order ID> GROUP BY OrderID ) AS v1 JOIN (SELECT OrderID AS ID, COUNT(*) AS NB FROM OrderItem WHERE OrderID != <specified order ID> GROUP BY OrderID ) AS v2 ON v1.NA <= v2.NB JOIN (SELECT I1.OrderID AS ID, COUNT(*) AS NC FROM OrderItem AS I1 JOIN OrderItem AS I2 ON I2.ItemID = I1.ItemID AND I2.OrderID = <specified order ID> WHERE I1.OrderID != <specified order ID> GROUP BY I1.OrderID ) AS v3 ON v3.ID = v2.ID UNION SELECT v2.ID AS OrderID_1, v2.NB AS NS, v1.ID AS OrderID_2, v1.NA AS NL, v3.NC AS NC FROM (SELECT OrderID AS ID, COUNT(*) AS NA FROM OrderItem WHERE OrderID = <specified order ID> GROUP BY OrderID ) AS v1 JOIN (SELECT OrderID AS ID, COUNT(*) AS NB FROM OrderItem WHERE OrderID != <specified order ID> GROUP BY OrderID ) AS v2 ON v1.NA > v2.NB JOIN (SELECT I1.OrderID AS ID, COUNT(*) AS NC FROM OrderItem AS I1 JOIN OrderItem AS I2 ON I2.ItemID = I1.ItemID AND I2.OrderID = <specified order ID> WHERE I1.OrderID != <specified order ID> GROUP BY I1.OrderID ) AS v3 ON v3.ID = v1.ID
这为我们提供了一个表格表达式,其中包含列OrderID_1,NS,OrderID_2,NL,NC,其中NS是“较小顺序”中的项目数,NL是较大顺序中的项目数,而NC是“较小顺序”中的项目数共同。
给定NS,NL,NC,我们正在寻找满足以下条件的订单:
(N-∂)÷(N + ∆ +∂)≥F.
∂—较小顺序中的项目数不匹配较大顺序中的项
NS = N-较小顺序的项目数
NL = N + ∆-较大顺序的项目数
因此,条件必须是:
NC / (NL + (NS - NC)) ≥ F
LHS上的术语必须以浮点数而不是整数表达式的形式求值。将其应用于上面的UNION查询,将产生:
SELECT OrderID_1, NS, OrderID_2, NL, NC, CAST(NC AS NUMERIC) / CAST(NL + NS - NC AS NUMERIC) AS Similarity FROM (SELECT v1.ID AS OrderID_1, v1.NA AS NS, v2.ID AS OrderID_2, v2.NB AS NL, v3.NC AS NC FROM (SELECT OrderID AS ID, COUNT(*) AS NA FROM OrderItem WHERE OrderID = <specified order ID> GROUP BY OrderID ) AS v1 JOIN (SELECT OrderID AS ID, COUNT(*) AS NB FROM OrderItem WHERE OrderID != <specified order ID> GROUP BY OrderID ) AS v2 ON v1.NA <= v2.NB JOIN (SELECT I1.OrderID AS ID, COUNT(*) AS NC FROM OrderItem AS I1 JOIN OrderItem AS I2 ON I2.ItemID = I1.ItemID AND I2.OrderID = <specified order ID> WHERE I1.OrderID != <specified order ID> GROUP BY I1.OrderID ) AS v3 ON v3.ID = v2.ID UNION SELECT v2.ID AS OrderID_1, v2.NB AS NS, v1.ID AS OrderID_2, v1.NA AS NL, v3.NC AS NC FROM (SELECT OrderID AS ID, COUNT(*) AS NA FROM OrderItem WHERE OrderID = <specified order ID> GROUP BY OrderID ) AS v1 JOIN (SELECT OrderID AS ID, COUNT(*) AS NB FROM OrderItem WHERE OrderID != <specified order ID> GROUP BY OrderID ) AS v2 ON v1.NA > v2.NB JOIN (SELECT I1.OrderID AS ID, COUNT(*) AS NC FROM OrderItem AS I1 JOIN OrderItem AS I2 ON I2.ItemID = I1.ItemID AND I2.OrderID = <specified order ID> WHERE I1.OrderID != <specified order ID> GROUP BY I1.OrderID ) AS v3 ON v3.ID = v1.ID ) AS u WHERE CAST(NC AS NUMERIC) / CAST(NL + NS - NC AS NUMERIC) >= 0.85 -- F
您可能会注意到,该查询仅使用OrderItem表。不需要订单和项目表。
警告:部分测试过的SQL(提示符)。现在,上面的SQL似乎对微小的数据集产生了合理的答案。我调整了相似度要求(先是0.25,然后是0.55),然后得出了合理的值和适当的选择性。但是,我的测试数据按最大顺序只有8项,并且肯定没有涵盖所描述数据的全部范围。由于我最常使用的DBMS不支持CTE,因此下面的SQL未经测试。但是,我有一定的信心,除非犯了大错,否则版本1中的CTE代码(具有大量重复的指定订单ID)应该是干净的。我认为版本2也可以,但是…未经测试。
可能有更紧凑的方式来表达查询,可能使用OLAP函数。
如果要对此进行测试,我将创建一个表,其中包含一些具有代表性的订单项集,检查返回的相似性度量是否合理。如图所示,我或多或少地会处理查询,逐渐建立复杂的查询。如果其中一个表达式显示有缺陷,那么我将对该查询进行适当的调整,直到修复缺陷为止。
显然,性能将是一个问题。最内层的查询并不是很复杂,但并不是微不足道的。但是,测量将显示这是一个引人注目的问题还是令人讨厌的问题。研究查询计划可能会有所帮助。似乎很可能在OrderItem.OrderID上有一个索引;如果没有这样的索引,查询就不太可能执行良好。因为它是外键列,所以这不太可能成为问题。
使用“ WITH子句”(通用表表达式)可能会带来一些好处。他们将明确表示在UNION子查询的两半中隐含的重复。
当表达式相同时,使用公用表表达式可以使优化程序清晰明了,并且可以帮助其更好地执行。他们还帮助人们阅读您的查询。上面的查询确实请求使用CTE。
版本1:重复指定的订单号
WITH SO AS (SELECT OrderID AS ID, COUNT(*) AS NA -- Specified Order (SO) FROM OrderItem WHERE OrderID = <specified order ID> GROUP BY OrderID ), OO AS (SELECT OrderID AS ID, COUNT(*) AS NB -- Other orders (OO) FROM OrderItem WHERE OrderID != <specified order ID> GROUP BY OrderID ), CI AS (SELECT I1.OrderID AS ID, COUNT(*) AS NC -- Common Items (CI) FROM OrderItem AS I1 JOIN OrderItem AS I2 ON I2.ItemID = I1.ItemID AND I2.OrderID = <specified order ID> WHERE I1.OrderID != <specified order ID> GROUP BY I1.OrderID ) SELECT OrderID_1, NS, OrderID_2, NL, NC, CAST(NC AS NUMERIC) / CAST(NL + NS - NC AS NUMERIC) AS Similarity FROM (SELECT v1.ID AS OrderID_1, v1.NA AS NS, v2.ID AS OrderID_2, v2.NB AS NL, v3.NC AS NC FROM SO AS v1 JOIN OO AS v2 ON v1.NA <= v2.NB JOIN CI AS v3 ON v3.ID = v2.ID UNION SELECT v2.ID AS OrderID_1, v2.NB AS NS, v1.ID AS OrderID_2, v1.NA AS NL, v3.NC AS NC FROM SO AS v1 JOIN OO AS v2 ON v1.NA > v2.NB JOIN CI AS v3 ON v3.ID = v1.ID ) AS u WHERE CAST(NC AS NUMERIC) / CAST(NL + NS - NC AS NUMERIC) >= 0.85 -- F
版本2:避免重复指定的订单号
WITH SO AS (SELECT OrderID AS ID, COUNT(*) AS NA -- Specified Order (SO) FROM OrderItem WHERE OrderID = <specified order ID> GROUP BY OrderID ), OO AS (SELECT OI.OrderID AS ID, COUNT(*) AS NB -- Other orders (OO) FROM OrderItem AS OI JOIN SO ON OI.OrderID != SO.ID GROUP BY OI.OrderID ), CI AS (SELECT I1.OrderID AS ID, COUNT(*) AS NC -- Common Items (CI) FROM OrderItem AS I1 JOIN SO AS S1 ON I1.OrderID != S1.ID JOIN OrderItem AS I2 ON I2.ItemID = I1.ItemID JOIN SO AS S2 ON I2.OrderID = S2.ID GROUP BY I1.OrderID ) SELECT OrderID_1, NS, OrderID_2, NL, NC, CAST(NC AS NUMERIC) / CAST(NL + NS - NC AS NUMERIC) AS Similarity FROM (SELECT v1.ID AS OrderID_1, v1.NA AS NS, v2.ID AS OrderID_2, v2.NB AS NL, v3.NC AS NC FROM SO AS v1 JOIN OO AS v2 ON v1.NA <= v2.NB JOIN CI AS v3 ON v3.ID = v2.ID UNION SELECT v2.ID AS OrderID_1, v2.NB AS NS, v1.ID AS OrderID_2, v1.NA AS NL, v3.NC AS NC FROM SO AS v1 JOIN OO AS v2 ON v1.NA > v2.NB JOIN CI AS v3 ON v3.ID = v1.ID ) AS u WHERE CAST(NC AS NUMERIC) / CAST(NL + NS - NC AS NUMERIC) >= 0.85 -- F
这些都不容易读;两者都比大写CTE的大型SELECT容易。
这不足以进行良好的测试。它给出了一点点信心(并且确实显示了“相同顺序”查询的问题。
CREATE TABLE Order (ID SERIAL NOT NULL PRIMARY KEY); CREATE TABLE Item (ID SERIAL NOT NULL PRIMARY KEY); CREATE TABLE OrderItem ( OrderID INTEGER NOT NULL REFERENCES Order, ItemID INTEGER NOT NULL REFERENCES Item, Quantity DECIMAL(8,2) NOT NULL ); INSERT INTO Order VALUES(1); INSERT INTO Order VALUES(2); INSERT INTO Order VALUES(3); INSERT INTO Order VALUES(4); INSERT INTO Order VALUES(5); INSERT INTO Order VALUES(6); INSERT INTO Order VALUES(7); INSERT INTO Item VALUES(111); INSERT INTO Item VALUES(222); INSERT INTO Item VALUES(333); INSERT INTO Item VALUES(444); INSERT INTO Item VALUES(555); INSERT INTO Item VALUES(666); INSERT INTO Item VALUES(777); INSERT INTO Item VALUES(888); INSERT INTO Item VALUES(999); INSERT INTO OrderItem VALUES(1, 111, 1); INSERT INTO OrderItem VALUES(1, 222, 1); INSERT INTO OrderItem VALUES(1, 333, 1); INSERT INTO OrderItem VALUES(1, 555, 1); INSERT INTO OrderItem VALUES(2, 111, 1); INSERT INTO OrderItem VALUES(2, 222, 1); INSERT INTO OrderItem VALUES(2, 333, 1); INSERT INTO OrderItem VALUES(2, 555, 1); INSERT INTO OrderItem VALUES(3, 111, 1); INSERT INTO OrderItem VALUES(3, 222, 1); INSERT INTO OrderItem VALUES(3, 333, 1); INSERT INTO OrderItem VALUES(3, 444, 1); INSERT INTO OrderItem VALUES(3, 555, 1); INSERT INTO OrderItem VALUES(3, 666, 1); INSERT INTO OrderItem VALUES(4, 111, 1); INSERT INTO OrderItem VALUES(4, 222, 1); INSERT INTO OrderItem VALUES(4, 333, 1); INSERT INTO OrderItem VALUES(4, 444, 1); INSERT INTO OrderItem VALUES(4, 555, 1); INSERT INTO OrderItem VALUES(4, 777, 1); INSERT INTO OrderItem VALUES(5, 111, 1); INSERT INTO OrderItem VALUES(5, 222, 1); INSERT INTO OrderItem VALUES(5, 333, 1); INSERT INTO OrderItem VALUES(5, 444, 1); INSERT INTO OrderItem VALUES(5, 555, 1); INSERT INTO OrderItem VALUES(5, 777, 1); INSERT INTO OrderItem VALUES(5, 999, 1); INSERT INTO OrderItem VALUES(6, 111, 1); INSERT INTO OrderItem VALUES(6, 222, 1); INSERT INTO OrderItem VALUES(6, 333, 1); INSERT INTO OrderItem VALUES(6, 444, 1); INSERT INTO OrderItem VALUES(6, 555, 1); INSERT INTO OrderItem VALUES(6, 777, 1); INSERT INTO OrderItem VALUES(6, 888, 1); INSERT INTO OrderItem VALUES(6, 999, 1); INSERT INTO OrderItem VALUES(7, 111, 1); INSERT INTO OrderItem VALUES(7, 222, 1); INSERT INTO OrderItem VALUES(7, 333, 1); INSERT INTO OrderItem VALUES(7, 444, 1); INSERT INTO OrderItem VALUES(7, 555, 1); INSERT INTO OrderItem VALUES(7, 777, 1); INSERT INTO OrderItem VALUES(7, 888, 1); INSERT INTO OrderItem VALUES(7, 999, 1); INSERT INTO OrderItem VALUES(7, 666, 1);