我有以下psql表。它总共约有20亿行。
id word lemma pos textid source 1 Stuffing stuff vvg 190568 AN 2 her her appge 190568 AN 3 key key nn1 190568 AN 4 into into ii 190568 AN 5 the the at 190568 AN 6 lock lock nn1 190568 AN 7 she she appge 190568 AN 8 pushed push vvd 190568 AN 9 her her appge 190568 AN 10 way way nn1 190568 AN 11 into into ii 190568 AN 12 the the appge 190568 AN 13 house house nn1 190568 AN 14 . . 190568 AN 15 She she appge 190568 AN 16 had have vhd 190568 AN 17 also also rr 190568 AN 18 cajoled cajole vvd 190568 AN 19 her her appge 190568 AN 20 way way nn1 190568 AN 21 into into ii 190568 AN 22 the the at 190568 AN 23 home home nn1 190568 AN 24 . . 190568 AN .. ... ... .. ... ..
我想创建一个下表,该表并排显示所有“ way”结构以及“ source”,“ lemma”和“ pos”列中的一些数据。
source word word word lemma pos word word word word word lemma pos word word AN lock she pushed push vvd her way into the house house nn1 . she AN had also cajoled cajole vvd her way into the home home nn1 . A AN tried to force force vvi her way into the palace palace nn1 , officials
在这里,您可以看到我使用的代码:
copy( SELECT c1.source, c1.word, c2.word, c3.word, c4.word, c4.lemma, c4.pos, c5.word, c6.word, c7.word, c8.word, c9.word, c9.lemma, c9.pos, c10.word, c11.word FROM orderedflatcorpus AS c1, orderedflatcorpus AS c2, orderedflatcorpus AS c3, orderedflatcorpus AS c4, orderedflatcorpus AS c5, orderedflatcorpus AS c6, orderedflatcorpus AS c7, orderedflatcorpus AS c8, orderedflatcorpus AS c9, orderedflatcorpus AS c10, orderedflatcorpus AS c11 WHERE c1.word LIKE '%' AND c2.word LIKE '%' AND c3.word LIKE '%' AND c4.pos LIKE 'v%' AND c5.pos = 'appge' AND c6.lemma = 'way' AND c7.pos LIKE 'i%' AND c8.word = 'the' AND c9.pos LIKE 'n%' AND c10.word LIKE '%' AND c11.word LIKE '%' AND c1.id + 1 = c2.id AND c1.id + 2 = c3.id AND c1.id + 3 = c4.id AND c1.id + 4 = c5.id AND c1.id + 5 = c6.id AND c1.id + 6 = c7.id AND c1.id + 7 = c8.id AND c1.id + 8 = c9.id AND c1.id + 9 = c10.id AND c1.id + 10 = c11.id ORDER BY c1.id ) TO '/home/postgres/Results/OUTPUT.csv' DELIMITER E'\t' csv header;
该查询需要花费近9个小时来执行20亿行(结果大约有19,000行)。
我该怎么做才能提高性能?
word,pos和lemma列已经具有btree索引。
我应该坚持我的代码,仅使用功能更强大的服务器,使用更多内核/更快的CPU和更多RAM(我的RAM只有8 GB,只有2个内核和2.8 GHz)吗?还是您会建议使用其他更有效的SQL查询?
谢谢!
我建议使用现代的连接语法,它可以很好地解决此问题:
SELECT c1.source, c1.word, c2.word, c3.word, c4.word, c4.lemma, c4.pos, c5.word, c6.word, c7.word, c8.word, c9.word, c9.lemma, c9.pos, c10.word, c11.word FROM orderedflatcorpus AS c1 JOIN orderedflatcorpus AS c2 ON c1.id + 1 = c2.id JOIN orderedflatcorpus AS c3 ON c1.id + 2 = c3.id JOIN orderedflatcorpus AS c4 ON c1.id + 3 = c4.id JOIN orderedflatcorpus AS c5 ON c1.id + 4 = c5.id JOIN orderedflatcorpus AS c6 ON c1.id + 5 = c6.id JOIN orderedflatcorpus AS c7 ON c1.id + 6 = c7.id JOIN orderedflatcorpus AS c8 ON c1.id + 7 = c8.id JOIN orderedflatcorpus AS c9 ON c1.id + 8 = c9.id JOIN orderedflatcorpus AS c10 ON c1.id + 9 = c10.id JOIN orderedflatcorpus AS c11 ON c1.id + 10 = c11.id WHERE c4.pos LIKE 'v%' AND c5.pos = 'appge' AND c6.lemma = 'way' AND c7.pos LIKE 'i%' AND c8.word = 'the' AND c9.pos LIKE 'n%'
笔记:
LIKE
ORDER BY