从Bash中另一个更大的文件中查找文件行的最快方法

小编典典

从Bash中另一个更大的文件中查找文件行的最快方法

linux

我有两个文件，file1.txt和file2.txt。 file1.txt有大约14000条线，file2.txt大约有20亿条线。
每行file1.txt有一个字段f1，而file2.txt有3个字段，f1通过f3，以分隔|。

我想从中找到匹配项的所有行（或者，如果我们不想花费额外的时间来拆分的值，则可以找到行file2.txt中f1的任何行）。file1.txt``f2``file2.txt``file2.txt

file1.txt（大约14000行， 未排序 ）：

foo1
foo2
...
bar1
bar2
...

file2.txt（大约20亿行， 未排序 ）：

date1|foo1|number1
date2|foo2|number2
...
date1|bar1|number1
date2|bar2|number2
...

预期输出：

date1|foo1|number1
date2|foo2|number2
...
date1|bar1|number1
date2|bar2|number2
...

这是我尝试过的，似乎要花几个小时才能运行：

fgrep -F -f file1.txt file2.txt > file.matched

我想知道是否有更好，更快的方法来执行常见的Unix命令或小的脚本。

阅读 338

2020-06-02

共1个答案

小编典典

一小段Perl代码解决了该问题。这是采取的方法：

将的行存储file1.txt在哈希中
file2.txt逐行读取，解析并提取第二个字段
检查提取的字段是否在哈希中；如果是这样，打印行

这是代码：

#!/usr/bin/perl -w

use strict;
if (scalar(@ARGV) != 2) {
  printf STDERR "Usage: fgrep.pl smallfile bigfile\n";
  exit(2);
}

my ($small_file, $big_file) = ($ARGV[0], $ARGV[1]);
my ($small_fp, $big_fp, %small_hash, $field);

open($small_fp, "<", $small_file) || die "Can't open $small_file: " . $!;
open($big_fp, "<", $big_file)     || die "Can't open $big_file: "   . $!;

# store contents of small file in a hash
while (<$small_fp>) {
  chomp;
  $small_hash{$_} = undef;
}
close($small_fp);

# loop through big file and find matches
while (<$big_fp>) {
  # no need for chomp
  $field = (split(/\|/, $_))[1];
  if (defined($field) && exists($small_hash{$field})) {
    printf("%s", $_);
  }
}

close($big_fp);
exit(0);

我使用file1.txt中的14K行和file2.txt中的130M行运行了上述脚本。它在大约13秒内完成了126K场比赛。这是time相同的输出：

real    0m11.694s
user    0m11.507s
sys 0m0.174s

我运行了@Inian的awk代码：

awk 'FNR==NR{hash[$1]; next}{for (i in hash) if (match($0,i)) {print; break}}' file1.txt FS='|' file2.txt

它比Perl解决方案慢得多，因为它使file2.txt中的每一行循环14K次-
这确实很昂贵。它在处理了592K条记录file2.txt并产生了40K条匹配的线后中止。这是花了多长时间：

awk: illegal primary in regular expression 24/Nov/2016||592989 at 592989
 input record number 675280, file file2.txt
 source line number 1

real    55m5.539s
user    54m53.080s
sys 0m5.095s

使用@Inian的其他awk解决方案可以消除循环问题：

time awk -F '|' 'FNR==NR{hash[$1]; next}$2 in hash' file1.txt FS='|' file2.txt > awk1.out

real    0m39.966s
user    0m37.916s
sys 0m0.743s

time LC_ALL=C awk -F '|' 'FNR==NR{hash[$1]; next}$2 in hash' file1.txt FS='|' file2.txt > awk.out

real    0m41.057s
user    0m38.475s
sys 0m0.904s

awk 鉴于我们不必编写整个程序来做到这一点，因此在这里给人留下了深刻的印象。

我也运行了@oliv的Python代码。完成这项工作大约花了15个小时，看起来效果不错。构建大型正则表达式的效率不及使用哈希查找的效率。这里的time输出：

real    895m14.862s
user    806m59.219s
sys 1m12.147s

我试图按照建议使用parallel。但是，fgrep: memory exhausted即使块大小很小，它也会因错误而失败。

令我惊讶的是，这fgrep完全不适合这样做。22小时后我终止了它，并产生了约10万次匹配。 我希望fgrep有一个选项可以强制将其内容-f file保留在哈希中，就像Perl代码所做的那样。

我没有检查join方法-我不需要排序文件的额外开销。而且，由于fgrep性能不佳，我认为join这样做不会比Perl代码更好。

感谢大家的关注和回应。

2020-06-02