通过R中变量的模糊匹配进行合并

小编典典

通过R中变量的模糊匹配进行合并

linux

我有两个dataframes（X，Y），其中ID是student_name，father_name和mother_name。由于存在印刷错误（“
n”而不是“
m”，随机的空格等），尽管我可以查看数据并看到应有的值，但我仍有大约60％的值未对齐。有没有办法以某种方式减少不匹配的程度，以便至少可以进行手动编辑？数据帧具有约700K观测值。

R最好。我知道一些python和一些基本的unix工具。PS我继续阅读agrep()，但不了解如何将其应用于实际数据集，尤其是当匹配项涉及多个变量时。

更新（发布的赏金数据）：

这是两个示例数据帧，sites_a和sites_b。他们可以在数字列匹配lat和lon以及在sitename列。了解如何在a）仅lat+
lon，b）sitename或c）两者上完成此操作将很有用。

您可以获取文件test_sites.R作为要点发布。

理想情况下，答案将以

merge(sites_a, sites_b, by = **magic**)

阅读 1169

2020-06-07

共1个答案

小编典典

该agrep函数（基数R的一部分）使用Levenshtein编辑距离进行近似字符串匹配可能值得尝试。不知道您的数据是什么样子，我无法真正提出可行的解决方案。但这是一个建议……它会将比赛记录在一个单独的列表中（如果有多个同样出色的比赛，那么也会记录这些比赛）。假设您的data.frame称为df：

l <- vector('list',nrow(df))
matches <- list(mother = l,father = l)
for(i in 1:nrow(df)){
  father_id <- with(df,which(student_name[i] == father_name))
  if(length(father_id) == 1){
    matches[['father']][[i]] <- father_id
  } else {
    old_father_id <- NULL
    ## try to find the total                                                                                                                                 
    for(m in 10:1){ ## m is the maximum distance                                                                                                             
      father_id <- with(df,agrep(student_name[i],father_name,max.dist = m))
      if(length(father_id) == 1 || m == 1){
        ## if we find a unique match or if we are in our last round, then stop                                                                               
        matches[['father']][[i]] <- father_id
        break
      } else if(length(father_id) == 0 && length(old_father_id) > 0) {
        ## if we can't do better than multiple matches, then record them anyway                                                                              
        matches[['father']][[i]] <- old_father_id
        break
      } else if(length(father_id) == 0 && length(old_father_id) == 0) {
        ## if the nearest match is more than 10 different from the current pattern, then stop                                                                
        break
      }
    }
  }
}

的代码mother_name基本相同。您甚至可以将它们组合成一个循环，但是此示例仅出于说明目的。

2020-06-07