我正在努力从我们的数据库中淘汰某个客户。我注意到一种趋势,人们用与他们填写公司名称相同的名字来填写名字。因此,示例如下所示:
business_name first_name ------------- ---------- locksmith taylorsville locksmith locksmith roy locksmi locksmith clinton locks locksmith farmington locksmith
这些人是我不想被查询的人。他们是坏蛋。我正在尝试将查询与WHERE语句(大概)组合在一起,以隔离姓氏至少与公司名称部分匹配的人,但是我很困惑,可以使用一些帮助。
您可以采用基于相似性的方法 在答案底部尝试代码 它会产生如下结果
business_name partial_business_name first_name similarity locksmith taylorsville locksmith locksmith 1.0 locksmith farmington locksmith locksmith 1.0 locksmith roy locksmith locksmi 0.7777777777777778 locksmith clinton locksmith locks 0.5555555555555556
因此,您将能够根据相似度值控制要过滤的内容
代码
SELECT business_name, partial_business_name, first_name, similarity FROM JS( // input table ( SELECT business_name, REGEXP_EXTRACT(business_name, r'^(\w+)') AS partial_business_name, first_name AS first_name FROM (SELECT 'locksmith taylorsville' AS business_name, 'locksmith' AS first_name), (SELECT 'locksmith roy' AS business_name, 'locksmi' AS first_name), (SELECT 'locksmith clinton' AS business_name, 'locks' AS first_name), (SELECT 'locksmith farmington' AS business_name, 'locksmith' AS first_name), ) , // input columns business_name, partial_business_name, first_name, // output schema "[{name: 'business_name', type:'string'}, {name: 'partial_business_name', type:'string'}, {name: 'first_name', type:'string'}, {name: 'similarity', type:'float'}] ", // function "function(r, emit) { var _extend = function(dst) { var sources = Array.prototype.slice.call(arguments, 1); for (var i=0; i<sources.length; ++i) { var src = sources[i]; for (var p in src) { if (src.hasOwnProperty(p)) dst[p] = src[p]; } } return dst; }; var Levenshtein = { /** * Calculate levenshtein distance of the two strings. * * @param str1 String the first string. * @param str2 String the second string. * @return Integer the levenshtein distance (0 and above). */ get: function(str1, str2) { // base cases if (str1 === str2) return 0; if (str1.length === 0) return str2.length; if (str2.length === 0) return str1.length; // two rows var prevRow = new Array(str2.length + 1), curCol, nextCol, i, j, tmp; // initialise previous row for (i=0; i<prevRow.length; ++i) { prevRow[i] = i; } // calculate current row distance from previous row for (i=0; i<str1.length; ++i) { nextCol = i + 1; for (j=0; j<str2.length; ++j) { curCol = nextCol; // substution nextCol = prevRow[j] + ( (str1.charAt(i) === str2.charAt(j)) ? 0 : 1 ); // insertion tmp = curCol + 1; if (nextCol > tmp) { nextCol = tmp; } // deletion tmp = prevRow[j + 1] + 1; if (nextCol > tmp) { nextCol = tmp; } // copy current col value into previous (in preparation for next iteration) prevRow[j] = curCol; } // copy last col value into previous (in preparation for next iteration) prevRow[j] = nextCol; } return nextCol; } }; var the_partial_business_name; try { the_partial_business_name = decodeURI(r.partial_business_name).toLowerCase(); } catch (ex) { the_partial_business_name = r.partial_business_name.toLowerCase(); } try { the_first_name = decodeURI(r.first_name).toLowerCase(); } catch (ex) { the_first_name = r.first_name.toLowerCase(); } emit({business_name: r.business_name, partial_business_name: the_partial_business_name, first_name: the_first_name, similarity: 1 - Levenshtein.get(the_partial_business_name, the_first_name) / the_partial_business_name.length}); }" ) ORDER BY similarity DESC