我有一个CSV文件(未规范化,例如,实际文件最多100列):
ID, CUST_NAME, CLIENT_NAME, PAYMENT_NUM, START_DATE, END_DATE 1, CUST1, CLIENT1, 10, 2018-04-01, 2018-04-02 2, CUST1, CLIENT1, 10, 2018-04-01, 2018-05-30 3, CUST1, CLIENT1, 101, 2018-04-02, 2018-04-03 4, CUST2, CLIENT1, 102, 2018-04-02, 2018-04-03
我如何找到所有可能用作主键的列集。
所需的输出:
1) ID 2) PAYMENT_NUM,START_DATE,END_DATE 3) CUST_NAME, CLIENT_NAME, PAYMENT_NUM,START_DATE,END_DATE
我可以用Java做到这一点,但可能是Python / Pandas已经提供了快速解决方案
熊猫和itertools将为您提供所需的东西。
import pandas from itertools import chain, combinations def key_options(items): return chain.from_iterable(combinations(items, r) for r in range(1, len(items)+1) ) df = pandas.read_csv('test.csv'); # iterate over all combos of headings, excluding ID for brevity for candidate in key_options(list(df)[1:]): deduped = df.drop_duplicates(candidate) if len(deduped.index) == len(df.index): print ','.join(candidate)
这将为您提供输出:
PAYMENT_NUM, END_DATE CUST_NAME, CLIENT_NAME, END_DATE CUST_NAME, PAYMENT_NUM, END_DATE CLIENT_NAME, PAYMENT_NUM, END_DATE PAYMENT_NUM, START_DATE, END_DATE CUST_NAME, CLIENT_NAME, PAYMENT_NUM, END_DATE CUST_NAME, CLIENT_NAME, START_DATE, END_DATE CUST_NAME, PAYMENT_NUM, START_DATE, END_DATE CLIENT_NAME, PAYMENT_NUM, START_DATE, END_DATE CUST_NAME, CLIENT_NAME, PAYMENT_NUM, START_DATE, END_DATE