我的文件夹上有多个csv文件。列标题不同,但列数据相同。
括号内的数字是实际的列名。项目(67)67是列名
因此,忽略字符串Item,仅考虑int()中的int并执行操作。
样本文件:https : //drive.google.com/open?id=1q7c1AqCRKRufSVh– 9o0W6rdz28QyBGa
说明:
驱动器上的文件应附加在一起。基于列名称的“启用条件”。如果条件上的整数与列名(列名()内的整数)匹配,则应将其放在该列上。请检查预期的输出。 档案
File1: ID Item(67) Item (89) Item (91) Item (100) 1 56 78 98 101 2 91 100 121 File2: ID Item(96) Item (58) Item (99) Item (105) 3 101 102 103 104 4 112 113 117 119
健康)状况
d ={ 'File':['File1','File2'], 'Price1':[67,67], 'Price2':[89,67], 'Price3':[91,67], 'Price4':[100,91] } Condition=pd.DataFrame(data=d) Condition
预期产量:
File ID Price1 Price2 Price3 Price4 File1 1 56 78 98 101 File1 2 91 100 121 File2 3 101 102 104 103 File2 4 112 113 119 117
采用:
files = glob.glob('shelldemo/*.csv') dfs = [] for fp in files: #if multiple columns with no () #df = pd.read_csv(fp, index_col=['S.no','id','number']) df = pd.read_csv(fp, index_col=['ID']) df['file'] = os.path.basename(fp).split('.')[0] df = df.set_index('file', append=True) df.columns = df.columns.str.extract('\((\d+)\)', expand=False).astype(int) dfs.append(df) df1 = pd.concat(dfs, sort=False).reset_index() print (df1) ID file 58 67 89 91 96 100 0 1 file1 NaN 56 78.0 98.0 NaN 101.0 1 2 file1 NaN 91 100.0 121.0 NaN NaN 2 3 file2 102.0 103 NaN NaN 101.0 104.0 3 4 file2 113.0 117 NaN NaN 112.0 119.0
print (df2) File Price1 Price2 Price3 Price4 0 File1 67 89 91 100 1 File2 96 58 105 99
df2.columns = df2.columns.str.lower() df2['file'] = df2['file'].str.lower() #merge data together by left join df = df1.merge(df2, on='file', how='left') print (df) ID file 58 67 89 91 96 100 price1 price2 price3 \ 0 1 file1 NaN 56 78.0 98.0 NaN 101.0 67 89 91 1 2 file1 NaN 91 100.0 121.0 NaN NaN 67 89 91 2 3 file2 102.0 103 NaN NaN 101.0 104.0 96 58 105 3 4 file2 113.0 117 NaN NaN 112.0 119.0 96 58 105 price4 0 100 1 100 2 99 3 99
#filter integers between () df1 = df.loc[:, df.columns.str.isnumeric().isnull()].copy() #filter all columns with price df2 = df.filter(regex='price').copy() uniq_vals_df2 = df2.stack().dropna().drop_duplicates() not_matched_vals = np.setdiff1d(uniq_vals_df2, df1.columns) df1 = df1.join(pd.DataFrame(columns=not_matched_vals.tolist() + ['a'])) #replace columns by match values from df2 for c in df2.columns: df2[c] = df1.lookup(df1.index, df2[c].fillna('a')) #join to original DataFrame df = df[['file','ID']].join(df2)
print (df) file ID price1 price2 price3 price4 0 file1 1 56.0 78.0 98.0 101.0 1 file1 2 91.0 100.0 121.0 NaN 2 file2 3 101.0 102.0 NaN NaN 3 file2 4 112.0 113.0 NaN NaN