用户项目的亲和力和建议: 我正在创建一个表,建议“购买此商品的客户也购买了算法” 输入数据集
productId userId Prod1 a Prod1 b Prod1 c Prod1 d prod2 b prod2 c prod2 a prod2 b prod3 c prod3 a prod3 d prod3 c prod4 a prod4 b prod4 d prod4 a prod5 d prod5 a
需要输出
Product1 Product2 score Prod1 prod3 Prod1 prod4 Prod1 prod5 prod2 Prod1 prod2 prod3 prod2 prod4 prod2 prod5 prod3 Prod1 prod3 prod2 Using code : #Get list of unique items itemList=list(set(main["productId"].tolist())) #Get count of users userCount=len(set(main["productId"].tolist())) #Create an empty data frame to store item affinity scores for items. itemAffinity= pd.DataFrame(columns=('item1', 'item2', 'score')) rowCount=0 #For each item in the list, compare with other items. for ind1 in range(len(itemList)): #Get list of users who bought this item 1. item1Users = main[main.productId==itemList[ind1]]["userId"].tolist() #print("Item 1 ", item1Users) #Get item 2 - items that are not item 1 or those that are not analyzed already. for ind2 in range(ind1, len(itemList)): if ( ind1 == ind2): continue #Get list of users who bought item 2 item2Users=main[main.productId==itemList[ind2]]["userId"].tolist() #print("Item 2",item2Users) #Find score. Find the common list of users and divide it by the total users. commonUsers= len(set(item1Users).intersection(set(item2Users))) score=commonUsers / userCount #Add a score for item 1, item 2 itemAffinity.loc[rowCount] = [itemList[ind1],itemList[ind2],score] rowCount +=1 #Add a score for item2, item 1. The same score would apply irrespective of the sequence. itemAffinity.loc[rowCount] = [itemList[ind2],itemList[ind1],score] rowCount +=1 #Check final result itemAffinity
该代码在示例数据集上运行良好,但是 该代码花费的时间太长,无法在包含100,000行的数据集中运行。请帮助我优化代码。
此处的关键是创建productId的笛卡尔积。参见下面的代码,
result=(main.drop_duplicates(['productId','userId']) .assign(cartesian_key=1) .pipe(lambda x:x.merge(x,on='cartesian_key')) .drop('cartesian_key',axis=1) .loc[lambda x:(x.productId_x!=x.productId_y) & (x.userId_x==x.userId_y)] .groupby(['productId_x','productId_y']).size() .div(data['userId'].nunique())) result Prod1 prod2 0.75 Prod1 prod3 0.75 Prod1 prod4 0.75 Prod1 prod5 0.5 prod2 Prod1 0.75 prod2 prod3 0.5 prod2 prod4 0.5 prod2 prod5 0.25 prod3 Prod1 0.75 prod3 prod2 0.5 prod3 prod4 0.5 prod3 prod5 0.5 prod4 Prod1 0.75 prod4 prod2 0.5 prod4 prod3 0.5 prod4 prod5 0.5 prod5 Prod1 0.5 prod5 prod2 0.25 prod5 prod3 0.5 prod5 prod4 0.5
result = (df.groupby(['productId','userId']).size() .clip(upper=1) .unstack() .assign(key=1) .reset_index() .pipe(lambda x:x.merge(x,on='key')) .drop('key',axis=1) .loc[lambda x:(x.productId_x!=x.productId_y)] .set_index(['productId_x','productId_y']) .pipe(lambda x:x.set_axis(x.columns.str.split('_',expand=True),axis=1,inplace=False)) .swaplevel(axis=1) .pipe(lambda x:(x['x']+x['y'])) .fillna(0) .div(2) .mean(axis=1))