我需要使用Python预处理推文。现在我想知道分别删除所有标签,@ user和tweet链接的正则表达式是什么?
例如,
original tweet: @peter I really love that shirt at #Macy. http://bet.ly//WjdiW4
I really love that shirt at Macy
@shawn Titanic tragedy could have been prevented Economic Times: Telegraph.co.ukTitanic tragedy could have been preve... http://bet.ly/tuN2wx
Titanic tragedy could have been prevented Economic Times Telegraph co ukTitanic tragedy could have been preve
I am at Starbucks http://4sh.com/samqUI (7419 3rd ave, at 75th, Brooklyn)
I am at Starbucks 7419 3rd ave at 75th Brooklyn
我只需要每个推文中有意义的词即可。我不需要用户名,任何链接或标点符号。
以下示例是一个近似的例子。不幸的是,仅通过正则表达式没有正确的方法。以下正则表达式仅去除URL(不只是http),任何标点,用户名或任何非字母数字字符。它还将单词分隔为单个空格。如果您想按预期分析推文,则系统中需要更多智能。考虑到没有标准tweet提要格式的一些认知性自我学习算法。
这是我的建议。
' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())
这是你的例子的结果
>>> x="@peter I really love that shirt at #Macy. http://bit.ly//WjdiW4" >>> ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split()) 'I really love that shirt at Macy' >>> x="@shawn Titanic tragedy could have been prevented Economic Times: Telegraph.co.ukTitanic tragedy could have been preve... http://bit.ly/tuN2wx" >>> ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split()) 'Titanic tragedy could have been prevented Economic Times Telegraph co ukTitanic tragedy could have been preve' >>> x="I am at Starbucks http://4sq.com/samqUI (7419 3rd ave, at 75th, Brooklyn) " >>> ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split()) 'I am at Starbucks 7419 3rd ave at 75th Brooklyn' >>>
这是一些不完美的例子
>>> x="I c RT @iamFink: @SamanthaSpice that's my excited face and my regular face. The expression never changes." >>> ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split()) 'I c RT that s my excited face and my regular face The expression never changes' >>> x="RT @AstrologyForYou: #Gemini recharges through regular contact with people of like mind, and social involvement that allows expression of their ideas" >>> ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split()) 'RT Gemini recharges through regular contact with people of like mind and social involvement that allows expression of their ideas' >>> # Though after you add # to the regex expression filter, results become a bit better >>> ' '.join(re.sub("([@#][A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split()) 'RT recharges through regular contact with people of like mind and social involvement that allows expression of their ideas' >>> x="New comment by diego.bosca: Re: Re: wrong regular expression? http://t.co/4KOb94ua" >>> ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split()) 'New comment by diego bosca Re Re wrong regular expression' >>> #See how miserably it performed? >>>