请原谅我的熊猫新手问题,但我有一列美国城镇和州,例如下面所示的截短的版本(出于某些奇怪的原因,该列的名称称为“ Alabama [edit]”,该列与第一个列中的0-7城镇值):
0 Auburn (Auburn University)[1] 1 Florence (University of North Alabama) 2 Jacksonville (Jacksonville State University)[2] 3 Livingston (University of West Alabama)[2] 4 Montevallo (University of Montevallo)[2] 5 Troy (Troy University)[2] 6 Tuscaloosa (University of Alabama, Stillman Co... 7 Tuskegee (Tuskegee University)[5] 8 Alaska[edit] 9 Fairbanks (University of Alaska Fairbanks)[2] 10 Arizona[edit] 11 Flagstaff (Northern Arizona University)[6] 12 Tempe (Arizona State University) 13 Tucson (University of Arizona) 14 Arkansas[edit] 15 Arkadelphia (Henderson State University, Ouach... 16 Conway (Central Baptist College, Hendrix Colle... 17 Fayetteville (University of Arkansas)[7] 18 Jonesboro (Arkansas State University)[8] 19 Magnolia (Southern Arkansas University)[2] 20 Monticello (University of Arkansas at Monticel... 21 Russellville (Arkansas Tech University)[2] 22 Searcy (Harding University)[5] 23 California[edit]
每个州的城镇均在每个州名称之下,例如,费尔班克斯(列值9)是阿拉斯加州的一个城镇。
我想要做的是根据州名称拆分城镇名称,这样我就有两列“州”和“地区名称”,其中每个州名称与每个城镇名称相关联,如下所示:
RegionName State 0 Auburn (Auburn University)[1] Alabama 1 Florence (University of North Alabama) Alabama 2 Jacksonville (Jacksonville State University)[2] Alabama 3 Livingston (University of West Alabama)[2] Alabama 4 Montevallo (University of Montevallo)[2] Alabama 5 Troy (Troy University)[2] Alabama 6 Tuscaloosa (University of Alabama, Stillman Co... Alabama 7 Tuskegee (Tuskegee University)[5] Alabama 8 Fairbanks (University of Alaska Fairbanks)[2] Alaska 9 Flagstaff (Northern Arizona University)[6] Arizona 10 Tempe (Arizona State University) Arizona 11 Tucson (University of Arizona) Arizona 12 Arkadelphia (Henderson State University, Ouach... Arkansas
。。。等等。
我知道每个州名后面都有一个字符串“ [edit]”,我认为我可以用它来分割和分配城镇名。但是我不知道该怎么做。
另外,我知道我还需要进行许多其他数据清理工作,例如删除括号内和方括号“ []”中的字符串。以后可以做…重要的部分是将州和镇分开,并将每个镇分配给其适当的美国。我们将不胜感激任何建议。
在没有太多背景信息或无法访问您的数据的情况下,我建议遵循以下原则。首先,修改读取数据的代码:
df = pd.read_csv(..., header=None, names=['RegionName']) # add header=False so as to read the first row as data
现在,使用提取状态名称str.extract,这仅应提取名称,只要该名称后接子字符串“ [edit]”即可。然后,您可以使用来填充所有NaN值ffill。
str.extract
ffill
df['State'] = df['RegionName'].str.extract( r'(?P<State>.*)(?=\s*\[edit\])' ).ffill()