如果该行具有rowspan element,那么如何使该行与Wikipedia页面中的表相对应。
from bs4 import BeautifulSoup import urllib2 from lxml.html import fromstring import re import csv import pandas as pd wiki = "http://en.wikipedia.org/wiki/List_of_England_Test_cricket_records" header = {'User-Agent': 'Mozilla/5.0'} #Needed to prevent 403 error on Wikipedia req = urllib2.Request(wiki,headers=header) page = urllib2.urlopen(req) soup = BeautifulSoup(page) try: table = soup.find_all('table')[6] except AttributeError as e: print 'No tables found, exiting' try: first = table.find_all('tr')[0] except AttributeError as e: print 'No table row found, exiting' try: allRows = table.find_all('tr')[1:-1] except AttributeError as e: print 'No table row found, exiting' headers = [header.get_text() for header in first.find_all(['th', 'td'])] results = [[data.get_text() for data in row.find_all(['th', 'td'])] for row in allRows] df = pd.DataFrame(data=results, columns=headers) df
我得到表作为输出..但是对于其中行包含 rowspan的 表 - 我得到表如下-
如您所知,由于以下情况造成的问题,
html内容:
<tr> <td rowspan="2">2=</td> <td>West Indies</td> <td>4</td> <td>Lord's</td> <td>2009</td> </tr> <tr> <td style="text-align:left;">India</td> <td>4</td> <td>Mumbai</td> <td>2012</td> </tr>
因此,当td具有rowspan属性时,请考虑对相同td级别的下一个重复tr相同的值,并rowspan为下一个tr标签数量重复均值。
td
rowspan
tr
注意:: 仅检查给定的测试用例。需要检查更多的测试用例。
码:
from bs4 import BeautifulSoup import urllib2 from lxml.html import fromstring import re import csv import pandas as pd wiki = "http://en.wikipedia.org/wiki/List_of_England_Test_cricket_records" header = {'User-Agent': 'Mozilla/5.0'} #Needed to prevent 403 error on Wikipedia req = urllib2.Request(wiki,headers=header) page = urllib2.urlopen(req) soup = BeautifulSoup(page) table = soup.find_all('table')[6] tmp = table.find_all('tr') first = tmp[0] allRows = tmp[1:-1] #table.find_all('tr')[1:-1] headers = [header.get_text() for header in first.find_all('th')] results = [[data.get_text() for data in row.find_all('td')] for row in allRows] #<td rowspan="2">2=</td> # list of tuple (Level of tr, Level of td, total Count, Text Value) #e.g. #[(1, 0, 2, u'2=')] # (<tr> is 1 , td sequence in tr is 0, reapted 2 times , value is 2=) rowspan = [] for no, tr in enumerate(allRows): tmp = [] for td_no, data in enumerate(tr.find_all('td')): print data.has_key("rowspan") if data.has_key("rowspan"): rowspan.append((no, td_no, int(data["rowspan"]), data.get_text())) if rowspan: for i in rowspan: # tr value of rowspan in present in 1th place in results for j in xrange(1, i[2]): #- Add value in next tr. results[i[0]+j].insert(i[1], i[3]) df = pd.DataFrame(data=results, columns=headers) print df
输出:
Rank Opponent No. wins Most recent venue Season 0 1 South Africa 6 Lord's 1951 1 2= West Indies 4 Lord's 2009 2 2= India 4 Mumbai 2012 3 4 Australia 3 Sydney 1932 4 5 Pakistan 2 Trent Bridge 1967 5 6 Sri Lanka 1 Old Trafford 2002
也要工作到表10
Rank Hundreds Player Matches Innings Average 0 1 25 Alastair Cook 107 191 45.61 1 2 23 Kevin Pietersen 104 181 47.28 2 3 22 Colin Cowdrey 114 188 44.07 3 3 22 Wally Hammond 85 140 58.46 4 3 22 Geoffrey Boycott 108 193 47.72 5 6 21 Andrew Strauss 100 178 40.91 6 6 21 Ian Bell 103 178 45.30 7 8= 20 Ken Barrington 82 131 58.67 8 8= 20 Graham Gooch 118 215 42.58 9 10 19 Len Hutton 79 138 56.67