小编典典

行跨度怎么办

python

如果该行具有rowspan element,那么如何使该行与Wikipedia页面中的表相对应。

from bs4 import BeautifulSoup
import urllib2
from lxml.html import fromstring 
import re
import csv
import pandas as pd

wiki = "http://en.wikipedia.org/wiki/List_of_England_Test_cricket_records"
header = {'User-Agent': 'Mozilla/5.0'} #Needed to prevent 403 error on Wikipedia
req = urllib2.Request(wiki,headers=header)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)

try:
    table = soup.find_all('table')[6]
except AttributeError as e:
    print 'No tables found, exiting'

try:
    first = table.find_all('tr')[0]
except AttributeError as e:
    print 'No table row found, exiting'

try:
    allRows = table.find_all('tr')[1:-1]
except AttributeError as e:
    print 'No table row found, exiting'


headers = [header.get_text() for header in first.find_all(['th', 'td'])]
results = [[data.get_text() for data in row.find_all(['th', 'td'])] for row in allRows]


df = pd.DataFrame(data=results, columns=headers)
df

我得到表作为输出..但是对于其中行包含 rowspan的- 我得到表如下-
在此处输入图片说明


阅读 211

收藏
2021-01-16

共1个答案

小编典典

如您所知,由于以下情况造成的问题,

html内容:

<tr>
     <td rowspan="2">2=</td>
     <td>West Indies</td>
     <td>4</td>
     <td>Lord's</td>
     <td>2009</td>
</tr>
<tr>
     <td style="text-align:left;">India</td>
     <td>4</td>
     <td>Mumbai</td>
      <td>2012</td>
</tr>

因此,当td具有rowspan属性时,请考虑对相同td级别的下一个重复tr相同的值,并rowspan为下一个tr标签数量重复均值。

  1. 获取所有此类rowspan信息并保存在变量中。的保存序列号tr标签,序列号td标签,价值rowspan即许多如何tr标记有相同td的文本值td
  2. tr按照上述方法全部更新结果。

注意:: 仅检查给定的测试用例。需要检查更多的测试用例。

码:

    from bs4 import BeautifulSoup
    import urllib2
    from lxml.html import fromstring 
    import re
    import csv
    import pandas as pd


    wiki = "http://en.wikipedia.org/wiki/List_of_England_Test_cricket_records"
    header = {'User-Agent': 'Mozilla/5.0'} #Needed to prevent 403 error on Wikipedia
    req = urllib2.Request(wiki,headers=header)
    page = urllib2.urlopen(req)

    soup = BeautifulSoup(page)

    table = soup.find_all('table')[6]

    tmp = table.find_all('tr')

    first = tmp[0]
    allRows = tmp[1:-1]
    #table.find_all('tr')[1:-1]


    headers = [header.get_text() for header in first.find_all('th')]

    results = [[data.get_text() for data in row.find_all('td')] for row in allRows]

    #<td rowspan="2">2=</td>
    # list of tuple (Level of tr, Level of td, total Count, Text Value)
    #e.g.
    #[(1, 0, 2, u'2=')]
    # (<tr> is 1 , td sequence in tr is 0, reapted 2 times , value is 2=)
    rowspan = []

    for no, tr in enumerate(allRows):
        tmp = []
        for td_no, data in enumerate(tr.find_all('td')):
            print  data.has_key("rowspan")
            if data.has_key("rowspan"):
                rowspan.append((no, td_no, int(data["rowspan"]), data.get_text()))


    if rowspan:
        for i in rowspan:
            # tr value of rowspan in present in 1th place in results
            for j in xrange(1, i[2]):
                #- Add value in next tr.
                results[i[0]+j].insert(i[1], i[3])


    df = pd.DataFrame(data=results, columns=headers)
    print df

输出:

      Rank       Opponent No. wins Most recent venue Season
    0    1   South Africa        6            Lord's   1951
    1   2=    West Indies        4            Lord's   2009
    2   2=          India        4            Mumbai   2012
    3    4      Australia        3            Sydney   1932
    4    5       Pakistan        2      Trent Bridge   1967
    5    6      Sri Lanka        1      Old Trafford   2002

也要工作到表10

      Rank Hundreds            Player Matches Innings Average
    0    1       25     Alastair Cook     107     191   45.61
    1    2       23   Kevin Pietersen     104     181   47.28
    2    3       22     Colin Cowdrey     114     188   44.07
    3    3       22     Wally Hammond      85     140   58.46
    4    3       22  Geoffrey Boycott     108     193   47.72
    5    6       21    Andrew Strauss     100     178   40.91
    6    6       21          Ian Bell     103     178   45.30
    7   8=       20    Ken Barrington      82     131   58.67
    8   8=       20      Graham Gooch     118     215   42.58
    9   10       19        Len Hutton      79     138   56.67
2021-01-16