Python solution to convert HTML tables to readable plain text

Question

I am looking for a way to cleanly convert HTML tables to readable plain text.

I.e. given an input:

<table>
    <tr>
        <td>Height:</td>
        <td>200</td>
    </tr>
    <tr>
        <td>Width:</td>
        <td>440</td>
    </tr>
</table>

I expect the output:

Height: 200
Width: 440

I would prefer not using external tools, e.g. w3m -dump file.html, because they are (1) platform-dependent, (2) I want to have some control over the process and (3) I assume it is doable with Python alone with or without extra modules.

I don't need any word-wrapping or adjustable cell separator width. Having tabs as cell separators would be good enough.

Update

This was an old question for an old use case. Given that pandas provides the read_html method, my current answer would definitely be pandas-based.

Community · Accepted Answer · 2017-05-23 12:09:26Z

4

How about using this:

Parse HTML table to Python list?

But, use collections.OrderedDict() instead of simple dictionary to preserve order. After you have a dictionary, it is very-very easy to get and format the text from it:

Using the solution of @Colt 45:

import xml.etree.ElementTree
import collections

s = """\
<table>
    <tr>
        <th>Height</th>
        <th>Width</th>
        <th>Depth</th>
    </tr>
    <tr>
        <td>10</td>
        <td>12</td>
        <td>5</td>
    </tr>
    <tr>
        <td>0</td>
        <td>3</td>
        <td>678</td>
    </tr>
    <tr>
        <td>5</td>
        <td>3</td>
        <td>4</td>
    </tr>
</table>
"""

table = xml.etree.ElementTree.XML(s)
rows = iter(table)
headers = [col.text for col in next(rows)]
for row in rows:
    values = [col.text for col in row]
    for key, value in collections.OrderedDict(zip(headers, values)).iteritems():
        print key, value

Output:

Height 10
Width 12
Depth 5
Height 0
Width 3
Depth 678
Height 5
Width 3
Depth 4

edited May 23, 2017 at 12:09

CommunityBot

11 silver badge

answered May 25, 2013 at 11:06

Peter Varo

12.4k7 gold badges60 silver badges80 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

ccpizza Over a year ago

Thank you for example code but the issue is that it only handles one special case and my actual input is bit more complicated and contains lots of colspans so it won't display the data the way I want it to. Here is a sample of the actual data: pastebin.com/yRQvz2Ww At the moment none of the options I tried (elementree, lxml, BeautifulSoup) come close to the output of w3m -dump with the input i have.

Peter Varo Over a year ago

That is a whole different question — I mean the given input and expected output is not what you asked. For what you asked first, my answer is working.

ccpizza Over a year ago

My original example is generic and the preferred answer would ideally be generic too. The solution you propose does solve the simplest case but is not generic enough.

Oin · Accepted Answer · 2013-05-25 11:15:50Z

1

You should look at the standard library modules ElementTree and minidom

edited May 25, 2013 at 11:15

answered May 25, 2013 at 11:01

Oin

7,5993 gold badges38 silver badges58 bronze badges

Comments

seagulf · Accepted Answer · 2013-05-27 17:05:51Z

You can use HTQL module at http://htql.net.

Here is the sample code for your page:

import urllib2
url='http://pastebin.com/yRQvz2Ww'
page=urllib2.urlopen(url).read();

query="""<div (ID='super_frame')>1.<div (ID='monster_frame')>1.<div (ID='content_frame')>1.<div (ID='content_left')>1.<div (ID='code_frame2')>1.<div (ID='code_frame')>1.<div (ID='selectable')>1.<div (CLASS='html4strict')>1 &tx
<table>.<tr>{
    c1=<td>:colspan;   t1=<td>1 &tx; 
    c2=<td>2:colspan;   t2=<td>2 &tx;
    c3=<td>3:colspan;   t3=<td>3 &tx; 
    c4=<td>4:colspan;   t4=<td>4 &tx;
    c5=<td>5:colspan;   t5=<td>5 &tx;
}
"""

for t in htql.query(page, query): 
    print('\t'.join(t));

The htql.query() produces 10 columns including the c1, t2, c2, t2, ... c5, t5. You can use the c1..c5 information to know which cells the t1..t5 should be in.

Collectives™ on Stack Overflow

Python solution to convert HTML tables to readable plain text

Update

3 Answers 3

3 Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

Update

3 Answers 3

3 Comments

Comments

Comments

Linked

Related