1

I am looking for a way to cleanly convert HTML tables to readable plain text.

I.e. given an input:

<table>
    <tr>
        <td>Height:</td>
        <td>200</td>
    </tr>
    <tr>
        <td>Width:</td>
        <td>440</td>
    </tr>
</table>

I expect the output:

Height: 200
Width: 440

I would prefer not using external tools, e.g. w3m -dump file.html, because they are (1) platform-dependent, (2) I want to have some control over the process and (3) I assume it is doable with Python alone with or without extra modules.

I don't need any word-wrapping or adjustable cell separator width. Having tabs as cell separators would be good enough.

Update

This was an old question for an old use case. Given that pandas provides the read_html method, my current answer would definitely be pandas-based.

3 Answers 3

4

How about using this:

Parse HTML table to Python list?

But, use collections.OrderedDict() instead of simple dictionary to preserve order. After you have a dictionary, it is very-very easy to get and format the text from it:

Using the solution of @Colt 45:

import xml.etree.ElementTree
import collections

s = """\
<table>
    <tr>
        <th>Height</th>
        <th>Width</th>
        <th>Depth</th>
    </tr>
    <tr>
        <td>10</td>
        <td>12</td>
        <td>5</td>
    </tr>
    <tr>
        <td>0</td>
        <td>3</td>
        <td>678</td>
    </tr>
    <tr>
        <td>5</td>
        <td>3</td>
        <td>4</td>
    </tr>
</table>
"""

table = xml.etree.ElementTree.XML(s)
rows = iter(table)
headers = [col.text for col in next(rows)]
for row in rows:
    values = [col.text for col in row]
    for key, value in collections.OrderedDict(zip(headers, values)).iteritems():
        print key, value

Output:

Height 10
Width 12
Depth 5
Height 0
Width 3
Depth 678
Height 5
Width 3
Depth 4
Sign up to request clarification or add additional context in comments.

3 Comments

Thank you for example code but the issue is that it only handles one special case and my actual input is bit more complicated and contains lots of colspans so it won't display the data the way I want it to. Here is a sample of the actual data: pastebin.com/yRQvz2Ww At the moment none of the options I tried (elementree, lxml, BeautifulSoup) come close to the output of w3m -dump with the input i have.
That is a whole different question — I mean the given input and expected output is not what you asked. For what you asked first, my answer is working.
My original example is generic and the preferred answer would ideally be generic too. The solution you propose does solve the simplest case but is not generic enough.
1

You should look at the standard library modules ElementTree and minidom

Comments

1

You can use HTQL module at http://htql.net.

Here is the sample code for your page:

import urllib2
url='http://pastebin.com/yRQvz2Ww'
page=urllib2.urlopen(url).read();

query="""<div (ID='super_frame')>1.<div (ID='monster_frame')>1.<div (ID='content_frame')>1.<div (ID='content_left')>1.<div (ID='code_frame2')>1.<div (ID='code_frame')>1.<div (ID='selectable')>1.<div (CLASS='html4strict')>1 &tx
<table>.<tr>{
    c1=<td>:colspan;   t1=<td>1 &tx; 
    c2=<td>2:colspan;   t2=<td>2 &tx;
    c3=<td>3:colspan;   t3=<td>3 &tx; 
    c4=<td>4:colspan;   t4=<td>4 &tx;
    c5=<td>5:colspan;   t5=<td>5 &tx;
}
"""

for t in htql.query(page, query): 
    print('\t'.join(t)); 

The htql.query() produces 10 columns including the c1, t2, c2, t2, ... c5, t5. You can use the c1..c5 information to know which cells the t1..t5 should be in.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.