73

I'd like to take an HTML table and parse through it to get a list of dictionaries. Each list element would be a dictionary corresponding to a row in the table.

If, for example, I had an HTML table with three columns (marked by header tags), "Event", "Start Date", and "End Date" and that table had 5 entries, I would like to parse through that table to get back a list of length 5 where each element is a dictionary with keys "Event", "Start Date", and "End Date".

Thanks for the help!

4 Answers 4

90

You should use some HTML parsing library like lxml:

from lxml import etree
s = """<table>
  <tr><th>Event</th><th>Start Date</th><th>End Date</th></tr>
  <tr><td>a</td><td>b</td><td>c</td></tr>
  <tr><td>d</td><td>e</td><td>f</td></tr>
  <tr><td>g</td><td>h</td><td>i</td></tr>
</table>
"""
table = etree.HTML(s).find("body/table")
rows = iter(table)
headers = [col.text for col in next(rows)]
for row in rows:
    values = [col.text for col in row]
    print dict(zip(headers, values))

prints

{'End Date': 'c', 'Start Date': 'b', 'Event': 'a'}
{'End Date': 'f', 'Start Date': 'e', 'Event': 'd'}
{'End Date': 'i', 'Start Date': 'h', 'Event': 'g'}
Sign up to request clarification or add additional context in comments.

8 Comments

My table has a varying number of rows. How can I make it work if this is the case? Thanks for the response, btw.
@Andrew: The above code works for any number of rows and any number of columns, as long as every row has the same number of columns.
I'd suggest HTMLParser/html.parser, but this solution is much better in this case.
This was a useful pointer for additional research. I actually have some broken HTML to parse, so some other answers involving lxml.html also proved useful.
it fails if html contains unquoted attrs like "<table align=center" with lxml.etree.XMLSyntaxError: AttValue: " or ' expected
|
79

Hands down the easiest way to parse a HTML table is to use pandas.read_html() - it accepts both URLs and HTML.

import pandas as pd
url = r'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'
tables = pd.read_html(url) # Returns list of all tables on page
sp500_table = tables[0] # Select table of interest

As of pandas version 1.5.0, read_html() can preserve hyperlinks with the extract_links argument. Table elements will be tuples.

6 Comments

Not a good way for tables containing rowspan and colspan!
@JohnStrood Looking forward to reading your answer on how to handle rowspan and colspan 👍
@tommy.carstensen Ah! I used bs4 to build an element tree, and traversed through the elements to break row-spanned column-spanned cells into constituent cells.
@tommy.carstensen There are already answers here: stackoverflow.com/a/39336433/5337834 and stackoverflow.com/a/9980393/5337834. If you're still unsatisfied, I'll write my own answer!
@zelusp I just learned, that Pandas is extremely slow, if your html has 100+ tables and you just want a single table with a specific id. Beautifulsoup is much faster in this case.
|
35

Sven Marnach excellent solution is directly translatable into ElementTree which is part of recent Python distributions:

from xml.etree import ElementTree as ET

s = """<table>
  <tr><th>Event</th><th>Start Date</th><th>End Date</th></tr>
  <tr><td>a</td><td>b</td><td>c</td></tr>
  <tr><td>d</td><td>e</td><td>f</td></tr>
  <tr><td>g</td><td>h</td><td>i</td></tr>
</table>
"""

table = ET.XML(s)
rows = iter(table)
headers = [col.text for col in next(rows)]
for row in rows:
    values = [col.text for col in row]
    print(dict(zip(headers, values)))

same output as Sven Marnach's answer...

3 Comments

+1 because it allows using cElementTree instead of ElementTree which is considerably faster than lxml if large number of tables are involved
I have a web page saved from wikipedia. How can I specify to ET which table to parse and fetch data ? Is it possible by table name or table id ?
also, <tbody> and <thead> don't work. see stackoverflow.com/q/49286753/8929814
23

If the HTML is not XML you can't do it with etree. But even then, you don't have to use an external library for parsing a HTML table. In python 3 you can reach your goal with HTMLParser from html.parser. I've the code of the simple derived HTMLParser class here in a github repo.

You can use that class (here named HTMLTableParser) the following way:

import urllib.request
from html_table_parser import HTMLTableParser

target = 'http://www.twitter.com'

# get website content
req = urllib.request.Request(url=target)
f = urllib.request.urlopen(req)
xhtml = f.read().decode('utf-8')

# instantiate the parser and feed it
p = HTMLTableParser()
p.feed(xhtml)
print(p.tables)

The output of this is a list of 2D-lists representing tables. It looks maybe like this:

[[['   ', ' Anmelden ']],
 [['Land', 'Code', 'Für Kunden von'],
  ['Vereinigte Staaten', '40404', '(beliebig)'],
  ['Kanada', '21212', '(beliebig)'],
  ...
  ['3424486444', 'Vodafone'],
  ['  Zeige SMS-Kurzwahlen für andere Länder ']]]

3 Comments

Awesome parser !!
neat indeed. It will break if some td have a colspan though
@mr.bjerre PR welcome ;-)

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.