Automatically Determine The Number Of Columns In An HTML Table With Python's Beautiful Soup
TL;DR
This is what I use to find the number of columns in an HTML table.
#!/usr/bin/env python3
from typing import Counter
from bs4 import BeautifulSoup
max_cols = 0
with open("source/1.html") as _in:
soup = BeautifulSoup(_in.read(), 'html.parser')
table = soup.find("table", "complex")
rows = table.find_all("tr")
for row in rows:
max_tds = 0
tds = row.find_all("td")
max_tds = max([max_tds, len(tds)])
max_ths = 0
ths = row.find_all("th")
max_ths = max([max_ths, len(ths)])
max_cols = max([max_cols, (max_ths + max_tds)])
print(f"Max columns: {max_cols}")
Details
I'm working on an ascii art tool. It includes a full set of unicode characters to choose from. I pulled the characters from the W3C site. There are 28 pages with tables that sort everything from the top down then column by column. Something like this:
a f k b g l c h m d i n e j o
They're set up that way based of their unicode ID numbers. Makes sense on the W3C pages, but I want them sorted continuously from left to right.
a b c d e f g h i j k l m n o p
I'm parsing the source HTML in Beautiful Soup then doing the formatting conversion in Pandas. I want to know the number of columns in the tables to setup the Pandas data frame explicitly. So, I setup the code snippet to figure that out. It loops through every row of the table and counts the number of `th`` (header) and `td`` (data) cells on each row then runs them through max functions to come up with the longest row.
-
This wasn't really necessary in this case. All the tables had 18 columns in them but I wanted code that confirmed that
-
I think you can just throw stuff at pandas and have it figure this stuff out, but I'm not familiar enough with it yet to know that for sure. Especially because the tables had inconsistent numbers of header rows and columns.
-
There's some more automation that could be done to import data to pandas. For example, remove any table row the consists of only able headers.