Parse HTML file with BeautifulSoup


In the last post, regular expression is used to fetch the specific information. To access the structured information, BeautifulSoapBeautifulSoup is preferred for its simplicity and convenient API:

  • You may override the fromEncoding in the constructor, this is very useful for non-roman, non-standard web pages.
  • Versatile find/findAll on tag, attributes.
  • Developer-friendly syntactic sugar, the Tag implements the interface of string, list, dict and callable function, so there are many ways to access the data as you wish. The drawback of this approach is the typo is only caught in the run time instead of compilation time.
  • Easy to deploy, only one file.

Something I don’t like:

  • No XPath support, more efforts are needed to port from JavaScript.
  • The API does not support stream, or file object. Laziness is always cherished for pipelining.
  • Why BeautifulSoup? I have made typo as soap more than ten times.

Here is the home-brewed script to wage through the Dvbbs thread to find the corresponding messages: