Metadata-Version: 1.1 Name: html5lib Version: 0.999 Summary: HTML parser based on the WHATWG HTML specifcation Home-page: https://github.com/html5lib/html5lib-python Author: James Graham Author-email: james@hoppipolla.co.uk License: MIT License Description: html5lib ======== .. image:: https://travis-ci.org/html5lib/html5lib-python.png?branch=master :target: https://travis-ci.org/html5lib/html5lib-python html5lib is a pure-python library for parsing HTML. It is designed to conform to the WHATWG HTML specification, as is implemented by all major web browsers. Usage ----- Simple usage follows this pattern: .. code-block:: python import html5lib with open("mydocument.html", "rb") as f: document = html5lib.parse(f) or: .. code-block:: python import html5lib document = html5lib.parse("<p>Hello World!") By default, the ``document`` will be an ``xml.etree`` element instance. Whenever possible, html5lib chooses the accelerated ``ElementTree`` implementation (i.e. ``xml.etree.cElementTree`` on Python 2.x). Two other tree types are supported: ``xml.dom.minidom`` and ``lxml.etree``. To use an alternative format, specify the name of a treebuilder: .. code-block:: python import html5lib with open("mydocument.html", "rb") as f: lxml_etree_document = html5lib.parse(f, treebuilder="lxml") When using with ``urllib2`` (Python 2), the charset from HTTP should be pass into html5lib as follows: .. code-block:: python from contextlib import closing from urllib2 import urlopen import html5lib with closing(urlopen("http://example.com/")) as f: document = html5lib.parse(f, encoding=f.info().getparam("charset")) When using with ``urllib.request`` (Python 3), the charset from HTTP should be pass into html5lib as follows: .. code-block:: python from urllib.request import urlopen import html5lib with urlopen("http://example.com/") as f: document = html5lib.parse(f, encoding=f.info().get_content_charset()) To have more control over the parser, create a parser object explicitly. For instance, to make the parser raise exceptions on parse errors, use: .. code-block:: python import html5lib with open("mydocument.html", "rb") as f: parser = html5lib.HTMLParser(strict=True) document = parser.parse(f) When you're instantiating parser objects explicitly, pass a treebuilder class as the ``tree`` keyword argument to use an alternative document format: .. code-block:: python import html5lib parser = html5lib.HTMLParser(tree=html5lib.getTreeBuilder("dom")) minidom_document = parser.parse("<p>Hello World!") More documentation is available at http://html5lib.readthedocs.org/. Installation ------------ html5lib works on CPython 2.6+, CPython 3.2+ and PyPy. To install it, use: .. code-block:: bash $ pip install html5lib Optional Dependencies --------------------- The following third-party libraries may be used for additional functionality: - ``datrie`` can be used to improve parsing performance (though in almost all cases the improvement is marginal); - ``lxml`` is supported as a tree format (for both building and walking) under CPython (but *not* PyPy where it is known to cause segfaults); - ``genshi`` has a treewalker (but not builder); and - ``charade`` can be used as a fallback when character encoding cannot be determined; ``chardet``, from which it was forked, can also be used on Python 2. - ``ordereddict`` can be used under Python 2.6 (``collections.OrderedDict`` is used instead on later versions) to serialize attributes in alphabetical order. Bugs ---- Please report any bugs on the `issue tracker <https://github.com/html5lib/html5lib-python/issues>`_. Tests ----- Unit tests require the ``nose`` library and can be run using the ``nosetests`` command in the root directory; ``ordereddict`` is required under Python 2.6. All should pass. Test data are contained in a separate `html5lib-tests <https://github.com/html5lib/html5lib-tests>`_ repository and included as a submodule, thus for git checkouts they must be initialized:: $ git submodule init $ git submodule update If you have all compatible Python implementations available on your system, you can run tests on all of them using the ``tox`` utility, which can be found on PyPI. Questions? ---------- There's a mailing list available for support on Google Groups, `html5lib-discuss <http://groups.google.com/group/html5lib-discuss>`_, though you may get a quicker response asking on IRC in `#whatwg on irc.freenode.net <http://wiki.whatwg.org/wiki/IRC>`_. Change Log ---------- 0.999 ~~~~~ Released on December 23, 2013 * Fix #127: add work-around for CPython issue #20007: .read(0) on http.client.HTTPResponse drops the rest of the content. * Fix #115: lxml treewalker can now deal with fragments containing, at their root level, text nodes with non-ASCII characters on Python 2. 0.99 ~~~~ Released on September 10, 2013 * No library changes from 1.0b3; released as 0.99 as pip has changed behaviour from 1.4 to avoid installing pre-release versions per PEP 440. 1.0b3 ~~~~~ Released on July 24, 2013 * Removed ``RecursiveTreeWalker`` from ``treewalkers._base``. Any implementation using it should be moved to ``NonRecursiveTreeWalker``, as everything bundled with html5lib has for years. * Fix #67 so that ``BufferedStream`` to correctly returns a bytes object, thereby fixing any case where html5lib is passed a non-seekable RawIOBase-like object. 1.0b2 ~~~~~ Released on June 27, 2013 * Removed reordering of attributes within the serializer. There is now an ``alphabetical_attributes`` option which preserves the previous behaviour through a new filter. This allows attribute order to be preserved through html5lib if the tree builder preserves order. * Removed ``dom2sax`` from DOM treebuilders. It has been replaced by ``treeadapters.sax.to_sax`` which is generic and supports any treewalker; it also resolves all known bugs with ``dom2sax``. * Fix treewalker assertions on hitting bytes strings on Python 2. Previous to 1.0b1, treewalkers coped with mixed bytes/unicode data on Python 2; this reintroduces this prior behaviour on Python 2. Behaviour is unchanged on Python 3. 1.0b1 ~~~~~ Released on May 17, 2013 * Implementation updated to implement the `HTML specification <http://www.whatwg.org/specs/web-apps/current-work/>`_ as of 5th May 2013 (`SVN <http://svn.whatwg.org/webapps/>`_ revision r7867). * Python 3.2+ supported in a single codebase using the ``six`` library. * Removed support for Python 2.5 and older. * Removed the deprecated Beautiful Soup 3 treebuilder. ``beautifulsoup4`` can use ``html5lib`` as a parser instead. Note that since it doesn't support namespaces, foreign content like SVG and MathML is parsed incorrectly. * Removed ``simpletree`` from the package. The default tree builder is now ``etree`` (using the ``xml.etree.cElementTree`` implementation if available, and ``xml.etree.ElementTree`` otherwise). * Removed the ``XHTMLSerializer`` as it never actually guaranteed its output was well-formed XML, and hence provided little of use. * Removed default DOM treebuilder, so ``html5lib.treebuilders.dom`` is no longer supported. ``html5lib.treebuilders.getTreeBuilder("dom")`` will return the default DOM treebuilder, which uses ``xml.dom.minidom``. * Optional heuristic character encoding detection now based on ``charade`` for Python 2.6 - 3.3 compatibility. * Optional ``Genshi`` treewalker support fixed. * Many bugfixes, including: * #33: null in attribute value breaks XML AttValue; * #4: nested, indirect descendant, <button> causes infinite loop; * `Google Code 215 <http://code.google.com/p/html5lib/issues/detail?id=215>`_: Properly detect seekable streams; * `Google Code 206 <http://code.google.com/p/html5lib/issues/detail?id=206>`_: add support for <video preload=...>, <audio preload=...>; * `Google Code 205 <http://code.google.com/p/html5lib/issues/detail?id=205>`_: add support for <video poster=...>; * `Google Code 202 <http://code.google.com/p/html5lib/issues/detail?id=202>`_: Unicode file breaks InputStream. * Source code is now mostly PEP 8 compliant. * Test harness has been improved and now depends on ``nose``. * Documentation updated and moved to http://html5lib.readthedocs.org/. 0.95 ~~~~ Released on February 11, 2012 0.90 ~~~~ Released on January 17, 2010 0.11.1 ~~~~~~ Released on June 12, 2008 0.11 ~~~~ Released on June 10, 2008 0.10 ~~~~ Released on October 7, 2007 0.9 ~~~ Released on March 11, 2007 0.2 ~~~ Released on January 8, 2007 Platform: UNKNOWN Classifier: Development Status :: 5 - Production/Stable Classifier: Intended Audience :: Developers Classifier: License :: OSI Approved :: MIT License Classifier: Operating System :: OS Independent Classifier: Programming Language :: Python Classifier: Programming Language :: Python :: 2 Classifier: Programming Language :: Python :: 2.6 Classifier: Programming Language :: Python :: 2.7 Classifier: Programming Language :: Python :: 3 Classifier: Programming Language :: Python :: 3.2 Classifier: Programming Language :: Python :: 3.3 Classifier: Topic :: Software Development :: Libraries :: Python Modules Classifier: Topic :: Text Processing :: Markup :: HTML