Switch to python3

This commit is contained in:
j 2014-09-30 18:15:32 +02:00
commit 9ba4b6a91a
5286 changed files with 677347 additions and 576888 deletions

View file

@ -0,0 +1,344 @@
Metadata-Version: 1.1
Name: html5lib
Version: 0.999
Summary: HTML parser based on the WHATWG HTML specifcation
Home-page: https://github.com/html5lib/html5lib-python
Author: James Graham
Author-email: james@hoppipolla.co.uk
License: MIT License
Description: html5lib
========
.. image:: https://travis-ci.org/html5lib/html5lib-python.png?branch=master
:target: https://travis-ci.org/html5lib/html5lib-python
html5lib is a pure-python library for parsing HTML. It is designed to
conform to the WHATWG HTML specification, as is implemented by all major
web browsers.
Usage
-----
Simple usage follows this pattern:
.. code-block:: python
import html5lib
with open("mydocument.html", "rb") as f:
document = html5lib.parse(f)
or:
.. code-block:: python
import html5lib
document = html5lib.parse("<p>Hello World!")
By default, the ``document`` will be an ``xml.etree`` element instance.
Whenever possible, html5lib chooses the accelerated ``ElementTree``
implementation (i.e. ``xml.etree.cElementTree`` on Python 2.x).
Two other tree types are supported: ``xml.dom.minidom`` and
``lxml.etree``. To use an alternative format, specify the name of
a treebuilder:
.. code-block:: python
import html5lib
with open("mydocument.html", "rb") as f:
lxml_etree_document = html5lib.parse(f, treebuilder="lxml")
When using with ``urllib2`` (Python 2), the charset from HTTP should be
pass into html5lib as follows:
.. code-block:: python
from contextlib import closing
from urllib2 import urlopen
import html5lib
with closing(urlopen("http://example.com/")) as f:
document = html5lib.parse(f, encoding=f.info().getparam("charset"))
When using with ``urllib.request`` (Python 3), the charset from HTTP
should be pass into html5lib as follows:
.. code-block:: python
from urllib.request import urlopen
import html5lib
with urlopen("http://example.com/") as f:
document = html5lib.parse(f, encoding=f.info().get_content_charset())
To have more control over the parser, create a parser object explicitly.
For instance, to make the parser raise exceptions on parse errors, use:
.. code-block:: python
import html5lib
with open("mydocument.html", "rb") as f:
parser = html5lib.HTMLParser(strict=True)
document = parser.parse(f)
When you're instantiating parser objects explicitly, pass a treebuilder
class as the ``tree`` keyword argument to use an alternative document
format:
.. code-block:: python
import html5lib
parser = html5lib.HTMLParser(tree=html5lib.getTreeBuilder("dom"))
minidom_document = parser.parse("<p>Hello World!")
More documentation is available at http://html5lib.readthedocs.org/.
Installation
------------
html5lib works on CPython 2.6+, CPython 3.2+ and PyPy. To install it,
use:
.. code-block:: bash
$ pip install html5lib
Optional Dependencies
---------------------
The following third-party libraries may be used for additional
functionality:
- ``datrie`` can be used to improve parsing performance (though in
almost all cases the improvement is marginal);
- ``lxml`` is supported as a tree format (for both building and
walking) under CPython (but *not* PyPy where it is known to cause
segfaults);
- ``genshi`` has a treewalker (but not builder); and
- ``charade`` can be used as a fallback when character encoding cannot
be determined; ``chardet``, from which it was forked, can also be used
on Python 2.
- ``ordereddict`` can be used under Python 2.6
(``collections.OrderedDict`` is used instead on later versions) to
serialize attributes in alphabetical order.
Bugs
----
Please report any bugs on the `issue tracker
<https://github.com/html5lib/html5lib-python/issues>`_.
Tests
-----
Unit tests require the ``nose`` library and can be run using the
``nosetests`` command in the root directory; ``ordereddict`` is
required under Python 2.6. All should pass.
Test data are contained in a separate `html5lib-tests
<https://github.com/html5lib/html5lib-tests>`_ repository and included
as a submodule, thus for git checkouts they must be initialized::
$ git submodule init
$ git submodule update
If you have all compatible Python implementations available on your
system, you can run tests on all of them using the ``tox`` utility,
which can be found on PyPI.
Questions?
----------
There's a mailing list available for support on Google Groups,
`html5lib-discuss <http://groups.google.com/group/html5lib-discuss>`_,
though you may get a quicker response asking on IRC in `#whatwg on
irc.freenode.net <http://wiki.whatwg.org/wiki/IRC>`_.
Change Log
----------
0.999
~~~~~
Released on December 23, 2013
* Fix #127: add work-around for CPython issue #20007: .read(0) on
http.client.HTTPResponse drops the rest of the content.
* Fix #115: lxml treewalker can now deal with fragments containing, at
their root level, text nodes with non-ASCII characters on Python 2.
0.99
~~~~
Released on September 10, 2013
* No library changes from 1.0b3; released as 0.99 as pip has changed
behaviour from 1.4 to avoid installing pre-release versions per
PEP 440.
1.0b3
~~~~~
Released on July 24, 2013
* Removed ``RecursiveTreeWalker`` from ``treewalkers._base``. Any
implementation using it should be moved to
``NonRecursiveTreeWalker``, as everything bundled with html5lib has
for years.
* Fix #67 so that ``BufferedStream`` to correctly returns a bytes
object, thereby fixing any case where html5lib is passed a
non-seekable RawIOBase-like object.
1.0b2
~~~~~
Released on June 27, 2013
* Removed reordering of attributes within the serializer. There is now
an ``alphabetical_attributes`` option which preserves the previous
behaviour through a new filter. This allows attribute order to be
preserved through html5lib if the tree builder preserves order.
* Removed ``dom2sax`` from DOM treebuilders. It has been replaced by
``treeadapters.sax.to_sax`` which is generic and supports any
treewalker; it also resolves all known bugs with ``dom2sax``.
* Fix treewalker assertions on hitting bytes strings on
Python 2. Previous to 1.0b1, treewalkers coped with mixed
bytes/unicode data on Python 2; this reintroduces this prior
behaviour on Python 2. Behaviour is unchanged on Python 3.
1.0b1
~~~~~
Released on May 17, 2013
* Implementation updated to implement the `HTML specification
<http://www.whatwg.org/specs/web-apps/current-work/>`_ as of 5th May
2013 (`SVN <http://svn.whatwg.org/webapps/>`_ revision r7867).
* Python 3.2+ supported in a single codebase using the ``six`` library.
* Removed support for Python 2.5 and older.
* Removed the deprecated Beautiful Soup 3 treebuilder.
``beautifulsoup4`` can use ``html5lib`` as a parser instead. Note that
since it doesn't support namespaces, foreign content like SVG and
MathML is parsed incorrectly.
* Removed ``simpletree`` from the package. The default tree builder is
now ``etree`` (using the ``xml.etree.cElementTree`` implementation if
available, and ``xml.etree.ElementTree`` otherwise).
* Removed the ``XHTMLSerializer`` as it never actually guaranteed its
output was well-formed XML, and hence provided little of use.
* Removed default DOM treebuilder, so ``html5lib.treebuilders.dom`` is no
longer supported. ``html5lib.treebuilders.getTreeBuilder("dom")`` will
return the default DOM treebuilder, which uses ``xml.dom.minidom``.
* Optional heuristic character encoding detection now based on
``charade`` for Python 2.6 - 3.3 compatibility.
* Optional ``Genshi`` treewalker support fixed.
* Many bugfixes, including:
* #33: null in attribute value breaks XML AttValue;
* #4: nested, indirect descendant, <button> causes infinite loop;
* `Google Code 215
<http://code.google.com/p/html5lib/issues/detail?id=215>`_: Properly
detect seekable streams;
* `Google Code 206
<http://code.google.com/p/html5lib/issues/detail?id=206>`_: add
support for <video preload=...>, <audio preload=...>;
* `Google Code 205
<http://code.google.com/p/html5lib/issues/detail?id=205>`_: add
support for <video poster=...>;
* `Google Code 202
<http://code.google.com/p/html5lib/issues/detail?id=202>`_: Unicode
file breaks InputStream.
* Source code is now mostly PEP 8 compliant.
* Test harness has been improved and now depends on ``nose``.
* Documentation updated and moved to http://html5lib.readthedocs.org/.
0.95
~~~~
Released on February 11, 2012
0.90
~~~~
Released on January 17, 2010
0.11.1
~~~~~~
Released on June 12, 2008
0.11
~~~~
Released on June 10, 2008
0.10
~~~~
Released on October 7, 2007
0.9
~~~
Released on March 11, 2007
0.2
~~~
Released on January 8, 2007
Platform: UNKNOWN
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 2
Classifier: Programming Language :: Python :: 2.6
Classifier: Programming Language :: Python :: 2.7
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.2
Classifier: Programming Language :: Python :: 3.3
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Text Processing :: Markup :: HTML

View file

@ -0,0 +1,42 @@
README.rst
html5lib/__init__.py
html5lib/constants.py
html5lib/html5parser.py
html5lib/ihatexml.py
html5lib/inputstream.py
html5lib/sanitizer.py
html5lib/tokenizer.py
html5lib/utils.py
html5lib.egg-info/PKG-INFO
html5lib.egg-info/SOURCES.txt
html5lib.egg-info/dependency_links.txt
html5lib.egg-info/requires.txt
html5lib.egg-info/top_level.txt
html5lib/filters/__init__.py
html5lib/filters/_base.py
html5lib/filters/alphabeticalattributes.py
html5lib/filters/inject_meta_charset.py
html5lib/filters/lint.py
html5lib/filters/optionaltags.py
html5lib/filters/sanitizer.py
html5lib/filters/whitespace.py
html5lib/serializer/__init__.py
html5lib/serializer/htmlserializer.py
html5lib/treeadapters/__init__.py
html5lib/treeadapters/sax.py
html5lib/treebuilders/__init__.py
html5lib/treebuilders/_base.py
html5lib/treebuilders/dom.py
html5lib/treebuilders/etree.py
html5lib/treebuilders/etree_lxml.py
html5lib/treewalkers/__init__.py
html5lib/treewalkers/_base.py
html5lib/treewalkers/dom.py
html5lib/treewalkers/etree.py
html5lib/treewalkers/genshistream.py
html5lib/treewalkers/lxmletree.py
html5lib/treewalkers/pulldom.py
html5lib/trie/__init__.py
html5lib/trie/_base.py
html5lib/trie/datrie.py
html5lib/trie/py.py

View file

@ -0,0 +1,78 @@
../html5lib/utils.py
../html5lib/ihatexml.py
../html5lib/__init__.py
../html5lib/tokenizer.py
../html5lib/html5parser.py
../html5lib/sanitizer.py
../html5lib/inputstream.py
../html5lib/constants.py
../html5lib/serializer/__init__.py
../html5lib/serializer/htmlserializer.py
../html5lib/treebuilders/_base.py
../html5lib/treebuilders/__init__.py
../html5lib/treebuilders/etree_lxml.py
../html5lib/treebuilders/dom.py
../html5lib/treebuilders/etree.py
../html5lib/filters/whitespace.py
../html5lib/filters/_base.py
../html5lib/filters/__init__.py
../html5lib/filters/sanitizer.py
../html5lib/filters/lint.py
../html5lib/filters/optionaltags.py
../html5lib/filters/inject_meta_charset.py
../html5lib/filters/alphabeticalattributes.py
../html5lib/treewalkers/pulldom.py
../html5lib/treewalkers/_base.py
../html5lib/treewalkers/genshistream.py
../html5lib/treewalkers/__init__.py
../html5lib/treewalkers/dom.py
../html5lib/treewalkers/etree.py
../html5lib/treewalkers/lxmletree.py
../html5lib/trie/datrie.py
../html5lib/trie/_base.py
../html5lib/trie/__init__.py
../html5lib/trie/py.py
../html5lib/treeadapters/sax.py
../html5lib/treeadapters/__init__.py
../html5lib/__pycache__/utils.cpython-34.pyc
../html5lib/__pycache__/ihatexml.cpython-34.pyc
../html5lib/__pycache__/__init__.cpython-34.pyc
../html5lib/__pycache__/tokenizer.cpython-34.pyc
../html5lib/__pycache__/html5parser.cpython-34.pyc
../html5lib/__pycache__/sanitizer.cpython-34.pyc
../html5lib/__pycache__/inputstream.cpython-34.pyc
../html5lib/__pycache__/constants.cpython-34.pyc
../html5lib/serializer/__pycache__/__init__.cpython-34.pyc
../html5lib/serializer/__pycache__/htmlserializer.cpython-34.pyc
../html5lib/treebuilders/__pycache__/_base.cpython-34.pyc
../html5lib/treebuilders/__pycache__/__init__.cpython-34.pyc
../html5lib/treebuilders/__pycache__/etree_lxml.cpython-34.pyc
../html5lib/treebuilders/__pycache__/dom.cpython-34.pyc
../html5lib/treebuilders/__pycache__/etree.cpython-34.pyc
../html5lib/filters/__pycache__/whitespace.cpython-34.pyc
../html5lib/filters/__pycache__/_base.cpython-34.pyc
../html5lib/filters/__pycache__/__init__.cpython-34.pyc
../html5lib/filters/__pycache__/sanitizer.cpython-34.pyc
../html5lib/filters/__pycache__/lint.cpython-34.pyc
../html5lib/filters/__pycache__/optionaltags.cpython-34.pyc
../html5lib/filters/__pycache__/inject_meta_charset.cpython-34.pyc
../html5lib/filters/__pycache__/alphabeticalattributes.cpython-34.pyc
../html5lib/treewalkers/__pycache__/pulldom.cpython-34.pyc
../html5lib/treewalkers/__pycache__/_base.cpython-34.pyc
../html5lib/treewalkers/__pycache__/genshistream.cpython-34.pyc
../html5lib/treewalkers/__pycache__/__init__.cpython-34.pyc
../html5lib/treewalkers/__pycache__/dom.cpython-34.pyc
../html5lib/treewalkers/__pycache__/etree.cpython-34.pyc
../html5lib/treewalkers/__pycache__/lxmletree.cpython-34.pyc
../html5lib/trie/__pycache__/datrie.cpython-34.pyc
../html5lib/trie/__pycache__/_base.cpython-34.pyc
../html5lib/trie/__pycache__/__init__.cpython-34.pyc
../html5lib/trie/__pycache__/py.cpython-34.pyc
../html5lib/treeadapters/__pycache__/sax.cpython-34.pyc
../html5lib/treeadapters/__pycache__/__init__.cpython-34.pyc
./
dependency_links.txt
PKG-INFO
SOURCES.txt
top_level.txt
requires.txt

View file

@ -0,0 +1 @@
html5lib