Sign in

Slicing HTML with CSS selectors

This code extracts HTML that matches CSS3-ish selectors.

I won't say I outright stole Mark Pilgrim's HTML parser code, but how many ways are there to use sgmllib? I certainly used his code as a model though.

It doesn't support combinators (foo + bar, foo > bar, foo ~ bar), though it does descendence (foo bar) just fine. Actually it'll parse the combinators, it just won't denote them in any way; the HTML parser will ignore them.

It doesn't do pseudos (e.g., :first-child) either. That was too much engineering up front to accomplish in one thunk, and, well... this is already a step up from the application it's replacing.

The attribute comparators it supports are:

=
Attribute must exactly equal the value.
~=
Attribute must contain the value in a space-separated list (e.g., class attributes).
|=
Attribute must contain the value in a hyphen-separated list (i.e., lang attributes).
^=
Attribute must begin with the value.
$=
Attribute must end with the value.
*=
Attribute must contain, somewhere, the value.

For example, selector img[src*="foo"] will match all img tags with the text "foo" in their src URLs.

Yes, this is for Stapler.py. Thanks for asking.