This code extracts HTML that matches CSS3-ish selectors.
- selectorp.g, the grammar file for Amit J. Patel's Yapps 2.
- htmlp.py, the SGMLParser that uses the generated selectorp.py (I can't put the selectorp.py up because you need the yappsrt.py module anyway).
I won't say I outright stole Mark Pilgrim's HTML parser code, but how many ways are there to use sgmllib? I certainly used his code as a model though.
It doesn't support combinators (foo + bar, foo > bar, foo ~ bar), though it does descendence (foo bar) just fine. Actually it'll parse the combinators, it just won't denote them in any way; the HTML parser will ignore them.
It doesn't do pseudos (e.g., :first-child) either. That was too much engineering up front to accomplish in one thunk, and, well... this is already a step up from the application it's replacing.
The attribute comparators it supports are:
- =
- Attribute must exactly equal the value.
- ~=
- Attribute must contain the value in a space-separated list (e.g.,
classattributes). - |=
- Attribute must contain the value in a hyphen-separated list (i.e.,
langattributes). - ^=
- Attribute must begin with the value.
- $=
- Attribute must end with the value.
- *=
- Attribute must contain, somewhere, the value.
For example, selector img[src*="foo"] will match all img tags with the text "foo" in their src URLs.
Yes, this is for Stapler.py. Thanks for asking.