# -*- coding: iso-8859-1 -*- # Copyright (C) 2000-2009 Bastian Kleineidam # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. """ Fast HTML parser module written in C with the following features: - Reentrant As soon as any HTML string data is available, we try to feed it to the HTML parser. This means that the parser has to scan possible incomplete data, recognizing as much as it can. Incomplete trailing data is saved for subsequent calls, or it is just flushed into the output buffer with the flush() function. A reset() brings the parser back to its initial state, throwing away all buffered data. - Coping with HTML syntax errors The parser recognizes as much as it can and passes the rest of the data as TEXT tokens. The scanner only passes complete recognized HTML syntax elements to the parser. Invalid syntax elements are passed as TEXT. This way we do not need the bison error recovery. Incomplete data is rescanned the next time the parser calls yylex() or when it is being flush()ed. The following syntax errors will be recognized correctly: - Unquoted attribute values. - Missing beginning quote of attribute values. - Invalid "" end tags in script modus. - Missing ">" in tags. - Invalid characters in tag or attribute names. The following syntax errors will not be recognized: - Missing end quote of attribute values. On the TODO list. - Unknown HTML tag or attribute names. - Invalid nesting of tags. Additionally the parser has the following features: - NULL bytes are changed into spaces - inside a