This avoids creating a DOM tree for link extraction and instead uses a `TokenSink` for on-the-fly extraction. In hyperfine benchmarks it was about 10-25% faster than the master.
Old: 4.557 s ± 0.404 s
New: 3.832 s ± 0.131 s
The performance fluctuates a little less as well.
Some missing element/attribute pairs were also added, which contain links according to the HTML spec. These occur very rarely, but it's good to parse them for completeness' sake.
Furthermore tried to clean up a lot of papercuts around our types. We now differentiate between a `RawUri` (stringy-types) and a Uri, which is a properly parsed `URI` type.
The extractor now only deals with extracting `RawUri`s while the collector creates the request objects.
* Reqwest comes with its own request pool, so there's no need in adding
another layer of indirection. This also gets rid of a lot of allocs.
* Remove cache from collector
* Improve error handling and documentation
* Add back test for request caching in single file
Signed-off-by: MichaIng <micha@dietpi.com>
Co-authored-by: Matthias <matthias-endler@gmx.net>
* Fix HTML parsing for non-closed elements like <link>
The XML parser we use requires all tags to be closed by default,
and if they aren't (like HTML5 <link> elements), it simply gives up
on further parsing. This change makes it ignore such issues.
Also uncover a bug with the current parser (it simply won't parse
elements like `<script defer src="..."></script>`) -- e.g. elements
with no attribute values.
The XML parser is an XML parser and will have to be replaced with
HTML aware parser in the future.
* Add check for empty elements
* Update extract.rs
Co-authored-by: Matthias <matthias-endler@gmx.net>
For now we only support JSON.
I honestly don't know if it makes sense to include other formats.
For example, MD and HTML are not really
machine-readable. YAML is not
a great standard format for this use-case. Open for discussions, though.
* Make GITHUB_TOKEN optional
This also makes the token possible to pass in from CLI args.
* Add missing test fixture file
* Normalize exit codes and GitHub checking behavior
The exit code is now defined as 1 for unexpected or config errors,
and 2 for link check failures.
GitHub checking behavior has been tweaked to generate errors if
a GitHub-specific check cannot be performed because of a missing
token.
* Remove short flag for github token