Commit graph

32 commits

Author SHA1 Message Date
Matthias
166c86c30e
Use tokenizer for extraction; add benchmark (#424)
This avoids creating a DOM tree for link extraction and instead uses a `TokenSink` for on-the-fly extraction. In hyperfine benchmarks it was about 10-25% faster than the master.

Old: 4.557 s ± 0.404 s
New: 3.832 s ± 0.131 s

The performance fluctuates a little less as well.

Some missing element/attribute pairs were also added, which contain links according to the HTML spec. These occur very rarely, but it's good to parse them for completeness' sake.

Furthermore tried to clean up a lot of papercuts around our types. We now differentiate between a `RawUri` (stringy-types) and a Uri, which is a properly parsed `URI` type.
The extractor now only deals with extracting `RawUri`s while the collector creates the request objects.
2021-12-16 18:45:52 +01:00
Matthias
591cbdbebb
Add support for .lycheeignore file #308 (#402)
This is similar to files like .gitignore and .dockerignore
and gets merged into exclude_files
2021-11-23 01:39:53 +01:00
MichaIng
961f12e58e
Remove cache from collector and remove custom reqwest client pool
* Reqwest comes with its own request pool, so there's no need in adding
another layer of indirection. This also gets rid of a lot of allocs.
* Remove cache from collector
* Improve error handling and documentation
* Add back test for request caching in single file

Signed-off-by: MichaIng <micha@dietpi.com>
Co-authored-by: Matthias <matthias-endler@gmx.net>
2021-10-07 18:07:18 +02:00
Matthias
a75cae54b1 Add failing test 2021-09-09 01:17:56 +02:00
Matthias
5d0b95271d Remove anchor from file links 2021-09-07 00:20:09 +02:00
Matthias
03f5df91cd Add fixtures for offline testing 2021-09-06 15:20:18 +02:00
Matthias
495f856c61 cleanup 2021-09-06 15:19:24 +02:00
Matthias
ee70e13bf7 Check real link to file 2021-09-06 15:19:09 +02:00
Matthias
f5ee472d93 explicit naming 2021-09-06 15:19:09 +02:00
Matthias Endler
701fbc9ada Add support for local files 2021-09-06 15:14:33 +02:00
Lucius Hu
80b8a856ac
Add new flag --require-https (#195) 2021-09-04 03:21:54 +02:00
dblock
dcee4a1058 Added support for --exclude-file. 2021-09-03 16:29:57 +02:00
dblock
739a3d6e41 Fix: remove URL that is currently returning a 503. 2021-09-03 16:29:57 +02:00
Matthias
fe399c0a8c
Simple URI cache (#243) 2021-05-04 13:28:39 +02:00
Matthias
164e1aea7e
Add support for multiple schemes (#237) 2021-04-26 18:24:54 +02:00
Matthias
f8426bafbf
Skip unsupported schemes (#236) 2021-04-26 17:16:58 +02:00
Matthias Endler
2b044a6f5b Fix exclude mail, add tests 2021-03-29 23:28:17 +02:00
Matthias Endler
5baaba3948 Add integration test 2021-02-28 19:09:11 +01:00
Matthias Endler
e00cdbf1ae example.com -> example.org 2021-02-21 16:33:33 +01:00
Matthias
702909c4ab
Mailto support (#138)
* Add mailto suport and use try_from for parsing URLs
* Cleanup and document code
2021-02-12 10:25:33 +01:00
Paweł Romanowski
aeab85da16
Use html5ever for HTML link extraction (#98) 2021-01-08 16:41:13 +01:00
Paweł Romanowski
cd00fa643e
Fix HTML parsing for non-closed elements like <link> (#92)
* Fix HTML parsing for non-closed elements like <link>

The XML parser we use requires all tags to be closed by default,
and if they aren't (like HTML5 <link> elements), it simply gives up
on further parsing.  This change makes it ignore such issues.

Also uncover a bug with the current parser (it simply won't parse
elements like `<script defer src="..."></script>`) -- e.g. elements
with no attribute values.

The XML parser is an XML parser and will have to be replaced with
HTML aware parser in the future.

* Add check for empty elements

* Update extract.rs

Co-authored-by: Matthias <matthias-endler@gmx.net>
2021-01-03 17:32:13 +01:00
Matthias
a78e8318cd
Add (machine-readable) output file support (fixes #53)
For now we only support JSON.
I honestly don't know if it makes sense to include other formats.
For example, MD and HTML are not really
machine-readable. YAML is not
a great standard format for this use-case. Open for discussions, though.
2020-12-14 01:15:14 +01:00
Paweł Romanowski
1f787613d4
Add support for reading from stdin and make input handling more robust (closes #26)
* Adds a `skip_missing` flag
* Adds an `Input` enum to handle different types of inputs
2020-12-02 23:28:37 +01:00
Paweł Romanowski
326683f4eb
Make GITHUB_TOKEN optional (#22)
* Make GITHUB_TOKEN optional

This also makes the token possible to pass in from CLI args.

* Add missing test fixture file

* Normalize exit codes and GitHub checking behavior

The exit code is now defined as 1 for unexpected or config errors,
and 2 for link check failures.

GitHub checking behavior has been tweaked to generate errors if
a GitHub-specific check cannot be performed because of a missing
token.

* Remove short flag for github token
2020-10-26 23:31:31 +01:00
WhizSid
6bd7bbf51f
feat: Support relative URLs (#15) 2020-10-21 01:31:06 +02:00
Paweł Romanowski
e175558376 Add --exclude-all-private flag and cli integration test 2020-10-17 10:01:06 +02:00
Matthias Endler
14d098f7cf Add mail 2020-08-23 23:19:21 +02:00
Matthias Endler
608499fdb4 Add more test links 2020-08-14 11:38:29 +02:00
Matthias Endler
391144b2ff Add globbing support 2020-08-14 02:33:04 +02:00
Matthias Endler
4aa2883371 Add more links 2020-08-09 22:43:11 +02:00
Matthias Endler
a58b3e1232 Add logging and proper URL parsing 2020-08-07 19:00:21 +02:00