Commit graph

26 commits

Author SHA1 Message Date
Matthias Endler
9837699b79
Introduce new let...else syntax (#936) 2023-01-30 14:25:30 +01:00
Matthias
d61105edbb
Fix parsing error of email addresses with query params (#809)
Email addresses with query parameters often get used in
contact forms on websites. They can also be found in
other documents like Markdown.

A common use-case is to add a subject line to the email
as a parameter e.g. `mailto:mail@example.com?subject="Hello"`.

Previously we handled such cases incorrectly by recognizing
them as files. The reason was that our email parsing was too strict
to allow for that use-case.
With `email_address` we switched to a more permissive parser.

Note that this does not affect the actual address email checking,
as this is still done `check-if-email-exists`, which has more strict
check functionality.
2022-11-05 23:40:33 +01:00
Matthias
84de43c554
Refactor request types (#637) 2022-06-03 20:13:07 +02:00
Matthias
363b95fe5f
Add support for excluding paths from link checking (#623)
This change deprecates `--exclude-file` as it was ambiguous.
Instead, `--exclude-path` was introduced to support excluding paths
to files and directories that should not be checked.
Furthermore, `.lycheeignore` is now the only way
to exclude URL patterns.
2022-05-29 17:27:09 +02:00
Matthias
03d28820bb
Extract more status information from reqwest (#577)
Recently we cleaned up the commandline output to trim away redundant
information like the URL, which occured twice.
Unfortunately we also removed helpful information from reqwest, which
could support the user in troubleshooting unexpected errors.

This commit reverts that.
We now extract the meaningful information from reqwest, without being
too verbose. For that we have to depend on the string output for the
reqwest error, but it's better than hiding that information from the user.
It is fragile as it depends on the reqwest internals, but in the worst case
we simply return the full error text in case our parsing won't work.
2022-04-02 14:37:03 +02:00
Matthias
ceb185e579
Add more comments to path methods (#543) 2022-03-08 13:50:54 +01:00
Matthias
ba276cd51b
Error cleanup (#510)
* Add more fine-grained error types; remove generic IO error
* Update error message for missing file
* Remove missing `Error` suffix
* Rename ErrorKind::Github to ErrorKind::GithubRequest for consistency with NetworkRequest
2022-02-19 01:44:00 +01:00
Matthias
812663d832
Prevent flaky tests (#514)
Move from example.org to example.com, which seems to be more permissive for testing
2022-02-18 10:29:49 +01:00
Markus Unterwaditzer
68d09f7e5b
Add html5gum as alternative link extractor (#480)
html5gum is a HTML parser that offers lower-level control over which tokens actually get created and are tracked. As such, the extractor doesn't allocate anything tokens it doesn't care about. On some benchmarks it provides a substantial performance boost. The old parser, html5ever is still available by setting the `LYCHEE_USE_HTML5EVER=1` env var.
2022-02-07 22:54:47 +01:00
Matthias
ac490f9c53
Add caching functionality (v2) (#443)
A while ago, caching was removed due to some issues (see #349).
This is a new implementation with the following improvements:

 * Architecture: The new implementation is decoupled from the collector, which was a major issue in the last version.    Now the collector has a single responsibility: collecting links. This also avoids race-conditions when running multiple collect_links instances, which probably was an issue before.
* Performance: Uses DashMap under the hood, which was noticeably faster than Mutex<HashMap> in my tests.
* Simplicity: The cache format is a CSV file with two columns: URI and status. I decided to create a new struct called CacheStatus for serialization, because trying to serialize the error kinds in Status turned out to be a bit of a nightmare and at this point I don't think it's worth the pain (and probably isn't idiomatic either).

This is an optional feature. Caching only gets used if the `--cache` flag is set.
2022-01-14 15:25:51 +01:00
Matthias
01393b34a2
Upgrade to Rust 2021 (#427) 2021-12-17 01:32:13 +01:00
Matthias
166c86c30e
Use tokenizer for extraction; add benchmark (#424)
This avoids creating a DOM tree for link extraction and instead uses a `TokenSink` for on-the-fly extraction. In hyperfine benchmarks it was about 10-25% faster than the master.

Old: 4.557 s ± 0.404 s
New: 3.832 s ± 0.131 s

The performance fluctuates a little less as well.

Some missing element/attribute pairs were also added, which contain links according to the HTML spec. These occur very rarely, but it's good to parse them for completeness' sake.

Furthermore tried to clean up a lot of papercuts around our types. We now differentiate between a `RawUri` (stringy-types) and a Uri, which is a properly parsed `URI` type.
The extractor now only deals with extracting `RawUri`s while the collector creates the request objects.
2021-12-16 18:45:52 +01:00
Matthias
3d5135668b
Improve concurrency with streams (#330)
* Move to from vec to streams

Previously we collected all inputs in one vector
before checking the links, which is not ideal.
Especially when reading many inputs (e.g. by using a glob pattern),
this could cause issues like running out of file handles.

By moving to streams we avoid that scenario. This is also the first
step towards improving performance for many inputs.

To stay as close to the pre-stream behaviour, we want to stop processing
as soon as an Err value appears in the stream. This is easiest when the
stream is consumed in the main thread.
Previously, the stream was consumed in a tokio task and the main thread
waited for responses.
Now, a tokio task waits for responses (and displays them/registers
response stats) and the main thread sends links to the ClientPool.
To ensure that the main thread waits for all responses to have arrived
before finishing the ProgressBar and printing the stats, it waits for
the show_results_task to finish.


* Return collected links as Stream
* Initialize ProgressBar without length because we can't know the amount of links without blocking
* Handle stream results in main thread, not in task
* Add basic directory support using jwalk
* Add test for HTTP protocol file type (http://)
* Remove deadpool (once again): Replaced with `futures::StreamExt::for_each_concurrent`.
* Refactor main; fix tests
* Move commands into separate submodule
* Simplify input handling
* Simplify collector
* Remove unnecessary unwrap
* Simplify main
* cleanup check
* clean up dump command
* Handle requests in parallel 
* Fix formatting and lints

Co-authored-by: Timo Freiberg <self@timofreiberg.com>
2021-12-01 18:25:11 +01:00
Markus Unterwaditzer
d3ed133f10
Remove srcset attribute from list of "link" attrs (#393)
* Remove srcset attribute from list of "link" attrs

Fix #390

* Add test for srcset

* Add note about srcSet links

* add real support for srcset

Co-authored-by: Matthias <matthias-endler@gmx.net>
2021-11-16 22:58:10 +01:00
Matthias
a7f809612d
Refactor extractor (#354)
This avoids sending URLs back and forth between the different parsers.
Also, it should allow for future optimizations to reduce allocs.
2021-10-07 12:51:02 +02:00
Matthias
251332efe2
Cache absolute_path to decrease allocations (#346)
* Cache `absolute_path` to decrease allocations

While profiling local file handling, I noticed that resolving paths was taking a
significant amount of time. It also caused quite a few allocations.
By caching the path and using a constant value for the current
directory, we can reduce the number of allocs by quite a lot.
For example, when testing on the sentry documentation, we do 50,4%
less allocations in total now. That's just a single test-case of course,
but it's probably also helping in many other cases as well.

* Defer to_string for attr.value to reduce allocs
* Use Tendrils instead of Strings for parsing (another ~1.5% less allocs)
* Move option parsing code into separate module
* Handle base dir more correctly
* Temporarily disable dry run
2021-10-05 01:37:43 +02:00
Matthias
3b41c4c375
Silently ignore absolute paths without base (fixes #320) (#338) 2021-09-20 11:13:30 +02:00
Matthias
ffab0343fc Revert refactor for removing params and fragments
The refactored version was not equivalent. It could not handle
fragments containing a question mark.
See 67268ed598 (r703400238)
2021-09-08 00:29:30 +02:00
Matthias
67268ed598 Clean up params and fragment handling 2021-09-07 13:02:39 +02:00
Matthias
5d0b95271d Remove anchor from file links 2021-09-07 00:20:09 +02:00
Matthias
f47282093a String allocation not needed 2021-09-06 15:23:10 +02:00
Matthias
f143087743 Relative path not needed 2021-09-06 15:23:10 +02:00
Matthias
b3c5d122e7 Fix clippy lints 2021-09-06 15:23:10 +02:00
Matthias
57af648ec9 fix tests after making base dir mandatory 2021-09-06 15:23:10 +02:00
Matthias
b7c129c431 Fix resolving absolute paths
The previous solution didn't resolve to absolute paths
and rather removed things like `.` and `..`.
2021-09-06 15:20:18 +02:00
Matthias
dd3205a87c wip 2021-09-06 15:19:43 +02:00