Adds support for accept ranges discussed in #1157. This allows the user to specify custom HTTP status codes accepted during checking and thus will report as valid (not broken). The accept option only supports specifying status codes as a comma-separated list. With this PR, the option will accept a list of status code ranges formatted like this:
```toml
accept = ["100..=103", "200..=299", "403"]
```
These combinations will be supported: `..<end>`, ` ..=<end>`, `<start>..<end>` and `<start>..=<end>`.
The behavior is copied from the Rust Range like concepts:
```
..<end>, includes 0 to <end> (exclusive)
..=<end>, includes 0 to <end> (inclusive)
<start>..<end>, includes <start> to <end> (exclusive)
<start>..=<end>, includes <start> to <end> (inclusive)
```
- Foundation and enhancements for accept ranges, including support for comma-separated strings and integration into the CLI.
- Implementations and updates for AcceptSelector, including Default, Display, and serde defaults.
- Address and fix various errors: clippy, cargo fmt, and tests.
- Add more tests, address edge cases, and enhance error messaging, especially for TOML config parsing.
- Update dependencies.
* Added html5gum based fragment extractor
* Markdown fragment extractor now extracts fragments from inline html
* Added fragment checks for html file
* Added inline html and html document to fragment checks test
* Improved some comments
* Improved documentation of markdown's fragment extractor.
* Fix rustls-tls feature
Commit 14e74879 (cookie support #1146) re-introduced an unconditional
dependency on the openssl-sys crate. That is, building Lychee with the
Rustls TLS backend now requires OpenSSL. I suppose this change was
unintended, maybe due to automatic conflict resolution. If not, please
let me know.
You can review the re-introduced dependency like so:
```
cargo tree --no-default-features --features rustls-tls -i openssl-sys
```
This commit puts the OpenSSL dependency behind the native-tls feature
flag again.
You can check the TLS features like so:
```
cargo check --workspace --all-targets --features vendored-openssl
cargo check --workspace --all-targets --all-features
cargo check --workspace --all-targets --no-default-features --features rustls-tls
```
Maybe this should be added to CI. But I don't want to waste anybody's
time.
* Check feature flags during CI
Adds a new CI job 'check-feature-flags' to verify the following:
- Lychee with rustls-tls feature only doesn't depend on OpenSSL
- Cargo check passes with default features
- Cargo check passes with all features
- Cargo check passes with rustls-tls feature only
- Implemented enhancements to include fragments in file links
- Checked links to markdown files with fragments, generating unique kebab case and heading attributes.
- Made code more idiomatic and added an integration test.
- Updated documentation.
- Fixed issues with heading attributes fragments and ensured proper handling of file errors.
Our current `srcset` parsing is pretty basic.
We split on comma and then on whitespace and take the first part, which is the image source URL.
However, we don't handle URLs containing unencoded commas like
</cdn-cgi/image/format=webp,width=640/https://img.youtube.com/vi/hVBl8_pgQf0/maxresdefault.jpg>, which leads to false-positives.
According to the spec, commas in strings should be encoded, but in practice, there are some websites which don't do that. To handle these cases, too, I propose to extend the `srcset` parsing to make use of a small "state machine", which detects if a comma is within the image source or outside of it while parsing.
This is part of an effort to reduce false-positives during link checking.
---------
Co-authored-by: Hugo McNally <45573837+HU90m@users.noreply.github.com>
more performance improvements, also i accidentally pushed a breaking
change in 0.5.7 and want to avoid 0.5.5 to circulate around too much
because of that
there's some perf improvements there, but one requires manual
configuration.
custom should_emit_errors, when inlined, eliminates some useless code
that just wastes time when we don't care about errors.
E-Mail checks cause too many false-postives,
so we put them behind a flag.
* `--exclude-mail` is deprecated (to be removed in 1.0)
* `--include-mail` is the new flag
This PR also removes the obsolete tests for `--exclude-file`, which was superseded by `.lycheeignore`.
Fixes#1089
This is a very conservative and limited implementation of cookie support.
The goal is to ship an MVP, which covers 80% of the use-cases.
When you run lychee with --cookie-jar cookies.json, all cookies will be stored in cookies.json, one cookie per line.
This makes cookies easy to edit by hand if needed, although this is an advanced use-case and the API for the format is not guaranteed to be stable.
Fixes: #645, #715
Partially fixes: #1108
* Add support for basic auth per domain
* Move URI matching to link collection phase
* Allow AsRef for BasicAuthExtractor::new to avoid clone
* Add tests
---------
Co-authored-by: Matthias Endler <matthias@endler.dev>
* Add optional Rustls support
This commit adds a non-default feature flag to use Rustls instead of OpenSSL.
My personal motivation is to use Lychee on OpenBSD -current, where the
`openssl` crate frequently fails to link against the unreleased system
LibreSSL. Using the `vendored-openssl` feature helps with compilation, but
segfaults at runtime.
The commit adds three feature flags to the library, binary, benchmark, and all
examples:
- The `native-tls` feature flag toggles the `openssl` crate.
- The `rustls-tls` feature flag toggles the `rustls` crate.
- The `email-check` feature flag toggles the `check-if-email-exists` crate,
which is the only existing functionality currently incompatible with Rustls.
By default, `native-tls` and `email-check` are enabled. Thus, Lychee (bin and
lib) can be used as before unless default features are disabled.
To use the Rustls feature, pass `--no-default-features --features rustls` to
cargo check/build/test/..., e.g.,
$ cargo clippy --workspace --all-targets --no-default-features \ --features
rustls-tls -- --deny warnings
Checking email addresses requires both, `native-tls` and `email-check`, to be
enabled. Otherwise, email addresses are excluded.
The `email-check` feature flag is technically not necessary. I preferred it
over `not(rustls-tls)` because it's clearer and it addresses the AGPL license
issue #594. As far as I understand, a Lychee binary compiled without the
`email-check` feature could be distributed with file-based copyleft for the
MPL-licensed dependencies only. But that's out of scope here.
The benchmark shows a performance regression varying between 2% and 4.4% when
using Rustls instead of OpenSSL on my machine.
PS: The `ring` crate needs to be patched on OpenBSD 7.3 and later until the new
xonly patches have been upstreamed, see the `rust-ring` port.
* Use platform native certificates with Rustls
By default, reqwest uses the webpki-roots crate with Rustls, effectively
bundling Mozilla's root certificates.
This commit uses the rustls-native-certs crate instead to use locally
installed root certificates, to minimize the difference between the
native-tls and rustls-tls features.
* Document feature flags
Unknown status codes should be skipped and not cached by default. The reason is that we don't know if they are valid or not and even if they are invalid, we don't know if they will be valid in the future.
Perform a warm-up request to ensure the lazy regexes in
`lychee-lib/src/quirks/mod.rs` are compiled.
On some platforms, this can take some time(approx. 110ms),
which should not be counted in the test.
Skipping URLs in verbatim elements didn't take nested
elements into consideration, which were not verbatim.
For instance, the following HTML snippet would yield
`https://example.com` in non-verbatim mode, even if
it is nested inside a verbatim `<pre>` element:
```html
<pre><a href="https://example.com">link</a></pre>
```
This commit fixes the behavior for both `html5gum` and
`html5ever`.
Note that nested verbatim elements of the same kind
still are not handled correctly.
For instance, the following HTML snippet would still yield
`https://example.com`:
```html
<pre>
<pre></pre>
<a href="https://example.com">link</a>
</pre>
```
The reason is that we currently only keep track of a single
verbatim element and not a stack of elements, which we
would need to unwind and resolve the situation.
Fixes https://github.com/lycheeverse/lychee/issues/986.
Previously, lychee would blindly retry all requests,
no matter if the request error was transient or fatal.
Taking a lesson from https://github.com/TrueLayer/reqwest-middleware,
we can be more granular about the error behavior.
This PR adds their retry logic to lychee, reducing the number of
unnecessary requests significantly.
I also made some ergonomic changes to the client, which should not
affect its behavior.
* Fix cached 200 status code handling
Assert that code 200 never needs to be explicitly accepted for cached response
to match the behavior of uncached checks
* Bump version to v0.11.1
`lychee_lib::client`:
- Improved documentation.
- Added an log message in `ClientBuilder::client()` when provied user-agent
overrides the one defined in provied custom header.
- Removed unnecessary error handling in `Client::check()` when setting HTTPS
scheme because all failure cases should occur when checking this URL the first
time already.
- Removed unnecessary error handling in `Client::remap()` since
`lychee-lib::remap::Remaps::remap()` doesn't returns a `Result` anymore.
- Fixed potential integer overflow in `Client::check_website()` when the wait
time between retries doubles, by using `std::time::Duration::saturating_mul`
instead.
- Renamed `invalid()` to `validate_url()`.
`lychee_lib::remap`:
- Improved documentation, in particular, clarified (in the comment) that it's
URLs not URIs being remapped.
- Changed `Remaps::remap()` so it takes `&mut Url` instead of `Uri` as its
argument, and doesn't return a `Result` as a result.
- Using `Url` instead of `Uri` because it aligns with the concept of
remapping locations rather than identifiers.
- Mutating the URL directly instead of returning a new one for it's more
straightforward.
- There is no error handling because we don't convert from URL to URI
anymore. Furthermore, this always succeed in the first place so we never
needed error handling.
- Added implementation of `IntoIterator` for `&'a Remaps` and convenience method
of `Remaps::iter`. (Their mutable or moving counterparts are deliberately
avoided because we don't want library users to modify all consume the
remapping rules after its instantiation.)
`lychee_lib::error`:
- Renamed `ErrorKind::InvalidUriRemap` to `InvalidUrlRemap` and improved
its error message.
Changes to other modules are minor and only serves to accompany aforementioned
changes.
Previously those were not correctly rewritten to thumbnail URLs.
This should be fixed now by splitting up the logic
for normal YouTube links and shortlinks.
Fixes#906
Previously remote URLs were incorrectly detected because the
string representation of a path is different than the path itself,
causing the `http` prefix match to be insufficient.
This resulted in unexpected side-effects, such as the
incorrect detection of verbatim mode for remote URLs.
The check now got improved and unit tests were added to avoid
future breakage. On top of that, missing verbatim elements were added