Skipping URLs in verbatim elements didn't take nested
elements into consideration, which were not verbatim.
For instance, the following HTML snippet would yield
`https://example.com` in non-verbatim mode, even if
it is nested inside a verbatim `<pre>` element:
```html
<pre><a href="https://example.com">link</a></pre>
```
This commit fixes the behavior for both `html5gum` and
`html5ever`.
Note that nested verbatim elements of the same kind
still are not handled correctly.
For instance, the following HTML snippet would still yield
`https://example.com`:
```html
<pre>
<pre></pre>
<a href="https://example.com">link</a>
</pre>
```
The reason is that we currently only keep track of a single
verbatim element and not a stack of elements, which we
would need to unwind and resolve the situation.
Fixes https://github.com/lycheeverse/lychee/issues/986.
Previously, lychee would blindly retry all requests,
no matter if the request error was transient or fatal.
Taking a lesson from https://github.com/TrueLayer/reqwest-middleware,
we can be more granular about the error behavior.
This PR adds their retry logic to lychee, reducing the number of
unnecessary requests significantly.
I also made some ergonomic changes to the client, which should not
affect its behavior.
* Fix cached 200 status code handling
Assert that code 200 never needs to be explicitly accepted for cached response
to match the behavior of uncached checks
* Bump version to v0.11.1
`lychee_lib::client`:
- Improved documentation.
- Added an log message in `ClientBuilder::client()` when provied user-agent
overrides the one defined in provied custom header.
- Removed unnecessary error handling in `Client::check()` when setting HTTPS
scheme because all failure cases should occur when checking this URL the first
time already.
- Removed unnecessary error handling in `Client::remap()` since
`lychee-lib::remap::Remaps::remap()` doesn't returns a `Result` anymore.
- Fixed potential integer overflow in `Client::check_website()` when the wait
time between retries doubles, by using `std::time::Duration::saturating_mul`
instead.
- Renamed `invalid()` to `validate_url()`.
`lychee_lib::remap`:
- Improved documentation, in particular, clarified (in the comment) that it's
URLs not URIs being remapped.
- Changed `Remaps::remap()` so it takes `&mut Url` instead of `Uri` as its
argument, and doesn't return a `Result` as a result.
- Using `Url` instead of `Uri` because it aligns with the concept of
remapping locations rather than identifiers.
- Mutating the URL directly instead of returning a new one for it's more
straightforward.
- There is no error handling because we don't convert from URL to URI
anymore. Furthermore, this always succeed in the first place so we never
needed error handling.
- Added implementation of `IntoIterator` for `&'a Remaps` and convenience method
of `Remaps::iter`. (Their mutable or moving counterparts are deliberately
avoided because we don't want library users to modify all consume the
remapping rules after its instantiation.)
`lychee_lib::error`:
- Renamed `ErrorKind::InvalidUriRemap` to `InvalidUrlRemap` and improved
its error message.
Changes to other modules are minor and only serves to accompany aforementioned
changes.
Previously those were not correctly rewritten to thumbnail URLs.
This should be fixed now by splitting up the logic
for normal YouTube links and shortlinks.
Fixes#906
Previously remote URLs were incorrectly detected because the
string representation of a path is different than the path itself,
causing the `http` prefix match to be insufficient.
This resulted in unexpected side-effects, such as the
incorrect detection of verbatim mode for remote URLs.
The check now got improved and unit tests were added to avoid
future breakage. On top of that, missing verbatim elements were added
This is a naive approach to exclude script tags from
getting checked. The reason is that the tag leads to
a lot of false-positives (e.g. `//unpkg.com/docsify-edit-on-github@1`
within a script block gets detected as an e-mail address).
A more thorough approach would be the use of a tree-builder in
html5gum and html5ever, but this could have a negative performance
impact.
I also did not want to add a new flag (e.g. `--include-scripts`) for this
setting because the current set of flags around exclusion/inclusion is
already quite long.
Fixes#821.
This was an oversight so far that became apparent after our
recent fix for email addreses with query params
(e.g. `test@example.com?subject=test`).
The parsing of email addresses has improved and so we detect
more mail addresses, but we didn't check if they belonged
to an example domain, causing false-positive checks.
Email addresses with query parameters often get used in
contact forms on websites. They can also be found in
other documents like Markdown.
A common use-case is to add a subject line to the email
as a parameter e.g. `mailto:mail@example.com?subject="Hello"`.
Previously we handled such cases incorrectly by recognizing
them as files. The reason was that our email parsing was too strict
to allow for that use-case.
With `email_address` we switched to a more permissive parser.
Note that this does not affect the actual address email checking,
as this is still done `check-if-email-exists`, which has more strict
check functionality.
As discussed in https://github.com/lycheeverse/lychee/issues/647#issuecomment-1170773449, it does not make much sense to cache unsupported
and excluded URLs.
Unsupported URLs might be supported in the future and caching them
would mean they won't get checked then. Excluded URLs were
excluded for a reason and should not appear in the cache.
Furthermore they might not be excluded
in a consecutive run, leading to a false-positive.
* Add custom deserializer for `CacheStatus` to properly classify status codes
* Add CLI integration tests to check .lycheecache behavior
* Add comment to explain conflict between cache and accept flags
* Replace lazy hashset with matches!
llvm will typically create much faster code than accessing a hashset at
runtime
source: trust me bro
* cargo fix
* cargo fmt
* shorten docstring
Unfortunately it's not possible to automatically enable features
for `cargo test`. See https://github.com/rust-lang/cargo/issues/2911.
As a workaround to allow for using example domains for unit- and integration
tests, we introduce a new feature, `check_example_domains`, which is
disabled by default for normal users. The feature gets activated for the
integration test which checks that the example domain exclusion works as
expected.
Remaps allow mapping from a URI pattern to a different URI.
The syntax is
```
lychee --remap 'https://example.comhttp://127.0.0.1'
```
Some use-cases are
- Testing URIs prior to production deployment
- Testing URIs behind a proxy
Be careful when using this feature because checking every link against a
large set of regular expressions has a performance impact. Also there are no
constraints on the URI mapping, so the rules might contradict with each
other.
Remap rules get applied in order of definition to every input URI.
This change deprecates `--exclude-file` as it was ambiguous.
Instead, `--exclude-path` was introduced to support excluding paths
to files and directories that should not be checked.
Furthermore, `.lycheeignore` is now the only way
to exclude URL patterns.
This sets a HTTP connect timeout (for stability)
and a TCP keepalive (for performance).
The connect timeout should help with flaky servers, which
would block the runtime and therefore other requests.
The keepalive helps when making many requests to the same
host. This is a very common pattern for checking internal documentation,
which is an important use-case of lychee.
The settings are currently not configurable by the user
and set to sane defaults. We might make this configurable in the future
if there is demand to do so.
* Add support for raw formatter (no color)
* Introduce ResponseFormatter trait
* Pass the same params to every cli command
* Update dependencies
* Remove pretty_assertions dependency (latest version doesn't build)
Recently we cleaned up the commandline output to trim away redundant
information like the URL, which occured twice.
Unfortunately we also removed helpful information from reqwest, which
could support the user in troubleshooting unexpected errors.
This commit reverts that.
We now extract the meaningful information from reqwest, without being
too verbose. For that we have to depend on the string output for the
reqwest error, but it's better than hiding that information from the user.
It is fragile as it depends on the reqwest internals, but in the worst case
we simply return the full error text in case our parsing won't work.
With the refactoring the URL checking as a workaround for the upstream
reqwest panic on invalid URLs, we introduced a regression, which caused
unsupported URL schemes to show up as errors in the lychee output.
This commit changes the behavior such that invalid schemes get ignored
again by making a differentiation between truly invalid URIs which would make
reqwest panic, and ones which are valid but just not handled by reqwest.
The check was moved to `check_website` such that the invalid URIs would
not be checked three times in a loop before erroring out.
This requires `Input::new` to return a `Result`, because the URL
parsing could fail when prepending `http://`.
We use http instead of https, because curl does as well:
70ac27604a/lib/urlapi.c (L1104-L1124)
Missing files will be interpreted as URLs from the command line
and these can be invalid, but that's not seen as an error anymore.