* Added html5gum based fragment extractor
* Markdown fragment extractor now extracts fragments from inline html
* Added fragment checks for html file
* Added inline html and html document to fragment checks test
* Improved some comments
* Improved documentation of markdown's fragment extractor.
- Implemented enhancements to include fragments in file links
- Checked links to markdown files with fragments, generating unique kebab case and heading attributes.
- Made code more idiomatic and added an integration test.
- Updated documentation.
- Fixed issues with heading attributes fragments and ensured proper handling of file errors.
E-Mail checks cause too many false-postives,
so we put them behind a flag.
* `--exclude-mail` is deprecated (to be removed in 1.0)
* `--include-mail` is the new flag
This PR also removes the obsolete tests for `--exclude-file`, which was superseded by `.lycheeignore`.
Fixes#1089
This is a very conservative and limited implementation of cookie support.
The goal is to ship an MVP, which covers 80% of the use-cases.
When you run lychee with --cookie-jar cookies.json, all cookies will be stored in cookies.json, one cookie per line.
This makes cookies easy to edit by hand if needed, although this is an advanced use-case and the API for the format is not guaranteed to be stable.
Fixes: #645, #715
Partially fixes: #1108
* Add support for basic auth per domain
* Move URI matching to link collection phase
* Allow AsRef for BasicAuthExtractor::new to avoid clone
* Add tests
---------
Co-authored-by: Matthias Endler <matthias@endler.dev>
Unknown status codes should be skipped and not cached by default. The reason is that we don't know if they are valid or not and even if they are invalid, we don't know if they will be valid in the future.
Skipping URLs in verbatim elements didn't take nested
elements into consideration, which were not verbatim.
For instance, the following HTML snippet would yield
`https://example.com` in non-verbatim mode, even if
it is nested inside a verbatim `<pre>` element:
```html
<pre><a href="https://example.com">link</a></pre>
```
This commit fixes the behavior for both `html5gum` and
`html5ever`.
Note that nested verbatim elements of the same kind
still are not handled correctly.
For instance, the following HTML snippet would still yield
`https://example.com`:
```html
<pre>
<pre></pre>
<a href="https://example.com">link</a>
</pre>
```
The reason is that we currently only keep track of a single
verbatim element and not a stack of elements, which we
would need to unwind and resolve the situation.
Fixes https://github.com/lycheeverse/lychee/issues/986.
Previously, lychee would blindly retry all requests,
no matter if the request error was transient or fatal.
Taking a lesson from https://github.com/TrueLayer/reqwest-middleware,
we can be more granular about the error behavior.
This PR adds their retry logic to lychee, reducing the number of
unnecessary requests significantly.
I also made some ergonomic changes to the client, which should not
affect its behavior.
* Fix cached 200 status code handling
Assert that code 200 never needs to be explicitly accepted for cached response
to match the behavior of uncached checks
* Bump version to v0.11.1
Previously remote URLs were incorrectly detected because the
string representation of a path is different than the path itself,
causing the `http` prefix match to be insufficient.
This resulted in unexpected side-effects, such as the
incorrect detection of verbatim mode for remote URLs.
The check now got improved and unit tests were added to avoid
future breakage. On top of that, missing verbatim elements were added
Email addresses with query parameters often get used in
contact forms on websites. They can also be found in
other documents like Markdown.
A common use-case is to add a subject line to the email
as a parameter e.g. `mailto:mail@example.com?subject="Hello"`.
Previously we handled such cases incorrectly by recognizing
them as files. The reason was that our email parsing was too strict
to allow for that use-case.
With `email_address` we switched to a more permissive parser.
Note that this does not affect the actual address email checking,
as this is still done `check-if-email-exists`, which has more strict
check functionality.
As discussed in https://github.com/lycheeverse/lychee/issues/647#issuecomment-1170773449, it does not make much sense to cache unsupported
and excluded URLs.
Unsupported URLs might be supported in the future and caching them
would mean they won't get checked then. Excluded URLs were
excluded for a reason and should not appear in the cache.
Furthermore they might not be excluded
in a consecutive run, leading to a false-positive.
* Add custom deserializer for `CacheStatus` to properly classify status codes
* Add CLI integration tests to check .lycheecache behavior
* Add comment to explain conflict between cache and accept flags