Email addresses with query parameters often get used in
contact forms on websites. They can also be found in
other documents like Markdown.
A common use-case is to add a subject line to the email
as a parameter e.g. `mailto:mail@example.com?subject="Hello"`.
Previously we handled such cases incorrectly by recognizing
them as files. The reason was that our email parsing was too strict
to allow for that use-case.
With `email_address` we switched to a more permissive parser.
Note that this does not affect the actual address email checking,
as this is still done `check-if-email-exists`, which has more strict
check functionality.
As discussed in https://github.com/lycheeverse/lychee/issues/647#issuecomment-1170773449, it does not make much sense to cache unsupported
and excluded URLs.
Unsupported URLs might be supported in the future and caching them
would mean they won't get checked then. Excluded URLs were
excluded for a reason and should not appear in the cache.
Furthermore they might not be excluded
in a consecutive run, leading to a false-positive.
Remaps allow mapping from a URI pattern to a different URI.
The syntax is
```
lychee --remap 'https://example.comhttp://127.0.0.1'
```
Some use-cases are
- Testing URIs prior to production deployment
- Testing URIs behind a proxy
Be careful when using this feature because checking every link against a
large set of regular expressions has a performance impact. Also there are no
constraints on the URI mapping, so the rules might contradict with each
other.
Remap rules get applied in order of definition to every input URI.
This change deprecates `--exclude-file` as it was ambiguous.
Instead, `--exclude-path` was introduced to support excluding paths
to files and directories that should not be checked.
Furthermore, `.lycheeignore` is now the only way
to exclude URL patterns.
This sets a HTTP connect timeout (for stability)
and a TCP keepalive (for performance).
The connect timeout should help with flaky servers, which
would block the runtime and therefore other requests.
The keepalive helps when making many requests to the same
host. This is a very common pattern for checking internal documentation,
which is an important use-case of lychee.
The settings are currently not configurable by the user
and set to sane defaults. We might make this configurable in the future
if there is demand to do so.
With the refactoring the URL checking as a workaround for the upstream
reqwest panic on invalid URLs, we introduced a regression, which caused
unsupported URL schemes to show up as errors in the lychee output.
This commit changes the behavior such that invalid schemes get ignored
again by making a differentiation between truly invalid URIs which would make
reqwest panic, and ones which are valid but just not handled by reqwest.
The check was moved to `check_website` such that the invalid URIs would
not be checked three times in a loop before erroring out.
Print original reqwest error for every Github link.
It contains more information about the underlying error.
Only print a message about the Github token at the
end if it's not set and there were Github errors.
This commit changed the type of `lychee-lib::ClientBuilder::github_token` from
`String` to `secrecy::SecretString` to fortify the secret management within our
program.
Note that this won't affect TOML configuration of `lychee-bin` because
`serde::Deserialize` is still implemented for `SecretString`.
This commit replaced `hubcaps` by `octocrab`, which has more downloads per month
and receives more frequent release updates.
The caveats are:
1. When instantiating the API client, `octocrab` doesn't offer you a way to
specify custom user-agent. But I would argue that, at least presently, this
doesn't seem to cause issues.
2. `octocrab` doesn't export as much details of its error types as `hubcaps`
does. So we will have fewer control on the display of the error message. But I
would also argue that this is not really important. Though we should do more
tests to make sure the error looks good enough.
* hide implementation details in error message
Co-authored-by: Lucius Hu <lebensterben@users.noreply.github.com>
This commit mainly added or improved documentation for `lychee-lib::client`
module.
But it also contains a few API changes:
- `ClientBuilder::client()` now consumes itself instead of taking a reference.
This helps to avoid a few unnecessary clones.
- `ClientBuilder::build_filter()` was a private function and is inlined to avoid
unnecessary clones.
- Added a new crate-scoped function `Uri::set_scheme()`.
* added notes on deprecated site-local network
Co-authored-by: Lucius Hu <lebensterben@users.noreply.github.com>
A while ago, caching was removed due to some issues (see #349).
This is a new implementation with the following improvements:
* Architecture: The new implementation is decoupled from the collector, which was a major issue in the last version. Now the collector has a single responsibility: collecting links. This also avoids race-conditions when running multiple collect_links instances, which probably was an issue before.
* Performance: Uses DashMap under the hood, which was noticeably faster than Mutex<HashMap> in my tests.
* Simplicity: The cache format is a CSV file with two columns: URI and status. I decided to create a new struct called CacheStatus for serialization, because trying to serialize the error kinds in Status turned out to be a bit of a nightmare and at this point I don't think it's worth the pain (and probably isn't idiomatic either).
This is an optional feature. Caching only gets used if the `--cache` flag is set.
The Github API doesn't handle checking individual files inside repos or
paths like `github.com/org/repo/issues`, so we are more
permissive and only check for repo existence. This is the
only way to get a basic check for private repos. Public repos are not affected and should work
with a normal check.
This avoids creating a DOM tree for link extraction and instead uses a `TokenSink` for on-the-fly extraction. In hyperfine benchmarks it was about 10-25% faster than the master.
Old: 4.557 s ± 0.404 s
New: 3.832 s ± 0.131 s
The performance fluctuates a little less as well.
Some missing element/attribute pairs were also added, which contain links according to the HTML spec. These occur very rarely, but it's good to parse them for completeness' sake.
Furthermore tried to clean up a lot of papercuts around our types. We now differentiate between a `RawUri` (stringy-types) and a Uri, which is a properly parsed `URI` type.
The extractor now only deals with extracting `RawUri`s while the collector creates the request objects.
* Reqwest comes with its own request pool, so there's no need in adding
another layer of indirection. This also gets rid of a lot of allocs.
* Remove cache from collector
* Improve error handling and documentation
* Add back test for request caching in single file
Signed-off-by: MichaIng <micha@dietpi.com>
Co-authored-by: Matthias <matthias-endler@gmx.net>
- Major changes in `lychee-lib::filter` module:
- Fields in `Excludes` except the `RegexSet` is now moved to `Filter`.
- `Filter` contains `Option<Excludes>` and `Option<Includes>`, which are
wrapper struct of `RegexSet` instead of `Option<RegexSet>`. As a result
the code now looks cleaner.
- Factored out some filtering logics to dedicated functions.
- It's possible to write tests for those functions in addition to tests
for the `Filter` struct.
- Added docs to `Filter::is_excluded` and reorgnized the code.
- placed `derive_builder` by `typed_builder`:
- The internal interface very ugly, as admitted by the author, but we no
longer have nested `Option`s like before.
- As a result, the `Client` building is much easier to read.
- Main benefit of `typed_builder` is, the arguments feeded to builder is
checked at compile time instead of run-time.
- Fixed a bug in `lychee::tests::usage` and `lychee-lib::stats::test`.
- Now it will clear environment variable which would otherwise cause an
issue if `GITHUB_TOKEN` is set.
- Updated dependencies.
Co-authored-by: Lucius Hu <lebensterben@users.noreply.github.com>
- The binary component and library component are separated as two
packages in the same workspace.
- `lychee` is the binary component, in `lychee-bin/*`.
- `lychee-lib` is the library component, in `lychee-lib/*`.
- Users can now install only the `lychee-lib`, instead of both
components, that would require fewer dependencies and faster
compilation.
- Dependencies for each component are adjusted and updated. E.g.,
no CLI dependencies for `lychee-lib`.
- CLI tests are only moved to `lychee`, as it has nothing to do
with the library component.
- `Status::Error` is refactored to contain dedicated error enum,
`ErrorKind`.
- The motivation is to delay the formatting of errors to strings.
Note that `e.to_string()` is not necessarily cheap (though
trivial in many cases). The formatting is no delayed until the
error is needed to be displayed to users. So in some cases, if
the error is never used, it means that it won't be formatted at
all.
- Replaced `regex` based matching with one of the following:
- Simple string equality test in the case of 'false positivie'.
- URL parsing based test, in the case of extracting repository and
user name for GitHub links.
- Either cases would be much more efficient than `regex` based
matching. First, there's no need to construct a state machine for
regex. Second, URL is already verified and parsed on its creation,
and extracting its components is fairly cheap. Also, this removes
the dependency on `lazy-static` in `lychee-lib`.
- `types` module now has a sub-directory, and its components are now
separated into their own modules (in that sub-directory).
- `lychee-lib::test_utils` module is only compiled for tests.
- `wiremock` is moved to `dev-dependency` as it's only needed for
`test` modules.
- Dependencies are listed in alphabetical order.
- Imports are organized in the following fashion:
- Imports from `std`
- Imports from 3rd-party crates, and `lychee-lib`.
- Imports from `crate::*` or `super::*`.
- No glob import.
- I followed suggestion from `cargo clippy`, with `clippy::all` and
`clippy:pedantic`.
Co-authored-by: Lucius Hu <lebensterben@users.noreply.github.com>