As discussed in https://github.com/lycheeverse/lychee/issues/647#issuecomment-1170773449, it does not make much sense to cache unsupported
and excluded URLs.
Unsupported URLs might be supported in the future and caching them
would mean they won't get checked then. Excluded URLs were
excluded for a reason and should not appear in the cache.
Furthermore they might not be excluded
in a consecutive run, leading to a false-positive.
* Add custom deserializer for `CacheStatus` to properly classify status codes
* Add CLI integration tests to check .lycheecache behavior
* Add comment to explain conflict between cache and accept flags
Remaps allow mapping from a URI pattern to a different URI.
The syntax is
```
lychee --remap 'https://example.comhttp://127.0.0.1'
```
Some use-cases are
- Testing URIs prior to production deployment
- Testing URIs behind a proxy
Be careful when using this feature because checking every link against a
large set of regular expressions has a performance impact. Also there are no
constraints on the URI mapping, so the rules might contradict with each
other.
Remap rules get applied in order of definition to every input URI.
This change deprecates `--exclude-file` as it was ambiguous.
Instead, `--exclude-path` was introduced to support excluding paths
to files and directories that should not be checked.
Furthermore, `.lycheeignore` is now the only way
to exclude URL patterns.
* Add support for raw formatter (no color)
* Introduce ResponseFormatter trait
* Pass the same params to every cli command
* Update dependencies
* Remove pretty_assertions dependency (latest version doesn't build)
Recently we cleaned up the commandline output to trim away redundant
information like the URL, which occured twice.
Unfortunately we also removed helpful information from reqwest, which
could support the user in troubleshooting unexpected errors.
This commit reverts that.
We now extract the meaningful information from reqwest, without being
too verbose. For that we have to depend on the string output for the
reqwest error, but it's better than hiding that information from the user.
It is fragile as it depends on the reqwest internals, but in the worst case
we simply return the full error text in case our parsing won't work.
This requires `Input::new` to return a `Result`, because the URL
parsing could fail when prepending `http://`.
We use http instead of https, because curl does as well:
70ac27604a/lib/urlapi.c (L1104-L1124)
Missing files will be interpreted as URLs from the command line
and these can be invalid, but that's not seen as an error anymore.
Print original reqwest error for every Github link.
It contains more information about the underlying error.
Only print a message about the Github token at the
end if it's not set and there were Github errors.
Make sure that broken pipes (e.g. when a reader of a
pipe prematurely exits during execution) get handled gracefully.
This change also moves some error messages to stderr by using
eprintln.
More info: https://github.com/jez/as-tree/issues/15
This commit uses crate `ip_network` to determine whether an IPv6 address is
link-local or unique local.
Note that this extra dependencies can be removed once rust-lang/rust#27709 is
stabilized.
Co-authored-by: Lucius Hu <lebensterben@users.noreply.github.com>
Co-authored-by: Matthias <matthias-endler@gmx.net>
This commit replaced `hubcaps` by `octocrab`, which has more downloads per month
and receives more frequent release updates.
The caveats are:
1. When instantiating the API client, `octocrab` doesn't offer you a way to
specify custom user-agent. But I would argue that, at least presently, this
doesn't seem to cause issues.
2. `octocrab` doesn't export as much details of its error types as `hubcaps`
does. So we will have fewer control on the display of the error message. But I
would also argue that this is not really important. Though we should do more
tests to make sure the error looks good enough.
* hide implementation details in error message
Co-authored-by: Lucius Hu <lebensterben@users.noreply.github.com>
This commit mainly added or improved documentation for `lychee-lib::client`
module.
But it also contains a few API changes:
- `ClientBuilder::client()` now consumes itself instead of taking a reference.
This helps to avoid a few unnecessary clones.
- `ClientBuilder::build_filter()` was a private function and is inlined to avoid
unnecessary clones.
- Added a new crate-scoped function `Uri::set_scheme()`.
* added notes on deprecated site-local network
Co-authored-by: Lucius Hu <lebensterben@users.noreply.github.com>
A while ago, caching was removed due to some issues (see #349).
This is a new implementation with the following improvements:
* Architecture: The new implementation is decoupled from the collector, which was a major issue in the last version. Now the collector has a single responsibility: collecting links. This also avoids race-conditions when running multiple collect_links instances, which probably was an issue before.
* Performance: Uses DashMap under the hood, which was noticeably faster than Mutex<HashMap> in my tests.
* Simplicity: The cache format is a CSV file with two columns: URI and status. I decided to create a new struct called CacheStatus for serialization, because trying to serialize the error kinds in Status turned out to be a bit of a nightmare and at this point I don't think it's worth the pain (and probably isn't idiomatic either).
This is an optional feature. Caching only gets used if the `--cache` flag is set.
The Github API doesn't handle checking individual files inside repos or
paths like `github.com/org/repo/issues`, so we are more
permissive and only check for repo existence. This is the
only way to get a basic check for private repos. Public repos are not affected and should work
with a normal check.
We recently removed the custom serialization for InputSource.
This causes the JSON formatter to fail
with "key must be a string".
Add it back and add a comment on
why this is needed.
This avoids creating a DOM tree for link extraction and instead uses a `TokenSink` for on-the-fly extraction. In hyperfine benchmarks it was about 10-25% faster than the master.
Old: 4.557 s ± 0.404 s
New: 3.832 s ± 0.131 s
The performance fluctuates a little less as well.
Some missing element/attribute pairs were also added, which contain links according to the HTML spec. These occur very rarely, but it's good to parse them for completeness' sake.
Furthermore tried to clean up a lot of papercuts around our types. We now differentiate between a `RawUri` (stringy-types) and a Uri, which is a properly parsed `URI` type.
The extractor now only deals with extracting `RawUri`s while the collector creates the request objects.
* Move to from vec to streams
Previously we collected all inputs in one vector
before checking the links, which is not ideal.
Especially when reading many inputs (e.g. by using a glob pattern),
this could cause issues like running out of file handles.
By moving to streams we avoid that scenario. This is also the first
step towards improving performance for many inputs.
To stay as close to the pre-stream behaviour, we want to stop processing
as soon as an Err value appears in the stream. This is easiest when the
stream is consumed in the main thread.
Previously, the stream was consumed in a tokio task and the main thread
waited for responses.
Now, a tokio task waits for responses (and displays them/registers
response stats) and the main thread sends links to the ClientPool.
To ensure that the main thread waits for all responses to have arrived
before finishing the ProgressBar and printing the stats, it waits for
the show_results_task to finish.
* Return collected links as Stream
* Initialize ProgressBar without length because we can't know the amount of links without blocking
* Handle stream results in main thread, not in task
* Add basic directory support using jwalk
* Add test for HTTP protocol file type (http://)
* Remove deadpool (once again): Replaced with `futures::StreamExt::for_each_concurrent`.
* Refactor main; fix tests
* Move commands into separate submodule
* Simplify input handling
* Simplify collector
* Remove unnecessary unwrap
* Simplify main
* cleanup check
* clean up dump command
* Handle requests in parallel
* Fix formatting and lints
Co-authored-by: Timo Freiberg <self@timofreiberg.com>
This removes some boilerplate and is arguably better
than handwriting the error handling code for
maintainability and avoid inconsitent functionality
for the error variants.
thiserror is also the de-facto standard for library
error types as of today.
When running lychee against a remote URL all relative links are ignored
by default because `--base` is normally not set. A good default in this
case is to automatically use the base domain from the source URL.
Setting `--base` overrides the automatic source extraction from the
source URL (same behaviour as we currently have).
* Reqwest comes with its own request pool, so there's no need in adding
another layer of indirection. This also gets rid of a lot of allocs.
* Remove cache from collector
* Improve error handling and documentation
* Add back test for request caching in single file
Signed-off-by: MichaIng <micha@dietpi.com>
Co-authored-by: Matthias <matthias-endler@gmx.net>
* Cache `absolute_path` to decrease allocations
While profiling local file handling, I noticed that resolving paths was taking a
significant amount of time. It also caused quite a few allocations.
By caching the path and using a constant value for the current
directory, we can reduce the number of allocs by quite a lot.
For example, when testing on the sentry documentation, we do 50,4%
less allocations in total now. That's just a single test-case of course,
but it's probably also helping in many other cases as well.
* Defer to_string for attr.value to reduce allocs
* Use Tendrils instead of Strings for parsing (another ~1.5% less allocs)
* Move option parsing code into separate module
* Handle base dir more correctly
* Temporarily disable dry run