As discussed in https://github.com/lycheeverse/lychee/issues/647#issuecomment-1170773449, it does not make much sense to cache unsupported
and excluded URLs.
Unsupported URLs might be supported in the future and caching them
would mean they won't get checked then. Excluded URLs were
excluded for a reason and should not appear in the cache.
Furthermore they might not be excluded
in a consecutive run, leading to a false-positive.
Remaps allow mapping from a URI pattern to a different URI.
The syntax is
```
lychee --remap 'https://example.comhttp://127.0.0.1'
```
Some use-cases are
- Testing URIs prior to production deployment
- Testing URIs behind a proxy
Be careful when using this feature because checking every link against a
large set of regular expressions has a performance impact. Also there are no
constraints on the URI mapping, so the rules might contradict with each
other.
Remap rules get applied in order of definition to every input URI.
This change deprecates `--exclude-file` as it was ambiguous.
Instead, `--exclude-path` was introduced to support excluding paths
to files and directories that should not be checked.
Furthermore, `.lycheeignore` is now the only way
to exclude URL patterns.
This is a minimally invasive version, which allows to grep for `[excluded]`.
The reason for exclusion would require more work and it's debatable if
it adds any value, because it might make grepping harder and the source
of exclusion is easily deducatable from the commandline parameters
or the `.lycheeignore` file.
Fixes#587.
Lines starting with the comment character (`#`) inside the
.lycheeignore file will be ignored.
Whitespace at the beginning of each line will be ignored, so
even an indented comment character will work.
* Add support for raw formatter (no color)
* Introduce ResponseFormatter trait
* Pass the same params to every cli command
* Update dependencies
* Remove pretty_assertions dependency (latest version doesn't build)
This requires `Input::new` to return a `Result`, because the URL
parsing could fail when prepending `http://`.
We use http instead of https, because curl does as well:
70ac27604a/lib/urlapi.c (L1104-L1124)
Missing files will be interpreted as URLs from the command line
and these can be invalid, but that's not seen as an error anymore.
Print original reqwest error for every Github link.
It contains more information about the underlying error.
Only print a message about the Github token at the
end if it's not set and there were Github errors.
Make sure that broken pipes (e.g. when a reader of a
pipe prematurely exits during execution) get handled gracefully.
This change also moves some error messages to stderr by using
eprintln.
More info: https://github.com/jez/as-tree/issues/15
This commit changed the type of `lychee-lib::ClientBuilder::github_token` from
`String` to `secrecy::SecretString` to fortify the secret management within our
program.
Note that this won't affect TOML configuration of `lychee-bin` because
`serde::Deserialize` is still implemented for `SecretString`.
The default configuration was broken since the
introduction of caching and specifically `max_cache_age`.
This fixes deserialization and config merging for
the case where this key is missing from the config.
* fix constant updating of progressbar
In other issues I've already lamented how slow lychee is when used
without `-n`. This fixes an issue where without `-n`, lychee would take
1 minute instead of 4 seconds to check sentry-docs.
* fix values
html5gum is a HTML parser that offers lower-level control over which tokens actually get created and are tracked. As such, the extractor doesn't allocate anything tokens it doesn't care about. On some benchmarks it provides a substantial performance boost. The old parser, html5ever is still available by setting the `LYCHEE_USE_HTML5EVER=1` env var.
This commit replaced the use of `lazy_static` by
`const_format` in `lychee-bin`.
Currently `lazy_static` is used to generate static
String at runtime. With `const_format` we can instead
make constant String at compile time.
Co-authored-by: Lucius Hu <lebensterben@users.noreply.github.com>
A while ago, caching was removed due to some issues (see #349).
This is a new implementation with the following improvements:
* Architecture: The new implementation is decoupled from the collector, which was a major issue in the last version. Now the collector has a single responsibility: collecting links. This also avoids race-conditions when running multiple collect_links instances, which probably was an issue before.
* Performance: Uses DashMap under the hood, which was noticeably faster than Mutex<HashMap> in my tests.
* Simplicity: The cache format is a CSV file with two columns: URI and status. I decided to create a new struct called CacheStatus for serialization, because trying to serialize the error kinds in Status turned out to be a bit of a nightmare and at this point I don't think it's worth the pain (and probably isn't idiomatic either).
This is an optional feature. Caching only gets used if the `--cache` flag is set.
This avoids creating a DOM tree for link extraction and instead uses a `TokenSink` for on-the-fly extraction. In hyperfine benchmarks it was about 10-25% faster than the master.
Old: 4.557 s ± 0.404 s
New: 3.832 s ± 0.131 s
The performance fluctuates a little less as well.
Some missing element/attribute pairs were also added, which contain links according to the HTML spec. These occur very rarely, but it's good to parse them for completeness' sake.
Furthermore tried to clean up a lot of papercuts around our types. We now differentiate between a `RawUri` (stringy-types) and a Uri, which is a properly parsed `URI` type.
The extractor now only deals with extracting `RawUri`s while the collector creates the request objects.