Commit graph

32 commits

Author SHA1 Message Date
Matthias
9ed97213a1 Add documentation to chain module
Also make `Chainable` and  `ChainResult` public to support external plugins/handlers.
2024-04-22 11:03:26 +02:00
Thomas Zahner
d92d3ba733 Extract checking functionality & make chain async 2024-04-22 11:03:26 +02:00
Thomas Zahner
0c4bea8074 Create request chain with basic authentication 2024-04-22 11:03:26 +02:00
Techassi
1b1fd0c707
feat: Add support for ranges in the --accept option / config field (#1167)
Adds support for accept ranges discussed in #1157. This allows the user to specify custom HTTP status codes accepted during checking and thus will report as valid (not broken). The accept option only supports specifying status codes as a comma-separated list. With this PR, the option will accept a list of status code ranges formatted like this:

```toml
accept = ["100..=103", "200..=299", "403"]
```

These combinations will be supported: `..<end>`, ` ..=<end>`, `<start>..<end>` and `<start>..=<end>`.
The behavior is copied from the Rust Range like concepts:

```
    ..<end>, includes 0 to <end> (exclusive)
    ..=<end>, includes 0 to <end> (inclusive)
    <start>..<end>, includes <start> to <end> (exclusive)
    <start>..=<end>, includes <start> to <end> (inclusive)
```


- Foundation and enhancements for accept ranges, including support for comma-separated strings and integration into the CLI.
- Implementations and updates for AcceptSelector, including Default, Display, and serde defaults.
- Address and fix various errors: clippy, cargo fmt, and tests.
- Add more tests, address edge cases, and enhance error messaging, especially for TOML config parsing.
- Update dependencies.
2023-09-17 21:39:01 +02:00
Matthias Endler
14e748793e
Cookie Support (#1146)
This is a very conservative and limited implementation of cookie support.

The goal is to ship an MVP, which covers 80% of the use-cases.
When you run lychee with --cookie-jar cookies.json, all cookies will be stored in cookies.json, one cookie per line.
This makes cookies easy to edit by hand if needed, although this is an advanced use-case and the API for the format is not guaranteed to be stable.

Fixes: #645, #715
Partially fixes: #1108
2023-07-13 17:32:41 +02:00
Techassi
67af7ef6d3
feat: add support for basic auth per URI (#1110)
* Add support for basic auth per domain
* Move URI matching to link collection phase
* Allow AsRef for BasicAuthExtractor::new to avoid clone
* Add tests

---------

Co-authored-by: Matthias Endler <matthias@endler.dev>
2023-06-26 12:06:24 +02:00
Stefan Kreutz
7dd84f6b7c
Add optional Rustls support (#1099)
* Add optional Rustls support

This commit adds a non-default feature flag to use Rustls instead of OpenSSL.

My personal motivation is to use Lychee on OpenBSD -current, where the
`openssl` crate frequently fails to link against the unreleased system
LibreSSL. Using the `vendored-openssl` feature helps with compilation, but
segfaults at runtime.

The commit adds three feature flags to the library, binary, benchmark, and all
examples:

- The `native-tls` feature flag toggles the `openssl` crate.
- The `rustls-tls` feature flag toggles the `rustls` crate.
- The `email-check` feature flag toggles the `check-if-email-exists` crate,
  which is the only existing functionality currently incompatible with Rustls.

By default, `native-tls` and `email-check` are enabled. Thus, Lychee (bin and
lib) can be used as before unless default features are disabled.

To use the Rustls feature, pass `--no-default-features --features rustls` to
cargo check/build/test/..., e.g.,

    $ cargo clippy --workspace --all-targets --no-default-features \ --features
    rustls-tls -- --deny warnings

Checking email addresses requires both, `native-tls` and `email-check`, to be
enabled. Otherwise, email addresses are excluded.

The `email-check` feature flag is technically not necessary. I preferred it
over `not(rustls-tls)` because it's clearer and it addresses the AGPL license
issue #594. As far as I understand, a Lychee binary compiled without the
`email-check` feature could be distributed with file-based copyleft for the
MPL-licensed dependencies only. But that's out of scope here.

The benchmark shows a performance regression varying between 2% and 4.4% when
using Rustls instead of OpenSSL on my machine.

PS: The `ring` crate needs to be patched on OpenBSD 7.3 and later until the new
xonly patches have been upstreamed, see the `rust-ring` port.

* Use platform native certificates with Rustls

By default, reqwest uses the webpki-roots crate with Rustls, effectively
bundling Mozilla's root certificates.

This commit uses the rustls-native-certs crate instead to use locally
installed root certificates, to minimize the difference between the
native-tls and rustls-tls features.

* Document feature flags
2023-06-16 02:21:57 +02:00
Matthias
e47577708b formatting 2023-04-11 16:36:32 +02:00
Matthias
e96d4114a9 helpers -> utils 2023-04-11 00:43:57 +02:00
Matthias Endler
2255ad9286
Better retry handling (#981)
Previously, lychee would blindly retry all requests,
no matter if the request error was transient or fatal.

Taking a lesson from https://github.com/TrueLayer/reqwest-middleware,
we can be more granular about the error behavior.
This PR adds their retry logic to lychee, reducing the number of
unnecessary requests significantly.

I also made some ergonomic changes to the client, which should not
affect its behavior.
2023-03-10 22:36:45 +01:00
Lucius Hu
e2406089ad
chore!: improve client and remap modules (#913)
`lychee_lib::client`:

- Improved documentation.
- Added an log message in `ClientBuilder::client()` when provied user-agent
  overrides the one defined in provied custom header.
- Removed unnecessary error handling in `Client::check()` when setting HTTPS
  scheme because all failure cases should occur when checking this URL the first
  time already.
- Removed unnecessary error handling in `Client::remap()` since
  `lychee-lib::remap::Remaps::remap()` doesn't returns a `Result` anymore.
- Fixed potential integer overflow in `Client::check_website()` when the wait
  time between retries doubles, by using `std::time::Duration::saturating_mul`
  instead.
- Renamed `invalid()` to `validate_url()`.

`lychee_lib::remap`:

- Improved documentation, in particular, clarified (in the comment) that it's
  URLs not URIs being remapped.
- Changed `Remaps::remap()` so it takes `&mut Url` instead of `Uri` as its
  argument, and doesn't return a `Result` as a result.
    - Using `Url` instead of `Uri` because it aligns with the concept of
      remapping locations rather than identifiers.
    - Mutating the URL directly instead of returning a new one for it's more
      straightforward.
    - There is no error handling because we don't convert from URL to URI
      anymore. Furthermore, this always succeed in the first place so we never
      needed error handling.
- Added implementation of `IntoIterator` for `&'a Remaps` and convenience method
  of `Remaps::iter`. (Their mutable or moving counterparts are deliberately
  avoided because we don't want library users to modify all consume the
  remapping rules after its instantiation.)

`lychee_lib::error`:

- Renamed `ErrorKind::InvalidUriRemap` to `InvalidUrlRemap` and improved
  its error message.

Changes to other modules are minor and only serves to accompany aforementioned
changes.
2023-01-16 19:14:09 +01:00
Matthias
84de43c554
Refactor request types (#637) 2022-06-03 20:13:07 +02:00
Matthias
22fecfc056
Add support for URI remapping (#620)
Remaps allow mapping from a URI pattern to a different URI.

The syntax is

```
lychee --remap 'https://example.com http://127.0.0.1'
```

Some use-cases are
- Testing URIs prior to production deployment
- Testing URIs behind a proxy

Be careful when using this feature because checking every link against a
large set of regular expressions has a performance impact. Also there are no
constraints on the URI mapping, so the rules might contradict with each
other.
Remap rules get applied in order of definition to every input URI.
2022-05-29 21:41:22 +02:00
Matthias
05bd3817ee
Make retry wait time configurable (#525) 2022-02-24 12:24:57 +01:00
Matthias
47df7780fe
Use captured identifiers in format strings (#507)
Makes for arguably cleaner-looking code.
The downside is that the MSRV is 1.58
https://blog.rust-lang.org/2022/01/13/Rust-1.58.0.html

Given that nobody uses lychee as a library yet
and we have precompiled binaries, it's an acceptable
tradeoff.
My little research revealed that this is a much-liked
feature: https://twitter.com/matthiasendler/status/1483895557621960715
2022-02-12 10:51:52 +01:00
Lucius Hu
476a048350
lychee-lib::client reworked (#500)
This commit mainly added or improved documentation for `lychee-lib::client`
module.

But it also contains a few API changes:

- `ClientBuilder::client()` now consumes itself instead of taking a reference.
  This helps to avoid a few unnecessary clones.
- `ClientBuilder::build_filter()` was a private function and is inlined to avoid
  unnecessary clones.
- Added a new crate-scoped function `Uri::set_scheme()`.

* added notes on deprecated site-local network

Co-authored-by: Lucius Hu <lebensterben@users.noreply.github.com>
2022-02-10 00:04:48 +01:00
Matthias
ac490f9c53
Add caching functionality (v2) (#443)
A while ago, caching was removed due to some issues (see #349).
This is a new implementation with the following improvements:

 * Architecture: The new implementation is decoupled from the collector, which was a major issue in the last version.    Now the collector has a single responsibility: collecting links. This also avoids race-conditions when running multiple collect_links instances, which probably was an issue before.
* Performance: Uses DashMap under the hood, which was noticeably faster than Mutex<HashMap> in my tests.
* Simplicity: The cache format is a CSV file with two columns: URI and status. I decided to create a new struct called CacheStatus for serialization, because trying to serialize the error kinds in Status turned out to be a bit of a nightmare and at this point I don't think it's worth the pain (and probably isn't idiomatic either).

This is an optional feature. Caching only gets used if the `--cache` flag is set.
2022-01-14 15:25:51 +01:00
Matthias
21f3160b71
Make retries configurable; align constants (#446)
Using the same default values for the library and the
binary now but tweaked the values a bit for slightly faster performance.
2022-01-07 01:03:10 +01:00
Matthias
166c86c30e
Use tokenizer for extraction; add benchmark (#424)
This avoids creating a DOM tree for link extraction and instead uses a `TokenSink` for on-the-fly extraction. In hyperfine benchmarks it was about 10-25% faster than the master.

Old: 4.557 s ± 0.404 s
New: 3.832 s ± 0.131 s

The performance fluctuates a little less as well.

Some missing element/attribute pairs were also added, which contain links according to the HTML spec. These occur very rarely, but it's good to parse them for completeness' sake.

Furthermore tried to clean up a lot of papercuts around our types. We now differentiate between a `RawUri` (stringy-types) and a Uri, which is a properly parsed `URI` type.
The extractor now only deals with extracting `RawUri`s while the collector creates the request objects.
2021-12-16 18:45:52 +01:00
Matthias
3d5135668b
Improve concurrency with streams (#330)
* Move to from vec to streams

Previously we collected all inputs in one vector
before checking the links, which is not ideal.
Especially when reading many inputs (e.g. by using a glob pattern),
this could cause issues like running out of file handles.

By moving to streams we avoid that scenario. This is also the first
step towards improving performance for many inputs.

To stay as close to the pre-stream behaviour, we want to stop processing
as soon as an Err value appears in the stream. This is easiest when the
stream is consumed in the main thread.
Previously, the stream was consumed in a tokio task and the main thread
waited for responses.
Now, a tokio task waits for responses (and displays them/registers
response stats) and the main thread sends links to the ClientPool.
To ensure that the main thread waits for all responses to have arrived
before finishing the ProgressBar and printing the stats, it waits for
the show_results_task to finish.


* Return collected links as Stream
* Initialize ProgressBar without length because we can't know the amount of links without blocking
* Handle stream results in main thread, not in task
* Add basic directory support using jwalk
* Add test for HTTP protocol file type (http://)
* Remove deadpool (once again): Replaced with `futures::StreamExt::for_each_concurrent`.
* Refactor main; fix tests
* Move commands into separate submodule
* Simplify input handling
* Simplify collector
* Remove unnecessary unwrap
* Simplify main
* cleanup check
* clean up dump command
* Handle requests in parallel 
* Fix formatting and lints

Co-authored-by: Timo Freiberg <self@timofreiberg.com>
2021-12-01 18:25:11 +01:00
Matthias
56726f41fc
Add back connection pool (#355) 2021-10-08 13:08:44 +02:00
MichaIng
961f12e58e
Remove cache from collector and remove custom reqwest client pool
* Reqwest comes with its own request pool, so there's no need in adding
another layer of indirection. This also gets rid of a lot of allocs.
* Remove cache from collector
* Improve error handling and documentation
* Add back test for request caching in single file

Signed-off-by: MichaIng <micha@dietpi.com>
Co-authored-by: Matthias <matthias-endler@gmx.net>
2021-10-07 18:07:18 +02:00
Matthias
21ea0fd033
Add support for tokio-console (#318)
This allows troubleshooting and improving async Rust code.
It is an optional feature that is still
experimental (but can be quite helpful)
2021-09-12 18:10:23 +02:00
Matthias
dd3205a87c wip 2021-09-06 15:19:43 +02:00
Matthias
887f1b9589 Split up file checking into file discovery and validation of path exists 2021-09-06 15:19:09 +02:00
Matthias
bfa3b1b6a1 Introduce Base type, which can be a path or URL 2021-09-06 15:15:40 +02:00
Matthias Endler
d5bb7ee7d7 Or Patterns (Rust 1.53) 2021-09-06 15:15:05 +02:00
Matthias Endler
701fbc9ada Add support for local files 2021-09-06 15:14:33 +02:00
Matthias
fe399c0a8c
Simple URI cache (#243) 2021-05-04 13:28:39 +02:00
Matthias
1926c73b6b
Add missing docs (#231)
This enables `#![deny(missing_docs)`) and adds all missing doc strings
2021-04-23 00:27:12 +02:00
Lucius Hu
f64213d58c
More refactor (#225)
- Major changes in `lychee-lib::filter` module:
  - Fields in `Excludes` except the `RegexSet` is now moved to `Filter`.
  - `Filter` contains `Option<Excludes>` and `Option<Includes>`, which are
    wrapper struct of `RegexSet` instead of `Option<RegexSet>`. As a result
    the code now looks cleaner.
  - Factored out some filtering logics to dedicated functions.
    - It's possible to write tests for those functions in addition to tests
      for the `Filter` struct.
  - Added docs to `Filter::is_excluded` and reorgnized the code.
- placed `derive_builder` by `typed_builder`:
  - The internal interface very ugly, as admitted by the author, but we no
    longer have nested `Option`s like before.
  - As a result, the `Client` building is much easier to read.
  - Main benefit of `typed_builder` is, the arguments feeded to builder is
    checked at compile time instead of run-time.
- Fixed a bug in `lychee::tests::usage` and `lychee-lib::stats::test`.
  - Now it will clear environment variable which would otherwise cause an
    issue if `GITHUB_TOKEN` is set.
- Updated dependencies.

Co-authored-by: Lucius Hu <lebensterben@users.noreply.github.com>
2021-04-16 20:25:22 +02:00
Lucius Hu
228e5df6a3
Major refactor of codebase (#208)
- The binary component and library component are separated as two
  packages in the same workspace.
  - `lychee` is the binary component, in `lychee-bin/*`.
  - `lychee-lib` is the library component, in `lychee-lib/*`.
  - Users can now install only the `lychee-lib`, instead of both
    components, that would require fewer dependencies and faster
    compilation.
  - Dependencies for each component are adjusted and updated. E.g.,
    no CLI dependencies for `lychee-lib`.
  - CLI tests are only moved to `lychee`, as it has nothing to do
    with the library component.
- `Status::Error` is refactored to contain dedicated error enum,
  `ErrorKind`.
  - The motivation is to delay the formatting of errors to strings.
    Note that `e.to_string()` is not necessarily cheap (though
    trivial in many cases). The formatting is no delayed until the
    error is needed to be displayed to users. So in some cases, if
    the error is never used, it means that it won't be formatted at
    all.
- Replaced `regex` based matching with one of the following:
  - Simple string equality test in the case of 'false positivie'.
  - URL parsing based test, in the case of extracting repository and
    user name for GitHub links.
  - Either cases would be much more efficient than `regex` based
    matching. First, there's no need to construct a state machine for
    regex. Second, URL is already verified and parsed on its creation,
    and extracting its components is fairly cheap. Also, this removes
    the dependency on `lazy-static` in `lychee-lib`.
- `types` module now has a sub-directory, and its components are now
  separated into their own modules (in that sub-directory).
- `lychee-lib::test_utils` module is only compiled for tests.
- `wiremock` is moved to `dev-dependency` as it's only needed for
  `test` modules.
- Dependencies are listed in alphabetical order.
- Imports are organized in the following fashion:
  - Imports from `std`
  - Imports from 3rd-party crates, and `lychee-lib`.
  - Imports from `crate::*` or `super::*`.
- No glob import.
- I followed suggestion from `cargo clippy`, with `clippy::all` and
  `clippy:pedantic`.

Co-authored-by: Lucius Hu <lebensterben@users.noreply.github.com>
2021-04-15 01:24:11 +02:00