lychee

mirror of https://github.com/Hopiu/lychee.git synced 2026-04-28 00:44:46 +00:00

Author	SHA1	Message	Date
Matthias Endler	6df1c378ec	Fix Rust 1.66 clippy lints (#879 )	2022-12-19 14:28:10 +01:00
Matthias	d61105edbb	Fix parsing error of email addresses with query params (#809 ) Email addresses with query parameters often get used in contact forms on websites. They can also be found in other documents like Markdown. A common use-case is to add a subject line to the email as a parameter e.g. `mailto:mail@example.com?subject="Hello"`. Previously we handled such cases incorrectly by recognizing them as files. The reason was that our email parsing was too strict to allow for that use-case. With `email_address` we switched to a more permissive parser. Note that this does not affect the actual address email checking, as this is still done `check-if-email-exists`, which has more strict check functionality.	2022-11-05 23:40:33 +01:00
Matthias	94dda21326	Fix clippy lints	2022-09-27 18:17:37 +02:00
dependabot[bot]	226546091b	Bump check-if-email-exists from 0.8.31 to 0.9.0 (#735 ) * Bump check-if-email-exists from 0.8.31 to 0.9.0 Bumps [check-if-email-exists](https://github.com/reacherhq/check-if-email-exists) from 0.8.31 to 0.9.0. - [Release notes](https://github.com/reacherhq/check-if-email-exists/releases) - [Changelog](https://github.com/reacherhq/check-if-email-exists/blob/master/CHANGELOG.md) - [Commits](https://github.com/reacherhq/check-if-email-exists/compare/v0.8.31...v0.9.0) --- updated-dependencies: - dependency-name: check-if-email-exists dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> * Update usage Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Matthias <matthias-endler@gmx.net>	2022-08-16 12:35:34 +02:00
Matthias	6fae93f2da	Skip caching unsupported and excluded URLs (#692 ) As discussed in https://github.com/lycheeverse/lychee/issues/647#issuecomment-1170773449, it does not make much sense to cache unsupported and excluded URLs. Unsupported URLs might be supported in the future and caching them would mean they won't get checked then. Excluded URLs were excluded for a reason and should not appear in the cache. Furthermore they might not be excluded in a consecutive run, leading to a false-positive.	2022-07-17 18:40:45 +02:00
Matthias	84de43c554	Refactor request types (#637 )	2022-06-03 20:13:07 +02:00
Matthias	22fecfc056	Add support for URI remapping (#620 ) Remaps allow mapping from a URI pattern to a different URI. The syntax is ``` lychee --remap 'https://example.com http://127.0.0.1' ``` Some use-cases are - Testing URIs prior to production deployment - Testing URIs behind a proxy Be careful when using this feature because checking every link against a large set of regular expressions has a performance impact. Also there are no constraints on the URI mapping, so the rules might contradict with each other. Remap rules get applied in order of definition to every input URI.	2022-05-29 21:41:22 +02:00
Matthias	363b95fe5f	Add support for excluding paths from link checking (#623 ) This change deprecates `--exclude-file` as it was ambiguous. Instead, `--exclude-path` was introduced to support excluding paths to files and directories that should not be checked. Furthermore, `.lycheeignore` is now the only way to exclude URL patterns.	2022-05-29 17:27:09 +02:00
Matthias	571b49410c	Extend reqwest client settings (#617 ) This sets a HTTP connect timeout (for stability) and a TCP keepalive (for performance). The connect timeout should help with flaky servers, which would block the runtime and therefore other requests. The keepalive helps when making many requests to the same host. This is a very common pattern for checking internal documentation, which is an important use-case of lychee. The settings are currently not configurable by the user and set to sane defaults. We might make this configurable in the future if there is demand to do so.	2022-05-13 18:51:11 +02:00
Matthias	a607b853c9	Move to downstream optimization for short strings (#600 ) Skipping to parse very short strings was merged into linkify so our own workaround is unnecessary https://github.com/robinst/linkify/pull/34	2022-04-25 19:18:50 +02:00
Matthias	da7bbf113d	Remove unnecessary Ok wrapper	2022-04-12 01:39:38 +02:00
Matthias	5ad7b14bdd	Regression: Ignore invalid URLs (#571 ) With the refactoring the URL checking as a workaround for the upstream reqwest panic on invalid URLs, we introduced a regression, which caused unsupported URL schemes to show up as errors in the lychee output. This commit changes the behavior such that invalid schemes get ignored again by making a differentiation between truly invalid URIs which would make reqwest panic, and ones which are valid but just not handled by reqwest. The check was moved to `check_website` such that the invalid URIs would not be checked three times in a loop before erroring out.	2022-03-27 23:22:46 +02:00
Matthias	45de5c763e	Avoid reqwest panic on invalid URIs (#557 )	2022-03-22 13:15:11 +01:00
Matthias	8097bfa408	Print Github token error once at the end (#537 ) Print original reqwest error for every Github link. It contains more information about the underlying error. Only print a message about the Github token at the end if it's not set and there were Github errors.	2022-03-03 10:04:55 +01:00
Matthias	05bd3817ee	Make retry wait time configurable (#525 )	2022-02-24 12:24:57 +01:00
Matthias	ba276cd51b	Error cleanup (#510 ) * Add more fine-grained error types; remove generic IO error * Update error message for missing file * Remove missing `Error` suffix * Rename ErrorKind::Github to ErrorKind::GithubRequest for consistency with NetworkRequest	2022-02-19 01:44:00 +01:00
Matthias	812663d832	Prevent flaky tests (#514 ) Move from example.org to example.com, which seems to be more permissive for testing	2022-02-18 10:29:49 +01:00
Lucius Hu	6d56c6b55c	Replace plain String with SecretString for GitHub token (#509 ) This commit changed the type of `lychee-lib::ClientBuilder::github_token` from `String` to `secrecy::SecretString` to fortify the secret management within our program. Note that this won't affect TOML configuration of `lychee-bin` because `serde::Deserialize` is still implemented for `SecretString`.	2022-02-13 13:53:46 +01:00
Lucius Hu	53c41b03d8	replace hubcaps by octocrab (#502 ) This commit replaced `hubcaps` by `octocrab`, which has more downloads per month and receives more frequent release updates. The caveats are: 1. When instantiating the API client, `octocrab` doesn't offer you a way to specify custom user-agent. But I would argue that, at least presently, this doesn't seem to cause issues. 2. `octocrab` doesn't export as much details of its error types as `hubcaps` does. So we will have fewer control on the display of the error message. But I would also argue that this is not really important. Though we should do more tests to make sure the error looks good enough. * hide implementation details in error message Co-authored-by: Lucius Hu <lebensterben@users.noreply.github.com>	2022-02-11 23:43:47 +01:00
Lucius Hu	476a048350	lychee-lib::client reworked (#500 ) This commit mainly added or improved documentation for `lychee-lib::client` module. But it also contains a few API changes: - `ClientBuilder::client()` now consumes itself instead of taking a reference. This helps to avoid a few unnecessary clones. - `ClientBuilder::build_filter()` was a private function and is inlined to avoid unnecessary clones. - Added a new crate-scoped function `Uri::set_scheme()`. * added notes on deprecated site-local network Co-authored-by: Lucius Hu <lebensterben@users.noreply.github.com>	2022-02-10 00:04:48 +01:00
Matthias	6e757fa20e	Add more information about mail errors (#463 )	2022-01-14 22:22:53 +01:00
Matthias	ac490f9c53	Add caching functionality (v2) (#443 ) A while ago, caching was removed due to some issues (see #349). This is a new implementation with the following improvements: * Architecture: The new implementation is decoupled from the collector, which was a major issue in the last version. Now the collector has a single responsibility: collecting links. This also avoids race-conditions when running multiple collect_links instances, which probably was an issue before. * Performance: Uses DashMap under the hood, which was noticeably faster than Mutex<HashMap> in my tests. * Simplicity: The cache format is a CSV file with two columns: URI and status. I decided to create a new struct called CacheStatus for serialization, because trying to serialize the error kinds in Status turned out to be a bit of a nightmare and at this point I don't think it's worth the pain (and probably isn't idiomatic either). This is an optional feature. Caching only gets used if the `--cache` flag is set.	2022-01-14 15:25:51 +01:00
Matthias	1e76e82811	Add test for nonexistent Github file	2022-01-12 09:25:12 +01:00
Matthias	48c8153e11	Refactor Github checking; add docs	2022-01-12 09:25:12 +01:00
Matthias	8d445a3a4b	Be more permissive around private GH repos The Github API doesn't handle checking individual files inside repos or paths like `github.com/org/repo/issues`, so we are more permissive and only check for repo existence. This is the only way to get a basic check for private repos. Public repos are not affected and should work with a normal check.	2022-01-12 09:25:12 +01:00
Matthias	e91c0c60f0	Only accept two path segments (org/repo) for Github API check	2022-01-12 09:25:12 +01:00
Matthias	21f3160b71	Make retries configurable; align constants (#446 ) Using the same default values for the library and the binary now but tweaked the values a bit for slightly faster performance.	2022-01-07 01:03:10 +01:00
Matthias	01393b34a2	Upgrade to Rust 2021 (#427 )	2021-12-17 01:32:13 +01:00
Matthias	166c86c30e	Use tokenizer for extraction; add benchmark (#424 ) This avoids creating a DOM tree for link extraction and instead uses a `TokenSink` for on-the-fly extraction. In hyperfine benchmarks it was about 10-25% faster than the master. Old: 4.557 s ± 0.404 s New: 3.832 s ± 0.131 s The performance fluctuates a little less as well. Some missing element/attribute pairs were also added, which contain links according to the HTML spec. These occur very rarely, but it's good to parse them for completeness' sake. Furthermore tried to clean up a lot of papercuts around our types. We now differentiate between a `RawUri` (stringy-types) and a Uri, which is a properly parsed `URI` type. The extractor now only deals with extracting `RawUri`s while the collector creates the request objects.	2021-12-16 18:45:52 +01:00
Matthias	b97fda34d0	Add support for different output formats (compact, detailed, markdown) (#375 )	2021-11-18 00:44:48 +01:00
MichaIng	961f12e58e	Remove cache from collector and remove custom reqwest client pool * Reqwest comes with its own request pool, so there's no need in adding another layer of indirection. This also gets rid of a lot of allocs. * Remove cache from collector * Improve error handling and documentation * Add back test for request caching in single file Signed-off-by: MichaIng <micha@dietpi.com> Co-authored-by: Matthias <matthias-endler@gmx.net>	2021-10-07 18:07:18 +02:00
Matthias	93948d7367	Avoid double-encoding already encoded destination paths E.g. `web%20site` becomes `web site`. That's because Url::from_file_path will encode the full URL in the end. This behavior cannot be configured. See https://github.com/lycheeverse/lychee/pull/262#issuecomment-915245411	2021-09-09 01:44:10 +02:00
Matthias	f3fe46a4d6	Merge branch 'master' of github.com:lycheeverse/lychee into local-files	2021-09-08 00:35:41 +02:00
Matthias	1246fa564c	Don't exlude mail on `exclude-all-private` (#316 )	2021-09-08 00:21:00 +02:00
Matthias	b2ce61357f	Fix build errors; cleanup code	2021-09-06 23:46:31 +02:00
Paweł Romanowski	8fd34a7367	Add no check (dump links only) flag (#99 )	2021-09-06 16:10:48 +02:00
Matthias	daa5be4c3a	Add/change file link tests	2021-09-06 15:19:09 +02:00
Matthias	887f1b9589	Split up file checking into file discovery and validation of path exists	2021-09-06 15:19:09 +02:00
Lucius Hu	80b8a856ac	Add new flag `--require-https` (#195 )	2021-09-04 03:21:54 +02:00
Matthias	164e1aea7e	Add support for multiple schemes (#237 )	2021-04-26 18:24:54 +02:00
Matthias	2a80760f58	Fix crates.io 404 with quirk (#235 )	2021-04-26 14:20:54 +02:00
Lucius Hu	f64213d58c	More refactor (#225 ) - Major changes in `lychee-lib::filter` module: - Fields in `Excludes` except the `RegexSet` is now moved to `Filter`. - `Filter` contains `Option<Excludes>` and `Option<Includes>`, which are wrapper struct of `RegexSet` instead of `Option<RegexSet>`. As a result the code now looks cleaner. - Factored out some filtering logics to dedicated functions. - It's possible to write tests for those functions in addition to tests for the `Filter` struct. - Added docs to `Filter::is_excluded` and reorgnized the code. - placed `derive_builder` by `typed_builder`: - The internal interface very ugly, as admitted by the author, but we no longer have nested `Option`s like before. - As a result, the `Client` building is much easier to read. - Main benefit of `typed_builder` is, the arguments feeded to builder is checked at compile time instead of run-time. - Fixed a bug in `lychee::tests::usage` and `lychee-lib::stats::test`. - Now it will clear environment variable which would otherwise cause an issue if `GITHUB_TOKEN` is set. - Updated dependencies. Co-authored-by: Lucius Hu <lebensterben@users.noreply.github.com>	2021-04-16 20:25:22 +02:00
Lucius Hu	228e5df6a3	Major refactor of codebase (#208 ) - The binary component and library component are separated as two packages in the same workspace. - `lychee` is the binary component, in `lychee-bin/`. - `lychee-lib` is the library component, in `lychee-lib/`. - Users can now install only the `lychee-lib`, instead of both components, that would require fewer dependencies and faster compilation. - Dependencies for each component are adjusted and updated. E.g., no CLI dependencies for `lychee-lib`. - CLI tests are only moved to `lychee`, as it has nothing to do with the library component. - `Status::Error` is refactored to contain dedicated error enum, `ErrorKind`. - The motivation is to delay the formatting of errors to strings. Note that `e.to_string()` is not necessarily cheap (though trivial in many cases). The formatting is no delayed until the error is needed to be displayed to users. So in some cases, if the error is never used, it means that it won't be formatted at all. - Replaced `regex` based matching with one of the following: - Simple string equality test in the case of 'false positivie'. - URL parsing based test, in the case of extracting repository and user name for GitHub links. - Either cases would be much more efficient than `regex` based matching. First, there's no need to construct a state machine for regex. Second, URL is already verified and parsed on its creation, and extracting its components is fairly cheap. Also, this removes the dependency on `lazy-static` in `lychee-lib`. - `types` module now has a sub-directory, and its components are now separated into their own modules (in that sub-directory). - `lychee-lib::test_utils` module is only compiled for tests. - `wiremock` is moved to `dev-dependency` as it's only needed for `test` modules. - Dependencies are listed in alphabetical order. - Imports are organized in the following fashion: - Imports from `std` - Imports from 3rd-party crates, and `lychee-lib`. - Imports from `crate::` or `super::`. - No glob import. - I followed suggestion from `cargo clippy`, with `clippy::all` and `clippy:pedantic`. Co-authored-by: Lucius Hu <lebensterben@users.noreply.github.com>	2021-04-15 01:24:11 +02:00

43 commits