lychee

mirror of https://github.com/Hopiu/lychee.git synced 2026-04-30 18:04:46 +00:00

Author	SHA1	Message	Date
Matthias	22fecfc056	Add support for URI remapping (#620 ) Remaps allow mapping from a URI pattern to a different URI. The syntax is ``` lychee --remap 'https://example.com http://127.0.0.1' ``` Some use-cases are - Testing URIs prior to production deployment - Testing URIs behind a proxy Be careful when using this feature because checking every link against a large set of regular expressions has a performance impact. Also there are no constraints on the URI mapping, so the rules might contradict with each other. Remap rules get applied in order of definition to every input URI.	2022-05-29 21:41:22 +02:00
Matthias	363b95fe5f	Add support for excluding paths from link checking (#623 ) This change deprecates `--exclude-file` as it was ambiguous. Instead, `--exclude-path` was introduced to support excluding paths to files and directories that should not be checked. Furthermore, `.lycheeignore` is now the only way to exclude URL patterns.	2022-05-29 17:27:09 +02:00
Matthias	571b49410c	Extend reqwest client settings (#617 ) This sets a HTTP connect timeout (for stability) and a TCP keepalive (for performance). The connect timeout should help with flaky servers, which would block the runtime and therefore other requests. The keepalive helps when making many requests to the same host. This is a very common pattern for checking internal documentation, which is an important use-case of lychee. The settings are currently not configurable by the user and set to sane defaults. We might make this configurable in the future if there is demand to do so.	2022-05-13 18:51:11 +02:00
Matthias	8c0a32d81d	Refactor response formatting (#599 ) * Add support for raw formatter (no color) * Introduce ResponseFormatter trait * Pass the same params to every cli command * Update dependencies * Remove pretty_assertions dependency (latest version doesn't build)	2022-04-25 19:19:36 +02:00
Matthias	a607b853c9	Move to downstream optimization for short strings (#600 ) Skipping to parse very short strings was merged into linkify so our own workaround is unnecessary https://github.com/robinst/linkify/pull/34	2022-04-25 19:18:50 +02:00
Matthias	da7bbf113d	Remove unnecessary Ok wrapper	2022-04-12 01:39:38 +02:00
Matthias	6ebc9fed4b	Reset nofollow in html5gum start tag (#584 )	2022-04-06 00:49:00 +02:00
Matthias	debe958766	Add support for nofollow (#572 )	2022-04-04 10:32:00 +02:00
Matthias	03d28820bb	Extract more status information from reqwest (#577 ) Recently we cleaned up the commandline output to trim away redundant information like the URL, which occured twice. Unfortunately we also removed helpful information from reqwest, which could support the user in troubleshooting unexpected errors. This commit reverts that. We now extract the meaningful information from reqwest, without being too verbose. For that we have to depend on the string output for the reqwest error, but it's better than hiding that information from the user. It is fragile as it depends on the reqwest internals, but in the worst case we simply return the full error text in case our parsing won't work.	2022-04-02 14:37:03 +02:00
Matthias	5ad7b14bdd	Regression: Ignore invalid URLs (#571 ) With the refactoring the URL checking as a workaround for the upstream reqwest panic on invalid URLs, we introduced a regression, which caused unsupported URL schemes to show up as errors in the lychee output. This commit changes the behavior such that invalid schemes get ignored again by making a differentiation between truly invalid URIs which would make reqwest panic, and ones which are valid but just not handled by reqwest. The check was moved to `check_website` such that the invalid URIs would not be checked three times in a loop before erroring out.	2022-03-27 23:22:46 +02:00
Matthias	36d3195c68	Cache verbosity issue (fixes #562 )	2022-03-27 14:48:09 +02:00
Matthias	743d386252	Allow input URLs without scheme (fixes #567 ) This requires `Input::new` to return a `Result`, because the URL parsing could fail when prepending `http://`. We use http instead of https, because curl does as well: `70ac27604a/lib/urlapi.c (L1104-L1124)` Missing files will be interpreted as URLs from the command line and these can be invalid, but that's not seen as an error anymore.	2022-03-27 01:27:27 +01:00
Matthias	d616177a99	Implement excluding code blocks (#523 ) This is done in the extractor to avoid unnecessary allocations.	2022-03-26 10:42:56 +01:00
Matthias	77b1724881	Optimize plaintext extractor for small strings (#565 ) Immediately return for very small strings which cannot be valid URIs. The shortest valid URI without a scheme might be g.cn (Google China) At least I am not aware of a shorter one. We set this as a lower threshold for parsing URIs from plaintext to avoid false-positives and as a slight performance optimization, which could add up for big files. This threshold might be adjusted in the future.	2022-03-23 23:06:49 +01:00
Matthias	e1d112dbab	Remove `missing_panic_doc` (#561 )	2022-03-22 21:02:56 +01:00
Matthias	45de5c763e	Avoid reqwest panic on invalid URIs (#557 )	2022-03-22 13:15:11 +01:00
Matthias	ceb185e579	Add more comments to path methods (#543 )	2022-03-08 13:50:54 +01:00
Matthias	8097bfa408	Print Github token error once at the end (#537 ) Print original reqwest error for every Github link. It contains more information about the underlying error. Only print a message about the Github token at the end if it's not set and there were Github errors.	2022-03-03 10:04:55 +01:00
Matthias	4c51fce22f	Fix broken pipe error on failing writes to stdout (#535 ) Make sure that broken pipes (e.g. when a reader of a pipe prematurely exits during execution) get handled gracefully. This change also moves some error messages to stderr by using eprintln. More info: https://github.com/jez/as-tree/issues/15	2022-03-02 23:39:54 +01:00
Matthias	0fc5fc9ffe	Print errors with a different format for easier clickability (fixes #532 )	2022-03-01 16:58:04 +01:00
Matthias	05bd3817ee	Make retry wait time configurable (#525 )	2022-02-24 12:24:57 +01:00
Matthias	41b291037a	Response output overhaul (#524 ) Clean up the response output. Superfluous information was removed and the formatting was changed to make the output more readable to humans.	2022-02-23 17:28:14 +01:00
Lucius Hu	70ebe45117	Improved IPv6 filtering support (#501 ) This commit uses crate `ip_network` to determine whether an IPv6 address is link-local or unique local. Note that this extra dependencies can be removed once rust-lang/rust#27709 is stabilized. Co-authored-by: Lucius Hu <lebensterben@users.noreply.github.com> Co-authored-by: Matthias <matthias-endler@gmx.net>	2022-02-22 10:39:44 +01:00
Matthias	ba276cd51b	Error cleanup (#510 ) * Add more fine-grained error types; remove generic IO error * Update error message for missing file * Remove missing `Error` suffix * Rename ErrorKind::Github to ErrorKind::GithubRequest for consistency with NetworkRequest	2022-02-19 01:44:00 +01:00
Matthias	812663d832	Prevent flaky tests (#514 ) Move from example.org to example.com, which seems to be more permissive for testing	2022-02-18 10:29:49 +01:00
Lucius Hu	6d56c6b55c	Replace plain String with SecretString for GitHub token (#509 ) This commit changed the type of `lychee-lib::ClientBuilder::github_token` from `String` to `secrecy::SecretString` to fortify the secret management within our program. Note that this won't affect TOML configuration of `lychee-bin` because `serde::Deserialize` is still implemented for `SecretString`.	2022-02-13 13:53:46 +01:00
Matthias	47df7780fe	Use captured identifiers in format strings (#507 ) Makes for arguably cleaner-looking code. The downside is that the MSRV is 1.58 https://blog.rust-lang.org/2022/01/13/Rust-1.58.0.html Given that nobody uses lychee as a library yet and we have precompiled binaries, it's an acceptable tradeoff. My little research revealed that this is a much-liked feature: https://twitter.com/matthiasendler/status/1483895557621960715	2022-02-12 10:51:52 +01:00
Lucius Hu	53c41b03d8	replace hubcaps by octocrab (#502 ) This commit replaced `hubcaps` by `octocrab`, which has more downloads per month and receives more frequent release updates. The caveats are: 1. When instantiating the API client, `octocrab` doesn't offer you a way to specify custom user-agent. But I would argue that, at least presently, this doesn't seem to cause issues. 2. `octocrab` doesn't export as much details of its error types as `hubcaps` does. So we will have fewer control on the display of the error message. But I would also argue that this is not really important. Though we should do more tests to make sure the error looks good enough. * hide implementation details in error message Co-authored-by: Lucius Hu <lebensterben@users.noreply.github.com>	2022-02-11 23:43:47 +01:00
Lucius Hu	476a048350	lychee-lib::client reworked (#500 ) This commit mainly added or improved documentation for `lychee-lib::client` module. But it also contains a few API changes: - `ClientBuilder::client()` now consumes itself instead of taking a reference. This helps to avoid a few unnecessary clones. - `ClientBuilder::build_filter()` was a private function and is inlined to avoid unnecessary clones. - Added a new crate-scoped function `Uri::set_scheme()`. * added notes on deprecated site-local network Co-authored-by: Lucius Hu <lebensterben@users.noreply.github.com>	2022-02-10 00:04:48 +01:00
Markus Unterwaditzer	68d09f7e5b	Add html5gum as alternative link extractor (#480 ) html5gum is a HTML parser that offers lower-level control over which tokens actually get created and are tracked. As such, the extractor doesn't allocate anything tokens it doesn't care about. On some benchmarks it provides a substantial performance boost. The old parser, html5ever is still available by setting the `LYCHEE_USE_HTML5EVER=1` env var.	2022-02-07 22:54:47 +01:00
Matthias	6635863746	Add Alpine page for benchmark; refactor code (#481 )	2022-01-27 23:42:06 +01:00
Matthias	97b06230fc	Add missing Github exclusions; sort entries (#473 )	2022-01-21 23:54:59 +01:00
Matthias	5802ae912c	Fix bugs in extractor; reduce allocs (#464 ) When URLs couldn't be extracted from a tag, we ran a plaintext search, but never added the newly found urls to the vec of extracted urls. Also tried to make the code a little more idiomatic	2022-01-16 02:13:38 +01:00
Matthias	6e757fa20e	Add more information about mail errors (#463 )	2022-01-14 22:22:53 +01:00
Matthias	994aadf6a1	Simplify error messages (#462 ) Using pattern matching to make the hubcaps and reqwest error messages a little shorter and (subjectively) more readable.	2022-01-14 15:26:13 +01:00
Matthias	ac490f9c53	Add caching functionality (v2) (#443 ) A while ago, caching was removed due to some issues (see #349). This is a new implementation with the following improvements: * Architecture: The new implementation is decoupled from the collector, which was a major issue in the last version. Now the collector has a single responsibility: collecting links. This also avoids race-conditions when running multiple collect_links instances, which probably was an issue before. * Performance: Uses DashMap under the hood, which was noticeably faster than Mutex<HashMap> in my tests. * Simplicity: The cache format is a CSV file with two columns: URI and status. I decided to create a new struct called CacheStatus for serialization, because trying to serialize the error kinds in Status turned out to be a bit of a nightmare and at this point I don't think it's worth the pain (and probably isn't idiomatic either). This is an optional feature. Caching only gets used if the `--cache` flag is set.	2022-01-14 15:25:51 +01:00
Matthias	1e76e82811	Add test for nonexistent Github file	2022-01-12 09:25:12 +01:00
Matthias	48c8153e11	Refactor Github checking; add docs	2022-01-12 09:25:12 +01:00
Matthias	50d7b05736	Conditionally compile constructors for GithubUri for tests	2022-01-12 09:25:12 +01:00
Matthias	8d445a3a4b	Be more permissive around private GH repos The Github API doesn't handle checking individual files inside repos or paths like `github.com/org/repo/issues`, so we are more permissive and only check for repo existence. This is the only way to get a basic check for private repos. Public repos are not affected and should work with a normal check.	2022-01-12 09:25:12 +01:00
Matthias	e91c0c60f0	Only accept two path segments (org/repo) for Github API check	2022-01-12 09:25:12 +01:00
Matthias	7667842bb6	Strip `.git` suffix from Github URLs (#384 )	2022-01-12 09:25:12 +01:00
Matthias	21f3160b71	Make retries configurable; align constants (#446 ) Using the same default values for the library and the binary now but tweaked the values a bit for slightly faster performance.	2022-01-07 01:03:10 +01:00
Matthias	388bbbe7b0	Exclude known false-positives from Github API check (#445 ) Fixes https://github.com/lycheeverse/lychee/issues/431	2022-01-06 00:33:53 +01:00
Matthias	dd48466d9a	Add missing test for local links in plaintext files (#444 )	2022-01-05 12:51:14 +01:00
Matthias	01393b34a2	Upgrade to Rust 2021 (#427 )	2021-12-17 01:32:13 +01:00
Matthias	83182c29ca	Fix JSON serialization (#426 ) We recently removed the custom serialization for InputSource. This causes the JSON formatter to fail with "key must be a string". Add it back and add a comment on why this is needed.	2021-12-16 23:55:04 +01:00
Matthias	166c86c30e	Use tokenizer for extraction; add benchmark (#424 ) This avoids creating a DOM tree for link extraction and instead uses a `TokenSink` for on-the-fly extraction. In hyperfine benchmarks it was about 10-25% faster than the master. Old: 4.557 s ± 0.404 s New: 3.832 s ± 0.131 s The performance fluctuates a little less as well. Some missing element/attribute pairs were also added, which contain links according to the HTML spec. These occur very rarely, but it's good to parse them for completeness' sake. Furthermore tried to clean up a lot of papercuts around our types. We now differentiate between a `RawUri` (stringy-types) and a Uri, which is a properly parsed `URI` type. The extractor now only deals with extracting `RawUri`s while the collector creates the request objects.	2021-12-16 18:45:52 +01:00
Matthias	c41ba64a69	Max concurrency moved to check (#419 ) Concurrency is defined by the channel size consuming from the request stream in `check`	2021-12-07 11:52:40 +01:00
Matthias	3d5135668b	Improve concurrency with streams (#330 ) * Move to from vec to streams Previously we collected all inputs in one vector before checking the links, which is not ideal. Especially when reading many inputs (e.g. by using a glob pattern), this could cause issues like running out of file handles. By moving to streams we avoid that scenario. This is also the first step towards improving performance for many inputs. To stay as close to the pre-stream behaviour, we want to stop processing as soon as an Err value appears in the stream. This is easiest when the stream is consumed in the main thread. Previously, the stream was consumed in a tokio task and the main thread waited for responses. Now, a tokio task waits for responses (and displays them/registers response stats) and the main thread sends links to the ClientPool. To ensure that the main thread waits for all responses to have arrived before finishing the ProgressBar and printing the stats, it waits for the show_results_task to finish. * Return collected links as Stream * Initialize ProgressBar without length because we can't know the amount of links without blocking * Handle stream results in main thread, not in task * Add basic directory support using jwalk * Add test for HTTP protocol file type (http://) * Remove deadpool (once again): Replaced with `futures::StreamExt::for_each_concurrent`. * Refactor main; fix tests * Move commands into separate submodule * Simplify input handling * Simplify collector * Remove unnecessary unwrap * Simplify main * cleanup check * clean up dump command * Handle requests in parallel * Fix formatting and lints Co-authored-by: Timo Freiberg <self@timofreiberg.com>	2021-12-01 18:25:11 +01:00

1 2 3

113 commits