lychee

mirror of https://github.com/Hopiu/lychee.git synced 2026-05-05 04:14:53 +00:00

Author	SHA1	Message	Date
Matthias	41b291037a	Response output overhaul (#524 ) Clean up the response output. Superfluous information was removed and the formatting was changed to make the output more readable to humans.	2022-02-23 17:28:14 +01:00
Lucius Hu	70ebe45117	Improved IPv6 filtering support (#501 ) This commit uses crate `ip_network` to determine whether an IPv6 address is link-local or unique local. Note that this extra dependencies can be removed once rust-lang/rust#27709 is stabilized. Co-authored-by: Lucius Hu <lebensterben@users.noreply.github.com> Co-authored-by: Matthias <matthias-endler@gmx.net>	2022-02-22 10:39:44 +01:00
Matthias	ba276cd51b	Error cleanup (#510 ) * Add more fine-grained error types; remove generic IO error * Update error message for missing file * Remove missing `Error` suffix * Rename ErrorKind::Github to ErrorKind::GithubRequest for consistency with NetworkRequest	2022-02-19 01:44:00 +01:00
Matthias	812663d832	Prevent flaky tests (#514 ) Move from example.org to example.com, which seems to be more permissive for testing	2022-02-18 10:29:49 +01:00
Lucius Hu	6d56c6b55c	Replace plain String with SecretString for GitHub token (#509 ) This commit changed the type of `lychee-lib::ClientBuilder::github_token` from `String` to `secrecy::SecretString` to fortify the secret management within our program. Note that this won't affect TOML configuration of `lychee-bin` because `serde::Deserialize` is still implemented for `SecretString`.	2022-02-13 13:53:46 +01:00
Matthias	47df7780fe	Use captured identifiers in format strings (#507 ) Makes for arguably cleaner-looking code. The downside is that the MSRV is 1.58 https://blog.rust-lang.org/2022/01/13/Rust-1.58.0.html Given that nobody uses lychee as a library yet and we have precompiled binaries, it's an acceptable tradeoff. My little research revealed that this is a much-liked feature: https://twitter.com/matthiasendler/status/1483895557621960715	2022-02-12 10:51:52 +01:00
Lucius Hu	53c41b03d8	replace hubcaps by octocrab (#502 ) This commit replaced `hubcaps` by `octocrab`, which has more downloads per month and receives more frequent release updates. The caveats are: 1. When instantiating the API client, `octocrab` doesn't offer you a way to specify custom user-agent. But I would argue that, at least presently, this doesn't seem to cause issues. 2. `octocrab` doesn't export as much details of its error types as `hubcaps` does. So we will have fewer control on the display of the error message. But I would also argue that this is not really important. Though we should do more tests to make sure the error looks good enough. * hide implementation details in error message Co-authored-by: Lucius Hu <lebensterben@users.noreply.github.com>	2022-02-11 23:43:47 +01:00
Lucius Hu	476a048350	lychee-lib::client reworked (#500 ) This commit mainly added or improved documentation for `lychee-lib::client` module. But it also contains a few API changes: - `ClientBuilder::client()` now consumes itself instead of taking a reference. This helps to avoid a few unnecessary clones. - `ClientBuilder::build_filter()` was a private function and is inlined to avoid unnecessary clones. - Added a new crate-scoped function `Uri::set_scheme()`. * added notes on deprecated site-local network Co-authored-by: Lucius Hu <lebensterben@users.noreply.github.com>	2022-02-10 00:04:48 +01:00
Markus Unterwaditzer	68d09f7e5b	Add html5gum as alternative link extractor (#480 ) html5gum is a HTML parser that offers lower-level control over which tokens actually get created and are tracked. As such, the extractor doesn't allocate anything tokens it doesn't care about. On some benchmarks it provides a substantial performance boost. The old parser, html5ever is still available by setting the `LYCHEE_USE_HTML5EVER=1` env var.	2022-02-07 22:54:47 +01:00
Matthias	6635863746	Add Alpine page for benchmark; refactor code (#481 )	2022-01-27 23:42:06 +01:00
Matthias	97b06230fc	Add missing Github exclusions; sort entries (#473 )	2022-01-21 23:54:59 +01:00
Matthias	5802ae912c	Fix bugs in extractor; reduce allocs (#464 ) When URLs couldn't be extracted from a tag, we ran a plaintext search, but never added the newly found urls to the vec of extracted urls. Also tried to make the code a little more idiomatic	2022-01-16 02:13:38 +01:00
Matthias	6e757fa20e	Add more information about mail errors (#463 )	2022-01-14 22:22:53 +01:00
Matthias	994aadf6a1	Simplify error messages (#462 ) Using pattern matching to make the hubcaps and reqwest error messages a little shorter and (subjectively) more readable.	2022-01-14 15:26:13 +01:00
Matthias	ac490f9c53	Add caching functionality (v2) (#443 ) A while ago, caching was removed due to some issues (see #349). This is a new implementation with the following improvements: * Architecture: The new implementation is decoupled from the collector, which was a major issue in the last version. Now the collector has a single responsibility: collecting links. This also avoids race-conditions when running multiple collect_links instances, which probably was an issue before. * Performance: Uses DashMap under the hood, which was noticeably faster than Mutex<HashMap> in my tests. * Simplicity: The cache format is a CSV file with two columns: URI and status. I decided to create a new struct called CacheStatus for serialization, because trying to serialize the error kinds in Status turned out to be a bit of a nightmare and at this point I don't think it's worth the pain (and probably isn't idiomatic either). This is an optional feature. Caching only gets used if the `--cache` flag is set.	2022-01-14 15:25:51 +01:00
Matthias	1e76e82811	Add test for nonexistent Github file	2022-01-12 09:25:12 +01:00
Matthias	48c8153e11	Refactor Github checking; add docs	2022-01-12 09:25:12 +01:00
Matthias	50d7b05736	Conditionally compile constructors for GithubUri for tests	2022-01-12 09:25:12 +01:00
Matthias	8d445a3a4b	Be more permissive around private GH repos The Github API doesn't handle checking individual files inside repos or paths like `github.com/org/repo/issues`, so we are more permissive and only check for repo existence. This is the only way to get a basic check for private repos. Public repos are not affected and should work with a normal check.	2022-01-12 09:25:12 +01:00
Matthias	e91c0c60f0	Only accept two path segments (org/repo) for Github API check	2022-01-12 09:25:12 +01:00
Matthias	7667842bb6	Strip `.git` suffix from Github URLs (#384 )	2022-01-12 09:25:12 +01:00
Matthias	21f3160b71	Make retries configurable; align constants (#446 ) Using the same default values for the library and the binary now but tweaked the values a bit for slightly faster performance.	2022-01-07 01:03:10 +01:00
Matthias	388bbbe7b0	Exclude known false-positives from Github API check (#445 ) Fixes https://github.com/lycheeverse/lychee/issues/431	2022-01-06 00:33:53 +01:00
Matthias	dd48466d9a	Add missing test for local links in plaintext files (#444 )	2022-01-05 12:51:14 +01:00
Matthias	01393b34a2	Upgrade to Rust 2021 (#427 )	2021-12-17 01:32:13 +01:00
Matthias	83182c29ca	Fix JSON serialization (#426 ) We recently removed the custom serialization for InputSource. This causes the JSON formatter to fail with "key must be a string". Add it back and add a comment on why this is needed.	2021-12-16 23:55:04 +01:00
Matthias	166c86c30e	Use tokenizer for extraction; add benchmark (#424 ) This avoids creating a DOM tree for link extraction and instead uses a `TokenSink` for on-the-fly extraction. In hyperfine benchmarks it was about 10-25% faster than the master. Old: 4.557 s ± 0.404 s New: 3.832 s ± 0.131 s The performance fluctuates a little less as well. Some missing element/attribute pairs were also added, which contain links according to the HTML spec. These occur very rarely, but it's good to parse them for completeness' sake. Furthermore tried to clean up a lot of papercuts around our types. We now differentiate between a `RawUri` (stringy-types) and a Uri, which is a properly parsed `URI` type. The extractor now only deals with extracting `RawUri`s while the collector creates the request objects.	2021-12-16 18:45:52 +01:00
Matthias	c41ba64a69	Max concurrency moved to check (#419 ) Concurrency is defined by the channel size consuming from the request stream in `check`	2021-12-07 11:52:40 +01:00
Matthias	3d5135668b	Improve concurrency with streams (#330 ) * Move to from vec to streams Previously we collected all inputs in one vector before checking the links, which is not ideal. Especially when reading many inputs (e.g. by using a glob pattern), this could cause issues like running out of file handles. By moving to streams we avoid that scenario. This is also the first step towards improving performance for many inputs. To stay as close to the pre-stream behaviour, we want to stop processing as soon as an Err value appears in the stream. This is easiest when the stream is consumed in the main thread. Previously, the stream was consumed in a tokio task and the main thread waited for responses. Now, a tokio task waits for responses (and displays them/registers response stats) and the main thread sends links to the ClientPool. To ensure that the main thread waits for all responses to have arrived before finishing the ProgressBar and printing the stats, it waits for the show_results_task to finish. * Return collected links as Stream * Initialize ProgressBar without length because we can't know the amount of links without blocking * Handle stream results in main thread, not in task * Add basic directory support using jwalk * Add test for HTTP protocol file type (http://) * Remove deadpool (once again): Replaced with `futures::StreamExt::for_each_concurrent`. * Refactor main; fix tests * Move commands into separate submodule * Simplify input handling * Simplify collector * Remove unnecessary unwrap * Simplify main * cleanup check * clean up dump command * Handle requests in parallel * Fix formatting and lints Co-authored-by: Timo Freiberg <self@timofreiberg.com>	2021-12-01 18:25:11 +01:00
Matthias	d96c1269ff	Use thiserror for error handling (#399 ) This removes some boilerplate and is arguably better than handwriting the error handling code for maintainability and avoid inconsitent functionality for the error variants. thiserror is also the de-facto standard for library error types as of today.	2021-11-20 01:42:50 +01:00
Matthias	b97fda34d0	Add support for different output formats (compact, detailed, markdown) (#375 )	2021-11-18 00:44:48 +01:00
Markus Unterwaditzer	d3ed133f10	Remove srcset attribute from list of "link" attrs (#393 ) * Remove srcset attribute from list of "link" attrs Fix #390 * Add test for srcset * Add note about srcSet links * add real support for srcset Co-authored-by: Matthias <matthias-endler@gmx.net>	2021-11-16 22:58:10 +01:00
Matthias	69e5d56687	Add more known false positive schema domains (#376 ) See https://github.com/lycheeverse/lychee-action/issues/53	2021-10-31 14:53:40 +01:00
dependabot[bot]	d3a72d3816	Bump deadpool from 0.7.0 to 0.9.1 (#371 ) * Bump deadpool from 0.7.0 to 0.9.1 Bumps [deadpool](https://github.com/bikeshedder/deadpool) from 0.7.0 to 0.9.1. - [Release notes](https://github.com/bikeshedder/deadpool/releases) - [Changelog](https://github.com/bikeshedder/deadpool/blob/master/CHANGELOG.md) - [Commits](https://github.com/bikeshedder/deadpool/compare/deadpool-v0.7.0...deadpool-v0.9.1) --- updated-dependencies: - dependency-name: deadpool dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> * Attempt fix for deadpool v0.8.0+ (#372) Signed-off-by: MichaIng <micha@dietpi.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: MichaIng <micha@dietpi.com>	2021-10-28 02:05:58 +02:00
Matthias	47426c6971	Fix typos, grammar	2021-10-28 02:05:35 +02:00
MichaIng	0870f0bc9e	Add http://www.w3.org/2000/svg to known false positives (#359 ) It has no forced HTTPS rewrite, but sets the HSTS header. Access otherwise works fine, so similar to http://www.w3.org/1999/xhtml it is basically to avoid lychee failures when --require-https was defined. Signed-off-by: MichaIng <micha@dietpi.com>	2021-10-11 00:40:27 +02:00
Jorge Luis Betancourt	174331d983	Extract base from the source URL if `--base` is empty (#358 ) When running lychee against a remote URL all relative links are ignored by default because `--base` is normally not set. A good default in this case is to automatically use the base domain from the source URL. Setting `--base` overrides the automatic source extraction from the source URL (same behaviour as we currently have).	2021-10-10 02:42:01 +02:00
Matthias	dd9e24b7f4	support uppercase filenames; add tests	2021-10-09 22:20:22 +02:00
Matthias	175342baf4	Merge branch 'master' of github.com:lycheeverse/lychee	2021-10-09 21:17:41 +02:00
Matthias	bdcd6f87bf	Make error message for broken file links more understandable	2021-10-09 21:17:37 +02:00
Matthias	56726f41fc	Add back connection pool (#355 )	2021-10-08 13:08:44 +02:00
MichaIng	961f12e58e	Remove cache from collector and remove custom reqwest client pool * Reqwest comes with its own request pool, so there's no need in adding another layer of indirection. This also gets rid of a lot of allocs. * Remove cache from collector * Improve error handling and documentation * Add back test for request caching in single file Signed-off-by: MichaIng <micha@dietpi.com> Co-authored-by: Matthias <matthias-endler@gmx.net>	2021-10-07 18:07:18 +02:00
Matthias	a7f809612d	Refactor extractor (#354 ) This avoids sending URLs back and forth between the different parsers. Also, it should allow for future optimizations to reduce allocs.	2021-10-07 12:51:02 +02:00
MichaIng	b648b5e914	Imply "localhost" when loopback IPs are excluded (#351 ) as "localhost" is usually mapped via "hosts" file to a loopback IP address. Resolves: https://github.com/lycheeverse/lychee/issues/319 Signed-off-by: MichaIng <micha@dietpi.com>	2021-10-06 11:33:23 +02:00
Matthias	251332efe2	Cache `absolute_path` to decrease allocations (#346 ) * Cache `absolute_path` to decrease allocations While profiling local file handling, I noticed that resolving paths was taking a significant amount of time. It also caused quite a few allocations. By caching the path and using a constant value for the current directory, we can reduce the number of allocs by quite a lot. For example, when testing on the sentry documentation, we do 50,4% less allocations in total now. That's just a single test-case of course, but it's probably also helping in many other cases as well. * Defer to_string for attr.value to reduce allocs * Use Tendrils instead of Strings for parsing (another ~1.5% less allocs) * Move option parsing code into separate module * Handle base dir more correctly * Temporarily disable dry run	2021-10-05 01:37:43 +02:00
Matthias	3b41c4c375	Silently ignore absolute paths without base (fixes #320 ) (#338 )	2021-09-20 11:13:30 +02:00
Matthias	21ea0fd033	Add support for tokio-console (#318 ) This allows troubleshooting and improving async Rust code. It is an optional feature that is still experimental (but can be quite helpful)	2021-09-12 18:10:23 +02:00
Matthias	de55fbd178	Add TODO for fixing URL encoding for paths	2021-09-09 19:31:49 +02:00
Matthias	d7436575eb	formatting	2021-09-09 14:43:40 +02:00
Matthias	2a4170eade	Add test for `+` encoding	2021-09-09 14:42:09 +02:00
Matthias	a1acf7b0d0	Reintegrate master	2021-09-09 01:49:25 +02:00
Matthias	93948d7367	Avoid double-encoding already encoded destination paths E.g. `web%20site` becomes `web site`. That's because Url::from_file_path will encode the full URL in the end. This behavior cannot be configured. See https://github.com/lycheeverse/lychee/pull/262#issuecomment-915245411	2021-09-09 01:44:10 +02:00
Matthias	24ea2482d3	Update docs	2021-09-08 01:08:59 +02:00
Matthias	f3fe46a4d6	Merge branch 'master' of github.com:lycheeverse/lychee into local-files	2021-09-08 00:35:41 +02:00
Matthias	ffab0343fc	Revert refactor for removing params and fragments The refactored version was not equivalent. It could not handle fragments containing a question mark. See `67268ed598 (r703400238)`	2021-09-08 00:29:30 +02:00
Matthias	1246fa564c	Don't exlude mail on `exclude-all-private` (#316 )	2021-09-08 00:21:00 +02:00
Matthias	67268ed598	Clean up params and fragment handling	2021-09-07 13:02:39 +02:00
Matthias	4827ecf6bd	Fix clippy warnings	2021-09-07 00:22:06 +02:00
Matthias	5d0b95271d	Remove anchor from file links	2021-09-07 00:20:09 +02:00
Matthias	b2ce61357f	Fix build errors; cleanup code	2021-09-06 23:46:31 +02:00
Paweł Romanowski	8fd34a7367	Add no check (dump links only) flag (#99 )	2021-09-06 16:10:48 +02:00
Matthias	00ddb6dfc8	Filter out directories with suffixes that look like extensions Directories can still have a suffix which looks like a file extension like `foo.html`. This can lead to unexpected behavior with glob patterns like `*/.html`. Therefore filter these out. https://github.com/lycheeverse/lychee/pull/262#issuecomment-91322681	2021-09-06 15:23:10 +02:00
Matthias	f47282093a	String allocation not needed	2021-09-06 15:23:10 +02:00
Matthias	f143087743	Relative path not needed	2021-09-06 15:23:10 +02:00
Matthias	b3c5d122e7	Fix clippy lints	2021-09-06 15:23:10 +02:00
Matthias	57af648ec9	fix tests after making base dir mandatory	2021-09-06 15:23:10 +02:00
Matthias	b7c129c431	Fix resolving absolute paths The previous solution didn't resolve to absolute paths and rather removed things like `.` and `..`.	2021-09-06 15:20:18 +02:00
Matthias	dd3205a87c	wip	2021-09-06 15:19:43 +02:00
Matthias	b06afb7252	fix test	2021-09-06 15:19:24 +02:00
Matthias	04bf838f98	lint	2021-09-06 15:19:24 +02:00
Matthias	4f9dc67bbd	fix test	2021-09-06 15:19:24 +02:00
Matthias	afdb721612	Fix lints	2021-09-06 15:19:24 +02:00
Matthias	1546d6ee38	Normalize path; fix tests	2021-09-06 15:19:09 +02:00
Matthias	a3fd85d923	Exclude anchor links	2021-09-06 15:19:09 +02:00
Matthias	daa5be4c3a	Add/change file link tests	2021-09-06 15:19:09 +02:00
Matthias	d924c25669	Non-existing directories are fine for URI base for files	2021-09-06 15:19:09 +02:00
Matthias	d51a49db46	Move uri to types	2021-09-06 15:19:09 +02:00
Matthias	887f1b9589	Split up file checking into file discovery and validation of path exists	2021-09-06 15:19:09 +02:00
Matthias	bfa3b1b6a1	Introduce Base type, which can be a path or URL	2021-09-06 15:15:40 +02:00
Matthias	f9bf52ef10	Add support for base_dir	2021-09-06 15:15:05 +02:00
Matthias Endler	d5bb7ee7d7	Or Patterns (Rust 1.53)	2021-09-06 15:15:05 +02:00
Matthias Endler	701fbc9ada	Add support for local files	2021-09-06 15:14:33 +02:00
Lucius Hu	80b8a856ac	Add new flag `--require-https` (#195 )	2021-09-04 03:21:54 +02:00
Matthias	59abd189cf	Fix remaining clippy lints	2021-09-03 16:29:57 +02:00
Matthias	fe399c0a8c	Simple URI cache (#243 )	2021-05-04 13:28:39 +02:00
Matthias	164e1aea7e	Add support for multiple schemes (#237 )	2021-04-26 18:24:54 +02:00
Matthias	f8426bafbf	Skip unsupported schemes (#236 )	2021-04-26 17:16:58 +02:00
Matthias	2a80760f58	Fix crates.io 404 with quirk (#235 )	2021-04-26 14:20:54 +02:00
Matthias	1865f7a309	Use thumbnail endpoint for YouTube links (#232 )	2021-04-23 01:23:15 +02:00
Matthias	1926c73b6b	Add missing docs (#231 ) This enables `#![deny(missing_docs)`) and adds all missing doc strings	2021-04-23 00:27:12 +02:00
Lucius Hu	f64213d58c	More refactor (#225 ) - Major changes in `lychee-lib::filter` module: - Fields in `Excludes` except the `RegexSet` is now moved to `Filter`. - `Filter` contains `Option<Excludes>` and `Option<Includes>`, which are wrapper struct of `RegexSet` instead of `Option<RegexSet>`. As a result the code now looks cleaner. - Factored out some filtering logics to dedicated functions. - It's possible to write tests for those functions in addition to tests for the `Filter` struct. - Added docs to `Filter::is_excluded` and reorgnized the code. - placed `derive_builder` by `typed_builder`: - The internal interface very ugly, as admitted by the author, but we no longer have nested `Option`s like before. - As a result, the `Client` building is much easier to read. - Main benefit of `typed_builder` is, the arguments feeded to builder is checked at compile time instead of run-time. - Fixed a bug in `lychee::tests::usage` and `lychee-lib::stats::test`. - Now it will clear environment variable which would otherwise cause an issue if `GITHUB_TOKEN` is set. - Updated dependencies. Co-authored-by: Lucius Hu <lebensterben@users.noreply.github.com>	2021-04-16 20:25:22 +02:00
Lucius Hu	228e5df6a3	Major refactor of codebase (#208 ) - The binary component and library component are separated as two packages in the same workspace. - `lychee` is the binary component, in `lychee-bin/`. - `lychee-lib` is the library component, in `lychee-lib/`. - Users can now install only the `lychee-lib`, instead of both components, that would require fewer dependencies and faster compilation. - Dependencies for each component are adjusted and updated. E.g., no CLI dependencies for `lychee-lib`. - CLI tests are only moved to `lychee`, as it has nothing to do with the library component. - `Status::Error` is refactored to contain dedicated error enum, `ErrorKind`. - The motivation is to delay the formatting of errors to strings. Note that `e.to_string()` is not necessarily cheap (though trivial in many cases). The formatting is no delayed until the error is needed to be displayed to users. So in some cases, if the error is never used, it means that it won't be formatted at all. - Replaced `regex` based matching with one of the following: - Simple string equality test in the case of 'false positivie'. - URL parsing based test, in the case of extracting repository and user name for GitHub links. - Either cases would be much more efficient than `regex` based matching. First, there's no need to construct a state machine for regex. Second, URL is already verified and parsed on its creation, and extracting its components is fairly cheap. Also, this removes the dependency on `lazy-static` in `lychee-lib`. - `types` module now has a sub-directory, and its components are now separated into their own modules (in that sub-directory). - `lychee-lib::test_utils` module is only compiled for tests. - `wiremock` is moved to `dev-dependency` as it's only needed for `test` modules. - Dependencies are listed in alphabetical order. - Imports are organized in the following fashion: - Imports from `std` - Imports from 3rd-party crates, and `lychee-lib`. - Imports from `crate::` or `super::`. - No glob import. - I followed suggestion from `cargo clippy`, with `clippy::all` and `clippy:pedantic`. Co-authored-by: Lucius Hu <lebensterben@users.noreply.github.com>	2021-04-15 01:24:11 +02:00

1 2 3

142 commits