lychee

mirror of https://github.com/Hopiu/lychee.git synced 2026-04-01 20:20:32 +00:00

Author	SHA1	Message	Date
Matthias Endler	55797071b0	Fix nested URL extraction in verbatim elements (#988 ) Skipping URLs in verbatim elements didn't take nested elements into consideration, which were not verbatim. For instance, the following HTML snippet would yield `https://example.com` in non-verbatim mode, even if it is nested inside a verbatim `<pre>` element: ```html <pre><a href="https://example.com">link</a></pre> ``` This commit fixes the behavior for both `html5gum` and `html5ever`. Note that nested verbatim elements of the same kind still are not handled correctly. For instance, the following HTML snippet would still yield `https://example.com`: ```html <pre> <pre></pre> <a href="https://example.com">link</a> </pre> ``` The reason is that we currently only keep track of a single verbatim element and not a stack of elements, which we would need to unwind and resolve the situation. Fixes https://github.com/lycheeverse/lychee/issues/986.	2023-03-11 15:18:25 +01:00
Matthias Endler	2255ad9286	Better retry handling (#981 ) Previously, lychee would blindly retry all requests, no matter if the request error was transient or fatal. Taking a lesson from https://github.com/TrueLayer/reqwest-middleware, we can be more granular about the error behavior. This PR adds their retry logic to lychee, reducing the number of unnecessary requests significantly. I also made some ergonomic changes to the client, which should not affect its behavior.	2023-03-10 22:36:45 +01:00
Matthias Endler	30e2a2b62b	Fix `--max-redirects` (#987 ) Having more than the max number of redirects caused lychee to abort the requests, but did not lead to an error. Related: https://github.com/lycheeverse/lychee-action/issues/164	2023-03-10 15:15:37 +01:00
Matthias	59ddc1e27d	Fix url input handling without scheme	2023-03-03 12:13:09 +01:00
Matthias Endler	7874195bbb	Customize verbosity (#956 )	2023-02-24 23:53:09 +01:00
Matthias Endler	b653a0a1ec	Fix cached 200 status code handling (#958 ) * Fix cached 200 status code handling Assert that code 200 never needs to be explicitly accepted for cached response to match the behavior of uncached checks * Bump version to v0.11.1	2023-02-23 00:25:53 +01:00
Matthias	5558531bab	Fix lint	2023-02-22 21:05:49 +01:00
Kian-Meng Ang	9fa1d732f7	Fix typos (#944 ) Found via `codespell -S fixtures -L crate,reacher,t`	2023-02-09 15:32:16 +01:00
dependabot[bot]	0a2cd324d5	Bump typed-builder from 0.11.0 to 0.12.0 (#934 ) * Bump typed-builder from 0.11.0 to 0.12.0 Bumps [typed-builder](https://github.com/idanarye/rust-typed-builder) from 0.11.0 to 0.12.0. - [Release notes](https://github.com/idanarye/rust-typed-builder/releases) - [Changelog](https://github.com/idanarye/rust-typed-builder/blob/master/CHANGELOG.md) - [Commits](https://github.com/idanarye/rust-typed-builder/commits) --- updated-dependencies: - dependency-name: typed-builder dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> * Remove custom builder method docs. We use the default again, which offers the same amount of information. * Add `make` target to show docs --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Matthias <matthias-endler@gmx.net> Co-authored-by: Matthias Endler <matthias@endler.dev>	2023-01-30 15:12:20 +01:00
Matthias Endler	9837699b79	Introduce new let...else syntax (#936 )	2023-01-30 14:25:30 +01:00
Lucius Hu	e2406089ad	chore!: improve client and remap modules (#913 ) `lychee_lib::client`: - Improved documentation. - Added an log message in `ClientBuilder::client()` when provied user-agent overrides the one defined in provied custom header. - Removed unnecessary error handling in `Client::check()` when setting HTTPS scheme because all failure cases should occur when checking this URL the first time already. - Removed unnecessary error handling in `Client::remap()` since `lychee-lib::remap::Remaps::remap()` doesn't returns a `Result` anymore. - Fixed potential integer overflow in `Client::check_website()` when the wait time between retries doubles, by using `std::time::Duration::saturating_mul` instead. - Renamed `invalid()` to `validate_url()`. `lychee_lib::remap`: - Improved documentation, in particular, clarified (in the comment) that it's URLs not URIs being remapped. - Changed `Remaps::remap()` so it takes `&mut Url` instead of `Uri` as its argument, and doesn't return a `Result` as a result. - Using `Url` instead of `Uri` because it aligns with the concept of remapping locations rather than identifiers. - Mutating the URL directly instead of returning a new one for it's more straightforward. - There is no error handling because we don't convert from URL to URI anymore. Furthermore, this always succeed in the first place so we never needed error handling. - Added implementation of `IntoIterator` for `&'a Remaps` and convenience method of `Remaps::iter`. (Their mutable or moving counterparts are deliberately avoided because we don't want library users to modify all consume the remapping rules after its instantiation.) `lychee_lib::error`: - Renamed `ErrorKind::InvalidUriRemap` to `InvalidUrlRemap` and improved its error message. Changes to other modules are minor and only serves to accompany aforementioned changes.	2023-01-16 19:14:09 +01:00
Matthias Endler	b620fc99f7	Properly handle youtu.be shortlinks (#908 ) Previously those were not correctly rewritten to thumbnail URLs. This should be fixed now by splitting up the logic for normal YouTube links and shortlinks. Fixes #906	2023-01-06 18:25:09 +01:00
Matthias Endler	4a3bfb99fb	Remove address from verbatim elements (#901 )	2023-01-05 14:55:53 +01:00
Matthias Endler	5654b7c317	Harden URL detection and extend verbatim elements (#899 ) Previously remote URLs were incorrectly detected because the string representation of a path is different than the path itself, causing the `http` prefix match to be insufficient. This resulted in unexpected side-effects, such as the incorrect detection of verbatim mode for remote URLs. The check now got improved and unit tests were added to avoid future breakage. On top of that, missing verbatim elements were added	2023-01-04 00:38:19 +01:00
Matthias Endler	da46734c54	Extend response stats in verbose mode (#882 )	2022-12-20 10:43:01 +01:00
Matthias Endler	6df1c378ec	Fix Rust 1.66 clippy lints (#879 )	2022-12-19 14:28:10 +01:00
Matthias	7d435f2155	Add more markdown extensions (#866 )	2022-12-12 18:26:42 +01:00
Matthias	ef391cea50	Recursively skip verbatim elements (#847 )	2022-12-12 01:06:45 +01:00
Matthias	9eeea250cd	Exclude <script> tags by default (#848 ) This is a naive approach to exclude script tags from getting checked. The reason is that the tag leads to a lot of false-positives (e.g. `//unpkg.com/docsify-edit-on-github@1` within a script block gets detected as an e-mail address). A more thorough approach would be the use of a tree-builder in html5gum and html5ever, but this could have a negative performance impact. I also did not want to add a new flag (e.g. `--include-scripts`) for this setting because the current set of flags around exclusion/inclusion is already quite long. Fixes #821.	2022-11-29 00:38:43 +01:00
Matthias	982d978e47	Add different verbosity levels (#824 ) More granular verbosity levels have been asked for repeatedly. To enable that we're moving to [env_logger] and [clap-verbosity-flag] to provide more flexible verbosity settings. Also tackles #661, #709 Lays the groundwork for tackling #268 https://github.com/rust-cli/env_logger https://github.com/clap-rs/clap-verbosity-flag	2022-11-28 23:25:33 +01:00
Matthias	b479a5810e	Allow overriding accepted status codes for cached URIs (#843 ) Fixes #840	2022-11-28 12:23:07 +01:00
Matthias	765f7adb12	Don't check example mail addresses by default (#815 ) This was an oversight so far that became apparent after our recent fix for email addreses with query params (e.g. `test@example.com?subject=test`). The parsing of email addresses has improved and so we detect more mail addresses, but we didn't check if they belonged to an example domain, causing false-positive checks.	2022-11-08 23:46:32 +01:00
Matthias	d61105edbb	Fix parsing error of email addresses with query params (#809 ) Email addresses with query parameters often get used in contact forms on websites. They can also be found in other documents like Markdown. A common use-case is to add a subject line to the email as a parameter e.g. `mailto:mail@example.com?subject="Hello"`. Previously we handled such cases incorrectly by recognizing them as files. The reason was that our email parsing was too strict to allow for that use-case. With `email_address` we switched to a more permissive parser. Note that this does not affect the actual address email checking, as this is still done `check-if-email-exists`, which has more strict check functionality.	2022-11-05 23:40:33 +01:00
Matthias	94dda21326	Fix clippy lints	2022-09-27 18:17:37 +02:00
dependabot[bot]	226546091b	Bump check-if-email-exists from 0.8.31 to 0.9.0 (#735 ) * Bump check-if-email-exists from 0.8.31 to 0.9.0 Bumps [check-if-email-exists](https://github.com/reacherhq/check-if-email-exists) from 0.8.31 to 0.9.0. - [Release notes](https://github.com/reacherhq/check-if-email-exists/releases) - [Changelog](https://github.com/reacherhq/check-if-email-exists/blob/master/CHANGELOG.md) - [Commits](https://github.com/reacherhq/check-if-email-exists/compare/v0.8.31...v0.9.0) --- updated-dependencies: - dependency-name: check-if-email-exists dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> * Update usage Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Matthias <matthias-endler@gmx.net>	2022-08-16 12:35:34 +02:00
Matthias	6a49cedc16	Check Twitter URLs using nitter.net (#731 ) Use an alternative Twitter frontend, which works more reliably than using Twitter directly.	2022-08-12 22:46:35 +02:00
Matthias	69f387c1bd	Markdown-status (#729 ) * Fix typos * Add status code description to markdown output	2022-08-11 22:08:05 +02:00
Walter Beller-Morales	6d40a2ab7b	Update to gracefully handle nonexistent relative paths (#691 ) * Update Input::new to gracefully handle nonexistent relative paths * Add test checking Input::new can handle real relative paths * Add better pre-conditions to Input::new tests * Add integration tests for handling relative paths in lychee-bin * Update lychee-lib/src/types/input.rs	2022-07-22 17:15:55 +02:00
Matthias	6fae93f2da	Skip caching unsupported and excluded URLs (#692 ) As discussed in https://github.com/lycheeverse/lychee/issues/647#issuecomment-1170773449, it does not make much sense to cache unsupported and excluded URLs. Unsupported URLs might be supported in the future and caching them would mean they won't get checked then. Excluded URLs were excluded for a reason and should not appear in the cache. Furthermore they might not be excluded in a consecutive run, leading to a false-positive.	2022-07-17 18:40:45 +02:00
Walter Beller-Morales	9ad53f97a2	Fix deserialize of lycheecache status codes (#685 ) * Add custom deserializer for `CacheStatus` to properly classify status codes * Add CLI integration tests to check .lycheecache behavior * Add comment to explain conflict between cache and accept flags	2022-07-15 22:45:24 +02:00
Matthias	fb367ef43a	Add http://www.w3.org/1999/xlink to list of false positives (#664 )	2022-07-01 12:11:57 +02:00
Matthias	487d88cefe	Add test for mailto address with query params (#655 )	2022-06-29 10:19:17 +02:00
dependabot[bot]	231939af82	Bump html5gum from 0.4.0 to 0.5.1 (#658 ) * Bump html5gum from 0.4.0 to 0.5.1 Bumps [html5gum](https://github.com/untitaker/html5gum) from 0.4.0 to 0.5.1. - [Release notes](https://github.com/untitaker/html5gum/releases) - [Commits](https://github.com/untitaker/html5gum/compare/0.4.0...0.5.1) --- updated-dependencies: - dependency-name: html5gum dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> * Update html5gum Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Matthias <matthias-endler@gmx.net>	2022-06-23 00:07:28 +02:00
Markus Unterwaditzer	f1ae22da09	Replace lazy hashset with matches! (#656 ) * Replace lazy hashset with matches! llvm will typically create much faster code than accessing a hashset at runtime source: trust me bro * cargo fix * cargo fmt * shorten docstring	2022-06-18 19:00:07 +02:00
Matthias	84de43c554	Refactor request types (#637 )	2022-06-03 20:13:07 +02:00
Matthias	9b4dfadffd	Fix parsing errors with config options (#632 )	2022-05-31 19:43:46 +02:00
Matthias	f33b897d5d	Exclude example domains as per RFC 2606 from checking (#627 ) Unfortunately it's not possible to automatically enable features for `cargo test`. See https://github.com/rust-lang/cargo/issues/2911. As a workaround to allow for using example domains for unit- and integration tests, we introduce a new feature, `check_example_domains`, which is disabled by default for normal users. The feature gets activated for the integration test which checks that the example domain exclusion works as expected.	2022-05-29 21:42:00 +02:00
Matthias	22fecfc056	Add support for URI remapping (#620 ) Remaps allow mapping from a URI pattern to a different URI. The syntax is ``` lychee --remap 'https://example.com http://127.0.0.1' ``` Some use-cases are - Testing URIs prior to production deployment - Testing URIs behind a proxy Be careful when using this feature because checking every link against a large set of regular expressions has a performance impact. Also there are no constraints on the URI mapping, so the rules might contradict with each other. Remap rules get applied in order of definition to every input URI.	2022-05-29 21:41:22 +02:00
Matthias	363b95fe5f	Add support for excluding paths from link checking (#623 ) This change deprecates `--exclude-file` as it was ambiguous. Instead, `--exclude-path` was introduced to support excluding paths to files and directories that should not be checked. Furthermore, `.lycheeignore` is now the only way to exclude URL patterns.	2022-05-29 17:27:09 +02:00
Matthias	571b49410c	Extend reqwest client settings (#617 ) This sets a HTTP connect timeout (for stability) and a TCP keepalive (for performance). The connect timeout should help with flaky servers, which would block the runtime and therefore other requests. The keepalive helps when making many requests to the same host. This is a very common pattern for checking internal documentation, which is an important use-case of lychee. The settings are currently not configurable by the user and set to sane defaults. We might make this configurable in the future if there is demand to do so.	2022-05-13 18:51:11 +02:00
Matthias	8c0a32d81d	Refactor response formatting (#599 ) * Add support for raw formatter (no color) * Introduce ResponseFormatter trait * Pass the same params to every cli command * Update dependencies * Remove pretty_assertions dependency (latest version doesn't build)	2022-04-25 19:19:36 +02:00
Matthias	a607b853c9	Move to downstream optimization for short strings (#600 ) Skipping to parse very short strings was merged into linkify so our own workaround is unnecessary https://github.com/robinst/linkify/pull/34	2022-04-25 19:18:50 +02:00
Matthias	da7bbf113d	Remove unnecessary Ok wrapper	2022-04-12 01:39:38 +02:00
Matthias	6ebc9fed4b	Reset nofollow in html5gum start tag (#584 )	2022-04-06 00:49:00 +02:00
Matthias	debe958766	Add support for nofollow (#572 )	2022-04-04 10:32:00 +02:00
Matthias	03d28820bb	Extract more status information from reqwest (#577 ) Recently we cleaned up the commandline output to trim away redundant information like the URL, which occured twice. Unfortunately we also removed helpful information from reqwest, which could support the user in troubleshooting unexpected errors. This commit reverts that. We now extract the meaningful information from reqwest, without being too verbose. For that we have to depend on the string output for the reqwest error, but it's better than hiding that information from the user. It is fragile as it depends on the reqwest internals, but in the worst case we simply return the full error text in case our parsing won't work.	2022-04-02 14:37:03 +02:00
Matthias	5ad7b14bdd	Regression: Ignore invalid URLs (#571 ) With the refactoring the URL checking as a workaround for the upstream reqwest panic on invalid URLs, we introduced a regression, which caused unsupported URL schemes to show up as errors in the lychee output. This commit changes the behavior such that invalid schemes get ignored again by making a differentiation between truly invalid URIs which would make reqwest panic, and ones which are valid but just not handled by reqwest. The check was moved to `check_website` such that the invalid URIs would not be checked three times in a loop before erroring out.	2022-03-27 23:22:46 +02:00
Matthias	36d3195c68	Cache verbosity issue (fixes #562 )	2022-03-27 14:48:09 +02:00
Matthias	743d386252	Allow input URLs without scheme (fixes #567 ) This requires `Input::new` to return a `Result`, because the URL parsing could fail when prepending `http://`. We use http instead of https, because curl does as well: `70ac27604a/lib/urlapi.c (L1104-L1124)` Missing files will be interpreted as URLs from the command line and these can be invalid, but that's not seen as an error anymore.	2022-03-27 01:27:27 +01:00
Matthias	d616177a99	Implement excluding code blocks (#523 ) This is done in the extractor to avoid unnecessary allocations.	2022-03-26 10:42:56 +01:00

1 2 3

150 commits