Commit graph

126 commits

Author SHA1 Message Date
dependabot[bot]
226546091b
Bump check-if-email-exists from 0.8.31 to 0.9.0 (#735)
* Bump check-if-email-exists from 0.8.31 to 0.9.0

Bumps [check-if-email-exists](https://github.com/reacherhq/check-if-email-exists) from 0.8.31 to 0.9.0.
- [Release notes](https://github.com/reacherhq/check-if-email-exists/releases)
- [Changelog](https://github.com/reacherhq/check-if-email-exists/blob/master/CHANGELOG.md)
- [Commits](https://github.com/reacherhq/check-if-email-exists/compare/v0.8.31...v0.9.0)

---
updated-dependencies:
- dependency-name: check-if-email-exists
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

* Update usage

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Matthias <matthias-endler@gmx.net>
2022-08-16 12:35:34 +02:00
Matthias
6a49cedc16
Check Twitter URLs using nitter.net (#731)
Use an alternative Twitter frontend, which works more reliably than
using Twitter directly.
2022-08-12 22:46:35 +02:00
Matthias
69f387c1bd
Markdown-status (#729)
* Fix typos

* Add status code description to markdown output
2022-08-11 22:08:05 +02:00
Walter Beller-Morales
6d40a2ab7b
Update to gracefully handle nonexistent relative paths (#691)
* Update Input::new to gracefully handle nonexistent relative paths
* Add test checking Input::new can handle real relative paths
* Add better pre-conditions to Input::new tests
* Add integration tests for handling relative paths in lychee-bin
* Update lychee-lib/src/types/input.rs
2022-07-22 17:15:55 +02:00
Matthias
6fae93f2da
Skip caching unsupported and excluded URLs (#692)
As discussed in https://github.com/lycheeverse/lychee/issues/647#issuecomment-1170773449, it does not make much sense to cache unsupported
and excluded URLs.
Unsupported URLs might be supported in the future and caching them
would mean they won't get checked then. Excluded URLs were
excluded for a reason and should not appear in the cache.
Furthermore they might not be excluded
in a consecutive run, leading to a false-positive.
2022-07-17 18:40:45 +02:00
Walter Beller-Morales
9ad53f97a2
Fix deserialize of lycheecache status codes (#685)
* Add custom deserializer for `CacheStatus` to properly classify status codes
* Add CLI integration tests to check .lycheecache behavior
* Add comment to explain conflict between cache and accept flags
2022-07-15 22:45:24 +02:00
Matthias
fb367ef43a
Add http://www.w3.org/1999/xlink to list of false positives (#664) 2022-07-01 12:11:57 +02:00
Matthias
487d88cefe
Add test for mailto address with query params (#655) 2022-06-29 10:19:17 +02:00
dependabot[bot]
231939af82
Bump html5gum from 0.4.0 to 0.5.1 (#658)
* Bump html5gum from 0.4.0 to 0.5.1

Bumps [html5gum](https://github.com/untitaker/html5gum) from 0.4.0 to 0.5.1.
- [Release notes](https://github.com/untitaker/html5gum/releases)
- [Commits](https://github.com/untitaker/html5gum/compare/0.4.0...0.5.1)

---
updated-dependencies:
- dependency-name: html5gum
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

* Update html5gum

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Matthias <matthias-endler@gmx.net>
2022-06-23 00:07:28 +02:00
Markus Unterwaditzer
f1ae22da09
Replace lazy hashset with matches! (#656)
* Replace lazy hashset with matches!

llvm will typically create much faster code than accessing a hashset at
runtime

source: trust me bro

* cargo fix

* cargo fmt

* shorten docstring
2022-06-18 19:00:07 +02:00
Matthias
84de43c554
Refactor request types (#637) 2022-06-03 20:13:07 +02:00
Matthias
9b4dfadffd
Fix parsing errors with config options (#632) 2022-05-31 19:43:46 +02:00
Matthias
f33b897d5d
Exclude example domains as per RFC 2606 from checking (#627)
Unfortunately it's not possible to automatically enable features
for `cargo test`. See https://github.com/rust-lang/cargo/issues/2911.

As a workaround to allow for using example domains for unit- and integration
tests,  we introduce a new feature, `check_example_domains`, which is
disabled by default for normal users. The feature gets activated for the
integration test which checks that the example domain exclusion works as
expected.
2022-05-29 21:42:00 +02:00
Matthias
22fecfc056
Add support for URI remapping (#620)
Remaps allow mapping from a URI pattern to a different URI.

The syntax is

```
lychee --remap 'https://example.com http://127.0.0.1'
```

Some use-cases are
- Testing URIs prior to production deployment
- Testing URIs behind a proxy

Be careful when using this feature because checking every link against a
large set of regular expressions has a performance impact. Also there are no
constraints on the URI mapping, so the rules might contradict with each
other.
Remap rules get applied in order of definition to every input URI.
2022-05-29 21:41:22 +02:00
Matthias
363b95fe5f
Add support for excluding paths from link checking (#623)
This change deprecates `--exclude-file` as it was ambiguous.
Instead, `--exclude-path` was introduced to support excluding paths
to files and directories that should not be checked.
Furthermore, `.lycheeignore` is now the only way
to exclude URL patterns.
2022-05-29 17:27:09 +02:00
Matthias
571b49410c
Extend reqwest client settings (#617)
This sets a HTTP connect timeout (for stability)
and a TCP keepalive (for performance).

The connect timeout should help with flaky servers, which
would block the runtime and therefore other requests.

The keepalive helps when making many requests to the same
host. This is a very common pattern for checking internal documentation,
which is an important use-case of lychee.

The settings are currently not configurable by the user
and set to sane defaults. We might make this configurable in the future
if there is demand to do so.
2022-05-13 18:51:11 +02:00
Matthias
8c0a32d81d
Refactor response formatting (#599)
* Add support for raw formatter (no color)
* Introduce ResponseFormatter trait
* Pass the same params to every cli command
* Update dependencies
* Remove pretty_assertions dependency (latest version doesn't build)
2022-04-25 19:19:36 +02:00
Matthias
a607b853c9
Move to downstream optimization for short strings (#600)
Skipping to parse very short strings was merged into linkify
so our own workaround is unnecessary
https://github.com/robinst/linkify/pull/34
2022-04-25 19:18:50 +02:00
Matthias
da7bbf113d
Remove unnecessary Ok wrapper 2022-04-12 01:39:38 +02:00
Matthias
6ebc9fed4b
Reset nofollow in html5gum start tag (#584) 2022-04-06 00:49:00 +02:00
Matthias
debe958766
Add support for nofollow (#572) 2022-04-04 10:32:00 +02:00
Matthias
03d28820bb
Extract more status information from reqwest (#577)
Recently we cleaned up the commandline output to trim away redundant
information like the URL, which occured twice.
Unfortunately we also removed helpful information from reqwest, which
could support the user in troubleshooting unexpected errors.

This commit reverts that.
We now extract the meaningful information from reqwest, without being
too verbose. For that we have to depend on the string output for the
reqwest error, but it's better than hiding that information from the user.
It is fragile as it depends on the reqwest internals, but in the worst case
we simply return the full error text in case our parsing won't work.
2022-04-02 14:37:03 +02:00
Matthias
5ad7b14bdd
Regression: Ignore invalid URLs (#571)
With the refactoring the URL checking as a workaround for the upstream
reqwest panic on invalid URLs, we introduced a regression, which caused
unsupported URL schemes to show up as errors in the lychee output.

This commit changes the behavior such that invalid schemes get ignored
again by making a differentiation between truly invalid URIs which would make
reqwest panic, and ones which are valid but just not handled by reqwest.
The check was moved to `check_website` such that the invalid URIs would
not be checked three times in a loop before erroring out.
2022-03-27 23:22:46 +02:00
Matthias
36d3195c68
Cache verbosity issue (fixes #562) 2022-03-27 14:48:09 +02:00
Matthias
743d386252
Allow input URLs without scheme (fixes #567)
This requires `Input::new` to return a `Result`, because the URL
parsing could fail when prepending `http://`.

We use http instead of https, because curl does as well:
70ac27604a/lib/urlapi.c (L1104-L1124)
Missing files will be interpreted as URLs from the command line
and these can be invalid, but that's not seen as an error anymore.
2022-03-27 01:27:27 +01:00
Matthias
d616177a99
Implement excluding code blocks (#523)
This is done in the extractor to avoid unnecessary
allocations.
2022-03-26 10:42:56 +01:00
Matthias
77b1724881
Optimize plaintext extractor for small strings (#565)
Immediately return for very small strings which cannot be valid URIs.

The shortest valid URI without a scheme might be g.cn (Google China)
At least I am not aware of a shorter one. We set this as a lower threshold
for parsing URIs from plaintext to avoid false-positives and as a slight
performance optimization, which could add up for big files.
This threshold might be adjusted in the future.
2022-03-23 23:06:49 +01:00
Matthias
e1d112dbab
Remove missing_panic_doc (#561) 2022-03-22 21:02:56 +01:00
Matthias
45de5c763e
Avoid reqwest panic on invalid URIs (#557) 2022-03-22 13:15:11 +01:00
Matthias
ceb185e579
Add more comments to path methods (#543) 2022-03-08 13:50:54 +01:00
Matthias
8097bfa408
Print Github token error once at the end (#537)
Print original reqwest error for every Github link.
It contains more information about the underlying error.

Only print a message about the Github token at the
end if it's not set and there were Github errors.
2022-03-03 10:04:55 +01:00
Matthias
4c51fce22f
Fix broken pipe error on failing writes to stdout (#535)
Make sure that broken pipes (e.g. when a reader of a
pipe prematurely exits during execution) get handled gracefully.
This change also moves some error messages to stderr by using
eprintln.

More info: https://github.com/jez/as-tree/issues/15
2022-03-02 23:39:54 +01:00
Matthias
0fc5fc9ffe
Print errors with a different format for easier clickability (fixes #532) 2022-03-01 16:58:04 +01:00
Matthias
05bd3817ee
Make retry wait time configurable (#525) 2022-02-24 12:24:57 +01:00
Matthias
41b291037a
Response output overhaul (#524)
Clean up the response output.
Superfluous information was removed and the formatting was changed to make
the output more readable to humans.
2022-02-23 17:28:14 +01:00
Lucius Hu
70ebe45117
Improved IPv6 filtering support (#501)
This commit uses crate `ip_network` to determine whether an IPv6 address is
link-local or unique local.

Note that this extra dependencies can be removed once rust-lang/rust#27709 is
stabilized.

Co-authored-by: Lucius Hu <lebensterben@users.noreply.github.com>
Co-authored-by: Matthias <matthias-endler@gmx.net>
2022-02-22 10:39:44 +01:00
Matthias
ba276cd51b
Error cleanup (#510)
* Add more fine-grained error types; remove generic IO error
* Update error message for missing file
* Remove missing `Error` suffix
* Rename ErrorKind::Github to ErrorKind::GithubRequest for consistency with NetworkRequest
2022-02-19 01:44:00 +01:00
Matthias
812663d832
Prevent flaky tests (#514)
Move from example.org to example.com, which seems to be more permissive for testing
2022-02-18 10:29:49 +01:00
Lucius Hu
6d56c6b55c
Replace plain String with SecretString for GitHub token (#509)
This commit changed the type of `lychee-lib::ClientBuilder::github_token` from
`String` to `secrecy::SecretString` to fortify the secret management within our
program.

Note that this won't affect TOML configuration of `lychee-bin` because
`serde::Deserialize` is still implemented for `SecretString`.
2022-02-13 13:53:46 +01:00
Matthias
47df7780fe
Use captured identifiers in format strings (#507)
Makes for arguably cleaner-looking code.
The downside is that the MSRV is 1.58
https://blog.rust-lang.org/2022/01/13/Rust-1.58.0.html

Given that nobody uses lychee as a library yet
and we have precompiled binaries, it's an acceptable
tradeoff.
My little research revealed that this is a much-liked
feature: https://twitter.com/matthiasendler/status/1483895557621960715
2022-02-12 10:51:52 +01:00
Lucius Hu
53c41b03d8
replace hubcaps by octocrab (#502)
This commit replaced `hubcaps` by `octocrab`, which has more downloads per month
and receives more frequent release updates.

The caveats are:

1. When instantiating the API client, `octocrab` doesn't offer you a way to
specify custom user-agent. But I would argue that, at least presently, this
doesn't seem to cause issues.
2. `octocrab` doesn't export as much details of its error types as `hubcaps`
does. So we will have fewer control on the display of the error message. But I
would also argue that this is not really important. Though we should do more
tests to make sure the error looks good enough.

* hide implementation details in error message

Co-authored-by: Lucius Hu <lebensterben@users.noreply.github.com>
2022-02-11 23:43:47 +01:00
Lucius Hu
476a048350
lychee-lib::client reworked (#500)
This commit mainly added or improved documentation for `lychee-lib::client`
module.

But it also contains a few API changes:

- `ClientBuilder::client()` now consumes itself instead of taking a reference.
  This helps to avoid a few unnecessary clones.
- `ClientBuilder::build_filter()` was a private function and is inlined to avoid
  unnecessary clones.
- Added a new crate-scoped function `Uri::set_scheme()`.

* added notes on deprecated site-local network

Co-authored-by: Lucius Hu <lebensterben@users.noreply.github.com>
2022-02-10 00:04:48 +01:00
Markus Unterwaditzer
68d09f7e5b
Add html5gum as alternative link extractor (#480)
html5gum is a HTML parser that offers lower-level control over which tokens actually get created and are tracked. As such, the extractor doesn't allocate anything tokens it doesn't care about. On some benchmarks it provides a substantial performance boost. The old parser, html5ever is still available by setting the `LYCHEE_USE_HTML5EVER=1` env var.
2022-02-07 22:54:47 +01:00
Matthias
6635863746
Add Alpine page for benchmark; refactor code (#481) 2022-01-27 23:42:06 +01:00
Matthias
97b06230fc
Add missing Github exclusions; sort entries (#473) 2022-01-21 23:54:59 +01:00
Matthias
5802ae912c
Fix bugs in extractor; reduce allocs (#464)
When URLs couldn't be extracted from a tag,
we ran a plaintext search, but never added the
newly found urls to the vec of extracted urls.

Also tried to make the code a little more idiomatic
2022-01-16 02:13:38 +01:00
Matthias
6e757fa20e
Add more information about mail errors (#463) 2022-01-14 22:22:53 +01:00
Matthias
994aadf6a1
Simplify error messages (#462)
Using pattern matching to make the hubcaps and reqwest error messages a little shorter and (subjectively) more readable.
2022-01-14 15:26:13 +01:00
Matthias
ac490f9c53
Add caching functionality (v2) (#443)
A while ago, caching was removed due to some issues (see #349).
This is a new implementation with the following improvements:

 * Architecture: The new implementation is decoupled from the collector, which was a major issue in the last version.    Now the collector has a single responsibility: collecting links. This also avoids race-conditions when running multiple collect_links instances, which probably was an issue before.
* Performance: Uses DashMap under the hood, which was noticeably faster than Mutex<HashMap> in my tests.
* Simplicity: The cache format is a CSV file with two columns: URI and status. I decided to create a new struct called CacheStatus for serialization, because trying to serialize the error kinds in Status turned out to be a bit of a nightmare and at this point I don't think it's worth the pain (and probably isn't idiomatic either).

This is an optional feature. Caching only gets used if the `--cache` flag is set.
2022-01-14 15:25:51 +01:00
Matthias
1e76e82811 Add test for nonexistent Github file 2022-01-12 09:25:12 +01:00