Commit graph

90 commits

Author SHA1 Message Date
Matthias
ba276cd51b
Error cleanup (#510)
* Add more fine-grained error types; remove generic IO error
* Update error message for missing file
* Remove missing `Error` suffix
* Rename ErrorKind::Github to ErrorKind::GithubRequest for consistency with NetworkRequest
2022-02-19 01:44:00 +01:00
Matthias
812663d832
Prevent flaky tests (#514)
Move from example.org to example.com, which seems to be more permissive for testing
2022-02-18 10:29:49 +01:00
Lucius Hu
6d56c6b55c
Replace plain String with SecretString for GitHub token (#509)
This commit changed the type of `lychee-lib::ClientBuilder::github_token` from
`String` to `secrecy::SecretString` to fortify the secret management within our
program.

Note that this won't affect TOML configuration of `lychee-bin` because
`serde::Deserialize` is still implemented for `SecretString`.
2022-02-13 13:53:46 +01:00
Matthias
47df7780fe
Use captured identifiers in format strings (#507)
Makes for arguably cleaner-looking code.
The downside is that the MSRV is 1.58
https://blog.rust-lang.org/2022/01/13/Rust-1.58.0.html

Given that nobody uses lychee as a library yet
and we have precompiled binaries, it's an acceptable
tradeoff.
My little research revealed that this is a much-liked
feature: https://twitter.com/matthiasendler/status/1483895557621960715
2022-02-12 10:51:52 +01:00
Lucius Hu
53c41b03d8
replace hubcaps by octocrab (#502)
This commit replaced `hubcaps` by `octocrab`, which has more downloads per month
and receives more frequent release updates.

The caveats are:

1. When instantiating the API client, `octocrab` doesn't offer you a way to
specify custom user-agent. But I would argue that, at least presently, this
doesn't seem to cause issues.
2. `octocrab` doesn't export as much details of its error types as `hubcaps`
does. So we will have fewer control on the display of the error message. But I
would also argue that this is not really important. Though we should do more
tests to make sure the error looks good enough.

* hide implementation details in error message

Co-authored-by: Lucius Hu <lebensterben@users.noreply.github.com>
2022-02-11 23:43:47 +01:00
Lucius Hu
476a048350
lychee-lib::client reworked (#500)
This commit mainly added or improved documentation for `lychee-lib::client`
module.

But it also contains a few API changes:

- `ClientBuilder::client()` now consumes itself instead of taking a reference.
  This helps to avoid a few unnecessary clones.
- `ClientBuilder::build_filter()` was a private function and is inlined to avoid
  unnecessary clones.
- Added a new crate-scoped function `Uri::set_scheme()`.

* added notes on deprecated site-local network

Co-authored-by: Lucius Hu <lebensterben@users.noreply.github.com>
2022-02-10 00:04:48 +01:00
Markus Unterwaditzer
68d09f7e5b
Add html5gum as alternative link extractor (#480)
html5gum is a HTML parser that offers lower-level control over which tokens actually get created and are tracked. As such, the extractor doesn't allocate anything tokens it doesn't care about. On some benchmarks it provides a substantial performance boost. The old parser, html5ever is still available by setting the `LYCHEE_USE_HTML5EVER=1` env var.
2022-02-07 22:54:47 +01:00
Matthias
6635863746
Add Alpine page for benchmark; refactor code (#481) 2022-01-27 23:42:06 +01:00
Matthias
97b06230fc
Add missing Github exclusions; sort entries (#473) 2022-01-21 23:54:59 +01:00
Matthias
5802ae912c
Fix bugs in extractor; reduce allocs (#464)
When URLs couldn't be extracted from a tag,
we ran a plaintext search, but never added the
newly found urls to the vec of extracted urls.

Also tried to make the code a little more idiomatic
2022-01-16 02:13:38 +01:00
Matthias
6e757fa20e
Add more information about mail errors (#463) 2022-01-14 22:22:53 +01:00
Matthias
994aadf6a1
Simplify error messages (#462)
Using pattern matching to make the hubcaps and reqwest error messages a little shorter and (subjectively) more readable.
2022-01-14 15:26:13 +01:00
Matthias
ac490f9c53
Add caching functionality (v2) (#443)
A while ago, caching was removed due to some issues (see #349).
This is a new implementation with the following improvements:

 * Architecture: The new implementation is decoupled from the collector, which was a major issue in the last version.    Now the collector has a single responsibility: collecting links. This also avoids race-conditions when running multiple collect_links instances, which probably was an issue before.
* Performance: Uses DashMap under the hood, which was noticeably faster than Mutex<HashMap> in my tests.
* Simplicity: The cache format is a CSV file with two columns: URI and status. I decided to create a new struct called CacheStatus for serialization, because trying to serialize the error kinds in Status turned out to be a bit of a nightmare and at this point I don't think it's worth the pain (and probably isn't idiomatic either).

This is an optional feature. Caching only gets used if the `--cache` flag is set.
2022-01-14 15:25:51 +01:00
Matthias
1e76e82811 Add test for nonexistent Github file 2022-01-12 09:25:12 +01:00
Matthias
48c8153e11 Refactor Github checking; add docs 2022-01-12 09:25:12 +01:00
Matthias
50d7b05736 Conditionally compile constructors for GithubUri for tests 2022-01-12 09:25:12 +01:00
Matthias
8d445a3a4b Be more permissive around private GH repos
The Github API doesn't handle checking individual files inside repos or
paths like `github.com/org/repo/issues`, so we are more
permissive and only check for repo existence. This is the
only way to get a basic check for private repos. Public repos are not affected and should work
with a normal check.
2022-01-12 09:25:12 +01:00
Matthias
e91c0c60f0 Only accept two path segments (org/repo) for Github API check 2022-01-12 09:25:12 +01:00
Matthias
7667842bb6 Strip .git suffix from Github URLs (#384) 2022-01-12 09:25:12 +01:00
Matthias
21f3160b71
Make retries configurable; align constants (#446)
Using the same default values for the library and the
binary now but tweaked the values a bit for slightly faster performance.
2022-01-07 01:03:10 +01:00
Matthias
388bbbe7b0
Exclude known false-positives from Github API check (#445)
Fixes https://github.com/lycheeverse/lychee/issues/431
2022-01-06 00:33:53 +01:00
Matthias
dd48466d9a
Add missing test for local links in plaintext files (#444) 2022-01-05 12:51:14 +01:00
Matthias
01393b34a2
Upgrade to Rust 2021 (#427) 2021-12-17 01:32:13 +01:00
Matthias
83182c29ca
Fix JSON serialization (#426)
We recently removed the custom serialization for InputSource.
This causes the JSON formatter to fail
with "key must be a string".
Add it back and add a comment on
why this is needed.
2021-12-16 23:55:04 +01:00
Matthias
166c86c30e
Use tokenizer for extraction; add benchmark (#424)
This avoids creating a DOM tree for link extraction and instead uses a `TokenSink` for on-the-fly extraction. In hyperfine benchmarks it was about 10-25% faster than the master.

Old: 4.557 s ± 0.404 s
New: 3.832 s ± 0.131 s

The performance fluctuates a little less as well.

Some missing element/attribute pairs were also added, which contain links according to the HTML spec. These occur very rarely, but it's good to parse them for completeness' sake.

Furthermore tried to clean up a lot of papercuts around our types. We now differentiate between a `RawUri` (stringy-types) and a Uri, which is a properly parsed `URI` type.
The extractor now only deals with extracting `RawUri`s while the collector creates the request objects.
2021-12-16 18:45:52 +01:00
Matthias
c41ba64a69
Max concurrency moved to check (#419)
Concurrency is defined by the channel size consuming
from the request stream in  `check`
2021-12-07 11:52:40 +01:00
Matthias
3d5135668b
Improve concurrency with streams (#330)
* Move to from vec to streams

Previously we collected all inputs in one vector
before checking the links, which is not ideal.
Especially when reading many inputs (e.g. by using a glob pattern),
this could cause issues like running out of file handles.

By moving to streams we avoid that scenario. This is also the first
step towards improving performance for many inputs.

To stay as close to the pre-stream behaviour, we want to stop processing
as soon as an Err value appears in the stream. This is easiest when the
stream is consumed in the main thread.
Previously, the stream was consumed in a tokio task and the main thread
waited for responses.
Now, a tokio task waits for responses (and displays them/registers
response stats) and the main thread sends links to the ClientPool.
To ensure that the main thread waits for all responses to have arrived
before finishing the ProgressBar and printing the stats, it waits for
the show_results_task to finish.


* Return collected links as Stream
* Initialize ProgressBar without length because we can't know the amount of links without blocking
* Handle stream results in main thread, not in task
* Add basic directory support using jwalk
* Add test for HTTP protocol file type (http://)
* Remove deadpool (once again): Replaced with `futures::StreamExt::for_each_concurrent`.
* Refactor main; fix tests
* Move commands into separate submodule
* Simplify input handling
* Simplify collector
* Remove unnecessary unwrap
* Simplify main
* cleanup check
* clean up dump command
* Handle requests in parallel 
* Fix formatting and lints

Co-authored-by: Timo Freiberg <self@timofreiberg.com>
2021-12-01 18:25:11 +01:00
Matthias
d96c1269ff
Use thiserror for error handling (#399)
This removes some boilerplate and is arguably better
than handwriting the error handling code for
maintainability and avoid inconsitent functionality
for the error variants.
thiserror is also the de-facto standard for library
error types as of today.
2021-11-20 01:42:50 +01:00
Matthias
b97fda34d0
Add support for different output formats (compact, detailed, markdown) (#375) 2021-11-18 00:44:48 +01:00
Markus Unterwaditzer
d3ed133f10
Remove srcset attribute from list of "link" attrs (#393)
* Remove srcset attribute from list of "link" attrs

Fix #390

* Add test for srcset

* Add note about srcSet links

* add real support for srcset

Co-authored-by: Matthias <matthias-endler@gmx.net>
2021-11-16 22:58:10 +01:00
Matthias
69e5d56687
Add more known false positive schema domains (#376)
See https://github.com/lycheeverse/lychee-action/issues/53
2021-10-31 14:53:40 +01:00
dependabot[bot]
d3a72d3816
Bump deadpool from 0.7.0 to 0.9.1 (#371)
* Bump deadpool from 0.7.0 to 0.9.1

Bumps [deadpool](https://github.com/bikeshedder/deadpool) from 0.7.0 to 0.9.1.
- [Release notes](https://github.com/bikeshedder/deadpool/releases)
- [Changelog](https://github.com/bikeshedder/deadpool/blob/master/CHANGELOG.md)
- [Commits](https://github.com/bikeshedder/deadpool/compare/deadpool-v0.7.0...deadpool-v0.9.1)

---
updated-dependencies:
- dependency-name: deadpool
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

* Attempt fix for deadpool v0.8.0+ (#372)

Signed-off-by: MichaIng <micha@dietpi.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: MichaIng <micha@dietpi.com>
2021-10-28 02:05:58 +02:00
Matthias
47426c6971
Fix typos, grammar 2021-10-28 02:05:35 +02:00
MichaIng
0870f0bc9e
Add http://www.w3.org/2000/svg to known false positives (#359)
It has no forced HTTPS rewrite, but sets the HSTS header. Access otherwise works fine, so similar to http://www.w3.org/1999/xhtml it is basically to avoid lychee failures when --require-https was defined.

Signed-off-by: MichaIng <micha@dietpi.com>
2021-10-11 00:40:27 +02:00
Jorge Luis Betancourt
174331d983
Extract base from the source URL if --base is empty (#358)
When running lychee against a remote URL all relative links are ignored
by default because `--base` is normally not set. A good default in this
case is to automatically use the base domain from the source URL.
Setting `--base` overrides the automatic source extraction from the
source URL (same behaviour as we currently have).
2021-10-10 02:42:01 +02:00
Matthias
dd9e24b7f4 support uppercase filenames; add tests 2021-10-09 22:20:22 +02:00
Matthias
175342baf4 Merge branch 'master' of github.com:lycheeverse/lychee 2021-10-09 21:17:41 +02:00
Matthias
bdcd6f87bf Make error message for broken file links more understandable 2021-10-09 21:17:37 +02:00
Matthias
56726f41fc
Add back connection pool (#355) 2021-10-08 13:08:44 +02:00
MichaIng
961f12e58e
Remove cache from collector and remove custom reqwest client pool
* Reqwest comes with its own request pool, so there's no need in adding
another layer of indirection. This also gets rid of a lot of allocs.
* Remove cache from collector
* Improve error handling and documentation
* Add back test for request caching in single file

Signed-off-by: MichaIng <micha@dietpi.com>
Co-authored-by: Matthias <matthias-endler@gmx.net>
2021-10-07 18:07:18 +02:00
Matthias
a7f809612d
Refactor extractor (#354)
This avoids sending URLs back and forth between the different parsers.
Also, it should allow for future optimizations to reduce allocs.
2021-10-07 12:51:02 +02:00
MichaIng
b648b5e914
Imply "localhost" when loopback IPs are excluded (#351)
as "localhost" is usually mapped via "hosts" file to a loopback IP address.

Resolves: https://github.com/lycheeverse/lychee/issues/319

Signed-off-by: MichaIng <micha@dietpi.com>
2021-10-06 11:33:23 +02:00
Matthias
251332efe2
Cache absolute_path to decrease allocations (#346)
* Cache `absolute_path` to decrease allocations

While profiling local file handling, I noticed that resolving paths was taking a
significant amount of time. It also caused quite a few allocations.
By caching the path and using a constant value for the current
directory, we can reduce the number of allocs by quite a lot.
For example, when testing on the sentry documentation, we do 50,4%
less allocations in total now. That's just a single test-case of course,
but it's probably also helping in many other cases as well.

* Defer to_string for attr.value to reduce allocs
* Use Tendrils instead of Strings for parsing (another ~1.5% less allocs)
* Move option parsing code into separate module
* Handle base dir more correctly
* Temporarily disable dry run
2021-10-05 01:37:43 +02:00
Matthias
3b41c4c375
Silently ignore absolute paths without base (fixes #320) (#338) 2021-09-20 11:13:30 +02:00
Matthias
21ea0fd033
Add support for tokio-console (#318)
This allows troubleshooting and improving async Rust code.
It is an optional feature that is still
experimental (but can be quite helpful)
2021-09-12 18:10:23 +02:00
Matthias
de55fbd178 Add TODO for fixing URL encoding for paths 2021-09-09 19:31:49 +02:00
Matthias
d7436575eb formatting 2021-09-09 14:43:40 +02:00
Matthias
2a4170eade Add test for + encoding 2021-09-09 14:42:09 +02:00
Matthias
a1acf7b0d0 Reintegrate master 2021-09-09 01:49:25 +02:00
Matthias
93948d7367 Avoid double-encoding already encoded destination paths
E.g. `web%20site` becomes `web site`.
That's because Url::from_file_path will encode the full URL in the end.
This behavior cannot be configured.
See https://github.com/lycheeverse/lychee/pull/262#issuecomment-915245411
2021-09-09 01:44:10 +02:00