Commit graph

35 commits

Author SHA1 Message Date
Matthias
01393b34a2
Upgrade to Rust 2021 (#427) 2021-12-17 01:32:13 +01:00
Matthias
166c86c30e
Use tokenizer for extraction; add benchmark (#424)
This avoids creating a DOM tree for link extraction and instead uses a `TokenSink` for on-the-fly extraction. In hyperfine benchmarks it was about 10-25% faster than the master.

Old: 4.557 s ± 0.404 s
New: 3.832 s ± 0.131 s

The performance fluctuates a little less as well.

Some missing element/attribute pairs were also added, which contain links according to the HTML spec. These occur very rarely, but it's good to parse them for completeness' sake.

Furthermore tried to clean up a lot of papercuts around our types. We now differentiate between a `RawUri` (stringy-types) and a Uri, which is a properly parsed `URI` type.
The extractor now only deals with extracting `RawUri`s while the collector creates the request objects.
2021-12-16 18:45:52 +01:00
Matthias
c41ba64a69
Max concurrency moved to check (#419)
Concurrency is defined by the channel size consuming
from the request stream in  `check`
2021-12-07 11:52:40 +01:00
Matthias
3d5135668b
Improve concurrency with streams (#330)
* Move to from vec to streams

Previously we collected all inputs in one vector
before checking the links, which is not ideal.
Especially when reading many inputs (e.g. by using a glob pattern),
this could cause issues like running out of file handles.

By moving to streams we avoid that scenario. This is also the first
step towards improving performance for many inputs.

To stay as close to the pre-stream behaviour, we want to stop processing
as soon as an Err value appears in the stream. This is easiest when the
stream is consumed in the main thread.
Previously, the stream was consumed in a tokio task and the main thread
waited for responses.
Now, a tokio task waits for responses (and displays them/registers
response stats) and the main thread sends links to the ClientPool.
To ensure that the main thread waits for all responses to have arrived
before finishing the ProgressBar and printing the stats, it waits for
the show_results_task to finish.


* Return collected links as Stream
* Initialize ProgressBar without length because we can't know the amount of links without blocking
* Handle stream results in main thread, not in task
* Add basic directory support using jwalk
* Add test for HTTP protocol file type (http://)
* Remove deadpool (once again): Replaced with `futures::StreamExt::for_each_concurrent`.
* Refactor main; fix tests
* Move commands into separate submodule
* Simplify input handling
* Simplify collector
* Remove unnecessary unwrap
* Simplify main
* cleanup check
* clean up dump command
* Handle requests in parallel 
* Fix formatting and lints

Co-authored-by: Timo Freiberg <self@timofreiberg.com>
2021-12-01 18:25:11 +01:00
Matthias
591cbdbebb
Add support for .lycheeignore file #308 (#402)
This is similar to files like .gitignore and .dockerignore
and gets merged into exclude_files
2021-11-23 01:39:53 +01:00
Matthias
1eb4453957
Only print source in verbose mode (#400)
This way the normal link output can be fed into
another tool without data mangling.
2021-11-21 17:22:04 +01:00
Matthias
4008c2ce38 Add missing newline 2021-11-18 00:46:20 +01:00
Matthias
b97fda34d0
Add support for different output formats (compact, detailed, markdown) (#375) 2021-11-18 00:44:48 +01:00
Derek Croote
e8bab82d76
Fix clippy lint (#383) 2021-11-05 10:22:51 +01:00
Matthias
56726f41fc
Add back connection pool (#355) 2021-10-08 13:08:44 +02:00
MichaIng
961f12e58e
Remove cache from collector and remove custom reqwest client pool
* Reqwest comes with its own request pool, so there's no need in adding
another layer of indirection. This also gets rid of a lot of allocs.
* Remove cache from collector
* Improve error handling and documentation
* Add back test for request caching in single file

Signed-off-by: MichaIng <micha@dietpi.com>
Co-authored-by: Matthias <matthias-endler@gmx.net>
2021-10-07 18:07:18 +02:00
MichaIng
b648b5e914
Imply "localhost" when loopback IPs are excluded (#351)
as "localhost" is usually mapped via "hosts" file to a loopback IP address.

Resolves: https://github.com/lycheeverse/lychee/issues/319

Signed-off-by: MichaIng <micha@dietpi.com>
2021-10-06 11:33:23 +02:00
Matthias
251332efe2
Cache absolute_path to decrease allocations (#346)
* Cache `absolute_path` to decrease allocations

While profiling local file handling, I noticed that resolving paths was taking a
significant amount of time. It also caused quite a few allocations.
By caching the path and using a constant value for the current
directory, we can reduce the number of allocs by quite a lot.
For example, when testing on the sentry documentation, we do 50,4%
less allocations in total now. That's just a single test-case of course,
but it's probably also helping in many other cases as well.

* Defer to_string for attr.value to reduce allocs
* Use Tendrils instead of Strings for parsing (another ~1.5% less allocs)
* Move option parsing code into separate module
* Handle base dir more correctly
* Temporarily disable dry run
2021-10-05 01:37:43 +02:00
Matthias
f2d7abbc29
Fix broken pipe when dumping links (#339)
When piping the output of lychee's `--dump` output to another program,
we can run into issues with broken pipes as described in
https://github.com/rust-lang/rust/issues/46016
and https://gabebw.com/blog/2019/10/13/learning-rust-by-candlelight
To avoid this, we use the underlying writeln macro and check the
returned `ErrorKind`.
2021-09-20 12:12:35 +02:00
Matthias
712bdfa8cb
Make inputs required (show help if not provided) (#329) 2021-09-16 16:40:38 +02:00
Matthias
21ea0fd033
Add support for tokio-console (#318)
This allows troubleshooting and improving async Rust code.
It is an optional feature that is still
experimental (but can be quite helpful)
2021-09-12 18:10:23 +02:00
Matthias
a1acf7b0d0 Reintegrate master 2021-09-09 01:49:25 +02:00
Matthias
f3fe46a4d6 Merge branch 'master' of github.com:lycheeverse/lychee into local-files 2021-09-08 00:35:41 +02:00
Paweł Romanowski
8fd34a7367
Add no check (dump links only) flag (#99) 2021-09-06 16:10:48 +02:00
Matthias
87fd90f2fc cargo fmt 2021-09-06 15:20:18 +02:00
Matthias
dd3205a87c wip 2021-09-06 15:19:43 +02:00
Matthias
bfa3b1b6a1 Introduce Base type, which can be a path or URL 2021-09-06 15:15:40 +02:00
Matthias
f9bf52ef10 Add support for base_dir 2021-09-06 15:15:05 +02:00
Matthias Endler
701fbc9ada Add support for local files 2021-09-06 15:14:33 +02:00
Lucius Hu
80b8a856ac
Add new flag --require-https (#195) 2021-09-04 03:21:54 +02:00
Matthias
4b537763a5 Directly connect into Result 2021-09-03 16:29:57 +02:00
Matthias
59abd189cf Fix remaining clippy lints 2021-09-03 16:29:57 +02:00
Matthias
d959b54b56 run cargo fmt 2021-09-03 16:29:57 +02:00
dblock
dcee4a1058 Added support for --exclude-file. 2021-09-03 16:29:57 +02:00
Matthias
fe399c0a8c
Simple URI cache (#243) 2021-05-04 13:28:39 +02:00
Matthias
164e1aea7e
Add support for multiple schemes (#237) 2021-04-26 18:24:54 +02:00
Matthias
f8426bafbf
Skip unsupported schemes (#236) 2021-04-26 17:16:58 +02:00
Matthias
1926c73b6b
Add missing docs (#231)
This enables `#![deny(missing_docs)`) and adds all missing doc strings
2021-04-23 00:27:12 +02:00
Lucius Hu
f64213d58c
More refactor (#225)
- Major changes in `lychee-lib::filter` module:
  - Fields in `Excludes` except the `RegexSet` is now moved to `Filter`.
  - `Filter` contains `Option<Excludes>` and `Option<Includes>`, which are
    wrapper struct of `RegexSet` instead of `Option<RegexSet>`. As a result
    the code now looks cleaner.
  - Factored out some filtering logics to dedicated functions.
    - It's possible to write tests for those functions in addition to tests
      for the `Filter` struct.
  - Added docs to `Filter::is_excluded` and reorgnized the code.
- placed `derive_builder` by `typed_builder`:
  - The internal interface very ugly, as admitted by the author, but we no
    longer have nested `Option`s like before.
  - As a result, the `Client` building is much easier to read.
  - Main benefit of `typed_builder` is, the arguments feeded to builder is
    checked at compile time instead of run-time.
- Fixed a bug in `lychee::tests::usage` and `lychee-lib::stats::test`.
  - Now it will clear environment variable which would otherwise cause an
    issue if `GITHUB_TOKEN` is set.
- Updated dependencies.

Co-authored-by: Lucius Hu <lebensterben@users.noreply.github.com>
2021-04-16 20:25:22 +02:00
Lucius Hu
228e5df6a3
Major refactor of codebase (#208)
- The binary component and library component are separated as two
  packages in the same workspace.
  - `lychee` is the binary component, in `lychee-bin/*`.
  - `lychee-lib` is the library component, in `lychee-lib/*`.
  - Users can now install only the `lychee-lib`, instead of both
    components, that would require fewer dependencies and faster
    compilation.
  - Dependencies for each component are adjusted and updated. E.g.,
    no CLI dependencies for `lychee-lib`.
  - CLI tests are only moved to `lychee`, as it has nothing to do
    with the library component.
- `Status::Error` is refactored to contain dedicated error enum,
  `ErrorKind`.
  - The motivation is to delay the formatting of errors to strings.
    Note that `e.to_string()` is not necessarily cheap (though
    trivial in many cases). The formatting is no delayed until the
    error is needed to be displayed to users. So in some cases, if
    the error is never used, it means that it won't be formatted at
    all.
- Replaced `regex` based matching with one of the following:
  - Simple string equality test in the case of 'false positivie'.
  - URL parsing based test, in the case of extracting repository and
    user name for GitHub links.
  - Either cases would be much more efficient than `regex` based
    matching. First, there's no need to construct a state machine for
    regex. Second, URL is already verified and parsed on its creation,
    and extracting its components is fairly cheap. Also, this removes
    the dependency on `lazy-static` in `lychee-lib`.
- `types` module now has a sub-directory, and its components are now
  separated into their own modules (in that sub-directory).
- `lychee-lib::test_utils` module is only compiled for tests.
- `wiremock` is moved to `dev-dependency` as it's only needed for
  `test` modules.
- Dependencies are listed in alphabetical order.
- Imports are organized in the following fashion:
  - Imports from `std`
  - Imports from 3rd-party crates, and `lychee-lib`.
  - Imports from `crate::*` or `super::*`.
- No glob import.
- I followed suggestion from `cargo clippy`, with `clippy::all` and
  `clippy:pedantic`.

Co-authored-by: Lucius Hu <lebensterben@users.noreply.github.com>
2021-04-15 01:24:11 +02:00