Commit graph

60 commits

Author SHA1 Message Date
Matthias Endler
97573123ef
Extend remap feature (#1133)
* wip

* Extend support for remapping

This adds supports for partial remaps and
capture groups to the remap feature.

Fixes #1129
2023-07-05 15:05:19 +02:00
Techassi
67af7ef6d3
feat: add support for basic auth per URI (#1110)
* Add support for basic auth per domain
* Move URI matching to link collection phase
* Allow AsRef for BasicAuthExtractor::new to avoid clone
* Add tests

---------

Co-authored-by: Matthias Endler <matthias@endler.dev>
2023-06-26 12:06:24 +02:00
Thomas Zahner
130fa21a6a
Concurrent archives (#1027) 2023-05-11 20:20:27 +02:00
Matthias Endler
55797071b0
Fix nested URL extraction in verbatim elements (#988)
Skipping URLs in verbatim elements didn't take nested
elements into consideration, which were not verbatim.

For instance, the following HTML snippet would yield
`https://example.com` in non-verbatim mode, even if
it is nested inside a verbatim `<pre>` element:

```html
<pre><a href="https://example.com">link</a></pre>
```

This commit fixes the behavior for both `html5gum` and
`html5ever`.

Note that nested verbatim elements of the same kind
still are not handled correctly.

For instance,  the following HTML snippet would still yield
`https://example.com`:

```html
<pre>
  <pre></pre>
  <a href="https://example.com">link</a>
</pre>
```

The reason is that we currently only keep track of a single
verbatim element and not a stack of elements, which we
would need to unwind and resolve the situation.

Fixes https://github.com/lycheeverse/lychee/issues/986.
2023-03-11 15:18:25 +01:00
Matthias
c9edb7f809 Split up quirks and skip twitter check
It's flaky on Github
2023-03-03 12:13:09 +01:00
Matthias
08466ad59b Ignore config smoketest output report file 2023-03-03 12:13:09 +01:00
Matthias
86f13609e6 Put lycheecache tests into separate subfolders to avoid race 2023-03-03 12:13:09 +01:00
Matthias
388bd20673 Fix tests after address is no longer a verbatim element 2023-03-03 12:13:09 +01:00
Matthias Endler
7874195bbb
Customize verbosity (#956) 2023-02-24 23:53:09 +01:00
Matthias Endler
5654b7c317
Harden URL detection and extend verbatim elements (#899)
Previously remote URLs were incorrectly detected because the
string representation of a path is different than the path itself,
causing the `http` prefix match to be insufficient.

This resulted in unexpected side-effects, such as the
incorrect detection of verbatim mode for remote URLs.

The check now got improved and unit tests were added to avoid
future breakage. On top of that, missing verbatim elements were added
2023-01-04 00:38:19 +01:00
Matthias
982d978e47
Add different verbosity levels (#824)
More granular verbosity levels have been asked
for repeatedly.
To enable that we're moving to [env_logger] and [clap-verbosity-flag]
to provide more flexible verbosity settings.

Also tackles #661, #709
Lays the groundwork for tackling #268

https://github.com/rust-cli/env_logger
https://github.com/clap-rs/clap-verbosity-flag
2022-11-28 23:25:33 +01:00
Matthias
765f7adb12
Don't check example mail addresses by default (#815)
This was an oversight so far that became apparent after our
recent fix for email addreses with query params
(e.g. `test@example.com?subject=test`).
The parsing of email addresses has improved and so we detect
more mail addresses, but we didn't check if they belonged
to an example domain, causing false-positive checks.
2022-11-08 23:46:32 +01:00
Matthias
d61105edbb
Fix parsing error of email addresses with query params (#809)
Email addresses with query parameters often get used in
contact forms on websites. They can also be found in
other documents like Markdown.

A common use-case is to add a subject line to the email
as a parameter e.g. `mailto:mail@example.com?subject="Hello"`.

Previously we handled such cases incorrectly by recognizing
them as files. The reason was that our email parsing was too strict
to allow for that use-case.
With `email_address` we switched to a more permissive parser.

Note that this does not affect the actual address email checking,
as this is still done `check-if-email-exists`, which has more strict
check functionality.
2022-11-05 23:40:33 +01:00
Andy Grunwald
a67b513238
Extend description of "--exclude" to also exclude email addresses, not only URLs (#801) 2022-10-23 12:17:20 +02:00
Matthias
601adcefd3
Add new SVG-based screencast (#693)
This is taken from https://github.com/sharkdp/fd, so all credits
go to the original authors.

The demo was a bit dated. We've since added more features and
changed the output. On top of that, the gif was a bit blurry.

The new version is in SVG and the commands can be scripted, so
we can change them with a PR and render them through CI.

Co-authored-by: Brennan Kinney <5098581+polarathene@users.noreply.github.com>
2022-08-10 17:35:50 +02:00
Walter Beller-Morales
9ad53f97a2
Fix deserialize of lycheecache status codes (#685)
* Add custom deserializer for `CacheStatus` to properly classify status codes
* Add CLI integration tests to check .lycheecache behavior
* Add comment to explain conflict between cache and accept flags
2022-07-15 22:45:24 +02:00
Matthias
a557cba0b4
Add support for parsing list of status codes from config file (#636) 2022-06-02 18:53:04 +02:00
Matthias
9b4dfadffd
Fix parsing errors with config options (#632) 2022-05-31 19:43:46 +02:00
vpereira01
d48a3279a8
Improve configuration example (#631)
* Add missing parameters
* Remove deprecated `--exclude-file` parameter
* Improve TOML comments
* Add config smoketest
2022-05-31 19:05:27 +02:00
Matthias
f33b897d5d
Exclude example domains as per RFC 2606 from checking (#627)
Unfortunately it's not possible to automatically enable features
for `cargo test`. See https://github.com/rust-lang/cargo/issues/2911.

As a workaround to allow for using example domains for unit- and integration
tests,  we introduce a new feature, `check_example_domains`, which is
disabled by default for normal users. The feature gets activated for the
integration test which checks that the example domain exclusion works as
expected.
2022-05-29 21:42:00 +02:00
Matthias
363b95fe5f
Add support for excluding paths from link checking (#623)
This change deprecates `--exclude-file` as it was ambiguous.
Instead, `--exclude-path` was introduced to support excluding paths
to files and directories that should not be checked.
Furthermore, `.lycheeignore` is now the only way
to exclude URL patterns.
2022-05-29 17:27:09 +02:00
Matthias
b40c785b64
Also dump excluded links (#615)
This is a minimally invasive version, which allows to grep for `[excluded]`.
The reason for exclusion would require more work and it's debatable if
it adds any value, because it might make grepping harder and the source
of exclusion is easily deducatable from the commandline parameters
or the `.lycheeignore` file.

Fixes #587.
2022-05-13 18:53:16 +02:00
Matthias
b0136683a9
Add support for comments in .lycheeignore (#616)
Lines starting with the comment character (`#`) inside the
.lycheeignore file will be ignored.
Whitespace at the beginning of each line will be ignored, so
even an indented comment character will work.
2022-05-13 18:51:58 +02:00
Matthias
03d28820bb
Extract more status information from reqwest (#577)
Recently we cleaned up the commandline output to trim away redundant
information like the URL, which occured twice.
Unfortunately we also removed helpful information from reqwest, which
could support the user in troubleshooting unexpected errors.

This commit reverts that.
We now extract the meaningful information from reqwest, without being
too verbose. For that we have to depend on the string output for the
reqwest error, but it's better than hiding that information from the user.
It is fragile as it depends on the reqwest internals, but in the worst case
we simply return the full error text in case our parsing won't work.
2022-04-02 14:37:03 +02:00
Matthias
d616177a99
Implement excluding code blocks (#523)
This is done in the extractor to avoid unnecessary
allocations.
2022-03-26 10:42:56 +01:00
Matthias
812663d832
Prevent flaky tests (#514)
Move from example.org to example.com, which seems to be more permissive for testing
2022-02-18 10:29:49 +01:00
Matthias
9d738fb3f5
Fix default config (#491)
The default configuration was broken since the
introduction of caching and specifically `max_cache_age`.
This fixes deserialization and config merging for
the case where this key is missing from the config.
2022-02-07 23:17:50 +01:00
Matthias
6635863746
Add Alpine page for benchmark; refactor code (#481) 2022-01-27 23:42:06 +01:00
Matthias
166c86c30e
Use tokenizer for extraction; add benchmark (#424)
This avoids creating a DOM tree for link extraction and instead uses a `TokenSink` for on-the-fly extraction. In hyperfine benchmarks it was about 10-25% faster than the master.

Old: 4.557 s ± 0.404 s
New: 3.832 s ± 0.131 s

The performance fluctuates a little less as well.

Some missing element/attribute pairs were also added, which contain links according to the HTML spec. These occur very rarely, but it's good to parse them for completeness' sake.

Furthermore tried to clean up a lot of papercuts around our types. We now differentiate between a `RawUri` (stringy-types) and a Uri, which is a properly parsed `URI` type.
The extractor now only deals with extracting `RawUri`s while the collector creates the request objects.
2021-12-16 18:45:52 +01:00
Matthias
591cbdbebb
Add support for .lycheeignore file #308 (#402)
This is similar to files like .gitignore and .dockerignore
and gets merged into exclude_files
2021-11-23 01:39:53 +01:00
MichaIng
961f12e58e
Remove cache from collector and remove custom reqwest client pool
* Reqwest comes with its own request pool, so there's no need in adding
another layer of indirection. This also gets rid of a lot of allocs.
* Remove cache from collector
* Improve error handling and documentation
* Add back test for request caching in single file

Signed-off-by: MichaIng <micha@dietpi.com>
Co-authored-by: Matthias <matthias-endler@gmx.net>
2021-10-07 18:07:18 +02:00
Matthias
a75cae54b1 Add failing test 2021-09-09 01:17:56 +02:00
Matthias
5d0b95271d Remove anchor from file links 2021-09-07 00:20:09 +02:00
Matthias
03f5df91cd Add fixtures for offline testing 2021-09-06 15:20:18 +02:00
Matthias
495f856c61 cleanup 2021-09-06 15:19:24 +02:00
Matthias
ee70e13bf7 Check real link to file 2021-09-06 15:19:09 +02:00
Matthias
f5ee472d93 explicit naming 2021-09-06 15:19:09 +02:00
Matthias Endler
701fbc9ada Add support for local files 2021-09-06 15:14:33 +02:00
Lucius Hu
80b8a856ac
Add new flag --require-https (#195) 2021-09-04 03:21:54 +02:00
dblock
dcee4a1058 Added support for --exclude-file. 2021-09-03 16:29:57 +02:00
dblock
739a3d6e41 Fix: remove URL that is currently returning a 503. 2021-09-03 16:29:57 +02:00
Matthias
fe399c0a8c
Simple URI cache (#243) 2021-05-04 13:28:39 +02:00
Matthias
164e1aea7e
Add support for multiple schemes (#237) 2021-04-26 18:24:54 +02:00
Matthias
f8426bafbf
Skip unsupported schemes (#236) 2021-04-26 17:16:58 +02:00
Matthias Endler
2b044a6f5b Fix exclude mail, add tests 2021-03-29 23:28:17 +02:00
Matthias Endler
5baaba3948 Add integration test 2021-02-28 19:09:11 +01:00
Matthias Endler
e00cdbf1ae example.com -> example.org 2021-02-21 16:33:33 +01:00
Matthias
702909c4ab
Mailto support (#138)
* Add mailto suport and use try_from for parsing URLs
* Cleanup and document code
2021-02-12 10:25:33 +01:00
Paweł Romanowski
aeab85da16
Use html5ever for HTML link extraction (#98) 2021-01-08 16:41:13 +01:00
Paweł Romanowski
cd00fa643e
Fix HTML parsing for non-closed elements like <link> (#92)
* Fix HTML parsing for non-closed elements like <link>

The XML parser we use requires all tags to be closed by default,
and if they aren't (like HTML5 <link> elements), it simply gives up
on further parsing.  This change makes it ignore such issues.

Also uncover a bug with the current parser (it simply won't parse
elements like `<script defer src="..."></script>`) -- e.g. elements
with no attribute values.

The XML parser is an XML parser and will have to be replaced with
HTML aware parser in the future.

* Add check for empty elements

* Update extract.rs

Co-authored-by: Matthias <matthias-endler@gmx.net>
2021-01-03 17:32:13 +01:00