Commit graph

25 commits

Author SHA1 Message Date
Matthias
961575cdc7 fix typos 2023-07-13 21:48:46 +02:00
Matthias Endler
15e420b8ad
Avoid false positives when checking email addresses in HTML input (#1123)
Skip email addresses outside href attributes in HTML
2023-07-01 00:12:11 +02:00
Matthias
e47577708b formatting 2023-04-11 16:36:32 +02:00
Matthias
e96d4114a9 helpers -> utils 2023-04-11 00:43:57 +02:00
Matthias Endler
55797071b0
Fix nested URL extraction in verbatim elements (#988)
Skipping URLs in verbatim elements didn't take nested
elements into consideration, which were not verbatim.

For instance, the following HTML snippet would yield
`https://example.com` in non-verbatim mode, even if
it is nested inside a verbatim `<pre>` element:

```html
<pre><a href="https://example.com">link</a></pre>
```

This commit fixes the behavior for both `html5gum` and
`html5ever`.

Note that nested verbatim elements of the same kind
still are not handled correctly.

For instance,  the following HTML snippet would still yield
`https://example.com`:

```html
<pre>
  <pre></pre>
  <a href="https://example.com">link</a>
</pre>
```

The reason is that we currently only keep track of a single
verbatim element and not a stack of elements, which we
would need to unwind and resolve the situation.

Fixes https://github.com/lycheeverse/lychee/issues/986.
2023-03-11 15:18:25 +01:00
Matthias Endler
4a3bfb99fb
Remove address from verbatim elements (#901) 2023-01-05 14:55:53 +01:00
Matthias Endler
5654b7c317
Harden URL detection and extend verbatim elements (#899)
Previously remote URLs were incorrectly detected because the
string representation of a path is different than the path itself,
causing the `http` prefix match to be insufficient.

This resulted in unexpected side-effects, such as the
incorrect detection of verbatim mode for remote URLs.

The check now got improved and unit tests were added to avoid
future breakage. On top of that, missing verbatim elements were added
2023-01-04 00:38:19 +01:00
Matthias
ef391cea50
Recursively skip verbatim elements (#847) 2022-12-12 01:06:45 +01:00
Matthias
9eeea250cd
Exclude <script> tags by default (#848)
This is a naive approach to exclude script tags from
getting checked. The reason is that the tag leads to
a lot of false-positives (e.g. `//unpkg.com/docsify-edit-on-github@1`
within a script block gets detected as an e-mail address).

A more thorough approach would be the use of a tree-builder in
html5gum and html5ever, but this could have a negative performance
impact.

I also did not want to add a new flag (e.g. `--include-scripts`) for this
setting because the current set of flags around exclusion/inclusion is
already quite long.

Fixes #821.
2022-11-29 00:38:43 +01:00
Matthias
69f387c1bd
Markdown-status (#729)
* Fix typos

* Add status code description to markdown output
2022-08-11 22:08:05 +02:00
dependabot[bot]
231939af82
Bump html5gum from 0.4.0 to 0.5.1 (#658)
* Bump html5gum from 0.4.0 to 0.5.1

Bumps [html5gum](https://github.com/untitaker/html5gum) from 0.4.0 to 0.5.1.
- [Release notes](https://github.com/untitaker/html5gum/releases)
- [Commits](https://github.com/untitaker/html5gum/compare/0.4.0...0.5.1)

---
updated-dependencies:
- dependency-name: html5gum
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

* Update html5gum

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Matthias <matthias-endler@gmx.net>
2022-06-23 00:07:28 +02:00
Markus Unterwaditzer
f1ae22da09
Replace lazy hashset with matches! (#656)
* Replace lazy hashset with matches!

llvm will typically create much faster code than accessing a hashset at
runtime

source: trust me bro

* cargo fix

* cargo fmt

* shorten docstring
2022-06-18 19:00:07 +02:00
Matthias
84de43c554
Refactor request types (#637) 2022-06-03 20:13:07 +02:00
Matthias
8c0a32d81d
Refactor response formatting (#599)
* Add support for raw formatter (no color)
* Introduce ResponseFormatter trait
* Pass the same params to every cli command
* Update dependencies
* Remove pretty_assertions dependency (latest version doesn't build)
2022-04-25 19:19:36 +02:00
Matthias
a607b853c9
Move to downstream optimization for short strings (#600)
Skipping to parse very short strings was merged into linkify
so our own workaround is unnecessary
https://github.com/robinst/linkify/pull/34
2022-04-25 19:18:50 +02:00
Matthias
6ebc9fed4b
Reset nofollow in html5gum start tag (#584) 2022-04-06 00:49:00 +02:00
Matthias
debe958766
Add support for nofollow (#572) 2022-04-04 10:32:00 +02:00
Matthias
d616177a99
Implement excluding code blocks (#523)
This is done in the extractor to avoid unnecessary
allocations.
2022-03-26 10:42:56 +01:00
Matthias
77b1724881
Optimize plaintext extractor for small strings (#565)
Immediately return for very small strings which cannot be valid URIs.

The shortest valid URI without a scheme might be g.cn (Google China)
At least I am not aware of a shorter one. We set this as a lower threshold
for parsing URIs from plaintext to avoid false-positives and as a slight
performance optimization, which could add up for big files.
This threshold might be adjusted in the future.
2022-03-23 23:06:49 +01:00
Matthias
812663d832
Prevent flaky tests (#514)
Move from example.org to example.com, which seems to be more permissive for testing
2022-02-18 10:29:49 +01:00
Markus Unterwaditzer
68d09f7e5b
Add html5gum as alternative link extractor (#480)
html5gum is a HTML parser that offers lower-level control over which tokens actually get created and are tracked. As such, the extractor doesn't allocate anything tokens it doesn't care about. On some benchmarks it provides a substantial performance boost. The old parser, html5ever is still available by setting the `LYCHEE_USE_HTML5EVER=1` env var.
2022-02-07 22:54:47 +01:00
Matthias
5802ae912c
Fix bugs in extractor; reduce allocs (#464)
When URLs couldn't be extracted from a tag,
we ran a plaintext search, but never added the
newly found urls to the vec of extracted urls.

Also tried to make the code a little more idiomatic
2022-01-16 02:13:38 +01:00
Matthias
ac490f9c53
Add caching functionality (v2) (#443)
A while ago, caching was removed due to some issues (see #349).
This is a new implementation with the following improvements:

 * Architecture: The new implementation is decoupled from the collector, which was a major issue in the last version.    Now the collector has a single responsibility: collecting links. This also avoids race-conditions when running multiple collect_links instances, which probably was an issue before.
* Performance: Uses DashMap under the hood, which was noticeably faster than Mutex<HashMap> in my tests.
* Simplicity: The cache format is a CSV file with two columns: URI and status. I decided to create a new struct called CacheStatus for serialization, because trying to serialize the error kinds in Status turned out to be a bit of a nightmare and at this point I don't think it's worth the pain (and probably isn't idiomatic either).

This is an optional feature. Caching only gets used if the `--cache` flag is set.
2022-01-14 15:25:51 +01:00
Matthias
dd48466d9a
Add missing test for local links in plaintext files (#444) 2022-01-05 12:51:14 +01:00
Matthias
166c86c30e
Use tokenizer for extraction; add benchmark (#424)
This avoids creating a DOM tree for link extraction and instead uses a `TokenSink` for on-the-fly extraction. In hyperfine benchmarks it was about 10-25% faster than the master.

Old: 4.557 s ± 0.404 s
New: 3.832 s ± 0.131 s

The performance fluctuates a little less as well.

Some missing element/attribute pairs were also added, which contain links according to the HTML spec. These occur very rarely, but it's good to parse them for completeness' sake.

Furthermore tried to clean up a lot of papercuts around our types. We now differentiate between a `RawUri` (stringy-types) and a Uri, which is a properly parsed `URI` type.
The extractor now only deals with extracting `RawUri`s while the collector creates the request objects.
2021-12-16 18:45:52 +01:00