lychee

mirror of https://github.com/Hopiu/lychee.git synced 2026-05-23 12:55:50 +00:00

Author	SHA1	Message	Date
Matthias	961575cdc7	fix typos	2023-07-13 21:48:46 +02:00
Matthias Endler	15e420b8ad	Avoid false positives when checking email addresses in HTML input (#1123 ) Skip email addresses outside href attributes in HTML	2023-07-01 00:12:11 +02:00
Matthias	e47577708b	formatting	2023-04-11 16:36:32 +02:00
Matthias	e96d4114a9	helpers -> utils	2023-04-11 00:43:57 +02:00
Matthias Endler	55797071b0	Fix nested URL extraction in verbatim elements (#988 ) Skipping URLs in verbatim elements didn't take nested elements into consideration, which were not verbatim. For instance, the following HTML snippet would yield `https://example.com` in non-verbatim mode, even if it is nested inside a verbatim `<pre>` element: ```html <pre><a href="https://example.com">link</a></pre> ``` This commit fixes the behavior for both `html5gum` and `html5ever`. Note that nested verbatim elements of the same kind still are not handled correctly. For instance, the following HTML snippet would still yield `https://example.com`: ```html <pre> <pre></pre> <a href="https://example.com">link</a> </pre> ``` The reason is that we currently only keep track of a single verbatim element and not a stack of elements, which we would need to unwind and resolve the situation. Fixes https://github.com/lycheeverse/lychee/issues/986.	2023-03-11 15:18:25 +01:00
Matthias Endler	4a3bfb99fb	Remove address from verbatim elements (#901 )	2023-01-05 14:55:53 +01:00
Matthias Endler	5654b7c317	Harden URL detection and extend verbatim elements (#899 ) Previously remote URLs were incorrectly detected because the string representation of a path is different than the path itself, causing the `http` prefix match to be insufficient. This resulted in unexpected side-effects, such as the incorrect detection of verbatim mode for remote URLs. The check now got improved and unit tests were added to avoid future breakage. On top of that, missing verbatim elements were added	2023-01-04 00:38:19 +01:00
Matthias	ef391cea50	Recursively skip verbatim elements (#847 )	2022-12-12 01:06:45 +01:00
Matthias	9eeea250cd	Exclude <script> tags by default (#848 ) This is a naive approach to exclude script tags from getting checked. The reason is that the tag leads to a lot of false-positives (e.g. `//unpkg.com/docsify-edit-on-github@1` within a script block gets detected as an e-mail address). A more thorough approach would be the use of a tree-builder in html5gum and html5ever, but this could have a negative performance impact. I also did not want to add a new flag (e.g. `--include-scripts`) for this setting because the current set of flags around exclusion/inclusion is already quite long. Fixes #821.	2022-11-29 00:38:43 +01:00
Matthias	69f387c1bd	Markdown-status (#729 ) * Fix typos * Add status code description to markdown output	2022-08-11 22:08:05 +02:00
dependabot[bot]	231939af82	Bump html5gum from 0.4.0 to 0.5.1 (#658 ) * Bump html5gum from 0.4.0 to 0.5.1 Bumps [html5gum](https://github.com/untitaker/html5gum) from 0.4.0 to 0.5.1. - [Release notes](https://github.com/untitaker/html5gum/releases) - [Commits](https://github.com/untitaker/html5gum/compare/0.4.0...0.5.1) --- updated-dependencies: - dependency-name: html5gum dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> * Update html5gum Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Matthias <matthias-endler@gmx.net>	2022-06-23 00:07:28 +02:00
Markus Unterwaditzer	f1ae22da09	Replace lazy hashset with matches! (#656 ) * Replace lazy hashset with matches! llvm will typically create much faster code than accessing a hashset at runtime source: trust me bro * cargo fix * cargo fmt * shorten docstring	2022-06-18 19:00:07 +02:00
Matthias	84de43c554	Refactor request types (#637 )	2022-06-03 20:13:07 +02:00
Matthias	8c0a32d81d	Refactor response formatting (#599 ) * Add support for raw formatter (no color) * Introduce ResponseFormatter trait * Pass the same params to every cli command * Update dependencies * Remove pretty_assertions dependency (latest version doesn't build)	2022-04-25 19:19:36 +02:00
Matthias	a607b853c9	Move to downstream optimization for short strings (#600 ) Skipping to parse very short strings was merged into linkify so our own workaround is unnecessary https://github.com/robinst/linkify/pull/34	2022-04-25 19:18:50 +02:00
Matthias	6ebc9fed4b	Reset nofollow in html5gum start tag (#584 )	2022-04-06 00:49:00 +02:00
Matthias	debe958766	Add support for nofollow (#572 )	2022-04-04 10:32:00 +02:00
Matthias	d616177a99	Implement excluding code blocks (#523 ) This is done in the extractor to avoid unnecessary allocations.	2022-03-26 10:42:56 +01:00
Matthias	77b1724881	Optimize plaintext extractor for small strings (#565 ) Immediately return for very small strings which cannot be valid URIs. The shortest valid URI without a scheme might be g.cn (Google China) At least I am not aware of a shorter one. We set this as a lower threshold for parsing URIs from plaintext to avoid false-positives and as a slight performance optimization, which could add up for big files. This threshold might be adjusted in the future.	2022-03-23 23:06:49 +01:00
Matthias	812663d832	Prevent flaky tests (#514 ) Move from example.org to example.com, which seems to be more permissive for testing	2022-02-18 10:29:49 +01:00
Markus Unterwaditzer	68d09f7e5b	Add html5gum as alternative link extractor (#480 ) html5gum is a HTML parser that offers lower-level control over which tokens actually get created and are tracked. As such, the extractor doesn't allocate anything tokens it doesn't care about. On some benchmarks it provides a substantial performance boost. The old parser, html5ever is still available by setting the `LYCHEE_USE_HTML5EVER=1` env var.	2022-02-07 22:54:47 +01:00
Matthias	5802ae912c	Fix bugs in extractor; reduce allocs (#464 ) When URLs couldn't be extracted from a tag, we ran a plaintext search, but never added the newly found urls to the vec of extracted urls. Also tried to make the code a little more idiomatic	2022-01-16 02:13:38 +01:00
Matthias	ac490f9c53	Add caching functionality (v2) (#443 ) A while ago, caching was removed due to some issues (see #349). This is a new implementation with the following improvements: * Architecture: The new implementation is decoupled from the collector, which was a major issue in the last version. Now the collector has a single responsibility: collecting links. This also avoids race-conditions when running multiple collect_links instances, which probably was an issue before. * Performance: Uses DashMap under the hood, which was noticeably faster than Mutex<HashMap> in my tests. * Simplicity: The cache format is a CSV file with two columns: URI and status. I decided to create a new struct called CacheStatus for serialization, because trying to serialize the error kinds in Status turned out to be a bit of a nightmare and at this point I don't think it's worth the pain (and probably isn't idiomatic either). This is an optional feature. Caching only gets used if the `--cache` flag is set.	2022-01-14 15:25:51 +01:00
Matthias	dd48466d9a	Add missing test for local links in plaintext files (#444 )	2022-01-05 12:51:14 +01:00
Matthias	166c86c30e	Use tokenizer for extraction; add benchmark (#424 ) This avoids creating a DOM tree for link extraction and instead uses a `TokenSink` for on-the-fly extraction. In hyperfine benchmarks it was about 10-25% faster than the master. Old: 4.557 s ± 0.404 s New: 3.832 s ± 0.131 s The performance fluctuates a little less as well. Some missing element/attribute pairs were also added, which contain links according to the HTML spec. These occur very rarely, but it's good to parse them for completeness' sake. Furthermore tried to clean up a lot of papercuts around our types. We now differentiate between a `RawUri` (stringy-types) and a Uri, which is a properly parsed `URI` type. The extractor now only deals with extracting `RawUri`s while the collector creates the request objects.	2021-12-16 18:45:52 +01:00

25 commits