From e6dac1a1bd4cd265939659a064c1268c58ad153c Mon Sep 17 00:00:00 2001 From: Simon Willison Date: Tue, 12 Sep 2023 11:32:12 -0700 Subject: [PATCH] Documentation for building binary embedding plugins, refs #264 --- docs/embeddings/python-api.md | 2 ++ docs/embeddings/storage.md | 2 ++ docs/embeddings/writing-plugins.md | 18 ++++++++++++++++++ 3 files changed, 22 insertions(+) diff --git a/docs/embeddings/python-api.md b/docs/embeddings/python-api.md index f62f6b8..310b548 100644 --- a/docs/embeddings/python-api.md +++ b/docs/embeddings/python-api.md @@ -16,6 +16,8 @@ If the embedding model can handle binary input, you can call `.embed()` with a b if embedding_model.supports_binary: vector = embedding_model.embed(open("my-image.jpg", "rb").read()) ``` +The `embedding_model.supports_text` property indicates if the model supports text input. + Many embeddings models are more efficient when you embed multiple strings or binary strings at once. To embed multiple strings at once, use the `.embed_multi()` method: ```python vectors = list(embedding_model.embed_multi(["my happy hound", "my dissatisfied cat"])) diff --git a/docs/embeddings/storage.md b/docs/embeddings/storage.md index d9a63be..da99cdc 100644 --- a/docs/embeddings/storage.md +++ b/docs/embeddings/storage.md @@ -18,3 +18,5 @@ def encode(values): def decode(binary): return struct.unpack("<" + "f" * (len(binary) // 4), binary) ``` + +These functions are available as `llm.encode()` and `llm.decode()`. diff --git a/docs/embeddings/writing-plugins.md b/docs/embeddings/writing-plugins.md index 0211dae..7095c4d 100644 --- a/docs/embeddings/writing-plugins.md +++ b/docs/embeddings/writing-plugins.md @@ -46,3 +46,21 @@ Or via its registered alias like this: ```bash cat file.txt | llm embed -m all-MiniLM-L6-v2 ``` +[llm-sentence-transformers](https://github.com/simonw/llm-sentence-transformers) is a complete example of a plugin that provides an embedding model. + +## Embedding binary content + +If your model can embed binary content, use the `supports_binary` property to indicate that: + +```python +class ClipEmbeddingModel(llm.EmbeddingModel): + model_id = "clip" + supports_binary = True + supports_text= True +``` + +`supports_text` defaults to `True` and so is not necessary here. You can set it to `False` if your model only supports binary data. + +If your model accepts binary, your `.embed_batch()` model may be called with a list of Python bytestrings. These may be mixed with regular strings if the model accepts both types of input. + +[llm-clip](https://github.com/simonw/llm-clip) is an example of a model that can embed both binary and text content. \ No newline at end of file