Documentation for building binary embedding plugins, refs #264

This commit is contained in:
Simon Willison 2023-09-12 11:32:12 -07:00
parent 4952a8d119
commit e6dac1a1bd
3 changed files with 22 additions and 0 deletions

View file

@ -16,6 +16,8 @@ If the embedding model can handle binary input, you can call `.embed()` with a b
if embedding_model.supports_binary:
vector = embedding_model.embed(open("my-image.jpg", "rb").read())
```
The `embedding_model.supports_text` property indicates if the model supports text input.
Many embeddings models are more efficient when you embed multiple strings or binary strings at once. To embed multiple strings at once, use the `.embed_multi()` method:
```python
vectors = list(embedding_model.embed_multi(["my happy hound", "my dissatisfied cat"]))

View file

@ -18,3 +18,5 @@ def encode(values):
def decode(binary):
return struct.unpack("<" + "f" * (len(binary) // 4), binary)
```
These functions are available as `llm.encode()` and `llm.decode()`.

View file

@ -46,3 +46,21 @@ Or via its registered alias like this:
```bash
cat file.txt | llm embed -m all-MiniLM-L6-v2
```
[llm-sentence-transformers](https://github.com/simonw/llm-sentence-transformers) is a complete example of a plugin that provides an embedding model.
## Embedding binary content
If your model can embed binary content, use the `supports_binary` property to indicate that:
```python
class ClipEmbeddingModel(llm.EmbeddingModel):
model_id = "clip"
supports_binary = True
supports_text= True
```
`supports_text` defaults to `True` and so is not necessary here. You can set it to `False` if your model only supports binary data.
If your model accepts binary, your `.embed_batch()` model may be called with a list of Python bytestrings. These may be mixed with regular strings if the model accepts both types of input.
[llm-clip](https://github.com/simonw/llm-clip) is an example of a model that can embed both binary and text content.