If the embedding model can handle binary input, you can call `.embed()` with a byte string instead. You can check the `supports_binary` property to see if this is supported:
Many embeddings models are more efficient when you embed multiple strings or binary strings at once. To embed multiple strings at once, use the `.embed_multi()` method:
vectors = list(embedding_model.embed_multi(["my happy hound", "my dissatisfied cat"]))
```
This returns a generator that yields one embedding vector per string.
(embeddings-python-collections)=
## Working with collections
The `llm.Collection` class can be used to work with **collections** of embeddings from Python code.
A collection is a named group of embedding vectors, each stored along with their IDs in a SQLite database table.
To work with embeddings in this way you will need an instance of a [sqlite-utils Database](https://sqlite-utils.datasette.io/en/stable/python-api.html#connecting-to-or-creating-a-database) object. You can then pass that to the `llm.Collection` constructor along with the unique string name of the collection and the ID of the embedding model you will be using with that collection:
If the collection already exists in the database you can omit the `model` or `model_id` argument - the model ID will be read from the `collections` table.
To embed a single string and store it in the collection, use the `embed()` method:
```python
collection.embed("hound", "my happy hound")
```
This stores the embedding for the string "my happy hound" in the `entries` collection under the key `hound`.
The `collection.embed_multi()` method can be used to store embeddings for multiple items at once. This can be more efficient for some embedding models.
A collection instance has the following properties and methods:
-`id` - the integer ID of the collection in the database
-`name` - the string name of the collection (unique in the database)
-`model_id` - the string ID of the embedding model used for this collection
-`model()` - returns the `EmbeddingModel` instance, based on that `model_id`
-`count()` - returns the integer number of items in the collection
-`embed(id: str, text: str, metadata: dict=None, store: bool=False)` - embeds the given string and stores it in the collection under the given ID. Can optionally include metadata (stored as JSON) and store the text content itself in the database table.
-`embed_multi(entries: Iterable, store: bool=False)` - see above
-`embed_multi_with_metadata(entries: Iterable, store: bool=False)` - see above
-`similar(query: str, number: int=10)` - returns a list of entries that are most similar to the embedding of the given query string
-`similar_by_id(id: str, number: int=10)` - returns a list of entries that are most similar to the embedding of the item with the given ID
-`similar_by_vector(vector: List[float], number: int=10, skip_id: str=None)` - returns a list of entries that are most similar to the given embedding vector, optionally skipping the entry with the given ID
There is also a `Collection.exists(db, name)` class method which returns a boolean value and can be used to determine if a collection exists or not in a database:
Once you have populated a collection of embeddings you can retrieve the entries that are most similar to a given string using the `similar()` method.
This method uses a brute force approach, calculating distance scores against every document. This is fine for small collections, but will not scale to large collections. See [issue 216](https://github.com/simonw/llm/issues/216) for plans to add a more scalable approach via vector indexes provided by plugins.
The `similar_by_id()` method takes the ID of another item in the collection and returns the most similar items to that one, based on the embedding that has already been stored for it: