llm/docs/embeddings/python-api.md

(embeddings-python-api)=
# Using embeddings from Python

You can load an embedding model using its model ID or alias like this:
```python
import llm

embedding_model = llm.get_embedding_model("ada-002")
```
To embed a string, returning a Python list of floating point numbers, use the `.embed()` method:
```python
vector = embedding_model.embed("my happy hound")
```
Many embeddings models are more efficient when you embed multiple strings at once. To embed multiple strings at once, use the `.embed_multi()` method:
```python
vectors = list(embedding_model.embed_multi(["my happy hound", "my dissatisfied cat"]))
```
This returns a generator that yields one embedding vector per string.

(embeddings-python-collections)=
## Working with collections

The `llm.Collection` class can be used to work with **collections** of embeddings from Python code.

A collection is a named group of embedding vectors, each stored along with their IDs in a SQLite database table.

To work with embeddings in this way you will need an instance of a [sqlite-utils Database](https://sqlite-utils.datasette.io/en/stable/python-api.html#connecting-to-or-creating-a-database) object. You can then pass that to the `llm.Collection` constructor along with the unique string name of the collection and the ID of the embedding model you will be using with that collection:

```python
import sqlite_utils
import llm

db = sqlite_utils.Database("my-embeddings.db")
# Pass model_id= to specify a model for the collection
collection = llm.Collection(db, "entries", model_id="ada-002")

# Or you can pass a model directly using model=
embedding_model = llm.get_embedding_model("ada-002")
collection = llm.Collection(db, "entries", model=embedding_model)
```
If the collection already exists in the database you can omit the `model` or `model_id` argument - the model ID will be read from the `collections` table.

To embed a single string and store it in the collection, use the `embed()` method:

```python
collection.embed("hound", "my happy hound")
```
This stores the embedding for the string "my happy hound" in the `entries` collection under the key `hound`.

Add `store=True` to store the text content itself in the database table along with the embedding vector.

To attach additional metadata to an item, pass a JSON-compatible dictionary as the `metadata=` argument:

```python
collection.embed("hound", "my happy hound", metadata={"name": "Hound"}, store=True)
```
This additional metadata will be stored as JSON in the `metadata` column of the embeddings database table.

(embeddings-python-similar)=
## Retrieving similar items

Once you have populated a collection of embeddings you can retrieve the IDs of the most similar items to a given string using the `similar()` method:

```python
for id, score in collection.similar("hound"):
    print(id, score)
```
The string will first by embedded using the model for the collection.

This defaults to returning the 10 most similar items. You can change this by passing a different `number=` argument:
```python
for id, score in collection.similar("hound", number=5):
    print(id, score)
```
The `similar_by_id()` method takes the ID of another item in the collection and returns the most similar items to that one, based on the embedding that has already been stored for it:

```python
for id, score in collection.similar_by_id("cat"):
    print(id, score)
```
The item itself is excluded from the results.
Initial Python embeddings API docs, refs #191 2023-09-02 00:25:45 +00:00			`(embeddings-python-api)=`
			`# Using embeddings from Python`

			`You can load an embedding model using its model ID or alias like this:`
			```python
			`import llm`

			`embedding_model = llm.get_embedding_model("ada-002")`
			```
			To embed a string, returning a Python list of floating point numbers, use the `.embed()` method:
			```python
			`vector = embedding_model.embed("my happy hound")`
			```
			Many embeddings models are more efficient when you embed multiple strings at once. To embed multiple strings at once, use the `.embed_multi()` method:
			```python
			`vectors = list(embedding_model.embed_multi(["my happy hound", "my dissatisfied cat"]))`
			```
			`This returns a generator that yields one embedding vector per string.`

			`(embeddings-python-collections)=`
			`## Working with collections`

			The `llm.Collection` class can be used to work with collections of embeddings from Python code.

			`A collection is a named group of embedding vectors, each stored along with their IDs in a SQLite database table.`

			To work with embeddings in this way you will need an instance of a [sqlite-utils Database](https://sqlite-utils.datasette.io/en/stable/python-api.html#connecting-to-or-creating-a-database) object. You can then pass that to the `llm.Collection` constructor along with the unique string name of the collection and the ID of the embedding model you will be using with that collection:

			```python
			`import sqlite_utils`
			`import llm`

			`db = sqlite_utils.Database("my-embeddings.db")`
			`# Pass model_id= to specify a model for the collection`
			`collection = llm.Collection(db, "entries", model_id="ada-002")`

			`# Or you can pass a model directly using model=`
			`embedding_model = llm.get_embedding_model("ada-002")`
			`collection = llm.Collection(db, "entries", model=embedding_model)`
			```
			If the collection already exists in the database you can omit the `model` or `model_id` argument - the model ID will be read from the `collections` table.

			To embed a single string and store it in the collection, use the `embed()` method:

			```python
			`collection.embed("hound", "my happy hound")`
			```
			This stores the embedding for the string "my happy hound" in the `entries` collection under the key `hound`.

Docs and test for store=True, refs #203 2023-09-02 00:52:43 +00:00			Add `store=True` to store the text content itself in the database table along with the embedding vector.

Test and docs for collection.embed(metadata=), refs #203 2023-09-02 00:57:54 +00:00			To attach additional metadata to an item, pass a JSON-compatible dictionary as the `metadata=` argument:
Initial Python embeddings API docs, refs #191 2023-09-02 00:25:45 +00:00
			```python
Test and docs for collection.embed(metadata=), refs #203 2023-09-02 00:57:54 +00:00			`collection.embed("hound", "my happy hound", metadata={"name": "Hound"}, store=True)`
Initial Python embeddings API docs, refs #191 2023-09-02 00:25:45 +00:00			```
Test and docs for collection.embed(metadata=), refs #203 2023-09-02 00:57:54 +00:00			This additional metadata will be stored as JSON in the `metadata` column of the embeddings database table.
Initial Python embeddings API docs, refs #191 2023-09-02 00:25:45 +00:00
			`(embeddings-python-similar)=`
			`## Retrieving similar items`

			Once you have populated a collection of embeddings you can retrieve the IDs of the most similar items to a given string using the `similar()` method:

			```python
			`for id, score in collection.similar("hound"):`
			`print(id, score)`
			```
			`The string will first by embedded using the model for the collection.`

			This defaults to returning the 10 most similar items. You can change this by passing a different `number=` argument:
			```python
			`for id, score in collection.similar("hound", number=5):`
			`print(id, score)`
			```
			The `similar_by_id()` method takes the ID of another item in the collection and returns the most similar items to that one, based on the embedding that has already been stored for it:

			```python
			`for id, score in collection.similar_by_id("cat"):`
			`print(id, score)`
			```
			`The item itself is excluded from the results.`