The `llm embed` command can be used to calculate embedding vectors for a string of content. These can be returned directly to the terminal, stored in a SQLite database, or both.
### Returning embeddings to the terminal
The simplest way to use this command is to pass content to it using the `-c/--content` option, like this:
`-m 3-small` specifies the OpenAI `text-embedding-3-small` model. You will need to have set an OpenAI API key using `llm keys set openai` for this to work.
You can install plugins to access other models. The [llm-sentence-transformers](https://github.com/simonw/llm-sentence-transformers) plugin can be used to run models on your own laptop, such as the [MiniLM-L6](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) model:
```bash
llm install llm-sentence-transformers
llm embed -c 'This is some content' -m sentence-transformers/all-MiniLM-L6-v2
LLM also offers a binary storage format for embeddings, described in {ref}`embeddings storage format <embeddings-storage>`.
You can output embeddings using that format as raw bytes using `--format blob`, or in hexadecimal using `--format hex`, or in Base64 using `--format base64`:
Some models such as [llm-clip](https://github.com/simonw/llm-clip) can run against binary data. You can pass in binary data using the `-i` and `--binary` options:
LLM includes the concept of a **collection** of embeddings. A collection groups together a set of stored embeddings created using the same model, each with a unique ID within that collection.
The following example stores the embedding for the string "my happy hound" in a collection called `phrases` under the key `hound` and using the model `3-small`:
You can store embeddings in a different SQLite database by passing a path to it using the `-d/--database` option to `llm embed`. If this file does not exist yet the command will create it:
You can also store a JSON object containing arbitrary metadata in the `metadata` column by passing the `--metadata` option. This example uses both `--store` and `--metadata` options:
The `llm embed` command embeds a single string at a time.
`llm embed-multi` can be used to embed multiple strings at once, taking advantage of any efficiencies that the embedding model may provide when processing multiple strings.
This command can be called in one of three ways:
1. With a CSV, TSV, JSON or newline-delimited JSON file
2. With a SQLite database and a SQL query
3. With one or more paths to directories, each accompanied by a glob pattern
All three mechanisms support these options:
-`-m model_id` to specify the embedding model to use
-`-d database.db` to specify a different database file to store the embeddings in
-`--store` to store the original content in the embeddings table in addition to the embedding vector
-`--prefix` to prepend a prefix to the stored ID of each item
The `--prepend` option is useful for embedding models that require you to prepend a special token to the content before embedding it. [nomic-embed-text-v2-moe](https://huggingface.co/nomic-ai/nomic-embed-text-v2-moe) for example requires documents to be prepended `'search_document: '` and search queries to be prepended `'search_query: '`.
Your file must contain at least two columns. The first one is expected to contain the ID of the item, and any subsequent columns will be treated as containing content to be embedded.
An example CSV file might look like this:
```
id,content
one,This is the first item
two,This is the second item
```
TSV would use tabs instead of commas.
JSON files can be structured like this:
```json
[
{"id": "one", "content": "This is the first item"},
{"id": "two", "content": "This is the second item"}
]
```
Or as newline-delimited JSON like this:
```json
{"id": "one", "content": "This is the first item"}
{"id": "two", "content": "This is the second item"}
```
In each of these cases the file can be passed to `llm embed-multi` like this:
LLM will attempt to detect the format of your data automatically. If this doesn't work you can specify the format using the `--format` option. This is required if you are piping newline-delimited JSON to standard input.
This example embeds the data from a JSON file in a collection called `items` in database called `docs.db` using the `3-small` model and stores the original content in the `embeddings` table as well, adding a prefix of `my-items/` to each ID:
The `docs.db` database here contains a `documents` table, and we want to embed the `title` and `content` columns from that table and store the results back in the same database.
To load content from a database other than the one you are using to store embeddings, attach it with the `--attach` option and use `alias.table` in your SQLite query:
```bash
llm embed-multi docs \
-d embeddings.db \
--attach other other.db \
--sql 'select id, title, content from other.documents' \
Here `--files docs '**/*.md'` specifies that the `docs` directory should be scanned for files matching the `**/*.md` glob pattern - which will match Markdown files in any nested directory.
The result of the above command is a `embeddings` table with the following IDs:
```
aliases.md
contributing.md
embeddings/binary.md
embeddings/cli.md
embeddings/index.md
index.md
logging.md
plugins/directory.md
plugins/index.md
```
Each corresponding to embedded content for the file in question.
Files are assumed to be `utf-8`, but LLM will fall back to `latin-1` if it encounters an encoding error. You can specify a different set of encodings using the `--encoding` option.
This example will try `utf-16` first and then `mac_roman` before falling back to `latin-1`:
The `llm similar` command searches a collection of embeddings for the items that are most similar to a given or item ID, based on [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity).
This currently uses a slow brute-force approach which does not scale well to large collections. See [issue 216](https://github.com/simonw/llm/issues/216) for plans to add a more scalable approach via vector indexes provided by plugins.