Embedbase-py - Python client for Embedbase
Embedbase-py is a Python SDK for interacting with Embedbase. It provides a simple and convenient way to access Embedbase's features, such as searching datasets, adding data, and creating contexts.
⚠️ Status: Alpha release ⚠️
This is not an officially launched product and currently lacks documentation. Please use at your own risk. If you are using this library, please let us know by opening an issue or contacting us on Discord (opens in a new tab).
Getting Started
To install the Embedbase-py library, run the following command:
pip install git+https://github.com/different-ai/embedbase.git@main#subdirectory=sdk/embedbase-py
- Usage
- Initializing the Client
- Searching Datasets
- Adding Data
- Creating a Context
- Splitting and Chunking Large Texts
- Dealing with large datasets
- Contributing
Usage
Initializing the Client
To get started, import the EmbedbaseClient
class and create a new instance:
from embedbase_client.client import EmbedbaseClient
embedbase_url = "https://api.embedbase.xyz"
embedbase_key = "<get your key here: https://app.embedbase.xyz>"
client = EmbedbaseClient(embedbase_url, embedbase_key)
In an async context, you can use the EmbedbaseAsyncClient
class instead.
This class provides the same methods as EmbedbaseClient
, but they are all asynchronous.
from embedbase_client.client import EmbedbaseAsyncClient
Remember to use await
when calling methods on EmbedbaseAsyncClient
objects.
Learn more about asynchronous Python here (opens in a new tab).
Searching Datasets
To search a dataset, call the search
method on a Dataset
object:
dataset = client.dataset("your_dataset_name")
search_results = dataset.search("your_query", limit=5)
Adding Data
To add data to a dataset, call the add
method on a Dataset
object:
document = "your_document_text"
metadata = {"key": "value"}
result = dataset.add(document, metadata)
Creating a Context
To create a context, call the create_context
method on a Dataset
object:
query = "your_query"
context = dataset.create_context(query, limit=5)
Splitting and Chunking Large Texts
AI models are often limited in the amount of text they can process at once. Embedbase provides a utility function to split large texts into smaller chunks.
We highly recommend using this function.
To split and chunk large texts, use the split_text
function from the split
module:
from embedbase_client.split import split_text
text = "your_long_text"
# ⚠️ note here that the value of max_tokens depends
# on the used embedder in embedbase.
# With models such as OpenAI's embeddings model, you can
# use a max_tokens of 500. With other models, you may need to
# use a lower max_tokens value.
# (embedbase cloud use openai model at the moment) ⚠️
max_tokens = 500
# chunk_overlap is the number of tokens that will overlap between chunks
# it is useful to have some overlap to ensure that the context is not
# cut off in the middle of a sentence.
chunk_overlap = 200
chunks = split_text(text, max_tokens, chunk_overlap)
# then ...
documents = []
for c in chunks:
documents.append({
"data": c.chunk,
})
result = client.dataset("my-dataset").batch_add(documents)
Dealing with large datasets
If your dataset is large, we recommend running parallel requests like so:
import asyncio
async def batch(my_list, fn, batch_size=100):
async def process_chunk(chunk):
return await fn(chunk)
tasks = []
for i in range(0, len(my_list), batch_size):
chunk = my_list[i:i + batch_size]
tasks.append(asyncio.create_task(process_chunk(chunk)))
results = await asyncio.gather(*tasks)
return results
async def batch_add_fn(chunk):
await asyncio.sleep(1)
return client.dataset(dataset_id).batch_add(chunk)
results = await batch(documents, batch_add_fn)
print(f"Results: {results}")
Contributing
We welcome contributions to Embedbase-py (opens in a new tab).
If you have any feedback or suggestions, please open an issue or join our Discord (opens in a new tab) to discuss your ideas.