sdk is still in alpha, if you if you have some feedback join our discord (opens in a new tab) 🔥
find it on github (opens in a new tab)
Embedbase
Open-source API & SDK to connect any data to ChatGPT
Before you start, you need get a an API key at app.embedbase.xyz (opens in a new tab).
Note: we're working on a fully client-side SDK. In the meantime, you can use the hosted instance of Embedbase.
npm i embedbase-js
Table of contents
- What is it
- Installation
- Searching
- Adding data
- Splitting and chunking large texts
- Creating a "context"
- Adding metadata
- Listing datasets
- Example: Create a recommendation engine
Design philosophy
- Simple
- Open-source
- Composable (integrates well with LLM & various databases)
What is it
This is the official typescript client for Embedbase. Embedbase is an open-source API to connect your data to ChatGPT.
Who is it for
People who want to
- plug their own data into ChatGPT.
Installation
You can install embedbase-js via the terminal.
npm i embedbase-js
Initializing
import { createClient } from 'embedbase-js'
// you can find the api key at https://embedbase.xyz
const apiKey = 'your api key'
// this is using the hosted instance
const url = 'https://api.embedbase.xyz'
const embedbase = createClient(url, apiKey)
Searching datasets
// fetching data
const data = await embedbase
.dataset('amazon-reviews')
.search('best hot dogs accessories', { limit: 3 })
console.log(data)
// [
// {
// "similarity": 0.810843349,
// "data": "The world is going to smell very different once electric vehicles become commonplace"
// },
// {
// "similarity": 0.794602573,
// "data": "200 years ago, people would never have guessed that humans in the future would communicate by silently tapping on glass"
// },
// {
// "similarity": 0.792932034,
// "data": "The average car in space is nicer than the average car on Earth"
// },
// ]
Adding Data
const data =
await // embeddings are extremely good for retrieving unstructured data
// in this example we store an unparsable html string
embedbase.dataset('amazon-reviews').add(`
<div>
<span>Lightweight. Telescopic. Easy zipper case for storage. Didn't put in dishwasher. Still perfect after many uses.</span>
`)
console.log(data)
//
// {
// "id": "eiew823",
// "data": "Lightweight. Telescopic. Easy zipper case for storage.
// Didn't put in dishwasher. Still perfect after many uses."
// }
If you have many documents to add, you should use batchAdd
:
embedbase.dataset(datasetId).batchAdd([{
data: 'some text',
}])
For better performance, you can use batches with Promise.all
:
const batch = async (myList: any[], fn: (chunk: any[]) => Promise<any>) => {
const batchSize = 100;
return Promise.all(
myList.reduce((acc: BatchAddDocument[][], chunk, i) => {
if (i % batchSize === 0) {
acc.push(myList.slice(i, i + batchSize));
}
return acc;
}, []).map(fn)
)
}
batch(chunks, (chunk) => embedbase.dataset(datasetId).batchAdd(chunk))
Splitting and chunking large texts
AI models are often limited in the amount of text they can process at once. Embedbase provides a utility function to split large texts into smaller chunks.
We highly recommend using this function.
To split and chunk large texts, use the splitText
function:
import { splitText } from 'embedbase-js/dist/main/split';
const text = 'some very long text...';
// ⚠️ note here that the value of maxTokens depends
// on the used embedder in embedbase.
// With models such as OpenAI's embeddings model, you can
// use a maxTokens of 500. With other models, you may need to
// use a lower maxTokens value.
// (embedbase cloud use openai model at the moment) ⚠️
const maxTokens = 500
// chunk_overlap is the number of tokens that will overlap between chunks
// it is useful to have some overlap to ensure that the context is not
// cut off in the middle of a sentence.
const chunkOverlap = 200
splitText(text, { maxTokens: maxTokens, chunkOverlap: chunkOverlap }, async ({ chunk, start, end }) =>
embedbase.dataset('some-data-set').add(chunk)
)
Check how we send our documentation to Embedbase (opens in a new tab) to let you ask it questions through GPT-4.
Creating a "context"
createContext
is very similar to .search
but it returns strings instead of an object. This is useful if you want to easily feed it to GPT.
// you can create a context to store data
const data = await embedbase
.dataset('my-documentation')
.createContext('my-context')
console.log(data)
[
"Embedbase API allows to store unstructured data...",
"Embedbase API has 3 main functions a) provides a plug and play solution to store embeddings b) makes it easy to connect to get the right data into llms c)..",
"Embedabase API is self-hostable...",
]
Adding metadata
const data =
await
embedbase.dataset('amazon-reviews').add(`
<div>
<span>Lightweight. Telescopic. Easy zipper case for storage. Didn't put in dishwasher. Still perfect after many uses.</span>
// metadata can be anything you want that will appear in the search results later
`, {category: 'smallItems', user: 'bob'})
console.log(data)
//
// {
// "id": "eiew823",
// "data": "Lightweight. Telescopic. Easy zipper case for storage.
// Didn't put in dishwasher. Still perfect after many uses.",
// "metadata": {"category": "smallItems", "user": "bob"}
// }
Listing datasets
const data = await embedbase.datasets()
console.log(data)
// [{"datasetId": "amazon-reviews", "documentsCount": 2}]
Create a recommendation engine
Check out this tutorial (opens in a new tab).