Embedbase Documentation

sdk is still in alpha, if you if you have some feedback join our discord (opens in a new tab) 🔥

Embedbase

Open-source API & SDK to connect any data to ChatGPT

Before you start, you need get a an API key at app.embedbase.xyz (opens in a new tab).

Note: we're working on a fully client-side SDK. In the meantime, you can use the hosted instance of Embedbase.

npm i embedbase-js

Design philosophy

Simple
Open-source
Composable (integrates well with LLM & various databases)

What is it

This is the official typescript client for Embedbase. Embedbase is an open-source API to connect your data to ChatGPT.

Who is it for

People who want to

plug their own data into ChatGPT.

Installation

You can install embedbase-js via the terminal.

npm i embedbase-js

Initializing

import { createClient } from 'embedbase-js'
 
// you can find the api key at https://embedbase.xyz
const apiKey = 'your api key'
// this is using the hosted instance
const url = 'https://api.embedbase.xyz'
 
const embedbase = createClient(url, apiKey)

Searching datasets

// fetching data
const data = await embedbase
  .dataset('amazon-reviews')
  .search('best hot dogs accessories', { limit: 3 })
 
console.log(data)
// [
//   {
//       "similarity": 0.810843349,
//       "data": "The world is going to smell very different once electric      vehicles become commonplace"
//   },
//   {
//       "similarity": 0.794602573,
//       "data": "200 years ago, people would never have guessed that humans in the future would communicate by silently tapping on glass"
//   },
//   {
//       "similarity": 0.792932034,
//       "data": "The average car in space is nicer than the average car on Earth"
//   },
// ]

Adding Data

const data =
  await // embeddings are extremely good for retrieving unstructured data
  // in this example we store an unparsable html string
  embedbase.dataset('amazon-reviews').add(`
  <div>
    <span>Lightweight. Telescopic. Easy zipper case for storage. Didn't put in dishwasher. Still perfect after many uses.</span>
`)
 
console.log(data)
//
// {
//   "id": "eiew823",
//   "data": "Lightweight. Telescopic. Easy zipper case for storage.
//          Didn't put in dishwasher. Still perfect after many uses."
// }

If you have many documents to add, you should use batchAdd:

embedbase.dataset(datasetId).batchAdd([{
  data: 'some text',
}])

For better performance, you can use batches with Promise.all:

const batch = async (myList: any[], fn: (chunk: any[]) => Promise<any>) => {
    const batchSize = 100;
    return Promise.all(
        myList.reduce((acc: BatchAddDocument[][], chunk, i) => {
            if (i % batchSize === 0) {
                acc.push(myList.slice(i, i + batchSize));
            }
            return acc;
        }, []).map(fn)
    )
}
batch(chunks, (chunk) => embedbase.dataset(datasetId).batchAdd(chunk))

Splitting and chunking large texts

AI models are often limited in the amount of text they can process at once. Embedbase provides a utility function to split large texts into smaller chunks. We highly recommend using this function. To split and chunk large texts, use the splitText function:

import { splitText } from 'embedbase-js/dist/main/split';
 
const text = 'some very long text...';
// ⚠️ note here that the value of maxTokens depends
// on the used embedder in embedbase.
// With models such as OpenAI's embeddings model, you can
// use a maxTokens of 500. With other models, you may need to
// use a lower maxTokens value.
// (embedbase cloud use openai model at the moment) ⚠️
const maxTokens = 500
// chunk_overlap is the number of tokens that will overlap between chunks
// it is useful to have some overlap to ensure that the context is not
// cut off in the middle of a sentence.
const chunkOverlap = 200
splitText(text, { maxTokens: maxTokens, chunkOverlap: chunkOverlap }, async ({ chunk, start, end }) =>
    embedbase.dataset('some-data-set').add(chunk)
)

Check how we send our documentation to Embedbase (opens in a new tab) to let you ask it questions through GPT-4.

Creating a "context"

createContext is very similar to .search but it returns strings instead of an object. This is useful if you want to easily feed it to GPT.

// you can create a context to store data
const data = await embedbase
  .dataset('my-documentation')
  .createContext('my-context')
 
console.log(data)
[
 "Embedbase API allows to store unstructured data...",
 "Embedbase API has 3 main functions a) provides a plug and play solution to store embeddings b) makes it easy to connect to get the right data into llms c)..",
 "Embedabase API is self-hostable...",
]

Adding metadata

const data =
  await
  embedbase.dataset('amazon-reviews').add(`
  <div>
    <span>Lightweight. Telescopic. Easy zipper case for storage. Didn't put in dishwasher. Still perfect after many uses.</span>
    // metadata can be anything you want that will appear in the search results later
`, {category: 'smallItems', user: 'bob'})
 
console.log(data)
//
// {
//   "id": "eiew823",
//   "data": "Lightweight. Telescopic. Easy zipper case for storage.
//          Didn't put in dishwasher. Still perfect after many uses.",
//   "metadata": {"category": "smallItems", "user": "bob"}
// }

Listing datasets

const data = await embedbase.datasets()
console.log(data)
// [{"datasetId": "amazon-reviews", "documentsCount": 2}]

Create a recommendation engine

Check out this tutorial (opens in a new tab).

🏡 Start Here 🐍 Python SDK