You want to use Weaviate with the OpenAI module (text2vec-openai), to generate vector embeddings for you.
This notebook takes you through a simple flow to set up a Weaviate instance, connect to it (with OpenAI API key), configure data schema, import data (which will automatically generate vector embeddings for your data), and run question answering.
Weaviate is an open-source vector search engine that stores data objects together with their vectors. This allows for combining vector search with structured filtering.
Weaviate uses KNN algorithms to create an vector-optimized index, which allows your queries to run extremely fast. Learn more here.
Weaviate let you use your favorite ML-models, and scale seamlessly into billions of data objects.
The first module is responsible for handling vectorization at import (or any CRUD operations) and when you run a search query. The second module communicates with the OpenAI completions endpoint.
This is great news for you. With text2vec-openai you don't need to manually vectorize your data, as Weaviate will call OpenAI for you whenever necessary.
All you need to do is:
provide your OpenAI API Key – when you connected to the Weaviate Client
define which OpenAI vectorizer to use in your Schema
(Recommended path) Weaviate Cloud Service – to host your Weaviate instance in the cloud. The free sandbox should be more than enough for this cookbook.
create a Weaviate Cluster with the following settings:
Sandbox: Sandbox Free
Weaviate Version: Use default (latest)
OIDC Authentication: Disabled
your instance should be ready in a minute or two
make a note of the Cluster Id. The link will take you to the full path of your cluster (you will need it later to connect to it). It should be something like: https://your-project-name.weaviate.network
To load sample data, you need the datasets library and its' dependency apache-beam.
# Install the Weaviate client for Python!pip install weaviate-client>3.11.0# Install datasets and apache-beam to load the sample datasets!pip install datasets apache-beam
Once you get your key, please add it to your environment variables as OPENAI_API_KEY.
# Export OpenAI API Key!export OPENAI_API_KEY="your key"
# Test that your OpenAI API key is correctly set as an environment variable# Note. if you run this notebook locally, you will need to reload your terminal and the notebook for the env variables to be live.import os# Note. alternatively you can set a temporary env variable like this:# os.environ['OPENAI_API_KEY'] = 'your-key-goes-here'if os.getenv("OPENAI_API_KEY") isnotNone:print ("OPENAI_API_KEY is ready")else:print ("OPENAI_API_KEY environment variable not found")
After this step, the client object will be used to perform all Weaviate-related operations.
import weaviatefrom datasets import load_datasetimport os# Connect to your Weaviate instanceclient = weaviate.Client(url="https://your-wcs-instance-name.weaviate.network/",# url="http://localhost:8080/",auth_client_secret=weaviate.auth.AuthApiKey(api_key="<YOUR-WEAVIATE-API-KEY>"), # comment out this line if you are not using authentication for your Weaviate instance (i.e. for locally deployed instances)additional_headers={"X-OpenAI-Api-Key": os.getenv("OPENAI_API_KEY") })# Check if your instance is live and ready# This should return `True`client.is_ready()
This is the second and final step, which requires OpenAI specific configuration.
After this step, the rest of instructions wlll only touch on Weaviate, as the OpenAI tasks will be handled automatically.
In Weaviate you create schemas to capture each of the entities you will be searching.
A schema is how you tell Weaviate:
what embedding model should be used to vectorize the data
what your data is made of (property names and types)
which properties should be vectorized and indexed
In this cookbook we will use a dataset for Articles, which contains:
title
content
url
We want to vectorize title and content, but not the url.
To vectorize and query the data, we will use text-embedding-3-small. For Q&A we will use gpt-3.5-turbo-instruct.
# Clear up the schema, so that we can recreate itclient.schema.delete_all()client.schema.get()# Define the Schema object to use `text-embedding-3-small` on `title` and `content`, but skip it for `url`article_schema = {"class": "Article","description": "A collection of articles","vectorizer": "text2vec-openai","moduleConfig": {"text2vec-openai": {"model": "ada","modelVersion": "002","type": "text" }, "qna-openai": {"model": "gpt-3.5-turbo-instruct","maxTokens": 16,"temperature": 0.0,"topP": 1,"frequencyPenalty": 0.0,"presencePenalty": 0.0 } },"properties": [{"name": "title","description": "Title of the article","dataType": ["string"] }, {"name": "content","description": "Contents of the article","dataType": ["text"] }, {"name": "url","description": "URL to the article","dataType": ["string"],"moduleConfig": { "text2vec-openai": { "skip": True } } }]}# add the Article schemaclient.schema.create_class(article_schema)# get the schema to make sure it workedclient.schema.get()
configure Weaviate Batch import (to make the import more efficient)
import the data into Weaviate
Note:
Like mentioned before. We don't need to manually vectorize the data.
The text2vec-openai module will take care of that.
### STEP 1 - load the datasetfrom datasets import load_datasetfrom typing import List, Iterator# We'll use the datasets library to pull the Simple Wikipedia dataset for embeddingdataset =list(load_dataset("wikipedia", "20220301.simple")["train"])# For testing, limited to 2.5k articles for demo purposesdataset = dataset[:2_500]# Limited to 25k articles for larger demo purposes# dataset = dataset[:25_000]# for free OpenAI acounts, you can use 50 objects# dataset = dataset[:50]
### Step 2 - configure Weaviate Batch, with# - starting batch size of 100# - dynamically increase/decrease based on performance# - add timeout retries if something goes wrongclient.batch.configure(batch_size=10, dynamic=True,timeout_retries=3,# callback=None,)
# Test that all data has loaded – get object countresult = ( client.query.aggregate("Article") .with_fields("meta { count }") .do())print("Object count: ", result["data"]["Aggregate"]["Article"], "\n")
# Test one article has worked by checking one objecttest_article = ( client.query .get("Article", ["title", "url", "content"]) .with_limit(1) .do())["data"]["Get"]["Article"][0]print(test_article['title'])print(test_article['url'])print(test_article['content'])
As above, we'll fire some queries at our new Index and get back results based on the closeness to our existing vectors
defqna(query, collection_name): properties = ["title", "content", "url","_additional { answer { hasAnswer property result startPosition endPosition } distance }" ] ask = {"question": query,"properties": ["content"] } result = ( client.query .get(collection_name, properties) .with_ask(ask) .with_limit(1) .do() )# Check for errorsif ("errors"in result):print ("\033[91mYou probably have run out of OpenAI API calls for the current minute – the limit is set at 60 per minute.")raiseException(result["errors"][0]['message'])return result["data"]["Get"][collection_name]
query_result = qna("Did Alanis Morissette win a Grammy?", "Article")for i, article inenumerate(query_result):print(f"{i+1}. { article['_additional']['answer']['result']} (Distance: {round(article['_additional']['distance'],3) })")
query_result = qna("What is the capital of China?", "Article")for i, article inenumerate(query_result):if article['_additional']['answer']['hasAnswer'] ==False:print('No answer found')else:print(f"{i+1}. { article['_additional']['answer']['result']} (Distance: {round(article['_additional']['distance'],3) })")
Thanks for following along, you're now equipped to set up your own vector databases and use embeddings to do all kinds of cool things - enjoy! For more complex use cases please continue to work through other cookbook examples in this repo.