Aerospike
Aerospike Vector Search (AVS) is an extension to the Aerospike Database that enables searches across very large datasets stored in Aerospike. This new service lives outside of Aerospike and builds an index to perform those searches.
This notebook showcases the functionality of the LangChain Aerospike VectorStore integration.
Install AVS
Before using this notebook, we need to have a running AVS instance. Use one of the available installation methods.
When finished, store your AVS instance's IP address and port to use later in this demo:
PROXIMUS_HOST = "<avs-ip>"
PROXIMUS_PORT = 5000
Install Dependencies
The sentence-transformers
dependency is large. This step could take several minutes to complete.
!pip install --upgrade --quiet aerospike-vector-search==0.6.1 langchain-community sentence-transformers langchain
Download Quotes Dataset
We will download a dataset of approximately 100,000 quotes and use a subset of those quotes for semantic search.
!wget https://github.com/aerospike/aerospike-vector-search-examples/raw/7dfab0fccca0852a511c6803aba46578729694b5/quote-semantic-search/container-volumes/quote-search/data/quotes.csv.tgz
--2024-05-10 17:28:17-- https://github.com/aerospike/aerospike-vector-search-examples/raw/7dfab0fccca0852a511c6803aba46578729694b5/quote-semantic-search/container-volumes/quote-search/data/quotes.csv.tgz
Resolving github.com (github.com)... 140.82.116.4
Connecting to github.com (github.com)|140.82.116.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/aerospike/aerospike-vector-search-examples/7dfab0fccca0852a511c6803aba46578729694b5/quote-semantic-search/container-volumes/quote-search/data/quotes.csv.tgz [following]
--2024-05-10 17:28:17-- https://raw.githubusercontent.com/aerospike/aerospike-vector-search-examples/7dfab0fccca0852a511c6803aba46578729694b5/quote-semantic-search/container-volumes/quote-search/data/quotes.csv.tgz
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 11597643 (11M) [application/octet-stream]
Saving to: ‘quotes.csv.tgz’
quotes.csv.tgz 100%[===================>] 11.06M 1.94MB/s in 6.1s
2024-05-10 17:28:23 (1.81 MB/s) - ‘quotes.csv.tgz’ saved [11597643/11597643]
Load the Quotes Into Documents
We will load our quotes dataset using the CSVLoader
document loader. In this case, lazy_load
returns an iterator to ingest our quotes more efficiently. In this example, we only load 5,000 quotes.
import itertools
import os
import tarfile
from langchain_community.document_loaders.csv_loader import CSVLoader
filename = "./quotes.csv"
if not os.path.exists(filename) and os.path.exists(filename + ".tgz"):
# Untar the file
with tarfile.open(filename + ".tgz", "r:gz") as tar:
tar.extractall(path=os.path.dirname(filename))
NUM_QUOTES = 5000
documents = CSVLoader(filename, metadata_columns=["author", "category"]).lazy_load()
documents = list(
itertools.islice(documents, NUM_QUOTES)
) # Allows us to slice an iterator
print(documents[0])
page_content="quote: I'm selfish, impatient and a little insecure. I make mistakes, I am out of control and at times hard to handle. But if you can't handle me at my worst, then you sure as hell don't deserve me at my best." metadata={'source': './quotes.csv', 'row': 0, 'author': 'Marilyn Monroe', 'category': 'attributed-no-source, best, life, love, mistakes, out-of-control, truth, worst'}
Create your Embedder
In this step, we use HuggingFaceEmbeddings and the "all-MiniLM-L6-v2" sentence transformer model to embed our documents so we can perform a vector search.
from aerospike_vector_search.types import VectorDistanceMetric
from langchain_community.embeddings import HuggingFaceEmbeddings
MODEL_DIM = 384
MODEL_DISTANCE_CALC = VectorDistanceMetric.COSINE
embedder = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
modules.json: 0%| | 0.00/349 [00:00<?, ?B/s]
config_sentence_transformers.json: 0%| | 0.00/116 [00:00<?, ?B/s]
README.md: 0%| | 0.00/10.7k [00:00<?, ?B/s]
sentence_bert_config.json: 0%| | 0.00/53.0 [00:00<?, ?B/s]
/opt/conda/lib/python3.11/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
warnings.warn(
config.json: 0%| | 0.00/612 [00:00<?, ?B/s]
/opt/conda/lib/python3.11/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
warnings.warn(
model.safetensors: 0%| | 0.00/90.9M [00:00<?, ?B/s]
tokenizer_config.json: 0%| | 0.00/350 [00:00<?, ?B/s]
vocab.txt: 0%| | 0.00/232k [00:00<?, ?B/s]
tokenizer.json: 0%| | 0.00/466k [00:00<?, ?B/s]
special_tokens_map.json: 0%| | 0.00/112 [00:00<?, ?B/s]
1_Pooling/config.json: 0%| | 0.00/190 [00:00<?, ?B/s]
Create an Aerospike Index and Embed Documents
Before we add documents, we need to create an index in the Aerospike Database. In the example below, we use some convenience code that checks to see if the expected index already exists.
from aerospike_vector_search import AdminClient, Client, HostPort
from aerospike_vector_search.types import VectorDistanceMetric
from langchain_community.vectorstores import Aerospike
# Here we are using the AVS host and port you configured earlier
seed = HostPort(host=PROXIMUS_HOST, port=PROXIMUS_PORT)
# The namespace of where to place our vectors. This should match the vector configured in your docstore.conf file.
NAMESPACE = "test"
# The name of our new index.
INDEX_NAME = "quote-miniLM-L6-v2"
# AVS needs to know which metadata key contains our vector when creating the index and inserting documents.
VECTOR_KEY = "vector"
client = Client(seeds=seed)
admin_client = AdminClient(
seeds=seed,
)
index_exists = False
# Check if the index already exists. If not, create it
for index in admin_client.index_list():
if index["id"]["namespace"] == NAMESPACE and index["id"]["name"] == INDEX_NAME:
index_exists = True
print(f"{INDEX_NAME} already exists. Skipping creation")
break
if not index_exists:
print(f"{INDEX_NAME} does not exist. Creating index")
admin_client.index_create(
namespace=NAMESPACE,
name=INDEX_NAME,
vector_field=VECTOR_KEY,
vector_distance_metric=MODEL_DISTANCE_CALC,
dimensions=MODEL_DIM,
index_meta_data={
"model": "miniLM-L6-v2",
"date": "05/04/2024",
"dim": str(MODEL_DIM),
"distance": "cosine",
},
)
admin_client.close()
docstore = Aerospike.from_documents(
documents,
embedder,
client=client,
namespace=NAMESPACE,
vector_key=VECTOR_KEY,
index_name=INDEX_NAME,
distance_strategy=MODEL_DISTANCE_CALC,
)
quote-miniLM-L6-v2 does not exist. Creating index