a week ago
I need help writing a filter. I want to pre-filter a vector index before performing a hybrid search and create this as a function. Below is a simple example of searching for products for a given customer. A prefilter is key as this provides authorizations for searching a vector index before applying the top k which reduces the vector space searching as a prefilter before the search. I am not seeing any filter capability like how you would call the API.
Example of the search API with prefilter
results = index.similarity_search(
query_text=question,
query_type = "HYBRID",
columns=["content", "product", "product_description", "product_id", "purchase_date"],
filters="{"customer_emai":customer_email},
num_results=5
)
Below is the SQL Function I need help on
a week ago
@the_peterlandis, Yes, currently vector_search SQL function doesn't provide pre filter support. However, if you must implement the UC function for this, you can do it something like below using Python code with filters.
%sql CREATE OR REPLACE FUNCTION kaushal.kaushal.vector_similarity_search( query_text STRING, filter_id INT, num_results INT ) RETURNS STRING LANGUAGE PYTHON COMMENT "Vector similarity search using authenticated client" ENVIRONMENT ( dependencies = '["databricks-vectorsearch", "databricks-sdk"]', environment_version = 'None' ) AS $$ import json import os # Get credentials from Databricks secrets # You'll need to set these up first def get_secret(scope, key): from databricks.sdk import WorkspaceClient w = WorkspaceClient() return w.secrets.get_secret(scope=scope, key=key).value # Alternative: if dbutils is available in UDF context # token = dbutils.secrets.get(scope="your-scope", key="databricks-token") # host = dbutils.secrets.get(scope="your-scope", key="databricks-host") # Set environment variables for authentication os.environ['DATABRICKS_HOST'] = get_secret("your-scope", "databricks-host") os.environ['DATABRICKS_TOKEN'] = get_secret("your-scope", "databricks-token") from databricks.vector_search.client import VectorSearchClient # Initialize client - should now pick up environment variables client = VectorSearchClient() index = client.get_index( endpoint_name="vector-search-demo-endpoint-kaushal", index_name="kaushal.kaushal.my_text_data_index" ) results = index.similarity_search( query_text=query_text, columns=["id", "content"], filters={"id": [filter_id]}, num_results=num_results ) return json.dumps(results.get('result', {}).get('data_array', [])) $$;
Then run your UC function with SQL, and you should get the expected results.
a week ago
Based on this documentation, it says, it indicate sql function VECTOR_SEARCH cannot apply pre filter which prefilter is a fundamental capability for vector search. Just very surprised this is not supported.
a week ago
@the_peterlandis, Yes, currently vector_search SQL function doesn't provide pre filter support. However, if you must implement the UC function for this, you can do it something like below using Python code with filters.
%sql CREATE OR REPLACE FUNCTION kaushal.kaushal.vector_similarity_search( query_text STRING, filter_id INT, num_results INT ) RETURNS STRING LANGUAGE PYTHON COMMENT "Vector similarity search using authenticated client" ENVIRONMENT ( dependencies = '["databricks-vectorsearch", "databricks-sdk"]', environment_version = 'None' ) AS $$ import json import os # Get credentials from Databricks secrets # You'll need to set these up first def get_secret(scope, key): from databricks.sdk import WorkspaceClient w = WorkspaceClient() return w.secrets.get_secret(scope=scope, key=key).value # Alternative: if dbutils is available in UDF context # token = dbutils.secrets.get(scope="your-scope", key="databricks-token") # host = dbutils.secrets.get(scope="your-scope", key="databricks-host") # Set environment variables for authentication os.environ['DATABRICKS_HOST'] = get_secret("your-scope", "databricks-host") os.environ['DATABRICKS_TOKEN'] = get_secret("your-scope", "databricks-token") from databricks.vector_search.client import VectorSearchClient # Initialize client - should now pick up environment variables client = VectorSearchClient() index = client.get_index( endpoint_name="vector-search-demo-endpoint-kaushal", index_name="kaushal.kaushal.my_text_data_index" ) results = index.similarity_search( query_text=query_text, columns=["id", "content"], filters={"id": [filter_id]}, num_results=num_results ) return json.dumps(results.get('result', {}).get('data_array', [])) $$;
Then run your UC function with SQL, and you should get the expected results.
Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!
Sign Up Now