Try an interactive version of this dialog: Sign up at solve.it.com, click Upload, and pass this URL.
Goal and Context
My name is Mani and I am a search practitioner and also intermediate python coder. And I am familiar with the lexical search in Apache Solr. But I'm not familiar with the re-ranking features and hybrid search features in Solr. Help me to actually on board first using Polya step by step approach. And also I have provided you an implementation for using this feature hybrid search in a client based on llamastack. So first actually explain me how that implementation works in real natural language and very simpler example. then we could learn thoroughly about how hybrid search works in a much more like gradual complexity.
https://r.jina.ai/https://sease.io/2023/12/hybrid-search-with-apache-solr.html
Title: Hybrid Search with Apache Solr
URL Source: https://sease.io/2023/12/hybrid-search-with-apache-solr.html
Published Time: 2023-12-12T16:12:44+00:00
Markdown Content: Vector-based search gained incredible popularity in the last few years: Large Language Models fine-tuned for sentence similarity proved to be quite effective in encoding text to vectors and representing some of the semantics of sentences in a numerical form.
These vectors can be used to run a K-nearest neighbour search and look for documents/paragraphs close to the query in a n-dimensional vector space, effectively mimicking a similarity search in the semantic space.
Limitations
Although exciting, vector-based searchnowadaysstill presents many limitations:
- it’s verydifficult to explain– e.g. why is document A returned and why at position K?
- it works better on paragraphs so textchunkingis stronglyrecommended– most vector embedding models are fine-tuned on sentences of a certain length, so it works better on queries and documents of such length
- it doesn’t care aboutexact keyword matchingand users still rely on keyword searches a lot.
Hybrid Search
It’s reasonable to think that these limitations will be possibly solved in the future, but right now a trend that is becoming extremely popular is to mitigate these limitations through search solutions that combine lexical (traditional keyword-based) search with neural (vector-based) search.
So, what does it mean to combine these two worlds?
It starts with the retrieval of two sets of candidates:
- one set of results coming from lexical matches with the query keywords
- a set of results coming from the K-Nearest Neighbours search with the query vector
Then these results must be combined and presented in a ranking that maximises the relevance for the user query.
This blog post focuses onhow you can run Hybrid Search with Apache Solr, so let’s dive into the details!
Apache Solr supports Hybrid Search
The heading clarifies the first misconception I’ve read in many places:Apache Solr supports Hybrid Search, but right now it’s not documented (SOLR-17103).
Hybrid approaches are a relatively new trend and back in the days we contributed the vector search capabilities it was not a big thing, and we didn’t add an explicit section to the reference guide.
But effectively some sort of Hybrid Search was supported as soon as vector-based search was implemented.
**N.B.**This blog focuses on query time, your schema needs to be configured with traditional fields and solr.DenseVectorField (click here if you want to know more about Apache Solr neural search)
Retrieval Stage
To retrieve the two sets of candidates an old friend comes to the rescue: The Boolean Query Parser [x].
This query parser can combine queries through boolean clauses, each clause can be a different type of query, using its query parser.
Using the Boolean Query Parser we can leverage various boolean conditions to build the hybrid candidates result set with flexibility:
Union
q = {!bool should=$lexicalQuery should=$vectorQuery}&
lexicalQuery = {!type=edismax qf=text_field}term1&
vectorQuery = {!knn f=vector topK=10}[0.001, -0.422, -0.284, ...]
The hybrid candidate result set is the union of the results coming from the two models:
the top-Kresults coming from the K-Nearest Neighbours search and the <**numFound>**results coming from the lexical (keyword-based) search.
The cardinality of the combined result set is**<= (K + NumFound)**.
The result set doesn’t include any duplicates.
Intersection
q = {!bool must=$lexicalQuery must=$vectorQuery}&
lexicalQuery = {!type=edismax qf=text_field}term1&
vectorQuery = {!knn f=vector topK=10}[0.001, -0.422, -0.284, ...]
The hybrid candidate result set is the intersection of the results coming from the two models:
only the top-Kresults coming from the K-Nearest Neighbour that satisfy the lexical query are returned.
The cardinality of the combined result set is <=K.
This is effectively post-filtering K-Nearest Neighbours results but affecting the score, let’s see what it means.
Bonus point: Pre-Filtering and Post-filtering
With Apache Solr**< 9.1**, when adding a filter query (fq) to a K-Nearest Neighbours vector search, a post filter was applied, so only the subset of K-Nearest Neighbours search results matching the filter was returned (with a cardinality <= K)
Since9.1, filter queries are pre-filters by default (SOLR-16246), this means that the filtering condition is applied as Apache Solr (Lucene actually) retrieves the K-Nearest Neighbours.
So if you have at least K documents that satisfy the filter, you are guaranteed to retrieve K results after.
You can still run filter queries as post-filters, specifying the filter cost, in the same fashion you would do post-filtering with any other Solr query (https://solr.apache.org/guide/solr/latest/query-guide/common-query-parameters.html#cache-local-parameter))
e.g.
N.B. frange has a default cost of 100.
&q={!knn f=vector topK=10}[1.0, 2.0, 3.0, 4.0]&fq={!frange cache=false l=0.99}$q
Should I do post-filtering or Boolean Query Parser intersection?
With post-filtering you won’t calculate the lexical score for the filter clause, so our recommendation here is to use one approach or the other depending on if you care to make the lexical score impact the vector-based score (and affect the ranking).
Let’s see more about scoring, in the next section.
Ranking Stage
Once we have the hybrid candidate set, we want to calculate a score for each document that reflects the best ordering in terms of relevance for the user query.
Out of the box, we get a score from -1 to 1 from the K-Nearest Neighbours search that is summed to an unbounded score from the lexical side (that could be way above that scale).
To be fair, there’s no quick answer to how to best combine the two scores and guarantee the best final ranking, but let’s see the options we have in Apache Solr.
**N.B.**In the future, we would love to see more Hybrid retrieval algorithms implemented, such as Reciprocal Rank Fusion (do you want to make this happen? Take a look at ourAI roadmap for Apache Solr)
Sum Normalised Scores
q = {!bool filter=$retrievalStage must=$rankingStage}&
retrievalStage = {!bool should=$lexicalQuery should=$vectorQuery}&
rankingStage = {!func}sum(query($normalisedLexicalQuery),query($vectorQuery))&
normalisedLexicalQuery = {!func}scale(query($lexicalQuery),0,1)&
lexicalQuery = {!type=edismax qf=field1}term1&
vectorQuery = {!knn f=vector topK=10}[0.001, -0.422, -0.284, ...]
The filter component ignores any scoring and just builds the hybrid result set.
The must clause is responsible for assigning the score, using the appropriate function query.
The lexical score is min-max normalised to be scaled between 0 and 1 and it is summed to the K-Nearest Neighbours score.
This simple linear combination of the scores could be a good starting point.
Multiply Normalised Scores
q = {!bool filter=$retrievalStage must=$rankingStage}&
retrievalStage = {!bool should=$lexicalQuery should=$vectorQuery}&
rankingStage = {!func}product(query($normalisedLexicalQuery),query($vectorQuery))&
normalisedLexicalQuery = {!func}scale(query($lexicalQuery),0.1,1)&
lexicalQuery = {!type=edismax qf=field1}term1&
vectorQuery = {!knn f=vector topK=10}[0.001, -0.422, -0.284, ...]
The filter component ignores any scoring and just builds the hybrid result set.
The must clause is responsible for assigning the score, using the appropriate function query.
The lexical score is min-max normalised to be scaled between 0.1 and 1 and it is multiplied by the K-Nearest Neighbours score.
There’s no evidence that this should bring better results than the simple sum, it’s always recommended to build a prototype and test your assumptions on real queries and rated datasets (RRE).
Learning To Rank
In hybrid search or any other search scenario where you need to combine multiple factors (features) to build the final ranking, there’s no quick answer to the mathematical function to use to combine such features.
Sum? Normalised Sum? Product? linear or non-linear function?
It’s likely you want to solve this problem, by making a Machine Learning the function.
Apache Solr supports learning to rank since 6.4 and since 9.3 we contributed a basic support for vector similarity function queries, that can be used as features in a Learning To Rank model.
If you want to know more about Apache Solr Learning To Rank:https://sease.io/category/learning-to-rank
The idea is to build a training set, outside Solr where you define a feature vector that describes your <query, document> pair and a rating that states how relevant the document is for the query.
This feature vector contains as many features as you like, but for the sake of this blog, we’ll focus on two features:
[
{
"name": "lexicalScore",
"class": "org.apache.solr.ltr.feature.SolrFeature",
"params": { "q" : "{!func}scale(query(${lexicalQuery}),0,1)" },
"store": "feature-store-1"
},
{
"name": "vectorSimilarityScore",
"class": "org.apache.solr.ltr.feature.SolrFeature",
"params": { "q" : "{!func}vectorSimilarity(FLOAT32, DOT_PRODUCT, vectorField, ${queryVector})" },
"store": "feature-store-1"
}
]
The first feature is a normalised lexical score and the second feature is the vector similarity
Then you train a model that learns from the training dataset how to best combine such scores:
e.g.
This could be the resulting model after training:
{
"class":"org.apache.solr.ltr.model.LinearModel",
"name":"linear",
"features":[
{"name":"lexicalScore"},
{"name":"vectorSimilarityScore"}
],
"params":{
"weights":{
"lexicalScore":1.0,
"vectorSimilarityScore":2.0
}
}
}
Finally you can rerank your hybrid candidate set:
q = {!bool should=$normalisedLexicalQuery should=$vectorQuery}&
normalisedLexicalQuery = {!func}scale(query($lexicalQuery),0,1)&
lexicalQuery = {!type=edismax qf=field1}term1&
vectorQuery = {!knn f=vector topK=10}[0.001, -0.422, -0.284, ...]&
&rq={!ltr model=linear reRankDocs=100 efi.lexicalQuery='{!type=edismax qf=field1 v=term1}'
efi.queryVector='[0.001, -0.422, -0.284, ...]'}&fl=id,score,[features]
Final Considerations
There’s a lot that can and will be done to improve the support for hybrid search in Apache Solr, but it’s important to know the basics and what can be achieved using what is available right now. Ideally, in the future, we’ll have a dedicated set of components to pick the two separate sets of candidates and combine them with performant state-of-the-art algorithms. Currently, many of the scores are re-calculated on the fly, it’s not ideal and there’s a lot of room for improvement!
What is Hybrid Search in Solr?
Hybrid search in Apache Solr combines traditional keyword-based (lexical) search with neural (vector-based) search. This approach leverages the strengths of both methods: keyword search excels at exact matches and recall, while vector search captures semantic meaning and intent. The result is a more robust and relevant search experience, especially for complex or ambiguous queries.[1][2][3]
How Hybrid Search Works in Solr
-
Step 1: Retrieve Candidates
- Run a traditional keyword search (e.g., BM25) to get a set of results matching query terms.
- Run a vector search (KNN) using a dense embedding of the query to find semantically similar documents.[2][3][4]
-
Step 2: Merge and Rerank
- Combine the two result sets.
- Use a fusion algorithm (commonly Reciprocal Rank Fusion, RRF) to merge and rerank the results, balancing both keyword relevance and semantic similarity.[5][6][2]
Practical Example: Setting Up Hybrid Search in Solr
1. Prepare the Schema
- Define both traditional text fields and a dense vector field in your Solr schema:
AdjustCopied!
<fieldType name="knn_vector" class="solr.DenseVectorField" vectorDimension="512" similarityFunction="cosine"/> <field name="text" type="text_general" indexed="true" stored="true"/> <field name="vector" type="knn_vector" indexed="true" stored="true"/>
vectorDimensionto match your embedding size.[4][7][8]
2. Index Documents
- For each document, index both the text and its vector embedding (produced by a model like BERT or Sentence-BERT).[8][4]
3. Querying
- Run a keyword query:
Copied!
q=text:("your search terms") - Run a vector query (using Solr's KNN query parser):
Copied!
q={!knn f=vector topK=10}0.12,0.34,0.56,...,0.78 - Hybrid: Retrieve both sets, then merge and rerank using RRF or another fusion strategy.[2][4][5]
4. Score Fusion (Reciprocal Rank Fusion Example)
- For each document, calculate its RRF score based on its rank in each list: $$ RRF(d) = \sum_{i=1}^{n} \frac{1}{k + rank_i(d)} $$ where $$k$$ is a constant (e.g., 60), and $$rank_i(d)$$ is the rank of document $$d$$ in result list $$i$$.[6][5][2]
5. Return the Final Ranked List
- Present the merged, reranked results to the user.
Summary Table: Hybrid Search Steps in Solr
| Step | Action |
|---|---|
| Schema Setup | Add text and vector fields |
| Indexing | Store both text and vector embeddings |
| Querying | Run both keyword and vector queries |
| Fusion | Merge and rerank using RRF or similar algorithm |
| Output | Return a single ranked list, balancing keyword and semantic relevance |
Hybrid search in Solr is practical, powerful, and can be set up today to deliver more relevant results by combining the best of both keyword and semantic search.[1][4][2]
Implementation of hybrid search combining vector similarity and keyword search using Solr native reranking https://github.com/mwcz/lightspeed-providers/blob/anx/solr-provider-ls-0.2/lightspeed_stack_providers/providers/remote/solr_vector_io/solr_vector_io/src/solr_vector_io/solr.py#L386-L395
async def query_hybrid(
self,
embedding: NDArray,
query_string: str,
k: int,
score_threshold: float,
reranker_type: str,
reranker_params: dict[str, Any] | None = None,
) -> QueryChunksResponse:
"""
Hybrid search combining vector similarity and keyword search using Solr's native reranking.
Args:
embedding: The query embedding vector
query_string: The text query for keyword search
k: Number of results to return
score_threshold: Minimum similarity score threshold
reranker_type: Type of reranker (ignored, uses Solr's native capabilities)
reranker_params: Parameters for reranking (e.g., boost values)
Returns:
QueryChunksResponse with combined results
"""
if reranker_params is None:
reranker_params = {}
# Get boost parameters, defaulting to equal weighting
vector_boost = reranker_params.get("vector_boost", 1.0)
keyword_boost = reranker_params.get("keyword_boost", 1.0)
log.debug(
f"Performing hybrid search: query='{query_string}', k={k}, "
f"score_threshold={score_threshold}, vector_boost={vector_boost}, "
f"keyword_boost={keyword_boost}"
)
async with self._create_http_client() as client:
# Use POST to avoid URI length limits with large embeddings
# Solr expects format: [f1,f2,f3]
vector_str = "[" + ",".join(str(v) for v in embedding.tolist()) + "]"
# Construct hybrid query using Solr's query boosting
# This uses both KNN and text search with configurable boosts
# The keyword_boost is applied via the reRankWeight for the text query
# and vector_boost is applied via reRankWeight for the KNN reranking
data_params = {
"q": query_string,
"rq": f"{{!rerank reRankQuery=$rqq reRankDocs={k * 2} reRankWeight={vector_boost}}}",
"rqq": f"{{!knn f={self.vector_field} topK={k * 2}}}{vector_str}",
"rows": k,
"fl": "*, score",
"fq": ["product:*openshift*"],
"wt": "json",
"defType": "edismax",
}
# Note: keyword_boost can be incorporated in future by discovering schema fields
# Add filter query for chunk documents if schema is configured
if self.chunk_window_config and self.chunk_window_config.chunk_filter_query:
solr_params["fq"].append(self.chunk_window_config.chunk_filter_query)
log.debug(
f"Applying chunk filter: {
self.chunk_window_config.chunk_filter_query
}"
)
try:
log.debug(
f"Sending hybrid query to Solr with reranking: reRankDocs={k * 2}, "
f"reRankWeight={vector_boost}"
)
response = await client.post(
f"{self.base_url}/select",
data=data_params,
headers={"Content-Type": "application/x-www-form-urlencoded"},
)
response.raise_for_status()
data = response.json()
chunks = []
scores = []
num_docs = data.get("response", {}).get("numFound", 0)
log.debug(f"Solr returned {num_docs} documents for hybrid search")
for doc in data.get("response", {}).get("docs", []):
score = float(doc.get("score", 0))
# Apply score threshold
if score < score_threshold:
log.debug(
f"Filtering out document with score {score} < threshold {
score_threshold
}"
)
continue
chunk = self._doc_to_chunk(doc)
if chunk:
chunks.append(chunk)
scores.append(score)
log.info(
f"Hybrid search returned {len(chunks)} chunks (filtered from {
num_docs
} by score threshold)"
)
response = QueryChunksResponse(chunks=chunks, scores=scores)
# Apply chunk window expansion if configured
if self.chunk_window_config is not None:
return await self._apply_chunk_window_expansion(
initial_response=response,
token_budget=self.chunk_window_config.token_budget,
min_chunk_gap=self.chunk_window_config.min_chunk_gap,
min_chunk_window=self.chunk_window_config.min_chunk_window,
)
return response
except httpx.HTTPStatusError as e:
log.error(
f"HTTP error during hybrid search: status={e.response.status_code}"
)
try:
error_data = e.response.json()
log.error(f"Solr error response: {error_data}")
except Exception:
log.error(f"Solr error response (text): {e.response.text[:500]}")
raise
except Exception as e:
log.exception(f"Error querying Solr with hybrid search: {e}")
raise
🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠
Hi Mani! Great to meet you. I'm excited to help you understand hybrid search and re-ranking in Solr.
Before we dive in, let me understand where you'd like to start:
- Would you like me to first explain what the code implementation does (the lightspeed one you shared) in simple terms?
- Or would you prefer to start with understanding the general concept of hybrid search first, and then come back to see how the code implements it?
Also, since you mentioned Polya's approach - are you familiar with it, or would you like me to briefly explain how we'll use it to structure our learning?
What feels like the right starting point for you?
Perfect! Let's break down what that query_hybrid function does in plain language.
The Big Picture: Imagine you're searching for something and you want the best of two worlds:
- Finding documents that contain your exact keywords (like traditional Google search)
- Finding documents that mean similar things, even if they use different words (semantic search)
This function combines both approaches to give you better results.
Here's what happens step by step:
The function takes in:
- Your search query as text (e.g., "how to deploy applications")
- A vector representation of that same query (numbers that capture its meaning)
- How many results you want (k)
- Some tuning knobs (boosts) to control whether you trust keywords more or semantic similarity more
Then it does something clever with Solr's reranking feature...
Quick check: Does this general idea make sense so far? Should I continue with explaining how it actually combines the two search methods?Perfect! Let's break down what this code does in plain language.
The Big Picture:
This query_hybrid function is doing a "two-stage search" - think of it like asking two different librarians to find books for you, then combining their recommendations.
Stage 1 - Getting Candidates (Retrieval):
- Librarian #1 (Keyword Search): Looks for documents that contain your actual search words (
query_string) - Librarian #2 (Vector Search): Looks for documents that are semantically similar to your query, using the
embedding(a numerical representation of meaning)
Stage 2 - Re-ranking:
The code uses Solr's rerank feature to combine both results. Here's the clever part:
- It first retrieves
k * 2results (double what you asked for) from both searches - Then it re-ranks them using the vector similarity scores, controlled by
vector_boost - Finally returns the top
kresults
Key Parameters:
vector_boostandkeyword_boost: Control how much weight each search method gets (though in this code,keyword_boostisn't fully implemented yet - see the comment)
Does this high-level explanation make sense? Would you like me to explain any specific part in more detail, or shall we look at how the query is actually constructed?
Brilliant. Let me restate.
q -> [keyword(q, k2), semantic(q, k2)] -> rerank(keyword, semantic, vector_boost) -> top_K
🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠
Almost! But there's a subtle difference in how this specific implementation works. Let me clarify:
Actual flow:
keyword(q) -> take top k*2 results -> rerank those k*2 using semantic similarity -> top k
The key difference:
- The keyword search retrieves the initial pool of candidates
- Then Solr's
rerankfeature rescores only those candidates using vector similarity - It doesn't independently retrieve two separate lists and merge them
Why does this matter?
This means a document must first be retrieved by the keyword search to even be considered. If a document is semantically similar but has no keyword match, it won't appear in the results.
Does this distinction make sense? This is actually one specific way to do hybrid search (there are other approaches too, which we can explore later).
Then in that case from your high level example Librarian #2 only uses the documents found by Librarian #1. It needs correction in your example based on this implementation. Is my understanding correct?
🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠
Exactly right! You caught that perfectly. My librarian analogy was misleading.
Better analogy:
- Librarian #1 (Keyword Search): Finds k*2 documents based on keywords and puts them on the table
- Librarian #2 (Vector Search): Takes only those k2 documents* from the table and re-ranks them based on semantic similarity
- You get the top k from Librarian #2's re-ranking
So Librarian #2 doesn't search the entire library - they only rescore what Librarian #1 already found.
This approach is sometimes called "keyword-first hybrid search" or "reranking-based hybrid search" - the keyword search acts as a filter/retrieval stage, and vector search refines the ranking.
Does this corrected understanding feel clearer now?
Yes it does. Now I have the higher level understanding please explain and map how the following params work.
data_params = {
"q": query_string,
"rq": f"{{!rerank reRankQuery=$rqq reRankDocs={k * 2} reRankWeight={vector_boost}}}",
"rqq": f"{{!knn f={self.vector_field} topK={k * 2}}}{vector_str}",
"rows": k,
"fl": "*, score",
"fq": ["product:*openshift*"],
"wt": "json",
"defType": "edismax",
}
Great! Let's map each parameter to what it does:
Initial Retrieval (Librarian #1):
"q": query_string- Your keyword search query (e.g., "deploy applications")"defType": "edismax"- Tells Solr to use the Extended DisMax parser for flexible keyword matching"rows": k- How many final results you want back
Re-ranking Stage (Librarian #2):
-
"rq": "{{!rerank reRankQuery=$rqq reRankDocs={k * 2} reRankWeight={vector_boost}}}"- This is the rerank instruction
reRankDocs={k * 2}- "Take the top k*2 documents from keyword search"reRankQuery=$rqq- "Rescore them using the query defined in $rqq"reRankWeight={vector_boost}- Controls how much the new score matters vs. the original keyword score
-
"rqq": "{{!knn f={self.vector_field} topK={k * 2}}}{vector_str}"- This defines what $rqq is
{!knn f={self.vector_field} topK={k * 2}}- Use KNN (K-Nearest Neighbor) search on the vector field{vector_str}- Your query embedding
Other params:
"fl": "*, score"- Return all fields plus the score"fq": ["product:*openshift*"]- Filter to only OpenShift products"wt": "json"- Return results as JSON
Question for you: What do you think reRankWeight actually controls? How does it balance the original keyword score vs. the vector similarity score?
I don;t understand the reRankWeight fully. Please use an example and analogy.
My understanding is we have initial k * 2 candidates alogn with its relevance score (A), then we will perform a semantic simialrity using the query and k * 2 documents and will have additional semantic similarity score (B). Now say the rerankWeight is 2, then will have 2 * B + A is the hybrid score used for final top_k. Correct??
🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠
Excellent! Your understanding is exactly right!
Let me confirm with a concrete example:
Example:
You have 3 documents after keyword search (k*2 = 3 for simplicity):
| Doc | Keyword Score (A) | Vector Similarity Score (B) |
|---|---|---|
| Doc1 | 10.0 | 0.8 |
| Doc2 | 8.0 | 0.95 |
| Doc3 | 6.0 | 0.7 |
If reRankWeight = 2:
Hybrid Score = A + (reRankWeight × B) = A + (2 × B)
| Doc | Calculation | Final Score |
|---|---|---|
| Doc1 | 10.0 + (2 × 0.8) | 11.6 |
| Doc2 | 8.0 + (2 × 0.95) | 9.9 |
| Doc3 | 6.0 + (2 × 0.7) | 7.4 |
Final ranking: Doc1, Doc2, Doc3
The analogy:
Think of reRankWeight as a volume knob for Librarian #2's opinion. Higher weight = Librarian #2's semantic opinion matters more. Lower weight = stick closer to Librarian #1's keyword ranking.
Does this example clarify it? Want to explore what happens with different reRankWeight values?
🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠
Perfect! Let's use the same 3 documents and try different weights:
Same documents:
| Doc | Keyword Score (A) | Vector Score (B) |
|---|---|---|
| Doc1 | 10.0 | 0.8 |
| Doc2 | 8.0 | 0.95 |
| Doc3 | 6.0 | 0.7 |
reRankWeight = 0 (ignore semantic):
- Doc1: 10.0 + (0 × 0.8) = 10.0
- Doc2: 8.0 + (0 × 0.95) = 8.0
- Doc3: 6.0 + (0 × 0.7) = 6.0
- Ranking: Doc1 → Doc2 → Doc3 (pure keyword ranking)
reRankWeight = 2:
- Doc1: 10.0 + (2 × 0.8) = 11.6
- Doc2: 8.0 + (2 × 0.95) = 9.9
- Doc3: 6.0 + (2 × 0.7) = 7.4
- Ranking: Doc1 → Doc2 → Doc3 (same order!)
reRankWeight = 5:
- Doc1: 10.0 + (5 × 0.8) = 14.0
- Doc2: 8.0 + (5 × 0.95) = 12.75
- Doc3: 6.0 + (5 × 0.7) = 9.5
- Ranking: Doc1 → Doc2 → Doc3 (still same!)
reRankWeight = 20:
- Doc1: 10.0 + (20 × 0.8) = 26.0
- Doc2: 8.0 + (20 × 0.95) = 27.0
- Doc3: 6.0 + (20 × 0.7) = 20.0
- Ranking: Doc2 → Doc1 → Doc3 (Doc2 wins now!)
What do you notice? At what point does the semantic similarity start to override the keyword ranking?
Wow in this scenario only at reRankWeight 20, the ranking is changed by the semantic similarity. In real world different query types require different strategies. Right?
🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠
Absolutely! You've hit on a key insight. Different query types need different balancing strategies.
Think about these query types:
-
Exact technical queries: "error code 404" or "kubectl get pods"
- User wants exact keyword matches
-
Conceptual queries: "how to improve application performance" or "best practices for security"
- Semantic similarity matters more - many ways to express the same concept
-
Mixed queries: "deploy nodejs application on kubernetes"
- Both keywords (nodejs, kubernetes) AND concepts (deploy) matter
Question for you:
For each of those query types, would you use a low reRankWeight (like 1-5) or high reRankWeight (like 10-20)? Why?
And how might you decide this in a production system - manually tune it, or somehow detect the query type automatically?
- Exact Technical Queries - low reRerankWeight (we cannot let semantic override since exact keyword matches are needed)
- Conceptual Queries - high reRankWeight (we definitely need higher semantic override because we need meaning to matter more here)
- Mixed Queries - this requires careful experimentation of rerankWeight. there is no one answer for rerankWeight
🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠
Excellent analysis! Your reasoning is spot-on for each type.
You've identified a crucial challenge: In a production system, you don't know ahead of time which type of query a user will submit.
So how do you handle this?
Here are some approaches practitioners use:
- Fixed middle-ground weight - Pick one reRankWeight (like 5) that works "okay" for everything
- Query classification - Detect query type first, then apply appropriate weight
- Dynamic weighting - Use query characteristics (length, presence of technical terms, etc.) to automatically adjust
- User feedback - Learn from clicks/ratings which weight works best for similar queries
- A/B testing - Test different weights and see which gives better user satisfaction
Question for you:
Which approach sounds most practical to you for your use case? Or do you have a different idea?
Also, are you starting to think about how you might want to tune this for your specific search application?
At this point in time, we don't have a rerank strategy at all. So we need to start simple in terms of 1 & 2 where Query Classification identifies the scenarios where exact matches are crucial (CVE, Errata) and have low rerankWeight (technical documentation queries) and for the rest we want to have high reRerankWeight. Yes. Are there flaws in my thinking process? How do you recommend me applying for my application?
🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠
Your thinking is solid for getting started! Starting simple with two buckets is exactly the right approach.
One potential flaw to consider:
Your "everything else gets high reRankWeight" bucket might be too broad. Consider:
- CVE queries → low weight (0-2) ✓
- Errata queries → low weight (0-2) ✓
- Conceptual queries → high weight (15-20) ✓
- But what about mixed queries? → high weight might be too aggressive
For example: "how to patch CVE-2024-1234 in OpenShift" has both CVE (needs exact match) AND concepts (patching, how-to).
My recommendation for starting:
Instead of two buckets, try three:
- Exact match critical (CVE, Errata) → reRankWeight = 1-2
- Balanced (default for most queries) → reRankWeight = 5-8
- Semantic heavy (questions, "how to", "best practices") → reRankWeight = 15-20
Questions for you:
- Does a three-tier approach make sense for your use case?
- How easy is it for you to detect CVE/Errata queries? (They have pretty clear patterns like CVE-YYYY-NNNN)
- Do you have access to query logs to test different weights and see what works?
- Yes Three tier approach makes sense.
- We already have an intent detection where I could detect security intents such as CVE/Errata queries.
- Yes we do have access to query logs to test different weights and see how it works.
🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠
Perfect! You're in a great position to implement this. You already have the key pieces:
- Intent detection system ✓
- Access to query logs for testing ✓
- Understanding of the reranking mechanism ✓
Now you have a clear path forward:
- Extend your intent detector to classify into your three tiers
- Map each tier to a reRankWeight value
- Test on historical queries to validate your choices
- Monitor and iterate
What would you like to explore next?
- How to structure the code to make reRankWeight dynamic based on intent?
- How to design experiments with your query logs to find optimal weight values?
- Understanding other hybrid search strategies (like the union approach from the article)?
- Something else?
What feels most valuable to you right now?
🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠
I can help you organize your notebook!
First, let me understand what you'd like:
- Do you want to restructure the existing notes into a better order?
- Would you like to add new sections (like a summary, implementation checklist, or next steps)?
- Are you looking to create a separate clean reference that captures your key learnings?
- Something else?
What would be most helpful for you?
Let's start with restructure the existing notes for me to solidy the understanding and communicate better based on existing dialog.
🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠
Perfect! I'll help you create a restructured set of notes that follows our learning journey. Here's a clean organization:
# Hybrid Search in Apache Solr - Learning Notes
## Context & Goal
- **Name:** Mani
- **Background:** Search practitioner, intermediate Python coder, familiar with lexical search in Solr
- **Goal:** Understand hybrid search and re-ranking features in Solr
- **Application:** Lightspeed core implementation for OpenShift documentation
---
## What is Hybrid Search?
Hybrid search combines two search approaches:
1. **Lexical (Keyword) Search** - Traditional BM25, exact term matching
2. **Semantic (Vector) Search** - KNN search using embeddings to capture meaning
**Why combine them?**
- Keyword search: Great for exact matches, technical terms
- Vector search: Captures semantic meaning, handles synonyms and concepts
- Together: More robust and relevant results
---
## How This Implementation Works
### High-Level Flow
keyword_search(query) → retrieve top k*2 candidates → rerank using semantic similarity → return top k
**Key insight:** This is "keyword-first hybrid search"
- Keyword search acts as the retrieval/filtering stage
- Vector search refines the ranking of those candidates
- Documents must match keywords to be considered
### Librarian Analogy
- **Librarian #1 (Keyword):** Finds k*2 documents based on keywords and puts them on the table
- **Librarian #2 (Vector):** Takes *only those k*2 documents* and re-ranks them by semantic similarity
- You get the top k from Librarian #2's re-ranking
---
## Code Implementation Details
### The Function
async def query_hybrid(
embedding: NDArray, # Query vector
query_string: str, # Query text
k: int, # Final results wanted
score_threshold: float, # Minimum score filter
reranker_type: str,
reranker_params: dict # Contains boost values
)
### Solr Query Parameters
data_params = { # Initial keyword retrieval "q": query_string, # Your keyword query "defType": "edismax", # Extended DisMax parser "rows": k, # Final result count
# Reranking stage
"rq": "{{!rerank reRankQuery=$rqq reRankDocs={k*2} reRankWeight={vector_boost}}}",
"rqq": "{{!knn f={vector_field} topK={k*2}}}{vector_str}",
# Other params
"fl": "*, score", # Return all fields + score
"fq": ["product:*openshift*"], # Filter query
"wt": "json" # Response format
}
### Parameter Breakdown
| Parameter | Purpose |
|-----------|---------|
| `q` | Keyword search query |
| `rq` | Rerank instruction |
| `reRankDocs` | How many top docs to rerank (k*2) |
| `reRankQuery` | What query to use for reranking ($rqq) |
| `reRankWeight` | How much semantic score matters |
| `rqq` | The KNN vector similarity query |
---
## Understanding reRankWeight
### The Scoring Formula
Final Score = Keyword_Score + (reRankWeight × Vector_Score) Final Score = A + (reRankWeight × B)
### Example with Different Weights
**Starting documents:**
| Doc | Keyword Score (A) | Vector Score (B) |
|-----|-------------------|------------------|
| Doc1 | 10.0 | 0.8 |
| Doc2 | 8.0 | 0.95 |
| Doc3 | 6.0 | 0.7 |
**reRankWeight = 0** (pure keyword)
- Doc1: 10.0 → Doc2: 8.0 → Doc3: 6.0
**reRankWeight = 2** (slight semantic boost)
- Doc1: 11.6 → Doc2: 9.9 → Doc3: 7.4
- Same ranking order
**reRankWeight = 20** (strong semantic)
- Doc2: 27.0 → Doc1: 26.0 → Doc3: 20.0
- **Ranking changed!** Doc2 now wins due to higher semantic similarity
### Key Insight
The higher the reRankWeight, the more semantic similarity can override keyword ranking.
---
## Query Type Strategy
Different queries need different reRankWeight values:
### 1. Exact Technical Queries
- **Examples:** "CVE-2024-1234", "error code 404", "kubectl get pods"
- **Strategy:** Low reRankWeight (1-2)
- **Why:** Need exact keyword matches, can't let semantics override
### 2. Conceptual Queries
- **Examples:** "how to improve performance", "best practices for security"
- **Strategy:** High reRankWeight (15-20)
- **Why:** Meaning matters more than exact words
### 3. Mixed Queries
- **Examples:** "how to patch CVE-2024-1234", "deploy nodejs on kubernetes"
- **Strategy:** Medium reRankWeight (5-8)
- **Why:** Both keywords and concepts matter
---
## Implementation Plan for Our Application
### Three-Tier Approach
| Tier | Detection | reRankWeight | Example Queries |
|------|-----------|--------------|-----------------|
| **Exact Match Critical** | Security intent (CVE, Errata) | 1-2 | "CVE-2024-1234" |
| **Balanced** | Default for most queries | 5-8 | "deploy nodejs app" |
| **Semantic Heavy** | Questions, how-to, best practices | 15-20 | "how to improve security?" |
### Implementation Steps
1. ✓ Extend intent detection to classify into three tiers
2. ✓ Map each tier to reRankWeight value
3. ✓ Test on historical query logs
4. ✓ Monitor and iterate based on results
### Advantages We Have
- Existing intent detection system (can detect security intents)
- Access to query logs for experimentation
- Clear use cases (OpenShift documentation)
---
## Key Takeaways
1. **Hybrid search = Keyword retrieval + Semantic reranking** (in this implementation)
2. **reRankWeight controls the balance** between keyword and semantic scores
3. **Different query types need different strategies** - no one-size-fits-all
4. **Start simple, iterate based on data** - three tiers is a practical starting point
5. **This is keyword-first hybrid** - only reranks what keywords find first
---
## Next Learning Topics
- How to design experiments with query logs
- Alternative hybrid approaches (union-based, RRF)
- Dynamic reRankWeight based on query features
- Learning to Rank (LTR) for automatic weight optimization
Would you like me to create this as a note message in your notebook?
# Hybrid Search in Apache Solr - Learning Notes
## Context & Goal
- **Name:** Mani
- **Background:** Search practitioner, intermediate Python coder, familiar with lexical search in Solr
- **Goal:** Understand hybrid search and re-ranking features in Solr
- **Application:** Lightspeed core implementation for OpenShift documentation
---
## What is Hybrid Search?
Hybrid search combines two search approaches:
1. **Lexical (Keyword) Search** - Traditional BM25, exact term matching
2. **Semantic (Vector) Search** - KNN search using embeddings to capture meaning
**Why combine them?**
- Keyword search: Great for exact matches, technical terms
- Vector search: Captures semantic meaning, handles synonyms and concepts
- Together: More robust and relevant results
---
## How This Implementation Works
### High-Level Flow
keyword_search(query) → retrieve top k*2 candidates → rerank using semantic similarity → return top k
**Key insight:** This is "keyword-first hybrid search"
- Keyword search acts as the retrieval/filtering stage
- Vector search refines the ranking of those candidates
- Documents must match keywords to be considered
### Librarian Analogy
- **Librarian #1 (Keyword):** Finds k*2 documents based on keywords and puts them on the table
- **Librarian #2 (Vector):** Takes *only those k*2 documents* and re-ranks them by semantic similarity
- You get the top k from Librarian #2's re-ranking
---
## Code Implementation Details
### The Function
async def query_hybrid(
embedding: NDArray, # Query vector
query_string: str, # Query text
k: int, # Final results wanted
score_threshold: float, # Minimum score filter
reranker_type: str,
reranker_params: dict # Contains boost values
)
### Solr Query Parameters
data_params = { # Initial keyword retrieval "q": query_string, # Your keyword query "defType": "edismax", # Extended DisMax parser "rows": k, # Final result count
# Reranking stage
"rq": "{{!rerank reRankQuery=$rqq reRankDocs={k*2} reRankWeight={vector_boost}}}",
"rqq": "{{!knn f={vector_field} topK={k*2}}}{vector_str}",
# Other params
"fl": "*, score", # Return all fields + score
"fq": ["product:*openshift*"], # Filter query
"wt": "json" # Response format
}
### Parameter Breakdown
| Parameter | Purpose |
|-----------|---------|
| `q` | Keyword search query |
| `rq` | Rerank instruction |
| `reRankDocs` | How many top docs to rerank (k*2) |
| `reRankQuery` | What query to use for reranking ($rqq) |
| `reRankWeight` | How much semantic score matters |
| `rqq` | The KNN vector similarity query |
---
## Understanding reRankWeight
### The Scoring Formula
Final Score = Keyword_Score + (reRankWeight × Vector_Score) Final Score = A + (reRankWeight × B)
### Example with Different Weights
**Starting documents:**
| Doc | Keyword Score (A) | Vector Score (B) |
|-----|-------------------|------------------|
| Doc1 | 10.0 | 0.8 |
| Doc2 | 8.0 | 0.95 |
| Doc3 | 6.0 | 0.7 |
**reRankWeight = 0** (pure keyword)
- Doc1: 10.0 → Doc2: 8.0 → Doc3: 6.0
**reRankWeight = 2** (slight semantic boost)
- Doc1: 11.6 → Doc2: 9.9 → Doc3: 7.4
- Same ranking order
**reRankWeight = 20** (strong semantic)
- Doc2: 27.0 → Doc1: 26.0 → Doc3: 20.0
- **Ranking changed!** Doc2 now wins due to higher semantic similarity
### Key Insight
The higher the reRankWeight, the more semantic similarity can override keyword ranking.
---
## Query Type Strategy
Different queries need different reRankWeight values:
### 1. Exact Technical Queries
- **Examples:** "CVE-2024-1234", "error code 404", "kubectl get pods"
- **Strategy:** Low reRankWeight (1-2)
- **Why:** Need exact keyword matches, can't let semantics override
### 2. Conceptual Queries
- **Examples:** "how to improve performance", "best practices for security"
- **Strategy:** High reRankWeight (15-20)
- **Why:** Meaning matters more than exact words
### 3. Mixed Queries
- **Examples:** "how to patch CVE-2024-1234", "deploy nodejs on kubernetes"
- **Strategy:** Medium reRankWeight (5-8)
- **Why:** Both keywords and concepts matter
---
## Implementation Plan for Our Application
### Three-Tier Approach
| Tier | Detection | reRankWeight | Example Queries |
|------|-----------|--------------|-----------------|
| **Exact Match Critical** | Security intent (CVE, Errata) | 1-2 | "CVE-2024-1234" |
| **Balanced** | Default for most queries | 5-8 | "deploy nodejs app" |
| **Semantic Heavy** | Questions, how-to, best practices | 15-20 | "how to improve security?" |
### Implementation Steps
1. ✓ Extend intent detection to classify into three tiers
2. ✓ Map each tier to reRankWeight value
3. ✓ Test on historical query logs
4. ✓ Monitor and iterate based on results
### Advantages We Have
- Existing intent detection system (can detect security intents)
- Access to query logs for experimentation
- Clear use cases (OpenShift documentation)
---
## Key Takeaways
1. **Hybrid search = Keyword retrieval + Semantic reranking** (in this implementation)
2. **reRankWeight controls the balance** between keyword and semantic scores
3. **Different query types need different strategies** - no one-size-fits-all
4. **Start simple, iterate based on data** - three tiers is a practical starting point
5. **This is keyword-first hybrid** - only reranks what keywords find first
---
## Next Learning Topics
- How to design experiments with query logs
- Alternative hybrid approaches (union-based, RRF)
- Dynamic reRankWeight based on query features
- Learning to Rank (LTR) for automatic weight optimization
---
## Reference Materials
- Sease.io blog: Hybrid Search with Apache Solr
- Lightspeed implementation: solr_vector_io/solr.py