Try an interactive version of this dialog: Sign up at solve.it.com, click Upload, and pass this URL.

Note: 187

Goal and Context

My name is Mani and I am a search practitioner and also intermediate python coder. And I am familiar with the lexical search in Apache Solr. But I'm not familiar with the re-ranking features and hybrid search features in Solr. Help me to actually on board first using Polya step by step approach. And also I have provided you an implementation for using this feature hybrid search in a client based on llamastack. So first actually explain me how that implementation works in real natural language and very simpler example. then we could learn thoroughly about how hybrid search works in a much more like gradual complexity.

Note: 10

Hybrid Search with Apache Solr

Note: 15

Implementation of hybrid search in lighspeed core

Prompt: 4

Let's start.

Assistant: 211

🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠

Hi Mani! Great to meet you. I'm excited to help you understand hybrid search and re-ranking in Solr.

Before we dive in, let me understand where you'd like to start:

  1. Would you like me to first explain what the code implementation does (the lightspeed one you shared) in simple terms?
  2. Or would you prefer to start with understanding the general concept of hybrid search first, and then come back to see how the code implements it?

Also, since you mentioned Polya's approach - are you familiar with it, or would you like me to briefly explain how we'll use it to structure our learning?

What feels like the right starting point for you?

Prompt: 13

  1. Explain what the code implementation does

Assistant: 726

Perfect! Let's break down what that query_hybrid function does in plain language.

The Big Picture: Imagine you're searching for something and you want the best of two worlds:

  • Finding documents that contain your exact keywords (like traditional Google search)
  • Finding documents that mean similar things, even if they use different words (semantic search)

This function combines both approaches to give you better results.

Here's what happens step by step:

The function takes in:

  • Your search query as text (e.g., "how to deploy applications")
  • A vector representation of that same query (numbers that capture its meaning)
  • How many results you want (k)
  • Some tuning knobs (boosts) to control whether you trust keywords more or semantic similarity more

Then it does something clever with Solr's reranking feature...

Quick check: Does this general idea make sense so far? Should I continue with explaining how it actually combines the two search methods?Perfect! Let's break down what this code does in plain language.

The Big Picture:

This query_hybrid function is doing a "two-stage search" - think of it like asking two different librarians to find books for you, then combining their recommendations.

Stage 1 - Getting Candidates (Retrieval):

  • Librarian #1 (Keyword Search): Looks for documents that contain your actual search words (query_string)
  • Librarian #2 (Vector Search): Looks for documents that are semantically similar to your query, using the embedding (a numerical representation of meaning)

Stage 2 - Re-ranking: The code uses Solr's rerank feature to combine both results. Here's the clever part:

  • It first retrieves k * 2 results (double what you asked for) from both searches
  • Then it re-ranks them using the vector similarity scores, controlled by vector_boost
  • Finally returns the top k results

Key Parameters:

  • vector_boost and keyword_boost: Control how much weight each search method gets (though in this code, keyword_boost isn't fully implemented yet - see the comment)

Does this high-level explanation make sense? Would you like me to explain any specific part in more detail, or shall we look at how the query is actually constructed?

Prompt: 58

Brilliant. Let me restate.

q -> [keyword(q, k2), semantic(q, k2)] -> rerank(keyword, semantic, vector_boost) -> top_K

Assistant: 255

🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠

Almost! But there's a subtle difference in how this specific implementation works. Let me clarify:

Actual flow:

keyword(q) -> take top k*2 results -> rerank those k*2 using semantic similarity -> top k

The key difference:

  • The keyword search retrieves the initial pool of candidates
  • Then Solr's rerank feature rescores only those candidates using vector similarity
  • It doesn't independently retrieve two separate lists and merge them

Why does this matter?

This means a document must first be retrieved by the keyword search to even be considered. If a document is semantically similar but has no keyword match, it won't appear in the results.

Does this distinction make sense? This is actually one specific way to do hybrid search (there are other approaches too, which we can explore later).

Prompt: 63

Then in that case from your high level example Librarian #2 only uses the documents found by Librarian #1. It needs correction in your example based on this implementation. Is my understanding correct?

Assistant: 255

🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠

Exactly right! You caught that perfectly. My librarian analogy was misleading.

Better analogy:

  • Librarian #1 (Keyword Search): Finds k*2 documents based on keywords and puts them on the table
  • Librarian #2 (Vector Search): Takes only those k2 documents* from the table and re-ranks them based on semantic similarity
  • You get the top k from Librarian #2's re-ranking

So Librarian #2 doesn't search the entire library - they only rescore what Librarian #1 already found.

This approach is sometimes called "keyword-first hybrid search" or "reranking-based hybrid search" - the keyword search acts as a filter/retrieval stage, and vector search refines the ranking.

Does this corrected understanding feel clearer now?

Prompt: 211

Yes it does. Now I have the higher level understanding please explain and map how the following params work.

Copied!
          data_params = {
                "q": query_string,
                "rq": f"{{!rerank reRankQuery=$rqq reRankDocs={k * 2} reRankWeight={vector_boost}}}",
                "rqq": f"{{!knn f={self.vector_field} topK={k * 2}}}{vector_str}",
                "rows": k,
                "fl": "*, score",
                "fq": ["product:*openshift*"],
                "wt": "json",
                "defType": "edismax",
            }

Assistant: 577

Great! Let's map each parameter to what it does:

Initial Retrieval (Librarian #1):

  • "q": query_string - Your keyword search query (e.g., "deploy applications")
  • "defType": "edismax" - Tells Solr to use the Extended DisMax parser for flexible keyword matching
  • "rows": k - How many final results you want back

Re-ranking Stage (Librarian #2):

  • "rq": "{{!rerank reRankQuery=$rqq reRankDocs={k * 2} reRankWeight={vector_boost}}}"

    • This is the rerank instruction
    • reRankDocs={k * 2} - "Take the top k*2 documents from keyword search"
    • reRankQuery=$rqq - "Rescore them using the query defined in $rqq"
    • reRankWeight={vector_boost} - Controls how much the new score matters vs. the original keyword score
  • "rqq": "{{!knn f={self.vector_field} topK={k * 2}}}{vector_str}"

    • This defines what $rqq is
    • {!knn f={self.vector_field} topK={k * 2}} - Use KNN (K-Nearest Neighbor) search on the vector field
    • {vector_str} - Your query embedding

Other params:

  • "fl": "*, score" - Return all fields plus the score
  • "fq": ["product:*openshift*"] - Filter to only OpenShift products
  • "wt": "json" - Return results as JSON

Question for you: What do you think reRankWeight actually controls? How does it balance the original keyword score vs. the vector similarity score?

Prompt: 147

I don;t understand the reRankWeight fully. Please use an example and analogy.

My understanding is we have initial k * 2 candidates alogn with its relevance score (A), then we will perform a semantic simialrity using the query and k * 2 documents and will have additional semantic similarity score (B). Now say the rerankWeight is 2, then will have 2 * B + A is the hybrid score used for final top_k. Correct??

Assistant: 460

🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠

Excellent! Your understanding is exactly right!

Let me confirm with a concrete example:

Example:

You have 3 documents after keyword search (k*2 = 3 for simplicity):

Doc Keyword Score (A) Vector Similarity Score (B)
Doc1 10.0 0.8
Doc2 8.0 0.95
Doc3 6.0 0.7

If reRankWeight = 2:

Hybrid Score = A + (reRankWeight × B) = A + (2 × B)

Doc Calculation Final Score
Doc1 10.0 + (2 × 0.8) 11.6
Doc2 8.0 + (2 × 0.95) 9.9
Doc3 6.0 + (2 × 0.7) 7.4

Final ranking: Doc1, Doc2, Doc3

The analogy: Think of reRankWeight as a volume knob for Librarian #2's opinion. Higher weight = Librarian #2's semantic opinion matters more. Lower weight = stick closer to Librarian #1's keyword ranking.

Does this example clarify it? Want to explore what happens with different reRankWeight values?

Prompt: 21

Yes. Let's explore what happens with different reRankWeight values.

Assistant: 733

🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠

Perfect! Let's use the same 3 documents and try different weights:

Same documents:

Doc Keyword Score (A) Vector Score (B)
Doc1 10.0 0.8
Doc2 8.0 0.95
Doc3 6.0 0.7

reRankWeight = 0 (ignore semantic):

  • Doc1: 10.0 + (0 × 0.8) = 10.0
  • Doc2: 8.0 + (0 × 0.95) = 8.0
  • Doc3: 6.0 + (0 × 0.7) = 6.0
  • Ranking: Doc1 → Doc2 → Doc3 (pure keyword ranking)

reRankWeight = 2:

  • Doc1: 10.0 + (2 × 0.8) = 11.6
  • Doc2: 8.0 + (2 × 0.95) = 9.9
  • Doc3: 6.0 + (2 × 0.7) = 7.4
  • Ranking: Doc1 → Doc2 → Doc3 (same order!)

reRankWeight = 5:

  • Doc1: 10.0 + (5 × 0.8) = 14.0
  • Doc2: 8.0 + (5 × 0.95) = 12.75
  • Doc3: 6.0 + (5 × 0.7) = 9.5
  • Ranking: Doc1 → Doc2 → Doc3 (still same!)

reRankWeight = 20:

  • Doc1: 10.0 + (20 × 0.8) = 26.0
  • Doc2: 8.0 + (20 × 0.95) = 27.0
  • Doc3: 6.0 + (20 × 0.7) = 20.0
  • Ranking: Doc2 → Doc1 → Doc3 (Doc2 wins now!)

What do you notice? At what point does the semantic similarity start to override the keyword ranking?

Prompt: 49

Wow in this scenario only at reRankWeight 20, the ranking is changed by the semantic similarity. In real world different query types require different strategies. Right?

Assistant: 283

🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠

Absolutely! You've hit on a key insight. Different query types need different balancing strategies.

Think about these query types:

  1. Exact technical queries: "error code 404" or "kubectl get pods"

    • User wants exact keyword matches
  2. Conceptual queries: "how to improve application performance" or "best practices for security"

    • Semantic similarity matters more - many ways to express the same concept
  3. Mixed queries: "deploy nodejs application on kubernetes"

    • Both keywords (nodejs, kubernetes) AND concepts (deploy) matter

Question for you:

For each of those query types, would you use a low reRankWeight (like 1-5) or high reRankWeight (like 10-20)? Why?

And how might you decide this in a production system - manually tune it, or somehow detect the query type automatically?

Prompt: 111

  1. Exact Technical Queries - low reRerankWeight (we cannot let semantic override since exact keyword matches are needed)
  2. Conceptual Queries - high reRankWeight (we definitely need higher semantic override because we need meaning to matter more here)
  3. Mixed Queries - this requires careful experimentation of rerankWeight. there is no one answer for rerankWeight

Assistant: 315

🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠

Excellent analysis! Your reasoning is spot-on for each type.

You've identified a crucial challenge: In a production system, you don't know ahead of time which type of query a user will submit.

So how do you handle this?

Here are some approaches practitioners use:

  1. Fixed middle-ground weight - Pick one reRankWeight (like 5) that works "okay" for everything
  2. Query classification - Detect query type first, then apply appropriate weight
  3. Dynamic weighting - Use query characteristics (length, presence of technical terms, etc.) to automatically adjust
  4. User feedback - Learn from clicks/ratings which weight works best for similar queries
  5. A/B testing - Test different weights and see which gives better user satisfaction

Question for you:

Which approach sounds most practical to you for your use case? Or do you have a different idea?

Also, are you starting to think about how you might want to tune this for your specific search application?

Prompt: 144

At this point in time, we don't have a rerank strategy at all. So we need to start simple in terms of 1 & 2 where Query Classification identifies the scenarios where exact matches are crucial (CVE, Errata) and have low rerankWeight (technical documentation queries) and for the rest we want to have high reRerankWeight. Yes. Are there flaws in my thinking process? How do you recommend me applying for my application?

Assistant: 453

🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠

Your thinking is solid for getting started! Starting simple with two buckets is exactly the right approach.

One potential flaw to consider:

Your "everything else gets high reRankWeight" bucket might be too broad. Consider:

  • CVE queries → low weight (0-2) ✓
  • Errata queries → low weight (0-2) ✓
  • Conceptual queries → high weight (15-20) ✓
  • But what about mixed queries? → high weight might be too aggressive

For example: "how to patch CVE-2024-1234 in OpenShift" has both CVE (needs exact match) AND concepts (patching, how-to).

My recommendation for starting:

Instead of two buckets, try three:

  1. Exact match critical (CVE, Errata) → reRankWeight = 1-2
  2. Balanced (default for most queries) → reRankWeight = 5-8
  3. Semantic heavy (questions, "how to", "best practices") → reRankWeight = 15-20

Questions for you:

  1. Does a three-tier approach make sense for your use case?
  2. How easy is it for you to detect CVE/Errata queries? (They have pretty clear patterns like CVE-YYYY-NNNN)
  3. Do you have access to query logs to test different weights and see what works?

Prompt: 82

  1. Yes Three tier approach makes sense.
  2. We already have an intent detection where I could detect security intents such as CVE/Errata queries.
  3. Yes we do have access to query logs to test different weights and see how it works.

Assistant: 244

🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠

Perfect! You're in a great position to implement this. You already have the key pieces:

  • Intent detection system ✓
  • Access to query logs for testing ✓
  • Understanding of the reranking mechanism ✓

Now you have a clear path forward:

  1. Extend your intent detector to classify into your three tiers
  2. Map each tier to a reRankWeight value
  3. Test on historical queries to validate your choices
  4. Monitor and iterate

What would you like to explore next?

  • How to structure the code to make reRankWeight dynamic based on intent?
  • How to design experiments with your query logs to find optimal weight values?
  • Understanding other hybrid search strategies (like the union approach from the article)?
  • Something else?

What feels most valuable to you right now?

Prompt: 7

Help me organize this notebook

Assistant: 141

🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠

I can help you organize your notebook!

First, let me understand what you'd like:

  1. Do you want to restructure the existing notes into a better order?
  2. Would you like to add new sections (like a summary, implementation checklist, or next steps)?
  3. Are you looking to create a separate clean reference that captures your key learnings?
  4. Something else?

What would be most helpful for you?

Prompt: 33

Let's start with restructure the existing notes for me to solidy the understanding and communicate better based on existing dialog.

Assistant: 2,290

🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠

Perfect! I'll help you create a restructured set of notes that follows our learning journey. Here's a clean organization:

Copied!
# Hybrid Search in Apache Solr - Learning Notes

## Context & Goal
- **Name:** Mani
- **Background:** Search practitioner, intermediate Python coder, familiar with lexical search in Solr
- **Goal:** Understand hybrid search and re-ranking features in Solr
- **Application:** Lightspeed core implementation for OpenShift documentation

---

## What is Hybrid Search?

Hybrid search combines two search approaches:
1. **Lexical (Keyword) Search** - Traditional BM25, exact term matching
2. **Semantic (Vector) Search** - KNN search using embeddings to capture meaning

**Why combine them?**
- Keyword search: Great for exact matches, technical terms
- Vector search: Captures semantic meaning, handles synonyms and concepts
- Together: More robust and relevant results

---

## How This Implementation Works

### High-Level Flow

keyword_search(query) → retrieve top k*2 candidates → rerank using semantic similarity → return top k

Copied!
**Key insight:** This is "keyword-first hybrid search"
- Keyword search acts as the retrieval/filtering stage
- Vector search refines the ranking of those candidates
- Documents must match keywords to be considered

### Librarian Analogy
- **Librarian #1 (Keyword):** Finds k*2 documents based on keywords and puts them on the table
- **Librarian #2 (Vector):** Takes *only those k*2 documents* and re-ranks them by semantic similarity
- You get the top k from Librarian #2's re-ranking

---

## Code Implementation Details

### The Function

async def query_hybrid( embedding: NDArray, # Query vector query_string: str, # Query text k: int, # Final results wanted score_threshold: float, # Minimum score filter reranker_type: str,
reranker_params: dict # Contains boost values )

Copied!
### Solr Query Parameters

data_params = { # Initial keyword retrieval "q": query_string, # Your keyword query "defType": "edismax", # Extended DisMax parser "rows": k, # Final result count

Copied!
# Reranking stage
"rq": "{{!rerank reRankQuery=$rqq reRankDocs={k*2} reRankWeight={vector_boost}}}",
"rqq": "{{!knn f={vector_field} topK={k*2}}}{vector_str}",

# Other params
"fl": "*, score",                     # Return all fields + score
"fq": ["product:*openshift*"],       # Filter query
"wt": "json"                          # Response format

}

Copied!
### Parameter Breakdown

| Parameter | Purpose |
|-----------|---------|
| `q` | Keyword search query |
| `rq` | Rerank instruction |
| `reRankDocs` | How many top docs to rerank (k*2) |
| `reRankQuery` | What query to use for reranking ($rqq) |
| `reRankWeight` | How much semantic score matters |
| `rqq` | The KNN vector similarity query |

---

## Understanding reRankWeight

### The Scoring Formula

Final Score = Keyword_Score + (reRankWeight × Vector_Score) Final Score = A + (reRankWeight × B)

Copied!
### Example with Different Weights

**Starting documents:**
| Doc | Keyword Score (A) | Vector Score (B) |
|-----|-------------------|------------------|
| Doc1 | 10.0 | 0.8 |
| Doc2 | 8.0 | 0.95 |
| Doc3 | 6.0 | 0.7 |

**reRankWeight = 0** (pure keyword)
- Doc1: 10.0  Doc2: 8.0  Doc3: 6.0

**reRankWeight = 2** (slight semantic boost)
- Doc1: 11.6  Doc2: 9.9  Doc3: 7.4
- Same ranking order

**reRankWeight = 20** (strong semantic)
- Doc2: 27.0  Doc1: 26.0  Doc3: 20.0
- **Ranking changed!** Doc2 now wins due to higher semantic similarity

### Key Insight
The higher the reRankWeight, the more semantic similarity can override keyword ranking.

---

## Query Type Strategy

Different queries need different reRankWeight values:

### 1. Exact Technical Queries
- **Examples:** "CVE-2024-1234", "error code 404", "kubectl get pods"
- **Strategy:** Low reRankWeight (1-2)
- **Why:** Need exact keyword matches, can't let semantics override

### 2. Conceptual Queries  
- **Examples:** "how to improve performance", "best practices for security"
- **Strategy:** High reRankWeight (15-20)
- **Why:** Meaning matters more than exact words

### 3. Mixed Queries
- **Examples:** "how to patch CVE-2024-1234", "deploy nodejs on kubernetes"
- **Strategy:** Medium reRankWeight (5-8)
- **Why:** Both keywords and concepts matter

---

## Implementation Plan for Our Application

### Three-Tier Approach

| Tier | Detection | reRankWeight | Example Queries |
|------|-----------|--------------|-----------------|
| **Exact Match Critical** | Security intent (CVE, Errata) | 1-2 | "CVE-2024-1234" |
| **Balanced** | Default for most queries | 5-8 | "deploy nodejs app" |
| **Semantic Heavy** | Questions, how-to, best practices | 15-20 | "how to improve security?" |

### Implementation Steps
1.  Extend intent detection to classify into three tiers
2.  Map each tier to reRankWeight value
3.  Test on historical query logs
4.  Monitor and iterate based on results

### Advantages We Have
- Existing intent detection system (can detect security intents)
- Access to query logs for experimentation
- Clear use cases (OpenShift documentation)

---

## Key Takeaways

1. **Hybrid search = Keyword retrieval + Semantic reranking** (in this implementation)
2. **reRankWeight controls the balance** between keyword and semantic scores
3. **Different query types need different strategies** - no one-size-fits-all
4. **Start simple, iterate based on data** - three tiers is a practical starting point
5. **This is keyword-first hybrid** - only reranks what keywords find first

---

## Next Learning Topics
- How to design experiments with query logs
- Alternative hybrid approaches (union-based, RRF)
- Dynamic reRankWeight based on query features
- Learning to Rank (LTR) for automatic weight optimization

Would you like me to create this as a note message in your notebook?

Prompt: 1

Yes

Assistant: 2,280

Copied!
# Hybrid Search in Apache Solr - Learning Notes

## Context & Goal
- **Name:** Mani
- **Background:** Search practitioner, intermediate Python coder, familiar with lexical search in Solr
- **Goal:** Understand hybrid search and re-ranking features in Solr
- **Application:** Lightspeed core implementation for OpenShift documentation

---

## What is Hybrid Search?

Hybrid search combines two search approaches:
1. **Lexical (Keyword) Search** - Traditional BM25, exact term matching
2. **Semantic (Vector) Search** - KNN search using embeddings to capture meaning

**Why combine them?**
- Keyword search: Great for exact matches, technical terms
- Vector search: Captures semantic meaning, handles synonyms and concepts
- Together: More robust and relevant results

---

## How This Implementation Works

### High-Level Flow

keyword_search(query) → retrieve top k*2 candidates → rerank using semantic similarity → return top k

Copied!
**Key insight:** This is "keyword-first hybrid search"
- Keyword search acts as the retrieval/filtering stage
- Vector search refines the ranking of those candidates
- Documents must match keywords to be considered

### Librarian Analogy
- **Librarian #1 (Keyword):** Finds k*2 documents based on keywords and puts them on the table
- **Librarian #2 (Vector):** Takes *only those k*2 documents* and re-ranks them by semantic similarity
- You get the top k from Librarian #2's re-ranking

---

## Code Implementation Details

### The Function

async def query_hybrid( embedding: NDArray, # Query vector query_string: str, # Query text k: int, # Final results wanted score_threshold: float, # Minimum score filter reranker_type: str,
reranker_params: dict # Contains boost values )

Copied!
### Solr Query Parameters

data_params = { # Initial keyword retrieval "q": query_string, # Your keyword query "defType": "edismax", # Extended DisMax parser "rows": k, # Final result count

Copied!
# Reranking stage
"rq": "{{!rerank reRankQuery=$rqq reRankDocs={k*2} reRankWeight={vector_boost}}}",
"rqq": "{{!knn f={vector_field} topK={k*2}}}{vector_str}",

# Other params
"fl": "*, score",                     # Return all fields + score
"fq": ["product:*openshift*"],       # Filter query
"wt": "json"                          # Response format

}

Copied!
### Parameter Breakdown

| Parameter | Purpose |
|-----------|---------|
| `q` | Keyword search query |
| `rq` | Rerank instruction |
| `reRankDocs` | How many top docs to rerank (k*2) |
| `reRankQuery` | What query to use for reranking ($rqq) |
| `reRankWeight` | How much semantic score matters |
| `rqq` | The KNN vector similarity query |

---

## Understanding reRankWeight

### The Scoring Formula

Final Score = Keyword_Score + (reRankWeight × Vector_Score) Final Score = A + (reRankWeight × B)

Copied!
### Example with Different Weights

**Starting documents:**
| Doc | Keyword Score (A) | Vector Score (B) |
|-----|-------------------|------------------|
| Doc1 | 10.0 | 0.8 |
| Doc2 | 8.0 | 0.95 |
| Doc3 | 6.0 | 0.7 |

**reRankWeight = 0** (pure keyword)
- Doc1: 10.0  Doc2: 8.0  Doc3: 6.0

**reRankWeight = 2** (slight semantic boost)
- Doc1: 11.6  Doc2: 9.9  Doc3: 7.4
- Same ranking order

**reRankWeight = 20** (strong semantic)
- Doc2: 27.0  Doc1: 26.0  Doc3: 20.0
- **Ranking changed!** Doc2 now wins due to higher semantic similarity

### Key Insight
The higher the reRankWeight, the more semantic similarity can override keyword ranking.

---

## Query Type Strategy

Different queries need different reRankWeight values:

### 1. Exact Technical Queries
- **Examples:** "CVE-2024-1234", "error code 404", "kubectl get pods"
- **Strategy:** Low reRankWeight (1-2)
- **Why:** Need exact keyword matches, can't let semantics override

### 2. Conceptual Queries  
- **Examples:** "how to improve performance", "best practices for security"
- **Strategy:** High reRankWeight (15-20)
- **Why:** Meaning matters more than exact words

### 3. Mixed Queries
- **Examples:** "how to patch CVE-2024-1234", "deploy nodejs on kubernetes"
- **Strategy:** Medium reRankWeight (5-8)
- **Why:** Both keywords and concepts matter

---

## Implementation Plan for Our Application

### Three-Tier Approach

| Tier | Detection | reRankWeight | Example Queries |
|------|-----------|--------------|-----------------|
| **Exact Match Critical** | Security intent (CVE, Errata) | 1-2 | "CVE-2024-1234" |
| **Balanced** | Default for most queries | 5-8 | "deploy nodejs app" |
| **Semantic Heavy** | Questions, how-to, best practices | 15-20 | "how to improve security?" |

### Implementation Steps
1.  Extend intent detection to classify into three tiers
2.  Map each tier to reRankWeight value
3.  Test on historical query logs
4.  Monitor and iterate based on results

### Advantages We Have
- Existing intent detection system (can detect security intents)
- Access to query logs for experimentation
- Clear use cases (OpenShift documentation)

---

## Key Takeaways

1. **Hybrid search = Keyword retrieval + Semantic reranking** (in this implementation)
2. **reRankWeight controls the balance** between keyword and semantic scores
3. **Different query types need different strategies** - no one-size-fits-all
4. **Start simple, iterate based on data** - three tiers is a practical starting point
5. **This is keyword-first hybrid** - only reranks what keywords find first

---

## Next Learning Topics
- How to design experiments with query logs
- Alternative hybrid approaches (union-based, RRF)
- Dynamic reRankWeight based on query features
- Learning to Rank (LTR) for automatic weight optimization

---

## Reference Materials
- Sease.io blog: Hybrid Search with Apache Solr
- Lightspeed implementation: solr_vector_io/solr.py