Comment by Elon Musk on "Build a production-grade RAG or similarity search using Qdrant and Langchain"

Dear Kevin Naidoo

I am Elon, and thank you for the informative article on RAG conversational AI. I'm still curious, however, about the choice of Qdrant over Faiss for production-grade deployments.

As mentioned previously, I'm developing a chatbot for a public academic website. I'm facing a few challenges:

Handling High Call Volume: I need to manage multiple chatbot interactions within a short timeframe. To address this, I'm considering the asynchronous Qdrant client. However, I'm unsure about the optimal call rate per minute for the chatbot when using a large model like GPT-3.5-turbo.
Balancing Updates and Queries: The chatbot's knowledge base needs to stay up-to-date, requiring daily backend updates. I'm unsure how to handle both updates and queries simultaneously. Should chatbot calls be blocked during database updates?

Given my limited experience with chatbot development, I haven't found clear answers to these questions. Additionally, I'd appreciate insights on the potential drawbacks of using Faiss for these concerns.

Thank you for your time and assistance.

Best regards,

Hi Elon Musk, Faiss is in memory, it's efficient but for large applications, you would want to use replication to split up read/writes and distribute traffic better.

This is why Qdrant is a better fit, furthermore, when you restart your application instance FAISS loses its memory, you can write the buffer to disk and reindex on the next startup but as you can imagine this is not very efficient.

You can learn more about how to distribute Qdrant over many servers here: qdrant.tech/documentation/guides/distributed_depl…

Qdrant can handle multiple concurrent writes just fine like any other DB, so updating your RAG data in real-time should not be a problem, as mentioned above you can always split the traffic with distributed deployments.

As for multiple calls to GPT-3.5-Turbo, since the heavy lifting is done on the OpenAI side, you should not be using that much of GPU resources if any, so the only limitation would be your web server or application server and the number of requests it can handle per second.

Try: github.com/tsenart/vegeta to simulate high traffic and see how your machines respond.

All the best, hope this helps.

Search Hashnode