BTW: with the average of embeddings part… can’t an LLM do that for a large conte...

refulgentis · on March 28, 2024

#1. Biggest constraint for me is I do embeddings locally. It's absolutely _bonkers_ that these "vector DBs" are storing a ton of private content. But its _barely_ possible, it's about 30 ms to get an embedding for 250 words on an iPhone from 2 years ago. So once there's a ton of downloaded URLs to choose from (say, 100), it's a fantastic way to filter between "oh these are from your query about quiche, and these are from your query about LA Lakers stats" to still get an answer in under 10 seconds.

#2 Real talk? I hate to assert something that sounds so opinionated, but for brevitys sake:

The long context stuff is miserable. More hype and a somewhat decent scratchpad than useful. I have a feeling the next GPT will be much better, Claude 3 is a qualitative jump from both Gemini 1.5's "1 million" tokens and GPTs 128K. Seems to me the long context stuff was a technical achievement layered on top of models using 4K token training materials.

Whenever I see people asserting long context will kick RAGs ass, I can tell they they're either "arguing eventually, on a long enough timeline" or they don't have a practical understanding of how its working. Ex. one of my final user stories was translating entirety of Moby Dick into over-the-top Zoomer. You simply can't get it done with chunks larger than 5K tokens, with any model.

It's pretty darn expensive to just throw tokens at it in any context outside using the cheapest model for extraction. Like, the latest round of GPT-4 price cuts finally gets answers my anesthesiologist friend loves underneath $0.15, and that's just 4K input tokens, about 12 pages.

Random anecdote, re: irony in that Claude having a qualitative leap in using the long context causes novel issues:

It's kinda funny because it's actually caused a nasty problem for me: conversation flow used to look something like:

1. USER: question.

2. (programmatic, invisible) USER: Please generate 3 search queries to help me answer $question.

3. (invisible) AI: $3_queries.

4. (do all the humdrum RAG stuff to make #5).

5. (programmatic, invisible) USER: Here are N documents I looked up, use them to answer $question.

6. AI: $answer.

Claude is the first that notices I don't inject $3_queries back into the question before $answer. So it thinks it still has to answer #2 before it answers #5. So $answer starts with 3 search queries.

bigfudge · on March 28, 2024

Why do 5 a an extension of the original chat? Just start a clean context. I actually wish this is something lmql added - the ability to start a clean context but still refer to previous completions by name. I have a hacky implementation of this but it would be great if built into guidance or lmql.