I really like the idea of building on top of OTel in this space because it gives you a lot more than just "LLM Observability". More specifically, it's a lot easier to get observability on your entire agent (rather than just LLM calls).
I'm working on a tool to track semantic failures (e.g. hallucination, calling the wrong tools, etc.). We purposefully chose to build on top of Vercel's AI SDK because of its OTel integration. It takes literally 10 lines of code to start collecting all of the LLM-related spans and run analyses on them.
> I'm not clear why you are focusing on hashing user ids. Nor how you landed on a 50/50 split
I landed on hashing and splitting from my research on building A/B tools, but none of that research was targeted towards building real, enterprise products (which is why I asked the question here). From your reply, I take it that this isn't as important as I read about earlier?
> When someone logs in, write a record of which one they got.
I'm confused about what you mean by "which one they got". How do I know which version to assign them? This is what I assumed hashing would solve - we'd have a reliable way to "choose" a version for any given user.
> why assign 50% of an entire userbase to a feature being tested that only 10% of the users touch?
This makes sense, I'm not sure why I had landed on 50%. So the percentage difference does not matter? I had assumed that we need a way to enforce a certain percentage split - how do I prevent a "feature" from only reaching 0.01% of the userbase, whereas the other feature reaches 99.99%?
Thanks again for your reply, you've been really helpful.
Thanks for responding, I appreciate the insights here! I'm not focused on offering every single feature at the moment. This is a brand-new project and also is not in the same market as the products you have mentioned (AB Tasty, Optimizely, etc.). Those products may offer a lot more than A/B testing, but for my situation, I have a single, clear problem that I want to solve which doesn't require much.
No probs at all mate, feel free to reach out if you need to bounce ideas. As I say I’ve quite a bit of experience from leading out the development offering in an agency and building a bespoke AB Testing IDE.
I'm working on a tool to track semantic failures (e.g. hallucination, calling the wrong tools, etc.). We purposefully chose to build on top of Vercel's AI SDK because of its OTel integration. It takes literally 10 lines of code to start collecting all of the LLM-related spans and run analyses on them.