Hacker Newsnew | past | comments | ask | show | jobs | submit | shahules's commentslogin

My team works on automatic environment generation for RL post-training. One of our projects is using coding agents to build web clones for BUAs/CUAs.

We tested Gemini, Claude Code, GLM, and Codex using our harness on their abilities to recreate a Slack workspace and benchmarked their performance.

Saw a variety of results:

- *Gemini 3 Pro:* Achieved the highest visual score (0.91 SSIM) but lacked interactive functionality. - *Claude Opus 4.6:* Developed the most complete application, balancing full interactivity with consistent self-correction. - *GLM-5:* Produced the best code architecture but reached a plateau in visual improvement. - *GPT-5.3 Codex:* Initialized quickly but entered a five-hour "scaling spiral" that failed to yield further progress.

Next, we’re planning:

- More web apps for cloning and benchmarking across the models - More functionality (the trajectory didn’t include full Slack features) - Better scoring for functionality (easier to catch Gemini’s mistake)

Repo: https://github.com/vibrantlabsai/cloning-bench

Blog post: https://vibrantlabs.com/blog/pa-bench


Nice, their training recipe seems unique.


After doing few experiments, I think that having Agents work on browser for all tasks wouldn't be best due to many factors like token cost, safety, etc. But browser/computer can be a tool that the agent can be alongside MCPs to complete tasks that requires interaction with such modalities.


There are few agents like browser-use, skyvern etc that may provide this capability.


Most current web agent benchmarks focus on single-tab tasks (e.g., 'go to Gmail and star this email'). We found that frontier models that score highly on those tasks (like in WebArena) often fall apart when they have to coordinate context across 2+ applications. We built a simulated environment with scenarios and deterministic verifiers to see why.


It's an interesting article and I agree with some points you brought up here. But here are some of them to which I don't agree to

1. Evals are used throughout the article in the sense of LLM benchmarking, but this is not the point. One could effectively evaluate any AI system by building custom evals.

2. The purpose of evals is to help devs systematically improve their AI systems (at least how we look at it) not any of the ones listed in your article. It's not a one-time thing, it's a practice like the scientific method.


2. I think to improve is the next step. KNOWING if the sytem even performs according to set criteria is more important. Improvement can't be made if you don't have any evals to know it is improving.


Can't agree with you more my friend. Another point on a philosophical level is efficiency or optimization in life, which always focuses on tangible aspects and ignores the greater intangible aspects of life.


Deepeval also uses Ragas underneath. They initially took a different approach by allowing uses to formulate test cases but we were focusing on RAGs only and creating metrics and features like synthetic test data generation for it. Now that we are doing good in the RAG category, we also want to expand to solve the greater challenge.


I think it's true for any early-stage library/framework. The tradeoff is then you will have to keep maintaining it, add support to other LLMs if you change LLMs, etc. Then in the end OSS will be far ahead because by that time it will have smoothened its rough edges.


Or OSS will be going in a different direction then what you need, so if you are using it you'll either be stuck on an old version or you have to keep fighting around it. ML libraries in particular have this annoying habit of not being very backwards compatible over more than 2-3 years.


Hey, DeepEval is interesting. What do you mean by "evaluating any LLMs"?


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: