Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> NTransformer High-efficiency C++/CUDA LLM inference engine. Runs Llama 70B on a single RTX 3090 (24GB VRAM) by streaming model layers through GPU memory via PCIe, with optional NVMe direct I/O that bypasses the CPU entirely.

untested:

https://github.com/xaskasdf/ntransformer

 help



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: