More

nhirschfeld · 2026-05-20T09:24:37 1779269077

Hi HN!

I'm the maintainer of Kreuzberg, an open-source document intelligence library (https://github.com/kreuzberg-dev/kreuzberg). Some of you may have used it for RAG ingestion.

We're launching Kreuzberg Cloud, a SAAS API and a self-hosted system. It's in public beta, and I would like to invite you all to give it a try.

What out MVP offers: we offer very fast CPU optimized document and code intelligence. You can extract content from more than 90 document file formats and 300 code file formats into Markdown (or plaintext/djot), with additional features (same pricing tier) including chunking, embeddings, keyword extraction - and various types of intelligence.

The OSS library is used as the base engine of the cloud system. Our initial offering is $0.008/page, and you get the first 10K pages free, no card required.

We also offer our entire system for self-hosting - using helm charts. We are looking for design partners, so if thats relevant - shoot me a line.

v-tan · 2026-05-20T09:38:51 1779269931

Thanks! I was waiting for this!

Tirtz · 2026-05-20T09:27:51 1779269271

Amazing work!

nhirschfeld · on Feb 16, 2025

You'll need to use a different OCR engine. Look at easy ocr

nhirschfeld · on Feb 16, 2025

Yes, there have already been several suggestions here for other backend etc.

You should try using a different PSM to see if you get better results.

If it's scientific texts specifically, look at grobid

nhirschfeld · on Feb 15, 2025

thats why Kreuzberg also exposes a sync API for you to consume.

nhirschfeld · on Feb 15, 2025

I'm actually considering another library with optional API called `Kreuzköln` - probably without the Umlaut!

nhirschfeld · on Feb 15, 2025

Retrieval Augmented Generation. Its a class of techniques for generating content using LLMs. I'd recommend Googling this.

tomcam · on Feb 16, 2025

Was going to reply indignantly that it's hard to google rag and get that answer when I read your comment. Then I did, and it was the first result.

Apologies!

maxnoe · on Feb 16, 2025

I understood the comment as "Google <the long version I provided> to get more info"

nhirschfeld · on Feb 15, 2025

Thanks for asking!

It's both. The OCR part is ofc CPU bound, but the entire text extraction involves reading files, or writing and then reading files.

Without async, these simply block.

As for efficiency - if you're working in an async application context you have to "asyncify" these operations or suffer the consequences.

skavi · on Feb 16, 2025

in that case, what’s the deal with extract_bytes being async? i’m not incredibly familiar with python, but i’d expect a “byte string” to be in memory.

nhirschfeld · on Feb 16, 2025

You still need to write it to file to process it via pandoc/tesseract etc.

There are alternative options to tesseract ofc.

LoganDark · on Feb 16, 2025

> You still need to write it to file to process it via pandoc/tesseract etc.

This sounds... I guess Pythonic? Sheesh.

nhirschfeld · on Feb 15, 2025

Yup, easy OCR is good.

My reasons for using Tesseract - easy OCR is larger, and it has a significant cold start.

It benchmarks better for many OCR tasks though, so I'm thinking of adding it as an alternative backend.

cdrini · on Feb 15, 2025

Where did you find benchmarks for OCR tools? There have been so many OCR engines coming lately, I would love to see benchmarks!

nhirschfeld · on Feb 15, 2025

I google this for a while...

alex_suzuki · on Feb 15, 2025

Any experience with Paddle OCR? https://github.com/PaddlePaddle/PaddleOCR

Personally I‘ve used Tesseract before but the results were underwhelming, so I‘m curious how Paddle OCR performs in comparison.

nhirschfeld · on Feb 15, 2025

I haven't, testing it out is on my todo list for sure

nhirschfeld · on Feb 15, 2025

interesting!

nhirschfeld · on Feb 15, 2025

lol ;).

But seriously, in 13 years living here, only one guy tried to pick pocket me.

tymm · on Feb 15, 2025

I live in 36 since 15 years or so. Wasn't as lucky as you :)

nhirschfeld · on Feb 15, 2025

Sorry to hear...