More

mandatory · 2025-10-03T01:29:56 1759454996

Good news for curl users: https://github.com/mandatoryprogrammer/thermoptic

benatkin · 2025-10-03T03:47:53 1759463273

> NOTE: Due to many WAFs employing JavaScript-level fingerprinting of web browsers, thermoptic also exposes hooks to utilize the browser for key steps of the scraping process. See this section for more information on this.

This reminds me of how Stripe does user tracking for fraude detection https://mtlynch.io/stripe-update/ I wonder if thermoptic could handle that.

mips_avatar · 2025-10-03T05:06:59 1759468019

Cool project!

mandatory · 2025-10-03T05:41:49 1759470109

Thanks!

joshmn · 2025-10-03T03:32:08 1759462328

Work like this is incredible. I did not know this existed. Thank you.

mandatory · 2025-10-03T05:41:24 1759470084

Thanks :) if you have any issues with it let me know.

snowe2010 · 2025-10-03T14:50:11 1759503011

People like you are why independent sites can’t afford to run on the internet anymore.

1gn15 · 2025-10-06T05:59:55 1759730395

I block all humans (only robots are allowed) and I'm still able to run independent websites.

mandatory · 2025-10-03T16:47:34 1759510054

They can't? I've run many free independent sites for years, that's news to me.

timbowhite · 2025-10-03T18:21:35 1759515695

I run independent websites and I'm not broke yet.

Symbiote · 2025-10-03T11:26:37 1759490797

Oh great /s

In a month or two, I can be annoyed when I see some vibe-coded AI startup's script making five million requests a day to work's website with this.

They'll have been ignoring the error responses:

  {"All data is public and available for free download": "https://example.edu/very-large-001.zip"}

— a message we also write in the first line of every HTML page source.

Then I will spend more time fighting this shit, and less time improving the public data system.

mandatory · 2025-10-03T18:42:55 1759516975

Feel free to read the README, this was already an ability that startups could pay for using private premium proxy services before thermoptic.

Having an open source version allows regular people to do scraping and not just those rich in capital.

Much of the best data services on the internet all start with scraping, the README lists many of them.

mandatory · on Aug 11, 2024

Yep, that's an example of what the automated scanning looks for. You can see a very similar example in the slides: https://media.defcon.org/DEF%20CON%2032/DEF%20CON%2032%20pre...

mandatory · on Aug 11, 2024

It's just because I did this talk and made FindThatMeme :) so not a popular method, just what I used to do large scale OCR.

krackers · on Aug 11, 2024

Oh I completely missed that you're actually the same guy!

mandatory · on Jan 11, 2023

Yes I definitely want to improve the search to be better. It is currently very text heavy and I (only recently) got image similarity indexing working. Hoping to leverage this to do something like you mentioned!

I'd also like to figure out how to turn an image into a description of whats in it. My ML/tensorflow knowledge is very weak though, so I still have a lot to learn here.

mandatory · on Jan 11, 2023

The image similarity search is probably a blog post of its own.

Short TL;DR: It runs off my home server running a large vector database (opendistro): https://opendistro.github.io/for-elasticsearch-docs/docs/knn...

mandatory · on Jan 11, 2023

Nope, you can use it totally offline. No way of getting banned as far as I'm aware.

mandatory · on Jan 11, 2023

Yep, this is exactly what I'm running on the raspberry pi LB. Nginx makes it super easy!

mandatory · on Jan 11, 2023

Author here: KnowYourMeme is one of many sites that memes are continually ingested from (any site that has memes I try to ingest regularly) :)

aemreunal · on Jan 11, 2023

Amazing work! Also, thank you for making that feed on the main page, been laughing for a while here :D

yojo · on Jan 11, 2023

Also lost 20 minutes doom scrolling that feed. Add an upvote button and some ML and you could destroy some lives.

mandatory · on Jan 11, 2023

Thanks! Comment made my night.

GistNoesis · on Jan 11, 2023

Nice IPhone cluster.

Have you tried something based on deep-learning that uses Transformers : https://github.com/roatienza/deep-text-recognition-benchmark (available weights are for tasks that seem similar to OCR so there is a good chance you can use it out of the box). With a good gpu it should process hundreds to thousands image per seconds, so you likely can build your index in less than a day. (Maybe you can even port it to your iphone stack :) )

https://github.com/microsoft/GenerativeImage2Text (You'll probably have to train on your custom dataset that you have constituted)

There are tons of other freely available solutions that you can get with a search for things with keywords like "image to text ocr" "transformers" "visual transformers"...

astrange · on Jan 12, 2023

You can do better than a general image-to-text model reading memes, because they all use the same fonts - so you want something trained off synthetic data made with that font.

generalizations · on Jan 11, 2023

Personally, I've been hunting for something that can extract both the text and the associated image. I've never seen anything that can do both.

taneq · on Jan 11, 2023

All hail the memelord!

spiffytech · on Jan 12, 2023

How do you ingest your social circle's in-group memes? Are they reliably posted to meme generator sites?

counttheforks · on Jan 11, 2023

What about copyright?

mandatory · on Oct 18, 2022

What kind of app? I'm planning on releasing a developer API soon so people can integrate it into their own bots/services.

mandatory · on Oct 16, 2022

Thanks!