Nice list! I'd say the SQLite with WAL is the biggest money saver mentioned.
One note: you can absolutely use Python or Node just as well as Go. There's Hetzner that offers 4GB RAM, 10TB network (then 1$/TB egress), 2CPUs machines for 5$.
Two disclaimers for VPS:
If you're using a dedicated server instead of a cloud server, just don't forget to backup DB to a Storage box often (3$ /mo for 1TB, use rsync). It's a good practice either way, but cloud instances seem more reliable to hardware faults. Also avoid their object store.
You are responsible for security. I saw good devs skipping basic SSH hardening and get infected by bots in <1hr. My go-to move when I spin up servers is a two-stage Terraform setup: first, I set up SSH with only my IP allowed, set up Tailscale and then shutdown the public SSH IP entrypoint completely.
Personally for backups I’d avoid using a product provided by the same company as the VM I’m backing up. You should be defending against the individual VM suffering corruption of some kind, needing to roll back to a previous version because of an error you made, and finally your VM provider taking a dislike to you (rationally or otherwise) and shutting down your account.
If you’re backing up to a third party losing your account isn’t a disaster, bring up a VM somewhere else, restore from backups, redirect DNS and you’re up and running again. If the backups are on a disk you can’t access anymore then a minor issue has just escalated to an existential threat to your company.
Personally I use Backblaze B2 for my offsite backups because they’re ridiculously cheap, but other options exist and Restic will write to all of them near identically.
> You are responsible for security. I saw good devs skipping basic SSH hardening and get infected by bots in <1hr. My go-to move when I spin up servers is a two-stage Terraform setup: first, I set up SSH with only my IP allowed, set up Tailscale and then shutdown the public SSH IP entrypoint completely.
Note that you don't need all of that to keep your SSH server secure. Just having a good password (ideally on a non-root account) is more than enough.
I'd call it unnecessary exposure. Under both modern threat models and classic cybernetic models (check out law of requisite variety) removing as much surface attack area as possible is optimal. Especially disabling passwords in SSH is infosec 1o1 these days. No need to worry about brute force attacks, credential stuffing, or simple human error, which was the cause of all attacks I've seen directly.
It's easier to add a small config to Terraform to make your config at least key-based.
Once I had Postgresql db with default password on a new vps, and forgetting to disable password based login, on a server with no domain. And it got hacked in a day, and was being used as bot server. And that was 10 years ago.
Recently deployed server, and was getting ssh login attempts within an hour, and it didn't had a domain. Fortunately, I've learned my lesson, and turned of password based login as soon as the server was up and running.
And similar attempts bogged down my desktop to halt.
Having an machine open to the world is now very scary. Thanks God for service like tailscale exists.
I need more info about devs getting infected over ssh in less than an hour. Unless they had a comically weak root password or left VNC I don't believe it at all
Yes, <1h was a weak root password. All attacks I've seen directly were always user error. The point is effectively removing attack surfaces rather than enhancing security in needlessly exposed internet-facing protocols.
> Nice list! I'd say the SQLite with WAL is the biggest money saver mentioned.
Funny you said that. I migrated an old, Django web site to a slightly more modern architecture (docker compose with uvicorn instead of bare metal uWSGI) the other day, and while doing that I noticed that it doesn't need PostgreSQL at all. The old server had it already installed, so it was the lazy choice.
I just dumped all data and loaded it into an SQLite database with WAL and it's much easier to maintain and back up now.
Does WAL really offer multiple concurrent writers? I know little about DBs and I've done a couple of Google searches and people say it allows concurrent reads while a write is happening, but no concurrent writers?
Not everybody says so... So, can anyone explain what's the right way to think about WAL?
No, it does not allow concurrent writes (with some exceptions if you get into it [0]). You should generally use it only if write serialisation is acceptable. Reads and writes are concurrent except for the commit stage of writes, which SQLite tries to keep short but is workload- and storage-dependent.
Now this is more controversial take and you should always benchmark on your own traffic projections, but:
consider that if you don't have a ton of indexes, the raw throughput of SQLite is so good that on many access patterns you'd already have to shard a Postgres instance anyway to surpass where SQLite single-write limitation would become the bottleneck.
Thanks! even I run a sqlite in "production" (is it production if you have no visitors?) and WAL mode is enabled, but I had to work around concurrent writes, so I was really confused. I may have misunderstood the comments.
First step is to get ssh setup correctly, and second step is to enable a firewall to block incoming connections on everything except the key ports (ssh but on a different port/web/ssl). This immediately eliminates a swathe of issues!
Historical reliability and compatibility. They claimed they were S3 compatible, but they were requiring deprecated S3 SDKs, plus S3 advanced features are unimplemented (but at least they document it [0]). There was constant timeouts for object creation and updates, very slow speeds and overall instability. Even now, if you check out r/hetzner on reddit, you'll see it's a reliability nightmare (but take it with a grain of salt, nobody reports lack of problems). Not as relevant for DB backups, but billing is dumb, even if you upload a 1KB file, they charge you for 64KB.
At least with Storage Box you know it's just a dumb storage box. And you can SSH, SFTP, Samba and rsync to it reliably.
I have a few tricks for handling procrastination that are in this ballpark:
1. When I see myself wanting to procrastinate, I ask myself 'If I follow this feeling, will it increase my power (i.e. capacity/agency/utility) or decrease it?'. Then I have a dialogue with myself: Nope, let's refocus, maybe try reading things out loud or draw a diagram or some other perspective change OR Yeah, I should stop for now, do something else, as long as that increases my power.
2. I observed that usually procrastination really is tied to novelty, quite similar with how it's presented in the article so I did this thing: instead of going on YouTube or games I started typing exercises online. After some time, I realised that I could get better at typing and get some extra-novelty by typing an existing book! So I have a Tampermonkey script that, whenever I try to go on a random typing website, redirects me to a website where I can type books (I could push it as a gist if anyone's interested). It stores in Local Storage what page I reached and from where I left them of. I got to read On the Origin of Species this way and now I type around 100 WPM from 80 WPM.
Quite a lot. And I'd say that I process more info by typing than by simply reading. I typed the first edition and got the printed second edition too after.
I always searched videos of what he was exemplifying and found quite amazing material for many (enslaver ants, ants tickling aphids, honeycomb construction).
Was super impressed to hear about Darwin's peers which he calls out by name every time, how there were people specialised in breading races, judging what constitutes species.
Was kind of stunned to find out that people didn't know dogs were all the same species, how hard it was even for specialised breeders to identify that their pigeons were changing, since they were not really taking pictures.
How Darwin published a book that was approachable to common folks, how the book was built on mountains of hand-collected data.
There's so, so, so much more I could talk about (tree of life, organs, descendant resemblance happening at the same age, embryology weirdness) but biggest mind-fuck would be the anti-teleological stance he holds. Basically, out of nowhere (although I saw that he read Hume [0]), Darwin figures out that things don't happen 'for a reason'. Things don't live because they're 'better'. All the creatures we see today are simply the things that survived. There's no final goal, no 'ought to be' in the world. We're simply patterns that survive that resemble patterns that happened to survive.
I also sometimes like to use typing test pages to kinda warm myself up before I start a project. I want to do really well and race someone else or type faster than I usually can, then ride the high of victory to get something done or break procrastination. But this is a much better way to do things; this way I can make that activity help advance my other goals.
The author calls it a 'joke' that Heroes are just unpaid Amazon employees, but reality doesn't become a joke just because it's funny. The asymmetry here is staggering. I find myself holding back private research because I don't want to provide free R&D for a value-extraction machine that is already efficient enough.
The author was at least dependency-driven in their contribution, but outside that kind of dependency, it's hard to justify contributing even 'in the open' when the relationship is this one-sided. Amazon in particular has done enormous damage to the economic assumptions that permissive open source once relied on. There's increasingly more projects adopting 'Business Source Licenses', precisely to prevent open work from becoming a free input into hyperscaler monetization.
These devs know Amazon is grabby and, at some point, the only dominant outcome their community contribution is upstream of is unpaid labor for a trillion-dollar entity that also diverts support and community engagement away from the original projects by funneling users into managed versions of the same software.
I am saying this is exactly what's happening, but with more robust language. If you disallow Amazon, maybe there is a third party that offers our services to Amazon. So Amazon-the-string is not the bogeyman; the concern is the resale or hosted-service arrangement they can access.
So you see formulations that target infrastructure resale rather than specific entities, such as:
"For the avoidance of doubt, the following scenarios are not permitted under the license:
* A managed service that lets third party developers ... register their own [SERVICE] service endpoints and invoke them through that managed service."
"You may not provide the software to third parties as a hosted or managed service, where the service provides users with access to any substantial set of the features or functionality of the software."
"If you make the functionality of the Program or a modified version available to third parties as a service, you must make the Service Source Code available via network download to everyone at no charge, under the terms of this License [...] where 'Service Source Code' is defined broadly to include the entire hosting stack (monitoring, backups, etc.) to ensure a level playing field"
> I find myself holding back private research because I don't want to provide free R&D for a value-extraction machine that is already efficient enough.
If someone wants to release technology in a way that makes it publicly viewable but restricts its use, they can do that.
If they don't want to release it, they don't have to.
Additionally, publicly released technology destroys patentability, if that's the objective.
I don't understand what one would want to achieve that can't be achieved here.
> If you disallow Amazon, maybe there is a third party that offers our services to Amazon. So Amazon-the-string is not the bogeyman; the concern is the resale or hosted-service arrangement they can access
That's some acrobatics I suspect Amazon won't engage in, because communicating to the customer that your FooBarDB is managed in AWS but hosted by a third party is awkward.
Amazon will happily reimplement your API with their backend, as they've done before.
AFAICT, large saas players can simply implement the software interfaces regardless of business source licenses like what happened to redis, no? Or is there some specific protections for API surfaces that I'm not aware of. I vaguely recall Google v Oracle almost established some protections but then got deferred in later ruling. My memory is hazy on that though...
Indeed. And with the frontier AI models it's worse than that. You can literally just have them write test cases for the product you want to clone, then set it loose reverse engineering the code base.
That said, all these models are trained on the open source code bases presumably, so it would be interesting to see if AI-blackbox reverse engineering actually holds up in court.
My gut says it would infact hold up in current US courts, but only because the lionshare of corporations want it to and the courts have been stacked in their favor.
I personally believe it should not and that AI code should NOT be considered a "clean room" method. That said, IANAL.
> There's increasingly more projects adopting 'Business Source Licenses', precisely to prevent open work from becoming a free input into hyperscaler monetization.
They could use AGPL or GPL3, typically those licenses are verboten in hyperscalers.
The truth is that the sort of company opting for BSL never really wanted to do OSS, and in truth only did so for the optics of it, for the goodwill it buys among developers, etc.
I know this is true of AGPL, but GPL3? I thought the people who objected to GPL3 were those distributing software to their users (e.g. was a reason Apple switched from bash to zsh). I cannot think of aything in GPL3 that would be a problem for hyper-scalers.
> They could use AGPL or GPL3, typically those licenses are verboten in hyperscalers.
Laws are only as good as their enforcement, in business at least. Unfortunately I have seen first hand that no one cares about licensing if they can’t get caught.
Businesses licenses are good because you can offer support and other benefits to encourage payment.
The claim is that those licenses are deemed no-touch within those companies—it's the companies themselves that insist on the software and their business not mixing, e.g. Apple continuing to ship old versions of GNU programs like Bash and then eventually moving to zsh rather than provide updated versions that are GPLv3.
Neither GPLv3 nor AGPLv3 say anything about businesses not being able to use the software.
Hey, nothing wrong with closed source, BSL, etc. I am fine with it. I am the last person that will say someone should give out their work for free.
What I object to is companies releasing software with permissive licenses, and then getting butthurt that others profit from it, or trying to rug pull the permissive licenses after a community adopted and contributed to it.
If you want to play the OSS game, then play it right.
I'm "lucky" to not be smart enough or important enough to think about this. Regardless, i wholeheartedly agree -- at this point, anything i personally could release publicly, will either be fully open source, or completely private. And I'm only choosing open source if I'm relatively sure it's not gonna make some asshole tons of money.
That's in the ballpark how big corps use open source strategically. They try to kill everyone value extraction moat at any other layer than the ones they dominate.
So they commoditize their complement [0]. They don't care if you make money based on their OSS, as long as you race to the bottom against anyone else who also has access to it and turn anything but the corp's profit center into a ubiquitous commodity. So they make the "asshole"'s incentives line up with their own.
That link was a great read and makes a strong point! Another reason corps invest in OSS is to develop something they rely on - special driver, etc - and capitalizing on that in the form of OSS maintainers charging consulting fees has been successful. Exactly in agreement with making the incentives line up with their own.
You could wrap pyobject via a proxy that controls context and have AI have a go at it.
You can customise that interface however you want, have a stable interface that does things like
This way you get a general interface for AI interacting with your data, while still keeping a very fluid interface.
Built a custom kernel for notebooks with PDB and a similar interface, the trick is to also have access to the same API yourself (preferably with some extra views for humans), so you see the same mediated state the AI sees.
By 'wrap' I mean build a capability-based, effect-aware, versioned-object system on top of objects (execs and namespaces too) instead of giving models direct access. Not sure if your specific runtime constraints make this easier or harder. Does this sound like something you'd be moving towards?
Really interesting idea! Part of the ethos here is that models are already really good at writing Python, and we want to bet on that rather than mediate around it. Python has the nice property of failing loudly (e.g., unknown keywords, type errors, missing attributes) so models can autocorrect quickly. And marimo's reactivity adds another layer of guardrails on top when it comes to managing context/state.
Anecdotally working on pair, I've found it really hard to anticipate what a model might find useful to accomplish a task, and being too prescriptive can break them out of loops where they'd otherwise self-correct. We ran into this with our original MCP approach, which framed access to marimo state as discrete tools (list_cells, read_cell, etc.). But there was a long tail of more tools we kept needing, and behind the scenes they were all just Python functions exposing marimo's state. That was the insight: just let the model write Python directly.
So generally my hesitation with a proxy layer is that it risks boxing the agent in. A mediated interface that helps today might become a constraint tomorrow as models get more capable.
Yeah, I'm talking more about a wrapper over the python data model (pyobject) rather than an MCP-style API for kernel interaction. I'm not proposing you abstract interactions under a rigid proxy, but that you can use proxy objects to virtualise access to the runtime. You could still let the model believe it is calling normal python code, but in actuality, it goes via your control plane. Seeing the demo I'd imagine you already have parts of this nailed down tho.
Ah, I think I misread your earlier comment. That's a more interesting version of the idea than what I responded to. We don't do this today, but marimo's reactivity already gives us some control plane benefits without virtualizing object access. That said, I can imagine there are many more things a proxy layer could do. Need to think on it, thanks for the clarification :)
Codex just picks it up. The surface is basically a guarded object model, so pandas/polars-style operations stay close to the APIs the model already knows. There's some extra-tricks but they're probably out of scope for an HN comment.
In practice, Pandas/Polars API would lower to:
proxy -> attr("iloc") -> getitem(slice(1,10,None))
Language is a tool, you have to do what's best for your own goal.
If you read Orwell, his message is not necessarily that complex language is worse at transmitting ideas, as he's actually arguing that complex language can hide the speaker's real motivation and deceive more easily.
For Paul Graham, I'd say for him the 'write like you talk' is very good advice since he's interacting with founders whose first language is not English, people with different backgrounds from his, young folks that maybe didn't take an academic route, so for him it checks out to recommend it.
Leslie Laport always talks about how you should always write down what you think. Until you write something down, you only think you're thinking. Also, he's all about writing most things in math over English, since math is less ambiguous (and less complex). And I'd say math is quite different from how you talk.
Now, you can notice how you can have different motivation for the same behaviour (Orwell and Graham), or different behaviour for a similar motivation (Orwell and Lamport). Maybe more interestingly, think about people with the opposite intentions from the ones above: a contractor that wants to mimic sophistication to get a contract with a bank (with representatives also mimicking sophistication); guilds trying to preserve a high barrier of entry. The advice they'd appreciate would be the opposite since their goals differ.
The 99.9% is less impressive than you'd think, currently they're not even keeping the same program behaviour 0.1% of the time. They also mention AST in the pipeline, not CST, so I wouldn't expect source-preservation to be a direct goal.
Also, if you use a nonstandard spacing, I'd say that's on you to preserve a mechanical Source-AST mapping if you want to use any tool that mentions dataflow analysis & transforms.
As a side note, comments are much trickier than non-standard spacing if their positioning is semantic.
No. JSIR is primarily for JS -> IR -> JS for analysis and source-to-source transformation. It's not a ready-made bridge for emitting other languages
You could use it as an intermediate form in a JS->C# pipeline, but you still have to define a subset of JavaScript that lowers cleanly to your target C# runtime and implementing the IR->C# lowering yourself.
I'd imagine the hard part is not the IR, but aligning the JavaScript semantics (object model, closures, prototypes etc.) with C# (static type system, different execution model..).
Right on. That makes sense. Thanks for spelling it out!
I do think aligning the semantics will be the easier part, honestly, because I'm only trying to transpile the supported source for the game engine. Since that's all written in typescript and I'm not guaranteeing full parity if you are trying to transpile arbitrary ts/js (only the source that can be parsed the same way the game engine is parsed), I'm expecting it to be a near 1-to-1 conversion. I started writing everything in C# and copied the structure to JS, knowing that this was the eventual plan, so the JS can actually be re-written as C# with a pretty simple regex tokenizer.
My hope, here, is that by having the code morphed into an IR, that the IR would be some kind of well-known IR that - for instance - C# could also be morphed into and - therefore - would allow automatic parsing back and forth. From what you're saying, though, it sounds like IRs don't use a common structure for describing code (I'm guessing because of the semantic misalignment you mention between a wide variety of different paradigms?), so this would only work if I made the map from IR to C# which would be just as complex (or more so) than just regexing my JS into C#. If I've got that right, that's a bummer, but understandable. If I'm wrong, though, happy to learn more!
I don't see anything wrong that would disqualify your plan.
But if the alternative is regex, and you're writing already in TypeScript, you may take a look at ts-morph [0]. TS has very good compiler APIs and that gets you something much safer than text-based replacement while still staying relatively small for a constrained subset. ts-morph wraps those APIs cleanly.
Btw, JS doesn't even have an official bytecode. The spec is defined at the language semantics level, so each engine/toolchain invents its own internal representation.
I think the WASM world is a clear example that bridges the gap you're describing.
You usually compile from SSA to WASM bytecode, and then immediately JIT (Cranelift) by reconstructing an SSA-like graph IR. If you look at the flow, it's basically:
Graph IR -> WASM (stack-based bytecode) -> Graph IR
So the stack-based IR is used as a kind of IR serialization layer. Then I realized that this works well because a stack-based IR is just a linearized encoding of a dataflow graph. The data dependencies are implicit in the stack discipline, but they can be recovered mechanically.
Once you see that, the blindness mostly disappears, since the difference between SSA/graph IRs and expression/stack-based IRs is about how the dataflow (mostly around def-use chains) is represented rather than about what optimizations are possible.
Fom there it becomes fairly obvious that graph IR techniques can be applied to expression-based structures as well, since the underlying information is the same, just represented differently.
Didn't look close enough to JSIR, but from looking around (and from building a restricted Source <-> Graph IR on JS for some code transforms), it basically shows you have at least a homomorphic mapping between expression-oriented JS and graph IR, if not even a proper isomorphism (at least in a structured and side-effect-constrained subsets).
Only compilers that already had an SSA-based pipeline transform SSA to stack-based for Wasm. And several don't like that they have to comply with Wasm structured control flow (which, granted, is independent from SSA). Compilers that have been using an expression-based IR directly compile to Wasm without using an SSA intermediary.
I was imprecise, I was specifically thinking of already SSA-based tech.
My broader point is that for SSA-based pipelines targeting Wasm, translation between SSA/graph IR and stack-based IR is largely mechanical and efficient. Whether a compiler uses SSA as an intermediary or goes straight from an AST to Wasm, the fact remains that you can round-trip between a SSA-like IR and a stack-based IR without losing the underlying dataflow information.
Yeah, mapping is not canonical and some non-semantic structure is not preserved (evaluation order, materialization points, join encoding, CFG reshaping for structured control and probably some more structure I'm not familiar with), but optimization power is unaffected.
And JSIR seems to be based on an even stronger assumption.
Would appreciate corrections if you see things differently.
reply