More

michalc · 2026-03-23T07:46:15 1774251975

I think I can understand why this wasn’t addressed for so long: in the vast majority of cases if your db is exposed on a network level to untrusted sources, then you probably have far bigger problems?

hrmtst93837 · 2026-03-23T10:33:17 1774261997

That's the kind of hand-wave that turns into a CVE later. Network exposure is one thing, but weird signal handling in local tooling can still become a cross-session bug or a nasty security footgun on shared infra, terminals, or jump boxes.

If you have shared psql sessions in tmux or on a jump box one bad cancel can trash someone else's work. 'Just firewall it' is how you end up owned by the intern with shell access.

pilif · 2026-03-23T09:25:55 1774257955

it's also very tricky to do given the current architecture on the server side where one single-threaded process handles the connection and uses (for all intents and purposes) sync io.

In such a scenario, listening (and acting) on cancellation requests on the same connection becomes very hard, so fixing this goes way beyond "just".

michalc · 2026-03-12T12:18:20 1773317900

So my definition of big data was data so big it cannot be processed on a single machine in a reasonable amount of time.

I guess they’re using a different definition?

jawns · 2026-03-12T12:27:54 1773318474

I think it's partly tongue in cheek, because when "big data" was over hyped, everyone claimed they were working with big data, or tried to sell expensive solutions for working with big data, and some reasonable minds spoke up and pointed out that a standard laptop could process more "big data" than people thought.

rattray · 2026-03-12T12:23:03 1773318183

> For our first experiment, we used ClickBench, an analytical database benchmark. ClickBench has 43 queries that focus on aggregation and filtering operations. The operations run on a single wide table with 100M rows, which uses about 14 GB when serialized to Parquet and 75 GB when stored in CSV format.

very much so…

rrr_oh_man · 2026-03-12T12:36:35 1773318995

In my former life as a soulless consultant mid-level IT managers really liked to hear the 3 "V"s mentioned: Velocity, Volume, Variety

speedgoose · 2026-03-12T12:38:09 1773319089

The V of Value is very important in some circles.

speedgoose · 2026-03-12T12:37:33 1773319053

Computers got bigger and software got smarter.

You have phones that are faster than cloud VMs of the past. You can use bare metal servers with up to 344 cores and 16TB of ram.

I used to share your definition too, but I now say that if it doesn’t open in Microsoft Excel, it’s big data.

Zambyte · 2026-03-12T12:43:28 1773319408

Processing data that cannot be processed on a single machine is fundamentally a different problem than processing data that can be processed on a single machine. It's useful to have a term for that.

As you say, single machines can scale up incredibly far. That just means 16 TB datasets no longer demand big data solutions.

speedgoose · 2026-03-12T12:49:15 1773319755

I get your point, but I don’t know if big data is the right term anymore.

Many people like to think they have big data, and you kinda have to agree with them if you want their money. At least in consulting.

Also you could go well beyond a 16TB dataset on a single machine. You assume that the whole uncompressed dataset has to fit in memory, but many workloads don’t need that.

How many people in the world have such big datasets to analyse within reasonable time?

Some people say extreme data.

brudgers · 2026-03-12T12:50:22 1773319822

“Your data isn’t big” is a good working definition of big data.

Google has big data. You are not google.

antonyh · 2026-03-12T14:50:52 1773327052

I think the definition of big is smaller than that. Mine was "too big to fit on a maxed-out laptop", effectively >8TB. Our photo collection is bigger than that, it's not 'big data'.

Or one could define it as too big to fit on a single SSD/HDD, maybe >30TB. Still within the reach of a hobbyist, but too large to process in memory and needs special tools to work with. It doesn't have to be petabyte scale to need 'big data' tooling.

brudgers · 2026-03-12T23:44:39 1773359079

“Your data is not big” comes from this thread…https://news.ycombinator.com/item?id=7192839

8TB is a couple hundred hours of 4k RAW video assets.

antonyh · 2026-03-16T17:21:35 1773681695

This is true, but 8TB is big data if it's text.

bcye · 2026-03-12T12:19:44 1773317984

I think they are simply referring to analytical workloads.

michalc · 2026-01-31T12:07:56 1769861276

Hmmm... depends on the project / phase of the project?

I am particularly not a fan of doing unnecessary work/over engineering, e.g. see https://charemza.name/blog/posts/agile/over-engineering/not-..., but even I think that sometimes things _are_ worth it

michalc · 2026-01-12T22:01:35 1768255295

Short answer is no, not as far as I am aware/can reason about it

In more detail: so by my understanding there are two techniques in making zip bombs…

Firstly nested ZIPs that leverage the fact that some unZIP programs recursively extract member files. stream-unzip doesn’t do this (Although you could probably use stream-unzip as a component in a vulnerable recursive ZIP parser if you really wanted to… but that I would argue is not the responsibility of stream-unzip)

The second technique is overlapping member files, but this depends on them overlapping as defined by the central directory at the end of the ZIP, which stream-unzip does not use

But if you are accepting files from an untrusted source, then you should validate the size of the uncompressed data as you unZIP (which you can do as you validate along with any other properties of the data)

michalc · 2026-01-04T14:18:13 1767536293

> Beyond that, I've grown fond of 'sticking to the defaults' over the years.

This resonates with me! Both in terms of things I use and things I make - I want them to "just work"

michalc · 2026-01-04T13:13:51 1767532431

> without regard for the maintenance burden 1, 2, 5, 10 years down the road.

To me software craftsmanship isn't just about the code, it's about engineering use of time.

In general shouldn't knowingly make choices that would result in pain in the future, but if you're increasing the chance of the project not making it to the future, then is that really the better option? Finding out enough information to make the judgement call between long term/far future pain and short term benefits is all part of the craftsmanship.

> I don't blame agile. But I do kind of blame Agile™

(Loving the phrasing here! I think I'm right on board, especially if we're talking Scrum/Scum-ish)

michalc · 2026-01-04T09:57:19 1767520639

> why not remind people of the purpose?

To answer this, I suspect that trying to change what certain words/phrases mean to people en-masse is extremely difficult, to the point of impossibility in most cases. However, we each have the power to be clearer in the words we use so they are understood by the people we're communicating with.

> engineering quality matters

But also, this to me suggests that there is some sort of absolute definition of quality, but it's much more nuanced. Nothing is inherently "bad quality", but instead has certain consequences, which may or may not happen or may or may not be acceptable in certain circumstances, and you might not even know what these are until the future. This I think is the point I'm trying to make - there is no absolute definition of engineering quality, and I suspect the term "technical debt" all too often suggests there is.

michalc · 2025-10-06T06:45:45 1759733145

Have to admit the lazy thing threw me, but I can see how the “doing less” I’m arguing for could be taken that way. The “less” is not about avoiding handling edge cases that are possible now, but about avoiding putting in layers of code to handle cases possible only in some future versions of the code (with some limited exceptions that I mention at the bottom of the post)

In fact, it’s crossing my mind that people might not want to be accused of being lazy, and that is a motivation to over-engineer solutions.

michalc · 2025-10-04T14:22:46 1759587766

You’re very welcome!

Have to admit I am curious: what’s the context / how has it helped you more specifically?

mouse_ · 2025-10-04T17:46:51 1759600011

It's going to take me some time to come to terms with what you have to say; I'm probably not going to be able to internalize it today, and my gut reaction to it is, "I hate it; I'm over-engineering for a reason! I'm going to be better for it and my output is going to be better for it!" but this kind of thing has had me hung up on simple things at every turn. A present example is, I'm going through Chapter 1 of ANSI K&R as a refresher (I'm not a programmer) and I've been stuck on exercise 1-21, "entab", which is presented as follows:

/* 1-21. Write a program entab that replaces strings of blanks by the minimum number of tabs and blanks to achieve the same spacing. Use the same tab stops as for detab. When either a tab or a single blank would suffice to reach a tab stop, which should be given preference? */

When confronted with a problem like this, I begin to think, "Well what's the most robust way of going about this task? What's a simple, good, and useful rule that will accomplish the stated goal?" And I'm not quite sure what happens next - figuring that out might require some deeper introspection, but I end up with the proposed solution:

"Any time we encounter consecutive whitespace characters, including spaces, tabs and newlines, ignore the literal characters and instead simply add up exactly how many columns of whitespace they are going to take up, and then, find the smallest number of newlines, tabs and spaces we can print to the screen to match that amount of whitespace. This way we accomplish the stated goal and also end up with a nice text sanitizer."

I'm still mulling all of this over but I'm pretty confident this goes so far beyond the stated problem that it could be considered self sabotage. There's a lot of moving parts to my solution and I don't have the cognitive tools to break up a problem like that yet. (I'd like to eventually of course, but I have to stay focus on what I'm doing!)

Tangentially related maybe? Witches' Loaves https://www.littlefox.com/hk/supplement/org/C0002439

michalc · 2025-09-15T07:06:55 1757920015

Very much agree.

It was a few years ago, and very AngularJS focused, but I posted something along these lines: https://charemza.name/blog/posts/angularjs/e2e/consider-not-...

In summary: having thing look cleaner at a glance is not helping if you’re (almost) always going to need to do more than glancing