I think I can understand why this wasn’t addressed for so long: in the vast majority of cases if your db is exposed on a network level to untrusted sources, then you probably have far bigger problems?
That's the kind of hand-wave that turns into a CVE later. Network exposure is one thing, but weird signal handling in local tooling can still become a cross-session bug or a nasty security footgun on shared infra, terminals, or jump boxes.
If you have shared psql sessions in tmux or on a jump box one bad cancel can trash someone else's work. 'Just firewall it' is how you end up owned by the intern with shell access.
it's also very tricky to do given the current architecture on the server side where one single-threaded process handles the connection and uses (for all intents and purposes) sync io.
In such a scenario, listening (and acting) on cancellation requests on the same connection becomes very hard, so fixing this goes way beyond "just".
I think it's partly tongue in cheek, because when "big data" was over hyped, everyone claimed they were working with big data, or tried to sell expensive solutions for working with big data, and some reasonable minds spoke up and pointed out that a standard laptop could process more "big data" than people thought.
> For our first experiment, we used ClickBench, an analytical database benchmark. ClickBench has 43 queries that focus on aggregation and filtering operations. The operations run on a single wide table with 100M rows, which uses about 14 GB when serialized to Parquet and 75 GB when stored in CSV format.
Processing data that cannot be processed on a single machine is fundamentally a different problem than processing data that can be processed on a single machine. It's useful to have a term for that.
As you say, single machines can scale up incredibly far. That just means 16 TB datasets no longer demand big data solutions.
I get your point, but I don’t know if big data is the right term anymore.
Many people like to think they have big data, and you kinda have to agree with them if you want their money. At least in consulting.
Also you could go well beyond a 16TB dataset on a single machine. You assume that the whole uncompressed dataset has to fit in memory, but many workloads don’t need that.
How many people in the world have such big datasets to analyse within reasonable time?
I think the definition of big is smaller than that. Mine was "too big to fit on a maxed-out laptop", effectively >8TB. Our photo collection is bigger than that, it's not 'big data'.
Or one could define it as too big to fit on a single SSD/HDD, maybe >30TB. Still within the reach of a hobbyist, but too large to process in memory and needs special tools to work with. It doesn't have to be petabyte scale to need 'big data' tooling.
Short answer is no, not as far as I am aware/can reason about it
In more detail: so by my understanding there are two techniques in making zip bombs…
Firstly nested ZIPs that leverage the fact that some unZIP programs recursively extract member files. stream-unzip doesn’t do this (Although you could probably use stream-unzip as a component in a vulnerable recursive ZIP parser if you really wanted to… but that I would argue is not the responsibility of stream-unzip)
The second technique is overlapping member files, but this depends on them overlapping as defined by the central directory at the end of the ZIP, which stream-unzip does not use
But if you are accepting files from an untrusted source, then you should validate the size of the uncompressed data as you unZIP (which you can do as you validate along with any other properties of the data)
> without regard for the maintenance burden 1, 2, 5, 10 years down the road.
To me software craftsmanship isn't just about the code, it's about engineering use of time.
In general shouldn't knowingly make choices that would result in pain in the future, but if you're increasing the chance of the project not making it to the future, then is that really the better option? Finding out enough information to make the judgement call between long term/far future pain and short term benefits is all part of the craftsmanship.
> I don't blame agile. But I do kind of blame Agile™
(Loving the phrasing here! I think I'm right on board, especially if we're talking Scrum/Scum-ish)
To answer this, I suspect that trying to change what certain words/phrases mean to people en-masse is extremely difficult, to the point of impossibility in most cases. However, we each have the power to be clearer in the words we use so they are understood by the people we're communicating with.
> engineering quality matters
But also, this to me suggests that there is some sort of absolute definition of quality, but it's much more nuanced. Nothing is inherently "bad quality", but instead has certain consequences, which may or may not happen or may or may not be acceptable in certain circumstances, and you might not even know what these are until the future. This I think is the point I'm trying to make - there is no absolute definition of engineering quality, and I suspect the term "technical debt" all too often suggests there is.
Have to admit the lazy thing threw me, but I can see how the “doing less” I’m arguing for could be taken that way. The “less” is not about avoiding handling edge cases that are possible now, but about avoiding putting in layers of code to handle cases possible only in some future versions of the code (with some limited exceptions that I mention at the bottom of the post)
In fact, it’s crossing my mind that people might not want to be accused of being lazy, and that is a motivation to over-engineer solutions.
It's going to take me some time to come to terms with what you have to say; I'm probably not going to be able to internalize it today, and my gut reaction to it is, "I hate it; I'm over-engineering for a reason! I'm going to be better for it and my output is going to be better for it!" but this kind of thing has had me hung up on simple things at every turn. A present example is, I'm going through Chapter 1 of ANSI K&R as a refresher (I'm not a programmer) and I've been stuck on exercise 1-21, "entab", which is presented as follows:
/* 1-21. Write a program entab that replaces strings of blanks by the
minimum number of tabs and blanks to achieve the same spacing. Use the
same tab stops as for detab. When either a tab or a single blank would suffice
to reach a tab stop, which should be given preference? */
When confronted with a problem like this, I begin to think, "Well what's the most robust way of going about this task? What's a simple, good, and useful rule that will accomplish the stated goal?" And I'm not quite sure what happens next - figuring that out might require some deeper introspection, but I end up with the proposed solution:
"Any time we encounter consecutive whitespace characters, including spaces, tabs and newlines, ignore the literal characters and instead simply add up exactly how many columns of whitespace they are going to take up, and then, find the smallest number of newlines, tabs and spaces we can print to the screen to match that amount of whitespace. This way we accomplish the stated goal and also end up with a nice text sanitizer."
I'm still mulling all of this over but I'm pretty confident this goes so far beyond the stated problem that it could be considered self sabotage. There's a lot of moving parts to my solution and I don't have the cognitive tools to break up a problem like that yet. (I'd like to eventually of course, but I have to stay focus on what I'm doing!)
reply