Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Ideological answer: For the same reason HTTP/2.0’s binary protocol didn’t instantly obviate/deprecate HTTP/1.0’s text protocol. Text has advantages: text is debuggable, and prototypable. If the interface between two programs is a text based declarative language, you can audit that text, diff that text, edit that text to see how changes affect the result, mock one side or the other by producing or consuming that text, etc.

I can see the argument for using a textual format (although I think it's weaker than you say; if we're generating this config with code then we don't want to diff or edit the generated config), but YAML seems like a singularly poor choice if you want reliable diffs and editing; it's like picking tag-soup HTML. Straight JSON (ideally with a schema), TOML or even XML seems like a better bet if you're generating it programmatically.

> And, obviously, if you don’t control the other end, you don’t decide how the other end does its config.

Right, in that case it's all moot. I took GP to be talking about what formats these tools should use. IMO if the tool is intended to consume a machine-generated config then it would be better to use a machine-oriented config format. I think the option of something like protobuf (which is language-independent) is underappreciated, but even restricting ourselves to textual options, something stricter than YAML seems like a better bet.



But the third-party tool frequently isn’t intended to (only) consume machine-generated config. It’s usually built to consume a format that could equally be machine-generated or hand-authored. Usually with an emphasis on hand-authoring, where machine-generation is an automation over hand-authoring that will only need to happen as one scales; and so high-complexity machine-generation will only be relevant to the most enterprise-y of integrators.

Other examples of formats like this, that are hand-authored in the small but generated in the large: RSS, SQL, CSV.

Again, Kubernetes is a prime example of this. K8s config YAML is designed with the intention of being hand-authored and hand-edited. It’s only when devs or their tools need to auto-generate entire k8s cluster definitions, that you begin needing to machine-generate this YAML. This generated YAML is expected to still be audited by eye and patched by hand after insertion, though, so it still needs to be in a format amenable to those cases, rather than in a format optimal for machine consumption.

> if we're generating this config with code then we don't want to diff or edit the generated config

Look more into GitOps. The idea behind it is that whatever tooling you’re using to generate config is run and the resulting config is committed to a “deployment” repo as a PR; ops staff (who don’t necessarily trust the tooling that generated the config) can then audit the PR, and the low-level changes it describes, before accepting it as the new converged system state. It puts a human veto in the pipeline between machine-generated config and continuous deployment; and allows for debugging when upstream tweaks aren’t having the low-level side-effects on system state one would expect.


In most programming languages you can hand author a value just fine - that part isn't an advantage to something like YAML or json. Given the use of variables and a few other similar simple techniques, I dare say many programming languages are more amenable to hand-authoring static config objects than most static config languages.

I think the real issue is reproducibility; and that boils down to purity. Fully fledged languages all come with lots of apis and features to interact with the rest of the world, and it's quite unclear which apis have such dependencies and which do not - and it's seductively easy to do something actually useful in a "real" programming language that will make the whole configuration process unwieldy later - like, say, reading parts of the config from disk, getting some services public key off the internet, embedding a timestap, or even writing some computed config like a random key to a bit of storage for a later config process to consume. And once you do that, then the whole thing gets flaky, fast.

If you can rigorously avoid that, there's not too much advantage to a static config language.


> In most programming languages you can hand author a value just fine

But keep in mind that we’re not inherently talking about programming languages here — nor are we necessarily talking about people capable of programming as our configurators. We’re talking about third-party components that need to be configured by ops people, who may or may not be DevOps people. Usually they’re not — most ops people are just pure ops, and don’t know any programming languages. As well, most amateur integrators (e.g. a person setting up their own blog) aren’t programmers either.

The goal of these systems, when choosing a configuration solution, is twofold: to give pure-ops and amateur integrators a config language they can author directly, in a text editor, without learning programming; while also making that language formal/structured enough that it’s easy to machine-generate from your programming runtime of choice, if you do have those skills, and a rigorous mindset.

Sure, programming languages don’t necessarily require you to use the full-fledged expression syntax they enable, and so can “reduce” to a configuration-language-like subset of themselves.

But remember, again — ops people and amateur integrators. What do such people tend to do, to create their config? Read the reference config schema? No. They tend to look up tutorials with samples, or StackOverflow “solutions”, from arbitrary places on the Internet.

And what do the creators of those samples have in abundance? Cleverness and a desire for clarity of meaning. Traits that cause them to use the expressive features of whatever the configuration language is, in order to make their answers more “pithy”.

Which means that, to wield these “pithy” samples/solutions, the ops people and amateur integrators now have to understand how to “patch” one arbitrary piece of complex code into another increasingly-arbitrary piece of complex code.

The thing a static data-serialization format gets you, is that the rules for merging any two expression-nodes in it are very simple to learn, because there just aren’t that many types of expressions. There’s no way to be “pithy” with the configuration that requires people to learn entirely-new-to-them syntax.

By choosing to configure your system in YAML, you’re guaranteeing that the samples these ops people and amateur integrators find and attempt to glue together, will also just be pure YAML. And since their existing config file, and each new sample, are pure YAML, they’ll likely succeed at doing this gluing-together.

Meanwhile, DevOps people and enterprise integrators can create their own programs to generate the YAML — but since there’s no first-party framework for doing this, there won’t be much value in sharing these programs around, and so the samples the pure-ops people and amateur integrators find will never be given “in terms of” writing code for such a framework, but rather only in terms of the config YAML itself.

> I think the real issue is reproducibility; and that boils down to purity. [...] If you can rigorously avoid that, there's not too much advantage to a static config language.

Individual users might be able to rigorously avoid that (though expecting a rigorous approach to formal expression from non-programmers is a bit much.) But often it's the system itself that needs purity and reproducibility.

Remember, config formats are usually something executed at every startup — in other words, they're durable state that happens to be human-modifiable. (Think: the Windows Registry.) As the designer of a system, you don't want the same state you serialized today to deserialize to something else tomorrow; and you especially don't want the meaning of your state to depend contextually on the environment. You want to "pin down" your state.

A good example: programming-language package-ecosystem "lock files." In most languages, dependency-constraint specification is done in a programming language, such that the generation of those constraint expressions has access Turing-complete features. But once you lock those constraints down to a baked set of choices, the lockfile itself — the predetermined set of choices, that should be environment-independent — is not expressed in a Turing complete language (in any runtime I know of, at least) but rather is always expressed in its own little static declarative language; or at most in a limited "data-expressions only" subset of the parent language (e.g. Erlang's `file:consult/1` format.)

In this case, dep-constraints are the inputs to a config-generator program; while the lockfile is the config format itself. The config format is a necessary intermediate here; it'd be impossible for the runtime to make the same static guarantees about package management if it wasn't! (In fact, see e.g. Python's setup.py, where exactly that problem stymies any package-manager the Python ecosystem introduces from pre-determining dependency graphs before actually downloading and attempting installation of the dependencies.)


Yeah, there's something to be said to a format that makes it hard to shoot yourself in the foot; essentially. That point is somewhat orthogonal to the issue of how easy it is to author a config value, however.

By the way, you conflate purity with turing completeness; but the two are not really all that strongly related. It's possible to have a turing incomplete language that is nevertheless impure (public I/O without unconstrained repetition), and conversely a turing complete language that is pure (i.e. keep your tape private).

I'd argue that turing completeness isn't as relevant as people make it out to be here. It's not a good thing, mind you, but it's just not that problematic either; externally imposed termination and storage limitation can render any turing complete system into a turing incomplete system - that's easy - but a system with uncontrolled sideeffects is almost intrinsically hard to manage. In fact, even technically turing-incomplete systems may well need to impose similar limitations anyhow, because a technically turing incomplete language that allows (say) nested loops or iteration - albeit bounded - may well not practically terminate, or nevertheless cause too much I/O. Some languages are really limited, and perhaps then you can get away without externally imposed resource constraints, but it's not clear to me how realistic that scenario is.

The real problem (to my mind) in general-purpose languages when it comes to using them for config-specification is not turing completeness, it's purity (i.e. reproducibility). And that's not even really a language issue alone, it's because those languages tend to come with large, pervasively used libraries, to the point that it's not trivial to just take some code off stackoverflow (say) and reliably tell whether it's pure or not - because that depends on the internals of all of those library methods too.


> and conversely a turing complete language that is pure (i.e. keep your tape private).

A Turing complete language can't be pure because there is no value that is equivalent to a nonterminating computation.


That's irrelevant right? The point is that it's reproducible. Whether you define purity as to include non-termination or not is besides the point; the point is to avoid side-effects. Lack of side effects matters in the context of configuration, non-termination does not (and see the thread you're replying to for an argument as to why that is). That's kind of the whole point of the argument.


> That's irrelevant right? The point is that it's reproducible.

It's not reproducible if it's not a value. The point of a pure function is that you can replace it with the value that it evaluates to.

If you include nontermination as a value in your language then your language becomes almost impossible to reason about as you break almost every equivalence property you could think of. E.g. you can no longer say x * 0 = 0.

> Lack of side effects matters in the context of configuration, non-termination does not (and see the thread you're replying to for an argument as to why that is).

I don't find "but terminating code may still take a long time" to be a convincing argument that nontermination isn't important; rather it's an argument that code taking a long time might also be important (at least to the extent that it actually comes up in practice, which I'm not convinced of).


I think it's pretty reasonable to say that technically you might not be able to equivalently replace x * 0 by 0. Note that if it's replacable, it's reliably replaceable. This is essentially how pretty much all functional languages work incidentally - functions may not terminate, and in theory that can cause issues, but in practice not so much. Part of the saving grace here is that:

(A) - you're exceedingly unlikely to run into this issue in the first place. Nontermination limmits aren't set to things you're likely to hit without a runaway loop or recursion.

(B) - when you do hit the a forced termination issue, such replacements are usually irrelevant, i.e if your algebraic rewrite doesn't affect the recursion or loop it won't affect bail out either. Depending on how you implement forced termination, you can likely guarantee this, but it's not very valuable to.

(C) - The alternative isn't real if you allow theoretically bounded loops but with high limits. You can decide not to specify forced termination, but that doesn't mean you don't have it; it simply means the OS or user will terminate the process instead, and reasoning about that is much, much harder. A system with small limits is possible, but those are much less practical to start with. And there have been quite a few systems over the years that tried to impose such limits by design and then it turned out that there were escape-hatched that could be abused to nevertheless impose huge load (if you can do any kind of doubling in an iteration, you don't need many to cause denial-of-service).

(D) - Although it's possible an algebraic rewrite could affect termination, the scope for this is pretty constrained. Either it works, or the whole system fails to terminate; there's no middle ground. That means if you simply assume termination will occur and deal with the code as if it were pure, you'll either end up with a functioning system, or a clearly non-functioning system, but without corruption or any unacceptable uncertainty. (It's possible to shoot yourself in the foot here, but I don't it's possible to do so accidentally).

I mean, if you want to make the argument that all of this is tricky - yes, it sure is; and there are a few risks and some complexity to all this! But simultaneously, I don't think you're going to do a lot better if you want the kind of flexibility that recursion and looping allow. These complexities are pretty manageable in practice; the risks limited. And if you need even tighter guarantees you're going to need to lose most recursion and loops, likely even bounded loops. I've never used it, but earlier in this thread somebody mentioned starlark - and while they clearly tried to avoid turing completeness (loops are bounded, and no recursion by the looks of it), they do allow nested loops with large bounds; i.e., given whatever time you think your OS or user will be willing to wait before pulling the plug you wont hit those bounds: in terms of reasoning, you cannot rely on termination.

But I think that's a decent trade-off. The restrictions needed to reliably terminate in some bounded amount of resources are just too onerous to leave room for a language that can come close to one that does not have those restrictions. As such, it's fine to have either a deterministic, side-effect free language with the risk of (practical) non-termination, or a language with very, very limited looping (e.g. no nesting, and perhaps only constructs like map as opposed to iterating over ranges) - but not much room for anything in between.

Again, the context is the kind of languages you might consider for configuration. And in that context I don't think that turing completeness is all that relevant, compared to determinism and no side-effects, assuming termination. It's those latter aspects that really have a huge impact, and termination mostly in theory, not in practice.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: