More

kfool · on Nov 6, 2011

(Disclaimer: I work on ChronicDB)

I second that: schema-less is misunderstood.

There's a difference between flexibility of schema definition and flexibility of schema change[1].

Flexibility of schema change, which NoSQL does not solve, is increasingly more important. Not just for large data stores but also for the data development process and release process. To avoid playing the suboptimal schema-change game both the code and the data need to be updated together. Or at least be given the illusion that they have[2].

A probably obvious question most developers must have asked by now is: if we've built great tools to version source changes, how come we haven't built great tools to version data changes?

[1] - http://chronicdb.com/blogs/nosql_is_technologically_inferior...

[2] - http://chronicdb.com/blogs/change_is_not_the_enemy

kfool · on Sept 1, 2011

This can be handled by deleting orders from the logical representation, yet preserving them in the historical layer.

kfool · on Sept 1, 2011

That's one of the things ChronicDB does. It makes historical values available in the average application.

kfool · on Sept 1, 2011

Was there something specific it the data model that made the versioning hard to write? Or was it that, for this to work across the board, the entire model had to be versioned?

It sounds that SQL itself wasn't the problem. Were you looking for versioning alternatives in SQL that weren't up to par?

dgreensp · on Sept 1, 2011

SQL was a solution more than a problem; we were just hoping for a simpler or more elegant solution.

For example, it's possible that logging all data model changes to a text file would have given us the persistence we were looking for without bridging all our data to SQL. Cutting out SQL from our production set-up would have been an inherent win -- one less process to manage, one less black-box source of complexity, etc.

kfool · on Aug 31, 2011

Isolating databases behind APIs and rev-ing API+schema separately is not enough. When the schema changes, data must be transformed to match the new schema version. As you point out this takes too long with a large database, and it doesn't account for data consistency.

We have been working on building what we hope are the procedures for this with ChronicDB (http://chronincdb.com). But it turned out harder than it seems, and we are not sure it will quite work out. We'd welcome feedback.

kfool · on Aug 31, 2011

Good point. With databases distributing data in-memory across machines, shared memory becomes the database. Don't be surprised if Arc runs HackerNews on distributed memory some day...

But one has to wonder, what happens when you need to upgrade the app? Shutting down the process and destroying the memory image doesn't seem like the best option:

- First, it disrupts connected applications since the process is killed, introducing downtime.

- Second, when starting up again in say version 2, the data that will be loaded in memory still needs to be transformed in the format expected by version 2. This transformation can take time on large data, introducing further downtime.

The challenge would be to eliminate this downtime by combining a solution for both client disruption and state transfer. A data abstraction like a database using SQL can simplify such a solution.

kfool · on Aug 31, 2011

Very well put. It is not access type (direct vs indirect) that make schema improvements hard, but access preservation: availability. And indeed copying the data often leads to one or more of the copies being wrong, unless special measures are taken in that direction.

> How do you guarantee that all of the apps that touch that data use the current version of said code?

An approach that may be worth considering is to not require all apps to use the current version: allow multiple versions. For some cases this would work, say if the semantics of the newer version are backwards compatible with the semantics of the older version. If the data semantics are preservable, transforming a schema could happen while each data access request to the schema is actively transformed.

But it clearly wouldn't work in all cases. More work to handle that would be needed.

> Code normalization is as important as data normalization.

True, this is the near show-stopper really. In that case, the best one can hope for is preparing the state of the new version (new data in new schema) and carefully coordinating a quick restart of the old version for the new version.

I would love to hear your thoughts on this. We have been working towards that direction with ChronicDB (http://chronicdb.com) and would welcome feedback.

anamax · on Sept 3, 2011

> I would love to hear your thoughts on this.

An e-mail address in your profile would have made that possible. (Chronicdb looks interesting and complements something that I've been thinking about. I suspect that you've implemented many of the relevant mechanisms.)

kfool · on May 24, 2011

Here is how I see things:

1. Updates should not only be applied in sequence.

It is better to produce a binary diff between any two versions, and apply only that (one) binary diff. The reason for this isn't efficiency, but semantics. Updates not only fix things, but break things. Meaning, updates corrupt application state (data), both in-memory and on-disk. It can be disastrous to apply an intermediate update that removes state, only to realize that a future version reversed the semantics and needs to use that state (which was available, but is now gone).

Peserving backward compatibility is important, which means the ability to skip some version updates is necessary. To the extent possible, reversing updates is important too.

2. The ideal update system should apply updates live, not offline.

With a model that accounts for updating the entire state of an application, updating live is possible. The reason most updates are not applied live yet is that the model is not descriptive enough to change the entire state of the running application.

Notable state that should be updated, but often isn't, is continuations and the stack. This is why GUI applications need to be shut down to update.

Scheme's call/cc (call-with-current-continuation) solved making changes to continuations and stack state decades ago better than Erlang. Erlang cannot force stacks unroll or continue from arbitrary points.

3. Updates must be produced with source code and programmer input.

Updates should not be produced with binaries as input.

The reason is the need to account for application semantics, which binaries do not expose in the detail source code does. Although automated, sophisticated semantic-diffing based on control-flow can be developed, it is sometimes inconclusive whether an update will break things.

4. It is necessary for programmers to provide live update guidance.

In the cases where producing provably safe dynamic updates is not possible, it is input from the programmer that can clear any conservatism of the safety certification process.

Tools are needed for programmers to reason about the semantic safety of their live updates, integrated in the development process. Including tools that help transform application state between versions.

kfool · on April 20, 2011

I would suggest neither the what nor the where, but the whom.

If your supervisor's background does not intimitate you, and you don't desperately want to become like them, then find someone else.

kfool · on April 6, 2011

ChronicDB could help here http://chronicdb.com