I feel like this article stabs at a kind of interesting tradeoff in distributed ...

mslot · on June 20, 2018

Replicating data in consensus groups gives some nice guarantees, in particular it can guarantee monotonic read consistency even if you're not reading from the leader, at the cost of a network round-trip. Switching between master-slave replicas might cause you to go back and forth in time, even if replication is synchronous.

There are also some potential availability benefits to storing data in consensus groups, though most implementations use leader leases to improve performance, which reduces availability in case of leader failure. A master-slave system that does synchronous replication could do failovers that are just as fast as leader changes in a consensus protocol.

The pain of storing data in consensus groups is in performance, especially concurrency. Every write to the log requires some synchronization with a quorum to ensure consensus on the order. That introduces extra round-trips and makes it hard to handle high concurrency.

There are quite a few bleeding edge distributed databases that have trouble matching the throughput of a single node RDBMS unless you're willing to spend an order of magnitude more on hardware or your workload follows a very specific pattern.

gregwebs · on June 20, 2018

NewSQL databases still may use consensus to point to a leader for data, its just that data is transparently partitioned and each partition can have a different leader.

In TiDB for example each region (data partition) has its own leader and only that leader can serve writes (but the leader can change via a raft-based election). By default all servers are both leaders of some partitions and followers of others [1].

Putting all your data on one master stops scaling at some point when data is large enough. However, even before that point, even without high-availability requirements, you are making trade-offs which may be good or bad (e.g. throughput is limited on a single master, but latency is low).

[1]: https://pingcap.github.io/docs/tikv/tikv-overview/

elvinyung · on June 21, 2018

Isn't sharding orthogonal to this? It seems completely possible to have a system where the individual data partitions are asynchronously replicated, but the metadata of leader leases and locations for all the partitions are listed on an separate strongly-consistent fault-tolerant source. I think Vitess does something like that, and certainly Bigtable [1] was an early implementation using Chubby to point to the "root" tablet.

[1] https://static.googleusercontent.com/media/research.google.c...

gregwebs · on June 22, 2018

Sharding is generally used to refer to a particular type of partitioning: physically distributing data on different machines with the purpose of scaling horizontally (to increase write capacity).

What many NewSQL databases use is a logical partitioning scheme where data is partitioned into <= 1 GB chunks. This makes it easy to both scale horizontally and achieve fault tolerance (and change these characteristics without interrupting service). In some systems that component may be coupled and hidden away. In TiDB it is a separately deployed component called the Placement Driver (which embeds etcd, a Chubby equivalent): https://pingcap.com/blog/2016-11-09-Deep-Dive-into-TiKV/#pla...

With sharding you do still need a system to locate your data. A big difference is that most sharding systems are not designed to distribute transactions (or possibly even joins) across shards. So Vitess has a special 2PC for cross-shard transactions with poorer performance than a single shard transaction.

jchanimal · on June 20, 2018

Here is a video demo of FaunaDB maintaining ACID transactional correctness even while datacenters go offline. Relevant because the underlying cluster manager maintains correctness at both the master-tracking and data level.

https://blog.fauna.com/demonstrating-transactional-correctne...

stplsd · on June 20, 2018

From this article I cannot parse if they use simple Mysql replication with GTID or MySQL cluster/Galera/similar solution.