More

devj · on May 21, 2020

The discussion thread is related to a tool created by UBER: Piranha - https://eng.uber.com/piranha/

devj · on April 1, 2020

I have a requirement of checking group membership wherein a group may contain > 10K members. Each member is an 8byte ID. Do you think Xor filter would be a better fit compared to Bloom here? Or am I looking at it the wrong way? Thanks.

gopalv · on April 1, 2020

> Do you think Xor filter would be a better fit compared to Bloom here?

The 8byte key is the only scenario where you should consider XorPlus (i.e a 8 bytes mapped to a long).

The lookup properties of the Xor filter are better with that case, but the real question is whether you have an entire collection to start building the bitset or not.

The sketch production isn't incremental - there is no add(k) after building it once.

So you can't build add data once it is built, while the Bloom filters do support adding entries after the fact (in fact, it can add bloom filters into it, rather than sending all the new keys).

And both of those approaches are missing an unset operation.

devj · on April 1, 2020

Sorry.. Forgot to mention that.

Yes insertion/deletion is required and the frequency may be higher. The exact usecase is that group membership is dependent on some conditions. We schedule this check every 15 minutes and based on it, we are adding/deleting the members.

cakoose · on April 1, 2020

Xor filters require all the members of the set be provided up front.

Bloom filters allow adding members, but not removing them.

Cuckoo filters allow removing members: https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf

And just for completeness, storing everything losslessly in a btree, radix tree, or hash table would probably be under 300 kB. (80 kB just for the IDs plus, say, an additional ~2x overhead.)

throwaway_pdp09 · on April 1, 2020

Safe removal is only possible if you know it is already in filter. The 'you can delete from cuckoo filters' bit I feel is often oversold.

cakoose · on April 7, 2020

Oh wow, I only read the paper's abstract and was fooled. Thanks for the correction.

antman · on April 1, 2020

You can use a counting Bloom Filter that allows deleting. Depending on your update requirements you might also be ok with two bloom filters, one for that has the group memberships and one that has group removals.

im3w1l · on April 1, 2020

10k members every 15 minutes? Just rebuild it from scratch.

rbanffy · on April 1, 2020

Or use a Cuckoo fiter.

the8472 · on April 1, 2020

> The sketch production isn't incremental - there is no add(k) after building it once.

Which means they can't be used for distributed set intersections, unions or cardinality estimations, a not insignificant side-benefit of bloom filters.

contravariant · on April 1, 2020

Might as well use a simple hashmap if that's all you have. Unless you have some extreme performance requirements.

anfilt · on April 1, 2020

This has much better storage/memory bounds than a hash table. The end result is less cache misses.

contravariant · on April 1, 2020

Sure, but 10K items is so little that you might as well not bother. If you're worried about memory usage you can also just scan the whole table each time (no such thing as a cache miss when the whole table fits in cache).

anfilt · on April 1, 2020

This is better than a hash table for membership testing. As for compared to table scan depends on your key size and number of keys.

This algorithm means you could easily have more than 10k especially for larger keys, and still likely reside in one of the cache levels. Heck a lot CPUs cant hold 10k items in L1. Like for 64 bit numbers thats 80Kb. Let alone for something like UUIDs or other larger sparse keys.

Also main memory access is aprox. delay of a few thousand cycles. Iterating a table of 10,000 items is going to be close to that. A hit in L1 is about 3 cycles. So a table scan is possibly slower or close to equal since a super scalar cpu has multiple execution ports.

Soon as you start getting any larger its easily gonna be slower to do a table scan.

This also gives you O(1) queries and being smaller means that table is more likely to reside in cache for any given look up. This smaller memory footprint is a great boon for any hash algorithm that involves look up.

contravariant · on April 1, 2020

All excellent points but no matter how impressive the savings it's not going to be interesting if the total time spent checking for membership is low to begin with. You'd need to be optimizing a very tight loop to even notice the difference between brute force and something more sophisticated like a hash map. And you'd need to be pretty damn desperate for more performance to even consider implementing a algorithm described in a paper in the hopes of beating the standard solutions.

It's pretty good fun though, so don't let pragmatism stop you. Just beware of premature optimization and all that.

throwaway_pdp09 · on April 1, 2020

A hash map is better than a hash table? I don't understand, genuine question, aren't they the same thing or am I getting confused about what you're saying?

iainmerrick · on April 1, 2020

Hash map and hash table are the same thing here. That’s what the GP suggests using. (And it’s the right suggestion.)

Bloom filters, cuckoo filters and XOR filters are all types of probabilistic lookup, which are different from normal hash tables.

iainmerrick · on April 1, 2020

That’s a very small dataset. Just use a normal HashSet / dict / whatever is normal in the language you’re using.

unwind · on April 1, 2020

How does plain binary search perform? 80 KB is not a lot to cache, if you do many lookups misses should go down, right?

asah · on April 1, 2020

Try hyperloglog (HLL) which is another terrific and widely-available algorithm for set membership.

All these algorithms are statistical, and should be checked by hand in the second phase, e.g. not-member is reliable but is-member should be confirmed.

Uptrenda · on April 1, 2020

Do you mean like 10,000 valid members or 10,000 candidates (where the set might only contain a small number of valid members?)

throwaway_pdp09 · on April 1, 2020

Is performance an issue? If not, occam says go with the simplest which is bloom. If performance matters and you are prepared to burn a bit more mem for speed, blocked bloom filters, which are likely optimal for speed and still simple.

dirtydroog · on April 1, 2020

Are the ids sparsely distributed?

devj · on Dec 14, 2019

Try jsonata.org

devj · on Dec 13, 2019

Earlier post here - https://news.ycombinator.com/item?id=21779191

dang · on Dec 13, 2019

Comments moved thither.

devj · on Nov 26, 2019

How does it compare with CouchDB Replication protocol - https://docs.couchdb.org/en/stable/replication/protocol.html?

devj · on Sept 13, 2019

# Fast, portable, non-Turing complete expression evaluation

Common Expression Language (CEL) is a non-Turing complete language designed for simplicity, speed, safety, and portability. CEL is designed to be evaluated without a sandbox, making it ideal for use within latency critical applications.

## Features

  - Familiar C-like syntax common to C++, Go, Java, and TypeScript.

  - Gradual typing with first-class support for Protobuf and JSON types.

  - High performance evaluators in C++ and Go with a rich set of standard operators.

## How Google uses Common Expression Language

Google uses CEL as the expression component of IAM and Firebase security policies, as well as within Istio Mixer configs.

devj · on Sept 1, 2019

Would like to know more on how its a good fit for database schemas? Does anybody here use any DSLs for database schemas? If yes, what and why?

devj · on Jan 30, 2019

Can Excel be categorised under Visual Programming?

devj · on Jan 15, 2019

Few doubts:

1. Any reason to write it in Java instead of C, C++, Rust, etc?

2. Any reason to use Protobuf instead of Flatbuffers, Avro, etc?

3. Can FoundationdDB be used with Apache Arrow?

all0c · on Jan 15, 2019

The Record Layer is written in Java as it was designed to fit in with an existing stack that was already primarily Java-based. You can read more about how CloudKit uses the Record Layer in the preprint of the Record Layer paper: https://www.foundationdb.org/files/record-layer-paper.pdf

Excellent question regarding the choice to use Protocol Buffers. Firstly, as mentioned in the paper released last year, CloudKit uses Protocol Buffers for client-server intercommunication. As a result, there was already expertise around protobuf, which is a good tie breaker when evaluating alternatives. (Here's that paper, by the way: http://www.vldb.org/pvldb/vol11/p540-shraer.pdf) Secondly, the Record Layer makes heavy use of Protocol Buffer descriptors, which specify the field types and names within protobuf schemata, and dynamic messages. Descriptors are used internally within the Record Layer to do things like schema validation. (For example, if an index is defined on a specific field, the descriptor can be checked to validate that that field exists in the given record type.) Likewise, dynamic messages make it possible for applications using the Record Layer to load their schema at run time by reading it from storage. The FDBMetaDataStore allows the user to do exactly that (while storing the schema persistently in FoundationDB): https://static.javadoc.io/org.foundationdb/fdb-record-layer-...

The Record Layer's data format is not compatible with the specification specified by Apache Arrow, no.

devj · on Jan 15, 2019

Thanks for your reply. Would be really helpful if you can share the following:

1. Size of the CloudKit cluster and the number of RecordLayer instances. A ratio would also be enough to get an approx. idea.

2. How metadata changes involving field data type are being handled?

3. How are relationships and therefore, foreign keys handled? Are any referential actions like cascading deletes supported?

all0c · on Jan 15, 2019

The Record Layer doesn't currently support foreign key constraints, so foreign keys are more of an “design pattern” than a first-class feature. For example, in a sample schema in the repository, an “Order” message has have a field called “item_id” that points to the primary key of an “Item” message: https://github.com/FoundationDB/fdb-record-layer/blob/792c95... There isn't an automatic check to make sure the item exists, though, nor are there cascading deletes. That being said, I don't think the architecture is incompatible with that feature, so it would be a reasonable feature request.

There are some guidelines regarding field type changes in the schema evolution guide: https://foundationdb.github.io/fdb-record-layer/SchemaEvolut... Most data type changes are incompatible with either Protobuf's serialization format or the FDB Tuple layer's serialization format (which the Record Layer users for storing secondary indexes and primary keys). The general advice for type changes (if there are existing data in your record stores) would instead be to introduce a new field of the new type and deprecate the old one.

devj · on Dec 7, 2018

To all Rustacenas who are running Rust in production, would you create an "Effective Rust"?

sidcool · on Dec 7, 2018

I would buy the book in a heartbeat.