I have a requirement of checking group membership wherein a group may contain > 10K members. Each member is an 8byte ID. Do you think Xor filter would be a better fit compared to Bloom here? Or am I looking at it the wrong way? Thanks.
> Do you think Xor filter would be a better fit compared to Bloom here?
The 8byte key is the only scenario where you should consider XorPlus (i.e a 8 bytes mapped to a long).
The lookup properties of the Xor filter are better with that case, but the real question is whether you have an entire collection to start building the bitset or not.
The sketch production isn't incremental - there is no add(k) after building it once.
So you can't build add data once it is built, while the Bloom filters do support adding entries after the fact (in fact, it can add bloom filters into it, rather than sending all the new keys).
And both of those approaches are missing an unset operation.
Yes insertion/deletion is required and the frequency may be higher. The exact usecase is that group membership is dependent on some conditions. We schedule this check every 15 minutes and based on it, we are adding/deleting the members.
And just for completeness, storing everything losslessly in a btree, radix tree, or hash table would probably be under 300 kB. (80 kB just for the IDs plus, say, an additional ~2x overhead.)
You can use a counting Bloom Filter that allows deleting. Depending on your update requirements you might also be ok with two bloom filters, one for that has the group memberships and one that has group removals.
> The sketch production isn't incremental - there is no add(k) after building it once.
Which means they can't be used for distributed set intersections, unions or cardinality estimations, a not insignificant side-benefit of bloom filters.
Sure, but 10K items is so little that you might as well not bother. If you're worried about memory usage you can also just scan the whole table each time (no such thing as a cache miss when the whole table fits in cache).
This is better than a hash table for membership testing. As for compared to table scan depends on your key size and number of keys.
This algorithm means you could easily have more than 10k especially for larger keys, and still likely reside in one of the cache levels. Heck a lot CPUs cant hold 10k items in L1. Like for 64 bit numbers thats 80Kb. Let alone for something like UUIDs or other larger sparse keys.
Also main memory access is aprox. delay of a few thousand cycles. Iterating a table of 10,000 items is going to be close to that. A hit in L1 is about 3 cycles. So a table scan is possibly slower or close to equal since a super scalar cpu has multiple execution ports.
Soon as you start getting any larger its easily gonna be slower to do a table scan.
This also gives you O(1) queries and being smaller means that table is more likely to reside in cache for any given look up. This smaller memory footprint is a great boon for any hash algorithm that involves look up.
All excellent points but no matter how impressive the savings it's not going to be interesting if the total time spent checking for membership is low to begin with. You'd need to be optimizing a very tight loop to even notice the difference between brute force and something more sophisticated like a hash map. And you'd need to be pretty damn desperate for more performance to even consider implementing a algorithm described in a paper in the hopes of beating the standard solutions.
It's pretty good fun though, so don't let pragmatism stop you. Just beware of premature optimization and all that.
A hash map is better than a hash table? I don't understand, genuine question, aren't they the same thing or am I getting confused about what you're saying?
Try hyperloglog (HLL) which is another terrific and widely-available algorithm for set membership.
All these algorithms are statistical, and should be checked by hand in the second phase, e.g. not-member is reliable but is-member should be confirmed.
Is performance an issue? If not, occam says go with the simplest which is bloom. If performance matters and you are prepared to burn a bit more mem for speed, blocked bloom filters, which are likely optimal for speed and still simple.
Common Expression Language (CEL) is a non-Turing complete language designed for simplicity, speed, safety, and portability. CEL is designed to be evaluated without a sandbox, making it ideal for use within latency critical applications.
## Features
- Familiar C-like syntax common to C++, Go, Java, and TypeScript.
- Gradual typing with first-class support for Protobuf and JSON types.
- High performance evaluators in C++ and Go with a rich set of standard operators.
## How Google uses Common Expression Language
Google uses CEL as the expression component of IAM and Firebase security policies, as well as within Istio Mixer configs.
The Record Layer is written in Java as it was designed to fit in with an existing stack that was already primarily Java-based. You can read more about how CloudKit uses the Record Layer in the preprint of the Record Layer paper: https://www.foundationdb.org/files/record-layer-paper.pdf
Excellent question regarding the choice to use Protocol Buffers. Firstly, as mentioned in the paper released last year, CloudKit uses Protocol Buffers for client-server intercommunication. As a result, there was already expertise around protobuf, which is a good tie breaker when evaluating alternatives. (Here's that paper, by the way: http://www.vldb.org/pvldb/vol11/p540-shraer.pdf) Secondly, the Record Layer makes heavy use of Protocol Buffer descriptors, which specify the field types and names within protobuf schemata, and dynamic messages. Descriptors are used internally within the Record Layer to do things like schema validation. (For example, if an index is defined on a specific field, the descriptor can be checked to validate that that field exists in the given record type.) Likewise, dynamic messages make it possible for applications using the Record Layer to load their schema at run time by reading it from storage. The FDBMetaDataStore allows the user to do exactly that (while storing the schema persistently in FoundationDB): https://static.javadoc.io/org.foundationdb/fdb-record-layer-...
The Record Layer's data format is not compatible with the specification specified by Apache Arrow, no.
The Record Layer doesn't currently support foreign key constraints, so foreign keys are more of an “design pattern” than a first-class feature. For example, in a sample schema in the repository, an “Order” message has have a field called “item_id” that points to the primary key of an “Item” message: https://github.com/FoundationDB/fdb-record-layer/blob/792c95... There isn't an automatic check to make sure the item exists, though, nor are there cascading deletes. That being said, I don't think the architecture is incompatible with that feature, so it would be a reasonable feature request.
There are some guidelines regarding field type changes in the schema evolution guide: https://foundationdb.github.io/fdb-record-layer/SchemaEvolut... Most data type changes are incompatible with either Protobuf's serialization format or the FDB Tuple layer's serialization format (which the Record Layer users for storing secondary indexes and primary keys). The general advice for type changes (if there are existing data in your record stores) would instead be to introduce a new field of the new type and deprecate the old one.