What are your recommendations for scalable storage of append only data. What are your favorite frameworks for memory mapping like VAEX or Polars? What is hot like duckdb?
Unpopular tip: you can store your data for the project in a simple csv file 9 out of the 10 times and load all of it in memory (with pandas for example). Don't waste your time on building a scalable data storage when you don't need it.
For big data systems, CSV is a arguably worse format
Row wise storage doesn't provide compression benefits like columnar storage which can significantly reduce storage needs.
Data could contain the same delimiters ( comma, newline ) as used by the parsers and introduce errors in computations. binary formats( parquet/ORC) can eliminate this issue while providing other benefits as well.
A data quality framework like deequ should help catch errors in data before you introduce the same to downstream applications in ELT.(not infalliable )
For ELT processes, always filter first before other processing.
Plan your partitions according to expected querying patterns. If reports are run country wise , then country is a good partition key or if date wise then date is a good parition key.
Be aware of data skews , you might introduce skew during partitions ( ex : partitioning by country and a few countries have large number of records compared to others.
Agreed. Im 80% sure CSV is best for his use case. Maybe even JSON line files. I think that he's working with regular data (a few hundred MB) because he's considering memory mapping it.
I have a rule that no-one is allowed to use the words big,small,fast or slow. You must quantify.
I've met too many people who think that 100MB is big data or that 1 Gbps is a fast internet connection.