Ask HN: Big Data 10x tips and tricks

noud · on Nov 7, 2022

Unpopular tip: you can store your data for the project in a simple csv file 9 out of the 10 times and load all of it in memory (with pandas for example). Don't waste your time on building a scalable data storage when you don't need it.

nith3n · on Nov 8, 2022

Counter point.

CSV is a great format for humans to comprehend.

For big data systems, CSV is a arguably worse format

Row wise storage doesn't provide compression benefits like columnar storage which can significantly reduce storage needs.

Data could contain the same delimiters ( comma, newline ) as used by the parsers and introduce errors in computations. binary formats( parquet/ORC) can eliminate this issue while providing other benefits as well.

A data quality framework like deequ should help catch errors in data before you introduce the same to downstream applications in ELT.(not infalliable )

For ELT processes, always filter first before other processing.

Plan your partitions according to expected querying patterns. If reports are run country wise , then country is a good partition key or if date wise then date is a good parition key.

Be aware of data skews , you might introduce skew during partitions ( ex : partitioning by country and a few countries have large number of records compared to others.

Random thoughts

sjducb · on Nov 8, 2022

Agreed. Im 80% sure CSV is best for his use case. Maybe even JSON line files. I think that he's working with regular data (a few hundred MB) because he's considering memory mapping it.

I have a rule that no-one is allowed to use the words big,small,fast or slow. You must quantify.

I've met too many people who think that 100MB is big data or that 1 Gbps is a fast internet connection.

sjducb · on Nov 7, 2022

- How much data will you process in your first year (In Terrabytes)?

- How big is the average data unit?

- How are you going to analyse and process this data? (What kinds of questions will you ask it?)

adammarples · on Nov 7, 2022

Delta lake on parquet files works very well. Bigquery works well. Snowflake works well.