Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ask HN: Big Data 10x tips and tricks
13 points by hoerzu on Nov 7, 2022 | hide | past | favorite | 5 comments
What are your recommendations for scalable storage of append only data. What are your favorite frameworks for memory mapping like VAEX or Polars? What is hot like duckdb?


Unpopular tip: you can store your data for the project in a simple csv file 9 out of the 10 times and load all of it in memory (with pandas for example). Don't waste your time on building a scalable data storage when you don't need it.


Counter point.

CSV is a great format for humans to comprehend.

For big data systems, CSV is a arguably worse format

Row wise storage doesn't provide compression benefits like columnar storage which can significantly reduce storage needs.

Data could contain the same delimiters ( comma, newline ) as used by the parsers and introduce errors in computations. binary formats( parquet/ORC) can eliminate this issue while providing other benefits as well.

A data quality framework like deequ should help catch errors in data before you introduce the same to downstream applications in ELT.(not infalliable )

For ELT processes, always filter first before other processing.

Plan your partitions according to expected querying patterns. If reports are run country wise , then country is a good partition key or if date wise then date is a good parition key.

Be aware of data skews , you might introduce skew during partitions ( ex : partitioning by country and a few countries have large number of records compared to others.

Random thoughts


Agreed. Im 80% sure CSV is best for his use case. Maybe even JSON line files. I think that he's working with regular data (a few hundred MB) because he's considering memory mapping it.

I have a rule that no-one is allowed to use the words big,small,fast or slow. You must quantify.

I've met too many people who think that 100MB is big data or that 1 Gbps is a fast internet connection.


- How much data will you process in your first year (In Terrabytes)?

- How big is the average data unit?

- How are you going to analyse and process this data? (What kinds of questions will you ask it?)


Delta lake on parquet files works very well. Bigquery works well. Snowflake works well.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: