At 4.8TB one could add a header section with the full code, instructions how to compile it etc. That would certainly help to reproduce it, assuming civilizations in 10k years still can decypher todays language.
It's correct that the number of reinsurers is smaller than that of primary insurers. But the risk born by reinsurers is less correlated, not more. Any given primary insurer has risk clusters (domestic market, line of business, etc.). If a large catastrophe happens in their domestic market they might go bust but what are the chances that it happens simultaneously to all markets globally?
Say you're a primary home insurer in the US. If a hurricane hits you might not have enough capital to rebuild all the homes. A reinsurer which is also covering Europe, Asia, LatAm, etc. is less likely to go bankrupt. The reinsurer can cross-subsidize and use the insurance premiums from other regions to pay out the claims from the US market. All that matters is that on average the loss probabilities and severities are estimated correctly.
And this is just using one line of business as example, reinsurers are covering property, casualty, life and health which add extra layers of diversification.
From what I understood the article refers to the point that DuckDB doesn't provide its own dataframe API, meaning a way to express SQL queries in Python classes/functions.
The link you shared shows how DuckDB can run SQL queries on a pandas dataframe (e.g. `duckdb.query("<SQL query>")`. The SQL query in this case is a string. A dataframe API would allow you to write it completely in Python. An example for this would be polars dataframes (`df.select(pl.col("...").alias("...")).filter(pl.col("...") > x)`).
Dataframe APIs benefit from autocompletion, error handling, syntax highlighting, etc. that the SQL strings wouldn't. Please let me know if I missed something from the blog post you linked!
I suspect so, OpenAI is subject to the EU AI Act [0]. Last time they released the Advanced Voice Mode it also took some time before it became available in the EU. Not sure why UK and Switzerland are delayed as well, they are not in the European Union.
I'm using both Spark and polars, to me the appeal of polars is additionally it is also much faster and easier to set up.
Spark is great if you have large datasets since you can easily scale as you said. But if the dataset is small-ish (<50 million rows) you hit a lower bound in Spark in terms of how fast the job can run. Even if the job is super simple it take 1-2 minutes. Polars on the other hand is almost instantaneous (< 1 second). Doesn't sound like much but to me makes a huge difference when iterating on solutions.
Yes only found the announcement [1] that the Polars team and NVIDIA engineers are working on a GPU engine, but other than that no concrete examples. Github issues also don't provide any hints on the status, only one open item [2] where most comments are prior to the announcement.
To add to that, an additional benefit would be you can compile and release it as Python package (Py03/maturin) or compile to WASM so it runs in the browser (with javascript bindings). This makes the code portable while benefiting from Rust's performance/memory safety.
In short: Compatible with existing Spark jobs but executing them much faster. Benchmarks in the README file and docs [1] show improvements up to 3x while not even all operations are implemented yet (i.e. if an operation is not available in Comet it falls back to Spark), so there is room for further improvements. Across all TPC-H queries the total speedup is currently 1.5x, the docs state that based on datafusion's standalone performance 2x-4x is a realistic goal [1]
Haven't seen any memory consumption benchmarks but suspect that it's lower than Spark for same jobs since datafusion is designsd from the ground up to be columnar-first.
For companies spending 100s of thousands if not millions on compute this would mean substantial savings with little effort.