Perhaps an additional Moral is, if a calculation can be done in an afternoon, it's probably been done before. (I say this because it's been pointed out in the Comments section of my essay that Jon Lu asked and answered the same question back in 2016.)
I'm not sure I understand your question. Given that it takes an infinite amount of time to write down the decimal expansion of pi, what might it mean to compute pi in finite time?
Yeah sorry, pretty vague, and I'm not sure what the "rules" precisely should be. Roughly I mean something that involves an infinite number of humans but where the workload per human is finite (perhaps only in expectation?) and each human only has to accept and pass on to the next one finite information. [This in the context of calculating pi via the "stupid method" of having each person choose a random integer and then using the probability of co-primeness being 6/pi^2].
My first thought does not achieve the task: expand pi/4 as a sum of positive rational numbers, then have each human use a couple of fair coins from their own Bernoulli factory to output a coin that is heads with probability given by their assigned rational number. The n'th human gets told the partial sum up to that point, flips their bernoulli factory coin and either terminates the protocol if they get heads, otherwise they adds their term to the partial sum they received and passes it on. The problem is the information content in the partial sums will grow with n.
Give the humans an order 1,2,3,…, and let a referee read them in that order.
Person k tosses one fair coin and reports H/T.
The referee stops at the first time the reported heads exceed the reported tails.
If N people were consulted, choose one of those N uniformly at random. Output heads iff that chosen person’s coin was heads.
For a one-pass version: instead of storing the whole consulted prefix, the referee can keep a single “currently marked” consulted person, and when the k-th consulted coin arrives, replace the mark by that new person with probability
1/k. When the process stops, the marked person is uniform among the consulted ones.
One thing in the article that strikes me as very strange is this:
"I suppose that, on a practical level, a take-home for the practicing mathematician is that if you use ChatGPT, don’t trust it to generate valid proofs, and even when it finds a valid proof, don’t be so sure it’s a good proof. And whatever you do, don’t have ChatGPT create a bibliography for you."
In isolation, that's all very good advice. But however is a take-home from what goes before? You used ChatGPT, it did generate a valid proof, it was "a solid by-the-book argument that employed a method I’ve used myself" which may or may not imply "a good proof" but suggests it was at least OK, and nothing in the story you told involves ChatGPT generating bibliographies at all.
Thanks for catching this! What was in my mind when I wrote the first of those two sentences was the entirety of my past experience using ChatGPT as a research tool, and all the times it made mistakes. (For a funny example, see my January 17, 2023 essay “Denominators and Doppelgängers” in which I describe ChatGPT's proof that .999... is less than 1.) But you're 100% right that the version of my essay that I posted last week didn't make this clear; I'll update it appropriately. I'll also add a brief description of ChatGPT's bibliographic blunder.
I think 4x4 matrices for 3D transforms (esp homogenous coordinates) are very elegant.
I think the intended critique is that the huge n*m matrices used in ML are not elegant - but the point is made poorly by pointing out properties of general matrices.
In ML matrices are just "data", or "weights". There are no interesting properties to these matrices. In a way a Neumann (https://en.wikipedia.org/wiki/Von_Neumann%27s_elephant) Elephant.
Now, this might just be what it is needed for ML to work and deal with messy real world data! But mathematically it is not elegant.
The inelegance to me isn't in the definition of the operation, but that it's doing a huge amount of brute-force work to mix every part of the input with every other part, when the answer really only depends on a tiny fraction of the input. If we somehow "just knew" what parts to look at, we could get the answer much more efficiently.
Of course that doesn't really make any sense at the matrix level. And (from what I understand) techniques like MoE move in that direction. So the criticism doesn't really make sense anymore, except in that brains are still much much more efficient than LLMs so we know that we could do better.
I think you're right that the inelegant part is how AI seems to just consist of endless loops of multiplication. I say this as a graphics programmer who realized years ago that all those beautiful images were just lots of MxNs, and AI takes this to a whole new level. When I was in college they told us most of computing resources were used doing Linear Programming. I wonder when that crossed over to graphics or AI (or some networking operation like SSL)?
> When I was in college they told us most of computing resources were used doing Linear Programming.
I seriously doubt that was ever true, except perhaps for a very brief time in the 1950s or 60s.
Linear programming is an incredibly niche application of computing used so infrequently that I've never seen it utilised anywhere despite being a consultant that has visited hundreds of varied customers including big business.
It's like Wolfram Mathematica. I learned to use it in University, I became proficient at it, and I've used it about once a decade "in industry" because most jobs are targeted at the median worker. The median worker is practically speaking innumerate, unable to read a graph, understand a curve fit, or if they do, their knowledge won't extend to confidence intervals or non-linear fits such as log-log graphs.
Teachers that are exposed to the same curriculum year after year, seeing the same topic over and over assume that industry must be the same as their lived experience. I've lost count of the number of papers I've seen about Voronoi diagrams or Delaunay triangulations, neither of which I've ever seen applied anywhere outside of a tertiary education setting. I mean, seriously, who uses this stuff!?
In the networking course in my computer science degree I had to use matrix exponentiation to calculate the maximum throughput of an arbitrary network topology. If I were to even suggest something like this at any customer, even those spending millions on their core network infrastructure, I would be either laughed at openly, or their staff would gape at me in wide-eyed horror and back away slowly.
And astronomy tends to throw up technology that becomes widely used (WiFi being the obvious example) or becomes of "interest" to governments. I expect that AMR code will be used/ported to nuclear simulations if it proves to be useful. Do I expect it to be used in a CRUD app? Obviously not, but use by most software shops isn't a measure of importance.
I have not only used linear programming in the industry, I have also had to write my own solver because the existing ones (even commercial) were to slow. (This was possible only because I only cares about a very approximate solution)
The triangulations you mention are important in the current group I'm working in.
I'm curious to hear what you specifically use these algorithms for!
PS: My point is not that these things are never used, they clearly are, I'm saying that the majority of CPU cycles globally goes towards "idle", then pushing pixels around with simple bitblt-like algorithms for 2D graphics, then whatever it is that browsers do on the inside, then operating system internals, and then specialised and more interesting algorithms like Linear Programming are a vanishingly small slice of whatever is left of that pie chart.
Part of the reason why linear programming does need t get used as often is that there are no industry standard software implementation that is not atrociously priced. Same deal with Mathematica.
Linear transformations are a beautiful thing, but matrices are an ugly representation that nevertheless is a convenient one when we actually want to compute.
Elegant territory. Inelegant, brute-force, number crunching map.
If the O(n^3) schoolbook multiplication were the best that could be done, then I'd totally agree that "it's simply the nature of matrices to have a bulky multiplication process". Yet there's a whole series of algorithms (from the Strassen algorithm onward) that use ever-more-clever ways to recursively batch things up and decrease the asymptotic complexity, most of which aren't remotely practical. And for all I know, it could go on forever down to O(n^(2+ε)). Overall, I hate not being able to get a straight answer for "how hard is it, really".
Maybe the problem is that matrices are too general.
You can have very beautiful algorithms when you assume the matrices involved have a certain structure. You can even have that A*B == B*A, if A and B have a certain structure.
I know linear algebra, but this part seems profoundly unclear. What does "send" mean? Following with different examples in 2 by 2 notation only makes it worse. It seems like you're changing referents constantly.
In US schools during K-12, we generally learn functions in two ways:
1. 2-d line chart with an x-axis and y-axis, like temperature over time, history of stock price, etc. Classic independent variable is on the horizontal axis, dependent variable is on the vertical axis. And even people who forgotten almost all math can instantly understand the graphics displayed when they're watching CNBC or a TV weather report.
2. We also think of functions like little machines that do things for us. E.g., y = f(x) means that f() is like a black box. We give the black box input 'x'; then the black box f() returns output 'y'. (Obviously very relevant to the life of programmers.)
But one of 3blue-1brown's excellent videos finally showed me at least a few more ways of thinking of functions. This is where a function acts as a map from what "thing" to another thing (technically from Domain X to Co-Domain Y).
So if we think of NVIDIA stock price over time (Interpretation 1) as a graph, it's not just a picture that goes up and to the right. It's mapping each point in time on the x-axis to a price on the y-axis, sure! Let's use the example, x=November 21, 2025 maps to y=$178/share. Of course, interpretation 2 might say that the black box of the function takes in "November 21, 2025" as input and returns "$178" as output.
But what what I call Interpretation 3 does is that it maps from the domain of Time to the output Co-domain of NVDA Stock Price.
3. This is a 1D to 1D mapping. aka, both x and y are scalar values. In the language that jamespropp used, we send the value "November 21, 2025" to the value "$178".
But we need not restrict ourselves to a 1-dimensional input domain (time) and a 1-dimensional output domain (price).
We could map from a 2-d Domain X to another 2-d Co-Domain Y. For example X could be 2-d geographical coordinates. And Y could be 2-d wind vector.
So we would feed input of say location (5,4) as input. and our 2Dto2D function would output wind vector (North by 2mph, East by 7mph).
So we are "sending" input (5,4) in the first 2d plane to output (+2,+7) in the second 2d plane.
Richard Dedekind didn't just invent Dedekind cuts (and rings and ideals); he gave us a new way to think about (or avoid thinking about) the ultimate nature of mathematical reality.
What makes the real number system different from all the smaller number systems it contains? Richard Dedekind found a simple answer: a geometric axiom that Euclid missed.
This Mathematical Enchantments essay describes three contexts in which a sensible way to count how many times you've performed an operation is "zero, one, two, one, two, one, two, ..."
This spoof of popular science journalism envisions a parallel world in which mathematicians make their discoveries in a radically different order yet science journalists invent the exact same cliches (and a few famous folks have suspiciously familiar names).