The idea of an asymmetric Chess starting position is very interesting, although it does introduce more risk of one side starting with a big advantage (perhaps this has been analyzed).
I also like that in this variant, castling works like normal -- that is one of the most unintuitive aspects of Chess960.
Yes, that appears to be another good solution to the castling trickiness! And probably how you assume castling works in Chess960 if you weren't given the rules.
One option is bidding with points: player A looks at the position and bids X<0.5 points for the privilege of picking the color. Player B either accepts or raises the bid, the process repeats until one of them accepts or bids 0.5 points. The match is then played for the remaining 1-X points.
Or, more simply, player A is shown N random positions, picks one of them and lets player B pick the color.
I don't think that bidding system really works. If one side is strongly favored in the opening, the optimal bid would be essentially 0.4999999999... so that you can pick the color and win the game by a slim margin. Players then increase the bid with tiny steps ad infinitum.
The other idea works but is essentially just discarding all of the lopsided starting positions, in which case they might as well not be in the game.
I have never played this variant of chess, but on the surface it seems that having both bishops on the same color would be a sizeable disadvantage.
The other randomized pieces (queens and knights) can get to any square, so having two knights start on dark squares, for instance, doesn't seem to really matter.
Who chooses the "correct" use of words? Is it you? Wikipedia disagrees with you: https://en.wikipedia.org/wiki/Gross_margin. Maybe you should make your own encyclopedia.
The answer will likely match what the reasoning steps bring it to, but that doesn’t mean the computations by the LLM to get that answer are necessarily approximated by the outputted reasoning steps. E.g. you might have an LLM that is trained on many examples of Shakespearean text. If you ask it who the author of a given text is, it might give some more detailed rationale for why it is Shakepeare, when the real answer is “I have a large prior for Shakespeare”.
Just want to note that this simple “mimicry” of mistakes seen in the training text can be mitigated to some degree by reinforcement learning (e.g. RLHF), such that the LLM is tuned toward giving responses that are “good” (helpful, honest, harmless, etc…) according to some reward function.
This is actually a great example of an optimization that would be extremely difficult for an LLM to find. It requires a separate computation to find the smallest /largest numbers in the range with digits summing to 30. Hence, an LLM is unlikely to be able to generate them accurately on-the-fly.
Next step would be to propose hardcoding 99930-3999 as the O(1) result and live with the output just being wrong sometimes. The bug rate is then in the ballpark of most modern software, including LLMs', so I'd say ship it.
> There is no record or credible report indicating that Jeff Baena has passed away. As of the most recent information available, he is still alive. My training data includes information up to October 2023. Events or details that emerged after that date may not be reflected in my responses.
I tried it in OpenAI's O1. If I give it minimaxir's original prompt it writes the obvious loop, even if I include the postamble "Look for tricks that will make this function run as fast as possible in the common case".
However, if I then simply ask "What is the most probable result for this function to return?" it figures out the answer and a very good approximation of the probability (4.5e-5). From there it's easily able to rewrite the program to use the trick. So the creative step of spotting that this line of reasoning might be profitable seems missing for now, but 2025's models might solve this :-)
The information on the creative step which you provided to o1, was also the key step and contained almost all the difficulty. The hope is that 2025 models could eventually come up with solutions like this given enough time, but this is also a toy problem. The question is how much clever answers will cost for real world complex problems. At present it looks like, very much.
All I need is the proportion of the qualifying numbers to the input array to run the algorithm and the number of samples. Then we can sample min, max index of the qualifying array and return their difference without having to sample many times if we can derive the joined min max distribution conditional on the Bernoulli.
In other words the procedure can take any input array and qualifying criteria.
The joint distribution is relatively simple to derive. (This is related to the fact that min, max of continuous uniform on 0, 1 are Beta distributions.)
Sampling doesn't give you the actual answer for an actual array. If the program uses the array for multiple things, such as organizing the numbers after allocating the correct number of buckets, your method will cause logic errors and crashes.
The O(1) method based on statistics only works when the function making this calculation can hide the array (or lack of array) behind a curtain the entire time. If it has to take an array as input, or share its array as output, the facade crumbles.
The prompt is not "generate this many random numbers and then say max qualifying minus min qualifying". If it was, your method would give valid solutions. But the prompt starts with "Given a list".
In the article, we let ChatGPT generate the random numbers as a matter of convenience. But the timing results are only valid as long as it keeps that part intact and isolated. We have to be able to swap it out for any other source of random numbers. If it invents a method that can't do that, it has failed.
It depends on how you read the problem still. In a lot of the llms solutions the array is not provided in the solving functions but rather constructed inside (as instead of defining the function with an input and then creating a main function that would be called with no argument, construct an array and call the solving function with that as argument, as typical in python), so I assume the llm did not read it like this or also failed this aspect of the code (which was never really mentioned). It is not clear if we are given a specific array of integers or one input is an array of random variables that we need to instantiate ourselves.
This gets to the old saw, "knowing what question to ask is the most important thing". To the extent that LLMs can answer questions better than formulate which ones to ask, they may be inherently limited. We will see.
But it does seem they are good (to the extent that they are good at anything) at identifying the questions first if you ask them. It does mean you need an ok enough meta-question to start the chain of the reasoning, but that is the key insight of the recent wave of "reasoning models." First ask the LLM to reformulate the problem and structure an approach, or multiple approaches on how to address it, then have a second pass do just that.
Google search with less steps? Still a huge advancement, of course.
Wonder how much benefit a meta lang for describing these problems correctly for the LLMs to process into code, an even-higher level language perhaps we could call it English?
Excellent point. The hope is reasoning LLMs will make a difference for such problems. But it's also a great example of why the those who think being able to have the LLM iterate more will be crucial to reasoning are off base. There are many computations that a transformers (or humans for that matter) are not well equipped to represent internally, tool use during the reasoning process is unavoidable for all but the artificial or knowledge heavy problems.
Small examples, throwaway but involved calculations, prototypes, notes of what didn't work and what's promising are what's crucial for novel reasoning. It goes beyond just search or iterative refinement; there is no royal road to reasoning.
You guys are picking on the problem statement. Here's a revised prompt, which also skips the silliness of single threading:
Write __fully parallelized__ Python code to solve this problem: __Generate__ 1 million random integers between 1 and 10,000,000, find the difference between the smallest and the largest numbers whose digits sum up to 30.
Are you sure you don't mean "The digits of which sum to 30"?
There being one legal way to say something isn't evidence that other ways are illegal. It remains the case that whose bears the same relation to which that it does to who.
But what's interesting about this is that there's a tradeoff in the total computation performed by the "fully parallelized" version of this and a sequential one. Without the user knowing this, it's kind of impossible to get the optimization you want: Do you want a minimum work solution or a minimum wall-clock-time solution?
If you want a better fully parallelized one, you do this:
Repeat a few times in exponential progression on k:
Process, in parallel, the first k entries in the list (let's start with 1000). Find the min and max whose digit sums = 30.
In parallel, filter the remaining list to eliminate entries that would not improve upon the min/max thus found.
k *= 10 and repeat until done.
I would wager against the LLM identifying this solution without prompting from the user (or reading this comment).
It would be very surprising if they would not scrape this site. The content is very high-quality in the general case and there are no giant barriers preventing entry (there even is a clean API!). One might even use us to fine-tune a coding assistant or the alike.
Maybe it only requires asking the LLM to be creative when designing the algorithm. The parent poster spent some time thinking about it, obviously--he didn't generate it accurately "on the fly," either. But he's able to direct his own attention.
I don't see why the LLM couldn't come up with this logic, if prompted to think about a clever algorithm that was highly specific to this problem.
I suspect that it would be unlikely to come up with it because it requires execution of a fairly lengthy algorithm (or sophisticated mathematical reasoning) to find the smallest/largest valid numbers in the range. You can verify this for yourself with the following ChatGPT prompt: "What is the smallest number in the range (1, 100000) whose digits sum to 30? Do not execute separate code."
Because otherwise we are talking about LLMs augmented with external tools (i.e. Python interpreters). My original comment was pointing to the limitations of LLMs in writing code by themselves.
You wouldn't ask a programmer to solve a problem and then also not let them write down the source or debug the program as you write it?
Are you asking it to not write down an algorithm that is general? They are doing a pretty good job on mathematical proofs.
I still don't understand why you wouldn't let its full reasoning abilities by letting it write down code or even another agent. We should be testing towards the result not the methods.
I'm simply pointing out the limitations of LLMs as code writers. Hybrid systems like ChatGPT-o1 that augment LLMs with tools like Python interpreters certainly have the potential to improve their performance. I am in full agreement!
It is worth noting that even ChatGPT-o1 doesn't seem capable of finding this code optimization, despite having access to a Python interpreter.
But programmers are LLMs augmented with the ability to run code. It seems odd to add a restriction when testing if an LLM is "as good as" a programmer, because if the LLM knows what it would need to do with the external code, that's just as good.
I've observed this with ChatGPT. It seems to be trained to minimize changes to code earlier in the conversation history. This is helpful in many cases since it's easier to track what it's changed. The downside is that it tends to never overhaul the approach when necessary.
Yeah, it does demonstrate that you can use compression to measure similarity of two images.
But it also demonstrates that it's a pretty poor similarity measure. Something as simple as counting % of matches between the black and white pixels performs much better.
Yeah, a lot of folks don't seem to be aware of the people behind this, and their history. Swift was originally closed source too. He's explained that this sort of "incubation period" is helpful for working out the kinks with a small core team.
Okay I looked up the team and agree they are brilliant, but I still don't know if I trust them to make this open in the longterm. Why can't they open it on Github but just not accept PRs?
I also like that in this variant, castling works like normal -- that is one of the most unintuitive aspects of Chess960.