Could you define what averaging and mode projection are to you in this context? ...

Could you define what averaging and mode projection are to you in this context? I think I can guess but am not entirely sure if I am understanding where you're coming from well enough.

The above responses here are not too surprising to me -- first, we drastically trade in the variance for bias on a limited dataset. We have a very particular tone here that we're looking for, so bias towards the less spurious modes likely would be a good default for an 'instruction-following' mode.

After all, that very strong bias is what lets us give it instructions and stay in a back-and-forth conversation instead of devolving into something like a movie/book dialog or the middle of a news article about asking about people's names.

I could be entirely wrong about these assumptions, however, I only have my best guesses and (potentially spurious) information to go off of.

We do have a tiny dataset for ALPACA -- only about 55k paired examples, which is great, but that's either going to be a tiny run or not too difficult to overfit to. I'm sure there's good mitigations for that.

I agree that more unexpected responses would be good but I think it's sorta a mathematical "cake and have it too" that a lot of people want. Maybe this isn't at the pareto front for the compute/data/etc (almost certainly not to be honest, it's all pretty new after all!), but your above examples do show the bias-variance tradeoff well, and it looks like we got scammed on the exchange rates.

Running the same example on the base model (i.e., LLAMA with no ALPACA) will get you a representative model of names and such from the open internet that rapidly approaches the underlying distribution as the model size approaches infinity (so far as we know). This is because this is, as effectively as possible (I believe, and please correct me if I'm wrong), an unbiased estimator of the underlying distribution. We're just approaching that minimum variance limit as we minimize the cross-entropy loss.

Necessarily the cross-entropy loss is not minimized under biased circumstances. So a raw language model on the full corpus of text would be brutally penalized* for simply sampling the main or most likely modes of the distribution, whereas for the sake of instruction following this may not be the case.

And of course this is just the surface argument about the statistics of the models, personally the actually interesting things to me are the concept factorizations that happen internal to the (raw next-word-generation) models under cross-entropy -- especially as both the model size and the amount of data grows. Then you start seeing those trends that would only occur when the model is able to disentangle the concept data from the raw statistics, which is I think rather straightforward. One can pull out their frequentist slide rule and mark this down to a T if they'd like to.

Now, this of course transforms wonderfully under the usecase of the fine-tuned instruction-following model usecase. While indeed hampered by bias (to include perhaps the strongest of all: "as an AI model, I cannot blah blah blah blah...."), we see these disentangled concepts take flight as the user asks for varied cross-domain and multi-modal usecases. To me, it's probably the best test to see what's survived the fine-tuning and what hasn't. And we can see that in many of these large language models with RLHF and the like -- indeed it has! It can be quite wrong, but lots of people seem to jump on the "it's just statistical hallucinations" without really asking _why_ it's doing what it's doing. If it was merely a word-chain hallucination, then I don't believe people would be fooled as easily. To me, it's just a standard limits-of-out-of-distribution inference problem that we see with practically every other (non-symbolic) network out there.

That's the cool thing. Neural networks usually don't do this well OOD and that is special.

One other point to the John Smith example and New York is that there's not really a 'good' response to that question other than a perfect distribution of names and places matching the real world. I'm not sure what I'd want to see there, and I guess it depends upon the data it was fine-tuned on. Maybe that is a good test, it is out of my realm of experience with that particular dataset, however.

Hopefully this clears up some of the subtleties between the base and RLHF models and the tradeoffs/etc etc within, at the very least from the perspective I'm coming at it from.

I do really want to emphasize that this is the reason I'm frustrated that people who are skilled in DL are allowing this messaging to happen without reinforcing the math behind it. There's a second, much much much more interesting discussion about the actual structural and developmental elements of these networks that is getting displaced by this discussion, which to me is a bit more surface level and leads itself in circles to the same conclusions we have arrived to for other neural networks, and we generally have little to show for it in the end. Like, we've really got to get a move on to the core of the development of the network during training and the intrinsics of how info theory is specifically impacting these models during training. That's the most interesting area to me and it's what I've learned the most from focusing on. Let's focus as a field on that instead, it's cool stuff and has much more of an effective impact in the long run. <3 :'''')

*(In a large-enough batchsize domain, which we do see with the enormous token batchsizes in LLMs)