You can hear in multiple dimensions because, as the original article mentions, there is some post-processing in your mind due to the time delay between sounds entering your left ear and sounds entering your right. Just because your ears are aligned on a single axis does not mean that you cannot perceive in multiple axes. No moving of your head is necessary to perceive in more than one dimension.
It actually also works because a sound registers differently based on the direction it travels through your own head (i.e. your skull itself is an occlusion source) - your brain recognizes front vs. back based on how your own head dulls the sound
I assume the shape of our ears also affects the sounds, depending on the location of the sound. This converts vertical location into frequency shaping.
And I wonder, why else would we have such oddly shaped ears?
You would be correct. The shaping of the pinna (and resulting sound reflections) help us determine the height of sound sources, even if they're right on our midline.
I listened to these so much several years ago (in particular one with somebody shaking a matchbox) that my brain apparently realized, to some extent, that it was being tricked. Ever since then, recordings like this don't work properly. Anything that's supposed to sound like it's coming from in front of me gets flipped around so that it's behind me. It's very strange.
We can hear true 3D. There are actually a couple of ways in which we can sense the direction of sound. One is loudness, which obviously only works for left-right. This is loudness only in the sense of sound-gets-more-silent-farther-from-the-source.
Another is the phase difference of the sounds in both ears. Phase difference comes from the fact that sound arriving at one ear had a longer way to travel than sound arriving at the other ear. tells a more complete story of direction than loudness and is very precise.
Thirdly, there is frequency response. Both our head and our ears absorb different frequencies at different angles. Also, there are some reflections from our shoulders and chest.
Taken all that together, normal people can detect the origin of sound sources to about one degree of precision.
But, there is more. Close your eyes and have another person talk while turning his head in different directions. You can clearly detect the direction he is talking to. At the same time, you can roughly sense the size and type of the room you are in. Also, you can do this while being in really noisy environments (say, a car in traffic with the radio playing and the kids in the back).
Talking about seeing vs. hearing: The response time of the ear is about 10-50 ms. It usually takes 200-500 ms to make sense of something you see. The human ear has a frequency resolution of about one Hz (at <500 Hz), a dynamic range of about 120 dB and a frequency range of about 10 octaves. The eye can only detect a very small dynamic range in comparison, and about one octave of wavelengths. But most importantly, the eye can only detect averages of three distinct, fixed frequency ranges (red, green, blue; red and green overlap 90%). The ear has floating detection windows and uses an arbitrary amount of different windows at any one time. That is, the ear can detect any spectral distribution with great precision, while the eye only detects three distinct spectral windows.
So in a lot of ways, the ear is physically way more precise than the eye. Of course, signal processing makes all the difference and the amount of signal processing going on in our acoustical and visual brain centers is just staggering. There is nothing in the technological world that even comes close to that.
The eye ends up taking in a lot of information from every point you look at though, whereas the ear is always listening in every direction at once. Because of the kind of processing in the eye and brain, our three types of cone cells actually end up capturing most of the information encoded in the different reflectance spectra of the kinds of objects that naturally occur in the world (most of the variation occurring in typical objects in the wavelength range we can detect): today with synthetic materials and various lighting sources metamerism is noticeable in special edge cases, but just wandering around it usually doesn’t matter too much. After all, a decent percentage of males get along just fine as dichromats, and many don’t even realize they’re missing anything until adulthood (because even with two sensor types there is a lot of variation to pick up).
I think it’s pretty silly to argue that hearing can pick up more detailed information about the world than vision can: from vision, we can figure out shape, orientation, distance, texture, gloss, material, lighting, direction and speed of extremely tiny motion, etc., and we can do it for fine details of everything we look at, thereby constructing a tremendously detailed model of our environment without needing to directly touch every part of every thing. Anyway, both hearing and vision are extremely sophisticated. The two gather quite dramatically different kinds of information. Neither should be underestimated.
Of course, this is completely true. But note that I did not talk about the amount of information but the precision of information. Also note that every cone cell is equivalent to one hearing cell in the cochlea, of which there are plenty, too. (Only in the eye, they are distributed spatially while in the cochlea, they are distributed by frequency)
But all this is really not as important as the signal processing that makes sense of it. There are interesting connections between hearing and seeing. If you watch TV and someone at your side turns his head to you in order to speak, you will notice and shift your attention to him. You will think that you saw his head movement in the corner of your eye. But truth is, many people wearing hearing aids will not notice the same situation, for the simple reason that you actually did not see his head movement, you heard it.
There are many more examples where things like this happen. What you perceive is different from what your senses detect. All these intricate combinations of sensual information are the really interesting part.
Another fun thing about hearing: The human ear can detect very low sound pressure levels. Actually, it will detect a displacement of the eardrum of about the diameter of an air molecule. In a way, this is saying that the ear can detect the impact of individual air molecules on the ear drum (not really true, but in the ballpark). That is freaking amazing.
"a dynamic range of about 120 dB"
Sound of a jackhammer at 1m has sound pressure level of approximately 100dB (relative to 20 uPa). Leaves rustling gently in the wind have sound pressure level about 10dB. If our hearing actually had 120dB dynamic range we'd be able to easily hear the rustling of the leaves while jackhammer is pounding right next to us, not to speak of normal conversation which roughly has level of 50-60dB.
As an analogy to radio equipment we might say that human ears have automatic gain control which spans 120dB (or as audio comparison we can adjust the volume on that span) but the range which we can discern two sources (the instantaneous dynamic range) is about 30dB or less.
Also about the frequency resolution of the ear vs the eyes the bandwidth are on a completely different scale. Visible light spans about 380THz which means that the bandwidth of that signal is 19 250 000 000 times the bandwidth spanned by our ears. You cannot do just simple octave based comparison as the amount of information is not dependent on the amount of octaves but only on bandwidth. You are correct in the sense that the eye uses this information in quite limited fashion, however the actual processing is not on normal frequency domain but on spatial domain (eye is not just one sensor but craploads of them).
There's something funny going on with front-back perception. Normally it shouldn't be doable, but the external part of the ear is like a direction-dependent frequency filter. When the brain hears a familiar sound, but the spectrum is skewed just like this, it goes "aha, this comes from the front side". If the spectrum is skewed the other way, it goes "this comes from behind".
There's a lot more to it than that, and it's not just something that occasionally works. The outer-ear notch filtering is also used in vertical localization.
Also: people who depend on behind-the-ear hearing aids do not get the benefit of these echoes, which means that, for example, it’s harder for them to distinguish the voice of someone talking to them from background noise or from echoes off the walls.
While our localization abilities vary by axis, we do have them in all three directions due to the combination of several mechanisms, none of which are actually 'left-to-right'.
"We hear in stereo 3-D." This is not true unless we move our heads. We hear in one dimension, left-to-right.
"We hear better than we see." This is an apples-to-monkeys comparison.