Threading sucks. It's a situation where you have some external process (the operating system) deciding when different pieces of work should be woken up, with no way of feeding this back (so it just bases it on relatively naive schedulers).
In practice, most of your threads are in one form of wait loop or another and you've just got polling both inside the threads and with the scheduler.
Have a look at Erlang if you want a better model :) the erlang "processes" (different to os processes) can intelligently only wake up when there is work for them to do.
For a language to efficiently use cores, it really needs to include it's own scheduling.
To pick some nits and generally elaborate, Erlang's VM also has a scheduler; it's not the presence or lack of a scheduler, it's how efficient it is and what guarantees it allows the programmer to make about their system. For example, OS schedulers are typically pre-emptive, which means your OS thread can get interrupted anywhere. On the otherhand, Go's scheduler (I'm using Go because I'm more familiar with it than with Erlang) only allows context switching at well-defined points in your program. Further, operating system threads have more overhead than in Go (presumably also Erlang) because they're fixed stack size (yes, I know this isn't true for all OSes).
>only allows context switching at well-defined points in your program.
Erlang works the same way. The VM scheduler will only context switch on a function call. for or while loops don't exist in Erlang which means there is no risk of blocking the scheduler.
You can have userspace threads (or even hybrids m:n threading). They went out of fashion in the last decade for many reasons but were fairly common in the past.
The idea is that cooperatve userspace task switching can be faster than kernel space task switching (a handful if cycles vs thousands). So moving the scheduler in userspace seems a natural evolution. But now you lose the ability to run on multiple cpus as from the kernel point of view the application is a single thread. The next step is to run n userspace scheduers, for each hardware CPUs, each scheduler running a number of userspace threads (thus m:n). Effectively this creates a two level scheduler (one in the kernel one in userspace) and some OSs have custon APIs (look for scheduler activation) that allow the two schedulers to cooperate allowing full preemption of user threads in all circumstances.
The reason that m:n threading went out of style is that, for CPU bound tasks (where you want to run exactly as many threads as there are CPUs), it is just useless overhead, while for the hundreds of thiusands of IO bound threads scenarios, the cost of stack switching is dominated by cache misses anyway, and the cost of calling into the kernel is amortized by the fact that IO requires a call inti elevated privileges anyway. At the sametime kernel threads scheduling has become very fast and userspace threads, which requirea whole stack of their own are not significantly more lightweight than kernel threads.
The modern async model is a compromise. On one side, the 'threads' consist of a single stack frame are very light weight, on the other side there is no generic userspace scheduler, but scheduling is fully controlled by the application.
You're right, in theory, if your select() or epoll() doesn't have a timeout. And the threads aren't doing any active work. But if you're optimising for that state, then you don't care about inefficiencies of one model or another.
There are also costs in setting and checking those locks etc. Sure, you can build solutions to optimise a broken model, which we've done over decades (with quite a bit of success!), but it doesn't make the model less broken.
In practice, most of your threads are in one form of wait loop or another and you've just got polling both inside the threads and with the scheduler.
Have a look at Erlang if you want a better model :) the erlang "processes" (different to os processes) can intelligently only wake up when there is work for them to do.
For a language to efficiently use cores, it really needs to include it's own scheduling.