That sounds suspiciously like something that could have easily been fixed in Num...

dr_zoidberg · on Sept 21, 2017

What he refers to is that you have to ask for them explicitly:

    import numpy as np
    arr = np.array([1, 3, 7, 5, 4, 3, 1, .100])
    maxval = arr.max()  # 1st pass
    minval = arr.min()  # 2nd pass

Whereas with numba you'd have something like this:

    from numba import jit
    @jit
    def maxmin(arr):
        maxval, minval = arr[0]
        for e in arr:
            if e < minval: minval = e
            if e > maxval: maxval = e
        return minval, maxval

And that will get optimized to numpy-like speeds, but with a single pass over data. So for large arrays, you'll get about 2x speedup, since memory access is the bottleneck.

As for optimizing this use case for NumPy, I'd go for a cythonized maxmin() function. Which is pretty much the same numba does, but you're moving the compilation overhead from the JIT into the compiling step of the module.

taeric · on Sept 21, 2017

This was what I assumed with the message. Doesn't change my point. If this is something that is truly a bottleneck, then it should be baked into numpy. In particular, agreeing for the array to calculate min/max/p50/p75/... in one pass is something that would make sense.

Regarding the moving the calculation, yeah. I get that. I argue that the compiling step of the module happens once for the module, no? The JIT will be something you force onto every execution. Right?

And none of this actually means this library shouldn't have been made. Just that it is a poor example for why it is better.

dr_zoidberg · on Sept 22, 2017

For some reason, I can't reply to nerdponx, so here it goes:

I'd add some "intelligence" (if you will) to the class, so that when I calculate the max(), I'll also track the min, average, etc, save those values in an internal cache, and return the max. Next function call (wether it's max, min, avg, or any other of the cached values), it gets the result from the precomputed cache.

Of course, some operations will affect said results, and there are two ways to handle that: either modify the cached results, if we're talking about some change that can be formulated (say, multiplying by 2 will multiply every computed value by 2), or simply invalidate the cache and re-compute the next time one of this functions is called.

As it was said, this would mean calculating some things that aren't the explicitly asked thing (max, min, avg, etc), and giving a worse case scenario, in the hope that it will result in a speedup on some usecases.

My guess as to why they don't do it? Because it's not a "generalizable" problem, and you have the machinery at hand to implement your custom function that fits your particular use case (say, using Cython and the numpy/cython wrappers).

nerdponx · on Sept 22, 2017

You can't bake this into NumPy without a compiler or JIT. Python calls don't "know" about each other. The only way to do it would be to have a function that returns both the max and the min.

taeric · on Sept 22, 2017

Like the sibling said, you could bake the stats into member fields to cache the values. I'm guessing it wouldn't caused any real slow down, but I can understand the concern. Would also help for repeated calls to the same stat.

Though, I would also think a general .stats method would make the most sense. It is quite often nowdays to want a lot of stats.

And again, I am not arguing that the new lib shouldn't exist. I just question that example.

dr_zoidberg · on Sept 22, 2017

See my comment on parent, for some reasong (I think your comment was too young) I couldn't reply to you directly. As for the compiler, since Microsoft handled the MSVC For Python (2.7), or if you were using Python 3 to begin with, I haven't had any problem.