This looks really good, however I struggle to find real applications for that. F...

wesm · on Sept 21, 2017

> Would somebody have some benchmarks against pandas for some standard operations ?

pandas creator here. Numba is a complementary technology to pandas, so you can and should use them together. It is designed for use with NumPy arrays and so does not deal with missing data and other things that pandas does. It does not help as much with non-numeric data types.

j88439h84 · on Sept 21, 2017

Are you saying if that you have missing data you can't use numba, or if you have missing data, and you use numba together with pandas, that pandas will handle the missing data where numba alone could not?

grej · on Sept 21, 2017

Heavy numba user here. What Wes is saying is that while Pandas handles some of those missing values in an automated way, if you choose to use numba it uses numpy arrays so you may have to handle some of those things yourself. I have at times used a separate numpy array to indicate whether values are missing or not. You could also use a value which is far out of the bounds of what you might ever see in your real data, then test for that while you're looping over those values (eg. fill missing values with -3.4E38 if you have a float32).

Depending on what you're doing, you might be able to use numpy.nan as a value. It does work inside of numpy arrays. But some methods that operate on those objects might not work as you expect.

For instance, if you run numpy.mean on a numpy array of [nan, 4, 5], it will return nan. If you run the same thing on a pandas dataframe of the same values, you'll get 4.5.

kronos29296 · on Sept 21, 2017

Nearly as fast as Numpy but never faster for jitted code while for the first run it takes longer as the jit need to generate llvm code the first time. If your calculation does not use unsupported features like classes (last time I checked they were not supported 1 year ago) and needs to be written as a loop rather than vectorized code, numba can be used to speed it up.

I believe Scipyconf 2016 had a talk on numba where he goes into it in great detail. Just search it up on Youtube.

Anything that is not convenient to be written as numpy arrays can be written using numba. Also it works with pure python code so your prototype can be used at scale with nothing but a decorator.

jzwinck · on Sept 21, 2017

Numba can be much faster than NumPy for some calculations. For example if you want to compute both the min and the max of an array, NumPy requires two passes but in Numba it is easily done in one. This can give close to 2x speedup for arrays which do not fit in cache.

taeric · on Sept 21, 2017

That sounds suspiciously like something that could have easily been fixed in NumPy.

dr_zoidberg · on Sept 21, 2017

What he refers to is that you have to ask for them explicitly:

    import numpy as np
    arr = np.array([1, 3, 7, 5, 4, 3, 1, .100])
    maxval = arr.max()  # 1st pass
    minval = arr.min()  # 2nd pass

Whereas with numba you'd have something like this:

    from numba import jit
    @jit
    def maxmin(arr):
        maxval, minval = arr[0]
        for e in arr:
            if e < minval: minval = e
            if e > maxval: maxval = e
        return minval, maxval

And that will get optimized to numpy-like speeds, but with a single pass over data. So for large arrays, you'll get about 2x speedup, since memory access is the bottleneck.

As for optimizing this use case for NumPy, I'd go for a cythonized maxmin() function. Which is pretty much the same numba does, but you're moving the compilation overhead from the JIT into the compiling step of the module.

taeric · on Sept 21, 2017

This was what I assumed with the message. Doesn't change my point. If this is something that is truly a bottleneck, then it should be baked into numpy. In particular, agreeing for the array to calculate min/max/p50/p75/... in one pass is something that would make sense.

Regarding the moving the calculation, yeah. I get that. I argue that the compiling step of the module happens once for the module, no? The JIT will be something you force onto every execution. Right?

And none of this actually means this library shouldn't have been made. Just that it is a poor example for why it is better.

dr_zoidberg · on Sept 22, 2017

For some reason, I can't reply to nerdponx, so here it goes:

I'd add some "intelligence" (if you will) to the class, so that when I calculate the max(), I'll also track the min, average, etc, save those values in an internal cache, and return the max. Next function call (wether it's max, min, avg, or any other of the cached values), it gets the result from the precomputed cache.

Of course, some operations will affect said results, and there are two ways to handle that: either modify the cached results, if we're talking about some change that can be formulated (say, multiplying by 2 will multiply every computed value by 2), or simply invalidate the cache and re-compute the next time one of this functions is called.

As it was said, this would mean calculating some things that aren't the explicitly asked thing (max, min, avg, etc), and giving a worse case scenario, in the hope that it will result in a speedup on some usecases.

My guess as to why they don't do it? Because it's not a "generalizable" problem, and you have the machinery at hand to implement your custom function that fits your particular use case (say, using Cython and the numpy/cython wrappers).

nerdponx · on Sept 22, 2017

You can't bake this into NumPy without a compiler or JIT. Python calls don't "know" about each other. The only way to do it would be to have a function that returns both the max and the min.

taeric · on Sept 22, 2017

Like the sibling said, you could bake the stats into member fields to cache the values. I'm guessing it wouldn't caused any real slow down, but I can understand the concern. Would also help for repeated calls to the same stat.

Though, I would also think a general .stats method would make the most sense. It is quite often nowdays to want a lot of stats.

And again, I am not arguing that the new lib shouldn't exist. I just question that example.

dr_zoidberg · on Sept 22, 2017

See my comment on parent, for some reasong (I think your comment was too young) I couldn't reply to you directly. As for the compiler, since Microsoft handled the MSVC For Python (2.7), or if you were using Python 3 to begin with, I haven't had any problem.

simonw · on Sept 21, 2017

This 2.5hr tutorial maybe? https://m.youtube.com/watch?v=SzBi3xdEF2Y

shoyer · on Sept 21, 2017

Another pandas developer here.

Numba can give you similar performance to what you'll see for highly optimized operations in pandas (e.g., for groupby-sum or a moving average), but you would have to write the low-level loops yourself. Like Wes writes, it's a complementary technology: in principle, the low-level loops in pandas currently written in Cython could be ported to Numba instead.

I had a little project where I experimented with this a few years ago: https://github.com/shoyer/numbagg. Since then, I expect numba performance has only improved.

Note that this is unlikely to ever happen in pandas itself, for various reasons. The existing routines in Cython already work (and unlike Numba, have a good story for distribution), and the algorithmic core for next version of pandas is being written in C++.

Aeolus98 · on Sept 21, 2017

A while ago I had to do a complex ML task.

It involved tons of time series data that followed a state machine, with very little training data.

A useful algorithm to force a series of noisy predictions to follow a state machine is the Viterbi decoder.

Numba let me write a JITted version that got order of magnitude improvements, especially when there were over 10^8 time series points.

It's a great piece of software, if a bit finicky sometimes.

mtw · on Sept 21, 2017

can you elaborate on the finicky part?

glup · on Sept 21, 2017

I've noticed two pain points: Installing outside of Anaconda can be a real chore, and error messages were extremely unhelpful (as of about 12-18 months ago, hopefully it's better now).

dr_zoidberg · on Sept 22, 2017

I work on non-Anaconda environments and this single pain point has caused me to stay away from it. I do some borderline code where I need the scientific stack and Django/flask/"weby libraries", so I could never pull "going full Anaconda" on the stack.

ballooney · on Sept 21, 2017

This example doesn't do the entire numba project justice, but if you've ever written a for-loop in a bit of python code that does number crunching, you'll notice how much it slows everything down, and the numba jit provides a decorator that yields an extremely quick win to get often 1-2 orders of magnitude of improvement in calculation time. It's less of a win if you're only every working with already vectorised data structures and algorithms.

I was confused by your comment though, specifically the idea that you are using tensorflow because you 'mostly work with simple data that doesn't require complicated calculations'. This seems very contradictory. Have I misunderstood?

TheAlchemist · on Sept 21, 2017

A bit unclear indeed. By that I meant that most of stuff I do either fits well in pandas dataframes and requires mostly standard operations, already well implemented in pandas (actually, there are so many of those, that by 'standard' I mean practically all operations I need) or it's image data, that Keras handles very handily.

451mov · on Sept 21, 2017

definitely a lot faster than pandas!

question: how do you debug a function which has a super complex decorator slapped on top of it?

dagw · on Sept 21, 2017

question: how do you debug a function which has a super complex decorator slapped on top of it?

With difficulty. If you find yourself in a situation where you have a function that works without the numba decorator, but fails with it then it's time to break out your llvm reading skills: http://numba.pydata.org/numba-doc/0.10/annotate.html

That being said it is very rare that that happens.

marmaduke · on Sept 21, 2017

You can just remove the decorator?

glup · on Sept 21, 2017

No, because certain things are fine in an interpreted function that will fail to compile, and throw totally uninformative errors. E.g., one time I spent about 3 hours trying to figure out why a relatively simple, perfectly functional Numpy function, limited to the subset of Numba primitives wouldn't compile. A labmate with actual knowledge of C looked at the code and in about 30 seconds suggested it might be having two return statements; she was of course correct that this was the problem. Doh.

marmaduke · on Sept 23, 2017

Ok but generally it is possible to debug by omitting the decorator.