Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> How to write optimized JavaScript

This is all sensible advice if you're interested in writing fast-enough code, but I do find there's a lack of material for people who want to write fast Javascript. Pretty much the only thing I've found is the post by Oz[1], though I really don't want to have to compile Chrome.

For an example, I have a method in Javascript that does a mix of (integer) arithmetic and typed array accesses; no object creation or other baggage. I want it to go faster, and with effort I managed to speed it up a factor of 5. One of the things that helped was inlining the {data, width, height} fields of an ImageData object; just moving them to locals dropped time by ~40%.

Yet after all this effort, mostly based on educated guesses since Chrome's tooling doesn't expose the underlying JIT code for analysis, the code is still suboptimal. There's a pair of `if`s that if I swap their order, that part of the code allocates. How do people deal with these issues? A large fraction of this code is still allocating, and I haven't a clue where or why.

Perhaps I'm asking too much from a language without value types (srsly tho, every fast language has value types), but what I want is clearly possible: ASM.js does it! I don't really want to handwrite that, though.

[1]: https://www.html5rocks.com/en/tutorials/performance/mystery/ (E: This link is actually written from a different perspective than the one I read, but the content is the same.)



I have some experience with this stuff, and my go-to advice is:

  - Never construct or mutate objects in a loop or recursive call.
  - Preallocate the max of what you'll need at load time, then index into it.
  - Prefer typed arrays over objects.
  - Don't use strings. Work with numbers.
  - Remove function calls.
  - Do not call a function with differently shaped arguments.
  - If the JS engine can't prove a number is a 32 bit int, you'll take a large perf hit.
  - Break out your algo into a worker. Not only does it keep things responsive, but it might actually be faster due to less GC pressure.
  - The web stack is full of frightening gotchas that you'll discover. Recieving large data on a WebSocket can halt your DOM rendering. Leaking a Promise callback can completely blow a function into interpreter mode. And other fun times.
Finally: If you're hand optimizing and not getting sufficient results on your preferred browser, accept that you're screwed. The next browser will break your assumptions. Rethink your algorithm and data flow, or move it elsewhere, like a worker or server.

I do agree the tooling for this stuff isn't great, but I found Chrome's line-by-line perf counts to be excellent hints about where you could do better. It's often surprising but obvious in retrospect.


A lot of this exact advice was applicable when I was first learning C and C++ in college. Some of it was enforced by certain compilers' errors even. Interesting how the more things change, the more they stay the same.


A lot of this likely has to do with the fact that the underlying components of v8 are written in C++.


I agree with a lot of these bullets just as general design principles, and emphatic +1 to not hand-optimizing.

Just noticed this comment after posting https://news.ycombinator.com/item?id=15069116, which is specific to V8 but which I'd guess holds for other JS engines too. Relevant excerpt:

""" ...in just about every case, the real answer is "it depends". That's why V8 is ~1,000,000 lines of code, not 1,000.

Please don't try to follow some rule blindly, much less derive. More often than not, when you try to hand-tune for the last bit of performance, you'll actually trigger something that was introduced to streamline the other [95% of] cases/JS devs' code, and then you've spent hours on making your code less readable with no benefit.

There is no "Try This One Crazy Trick To Make Your JavaScript Fast!!!" (or even ten!) """

(And for the specific case of preallocation vs. growing as you go, right now, in most cases, the difference is negligible. In the others, it depends.)


I've searched for advice once too. At work I'm maintaining a compiler from C# to Java and JS and while the code emitted will by definition never do some things that are slow (e.g. call functions with differently shaped arguments or change the shape of objects) I'm in a far better position to implement other things that improve performance, if I knew what I could do. Of course, avoiding allocations helps other languages too and general algorithm changes should be done by a human, but things at the expression level that only change performance but not semantics could be easily applied over a whole codebase.

A large change we made recently was to extract private methods from classes and have them as normal functions in scope instead of looking them up from an object, but that was also mostly guesswork since "JS then no longer has to do a lookup for the function but instead can call it directly". In the end I still don't know whether or how much that changed, as ot coincided with a major library version change and we're faster than the old version overall.

Generally catering to not only Chrome, but also Firefox and Edge makes such things generally difficult as well, without a bunch of solid performance tests in place to catch regressions on certain browsers.

Ideas I had, but not yet tested/implemented:

· Inline compile-time constants so instead of having some.namespace.Type.CONSTANT, we'd get 25 in-place.

· Merge multiple boolean fields in a class into a single int and access them via bitwise operations from the getter. Might mostly help in how much fits into the cache.

· Append |0 to all sub-expressions of type int instead of only the outermost one. May help the JIT to not even emit code to check types in between.

· Using typed arrays instead of normal arrays if we have things like int[] or double[]. Needs a polyfill for IE 9 then.


What this tells me is don't bother to try to write performant code in javascript, which is scary!


Not what I said. I said to avoid /hand-optimizing/ code.

Trying to write performant code is good. Some examples of advice on that front: - use an appropriate algorithm, and make sure the logic is free of bugs. For example, accidental out-of-bounds array accesses can really hurt performance. - avoid needless computations. Don't do something (complex) twice if doing it once is enough.

If this sounds like "use common sense/generally good design principles", that's because it is (and hence the agreement with many of parent post's points that are simply and straightforwardly good design principles).

Let's take the example of "remove function calls". It's true that each call costs a few instructions, but unless the function you're calling is tiny, that overhead won't be measurable. If it makes sense for readability/maintainability/testability to split your code into functions, then by all means do it!

Another example is the "trick" to write "for (var i = 0, len = array.length; i < len; i++)" instead of the simpler "for (var i = 0; i < array.length; i++)". As http://mrale.ph/blog/2014/12/24/array-length-caching.html (from the original V8 team) explains in great detail, what seems like a no-brainer can actually do the opposite of what you'd expect --- and at the same time, won't make any difference whatsoever in most real code (i.e. aside from microbenchmarks).

Yet another example is https://www.youtube.com/watch?v=UJPdhx5zTaw. You can spend all day on it, but at the end of the day, the best thing to do is code correctly and sanely, not hand-optimize the bejeezus out of the thing (mostly yourself, in many cases).


Only we've seen what happens when everyone follows this: you get to the state that we're in today, where everything sucks but everyone's OK with it.


In V8, objects are actually arrays (sort of, they have an indexed store) - so doing `this[0]` is about as fast as doing it with an array (certain array things like for... of iteration are optimized though - but only if you haven't changed the prototype).


In case it wasn't clear, when I said prefer typed arrays over objects, I literally meant Typed Arrays [1].

These are not like arrays _or_ objects. They're faster than either, since there are less corner cases and type ambiguities.

[1] https://developer.mozilla.org/en-US/docs/Web/JavaScript/Type...


Is that definitely true? Some time ago I was tuning code that did lots of math on a big array of floats, and I found that using Float32Arrays was moderately but measurably slower than just using [].

This may have been before TurboFan, though.


You're right. In V8, values from a Float32Array are always converted to doubles upon access, even in cases like adding values from two Float32Arrays and storing them in another Float32Array where the result is provably the same. That costs a handful of cycles for each element.

Mozilla has a blog post about doing efficient float32 math: https://blog.mozilla.org/javascript/2013/11/07/efficient-flo...

V8 bug: https://bugs.chromium.org/p/v8/issues/detail?id=5541

Using a Float64Array is faster at the expense of memory usage. (I've never seen a case where the code generated by V8 bottlenecks on memory bandwidth fwiw.)


Thanks for the details, very useful!

I don't suppose you know if similar things are true for Int arrays? I think I once had a similar case there, where I was doing integer operations on a large set of ints, and regular arrays were faster than Int32. But it's hard to imagine there as well what the slowdown would be.


Good question. I haven't looked at that case. V8 can detect and optimize small ints (smi, fit in uint32, i.e. all types of typedarrays), but it might still take the easy/slow path and convert to doubles. Relatively easy to look at with node.js and irhydra.


In a few different cases (one node, one browser-based) I've had medium-sized read-only datasets of a couple million objects where I've tried to conserve memory by using typed arrays. While it saved a good amount of memory, property access was significantly faster via objects than typed views. I'm curious to re-run the benchmarks with Turbofan though, maybe things have changed.


Indeed, it heavily depends on what your JS engine thinks your data types are, your allocation patterns, and more. If your engine thinks you're using floats to index into a typed array that might be slower than using floats to index into an object. Even if you know your numbers are really only integers. Which is why JS performance is so puzzling sometimes.

One thing you can do with typed arrays that you cannot do with objects, though, is zero-copy into workers. You can use that technique to get pretty much free parallel performance on any number-crunchy task.

These things change all the time though; all the more reason for always measuring before you cut and investing into better profiling tools.


This is incorrect. Arrays are implemented as a type of object, but you shouldn't worry about that. Hashmaps and arrays both have ~O(1) random indexing. If you need an array, don't use an object.


Objects have a length overhead. At Bluebird we used objects and used a bitmask on the length so it's fast. (We use less slots than an array actually)

I was just pointing out that Objects weren't slow for indexed access.


> - Do not call a function with differently shaped arguments.

If you don't mind me asking, what did you mean by this?


This has to do with monomorphic versus polymorphic and megamorphic functions. Basically, the inline cache has finite capacity, and if every call of a given function takes objects of the same shape (they share a hidden class), then you don't need to worry about evicting your cache entries.

Once you start passing in objects of different shapes though, you're going to exceed the inline cache's capacity, and start losing out on the massive speedups the cache gains you.

A function that takes one inline cache entry is monomorphic, more than one is polymorphic, and more than the inline cache capacity is megamorphic. You want as many functions as possible to be monomorphic, polymorphic if you can't help it, and never megamorphic.

This post gets into a lot more detail: http://mrale.ph/blog/2015/01/11/whats-up-with-monomorphism.h...


If you call a function with a different number or type of arguments, or an object that has different keys, or even keys that you added in a different order, bets are off and the optimizer might not be able to reuse previously optimized versions of the function. And if you do this several times, the engine might give up optimizing your function entirely.


Worth noting that this is valid advice for nearly everything with a JIT (and many languages with generics without a JIT, since doing so likely bloats your dispatch logic, whether it's dynamic or built at compile-time).


> - Prefer typed arrays over objects.

Could you explain this (or provide a link to a resource that does)? I feel that this could be substantially useful but I have no idea how I'd leverage typed arrays over objects off the cuff.


README> Turbolizer is a HTML-based tool that visualizes optimized code along the various phases of Turbofan's optimization pipeline, allowing easy navigation between source code, Turbofan IR graphs, scheduled IR nodes and generated assembly code.

https://github.com/v8/v8/tree/master/tools/turbolizer is like IRHydra, but for Turbofan.

    svn export https://github.com/v8/v8/trunk/tools/turbolizer turbolizer
    cd turbolizer
    python -m SimpleHTTPServer 8000

    chromium-browser http://0.0.0.0:8000/

    chromium-browser --no-sandbox --js-flags="--allow-natives-syntax --code-comments --trace-turbo" testfile.html

    mkdir out
    cd out
    node --allow-natives-syntax --code-comments --trace-turbo ../testfile.js

    cat testfile.js
    function add(a,b) { return a + b; }
    add(23, 44);
    add(2, 88);
    %OptimizeFunctionOnNextCall(add);
    add(2, 7);

    ls out
    ... turbo-add-0.json ...


The current usefulness of turbolizer seems... mixed.

Turbofan generated code is highly variable. Changes in argument types and optimization order can result in very different code being emitted.

It's not like "all these cases were well optimized, but this one case wasn't, so let's fix/avoid that one". Instead, it's "well, that's a diverse mess of results, hmm...".

Perhaps that will change. One project objective is apparently optimization with fewer performance cliffs. But another objective is a compiler which better serves low-end mobile devices - thus raising the bar for what optimizations are worth spending time on. There's a tension between those two.

Also, CPU branch prediction makes interpreting the resulting assembly more interesting. Turbofan output has a lot of deopt guards. Tests that the assumptions used when writing the optimized code are still true. But say one bit of javascript optimizes to code with more guards than another. Do they matter? Maybe not, if the guards end up off mainline and hit cache.

So turbolizer seems helpful if you have a some red-hot performance chokepoint. But at present, perhaps less so for providing more general guidance.


> is like IRHydra

It does 'IR' part of the IRHydra, but it does not give you another important feature - ability to see how your function evolved over time and figure out reasons for it. Loading files one by one in search for the right version is pretty cumbersome.

I was always hoping that V8 will eventually get proper Dev Tools integration for compiler / runtime internals - and I think things have finally started happening in that area.


> cumbersome

Yes, turbolizer currently lacks some of IRHydra/2's mature featurefulness. No pretty "you're deopt is here". I've no idea whether Google would welcome contributions or not.

> I was always hoping that V8 will eventually get proper Dev Tools integration for compiler / runtime internals

Wistful sigh.


You can ask V8 to dump the assembly or intermediate representation or you can just run it with --trace-deopt and it'll tell you most of the time (generally) what's slowing your code down.

> I'm asking too much from a language without value types

JavaScript has value types, what it doesn't have is user defined value types. BigInts are coming (stage 3 already!) though.


To an extent I feel like there is a dividing line in the type of optimisations that I should aim for.

I'd aim for Typed arrays and avoiding intermediary object creation, because those reduce the intrinsic work load (especially for things like float32 where you are giving a nod to the system to allow slightly worse results)

Aything where the optimiser should be able to figure out what to do I'm much more hesitant. Inining variables is something that I don't feel should be neccessary.

Anything that is just massaging the optimiser into making the right desicion runs the risk of making it worse on a different engine. That different engine may even be a later version of your current engine.


Vyacheslav Egorov wrote an article[0] on monomorphic function calls and how they impact JavaScript performance. While it's very in-depth, it is over 2 years old at this point, so I'm not sure how well it applies to the current version of V8.

[0] http://mrale.ph/blog/2015/01/11/whats-up-with-monomorphism.h...


Hello @metrognome, three of your last four posts were [dead] even though none of them appear to break the HN guidelines... I vouched for the last two, the other one is too old for me as user to do anything about it.

You may want to contact the mods about this.


I don't think there are any really solid approaches here except to use IRHydra [0], which lets you decompile v8's optimized version of a function into an assembly-like representation. But I'm pretty skeptical of whether anyone but v8 engineers can really make use of it.

I think the only workable approach is what you're doing - have an approximate idea of how the compiler works and make intelligent guesses about what will help. But when you're tuning at the level where changing the order of if statements makes a difference, you're at the level where next month's update may reverse your fix, so in a lot of cases it'll be questionable whether it's worth the effort.

[0] IRHydra - http://mrale.ph/irhydra/2/


Ignition/Turbofan has a new IR - IRHydra 1 and 2 won't help. For Turbofan there's https://github.com/v8/v8/tree/master/tools/turbolizer .


You can still do

    d8 --trace-turbo                        \
       --trace-deopt                        \
       --code-comments                      \
       --redirect-code-traces               \
       --redirect-code-traces-to=code.asm   \
       --print-opt-code                     \
       your-file.js
and load resulting turbo.cfg and code.asm into IRHydra.

However most of the important features (like source and deopt-to-ir mapping) are broken.

I used to repair things in my free time - but I don't think I want to do it anymore.


> I used to repair things in my free time

Oh! Thank you for all your work.

I noticed my current node wasn't dumping .cfg's, and assumed that data path had died with TF.

> but I don't think I want to do it anymore.

Nod. Programming languages - civilization critical infrastructure funded like kindergarten bake sales.


Is the IRHydra is not working with the latest V8 pipeline (Ignition + Turbofan), AFAIK: https://github.com/mraleph/irhydra/issues/52#issuecomment-25...


I find the Steve Souders' books [1] "High Performance Web Sites" and "Even Faster Web Sites" methods are still fairly practical, not all of it applicable to JS directly, but still.

In the end, you'll need to test your changes in the most common browsers that affect you, because each browser's JS engine can vary a lot. The lookup cost you mention can be deceptively expensive, especially if you have anything resembling a deeper inheritance tree, or worse, proxies, getters/setters etc.

In the end, a lot of it doesn't matter, until it does.

[1] https://www.amazon.com/s/?keywords=steve+souders


The problem however is that JS optimization is very sensitive, depending on the browser vendor, version, and execution context, optimization advice can change dramatically.

Back when I worked on a perf sensitive JS codebase, we had a perf test suite, and depending on the details, often times locals were faster, but sometimes objects, and on a different browser it was arrays. And if you accident hit an ask.js opt it could be typed arrays, but mostly typed arrays were significantly slower. And when a new version came out this could all invert.


> There's a pair of `if`s that if I swap their order, that part of the code allocates. How do people deal with these issues? A large fraction of this code is still allocating, and I haven't a clue where or why.

Are you sure that allocation is a problem? Nursery allocation is extremely fast: it's just one overflow check and a pointer bump. Usually it's around 3-6 instructions.

Allocation in modern generational GCs is in a completely different performance category than C-style malloc is.


Pointer-bump allocations are fast in isolation, but if your code is doing one it means the compiler has gotten itself confused, and every time it gets confused it makes it harder for later code to be optimized. Compilers only really reason reasonably about value types (and even then you should never assume they're smart[1]).

That said, I don't actually know allocation is the problem; it's a suspect, but unlike a native script I can't just run `perf` to get performance counters and assembly listings. All I really know is that the non-allocating order is a lot faster.

[1]: If you can read assembly, this one's hilarious: https://godbolt.org/g/81qWoq


>[1]: If you can read assembly, this one's hilarious: https://godbolt.org/g/81qWoq

I only barely read assembly, so I was just trying to figure this one out.

So, the compiler has figured out that it just needs to test v against 0 (since the array is immutable), but it does this 2 times, and with a couple of apparently redundant 'je' instructions. Is that what's weird about the output?

In case it changed, this is the output I saw:

  example::exists_in_table:
    push rbp
    mov rbp, rsp
    mov al, 1
    test edi, edi
    je .LBB0_5
    je .LBB0_5
    test edi, edi
    je .LBB0_5
    je .LBB0_5
    xor eax, eax
  .LBB0_5:
    pop rbp
    ret


Basically, yes. The first test and je show the compiler has managed the hard part, but the next je obviously can't be hit, the test after that literally doesn't do anything, and the two jumps after that also can't be hit. It's funny because this is all so trivial.

If I write it as some C

    bool should_jump = edi == 0;
    if (should_jump) goto LBB0_5;
    if (should_jump) goto LBB0_5;
    bool should_jump = edi == 0;
    if (should_jump) goto LBB0_5;
    if (should_jump) goto LBB0_5;
I can guarantee the compiler is going to produce good code from it.


This looks like yet-another-phase-ordering-issue in LLVM. I hit so many of these it's not even funny. Arguably one could find C code that shows the exact problem under clang. Example of optimized LLVM IR for that sample (press LLVM IR on https://play.rust-lang.org/?gist=b60ec85886eff967b21c7abc899...):

    br i1 false, label %bb9, label %bb6
SimplifyCFG would instantly kill that. But it doesn't get to run again after whatever simplified the branch condition. I've heard rumors that LLVM is working on a new pass manager, maybe they will eventually fix this pervasive issue.


Apologies for prodding a sore spot! The phase ordering problem is a well known demon for compilers; makes me wonder why I've not heard anything about unphased compilation.


No worries, I'm just surprised you managed to find a simple testcase that exhibits it, so, really, thank you!

I've opened a (a bit vague for now) issue so at least I don't forget about it: https://github.com/rust-lang/rust/issues/44041

As for phase ordering - someone did show me recently VSDG, which claims to solve at least part of the problem (specifically, https://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-705.pdf and http://sro.sussex.ac.uk/7576/1/Stanier%2C_James.pdf).


Awesome references, seems I have some fun reading :). (Trivia: Alan Mycroft actually supervised me for my undergraduate dissertation.)


You can at least use the browser profiler tools.


Browser profiling tools are great but their granularity stops at the function level - they can't tell you anything about what happens within a given function, just how that function compares to others.


I haven't used them much, specially since I tend to choose native projects if possible.

Thanks for the heads up.


I do! It's missing a lot of detail, though; it's optimized much more for overall performance analysis.


> Allocation in modern generational GCs is in a completely different performance category than C-style malloc is.

Yes and no. If enough allocation happens that a GC pass occurs while you're still running your JS hot loop you're pretty much screwed and all your performance vanishes. If you want to see this in the extreme go open an Android systrace or Chrome chrome://tracing trace result and do a JS profile while you do so. The single largest consumer of JS CPU time is the GC at nearly 13%.

The problem with modern generation GC langauges is they tend to be designed assuming that allocation is always cheap/free and it just isn't. That's a fairytale. It's just not true.

Also just because it only takes 6-13ns for an allocation to be given to you doesn't mean anything when the first access of that allocation is a cache line miss compared to if you did the same thing in C and it was on the stack, which is likely sitting in hot L1 primed and ready to go. Reusing the same memory address range is critical to getting maximum performance, and modern generational GCs are miserably bad at that.


> Yes and no. If enough allocation happens that a GC pass occurs while you're still running your JS hot loop you're pretty much screwed and all your performance vanishes.

GCs are expensive relative to not doing anything, but minor GCs are very fast.

> The problem with modern generation GC langauges is they tend to be designed assuming that allocation is always cheap/free and it just isn't. That's a fairytale. It's just not true.

I'm not making an argument about language design here. Obviously value types are a good thing.

> Reusing the same memory address range is critical to getting maximum performance, and modern generational GCs are miserably bad at that.

Not at all. The nursery is usually in cache, just as the stack is. Nurseries are usually implemented as two space copying collectors with small spaces, which are excellent for cache locality (arguably even better than stacks, because stacks can grow deep).


> Not at all. The nursery is usually in cache, just as the stack is. Nurseries are usually implemented as two space copying collectors with small spaces, which are excellent for cache locality (arguably even better than stacks, because stacks can grow deep).

I think you're still working under the assumption that the GC is keeping up with the allocation rate, which if you have a small allocation in a hot loop will largely not be true. Typically GC'd languages rely on escape analysis to handle this and not the generational GC at all, but if that fails then you're SOL because the generational GC is unable to keep up and unable to keep things in the fast path.


> Typically GC'd languages rely on escape analysis to handle this and not the generational GC at all

No, they don't. Most GC'd languages rely on generational GC, because escape analysis on its own doesn't directly provide a lot of performance benefits once you have a generational GC.

> if that fails then you're SOL because the generational GC is unable to keep up and unable to keep things in the fast path.

No, you aren't. Cleaning up dead objects (for example, temporaries created in a hot loop) in a nursery is extremely fast. The entire nursery semispace is typically in L1.


    The entire nursery semispace is typically in L1.
L1 is only going to be about 64k per core (eg, Kaby Lake). That'll have your stack, your working set, and a tiny bit of your nursery in it... as long as you don't get associativity problems.

If you read a bunch of data before making an object, I could see you easily evicting your nursery from L1 into L2, but then you only have about 4x as much space.


Not GP, but as someone who has worked with image manipulation and typed arrays in JS (which seems similar to what the GP was doing) I can say that this often boils down to linearly looping through a typed array and doing very simple operations. You know, one of those "best case" scenarios for cpu utilisation where memory access tends to be the bottleneck. If the allocation happens inside a tight loop like that I can imagine it ruining performance


Do these same optimisations work well on non-V8 browsers as well, or have we completely given up on avoiding the browser monoculture again?


I have been wondering for a while now if there's a terse / "embed some essentially-C code" style asm.js tool. Being able to drop into an optimizer for math / large data purposes, without going full emscripten, seems like it could be extremely useful.

That said - have you tried using emscripten on a C version of your code? Any better / worse?


Yes, it would be nice to just have some config flags to enable it on the debugger.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: