More

Yttrill · on May 27, 2018

Really? Thats not my understanding, its much smarter than that. Goroutines steal memory from the current pthread machine stack, that is, the machine stack. The problem calling C from a goroutine is that whilst you have a real C stack right there .. other goroutines expect to be able to steal memory from it, and once you call C you don't know how much stack C is going to use. So whilst C is running, goroutines cannot allocate stack memory, which means they cannot call subroutines. The only way to fix that is to give the C function being called its own stack.

The problem you're going to have is that if 10K goroutines all call PCRE you need 10K stacks, because all the calls are (potentially) concurrent.

What makes go work is that the compiler calculates how much local memory a goroutine requires and so after a serialised bump of the stack pointer the routine cannot run out of stack. Serialising the bump between competing goroutines is extremely fast (no locks required). Deallocation is trickier, I think go uses copy collection, i.e. it copies the stack when it runs out of address space on the stack, NOT because its out of memory (the OS can always add to the end), but because the copying compacts the stack by not copying unused blocks. Its a stock standard garbage collection algorithm .. used in a novel way.

The core of Go is very smart. Pity about the rest of the language.

masklinn · on May 28, 2018

> Really? Thats not my understanding, its much smarter than that. Goroutines steal memory from the current pthread machine stack, that is, the machine stack.

There is no "machine stack", and yes in the details it tries to set up and memoise C stacks, but it still need to switch out the stack and copy a bunch of crap onto there, and that's expensive.

Yttrill · on May 27, 2018

C++ and Felix obviously have the lowest overhead .. namely zero in both cases :-)

Yttrill · on July 11, 2017

Modules would be very hard to make work in C++, there's way too much entanglement at all levels. Of course the rot actually started with the ANSI C committee when it introduced typedef and broke context free parsing. C++ just compounds this kind of problem with template lookup stupidity. Its what you get when languages are designed by people with no understanding of basic computer science.

Yttrill · on Feb 22, 2013

Claims 3 and 6 lack a true understanding of Ocaml. The compiler's error messages are exceptionally good in most cases.

Syntax errors without detail are a product of the parsing technology: its a minor irritant and worst.

The most difficult problem (IMHO) comes from type inference when a type inferred from context is not the one intended, but a conflict is not discovered until code with the intended typing is found: the error reflects the second location which is confusing because there is no error at that point. Of course this doesn't occur if you actually provide type annotations, so it is merely a cost of being able to omit them.

The excess exception handling is a reasonable comment, however it is easy to wrap the exception throwing code in a variant to enforce local error checking, and in any case this style is a property of the library, not the core language.

The type polymorphism in Ocaml is anything but clunky: for basic stuff you don't even need any type annotations. The thing is that Set us NOT type polymorphism. Its module polymorphic, and that requires explicit binding of an instance. There is a simple enough reason: module functors take non-type arguments, particularly function closures, which are values, which do not have unique signatures, so implicit instantiation is not possible.

Ocaml does have some other serious disadvantages. One is that because of the way the optimiser uses information from compiles on which it depends, there are annoying constraints on build order. In addition, recursion cannot span compilation boundaries which is pretty lame for a functional programming language: even C has no problem with that.

Finally, the political situation is probably the biggest problem. The development team is excellent but limited and their focus is on the type system, rather than libraries. Because the community is modest compared to say C++, many attempts to make extended libraries have failed to gain acceptance, because no one wants to depend on an unresourced third party library and the core INRIA team doesn't have the resources or interest to extend the standard distribution.

Despite these difficulties .. I would never want to go back to writing C++ (OMG .. the pain!). As a language .. Ocaml is light years ahead.

jahewson · on Feb 24, 2013

> Syntax errors without detail are a product of the parsing technology: its a minor irritant and worst.

You may think so, but it infuriates me many times a day. This is the most basic thing which a compiler should be getting right, and yet they fail.

I agree with you on type inference errors, but my solution is to annotate function parameters, which avoids the most confusing problems.

I don't want to wrap my error handling code, or work around it with other libraries, I just want something usable out of the box. Languages need to be evaluated holistically, and the design of the standard library is a critical part of that, look at the C++ STL.

Yes, I meant module polymorphism. I much prefer the way that F# deals with Sets, Maps, etc.

I think Ocaml has had a good run, but it's starting to look dated, F# has really innovated, it's just a shame that it's tied to .NET.

Yttrill · on Feb 5, 2013

There certainly are languages that can provably ensure all cases are handled.

Have a look at ATS.

Yttrill · on Jan 28, 2013

This does reflect very sadly on schools, and shows up when trying to explain things to programmers used to conventional languages, especially if you're trying to explain why C is fundamentally broken.

Roughly there are three fundamental control transfer operations: conditional jumps, subroutine calls, and coroutine calls. A subroutine is a slave, the caller is a master. With coroutines, the caller and callee are peers.

Coroutine calling is more fundamental and easier to program, but it isn't available in C.

The theory is about continuations. A continuation is just "the rest of the program". You can think of it as the program counter (PC). Subroutine calling works by passing a continuation to a routine, when the routine is finished it invokes the continuation. This is done by pushing the program counter on the stack, and the subroutine popping it with a return statement.

Coroutines work by exchange of control. When you call a coroutine you give it your current continuation to call when its ready. But also you do not call the coroutine at the beginning. You call it where it last left off: at its last continuation point.

A set of coroutines are usually called fibres. They represent interleaving of control. You can emulate them with pre-emptive threads and locks, but coroutines are synchronous and non-premptive.

Felix and Go both make heavy use of coroutines (called fthreads and goroutines). Iterators as in Python are a special case.

In general on today's badly designed CPU's you have to think of coroutines as requiring stack swapping. Threads do this which is why you can emulate coroutines with threads.

By far the most well known coroutine scheduler is .. the Operating System. Its basically a coroutine of applications. This is why you can read and write to files: the real world is event driven but callbacks are impossible to program with. So the operating control inverts the events by stack swapping so your application can be written as a master.

You think you're calling the OS, and the OS thinks its calling you. You're both masters. That's coroutines.

Python (and Felix) both have yield, but that's a special case. In general the model is reading and writing channels: yielding is just writing the "sole" channel and getting a function result is just reading it. A channel is basically a "place to swap stacks".

csense · on Jan 28, 2013

> Coroutine calling is more fundamental and easier to program

Citation needed.

For coroutines, each coroutine needs its own stack. That means you have to have a dynamic memory system baked into the language. And maybe garbage collection too.

> on today's badly designed CPU's you have to think of coroutines as requiring stack swapping

I can't think of an implementation of yield (let alone general coroutines) that doesn't require a separate stack for each coroutine. I admit that I haven't learned very many of the stranger forgotten architectures that are out there, so I might be blinded by the limitations of a somewhat conventional experience.

But I'm also thinking it might even be provable that each coroutine needs its own stack: Think of a program that has m generator functions, where f_1 calls f_2, f_2 calls f_3, ..., f_{m-1} calls f_m. Each of these subroutines creates n copies of its next-level generator, and steps those generators and yields to the parent unpredictably (for example, depending on input from a user-supplied file). It seems like if m and n are large enough, you'll have no choice but to resort to swapping stacks.

eru · on Jan 30, 2013

> For coroutines, each coroutine needs its own stack. That means you have to have a dynamic memory system baked into the language. And maybe garbage collection too.

Why? Just statically give every co-routine it's own stack.

csense · on Feb 3, 2013

The number of coroutine invocations, and the order in which they're cleaned up, could depend on user input. The former could be unbounded.

More formally, I'm pretty sure you can write a program that always halts, but for every pair of large numbers N, K > 0, there are at least K different input values that result in (1) at least N coroutine invocations being simultaneously active, and (2) for each of those K different input values, those invocations finish in a different order.

Yttrill · on Jan 19, 2013

I'm afraid the argument that C+ASM is always faster is flawed in reality. Pure ASM, with a bit of C thrown in, maybe, but this is just as impractical for complex codes as C itself is.

It is well known that for numerical codes Fortran beats the pants off C. Why is this? Because the structure of C programs proves difficult to optimise automatically. Indeed the C committee attempted to address one of the main problems by introducing the restrict keyword (the problem of course is aliasing).

For complex codes, ASM isn't an option. For large functions, high levels of optimisation aren't an option for C because C compilers are incapable of optimisation in a reasonable time frame: I have short code that cannot be compiled in less than 2 minutes on either gcc or clang. Full alias analysis requires data flow which is cubic order on function size and C compilers are incapable of partitioning functions to keep the cost down.

Furthermore, C has an weak set of control operations and an object model which is generally inappropriate for modern software. K&R C compilers were hopeless because of the ABI required the caller push and pop arguments to support varargs, preventing tail rec optimisation of C function calls.

Subroutine calling is useful, but it is not the only control structure. Continuation passing, control exchange, and other fundamentals are missing from C. These things can always be emulated by turning your program into one large function, but then, it isn't C and it cannot be compiled because the best C compilers available cannot optimise large functions.

Similarly, complex data structures which involve general graph shapes require garbage collection for memory management. With C that's not built in so you have no choice but to roll your own (there is no other way to manage a graph). It's clear that modern copying collectors will beat the pants of C in this case.

C++ pushes the boundaries. It can trash C easily because it has more powerful constructions. It had real inlining before C, and whole program compilation via templates. It is high enough level for lazy evaluators to perform high level optimisations (expression templates) C programmers could never dream of. And C++ virtual dispatch is bound to be more effective than roll your own OO in C, once the program gets complex because the C programmer will never get it right: the type system is too weak.

Many other languages generate C and have an FFI, some, like Felix, go much further and allow embedding. Indeed, any C++ program you care to write is a Felix program by embedding, so Felix is necessarily faster than C by the OP's argument: C++ is Felix's assembler.

As the compiler writer I have to tell you that the restriction to the weak C/C++ object model is a serious constraint. I really wish I could generate machine code to get around the C language. Its slow. Its hard to express useful control structures in. It tends to generate bad code. With separate compilation bad performance is assured.

I am sorry but the OP is just plain wrong. C is not assured to be faster, on the contrary, its probably the worst language you could dream up in terms of performance. The evidence is in the C compilers themselves. They're usually written in C, and they're incapable of generating reasonable code in many cases and impossible to improve because C is such a poor language that all the brain power of hundreds of contributors cannot do it.

Compare with the Ocaml compiler, written in Ocaml, which is lightning fast and generates reasonable code, all the time: not as fast as C for micro-benchmarks but don't even think about solving complex graph problems in C, the Ocaml GC (written in C), will easily trash a home brew collection algorithm.

Compare with ATS(2) compiler, written in Ocaml(ATS), which by using dependent typing eliminates the need for run time checks that plague C programs given the great difficulty reasoning about the correctness of C codes. AST generates C, but you would never be able to hand write that same C and also be confident your code was correct.

Compare with Felix, compiler written in Ocaml, which generates C++, can do very high level optimisations, which can embed C++ in a more flexible way than a mere FFI, and which provides some novel control structures (fibres, generators) which you'd never get right hand coding in C.

The bottom line is that OP's claim is valid only in a limited context. C is good for small functions where correctness is relatively easy to verify manually and optimisation is easy to do automatically, and any decent C code generating compiler for a high level language will probably generate C code with comparable performance.

So the converse of the argument is true: good high level languages will trash C in all contexts other than micro tests where they will do roughly the same.

cpeterso · on Jan 19, 2013

The Rust compiler was also written in OCaml (until it could be self-hosted in Rust itself).

DannyBee · on Jan 19, 2013

It's not actually the structure of C programs, it's the guarantees the language offers. So only about the first paragraph of your rant is right.

Fortran had almost no memory aliasing, explicit global accesses, and offered almost unbridled implementor freedom. As long as the operations got done, it didn't care what happened behind the scenes.

None of the rest of the things you talk about matter when it comes to optimizing C, to be honest. If you gave me no aliasing by default and explicit globals, I probably could do near as well as fortran (though it would take significantly more analysis) in terms of loop transformations.

Note that "full alias analysis" is statically undecidable. When you say cubic order, you are thinking of Andersen's subtyping based algorithm. There are unification based algorithms that are almost linear time (inverse ackermann).

At this point, we have scaled these algorithms almost as far as you can on a single computer. You can do context insensitive andersens on many million LOC without too much trouble.

Context-insensitive unification points-to can scale to whatever size you like.

Context sensitive unification based algorithms do quite well in practice with 10 million LOC + codebases.

The main reason you don't see unification based algorithms used often in free compilers is because the entire set of algorithms are covered by patents owned by MS Research.

As a final note, note that C++ does not really help optimization in practice, it often hurts it.

It is very hard to teach a pointer analysis algorithm about virtual calling. Most compilers treat them like function pointers that get type-filtered, and do some form of class hierarchy analysis to limit the number of call graph targets, or try to incrementally discover the call graph. It's a bit of a mess.

On the other hand, straight function pointers get resolved either context-sensitively or insensitively.

In fact, C++ makes type based aliasing a lot worse due to placement new being able to legally change the type of a piece of memory, which is very hard to track over a program.

Even outside the realm of alias analysis, C++ involves a lot more structures, which means a lot more time has to be spent trying to get pieces back into scalars, or struct splitting, or something, so that you don't end up having to fuck around with memory every time you touch a piece of the structure.

I could go on and on.

In short: Any of C++'s lower level optimization advantages come from less pointer usage by programmers, not language guarantees.

At the high level, it's from better standard implementations and common usage idioms.

In any case, high level languages, particularly those with memory objects (not real pointers, just memory objects, like Java) usually solve none of the pointer/alias analysis related problems. You are still stuck with the same pointer analysis algorithms.

For example: The only nice thing about java's memory system is that doing structure-field sensitive pointer analysis can only help, whereas in C it can hurt, due to some weirdness.

It's just nobody usually gets around to doing pointer analysis on the higher level languages (because it's harder and offers no particular benefit), they lower their language to an IR that already has a good algorithm in it.

Just in case you were wondering, i'm not talking out of my ass. I wrote GCC's first set of high level loop optimizations, and also, it's pointer analysis.

srean · on Jan 19, 2013

Lots of interesting information. However, I think you over-reacted about Fortran. After reading the parent comment and your comment it seems both of you are saying the same thing: Fortran has the advantage of having no pointer aliasing.

Regarding unification based algorithms, do microsoft use any of it in their F# compiler. I ask because they have time to time tried to say we wont sue you for F# technology. Dont know how much of those sweet nothings are binding. Given your knowledge about compilers and legal systems I am very curious to hear your opinion.

DannyBee · on Jan 19, 2013

I don't know off hand if they use it. I know they do in static analysis tools. As nice as MS is, they seem to consider compilers solely a cost center. Their compilers produce "relatively good code", but have never really been state of the art.

Yttrill · on Jan 15, 2013

Felix is built on an extensible parser which uses EBNF for the productions. The Felix "language" is defined in the library by such productions. The action codes of the grammar are arbitrary Scheme functions returning non-arbitrary s-expressions which map to Felix AST nodes.

The grammar is here:

http://felix-lang.org/lib/grammar

Close examination reveals even the regexps for literals are defined in the grammar. The parser is built on top of the excellent extensible GLR+ parsing tool Dypgen.

In principle the Felix parsing system is independent of Felix. All you need to do is replace the s-expression to Felix AST translator with some kind of pretty printer for s-expressions, even XML, and you can target anything.

Yttrill · on Jan 7, 2013

Basically this is to support "expression" like syntax such as people are used to in C, for example i++ and x=y are expressions with side effects in C and a lot of C idioms depend on such things. This kind of thing was originally not allowed but I personally found it clumsy to manually split out the imperative parts from the functional parts, so I taught the compiler how to do it for me :)

andrewflnr · on Jan 7, 2013

But those expressions also return things. Why can't they be gen? Why would I use a proc instead? It seems like I must be missing something, but I can't see it.

Yttrill · on Jan 5, 2013

As a whole program analyser, Felix does a lot of optimisations. The most important one is inlining. Felix pretty much inlines everything :)

When a function is inlined, there are two things you can do with the arguments: assign them to variables representing the parameters (eager evaluation) or just replace the parameters in the code with the arguments (lazy evaluation).

Substitution doesn't just apply to functions: a sequence of straight line code with assignments can be converted into an expression by replacing occurrences of the variables with the initialising expressions.

For a small number of uses, substitution is the usually the most efficient. For many uses, lifting the common expressions to a variable is more efficient. If we're dealing with a function (in C++ the model is a class) for which a closure is formed (in C++ the model is an object of the class) lazy evaluation is very expensive because the argument itself must be wrapped in an object to delay evaluation.

By default, Felix val's and function arguments use indeterminate evaluation semantics, meaning the compiler gets to choose the strategy. This leads to high performance, but it also means we need a way to enforce a particular strategy: for example vars and var parameters always trigger eager evaluation. This leads to some complication in the language.

Felix also does other optimisations, for example it does the usual self-tail call optimisation. This one works best if you do inlining at the right point to convert a non-self tail call (which cannot be represented for functions in C) into a self-tail call (which is replaced by a goto).

Felix also does parallel assignment optimisation.

It ensures type-classes have zero cost (unlike Haskell which, by supporting separate compilation, may have to pass dictionaries around).

There is quite a lot more: eliminating useless variables, functions, unused arguments, etc. There are even user specified optimisations based on semantics, such as

  reduce idem[T]  (x:list[T]) : list[T] = x.rev.rev => x;

which says reversing a list twice leaves the original list, so just get rid of these two calls.

Actually one important aspect to the optimisation process: by default a function is a C++ class with an apply() method. This allows forming a closure (object). The object is usually allocated on the heap. However Felix "knows" when it can get away with allocating such an object on the machine stack instead (saving a malloc and garbage collection). Furthermore, Felix "knows" when it can get away with a plain old C function, and generates one of those instead if it can. And all of that occurs only if the function wasn't entirely eliminated by inlining all the calls.

So although you should think of Felix functions and procedures as objects of C++ classes allocated on the heap and garbage collected, any significant program implemented with this model without optimisations would just drop dead.

mitchi · on Jan 6, 2013

These are very good optimisations indeed. Inlining adds a lot more speed than people think. I don't understand why people want closures in their languages. You basically want a function that cheats her own scope... I don't get where the big deal is.

Yttrill · on Jan 7, 2013

One needs closures, that is, functional values bound to some context, for higher order functions (HOFs). For example in C++ much of STL was pretty useless until C++11 added lambdas. Many systems provide for closures in C, by requiring you register callbacks as event handlers, and these callbacks invariably have an associated client data pointer. A C callback function together with a pointer to arbitrary client data object is a hand constructed closure.

The utility of closures derives from being able to split a data context into two pieces and program the two halves separately and independently. For example for many data structures you can write a visitor function which accepts a closure which is called with each visited value. Lets say we have N data structures.

Independently you can write different calculations on those values formed incrementally one value at a time, such as addition, or, multiplication. Lets say we have M operations.

With now you can perform N * M distinct calculations whilst only writing N + M functions. You have achieved this because both halves of the computation are functions: lambda abstraction reduces a quadratic problem to a linear one.

The downside of this is that the abstraction gets in the way of the compiler re-combining the visitor HOF and the calculation function to generate more efficient code.

There's another more serious structural problem though. The client of a HOF is a callback. In C, this is very lame because functions have no state. In more advanced languages they can have state. That's an improvement but it isn't good enough because it's still a callback, and callbacks are unusable for anything complex.

A callback is a slave. The HOF that calls it is a master. Even if the context of the slave is a finite state machine, where there is theory which tells you how to maintain the state, doing so is very hard. What you really want is for the calculation part of the problem to be a master, just like the HOF itself is. You want your calculation to read its data, and maintain state in the first instance on the stack.

Many people do not believe this and answer that they have no problems with callbacks, but these people are very ignorant. Just you try to write a simple program that is called with data from a file instead of reading the file. An operating system is just one big HOF that calls back into your program with stream data, but there's an important difference: both your program and the operating system are masters: your program reads the data, it isn't a function that accepts the data as an argument.

Another common example of master/master programming is client/server paradigm. This is usually implemented with two threads or processes and a communication link.

Felix special ability is high performance control inversion. It allows you to write threads which read data, and translates them into callbacks mechanically.

Some programming languages can do this in a limited context. For example Python has iterators and you can write functions that yield without losing their context, but this only works in the special case of a sequential visitor.

Felix can do this in general. It was actually designed to support monitoring half a million phone calls with threads that could perform complex calculations such as minimal cost routing. At the time no OS could come close to launching that number of pre-emptive threads yet Felix could do the job on a 1990's desktop PC (with a transaction rate around 500K/sec where a transaction was either a thread creation or sending a null message).

mitchi · on Jan 7, 2013

I'm learning a lot thank you. Felix is pretty impressive!