Benchmarking Optimization Levels for Emacs and Native Compiler

Native compiler in Emacs has just got a new option, native-comp-compiler-options. With this option, we can pass the same compiler options for optimizations and code generation to the native compiler, as we would normally pass to GCC via command line switches. The format is a list of strings:

(setq native-comp-compiler-options ‘(“-O2” “-march=skylake” “-mtune=native”))

Other GCC options are of course available, but which options should we use, and which makes sense?

For the first, a note about example above: -march flag should be able to take native as a value, and should eval to the architecture of the CPU at hand. However, there seems to be a bug; libgccjit reports native as unrecognized option. I don’t know why, but using name for the architecture of the CPU works. To find out what architecture do you have, you can use this command (on GNU/Linux systems):

cat /sys/devices/cpu/caps/pmu_name

So what optimizations should we use, and where does it matter? To be honest, I have no idea, so I was testing a bit today and yesterday. I have compiled both Emacs and lisp files with different optimization levels, and I was trying to come up with some tests that would reflect real-life use cases. It is easy to come up with synthetic, micro-benchmarks, like create big arrays or lists and shuffle data around, but they don’t really reflect how Emacs perform when we use it. Of course, no benchmark can truly replace running an application and profiling it.

To start with, I have used Plato’s dialogues for some tests, mostly some search and replace in buffers and strings. I will test some insertions and deletions another day. It’s all very simple thus far; I am not really interested to see how they perform per se, or how they relate to each other, just to see if there is any difference in performance when Emacs and lisp files are compiled with different optimization levels. There are four tests that work on a buffer, count-names-freq, count-socrates and count-words-in-dialogues. Two work with strings, replace-republic replaces some words in a string and reverse-republic tests some insertion in buffer and hash-table.

I also have a test where I search through entire Emacs lisp sources and sources for all installed packages and collect symbol names for macros, functions, and variables. That tests some file parsing and searching in strings, but those tests are probably bound mostly by I/O.

It is not a very extensive benchmark suite, but it does offer a glance into how Emacs performs. It seems that -O3 (and higher) produce slightly slower code. I don’t know why, but measurements show so. (And Eli said that :-)) I did try all kind of crazy optimizations, loop vectorization, unrolling etc, and in some cases it seems to give a slight boost, but in most cases, -O2 seems to be the sweet spot. Tuning for the current CPU also seems to help, but not by much. For measurements, I am using benchmark-run, a tool built into Emacs. I am not sure how accurate time measurement it does. Results vary quite a lot, so I am rather inclined to believe that I am measuring more system noise than real difference in  performance.

What is certain is that garbage collector has quite an impact on performance.

For example, the benchmark with strings at -O2:

    replace-republic with-gc-on | 2.7602 210 2.233186564

    replace-republic with-gc-off | 0.6242 0 0.0

    replace-republic with-gc-on | 2.8143 200 2.303998561

    replace-republic with-gc-off | 0.5841 0 0.0

The first number is running time, second is a number of garbage collections, and the third is time for garbage collection.

What I see also is that buffer operations does not trigger as much garbage collection as strings. Another interesting thing to see is that my personal setup is slightly slower than Emacs -Q with garbage collection off, despite that this benchmarks does not use and third party packages. That is especially visible when collecting symbols from lisp sources, at all optimization levels.

In general, I don’t see much of a difference in performance regardless of optimization level. -O3 and -Ofast do seem to perform slightly slower than -O2, in all cases but the one with strings (replace-republic benchmark). The difference there is also so small that I am not really sure if it is just noise, or true difference.

The code is available on my GitHub, as well as the report. I am sorry that the report is tedious to look at; it is half machine generated, half manually prettified. I am in essence just emitting what benchmark-run returns to a file. I would need some better formatting, but I don’t care enough to implement it. If you are interested to run benchmarks on your computer, the entry are macros with-gc-on/off. They return a list with benchmark results. The first item is the macro name, rest are results as returned from the benchmark-run. Results can be appended to a file with function (repport-benchmark report filename), where report is a list of results return from mentioned macros and filename is a file to append to.

If someone has some nice benchmark that reflects some real use-case scenario, I would be happy to include it, or to read about it.