Got myself a few months ago into the optimization rabbit hole as I had a slow quant finance library to take care of, and for now my most successful optimizations are using local memory allocators (see my C++ post, I also played with mimalloc which helped but custom local memory allocators are even better) and rethinking class layouts in a more “data-oriented” way (mostly going from array-of-structs to struct-of-arrays layouts whenever it’s more advantageous to do so, see for example this talk).
What are some of your preferred optimizations that yielded sizeable gains in speed and/or memory usage? I realize that many optimizations aren’t necessarily specific to any given language so I’m asking in !programming@programming.dev.
I was playing bloons td back when it was flash, in firefox. It was sometimes too slow. So i fired up perf and found out what horrors flash player was doing with memcpy. One byte memcpy, completely unaligned memcpy.
So i wrote an ssse3 memcpy that could do one byte unaligned with xmm registers. It was 30% faster then whatever glibc was doing and made the game playable. Was planing to submit it to glibc, but they came up with something different that was just as fast.