This avoids unnecessarily copying the canary when doing a realloc from a
small size to a large size. It also avoids trying to copy a non-existent
canary out of a zero-size allocation, which are memory protected.
This avoids making a huge number of getrandom system calls during
initialization. The init CSPRNG is unmapped before initialization
finishes and these are still reseeded from the OS. The purpose of the
independent CSPRNGs is simply to avoid the massive performance hit of
synchronization and there's no harm in doing it this way.
Keeping around the init CSPRNG and reseeding from it would defeat the
purpose of reseeding, and it isn't a measurable performance issue since
it can just be tuned to reseed less often.
The initial implementation was a temporary hack rather than a serious
implementation of random arena selection. It may still make sense to
offer it but it should be implemented via the CSPRNG instead of this
silly hack. It would also make sense to offer dynamic load balancing,
particularly with sched_getcpu().
This results in a much more predictable spread across arenas. This is
one place where randomization probably isn't a great idea because it
makes the benefits of arenas unpredictable in programs not creating a
massive number of threads. The security benefits of randomization for
this are also quite small. It's not certain that randomization is even a
net win for security since it's not random enough and can result in a
more interesting mix of threads in the same arena for an attacker if
they're able to attempt multiple attacks.
This extends the size class scheme used for slab allocations to large
allocations. This drastically improves performance for many real world
programs using incremental realloc growth instead of using proper growth
factors. There are 4 size classes for every doubling in size, resulting
in a worst case of ~20% extra virtual memory being reserved and a huge
increase in performance for pathological cases. For example, growing
from 4MiB to 8MiB by calling realloc in increments of 32 bytes will only
need to do work beyond looking up the size 4 times instead of 1024 times
with 4096 byte granularity.