Optimize AttributeBuffer to OutputVertex conversion
First I unrolled the inner loop, then I pushed semantics validation
outside of the hotloop.
I also added overflow slots to avoid conditional branches.
Super Mario 3D Land's intro runs at almost full speed when compiled with
Clang, and theres a noticible speed increase in MSVC. GCC hasn't been
tested but I'm confident in its ability to optimize this code.