I would actually read the original headers and look up the assembly instructions of the target machine. Because if we would need to optimize those guessing is not enough.
To me, adding your own implementations only makes sense if the added new complexity (maintenance, cognitive load, obfuscation) is less important than the performance.
otherwise I would look into other compilers and look for possible optimizations from the outside tooling like inline the instructions and other nfa - dfa transformations.