Have you tried using Enzyme (https://enzyme.mit.edu/)? It operates on the LLVM IR, so it's available in any language that breaks down into LLVM (e.g., Julia, where I've used it for surface gradients) and it produces highly optimized AD code. Pretty cool stuff.
Yeah I've used it (cool project indeed!), albeit mostly just in a project I and others in the autodiff community maintain which benchmarks many different autodiff tools against each other: https://github.com/gradbench/gradbench
Randomized numerical linear algebra has proven very useful as well. It allows you to use a black-box function implementing matrix-vector multiplication (MVM) to compute standard decompositions like SVD, QR, etc. Very useful when MVM is O(N log N) or better.