As the title says: How do you properly test and benchmark different implementations of mutexes in c++?

Essentially I wrote my own std::mutex like class for a project running on a 2 core, armv7 with the aim to minimize the overhead in the uncontested case. Now I'm considering using said mutex in more places and also different architectures, but before I do this I'd like to make sure that 

 - it is actually correct
 - there aren't any pathological cases in which it performs much worse than a standard std::mutex.

Obviously, I wrote a few basic unit tests and micro-benchmarks and everything seems to work, but in multi-threaded code "seems to work" doesn't give me great comfort.

 - So, are there any established static or dynamic analysis techniques? 
 - What are common pitfalls when writing unit tests for mutex classes? 
 - What are typical edge cases one should look out for (performance-wise)?

I'm only using standard library types for the implementation, which includes non-sequential-consistent load & store operations on atomics. However, I'm mainly interested in implementation agnostic advice, since I'd like to use the same test harness for other implementations, too.