The CNMeM library is a "simple library to help the Deep Learning frameworks manage CUDA memory."
CNMeM has been reported to give some interesting speed improvements, and is supported by Theano, Torch, and Caffe. However, TensorFlow preallocates GPU memory when starting a session, unlike Theano, Torch, and Caffe.
Does using CNMeM when running a TensorFlow-based program help (e.g., reduce the running time)?