I own and run a NEC VectorEngine 10C which is a PCIe accelerator running a proprietary ISA which has many fun instructions among them is VRSQRT which computes an approximation of the inverse squareroot 1/sqrt(x) for 16384 bits of floating point numbers (could be 256 double or 512 single precision floats) at a time. The exact computation result is Implementation defined according to the specification.
For the 32 bit variant a dump of the entire instruction fits in RAM (<18 Gigabytes). The 64 bit variant obviously doesn't.
How can i get a computer program which emulates this implementation? Ideally i would like a computer to find this emulation for me. However an actionable description of a manual approach is also acceptable and should it be required I can give access to the hardware remotely.