NxD Inference#
This section contains the technical documentation specific to the NxD Inference library included with the Neuron SDK.
What is NxD Inference?#
NxD Inference (NeuronX Distributed Inference) is an ML inference library included with the Neuron SDK that simplifies deploying deep learning models on AWS Inferentia and Trainium instances. It offers advanced features like continuous batching and speculative decoding for high-performance inference, and supports popular models like Llama-3.1, DBRX, and Mixtral.
With NxD Inference, developers can:
Deploy production-ready LLMs with minimal configuration
Leverage optimizations like KV Cache, Flash Attention, and Quantization
Distribute large models across multiple NeuronCores using Tensor and Sequence Parallelism
Integrate with vLLM for seamless production deployment
Customize and extend models with a modular design approach
With NxD Inference, developers can:
Use vLLM for Inference#
Neuron recommends that use vLLM when building your inference models. Read more about Neuron’s integration with vLLM here: vLLM on Neuron