NxD Inference#

This section contains the technical documentation specific to the NxD Inference library included with the Neuron SDK.

What is NxD Inference?#

NxD Inference (NeuronX Distributed Inference) is an ML inference library included with the Neuron SDK that simplifies deploying deep learning models on AWS Inferentia and Trainium instances. It offers advanced features like continuous batching and speculative decoding for high-performance inference, and supports popular models like Llama-3.1, DBRX, and Mixtral.

With NxD Inference, developers can:

  • Deploy production-ready LLMs with minimal configuration

  • Leverage optimizations like KV Cache, Flash Attention, and Quantization

  • Distribute large models across multiple NeuronCores using Tensor and Sequence Parallelism

  • Integrate with vLLM for seamless production deployment

  • Customize and extend models with a modular design approach

With NxD Inference, developers can:

Use vLLM for Inference#

Neuron recommends that use vLLM when building your inference models. Read more about Neuron’s integration with vLLM here: vLLM on Neuron

Quickstarts#

Quickstart: Serve models online with vLLM on Neuron

Get started serving online models with vLLM. Time to complete: ~20 minutes.

Quickstart: Run offline inference with vLLM on Neuron

Get started running offline inference with vLLM. Time to complete: ~20 minutes.

NxD Inference documentation#

Overview

Learn about NxD Inference architecture, key features, and how it can help you deploy models efficiently on AWS Neuron hardware.

Setup

Step-by-step instructions for setting up NxD Inference using DLAMI, Docker containers, or manual installation.

Get Started with Models

Deploy production-ready models like Llama 3, DBRX, and Mixtral with optimized configurations for AWS Neuron hardware.

Tutorials

Hands-on tutorials for deploying various models, including Llama 3 variants, multimodal models, and using advanced features like speculative decoding.

Developer Guides

In-depth guides for model onboarding, feature integration, vLLM usage, benchmarking, and customizing inference workflows.

API Reference

Comprehensive API documentation for integrating NxD Inference into your applications and customizing inference behavior.

Application Notes

Detailed application notes on parallelism strategies and other advanced topics for optimizing inference performance.

Misc Resources

Release notes, troubleshooting guides, and other helpful resources for working with NxD Inference.