Skip to main content

Work Progress Report: H100 DeepSeek-R1-70B Quantized Model Setup

· 5 min read

Project Overview

Model: RedHatAI/DeepSeek-R1-Distill-Llama-70B-quantized.w4a16
Hardware: NVIDIA H100 GPU
Environment: Ubuntu 22.04 with CUDA 12.1
Quantization: w4a16 (4-bit weights, 16-bit activations)

Project Phases Progress

PhaseComponentStatusNotes
1Base InfrastructureDONESystem updates, CUDA installation, Python environment
2Model Download & vLLMDONEModel acquisition and inference engine setup
3Core ServicesFastAPI server and model serving infrastructure
4RAG PipelineRetrieval-Augmented Generation implementation
5Agent FrameworkIntelligent agent capabilities
6UI & APIUser interface and API endpoints
7Integration & TestingSystem integration and performance validation

Phase 1: Base Infrastructure Setup

System Prerequisites

System Updates and Base Packages

# Update system packages
sudo apt update && sudo apt upgrade -y
sudo apt install -y git curl wget tmux build-essential software-properties-common
sudo apt install -y python3.11 python3.11-venv python3-pip python3.11-dev
sudo apt install -y htop nvtop tree jq unzip

NVIDIA Driver and CUDA Installation

# Verify current NVIDIA driver
nvidia-smi

# Install CUDA Toolkit 12.1 for H100 optimization
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-1
#check nvcc --version

# Configure CUDA environment paths
echo 'export PATH=/usr/local/cuda-12.1/bin:$PATH' >> ~/.profile
echo 'export LD_LIBRARY_PATH=/usr/local/cuda-12.1/lib64:$LD_LIBRARY_PATH' >> ~/.profile
echo 'export CUDA_HOME=/usr/local/cuda-12.1' >> ~/.profile

H100 GPU Verification

# Check H100 detection and specifications
nvidia-smi --query-gpu=name,memory.total,compute_cap --format=csv

Python Environment Setup

Project Directory and Virtual Environment

# Create dedicated project directory
mkdir h100-deepseek-70b-setup
cd h100-deepseek-70b-setup

# Create Python 3.11 virtual environment
python3.11 -m venv aladin_venv
source /workspace/h100-deepseek-70b-setup/aladin_venv/bin/activate

# Upgrade core Python packages
pip install --upgrade pip setuptools wheel build

PyTorch Installation with H100 Support

# Install PyTorch 2.3.0 with CUDA 12.1 support
pip install torch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 --index-url https://download.pytorch.org/whl/cu121

# Install NVIDIA monitoring libraries
pip install nvidia-ml-py3 pynvml

vLLM and Quantization Dependencies

Core vLLM Installation

# Install vLLM inference engine
pip install vllm==0.4.3

# Install Ray for distributed processing
pip install ray[default]==2.9.3

Quantization Support Libraries

# GPTQ quantization support
pip install auto-gptq==0.7.1

# Hugging Face optimization tools
pip install optimum==1.17.1

# Efficient CUDA routines for quantized operations
pip install bitsandbytes==0.42.0

Performance Optimization Libraries

# Flash Attention for memory efficiency
pip install flash-attn --no-build-isolation --extra-index-url https://pypi.nvidia.com

# Core ML libraries
pip install transformers==4.41.1
pip install accelerate==0.26.1

API Framework Installation

FastAPI and Server Components

# Web framework for model serving
pip install fastapi==0.109.0
pip install uvicorn[standard]==0.27.0
pip install uvloop httptools

# Additional utilities
pip install psutil requests aiofiles

Installation Verification

PyTorch and H100 Integration Test

import torch

print('PyTorch & H100 Status:')
print(f'PyTorch version: {torch.__version__}')
print(f'CUDA available: {torch.cuda.is_available()}')
print(f'CUDA version: {torch.version.cuda}')
print(f'Device count: {torch.cuda.device_count()}')

# Check each GPU
for i in range(torch.cuda.device_count()):
props = torch.cuda.get_device_properties(i)
memory_gb = props.total_memory / 1024**3
print(f'GPU {i}: {torch.cuda.get_device_name(i)}')
print(f'Memory: {memory_gb:.1f} GB')
print(f'Compute: {props.major}.{props.minor}')

# Check 70B model compatibility
if memory_gb >= 70:
print(f'Can handle 70B quantized model')
elif memory_gb >= 40:
print(f'May need optimization for 70B model')
else:
print(f'Insufficient memory for 70B model')

Model Download Preparation

Model Download Script

#!/usr/bin/env python3
from huggingface_hub import snapshot_download
from transformers import AutoTokenizer

def download_model():
model_name = "RedHatAI/DeepSeek-R1-Distill-Llama-70B-quantized.w4a16"

print(f"Downloading {model_name}")
print("This will take 15-30 minutes...")

try:
# Download tokenizer first
print("Downloading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

# Download all model files
print("Downloading model files...")
local_dir = snapshot_download(
repo_id=model_name,
cache_dir=None,
resume_download=True
)

print(f"Download complete! Files saved to: {local_dir}")
print("Model ready for use with vllm or other inference engines")

except Exception as e:
print(f"Failed: {e}")
print("Try: huggingface-cli login")

if __name__ == "__main__":
download_model()

Environment Configuration

# Set model path environment variable
MODEL_PATH=$(python3 -c "from huggingface_hub import snapshot_download; print(snapshot_download('RedHatAI/DeepSeek-R1-Distill-Llama-70B-quantized.w4a16', local_files_only=True))")

# Verify disk space
df -h

Phase 1 Completion Checklist

  • System packages updated and installed
  • NVIDIA drivers and CUDA 12.1 installed
  • H100 GPU detected and verified
  • Python 3.11 virtual environment created
  • PyTorch with CUDA support installed
  • vLLM and quantization libraries installed
  • FastAPI framework installed
  • GPU compatibility verified
  • Model download script prepared
  • Environment variables configured

Technical Specifications Summary

Hardware Requirements

  • NVIDIA H100 GPU with 80GB+ VRAM
  • Ubuntu 22.04 LTS
  • CUDA 12.1 compatible driver

Software Stack

  • Python 3.11 with virtual environment
  • PyTorch 2.3.0 with CUDA 12.1
  • vLLM 0.4.3 inference engine
  • GPTQ quantization support
  • FastAPI web framework

Model Specifications

  • Model: DeepSeek-R1-Distill-Llama-70B
  • Quantization: w4a16 (4-bit weights, 16-bit activations)
  • Estimated VRAM usage: 35-45GB

Next Steps

Phase 2 will focus on:

  • Model download and verification
  • vLLM configuration for 70B quantized model
  • Initial inference testing
  • Performance benchmarking

Notes and Issues

Document any issues, workarounds, or important observations during setup


Document Version: 1.0
Last Updated: [Date]
Prepared By: [Name]
Project: H100 DeepSeek-R1-70B Setup