Work Progress Report: H100 DeepSeek-R1-70B Quantized Model Setup
· 5 min read
Project Overview
Model: RedHatAI/DeepSeek-R1-Distill-Llama-70B-quantized.w4a16
Hardware: NVIDIA H100 GPU
Environment: Ubuntu 22.04 with CUDA 12.1
Quantization: w4a16 (4-bit weights, 16-bit activations)
Project Phases Progress
Phase | Component | Status | Notes |
---|---|---|---|
1 | Base Infrastructure | DONE | System updates, CUDA installation, Python environment |
2 | Model Download & vLLM | DONE | Model acquisition and inference engine setup |
3 | Core Services | FastAPI server and model serving infrastructure | |
4 | RAG Pipeline | Retrieval-Augmented Generation implementation | |
5 | Agent Framework | Intelligent agent capabilities | |
6 | UI & API | User interface and API endpoints | |
7 | Integration & Testing | System integration and performance validation |
Phase 1: Base Infrastructure Setup
System Prerequisites
System Updates and Base Packages
# Update system packages
sudo apt update && sudo apt upgrade -y
sudo apt install -y git curl wget tmux build-essential software-properties-common
sudo apt install -y python3.11 python3.11-venv python3-pip python3.11-dev
sudo apt install -y htop nvtop tree jq unzip
NVIDIA Driver and CUDA Installation
# Verify current NVIDIA driver
nvidia-smi
# Install CUDA Toolkit 12.1 for H100 optimization
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-1
#check nvcc --version
# Configure CUDA environment paths
echo 'export PATH=/usr/local/cuda-12.1/bin:$PATH' >> ~/.profile
echo 'export LD_LIBRARY_PATH=/usr/local/cuda-12.1/lib64:$LD_LIBRARY_PATH' >> ~/.profile
echo 'export CUDA_HOME=/usr/local/cuda-12.1' >> ~/.profile
H100 GPU Verification
# Check H100 detection and specifications
nvidia-smi --query-gpu=name,memory.total,compute_cap --format=csv
Python Environment Setup
Project Directory and Virtual Environment
# Create dedicated project directory
mkdir h100-deepseek-70b-setup
cd h100-deepseek-70b-setup
# Create Python 3.11 virtual environment
python3.11 -m venv aladin_venv
source /workspace/h100-deepseek-70b-setup/aladin_venv/bin/activate
# Upgrade core Python packages
pip install --upgrade pip setuptools wheel build
PyTorch Installation with H100 Support
# Install PyTorch 2.3.0 with CUDA 12.1 support
pip install torch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 --index-url https://download.pytorch.org/whl/cu121
# Install NVIDIA monitoring libraries
pip install nvidia-ml-py3 pynvml
vLLM and Quantization Dependencies
Core vLLM Installation
# Install vLLM inference engine
pip install vllm==0.4.3
# Install Ray for distributed processing
pip install ray[default]==2.9.3
Quantization Support Libraries
# GPTQ quantization support
pip install auto-gptq==0.7.1
# Hugging Face optimization tools
pip install optimum==1.17.1
# Efficient CUDA routines for quantized operations
pip install bitsandbytes==0.42.0
Performance Optimization Libraries
# Flash Attention for memory efficiency
pip install flash-attn --no-build-isolation --extra-index-url https://pypi.nvidia.com
# Core ML libraries
pip install transformers==4.41.1
pip install accelerate==0.26.1
API Framework Installation
FastAPI and Server Components
# Web framework for model serving
pip install fastapi==0.109.0
pip install uvicorn[standard]==0.27.0
pip install uvloop httptools
# Additional utilities
pip install psutil requests aiofiles
Installation Verification
PyTorch and H100 Integration Test
import torch
print('PyTorch & H100 Status:')
print(f'PyTorch version: {torch.__version__}')
print(f'CUDA available: {torch.cuda.is_available()}')
print(f'CUDA version: {torch.version.cuda}')
print(f'Device count: {torch.cuda.device_count()}')
# Check each GPU
for i in range(torch.cuda.device_count()):
props = torch.cuda.get_device_properties(i)
memory_gb = props.total_memory / 1024**3
print(f'GPU {i}: {torch.cuda.get_device_name(i)}')
print(f'Memory: {memory_gb:.1f} GB')
print(f'Compute: {props.major}.{props.minor}')
# Check 70B model compatibility
if memory_gb >= 70:
print(f'Can handle 70B quantized model')
elif memory_gb >= 40:
print(f'May need optimization for 70B model')
else:
print(f'Insufficient memory for 70B model')
Model Download Preparation
Model Download Script
#!/usr/bin/env python3
from huggingface_hub import snapshot_download
from transformers import AutoTokenizer
def download_model():
model_name = "RedHatAI/DeepSeek-R1-Distill-Llama-70B-quantized.w4a16"
print(f"Downloading {model_name}")
print("This will take 15-30 minutes...")
try:
# Download tokenizer first
print("Downloading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
# Download all model files
print("Downloading model files...")
local_dir = snapshot_download(
repo_id=model_name,
cache_dir=None,
resume_download=True
)
print(f"Download complete! Files saved to: {local_dir}")
print("Model ready for use with vllm or other inference engines")
except Exception as e:
print(f"Failed: {e}")
print("Try: huggingface-cli login")
if __name__ == "__main__":
download_model()
Environment Configuration
# Set model path environment variable
MODEL_PATH=$(python3 -c "from huggingface_hub import snapshot_download; print(snapshot_download('RedHatAI/DeepSeek-R1-Distill-Llama-70B-quantized.w4a16', local_files_only=True))")
# Verify disk space
df -h
Phase 1 Completion Checklist
- System packages updated and installed
- NVIDIA drivers and CUDA 12.1 installed
- H100 GPU detected and verified
- Python 3.11 virtual environment created
- PyTorch with CUDA support installed
- vLLM and quantization libraries installed
- FastAPI framework installed
- GPU compatibility verified
- Model download script prepared
- Environment variables configured
Technical Specifications Summary
Hardware Requirements
- NVIDIA H100 GPU with 80GB+ VRAM
- Ubuntu 22.04 LTS
- CUDA 12.1 compatible driver
Software Stack
- Python 3.11 with virtual environment
- PyTorch 2.3.0 with CUDA 12.1
- vLLM 0.4.3 inference engine
- GPTQ quantization support
- FastAPI web framework
Model Specifications
- Model: DeepSeek-R1-Distill-Llama-70B
- Quantization: w4a16 (4-bit weights, 16-bit activations)
- Estimated VRAM usage: 35-45GB
Next Steps
Phase 2 will focus on:
- Model download and verification
- vLLM configuration for 70B quantized model
- Initial inference testing
- Performance benchmarking
Notes and Issues
Document any issues, workarounds, or important observations during setup
Document Version: 1.0
Last Updated: [Date]
Prepared By: [Name]
Project: H100 DeepSeek-R1-70B Setup