Work Progress Report: H100 DeepSeek-R1-70B Quantized Model Setup

Project Overview

Model: RedHatAI/DeepSeek-R1-Distill-Llama-70B-quantized.w4a16
Hardware: NVIDIA H100 GPU
Environment: Ubuntu 22.04 with CUDA 12.1
Quantization: w4a16 (4-bit weights, 16-bit activations)

Project Phases Progress

Phase	Component	Status	Notes
1	Base Infrastructure	DONE	System updates, CUDA installation, Python environment
2	Model Download & vLLM	DONE	Model acquisition and inference engine setup
3	Core Services		FastAPI server and model serving infrastructure
4	RAG Pipeline		Retrieval-Augmented Generation implementation
5	Agent Framework		Intelligent agent capabilities
6	UI & API		User interface and API endpoints
7	Integration & Testing		System integration and performance validation

Phase 1: Base Infrastructure Setup

System Prerequisites

System Updates and Base Packages

# Update system packages
sudo apt update && sudo apt upgrade -y
sudo apt install -y git curl wget tmux build-essential software-properties-common
sudo apt install -y python3.11 python3.11-venv python3-pip python3.11-dev
sudo apt install -y htop nvtop tree jq unzip

NVIDIA Driver and CUDA Installation

# Verify current NVIDIA driver
nvidia-smi

# Install CUDA Toolkit 12.1 for H100 optimization
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-1
#check nvcc --version

# Configure CUDA environment paths
echo 'export PATH=/usr/local/cuda-12.1/bin:$PATH' >> ~/.profile
echo 'export LD_LIBRARY_PATH=/usr/local/cuda-12.1/lib64:$LD_LIBRARY_PATH' >> ~/.profile
echo 'export CUDA_HOME=/usr/local/cuda-12.1' >> ~/.profile

H100 GPU Verification

# Check H100 detection and specifications
nvidia-smi --query-gpu=name,memory.total,compute_cap --format=csv

Python Environment Setup

Project Directory and Virtual Environment

# Create dedicated project directory
mkdir h100-deepseek-70b-setup
cd h100-deepseek-70b-setup

# Create Python 3.11 virtual environment
python3.11 -m venv aladin_venv
source /workspace/h100-deepseek-70b-setup/aladin_venv/bin/activate

# Upgrade core Python packages
pip install --upgrade pip setuptools wheel build

PyTorch Installation with H100 Support

# Install PyTorch 2.3.0 with CUDA 12.1 support
pip install torch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 --index-url https://download.pytorch.org/whl/cu121

# Install NVIDIA monitoring libraries
pip install nvidia-ml-py3 pynvml

vLLM and Quantization Dependencies

Core vLLM Installation

# Install vLLM inference engine
pip install vllm==0.4.3

# Install Ray for distributed processing
pip install ray[default]==2.9.3

Quantization Support Libraries

# GPTQ quantization support
pip install auto-gptq==0.7.1

# Hugging Face optimization tools
pip install optimum==1.17.1

# Efficient CUDA routines for quantized operations
pip install bitsandbytes==0.42.0

Performance Optimization Libraries

# Flash Attention for memory efficiency
pip install flash-attn --no-build-isolation --extra-index-url https://pypi.nvidia.com

# Core ML libraries
pip install transformers==4.41.1
pip install accelerate==0.26.1

API Framework Installation

FastAPI and Server Components

# Web framework for model serving
pip install fastapi==0.109.0
pip install uvicorn[standard]==0.27.0
pip install uvloop httptools

# Additional utilities
pip install psutil requests aiofiles

Installation Verification

PyTorch and H100 Integration Test

import torch

print('PyTorch & H100 Status:')
print(f'PyTorch version: {torch.__version__}')
print(f'CUDA available: {torch.cuda.is_available()}')
print(f'CUDA version: {torch.version.cuda}')
print(f'Device count: {torch.cuda.device_count()}')

# Check each GPU
for i in range(torch.cuda.device_count()):
    props = torch.cuda.get_device_properties(i)
    memory_gb = props.total_memory / 1024**3
    print(f'GPU {i}: {torch.cuda.get_device_name(i)}')
    print(f'Memory: {memory_gb:.1f} GB')
    print(f'Compute: {props.major}.{props.minor}')
    
    # Check 70B model compatibility
    if memory_gb >= 70:
        print(f'Can handle 70B quantized model')
    elif memory_gb >= 40:
        print(f'May need optimization for 70B model')
    else:
        print(f'Insufficient memory for 70B model')

Model Download Preparation

Model Download Script

#!/usr/bin/env python3
from huggingface_hub import snapshot_download
from transformers import AutoTokenizer

def download_model():
    model_name = "RedHatAI/DeepSeek-R1-Distill-Llama-70B-quantized.w4a16"
    
    print(f"Downloading {model_name}")
    print("This will take 15-30 minutes...")
    
    try:
        # Download tokenizer first
        print("Downloading tokenizer...")
        tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
        
        # Download all model files
        print("Downloading model files...")
        local_dir = snapshot_download(
            repo_id=model_name,
            cache_dir=None,
            resume_download=True
        )
        
        print(f"Download complete! Files saved to: {local_dir}")
        print("Model ready for use with vllm or other inference engines")
        
    except Exception as e:
        print(f"Failed: {e}")
        print("Try: huggingface-cli login")

if __name__ == "__main__":
    download_model()

Environment Configuration

# Set model path environment variable
MODEL_PATH=$(python3 -c "from huggingface_hub import snapshot_download; print(snapshot_download('RedHatAI/DeepSeek-R1-Distill-Llama-70B-quantized.w4a16', local_files_only=True))")

# Verify disk space
df -h

Phase 1 Completion Checklist

Technical Specifications Summary

Hardware Requirements

NVIDIA H100 GPU with 80GB+ VRAM
Ubuntu 22.04 LTS
CUDA 12.1 compatible driver

Software Stack

Python 3.11 with virtual environment
PyTorch 2.3.0 with CUDA 12.1
vLLM 0.4.3 inference engine
GPTQ quantization support
FastAPI web framework

Model Specifications

Model: DeepSeek-R1-Distill-Llama-70B
Quantization: w4a16 (4-bit weights, 16-bit activations)
Estimated VRAM usage: 35-45GB

Next Steps

Phase 2 will focus on:

Model download and verification
vLLM configuration for 70B quantized model
Initial inference testing
Performance benchmarking

Notes and Issues

Document any issues, workarounds, or important observations during setup

Document Version: 1.0
Last Updated: [Date]
Prepared By: [Name]
Project: H100 DeepSeek-R1-70B Setup

Project Overview​

Project Phases Progress​

Phase 1: Base Infrastructure Setup​

System Prerequisites​

Python Environment Setup​

vLLM and Quantization Dependencies​

API Framework Installation​

Installation Verification​

PyTorch and H100 Integration Test​

Model Download Preparation​

Model Download Script​

Environment Configuration​

Phase 1 Completion Checklist​

Technical Specifications Summary​

Next Steps​

Notes and Issues​