Skip to main content

Template Report

· 2 min read

1. Executive Summary

  • Objective: Briefly explain why the research was conducted (e.g., solve a problem, improve efficiency).
  • Key Findings: Summarize the major conclusions and recommendations.
  • Impact: Highlight the potential benefits of the new technology.

2. Introduction

  • Background: Describe the current situation, challenges, or limitations that prompted the research.
  • Purpose: State the purpose of the research (e.g., evaluate feasibility, compare solutions).
  • Scope: Define the boundaries of the study (e.g., specific use cases, teams, or systems).

3. Research Methodology

  • Approach: Describe the methods used to gather and evaluate information (e.g., testing, surveys, benchmarking).
  • Criteria: List the evaluation criteria (e.g., performance, scalability, ease of use, cost).
  • Tools/Resources: Mention any tools, datasets, or environments used.

4. Technology Overview

  • Description: Provide a high-level overview of the technology.
  • Features: Highlight key features or innovations.
  • Market Adoption: Briefly mention how widely the technology is adopted or supported.

5. Analysis and Findings

  • Testing and Experiments: Document the tests or prototypes you built and the results.
  • Comparison with Alternatives: Compare the new technology with current solutions or competitors.
  • Strengths and Weaknesses: List the pros and cons of the technology based on your findings.

6. Use Case Applicability

  • Potential Use Cases: Identify specific scenarios where the technology could be applied.
  • Limitations: Highlight areas where the technology may not be suitable.
  • Integration Feasibility: Discuss how easily the technology can be integrated into existing systems.

7. Cost-Benefit Analysis

  • Implementation Costs: Estimate costs (e.g., licensing, training, migration).
  • Return on Investment (ROI): Discuss potential savings or revenue increases.
  • Risk Assessment: Highlight risks associated with adoption.

8. Recommendations

  • Adoption Plan: Recommend whether or not to proceed and provide an implementation timeline.
  • Training and Support Needs: Outline any training or support required.
  • Further Research: Suggest additional areas for exploration if needed.

9. Conclusion

  • Summary of Findings: Recap the most important insights.
  • Final Recommendation: Reiterate your stance on adopting the technology.

10. Appendices

  • References: List all sources used during the research.
  • Data and Charts: Include any additional data, graphs, or tables supporting your analysis.
  • Technical Details: Provide detailed technical notes or configurations.

Technical Report: NVIDIA H100 GPU Floating Point Capabilities and INT4/8 Precision Impact

· 5 min read

1. Executive Summary

  • Objective: To evaluate the performance implications and trade-offs of switching from floating point to INT4/8 bit precision modes on NVIDIA H100 GPUs for AI workloads.
  • Key Findings: Switching to INT4/8 precision can deliver 2-4x improvements in computational throughput and significant memory efficiency gains with manageable accuracy trade-offs when implemented correctly.
  • Impact: Switching to INT4/8 precision can enable larger model deployment, lower inference latency, higher throughput(good data), while maintaining acceptable accuracy for many AI applications.

2. Introduction

  • Background: A node of H100 with 8GPU, each have 80GB VRAM, will give us 640GB. This is not enough to run a full R1, which needs 400GB+ aside for KV cache in FP8 and 671GB to load weights in FP8. We can sconclude that a full deepseek-R1 runs in production on 16 H100 GPUs (2 nodes). To run this on a single H100, alternatives strategies must be considered. As AI models grow increasingly larger, GPU memory and computational demands present significant challenges for efficient deployment and operation. Traditional floating-point precision may be unnecessarily resource-intensive for many inferences workloads

  • Purpose: To assess the benefits of adopting lower precision INT4/8 computation on H100 GPUs compared to traditional floating point formats.


3. Research Methodology

  • Approach: Technical specification analysis, performance benchmarking, and review of published research on quantization techniques for the H100 architecture.

  • Tools/Resources: NVIDIA H100 technical documentation, PyTorch/TensorFlow quantization frameworks, benchmark datasets for accuracy validation, and TensorRT optimization suite.


4. Technology Overview

  • Description: The NVIDIA H100 (Hopper architecture) GPU features fourth-generation Tensor Cores and a new Transformer Engine that support multiple precision formats including FP64, FP32, TF32, FP16, FP8, INT8, and INT4.
  • Features:
    • Transformer Engine for adaptive precision management
    • Hardware-accelerated quantization/dequantization operations
  • Market Adoption: The H100 represents NVIDIA's flagship data center GPU with growing adoption across major cloud providers and AI research institutions for large-scale AI training and deployment.

5. Analysis and Findings

  • Comparison with Alternatives:

    Precision FormatThroughputMemory UsageAccuracy ImpactUse Case Suitability
    FP32/TF32BaselineHighNoneTraining, Scientific
    FP16MediumMediumMinimalTraining, Inference
    FP8HighLowLow-MediumTraining, Inference
    INT8Very HighVery LowMediumPrimarily Inference
    INT4Ultra HighUltra LowHighInference Only
  • Strengths and Weaknesses:

    • Strengths:
      • Significantly increased computational throughput (2-4x), INT4 Precision can bring an additional 59% SpeedUp Compared to INT8
      • Reduced a network's memory footprint and conserve memory bandwidth enabling larger models
      • Lower latency for inference workloads
      • Hardware-accelerated quantization support
    • Weaknesses:
      • Reduced numerical precision and range
      • Potential accuracy degradation
      • Implementation complexity requiring specialized expertise
      • Not suitable for all model architectures or operations

6. Use Case Applicability

  • Limitations:

    • May not be suitable for models with high numerical sensitivity
  • Integration Feasibility:

    • Well-supported through NVIDIA's software stack (TensorRT, CUDA)
    • PyTorch/TensorFlow integration available through quantization libraries
    • Requires model-specific calibration and validation
    • May require architecture-specific modifications for optimal results

7. Cost-Benefit Analysis

  • Implementation Costs:

    • Engineering time for model quantization and validation
    • Potential accuracy recovery work through quantization-aware training
    • Testing and qualification across various inputs
  • Return on Investment (ROI):

    • 2-4x increase in inference throughput per GPU
    • Proportional reduction in infrastructure costs
    • Ability to deploy models 2-4x larger within the same memory constraints
  • Risk Assessment:

    • Accuracy degradation may impact user experience
    • Maintenance complexity with mixed precision workflows

8. Recommendations

  • Adoption Plan:

    1. Initial Phase: Identify candidate models for INT8/4 conversion based on throughput needs and accuracy tolerance
    2. Testing Phase: Implement post-training quantization and validate accuracy against benchmarks
    3. Optimization Phase: Apply quantization-aware training where necessary
    4. Deployment Phase: Gradual rollout with monitoring
  • Further Research:

    • Exploration of hybrid precision approaches for model-specific optimization
    • Evaluation of emerging quantization techniques
    • Comparing TensorRT-LLM and Ollama
  • Possible Workflow:

    • Download model from Hugging Face → Prefer models with GPTQ or QLoRA support (4-bit).
    • Load with transformers + AutoModelForCausalLM → Set quantization configs.
    • (Optional) Quantize with GPTQ → Skip if model is already quantized; else run GPTQ tooling.
    • Fine-tune with QLoRA → Use peft, bitsandbytes, transformers, accelerate (fit LLaMA-2 13B easily on H100 in 4-bit).
    • Evaluate and deploy locally → Wrap with FastAPI, vLLM, or Hugging Face's text-generation-inference
  • Untrained Model:

Possible-setup

9. Conclusion

  • Summary of Findings: INT4/8 precision modes on H100 GPUs offer substantial performance and efficiency benefits with manageable accuracy trade-offs for suitable workloads.
  • Final Recommendation: Adopt INT8 precision broadly for inference workloads with selective INT4 usage for less sensitive components. Maintain higher precision for training and numerically sensitive operations.

10. Appendices

  • References:
    • NVIDIA H100 Technical Documentation
    • "Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference"
    • "ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers" (Yao et al.)
    • "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers" (Frantar et al.)

Đánh giá tính hiệu quả LLMs Opensource Model, Sizing cần thiết, Định hướng RND.

· 5 min read
Software Engineer @ Aladintech

1. Executive Summary

Objective

  • Evaluate the effectiveness of DeepSeek models, including:
    • Reliability
    • Response time
    • Model suitability per agent type
  • Optimize cost and resource sizing
  • Assess usage feasibility

Key Findings

  • DeepSeek models provide strong performance when used with proper configuration and alignment to task type.
  • High-end models (70B+) require expensive infrastructure and should be carefully evaluated for ROI.
  • Smaller distilled models can achieve practical efficiency and reliability when tuned correctly.

Impact

  • Improved agent reliability and response time
  • Flexible deployment options (local/private inference)
  • Cost-effective sizing strategies and modular flow architecture

2. Introduction

Background

Current Needs

  • Need for private deployable models
  • Suboptimal performance on CPU
  • Lack of full testing across models and tasks

Challenges

  • Too many model options to benchmark exhaustively
  • Risks include:
    • Slow response times
    • Inaccurate outputs (especially with 7B models)
  • Difficulty in measuring efficiency

Understanding Threats

Proposed Solution

  • Benchmark and test candidate models
  • Use model classification and evaluation metrics

Purpose

  • Create a measurable framework for evaluating models
  • Classify model capabilities, usage limits, and costs
  • Select 3 high-reliability agents for further testing
  • Explore optimization strategies for agents
  • Propose reliability enhancement techniques
    Reference:
    • Secure data access and privacy
    • Improve user experience

Scope

  • Conduct a controlled evaluation (no production deployment yet)
  • Provide a test suite and benchmarking guidance

3. Research Methodology

Approach

Usage & Environment

  • Environment: Ollama, SGLang (ref)
  • Tools: Ollama, Langchain, n8n (prebuilt SDK available)

Tested Model Variants

  • DeepSeek-V3 (General-purpose)
  • DeepSeek-R1 (Reasoning tasks)
  • DeepSeek-VL2 (Image+Text)
  • Janus (Multi-modal)
  • DeepSeek-Coder (Code-focused)

Evaluation Metrics

  • Total duration
  • Load duration
  • Prompt evaluation (count, duration, rate)
  • Overall evaluation (count, duration, rate)

Hardware Sizing Reference

Usage: 8hrs per day, on demand:

  • Usage for research, application oriented.
  • For training model, usage when monthly rent.
Model Parameters (Billions)Params (B)VRAM (GB)Recommended GPUCPU RecommendationRAM (GB)Price
700B671134216x NVIDIA A100 80GBAMD EPYC 9654 / Intel Xeon Platinum 8490H2048+2500$
14B146.5RTX 3080 10GBRyzen 9 7900X / i9-13900K64+N/A
32B3214.91 x A6000Threadripper 7980X / Xeon W9-3495X128+N/A
70B7032.71 x H100EPYC 9654 / Xeon Platinum 8490H256+1200$

4. Analysis and Findings

Benchmark Environment

Configs Used

  • Dual RTX 5070Ti: 32GB VRAM, 64GB RAM, i5-14600KF (~$0.5/hr)
  • H100 NVL (single): 94GB VRAM, 100GB RAM, EPYC 9354 (~$2.5/hr)
  • Dual H100 NVL (~$5/hr)

Model Benchmarks

  • llama3.1 (70B): General-purpose
  • DeepSeek-R1 (32B, 70B): For reasoning, solution generation
  • DeepSeek-Coder (33B): For code explanation/suggestions
  • DeepSeek-LLM (67B): General-purpose

Observations

  • RTX 5070Ti can handle up to 32B models with tuning; ideal for narrow-scope agents (e.g., coding assistants).
  • For 70B models, 5070Ti setup is slow (up to 4 mins response).
  • H100 NVL is optimal for real-time inference with 70B+ models.
  • Larger models (700B) are currently impractical for cost and infra reasons.

Model Comparison Insights

  • 1x A6000 is sufficient for 32B models with proper prompt tuning.
  • 1x H100 can support 70B models for testing/research.
  • Models with vision capabilities require pre-processing: PDF → Text → Embedding → Vector DB → Retriever → LLM.
  • Multi-model workflows (e.g., classification + reasoning) improve accuracy and performance:
    User → Model 1 (classifier/tuner) → Model 2 (responder) → Output

5. Use Case Applicability

Suitable Use Cases

  • Domain-specific AI agents (e.g., code generation, Q&A bots)
  • Parallel model inference to boost reliability
  • On-prem inference for sensitive data handling
  • Coding support agents

Limitations

  • GPU-dependent
  • Lacks native image/PDF input unless extended with external modules
  • Large model cost constraints

Integration Feasibility

  • High if using LangChain/n8n SDKs
  • Moderate effort for on-prem setup (requires infrastructure and monitoring)

6. Cost-Benefit Analysis

Implementation Costs

  • GPU rental via Vast.ai for prototyping (~$0.5 to $4/hr)
  • Setup and tuning time
  • DevOps and monitoring for production use

ROI & Savings

  • Local inference = no token cost (vs. OpenAI)
  • Smaller tuned models yield significant savings
  • Modular flows reduce infrastructure duplication

Risks

  • Over-investing in large models without maximizing smaller ones
  • Long response times = poor UX
  • Lack of model support for some input types (e.g., images)

7. Recommendations

Adoption Plan

  • Build and tune agents first
  • Start with 1x H100 or A6000
  • Adopt flow-based architecture:
  • Use multiple models for task specialization
  • Consider unsupervised learning and prompt chaining

Training & Support

  • Document setup and tuning best practices
  • Evaluate in-house vs. external support

Further Research

  • Model tuning for cost-efficiency
  • Explore hybrid models (reasoning + coding)
  • Improve model interaction reliability

8. Conclusion

Summary

  • DeepSeek offers strong performance when deployed with appropriate infrastructure.
  • Smaller models (14B–32B) provide good results with tuning.
  • 700B+ models are not cost-effective currently - can not use all the powers yet.

Final Recommendation

  • Use 1x A6000 or H100 for research and mid-sized deployments.
  • Optimize agent design and build modular flows.
  • Focus on maximizing potential of 32B models before scaling up.

Tích hợp n8n vào Aladintech.co V2

· 3 min read
Software Engineer @ Aladintech

1. Tổng quan

  • Bài toán: Tích hợp N8N vào Aladintech version 2.
  • Usecase: Tự động hoá workflow bằng công cụ N8N.
  • Tác động: Bằng công cụ quản lý workflow, có thể thiết kế quản lý luồng công việc, tích hợp các công cụ đang bị phân mảnh hiện tại của Aladintech.

2. Tổng quan công cụ

  • Mô tả: N8n là nền tảng tự động hóa luồng công việc (workflow automation), cho phép tạo các quy trình tự động hóa.
  • Tính năng: Kéo thả các node, custom node, tích hợp sẵn nhiều API của các công cụ như email, jira, google sheets,...

3. Use-case

  • Tiềm năng:

    • Tự động hoá các luồng công việc:
      • Hiện tại, team đang xây quy trình, bắt đầu bằng module HRMs. Anh Khương đang tham vấn và sẽ gửi lại 3 đầu việc của module HRMs, bao gồm:

        • Quy trình tuyển dụng: Lọc đầu vào, ngân hàng câu hỏi phỏng vấn, ma trận đánh giá kỹ năng
        • Quy trình training: Quy trình training cho các vị trí khác nhau như BE & FE
        • Quy trình đánh giá: Quy trình đánh giá, lộ trình thăng tiến cho nhân viên.

        => Xây dựng cho các quy trình có thể tự động. Các quy trình đều là các bước A => B, ta có thể áp dụng tự động hoá cho các đầu việc không liên quan đê con người. Ví dụ, khi PV ứng viên, tự động tạo list câu hỏi và gửi về email cho ứng viên.

      • Giải quyết các vấn đề tồn đọng hiện tại:

        • Các hệ thống quản lý rời rạc, nhiều khi ảnh hưởng bởi lý do khách quan là khách hàng yêu cầu sử dụng một phần mềm nào đó.
        • Ví dụ:
          • Khách hàng yêu cầu dùng JIRA => JIRA của mình là self-host => Không đồng bộ.
          • Quản lý bằng Google Sheets => Thống kê, đánh giá khó khăn, manh mún.

        Structure

        Figure 1: Flow xây dựng.

  • Giới hạn & Vấn đề kỹ thuật:

    • Jira hiện tại em chưa call tích hợp được => cần thời gian thêm tích hợp / check API của các self-host app của mình.
    • Sẽ cần người thiết kế flow đối với các flow phức tạp (Ví dụ: custom API của một hệ thông khác, một node chưa được hỗ trợ bởi n8n)

4. Test self-host n8n


5. Plan tích hợp & triển khai

  • Triển khai seft-host:
    • Đơn giản
    • Cần dựng các flow có sẵn cho team.
    • Dùng cơ bản là kéo thả, cần một chút training sau khi dựng flow.
  • Plain triển khai:
    • Xây dựng xong quy trình, mô hình hoá thành flow diragram
    • Tiến hành xây dựng flow trên N8N