All Articles
AI/ML

Traffic Shaping and Rate Limiting in LLM Development with Docker Trickle on Ubuntu

C
Chandan Kumar
Founder, beCloudReady
February 11, 20255 min read
Traffic Shaping and Rate Limiting in LLM Development with Docker Trickle on Ubuntu

How to implement traffic shaping and rate limiting for LLM inference services using Docker Trickle on Ubuntu.

Introduction

Large Language Models (LLMs) have revolutionized AI-driven applications, but they come with a significant challenge: network bandwidth consumption. With models like LLaMA, DeepSeek, and Falcon requiring tens to hundreds of gigabytes for downloading and fine-tuning, unregulated network usage can lead to serious issues, including:

  • Network Congestion : Excessive bandwidth usage can degrade performance for other critical workloads.

  • Cloud Provider Penalties : Many cloud providers charge based on bandwidth usage or enforce rate limits.

  • On-Premise Infrastructure Bottlenecks : Corporate networks can become overwhelmed, impacting services like databases, monitoring, and internal applications.

To address these challenges, traffic shaping and rate limiting are essential strategies. Let’s explore various approaches and best practices to efficiently manage LLM-related bandwidth consumption.

Why Traffic Shaping & Rate Limiting Matter in LLM Development

  1. Avoiding Bandwidth Overuse & Cloud Penalties

    • Cloud providers often have egress bandwidth limits and charge extra for exceeding quotas. For example, AWS charges per GB for outbound traffic, making uncontrolled downloads expensive.

    • Solution: Set up bandwidth caps per instance or per user using tools like tc, Docker’s rate-limiting options, or cloud networking policies.

  2. Preventing On-Premise Network Saturation

    • When multiple engineers or training pipelines fetch multi-GB models, it can disrupt other operations like CI/CD, logging, and storage replication.

    • Solution: Use traffic shaping to ensure AI workloads don’t consume all available network resources.

  3. Ensuring Fair Resource Allocation

    • If multiple workloads are running on a shared network, LLM-related downloads should not block other critical applications.

    • Solution: Rate limit AI workloads to a controlled share of bandwidth, ensuring fair usage.

  4. Better CI/CD and Model Deployment Pipelines

    • Excessive network usage during model deployment can slow down inference servers or prevent real-time applications from functioning efficiently.

    • Solution: Use scheduled model syncs with bandwidth control to avoid high-traffic periods.

Traffic Shaping & Rate Limiting Techniques

1. Limiting Download Speeds via wget or curl

A quick way to limit bandwidth for downloading LLM models is using wget or curl with the --limit-rate option:

# Limit to 500 KB/s while downloading DeepSeek 67B model
wget --limit-rate=500k https://huggingface.co/deepseek-ai/deepseek-67b

For curl:

curl --limit-rate 1m -O https://huggingface.co/deepseek-ai/deepseek-67b

2. Using tc (Linux Traffic Control) for Advanced Rate Limiting

For on-premise or self-hosted deployments, Linux’s tc command provides precise bandwidth control:

# Limit Docker containers to 5Mbps on eth0
sudo tc qdisc add dev eth0 root tbf rate 5mbit burst 32kbit latency 400ms

To remove limits:

sudo tc qdisc del dev eth0 root

3. Using trickle to Limit Bandwidth for Specific Applications

For user-level control, trickle is useful for limiting bandwidth to specific processes:

trickle -s -d 200 wget https://huggingface.co/meta-llama/Meta-Llama-3-70B






trickle -s -d 500 docker pull [_ghcr.io/huggingface/text-generation-inference_](<http://ghcr.io/huggingface/text-generation-inference>)

4. Docker Bandwidth Limiting for AI Workloads

If you are running LLM models in Docker, you can limit network bandwidth per container:

docker network create \
  --driver=bridge \
  --opt com.docker.network.driver.mtu=1200 \
  limited_network

docker run --network=limited_network ai-container

For more control, use Docker traffic shaping with tc as shown earlier.

5. Configuring Cloud Provider Network Limits

Cloud platforms offer built-in network traffic policies:

  • AWS: Use AWS VPC bandwidth limits to cap model download speeds.

  • Azure: Configure Azure Network Security Groups (NSGs) to limit outbound AI-related traffic.

  • GCP: Apply Cloud Armor rate limits to avoid unexpected traffic spikes.

Example AWS CLI command to limit a network interface bandwidth:

aws ec2 modify-network-interface-attribute \
  --network-interface-id eni-12345678 \
  --rate-limit 10mbps

Real-World Example: Managing DeepSeek & LLaMA Downloads

Let’s say you’re running a multi-node LLM deployment and need to fetch models without saturating your cloud provider’s bandwidth limits. Here’s a structured approach:

1. Use a Local Model Cache

Instead of repeatedly downloading models from Hugging Face, use a shared storage solution like:

  • MinIO (Self-hosted S3 storage)

  • NFS (Network File System)

  • Hugging Face Model Hub Local Mirror

Example MinIO storage command:

mc alias set localminio http://minio.example.com accessKey secretKey
mc mirror --limit 5MiB https://huggingface.co/meta-llama localminio/models/

2. Apply Traffic Shaping on Model Syncing Nodes

If you’re syncing models between training clusters, limit sync speeds:

rsync --bwlimit=10000 -avz model_files/ remote_server:/models/

This ensures that training data and logs still flow smoothly.

3. Use Cloud-Specific Optimizations

On AWS , store frequently used models in S3 with CloudFront caching to reduce redundant downloads.

aws s3 cp llama-70b.tar.gz s3://my-llm-storage/ --storage-class INTELLIGENT_TIERING

Conclusion

Traffic shaping and rate limiting are critical for LLM development to:

  • Prevent network congestion in cloud and on-premise environments.

  • Avoid excessive cloud provider charges for bandwidth overuse.

  • Ensure fair resource allocation across workloads.

  • Improve overall AI infrastructure performance.

By leveraging tools like wget, tc, trickle, Docker network settings, and cloud-native traffic controls, developers can optimize LLM model downloads and deployments without disrupting network operations.

What’s Next?

  • Experiment with different traffic shaping techniques.

  • Deploy a local caching solution to reduce model download dependencies.

  • Implement CI/CD pipelines that respect network limits during AI model deployment.

By proactively managing bandwidth, AI teams can develop and deploy LLMs efficiently while keeping network resources balanced and cost-effective.

LLMDockerRate LimitingTraffic ShapingUbuntu