Scalable LLM Inference Service with Ollama

Introduction

In today's era of AI-powered solutions, deploying large language models (LLMs) at scale requires meticulous planning, robust infrastructure, and dynamic scaling to ensure reliability and performance. In this blog, I'll walk you through a comprehensive DevOps/MLOps project where I implemented a scalable LLM inference service using Ollama, AWS EKS, and K6 load testing. Let’s dive into the details of the implementation, challenges faced, and the insights gained along the way.

The Challenge: Why Scaling Matters for LLMs

LLMs are resource-intensive by nature, demanding significant computational power and memory. When deploying these models, factors like latency, response time, and throughput directly impact user experience. A poorly optimized deployment can lead to service disruptions, long wait times, or even crashes under heavy load. This project aimed to address these challenges by:

Creating a scalable API layer around the Ollama moondream model.
Deploying on Kubernetes (AWS EKS) for high availability and fault tolerance.
Implementing load testing and autoscaling to adapt to dynamic workloads.
Automating CI/CD for seamless updates and deployments.

Step 1: Laying the Foundation with Docker and Flask

Building the API Layer

The first step involved creating a lightweight API wrapper using Flask to interact with the Ollama moondream model. The API exposed endpoints for sending prompts and retrieving responses.

from flask import Flask, request, jsonify
import ollama
import logging

# Configure logging
logging.basicConfig(level=logging.DEBUG)

app = Flask(__name__)

@app.route('/')
def home():
    return jsonify({'message': 'Welcome to the Olama API! Use /generate to post prompts.'}), 200

@app.route('/generate', methods=['POST'])
def generate():
    # Log the incoming request
    logging.debug(f"Incoming request: {request.method} {request.url}")

    data = request.json
    prompt = data.get('prompt')

    if not prompt:
        logging.error("Prompt is required.")
        return jsonify({'error': 'Prompt is required'}), 400

    try:
        # Prepare the message structure for chat
        messages = [{'role': 'user', 'content': prompt}]
        response = ollama.chat(model='moondream', messages=messages)
        logging.debug("Response generated successfully.")
        return jsonify({'response': response['message']['content']}), 200
    except Exception as e:
        logging.error(f"Error generating response: {str(e)}")
        return jsonify({'error': str(e)}), 500

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)

Containerizing the Application

I created a very simple API wrapper using Flask. Implemented the ollama.chat method for interacting with the moondream model. Next, I went on with the creation of Dockerfile using ollama as the base image. This is how my Dockerfile looked.

Next, I created an EC2 instance on AWS with the instance type as t2.micro for the purpose of testing the application and model deployment. Imported the entire code onto the instance and installed Docker. Commands to install Docker for Ubuntu:

# Add Docker's official GPG key:
sudo apt-get update
sudo apt-get install ca-certificates curl
sudo install -m 0755 -d /etc/apt/keyrings
sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
sudo chmod a+r /etc/apt/keyrings/docker.asc

# Add the repository to Apt sources:
echo \
  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \
  $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
  sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update

sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin

#Verify Docker Installation
sudo docker run hello-world

Next, I had to grant ubuntu user the privelege to execute Docker commands:

sudo usermod -aG docker ubuntu

After importing the Dockerfile and app.py along with requirements.txt and start.sh, I used this command to build the Dockerfile into an image.

docker build -t ollama-app .

Here I encountered 2 issues:

❌ The command "ollama pull moondream" could not be executed because the ollama service had not yet started.

❌ The command specified through CMD i.e. "python app.py" because ollama image doesn't understand the command 'python'.

Inference: One inference I drew other than these issues specified is that if I pull the moondream model while building the Dockerfile itself, the size of the image would become too large because the model is approximately 1 GB in size and on top of that the size of the ollama base image is also around 900 MB.

A large Docker image increases the time for building, transferring, and starting containers, leading to inefficiencies in deployment and resource usage. Additionally, it consumes more disk space and bandwidth, impacting performance and scalability.

✅ So, I decided to execute these commands specified below using a Bash script (start.sh).

ollama server &
ollama pull moondream
python app.py

Updated Dockerfile:

# Use the Ollama image as the base
FROM ollama/ollama:latest

# Install Python and pip
RUN apt-get update && apt-get install -y \
    python3 \
    python3-pip

# Set the working directory
WORKDIR /app

# Copy the current directory contents into the container at /app
COPY . /app

# Install any needed packages specified in requirements.txt
RUN pip install --upgrade pip && \
    pip install -r requirements.txt

# Expose port 5000 for the Flask app
EXPOSE 5000

# Copy the start script into the container
COPY start.sh /app/start.sh
RUN chmod +x /app/start.sh

# Use the shell directly to run the start script
ENTRYPOINT ["/bin/sh", "/app/start.sh"]

Deploying Locally

After updating the Dockerfile, I once again executed the command to build the Dockerfile into an image. The Dockerfile was built successfully! To list the images, execute the command:

docker images

The Docker image 'ollama' is built and has a size of 1.01 GB. The next step is to create a container for this image and to do that execute the command:

docker run -d -p 5000:5000 ollama

#To see the active containers
docker ps

This command ensures that the container executes in the background mode using the '-d' flag and also for it to be accessible at port 5000.

Now, we will be sending POST request using the curl command to the model with a prompt "Why is the sky blue?" With the container running, you can access the service via HTTP requests. For example, to generate text, use curl or any HTTP client to send a POST request:

curl -X POST http://<ec2-instance-ip-address>:5000/generate -H "Content-Type: application/json" -d '{"prompt": "Why is the sky blue?"}'

❌ Error: moondream runner process has terminated: exit status 0xc0000409 error loading model: unable to allocate backend buffer

The error exit status 0xc0000409 typically indicates a problem related to memory access or an issue with how the program allocates memory. Specifically, in the context of ollama setup, the error message unable to allocate backend buffer suggests that there is a failure in allocating memory buffers required by the backend process handling the model.

✅ To resolve this error, we need to allocate more memory to the EC2 instance for the model to function properly. Stop the instance and under Actions, go to Instance Settings -> Change Instance Type -> Change the instance type to t2.medium -> Start the Instance again.

Post starting the instance, make sure the container is up and running. Once again, execute the command:

curl -X POST http://<ec2-instance-ip-address>:5000/generate -H "Content-Type: application/json" -d '{"prompt": "Why is the sky blue?"}'

❌ Error: curl: (28) Failed to connect to 54.167.116.223 port 5000 after 132622 ms: Couldn't connect to server (Here, 54.167.116.223 is the EC2 Instance IP Address)

✅ This error is a result of port 5000 not being publicly accessible. To make it accessible, go to the EC2 Dashboard -> Instances -> Security -> Click on Security Groups -> Add port 5000 in the Inbound Rules from Anywhere IPv4 and click on Update.

Execute the curl request again and you'll see the output of the prompt "Why is the sky blue?"

Screenshot 2024-07-12 210647

Inference: The model has been initialized successfully and is running on the server. The API server is up and running; requests are being processed without errors. Upon sending a prompt, the server responds with the expected output, confirming successful operation!

Step 2: Deploying on AWS Elastic Kubernetes Service (EKS)

The next step is to deploy the model using AWS Elastic Kubernetes Service. First of all, let's understand the benefits of using AWS EKS over other container orchestration tools:

Benefits of AWS EKS

Managed Service: AWS handles Kubernetes control plane management, including scaling, patching, and updating.
High Availability: EKS control plane is distributed across multiple Availability Zones (AZs) for fault tolerance.
Seamless AWS Integration: Integrates with AWS services like IAM, ELB, CloudWatch, and ECR.
Auto-Scaling: Supports Kubernetes Cluster Autoscaler and Horizontal Pod Autoscaler for dynamic scaling.
IAM Integration: Provides fine-grained access control for Kubernetes resources.
Pay-as-You-Go: Charges based on worker nodes and resources consumed, with AWS Savings Plans available for cost optimization.

Prerequisites

Before creating the cluster, I installed the following tools:

AWS CLI for managing AWS resources.
kubectl for interacting with the Kubernetes cluster.
eksctl for automating EKS cluster creation.

Before creating the EKS Cluster, first we need to push the image we created to the DockerHub account. Commands:

docker login

#Mention the username
username:
#Mention the password
password:

docker push <username>:ollama

Go to your DockerHub account to make sure the image is pushed successfully!

Write the YAML manifests i.e. deployment.yml and service.yml for the application. Mention the docker image that you just pushed inside the deployment.yml file.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama-app
spec:
  replicas: 8
  selector:
    matchLabels:
      app: ollama-app
  template:
    metadata:
      labels:
        app: ollama-app
    spec:
      containers:
      - name: ollama-app
        image: nishankkoul/ollama:latest
        ports:
        - containerPort: 5000

For service.yml, make sure to keep the type of the service as Load Balancer as we will be using Load Balancer created by the cluster to efficiently distribute incoming network traffic across multiple servers, ensuring high availability and reliability by preventing server overload. It also enhances application performance and security by managing traffic spikes and providing failover support.

apiVersion: v1
kind: Service
metadata:
  name: ollama-service
spec:
  type: LoadBalancer
  selector:
    app: ollama-app
  ports:
  - protocol: TCP
    port: 80
    targetPort: 5000

The next task is to create an EKS Cluster but there are certain prerequisites that must be fulfilled before actually creating the cluster.

Command to create EKS Cluster:

eksctl create cluster   --name ollama-cluster   --region us-east-1   --nodegroup-name linux-nodes   --node-type t2.medium   --nodes 4  --managed

This command creates a new Amazon EKS cluster named ollama-cluster in the us-east-1 region with a managed node group named linux-nodes consisting of four t2.medium EC2 instances. It sets up the necessary infrastructure to run Kubernetes workloads on AWS.

The cluster has been successfully created and is up and running! (Make sure to wait for atleast 5 minutes, cluster creation takes time depending upon the resources allocated).

Next, we want the kubectl to interact with cluster and for that execute the command:

aws eks update-kubeconfig --region us-east-1 --name ollama-cluster

To see the nodes running, execute the command:

kubectl get nodes

Now, deploy the YAML manifests using these commands:

kubectl apply -f deployment.yml  #This command will deploy the number of replicas specified inside the deployment.yml onto the EC2 instances.

kubectl get all  #To view all the resources created

The pods have been created along with the deployment and the replica set. In case the container doesn't come up, troubleshoot using the command:

The logs from a container can provide insights into what is happening inside the container. Common Issues with Kubernetes Pods are listed below:

CrashLoopBackOff:

Description: A pod repeatedly crashes and restarts.
Troubleshooting:
Check pod logs: kubectl logs <pod-id>.
Describe the pod for more details: kubectl describe pod <pod-name>.
Investigate the application's start-up and initialization code.

ImagePullBackOff:

Description: Kubernetes cannot pull the container image from the registry.
Troubleshooting:
Verify the image name and tag.
Check the image registry credentials.
Ensure the image exists in the specified registry.

Pending Pods:

Description: Pods remain in the "Pending" state and are not scheduled.
Troubleshooting:
Check node resources (CPU, memory) to ensure there is enough capacity.
Ensure the nodes are labeled correctly if using node selectors or affinities.
Verify there are no taints on nodes that would prevent scheduling.

In case everything goes well, proceed with the deployment of the service of type Load Balancer. Command:

kubectl apply -f service.yml

The highlighted text in the above picture is the endpoint URL that we have to hit using the curl command. This URL is Load Balancer's DNS Name.

Copy the DNS Name and paste it on the browser.

The above picture clearly demonstrates that the API server is up and running and ready to intercept requests. But make sure to send the POST request using the curl command at the '/generate' endpoint only. Command:

curl -X POST http://<Replace-with-Load-Balancer-DNS-Name>/generate -H "Content-Type: application/json" -d '{"prompt": "Who is Albert Einstein?"}'

The curl command sent a POST request to the specified endpoint with the prompt "Who is Albert Einstein?" and received a JSON response explaining who Albert Einstein was, highlighting his contributions to physics and mathematics, particularly his theory of relativity.

Step 3: Load Testing with K6

For this task, we will be using K6.io, a powerful open-source load testing tool designed for modern infrastructure. It provides a modern scripting environment, using JavaScript, to create and execute test scripts that simulate a wide range of traffic conditions. K6 is known for its ease of use, high performance, and rich features, making it a popular choice for load testing APIs, websites, and other web services. It also offers seamless integration with CI/CD pipelines, enabling automated performance testing as part of the development workflow.

Install K6 on your local system

The first step is to install K6 on your local system and for that refer: https://github.com/grafana/k6/releases. Download the latest zip archive.

Extract K6

Extract the contents of the zip archive to a directory of your choice.

Add K6 to Path

Add the directory containing k6.exe to your system's PATH environment variable.

Verify Installation

Open a Command Prompt and run k6 version to ensure K6 is installed correctly.

Create a Test Script

The next step is to create a test script and we will be using Javascript for that.

import http from 'k6/http';
import { check, sleep } from 'k6';
import { Rate } from 'k6/metrics';

export let options = {
    stages: [
        { duration: '1m', target: 5 },   // Ramp up to 5 users over 1 minute
        { duration: '2m', target: 10 },  // Stay at 10 users for 2 minutes
        { duration: '1m', target: 15 },  // Ramp up to 15 users over 1 minute
        { duration: '2m', target: 20 },  // Stay at 20 users for 2 minutes
        { duration: '1m', target: 0 },   // Ramp down to 0 users over 1 minute
    ],
    thresholds: {
        http_req_failed: ['rate<0.1'],      // Allow up to 10% of requests to fail
        http_req_duration: ['p(95)<60000'], // 95% of requests should be below 60s
    },
};

export default function () {
    const url = 'http://<Replace-with-Load-Balancer-DNS-Name>/generate';
    const payload = JSON.stringify({ prompt: 'Why is the Earth round in shape?' });
    const params = {
        headers: { 'Content-Type': 'application/json' },
        timeout: '90s',  // Increase timeout to 90 seconds
    };
    let res;

    // Retry logic with increased timeout
    for (let retries = 0; retries < 3; retries++) {
        res = http.post(url, payload, params);
        if (res.status === 200) {
            break;
        }
        sleep(3);  // Wait for 3 seconds before retrying
    }

    const isSuccess = check(res, {
        'status is 200': (r) => r.status === 200,
    });

    console.log(`Response time: ${res.timings.duration} ms, Status: ${res.status}`);

    sleep(3);  // Wait for 3 seconds before making the next request
}

Make sure to replace the url with your Load Balancer's DNS Name along with the '/generate' endpoint. This K6 script orchestrates a load test using HTTP POST requests, sequentially ramping up users: starting with 5 users and increasing to 10 over 1 minute, then maintaining 10 users for 2 minutes, followed by scaling up to 15 users over 1 minute and maintaining that for 2 minutes. It concludes by ramping down to 0 users over 1 minute. The script monitors that no more than 10% of requests fail (http_req_failed) and that 95% of requests complete within 60 seconds (http_req_duration). It retries failed requests up to 3 times with a 90-second timeout, logs response times, and ensures each request achieves a 200 status code, ensuring thorough testing of the target endpoint's performance and reliability under simulated load conditions. This approach implements different test scenarios (e.g., gradual ramp-up, spike tests, endurance tests).

Execute the script

Navigate to the directory containing your test script. Execute the following command:

k6 run load-test.js

Results:

1. Total Requests: Simulated 82 HTTP requests across various scenarios.

2. Success Rate: 44 requests returned a successful status code (200), achieving a success rate of approximately 53.66%.

3. Variability in Response Time: Average response time for successful requests ranged from 20000 ms to 50000 ms based on operation complexity.

5. Load Testing: Assessed system performance under varying virtual user loads, peaking at 20 virtual users.

❌ Unfortunately, only 44 out of 82 requests returned a successful status code (200), resulting in a success rate of approximately 53.66%.

✅ To increase the success rate of the requests, we'll be scaling up our infrastructure by upgrading the EC2 instance types from t2.medium to t2.xlarge. This change aims to improve performance and ensure smoother handling of the workload.

NOTE: AWS only allows a cumulative capacity of 16 CPUs to spin up for the entire EC2 service for a personal account. Since t2.xlarge instance type is allocated 4 virtual CPUs, hence, we can spin up a maximum of 4 nodes. So, change the instance type to t2.xlarge now.

Deploy the pods again as well as the service.

kubectl apply -f deployment.yml
kubectl apply -f service.yml
kubectl get all

Perform the load test again:

k6 run load-test.js

Results:

1. Total Requests: Simulated 120 HTTP requests across various scenarios.

2. Success Rate: 91 requests returned a successful status code (200), achieving a success rate of approximately 75.83%.

3. Variability in Response Times: Average response time for successful requests ranged from 10000 ms to 40000 ms based on operation complexity.

4. Load Testing: Assessed system performance under varying virtual user loads, peaking at 20 virtual users.

Inference: Scaling the infrastructure from t2.medium to t2.xlarge significantly improved the system's ability to handle a higher volume of requests more efficiently. With the increased CPU and memory resources provided by the t2.xlarge instance, the server managed to process a larger number of concurrent requests successfully (from 53.66% to 75.83%), reducing the likelihood of bottlenecks. This upgrade not only increased throughput but also resulted in a noticeable decrease in response times (from 20,000ms-50,000ms to 10,000ms-40,000ms), ensuring faster and more reliable service for end-users. The enhanced capacity of the t2.xlarge instance allowed the application to maintain performance under load, demonstrating the critical role of appropriate infrastructure scaling in optimizing application responsiveness and user experience.

Step 4: Autoscaling with Horizontal Pod Autoscaler (HPA)

Implementing Horizontal Pod Autoscaler (HPA) for the deployment is a strategic approach to dynamically scale application pods based on CPU and memory usage, as well as custom metrics. By monitoring key performance indicators like CPU utilization, HPA ensures that the number of running pods adjusts automatically to meet the current demand. When CPU or memory usage exceeds predefined thresholds, HPA increases the number of pods to handle the load, thereby maintaining optimal performance and avoiding potential downtime. Conversely, during periods of low utilization, HPA scales down the number of pods to conserve resources and reduce costs. This automated scaling mechanism not only enhances the application's resilience and responsiveness but also optimizes resource utilization, ensuring efficient and cost-effective operations.

Create hpa.yml specifying the Deployment name along with minimum and maximum replicas of pods.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ollama-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ollama-app
  minReplicas: 8
  maxReplicas: 16
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageValue: 30

The provided code defines a Horizontal Pod Autoscaler (HPA) for a Kubernetes deployment named ollama-app. This HPA automatically adjusts the number of pod replicas within the specified range of 4 to 10, based on the resource usage metrics. Specifically, it monitors CPU utilization, aiming to maintain an average CPU usage of 100 millicores (100m) per pod. When the CPU usage exceeds this target, the HPA increases the number of replicas to handle the load; conversely, it decreases the number of replicas when the CPU usage is below the target, ensuring efficient resource utilization and maintaining optimal performance of the ollama-app deployment.

Before executing the YAML manifest, you need to install the Kubernetes Metrics Server , which provides the resource usage metrics necessary for monitoring the pods consumption:

kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

Verify the installation:

kubectl get pods -n kube-system | grep metrics-server

You should see a metrics-server pod running.

Now, apply the HorizontalPodAutoscaler YAML manifest using this command:

kubectl apply -f hpa.yml

Verify the HPA:

kubectl get hpa

This command will show you the current status of the HPA, including the current number of replicas and target metrics.

Execute this command:

kubectl top pods

The command 'kubectl top pods' is used to display resource usage (CPU and memory) of the pods in a Kubernetes cluster. This command helps you monitor and understand the resource consumption of your pods, which is essential for performance tuning and resource management. It provides real-time metrics, allowing you to see how much CPU and memory each pod is using.

As we have not executed the load testing script yet, we can see the number of pods (4) as per the deployment along with their CPU consumption.

Execute the load testing script.

k6 run load-test.js

After executing the script, monitor the pods using the command 'kubectl top pods'.

The above image demonstrates that as soon as the load increased and surpassed the threshold limit of 100m, 6 more pods came up to handle the load and distribute it evenly.

Results:

1. Total Requests: Simulated 131 HTTP requests across various scenarios.

2. Success Rate: 112 requests returned a successful status code (200), achieving a success rate of approximately 85.49%.

3. Variability in Response Times: Average response time for successful requests ranged from 5000 ms to 20000 ms based on operation complexity.

4. Load Testing: Assessed system performance under varying virtual user loads, peaking at 20 virtual users.

Step 5: Automating CI/CD with GitHub Actions

Creating a GitHub Actions CI/CD pipeline for this project involves automating the build, test, and deployment processes directly from the GitHub repository. By leveraging GitHub Actions, we will create workflows that automatically trigger code changes, ensuring your application is always up-to-date and tested before deployment.

The workflow is triggered by pushes and pull requests to the "Main" branch. The job runs on the latest Ubuntu environment and consists of several steps. First, it checks out the code from the repository. Next, it sets up Docker Buildx, logs into DockerHub using stored secrets, and builds and pushes a Docker image to DockerHub. It then installs kubectl for Kubernetes management and configures AWS CLI with the necessary credentials to interact with AWS services. The kubeconfig is updated to connect to the specified EKS cluster. Finally, the workflow deploys the application to the EKS cluster by applying the Kubernetes deployment, service, and HPA configuration files. This streamlined process ensures the application is continuously integrated and deployed, facilitating efficient and reliable updates.

Storing the secrets here:

Navigate to Settings -> Secrets -> New repository secret.

# Add these secrets

- AWS_ACCESS_KEY_ID
- AWS_REGION
- AWS_SECRET_ACCESS_KEY
- DOCKER_PASSWORD
- DOCKER_USERNAME
- EKS_CLUSTER_NAME

To test the GitHub Actions CI/CD Pipeline, append a comment to any of the files and commit it to the repository.

The pipeline is executed successfully and an updated Docker image is also uploaded to the Docker Hub account.

Best Practices Learned

1. Regular Performance Testing: Regular Performance Testing of the application with different infrastructure levels gives a clear understanding of the application performance and user experience. Reviewing and optimizing the infrastructure based on performance test results ensured robustness and readiness for anticipated operational conditions.

2. Containerization: Using Docker to containerize applications ensures consistency across different environments and simplifies deployment.

3. Infrastructure as Code (IaC): Automating the creation of cloud infrastructure using tools like 'eksctl' ensures the reproducibility and scalability of the deployment process.

4. Load Balancing: Implementing a load balancer distributes traffic across multiple instances, enhancing the availability and reliability of the application.

5. Use GitHub Actions for CI/CD: Automating the build, test, and deployment process using GitHub Actions helps maintain consistency and efficiency.

6. Securely Manage Secrets: Use GitHub Secrets to store sensitive information such as DockerHub credentials and AWS keys.

7. Scalability: Implement HPA to automatically scale your application based on CPU utilization, ensuring optimal performance and resource utilization.

Building a Scalable LLM Inference Service with Ollama, Stress Testing, and Autoscaling

Table of contents

Introduction

The Challenge: Why Scaling Matters for LLMs

Step 1: Laying the Foundation with Docker and Flask

Building the API Layer

Containerizing the Application

Deploying Locally

Step 2: Deploying on AWS Elastic Kubernetes Service (EKS)

Benefits of AWS EKS

Prerequisites

Step 3: Load Testing with K6

Install K6 on your local system

Extract K6

Add K6 to Path

Verify Installation

Create a Test Script

Execute the script

Results:

Results:

Step 4: Autoscaling with Horizontal Pod Autoscaler (HPA)

Results:

Step 5: Automating CI/CD with GitHub Actions

Best Practices Learned