Quantcast
Viewing all articles
Browse latest Browse all 268

2 Ways AI Assistants Are Changing Kubernetes Troubleshooting

Image may be NSFW.
Clik here to view.
Person playing a guitar

Of all the hoopla around AI, the most misguided part is the insistence on fine-tuned large language models (LLMs). Too many believe specializing a model based on a massive collection of domain-specific data is the only way to build useful AI assistants.

This demand for fine-tuning is even more common in highly specialized or technical fields, such as software development and cloud services. A prime example is the ongoing maintenance and troubleshooting of Kubernetes clusters designed to deliver apps. This situation underscores the critical challenge faced by DevOps and app development leaders: There needs to be a better solution for managing the complexity of cloud native infrastructure. These environments often present intractable challenges that defy experience, wisdom or intuition around troubleshooting.

In response, startups and open source projects claim to have fine-tuned existing models to include specialized knowledge about Kubernetes that generic models, even GPT-4 Turbo, wouldn’t normally ingest or have access to.

However, the challenge is not a problem with the fine-tuning itself, but its inability to mimic the human approach to troubleshooting. No matter how intelligent the model, you’ll get no real value unless it can replicate how you perform troubleshooting: gathering disparate resources, jugging all the critical details you’ve found in logs and kubectl output in your head, leaning on your experience, and distilling it all into a logical next step.

There are only two key areas that make an AI assistant useful in the Kubernetes world. Assistants must be:

  • Embedded within Kubernetes clusters with access to artifacts that describe its state, like the output of kubectl get/logs/describe.
  • Able to understand your questions in natural language and translate complex operational data into simple, actionable next steps.

Fine-tuning just prioritizes hype over what matters to you most: acting on what’s actually happening with your pods, nodes and apps.

Getting Kubernetes AI Assistance Halfway Right

The AI and cloud native spaces are growing simultaneously, so new tools are overlapping in these two domains.

New open source command-line interface (CLI) tools like K8sgpt and KoPylot wrap their operations around kubectl to gain access to your cluster’s state. By running that command on your behalf using the context available in your .kube/config file, these tools can read and process the output directly, rather than forcing you to switch context. They then proxy data to OpenAI’s API to deliver AI-generated responses in your terminal.

It’s a clever workaround, but these CLI tools still require a high level of Kubernetes knowledge, or another CLI tool. You need to know the right commands, not just a question about your cluster’s status, to initiate interaction.

Another open source tool, mico, advances this concept by converting your natural language queries into kubectl commands. You can ask mico to, for example, print the number of times each pod in xyz namespace has restarted, and it will use the jsonpath argument in kubectl to filter output down to just the relevant line.

We love to see how the open source community is leveraging AI, but these tools are limited: They either understand the cluster’s state but can’t handle natural language queries, or they help you write queries but only return kubectl output without the next troubleshooting steps. You could replace the default OpenAI models powering these tools with a specialized alternative, but that won’t help you reduce your troubleshooting time or help your less-experienced peers monitor their apps.

What Makes AI Assistance Valuable for Kubernetes Troubleshooting?

The answer is an AI assistant that excels in understanding cluster state and interpreting natural language — fine-tuning be damned.

Access to Your Cluster’s State

Without access to the cluster state, the only way to get help from your AI assistant is to play a game of telephone on your path to resolving issues. Even with a fine-tuned AI, you can expect the conversation to go a little like this:

  1. You know enough about Kubernetes to run kubectl get pods when your deployment doesn’t come up immediately.
  2. You ask your AI assistant why a pod would crash due to a CrashLoopBackOff error.
  3. The AI responds by telling you that the most common possible causes of a CrashLoopBackOff error include insufficient memory, missing dependencies and container failure due to port conflict. Perhaps it’s smart enough to ask you to run kubectl describe pod POD_NAME for clues about its resource usage and limits … perhaps.
  4. You tell your AI assistant about that output, including the Terminated state and last emitted event: Back-off restarting failed container.
  5. The AI suggests you run kubectl get events --field-selector involvedObject.name=POD_NAME to search for other possible causes.
  6. You find events around failed readiness and liveliness probes, along with the backoff process, but nothing new, and you let your AI assistant know.
  7. The AI assistant suggests you run kubectl logs POD_NAME --all-containers to search for specific errors with your containerized app or its dependencies from your manifest, like a database or messaging queue.
  8. Amidst the lengthy logs, you find a warning from docker-entrypoint.sh saying it couldn’t execute because of a not found argument.
  9. You ask your AI assistant about that warning, and it (finally) tells you to check your Kubernetes manifest for a typo or misconfiguration in the arguments you’ve attached to that container, which is the root cause of your issue.

You certainly received assistance from your AI tool, but the assistance was not particularly efficient. It might have saved you from Googling each error or running kubectl ... help commands to find the right syntax. But because you were responsible for accurately sharing information about your cluster’s state and understanding each step from your AI assistant, you still carried almost all the cognitive load and did not save very much time.

Access to a cluster’s state is essential. A valuable AI assistant must automatically respond to your original question about CrashLoopBackOff by running kubectl commands itself, parsing the output for clues, bringing in context from the collective Kubernetes troubleshooting knowledge available online and delivering a precise path to remediation — no runbooks or deep dives into documentation required.

Understanding Your Natural Language Questions

A Kubernetes AI assistant that can read outputs or logs and deliver an executive summary of what to think about next is great, but it assumes you have enough Kubernetes knowledge to know what question or specific kubectl command to run. The real added value, especially for application developers with limited knowledge of Kubernetes operations, comes from the ability to ask questions in your natural language:

  • Some folks might need to ask beginner-level prompts: “What is a pod?”
  • Others can ask specific questions about the cluster based on a basic understanding of Kubernetes: “Are there any failing pods in my xyz namespace?”
  • The most advanced DevOps engineers might take it one step further: “What should I do about this notification, which says one of my nodes is suddenly NotReady?”

When the AI can translate a question into the relevant command to gather state context (kubectl get pods -n xyz), it can effectively reduce the cognitive load on your team. DevOps engineers can reduce their mean time to resolution (MTTR) by using the AI assistant as a resource to reflect their specialized knowledge, and developers can troubleshoot their apps in a self-service fashion.

When the AI assistant runs on platforms where your team operates, like Slack or Microsoft Teams, this knowledge is more accessible and collaborative. When the next significant incident strikes your app, DevOps engineers and developers can engage your AI assistant in the same channel for more targeted root cause analysis and a remediation plan that goes beyond a temporary fix.

A New Type of Kubernetes AI Assistant

To address these issues, Botkube recently launched AI Assistant, which is designed to operate in both areas of Kubernetes troubleshooting and directly in collaboration platforms.

The assistant works by listening to your natural language questions about your Kubernetes cluster and its apps, converting your queries into the appropriate kubectl get/logs/describe commands and interfacing with an LLM to explore root causes and opportunities. From this, the assistant can deliver insights and recommend next steps on your troubleshooting journey.

Image may be NSFW.
Clik here to view.

This assistant enhances Botkube’s notification, investigation and troubleshooting tools by operating on the most valuable bounds of both areas. Using AI Assistant helps you research why an issue is happening, learn kubectl to perform basic operations, or tap into Kubernetes expertise to seek out root causes and find a workable solution.

Cluster State && Natural Language >>> Fine-Tuned LLM

Under the hood, Botkube’s AI Assistant uses ChatGPT-4.

We’re not ashamed to admit that we’re using the same model as every open source tool and most new paid platforms. We can’t fine-tune what ChatGPT knows, but we can add nuance to queries and tweak the nature of its responses to provide a better troubleshooting experience.

For example, we layer additional instructions on top of common natural language queries and data about a cluster’s state to “force” ChatGPT to provide more comprehensive answers. We also enrich ChatGPT’s default output with better formatting and organizational structure to help you focus on troubleshooting, not deciphering instructions.

Adding value just before and after interfacing with an LLM can do much more than fine-tuning. We designed AI Assistant to be context-aware and compatible with the questions you genuinely want to ask of your cluster — not the complex kubectl commands you may be used to.

Ways to Use an AI Assistant

The opportunities are generally bounded only by how much detail kubectl emits and the Kubernetes knowledge built into OpenAI’s latest models … which is quite a lot. You can ask:

  • Basic questions about the Kubernetes ecosystem, such as details on the differences among containers, pods and nodes.
  • Specific questions related to a cluster’s state, like confirming that all pods in the xyz namespace are healthy.
  • For specific troubleshooting help around a new error notification, without having to reference a runbook or read documentation.

Botkube’s executor features then let you turn the AI Assistant’s insights into immediate remediation by helping you craft the right kubectl through a drop-down interface (rather than a dozen runs of kubectl ... help).

DevOps engineers can speed up workflows by spending less time in the terminal and more time where collaboration happens. And app developers can fix Kubernetes issues on their own instead of filling out a ticket and waiting for someone to help.

No matter your title or role, you can start using Botkube’s embedded AI assistant today with a new or existing Botkube account. Sign up now for free to enable our Kubernetes AI assistant with a single click, with no configuration required.

Toss it a few questions and you’ll quickly see why cluster awareness and natural language — not fine-tuned LLMs — are the best path forward to manage the complexity of your cloud native infrastructure.

The post 2 Ways AI Assistants Are Changing Kubernetes Troubleshooting appeared first on The New Stack.

AI that mimics how humans approach troubleshooting can democratize and improve how people identify and fix Kubernetes issues.

Viewing all articles
Browse latest Browse all 268

Trending Articles