No Huddle Offense

Preference-driven systems: a practical bridge to intent-driven systems

July 21st, 2025 • Comments Off

Intent-driven orchestration has emerged as a effective model for managing increasingly complex, distributed systems. Rather than prescribing how infrastructure should behave, intent allows users to declare what outcomes or objectives they want systems to fulfill. This paradigm shifts control from static configurations to adaptive, context-aware orchestration. Examples of such approaches are already available, such as Intel’s Intent-Driven Orchestration planner for Kubernetes or TM Forum’s intent ontology, which provide a structured way to express high-level business and service goals that can be interpreted by autonomous systems.

Intent-driven systems offer several advantages and can be effectively combined with declarative orchestration and management approaches. These models are not mutually exclusive — in fact, when used together, they can deliver the best of both worlds: clear goal expression and programmable control. Here’s why intents are particularly powerful (adapted from this blog post):

Semantic Portability: Intents define what needs to be achieved (e.g., “ensure 10ms tail latency”), not how to achieve it (e.g., “allocate 10 vCPUs”). This abstraction enables portability across heterogeneous platforms. While infrastructure may vary, the meaning of the intent — such as latency targets — remains consistent.
Intent as Invariant: Because intents focus on outcomes rather than implementation, resource providers have the freedom to optimize — change hardware, reconfigure topology, shift workloads — as long as they uphold the declared expectations. This flexibility supports resilience and continuous optimization.
Context-Aware Optimization: Intents offer (e.g., “favor performance over power consumption”) gives orchestrators crucial context. This enables systems to make informed trade-offs, and take decisions to better align with user objectives.

It’s a powerful vision — and arguably the north star for upcoming orchestration and management systems. However, realizing it at scale remains a challenge.

The Reality of Interface Inertia

Intent requires shared semantics between users and resource providers. These semantics must be embedded in APIs, orchestration layers, and policy engines — all of which take time to evolve and gain adoption. In many systems today, APIs are still tightly coupled to implementation details. Even where abstraction exists, it often lacks the richness to fully express intent in a way that systems can reliably act on.

As a result, adoption of intent-based orchestration has been slower than hoped. Infrastructure providers need time to align on standards, support new interfaces, and re-architect parts of their platforms. Application owners, meanwhile, still operate in environments where procedural configurations are the norm.

A Middle Ground: Preference-Driven Orchestration

As an interim step, a preference-driven orchestration model offers a practical path forward.

Rather than fully declaring objectives and invariants, users can express preferences — high-level hints about how orchestration decisions should be made. These preferences are not strict requirements, but they provide meaningful guidance to the orchestration layer.

Examples of such preferences might include:

“Prefer performance over energy efficiency”
“Minimize cost where possible”
“Favor locality to user region”
“No need to boost this best effort app”
“This is a <important|besteffort|background, …> task”

These could be defined as simple profiles, annotations, or lightweight policy templates. Unlike full intent, preferences don’t require formal semantics or guaranteed enforcement — they simply steer the system toward user-aligned decisions in environments where exact guarantees are not yet feasible.

Why Preferences Matter

Feasible Today: Preferences can be layered onto existing APIs and orchestrators with minimal disruption, making them more readily adoptable.
Low Risk, High Value: Because preferences are advisory rather than declarative, they don’t require full support from every system component. Partial adherence still improves system alignment with user goals.
Prepares the Ground for Intent: By introducing users to higher-level abstractions and aligning systems with those abstractions incrementally, preferences create a pathway toward more formal intent expression over time.

Improves Portability and Contextualization: Even when resource providers cannot guarantee intent satisfaction, understanding user preferences provides valuable context for optimizing placement, configuration, and trade-off decisions.

From Preferences to Intent

Over time, preference-driven orchestration can evolve naturally into full intent-driven models. As infrastructure layers mature and standard semantics emerge, preferences can be promoted to soft intents, and eventually to hard objectives with measurable enforcement.

This progression mirrors other transitions in computing — from imperative to declarative configuration, from manual scaling to autoscaling, from static placement to adaptive scheduling. Each step builds on the last, gradually shifting the responsibility from the user to the system.

Intent-driven orchestration remains a powerful vision for the future of cloud, edge, and distributed computing. But recognizing the complexity of reaching that goal, a preference-driven approach offers a practical, incremental step. By enabling systems to make smarter decisions today — with minimal disruption — preferences pave the way for a more intelligent, adaptive, and user-aligned orchestration model tomorrow.

Categories: Personal • Tags: Intent-Driven Orchestration, Orchestration • Permalink for this article

Power efficiency: throttling for Sustainability

December 17th, 2024 • Comments Off

The rise of AI applications has brought us soaring power demands and strains on the environment. Google reported a 13% emission increase thanks to AI, their footprint is huge, some power grids are struggling, new data centers are about to use enormous amounts of energy, and the reporting of all this isn’t great either sometimes (btw try a traditional search before using a chat bot if you care about the environment).

So the challenge is: how can we make AI, and HPC/HTC applications more energy-efficient without sacrificing too much performance?

The good news is, that AI, HPC, and HTC workloads are very flexible – they can e.g. be shifted in time and space to align with the availability of e.g. green energy. HPC systems, using job schedulers like Grid Engine, LSF, and SLURM can be used for this. But what if we could add another dimension to this energy-saving toolkit? Enter CPU and GPU throttling.

Throttling CPU and GPU frequencies to lower power draw levels is not a new idea. Research from Microsoft and others has shown that modest performance reductions can result in substantial energy savings. The underlying principle is dynamic voltage and frequency scaling (DVFS). To illustrate this point, let’s consider a specific example: here I’m running a Phi 3.5 LLM on an Arc GPU for inference. Performance is measured through the token creation time and we’ll observe the P-99 of the same. Just a small adjustment in performance can lead to significant power savings. For instance: by adjusting the GPU frequency, while only sacrificing a 10% in performance, we can observed up to 30% power savings.

(Click to enlarge)

Btw, as we’re testing an AI workload, the performance for this example depends heavily on memory bandwidth. With lower GPU frequency, memory bandwidth usage is lower. Again showing that frequency scaling is an effective way to control memory bandwidth. This explains low P99 compute latency that improves over time as frequency is increazed. The application was in general bottle necked. From around 1500MHz onward, memory bandwidth stabilizes, and performance gets more consistent.

(Click to enlarge)

While the results shown here use a low-range GPU, the principles hold true for larger systems. For example, a server with 8 GPUs, each with a TDP of ~1000W, could save hundreds of watts per server. Imagine the compounded impact across racks of servers in a data center. By incorporating techniques like throttling, we can make meaningful strides toward reducing the carbon footprint of AI.

Leveraging the flexibility of workloads—such as the AI examples shown here—can greatly improve system effectiveness: run faster when green energy is available and slower when sustainability goals take priority. Together, these strategies pave the way for a more sustainable compute continuum.

Categories: Personal • Tags: Artificial Intelligence, Cloud, data center, Edge, system effectiveness, system performance • Permalink for this article

System Effectiveness

September 9th, 2024 • Comments Off

In the world of distributed systems, performance has traditionally been the primary measure of success. Engineers have focused on optimizing for latency, throughput, reliability and scalability, pushing systems to be faster and handle more work. While these performance metrics are critical, they tell only part of the story. A more holistic and sustainable concept is system effectiveness, which balances performance with broader considerations like cost efficiency, resource utilization, and environmental impact. Here is a first shot at defining system effectiveness:

System effectiveness measures how efficiently a system performs relative to its resource consumption, financial cost, and environmental impact, while adapting dynamically to context over time to balance performance with sustainability.

In an era of increasing environmental awareness, this broader focus on system effectiveness—not just raw performance—should be center! Hence my thought that we should focus more on this asap!

System effectiveness incorporates the Total Cost of Ownership (TCO), Return on Investment (ROI), and the balance between Capital Expenditures (CAPEX) and Operating Expenses (OPEX). It considers not just how fast and scalable a system is, but also the economic and environmental costs of running it. For instance, somebody might purchase server(s) and generate thousands of euros in revenue from it. While the initial CAPEX seems justified, ongoing OPEX — such as for power and cooling — may appear negligible at first. However, as the system scales, energy usage adds up. These “small” costs accumulate quickly and can have a significant impact on profitability, especially if systems are not optimized for efficiency.

In the context of system effectiveness, it’s important to note that simply turning off servers isn’t always the best solution. The revenue generated by a server running efficiently can often outweigh the costs, even including its energy consumption. However, a well-managed server can be optimized in other ways, such as throttling its performance to ensure it operates using green energy whenever possible. By aligning energy consumption with renewable sources, businesses can reduce their environmental impact while maintaining operational efficiency and profitability

But this raises a question: Why do we still prioritize system performance over effectiveness, when sustainability and efficiency are becoming just as critical? Just looking at it from a OPEX perspective might make it look like a bad business decisions — energy is just still so cheap. But as governance shifts and policies tighten, businesses will need to account for more than just technical performance. Companies must start asking themselves whether they want to wait until regulations force them to change — or if they want to take the lead in defining what responsible, effective computing looks like for the future. The choice to focus on system effectiveness is not only a reflection of good business sense but an opportunity to align with a more sustainable and resilient future.

System effectiveness encourages optimizing resource usage, which keeps ongoing operational costs low while maximizing revenue. Companies that embrace this aren’t just reducing their environmental footprint — they’re positioning themselves for long-term competitiveness. Efficient systems not only lower power consumption and reduce environmental impact but also improve long-term ROI by reducing unnecessary expenses. As consumer preferences shift toward eco-conscious brands and governments that enforce stricter environmental regulations, businesses that focus on system effectiveness will see enhanced customer loyalty and avoid costly penalties. For example, in the European Union, certain companies will be required to report their footprints, and similar regulations are spreading globally. Organizations that adopt system effectiveness today will ensure that those reports look favorable, maintaining a competitive edge while demonstrating responsible stewardship of resources.

Categories: Personal • Tags: system effectiveness, system performance • Permalink for this article

Your personal AI LLM based agent

February 28th, 2024 • Comments Off

Trevor Noah made an interesting comment in one of hit recent podcast:

You know, the new workforce becomes one where your GPT is almost your resume. Your GPT is almost more valuable than you are in a strange way. Do you know what I mean? It’s like a combination of everything you think and everything you’ve thought and the way you synthesize ideas combined with your own personal GPT becomes. And I mean, this is me just like thinking of a crazy future where you go, you literally get to a job and they go, what’s your GPT? And you say, well, here’s mine.

I found this comment intriguing, and based on that I thought, time to go a bit deeper on running AI models locally for privacy reasons; And best way to learn is to build it yourself. So I thought: Can I grab an AI model, add my e.g. CV to it (so it gets to know me), and I can ask it a few questions about me?

Preamble: the following blog post was created using exactly the same methodology as described within it (just instead of a CV it was given a rough draft of the post for the RAG). See notes at the bottom of the post for more information.

User: Can you create a blog post for me with the title: “Building your personal AI LLM based agent.”

Sure, I can help you draft a blog post titled “Building your personal AI LLM based agent” where we discuss optimizing a set of text paragraphs together. Here’s a suggested outline for the blog post:

Introduction: Discuss the potential future of personalized AI agents and the idea of a “personal GPT” as mentioned in the Trevor Noah and Sam Altman podcast.
Background: Briefly explain what an LLM (Language Model) is and its relevance to building a personalized AI agent.
Methodology: Introduce the concept of using RAG (~~Rapidly Agile Graphics~~ Retrieval Augmented Generation) and the llama_index package for building a local and relatively private personalized AI assistant.
Building the chat flow: Walk through the process of setting up the LLM, embedding model, and indexer, and creating a simple for loop chat function.
Personalizing the assistant: Demonstrate how to add custom prompts and chat history to make the assistant more effective and personalized to the user.
Conclusion: Recap the benefits and potential future applications of a personalized AI assistant and encourage readers to explore the […]

User: Can you write the introduction for me

In the “What Now?” podcast, Trevor Noah and Sam Altman discussed the potential future of personalized AI agents, where these models would get to know individuals, access their personal data, and work effectively in their context. This concept, as Trevor suggested, could lead to a future workforce where an individual’s GPT (Generative Pre-trained Transformer) becomes their resume or avatar, shaping how others interact with them. To build such a personalized AI agent, we can explore using a local, reasonably private approach, as opposed to relying on big companies for fine-tuning or accessing vast resources. One way to start is by using llama_index, a Python package, to build a simple chat flow for our local AI assistant. This assistant would be able to answer questions related to our expertise and knowledge, providing a more personalized and helpful interaction. The benefits of having a personalized AI agent include privacy, customization, and the ability to work effectively within our unique context. As we continue to explore this project, we’ll dive deeper into the methodology and code required to build and run our very own “personal AI LLM based agent.”

User: Can you write the background section for me

An LLM, or Language Model, is a type of artificial intelligence model that understands and generates human language. In the context of building a personalized AI agent, an LLM can serve as the foundation for understanding and responding to user queries in a more effective and personalized manner. This can be achieved by fine-tuning an LLM on an individual’s data or knowledge base using local models like LlamaCPP and indexing systems like Faiss or Annoy. By doing so, the AI agent becomes an intelligent assistant tailored to the user’s specific needs and preferences. This local approach offers several advantages over relying on large cloud-based models, including privacy, customization, and improved performance within the user’s unique context. For instance, you can create a local AI agent using the Mistral 7B model with LlamaCPP and the UAE-Large-V1 embedding model. First, download the Mistral 7B model for LlamaCPP manually from the Hugging Face model hub and provide it to the helper routine. Next, download the UAE-Large-V1 embedding model if not present/cached locally. Now that we have a model, we can fine-tune the AI agent.

User: Can you write the methodology section for me

Install the required packages for llama_index and Hugging Face Transformers using pip:

pip install llama-index transformers

Obtain a language model (LLM) and embedding model (EmbedModel) by downloading a pre-trained model from Hugging Face Model Hub or your local storage. For example, you can use Mistral-7b model for this purpose. Save the model files in a specific directory.

Create two helper functions to load the LLM and EmbedModel from the saved files:

def get_models(path_to_gguf):
    llm = base.LlamaCPP(
        model_path=path_to_gguf,
        context_window=2048,
        max_new_tokens=256,
        verbose=False
    )

    embedding_model = huggingface.HuggingFaceEmbedding(
        model_name="WhereIsAI/UAE-Large-V1",
        cache_folder="hugging_cache")

    core.Settings.llm = llm
    core.Settings.embed_model = embedding_model

    utils.set_global_tokenizer(
        transformers.AutoTokenizer.from_pretrained("mistralai/Mixtral-8x7B-Instruct-v0.1").encode
    )

    return llm, embedding_model

Initialize the indexer using llama_index by providing the storage directory and data directory. Create a query engine from the indexer:

def get_index(path_to_storage, path_to_data):
    if not os.path.exists(path_to_storage):
        # load the documents and create the index
        documents = readers.SimpleDirectoryReader(path_to_data).load_data()
        index = vector_store.VectorStoreIndex.from_documents(documents, show_progress=True)
        # store it for later
        index.storage_context.persist(persist_dir=path_to_storage)
    else:
        # load the existing index
        storage_context = core.StorageContext.from_defaults(persist_dir=path_to_storage)
        index = indices.load_index_from_storage(storage_context)
    return index

Initialize the chat engine using the query engine, LLM, embedding model, custom prompt, and chat history. Define a maximum number of chat iterations (MAX_CHAT). Create a simple for loop chat function called ‘chat()’ that takes the following steps:

def chat():
    llm, embed_model = get_models("<path to model>/mistral-7b-instruct-v0.2.Q4_K_M.gguf")
    index = get_index("./storage", "./data")

    query_engine = index.as_query_engine()
    ce = chat_engine.CondenseQuestionChatEngine.from_defaults(
        llm=llm,
        query_engine=query_engine,
        condense_question_prompt=custom_prompt,
        chat_history=custom_chat_history,
    )
    for _ in range(MAX_CHAT):
        q = input('\nUser: ')
        if q == 'exit':
            break
        streaming_response = ce.stream_chat(f"<s>;[INST]{q}[/INST]")
        for token in streaming_response.response_gen:
            print(token, end="")

User: Can you write the section on “Personalizing the assistant” for me

Create a custom_prompts~~.json file~~ containing your desired prompts ~~in JSON format~~. Update your init_index function to include custom prompts as an argument ~~and use them when creating your index~~. This will help personalize your local AI agent using LlamaIndex:

custom_prompt = core.PromptTemplate(
    """\
<s><INST>
Given a conversation (between Human and Assistant) and a follow up message from Human, \
rewrite the message to be a standalone question that captures all relevant context \
from the conversation.
<Chat History>
{chat_history}
<Follow Up Message>
{question}
<Standalone question>
</INST></s>
"""
)

custom_chat_history = [
    types.ChatMessage(
        role=types.MessageRole.USER,
        content="Hello assistant, we are having a insightful discussion about Thijs Metsch today. Answer questions in a"
                " positive, helpful and empathetic way.",
    ),
    types.ChatMessage(role=types.MessageRole.ASSISTANT, content="Okay, sounds good."),
]

User: Finally, write the conclusions.

In conclusion, the future of workforce might involve a significant shift towards the integration of personalized AI language model-based agents, as suggested by Trevor Noah and Sam Altman during their podcast conversation. These agents, which could be considered an extension or avatar of an individual, have the potential to revolutionize the way we interact and work by providing context-specific responses and solutions. The process of building such an agent can be initiated using tools like llama_index, which allows for the creation of a simple yet effective chatbot. By combining the user’s expertise and the capabilities of the AI model, a more efficient and personalized assistant can be developed, offering a unique blend of human intelligence and machine learning. As the field of AI continues to evolve, it’s essential to consider the potential benefits and challenges, and to approach the technology with a thoughtful and informed perspective, as outlined in various viewpoints. The code for this blog post, which demonstrates the implementation of a local and relatively private AI language model-based agent, can be found here.

So did it work? It did! Once I indexed my CV I can ask the model that previously knew nothing about me sth:

User: Does Thijs know what an SLO is?
Thijs Metsch has expertise in defining APIs for SLO management […]

Postamble: AI is here to stay I think; in what form & detail we’ll see. LLM are not the solution to everything in the end. I would encourage you to take into consideration different viewpoints, like this one on “What kind of bubble is AI?“. There is more stuff to come; personally, I would love to use Rust instead of python for this kind of work. I’m sure that will come soon.

Some notes on the blog post – especially I wanted to note what all I had to tweak:

I did find that the Mistral 7B model work really well with a reasonable footprint.
I did some formatting like adding line breaks etc.
I did insert hyperlinks to the relevant bits & pieces.
The code to pull this off is surprisingly fast to write, the libraries are great; you’ll get sth reasonable good in ~100 LOC!
The results I got weren’t always that great, but a bit of prompt tuning would have fixed that.
It thought RAG stood for “Rapidly Agile Graphics” not “Retrieval Augmented Generation”.
I did change out the code snippets with the actually code used; it did add some import statements to the python code which were not necessary (e.g. import faiss).
It actually came up with good ideas like the addition of pip commands which were not given in the draft post.
It took a while to generate the first token – subsequent ones are obviously faster.

BTW It has been years, but finally I think I found a reason to add more RAM to my system. Up to now the amount I had was plenty – oh how times change…

Categories: Personal • Tags: Artificial Intelligence, LLM • Permalink for this article

Intent Driven Orchestration

December 12th, 2022 • Comments Off

So let’s start with a bolt statement: the introduction of Microservices/functions and Serverless deployment styles for cloud-native applications has triggered a need to shift the orchestration paradigms towards an intent-driven model.

So what are intents – and what does intent-driven mean? Imagine a restaurant and you order a medium rare steak – the “medium rare” part is the intent declaration. But if we contrast this concept to how orchestration stacks work today – you’d walk into the restaurant, walk straight into the kitchen and you’d say “put the burner on 80% and use that spatula” etc. Essentially declaratively asking for certain amounts of resources/certain way of setup. And obviously, there are a couple of issues with that – you do not necessarily know all the details of the burner. Should it have been set to 80% or 75% maybe? Should it have been 1 core, 500Mb or RAM, sth else? Abstractions and Serverless, anyone?

So why not let app/service owners define what they care about – the objectives of their app/service? For example, “I want P99 latency to be less than 20ms”. That is the “medium rare” intent declaration for an app/service. That is what we’ve been working on here at Intel – and now we’ve released our Intent-Driven Orchestration Planner (Github) for Kubernetes.

Btw.: I shamelessly stole the restaurant metaphor from Kelsey Hightower – for example, check out this podcast. On the P-numbers – again sth that other people have been writing about as well, see Tim Bray‘s blog post on Serverless (part of a Series).

Based on the intents defined by the service owner we want the orchestration stack to handle the rest – just like a good chef. We can do this through scheduling (where/when to place) and planning (how/what to do), to figure out how to set up the stack to make sure the objectives (SLOs) are met.

So why though a planner? The planning component brings sth to the table that the scheduler cannot. It continuously tries to match desired and current objectives of all the workloads. It does this based on data coming from the observability/monitoring stack and tries to reason to enable efficient management. In doing so it can trade-off between various motivations for the stakeholders at play and even take proactive actions if needed – the possibilities for a planner are huge. In the end, the planner can e.g. modify POD specs so the scheduler can make more informed decisions.

Here is an example of that an intent declaration for out Intent Driven Orchestration Planner can look like – essentially requesting that P99 latency should be below 20ms for a target Kubernetes Deployment:

apiVersion: "ido.intel.com/v1alpha1"
kind: Intent
metadata:
  name: my-function-intent
spec:
  targetRef:
    kind: "Deployment"
    name: "default/function-deployment"
  objectives:
    - name: my-function-p99compliance
      value: 20
      measuredBy: default/p99latency

Again the usage of planners is not revolutionary per se, NASA has even flown them to space – and could demonstrate some nice self-healing capabilities – on e.g. Deep Space 1. And just as Deep Space 1 was a tech demonstrator, maybe a quick note: this is all early days for intent-driven orchestration, but we would be very interested in learning what you think…

So ultimately, by better understanding the intents of the apps/services instead of just their desired declarative state, orchestrators – thanks to an intent-driven model – can make decisions that will lead to efficiency gains for service and resource owners.

Categories: Personal • Tags: Cloud, Edge, Intent-Driven Orchestration, Orchestration, Planning • Permalink for this article