Other Language Version: [Korean] [Japanese]
Big O notation is a notation used to describe the performance of an algorithm mathematically. In particular, it is used to analyze the asymptotic behavior of an algorithm’s time complexity and space complexity as a function of the input size (\(n\)), indicating how the algorithm’s execution time or memory usage increases as the input size grows.
In simpler terms, it is a method for predicting how efficient an algorithm is, particularly in terms of its speed, when the amount of input data becomes extremely large.
1. What is Big O notation?
1) Why use Big O notation?
- Algorithm comparison: Allows you to objectively compare which of several algorithms is more efficient for a specific problem.
- Performance prediction: You can predict how the performance of the algorithm will change as the size of the input data increases.
- Identify bottlenecks: Helps identify and improve areas that are causing performance degradation.
- Hardware independence: It is possible to evaluate the efficiency of the algorithm itself without depending on hardware factors such as the CPU speed or memory capacity of a specific computer.
2) Main features of Big O notation
- Worst-Case Analysis: Generally, Big O notation analyses performance based on the “worst case” in which an algorithm can operate most inefficiently. This means that the performance of the algorithm is guaranteed to be at least at a certain level.
- Ignoring constant terms: Constant time (e.g., \(100n\) in \(100\) that affects the execution time of an algorithm is ignored because its importance becomes negligible when the input size becomes very large. That is, \(O(2n+5)\) is written as \(O(n)\).
- Consider the highest-order port: When there are multiple ports, consider the highest-order port that has the greatest impact as the input size \(n\) increases. For example, \(O(n^2 + n)\) is written as \(O(n^2)\). This is because \(n^2\) increases much faster than \(n\).
3) Main types of Big O complexity (low complexity -> high complexity)
- \(O(1)\) – Constant time complexity (Constant Time):
- It always takes a certain amount of time regardless of the input size.
- Example: Reading an element at a specific index in an array, pushing/popping from a stack.
- \(O(\log n)\) – Logarithmic Time Complexity:
- The larger the input size, the slower the execution time increases. This problem mainly occurs in algorithms that reduce the problem size by half.
- Example: Binary Search.
- \(O(n)\) – Linear time complexity (Linear Time):
- The execution time increases in proportion to the input size.
- Examples: Iterating through all elements of an array once, finding a specific element in an unsorted array.
- \(O(n \log n)\) – Linear logarithmic time complexity (Linearithmic Time):
- It is slower than \(O(n)\) but faster than \(O(n^2)\). It is often seen in efficient sorting algorithms.
- Examples: Merge Sort, Quick Sort.
- \(O(n^2)\) – Quadratic Time Complexity:
- The execution time increases in proportion to the square of the input size. This often occurs in nested loops.
- Examples: Bubble Sort, Selection Sort, Insertion Sort.
- \(O(2^n)\) – Exponential Time Complexity:
- Even a slight increase in input size causes execution time to increase exponentially. This is a highly inefficient algorithm.
- Example: A naive method of calculating the Fibonacci sequence recursively, using a brute force approach to generate all subsets.
- \(O(n!)\) – Factorial time complexity:
- The execution time increases dramatically with each increment of input size. This is one of the most inefficient complexities.
- Example: Solving the travelling salesman problem using the brute force algorithm.
Code example (Python)
- Example 1: \(O(1)\)
- This function always returns the first element regardless of the size of the array, and the time required to do so is always the same. Therefore, it is \(O(1)\).
def get_first_element(arr):
return arr[0]
- Example 2: \(O(n)\)
- This function iterates through all elements of the array once. The execution time increases proportionally to the size of the array \((n)\), so it is \(O(n)\).
def sum_elements(arr):
total = 0
for element in arr:
total += element
return total
- Example 3: \(O(n^2)\)
- This function has two nested loops. Each loop is repeated as many times as the size of the array \((n)\), so a total of \(n \times n=n^2\) operations are performed. Therefore, it is \(O(n^2)\).
def print_pairs(arr):
for i in range(len(arr)):
for j in range(len(arr)):
print(arr[i], arr[j])
Summary
Big O notation is an essential tool for developers to understand the efficiency of algorithms and determine which algorithms are more suitable for large datasets. Understanding Big O notation is crucial for designing and implementing systems with optimal performance.
2. Understanding Big O in the Age of LLM
Applying Big O notation to explain cutting-edge technological concepts such as LLM, MCP (Model Context Protocol), and AI agents is both fascinating and highly significant. Beyond traditional algorithm analysis, it can provide valuable insights into understanding and optimising the performance of complex artificial intelligence systems.
1) New horizons in Big O notation: Complexity analysis in the age of AI
While traditional Big O notation primarily evaluated algorithm efficiency based on CPU computation counts and memory usage, newer technologies such as LLM, MCP, and AI agents require a new perspective on complexity analysis due to the expanded definition of “computation” and the diversification of “resource” types.
The key is to define what the ‘input’ of the system is, what ‘resources’ are consumed during the ‘processing’ process, and how the consumption of those ‘resources’ increases depending on the size of the input.
2) Big O(Token): Language processing complexity of LLM
The core input unit of LLM (Large Language Models) is ‘token’. Tokens can be words, punctuation marks, or even parts of letters. The number of input and output tokens mainly determines the inference time and cost of LLM.
Explanation:
Definition: Big O(Token) is a concept that indicates how the computation time and cost change depending on the number of input tokens \((N_{input})\) and output tokens \((N_{output})\) required for an LLM to perform a specific task.
Key considerations:
- Number of input tokens (Context Window): LLM has a limited ‘Context Window’. Inputs that exceed the window size (maximum number of tokens) cannot be processed or require additional processing such as internal summarisation.
- Self-Attention Mechanism: The Self-Attention Mechanism in the Transformer architecture, which is at the core of LLMs, has a time complexity of \(O(N_{input}^2)\) concerning the input sequence length (number of tokens). This means that the computational load increases exponentially as the number of input tokens increases. This is the primary reason for the limitation on the context window size.
- Self-Attention Mechanism: The Self-Attention Mechanism in the Transformer architecture, which is at the core of LLMs, has a time complexity of \(O(N_{input}^2)\) concerning the input sequence length (number of tokens). This means that the computational load increases exponentially as the number of input tokens increases. This is the primary reason for the limitation on the context window size.
- The importance of prompt engineering: Unnecessarily long prompts increase \(N_{input}\), resulting in a cost of \(O(N_{input}^2)\), while unnecessary verbosity in the output increases \(N_{output}\), resulting in a cost of \(O(N_{output})\). Therefore, efficient prompts are key to optimising Big O(Token).
Example:
- Simple question-answer: Short questions (low \(N_{input}\)) and short answers (low \(N_{output}\)) have a low complexity of \(O(1)\) or close to \(O(N_{output})\) (at a level where self-attention has a negligible effect).
- Long document summarisation: Summarising very long documents (\(N_{input}\) is large) is dominated by \(O(N_{input}^2)\) self-attention costs, with an additional \(O(N_{output})\) generation cost depending on the length of the summarised output (\(N_{output}\)).
- Code generation: As the requirements specification (low \(N_{input}\)) becomes longer and the length of the code to be generated (\(N_{output}\)) increases, the complexity of \(O(N_{output})\) becomes more prominent.
3) Big O (Agent): AI agent collaboration and decision-making complexity
In AI agent systems, especially those where multiple agents collaborate to solve complex problems such as MCP (Model Context Protocol), ‘input’ can refer not only to the amount of data but also to ‘goal complexity’, ‘number of agents’, ‘number of interactions’, and so on.
Explanation:
- Definition: Big O(Agent) is a concept that indicates how the total execution time and resource usage of an AI agent system change depending on the number of agents (\(N_{agent}\)), the number of interactions (\(N_{interaction}\)), or the size of the problem space (\(S_{problem})\).
Key considerations:
- Communication/collaboration between agents: As the number of agents increases, the overhead required for them to exchange information (communication) and coordinate (collaboration) may increase. For example, if all agents need to communicate with all other agents, the communication complexity may be \(O(N_{agent}^2)\).
- Decision-making complexity: The inference process required for each agent to decide on its following action may include internal LLM calls (affected by Big O(Token)) or search algorithms. The complexity of this decision-making process can increase depending on the size of the problem space (\(S_{problem}\)), such as \(O(\log S_{problem})\) or \(O(S_{problem})\).
- Task decomposition and assignment: The process of efficiently decomposing and assigning complex goals to agents is itself complex. This can be expressed as \(O(N_{tasks})\) or \(O(N_{tasks} \log N_{agents})\).
- Feedback loops and retries: The total execution time may increase depending on the number of times agents fail to achieve their goals and retry or go through feedback loops (\(N_{retry}\)). This can be expressed as \(O(N_{retry})\) or in more complex forms.
Example:
- Simple information gathering: When multiple agents independently explore web pages to gather information, each agent’s task is \(O(1)\) or \(O(N_{page})\), but the entire system is processed in parallel in proportion to the number of agents, so Big O(Agent) is close to \(O(N_{page_per_agent})\). (dependent on the longest task).
Collaborative problem solving (e.g., software development agents):
- Suppose Agent A (planning), Agent B (design), Agent C (coding), and Agent D (testing) work sequentially and provide feedback to each other. In that case, the sum of the LLM calls and internal logic costs at each stage becomes the main complexity.
- If, during the design phase, agents A and B must reach agreement through N discussions, this interaction could have a cost of \(O(N_{discussion} \times \text{Big O(Token)}).\)
- As the complexity of overall development goals (Sproblem) increases, the total time required for each agent’s decision-making and collaboration may increase in the form of \(O(S_{problem})\) or \(O(S_{problem}^2)\).
- Search-based AI agents: When agents use search algorithms in problems with large state spaces, such as maze navigation or game play, the complexity can be \(O(S_{state_space})\) or \(O(S_{state_space} \log S_{state_space})\), depending on the size of the search space (\(S_{state_space})\).
4) Practical significance and limitations of Big O analysis
These new Big O concepts provide essential insights for the design and optimisation of AI systems.
- Suggestions for optimisation: For example, if LLM costs are too high, it may be necessary to optimise the input token length to reduce Self-Attention, which causes \(O(N_{input}^2)\), or improve prompt engineering to encourage concise responses to reduce \(O(N_{output})\). If the communication complexity in an AI Agent system is \(O(N_{agent}^2)\), consider optimising the communication method between agents (e.g., using routing instead of broadcast) or introducing a hierarchical agent structure to reduce communication volume.
- Scalability prediction: Helps predict the scalability of the current architecture when a specific AI system needs to handle larger problems or more agents in the future.
- Resource management: This is an essential metric for predicting and efficiently managing LLM API call costs and computing resource (GPU) usage in cloud environments.
Limitations:
- Difficulty of abstraction: The internal logic of LLM or AI agents is highly complex, and it may not be easy to abstract all detailed operations into Big O.
- Importance of constant factors: Big O is an asymptotic analysis, so it ignores constant factors, but in actual services, these constant factors can make significant performance differences (e.g., the default inference speed of a specific LLM).
- Indeterminate characteristics: AI agents’ behaviour can be probabilistic or indeterminate, making it difficult always clearly to define the worst-case scenario.
Summary
Big O notation is a classic concept in computer science. Still, it remains a powerful and essential tool for understanding and managing the complexity of modern artificial intelligence systems such as LLM, MCP, and AI agents. Big O(Token) and Big O(Agent) will extend the existing complexity analysis framework to align with these new computing paradigms, providing a crucial conceptual foundation for designing and optimising the efficiency of AI systems. This will enable us to build more intelligent and scalable AI systems.
3. Considerations for establishing SLM within an organisation
When evaluating the performance of SLM (Small Language Model) models installed on personal PCs or servers within an organisation, it is very important to apply the Big O concept. Unlike cloud-based LLM, in on-premise environments, it is necessary to utilise limited hardware resources as efficiently as possible, which makes algorithm complexity analysis even more important.
The following is a specific idea for applying the Big O concept to the performance evaluation of SLM models.
1) Big O(Token) for SLM Inference
Like LLM, SLM also operates based on tokens. However, SLM has fewer parameters than LLM and is often trained to specialise in specific domains, so even with the same number of tokens, the actual operation and optimisation strategies may differ.
Analysis of inference time complexity according to input token length (\(N_{input}\))
- \(O(N_{input})\) (linear): This is the ideal scenario. If the inference time increases linearly as the number of input tokens increases, the SLM is considered to be very efficiently designed, or that specific optimisations (e.g., sparse attention, compression techniques) have been effectively applied. In practice, due to the self-attention characteristics of the Transformer architecture, achieving pure \(O(N_{input})\) is challenging. Still, it is possible to make it behave close to linear through specific optimisations.
- \(O(N_{input} \log N_{input})\) (linear log): This complexity can be seen in some efficient transformer variants (e.g., linear attention variants) or specialised SLM architectures. It is much more efficient than the \(O(N_{input}^2)\) complexity of LLM.
- \(O(N_{input}^2)\) (quadratic): This is the inherent complexity of self-attention in a typical transformer architecture. Even with SLM, if this part is not optimised, performance will slow down rapidly as the input length increases. In on-premises environments, this can lead to GPU memory shortages or CPU bottlenecks.
- Evaluation method: Give SLM inputs of various lengths (\(N_{input})\) and measure the inference time (TTFT: Time To First Token, TTPT: Time To Per Token) for each input length and visualise it in a graph. You can estimate the approximate Big O from the slope or curve of the graph.
Analysis of generation time complexity according to output token length (\(N_{output}\)):
- \(O(N_{output})\) (linear): Most LLM/SLM generate tokens sequentially, so the generation time increases in proportion to the number of output tokens. This is a common phenomenon, and rather than reducing this complexity itself, it is essential to increase the token generation speed (Tokens per second).
- Evaluation method: Request the generation of responses of various lengths (\(N_{output})\) for a fixed input token length, and measure the total generation time for each response length.
Analysis of factors affecting token complexity (on-premises specialisation):
- Model size (number of parameters): Even SLMs can have billions of parameters (≈ 7B, 13B). This directly affects GPU memory usage. Big O does not directly correlate with the number of parameters. Still, as the number of parameters increases, the time required for a single operation increases, and the likelihood of bottlenecks due to memory constraints increases.
- Quantisation level: Depending on the quantisation level, such as INT8 or INT4, the processing speed and memory usage will vary. Quantisation reduces the constant time of \(O(1)\) operations, but quantising too low may result in accuracy loss.
- Batch Size: When batch processing multiple requests at once, the actual Big O can be \(O(N_{input}^2 \times \text{Batch Size})\). However, since this improves overall throughput by increasing hardware utilisation, a larger batch size is generally advantageous. It is essential to find the optimal batch size under the GPU memory constraints of a personal computer or server.
- Hardware constraints: Depending on hardware specifications such as CPU, GPU type, RAM, and VRAM, the actual execution time may vary significantly even if the Big O complexity is the same. VRAM is significant for model loading and context maintenance.
2) Big O(Concurrent Requests) for SLM Deployment
In an organisational server environment, multiple users can send requests to SLM simultaneously. In this case, changes in system throughput and latency due to the number of concurrent requests can be analysed using the Big O concept.
Throughput (Queries Per Second, QPS) complexity according to the number of simultaneous requests (\(N_{requests}\))
- \(O(1)\) (constant throughput): If hardware resources are infinite or the system is highly optimised for parallel processing, the processing time for each request remains steady even as the number of concurrent requests increases, and the total throughput remains nearly constant (an ideal situation that is almost impossible to achieve).
- \(O(Batch Size)\) or (O(N_{requests}/Batch\space Size)): In reality, when the batch size is fixed, the throughput is proportional to (number of concurrent requests / batch size). That is, efficiency increases whenever the number of simultaneous requests is a multiple of the batch size, and the same resources may be consumed in between.
- \(O(N_{requests})\) (linear decrease): As the number of concurrent requests increases, the processing time per request increases linearly, and the total throughput tends to reach saturation or decrease. This indicates the limitations of parallel processing or resource (CPU cores, GPU memory, IO) bottlenecks.
- Evaluation Method: Using load testing tools such as JMeter and Locust, we gradually increase the number of concurrent requests to measure the system’s QPS and the average latency of each request. This allows us to determine how many simultaneous requests the system can efficiently handle and at what point performance begins to degrade due to increased complexity.
Latency complexity according to the number of simultaneous requests (\(N_{requests}\))
- \(O(1)\) (constant latency): An ideal situation that occurs when the number of concurrent requests is very low.
- \(O(N_{requests})\) (linear increase): This is the most common scenario in which the latency of each request increases linearly due to queuing or resource contention as the number of concurrent requests increases.
- \(O(N_{requests}^2)\) (quadratic increase): This can occur when there is a severe bottleneck or inefficient resource management (e.g., intensified lock competition).
Resource usage (CPU, GPU, RAM):
- Monitor how CPU usage, GPU usage, and memory usage change depending on the number of simultaneous requests ((N_{requests})).
- For example, if CPU usage increases linearly as (O(N_{requests})) and QPS no longer increases or decreases after exceeding a certain threshold, you can determine that the CPU is the bottleneck.
- If GPU memory usage increases to (O(N_{requests})) and exceeds a specific VRAM capacity, model loading may fail or swapping may occur, causing a sharp decline in performance.
3) SLM optimization and deployment strategy proposal through Big O analysis
Based on the Big O analysis results, you can establish SLM optimization and deployment strategies tailored to individual PC and organizational server environments.
- Model selection: Select the SLM model with the lowest Big O complexity within a specific \(N_{input}\) range. For example, if the main purpose is short question answering, a lightweight model close to \(O(N_{input})\) is advantageous.
- Hardware upgrade priorities: Based on the analysis results, we determine the optimal hardware investment direction, such as increasing the number of CPU cores if the CPU is a bottleneck, or upgrading to a GPU with more VRAM if the GPU memory is insufficient.
- Batch processing optimization: Understand the complexity of \(O(Batch \space Size)\) or \(O(N_{requests}/Batch \space Size)\), and find and set the optimal batch size to achieve maximum throughput on the given hardware.
- Service queue and scheduling: \(O(N_{requests})\) To mitigate increased latency, consider using a request queue to regulate load or introducing priority-based scheduling.
Application of model lightweighting techniques:
- Quantization: Reduces model size and inference speed, resulting in a constant time reduction in \(O(1)\) operations.
- Pruning: Reduces the amount of computation by removing unnecessary connections in the model.
- Knowledge Distillation: “Distills” knowledge from larger LLMs into SLMs so that smaller models can achieve similar performance.
- These techniques may not change the intrinsic Big O, but they significantly reduce the “constant factor” within the same Big O, thereby improving actual performance.
- Caching strategy: By caching responses to frequently asked questions or patterns, you can improve the overall Big O (Request) of the system by responding in \(O(1)\) without actual model inference.
- Introduction of RAG (Retrieval Augmented Generation): Instead of using the vast knowledge of LLM to answer complex questions, RAG searches for relevant information in local databases or documents \((O(\log N_{docs})\) or \(O(N_{docs}))\) and provides it to SLM, thereby reducing the \(N_{input}\) complexity of SLM. This allows the SLM to focus on domain-specific information processing without needing to learn vast amounts of knowledge, thereby optimizing overall performance.
The application of this Big O concept will provide a critical benchmark for efficiently deploying and managing SLM in limited environments such as personal PCs and servers within organizations.
4. Big O Concept for Open SLM Model Memory Requirements
1) Time Complexity
Time complexity indicates how the computation time of an algorithm changes depending on the input size (N). It is based on the worst-case scenario and is an indicator of the efficiency of the algorithm itself, regardless of hardware performance.
Key considerations:
- Number of operations: The total number of core operations (addition, multiplication, comparison, data access, etc.) performed by the algorithm is expressed as a function of N.
- GPU Performance (Floating Point Operations Per Second, FLOPS): The more floating point operations a GPU can perform per second, the faster it can complete algorithms with a specific time complexity. For example, if there is an \(O(N^2)\) algorithm, a GPU with high FLOPS will provide greater time savings as N increases. However, Big O itself does not change proportionally to FLOPS; it indicates the trend of the increase in the number of operations as N increases.
- Parallel processing: GPUs are inherently optimized for parallel computation. If a specific algorithm is well suited for parallelization, even algorithms with the same \(Big \space O(N)\) can run much faster than sequential processing.
- Example: Matrix multiplication (\(O(N^3)\) in general) can be significantly accelerated using GPUs when N is huge, thanks to parallel processing. Self-Attention in LLM/SLM (\(O(N_{token}^2)\)) is also possible in practice thanks to parallel processing on GPUs.
Inference time complexity of SLM models (Big O(Token) perspective):
- \(O(N_{token}^2)\) (Self-Attention): This is the most significant time complexity factor in Transformer-based models. As the input token length (\(N_{token}\)) increases, the computational load increases quadratically. This is the part that places the most significant burden on GPU FLOPS and VRAM. When using SLMs with short context windows on personal computers or servers, the \(N_{token}\) value decreases, reducing the absolute computational load of \(O(N_{token}^2)\) and resulting in significantly shorter actual execution times.
- \(O(n_{token})\) (FFN, LayerNorm, etc.): Feed-Forward Network, Layer Normalization, etc., have computational complexity that is linearly proportional to the input token length.
- Summary: The total inference time complexity of SLM is mainly determined by Self-Attention (\(O(N_{token}^2)\)) and token generation (\(O(N_{output})\)). In on-premise environments, if GPU FLOPS are sufficient, \(N_{token}\) and \(N_{output}\) mainly determine the actual inference time.
2) Space Complexity
Space complexity indicates how the amount of memory required for an algorithm to run changes depending on the input size (N).
Key considerations:
- GPU memory (VRAM): This is the main space where SLM model weights are loaded. It also stores activation values and K/V caches during inference. If VRAM is insufficient, the model cannot be loaded, or inference speed will slow down significantly (due to data transfer between CPU and GPU and swapping).
- PC (server) memory (RAM): Used to load model weights from disk to VRAM or for CPU inference. It is essential when loading multiple models simultaneously or processing massive datasets. Techniques such as offloading part of the model to RAM (CPU offloading, Quantization with Offloading) can be used when VRAM is insufficient, but this results in speed degradation due to data transfer between GPU and RAM.
Memory requirements of SLM models (Big O(Memory) perspective):
- Model weight size: Determined by the number of parameters (P) in the SLM model.
- Model size = P * (number of bytes for each parameter)
- Example: A 7B (7 billion) parameter model stored in FP16 (2 bytes) requires \(7 \times 10^9 \times 2 \approx 14 \text{GB}\) of VRAM.
- This is \(O(P)\) (or \(O(1)\) because \(P\) is the “size of the model itself,” which is independent of the input size N. However, when comparing multiple SLMs, \(P\) becomes an important factor for comparison).
- Activations: A space where the output values of each layer are temporarily stored during the inference process. In the case of transformer models, memory is consumed in proportion to the input token length (\(N_{token}\)) and the hidden dimension (\(D_{hidden}\)) of the model in the self-attention mechanism.
- \(O(N_{token}×D_{hidden})\): Store basic activation values.
- \(O(N_{token}^2×Batch \space Size)\): Storage of intermediate results that may occur when calculating the query (Q), key (K), and value (V) matrices of Self-Attention (especially Softmax attention scores).
- The activation value is usually released when inference is complete, but VRAM usage increases rapidly as \(N_{token}\) becomes longer.
- KV Cache: When generating tokens sequentially, stores the Key and Value matrix of previously generated tokens to avoid unnecessary recalculation. Proportional to the input token length (\(N_{input}\)) and output token length (\(N_{output}\)).
- \(O((N_{input}+N_{output})×Batch \space Size×D_{head}×N_{layers})\):
- \(N_{input}+N_{output}\): Context length
- \(Batch \space Size\): Number of requests processed simultaneously
- \(D_{head}\): Dimension of Attention Head
- \(N_{layers}\): Number of layers in the model
- \(O((N_{input}+N_{output})×Batch \space Size×D_{head}×N_{layers})\):
- This cache consumes more VRAM as the batch size increases or the input/output length becomes longer, and is the main cause of VRAM shortage, especially in Long Context inference of LLM/SLM.
3) Analysis of memory requirements for recently released Open SLM models (applying the Big O concept)
Recently released Open SLM models often attempt to reduce memory requirements through various sizes (number of parameters) and architectural optimizations. This can be analyzed using the Big O concept.
Reduction in the number of parameters (model size): \(O(P)\) improvement
- Example: Llama 3 8B, Phi-3-mini (3.8B), Gemma 2B/7B
- Analysis: Simply reducing the number of parameters reduces VRAM requirements to \(O(P)\).
- Big O: \(P\) is not a direct function of input (N), but it is the most direct way to reduce the “constant factor” of the space consumed by the model itself.
- Meaning: Enables smaller models to be loaded on personal PCs or servers with limited VRAM. This also has a positive effect on inference speed (as smaller models require less computation).
Quantization: Reduction of constant factors in \(O(P)\)
- Example: Various quantization level models such as Q4_K_M and Q8_0 in GGUF (llama.cpp) format.
- Analysis: By lowering the precision of weights from FP16 to INT8, INT4, etc., the model size can be reduced by more than 2 or 4 times, even with the same number of parameters.
- Big O: Significantly reduces the VRAM/RAM size required for model loading within the \(O(P)\) scale. 4-bit quantization requires only 1/4 of the memory compared to 16-bit quantization.
- Meaning: Enables loading larger models (originally FP16) even with limited VRAM, processing larger batch sizes with the same model, or maintaining longer contexts.
Efficient architecture (Long Context Optimization): Constant factor or degree reduction of \(O(N_{token}^2))\
- Example: Mistral, Mixtral (MoE), RWKV, Long-RoPE, FlashAttention, etc.
- Analysis:
- Mixtral (MoE): Although the model itself is large, the number of parameters activated for specific tokens is small (Sparse Activation), reducing the amount of computation required for actual inference.
- \(Big \space O(Time)\):\(O(N_{token}^2)\) still exists, but it reduces the computational constant factor in proportion to the number of activated parameters, such as \(O(P_{active})\).
- \(Big \space O (Space)\): The model as a whole is large, but the number of activation values loaded into memory at a given point in time can be reduced. However, the total weight load remains \(O(P_{total})\), so the VRAM burden remains the same.
- Improvements to Long-RoPE, ALiBi, and other positional encoding: These are improved positional embedding techniques for efficiently processing long contexts.
- \(Big \space O (Time/Space)\): When \(N_{token}\) becomes very long, it helps to manage the time complexity of \(O(N_{token}^2)\) and the space complexity of the K/V cache of \(O(N_{token})\) relatively better. Rather than changing the Big O itself, increasing the upper limit of \(N_{token}\) enhances practical usability.
- FlashAttention (GPU optimization): By optimizing the Self-Attention operation for GPUs, we improve VRAM access patterns and reduce VRAM usage by not storing intermediate results in DRAM.
- \(Big \space O(Time)\): The intrinsic complexity of \(O(N_{token}^2)\) remains the same, but the ‘actual execution constant time’ is significantly reduced, enabling faster inference.
- \(Big \space O (Space)\): Reduces space complexity to \(O(N_{token})\), enabling processing of longer contexts.
- Mixtral (MoE): Although the model itself is large, the number of parameters activated for specific tokens is small (Sparse Activation), reducing the amount of computation required for actual inference.
Streaming/offloading techniques (memory utilization): Flexible management of \(O(P)\)
- Example: llama.cpp (CPU offloading, GGML/GGUF), vLLM (PagedAttention)
- Analysis:
- CPU offloading: When GPU VRAM is insufficient, some of the model weights are moved to PC (server) RAM and loaded.
- \(Big \space O (Space)\): Distributes the total (O(P)) weight of the model across GPU VRAM and PC RAM to overcome the limitations of a single VRAM capacity.
- \(Big \space O (Time)\): Data transfer between GPU and RAM occurs, increasing the ‘constant factor’ of inference time and potentially causing bottlenecks in \(O(N_{token})\) or \(O(N_{token}^2)\) operations.
- PagedAttention (vLLM): Manages K/V cache memory in pages to reduce unnecessary memory allocation and prevent memory fragmentation.
- \(Big \space O (Space)\): By more efficiently managing the space complexity of the K/V cache, which is \(O((N_{input}+N_{output})×Batch \space Size)\), it is possible to process more concurrent requests or longer contexts in the same VRAM.
- CPU offloading: When GPU VRAM is insufficient, some of the model weights are moved to PC (server) RAM and loaded.
Summary
When deploying SLM on personal PCs or servers within an organization, it is important to analyze not only the parameter size (P) of the model, but also the time complexity with respect to \(N_{token}\) (the presence or absence of \(O(N_{token}^2)\) and its constant factor) and the space complexity with respect to VRAM/RAM (model weight \(O(P)\) and activation value \(O(N_{token}^2\) K/V cache, \(O(N_{token}^2\) activation value) comprehensively. \(O(N_{token})\) K/V cache, \(O(N_{token}^2)\) activation values).
Recent Open SLM models have attempted to reduce the number of parameters or apply quantization to reduce the constant factor \(O(P)\), and improve efficiency in terms of time complexity \(O(N_{token}^2)\) and space complexity \(O(N_{token})\) through Long Context optimization and FlashAttention.
Therefore, when evaluating SLM models:
- Understand the basic \(O(P)\) space requirements through the number of model parameters.
- Check how much more \(O(P)\) space can be reduced by enabling quantization support.
- Through architecture optimization (Long-RoPE, FlashAttention, etc.), we must measure how efficiently the \(O(N_{token}^2)\) time complexity and \(O(N_{token})\) K/V cache space complexity for long \(N_{token}\) are managed, and how well actual GPU performance (FLOPS) is utilized.
- Consider whether \(O(P)\) space shortage issues can be circumvented by utilizing PC/server RAM through distribution techniques such as CPU offloading.
This Big O-based analysis provides critical insights for maximizing SLM performance with limited on-premises resources and making informed hardware investment decisions.
.End.
1 thought on “Improve Agentic AI Performance Through Understanding Big O”