CarbonCall: Sustainability-Aware Function Calling for Large Language Models on Edge Devices
Problem Statement
LLMs deployed on edge devices for real-time function calling incur significant computational overhead, driving high power consumption and carbon emissions that existing frameworks ignore entirely. Current function-calling optimization methods focus solely on accuracy and latency, leaving energy-constrained edge environments underserved. This creates a growing sustainability problem as edge AI proliferates without any carbon-aware scheduling or model adaptation.
Key Novelty
- Carbon-aware execution loop that integrates real-time carbon intensity forecasts to dynamically adjust power thresholds during LLM function calling
- Dynamic switching between quantized LLM variants (e.g., different quantization levels) to maintain high tokens-per-second throughput under power constraints
- End-to-end sustainability-aware function-calling framework specifically designed and evaluated for edge hardware (NVIDIA Jetson AGX Orin)
Evaluation Highlights
- CarbonCall reduces carbon emissions by up to 52% and power consumption by up to 30% on NVIDIA Jetson AGX Orin compared to non-sustainability-aware baselines
- Execution time is reduced by up to 30% while maintaining high tokens-per-second throughput, demonstrating that sustainability and performance are not mutually exclusive
Breakthrough Assessment
Methodology
- Monitor real-time carbon intensity signals and translate them into dynamic power budgets/thresholds for the edge device during inference
- Select and switch between pre-quantized LLM variants (differing in quantization level/size) at runtime to stay within the current power budget while maximizing throughput
- Execute function-calling tasks using the selected model variant, measuring tokens-per-second, power draw, and carbon footprint, and feeding results back into the adaptive scheduling loop
System Components
Ingests real-time carbon intensity forecasts from external or local sources to determine the current 'greenness' of available compute power
Translates carbon intensity signals into actionable power caps that constrain the LLM execution environment on the edge device
Maintains a pool of quantized model variants and dynamically switches between them to match the current power budget while sustaining high tokens-per-second
Orchestrates tool/function selection and execution within the carbon- and power-constrained LLM inference pipeline on edge hardware
Results
| Metric | Baseline (Standard Function Calling) | CarbonCall | Delta |
|---|---|---|---|
| Carbon Emissions | Baseline level | Up to 52% lower | -52% |
| Power Consumption | Baseline level | Up to 30% lower | -30% |
| Execution Time | Baseline level | Up to 30% lower | -30% |
| Tokens-per-Second Throughput | Baseline level | Maintained high efficiency | ~0% degradation |
Key Takeaways
- Edge AI deployments can significantly reduce their carbon footprint (up to 52%) by integrating real-time carbon intensity forecasting into LLM inference scheduling without a major performance penalty
- Maintaining a suite of quantized model variants and switching between them dynamically is a practical and effective strategy for balancing sustainability and throughput on resource-constrained hardware like NVIDIA Jetson
- Sustainability should be treated as a first-class constraint alongside latency and accuracy in LLM system design, especially as edge AI scales — CarbonCall provides a concrete framework architecture for doing so
Abstract
Large Language Models (LLMs) enable real-time function calling in edge AI systems but introduce significant computational overhead, leading to high power consumption and carbon emissions. Existing methods optimize for performance while neglecting sustainability, making them inefficient for energy-constrained environments. We introduce CarbonCall, a sustainability-aware function-calling framework that integrates dynamic tool selection, carbon-aware execution, and quantized LLM adaptation. CarbonCall adjusts power thresholds based on real-time carbon intensity forecasts and switches between model variants to sustain high tokens-per-second throughput under power constraints. Experiments on an NVIDIA Jetson AGX Orin show that CarbonCall reduces carbon emissions by up to 52%, power consumption by 30%, and execution time by 30%, while maintaining high efficiency.