CarbonCall: Sustainability-Aware Function Calling for Large Language Models on Edge Devices

CarbonCall is a sustainability-aware function-calling framework for LLMs on edge devices that dynamically adapts model variants and power thresholds based on real-time carbon intensity forecasts to reduce emissions without sacrificing throughput. It addresses the gap between LLM performance optimization and environmental sustainability in edge AI deployments.

Problem Statement

LLMs deployed on edge devices for real-time function calling incur significant computational overhead, driving high power consumption and carbon emissions that existing frameworks ignore entirely. Current function-calling optimization methods focus solely on accuracy and latency, leaving energy-constrained edge environments underserved. This creates a growing sustainability problem as edge AI proliferates without any carbon-aware scheduling or model adaptation.

Key Novelty

Carbon-aware execution loop that integrates real-time carbon intensity forecasts to dynamically adjust power thresholds during LLM function calling
Dynamic switching between quantized LLM variants (e.g., different quantization levels) to maintain high tokens-per-second throughput under power constraints
End-to-end sustainability-aware function-calling framework specifically designed and evaluated for edge hardware (NVIDIA Jetson AGX Orin)

Evaluation Highlights

CarbonCall reduces carbon emissions by up to 52% and power consumption by up to 30% on NVIDIA Jetson AGX Orin compared to non-sustainability-aware baselines
Execution time is reduced by up to 30% while maintaining high tokens-per-second throughput, demonstrating that sustainability and performance are not mutually exclusive

Breakthrough Assessment

6/10 CarbonCall is a solid and timely contribution that bridges an important gap between LLM deployment efficiency and environmental sustainability on edge devices. However, it is largely a systems integration advance combining existing techniques (quantization, carbon forecasting, dynamic scheduling) rather than a fundamental algorithmic breakthrough.

Methodology

Monitor real-time carbon intensity signals and translate them into dynamic power budgets/thresholds for the edge device during inference
Select and switch between pre-quantized LLM variants (differing in quantization level/size) at runtime to stay within the current power budget while maximizing throughput
Execute function-calling tasks using the selected model variant, measuring tokens-per-second, power draw, and carbon footprint, and feeding results back into the adaptive scheduling loop

System Components

Carbon Intensity Monitor

Ingests real-time carbon intensity forecasts from external or local sources to determine the current 'greenness' of available compute power

Dynamic Power Threshold Manager

Translates carbon intensity signals into actionable power caps that constrain the LLM execution environment on the edge device

Quantized LLM Variant Selector

Maintains a pool of quantized model variants and dynamically switches between them to match the current power budget while sustaining high tokens-per-second

Sustainability-Aware Function Caller

Orchestrates tool/function selection and execution within the carbon- and power-constrained LLM inference pipeline on edge hardware

Results

Metric	Baseline (Standard Function Calling)	CarbonCall	Delta
Carbon Emissions	Baseline level	Up to 52% lower	-52%
Power Consumption	Baseline level	Up to 30% lower	-30%
Execution Time	Baseline level	Up to 30% lower	-30%
Tokens-per-Second Throughput	Baseline level	Maintained high efficiency	~0% degradation

Key Takeaways

Edge AI deployments can significantly reduce their carbon footprint (up to 52%) by integrating real-time carbon intensity forecasting into LLM inference scheduling without a major performance penalty
Maintaining a suite of quantized model variants and switching between them dynamically is a practical and effective strategy for balancing sustainability and throughput on resource-constrained hardware like NVIDIA Jetson
Sustainability should be treated as a first-class constraint alongside latency and accuracy in LLM system design, especially as edge AI scales — CarbonCall provides a concrete framework architecture for doing so

Abstract

Large Language Models (LLMs) enable real-time function calling in edge AI systems but introduce significant computational overhead, leading to high power consumption and carbon emissions. Existing methods optimize for performance while neglecting sustainability, making them inefficient for energy-constrained environments. We introduce CarbonCall, a sustainability-aware function-calling framework that integrates dynamic tool selection, carbon-aware execution, and quantized LLM adaptation. CarbonCall adjusts power thresholds based on real-time carbon intensity forecasts and switches between model variants to sustain high tokens-per-second throughput under power constraints. Experiments on an NVIDIA Jetson AGX Orin show that CarbonCall reduces carbon emissions by up to 52%, power consumption by 30%, and execution time by 30%, while maintaining high efficiency.