Skip to main content
3Nsofts logo3Nsofts
On-Device AI

Battery-Aware AI Scheduling in iOS Apps: Architecture Patterns for On-Device Inference

How to build an inference scheduler that routes requests by battery level, thermal state, and priority tier — keeping on-device AI useful across the full battery range without draining the device.

By Ehsan Azish · 3NSOFTS·May 2026·10 min read

On-device AI inference is not free. Every Core ML call draws from the battery, heats the SoC, and competes with the OS scheduler. Battery-aware scheduling is not a performance optimisation you add later — it is a design constraint that shapes how inference requests enter the system.

The constraint that shapes everything

On-device inference runs on the Neural Engine, the GPU, or the CPU — depending on the model, the runtime, and device state. The Neural Engine is the most efficient path, but the OS controls access to it. When the device enters low-power mode, background processing budgets shrink. When the SoC heats up, the OS throttles clock speeds. When battery drops below a threshold, BGProcessingTask requests are deferred or denied outright.

The constraint that shaped everything in offgrid:AI: inference must remain useful across the full battery range, not just when conditions are ideal. Every architectural decision flows from that.

Why naive inference scheduling fails

The naive approach is to call inference directly from the view model — user taps a button, a Task fires, the model runs. This works on a bench device. In production, it fails in three ways:

  • It gives the OS no signal about the relative importance of the work. A low-priority background classification runs at the same priority as a foreground response the user is actively waiting on.
  • There is no mechanism to defer non-urgent inference when battery or thermal conditions make running it expensive.
  • Inference calls accumulate without coalescing. A user scrolling through a feed can trigger dozens of classification requests in seconds. Without a queue, each runs independently, preventing any batching optimisation the Neural Engine could otherwise apply.

Reading battery state before scheduling

UIDevice.current.batteryState and UIDevice.current.batteryLevel are the entry points. Battery monitoring must be enabled explicitly:

UIDevice.current.isBatteryMonitoringEnabled = true

The relevant states map to four operating conditions:

  • .charging or .full — full inference budget available
  • .unplugged at level > 0.20 — standard budget, no deferral
  • .unplugged at level ≤ 0.20 — reduced budget; non-critical inference defers
  • ProcessInfo.processInfo.isLowPowerModeEnabled — hard signal to suspend all non-foreground inference

Low-power mode is the clearest signal. When it is active, the user has explicitly told the OS to conserve energy. Inference that is not directly serving a foreground interaction should not run.

Scheduling architecture

The inference queue

The inference scheduler sits between the call site and the Core ML model. Nothing calls the model directly. Every request transits through the scheduler, which evaluates battery state, thermal state, and request priority before deciding whether to run immediately, defer, or drop.

actor InferenceScheduler {
    private let model: SomeMLModel
    private var pendingTasks: [InferenceRequest] = []

    func enqueue(_ request: InferenceRequest) async throws -> InferenceResult {
        let budget = BatteryBudget.current()
        guard budget.allows(request.priority) else {
            throw InferenceError.deferred(reason: budget.deferralReason)
        }
        return try await model.perform(request)
    }
}

The actor isolation here is not cosmetic. Inference requests from multiple call sites — a view model, a background sync handler, a widget timeline provider — all transit through a single actor-isolated queue. Contention is serialised by the Swift concurrency runtime, not by manual locking.

Priority tiers

Not all inference is equal. A three-tier model covers most production cases:

  • .critical — foreground, user-initiated, blocking UI. Runs regardless of battery state. Example: a user waiting on a response in a chat interface.
  • .standard — foreground but not blocking. Runs unless low-power mode is active. Example: pre-classifying content as the user scrolls.
  • .background — non-user-visible. Defers when battery is below 20% or low-power mode is active. Example: indexing, pre-computation, cache warming.

Deferral and coalescing

Deferred requests do not disappear. They accumulate in pendingTasks and re-evaluate when battery state changes. The scheduler observes UIDevice.batteryLevelDidChangeNotification and NSProcessInfoPowerStateDidChangeNotification to trigger re-evaluation:

NotificationCenter.default.publisher(
    for: NSProcessInfo.powerStateDidChangeNotification
)
.sink { [weak self] _ in
    Task { await self?.drainDeferredQueue() }
}
.store(in: &cancellables)

Coalescing applies to background tasks with identical input signatures. If five requests to classify the same content type arrive within a 500ms window, the scheduler runs one and fans the result out to all five callers.

Model selection as a runtime decision

Many production apps ship more than one model variant — a full-precision model for high-accuracy tasks and a quantized INT4 or INT8 variant for constrained conditions. With a scheduler in place, model selection at runtime becomes a direct consequence of battery state.

The scheduler holds references to both variants. When battery drops below the threshold, it routes requests to the quantized model. Inference accuracy may decrease marginally. Inference speed and energy cost decrease substantially.

Core ML Tools supports post-training quantization. A model quantized to INT8 typically runs 2–4x faster on the Neural Engine than its FP32 equivalent, with accuracy loss that is often below the threshold of user perception for classification tasks.

Thermal state as a secondary signal

Battery level is the primary signal. Thermal state is the secondary one. ProcessInfo.processInfo.thermalState surfaces four levels: .nominal, .fair, .serious, .critical.

At .serious, the OS has already begun throttling. Running full-precision inference at that point actively worsens the situation — the model runs slower, generates more heat, and extends the time the device stays throttled.

struct BatteryBudget {
    static func current() -> BatteryBudget {
        let lowPower = ProcessInfo.processInfo.isLowPowerModeEnabled
        let thermal = ProcessInfo.processInfo.thermalState
        let level = UIDevice.current.batteryLevel

        return BatteryBudget(
            minimumPriority: Self.floor(
                lowPower: lowPower,
                thermal: thermal,
                level: level
            )
        )
    }
}

Background inference with BGProcessingTask

Some inference workloads are genuinely background — model fine-tuning steps, large-batch classification runs, index updates. These belong in BGProcessingTask, not in foreground task groups.

BGProcessingTask requests can specify requiresExternalPower: true and requiresNetworkConnectivity: false. For inference tasks, requiring external power is the correct default. The OS schedules the task when the device is plugged in and idle.

BGTaskScheduler.shared.register(
    forTaskWithIdentifier: "com.yourapp.inference.batch",
    using: nil
) { task in
    guard let processingTask = task as? BGProcessingTask else { return }
    Task {
        await InferenceScheduler.shared.runDeferredBatch()
        processingTask.setTaskCompleted(success: true)
    }
}

The task identifier must be declared in Info.plist under BGTaskSchedulerPermittedIdentifiers. Omitting this causes silent scheduling failures — the task registers but never executes.

FAQ

Work With Me

The On-Device AI Integration engagement covers Core ML model selection, actor-isolated inference, battery-aware scheduling, and App Store compliance — delivered in 3–5 weeks at a fixed price.

Related

Authoritative References