From 31% to 8.3% False Positives: Tuning Our Battery Failure Classifier

Feature importance chart from a machine learning model, data science visualization

When Alerts Become Noise

The first version of Stima's degradation alert was a simple threshold rule: if a pack's estimated state-of-health dropped below 82% or if the 7-day trend in peak power delivery showed a slope below -0.15 Wh/discharge/day, fire an alert. The rule was easy to implement and easy to explain. It was also wrong about 31% of the time.

The 31% false positive rate sounds abstractly bad, but the consequences were concrete. When Samuel received an alert for vehicle 23, he'd assign someone to inspect the pack. The inspection found nothing wrong. Samuel spent 20 minutes he didn't have on a non-problem. When the same thing happened three more times in a week, he stopped responding to alerts. By month three of the pilot, he was ignoring most of what the system sent him.

An alert system with a 31% false positive rate is worse than no alert system. It creates alert fatigue, wastes manager time, and — most importantly — erodes trust in the real alerts. When vehicle 7 genuinely needed attention and the alert fired, Samuel treated it the same as the previous false positives. He was right to be skeptical; we'd trained him to be. The ML model improvement wasn't a technical nicety — it was required for the product to have any value at all.

What Was Causing the False Positives

Diagnosing the false positives required looking at each one and asking what the telemetry showed at alert time versus what ground truth turned out to be. Three patterns emerged:

Temperature-driven SOH noise. On very hot days (ambient above 36°C), packs showed temporary SOH drops that recovered when temperature normalized. The threshold rule couldn't distinguish between a real degradation trend and a heat-stress response. About 40% of false positives fell into this category.

Generator charging voltage spikes. When a pack was charged on a generator with voltage instability, the discharge curve in the subsequent cycle looked anomalous compared to the pack's baseline. The threshold rule interpreted this as degradation. The pack wasn't actually degrading — it was recovering from a single bad charging event. About 35% of false positives fell here.

New pack baseline calibration period. When a pack was first installed in a vehicle, we had no baseline to compare against. The first 5–10 discharge cycles showed high variance in estimated SOH as the model built its pack-specific baseline. Alerts fired during this calibration window based on the general model, which frequently didn't match the specific pack's chemistry and condition. About 25% of false positives came from new pack installations.

Feature Engineering: What the Model Needed to Know

The threshold rule used three features: estimated SOH, 7-day SOH trend, and peak power delivered in the last discharge. The gradient boosting model we built to replace it uses 34 features. The additional features are mostly about context — giving the model the information it needs to distinguish real degradation from noise.

Key features added for the temperature dimension: ambient temperature at discharge start, pack temperature delta above ambient (how much heat the pack generated versus its environment), max pack temperature reached during the discharge event, and a rolling 7-day average of the temperature differential. These features allow the model to calibrate its degradation assessment based on thermal conditions during each measurement event.

Key features added for the charging quality dimension: charging voltage standard deviation (calculated from the sampled voltage readings during the charging event), number of voltage samples outside ±10% of nominal (a direct count of instability events), and a binary flag for generator charging (inferred from the voltage signature pattern of a generator versus grid power). These features allow the model to down-weight an anomalous discharge event that follows a known-bad charging session.

Key features added for the calibration dimension: days since pack was first registered, total discharge cycles observed for this pack, and a confidence score representing how many standard deviations the current pack's behavior is from the general population baseline. Packs with fewer than 15 observed cycles get a suppressed alert threshold — the model requires stronger evidence before flagging a new pack.

Model Architecture: Why Gradient Boosting

We evaluated three model types: gradient boosting (LightGBM 4.1), a random forest, and an LSTM-based sequence model. The LSTM had the best raw accuracy on the test set — 94.2% recall at a 5% false positive rate — but it required significantly more compute to run inference and was difficult to debug. When Samuel asked "why did you alert on vehicle 23?", a gradient boosting model gives you a feature importance explanation. An LSTM gives you an embedding.

For a product where operator trust is the primary challenge, explainability is not optional. If we can tell Samuel "vehicle 23 was flagged because the discharge peak power has dropped 18% over 14 days while temperature has been normal and charging has been stable," he can verify that claim by looking at the bike's recent history. If we tell him "the neural network assigned a 0.73 failure probability," he has no way to evaluate whether that's meaningful.

LightGBM with 500 trees, maximum depth 8, and a learning rate of 0.05 trained on our 18,347 discharge cycle dataset. Training time on a MacBook Pro M2 was 4 minutes. Inference time per pack per evaluation run is under 2ms. The model is small enough to run on a $50/month VPS instance evaluating 1,000 packs simultaneously with headroom to spare.

Training Data and Label Construction

Labeling the training data was more difficult than building the model. "Battery failure" needed a precise definition for supervised learning: we defined it as a pack reaching below 70% of its original rated capacity, confirmed by load test, within 14 days of the event being labeled. This definition excluded packs that showed telemetry anomalies but weren't actually degraded, and excluded packs that failed outside the 14-day window (which the model shouldn't claim to predict at that horizon).

Of the 18,347 labeled discharge cycles in the training set, 1,204 were labeled as pre-failure events (the pack failed within 14 days of that discharge cycle). The class imbalance — roughly 6.5% positive rate — required careful handling. We used SMOTE (Synthetic Minority Oversampling Technique) to generate synthetic positive samples during training and weighted the positive class 3x in the loss function. Without these adjustments, the model optimized for accuracy by predicting "no failure" for almost everything — technically 93.5% accurate but useless for our purpose.

The final model achieves 91.4% recall (correctly identifies 91.4% of packs that will fail within 14 days) at 8.3% false positive rate. That's a 22.7 percentage point improvement in false positives versus the threshold rule, with 91.4% versus 78% recall on the same test set. Samuel now trusts the alerts. That's the metric that matters.

Deployment: Running Inference at the Edge vs. Cloud

We evaluated whether to run inference on the SEM-1 hardware module (edge inference) or on the cloud backend after telemetry data arrives. Edge inference would allow the module to generate alerts even during connectivity gaps, which aligns with our offline-first architecture philosophy. Cloud inference has access to the full historical feature set, which the edge module can't always reconstruct from local cache alone.

The ARM Cortex-M4F in the SEM-1 can run a simplified version of the LightGBM model — we compiled a 12-feature subset using ONNX Runtime for embedded targets — in approximately 35ms. The simplified model achieves 84% recall at 12% false positive rate. That's worse than the full model but still better than the original threshold rule.

The current deployment uses both: the edge model runs continuously and can push a local alert via the driver app over Bluetooth even without connectivity. The cloud model runs every 15 minutes against newly synced telemetry and can update or override the edge alert with higher-confidence assessments. When connectivity is available, the cloud model's alert takes precedence. When the vehicle is offline, the edge model holds down the fort. This hybrid approach was not in the original design specification — it emerged from field deployments where drivers needed in-cab notification before they lost connectivity on rural routes.

Filed under: Machine Learning, Battery Technology · Back to Blog