360Labs.dev()
0%
360Labs.dev()
0%

Technical Report • 2026

VAJRA: A Multi-Sensor On-Device Counter-UAS System with Custom-Trained Visual and Acoustic Deep Learning Models

360Labs

Counter-UASDrone DetectionOn-Device MLYOLOv8Acoustic ClassificationMulti-Sensor FusionEdge AITensorFlow Lite

Abstract

The proliferation of small unmanned aerial systems (sUAS) poses significant security challenges across military, critical infrastructure, and civilian domains. Current counter-UAS (C-UAS) solutions rely on expensive, centralized, network-dependent hardware that is impractical for forward-deployed or resource-constrained environments. We present VAJRA, a fully on-device, multi-sensor drone detection and neutralization system that runs entirely on a commercial Android smartphone. VAJRA integrates three detection modalities -visual (camera + custom-trained YOLOv8n), acoustic (microphone + FFT + custom-trained CNN classifier), and RF spectrum analysis -fused into a unified threat display with countermeasure control. All machine learning inference runs locally with zero network dependency, enabling operation in denied/degraded communications environments. We describe our approach to custom training both the visual object detection model on drone-specific imagery and the acoustic classification model on synthetically generated propeller audio profiles. The system achieves real-time performance (>25 FPS visual, <100ms acoustic classification) on mid-range Android hardware with a total application size under 40MB. We discuss the current limitations of synthetic training data and outline a roadmap for improving detection accuracy through expanded real-world datasets.

1. Introduction

1.1 The Growing UAS Threat

The rapid democratization of drone technology has created an asymmetric threat landscape. Consumer quadcopters costing under $1,000 can carry payloads, conduct surveillance, and penetrate restricted airspace with minimal operator skill. In military contexts, the Ukraine-Russia conflict has demonstrated the devastating effectiveness of first-person-view (FPV) attack drones, fiber-optic guided munitions, and loitering ammunition such as the Shahed-136 [1]. The Indian subcontinent faces similar challenges along its borders, where adversary drones conduct reconnaissance and cross-border smuggling operations [2].

1.2 Limitations of Existing C-UAS Systems

Current counter-drone systems -such as DroneShield's DroneSentry, Rafael's Drone Dome, and Dedrone's DedroneTracker -suffer from several limitations:

  1. Cost: Military-grade systems cost $500K-$5M per installation
  2. Infrastructure dependency: Require dedicated radar arrays, RF sensors, and command stations
  3. Network dependency: Cloud-based ML processing requires persistent connectivity
  4. Portability: Fixed installations cannot support mobile patrols or forward positions
  5. Latency: Cloud round-trips add 200-2000ms to detection-to-response time

1.3 Our Contribution

VAJRA addresses these limitations by implementing a complete C-UAS pipeline on a single Android device:

  • Visual detection using a custom-trained YOLOv8n model (12MB) for real-time drone recognition via the device camera
  • Acoustic detection using a dual-layer approach: real-time FFT peak analysis for propeller frequency identification, combined with a custom-trained CNN classifier (129KB) for drone type classification from mel spectrograms
  • RF analysis for control link identification and protocol fingerprinting
  • Multi-sensor fusion combining all modalities into a unified tactical threat display
  • Countermeasure control interface for RF jamming and GPS spoofing operations

All inference runs on-device using TensorFlow Lite, requiring zero network connectivity. The complete application is under 40MB and runs on any Android 8.0+ device.


2. System Architecture

2.1 Overview

VAJRA follows a modular architecture where each sensor modality operates as an independent detection pipeline, with results fused at the threat display layer.

VAJRA System Architecture

CAMERA          MICROPHONE        SDR/RF
(CameraX)       (AudioRec)        (RTL-SDR)
    |               |                |
    v               v                v
YOLOv8n         FFT Engine       Spectrum
TFLite          1024-pt FFT      Analyzer
(12MB)          @ 44.1 kHz       2.048 MSPS
    |               |                |
    |               v                |
    |           CNN Mel              |
    |           Classifier           |
    |           (129KB)              |
    |               |                |
    v               v                v
         DRONE DATABASE
    8 profiles x 22 parameters
                |
                v
      THREAT DISPLAY (Fusion Layer)
  Radar sweep + bearing/range/speed/IFF
                |
                v
      COUNTERMEASURE CONTROL
  Barrage jam / Protocol exploit / GPS spoof

2.2 Drone Database

At the core of VAJRA's identification capability is a structured database of 8 drone profiles spanning consumer, commercial, military, FPV attack, and loitering munition categories. Each profile contains 22 parameters:

Parameter CategoryFields
IdentityID, name, manufacturer, country of origin
PhysicalCategory, type (quadcopter/fixed-wing/hybrid), control link type
PerformanceMax range (7-150 km), max speed (72-220 km/h), endurance (8 min-36 hrs)
RF SignatureFrequency bands (2.4/5.8 GHz, 900 MHz, C/Ku-band SAT), RF protocol
Acoustic SignaturePropeller fundamental frequency (28-400 Hz), acoustic description
Threat AssessmentThreat level (LOW/MEDIUM/HIGH/CRITICAL), payload capability, countermeasure

Table 1: Drone profiles in VAJRA database

DroneCountryCategoryProp FreqControlThreatJammable
DJI Mavic 3CNConsumer240 HzOcuSync 3.0MEDIUMYes
DJI Phantom 4CNConsumer215 HzLightbridge 2MEDIUMYes
Bayraktar TB2TRMilitary95 HzSatelliteHIGHNo
Shahed-136IRLoitering65 HzAutonomousCRITICALNo*
FPV AttackUA/RUFPV Attack400 HzExpressLRSCRITICALYes
Fiber-Optic FPVUA/RUFPV Attack375 HzFiber OpticCRITICALNo
Orlan-10RUMilitary110 HzMIL-STDHIGHPartial
Heron TPILMilitary55 HzSatelliteHIGHNo

*GPS spoofing may be effective against GPS-guided autonomous drones.

2.3 Platform and Dependencies

  • Target platform: Android 8.0+ (API 26), optimized for landscape tablet/phone
  • ML runtime: TensorFlow Lite 2.14.0 with 4-thread CPU inference
  • Camera framework: AndroidX CameraX 1.4.1
  • Audio capture: Android AudioRecord at 44,100 Hz, 16-bit mono PCM
  • Total APK size: ~38 MB (including all models and assets)

3. Visual Detection Module

3.1 Model Architecture

For real-time visual drone detection, we employ YOLOv8n (nano variant) -the smallest model in the Ultralytics YOLOv8 family. YOLOv8n uses a CSPDarknet backbone with a Path Aggregation Network (PAN) feature pyramid and a decoupled detection head [3].

Model specifications:

ParameterValue
ArchitectureYOLOv8n (nano)
Input resolution320 x 320 x 3 (RGB, float32, normalized 0-1)
Output format[1, 8, 2100] -2100 predictions x 8 values
Output structure[cx, cy, w, h, class0, class1, class2, class3] per prediction
Anchor boxes2100 (40x40 + 20x20 + 10x10 multi-scale)
Number of classes4 (all mapped to DRONE)
Model size12 MB (TFLite, float16 quantized)
NMS IoU threshold0.45
Confidence threshold0.35
Max detections20 per frame

3.2 Custom Training

The YOLOv8n model was custom-trained on a drone detection dataset rather than using the general-purpose COCO pre-trained weights. The training process:

  1. Dataset: Drone imagery from multiple angles, distances, and backgrounds, including both consumer and military UAS types
  2. Augmentation: Standard YOLOv8 augmentation pipeline (mosaic, mixup, random perspective, HSV shifts)
  3. Training: Transfer learning from COCO pre-trained weights with custom drone classes
  4. Export: Converted to TFLite with float16 quantization for mobile deployment

3.3 Inference Pipeline

The visual detection pipeline processes camera frames in real-time:

Camera Frame (YUV_420_888, 30 FPS)
    |
    v
YUV -> Bitmap Conversion (with rotation correction)
    |
    v
Resize to 320x320 (bilinear interpolation)
    |
    v
Normalize to [0, 1] float32 RGB
    |
    v
YOLOv8n TFLite Inference (4 threads)
    |
    v
Parse [1, 8, 2100] output tensor
    |
    v
Filter by confidence (>0.35)
    |
    v
Non-Maximum Suppression (IoU >0.45)
    |
    v
Tactical overlay rendering (corner brackets + label + confidence)

The inference runs on a dedicated single-thread executor to avoid blocking the UI thread. Detection results are marshaled to the main thread via Handler.post() for overlay rendering with an 800ms per-class cooldown to prevent log spam.

3.4 Tactical Display

Detections are rendered as tactical corner-bracket overlays (rather than simple rectangles) with:

  • Drone class label and confidence percentage
  • FPS counter for performance monitoring
  • Running detection statistics (contacts, drones, vehicles)
  • Timestamped detection log (50-entry circular buffer)

4. Acoustic Detection Module

4.1 Dual-Layer Architecture

VAJRA's acoustic detection employs two complementary layers that run simultaneously on live microphone audio:

Layer 1: Real-time FFT peak detection (every ~23ms)

  • Detects propeller fundamental frequency in the 50-500 Hz range
  • Validates harmonic structure (2nd harmonic presence)
  • Tracks frequency stability across consecutive frames
  • Matches against drone database profiles

Layer 2: ML CNN classifier (every 1 second)

  • Computes mel spectrogram from 1-second audio buffer
  • Classifies drone type using a trained CNN
  • Provides probabilistic output across 5 classes

4.2 FFT Engine

The FFT engine implements a Cooley-Tukey radix-2 decimation-in-time algorithm in pure Kotlin, operating on 1024-point windows at 44,100 Hz sampling rate.

FFT specifications:

ParameterValue
FFT size1024 points
Sample rate44,100 Hz
Frequency resolution43.07 Hz per bin
Frame rate~43 frames/second
Window functionHanning (periodic)
Output bins256 (mapped to 0-1000 Hz)
Bit-reversalPre-computed lookup table
Twiddle factorsPre-computed sin/cos table

Detection criteria (all must be satisfied):

CriterionThresholdPurpose
SNR>= 12 dBPeak must be significantly above noise floor
Amplitude>= -55 dBFSReject quiet ambient fluctuations
Harmonic2nd harmonic >= 4 dB above local noiseConfirm propeller signature vs. random peak
Stability+/- 15 Hz across 3+ consecutive framesReject transient peaks
Confirmation6 consecutive valid frames (~150ms)Reject brief coincidences

Harmonic validation algorithm:

Drone propellers produce strong harmonics due to blade geometry. A peak at fundamental frequency f0 should have a corresponding peak at 2f0 (second harmonic). VAJRA validates this by:

  1. Computing the expected harmonic bin: bin_h2 = 2 * bin_f0
  2. Finding the maximum magnitude within +/-1 bin of the expected position
  3. Computing the local noise floor in a +/-15 bin window (excluding +/-3 bins around harmonic)
  4. Requiring the harmonic SNR >= 4 dB

This criterion effectively rejects non-propeller sources (speech, music, traffic) that may have energy in the 50-500 Hz range but lack the harmonic structure of rotating blades.

Distance estimation:

Signal amplitude provides a rough distance estimate via inverse-square-law scaling:

distance = DIST_CLOSE + (amplitude_dB - AMP_CLOSE) / (AMP_FAR - AMP_CLOSE)
         * (DIST_FAR - DIST_CLOSE)

Where AMP_CLOSE = -20 dBFS -> 50m, AMP_FAR = -60 dBFS -> 1000m.

4.3 CNN Acoustic Classifier

4.3.1 Model Architecture

The acoustic classifier is a compact CNN operating on log-mel spectrograms:

Input: (64, 44, 1) -64 mel bands x 44 time frames x 1 channel

Conv2D(16, 3x3, same) -> BatchNorm -> ReLU -> MaxPool(2x2)    -> (32, 22, 16)
Conv2D(32, 3x3, same) -> BatchNorm -> ReLU -> MaxPool(2x2)    -> (16, 11, 32)
Conv2D(64, 3x3, same) -> BatchNorm -> ReLU -> MaxPool(2x2)    -> (8, 5, 64)
Conv2D(64, 3x3, same) -> BatchNorm -> ReLU -> GlobalAvgPool   -> (64,)
Dropout(0.3) -> Dense(32, ReLU) -> Dropout(0.2) -> Dense(5, softmax)

Output: [ambient, quadcopter_small, quadcopter_large, fixed_wing, helicopter_uav]

Model specifications:

ParameterValue
Total parameters~52K
Model size (TFLite float16)129 KB
Mel filterbank size128 KB (64 x 513 float32)
Input shape[1, 64, 44, 1]
Output classes5
Confidence threshold0.35
Inference time< 50ms on mobile CPU

4.3.2 Mel Spectrogram Computation

The on-device mel spectrogram pipeline matches the training pipeline (Python librosa):

  1. Windowing: 1024-sample Hanning window, 1024-sample hop (non-overlapping)
  2. FFT: 1024-point radix-2 Cooley-Tukey (implemented in pure Kotlin)
  3. Power spectrum: |X[k]|^2 for k = 0..512
  4. Mel filterbank: 64 triangular filters spanning 20-2000 Hz, applied as matrix multiply
  5. Log compression: 10 x log10(power + 10^-10), clipped to 80 dB dynamic range (matching librosa.power_to_db(ref=np.max, top_db=80))
  6. Normalization: Min-max scaling to [0, 1]

The mel filterbank weights are pre-computed using librosa.filters.mel(sr=44100, n_fft=1024, n_mels=64, fmin=20, fmax=2000) and shipped as a binary asset (128 KB).

4.3.3 Training Data: Synthetic Drone Audio Generation

A key contribution of this work is the synthetic training data generation pipeline. Due to the difficulty of collecting real drone audio across multiple types and conditions, we generate physically accurate propeller audio from known frequency profiles.

Generation parameters per drone profile:

ParameterDescriptionValue range
Fundamental frequencyBlade-pass frequency28-400 Hz
Harmonics2nd, 3rd, 4th overtonesAmplitude: 0.5, 0.25, 0.12 x fundamental
Number of propellersMulti-rotor detuning1 (fixed-wing) to 6 (hexcopter)
Propeller detuningInter-motor frequency offset+/-3-8 Hz
RPM modulationThrottle variation+/-5% at 0.2-0.5 Hz
Amplitude modulationDoppler/distance simulation+/-20% at 0.05-0.15 Hz
Background noisePink noise (1/f spectrum)SNR: 15, 25, 40 dB

Drone profiles used for training:

ClassProfilesFrequencies
quadcopter_smallDJI Mavic 3, DJI Mini, DJI Phantom 4, Generic215, 230, 240, 260 Hz
quadcopter_largeFPV Racing, FPV Attack, Matrice 600, Heavy Quad160, 180, 375, 400 Hz
fixed_wingBayraktar TB2, Shahed-136, Orlan-10, RC Plane65, 95, 110, 130 Hz
helicopter_uavHeron TP, Coaxial Heli, Heli UAV, Medium Heli28, 35, 45, 55 Hz
ambientESC-50 dataset (21 categories)N/A

Signal synthesis formula:

For a drone with N propellers, fundamental frequency f0, and detuning delta:

signal(t) = Sum_p=1^N  Sum_h=1^4  (a_h / N) x sin(2pi x integral_0^t f_ph(tau) dtau)

where:
  f_ph(t) = h x [f0 + delta_p + delta_f x sin(2pi x fmod x t + phi_p)]
  delta_p = (p - N/2) x delta / N        (per-propeller detuning)
  delta_f = 0.05 x f0                     (RPM modulation depth)
  fmod    ~ U(0.2, 0.5) Hz               (modulation rate)
  a_h     = [1.0, 0.5, 0.25, 0.12]       (harmonic amplitudes)

This produces audio with the characteristic "buzzing" quality of multi-rotor drones, where slightly detuned motors create beating patterns in the frequency domain.

Training dataset composition:

SourceSamplesDuration
Synthetic drone audio (16 profiles x 3 SNR x 60s)~5,600 clips48 minutes
ESC-50 ambient (21 categories)~6,300 clips35 minutes
Total (before augmentation)~11,90083 minutes

Data augmentation (SpecAugment-style [4]):

  • Time masking: random 1-5 frame zeroing
  • Frequency masking: random 1-8 mel band zeroing
  • Gaussian noise injection: sigma in [0.01, 0.05]
  • Time shifting: +/-3 frames circular shift

After augmentation and class balancing, the final training set contains ~3,000+ samples per class.

Training configuration:

ParameterValue
OptimizerAdam (lr=0.001)
LossSparse categorical cross-entropy
Batch size32
Epochs100 (early stopping, patience=15)
LR scheduleReduceLROnPlateau (factor=0.5, patience=5)
Train/val split80/20 (stratified)
Best val accuracy100% (42 epochs)

4.4 Spectrogram Visualization

The acoustic module includes a real-time waterfall spectrogram display:

  • Resolution: 256 frequency bins x 200 time rows (~6 seconds visible history)
  • Frequency range: 0-1000 Hz
  • Color mapping: Dark green -> bright green -> yellow -> orange/red -> white (magnitude-proportional)
  • Harmonic markers: Vertical dashed lines at detected fundamental and harmonics when drone is detected

5. RF Analysis Module

5.1 Architecture

The RF analysis module performs spectrum monitoring to detect drone control link transmissions. In production deployment, this requires an external SDR (software-defined radio) connected via USB-OTG:

HardwareCostBandwidthFrequency Range
RTL-SDR v3$252.048 MSPS25 MHz - 1.7 GHz
HackRF One$30020 MSPS1 MHz - 6 GHz

5.2 Detection Pipeline

IQ Samples (SDR) -> FFT -> Peak Detection -> Bandwidth Extraction
    -> Modulation Classification -> Protocol Fingerprinting -> Database Match

Target frequency bands:

BandUsageTypical drones
2.4 GHzControl link (primary)DJI, consumer quads
5.8 GHzHD video downlinkDJI, racing drones
900 MHzLong-range controlFPV (Crossfire), military
1575.42 MHzGPS L1 navigationAll GPS-dependent drones

5.3 Protocol Fingerprinting

Each drone control protocol has a distinctive RF fingerprint:

ProtocolBandwidthModulationCharacteristic
DJI OcuSync 3.010-20 MHzOFDMDual-band simultaneous
DJI Lightbridge 210 MHzOFDMSingle-band
ExpressLRS500 kHzLoRaNarrow-band, frequency-hopping
Crossfire1 MHzLoRa900 MHz long-range
MIL-STD encryptedVariableFHSSFrequency-hopping spread spectrum

6. Multi-Sensor Fusion and Threat Display

6.1 Fusion Architecture

The threat display module aggregates detections from all sensor modalities into a unified tactical picture. Each detection source provides complementary information:

SensorProvidesLimitations
Visual (camera)Bearing, visual classification, size estimationLimited range (~500m), LoS only, weather-dependent
Acoustic (microphone)Bearing (with arrays), type classification, presence detectionLimited range (~1km), affected by ambient noise
RF (SDR)Control protocol, operator direction, jammability assessmentCannot detect autonomous drones

6.2 Radar Display

The threat display renders a rotating radar sweep with contact blips:

ThreatBlip(
    id: String,           // Database profile ID
    label: String,        // Display name
    bearingDeg: Float,    // 0-360 from north
    rangeMeter: Float,    // Distance from sensor
    speedMps: Float,      // Closing speed
    headingDeg: Float,    // Direction of travel
    threatLevel: ThreatLevel,  // LOW/MEDIUM/HIGH/CRITICAL
    isFriendly: Boolean   // IFF classification
)

6.3 Threat Assessment

Each detected contact receives a composite threat score based on:

  1. Drone type: Military/FPV attack -> CRITICAL, consumer -> LOW/MEDIUM
  2. Control link: Autonomous/fiber-optic -> higher threat (unjammable)
  3. Payload capability: Munition-capable -> CRITICAL
  4. Closing speed and range: Fast-approaching -> elevated priority

7. Countermeasure Control

7.1 Jammer Specifications

VAJRA includes an interface for directing RF countermeasures (requires external jammer hardware in production):

BandPowerTarget
2.4 GHz10WWi-Fi, DJI OcuSync, ExpressLRS
5.8 GHz10WVideo downlinks
900 MHz10WCrossfire, military FH protocols
GPS L15WNavigation denial

7.2 Countermeasure Modes

  1. Barrage jamming: Simultaneous wideband transmission across all bands, forcing drone failsafe (hover -> land)
  2. DJI RTH spoof: Protocol-level command injection exploiting OcuSync vulnerability to trigger return-to-home
  3. GPS spoofing: Transmission of false GPS L1 signals at +20 dBm over real signal, redirecting GPS-dependent drones to a designated safe zone

7.3 Jammability Matrix

Control LinkJammableAlternative
Wireless RF (2.4/5.8 GHz)YesBarrage jam + protocol exploit
Fiber-opticNoKinetic intercept required
Satellite (C/Ku-band)NoGPS spoofing (if GPS-dependent)
Autonomous (GPS-guided)No (RF)GPS spoofing may redirect

8. Results and Discussion

8.1 Visual Detection Performance

The custom-trained YOLOv8n achieves real-time inference on mobile hardware:

MetricValue
Inference speed> 25 FPS on mid-range Android
Model size12 MB (float16 TFLite)
Input resolution320 x 320
NMS processing< 2ms
Memory footprint~45 MB (model + buffers)

8.2 Acoustic Detection Performance

FFT layer:

MetricValue
Processing rate~43 frames/second
Detection latency~150ms (6 confirmation frames)
Frequency resolution43.07 Hz
Effective range50-1000m (estimated)

CNN classifier:

MetricValue
Inference speed< 50ms per 1-second clip
Model size129 KB (float16 TFLite) + 128 KB filterbank
Validation accuracy100% on synthetic test set
Classes5 (ambient + 4 drone types)

8.3 System-Level Performance

MetricValue
Total APK size38 MB
Cold start time< 3 seconds
Battery impact~15% per hour (active scanning)
Network requirementNone (fully offline)
Minimum hardwareAndroid 8.0, ARM64, 2GB RAM

8.4 Limitations

  1. Synthetic training data gap: The acoustic classifier achieves 100% accuracy on synthetic test data but shows reduced performance on real-world audio played through speakers. This is expected -synthetic audio has idealized harmonic structure that differs from audio captured through the speaker -> air -> microphone chain. The spectral characteristics are transformed by speaker frequency response, room acoustics, and microphone response curves.

  2. Visual detection range: At 320x320 input resolution, small drones at distances beyond ~200m occupy very few pixels, reducing detection confidence. Higher input resolution (640x640) would improve range at the cost of inference speed.

  3. Low-frequency drones: Fixed-wing drones (65-130 Hz) and helicopter UAVs (28-55 Hz) produce fundamental frequencies below the reproduction capability of most phone speakers (typically > 150 Hz), making speaker-based testing impossible for these categories. Real-world testing with actual drones is required to validate these classes.

  4. RF module: Currently operates with simulated data; production deployment requires external SDR hardware.


9. Improving with More Data

9.1 Visual Model Improvements

The current YOLOv8n model can be significantly improved through expanded training data:

Priority datasets:

SourceTypeExpected benefit
Real drone flight recordingsVideo frames at various distances, angles, backgrounds+10-15% accuracy at range
Thermal/IR imageryFLIR sensor capturesNight-time detection capability
Adverse weather footageRain, fog, dusk/dawnRobustness in operational conditions
Drone-specific negative samplesBirds, aircraft, kites, balloonsReduced false positive rate
Synthetic data (AirSim/Gazebo)Rendered drone models on diverse backgroundsScale training data 10-100x

Recommended training pipeline:

  1. Start with current custom-trained model as baseline
  2. Collect 10,000+ real drone images across 5+ drone types
  3. Include 20,000+ negative samples (birds at various sizes, aircraft)
  4. Train YOLOv8s (small) or YOLOv8m (medium) for higher accuracy
  5. Apply INT8 quantization for mobile deployment
  6. Target: > 90% mAP@0.5 at 50-500m range

9.2 Acoustic Model Improvements

Path 1: Real drone recordings

The most impactful improvement is supplementing synthetic data with real recordings:

Recording Protocol:
1. Record each drone type at 3 distances (50m, 200m, 500m)
2. Record in 3 environments (open field, urban, forested)
3. Use phone microphone (matches deployment hardware)
4. Minimum 2 minutes per recording
5. Include hover, forward flight, and approach maneuvers

Even 30 seconds of real drone audio per type would significantly improve generalization from synthetic-only training.

Path 2: Transfer learning from large audio models

Pre-trained audio models (AudioSet, VGGish, YAMNet) can provide learned feature representations that generalize better than training from scratch:

  1. Use YAMNet (Google's AudioSet model) as feature extractor
  2. Replace final classification head with 5-class drone classifier
  3. Fine-tune on synthetic + real drone data
  4. Expected benefit: better ambient rejection, more robust features

Path 3: Environmental calibration

Record 1-2 minutes of ambient sound at the deployment location. Use this as additional "ambient" training data specific to the operating environment. This trains the model to reject site-specific background noise (nearby roads, industrial equipment, wildlife).

Path 4: Data augmentation improvements

AugmentationPurpose
Room impulse response (RIR) convolutionSimulate speaker-to-mic acoustic path
Speaker frequency response simulationModel typical phone/laptop speaker rolloff
Multi-drone mixingDetect when 2+ drones present simultaneously
Wind noise overlayOutdoor robustness
Variable-distance amplitude scalingContinuous distance modeling

9.3 Cross-Modal Learning

Future work could explore cross-modal training where visual and acoustic detections reinforce each other:

  1. When visual detection confirms a drone, label the concurrent acoustic data
  2. Build a self-supervised dataset from field deployments
  3. Train a fusion model that jointly processes visual features + mel spectrograms
  4. This approach could improve detection confidence when either modality alone is uncertain

10. Related Work

Commercial C-UAS systems: DroneShield DroneSentry [5] uses acoustic arrays + radar + RF detection with costs exceeding $500K. Dedrone DedroneTracker [6] uses camera + RF + radar with cloud processing. Neither operates on mobile devices.

Academic drone detection: Kim et al. [7] demonstrated acoustic drone detection using mel spectrograms with 94% accuracy on recorded drone audio. Al-Emadi et al. [8] proposed a CNN-based acoustic classifier achieving 96% accuracy on real drone recordings. Our synthetic data approach complements these works by enabling rapid prototyping without drone access.

On-device ML: YOLOv8n represents the state-of-the-art in mobile object detection [3], achieving 37.3% mAP on COCO at 80+ FPS on mobile GPUs. Our application demonstrates its viability for specialized drone detection tasks.


11. Conclusion

VAJRA demonstrates that a practical multi-sensor counter-UAS system can operate entirely on a commercial smartphone. By combining custom-trained visual (YOLOv8n, 12MB) and acoustic (CNN, 129KB) deep learning models with real-time FFT signal processing and an RF analysis pipeline, the system provides drone detection, classification, and countermeasure guidance without any network dependency.

The synthetic audio training approach enables rapid model development for new drone types without requiring physical access to each drone. While the synthetic-to-real domain gap remains a challenge, we outline a clear path to closing this gap through real-world recordings, transfer learning, and environmental calibration.

VAJRA's fully on-device architecture makes it uniquely suited for:

  • Forward military positions with denied/degraded communications
  • Border security posts in remote areas without network infrastructure
  • Critical infrastructure protection where data sovereignty requires on-premises processing
  • Rapid deployment -any soldier with a smartphone becomes a drone detection node

Future work will focus on expanding the training datasets with real drone recordings, implementing acoustic direction-of-arrival using multi-microphone arrays, and integrating external SDR hardware for production RF analysis capability.


References

  • [1] Watling, J. & Reynolds, N. "The Role of Drones in the Russia-Ukraine War." Royal United Services Institute, 2023.
  • [2] Ministry of Defence, Government of India. "Anti-Drone Technology Requirements for Border Security." DRDO Technology Perspective, 2024.
  • [3] Jocher, G., Chaurasia, A., & Qiu, J. "Ultralytics YOLOv8." Ultralytics, 2023.
  • [4] Park, D.S. et al. "SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition." Interspeech, 2019.
  • [5] DroneShield. "DroneSentry: Integrated Detect-and-Defeat Counter-Drone System." DroneShield Technical Datasheet, 2024.
  • [6] Dedrone. "DedroneTracker: AI-Powered Airspace Security Platform." Dedrone Product Documentation, 2024.
  • [7] Kim, J. et al. "Acoustic-Based Drone Detection and Classification Using Mel Spectrograms and Convolutional Neural Networks." IEEE Access, vol. 9, 2021.
  • [8] Al-Emadi, S. et al. "Audio Based Drone Detection and Identification Using Deep Learning." IEEE International Workshop on Signal Processing Advances in Wireless Communications, 2019.