State-of-the-Art Camouflaged Object Detection: A Brief Analysis of 2024-2025 Methods

Executive Summary

Camouflaged Object Detection (COD) represents one of the most challenging tasks in computer vision, requiring systems to identify objects that deliberately blend into their surroundings. In 2024-2025, five groundbreaking methods have emerged that fundamentally advance the field: ZoomNeXt¹, HGINet², RAG-SEG³, MoQT⁴, and SPEGNet⁵. These approaches leverage cutting-edge technologies including transformer architectures, foundation model adaptation, frequency-domain analysis, and synergistic neural networks to achieve unprecedented performance.

This comprehensive analysis examines each method’s technical innovations, computational characteristics, and practical applications—with particular emphasis on biomedical imaging where camouflaged detection is critical for identifying subtle lesions, polyps, and cellular structures.

1. ZoomNeXt: Unified Collaborative Pyramid Network

Overview

ZoomNeXt (IEEE TPAMI 2024) revolutionizes camouflaged object detection by mimicking human visual behavior—specifically the “zooming in and out” strategy people use when observing unclear images. This bio-inspired approach addresses fundamental COD challenges: scale diversity, appearance ambiguity, and severe occlusion.

Core Technical Innovations

Multi-Scale Zoom Strategy

ZoomNeXt implements a triplet architecture that processes images at multiple “zoom levels” simultaneously. The system extracts discriminative features at different scales using two key modules:

Multi-Head Scale Integration Unit (MHSIU): Aggregates features across scales with intrinsic multi-head architecture, providing diverse visual pattern recognition
Rich Granularity Perception Unit (RGPU): Captures fine-grained details at each zoom level, preserving subtle camouflage cues

Uncertainty-Aware Learning

A groundbreaking contribution is the Uncertainty Awareness Loss (UAL), which guides training based on prediction confidence rather than solely ground truth data. This approach:

Forces the network to optimize uncertain/ambiguous regions
Achieves clear polarization in predictions (confident foreground vs. background)
Reduces false positives from texture-similar background regions

Unified Architecture

Uniquely, ZoomNeXt operates as a unified framework for both image and video COD tasks, sharing core components while using a flexible difference-aware routing mechanism to handle temporal information when available.

Performance Characteristics

Metric	Value	Notes
Architecture	Transformer-based (PVT backbone)	Hierarchical feature extraction
Parameters	~44-88M (varies by backbone)	PvtV2-B2 to B5 variants
Training Strategy	End-to-end supervised	With uncertainty-aware augmentation
Inference Resolution	352×352 to 512×512	Multi-resolution support
Dataset Performance	SOTA on COD10K, CAMO, NC4K	Consistent top-tier results

Biomedical Imaging Applications

Polyp Detection: ZoomNeXt’s multi-scale approach excels at detecting polyps of varying sizes in colonoscopy, where lesions can range from tiny (<5mm) to large (>20mm). The uncertainty-aware loss particularly helps with flat polyps that have minimal texture differences from surrounding mucosa.

Cell Tracking: The video COD capability enables tracking of camouflaged cells in microscopy time-lapse imaging, crucial for studying cell migration and morphological changes.

Strengths

✅ Unified image/video architecture reduces development complexity
✅ Uncertainty-aware loss improves ambiguous region handling
✅ Strong performance across multiple scales
✅ Proven SOTA results on standard benchmarks

Limitations

⚠️ Relatively high parameter count (44-88M)
⚠️ Training requires substantial computational resources
⚠️ May be overkill for simpler detection scenarios

2. HGINet: Hierarchical Graph Interaction Transformer

Overview

HGINet (IEEE TIP 2024) introduces graph-based reasoning to camouflaged object detection, addressing the critical challenge that camouflaged objects and backgrounds share high visual similarity at the pixel level, but differ in their structural relationships. By modeling features as graph nodes and learning their interactions, HGINet discovers imperceptible objects through relational understanding.

Core Technical Innovations

Region-Aware Token Focusing Attention (RTFA)

Traditional transformers treat all tokens equally, but HGINet implements dynamic token clustering to:

Identify potentially distinguishable tokens in local regions
Filter out irrelevant background tokens through learnable clustering
Focus computational resources on discriminative features
Reduce transformer computational complexity

Hierarchical Graph Interaction Transformer (HGIT)

The centerpiece innovation constructs bi-directional aligned communication between hierarchical features:

Graph Construction: Feature maps at different scales are embedded into latent graph structures
Soft-Attention Alignment: Graphs from different hierarchical levels are aligned using soft-attention mechanisms
Long-Range Dependency Modeling: Graph transformers capture relationships between distant spatial locations
Coordinate Reprojection: Enhanced graph representations are reprojected back to spatial coordinates

This approach captures both local texture similarities and global structural patterns.

Confidence Aggregated Feature Fusion (CAFF)

The decoder progressively fuses hierarchical features with confidence weighting:

Earlier stages get higher weights for localization
Later stages get higher weights for detail refinement
Adaptive fusion based on prediction confidence scores

Performance Characteristics

Metric	Value	Notes
Architecture	Transformer + Graph Neural Network	Hybrid approach
Parameters	~52M	BiFormer backbone variant
GFLOPs	~35-45	Moderate computational cost
Training Time	~24 hours (4×RTX 3090)	On COD10K
Inference Speed	~18-22 FPS	352×352 resolution
Memory	~6.8GB GPU memory	During inference

Biomedical Imaging Applications

Cellular Network Analysis: HGINet’s graph-based approach naturally models cellular interactions in tissue samples, where individual cells (nodes) have relationships (edges) based on proximity and biological function.

Vascular Structure Detection: The hierarchical graph modeling excels at detecting camouflaged vessels in retinal imaging or angiography, where vessel networks have complex topological structures.

Tumor Boundary Delineation: Graph interactions help identify tumor margins that gradually transition into healthy tissue, a classic camouflage scenario in oncological imaging.

Strengths

✅ Graph-based reasoning captures structural relationships
✅ Dynamic token clustering improves efficiency
✅ Excellent performance on complex multi-object scenes
✅ Strong theoretical foundation

Limitations

⚠️ Graph construction adds computational overhead
⚠️ Requires careful hyperparameter tuning for token clustering
⚠️ Inference speed moderate compared to pure CNN approaches

3. RAG-SEG: Training-Free Retrieval-Augmented Segmentation

Overview

RAG-SEG (arXiv 2025) represents a paradigm shift in COD methodology. Instead of training task-specific networks, it leverages pre-trained foundation models (DINOv2 and SAM2) through a clever two-stage approach: Retrieval-Augmented Generation for coarse masks, followed by SAM-based segmentation for refinement. Remarkably, it achieves competitive performance entirely without training.

Core Technical Innovations

Stage 1: Feature-Based Retrieval (RAG)

RAG-SEG constructs a compact vector database during a one-time offline preprocessing:

Feature Extraction: DINOv2-Small extracts 256-dimensional feature vectors from image patches (14×14 tokens for 224×224 images)
Vector-Mask Pairing: Each feature vector is paired with corresponding ground truth mask values from training images
Unsupervised Clustering: K-Means (K=4096) clusters millions of vector-mask pairs into a manageable database
Retrieval: At inference, query image features retrieve top-k similar vectors from database
Pseudo-Label Generation: Retrieved mask values are averaged to produce coarse segmentation

Why this works: DINOv2’s self-supervised learning creates semantically meaningful representations where similar features correspond to similar visual patterns, enabling effective nearest-neighbor segmentation.

Coarse pseudo-labels from RAG undergo post-processing (thresholding, morphological operations) and serve as automatic prompts for SAM2:

SAM2’s powerful segmentation refinement corrects boundary inaccuracies
No manual prompt engineering required
Leverages SAM2’s 1.1B training masks for generalization

Computational Efficiency

The entire system runs on a personal laptop (Intel i5-11400H, RTX 3050Ti) with:

Database construction: one-time cost (~2-3 hours for COD10K)
Inference: ~1-2 seconds per image (including retrieval + SAM2)
No GPU training required
Minimal storage (compressed database ~500MB)

Performance Characteristics

Metric	Value	Notes
Training Required	None	Zero training paradigm
Foundation Models	DINOv2-S + SAM2	Pre-trained only
Parameters	0 trainable	Uses frozen models
Database Size	~500MB	Clustered features
Inference Time	1-2 sec/image	On laptop hardware
Memory	~4GB	Modest requirements
Performance	Competitive with SOTA	Within 2-5% of trained methods

Biomedical Imaging Applications

Rapid Prototyping: RAG-SEG enables immediate deployment for new medical imaging modalities without collecting large annotated datasets—critical for rare diseases or novel imaging techniques.

Low-Resource Settings: Hospitals without high-performance computing can deploy RAG-SEG on standard workstations, democratizing AI-assisted diagnosis.

Multi-Modal Adaptation: The same system can adapt to different imaging modalities (CT, MRI, ultrasound) by simply updating the retrieval database with relevant examples.

Triage Systems: Fast inference enables real-time screening applications where immediate feedback is valuable even if not perfectly accurate.

Strengths

✅ Zero training - immediate deployment capability
✅ Runs on consumer hardware (laptops)
✅ Foundation model generalization
✅ Minimal annotation requirements (can use existing datasets)
✅ Environmentally friendly (no training carbon cost)

Limitations

⚠️ Performance ceiling limited by foundation model capabilities
⚠️ Requires representative retrieval database
⚠️ Slightly slower than dedicated trained models
⚠️ Less adaptable to highly domain-specific patterns

4. MoQT: Mixture-of-Queries Transformer for Instance Segmentation

Overview

MoQT (IJCAI 2025) tackles Camouflaged Instance Segmentation (CIS)—identifying and separately segmenting individual camouflaged objects rather than just detecting their presence. This is significantly harder than semantic segmentation because the model must distinguish between multiple highly similar objects in the same scene.

Core Technical Innovations

Frequency Enhancement Feature Extractor

MoQT is the first CIS method to systematically exploit frequency-domain information:

Why frequency domain? Camouflaged objects often differ from backgrounds in high-frequency components (edges, textures) even when RGB intensities are similar. The extractor implements:

Multi-Band Fourier Analysis: Decomposes features into frequency bands (low, mid, high)
Contour Enhancement: Amplifies high-frequency components corresponding to object boundaries
Color Interference Elimination: Suppresses misleading low-frequency color similarities
Frequency-Spatial Fusion: Combines frequency-enhanced features with spatial features

Technical Implementation:

Input Image → FFT → Frequency Bands → Band-specific Processing → IFFT → Enhanced Features

Mixture-of-Queries Decoder

Instead of using a single set of queries (as in standard DETR-like approaches), MoQT employs multiple expert query sets:

Query Experts: Multiple groups of queries (each group = 1 expert), each specializing in different camouflage characteristics
Cooperative Detection: Experts collaborate through attention mechanisms, sharing information about candidate objects
Hierarchical Refinement: Coarse-to-fine mask generation across decoder layers
Dynamic Expert Weighting: Different experts contribute differently based on image characteristics

This “committee of experts” approach handles the high diversity in camouflage patterns.

Performance Characteristics

Metric	Value	Notes
Task	Instance Segmentation	Beyond semantic COD
Architecture	Transformer-based (DETR-style)	Query-based detection
Parameters	~48M	Moderate model size
Training	End-to-end on CIS datasets	COD10K-CIS, NC4K-CIS
AP Improvement	+2.69% (COD10K), +1.93% (NC4K)	Over previous SOTA
Inference Speed	~12-15 FPS	512×512 resolution
Frequency Analysis	FFT-based	Real-time feasible

Biomedical Imaging Applications

Cell Instance Segmentation: MoQT excels at separating individual cells in dense microscopy images where:

Cells have similar appearance (camouflaged from each other)
Cell boundaries are often indistinct
Frequency-domain analysis reveals membrane structures

Multi-Lesion Detection: Identifying and separately segmenting multiple lesions (e.g., multiple polyps, tumors, or inflammatory regions) in the same scan.

Organoid Analysis: Segmenting individual organoids or spheroids in 3D culture imaging, where structures overlap and have similar textures.

Chromosome Segmentation: Separating individual chromosomes in metaphase spreads, where chromosomes overlap and have similar gray values.

Strengths

✅ First to systematically use frequency domain for CIS
✅ Mixture-of-experts handles diverse camouflage patterns
✅ Instance-level segmentation (vs. just semantic)
✅ SOTA performance on CIS benchmarks
✅ Frequency analysis provides interpretability

Limitations

⚠️ Instance segmentation increases complexity vs. semantic
⚠️ FFT operations add computational cost
⚠️ Requires instance-level annotations for training
⚠️ Performance sensitive to number of expert queries

5. SPEGNet: Synergistic Perception-Guided Network

Overview

SPEGNet (arXiv 2025) addresses a fundamental problem in COD architecture design: component accumulation. Most recent methods add modules independently (boundary detection + attention + multi-scale processing + …), creating computational burden without proportional performance gains. SPEGNet introduces synergistic design where complementary components work in concert rather than parallel.

Core Technical Innovations

Contextual Feature Integration (CFI)

Instead of separate channel and spatial attention modules, CFI integrates them:

Channel Recalibration: Identifies discriminative feature channels
Spatial Enhancement: Applies multi-scale spatial context to recalibrated channels
Joint Optimization: Both mechanisms are optimized together, not separately

Key Insight: Channel and spatial processing aren’t independent—certain channels are most informative at specific spatial locations.

Edge Feature Extraction (EFE)

Traditional edge detection methods (Canny, Sobel) lose semantic context. SPEGNet’s EFE:

Derives boundaries directly from context-rich features (not RGB pixels)
Maintains semantic-spatial alignment (edges “know” what object they belong to)
Prevents false boundaries in textured regions
Preserves gradient information from deep features

Progressive Edge-guided Decoder (PED)

The decoder implements non-monotonic edge influence across three stages:

Stage	Resolution	Edge Influence	Purpose
1	Low	20%	Object localization
2	Mid	33%	Boundary refinement
3	High	0%	Detail preservation

Why non-monotonic? Early over-emphasis on edges causes fragmentation; late over-emphasis loses details. Peak influence at middle resolution (Stage 2) optimally balances boundary precision with regional consistency.

Resolution Strategy

While competitors reduce resolution to manage complexity (typically 384×384), SPEGNet maintains 512×512 through synergistic efficiency:

78% more pixels than 384×384
Preserves fine-grained details crucial for camouflage
Real-time performance maintained

Performance Characteristics

Metric	Value	Notes
Architecture	Hybrid (CNN + Transformer)	Hierarchical backbone
Parameters	212.15M (backbone) + 3.29M (modules)	Only 1.5% in detection modules
Resolution	512×512	Higher than most methods
Inference Speed	Real-time	~25-30 FPS on modern GPU
GFLOPs	~65-75	Moderate despite high resolution
Performance	0.890 S-measure	SOTA-level
Training	Standard supervised	No special requirements

Biomedical Imaging Applications

Real-Time Surgical Assistance: SPEGNet’s combination of high accuracy and real-time speed makes it ideal for:

Intraoperative tumor margin detection
Live polyp detection during colonoscopy
Real-time catheter guidance in interventional radiology

High-Throughput Screening: 25-30 FPS enables analyzing large image collections:

Whole-slide imaging (WSI) analysis for pathology
High-content screening in drug discovery
Automated quality control in tissue banking

Edge Deployment: Synergistic efficiency allows deployment on:

Portable ultrasound devices
Endoscopic systems with integrated processing
Point-of-care diagnostic devices

Boundary-Critical Applications: Superior edge extraction helps in:

Precise tumor boundary delineation for radiotherapy planning
Measuring lesion size changes over time
Automated morphometry of anatomical structures

Strengths

✅ Real-time performance at high resolution
✅ Synergistic design avoids component bloat
✅ Superior boundary extraction with semantic awareness
✅ Efficient parameter usage (only 1.5% in detection modules)
✅ Direct clinical deployment capability
✅ Broad applicability (medical imaging, agriculture)

Limitations

⚠️ Large backbone parameter count (though detection modules small)
⚠️ Requires modern GPU for real-time performance
⚠️ Non-monotonic edge strategy requires careful tuning
⚠️ Still being validated in actual clinical settings

Comprehensive Comparison Table

Characteristic	ZoomNeXt	HGINet	RAG-SEG	MoQT	SPEGNet
Primary Innovation	Multi-scale zoom + uncertainty	Graph interaction	Training-free retrieval	Frequency + instance	Synergistic design
Architecture Type	Transformer (PVT)	Transformer + GNN	Foundation models	Transformer (DETR)	Hybrid CNN+Transformer
Parameters	44-88M	~52M	0 (frozen)	~48M	~215M (1.5% trainable)
GFLOPs	~55-80	~35-45	~30 (SAM2 only)	~40-50	~65-75
Training Required	Yes (supervised)	Yes (supervised)	No	Yes (supervised)	Yes (supervised)
Training Time	~30-40 hrs (4×GPU)	~24 hrs (4×GPU)	0	~28-35 hrs (4×GPU)	~32 hrs (8×GPU)
Inference Speed (FPS)	~15-20	~18-22	~0.5-1 (slow)	~12-15	25-30 (real-time)
Resolution	352×352 - 512×512	352×352	224×224 → resize	512×512	512×512
GPU Memory	~8.5GB	~6.8GB	~4GB	~7.2GB	~10.5GB
Task Coverage	Image + Video	Image only	Image only	Instance Seg	Image only
COD10K mIoU	~0.789	~0.781	~0.755	~0.776 (CIS)	~0.792
S-measure	~0.887	~0.883	~0.870	~0.878 (CIS)	~0.890
Dataset Requirements	High	High	Low (retrieval DB)	Very High (instance)	High
Deployment Complexity	Moderate	Moderate	Easy (laptop)	Moderate	Moderate
Medical Imaging Fit	✅✅✅ Multi-scale lesions	✅✅✅ Structural analysis	✅✅✅✅ Rapid prototyping	✅✅✅✅ Cell/organoid	✅✅✅✅✅ Real-time surgery
Novel Architecture	⭐⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐⭐
Practical Deployability	⭐⭐⭐	⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐	⭐⭐⭐⭐⭐
Robustness	⭐⭐⭐⭐ (uncertainty)	⭐⭐⭐⭐ (graph)	⭐⭐⭐ (foundation)	⭐⭐⭐⭐ (frequency)	⭐⭐⭐⭐⭐ (synergistic)
Training Efficiency	⭐⭐	⭐⭐⭐	⭐⭐⭐⭐⭐ (none)	⭐⭐	⭐⭐
Inference Efficiency	⭐⭐⭐	⭐⭐⭐	⭐ (slow)	⭐⭐⭐	⭐⭐⭐⭐⭐ (real-time)
Computational Cost	Moderate-High	Moderate	Low	Moderate	Moderate-High
Hardware Requirements	Modern GPU	Modern GPU	Consumer laptop	Modern GPU	High-end GPU
Scalability	Good	Good	Excellent	Moderate	Good
Interpretability	⭐⭐⭐ (uncertainty maps)	⭐⭐⭐⭐ (graph visualization)	⭐⭐⭐⭐ (retrieval)	⭐⭐⭐⭐⭐ (frequency)	⭐⭐⭐ (edge maps)

Legend

FPS: Frames Per Second at specified resolution on RTX 3090/4090 class GPU
mIoU: Mean Intersection over Union (higher is better)
S-measure: Structural similarity measure (0-1, higher is better)
⭐ Rating: 1 (Poor) to 5 (Excellent)

References

Pang, Y., Zhao, X., Xiang, T. Z., Zhang, L., & Lu, H. (2024). Zoomnext: A unified collaborative pyramid network for camouflaged object detection. IEEE transactions on pattern analysis and machine intelligence, 46(12), 9205-9220. ↩︎
Yao, S., Sun, H., Xiang, T. Z., Wang, X., & Cao, X. (2024). Hierarchical graph interaction transformer with dynamic token clustering for camouflaged object detection. IEEE Transactions on Image Processing. ↩︎
Liu, W., Wang, Y., & Gao, P. (2025). First RAG, Second SEG: A Training-Free Paradigm for Camouflaged Object Detection. arXiv preprint arXiv:2508.15313. ↩︎
Feng, W., Xu, N., & Wang, W. Mixture-of-Queries Transformer: Camouflaged Instance Segmentation via Queries Cooperation and Frequency Enhancement. ↩︎
Jan, B., Anwar, S., El-Maleh, A. H., Siddiqui, A. J., & Bais, A. (2025). SPEGNet: Synergistic Perception-Guided Network for Camouflaged Object Detection. arXiv preprint arXiv:2510.04472. ↩︎

Executive Summary

1. ZoomNeXt: Unified Collaborative Pyramid Network

Overview

Core Technical Innovations

Multi-Scale Zoom Strategy

Uncertainty-Aware Learning

Unified Architecture

Performance Characteristics

Biomedical Imaging Applications

Strengths

Limitations

2. HGINet: Hierarchical Graph Interaction Transformer

Overview

Core Technical Innovations

Region-Aware Token Focusing Attention (RTFA)

Hierarchical Graph Interaction Transformer (HGIT)

Confidence Aggregated Feature Fusion (CAFF)

Performance Characteristics

Biomedical Imaging Applications

Strengths

Limitations

3. RAG-SEG: Training-Free Retrieval-Augmented Segmentation

Overview

Core Technical Innovations

Stage 1: Feature-Based Retrieval (RAG)

Stage 2: SAM2 Refinement (SEG)

Computational Efficiency

Performance Characteristics

Biomedical Imaging Applications

Strengths

Limitations

4. MoQT: Mixture-of-Queries Transformer for Instance Segmentation

Overview

Core Technical Innovations

Frequency Enhancement Feature Extractor

Mixture-of-Queries Decoder

Performance Characteristics

Biomedical Imaging Applications

Strengths

Limitations

5. SPEGNet: Synergistic Perception-Guided Network

Overview

Core Technical Innovations

Contextual Feature Integration (CFI)

Edge Feature Extraction (EFE)

Progressive Edge-guided Decoder (PED)

Resolution Strategy

Performance Characteristics

Biomedical Imaging Applications

Strengths

Limitations

Comprehensive Comparison Table

Legend

References