| A state-of-the-art neural network designed to detect deepfake videos and highlight the altered areas in each frame using explainable AI. | |
| |
%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#1a1a2e', 'primaryTextColor': '#fff', 'lineColor': '#ffffff', 'background': '#0d1117', 'mainBkg': '#0d1117'}}}%% flowchart TB subgraph Preprocess[" Preprocess "] A[MP4 Video] --> B[Frames extraction] B --> C[Faces detection] C --> D[Faces extraction] end D --> E D --> G D --> H subgraph Processing[" "] direction LR subgraph Spatial_Module[" Spatial Module "] E[Features ExtractionEfficientNet] -->|"1792, 7, 7"| F((Pooled)) end subgraph Frequency_Module[" Frequency Module "] G[DCT Extractor8×8, freq_bands] -->|"512"| I((Concat)) H[FFT Extractorradial=8, hann] -->|"512"| I I -->|"1024"| J[FusionMLP1024→512→1024] end end F -->|"1792"| K J -->|"1024"| K subgraph Classifier_Module[" Classifier Module "] K((Concat)) -->|"2816"| L[MLP] L --> M[Frames Aggregation] M --> N[Final Score] end E --> O D --> Q subgraph Gradcam_Module[" Gradcam Module "] O[Gradcam Computation] --> P[Gradcam Visualization] P --> Q((Remap)) Q --> R[Final Manipulated Video] end %% Styling classDef preprocess fill:#1a1a2e,stroke:#f59e0b,color:#f59e0b classDef spatial fill:#1a1a2e,stroke:#10b981,color:#10b981 classDef frequency fill:#1a1a2e,stroke:#3b82f6,color:#3b82f6 classDef classifier fill:#1a1a2e,stroke:#ec4899,color:#ec4899 classDef gradcam fill:#1a1a2e,stroke:#a855f7,color:#a855f7 classDef default fill:#1a1a2e,stroke:#6b7280,color:#fff class B,C,D preprocess class E,F spatial class G,H,I,J frequency class K,L,M,N classifier class O,P,Q,R gradcam style Preprocess fill:#0d1117,stroke:#f59e0b,color:#f59e0b style Spatial_Module fill:#0d1117,stroke:#10b981,color:#10b981 style Frequency_Module fill:#0d1117,stroke:#3b82f6,color:#3b82f6 style Classifier_Module fill:#0d1117,stroke:#ec4899,color:#ec4899 style Gradcam_Module fill:#0d1117,stroke:#a855f7,color:#a855f7 style Processing fill:transparent,stroke:none graph LR Input["Input Image<br/>224×224×3"] --> Spatial["<b>Spatial Stream</b><br/>EfficientNet-B4<br/>(ImageNet)"] Input --> Frequency["<b>Frequency Stream</b>"] Frequency --> FFT["FFT Extractor<br/>8 radial bands<br/>Hann window"] Frequency --> DCT["DCT Extractor<br/>8×8 blocks<br/>frequency bands"] FFT --> FFT_Out["512-dim"] DCT --> DCT_Out["512-dim"] FFT_Out --> Fusion["Fusion MLP"] DCT_Out --> Fusion Spatial --> Spatial_Out["1792-dim"] Fusion --> Fusion_Out["1024-dim"] Spatial_Out --> Concat["Concatenate<br/>2816-dim"] Fusion_Out --> Concat Concat --> Classifier["<b>Classifier MLP</b><br/>1024 → 512 → 256"] Classifier --> Output["Output<br/>FAKE/REAL<br/>+ confidence"] style Input fill:#1e293b,stroke:#3b82f6,stroke-width:2px,color:#fff style Spatial fill:#0f172a,stroke:#10b981,stroke-width:2px,color:#10b981 style Frequency fill:#0f172a,stroke:#3b82f6,stroke-width:2px,color:#3b82f6 style FFT fill:#0d1117,stroke:#06b6d4,color:#06b6d4 style DCT fill:#0d1117,stroke:#06b6d4,color:#06b6d4 style Fusion fill:#0d1117,stroke:#3b82f6,color:#3b82f6 style Classifier fill:#0f172a,stroke:#ec4899,stroke-width:2px,color:#ec4899 style Output fill:#1e293b,stroke:#a855f7,stroke-width:2px,color:#fff style Concat fill:#0d1117,stroke:#8b5cf6,color:#8b5cf6 style FFT_Out fill:#0d1117,stroke:#64748b,color:#64748b style DCT_Out fill:#0d1117,stroke:#64748b,color:#64748b style Spatial_Out fill:#0d1117,stroke:#64748b,color:#64748b style Fusion_Out fill:#0d1117,stroke:#64748b,color:#64748b | Specification | Value |
|---|---|
| Total Parameters | 25.05M params |
| Input Size | 224×224 RGB |
| Output | Binary (FAKE/REAL) + confidence |
| Backbone | EfficientNet-B4 (19.34M params) |
| Frequency Module | 2.16M params |
| Classifier | 3.54M params |
| We trained the model on an RTX 3090 (with CUDA) for approximately 4 hours. We used the GPU provider vast.ai The training file is located in the |
|
| We started from an existing dataset found on Kaggle: Containing 7000 videos with numerous deepfake techniques: We extracted the frames and faces from these videos to create our dataset: VeridisQuo Preprocessed Dataset The dataset was built using the following pipeline:
|
Total: 716,438 images |
|
| Method | Endpoint | Description |
|---|---|---|
GET | /api/v1/health | Health check and model status |
POST | /api/v1/analyze | Analyze video for deepfakes |
GET | /api/v1/outputs/{filename} | Download GradCAM visualization |
DELETE | /api/v1/outputs/{filename} | Delete output file |
curl -X POST http://localhost:8000/api/v1/analyze \ -F "file=@video.mp4" \ -F "fps=1" \ -F "aggregation_method=majority" \ -F "generate_gradcam=true"Parameters:
file: Video file (MP4, AVI, MOV, MKV, WEBM)fps: Frames per second to extract (default: 1)aggregation_method: Score aggregation (default: majority)generate_gradcam: Generate visualization video (default: false)
{ "prediction": "FAKE", "confidence": 0.8734, "aggregation_method": "majority", "total_frames": 120, "gradcam_video_path": "/api/v1/outputs/gradcam_video_20250102.mp4" }- Python 3.12 or 3.13
- uv package manager
- Node.js 18+ and npm (optional, for frontend)
- CUDA 11.8+ (optional, for GPU acceleration)
# Clone repository git clone https://github.com/VeridisQuo-orga/VeridisQuo.git cd VeridisQuochmod +x ./scripts/launch_api.sh ./scripts/launch_api.sh server runs on http://localhost:8000 | Docs at /docs
chmod +x ./scripts/launch_frontend.sh ./scripts/launch_frontend.shDevelopment server on http://localhost:3000
If you use VeridisQuo in your research, please cite:
@software{veridisquo2025, title = {VeridisQuo: Hybrid Deepfake Detection with Explainable AI}, author = {Castillo, Theo and Barriere, Clement}, year = {2025}, url = {https://github.com/VeridisQuo-orga/VeridisQuo}, note = {Model: \url{https://huggingface.co/Gazeux33/VeridisQuo}} }



