Name	Name	Last commit message	Last commit date
parent directory ..
conf	conf
local	local
speaker_diarization_sample	speaker_diarization_sample
speakerlab	speakerlab
README.md	README.md
path.sh	path.sh
run.sh	run.sh

Speaker Diarization

Introduction

This recipe offers speaker diarization methods that address the problem of "who spoke when". It provides multimodal diarization. The audio diarization comprises overlap detection, voice activity detection, speech segmentation, speaker embedding extraction. The video diarization comprises face detection, cctive speaker detection, face recognition, lip recognition. Then multimodal speaker clustering results are achieved.

The DER results of two diarization pipelines on a multi-person conversation video dataset.

Pipeline	DER
Audio-only diarization	5.3%
Multimodal diarization	3.7%

Usage

Quick Start

Ensure that ffmpeg is available in your environment.

sudo apt-get update sudo apt-get install ffmpeg

The pyannote/segmentation-3.0 is used as a overlapping speech detection module. Make sure to accept pyannote/segmentation-3.0 user conditions and create an access token at hf.co/settings/tokens

Stage1: Generate video.list and wav.list
Stage2: Process the wav and use the CAM++ speaker recognition model (Tongyi) to extract speaker embeddings (auditory modality) for each sub-segment of the audio.
- First, perform speaker overlap detection to obtain overlap.list.
- Delete speaker overlap samples to obtain clean_wav.list and clean_video.list.
- Use the FSMN-Monophone VAD model to perform VAD on the audio and perform fine-grained segmentation to obtain json/vad.json.
- Prepare subsegment information to obtain json/subseg.json.
- Use CAM++ model to extract the speaker embedding of wav audio and save it to embs_wav.
Stage 3: Process the video and extract the speaker's facial data (visual modality) through a face detection model, an active speaker detection model, a face recognition model, and a facial landmark detection model.
- For 25fps video, sample one frame every 5 frames (every 0.2 seconds).
- Detect all faces in the sampled frames using the a lightweight fast face detection model
- Score all faces using the TalkNet-ASD model, and use the face with the highest score as the active speaker's face
- (Optional, but not recommended) Use a face quality assessment model to filter out faces with poor quality.
- Use the CurricularFace model to extract the face embedding of the speaker in the active frame.
- Use the FAN model to perform 2D facial key point detection on the speaker's face, obtain the mouth coordinate (relative coordinates) of each face frame and extract the raw face and mouth data.
Stage 4: Joint cluster the audio and visual embeddings to obtain the multimodal active speaker detection results and save them in RTTM file.

hf_access_token is your access token

bash run.sh --stage 1 --stop_stage 4 --hf_access_token hf_xxx --root datasets/clean/zh --gpus "0 1 2 3"

To better understand the source code, you can refer to the sample.mp4 and run.sh files in the subfolder speaker_diarization_sample to perform single-sample inference.

Limitations

It may not perform well when the audio duration is too short and when the number of speakers is too large.
The final accuracy is highly dependent on the performance of each modules. Among them, the ASD model affects the quality of the results

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Speaker Diarization

Introduction

Usage

Quick Start

Limitations

FilesExpand file tree

speaker_diarization

Directory actions

More options

Directory actions

More options

Latest commit

History

speaker_diarization

Folders and files

parent directory

README.md

Speaker Diarization

Introduction

Usage

Quick Start

Limitations