Audio-visual Target Speaker Extraction (AV-TSE) aims to isolate the speech of a specific target speaker from an audio mixture using time-synchronized visual cues. In real-world scenarios, visual cues ...