Hide the attention variable for configurable transformed speech recognition

Wireless

This work studies the use of cloaking in transducer-based speech recognition to build a single configurable model for different deployment scenarios. We present a comprehensive set of experiments comparing static masking, in which the same attention mask is applied in each frame, with split masking, in which the attention mask for each frame is determined by cutoff boundaries, in terms of recognition accuracy and response time. We then explore the use of variable masking, where attention masks are sampled from the target distribution at training time, to build models that can operate in different configurations. Finally, we investigate how a single configurable model can be used to perform both first-pass stream recognition and second-pass audio re-rescue. Experiments show that segmented masking achieves better accuracy against the latency trade-off than static masking, with or without FastEmit. We also show that variable masking improves accuracy by up to 8% proportionally in the audio recapture scenario.

Source link

#phones

Techspiro4

Hide the attention variable for configurable transformed speech recognition

Post a Comment

Drake AI's latest viral song is just a repurposed SoundCloud rap

California covered in flowers

Fiio FT3 review: There's a new value king in town

The leaked Pixel 7a case shows different colors

Google Authenticator now syncs one-time passcodes with your account

Ahmed Haroud