Hide the attention variable for configurable transformed speech recognition

Wireless

This work studies the use of cloaking in transducer-based speech recognition to build a single configurable model for different deployment scenarios. We present a comprehensive set of experiments comparing static masking, in which the same attention mask is applied in each frame, with split masking, in which the attention mask for each frame is determined by cutoff boundaries, in terms of recognition accuracy and response time. We then explore the use of variable masking, where attention masks are sampled from the target distribution at training time, to build models that can operate in different configurations. Finally, we investigate how a single configurable model can be used to perform both first-pass stream recognition and second-pass audio re-rescue. Experiments show that segmented masking achieves better accuracy against the latency trade-off than static masking, with or without FastEmit. We also show that variable masking improves accuracy by up to 8% proportionally in the audio recapture scenario.

Source link

Post a Comment

Cookie Consent
We serve cookies on this site to analyze traffic, remember your preferences, and optimize your experience.
Oops!
It seems there is something wrong with your internet connection. Please connect to the internet and start browsing again.
AdBlock Detected!
We have detected that you are using adblocking plugin in your browser.
The revenue we earn by the advertisements is used to manage this website, we request you to whitelist our website in your adblocking plugin.
Site is Blocked
Sorry! This site is not available in your country.