Speech Inversion Service — User Guide
Interpretable and Available Acoustic-to-Articulatory Speech Inversion Funded by the ASHFoundation New Investigators Research Grant PI: Nina R. Benway, PhD CCC-SLP · University of Maryland, College Park
What Is This Service?
The Speech Inversion Service estimates how the vocal tract is moving during speech.
These estimates are called tract variables, a framework from Articulatory Phonology (Browman & Goldstein, 1992). Tract variables describe articulator movements at 100 frames per second (every 10 milliseconds).
The models currently available are based on the Siriwardena speech inversion system (Siriwardena & Espy-Wilson, ICASSP 2023; Siriwardena, Sivaraman, & Espy-Wilson, 2022), trained on adult speaker articulography data from the Wisconsin Xray Microbeam Dataset. Additional models are in development and will be added to the service over time, including models trained on child speech and models with phonetically normalized tract variables.
Who Is This For?
This service is designed for speech researchers who want acoustic-to-articulatory speech inversion estimates for research purposes.
If you use data from clinical populations (e.g., children with speech sound disorders), please ensure your IRB protocol covers secondary data use.
What You Will Receive
For each .wav file you submit, you will receive one CSV file (openable in Excel, MATLAB, R, SPSS, or Python) containing tract variable estimates over time.
Example output (XRMB model, first 3 rows):
| frame | time_s | LA | LP | TTCL | TTCD | TBCL | TBCD |
|---|---|---|---|---|---|---|---|
| 0 | 0.000 | 0.2342 | −0.1180 | 0.4453 | −0.3321 | 0.1121 | −0.5679 |
| 1 | 0.010 | 0.2410 | −0.1124 | 0.4390 | −0.3279 | 0.1183 | −0.5609 |
| 2 | 0.020 | 0.2489 | −0.1063 | 0.4321 | −0.3231 | 0.1248 | −0.5534 |
Column definitions:
| Column | Full Name | Description |
|---|---|---|
frame | Frame index | Time step (0-indexed, 100 Hz) |
time_s | Time in seconds | Frame 0 = 0.000 s, Frame 1 = 0.010 s, etc. |
LA | Lip Aperture | Vertical distance between upper and lower lip |
LP | Lip Protrusion | Forward extension of the lips |
TTCL | Tongue Tip Constriction Location | How far forward/back the tongue tip constriction is |
TTCD | Tongue Tip Constriction Degree | How close the tongue tip is to the palate |
TBCL | Tongue Body Constriction Location | How far forward/back the tongue body constriction is |
TBCD | Tongue Body Constriction Degree | How close the tongue body is to the palate |
The HPRC model additionally estimates:
| Column | Full Name | Description |
|---|---|---|
PER | Periodicity | Degree of periodic (voiced) voicing |
APER | Aperiodicity | Degree of aperiodic (noisy) voicing |
F0 | Fundamental Frequency | Estimated pitch in Hz |
phone_label | Phoneme label | Predicted phoneme symbol for each frame |
You will also receive a run_metadata.json file documenting the model used, processing results, and full citation information for your records.
Before You Begin
Audio file requirements
- Format: WAV files only (
.wav) - Sample rate: Any — files are automatically converted to 16 kHz
- Duration: At least 100 milliseconds per file; no maximum
- Content: Any speech is acceptable. The service does not reject recordings based on speaker characteristics, dialect, disorder type, or recording quality.
- Channels: Mono preferred; stereo files are automatically converted to mono
What to avoid
- Non-speech audio (music, noise-only recordings) will produce output but the estimates will not be meaningful
- Files shorter than 100 ms will be rejected with an error message
- Files in formats other than
.wav(e.g.,.mp3,.m4a) will be rejected — please convert them first
How to Submit
Step 1 — Prepare your audio files
Collect all .wav files for a single batch into one folder, then compress it into a ZIP file.
On Windows: Right-click the folder → Send to → Compressed (zipped) folder On Mac: Right-click the folder → Compress
The ZIP file can contain up to several hundred files. Each is processed separately.
Step 2 — Open the submission form
Go to: https://redcap.link/si-service
Step 3 — Complete the form
The form will ask for:
- Your email address — results will be sent here
- Your ZIP file — the compressed folder of
.wavfiles - Model selection — which speech inversion model to use (see below)
- Password - a password generated for you by Nina R Benway.
Once submitted, you will receive a confirmation with your Job ID. Keep this for your records.
Step 4 — Wait for your results
You will receive an email when results are ready. The email contains a download link valid for 7 days, after which files are automatically deleted per our data retention policy.
Choosing a Model
| Model | Best for | Output columns |
|---|---|---|
| XRMB (recommended for most users) | Adult speech; general articulation research | frame, time_s, LA, LP, TTCL, TTCD, TBCL, TBCD |
| HPRC | Research requiring voicing/pitch estimates or phoneme labels | All XRMB columns + PER, APER, F0, phone_label |
Not sure which to choose? Select XRMB. It was trained on a larger dataset (46 adult speakers from the Wisconsin X-Ray Microbeam Database) and provides the six core Articulatory Phonology tract variables.
New models are coming soon.
Data Privacy and HIPAA
- Audio files are encrypted in transit (HTTPS) and at rest (AES-256)
- Audio files are automatically deleted after 7 days. CSV output files (numbers only, no audio) may be retained longer
- Only the submitting researcher receives the download link
- Server logs do not record filenames or email addresses — only an anonymous Job ID
- This service is hosted on University of Maryland infrastructure
If your audio contains recordings from human participants: Please ensure your IRB protocol covers submission of de-identified audio to a third-party analysis service.
Troubleshooting
I received an error for one of my files. The results email will specify which file(s) failed and what to fix. Other files in your batch are still processed. Common causes:
- The file is not a valid WAV — try re-exporting from your recording software
- The file is shorter than 100 ms
- The file contains no audio signal (silence only)
My download link has expired. Links are valid for 7 days. Please resubmit your files.
The results don’t look right. Tract variable estimates are most reliable for clearly recorded speech at a comfortable speaking rate. Very fast speech, overlapping speakers, or noisy recordings may produce less reliable estimates. This is a model limitation, not an error.
I need help interpreting the output. See the references below.
Citation
If you use this service in published research, please cite:
Benway, N. R. (in preparation). Speech Inversion Service [Software]. University of Maryland.
And the underlying model:
Siriwardena, L., Sivaraman, G., & Espy-Wilson, C. (2022). Multitask learning for acoustic-to-articulatory speech inversion. arXiv:2205.13755. Siriwardena, L., & Espy-Wilson, C. (2023). Acoustic-to-articulatory speech inversion using self-supervised speech representations. ICASSP 2023.
References
- Browman, C. P., & Goldstein, L. (1992). Articulatory Phonology: An overview. Phonetica, 49(3–4), 155–180.
- Chen, S., Wang, C., Chen, Z., et al. (2022). WavLM: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16(6), 1505–1518.
Contact
Nina R. Benway, PhD CCC-SLP Postdoctoral Fellow, Electrical and Computer Engineering University of Maryland, College Park
This service is available for research use, with funding for ~ 4 years following initial deployment.
