Speech Inversion Service — User Guide

Interpretable and Available Acoustic-to-Articulatory Speech Inversion Funded by the ASHFoundation New Investigators Research Grant PI: Nina R. Benway, PhD CCC-SLP · University of Maryland, College Park


What Is This Service?

The Speech Inversion Service estimates how the vocal tract is moving during speech.

These estimates are called tract variables, a framework from Articulatory Phonology (Browman & Goldstein, 1992). Tract variables describe articulator movements at 100 frames per second (every 10 milliseconds).

The models currently available are based on the Siriwardena speech inversion system (Siriwardena & Espy-Wilson, ICASSP 2023; Siriwardena, Sivaraman, & Espy-Wilson, 2022), trained on adult speaker articulography data from the Wisconsin Xray Microbeam Dataset. Additional models are in development and will be added to the service over time, including models trained on child speech and models with phonetically normalized tract variables.


Who Is This For?

This service is designed for speech researchers who want acoustic-to-articulatory speech inversion estimates for research purposes.

If you use data from clinical populations (e.g., children with speech sound disorders), please ensure your IRB protocol covers secondary data use.


What You Will Receive

For each .wav file you submit, you will receive one CSV file (openable in Excel, MATLAB, R, SPSS, or Python) containing tract variable estimates over time.

Example output (XRMB model, first 3 rows):

frametime_sLALPTTCLTTCDTBCLTBCD
00.0000.2342−0.11800.4453−0.33210.1121−0.5679
10.0100.2410−0.11240.4390−0.32790.1183−0.5609
20.0200.2489−0.10630.4321−0.32310.1248−0.5534

Column definitions:

ColumnFull NameDescription
frameFrame indexTime step (0-indexed, 100 Hz)
time_sTime in secondsFrame 0 = 0.000 s, Frame 1 = 0.010 s, etc.
LALip ApertureVertical distance between upper and lower lip
LPLip ProtrusionForward extension of the lips
TTCLTongue Tip Constriction LocationHow far forward/back the tongue tip constriction is
TTCDTongue Tip Constriction DegreeHow close the tongue tip is to the palate
TBCLTongue Body Constriction LocationHow far forward/back the tongue body constriction is
TBCDTongue Body Constriction DegreeHow close the tongue body is to the palate

The HPRC model additionally estimates:

ColumnFull NameDescription
PERPeriodicityDegree of periodic (voiced) voicing
APERAperiodicityDegree of aperiodic (noisy) voicing
F0Fundamental FrequencyEstimated pitch in Hz
phone_labelPhoneme labelPredicted phoneme symbol for each frame

You will also receive a run_metadata.json file documenting the model used, processing results, and full citation information for your records.


Before You Begin

Audio file requirements

  • Format: WAV files only (.wav)
  • Sample rate: Any — files are automatically converted to 16 kHz
  • Duration: At least 100 milliseconds per file; no maximum
  • Content: Any speech is acceptable. The service does not reject recordings based on speaker characteristics, dialect, disorder type, or recording quality.
  • Channels: Mono preferred; stereo files are automatically converted to mono

What to avoid

  • Non-speech audio (music, noise-only recordings) will produce output but the estimates will not be meaningful
  • Files shorter than 100 ms will be rejected with an error message
  • Files in formats other than .wav (e.g., .mp3, .m4a) will be rejected — please convert them first

How to Submit

Step 1 — Prepare your audio files

Collect all .wav files for a single batch into one folder, then compress it into a ZIP file.

On Windows: Right-click the folder → Send toCompressed (zipped) folder On Mac: Right-click the folder → Compress

The ZIP file can contain up to several hundred files. Each is processed separately.

Step 2 — Open the submission form

Go to: https://redcap.link/si-service

Step 3 — Complete the form

The form will ask for:

  1. Your email address — results will be sent here
  2. Your ZIP file — the compressed folder of .wav files
  3. Model selection — which speech inversion model to use (see below)
  4. Password - a password generated for you by Nina R Benway.

Once submitted, you will receive a confirmation with your Job ID. Keep this for your records.

Step 4 — Wait for your results

You will receive an email when results are ready. The email contains a download link valid for 7 days, after which files are automatically deleted per our data retention policy.


Choosing a Model

ModelBest forOutput columns
XRMB (recommended for most users)Adult speech; general articulation researchframe, time_s, LA, LP, TTCL, TTCD, TBCL, TBCD
HPRCResearch requiring voicing/pitch estimates or phoneme labelsAll XRMB columns + PER, APER, F0, phone_label

Not sure which to choose? Select XRMB. It was trained on a larger dataset (46 adult speakers from the Wisconsin X-Ray Microbeam Database) and provides the six core Articulatory Phonology tract variables.

New models are coming soon.


Data Privacy and HIPAA

  • Audio files are encrypted in transit (HTTPS) and at rest (AES-256)
  • Audio files are automatically deleted after 7 days. CSV output files (numbers only, no audio) may be retained longer
  • Only the submitting researcher receives the download link
  • Server logs do not record filenames or email addresses — only an anonymous Job ID
  • This service is hosted on University of Maryland infrastructure

If your audio contains recordings from human participants: Please ensure your IRB protocol covers submission of de-identified audio to a third-party analysis service.


Troubleshooting

I received an error for one of my files. The results email will specify which file(s) failed and what to fix. Other files in your batch are still processed. Common causes:

  • The file is not a valid WAV — try re-exporting from your recording software
  • The file is shorter than 100 ms
  • The file contains no audio signal (silence only)

My download link has expired. Links are valid for 7 days. Please resubmit your files.

The results don’t look right. Tract variable estimates are most reliable for clearly recorded speech at a comfortable speaking rate. Very fast speech, overlapping speakers, or noisy recordings may produce less reliable estimates. This is a model limitation, not an error.

I need help interpreting the output. See the references below.


Citation

If you use this service in published research, please cite:

Benway, N. R. (in preparation). Speech Inversion Service [Software]. University of Maryland.

And the underlying model:

Siriwardena, L., Sivaraman, G., & Espy-Wilson, C. (2022). Multitask learning for acoustic-to-articulatory speech inversion. arXiv:2205.13755. Siriwardena, L., & Espy-Wilson, C. (2023). Acoustic-to-articulatory speech inversion using self-supervised speech representations. ICASSP 2023.


References

  • Browman, C. P., & Goldstein, L. (1992). Articulatory Phonology: An overview. Phonetica, 49(3–4), 155–180.
  • Chen, S., Wang, C., Chen, Z., et al. (2022). WavLM: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16(6), 1505–1518.

Contact

Nina R. Benway, PhD CCC-SLP Postdoctoral Fellow, Electrical and Computer Engineering University of Maryland, College Park

This service is available for research use, with funding for ~ 4 years following initial deployment.