Project Details
Description
The Iroquoian languages of the Six Nations Confederacy, or the Haudenosaunee people, were first encountered when the explorer Jacques Cartier sailed up the St. Lawrence River in the 1530s. Dictionaries and grammars exist, but the basic elements of documentation for every language should also include annotated texts of diverse genres. As currently recognized in The Native American Languages Act, passed by the U.S. Congress in 1990, languages spoken by the indigenous peoples of North America have a unique status and importance. This project will bring together members of the Seneca Nation of Indians with a team of linguistics and computing researchers to record elders speaking Seneca, an Iroquoian language that is particularly endangered. The team will develop software to accurately and efficiently transcribe these recordings using automatic speech recognition (ASR), the technology behind digital personal assistants like Siri or Alexa. Seneca has an exceedingly complex word structure, known as polysynthesis, in which a word is equivalent to a clause or sentence. Such languages challenge ASR systems, which are generally designed to recognize words over a constrained vocabulary. This project will advance scientific knowledge by developing novel methods for generating synthetic text data to augment the existing written resources required to model this complexity. Broader impacts include the availability of the newly documented materials for language revitalization and scientific investigation. The project will provide undergraduates, graduate students, and young adults from the Seneca Nation with valuable STEM experience and broadening participation of Native Americans in the language and computing sciences, including supporting a Seneca doctoral student in computer science. The computational tools and methodologies developed will be accessible to others who are working to document and analyze low-resource languages, many spoken in regions of critical importance for national security.
Spontaneous speech in Seneca contains long, complex words but also many short particles that are essential to understanding the discourse. Crucial for segmenting and annotating spoken Seneca are the prosodic patterns that occur in longer utterances, involving both metrical and tonal components. Most ASR frameworks would be challenged by the large vocabulary size that a polysynthetic morphological system tends to yield. In addition, ASR systems do not typically model high-level prosodic information. Seneca has little available text data derived from spontaneous speech, which is needed to build the predictive language models used in ASR and is invaluable to Seneca learners. Augmenting the available text data will require novel techniques for generating synthetic but plausible text, with a particular focus on neural sequence-to-sequence models. The ability of neural nets to model long-distance and hierarchical relationships will also be exploited to capture utterance-level prosodic patterns required for accurate segmentation of spontaneous speech in Seneca. By bringing together a range of expertise and by involving Seneca community members, key stakeholders in the language, the project bridges traditional linguistic methodology and computational approaches. Each new Seneca recording that is transcribed and annotated through this collaboration across disciplines will support the revitalization of the Seneca language and help to further the state of the art in low-resource language technology.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
| Status | Finished |
|---|---|
| Effective start/end date | 06/1/18 → 05/31/22 |
Funding
- National Science Foundation: $90,367.00
Fingerprint
Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.