Enabling Information Retrieval from Conversational Speech Archives via Crowdsourcing

This is an ongoing project.

We are now reviewing applications received for the Spring 2015 position; no further applications are requested at this time.

You will have the opportunity to contribute to active research in our lab. The lab typically includes 4-5 graduate students to interact with, and a shared lab space to optionally work in.

This project was featured in WIRED: http://www.wired.com/dangerroom/2013/03/darpa-speech

Advances in capture and storage technology now let us archive massive amounts of spontaneous (conversational) speech data. However, effective use of this data requires accurate information retrieval technology designed for and evaluated on spontaneous speech data. Unfortunately, traditional practice for benchmarking search engine accuracy cannot scale to “big data”, especialy for conversational speech. This restricts our ability to even measure the effectiveness of existing search engines, much less further advance them.

Evaluating search with spontaneous speech archives is particularly challenging vs. more traditional text collections. Unlike text, speech must first be transcribed, and while prepared speech (e.g. broadcast news) transcripts are very readable, spontaneous speech transcripts are often very difficult to read, even with perfect transcription, due to "disfluency" (self-corrections, trailing off, interruptions, etc.) and lack of commas and sentence boundaries. Human editing to correct this would require even greater manual effort. As a result, few spontaneous speech IR test collections exist today.

We are investigating use of nascent crowdsourcing (crowd computing) techniques in concert with "rich transcription" technology. While crowdsourcing offers tremendous potential for time and cost savings, how to achieve these savings without compromising quality remains an open research problem. As a case study, we are investigating search of spontaneous speech interviews with Holocaust eye-witnesses collected by the Shoah Foundation.

For example interviews, see: https://sfi.usc.edu/clipviewer


Programming knowledge; please let me know your background in computer science and math/statistics.


Work with graduate students and/or professor to write algorithms and set up and run experiments and analyze results. Expected to maintain weekly contact. Hours can be varied, depending on time available, but must commit to at least one semester of work to justify our time teaching and training you. Depending on prior skill and experience, degree of supervision and sophistication tasks will be tailored accordingly. You will be expected to document your research activities in writing, with the opportunity to publish a technical report or research at the end.


The Office of Undergraduate Research recommends that you attend an info session or advising before contacting faculty members or project contacts about research opportunities. We'll cover the steps to get involved, tips for contacting faculty, funding possibilities, and options for course credit. Once you have attended an Office of Undergraduate Research info session or spoken to an advisor, you can use the "Who to contact" details for this project to get in touch with the project leader and express your interest in getting involved.