Deadline: 23 August 2024
The Lacuna Fund is inviting applications for the Natural Language Processing (NLP) Program to support efforts to develop open and accessible datasets for machine learning applications related to Natural Language Processing (NLP) for low-resource languages and cultures in Africa and Latin America.
The ability to communicate and be understood in one’s own language variety and cultural context is fundamental to digital and societal inclusion. Natural language processing techniques have the potential to enable AI applications that facilitate digital inclusion and improvements in education, finance, healthcare, agriculture, communication, and responses to natural hazards, among others. Many advances in both fundamental and applied NLP have stemmed from openly licensed and publicly available datasets.
However, such datasets are scarce to non-existent for many African and Latin-American languages, excluding these populations from the benefits of NLP. Many current machine learning (ML) models are informed by Anglo-centric and/or translated datasets, lacking culturally relevant nuances and creating biased or unusable models for communities in Africa and Latin America. Where relevant datasets do exist, they are often based on religious or judiciary texts of the past, leading to outdated language and bias. There is a need for openly accessible datasets to facilitate NLP technologies for low-resource languages in Africa and Latin America and support the development of robust and culturally appropriate language datasets that cater to the specific needs of underrepresented communities.
Funding Information
- The total pool available is approximately $1 million USD. They would like to fund projects in each of the target regions (Africa, Latin America) and anticipate supporting 6-8 smaller projects with budgets up to $100k USD and 2-3 larger, more complex projects with budgets ranging from $100-250k USD.
Need
- Lacuna Fund seeks proposals from qualified, multidisciplinary teams to develop open and accessible training and evaluation datasets for machine learning applications for NLP in low-resource languages and underrepresented cultures in Africa and Latin America.
- Proposals may include, but are not limited to:
- Collecting and/or annotating new data;
- Annotating or releasing existing data;
- Augmenting existing datasets from diverse sources to fill gaps in local ground truth data, decrease bias (such as geographic bias, gender gaps or other types of bias or discrimination), or increase the usability of data and technology related to NLP in low- and middle-income contexts;
- Linking and harmonizing existing datasets (such as across regions, time, linguistic varieties, as well as domain-specific datasets such as historical, health and education data).
- The TAP sees a need for training and evaluation datasets that will account for the linguistic diversity and cultural nuances in Africa and Latin America. This includes datasets on regional slang, idiomatic expressions, local linguistic varieties or dialects, and culturally relevant data. Such datasets are crucial for developing more inclusive and effective natural language processing tools that can serve the unique needs of culturally diverse linguistic communities.
- They seek datasets identified by local experts designed to address locally identified needs. The following are illustrative examples only.
- Datasets may include, but are not limited to the following:
- Labeled and unlabeled datasets for low-resource NLP tasks, supporting the development of accurate and effective machine learning models. Downstream tasks from labeled datasets might include, but are not limited to: question answering and conversational AI, sentiment analysis datasets, social bias detection, hate speech detection and counter speech, misinformation and disinformation detection; automatic text summarization or other natural language understanding and generation tasks, or resources to support NLP education in collaboration with communities. Unlabeled datasets include text corpora that can be used to support the training and evaluation of speech models.
- Speech corpora, including datasets to enable automatic speech recognition (ASR) that allows illiterate or otherwise underprivileged groups of persons to access information and/or services in low-resource languages.
- Text-generation tasks datasets, particularly other than machine translation.
- Multimodal and other innovative datasets, such as video or audio captioning, visual question-answering or other image-text interactions.
- Datasets supporting knowledge-intensive tasks, such as quality assurance (QA) and Retrieval Augmented Generation (RAG).
- Datasets related to dialectal variation corpora and code-switched text and speech, including capturing linguistic variations (regional slang, idiomatic expressions, culturally relevant data) in dialect-rich low-resource languages and in linguistic communities where code-switching is common.
- Domain-specific creation or augmentation of text and speech datasets, such as healthcare, place names, agriculture or education, that enable applications with significant social impact. Exploring Generative Data Augmentation frameworks to include domain-specialized vocabulary, semantics, morphology, and syntax.
- Datasets supporting machine learning for linguistics, for the preservation and revitalization of marginalized cultures and aspects of underrepresented languages that these cultures consider important for their health, dignity, environment, and well-being. These datasets may include phonetic, morphological, and syntactic annotations, and automatized tools to perform these tasks if sought by the involved social group
- Across all datasets: gender-responsiveness and inclusion of key vulnerable groups, including bias mitigation for those living in humanitarian and conflict settings, as well as those at the intersections of more than one socio-economic group (e.g., disability, gender, age, minorities). Please refer to the ‘Risks, including Ethics and Privacy’ paragraph on the Proposal narrative section of this document and carefully consider ethics around data collection.
Eligibility Criteria
- Lacuna Fund aims to make its funding accessible to as many organizations as possible in the AI for social good space and cultivate capacity and emerging organizations in the field.
- To be eligible for funding, organizations must:
- Be either a non-profit entity, research institution, for-profit social enterprise, or a team of such organizations. Individuals must apply through an institutional sponsor. Partnerships are strongly encouraged as a way to strengthen collaboration and maximize the benefits derived from the use of the datasets, but only the lead applicant will receive funds.
- Have a mission supporting societal good, broadly defined.
- Be headquartered in the country or region where data will be collected. The geographic focus of this call is Africa and Latin America. Institutions based in other countries or regions can apply as partners of the lead institution. As stated above, only the lead applicant will receive funds.
- Have all necessary national or other approvals to conduct the proposed research. The approval process may be conducted in parallel with the grant application, if necessary. Approval costs, if any, are the responsibility of the applicant.
- Have the technical capacity – or the ability to build this capacity through a partnership described in the proposal – to conduct dataset labeling, creation, aggregation, expansion, and/or maintenance, including the ability to apply best practice and established standards in the specific domain (e.g. natural language processing) to allow high quality AI/ML analytics to be performed by multiple entities.
For more information, visit Lacuna Fund.