REUTERS FILE PIC, FOR ILLUSTRATION PURPOSE ONLY.
REUTERS FILE PIC, FOR ILLUSTRATION PURPOSE ONLY.

Like millions worldwide, Southeast Asians have been trying out large language models such as Meta's Llama 2 and Mistral AI — but in their native Bahasa Indonesia or Thai. The result has usually been gibberish in English.

This leaves them at a disadvantage, tech experts warn, as generative artificial intelligence (AI) transforms education, work and governance worldwide.

A Singapore government-led initiative aims to correct the imbalance with a Southeast Asian LLM, the first in a family of models named SEA-LION — Southeast Asian Languages in One Network — trained in the region's languages and cultural norms.

Trained on data in 11 Southeast Asian languages including Vietnamese, Thai and Bahasa Indonesia, the open-sourced model is a cheaper and more efficient option for the region's businesses, governments and academia, said Leslie Teo at AI Singapore.

"We are not trying to compete with the big LLMs; we are trying to complement them, so there can be better representation of us," said Teo, senior director for AI products.

There are over 7,000 languages spoken worldwide. Yet LLMs, including Open AI's GPT-4 and Meta's Llama 2, that are used to build AI systems such as chatbots and other tools, have largely been developed for, and are trained on, the English language.

Governments and tech firms are trying to bridge this gap, with India creating datasets in local languages, an LLM in the United Arab Emirates powering generative AI tools in Arabic, and AI models in China, Japan and Vietnam in local languages.

These models can help local populations participate more equitably in the global AI economy that is largely dominated by big tech firms, said Nuurrianti Jalli, an assistant professor at Oklahoma State University's school of communications.

"Less reliance on Western LLMs could provide better privacy for local populations, and also align better with national or regional interest," she said.

Multilingual language models that are trained on text from several languages, can infer semantic and grammatical connections between high-resource languages that have more data than low-resource languages, researchers say.

These models can be used in a variety of applications, from translation and customer-service chatbots, to content moderation on social media platforms that have struggled to identify hate speech in low-resource languages, such as Burmese or Amharic.

About 13 per cent of SEA-LION's data is sourced from Southeast Asian languages — more than any other major LLM, said Teo. More than nine per cent of its data is from Chinese text, and about 63 per cent from English.

Multilingual language models often train on translated text and other poor quality data that may have errors, so AI Singapore is "careful" about the data used in training SEA-LION, Teo said.

At Indonesian e-commerce company Tokopedia, a majority of customer interactions is in Bahasa Indonesia, so models "with that local fluency will enhance our ability to connect with customers and improve their experiences," said Paul Condylis, Tokopedia's associate vice-president of data science.

As more countries and regions build their own LLMs, digital and human rights experts fret that they will reproduce only the dominant views expressed online, which can be particularly problematic in nations with authoritarian governments or strict media censorship, or those lacking a strong civil society.

Chinese social media platforms, for example, censor references to the Tiananmen Square uprising and criticism of the government, while several Southeast Asian nations have enacted laws to curb content that authorities deem as misleading.

"Training models on such data risks perpetuating biased, prejudiced, incomplete and even misleading narratives," said Jalli.

"The models may fail to surface important socio-political issues like human rights abuse, corruption, or valid criticism of political powers," she said.

In response to a query on Indonesian former president Suharto, for example, Llama 2 and GPT-4 mentioned his spotty human rights record, while SEA-LION's response focused largely on his achievements.

But the alternative — relying entirely on Western LLMs — means perpetuating different biases related to cultural values, political beliefs and social norms, according to AI Singapore.

"We are not saying ours is the only perspective — we are just trying to rebalance it," said Teo.


* The writer is from Reuters