AI is projected by most experts to be an important tool in the toolset for Arabic countries seeking to expand their tech economies. But developing AI systems that work well within the Arabic-speaking world’s many dialects is easier said than done. Where there’s a will, there’s a way, though, and massive corporations across the globe are sinking investments to prepare themselves.
While Modern Standard Arabic, or Fus’ha, is a mainstay in Arab countries and a useful baseline for AI systems, the man dialects spead across Arab-speaking countries poses some significant challenges. Arabic is spoken by 400 million people and is the official language of 20 countries, with 30 distinct dialects spread across them. According to electronics giant Samsung, that presents a challenge in developing AI systems that can both properly understand commands through Automatic Speech Recognition (ASR) and reply coherently within the language family.
“Unlike other languages, the pronunciation of the object in Arabic varies depending on the subject and verb in the sentence,” said Mohammad Hamdan, Samsung project leader of the Arabic language development team, in a company blog. “Our goal is to develop a model that understands all these dialects and can answer in standard Arabic.”
Given the importance of maintaining robust datasets for the large language models that power AI systems, Samsung’s Arabic language team faced a unique challenge related to diacritics. A word pronunciation guide for various Arabic contexts in books, poetry, and religious texts, they are everpresent in Arabic life but difficult to translate into the digital language an AI system requires.
“There is a shortage of high-quality and reliable datasets that accurately represent how diacritics are correctly used,” said Haweeleh in the Samsung blog. “We had to design a neural model that can predict and restore those missing diacritics with high accuracy.”
The requirement for Arabic systems to understand various dialects was another challenge — one they tackled by collecting and transcribing a host of audio sources with an emphasis on clarifying unique sounds, words, and phrases.
“Building an ASR system that supports multiple dialects in a single model is a complex undertaking,” said Mohammad Hamdan, Samsung ASR lead for the project, in the blog. “It demands a thorough understanding of the language’s intricacies, careful data selection and advanced modeling techniques.”
According to Arab News, another central concern in developing Arabic AI systems is ensuring that they comport with regional cultural, social, and ethical values. That’s especially true in countries within the Arab world where traditions and community values can differ from the western countries leading the charge on AI system development.
“Systems developed in the West or in East Asia may not fully understand or respect the cultural norms and values of the Arab world. This can result in AI behaviors or culturally insensitive or inappropriate decisions,” wrote Mohammed A. Al-Qarni, an academic and consultant on AI for business, for Arab News. “Moreover, AI applications require a deep understanding of local contexts to function effectively and ethically. Without this, systems might make decisions that overlook important regional nuances.