原文标题:
Polyglot machines
Why AI needs to learn new languages
Efforts are under way to make AI fluent in more than just English
多语言机器
为什么人工智能需要学习新语言
努力让人工智能会流利的英语,还会其他语言
[Paragraph 1]
CHATGPT, A CHATBOT developed by OpenAI, an American firm, can give passable answers to questions on everything from nuclear engineering to Stoic philosophy.
ChatGPT是美国OpenAI公司开发的聊天机器人,能回答核工程、斯多葛哲学等各种问题。
Or at least, it can in English. The latest version, ChatGPT-4, scored 85% on a common question-and-answer test.
至少,它可以用英语回答。最新版本的ChatGPT-4在一项常见的问答测试中获得了85%的分数。
In other languages it is less impressive. When taking the test in Telugu, an Indian language spoken by nearly 100m people, for instance, it scored just 62%.
但用其他语言回答,表现就没那么出色了。比如在近亿人使用的印度泰卢固语问题中,得分只有62%。
[Paragraph 2]
OpenAI has not revealed much about how ChatGPT-4 was built. But a look at its predecessor, ChatGPT-3, is suggestive.
OpenAI公司并没有透露太多关于ChatGPT-4的构建信息。但观察旧版本ChatGPT-3,可以找到一些线索。
Large language models (LLMs) are trained on text scraped from the internet, on which English is the lingua franca. Around 93% of ChatGPT-3’s training data was in English.
大型语言模型(LLM)是根据从互联网上抓取的文本进行训练的,而英语是互联网上的通用语言。ChatGPT-3训练数据中约93%是英文。
In Common Crawl, just one of the datasets on which the model was trained, English makes up 47% of the corpus, with other (mostly related) European languages accounting for 38% more.
在模型训练所用的Common Crawl数据集中,英语占整个语料库的47%,其他主要相关的欧洲语言占据了另外的38%。
Chinese and Japanese combined, by contrast, made up just 9%. Telugu was not even a rounding error.
相比之下,中文和日语加起来只占到了9%。而泰卢固语甚至没有列入统计。
[Paragraph 3]
An evaluation by Nathaniel Robinson, a researcher at Johns Hopkins University, and his colleagues finds that is not a problem limited to ChatGPT.
约翰·霍普金斯大学研究员内森尼尔·罗宾逊和他的同事们的评估发现,这并不是一个仅限于ChatGPT的问题。
All LLMs fare better with “high-resource” languages, for which training data are plentiful, than for “low-resource” ones for which they are scarce.
所有大型语言模型在“高资源”语言方面表现得更好,因为这些语言的训练数据充足,而在“低资源”语言方面则表现更差,因为这些语言的训练数据稀缺。
That is a problem for those hoping to export AI to poor countries, in the hope it might improve everything from schools to health care.
这对于希望将AI技术输出到贫穷国家,以改善学校、医疗保健等各方面条件的人们来说,成了一个问题。
Researchers around the world are therefore working to make AI more multilingual.
因此,世界各地的研究人员都在努力让AI实现多语种化。
[Paragraph 4]
India’s government is particularly keen. Many of its public services are already digitised, and it is keen to fortify them with AI.
印度政府尤其热衷于此。印度的许多公共服务已经实现了数字化,现在政府希望通过AI技术来强化这些服务。
In September, for instance, it launched a chatbot to help farmers get information about state benefits.
例如,去年9月,印度推出了一款聊天机器人,以帮助农民获取有关国家福利的信息。
[Paragraph 5]
The bot works by welding two sorts of language model together, says Shankar Maruwada of the EkStep Foundation, a non-profit that helped build it.
这款机器人的工作原理是将两种语言模型结合在一起,背后提供帮助的非营利组织EkStep基金会的山卡尔·马鲁瓦达如是说。
Users can submit queries in their native tongues. (Eight are supported so far; five more are coming soon.)
用户可以用自己的母语提交问题。(目前支持8种语言,另外5种即将推出。)
These are passed to a piece of machine-translation software developed at IIT Madras, an Indian academic institution, which translates them into English.
这些问题会被传送到印度理工学院马德拉斯分校开发的一款机器翻译软件,这款软件会将问题翻译成英语。
The English version of the question is then fed to the LLM, and its response translated back into the user’s mother tongue.
然后,将问题的英文版输入到大型语言模型中,并将问题的答案翻译回用户的母语。
[Paragraph 6]
The system seems to work. But translating queries into an LLM’s preferred language is a rather clumsy workaround.
这个系统看起来运行正常。但是,将问题翻译成大型语言模型所偏爱的语言确实是一种不太方便的解决方法。
After all, language is a vehicle for worldviews and culture as well as just meaning, notes the boss of one Indian AI firm.
一家印度人工智能公司的老板指出,语言毕竟是世界观和文化的载体,也是意义的载体。
A paper by Rebecca Johnson, a researcher at the University of Sydney, published in 2022, found that ChatGPT-3 gave replies on topics such as gun control and refugee policy that aligned most with the values displayed by Americans in the World Values Survey, a global questionnaire of public opinion.
悉尼大学研究员丽贝卡·约翰逊在2022年发表了一篇论文,她发现在枪支管制和难民政策等话题上,ChatGPT-3的回答与民意调查《全球价值观普查报告》中美国人所表现出的价值观非常一致。
(恭喜读完,本篇英语词汇481/942左右)
原文出自:2024年1月27日《TE》Science & technology版块
精读笔记来源于:自由英语之路
本文翻译整理: Irene
本文编辑校对: Irene
仅供个人英语学习交流使用。
【补充资料】(来自于网络)
斯多葛哲学Stoic philosophy是希腊哲学的一个派别。公元前三世纪以前,该派别盛行于罗马与希腊。斯多葛学派在哲学个人道德领域占有重要地位,它拥有一套逻辑体系以及一系列对物质世界的观点。斯多葛学说认为,人类作为社交动物,想要获得幸福,就要接受人生起伏,不能左右于欲望或是恐惧。斯多葛学派主张人类要动用智慧去理解世界,与他人合作,以公平公正的方式对待他人。
【重点句子】(3个)
Large language models (LLMs) are trained on text scraped from the internet, on which English is the lingua franca.
大型语言模型(LLM)是根据从互联网上抓取的文本进行训练的,而英语是互联网上的通用语言。
All LLMs fare better with “high-resource” languages, for which training data are plentiful, than for “low-resource” ones for which they are scarce.
所有大型语言模型在“高资源”语言方面表现得更好,因为这些语言的训练数据充足,而在“低资源”语言方面则表现更差,因为这些语言的训练数据稀缺。
After all, language is a vehicle for worldviews and culture as well as just meaning, notes the boss of one Indian AI firm.
一家印度人工智能公司的老板指出,语言毕竟是世界观和文化的载体,也是意义的载体。
自由英语之路 2023-04-24
自由英语之路 2023-05-19
自由英语之路 2023-05-02