We are introducing OpenAI Data Partnerships, where we’ll work together with organizations to produce public and private datasets for training AI models.
我们正在介绍 OpenAI 数据合作伙伴关系,我们将与组织合作生产公共和私有数据集,以训练 AI 模型。
Modern AI technology learns skills and aspects of our world — of people, our motivations, interactions, and the way we communicate — by making sense of the data on which it’s trained. To ultimately make AGI that is safe and beneficial to all of humanity, we’d like AI models to deeply understand all subject matters, industries, cultures, and languages, which requires as broad a training dataset as possible.
现代 AI 技术通过理解其训练所需的数据来学习我们的世界的技能和能力——人类、我们的动机、互动以及我们沟通的方式——从而掌握这些技能和能力。为了最终实现对人类所有有益的安全 AGI,我们希望 AI 模型能够深入了解所有主题、行业、文化和语言,这需要尽可能广泛的训练数据集。
Including your content can make AI models more helpful to you by increasing their understanding of your domain. We’re already working with many partners who are eager to represent data from their country or industry. For example, we recently partnered with the Icelandic Government and Miðeind ehf to improve GPT-4’s ability to speak Icelandic by integrating their curated datasets. We also partnered with non-profit organization Free Law Project, which aims to democratize access to legal understanding by including their large collection of legal documents in AI training. We know there may be many more who also want to contribute to the future of AI research while discovering the potential of their unique data.
包括您的内容可以使 AI 模型对您更有帮助,通过增加它们对您的领域的理解。我们已经与许多渴望代表他们国家或行业数据的合作伙伴合作。例如,我们最近与冰岛政府 Miðeind ehf 合作,通过整合他们的精选数据集来提高 GPT-4 说冰岛语的能力。我们还与非营利组织 Free Law Project 合作,该组织的目标是使法律理解的访问民主化,将他们的大量法律文档包括在 AI 培训中。我们知道可能有更多的人也想为 AI 研究的未来做出贡献,同时发现他们独特数据的可能性。
Data Partnerships are intended to enable more organizations to help steer the future of AI and benefit from models that are more useful to them, by including content they care about.
数据合作伙伴关系旨在使更多的组织能够引导 AI 的未来并从中受益,通过包含他们关心的内容。
我们正在寻求以下类型的数据
We’re interested in large-scale datasets that reflect human society and that are not already easily accessible online to the public today. We can work with any modality, including text, images, audio, or video. We’re particularly looking for data that expresses human intention (e.g. long-form writing or conversations rather than disconnected snippets), across any language, topic, and format.
我们对大型数据集感兴趣,这些数据集反映了人类社会,并且今天在互联网上不容易向公众访问。我们可以处理任何模态,包括文本、图像、音频或视频。我们特别寻找表达人类意图的数据(例如,长篇文章或对话,而不是断开的片段),跨越任何语言、主题和格式。
We can work with data in almost any form and can use our next-generation in-house AI technology to help you digitize and structure your data. For example, we have world-class optical character recognition (OCR) technology to digitize files like PDFs, and automatic speech recognition (ASR) to transcribe spoken words. If the data needs cleaning (e.g. has lots of auto-generated artifacts or transcription errors), we can work with your team to process it into the most useful form. We are not seeking datasets with sensitive or personal information, or information that belongs to a third party; we can work with you to remove this information if you need help.
我们可以处理几乎任何形式的数据,并使用我们下一代的内部AI技术来帮助您数字化和结构化您的数据。例如,我们有世界级的光学字符识别(OCR)技术,可以数字化像 PDF 这样的文件,以及自动语音识别(ASR)技术,可以将口头语言转录成文字。如果数据需要清洗(例如,有大量自动生成的痕迹或转录错误),我们可以与您的团队一起处理,将其处理成最有用的形式。我们不寻找包含敏感或个人信息,或属于第三方的信息的数据集;如果您需要帮忙,我们可以与您合作删除这些信息。
与我们合作的方式
We currently have two ways to partner, and may expand in the future:
我们目前有两种合作方式,未来可能会扩展:
Open-Source Archive: We’re seeking partners to help us create an open-source dataset for training language models. This dataset would be public for anyone to use in AI model training. We would also explore using it to safely train additional open-source models ourselves. We believe open-source plays an important role in the ecosystem.
开源档案:我们正在寻求合作伙伴帮助我们创建一个开源数据集,用于训练语言模型。这个数据集将对所有人公开,以便在AI模型训练中使用。我们还将探索使用它安全地训练其他开源模型。我们相信开源在生态系统中的作用至关重要。
Private Datasets: We are also preparing private datasets for training proprietary AI models, including our foundation models and fine-tuned and custom models. If you have data you wish to keep private, but you would like our AI models to have a better understanding of your domain (or you’d even just like to gauge the potential of your data to do so), this is the optimal way to partner. We’ll treat your data with the level of sensitivity and access controls that you prefer.
私有数据集:我们还在准备私有数据集,以训练专有AI模型,包括我们的基础模型、微调和定制模型。如果您希望保持数据的私密性,但希望我们的AI模型更好地了解您的领域(或您只是想了解您的数据的可能性),这是建立合作伙伴关系的最佳方式。我们将以您喜欢的敏感性和访问控制级别处理您的数据。
Overall, we are seeking partners who want to help us teach AI to understand our world in order to be maximally helpful to everyone. Together, we can move towards AGI that benefits all of humanity.
总的来说,我们正在寻找愿意帮助我们教育AI理解世界的合作伙伴,以便对所有人类最大限度地提供帮助。共同努力,我们可以朝着造福全人类的强人工智能(AGI)迈进。