国广清科售前总监肇中山：高质量、大规模、多样性数据助力AI大模型训练

作者：国广清科发布时间：2023-07-18

AI大模型是“大数据+大算力+强算法”结合的产物，凝聚了大数据内在精华的“隐式知识库”。包含了“预训练”和“大模型”两层含义，即模型在大规模数据集上完成了预训练后无需微调，或仅需要少量数据的微调，就能直接支撑各类应用。 AI大模型具有很高的计算和存储需求，需要使用极为强大的计算设备和高效的算法才能训练和应用，参数量一般可以达到惊人的数十亿或者数千亿。

AI Big Model is the product of the combination of "Big Data + Big Computing Power + Strong Algorithm", which is an implicit knowledge base that concentrate the intrinsic essence of Big Data. It concludes both "pre-training" and "big models", i.e. models that can directly support various applications without fine-tuning after pre-training on large data sets, or with only a small amount of data AI big models that could have high computational and storage requirements. It requires extremely powerful computing equipment and efficient algorithms to support various applications. The number of participants number can be the billions or hundreds of billions.

AI大模型需要高质量、大规模、多样性的数据集。高质量数据集能够提高模型精度与可解释性，并且减少收敛到最优解的时间，即减少训练时长； OpenAI在《Scaling Laws for Neural Language Models》中提出LLM模型所遵循的“伸缩法则”（scaling law），即独立增加训练数据量、模型参数规模或者延长模型训练时间，预训练模型的效果会越来越好。与此同时，数据丰富性能够提高模型泛化能力，过于单一的数据会非常容易让模型过于拟合训练数据。

Large AI models require high quality, large scale and diverse datasets. High-quality datasets could improve model’s accuracy and interpretability, and reduce the time to converge to the optimal solution, i.e. reduce training time; in <<Scaling Laws for Neural Language Models>>, OpenAI proposes a "scaling law" for LLM models. Which states that pre-trained models get better and better by independently increasing the amount of training data, the size of the model parameters, or the training time of the model. At the same time, data richness improves the generalisation ability of the model, as too much data can easily make the model over-fit the training data.

国广清科作为一家专注于隐私计算技术研究和应用的数据技术服务公司，已经在多个领域积累了丰富的数据服务经验。可基于隐私计算全栈技术服务、数字资源管理解决方案、数据系统和数据服务，链接数据要素产业链上下游，包括互联网、社交媒体、公共数据库等，形成优质数据集，满足各种训练场景的需求。

As a data technology service company focusing on the research and application of privacy computing technology, CRI TSING'S TECH: has accumulated rich experience in data services in a number of fields. It base on privacy computing full-stack technology services, digital resource management solutions, data systems and data services, linking the upstream and downstream of the data element industry chain, including the Internet, social media and public databases, to form quality data sets to meet the needs of various training scenarios.

同时，国广清科还是福建大数据交易所、华东江苏大数据交易中心、贵阳大数据交易所、西部数据交易中心、深圳数据交易有限公司-开放群岛的数据服务商，以及北京国际数据交易联盟、大数据技术标准推进委员会、信通院隐私计算联盟、信通院“数据安全推进计划”的成员单位。

At the same time, CRI TSING'S TECH: is also a data service provider of Fujian Big Data Exchange, East China Jiangsu Big Data Exchange, Guiyang Big Data Exchange, Western Data Exchange, Shenzhen Data Exchange Co.-the Open Islands' data service provider, and the Beijing International Data Exchange Alliance, the Big Data Technical Standards Promotion Committee, the ICT Academy Privacy Computing Alliance and the ICT Academy's Data Security Promotion Program.

在未来，随着各地积极推动数据交易所建设，数据有望在各行业、各企业之间实现自由流通，缓解国内优质数据集不足问题。国广清科也将进一步激活数据交易流通市场，提供更多样化的数据产品，促进我国 AI 大模型数据集的发展。

In the future, with the active promotion of data exchanges iin China, data is expected to circulate freely among various industries and enterprises, alleviating the problem of insufficient domestic quality data sets. CRI TSING'S TECH company will also further activate the data exchange market, provide more diversified data products, and promote the development of large AI model datasets in China.

如果您有AI模型数据训练需求，可以邮件联系hz@cri-tsing.com，我们将在24小时内回复。

If you have any request for AI model data training service, please contact us with hz@cri-tsing.com, we will respond you in 24 hours.