ONE REASON THAT TRAINING DATA VOLUMES REMAIN A MANAGEABLE SIZE IS THAT IT ’ S BEEN THE FASHION AMONGST THE MAJORITY OF AI MODEL BUILDERS TO USE PRE-TRAINING MODELS ( PTMS ), INSTEAD OF SEARCH MODELS TRAINED FROM SCRATCH .
So , what ’ s going on to make AI so power hungry ?
Is it the data set , i . e . volume of data ? The number of parameters used ? The transformer model ? The encoding , decoding and fine tuning ? The processing time ? The answer is of course a combination of all of the above .
Data
Ed Ansett , Founder and Chairman , i3 Solutions Group of open web data accessible to researchers . . . Over 250 billion pages spanning 16 years . 3 – 5 billion new pages added each month .”
It is thought that ChatGPT-3 was trained on 45Terabytes of Commoncrawl plaintext , filtered down to 570GB of text data . It is hosted on AWS for free as its contribution to Open Source AI data .
But storage volumes , the billions of web pages or data tokens that are scraped from the Web , Wikipedia and
An example of a text to image model is LAION ( Large Scale AI Open Network ) – a German AI model with billions of images . One of its models , known as LAION 400m , is a 10TB web data set . Another , LAION5B has 5.85 billion clip filtered text image pairs .
One reason that training data volumes remain a manageable size is that it ’ s been the fashion amongst the majority of AI model builders to use Pre-Training Models ( PTMs ), instead of search models trained from scratch . Two examples of
“
ONE REASON THAT TRAINING DATA VOLUMES REMAIN A MANAGEABLE SIZE IS THAT IT ’ S BEEN THE FASHION AMONGST THE MAJORITY OF AI MODEL BUILDERS TO USE PRE-TRAINING MODELS ( PTMS ), INSTEAD OF SEARCH MODELS TRAINED FROM SCRATCH .
It is often said that GenAI Large Language Models ( LLMs ) and Natural Language Processing ( NLP ) require large amounts of training data . However , measured in terms of traditional data storage , this is not actually the case .
For example , ChatGPT used www . commoncrawl . com data . Commoncrawl says of itself that it is the primary training corpus in every LLM and that it supplied 82 % of raw tokens used to train GPT-3 : “ We make wholesale extraction , transformation and analysis elsewhere then encoded , decoded and fine-tuned to train ChatGPT and other models , should have no major impact on a data centre .
Similarly , the terabytes or petabytes of data needed to train a text to speech , text to image or text to video model should put no extraordinary strain on the power and cooling systems in a data centre built for hosting IT equipment storing and processing hundreds or thousands of petabytes of data .
PTMs that are becoming familiar are Bidirectional Encoder Representations from Transformers ( BERT ) and the Generative Pre-trained Transformer ( GPT ) series – as in ChatGPT .
Parameters
Another measurement of AI training that are of interest to data centre operators are parameters . AI parameters are used by Generative AI models during training – the greater the number of parameters ,
www . intelligentdatacentres . com 65