It's been more than a year since Andreykapasi posted an introductory LLM video.
As the video a year ago had more than 2.5 million views, which helped many people, we organized this video with high expectations. Even non-experts say it is organized in an easy-to-understand manner.
============================================
Deep Analysis of Large Language Models (LLMs): Understanding How ChatGPT Works
============================================
Large language models (LLMs), especially ChatGPT, have penetrated deep into our daily lives and are being utilized in a variety of ways. An in-depth understanding of how this powerful tool works inside and how it learns to work is essential to getting the best use of LLM technology away from mere users.
1️⃣ Pre-training
The first step in LLM deployment is pre-training. The key to this step is downloading and processing large amounts of textual data from the Internet.
- Data Collection: Collects a variety of high-quality documents in large quantities, such as the FindingWeb dataset in Hugging Face. Major LLM providers have similar datasets internally.
- Data Filtering: Refines raw data from sources such as Common Crawl through several steps, including URL filtering, text extraction, and language filtering.
- URL filtering: Blocks unwanted domains such as malicious websites, spam websites, marketing websites, racist websites, adult sites, etc.
- Text extraction: Remove markup, CSS, and other unnecessary data from raw HTML on a webpage, and extract only the page's core text content.
- Language filtering: Remove data from other languages, leaving only text data from a particular language.
- Tokenization: The collected text is represented by a number called tokens so that the neural network can process it.
- UTF-8 encoding transforms text into a computer-understandable binary form.
- A binary number consists of two symbols, 0 and 1, and has a very long sequence.
- Byte Pair Encoding (BPE) algorithms work by merging frequently appearing byte sequences to create new tokens, reducing the length of token sequences, and increasing the lexical size.
- As an example of tokenization, "hello world" is divided into two tokens, and the tokenization results can vary depending on spaces or case letters. This tokenized data is used as an input for training neural networks.
- For example, the FindWeb dataset contains approximately 44 terabytes of disk space and 15 trillion token sequences.
Neural Network Training
Pre-trained data is now used to train neural networks. At this stage, the model learns statistical relationships about how tokens follow each other within a token sequence.
- Context Window: The model trains using a randomly selected token window from a token sequence. The window size varies from zero to a maximum size (e.g., 8,000 tokens). In practice, processing long sequences is expensive, which determines the maximum window size.
- Next token prediction: Neural networks are trained to predict tokens that are likely to come after tokens given in the context window (input). The model outputs a probability vector for every token in the vocabulary, and predicts the most likely token.
- Adjusting Parameters:
- We adjust the parameters (weights) of the neural network so that the predicted token matches the real next token. This process is repeated until the model's prediction accuracy improves.
- Neural networks can have billions of parameters, which are initially set entirely at random. The training process adjusts these parameters, allowing the model to generate outputs consistent with the patterns observed in the training data.
- Transformer: Transformer is a neural network structure used in modern LLM and contains approximately 85,000 parameters.
- Transformers embed the input tokens to create distributed representations, and generate outputs (next token prediction) through multiple layers.
- Transformers use complex mathematical representations, but individually they consist of simple operations. These neural networks have a fixed mathematical representation from input to output, and are memory-free reservoirs of state.
Inference
Once the training is complete, the model can be used to generate new data; this process is called inference.
- Token generation: The model takes a given token sequence as input and predicts the next token.
- The model outputs a probability distribution for the next token, which samples the next token.
- Tokens with high probability are likely to be sampled.
- The model is stochastic, so it can produce different outputs each time for the same input.
- The token streams generated by the model have statistical properties similar to those in the training data, but are not completely identical to the training data.
- Base Model: A model generated by the output of pre-training, which acts as a token simulator for Internet text.
- The base model cannot be used directly to provide answers to questions, and it mimics the token sequence of Internet documents.
- The base model trains with 100 billion to 15 trillion tokens.
- OpenAI's GPT-2 is a 1.5 billion parameter model trained with 100 billion tokens, and Meta's Llama 3 is a 405 billion parameter model trained with 15 trillion tokens.
Exploring and leveraging base models
- Llama 3 on platforms such as the Hyperbolic website
'U.S stocks [2025] ISSUE arrangemet' 카테고리의 다른 글
The government has become (9) | 2025.02.07 |
---|---|
However, I would like to discuss this remark separately from him. (6) | 2025.02.07 |
Trump's Ukraine peace plan (7) | 2025.02.07 |
Is the crumbs of Dubon Korea_Jongwon Baek worth picking up? (6) | 2025.02.07 |
디지털서비스세 대응 위해 무역법 301조 활용 가능성 (4) | 2025.02.07 |