Build a large language model from the beginning
Building a large language model from the beginning
I’m a learning / ai hobbyist machine. Technologies have fascinated me, and I cannot learn enough about them. The book of Sebastian Rastchka, build a large language model (from the beginning) gets my eye. I don’t remember how I can stumble it, but I find it when it’s early access from the publications of claim. I bought it, and started working on it while the last chapters were written and released. I have finished the book and all the attached job and loved every minute of it.
My way
How many years ago, I read some advice about learning program from digital books and tutorials. The advice will never copy and paste the code from the samples but to withdraw all code. I took this method in this book. I type each line of code (except for a pair of blocks that are more repeated and high). You can see all my work here: https://github.com/controbersyy187/build-rad-lenguage-model
I do best work in sections chunks. I don’t want to start a section unless I have time dedicated to completing it. Some sections are relatively short, some are fairly linked and waste of time.
I built it in jupyter notebooks on my laptop, which is relatively low for this type of work. The book’s trunk is that you can build a consumer hardware, and it can be good. As I wrote this, now I’m fine with my local model. My model is about 50 steps in a 230 step tuning, and I just cross 20-minute murder time scores. The earlier samples of code runs easier, but the last few sections used are much more models, which slow down things.
I didn’t do most supplemental exercises. I have a “I want to do all things!” Personality. The disadvantage is that when I spend time doing all things, eventually I breathe and never finish what I started. So I followed this book. I still take a few weeks around Christmas and New Year. But I have returned to it and run in the last few chapters.
So, more or less, I have read the chapters and writes all the tasks of mandatory coding.
Learings
What should I tell you about many models of language? More than I can do before I start this book, but indeed not all things the writer’s test. I summarize my understanding, but I’m wrong about some of these things, and I definitely forgot or don’t understand others.
Tokenization & Vocabularary
A large language model begins with life by building a text vocabulary. A large amount of text is launched on a list of exceptional words. Each word then translates into an integer because computers are like numbers more than they want with words. This process is meant to be “tokenization”, which word is replaced by a numeric sign. So now we have a list of extraordinary signs, which is the vocabulary of the big language model.
# Build a more advanced tokenizer
text = "Hello, world. Is this-- a test?"
result = re.split(r'((,.:;?_!"()\')|--|\s)', text)
result = (item.strip() for item in result if item.strip())
print(result)
# Outputs "('Hello', ',', 'world', '.', 'Is', 'this', '--', 'a', 'test', '?')"
all_words = sorted(set(result))
vocab_size = len(all_words)
print(vocab_size)
# Outputs 10
# Display the first 51 tokens in our vocabulary.
vocab = {token:integer for integer,token in enumerate(all_words)}
for i, item in enumerate(vocab.items()):
print(item)
# Outputs:
(',', 0)
('--', 1)
('.', 2)
('?', 3)
('Hello', 4)
('Is', 5)
('a', 6)
('test', 7)
('this', 8)
('world', 9)
# In this example, the id 9 represents the word "world". 5 represents "Is". etc.
Here where my understanding is obscene. We’re not farther away before that happened, ‘eh? Now, we take a large amount of the text we use at first to make the vocabulary (or a subset, or distinctive text), and we have to tie the whole text. We have made it by using the vocabulary we have previously established and replace the words in the training text for their equivalent sign value. This is our training text.
Model Training & Partition
With that complete, we can “train” the model. This process involves taking each vocabulary sign and build a relationship with each vocabulary sign, based on the relatives of token with each other in the training text. So if the word “cat” followed by the word “jump”, the model records that relationship. But it also recorded the word “cat” to other words in the text. So “jumping” follows “cat”, but it may always be often when they are near the word “mouse”. And maybe less often when they are near the word “sleeping”. Recording all these relationships will require a large dataset, so mathematical relationships reduce mathematics and approx. There are certainly more technical terms to use, and the book has entered them. Instead, I forgot them.
Text process
Now, if you give a text to the model starter, try it to complete the text for you. Keeping our example, if I gave the text model “My cat saw a mouse and it was” so increased the word “jumped” in the text I submit, and then needed the whole bag- Ong passage and feed it with himself. So today the input text is “My cat saw a mouse and it jumped”. The next word word can be “on”, so it adds this word and feeds it to be etched output back to its input.
Each time it is a loop like this, tokize the entire input (or to a limit, known as contextual limit or calculate the best token, then all changes everything to be token, then changed everything to toke to read. See the update
Model weights and distributions
Storing all relationships between tokens are known as “weights” of the model. See the update Those to be distributed, so if you train a model in a given text in training, you can give that to your friends and they can use the models of weight to find the text in the text of training.
Good tuning
Good tuning is the process of training a model for specific & mldr; things. My mind is in mind here, so I don’t go deep. It’s enough to say, that you start with a model language language and continue to train it using specific input pairs and output. In the book, we build a spam classification determined if a given message is spam or not, as well as a model that follows instructions. That’s the real one trained today as I wrote this post, so I’m not sure how well it is. Based on the fact that it was published in a book, I think it just goes out.
So while I wasn’t fully done in the book, I was about there. I learned a lot of great concepts, even though they obviously obscure them did not keep. It may take me to return the book again and quickly through this, to refresh my memory and cement my findings.
In addition to the technical aspects of many language models, what else do I know by this experience?
By my experimental typing all the samples of code by hand, I can say that my time can be better spent in a different way. If I do it again, I may not type all the codes snippets, but rather “coveted” them in my mind, and knowing what each line is doing. The hours I know most of the real when I make a typo and have to go back to my code to debug it. That compels me to understand what is happening to know me what is wrong.
I better learn the paper, instead of a digital book. I don’t know why. I have both, and I read the first couple of chapters in the book book. That information that heals better. Perhaps because it is before the book and simpler to understand, or perhaps the format it is played. But I enjoy it, no matter what.
I don’t have to “know” anything, and I think that makes my learning. There are supplemental book exercises, where the writer gives you a problem and you need to know how to solve it. Answers are given to his GitHub repository. That can slow me a lot, but I’m confident I know the material better.
What’s next?
I’m back now. I want to understand this material, but I think if going to the low level, the specific material can help me understand AI and machine learning. What happens is that I can copy and paste this claude.i content and suggest a path ahead for me.
Update: 2025-02-17
Sebastian Raschka Sent me an affectionate message in response to this post and explained some of what I thought. To quote her:
- “Every time it is a loop like this, tokize the entire input (or to a limit, known as contextual limit or contextual limit). You make it at first when you’re at the beginning Input text. But then in technical manner does not have to post anything. You can leave the former output of the Toki form to make the next sign.
What I mean when the text
“My cat saw a rat”
Tokens can be “123 1 5 6 99” (numbers are unjustified examples). Then LLM creates Token 801 for “jump”. Then you use “123 1 5 6 99 801” as the input for the next word.
If you show the user output, then you are converted into the text.
- “Saving all relationships between tokens are known as” weights “of the model.”
I’ll say that relationships between tokens are attention points. Weight weights are like values involved in computing items such as attention points (and other items).
Now you have finished the book, if you’re crazy, I also have a lot of materials as a bonus material to the GitHub repository.
I told GPP-> Test Concricers (https://github.com/lasbet/llms-from-srratch/tree/main/ch04/07_gpt_to_lama) and the tuning of the DPOhttps://github.com/rasbt/llms-from-srratch/blob/main/ch0in/ch0in/ch0in/ch0in/ch0in/ch0in/ch0in/ch0in/ Ch0in / Ch0in / Ch0in / Ch0in / Ch0in / Ch0in / Ch0in / Ch0in / Ch0in / Ch0in / Ch0in / Ch0in / Ch0in / Ch0in / Ch0in / Ch0in) perhaps the most interesting people.
I also uploaded some pytorch tips for adding model speed training: https://github.com/lasbt/llms-from-srata-srratch/tree/main/ch04/10_llm-training
These materials are less polished than the book itself, but maybe you can see them useful!
2025-02-20 00:14:00