Blog Post

From vision to reality, explore our blog and articles. Contact us to turn your ideas into success.
Contact us.

Creating a Large Language Model for a New Language: A Step-by-Step Guide

By Next Solution Lab on 2024-07-14 23:30:57

Introduction:

With the rise of large language models (LLMs), there's a growing interest in developing models for languages that are not well-represented in existing AI technologies. As English is the most common language, most LLM are trained on English dataset. Even multi-lingual LLM perform worst in tasks related to other languages because in the training dataset of multi-lingual models most data is English and only a small percentage is rest of the language. For this reason there is always a demand for creating an LLM that is specialized in that particular language. But creating an LLM from scratch is extremely time consuming and resource intensive.

This blog will guide you through the process of creating an LLM for a new language without creating a model from scratch, covering everything from selecting a base model to fine-tuning with labeled data and optimizing the model's performance. This blog dose not provide any code-base rather it’s a general guideline and some links to external code-bases that can be used to do this task.

1. Choosing a Base Model:

The first step is to select a base model that has robust performance and can serve as a foundation for your new language model. Rather than creating a model from scratch we are going to do continual/domain adaptive pretraining. This way we will already start with a model that has a basic understanding of language and comprehension. If the base model can understand your target language then it’s also an advantage.

Popular choices include LlaMA-3, Gemini, Mistral or other state-of-the-art transformer models. The key is to choose a model with a sufficient number of parameters to ensure good performance. We don’t want a model that is too big as it’ll be very resource expensive to train that model. We also don’t want a very small model as performance may regrade. We should should at first decide on the resource that we’ll be able to provide throughout the training and that choose a model according to it. Here are some resources that can help you understand what is the optimal number of parameter model that you can train,

For our purposes, a we’ll chose a 7 billion parameter model LLaMA-3 and our target language is Bangla.

2. Collecting Unlabled Data:

Secondly we need to collect some unlabled data to train out model with. The unlable data has to be in out target language. There are a lot of resources we can use to find unlabled dataset and some are listed below,

• OSCAR

• WEB Crawler

• Digital libraries

• Social media text

• News articles

• Wikipedia

• Public domain books

Make sure to include data that are factual, diverse and cover various domain. Also make sure to clean your data for redundancy. Usually for a 7B parameter model, 30GB~50GB of unlabled dataset are required. Including data collection, data processing is also a crucial step in this process.

3. Training a SentencePiece Tokenizer and Merging:

Now we need a tokenizer that can understand and correctly tokenize our target language. If our chosen base model can already tokenize the characters of our target language that this step can be skipped. We can test this by using the tokenizer of our chosen model to tokenize various text of our target language. If our chosen model fails to tokenize our target language we need to follow this step,

• Use Sentence Piece tokenizer to create a custom tokenizer that can tokenize the characters of our target language.

• Merge that tokenizer with the chosen model’s tokenizer.

• Do necessary architectural change to the chosen model so it can fit the new tokenizer.

We can create a tokenizer of our target language by using SentencePiece and the unlabled dataset we have created.

After creating the tokenizer we should have a file named ‘new_tokenizer.model’. To merge this tokenizer with the tokenizer of the chosen model we can follow this script,

https://github.com/ymcui/Chinese-LLaMA-Alpaca/

They have a script for merging tokenizer and do the required architectural change of the chosen model due to the increased vocabulary size. We can read more about it in their repository. By using their script we’ll be able to get a base model that can tokenize out target language

4. Domain Adaptive Pretraining:

In this step we’ll use out unlabled dataset to train our model, The purpose of this training is to teach the model about our target language. Masked Language Modeling (MLM) is used for this training. After training the model will be able to understand and comprehend our target language properly. We can follow the below resource to get help on how to do domain adaptive pretraining,

https://towardsdatascience.com/domain-adaptation-of-a-large-language-model-2692ed59f180

For a limited resource environment, PEFT (Parameter Efficient Fine-Tuning) is needed for less memory requirement. We need to train for at least 10 epoch. For Hyper-parameter tuning it’s better to start with a sub-set of the original dataset and train using that dataset to get some good hyper-parameter configuration. It’s also important to monitor the model while pretraining as LLM training differs a lot from other traditional ML models. You can read about various training arguments from the link below,

https://huggingface.co/docs/transformers/en/main_classes/trainer

5. Finetuning with Labeled Data:

The next step is to finetuning the model using labled dataset. Labled dataset is hard to acquire than unlabled dataset. We can search in open-souced communities like Hugging Face, Kaggle etc. We can also use ChatGPT to generate labled dataset. I have selected the below dataset,

https://huggingface.co/datasets/OdiaGenAI/all_combined_bengali_252k?row=97

We can select any dataset. The dataset requires 3 fields Instruction (what to do), Input (To do on what) and Output (what to respond with). The purpose of finetuning is not to teach the model anything but rather show that model how to talk. So the labled dataset should include every type of way a question can be asked not every question possible. Prompt Engineering can also be used in creating dataset to get better result. We can learn about them from here,

https://www.datacamp.com/blog/what-is-prompt-engineering-the-future-of-ai-communication

After getting the labled dataset we’ll use QLORA to finetune our model. Finetuning is less resource intensive than pretraining but still needs a lot. So a great emphasis should be given on the quality and diversity of the dataset. Also hyper-parameter tuning is also required for this step. Finally to finetune the model we can follow the below script,

https://huggingface.co/blog/mlabonne/orpo-llama-3

https://www.datacamp.com/tutorial/fine-tuning-large-language-models

After Finetuning we’ll get a model that can answer any question. If done correctly we can also use Prompt Engineering and RAG on this model.

6. Direct Preference Optimization (DPO):

To align the model more closely with user preferences, you can use Direct Preference Optimization (DPO). This step used to be done using Reinforcement Learning with Human Feedback (RLHF). But it requires a lot of resource (computational and human) to train using RLHF. For this reason DPO can be used. We can learn about DPO from their paper,

https://arxiv.org/abs/2305.18290

To perform DPO on our Instruct Model we have to gather data multiple outputs from the same input using our instruct model. Then use human lablers to rate the different responses generated from our model. Once we have acquired that data we have to format it in a ‘chosen’ and ‘rejected’ format for our preferred and rejected response. We can look at a huggingface DPO dataset to see our out dataset should be formatted.

https://huggingface.co/datasets/efederici/alpaca-vs-alpaca-orpo-dpo

Finally we need to do DPO training using this dataset. Here is a blog that describes this training method.

https://huggingface.co/blog/dpo-trl

7. Conclusion:

Creating an LLM for a new language involves multiple steps, from selecting a base model and gathering data to training tokenizers and fine-tuning the model. By following these steps, you can develop a powerful language model that understands and generates text in your target language, tailored to your specific needs and preferences. This process, while complex, opens the door to bringing advanced AI capabilities to more languages and cultures around the world.

Let us know your interest

At Next Solution Lab, we are dedicated to transforming experiences through innovative solutions. If you are interested in learning more about how our projects can benefit your organization.

Contact Us

Bangladesh Office

(+880) 1765799777
House 752, Road 10, Avenue 4,
Mirpur DOHS, Dhaka - 1216

Japan Office

Katsushika-KU
Shiratori 2-18-8,
Tokyo Japan.

Canada Office

3440 Peter St Windsor,
ON N9C4C9,Canada

USA Office

1944 Watson Ave,2nd Floor
Bronx,NY 10472

Blog Post

Creating a Large Language Model for a New Language: A Step-by-Step Guide

Introduction:

1. Choosing a Base Model:

2. Collecting Unlabled Data:

3. Training a SentencePiece Tokenizer and Merging:

4. Domain Adaptive Pretraining:

5. Finetuning with Labeled Data:

6. Direct Preference Optimization (DPO):

7. Conclusion:

Let us know your interest

Bangladesh Office

Japan Office

Canada Office

USA Office

Latest

Resources

Company

Offshore Development

Web Development

Mobile Application

Artificial Intelligence

Software Testing as a Service

Consultation and Strategy

Research and Development

Digital Marketing and Others

Blog Post

Creating a Large Language Model for a New Language: A Step-by-Step Guide

Introduction:

1. Choosing a Base Model:

2. Collecting Unlabled Data:

3. Training a SentencePiece Tokenizer and Merging:

4. Domain Adaptive Pretraining:

5. Finetuning with Labeled Data:

6. Direct Preference Optimization (DPO):

7. Conclusion:

Let us know your interest

Bangladesh Office

Japan Office

Canada Office

USA Office

Latest

Resources

Company