Skip to main content

Apple's DCLM-7B: Setup, Example Usage, Fine-Tuning

Get started with Apple's DCLM-7B large language model and learn how to set it up, use it, and fine-tune it for specific tasks.
Aug 12, 2024  · 9 min read

Apple has long been a pioneer in technology, consistently setting the bar for innovation. Recently, Apple released a new open-source DCLM-7B large language model (LLM) for the community to use.

It’s encouraging to see a major player like Apple release their new model as open-source, as this move aligns with the growing trend of democratizing AI and making powerful tools accessible to a broader audience.

Develop AI Applications

Learn to build AI applications using the OpenAI API.
Start Upskilling for Free

DCLM-7B: Key Features and Capabilities

Apple's latest contribution, the DCLM-7B (DataComp for Language Models) base model, stands out as a noteworthy addition to the LLM field. Let’s explore its key features.

Decoder-only transformer

The DCLM-7B model utilizes a decoder-only Transformer architecture, which is a design where the model predicts one token at a time, and each generated token is fed back into the model to generate the next one.

This architecture is optimized for generating coherent and contextually relevant text, making it ideal for various natural language processing tasks. It is the same architecture used in state-of-the-art models like ChatGPT and GPT-4o, demonstrating its effectiveness in understanding and generating human-like text.

Open-source

The DCLM-7B model is available for research and development under the Apple Sample Code License. This open-source approach encourages widespread use and collaboration within the AI community.

By making this model accessible, Apple supports the democratization of AI, allowing researchers and developers from all over the world to experiment with and build upon the base model.

7 billion parameters

With 7 billion parameters, the DCLM-7B model strikes a balance between performance and computational efficiency.

This size makes it possible to run the model on most high RAM/VRAM rigs and cloud platforms, making it versatile and accessible for various applications. The substantial number of parameters enables the model to capture complex language patterns, enhancing its capability to perform a wide range of tasks.

Trained on a massive dataset

The model has been trained on an extensive dataset of 2.5 trillion tokens, providing a solid foundation for tackling a wide range of language tasks. This allows the DCLM-7B model to understand and generate text with a high degree of accuracy and relevance. Additionally, this makes the model a good choice for task-specific fine-tuning, as it has a robust base understanding of the English language.

Context window

The base DCLM-7B model has a context window of 2048 tokens, which allows it to process relatively long sequences of text. Although this is relatively small by today’s standards, Apple has also released a variant with an 8K token context window.

This extended context window provides even greater flexibility for handling longer inputs, making the model suitable for applications that require the processing of extended texts or documents, like Retrieval Augmented Generation (RAG).

Key features and capabilities

Getting Started With DCLM-7B

Apple has made the DCLM-7B model compatible with Hugging Face’s transformers library, making it easy to access and use.

You can find the model’s webpage on Hugging Face, and check out the GitHub repository for more details. To use and access the model, we will need to install the transformers  library:

pip install transformers

Additionally, we will need to install the open_lm framework:

pip install git+https://github.com/mlfoundations/open_lm.git

The DCLM-7B model with full precision is quite large, approximately 27.5GB, requiring a significant amount of RAM or VRAM to run. You will need a high-end computer or some kind of cloud environment. I will be using Google Colab’s premium subscription notebook with 50GB of RAM and an L4 GPU.

With all the necessary libraries installed, we are ready to start using the model!

DCLM-7B: Example Usage

For the example, I will be running a basic example provided on the model's Huggingface webpage. First, we import all the necessary libraries:

from open_lm.hf import *
from transformers import AutoTokenizer, AutoModelForCausalLM

Then, we need to download and initialize both the tokenizer and the model (notice that we are running the model with full precision floats on a CPU for this example):

tokenizer = AutoTokenizer.from_pretrained("apple/DCLM-Baseline-7B")
model = AutoModelForCausalLM.from_pretrained("apple/DCLM-Baseline-7B")

And lastly, we run the example prompt:

inputs = tokenizer(["Machine learning is"], return_tensors="pt")
gen_kwargs = {"max_new_tokens": 50, "top_p": 0.8, "temperature": 0.8, "do_sample": True, "repetition_penalty": 1.1}
output = model.generate(inputs['input_ids'], **gen_kwargs)
output = tokenizer.decode(output[0].tolist(), skip_special_tokens=True)

I got the following output:

[Machine learning is not the solution to everything, it just enables you to solve a problem that otherwise would have been impossible. The biggest challenge for me as a manager of an AI team was to identify those problems where machine learning can really add value and be successful.]

Advanced Usage And Fine-Tuning

Fine-tuning the DCLM-7B model can help tailor it to specific tasks, enhancing its performance in your applications. Unfortunately, the DCLM-7B model is not supported by Huggingface’s peft library, and therefore we need to use transformers library to fine-tune it.

Without tools like LoRA, fine-tuning the model this large requires immense resources, as it is basically the same as just training it from scratch. Therefore I will just outline the fine-tuning process here without actually running it to see the results.

Preparing the dataset

To download and use an openly available dataset, we will be using Hugging Face’s datasets library. We install it with the following command:

pip install datasets

Once installed, we import and use the load_dataset function, for this example I will be using the wikitext dataset:

from datasets import load_dataset
dataset = load_dataset('wikitext', 'wikitext-2-raw-v1')

Now, we need to tokenize the dataset:

def tokenize_function(examples):
	return tokenizer(examples['text'], padding='max_length', truncation=True, max_length=512)
tokenized_datasets = dataset.map(tokenize_function, batched=True)

Now, we are ready to start fine-tuning!

Fine-tuning

For fine-tuning, we need to import and initialize TrainingArguments and Trainer objects and then run the train() function.

from transformers import TrainingArguments, Trainer
training_args = TrainingArguments(
	    report_to = "none",
	    output_dir="./results",
	    evaluation_strategy="epoch",
	    learning_rate=2e-5,  # Controls how much to change the model weights during training
	    per_device_train_batch_size=2,  # Number of samples per batch per device during training
	    per_device_eval_batch_size=2,  # Number of samples per batch per device during evaluation
	    num_train_epochs=3,  # Number of times the entire training dataset will be passed through the model
	    weight_decay=0.01,  # Regularization technique to prevent overfitting
	)
trainer = Trainer(
	    model=model,
	    args=training_args,
	    train_dataset=tokenized_datasets['train'],
	    eval_dataset=tokenized_datasets['test'],
	    data_collator=data_collator,
	    tokenizer=tokenizer,
	)
trainer.train()

Conclusion

Overall, Apple's DCLM-7B is a significant addition to the open-source language model landscape, offering researchers and developers a powerful tool for various NLP tasks.

As a decoder-only Transformer model, it is optimized for text generation, providing coherent and contextually relevant outputs. The model's availability under the Apple Academic Software License Agreement further encourages collaboration and innovation in the AI community.

Earn a Top AI Certification

Demonstrate you can effectively and responsibly use AI.

Photo of Dimitri Didmanidze
Author
Dimitri Didmanidze
LinkedIn
I'm Dimitri Didmanidze, a data scientist currently pursuing a Master's degree in Mathematics with a focus on Machine Learning. My academic journey has also included research about the capabilities of transformer-based models and teaching at the university level, enriching my understanding of complex theoretical concepts. I have also worked in the banking industry, where I've applied these principles to tackle real-world data challenges.
Topics

Learn AI with these courses!

course

Developing LLM Applications with LangChain

3 hr
7K
Discover how to build AI-powered applications using LLMs, prompts, chains, and agents in LangChain.
See DetailsRight Arrow
Start Course
See MoreRight Arrow
Related

blog

What is an LLM? A Guide on Large Language Models and How They Work

Read this article to discover the basics of large language models, the key technology that is powering the current AI revolution
Javier Canales Luna's photo

Javier Canales Luna

12 min

blog

Exploring BLOOM: A Comprehensive Guide to the Multilingual Large Language Model

Dive into BLOOM, a multilingual large language model, exploring its creation, technical specs, usage, and ethical aspects for democratizing AI.
Zoumana Keita 's photo

Zoumana Keita

13 min

tutorial

Fine-Tuning LLMs: A Guide With Examples

Learn how fine-tuning large language models (LLMs) improves their performance in tasks like language translation, sentiment analysis, and text generation.
Josep Ferrer's photo

Josep Ferrer

11 min

tutorial

How to Train an LLM with PyTorch

Master the process of training large language models using PyTorch, from initial setup to final implementation.
Zoumana Keita 's photo

Zoumana Keita

8 min

tutorial

Fine-Tuning LLaMA 2: A Step-by-Step Guide to Customizing the Large Language Model

Learn how to fine-tune Llama-2 on Colab using new techniques to overcome memory and computing limitations to make open-source large language models more accessible.
Abid Ali Awan's photo

Abid Ali Awan

12 min

code-along

Introduction to Large Language Models with GPT & LangChain

Learn the fundamentals of working with large language models and build a bot that analyzes data.
Richie Cotton's photo

Richie Cotton

See MoreSee More