首页 Google Cloud

课程

Google DeepMind: Represent Your Language Data

中级技能水平

更新时间 2026年4月

In this Google DeepMind course you will learn how to prepare text data for language models to process.

免费开始课程

Google CloudCloud

4小时

35 练习

1,750 经验值

成就证明

深受数千家公司学习者的喜爱

在培训团队？

企业版试用

课程描述

In this Google DeepMind course you will learn how to prepare text data for language models to process. You will investigate the tools and techniques used to prepare, structure, and represent text data for language models, with a focus on tokenization and embeddings. You will be encouraged to think critically about the decisions behind data preparation, and what biases within the data may be introduced into models. You will analyze trade-offs, learn how to work with vectors and matrices, how meaning is represented in language models. Finally, you will practice designing a dataset ethically using the Data Cards process, ensuring transparency, accountability, and respect for community values in AI development.

先决条件

本课程无先修要求

1

Introduction to text data

In this module, you will learn about the challenges that come with preparing text data so that it is in a format that machines can process. You will consider the course learning objectives and how to most effectively study them. Furthermore, you will learn how the meaning of text depends on social and cultural contexts and why this makes issues like ownership, consent, privacy, and exclusion central to building responsible datasets for LLMs.

Teaching a machine the soul of your language

50 经验值

A world of text: Types and sources

50 经验值

Exploring raw data

50 经验值

Learning objectives

50 经验值

How to get the most out of this course

50 经验值

2

Preprocessing

In this module, you will practice common automatic techniques for cleaning texts and think about where text data comes from. You will hear from Professor David Adelani about community efforts to create datasets that work well for African languages. Next, you will explore why reflecting on data sourcing, consent and ownership in the African context is crucial in preventing digital data from becoming another form of extraction. You will investigate how issues of transparency, benefit-sharing, and community control shape ethical questions about who owns data, who profits from it, and how it can be used responsibly.

Lab: Preprocess Data

50 经验值

Harnessing the potential of low-resource languages

50 经验值

Data resources

50 经验值

Who owns the data?

50 经验值

Quiz 1 - Question 1

50 经验值

Quiz 1 - Question 2

50 经验值

3

Tokenization

In this module, you will learn about different levels of granularity when splitting texts into tokens. You will first experiment with character-level and word-level tokenizers to understand their different approaches. Then, you will learn about byte pair encoding (BPE), which is a subword tokenizer. This advanced method combines the benefits of both character and word-level approaches, offering a more balanced solution. You will then move on to consider how gaps and biases in LLM training datasets can marginalize African languages and cultures, reinforcing digital exclusion. By reflecting on these disparities, you will see how inclusive data practices and community-driven initiatives are essential for building fairer, more responsible AI systems.

What is tokenization?

50 经验值

Lab: Tokenize Texts into Characters and Words

50 经验值

Lab: Tokenize Texts into Subword Tokens

50 经验值

Subword tokenization

50 经验值

Lab: Implement a BPE Tokenizer

50 经验值

Whose voice is missing?

50 经验值

Quiz 2 - Question 1

50 经验值

Quiz 2 - Question 2

50 经验值

4

Embeddings

In this module, you will investigate how language models represent the meaning of tokens in the form of embeddings. You will design your own “map of meaning”, experiment with Gemma’s embeddings, and learn how to visualize the token meaning representations. Finally, you will use the BPE tokenizer that you implemented in the previous module to prepare a dataset for training a small language model.

What are embeddings?

50 经验值

Design your own embeddings

50 经验值

Desired properties of embeddings

50 经验值

Lab: Experiment with Embeddings

50 经验值

Lab: Train an SLM with Your BPE Tokenizer

50 经验值

Quiz 3 - Question 1

50 经验值

Quiz 3 - Question 2

50 经验值

5

Challenge

In this module, you will build on your values-led problem statement from 01 Build Your Own Small Language Model by learning how to design an ethical dataset that supports your solution. You will see how dataset choices shape fairness, representation, and accountability in AI, and why responsible innovation in Africa means creating systems that respect privacy, community ownership, and cultural heritage

Why document data?

50 经验值

Build a dataset ethically with a Data Card

50 经验值

Quiz 4 - Question 1

50 经验值

Quiz 4 - Question 2

50 经验值

6

Continue your journey

In this module, you will have the opportunity to consult additional resources and further reading to investigate the topics you have covered in more detail. Finally, you will consider your next steps and how you can build on what you have learned in the course.

50 经验值

Looking forward

50 经验值

Additional resources and further reading

50 经验值

50 经验值

50 经验值

Google DeepMind: Represent Your Language Data

课程完成

获得成就证明

将此证书添加到你的 LinkedIn 档案、简历或履历中
在社交媒体和绩效评估中分享立即注册

加入超过19百万学习者，今天就开始Google DeepMind: Represent Your Language Data！

通过 DataCamp for Mobile 提升您的数据技能

随时随地通过我们的移动课程和每日 5 分钟编程挑战提升技能。