Перейти к основному содержимому

Главная Python

Курс

Multi-Modal Models with Hugging Face

Средний уровеньУровень навыков

Обновлено 01.2026

Combine text, images, audio, and video with the latest AI models from Hugging Face, and generate new images and videos!

Начать курс бесплатно

PythonArtificial Intelligence

4 ч

14 видео

45 Упражнений

3,800 XP

Справка об успешном завершении

Создать бесплатный аккаунт

Продолжить через Google Показать больше вариантов

или

Продолжая, вы принимаете наши Условия использования, нашу Политику конфиденциальности и соглашаетесь с тем, что ваши данные хранятся в США.

Любимая обучающимися из тысяч компаний

Обучаете команду?

Попробуйте для бизнеса

Описание курса

Dive into the cutting-edge world of multi-modal AI models, where text, images, and speech combine to create powerful applications. Learn how to leverage Hugging Face's vast repository of models that can see, hear, and understand like never before. Whether you're analyzing social media content, building voice assistants, or creating next-generation AI applications, multi-modal models are your gateway to handling diverse data types seamlessly.

Explore state-of-the-art models like CLIP for image-text understanding, SpeechT5 for voice synthesis, and the Qwen2 Vision Language model for multi-modal sentiment analysis. Through hands-on exercises, you'll master the techniques used by leading AI companies to build sophisticated multi-modal systems.

Future-Proof Your AI Skills

This course will give you a robust toolkit for handling multi-modal AI tasks. You'll learn to process and combine different data modalities effectively, fine-tune pre-trained models for custom applications, and evaluate and improve model performance across modalities.

Необходимые условия

Introduction to LLMs in Python

1

Accessing Hugging Face Models and Datasets

Navigate the Hugging Face model hub, transform raw text, audio, and visual data into AI-friendly formats. Learn how to find the latest most popular models for tasks such as text generation and harness the power of pre-built pipelines.

Hugging Face model navigation

How many models!?

Finding the most popular text-to-image model

Preprocessing different modalities

Text tokenizing

Image preprocessing

Audio preprocessing

Pipeline tasks and evaluations

Pipeline caption generation

Passing keyword arguments

Model evaluation on a custom dataset

Начать главу

2

Unimodal Vision, Audio, and Text Models

Learn to master individual modalities with state-of-the-art models. Dive into computer vision for image classification and segmentation, explore speech recognition and text-to-speech synthesis, and learn effective fine-tuning techniques. Build practical skills with pre-trained models from Hugging Face's transformers library.

Computer vision

Image classification

Object detection

Image background removal

Fine-tuning computer vision models

CV fine-tuning: dataset prep

CV fine-tuning: model classes

CV fine-tuning: trainer configuration

Speech recognition and audio generation

Automatic speech recognition

Creating speech embeddings

Audio denoising

Fine-tuning text-to-speech models

Fine-tuning a text-to-speech model

Generating new speech

Начать главу

3

Multi-Modal Models for Classification

Learn to fuse visual, textual, and audio information for richer AI applications. Master techniques like CLIP for zero-shot classification, build sentiment analyzers that see and read, and create emotion detectors that combine facial expressions with voice. Take your AI models beyond single-modality thinking.

Zero-shot image classification

Zero-shot learning with CLIP

Automated caption quality assessment

Multi-modal sentiment analysis

Prompting Vision Language Models (VLMs)

Multi-modal sentiment classification with Qwen

Zero-shot video classification

Video audio splitting

Video sentiment analysis with CLIP CLAP

Начать главу

4

Multi-Modal Generation

Transform ideas into reality! Master cutting-edge AI techniques to generate and manipulate visual content using text prompts. Create stunning images, edit photos intelligently, and build powerful question-answering systems for images and documents. Turn your creative vision into digital reality with multi-modal AI.

Visual question-answering (VQA)

VQA with Vision Language Transformers (ViLTs)

Document VQA with LayoutLM

Image editing with diffusion models

Custom image editing

Image inpainting

Video generation

Build a video!

Assessing video generation performance

Congratulations!

Начать главу

Multi-Modal Models with Hugging Face

Курс
завершён

Получить сертификат об окончании

Добавьте эту квалификацию в профиль LinkedIn, резюме или CV
Поделитесь в социальных сетях и в обзоре эффективностиЗаписаться сейчас

Присоединяйтесь к более чем 19 миллионам обучающихся и начните Multi-Modal Models with Hugging Face уже сегодня!

Создать бесплатный аккаунт

Продолжить через Google Показать больше вариантов

или

Продолжая, вы принимаете наши Условия использования, нашу Политику конфиденциальности и соглашаетесь с тем, что ваши данные хранятся в США.

Развивайте свои навыки работы с данными с помощью DataCamp для мобильных устройств.

Успевайте в обучении на ходу с помощью наших мобильных курсов и ежедневных 5-минутных заданий по программированию.