Hugging Face で学ぶマルチモーダルモデル

中級スキルレベル

更新日 2026/01

Hugging Faceの最新AIモデルでテキスト・画像・音声・動画を組み合わせ、新しい画像や動画を生成しましょう。

コース説明

前提条件

Introduction to LLMs in Python

Accessing Hugging Face Models and Datasets

Navigate the Hugging Face model hub, transform raw text, audio, and visual data into AI-friendly formats. Learn how to find the latest most popular models for tasks such as text generation and harness the power of pre-built pipelines.

Hugging Face model navigation

50 XP

How many models!?

100 XP

Finding the most popular text-to-image model

100 XP

Preprocessing different modalities

50 XP

Text tokenizing

100 XP

Image preprocessing

100 XP

Audio preprocessing

100 XP

Pipeline tasks and evaluations

50 XP

Pipeline caption generation

100 XP

Passing keyword arguments

100 XP

Model evaluation on a custom dataset

100 XP

チャプターを開始

Unimodal Vision, Audio, and Text Models

Learn to master individual modalities with state-of-the-art models. Dive into computer vision for image classification and segmentation, explore speech recognition and text-to-speech synthesis, and learn effective fine-tuning techniques. Build practical skills with pre-trained models from Hugging Face's transformers library.

Computer vision

50 XP

Image classification

100 XP

Object detection

100 XP

Image background removal

100 XP

Fine-tuning computer vision models

50 XP

CV fine-tuning: dataset prep

100 XP

CV fine-tuning: model classes

100 XP

CV fine-tuning: trainer configuration

100 XP

Speech recognition and audio generation

50 XP

Automatic speech recognition

100 XP

Creating speech embeddings

100 XP

Audio denoising

100 XP

Fine-tuning text-to-speech models

50 XP

Fine-tuning a text-to-speech model

100 XP

Generating new speech

100 XP

チャプターを開始

Multi-Modal Models for Classification

Learn to fuse visual, textual, and audio information for richer AI applications. Master techniques like CLIP for zero-shot classification, build sentiment analyzers that see and read, and create emotion detectors that combine facial expressions with voice. Take your AI models beyond single-modality thinking.

Zero-shot image classification

50 XP

Zero-shot learning with CLIP

100 XP

Automated caption quality assessment

100 XP

Multi-modal sentiment analysis

50 XP

Prompting Vision Language Models (VLMs)

100 XP

Multi-modal sentiment classification with Qwen

100 XP

Zero-shot video classification

50 XP

Video audio splitting

100 XP

Video sentiment analysis with CLIP CLAP

100 XP

チャプターを開始

Multi-Modal Generation

Transform ideas into reality! Master cutting-edge AI techniques to generate and manipulate visual content using text prompts. Create stunning images, edit photos intelligently, and build powerful question-answering systems for images and documents. Turn your creative vision into digital reality with multi-modal AI.

Visual question-answering (VQA)

50 XP

VQA with Vision Language Transformers (ViLTs)

100 XP

Document VQA with LayoutLM

100 XP

Image editing with diffusion models

50 XP

Custom image editing

100 XP

Image inpainting

100 XP