Sidra Riaz has completed

Multi-Modal Models with Hugging Face

4 hr

3,800 XP

Loved by learners at thousands of companies

Course Description

Dive into the cutting-edge world of multi-modal AI models, where text, images, and speech combine to create powerful applications. Learn how to leverage Hugging Face's vast repository of models that can see, hear, and understand like never before. Whether you're analyzing social media content, building voice assistants, or creating next-generation AI applications, multi-modal models are your gateway to handling diverse data types seamlessly.

Explore state-of-the-art models like CLIP for image-text understanding, SpeechT5 for voice synthesis, and the Qwen2 Vision Language model for multi-modal sentiment analysis. Through hands-on exercises, you'll master the techniques used by leading AI companies to build sophisticated multi-modal systems.

Future-Proof Your AI Skills

This course will give you a robust toolkit for handling multi-modal AI tasks. You'll learn to process and combine different data modalities effectively, fine-tune pre-trained models for custom applications, and evaluate and improve model performance across modalities.

For Business

Training 2 or more people?

Get your team access to the full DataCamp platform, including all the features.

1
Accessing Hugging Face Models and Datasets
Free
Navigate the Hugging Face model hub, transform raw text, audio, and visual data into AI-friendly formats. Learn how to find the latest most popular models for tasks such as text generation and harness the power of pre-built pipelines.
Play Chapter Now
Hugging Face model navigation
50 xp
How many models!?
100 xp
Finding the most popular text-to-image model
100 xp
Preprocessing different modalities
50 xp
Text tokenizing
100 xp
Image preprocessing
100 xp
Audio preprocessing
100 xp
Pipeline tasks and evaluations
50 xp
Pipeline caption generation
100 xp
Passing keyword arguments
100 xp
Model evaluation on a custom dataset
100 xp
2
Unimodal Vision, Audio, and Text Models
Learn to master individual modalities with state-of-the-art models. Dive into computer vision for image classification and segmentation, explore speech recognition and text-to-speech synthesis, and learn effective fine-tuning techniques. Build practical skills with pre-trained models from Hugging Face's transformers library.
Play Chapter Now
Computer vision
50 xp
Image classification
100 xp
Object detection
100 xp
Image background removal
100 xp
Fine-tuning computer vision models
50 xp
CV fine-tuning: dataset prep
100 xp
CV fine-tuning: model classes
100 xp
CV fine-tuning: trainer configuration
100 xp
Speech recognition and audio generation
50 xp
Automatic speech recognition
100 xp
Creating speech embeddings
100 xp
Audio denoising
100 xp
Fine-tuning text-to-speech models
50 xp
Fine-tuning a text-to-speech model
100 xp
Generating new speech
100 xp
3
Multi-Modal Models for Classification
Learn to fuse visual, textual, and audio information for richer AI applications. Master techniques like CLIP for zero-shot classification, build sentiment analyzers that see and read, and create emotion detectors that combine facial expressions with voice. Take your AI models beyond single-modality thinking.
Play Chapter Now
Zero-shot image classification
50 xp
Zero-shot learning with CLIP
100 xp
Automated caption quality assessment
100 xp
Multi-modal sentiment analysis
50 xp
Prompting Vision Language Models (VLMs)
100 xp
Multi-modal sentiment classification with Qwen
100 xp
Zero-shot video classification
50 xp
Video audio splitting
100 xp
Video sentiment analysis with CLIP CLAP
100 xp
4
Multi-Modal Generation
Transform ideas into reality! Master cutting-edge AI techniques to generate and manipulate visual content using text prompts. Create stunning images, edit photos intelligently, and build powerful question-answering systems for images and documents. Turn your creative vision into digital reality with multi-modal AI.
Play Chapter Now
Visual question-answering (VQA)
50 xp
VQA with Vision Language Transformers (ViLTs)
100 xp
Document VQA with LayoutLM
100 xp
Image editing with diffusion models
50 xp
Custom image editing
100 xp
Image inpainting
100 xp
Video generation
50 xp
Build a video!
100 xp
Assessing video generation performance
100 xp
Congratulations!
50 xp

For Business

Training 2 or more people?

Get your team access to the full DataCamp platform, including all the features.

collaborators

James Chapman

Francesca Donadoni

prerequisites

Introduction to LLMs in Python

Sean Benson

Applied AI Research Scientist, Amsterdam University Medical Centers

Sean is an Applied AI Research Scientist and Assistant Professor at Amsterdam University Medical Centers. He spends his days researching new applications for generative AI models to enhance personalized healthcare and building pipelines with frontends to increase his knowledge of full stack implementations. When not behind a laptop, he can be found spending time with his two young children, in the gym, or on-call as a qualified firefighter. He has a joint master’s in mathematics and physics, and a PhD in particle physics.

Join over 17 million learners and start Multi-Modal Models with Hugging Face today!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.

Multi-Modal Models with Hugging Face

Loved by learners at thousands of companies

Course Description

Harness the Power of Multi-Modal AI

Master Essential Multi-Modal Techniques

Future-Proof Your AI Skills

.css-10r9e5n{-webkit-margin-end:8px;margin-inline-end:8px;}.css-1309hh9{-webkit-flex-shrink:0;-ms-flex-negative:0;flex-shrink:0;-webkit-margin-end:8px;margin-inline-end:8px;}Training 2 or more people?

Accessing Hugging Face Models and Datasets

Unimodal Vision, Audio, and Text Models

Multi-Modal Models for Classification

Multi-Modal Generation

Training 2 or more people?

Join over .css-ou6dz6{color:#03ef62;}17 million learners and start Multi-Modal Models with Hugging Face today!

Create Your Free Account

Training 2 or more people?

Join over 17 million learners and start Multi-Modal Models with Hugging Face today!