
Loved by learners at thousands of companies
Course Description
Harness the Power of Multi-Modal AI
Dive into the cutting-edge world of multi-modal AI models, where text, images, and speech combine to create powerful applications. Learn how to leverage Hugging Face's vast repository of models that can see, hear, and understand like never before. Whether you're analyzing social media content, building voice assistants, or creating next-generation AI applications, multi-modal models are your gateway to handling diverse data types seamlessly.Master Essential Multi-Modal Techniques
Explore state-of-the-art models like CLIP for image-text understanding, SpeechT5 for voice synthesis, and the Qwen2 Vision Language model for multi-modal sentiment analysis. Through hands-on exercises, you'll master the techniques used by leading AI companies to build sophisticated multi-modal systems.Future-Proof Your AI Skills
This course will give you a robust toolkit for handling multi-modal AI tasks. You'll learn to process and combine different data modalities effectively, fine-tune pre-trained models for custom applications, and evaluate and improve model performance across modalities.Training 2 or more people?
Get your team access to the full DataCamp platform, including all the features.- 1
Accessing Hugging Face Models and Datasets
FreeNavigate the Hugging Face model hub, transform raw text, audio, and visual data into AI-friendly formats. Learn how to find the latest most popular models for tasks such as text generation and harness the power of pre-built pipelines.
Hugging Face model navigation50 xpHow many models!?100 xpFinding the most popular text-to-image model100 xpPreprocessing different modalities50 xpText tokenizing100 xpImage preprocessing100 xpAudio preprocessing100 xpPipeline tasks and evaluations50 xpPipeline caption generation100 xpPassing keyword arguments100 xpModel evaluation on a custom dataset100 xp - 2
Unimodal Vision, Audio, and Text Models
Learn to master individual modalities with state-of-the-art models. Dive into computer vision for image classification and segmentation, explore speech recognition and text-to-speech synthesis, and learn effective fine-tuning techniques. Build practical skills with pre-trained models from Hugging Face's transformers library.
Computer vision50 xpImage classification100 xpObject detection100 xpImage background removal100 xpFine-tuning computer vision models50 xpCV fine-tuning: dataset prep100 xpCV fine-tuning: model classes100 xpCV fine-tuning: trainer configuration100 xpSpeech recognition and audio generation50 xpAutomatic speech recognition100 xpCreating speech embeddings100 xpAudio denoising100 xpFine-tuning text-to-speech models50 xpFine-tuning a text-to-speech model100 xpGenerating new speech100 xp - 3
Multi-Modal Models for Classification
Learn to fuse visual, textual, and audio information for richer AI applications. Master techniques like CLIP for zero-shot classification, build sentiment analyzers that see and read, and create emotion detectors that combine facial expressions with voice. Take your AI models beyond single-modality thinking.
Zero-shot image classification50 xpZero-shot learning with CLIP100 xpAutomated caption quality assessment100 xpMulti-modal sentiment analysis50 xpPrompting Vision Language Models (VLMs)100 xpMulti-modal sentiment classification with Qwen100 xpZero-shot video classification50 xpVideo audio splitting100 xpVideo sentiment analysis with CLIP CLAP100 xp - 4
Multi-Modal Generation
Transform ideas into reality! Master cutting-edge AI techniques to generate and manipulate visual content using text prompts. Create stunning images, edit photos intelligently, and build powerful question-answering systems for images and documents. Turn your creative vision into digital reality with multi-modal AI.
Visual question-answering (VQA)50 xpVQA with Vision Language Transformers (ViLTs)100 xpDocument VQA with LayoutLM100 xpImage editing with diffusion models50 xpCustom image editing100 xpImage inpainting100 xpVideo generation50 xpBuild a video!100 xpAssessing video generation performance100 xpCongratulations!50 xp
Training 2 or more people?
Get your team access to the full DataCamp platform, including all the features.collaborators

prerequisites
Introduction to LLMs in Python
Applied AI Research Scientist, Amsterdam University Medical Centers
Sean is an Applied AI Research Scientist and Assistant Professor at Amsterdam University Medical Centers. He spends his days researching new applications for generative AI models to enhance personalized healthcare and building pipelines with frontends to increase his knowledge of full stack implementations. When not behind a laptop, he can be found spending time with his two young children, in the gym, or on-call as a qualified firefighter. He has a joint master’s in mathematics and physics, and a PhD in particle physics.
Join over 17 million learners and start Multi-Modal Models with Hugging Face today!
Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.