T2MM2024

Call for Papers

The integration of Computer Vision (CV) and Natural Language Processing (NLP) is significantly transforming the field of AI. The combination of LLM and CV is paving the way for AI systems to understand and generate multi-modal content. Text-guided multi-modal generation is one of the areas that has been significantly advanced thanks to the evolution of both LLM and CV, where the core challenge of the text-guided generation is the visual-language (VL) alignment.

This is the 1^st TMMG workshop to be held in conjunction with ICME 2024 in Niagra Falls, Canada. We aim to bring together researchers from the fields of image, video, audio generations as well as NLP to facilitate discussions and progress at the intersection of the LLMs/LMMs and multi-modal content generation.

We aim to invite a diverse set of experts to discuss their recent research results and future directions for the text-guided multi-modal generation, with a particular focus on improving the visual-language alignment in text- guided generation.

We warmly welcome contributions concerning text-guided generation, LLMs/LMMs for multi-modal generation, and visual-language alignment analysis. The topics of interest include (but are not limited to):

Advances in text-guided image/video/audio/multi-modal generation
Visual-language alignment analysis
LLM/LMM and text-guided generation
Self-supervised learning with generative models
Adversarial attacks and defenses with generative models
Novel evaluation metrics and methods
Benchmark datasets
Ethical considerations and bias in text-guided visual generation
Augmented and virtual reality applications
Accessibility in multimedia content
Impact of multi-modal generation on media and Journalism

Paper Submission

Authors should prepare their manuscript according to the Guide for Authors of ICME available at Author Information and Submission Instructions.

Submission address: https://cmt3.research.microsoft.com/ICME2024W

Workshop Track: TMMG

Submit link

Keynotes (1/3)

Keynote 1

Speaker:

Dr. Zhengyuan Yang

Title:

Multi-Modal Agents

Time:

8:10 – 8:40, July 19, 2024

Biography:

Zhengyuan Yang is a Senior Researcher at Microsoft. He received his Ph.D. degree in Computer Science at University of Rochester, advised by Prof. Jiebo Luo. He received the bachelors at University of Science and Technology of China. He has received ACM SIGMM Award for Outstanding Ph.D. Thesis, Twitch Research Fellowship, and ICPR 2018 Best Industry Related Paper Award. His research interests involve the intersection of computer vision and natural language processing, including multi-modal vision-language understanding and generation.

Keynotes (2/3)

Keynote 2

Speaker:

Prof. Siyu Huang

Title:

Navigating the Latent Space of Image Synthesis

Time:

9:25 – 9:55, July 19, 2024

Biography:

Siyu Huang is an assistant professor of Clemson University. He received the B.E. degree and Ph.D. degree in information and communication engineering from Zhejiang University, Hangzhou, China, in 2014 and 2019. He was a Postdoctoral Fellow in the John A. Paulson School of Engineering and Applied Sciences at Harvard University. Before that, he was a Visiting Scholar at Language Technologies Institute in the School of Computer Science, Carnegie Mellon University in 2018, a Research Scientist at Big Data Laboratory, Baidu Research from 2019 to 2021, and a Research Fellow in the School of Electrical and Electronic Engineering at Nanyang Technological University in 2021. He has published more than 20 papers on top-tier computer science journals and conferences. His research interests are primarily in computer vision, multimedia analysis, and generative models.

Keynotes (3/3)

Keynote 3

Speaker:

Amber (Yijia) Zheng

Title:

Immunizing Text-to-Image Models Against Malicious Adaptation

Time:

10:55 – 11:25, July 19, 2024

Biography:

Amber (Yijia) Zheng is a Ph.D. student in Computer Science at Purdue University advised by Prof. Raymond A. Yeh. She received her B.Sc. in Data Science at School of Statistics and Management, Shanghai University of Finance and Economics, where she was working with Prof. Yixuan Qiu. Her research interests lie in developing algorithms and models for reliable and interpretable AI, specifically focusing on language and vision, Gen AI, and attributing model behaviors through the ML pipeline.