Personalized Daily Arxiv Papers 01/15/2025

0. Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks [more]
Authors: Miran Heo, Min-Hung Chen, De-An Huang, Sifei Liu, Subhashree Radhakrishnan, Seon Joo Kim, Yu-Chiang Frank Wang, Ryo Hachiuma

1. 3UR-LLM: An End-to-End Multimodal Large Language Model for 3D Scene Understanding [more]
Authors: Haomiao Xiong, Yunzhi Zhuge, Jiawen Zhu, Lu Zhang, Huchuan Lu

2. Learning Motion and Temporal Cues for Unsupervised Video Object Segmentation [more]
Authors: Yunzhi Zhuge, Hongyu Gu, Lu Zhang, Jinqing Qi, Huchuan Lu

3. LayerAnimate: Layer-specific Control for Animation [more]
Authors: Yuxue Yang, Lue Fan, Zuzen Lin, Feng Wang, Zhaoxiang Zhang

4. FramePainter: Endowing Interactive Image Editing with Video Diffusion Priors [more]
Authors: Yabo Zhang, Xinpeng Zhou, Yihan Zeng, Hang Xu, Hui Li, Wangmeng Zuo

5. Advancing Semantic Future Prediction through Multimodal Visual Sequence Transformers [more]
Authors: Efstathios Karypidis, Ioannis Kakogeorgiou, Spyros Gidaris, Nikos Komodakis

6. Diffusion Adversarial Post-Training for One-Step Video Generation [more]
Authors: Shanchuan Lin, Xin Xia, Yuxi Ren, Ceyuan Yang, Xuefeng Xiao, Lu Jiang

7. AVS-Mamba: Exploring Temporal and Multi-modal Mamba for Audio-Visual Segmentation [more]
Authors: Sitong Gong, Yunzhi Zhuge, Lu Zhang, Yifan Wang, Pingping Zhang, Lijun Wang, Huchuan Lu

8. Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding [more]
Authors: Liping Yuan, Jiawei Wang, Haomiao Sun, Yuchen Zhang, Yuan Lin

9. GameFactory: Creating New Games with Generative Interactive Videos [more]
Authors: Jiwen Yu, Yiran Qin, Xintao Wang, Pengfei Wan, Di Zhang, Xihui Liu

10. LLaVA-ST: A Multimodal Large Language Model for Fine-Grained Spatial-Temporal Understanding [more]
Authors: Hongyu Li, Jinyu Chen, Ziyu Wei, Shaofei Huang, Tianrui Hui, Jialin Gao, Xiaoming Wei, Si Liu

11. DAViD: Modeling Dynamic Affordance of 3D Objects using Pre-trained Video Diffusion Models [more]
Authors: Hyeonwoo Kim, Sangwon Beak, Hanbyul Joo

12. Go-with-the-Flow: Motion-Controllable Video Diffusion Models Using Real-Time Warped Noise [more]
Authors: Ryan Burgert, Yuancheng Xu, Wenqi Xian, Oliver Pilarski, Pascal Clausen, Mingming He, Li Ma, Yitong Deng, Lingxiao Li, Mohsen Mousavi, Michael Ryoo, Paul Debevec, Ning Yu

13. Democratizing Text-to-Image Masked Generative Models with Compact Text-Aware One-Dimensional Tokens [more]
Authors: Dongwon Kim, Ju He, Qihang Yu, Chenglin Yang, Xiaohui Shen, Suha Kwak, Liang-Chieh Chen

14. Change Captioning in Remote Sensing: Evolution to SAT-Cap -- A Single-Stage Transformer Approach [more]
Authors: Yuduo Wang, Weikang Yu, Pedram Ghamisi

15. Predicting 4D Hand Trajectory from Monocular Videos [more]
Authors: Yufei Ye, Yao Feng, Omid Taheri, Haiwen Feng, Shubham Tulsiani, Michael J. Black

16. VENOM: Text-driven Unrestricted Adversarial Example Generation with Diffusion Models [more]
Authors: Hui Kuurila-Zhang, Haoyu Chen, Guoying Zhao

17. MangaNinja: Line Art Colorization with Precise Reference Following [more]
Authors: Zhiheng Liu, Ka Leong Cheng, Xi Chen, Jie Xiao, Hao Ouyang, Kai Zhu, Yu Liu, Yujun Shen, Qifeng Chen, Ping Luo

18. LeapVAD: A Leap in Autonomous Driving via Cognitive Perception and Dual-Process Thinking [more]
Authors: Yukai Ma, Tiantian Wei, Naiting Zhong, Jianbiao Mei, Tao Hu, Licheng Wen, Xuemeng Yang, Botian Shi, Yong Liu

19. V-Trans4Style: Visual Transition Recommendation for Video Production Style Adaptation [more]
Authors: Pooja Guhan, Tsung-Wei Huang, Guan-Ming Su, Subhadra Gopalakrishnan, Dinesh Manocha

20. BioPose: Biomechanically-accurate 3D Pose Estimation from Monocular Videos [more]
Authors: Farnoosh Koleini, Muhammad Usama Saleem, Pu Wang, Hongfei Xue, Ahmed Helmy, Abbey Fenwick

21. Revisiting Birds Eye View Perception Models with Frozen Foundation Models: DINOv2 and Metric3Dv2 [more]
Authors: Seamie Hayes, Ganesh Sistu, Ciar'an Eising

22. C2PD: Continuity-Constrained Pixelwise Deformation for Guided Depth Super-Resolution [more]
Authors: Jiahui Kang, Qing Cai, Runqing Tan, Yimei Liu, Zhi Liu

23. PSReg: Prior-guided Sparse Mixture of Experts for Point Cloud Registration [more]
Authors: Xiaoshui Huang, Zhou Huang, Yifan Zuo, Yongshun Gong, Chengdong Zhang, Deyang Liu, Yuming Fang

24. Visual Language Models as Operator Agents in the Space Domain [more]
Authors: Alejandro Carrasco, Marco Nedungadi, Enrico M. Zucchelli, Amit Jain, Victor Rodriguez-Fernandez, Richard Linares

25. Continual Deep Active Learning for Medical Imaging: Replay-Base Architecture for Context Adaptation [more]
Authors: Rui Daniel, M. Rita Verdelho, Catarina Barata, Carlos Santiago

26. RoHan: Robust Hand Detection in Operation Room [more]
Authors: Roi Papo, Sapir Gershov, Tom Friedman, Itay Or, Gil Bolotin, Shlomi Laufer

27. Self-supervised Deep Hyperspectral Inpainting with the Plug and Play and Deep Image Prior Models [more]
Authors: Shuo Li, Mehrdad Yaghoobi

28. AgentPose: Progressive Distribution Alignment via Feature Agent for Human Pose Distillation [more]
Authors: Feng Zhang, Jinwei Liu, Xiatian Zhu, Lei Chen

29. Make-A-Character 2: Animatable 3D Character Generation From a Single Image [more]
Authors: Lin Liu, Yutong Wang, Jiahao Chen, Jianfang Li, Tangli Xue, Longlong Li, Jianqiang Ren, Liefeng Bo

30. EarthView: A Large Scale Remote Sensing Dataset for Self-Supervision [more]
Authors: Diego Velazquez, Pau Rodriguez L'opez, Sergio Alonso, Josep M. Gonfaus, Jordi Gonzalez, Gerardo Richarte, Javier Marin, Yoshua Bengio, Alexandre Lacoste

31. D$^2$-DPM: Dual Denoising for Quantized Diffusion Probabilistic Models [more]
Authors: Qian Zeng, Jie Song, Han Zheng, Hao Jiang, Mingli Song

32. Object-Centric 2D Gaussian Splatting: Background Removal and Occlusion-Aware Pruning for Compact Object Models [more]
Authors: Marcel Rogge, Didier Stricker

33. SkipClick: Combining Quick Responses and Low-Level Features for Interactive Segmentation in Winter Sports Contexts [more]
Authors: Robin Sch"on, Julian Lorenz, Daniel Kienzle, Rainer Lienhart

34. EmoNeXt: an Adapted ConvNeXt for Facial Emotion Recognition [more]
Authors: Yassine El Boudouri, Amine Bohi

35. Towards an End-to-End (E2E) Adversarial Learning and Application in the Physical World [more]
Authors: Dudi Biton, Jacob Shams, Koda Satoru, Asaf Shabtai, Yuval Elovici, Ben Nassi

36. Logarithmic Memory Networks (LMNs): Efficient Long-Range Sequence Modeling for Resource-Constrained Environments [more]
Authors: Mohamed A. Taha

37. In-situ graph reasoning and knowledge expansion using Graph-PReFLexOR [more]
Authors: Markus J. Buehler

38. Benchmarking Vision Foundation Models for Input Monitoring in Autonomous Driving [more]
Authors: Nert Keser, Halil Ibrahim Orhan, Niki Amini-Naieni, Gesina Schwalbe, Alois Knoll, Matthias Rottmann

39. GDiffRetro: Retrosynthesis Prediction with Dual Graph Enhanced Molecular Representation and Diffusion Generation [more]
Authors: Shengyin Sun, Wenhao Yu, Yuxiang Ren, Weitao Du, Liwei Liu, Xuecang Zhang, Ying Hu, Chen Ma

40. Benchmarking Multimodal Models for Fine-Grained Image Analysis: A Comparative Study Across Diverse Visual Features [more]
Authors: Evgenii Evstafev

41. Cloud Removal With PolSAR-Optical Data Fusion Using A Two-Flow Residual Network [more]
Authors: Yuxi Wang, Wenjuan Zhang, Bing Zhang

42. Skeleton and Font Generation Network for Zero-shot Chinese Character Generation [more]
Authors: Mobai Xue, Jun Du, Zhenrong Zhang, Jiefeng Ma, Qikai Chang, Pengfei Hu, Jianshu Zhang, Yu Hu

43. Demographic Variability in Face Image Quality Measures [more]
Authors: Wassim Kabbani, Kiran Raja, Raghavendra Ramachandra, Christoph Busch

44. Audio-visual Deepfake Detection With Local Temporal Inconsistencies [more]
Authors: Marcella Astrid, Enjie Ghorbel, Djamila Aouada

45. Exploring visual language models as a powerful tool in the diagnosis of Ewing Sarcoma [more]
Authors: Alvaro Pastor-Naranjo, Pablo Meseguer, Roc'io del Amor, Jose Antonio Lopez-Guerrero, Samuel Navarro, Katia Scotlandi, Antonio Llombart-Bosch, Isidro Machado, Valery Naranjo

46. Agent-Centric Projection of Prompting Techniques and Implications for Synthetic Training Data for Large Language Models [more]
Authors: Dhruv Dhamani, Mary Lou Maher

47. Pedestrian Trajectory Prediction Based on Social Interactions Learning With Random Weights [more]
Authors: Jiajia Xie, Sheng Zhang, Beihao Xia, Zhu Xiao, Hongbo Jiang, Siwang Zhou, Zheng Qin, Hongyang Chen

Back to [top]

Full paper list

0. Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks

ArXiv: 2501.08326 [page] [pdf]

Authors: Miran Heo, Min-Hung Chen, De-An Huang, Sifei Liu, Subhashree Radhakrishnan, Seon Joo Kim, Yu-Chiang Frank Wang, Ryo Hachiuma

Abstract: arXiv:2501.08326v1 Announce Type: new Abstract: We present Omni-RGPT, a multimodal large language model designed to facilitate region-level comprehension for both images and videos. To achieve consistent region representation across spatio-temporal dimensions, we introduce Token Mark, a set of tokens highlighting the target regions within the visual feature space. These tokens are directly embedded into spatial regions using region prompts (e.g., boxes or masks) and simultaneously incorporated into the text prompt to specify the target, establishing a direct connection between visual and text tokens. To further support robust video understanding without requiring tracklets, we introduce an auxiliary task that guides Token Mark by leveraging the consistency of the tokens, enabling stable region interpretation across the video. Additionally, we introduce a large-scale region-level video instruction dataset (RegVID-300k). Omni-RGPT achieves state-of-the-art results on image and video-based commonsense reasoning benchmarks while showing strong performance in captioning and referring expression comprehension tasks.

Comment: 1, 2