GenAI Cracow #28 - Local LLMs, Multimodality
- Opening Ceremony
- Native Multimodality: Beyond Language-Centric Multimodal Models by Jakub
- TBD - CFP is active
- Q&A
- Networking with pizza and beer
Abstract
Native Multimodality: Beyond Language-Centric Multimodal Models
Traditional multimodal pipelines compromise performance by forcing audio and visual data into a compressed text-token space, permanently losing spatial structure, temporal flow, and fine-grained detail. To resolve this bottleneck, cutting-edge systems—including Kimi K2.5, SenseTime’s NEO/NEO-unify architecture, and the Gemini 1.5+ series—have converged on native multimodality, retaining raw structural context or eliminating the encoder-projector pipeline entirely. While native architectures drastically reduce data requirements and make cross-modal reasoning less brittle, they also introduce complex training dynamics and shift failure modes rather than eliminating them. Drawing on six months of empirical research, this talk evaluates where native multimodality fundamentally alters performance, outlines its persistent failure modes, and analyzes the emerging scaling behaviors defining the next generation of AI.
Speakers
Jakub Strawa
AI Researcher and Research Engineer specializing in LLM training, post-training, and multimodal models, bridging the gap between cutting-edge research and scalable engineering. Currently at Stonly, I focus on developing, training, and rigorously evaluating AI agents. My background includes building enterprise-grade applications for Fortune 500 companies, conducting R&D at Roche and Raiffeisen Bank, and working on multimodal reasoning at TCL, where I collaborated directly with top-tier researchers in China and the Qwen team.
See an error in the description or event details?
Log in, by zgłosić zmianę.