Less hype and more hardware: SenseTime banks on multimodal AI to regain its edge

Chinese artificial intelligence pioneer SenseTime is betting that its roots in computer vision will help it lead the next phase of AI, as the industry shifts towards multimodal systems and embodied intelligence in the physical world, according to co-founder and chief scientist Lin Dahua.

In an interview with the Post on Wednesday, Lin said the company’s long-standing expertise in vision-based AI put it in a strong position to become a leader in embodied intelligence, robotics and AI agents operating in real-world environments, at a time when there is growing debate about the limits of large language models (LLMs).

“Our strategic approach is somewhat similar to Google’s in the United States, which primarily focuses on multimodal AI including the latest Nano Banana Pro. They also start with vision capabilities as the core, then add language abilities to create real multimodal systems,” said Lin, who is also an associate professor of information engineering at the Chinese University of Hong Kong.

The Hong Kong-listed company, long regarded as one of the world’s leading facial recognition providers, is trying to carve out a new role in the generative AI era that followed the launch of ChatGPT three years ago.

Extending his comparison with Google – which has deep capabilities across the AI stack, including its own TPU chips for training models – Lin said SenseTime’s decision as early as 2018 to build out large-scale data centres had laid a solid foundation for its ambitions.

As of August, the company’s total computing power stood at about 25,000 petaflops, up 8.7 per cent since the start of the year, after surging 92 per cent over the whole of 2024.

South China Morning Post