AI

‘Tiny’ AI, big world: New models show smaller can be smarter

  • IBM Research has developed a compact time-series forecasting model with fewer than 1 million parameters
  • This small model enables fast predictions and requires less computational power
  • Smaller, more efficient models are trending in the AI arena

When it comes to artificial intelligence (AI) models, maybe bigger isn’t always better.

IBM Research is the latest to create a "tiny" AI in the face of growing demand for resource-efficient models. The new model is specifically made for time-series forecasting, a technique that can predict trends across domains including telecom and data centers. For example, such models are capable of forecasting network traffic or GPU loads based on past data.

While most other models require several hundred million or even billions of parameters, IBM’s TinyTimeMixer is trained with less than 1 million parameters, Jayant Kalagnanam, director of AI Applications at IBM Research, told Fierce Network.

There's a growing trend in AI towards reducing the size of models without sacrificing accuracy. In natural language processing (NLP) and computer vision, tiny models have gained traction to achieve strong performance while minimizing computational requirements, often in scenarios where computational resources are limited, such as on mobile devices, embedded systems or edge computing environments.

Keeping data within these devices can “minimize latency and maximize privacy,” said Luis Vargas, VP of AI at Microsoft, for a company blog

“Some customers may only need small models, some will need big models and many are going to want to combine both in a variety of ways,” Vargas added.

Going tiny for times-series

Alpaca, H2O-Danube-1.8B, Koala, TinyLlama and Vicuna are all tiny versions of large language models (LLMs) which require comparatively small computational resources for advanced generative AI. Google and Microsoft have also introduced their own scaled-down versions of previous AI models.

But time-series data differs from language data because it lacks inherent meaning and often comes from diverse, multi-channel sources such as telecom and manufacturing. Unlike language models, which benefit from abundant public data, time-series models struggle with limited data and noisy measurements.

To overcome these challenges, IBM adapted transformer architectures for time-series forecasting, introducing techniques like "patching" for context and "mixing" for improved correlation analysis.

The small size of the model allows for very fast predictions and requires less computational power, enabling it to run even on standard devices like a Mac laptop. The reduced computational demand also translates to lower costs; while a high-end GPU could cost $10,000 to $12,000 monthly, Kalagnanam said, the tiny model can be run on less expensive hardware and fine-tuning costs less.

Honey, I shrunk the AI

There have already been about 750,000 downloads of IBM’s tiny time-series model, according to Kalagnanam, who said the advantages of having the smaller-sized AI models are "getting a lot of attention” across different use cases.

This doesn't necessarily mean a ubiquitous shift from large to small models is coming, but instead a shift from a singular category of models to an entire portfolio of models where "customers get the ability to make a decision on what is the best model for their scenario,” said Sonali Yadav, principal product manager for Generative AI at Microsoft.

In the case of times-series forecasting, a small company in the financial sector could be running 50 to 100,000 models for different equities, Kalagnanam noted. When running on that scale, “having smaller models makes a huge difference.”

Typically, AI developers have been focused on building “very large models for deployment purposes, " he added. Now, these workhorse models with billions of parameters can be made smaller through an exercise called knowledge distillation, a process in machine learning where a smaller, more efficient model (the "student") is trained to replicate the behavior of a larger, more complex model (the "teacher").

Knowledge distillation tends to be “somewhat compute intensive,” Kalagnanam said. “But you do it once, and then you can use it forever. Definitely, there is an effort toward smaller models with minimum loss of accuracy.”