Should AI initiatives change network planning?

Everyone is into AI these days, even network planners. But for network professionals, the primary focus has been on the use of AI in network operations. What about the impact of AI on network traffic?

When I asked almost 100 network planners about that, only eight of them told me they've thought about the impact AI might have on network traffic and network plans. Are they missing something? Maybe, because there are two questions on the table here. One is whether AI has a potential impact on enterprise networks and traffic, and the other is whether it could influence technology.

AI's impact on traffic and infrastructure depends largely on enterprise plans to self-host their AI. The great majority of AI models are run on specialized chips like GPUs, which means specialized servers in the data center. As of today, I've gotten comments on "self-hosted" AI from 91 enterprises. I put the term in quotes, because the truth is that only 16 of them actually had been planning in 2023 for specific AI hosting, and only eight said they'd done any self-hosting this year. Not surprisingly, those eight planners are the same eight who've thought about network impact. For 2024, the number jumps to 77, and I think that growth is stimulating interest among both AI and network equipment vendors. Both Cisco and Juniper have been singing about their AI network credentials, for example.

The question, of course, is just what kind of AI is getting self-hosted and networked, and we shouldn't assume we can answer that based on what we're reading about AI.

Generative AI from players like ChatGPT, Google, and Microsoft is getting a lot of ink, but there is a fundamental problem with the classic generative open-Internet approach as far as businesses are concerned. They're worried about the hallucinations all too common in public-trained chatbots. They're worried about copyright issues biting content that's AI-created. They're worried about the security of their own data if AI is trained in a specialized way. Some are worried about the energy and environmental impact of all those GPUs churning out human-like results. A lot of recent AI initiatives, including Google's Gemini, were advanced in part to push for a new form of generative AI, one that applies the basic large-language-model technology that created popular generative AI services to enterprise data, within enterprise data centers, or as a part of enterprise cloud services.

If enterprises are looking for a kind of lightweight large-language-model approach to AI, that would mean that the number of specialized AI servers in their data center would be limited. Think in terms of a single AI cluster of GPU servers, and you have what enterprises are seeing. The dominant strategy for AI networking inside that cluster is InfiniBand, a superfast, low-latency, technology that's strongly supported by NVIDIA but not particularly popular (or even known) at the enterprise level. NVIDIA's DGX InfiniBand approach is what connects that mass of GPUs in most big AI data centers, which is why there's almost a presumption that InfiniBand will be the technology used for self-hosted AI.

That's probably unnecessary, and possibly downright wrong. Enterprises don't need to crawl the Internet for training data for their model. Enterprises don't need to support mass-market use of their AI, and if they did for applications like chatbots in customer support, they'd likely use cloud hosting not in-house deployment. That means that AI to the enterprise is really a form of enhanced analytics. Widespread use of analytics has influenced data center network planning for access to the databases, and AI would likely increase database access if it's widely used. But even given all of that, there's no reason to think that Ethernet, the dominant data center network technology, wouldn't be fine for AI. So forget the notion of an InfiniBand technology shift. But that doesn't mean that AI won't need to be planned for in the network.

Think of an AI cluster as an enormous virtual user community. It has to collect data from the enterprise repository, all of it, to train and get the latest information to answer user questions. That means it needs a high-performance data path to this data, and that path can't be allowed to congest other traditional workflows within the network. The issue is acute for enterprises with multiple data centers, multiple complexes of users, because it's likely that they won't want to host AI in every location. If the AI cluster is separated from some applications, databases, and users, then data center interconnect (DCI) paths might have to be augmented to carry the traffic without congestion risk.

According to those eight AI-hosting enterprises, the primary rule for AI traffic is that you want the workflows to be as short as possible, over the fastest connections you have. Pulling or pushing masses of AI data over widespread connections could make it almost impossible to prevent random massive movements of data from interfering with other traffic. It's particularly important to ensure that AI flows don't collide with other high-volume data flows, like conventional analytics and reporting. One approach is to map AI workflows and augment capacity along the path, and the other is to shorten and guide AI workflows by properly placing the AI cluster.

Planning for the AI cluster starts with the association between enterprise AI and business analytics. Analytics uses the same databases that AI would likely use, which means that placing AI where the major analytics applications are hosted would be smart. Remember that this means placing AI where the actual analytics applications are run, not where the results are formatted for use. Since analytics applications are often run proximate to the location of the major databases, this will put AI in the location most likely to generate the shortest network connections. Run fat Ethernet pipes within the AI cluster and to the database hosts, and you're probably in good shape. But watch AI usage and traffic carefully, particularly if there aren't many controls on who uses it and how much. Rampant, and largely unjustified, use of self-hosted AI was reported by six of the eight enterprises, and that could drive costly network upgrades.

The future of AI networking for enterprises isn't about how AI is run, it's about how it's used, and while AI usage will surely drive additional traffic, it's not going to require swapping out the entire data center network for hundreds of gigabits of Ethernet capacity. What it will require is a better understanding of how AI usage connects with AI data center clusters, cloud resources, and some generative AI thrown in. If Cisco, Juniper, or another vendor can provide that, they can expect a happy bonus in 2024.

Source: Network World
Published: Thu, 04 Jan 2024 15:35:50 +0000