AI Compute Pool provides data center-level AI computing power resource pools, enabling user applications to transparently share and use AI accelerators on any server in the data center without modification.
It features elastic scalability and flexible billing, making your AI model training efficient and affordable.
Elastic scalability: Fully distributed deployment, connecting each node through an RDMA (IB/RoCE) or TCP/IP network to realize the elastic scalability of resource pools
High performance: Ultimate hardware configuration and optimal scheduling optimization, bringing the most effective computing power supply for AI model training.
Developer-friendliness: One-click solution to the problems of GPU/CPU ratio and multi-node multi-GPU model splitting in training models faced by AI developers, saving algorithm engineers a lot of valuable time
Support TensorFlow and PyTorch training frameworks, MPI single-node and distributed training tasks, the configuration of various roles, the on-demand creation of training tasks, and the management of actions on tasks in various states; and provide multiple scheduling strategies to meet the requirements of various tasks for high efficiency and low cost;
Connect each node through an RDMA (IB/RoCE) or TCP/IP network to realize the elastic scalability of resource pools;
Provide monitoring and log services of various task resources and business metrics to meet the operation and maintenance requirements of algorithm engineers during the debugging process.
Custom storages, networks, computing, and task schedulers for deep learning, combined with rich auxiliary debugging and visualization tools, bring efficient and developer-friendly deep learning training experience.
Custom storages, networks, computing, and task schedulers for deep learning, combined with rich auxiliary debugging and visualization tools, bring efficient and developer-friendly deep learning training experience.
Help you achieve new breakthroughs in business with professional AI solutions and advanced AI products