Google open-sourced their TensorFlow Runtime (TFRT). This new runtime improves the inference latency by almost 30%. Be aware: eager execution is also supported, so your models shouldn't be affected by switching to the new runtime! How it works (simplified): the TF graph is transformed into a TFRT binary executable format (see picture below), which directly calls the low-level TFRT kernels, which are device-aware. This means faster graph execution. Eager execution is also supported, the new runtime taking the calls directly from your Python eager API calls. This will gradually replace the old TF runtime, which (even in TF 2.0) is still centered around the old Graph execution model.
Source: https://blog.tensorflow.org
A high-performance low-level runtime is a key to enable the trends of today and empower the innovations of tomorrow.
Eric Johnson, TensorFlow Product Manager
Key modifications
The current TensorFlow 2.0 runtime is supports eager execution, but its design is still centered around the Graph execution model and around training instead of inference. This means that eager execution doesn't have enough performance (@tf.function much faster than eager execution if the workload is big enough).
The new runtime's architecture is centered around the idea of making eager execution fast and graph execution faster than in the old runtime, by creating a new binary format for graph execution and a unified set of TFRT kernels which are able to distribute the work load to highly-efficient, device-aware kernels. Graph execution is no longer the underlying execution model, and the lower layers of TFRT are very well specialized for TPUs, GPUs, and other mobile and embedded devices. Your Python eager code will directly call the TFRT GPU or TFRT TPU kernels. This is not a small feat!
Become a Deep Learning specialist with our intensive 8 days live online training.
Find out more!
Click "Edit Contents" and start uploading your learning material.