本文介绍了Google Colab:为什么CPU比TPU快?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Google colab TPU 来训练简单的 Keras 模型。删除分布式策略并在 CPU 上运行相同的程序比 TPU 快得多。

I'm using Google colab TPU to train a simple Keras model. Removing the distributed strategy and running the same program on the CPU is much faster than TPU. How is that possible?

import timeit
import os
import tensorflow as tf
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam

# Load Iris dataset
x = load_iris().data
y = load_iris().target

# Split data to train and validation set
x_train, x_val, y_train, y_val = train_test_split(x, y, test_size=0.30, shuffle=False)

# Convert train data type to use TPU
x_train = x_train.astype('float32')
x_val = x_val.astype('float32')

# Specify a distributed strategy to use TPU
resolver = tf.contrib.cluster_resolver.TPUClusterResolver(tpu='grpc://' + os.environ['COLAB_TPU_ADDR'])
tf.contrib.distribute.initialize_tpu_system(resolver)
strategy = tf.contrib.distribute.TPUStrategy(resolver)

# Use the strategy to create and compile a Keras model
with strategy.scope():
  model = Sequential()
  model.add(Dense(32, input_shape=(4,), activation=tf.nn.relu, name="relu"))
  model.add(Dense(3, activation=tf.nn.softmax, name="softmax"))
  model.compile(optimizer=Adam(learning_rate=0.1), loss='logcosh')

start = timeit.default_timer()

# Fit the Keras model on the dataset
model.fit(x_train, y_train, batch_size=20, epochs=20, validation_data=[x_val, y_val], verbose=0, steps_per_epoch=2)

print('\nTime: ', timeit.default_timer() - start)


推荐答案

谢谢您的提问。

我认为这里发生的事情是开销的问题-因为TPU运行在单独的VM上(可通过 grpc:// $ COLAB_TPU_ADDR ),每次在TPU上运行模型的调用都会产生一定的开销,因为客户端(在这种情况下为Colab笔记本)将图形发送到TPU,然后将其编译并运行。与运行例如ResNet50只能使用一个时期,但与运行示例中的简单模型相比要大一些。

I think what's happening here is a matter of overhead -- since the TPU runs on a separate VM (accessible at grpc://$COLAB_TPU_ADDR), each call to run a model on the TPU incurs some amount of overhead as the client (the Colab notebook in this case) sends a graph to the TPU, which is then compiled and run. This overhead is small compared to the time it takes to run e.g. ResNet50 for one epoch, but large compared to run a simple model like the one in your example.

为了在TPU上获得最佳效果,我们建议使用。我为TensorFlow 2.2更新了示例:

For best results on TPU we recommend using tf.data.Dataset. I updated your example for TensorFlow 2.2:

%tensorflow_version 2.x
import timeit
import os
import tensorflow as tf
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam

# Load Iris dataset
x = load_iris().data
y = load_iris().target

# Split data to train and validation set
x_train, x_val, y_train, y_val = train_test_split(x, y, test_size=0.30, shuffle=False)

# Convert train data type to use TPU
x_train = x_train.astype('float32')
x_val = x_val.astype('float32')

resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='grpc://' + os.environ['COLAB_TPU_ADDR'])
tf.config.experimental_connect_to_cluster(resolver)
tf.tpu.experimental.initialize_tpu_system(resolver)
strategy = tf.distribute.experimental.TPUStrategy(resolver)

train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train)).batch(20)
val_dataset = tf.data.Dataset.from_tensor_slices((x_val, y_val)).batch(20)

# Use the strategy to create and compile a Keras model
with strategy.scope():
  model = Sequential()
  model.add(Dense(32, input_shape=(4,), activation=tf.nn.relu, name="relu"))
  model.add(Dense(3, activation=tf.nn.softmax, name="softmax"))
  model.compile(optimizer=Adam(learning_rate=0.1), loss='logcosh')

start = timeit.default_timer()

# Fit the Keras model on the dataset
model.fit(train_dataset, epochs=20, validation_data=val_dataset)

print('\nTime: ', timeit.default_timer() - start)

运行大约需要30秒,而在CPU上需要大约1.3秒。通过重复数据集并运行一个长时期而不是几个小时期,我们可以在这里大大减少开销。我用以下命令替换了数据集设置:

This takes about 30 seconds to run, compared to ~1.3 seconds to run on CPU. We can substantially reduce the overhead here by repeating the dataset and running one long epoch rather than several small ones. I replaced the dataset setup with this:

train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train)).repeat(20).batch(20)
val_dataset = tf.data.Dataset.from_tensor_slices((x_val, y_val)).batch(20)

并用以下命令替换 fit 调用:

And replaced the fit call with this:

model.fit(train_dataset, validation_data=val_dataset)

这对我来说将运行时间缩短到大约6秒。这仍然比CPU慢,但这对于可以在本地轻松运行的小型模型来说并不奇怪。通常,将TPU与更大的型号配合使用会带来更多好处。我建议浏览,该指南为MNIST数据集提供了更大的图像分类模型

This brings the runtime down to about 6 seconds for me. This is still slower than CPU, but that's not surprising for such a small model that can easily be run locally. In general, you'll see more benefit from using TPUs with larger models. I recommend looking through TensorFlow's official TPU guide, which presents a larger image classification model for the MNIST dataset.

这篇关于Google Colab:为什么CPU比TPU快?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-13 08:54