ai-platform:使用估算器运行 TensorFlow 2.1 训练作业时，输出中没有 eval 文件夹或导出文件夹

本文介绍了ai-platform:使用估算器运行 TensorFlow 2.1 训练作业时，输出中没有 eval 文件夹或导出文件夹的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

问题

我的代码在本地运行，但在升级到 TensorFlow 2.1 后提交在线训练作业时，我无法从我的 TensorFlow 估算器中获取任何评估数据或导出.这是我的大部分代码:

My code works locally, but I am not able to get any evaluation data or exports from my TensorFlow estimator when submitting online training jobs after having upgraded to TensorFlow 2.1. Here's the bulk of my code:

def build_estimator(model_dir, config):

    return tf.estimator.LinearClassifier(
        feature_columns=feature_columns,
        n_classes=2,
        optimizer=tf.keras.optimizers.Ftrl(
            learning_rate=args.learning_rate,
            l1_regularization_strength=args.l1_strength
        ),
        model_dir=model_dir,
        config=config
    )

run_config = tf.estimator.RunConfig(save_checkpoints_steps=100,
                                    save_summary_steps=100)  
...

estimator = build_estimator(model_dir=args.job_dir, config=run_config)

...

def serving_input_fn():
    inputs = {
        'feature1': tf.compat.v1.placeholder(shape=None, dtype=tf.string),
        'feature2': tf.compat.v1.placeholder(shape=None, dtype=tf.string),
        'feature3': tf.compat.v1.placeholder(shape=None, dtype=tf.string),
        ...
    }

    split_features = {}

    for feature in inputs:
        split_features[feature] = tf.strings.split(inputs[feature], "||").to_sparse()

    return tf.estimator.export.ServingInputReceiver(features=split_features, receiver_tensors=inputs)

exporter_cls = tf.estimator.LatestExporter('predict', serving_input_fn)

eval_spec = tf.estimator.EvalSpec(
    input_fn=lambda: input_eval_fn(args.test_dir),
    exporters=[exporter_cls],
    start_delay_secs=10,
    throttle_secs=0)

tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)

如果我使用本地 gcloud 命令运行它，它工作正常，我会得到我的 /eval 和 /export 文件夹:

If I run this with local gcloud command it works fine, I get my /eval and /export folders:

gcloud ai-platform local train \
--package-path trainer \
--module-name trainer.task \
-- \
--train-dir $TRAIN_DATA \
--test-dir $TEST_DATA \
--training-steps $TRAINING_STEPS \
--job-dir $OUTPUT

但是当我尝试在云中运行它时，我没有得到我的 /eval /export 文件夹.这仅在升级到 2.1 时才开始发生.以前在 1.14 中一切正常.

But when I try to run it in the cloud, i do not get my /eval /export folders. This only started happening when upgrading to 2.1. Previously everything worked fine in 1.14.

    gcloud ai-platform jobs submit training $JOB_NAME \
    --job-dir $OUTPUT_PATH \
    --staging-bucket gs://$STAGING_BUCKET_NAME \
    --runtime-version 2.1 \
    --python-version 3.7 \
    --package-path trainer/ \
    --module-name trainer.task \
    --region $REGION \
    --config config.yaml \
    -- \
    --train-dir $TRAIN_DATA \
    --test-dir $TEST_DATA \

我的尝试

我还尝试使用 tf.estimator.export_saved_model，而不是依赖 EvalSpec 来导出我的模型.虽然这在本地和在线都有效，但如果可能的话，我想继续使用 EvalSpec 和 train_and_evaluate，因为我可以传入不同的导出方法，例如 BestExporter、LastExporter 等

Instead of relying on the EvalSpec to export my model, I also tried using tf.estimator.export_saved_model. While this works both locally and online, i'd like to continue using the EvalSpec with train_and_evaluate if possible, because I can pass in different export methods like BestExporter, LastExporter, etc.

我的主要问题是...

我是否在 TensorFlow 2.1 中错误地导出了我的模型，或者这是新版本平台上发生的错误?

Am I incorrectly exporting my model in TensorFlow 2.1, or is this a bug that is happening on the platform with the new version?

推荐答案

找到答案...

基于有关 TF_CONFIG 环境变量的文档...

Based on documentation about the TF_CONFIG environment variable...

master 是 TensorFlow 中已弃用的任务类型.master 代表一个任务，它扮演着与首席类似的角色，但在某些配置中也充当评估者.TensorFlow 2 不支持包含主任务的 TF_CONFIG 环境变量.

所以之前我们使用的是 TF 1.X，它使用了一个 master 工作线程.但是，在训练 TF 2.X 作业时，master 已被弃用.现在默认是 Chief，但默认情况下， Chief 不充当评估者.为了获得评估数据，我们需要更新我们的配置 yaml 以明确分配评估器.

So previously we were using TF 1.X, which used a master worker. But, master has been deprecated when training TF 2.X jobs. Now the default is chief, but chief by default does not act as an evaluator. In order to get evaluation data, we needed to update our config yaml to explicitly allocate an evaluator.

https://cloud.google.com/ai-platform/training/docs/distributed-training-details#tf-config-format

我们用 evaluatorType 和 evaluatorCount

trainingInput:
  scaleTier: CUSTOM
  masterType: standard_gpu
  workerType: standard_gpu
  workerCount: 1
  evaluatorType: standard_gpu
  evaluatorCount: 1

它奏效了！！！

这篇关于ai-platform:使用估算器运行 TensorFlow 2.1 训练作业时，输出中没有 eval 文件夹或导出文件夹的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！

tensorflow