Tensorflow a2.0.0:将 CSV 转换为 tfrecord，创建使用来自大型源的流水线数据的 Keras 模型，将权重存储到 CSV 文件?

本文介绍了Tensorflow a2.0.0:将 CSV 转换为 tfrecord，创建使用来自大型源的流水线数据的 Keras 模型，将权重存储到 CSV 文件?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在从 Andrew NG 在 Coursera 上的讲座中学习机器学习.该课程使用 Matlab，它非常适合机器学习模型的理解和原型设计，但速度相当慢.我目前正在研究 Tensorflow，因为它支持 GPU 利用率和数据流水线，这应该会加速我的模型.

I am learning machine learning from Andrew NG's lectures on Coursera. The course uses Matlab, which is great for understanding and prototyping machine learning models, but it is rather slow. I am currently researching Tensorflow since it supports GPU utilization and data pipelining, which should speed up my models.

然而，我完全迷失了这一点.文档没有详细说明，示例代码没有注释，最重要的是，Tensorflow 刚刚发布了一个 Alpha2.0，它显着改变了 API(许多旧的 StackOverflow 线程无济于事).

However, I am completely lost on this one. The documentation does not go into detail, the sample codes are not commented, and to top it all off, Tensorflow just released an Alpha2.0 which changes the API significantly (so many old StackOverflow threads don't help).

我的目标是:

将一个大(10GB+)的 CSV 文件转换为 tfrecords(在某处找到这有好处吗?)
创建一个ks.dataset，在多个线程中读取数据并将其通过管道传输到模型
创建一个模型，使用我的 GPU 从上述数据集中学习
将学习到的参数导出到文件

现在，我只能构建 keras 模型

Right now, I've only been able to build the keras model

model = keras.Sequential([
    keras.layers.Conv2D(filters=3, activation='relu',
                        kernel_regularizer=keras.regularizers.l2(0.001),
                        kernel_size=28,
                        padding="same",
                        input_shape=(28, 28, 1)),
    keras.layers.Flatten(),
    keras.layers.Dropout(0.09),
    keras.layers.Dense(10, activation='softmax', kernel_regularizer=keras.regularizers.l2(lambd)),
    keras.layers.Dropout(0.09)])

model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

model.fit(x_train, y_train, epochs=47, batch_size=256)
test_loss, test_acc = model.evaluate(x_test, y_test)
print('\nTest accuracy:', test_acc)

此时任何事情都会有所帮助！我应该研究哪些对我的目标至关重要的职能?

Anything would be helpful at this point! What functions should I look into that would be crucial for any of my goals?

推荐答案

经过 24 小时不间断的研究，我终于把所有的部分都粘在了拼图上.API 很棒，但缺少文档.

After 24 hours of nonstop research, I finally glued all the pieces to the puzzle. The API is amazing, but the documentation is lacking.

用于将 CSV 转换为 tfrecord:

For converting a CSV to tfrecord:

import tensorflow as tf
import numpy as np
import pandas as pd # For reading .csv
from datetime import datetime # For knowing how long does each read/write take

def _bytes_feature(value):
    # Returns a bytes_list from a string / byte.
    if isinstance(value, type(tf.constant(0))):
        value = value.numpy()  # BytesList won't unpack a string from an EagerTensor.
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))


def _float_feature(value):
    # Returns a float_list from a float / double.
    # If a list of values was passed, a float list feature with the entire list will be returned
    if isinstance(value, list):
        return tf.train.Feature(float_list=tf.train.FloatList(value=value))

    return tf.train.Feature(float_list=tf.train.FloatList(value=[value]))


def _int64_feature(value):
    # Returns an int64_list from a bool / enum / int / uint.
    return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))


def serialize_example(pandabase):
    # Serializes inputs from a pandas dataset (read in chunks)

    # Creates a mapping of the features from the header row of the file
    base_chunk = pandabase.get_chunk(0)
    num_features = len(base_chunk.columns)
    features_map = {}
    for i in range(num_features):
        features_map.update({'feature' + str(i): _float_feature(0)})

    # Set writing options with compression
    options = tf.io.TFRecordOptions(compression_type=tf.io.TFRecordCompressionType.ZLIB,
                                    compression_level=9)
    with tf.io.TFRecordWriter('test2.tfrecord.zip', options=options) as writer:
    # Convert the chunk to a numpy array, and write each row to the file in a double for loop
        for chunk in pandabase:
            nump = chunk.to_numpy()
            for row in nump:
                ii = 0
                for elem in row:
                    features_map['feature' + str(ii)] = _float_feature(float(elem))
                    ii += 1
                myProto = tf.train.Example(features=tf.train.Features(feature=features_map))
                writer.write(myProto.SerializeToString())


start = datetime.now()
bk1 = pd.read_csv("Book2.csv", chunksize=2048, engine='c', iterator=True)    
serialize_example(bk1)
end = datetime.now()
print("- consumed time: %ds" % (end-start).seconds)

对于从 tfrecords 和使用 GPU 进行机器学习:按照本指南进行正确设置然后使用此代码:

For machine learning from the tfrecords and using GPU:Follow this guide for the correct setupthen use this code:

# Recreate the feature mappings (Must be similar to the one used to write the tfrecords)
_NUMCOL = 5
feature_description = {}
for i in range(_NUMCOL):
    feature_description.update({'feature' + str(i): tf.io.FixedLenFeature([], tf.float32)})

# Parse the tfrecords into the form (x, y) or (x, y, weights) to be used with keras
def _parse_function(example_proto):
    dic = tf.io.parse_single_example(example_proto, feature_description)
    y = dic['feature0']
    x = tf.stack([dic['feature1'],
                   dic['feature2'],
                   dic['feature3'],
                   dic['feature4']], axis=0)
    return x, y

# Let tensorflow autotune the training speed
AUTOTUNE = tf.data.experimental.AUTOTUNE
# creat a tfdataset from the recorded file, set parallel reads to number of cores for best running speed
myData = tf.data.TFRecordDataset('test.tfrecord.zip', compression_type='ZLIB',
                                 num_parallel_reads=2)
# Map the data to a form useable by keras (using _parse_function), cache the data, shuffle, and read the data in batches  
myData = myData.map(_parse_function, num_parallel_calls=AUTOTUNE)
myData = myData.cache()
myData = myData.shuffle(buffer_size=8192)
batches = 16385
myData = myData.batch(batches).prefetch(buffer_size=AUTOTUNE)

model = keras.Sequential([
    keras.layers.Dense(100, activation='softmax', kernel_regularizer=keras.regularizers.l2(lambd)),
    keras.layers.Dense(10, activation='softmax', kernel_regularizer=keras.regularizers.l2(lambd)),
    keras.layers.Dense(1, activation='linear', kernel_regularizer=keras.regularizers.l2(lambd))])

model.compile(optimizer='adam',
              loss='mean_squared_error')
model.save('keras.HD5F')

这篇关于Tensorflow a2.0.0:将 CSV 转换为 tfrecord，创建使用来自大型源的流水线数据的 Keras 模型，将权重存储到 CSV 文件?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！