如何使用sklearn.datasets.load_files加载数据百分比

本文介绍了如何使用sklearn.datasets.load_files加载数据百分比的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有8000张图像，正在使用 sklearn加载. Datasets.load_files ，并通过 keras 的resnet传递，以获取瓶颈功能.但是，此任务在GPU上要花费数小时，因此我想找出是否有一种方法可以告诉load_files加载一定百分比的数据(例如20％).

I have 8000 images which I am loading with sklearn.datasets.load_files and passing through resnet from keras to get bottleneck features. However this task is taking hours on a GPU so I'd like to find out if there is a way to tell load_files to load a percentage of data like 20%.

我这样做是为了训练我自己的顶层(最后一个密集层)并将其附加到resnet.

I'm doing this to train my own top layer (last dense layer) and attach it to resnet.

def load_dataset(path):
    data = load_files(path)
    files = np.array(data['filenames'])
    targets = np_utils.to_categorical(np.array(data['target']), 100)
    return files, targets

train_files, train_targets = load_dataset('images/train')

推荐答案

这听起来更适合Keras ImageDataGenerator类并使用ImageDataGenerator.flow_from_directory方法.您不必同时使用数据扩充功能(这会进一步降低速度)，但是您可以选择批量大小从目录中提取，而不必全部加载.

This sounds like it would be better suited for the Keras ImageDataGenerator class and to use the ImageDataGenerator.flow_from_directory method. You don't have to use data augmentation with it (which would slow it down further) but you can choose your batch size to pull from the directory instead of loading them all.

从 https://keras.io/preprocessing/image/复制，并稍作修改笔记.

Copied from https://keras.io/preprocessing/image/ and slightly modified with notes.

train_datagen = ImageDataGenerator(  # <- customize your transformations
        rescale=1./255,
        shear_range=0.2,
        zoom_range=0.2,
        horizontal_flip=True)

test_datagen = ImageDataGenerator(rescale=1./255)

train_generator = train_datagen.flow_from_directory(
        'data/train',
        target_size=(150, 150),
        batch_size=32,  # <- control how many images are loaded each batch
        class_mode='binary')

validation_generator = test_datagen.flow_from_directory(
        'data/validation',
        target_size=(150, 150),
        batch_size=32,
        class_mode='binary')

model.fit_generator(
        train_generator,
        steps_per_epoch=2000,  # <- reduce here to lower the overall images used
        epochs=50,
        validation_data=validation_generator,
        validation_steps=800)

修改

根据下面的问题...steps_per_epoch确定每个纪元要加载多少批.

Per your question below...steps_per_epoch determines how many batches are loaded for each epoch.

例如:

steps_per_epoch = 50
batch_size = 32
历元= 1

该时期总共可以提供1600张图像.恰好是您8,000张图片中的20％. 注意，如果您遇到批处理大小为32的内存问题，则可能需要减少此问题并增加steps_per_epoch.要使其正确，将需要进行一些修补.

Would give you 1,600 images total for that epoch. Which is exactly 20% of your 8,000 images. Note that if you run into memory problems with a batch size of 32, you may want to decrease this and increase your steps_per_epoch. It will take some tinkering with to get it right.

这篇关于如何使用sklearn.datasets.load_files加载数据百分比的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！