基于CNN的模型构建/训练/推理


前言

本系列最后一部分:urban sound音频分类神经网络模型的搭建和训练,大纲和数据集的准备可以看我前期的内容:
1.PyTorch for Audio + Music Processing(1) :Course Overview(课程大纲)
2.PyTorch for Audio + Music Processing(2/3/4/5/6/7) :构建数据集和提取音频特征
本期的内容包括:

08 Implementing a CNN network

类似VGG网络结构的CNN模型的构建

09 Training urban sound classifier

urban sound音频分类模型的训练

10 Predictions with sound classifier

推理结构的实现


一、构建CNN模型

构建过程如下:

1.4个卷积block,对应conv1,conv2,conv3,conv4,每个block包含conv2d,relu,maxpooling
2.flatten层
3.全连接linear
4.softmax
代码如下:

class CNNNetwork(nn.Module):

    def __init__(self):
        super().__init__()
        # 4 conv blocks / flatten / linear / softmax
        self.conv1 = nn.Sequential(
            nn.Conv2d(
                in_channels=1,
                out_channels=16,
                kernel_size=3,
                stride=1,
                padding=2
            ),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2)
        )
        self.conv2 = nn.Sequential(
            nn.Conv2d(
                in_channels=16,
                out_channels=32,
                kernel_size=3,
                stride=1,
                padding=2
            ),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2)
        )
        self.conv3 = nn.Sequential(
            nn.Conv2d(
                in_channels=32,
                out_channels=64,
                kernel_size=3,
                stride=1,
                padding=2
            ),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2)
        )
        self.conv4 = nn.Sequential(
            nn.Conv2d(
                in_channels=64,
                out_channels=128,
                kernel_size=3,
                stride=1,
                padding=2
            ),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2)
        )
        self.flatten = nn.Flatten()
        self.linear = nn.Linear(128 * 5 * 4, 10)
        self.softmax = nn.Softmax(dim=1)

    def forward(self, input_data):
        x = self.conv1(input_data)
        x = self.conv2(x)
        x = self.conv3(x)
        x = self.conv4(x)
        x = self.flatten(x)
        logits = self.linear(x)
        predictions = self.softmax(logits)
        return predictions

网络结构通过torchsummary打印出来

        Layer (type)               Output Shape         Param #
================================================================
            Conv2d-1           [-1, 16, 66, 46]             160
              ReLU-2           [-1, 16, 66, 46]               0
         MaxPool2d-3           [-1, 16, 33, 23]               0
            Conv2d-4           [-1, 32, 35, 25]           4,640
              ReLU-5           [-1, 32, 35, 25]               0
         MaxPool2d-6           [-1, 32, 17, 12]               0
            Conv2d-7           [-1, 64, 19, 14]          18,496
              ReLU-8           [-1, 64, 19, 14]               0
         MaxPool2d-9             [-1, 64, 9, 7]               0
           Conv2d-10           [-1, 128, 11, 9]          73,856
             ReLU-11           [-1, 128, 11, 9]               0
        MaxPool2d-12            [-1, 128, 5, 4]               0
          Flatten-13                 [-1, 2560]               0
           Linear-14                   [-1, 10]          25,610
          Softmax-15                   [-1, 10]               0
================================================================
Total params: 122,762
Trainable params: 122,762
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.01
Forward/backward pass size (MB): 1.83
Params size (MB): 0.47
Estimated Total Size (MB): 2.31
----------------------------------------------------------------

输入输出shape和参数数量说明

以第一个卷据块输出为例,前面音频通过梅尔频谱变换提取到的特征为shape为64*44的tensor
其结构定义为:

self.conv1 = nn.Sequential(
            nn.Conv2d(
                in_channels=1,
                out_channels=16,
                kernel_size=3,
                stride=1,
                padding=2
            ),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2)
        )

输出shape的计算

其中padding=2,所以输出的shape就变成(64+2)x(44+2),即66x46
out_channels=16,即有16个通道(或卷积核数量),分别对输入进行卷积计算,所以输出的通道也是16
所以输出的tensor的shape为16x66x46
maxpool2d的size=2,所以其shape最终为16x33x23

输出参数数量的计算

每个kernel_size=3,且每个kernel是参数共享,所以每个kernel参数为3x3=9
由于有16个kernel,所以参数数量=16x9=144。由于还有偏置=kernel的数量,最终参数数量=16x9+16=160

二、模型的训练

创建dataloader

from torch.utils.data import DataLoader
# 导入torch的dataloader
def create_data_loader(train_data, batch_size):
	# train_data为前期定义的urbanDataset,batch_size为每批训练样本数
    train_dataloader = DataLoader(train_data, batch_size=batch_size)
    return train_dataloader

单个epoch的训练过程

def train_single_epoch(model, data_loader, loss_fn, optimiser, device):
	# model为一步骤定义的cnn模型,loss_fn为损失函数,optimiser为优化方法,device为训练设备
    for input, target in data_loader:
        input, target = input.to(device), target.to(device)
        # 从迭代器获取训练数据和标签

        # calculate loss
        prediction = model(input)
        # 前向输出预测结果
        loss = loss_fn(prediction, target)
        # 通过模型输出和标签计算损失函数

        # backpropagate error and update weights
        optimiser.zero_grad()
        # 梯度归零,因为训练的过程通常使用mini-batch方法,所以如果不将梯度清零的话,梯度会与上一个batch的数据相关
        loss.backward()
        # 反向传播计算梯度
        optimiser.step()
        # 基于optimiser方法和梯度信息更新weight

    print(f"loss: {loss.item()}")

多个epoch

def train(model, data_loader, loss_fn, optimiser, device, epochs):
    for i in range(epochs):
        print(f"Epoch {i+1}")
        train_single_epoch(model, data_loader, loss_fn, optimiser, device)
        print("---------------------------")
    print("Finished training")

训练整体过程

if __name__ == "__main__":
    if torch.cuda.is_available():
        device = "cuda"
    else:
        device = "cpu"
    print(f"Using {device}")

    mel_spectrogram = torchaudio.transforms.MelSpectrogram(
        sample_rate=SAMPLE_RATE,
        n_fft=1024,
        hop_length=512,
        n_mels=64
    )
    # 通过torchaudio定义梅尔转换,为下面的UrbanSoundDataset准备

    usd = UrbanSoundDataset(ANNOTATIONS_FILE,
                            AUDIO_DIR,
                            mel_spectrogram,
                            SAMPLE_RATE,
                            NUM_SAMPLES,
                            device)
    # 定义数据集

    train_dataloader = create_data_loader(usd, BATCH_SIZE)
    # 调用上面定义的create_data_loader

    # construct model and assign it to device
    cnn = CNNNetwork().to(device)
    print(cnn)
    # 实例化CNNNetwork模型

    # initialise loss funtion + optimiser
    loss_fn = nn.CrossEntropyLoss()
    # 采用交叉熵损失函数
    optimiser = torch.optim.Adam(cnn.parameters(),
                                 lr=LEARNING_RATE)
    # 定义优化方式

    # train model
    train(cnn, train_dataloader, loss_fn, optimiser, device, EPOCHS)
    # 训练

    # save model
    torch.save(cnn.state_dict(), "feedforwardnet.pth")
    print("Trained feed forward net saved at feedforwardnet.pth")

训练最终输出

Epoch 1
loss: 2.241577625274658
---------------------------
Epoch 2
loss: 2.2747385501861572
---------------------------
Epoch 3
loss: 2.3089897632598877
---------------------------
Epoch 4
loss: 2.348045587539673
---------------------------
Epoch 5
loss: 2.315420150756836
---------------------------
Epoch 6
loss: 2.3148367404937744
---------------------------
Epoch 7
loss: 2.31473708152771
---------------------------
Epoch 8
loss: 2.3141160011291504
---------------------------
Epoch 9
loss: 2.3157730102539062
---------------------------
Epoch 10
loss: 2.3171067237854004
---------------------------
Finished training
Trained feed forward net saved at feedforwardnet.pth

Process finished with exit code 0

三、模型推理

定义class_mapping

模型的输出是对应的class的序号,所以这里定义了一个序号(顺序)与类别的映射,其数据是根据数据集ubranDataset定义的类别

class_mapping = [
    "air_conditioner",
    "car_horn",
    "children_playing",
    "dog_bark",
    "drilling",
    "engine_idling",
    "gun_shot",
    "jackhammer",
    "siren",
    "street_music"
]

预测函数

def predict(model, input, target, class_mapping):
    model.eval()
    # 必须加这句,eval() 时,pytorch 会自动把 BN 和 DropOut 固定住,不会取平均,而是用训练好的值
    with torch.no_grad():
        predictions = model(input)
        # Tensor (1, 10) -> [ [0.1, 0.01, ..., 0.6] ]
        predicted_index = predictions[0].argmax(0)
        predicted = class_mapping[predicted_index]
        # 模型预测的值
        expected = class_mapping[target]
        # ground_truth值
    return predicted, expected

总结

本系列PyTorch for Audio + Music Processing课程完整地讲述了:
1.基于torch audio的音频数据集处理,加载,梅尔特征提取过程
2.基于CNN的基础分类模型的构建
3.torch模型的训练和预测
逻辑比较清晰,讲的也很细致,很适合入门。但同时作者也说了,这个课程只是普及了基础框架和这类问题的处理思路,采用的网络模型也是很基础的类VGG结构,感兴趣的同学可以尝试更SOTA的模型和多种特征来加强模型的性能。

12-18 06:36