Speech Recognition with PyTorch for beginners

5 min readJun 13, 2021


Setup The Environment

You can use your favorite IDE, or mine — PyCharm.

If you need help setting up PyCharm, have a look here.

You will also need to install PyTorch by running a simple command in the terminal — See https://pytorch.org/ for more information. As an example, If you are using windows, and pip — for the stable version (1.8.1) at the time of writing this article, you need to run

pip3 install torch==1.8.1+cpu torchvision==0.9.1+cpu torchaudio===0.8.1 -f https://download.pytorch.org/whl/torch_stable.html

The Project

Using PyTorch’s SPEECHCOMMANDS dataset, which includes 35 voice commands (down, follow, forward etc.), we will build a command recognizer.

The Code

Let’s have a quick look at the code and then perform a forensic analysis.

Due to large volume of training data, the code will take around 7 minutes to start training, and 10 minutes to train.
For a quick demo, set QUICK = True in line#9.

SimpleNet Class

Here, our input is audio consisting of 16000 frames, simply put a vector of size 16000.

Therefore our first layer is taking 16000 inputs, and producing 128 outputs (Line#44).

We then pass these 128 inputs to the next layer through ReLU, which produces 64 outputs.

Again, we pass these 64 inputs to the next layer through ReLU.

A few things to notice:

  1. We do not apply ReLU to the last layer
  2. The network configuration is kind of arbitrary — many other configurations would have worked as well. It is important to make sure the first layer takes 16000 inputs, and the last layer produces 35 outputs, and ReLU (or some other activation function) in between to ensure non-linearity.


This function is invoked during every training pass. When line#52 is executed, it feeds the audio (vector of 16000 size) to the machine and gets value of the output layer in return(x), which is a vector of size 35 in this case.

This x contains 35 values, with index of the highest value being the index of the label with highest likelihood.

For example, consider we have 3 possible outputs:

And for a sample audio, say we get the following value for x after executing Line#52:

Since index#1 here is the maximum value, there is a big possibility that the audio is saying “down”.

Line#53 is simply calculating log_softmax for the convenience to calculate loss. The details are outside the scope of the article. For a quick understanding, log_softmax(x) is log(softmax(x)). For the maximum value in x, both softmax(x) and log_softmax(x) will produce the maximum value.

If you are new to PyTorch or any other ML library, you might find it surprising that you don’t need to write backward method. PyTorch does it for you!


Notice that, to feed the audio, we need the audio vector to be of size 16000. There are sample samples in the dataset which mismatch the size a bit. Feeding those values will case error. There are smarter way to handle those cases, for example padding (filling rest of the frames with empty value) for audio with lower frame size, and trimming for audio with higher frame size. However, we took a lazy approach here, simply omitting the samples with size not being 16000. 🙄

SubsetSC class

I kind of copied that part from PyTorch documentation — nothing much to look at here. It is simply segregating dataset into different parts (training, validation etc.)


As you see, it is pretty self-explanatory. I will highlight a few stuffs though:

  1. Since, we are now training this model, at line#70 we are calling train() method, so that PyTorch behaves accordingly.
  2. In Line#74, during every iteration, dataloader is giving 5 information (features, size, train_labels, serial, train_labels_indices), but we just need features and train_label_indices among them.
  3. In Line#75, we are making sure all gradients are set to 0. Otherwise, newly calculated gradients will be cumulated to previous ones.
  4. The features variable is a 3D tensor is of shape: NUMBER_OF_ITEMS X 1 X 16000. Meaning, if there are one 100 items in this iteration, 100 X 1 X 16000. We are simply reshaping it to 100 X 16000 dimension in that case.
  5. Line#79, calculate gradients through backpropagation.
  6. Line#80 updates all weights/biases parameters


As you see, it is quite similar to train with. Few things to notice:

  1. In line#91, we are calling eval method to make PyTorch act for testing.
  2. In line#93, by calling no_grad(), we are making the execution faster since it will skip calculating gradients.

Where to go from here

  1. Improve performance. You can generate melspectogram using librosa libray. You can treat the melspectogram as an image and can apply CNN on it.
  2. Isn’t it painful running training every time to test? After finishing training, you can save the trained machine with:
    torch.save(model.state_dict(), ‘models/linear.h5’)
  3. During test, you can load the saved machine with:
    model = simple_net.SimpleNet()


This demonstrated problem is taken from PyTorch site, but making it much simpler for beginners.