What is the time complexity for training a neural network using back-propagation?



Suppose that a NN contains $n$ hidden layers, $m$ training examples, $x$ features, and $n_i$ nodes in each layer. What is the time complexity to train this NN using back-propagation?

I have a basic idea about how they find the time complexity of algorithms, but here there are 4 different factors to consider here i.e. iterations, layers, nodes in each layer, training examples, and maybe more factors. I found an answer here but it was not clear enough.

Are there other factors, apart from those I mentioned above, that influence the time complexity of the training algorithm of a NN?


Posted 2018-03-18T11:26:55.320

Reputation: 4 881

See also https://qr.ae/TWttzq.

– nbro – 2019-06-27T23:07:30.470



I haven't seen an answer from a trusted source, but I'll try to answer this myself, with a simple example (with my current knowledge).

In general, note that training a MLP using back-propagation is usually implemented with matrices.

Time complexity of matrix multiplication

The time complexity of matrix multiplication for $M_{ij} * M_{jk}$ is simply $\mathcal{O}(i*j*k)$.

Notice that we are assuming simplest multiplication algorithm here: there exists some other algorithms with somewhat better time complexity.

Feedforward pass algorithm

Feedforward propagation algorithm is as follows.

First, to go from layer $i$ to $j$, you do

$$S_j = W_{ji}*Z_i$$

Then you apply the activation function

$$Z_j = f(S_j)$$

If we have $N$ layers (including input and output layer), this will run $N-1$ times.


As an example, let's compute the time complexity for the forward pass algorithm for a MLP with $4$ layers, where $i$ denotes the number of nodes of the input layer, $j$ the number of nodes in the second layer, $k$ the number of nodes in the third layer and $l$ the number of nodes in the output layer.

Since there are $4$ layers, you need $3$ matrices to represent weights between these layers. Let's denote them by $W_{ji}$, $W_{kj}$ and $W_{lk}$, where $W_{ji}$ is a matrix with $j$ rows and $i$ columns ($W_{ji}$ thus contains the weights going from layer $i$ to layer $j$).

Assume you have $t$ training examples. For propagating from layer $i$ to $j$, we have first

$$S_{jt} = W_{ji} * Z_{it}$$

and this operation (i.e. matrix multiplcation) has $\mathcal{O}(j*i*t)$ time complexity. Then we apply the activation function

$$ Z_{jt} = f(S_{jt}) $$

and this has $\mathcal{O}(j*t)$ time complexity, because it is an element-wise operation.

So, in total, we have

$$\mathcal{O}(j*i*t + j*t) = \mathcal{O}(j*t*(t + 1)) = \mathcal{O}(j*i*t)$$

Using same logic, for going $j \to k$, we have $\mathcal{O}(k*j*t)$, and, for $k \to l$, we have $\mathcal{O}(l*k*t)$.

In total, the time complexity for feedforward propagation will be

$$\mathcal{O}(j*i*t + k*j*t + l*k*t) = \mathcal{O}(t*(ij + jk + kl))$$

I'm not sure if this can be simplified further or not. Maybe it's just $\mathcal{O}(t*i*j*k*l)$, but I'm not sure.

Back-propagation algorithm

The back-propagation algorithm proceeds as follows. Starting from the output layer $l \to k$, we compute the error signal, $E_{lt}$, a matrix containing the error signals for nodes at layer $l$

$$ E_{lt} = f'(S_{lt}) \odot {(Z_{lt} - O_{lt})} $$

where $\odot$ means element-wise multiplication. Note that $E_{lt}$ has $l$ rows and $t$ columns: it simply means each column is the error signal for training example $t$.

We then compute the "delta weights", $D_{lk} \in \mathbb{R}^{l \times k}$ (between layer $l$ and layer $k$)

$$ D_{lk} = E_{lt} * Z_{tk} $$

where $Z_{tk}$ is the transpose of $Z_{kt}$.

We then adjust the weights

$$ W_{lk} = W_{lk} - D_{lk} $$

For $l \to k$, we thus have the time complexity $\mathcal{O}(lt + lt + ltk + lk) = \mathcal{O}(l*t*k)$.

Now, going back from $k \to j$. We first have

$$ E_{kt} = f'(S_{kt}) \odot (W_{kl} * E_{lt}) $$


$$ D_{kj} = E_{kt} * Z_{tj} $$

And then

$$W_{kj} = W_{kj} - D_{kj}$$

where $W_{kl}$ is the transpose of $W_{lk}$. For $k \to j$, we have the time complexity $\mathcal{O}(kt + klt + ktj + kj) = \mathcal{O}(k*t(l+j))$.

And finally, for $j \to i$, we have $\mathcal{O}(j*t(k+i))$. In total, we have

$$\mathcal{O}(ltk + tk(l + j) + tj (k + i)) = \mathcal{O}(t*(lk + kj + ji))$$

which is same as feedforward pass algorithm. Since they are same, the total time complexity for one epoch will be $$O(t*(ij + jk + kl)).$$

This time complexity is then multiplied by number of iterations (epochs). So, we have $$O(n*t*(ij + jk + kl)),$$ where $n$ is number of iterations.


Note that these matrix operations can greatly be paralelized by GPUs.


We tried to find the time complexity for training a neural network that has 4 layers with respectively $i$, $j$, $k$ and $l$ nodes, with $t$ training examples and $n$ epochs. The result was $\mathcal{O}(nt*(ij + jk + kl))$.

We assumed the simplest form of matrix multiplication that has cubic time complexity. We used batch gradient descent algorithm. The results for stochastic and mini-batch gradient descent should be same. (Let me know if you think the otherwise: note that batch gradient descent is the general form, with little modification, it becomes stochastic or mini-batch)

Also, if you use momentum optimization, you will have same time complexity, because the extra matrix operations required are all element-wise operations, hence they will not affect the time complexity of the algorithm.

I'm not sure what the results would be using other optimizers such as RMSprop.


The following article http://briandolhansky.com/blog/2014/10/30/artificial-neural-networks-matrix-form-part-5 describes an implementation using matrices. Although this implementation is using "row major", the time complexity is not affected by this.

If you're not familiar with back-propagation, check this article:


M.kazem Akhgary

Posted 2018-03-18T11:26:55.320

Reputation: 256


For the evaluation of a single pattern, you need to process all weights and all neurons. Given that every neuron has at least one weight, we can ignore them, and have $\mathcal{O}(w)$ where $w$ is the number of weights, i.e., $n * n_i$, assuming full connectivity between your layers.

The back-propagation has the same complexity as the forward evaluation (just look at the formula).

So, the complexity for learning $m$ examples, where each gets repeated $e$ times, is $\mathcal{O}(w*m*e)$.

The bad news is that there's no formula telling you what number of epochs $e$ you need.


Posted 2018-03-18T11:26:55.320

Reputation: 459

From the above answer don't you think itdepends on more factors? – DuttaA – 2018-03-20T11:00:50.420

1@DuttaA No. There's a constant amount of work per weight, which gets repeated e times for each of m examples. I didn't bother to compute the number of weights, I guess, that's the difference. – maaartinus – 2018-03-20T13:50:39.843

2I think the answers are same. in my answer I can assume number of weights w = ij + jk + kl. basically sum of n * n_i between layers as you noted. – M.kazem Akhgary – 2018-03-24T10:01:38.063


A potential disadvantage of gradient-based methods is that they head for the nearest minimum, which is usually not the global minimum.

This means that the only difference between these search methods is the speed with which solutions are obtained, and not the nature of those solutions.

An important consideration is time complexity, which is the rate at which the time required to find a solution increases with the number of parameters (weights). In short, the time complexities of a range of different gradient-based methods (including second-order methods) seem to be similar.

Six different error functions exhibit a median run-time order of approximately O(N to the power 4) on the N-2-N encoder in this paper:

Lister, R and Stone J "An Empirical Study of the Time Complexity of Various Error Functions with Conjugate Gradient Back Propagation" , IEEE International Conference on Artificial Neural Networks (ICNN95), Perth, Australia, Nov 27-Dec 1, 1995.

Summarised from my book: Artificial Intelligence Engines: A Tutorial Introduction to the Mathematics of Deep Learning.

James V Stone

Posted 2018-03-18T11:26:55.320

Reputation: 49

Hi J. Stone. Thanks for trying to contribute to the site. However, please, note that this is not a place for advertising yourself. Anyway, you can surely provide a link to your own books if they are useful for answering the questions and provided you're not just trying to advertise yourself. – nbro – 2020-01-24T12:27:04.453

@nbro If James Stone can provide an insightful answer - and it seems so - then i'm fine with him also mentioning some of his work. Having experts on this network is a solid contribution to the quality and level. – javadba – 2020-01-28T02:45:45.453

Dear nbro, That is a fair comment. I dislike adverts too. But it is possible for a book and/or paper to be relevant to a question, as I believe it is in this case. regards, Jim Stone – James V Stone – 2020-01-29T08:37:22.200


I found a paper that gives a table of time complexities for different architectures using linear programming-based training: https://arxiv.org/abs/1810.03218

Giorgio Luigi Morales Luna

Posted 2018-03-18T11:26:55.320

Reputation: 11

Hi and welcome to AI SE! Thanks for contributing! Maybe you can elaborate a little more about the table, and include a screenshot of it if you don't want to summarise the results! – nbro – 2020-03-27T19:07:12.557


First thing to remember is time-complexity is calculated for an algorithm. An algorithm takes an input and produces an output. Now in case of neural networks, your time complexity depends on what you are taking as input.

Case 1: Input is just the dataset. Architecture and hyperparameters are fixed in the algorithm.

Whenever we say time complexity of an algorithm we generally mean "number of iterations or recursions the algorithm makes with respect to the input, before coming to halt" (my definition, not a standard one).

Training process can be divided into three steps:

  1. single-pass-n-update the model for a given batch of input
  2. Repeat the above step for a different batches of input
  3. Repeat the above steps for the given number of epochs

So if we break the single-pass-n-update process into different parts for a better understanding, it will be as follows:

  1. Forward pass
  2. Loss calculation
  3. Backward pass and parameter update

Loss calculation is obviously $O(1)$. But I think the forward pass will also have a time complexity of $O(1)$. And so does backward pass. Why?

To see my viewpoint, please recall the definition of time complexity and take an example say sorting. Bubble sort has a time complexity of $O(n^2)$. Why? Because if your input size is $n$ (i.e. n numbers) then you need to do $n^2$ iterations to complete the task.

In other words, if looping and recursion are not available in your programming language or you don't want to use them, then you will need to write around $n^2$ lines of code (LOC) to do the sorting. It is called loop unrolling. Mind it, your code needs to be rewritten every time for every new input. If Mary wants to sort 5 elements, your code will have 25 LOC. If Bob wants to sort 500 elements, 250000 LOC and so on.

But if I tell you to write a program that sorts exactly 10 elements, then you don't have the LOC curse. You can simply write 10*10 LOC, and it will work for every input that contains 10 elements! So your time complexity becomes O(100) = O(1). That is the key in my mind.

In our given case we fix the architecture and hyperparameters of the network beforehand.

When we fix the architecture of a neural network (i.e. the number of hidden layers, the input size, and the output size), we fix the maximum number of iterations to a constant number.

Consider a neural network that takes input as 32x32 (=1024) grayscale image, has a hidden layer of size 2048, and output as 10 nodes representing 10 classes (yes classic MNSIT digit recognition task). The batch size is 16.

In one single forward pass, first, there will be a matrix multiplication. The two matrices that will be multiplied will be the input matrix and the weight matrix. The (fixed) dimensions will be 32x1024 and 1024x2048, respectively. One can write 32x1024x2048 LOC to do the multiplication. Consequently, matrix values at the hidden layer (32x2048) will be multiplied with another weight matrix (2048x10) to get the final output. This matrix multiplication will also be fixed. Therefore in a single forward pass, a constant number of operations will be performed. Therefore LOC needed are fixed. Hence $O(1)$ time complexity. Similarly for backpropagation.

The key here is to understand that while training a plain vanilla multi-layer neural network, the architecture is fixed and hence the number of operations is fixed. So one single-pass-n-update happens in $O(1)$. However, this doesn't make the training process itself to run in $O(1)$ time. Why? Let us see how training happens:

// Algorithm 1
// assuming model architecture is already defined
// hyperparamaters are fixed
// N is number of training examples in the dataset
// ... denotes other hyperparamters

model train_function (model, train_dataset, batch_size, epochs, ...){
    batches = train_dataset / batch_size
    for(int i=0; i<epochs; i++){                             //O(1)
        for each batch in batches{                           //O(N)
            model = single-pass-n-update(model, batch, ...)  //O(1)
    return model

As epochs are also fixed, the epoch loop also runs in $O(1)$. The batch loop technically runs in $O(N/batch\_size)$ but as $batch\_size$ is also fixed we can write it as $O(N)$. SO the final time complexity will be $O(N)$.

Case 2: Input is dataset+hyperparameters. But the architecture is fixed.

In this case, the running time complexity of the epoch loop will become exponential. So the entire time complexity of training becomes exponential.

model train_function (model, train_dataset, batch_size, epochs, ...){
    batches = train_dataset / batch_size
    for(int i=0; i<epochs; i++){                             //O(exp)
        for each batch in batches{                           //O(N)
            model = single-pass-n-update(model, batch, ...)  //O(...)
    return model

If you are wondering why epoch loop is running in exponential time and why not in $O(epochs)$, then the answer would be that some simple algorithms look like polynomial but they are not, they have a pseudo-polynomial running time. Basically it states that time complexity is not only dependent on input but on the size of input as well.

Pseudo-Polynomial Running Time

When you use bubble sort, you give the program input as $n$ numbers. The program takes each number one by one. Each number is a (say) 8-bit unsigned integer. We know the time complexity is quadratic, $O(n^2)$, so changing the $n$ will change the number of iterations quadratically. Fine till now.
But now comes the catch, if you change the input size i.e. 9-bit or 7-bit instead of 8-bit, will the number of iterations change? The answer is No. Why? DIY

If you want to create a loop that prints "Hello" for $n$ times, you might think the time complexity is $O(n)$, where $n$ is an 8-bit integer supplied by the user. Now ask yourself, if you change the input size i.e. 9-bit or 7-bit instead of 8-bit, will the number of iterations change? The answer is Yes. It will change exponentially. Why? DIY

In other words, when you are calculating the time complexity and the expression contains a pure number provided by the user then the complexity automatically becomes exponential.

If you know a little about cryptography you might have come across that the Knapsack problem is used in cryptography. Cryptography uses functions that are computable in exponential time only. That's what makes cryptography secure. But you might also know that dynamic programming solves the Knapsack problem in $O(nW)$ time. Then what is the catch? The catch is that dynamic programming solves it in pseudo-polynomial time. Why? Look at the time complexity expression $O(nW)$ where $n$ is the number of items and $W$ is the capacity of the Knapsack which is a pure number provided by the user. Hence the actual time complexity becomes exponential. To read more about it, this might help.

Case 3: Input is dataset+architecture. But hyperparameters are fixed

Architecture = [ number of input nodes ($n$), number of hidden layers ($h$), [number of hidden nodes in each hidden layer $h_1,h_2,h_3,...,h_h$], number of output nodes ($t$)]

Batch size = constant = $K_b$

Then the epoch loop will run in $O(1)$ and the batch loop will run in $O(N)$.

Coming to single-pass-n-update. Forward pass will happen in $O(K_bnh_1+K_bh_1h_2+K_bh_2h_3+ ... + K_bh_{h-1}h_h)$ Why? Matrix multiplication

Ignoring the constant term, the expression can be written as $O(nh_1+h_1h_2+h_2h_3+ ... + h_{h-1}h_h)$, as you can see the expression contains $h_1,h_2,h_3,...,h_h$ which are pure numbers given by the user. Hence the time complexity in the forward pass becomes exponential. Hence the time complexity of single-pass-n-update itself becomes exponential

model train_function (model, train_dataset, batch_size, epochs, ...){
    batches = train_dataset / batch_size
    for(int i=0; i<epochs; i++){                             //O(1)
        for each batch in batches{                           //O(N)
            model = single-pass-n-update(model, batch, ...)  //O(exp)
    return model

Now I hope fourth case is also clear when input is dataset+architecture+hyperparameters. And nothing is fixed


Posted 2018-03-18T11:26:55.320

Reputation: 111