First thing to remember is time-complexity is calculated for an algorithm. An algorithm takes an input and produces an output. Now in case of neural networks, your time complexity depends on what you are taking as input.

**Case 1: Input is just the dataset. Architecture and hyperparameters are fixed in the algorithm.**

Whenever we say time complexity of an algorithm we generally mean "number of iterations or recursions the algorithm makes with respect to the input, before coming to halt" (my definition, not a standard one).

Training process can be divided into three steps:

- single-pass-n-update the model for a given batch of input
- Repeat the above step for a different batches of input
- Repeat the above steps for the given number of epochs

So if we break the single-pass-n-update process into different parts for a better understanding, it will be as follows:

- Forward pass
- Loss calculation
- Backward pass and parameter update

Loss calculation is obviously $O(1)$. But I think the forward pass will also have a time complexity of $O(1)$. And so does backward pass. Why?

To see my viewpoint, please recall the definition of time complexity and take an example say sorting. Bubble sort has a time complexity of $O(n^2)$. Why? Because if your input size is $n$ (i.e. n numbers) then you need to do $n^2$ iterations to complete the task.

In other words, if looping and recursion are not available in your programming language or you don't want to use them, then you will need to write around $n^2$ lines of code (LOC) to do the sorting. It is called *loop unrolling*. Mind it, your code needs to be rewritten every time for every new input. If Mary wants to sort 5 elements, your code will have 25 LOC. If Bob wants to sort 500 elements, 250000 LOC and so on.

But if I tell you to write a program that sorts exactly 10 elements, then you don't have the LOC curse. You can simply write 10*10 LOC, and it will work for every input that contains 10 elements! So your time complexity becomes O(100) = O(1). That is the key in my mind.

In our given case we fix the architecture and hyperparameters of the network beforehand.

When we fix the architecture of a neural network (i.e. the number of hidden layers, the input size, and the output size), we fix the maximum number of iterations to a constant number.

Consider a neural network that takes input as 32x32 (=1024) grayscale image, has a hidden layer of size 2048, and output as 10 nodes representing 10 classes (yes classic MNSIT digit recognition task). The batch size is 16.

In one single forward pass, first, there will be a matrix multiplication. The two matrices that will be multiplied will be the input matrix and the weight matrix. The (fixed) dimensions will be 32x1024 and 1024x2048, respectively. One can write 32x1024x2048 LOC to do the multiplication. Consequently, matrix values at the hidden layer (32x2048) will be multiplied with another weight matrix (2048x10) to get the final output. This matrix multiplication will also be fixed. Therefore in a single forward pass, a constant number of operations will be performed. Therefore LOC needed are fixed. Hence $O(1)$ time complexity. Similarly for backpropagation.

The key here is to understand that while training a plain vanilla multi-layer neural network, the architecture is fixed and hence the number of operations is fixed. So one single-pass-n-update happens in $O(1)$. However, this doesn't make the training process itself to run in $O(1)$ time. Why? Let us see how training happens:

```
// Algorithm 1
// assuming model architecture is already defined
// hyperparamaters are fixed
// N is number of training examples in the dataset
// ... denotes other hyperparamters
model train_function (model, train_dataset, batch_size, epochs, ...){
batches = train_dataset / batch_size
for(int i=0; i<epochs; i++){ //O(1)
for each batch in batches{ //O(N)
model = single-pass-n-update(model, batch, ...) //O(1)
}
}
return model
}
```

As epochs are also fixed, the epoch loop also runs in $O(1)$. The batch loop technically runs in $O(N/batch\_size)$ but as $batch\_size$ is also fixed we can write it as $O(N)$. SO the final time complexity will be $O(N)$.

**Case 2: Input is dataset+hyperparameters. But the architecture is fixed.**

In this case, the running time complexity of the epoch loop will become exponential. *So the entire time complexity of training becomes exponential*.

```
model train_function (model, train_dataset, batch_size, epochs, ...){
batches = train_dataset / batch_size
for(int i=0; i<epochs; i++){ //O(exp)
for each batch in batches{ //O(N)
model = single-pass-n-update(model, batch, ...) //O(...)
}
}
return model
}
```

If you are wondering why epoch loop is running in exponential time and why not in $O(epochs)$, then the answer would be that some simple algorithms look like polynomial but they are not, they have a *pseudo-polynomial running time*. Basically it states that time complexity is not only dependent on input but on the size of input as well.

*Pseudo-Polynomial Running Time*

When you use bubble sort, you give the program input as $n$ numbers. The program takes each number one by one. Each number is a (say) 8-bit unsigned integer. We know the time complexity is quadratic, $O(n^2)$, so changing the $n$ will change the number of iterations quadratically. Fine till now.

But now comes the catch, if you change the input size i.e. 9-bit or 7-bit instead of 8-bit, will the number of iterations change? The answer is No. Why? DIY

However

If you want to create a loop that prints "Hello" for $n$ times, you might think the time complexity is $O(n)$, where $n$ is an 8-bit integer supplied by the user. Now ask yourself, if you change the input size i.e. 9-bit or 7-bit instead of 8-bit, will the number of iterations change? The answer is Yes. It will change exponentially. Why? DIY

In other words, when you are calculating the time complexity and the expression contains a pure number provided by the user then the complexity automatically becomes exponential.

If you know a little about cryptography you might have come across that the Knapsack problem is used in cryptography. Cryptography uses functions that are computable in exponential time only. That's what makes cryptography secure. But you might also know that dynamic programming solves the Knapsack problem in $O(nW)$ time. Then what is the catch? The catch is that dynamic programming solves it in *pseudo-polynomial time*. Why? Look at the time complexity expression $O(nW)$ where $n$ is the number of items and $W$ is the capacity of the Knapsack which is a pure number provided by the user. Hence the actual time complexity becomes exponential. To read more about it, this might help.

**Case 3: Input is dataset+architecture. But hyperparameters are fixed**

Architecture = [ number of input nodes ($n$), number of hidden layers ($h$), [number of hidden nodes in each hidden layer $h_1,h_2,h_3,...,h_h$], number of output nodes ($t$)]

Batch size = constant = $K_b$

Then the epoch loop will run in $O(1)$ and the batch loop will run in $O(N)$.

Coming to single-pass-n-update. Forward pass will happen in $O(K_bnh_1+K_bh_1h_2+K_bh_2h_3+ ... + K_bh_{h-1}h_h)$ Why? Matrix multiplication

Ignoring the constant term, the expression can be written as
$O(nh_1+h_1h_2+h_2h_3+ ... + h_{h-1}h_h)$, as you can see the expression contains $h_1,h_2,h_3,...,h_h$ which are pure numbers given by the user. Hence the time complexity in the forward pass becomes exponential. Hence the time complexity of single-pass-n-update itself becomes exponential

```
model train_function (model, train_dataset, batch_size, epochs, ...){
batches = train_dataset / batch_size
for(int i=0; i<epochs; i++){ //O(1)
for each batch in batches{ //O(N)
model = single-pass-n-update(model, batch, ...) //O(exp)
}
}
return model
}
```

**Now I hope fourth case is also clear when input is dataset+architecture+hyperparameters. And nothing is fixed**

See also https://qr.ae/TWttzq.

– nbro – 2019-06-27T23:07:30.470