2

The minimization problem for SVM can be written as- $$\overset{\text{min}}{\theta} C\sum_{i = 1}^{m}{[y^icost_1(\theta^Tx^i) + (1-y^i)cost_0(\theta^Tx^i)]} + \frac12\sum_{j = 1}^n{\theta_j}^2$$

Now, how can the choice of $C$ lead to underfitting or overfitting?

As I understand, parameters are chosen to make $C\sum_{i = 1}^{m}{[y^icost_1(\theta^Tx^i) + (1-y^i)cost_0(\theta^Tx^i)]}$ part $0$. And we concern ourselves with the second part.

And Andrew Ng says that a large $C$ leads to lower bias and higher variance.

How does this happen? What is the intuition behind this?