To perform a white box adversarial attack, would the use of a numerical gradient suffice?


I am trying to perform a white box attack on a model. Would it be possible to simply use the numerical gradient of the output wrt input directly rather than computing each subgradient of the network analytically? Would this (1) work and (2) actually be a white box attack?

As I would not be using a different model to 'mimic' the results but instead be using the same model to get the outputs, am I right in thinking that this would still be a white box attack.


Posted 2020-04-11T11:31:04.957

Reputation: 157

Where did you read the term "white box adversarial attack"? – nbro – 2020-04-11T19:18:29.140

Heres one example:

– FeedMeInformation – 2020-04-11T19:25:52.040

1What exactly do you mean by numerical gradient of the ouput? If you mean using finite differences, this is generally considered a black-box attack. Usually the 'white box attack' setting is assuming that the adversary has full access to the model. The 'black box attack setting' generally assumes either that you only have access to outputs, either in the form of logits or (even more restrictive) just the class prediction (i.e. argmax of logits) – Chris Cundy – 2020-04-12T03:27:07.720

No answers