Can Grad CAM feature maps be used for Training?


I am trying to recreate the architecture of the following paper:

Can someone help me in explaining how are the feature maps coming out of the output of GradCam used in the following conv layers?


