Optimizing the distribution on an unknown of input / Data Audit Pattern


I want to develop a new autoencoder. Once the network itself is developed, the customer wants to train and deploy the it all by himself. Due to data protection he can't give me anymore information, as advising me about the usage in transactional ERP data and the number of inputs/outputs, which will be 10.

Usually I would start a project by analysing the data and checking the distribution by running histogram. The initial check on the data usually gives me insight about Skewness, max-values, min-values, averages and medians.

Particularly the skewness worries me, as in my experiences neural-networks such as auto-encoders perform extremely bad on such inputs. Usually I would offset a strongly skewed data distribution by using logarithms, root functions or power transformation.

I wonder if I can just us any of these "Anti-Skewness" methods just blindly, in order to offset "bad-data" but not worrying about destroying the performance of my network.


Posted 2016-06-15T19:32:42.153

Reputation: 265

No answers