are there any Machine Learning algorithms for comparing files


So, in situations where some output-data-chunks could change without changing the input data from time to time, it could be challenging to automate the testing process.

My question is, are there any ways by using AI in general, in which those output data analyzed(after the training) and as a result of the actual and mutual data (that could be tested) return?


Posted 2018-12-02T22:25:16.043

Reputation: 43



Yes. Many of them general purpose machine learning algorithms can be helpful and even the ideal choice to include in a system that compares files. This is because a file can contain anything that can be represented a finite number of bits for which storage resources exist and comparisons can be on any basis imaginable.

Before selecting an algorithm, a set of requirements for the system and an approach toward meeting those requirements should be developed. (The question author may have already completed those phases of development, but nothing about the requirements or approach was expressed.) After the requirements and approach are clarified, a design is necessary before one can intelligently chose an algorithm from the several algorithms that may exist for a machine learning implementation of the artificial network portion of the system design.

For instance, if the objective is to determine whether two documents originate from the same sources, then a simple MLP (multilayer perceptron) connected to a word count program can be trained to compare the distribution of words and determine fairly reliably if the two documents are edits of a common upstream document.

If the objective is to determine if two documents are in particular temporal positions, that is particular versions of a document in a series of versions of the same document, then the LSTM artificial network approach may work well to determine their relative order. LSTMs are often successfully used as a component in a system that deals with temporal patterns.

If the goal is to present a color coded indication of textual differences among stretches of common text, then the study of meld and other open source programs that do that and perhaps joining those open source projects would be an excellent preparatory choice to perhaps producing an AI based improvement to those existing applications.

In all these cases, first the design and then the algorithm would flow from the objective to be achieved by the machine learning or other AI component. Knowledge of algorithms may assist in system design, but the algorithm is rarely, by itself, a solution. It is normally an element in an overall design of a solution. An artificial network is more than an algorithm because of initialization, data normalization, sample procurement, the setting of hyper-parameters, the variations in the way the algorithm is implemented in different libraries, how the data is represented, and several other variants. The artificial network is also less than the entire system. In all of the above the network is a component of a larger system design and a portion of a larger process that can include training, testing, and alteration of the design after results are analyzed.

At the extremes, machine learning is not a well suited solution for document comparisons.

At one extreme, if the goal is to determine if the two documents are say similar things about some topic, then there is no algorithm available to the public at the time of this writing to develop a cognitive model of what the document means in the context with an arbitrary set of knowledge domains discussed within the two documents and determine rationally the strength and density of agreements and dissents between the documents reliably.

At the other extreme, if the goal is to determine if the two documents are identical, a shell script that first checks the file size and, if they are equal, the checksum may suffice.

Douglas Daseeco

Posted 2018-12-02T22:25:16.043

Reputation: 7 174

1Thank you, sir, for the elaborative information, and the proposed ideas. This will help me a lot in the upcoming projects. If you have any other useful resources, I would be even more thankful. – james – 2018-12-03T06:13:06.520 has links in the middle. presents RBMs in this context. Please give them an up vote if they help. From the terms in these 3 answers, a simple search of scholarly articles will provide theoretical basis, some articles providing algorithm names, but performing requirements analysis & design with diligence 1st will likely save time and produce a much more satisfactory result. – Douglas Daseeco – 2018-12-03T20:44:58.017

1Thank you again for the sufficient information. – james – 2018-12-04T20:56:33.223