Correlation Between Entries
The first recommendation is to ensure that appropriate warning and informational entries in the log file are presented along with errors into the machine learning components of the solution. All log entries are potentially useful input data if it is possible that there are correlations between informational messages, warnings, and errors. Sometimes the correlation is strong and therefore critical to maximizing the learning rate.
System administrators often experience this as a series of warnings followed by an error caused by the condition indicated in the warnings. The information in the warnings is more indicative of the root cause of failure than the error entry created as the system or a subsystem critically fails.
If one is building a system health dashboard for a piece of equipment or an array of machines that inter-operate, which appears to be the case in this question, the root cause of problems and some early warning capability is key information to display.
Furthermore, not all poor system health conditions end in failure.
The only log entries that should be eliminated by filtration prior to presentation to the learning mechanism are ones that are surely irrelevant and uncorrelated. This may be the case when the log file is an aggregation of logging from several systems. In such a case, entries for the independent system being analyzed should be extracted as an isolate from entries that could not possibly correlate to the phenomena being analyzed.
It is important to note that limiting analysis to one entry at a time vastly limits the usefulness of the dashboard. The health of a system is not equal to the health indications of the most recent log entry. It is not even the linear sum of the health indications of the most recent N entries.
System health has a very nonlinear and very temporally dependent relationships with many entries. Patterns can emerge gradually over the course of days on many types of systems. The base (or a base) neural net in the system must be trained to identify these nonlinear indications of health, impending dangers, and risk conditions if a highly useful dashboard is desired. To display the likelihood of an impending failure or quality control issue, an entire time window of log entries of considerable length must enter this neural net.
Distinction Between Known and Unknown Patterns
Notice that the identification of known patterns is different in one important respect than the identification of new patterns. The idiosyncrasies of the entry syntax of known errors has already been identified, considerably reducing the learning burden in input normalization stages of processing for those entries. The syntactic idiosyncrasies of new error types must be discovered first.
The entries of a known type can also be separated from those that are unknown, enabling the use of known entry types as training data to help in the learning of new syntactic patterns. The goal is to present syntactically normalized information to semantic analysis.
First Stage of Normalization Specific to Log Files
If the time stamp is always in the same place in entries, converting it to relative milliseconds and perhaps removing any 0x0d characters before 0x0a characters can be done before anything else as a first step in normalization. Stack traces can also be folded up into tab delimited arrays of trace levels so that there is a one-to-one correspondence between log entries and log lines.
The syntactically normalized information arising out of both known and unknown entries of error and non-error type entries can then be presented to unsupervised nets for the naive identification of categories of a semantic structure. We do not want to categorize numbers or text variables such as user names or part serial numbers.
If the syntactically normalized information is appropriately marked to indicate highly variable symbols such as counts, capacities, metrics, and time stamps, feature extraction may be applied to learn the expression patterns in a way that maintains the distinction between semantic structure and variables. Maintaining that distinction permits the tracking of more continuous (less discrete) trends in system metrics. Each entry may have zero or more such variables, whether known a priori or recently acquired through feature extraction.
Trends can be graphed against time or against the number of instances of a particular kind. Such graphics can assist in the identification of mechanical fatigue, the approach of over capacity conditions, or other risks that escalate to a failure point. Further neural nets can be trained to produce warning indicators when the trends indicate such conditions are impending.
All of this log analysis would be moot if software architects and technology officers stopped leaving the storage format of important system information to the varying convenient whims of software developers. Log files are generally a mess, and the extraction of statistical information about patterns in them is one of the most common challenges in software quality control. The likelihood that rigor will ever be universally applied to logging is small since none of the popular logging frameworks encourage rigor. That is most likely why this question has been viewed frequently.
Requirements Section of This Specific Question
In the specific case presented in this question, requirement #1 indicates a preference to run the analysis in the browser, which is possible but not recommended. Even though ECMA is a wonderful scripting language and the regular expression machinery that can be a help in learning parsers is built into ECMA (which complies with the other part of requirement #1, not requiring additional installations) un-compiled languages are not nearly as efficient as Java. And even Java is not as efficient as C because of garbage collection and inefficiencies that occur by delegating the mapping of byte code to machine code to run time.
Many experimentation in machine learning employs Python, another wonderful language, but most of the work I've done in Python was then ported to computationally efficient C++ for nearly 1,000 to one gains in speed in many cases. Even the C++ method lookup was a bottleneck, so the ports use very little inheritance, in ECMA style, but much faster. In typical kernel code traditional, C structures and function pointer use eliminates vtable overhead.
The second requirement of modular handlers is reasonable and implies a triggered rule environment that many may be tempted to think is incompatible with NN architectures, but it is not. Once pattern categories have been identified, looking for the most common ones first in further input data is already implied in the known/unknown distinction already embedded into the process above. There is a challenge with this modular approach however.
Because system health is often indicated by trends and not single entries (as discussed above) and because system health is not a linear sum of the health value of individual entries, the modular approach to handling entries should not just be piped to the display without further analysis. This is in fact where neural nets will provide the greatest functional gains in health monitoring. The outputs of the modules must enter a neural net that can be trained to identify these nonlinear indications of health, impending dangers, and risk conditions.
Furthermore, the temporal aspect of pre-failure behavior implies that an entire time window of log entries of considerable length must enter this net. This further implies the inappropriateness of ECMA or Python as a choice for the computationally intensive portion of the solution. (Note that the trend in Python is to do what I do with C++: Use object oriented design, encapsulation, and easy to follow design patterns for supervisory code and very computationally efficient kernel-like code for actual learning and other computationally intensive or data intensive functions.)
It is not recommendable to pick algorithms in the initial stages of architecture (as was implied at the end of the question). Architect the process first. Determine learning components, the type of them needed, their goal state after training, where reinforcement can be used, and how the wellness/error signal will be generated to reinforce/correct desired network behavior. Base these determinations not only on desired display content but on expected throughput, computing resource requirements, and minimal effective learning rate. Algorithms, language, and capacity planning for the system can only be meaningfully selected after all of those things are at least roughly defined.
Similar Work in Production
Simple adaptive parsing is running in the lab here as a part of social networking automation, but only for limited sets of symbols and sequential patterns. It does scale without reconfiguration to an arbitrarily large base linguistic units, prefixes, endings, and suffixes, limited only by our hardware capacities and throughput. The existence of regular expression libraries was helpful to keep the design simple. We use the PCRE version 8 series library fed by a ansiotropic form of DCNN for feature extraction from a window moving through the input text with a configurable windows size and move increment size. Heuristics applied to input text statistics gathered in a first pass produce a set of hypothetical PCREs arranged in two layers.
Optimization occurs to apply higher probabilistic weights to the best PCREs in a chaotically perturbed text search. It uses the same gradient descent convergence strategies used in NN back propagation in training. It is a naive approach that does not make assumptions like the existence of back-traces, files, or errors. It would adapt equally to Arabic messages and Spanish ones.
The output is an arbitrary directed graph in memory, which is similar to a dump of an object oriented database.
قنبلة -> dangereux -> 4anlyss
bomba -> dangereux
ambiguïté -> 4anlyss -> préemption -> قنبلة
Although a re-entrant algorithm for a reinforcement version is stubbed out and the wellness signal is already available, other work preempted furthering the adaptive parser or working toward the next step to use the work for natural language: Matching the directed graphs to persisted directed graph filters representing ideas, which would mimic the idea recollection aspect of language comprehension.
The system has components and process architecture similar to the log analysis problem and prove the concepts listed above. Of course, the more disorganization in the way logging is done between developers of the system doing the logging, the more difficult it is for a human or artificial agent to disambiguate the entries. Some system logging has been so poorly quality control for so long that the log is nearly useless.