Using feature learning for a medical text classification problem


I'm currently working with the CHILDES corpus trying to create a classifier that distinguishes children whom suffer from specific language impairment (SLI) from those who are typically developing (TD).

In my readings I noticed that there really isn't a convincing set of features to distinguish the two that have been discovered yet, so I came upon the idea of trying to create a feature learning algorithm that could potentially make better ones.

Is this possible? If so how do you suggest I approach this? From the reading I have done, most feature learning is done on image processing. Another problem is the dataset I have is potentially too small to make it work (in the 100's) unless I find a way to get more transcripts from children.


Posted 2016-08-30T09:20:03.783

Reputation: 39

Question was closed 2016-08-30T13:39:56.780

Hello, welcome to Ai.SE! I recommend that you take a look at the [tour], and maybe have a look at [meta] as well. Hope to see you around! – Mithical – 2016-08-30T09:35:14.640


This site isn't meant for programming/implementation questions in AI. We have many other programming, statistics, cs, and data science sites for this. This site is intended for conceptual questions dealing more in the social and scientific aspects of this subject. See How do we quickly describe our site?

– Robert Cartaino – 2016-08-30T13:43:06.143

That's being disingenuous. There's clearly still debate in regards to the issue of whether or not some programming/implementation questions should be allowed. Personally, I think this question should be considered on-topic. – mindcrime – 2016-09-01T20:33:41.533



Having just looked through a few entries from the corpus, I'd personally be skeptical of the applicability of any naive approaches.

  1. In particular light of your small training set, I'd recommend that whatever method you use should be able to produce human-readable explanations for the operation of the classifier it builds (e.g. decision trees/learning classifier systems/genetic programming): this allows 'common sense' tuning of the classifier, rather than the danger of overfitting to the training set via black box parameter optimization.

  2. Rather than throw the entire 'bag of words' at a classifier and hope that the appropriate set of features will be extracted, you should first consider what kind of criteria you as a human being might use to make that decision, and how you might be able to pre-process to produce features (e.g. syllable-length, metrics from ConceptNet etc) that are as close to these as reasonably possible.

  3. Having used some human intuition to obtain a reasonable set of feature primitives, then you can build your classifier and obtain higher-level expressions that discriminate between them.


Posted 2016-08-30T09:20:03.783

Reputation: 6 685