Content analysis tools


We want to figure out the connection between people, based on their speech. Assume, that a conversation is a poetry with lines belongs to characters. There are a lot of poetries and the lines are mixed. Now we want to define a conversation to which each line belongs to. We assume, that people in a conversation use similar words (their dictionaries should be similar). It means that there is a correlation between words of person A and words belongs to person B, and we could detect a connection between the peoples which had a conversation. What are the next steps for content understanding after NLP? Can some of you advise us about the field of study and tools/libraries, which are dealing with content processing? Maybe, some of you know good articles or online resources, which can help us to dive into this field.


Posted 2018-11-29T08:18:19.443

Reputation: 11

The bottleneck in content analysis is called grounding. It's the shared knowledge of two persons during the dialogue. In classical NLP, grounding is hard to solve, the more easier task is to switch to characters in virtual reality. In a fairy-tale game, the shared knowledge is modeled in the game. It's about places, feelings and experiences. If two characters are in a conversation it's possible to parse, interpret and predict the speech acts. – Manuel Rodriguez – 2018-11-29T09:24:26.673



Assembling groups of messages with the objective of assembling conversations from a mix of them has process features in common with reconstituting a drive with overwritten indexing information, reassembling broken items that have been intermixed, or transcribing music with appropriate assignments of notes to instruments in the staff based on frequency spectra. The goal in all cases is to develop associations that are no longer directly observable.

The question explicitly states content analysis as the intent, did not mention voice recognition as part of the problem in this Q&A, and stated, "After NLP." We can assume that the input is text, not voice, and that the system is based on use of language, not of speech, from the system analysis perspective.

The use of vocabulary to drive the reassembly is reasonable but does not draw on important additional hints. A hint is a probabilistic association, and vocabulary has too high an entropy to be a primary determinant in message association. Phrases encapsulated in messages share a topic when adjacent to one another in a single conversation. Within the chronology of messages in a conversation are unidirectional causal references such that the message sequence in reverse order would lack coherence as a conversation. Although people in the same group share vocabulary, in a large pool of messages, vocabulary will be shared across many conversations and therefore will not be a sufficient set of hints in many of the cases where vocabulary, topic, and reference based hints are.

We want to define [the] conversation [into] which each [message] belongs.

Consider a set of $C$ conversations, $\mathbb{C}: \{c_1, c_2, ..., c_C\}$ and a set of $M$ messages, $\mathbb{M}: \{m_1, m_2, ..., m_M\}$. Each message has text encoded in UTF-8 or some other encoding. The intention is to develop sequences such that the following has a certain probability of being mostly correct. Complete certainty would be unusual if occurrent at all.

The distribution can be formalized by defining $P(i, j, \ell)$ as the probability that $m_j$ is at location index $\ell$ in conversation $c_i$. We can see that the probability set of the message belonging at particular locations in the conversation is not easy to directly determine from the message content.

$$\Big\{P(i, j, 1), \, P(i, j, 2), \; \dots \Big\} = f(\mathbb{C}, \mathbb{M}, i, j)$$

However, if we define a conversation as a sequence of at least one message, we know each message belongs to one location in one conversation for any given proposed association mapping of all messages. If the bounds of the data are clean in that no conversations are incompletely represented in the messages and the length of each conversation $c_i$ is $s_i$ we can determine more.

$$\forall \; i \in [1, s_i] \; \land \; j \in [1, M] \, , \; \sum_{i=1}^C P(i, j, \ell) = 1$$

Training using this model involves finding an algorithm that converges to a full set of probabilities.

If some conversations were assembled from the data set, then supervised learning is possible, probably using the LSTM artificial network type, but convergence must be on some basis, represented by a loss function. In this case, divergence of the array of probabilities for a given message such that one is remarkably more probable than the others is the objective, so loss is the converse.

A direct naive Bayesian approach may also be applied. It is not clear without more information and some experimentation which may be best.

The primary drivers for the probabilities were listed above (vocabulary, topic, references). Unless cognition is one of the AI capabilities contained in the system (which would require a time machine), probabilities related to vocabulary, topic, and causal references must be based on three more directly available quantities.

  • Word matches
  • Frequency of the word in the overall data set or the language used
  • Adjacency of words

These cases may help illuminate the above.

Yea, he would have been a definite if the mail box wasn't so fragile.

My son just made the honor roll.

When Joshua rolled over the mail box, it kinda killed his chances of going.

The AI system does not need to know that those speaking in the first and third message are youth planning an event and Joshua's use of the car is restricted to begin to assemble a conversation $c_i$ that includes high probabilities for $m_3$ at $\ell$ and $m_1$ at $\ell + 1$. They are probable messages in the same conversation and likely in reverse chronological order.

Such likelihood results would not be on the basis of the definite article the, which appears too much in the language to be valuable as a probabilistic determinant. The second message is clearly not in the conversation with the other two, but contains the and roll as part of a two word noun honor roll (since the meaning cannot be derived easily by applying the adjective honor from the word roll now that the noun roll is rarely used by itself. The past tense verb rolled is comprised of linguistic elements roll and -ed as a past tense ending. The element roll could be a counterproductive determinant of association in this case. The pronoun he is also too common to strongly indicate association.

The word pair mail box is a combination of two words, each with much lower frequency in the language than the or he. In combination, they are a strong determinant of topic. This is where adjacency demonstrates its probabilistic importance. Whether NLP returns mail box, mail-box, or mailbox cannot always be known. Colloquial abstractions like smell a rat may never contract into new words like smellarat. Even if it did, it would be a stronger indicator suspect him, but less a reference with an unusual proper name like suspect Jalisia.

As sophistication is developed and the volume of training data increases, it may be possible for the AI system, without full cognition, to recognize the word triplet would-have-been as a causal hint that increases the likelihood of the reference being in reverse from the order depicted above.

There are tools and libraries to perform reconstitution of conversations, corrupted drives, and sheets of paper from bags of shreds, but they are company confidential, not open source. A search for semantic reconstitution, for distinction conversation, and for various other synonymous and terminological permutations was not fruitful in either academic or general searches.

Legalities require consulting practitioners to reassemble from scratch. Eventually, intellectual property will bleed out into general knowledge in a field, but each project is a pioneering one at the time of this writing.

None the less, it is sometimes surprising how different each implementation becomes, even when developed by the same AI engineers. Initial attempts fail, further requirements are gathered, adaptation to data availability challenges, and integration with corporate databases, applications, processes, and conventions place constraints on the system. Quality and acceptability expectations often drive development far from past solutions that worked, especially in cases where the statistical profile of the pool of constituents differs.

Douglas Daseeco

Posted 2018-11-29T08:18:19.443

Reputation: 7 174


These are really two questions -
1. How to assign lines in a dialogue to particular speakers
2. How to analyse the content of a speaker's utterances

I will attempt to answer the first one.

In discourse analysis (the field that studies conversations), there is the concept of adjacency pairs, linked utterances which typically follow on to each other. A well-formed conversation can be segmented into a series of adjacency pairs, such as greeting/greeting, question/answer/feedback (that's more of an 'adjacency triple'), statement/comment, farewell/farewell. Sometimes it's a bit more tricky, such as question/counter-question/answer/answer:

young person: What beers do you have? (Q)
bartender: How old are you? (Counter-Q)
young person: 21 (A)
bartender: We have a lager or a nice IPA. (A)

I don't know if this has been tried before, but it should be comparatively easy to identify the types of speech acts in a conversation and join them up to form relevant adjacency pairs. This would be a lot more reliable than looking at vocabulary, especially since conversations mostly use higher-frequency words which are part of everyone's active vocabulary.

You would need to take some conversations, classify the individual lines as speech acts (building up an inventory as you do that), and then identify adjacency pairs. This will give you the structure of the conversation, and should also allow you to assign speakers to utterances (assuming a dialogue, ie only two participants). Note that it is not always purely form that determines the speech act: How are you? is a question on the surface, but its function is part of a greeting sequence.

This conversational structure could also assist you in looking at the second part, the content. If you know someone asks questions, and the other participant answers them, then you know exactly where to look for salient utterances that might give you clues as to the topic of the conversation. At least it should enable you to ignore the 'housekeeping' utterances like greetings and farewells etc.

Your second question is hard to answer without knowing more about the purpose of your project. I would rephrase this and ask it as a separate question. People would need to know what kind of conversations they are, and what exactly you want to get out of it.

Oh, and I wouldn't compare conversations with poetry. Completely different things! :)

Oliver Mason

Posted 2018-11-29T08:18:19.443

Reputation: 3 755