What could be some Classification techniques to classify a tree of webpages given the category of each webpage


I want to perform a website classification task where I have modeled a website as a tree of webpages. I already have a model which can assign categories to the nodes in the tree (webpages). I need guidance on how can I combine those node categories to get the overall category of the tree. What classification techniques could be used ? As of now I know about usage of markov chains for this task from this research paper, Web Site Mining. I would be grateful if i could get some more ideas about how to perform the task.

Samyak Jain

Posted 2018-05-11T07:22:28.283

Reputation: 41

Do you have any websites which you or some other people have already assigned a label for the overall tree? – JahKnows – 2018-05-11T07:50:58.350

Yes I have labelled data for domains. – Samyak Jain – 2018-05-11T07:54:07.460

And you can already automatically label the nodes (webpages) of your tree? – JahKnows – 2018-05-11T07:57:24.740

Yes I have a model that assigns categories to a node from a predefined set of categories and also I have labeled data for the domains , where each label is domain category. As of now I need some suggestions how to assemble node categories to generate an overall category. – Samyak Jain – 2018-05-11T08:00:27.543



There are of course a number of ways this can be done, such as majority voting or some other rule-based algorithm, however it can also be done through supervised learning since you have some labels for the trees.

I would make the input space of my model the normalized frequency of the categories for a tree. This means you will need a dictionary of possible categories for the nodes, usually obtained from your training set. Then you can tabulate the frequency of instances.

For example if we have a website with the following node classes:

  • News: 5
  • Opinions: 9
  • About: 1

Then we can formulate our input vector as $[0.33, 0.6, 0.067]$.

You can then train this model using your already labeled trees. The model will then be capable of classifying future trees in this same way.

To determine the top $K$ classes for a tree you will need a model which can do this (most can). If you use K-NN (different K) then you can pick the $K$ closest neighbourhoods. With Random Forests or Naive Bayes you can pick the $K$ classes with the highest probabilities.

To consider the fact that you have a list of $K$ categories for each node you can add a weighting when calculating the normalized frequencies. For example let's say we have 3 classifications and the following webpages (nodes).

  • Page 1: News, Opinion, Commentary
  • Page 2: News, Advertisement, Opinion
  • Page 3: Commentary, News, Adverisement
  • Page 4: News, Opinion, Advertisement

Then the input vector can be calculated by awarding 3 points to the first category, 2 to the next and 1 to the last. This results in $[0.49, 0.21,0.17, 0.17]$. Alternatively, if you have a probability for these classifications you can use that as the weighting factor.


Posted 2018-05-11T07:22:28.283

Reputation: 7 863

Thanks a lot for the suggestion. Suppose I also have category scores for each category , I mean for each node in the tree I have a list of top K categories along with their relevance scores. How can I now predict top K categories for the tree along with relevance scores , or just top K categories or just one category. – Samyak Jain – 2018-05-11T08:38:05.450

Check the edit on the answer, that is how I would address those two points. – JahKnows – 2018-05-11T08:49:46.170

@SamyakJain, I think you are confusing weights for doing some useful feature extraction and model weights. These are two very distinct things. The number of nodes per tree will vary thus you cannot use the nodes directly as an input to a neural network. That is why we need a means to vectorize them. How many instances do you have labeled? – JahKnows – 2018-05-11T09:08:34.097