There are of course a number of ways this can be done, such as majority voting or some other rule-based algorithm, however it can also be done through supervised learning since you have some labels for the trees.
I would make the input space of my model the normalized frequency of the categories for a tree. This means you will need a dictionary of possible categories for the nodes, usually obtained from your training set. Then you can tabulate the frequency of instances.
For example if we have a website with the following node classes:
- News: 5
- Opinions: 9
- About: 1
Then we can formulate our input vector as $[0.33, 0.6, 0.067]$.
You can then train this model using your already labeled trees. The model will then be capable of classifying future trees in this same way.
To determine the top $K$ classes for a tree you will need a model which can do this (most can). If you use K-NN (different K) then you can pick the $K$ closest neighbourhoods. With Random Forests or Naive Bayes you can pick the $K$ classes with the highest probabilities.
To consider the fact that you have a list of $K$ categories for each node you can add a weighting when calculating the normalized frequencies. For example let's say we have 3 classifications and the following webpages (nodes).
- Page 1: News, Opinion, Commentary
- Page 2: News, Advertisement, Opinion
- Page 3: Commentary, News, Adverisement
- Page 4: News, Opinion, Advertisement
Then the input vector can be calculated by awarding 3 points to the first category, 2 to the next and 1 to the last. This results in $[0.49, 0.21,0.17, 0.17]$. Alternatively, if you have a probability for these classifications you can use that as the weighting factor.