large database operation. check for relatedness between entities

1

I have one small list of entities, such as:

Russia
Vladimir
Moscow

Then I have a massive database of JSON indices. For each entry there are multiple alpha-numeric identifiers.

So for instance, for Russia there might only be one. But for Vladimir maybe there will be 100.

They're stored in JSON but I read them into my java program like this:

        // GET JSON DATA
        File f = new File("/home/matthias/Workbench/SUTD/nytimes_corpus/wdtk-parent/wdtk-examples/JSON_Output/user666.json");
        String jsonTxt = null;

        if (f.exists())
        {
            InputStream is = new FileInputStream("/home/matthias/Workbench/SUTD/nytimes_corpus/wdtk-parent/wdtk-examples/JSON_Output/user666.json");
            jsonTxt = IOUtils.toString(is);
        }
        //reformat
        jsonTxt = ( jsonTxt.substring(1, jsonTxt.length()-1) ).replace("\\","");

        Gson json = new Gson();
        Map<String, HashSet<String>> mast_Q_storage_map = new HashMap<String, HashSet<String>>();
        mast_Q_storage_map = (Map<String, HashSet<String>>) json.fromJson(jsonTxt, mast_Q_storage_map.getClass());

I want to get all of the values associated with the entities from the small list.

So I need to search the big list and retrieve their values.

Then I want to try to determine if there is a relationship between any of the entities in the sentence, as in I want to check if

Russia X Vladimir Russia X Moscow Moscow X Vladimir Moscow X Russia Vladimir X Russia Vladimir X Moscow

will result in a relationship, I've been trying to do it like this, but I'm running into big problems:

        // Read in all the sentences, that are in files, in this folder
        final File folder = new File("/home/matthias/Workbench/SUTD/nytimes_corpus/wdtk-parent/wdtk-examples/JSON_Output/");

        for (final File fileEntry : folder.listFiles()) 
        {
            BufferedReader br = new BufferedReader(new FileReader(fileEntry));
            try 
            {
                //Store the filename
                //System.out.println(fileEntry.getName());
                StringBuilder sb = new StringBuilder();
                String line = br.readLine();

                while (line != null) 
                {   
                    sb.append(line);
                    sb.append(System.lineSeparator());
                    line = br.readLine();
                }
                String everything = sb.toString();
                //System.out.println(everything);

                Document doc = Jsoup.parse(everything);

                Elements contents = doc.getElementsByTag("sentence");
                for (Element content : contents) 
                {
                    //store the sentence number
                    String number = content.select("sentence").text();
                    number = number.substring(0, number.indexOf(" ")); 
                    System.out.println(number);

                    //get all the entities in this sentence
                    Elements pers = content.select("PERSON");
                    Elements locs = content.select("LOCATION");
                    Elements orgs = content.select("ORGANIZATION");

                    //collect all the elements to a list, all the elements of one sentence
                    List<String> list = new ArrayList<String>();

                    for (Element per : pers) 
                    {
                        list.add(per.text().trim());
                    }
                    for (Element loc : locs) 
                    {
                        list.add(loc.text().trim());
                    }
                    for (Element org : orgs) 
                    {
                        list.add(org.text().trim());
                    }


                    System.out.println();
                    System.out.println();
                    System.out.println();
                    System.out.println("This is list of sentence elements:");
                    for (String s : list)
                        System.out.println(s);
                    System.out.println();
                    System.out.println();
                    System.out.println();


                    List<String> Q_value_list = new ArrayList<String>();

                    // for the list of Q values to keys
                    for (Entry<String, HashSet<String>> e : mast_Q_storage_map.entrySet()) 
                    {
                        for (String s : list)
                        {
                            if (e.getKey().contains(s)) 
                            {
                                //List<String> Q_value_list = new ArrayList<String>(e.getValue());

                                System.out.println(e.getKey() + " :: " + Q_value_list.toString());
                            }
                        }
                    }



                                //czeher
                                for (String home:Q_value_list) 
                                {
                                  for (String away:Q_value_list) 
                                  {


                                    String URL_czech = "http://milenio.dcc.uchile.cl/sparql?default-graph-uri=&query=PREFIX+%3A+%3Chttp%3A%2F%2Fwww.wikidata.org%2Fentity%2F%3E%0D%0ASELECT+*+WHERE+%7B%0D%0A+++%3A" 
                                                       + home + "+%3FsimpleProperty+%3A" 
                                                       + away + "%0D%0A%7D%0D%0A&format=text%2Fhtml&timeout=0&debug=on";


                                    URL wikidata_page = new URL(URL_czech);
                                    HttpURLConnection wiki_connection = (HttpURLConnection)wikidata_page.openConnection();
                                    InputStream wikiInputStream = null;


                                        try 
                                        {
                                            // try to connect and use the input stream
                                            wiki_connection.connect();
                                            wikiInputStream = wiki_connection.getInputStream();
                                        } 
                                        catch(IOException error) 
                                        {
                                            // failed, try using the error stream
                                            wikiInputStream = wiki_connection.getErrorStream();
                                        }
                                    // parse the input stream using Jsoup
                                    Document docx = Jsoup.parse(wikiInputStream, null, wikidata_page.getProtocol()+"://"+wikidata_page.getHost()+"/");



                                    Elements link_text = docx.select("table.sparql > tbody > tr:nth-child(2) > td > a");
                                    //link_text.text();
                                    for (Element l : link_text) 
                                    {
                                        String output = l.text();
                                        System.out.println( output );
                                    }


                                  }






                    }

                }
            }
            finally 
            {
                br.close();
            }

        }

smatthewenglish

Posted 2015-04-28T12:58:33.920

Reputation: 191

1What are these big problems? – Wojciech Walczak – 2015-04-28T18:47:30.503

No answers