ÃֽŠ°Ô½Ã±Û(JAVA)
2017.07.09 / 01:45

How to parse HTML table using jsoup?

Ŭ·¡½Ä·Î¾â
Ãßõ ¼ö 250

I am trying to parse HTML using jsoup. This is my first time working with jsoup and I read some tutorial on it as well. Below is my HTML table which I am trying to parse -

If you see my below table, it has three tr as of now (I have shorten it down to have three table rows just for understanding purpose but in general it will be more). Now I would like to extract Cluster Name from my below table and it's corresponding host name so for example - I would extract Titan as cluster name and all its hostname whose status are down.

As you can see below for Titan cluster name, I have two hostnames machineA.abc.com and machineB.abc.com in which machineA status is up but machineB status is down.

So I will print out Titan as cluster name and print out machineB.abc.com as the hostname since it is down. Is this possible to do using jsoup?

<table border=1>
   <tr>
      <td>&nbsp;</td>
      <td>&nbsp;</td>
      <td>Alert</td>
      <td>Cluster Name</td>
      <td>IP addr</td>
      <td>Host Name</td>
      <td>Type</td>
      <td>Status</td>
      <td>Free</td>
      <td>Version</td>
      <td>Restart Time</td>
      <td>UpTime(Days)</td>
      <td>Last probed</td>
      <td>Last up</td>
   </tr>
   <tr bgcolor="ffffff">
      <td><a href=showlog?ip_addr=127.0.0.1>Hist</a></td>
      <td><a href=http://127.0.0.1:8080/test?full=y>VI</a></td>
      <td bgcolor="ffffff">&nbsp</td>
      <td>Titan</td>
      <td>10.100.111.77</td>
      <td>machineA.abc.com</td>
      <td></td>
      <td bgcolor="ffffff">up</td>
      <td bgcolor="ffffff" align=right>88%</td>
      <td bgcolor="ffffff">2.0.5-SNAPSHOT</td>
      <td bgcolor="ffffff">2014-07-04 01:49:08,220</td>
      <td bgcolor="ffffff" align=right>381</td>
      <td>07-14 20:01:59</td>
      <td>07-14 20:01:59</td>
   </tr>
   <tr bgcolor="ffffff">
      <td><a href=showlog?ip_addr=127.0.0.1>Hist</a></td>
      <td><a href=http://127.0.0.1:8080/test?full=y>VI</a></td>
      <td bgcolor="ffffff">&nbsp</td>
      <td></td>
      <td>10.200.192.99</td>
      <td>machineB.abc.com</td>
      <td></td>
      <td bgcolor="ffffff">down</td>
      <td bgcolor="ffffff" align=right>85%</td>
      <td bgcolor="ffffff">2.0.5-SNAPSHOT</td>
      <td bgcolor="ffffff">2014-07-04 01:52:20,613</td>
      <td bgcolor="ffffff" align=right>103</td>
      <td>07-14 20:01:59</td>
      <td>07-14 20:01:59</td>
   </tr>
</table>

So far, I am able to extract whole HTML table using jsoup but not sure how would I extract cluster name and the hostnames which are down -

URL url = new URL("url_name");
Document doc = Jsoup.parse(url, 3000);

Update:-

I might have two cluster name in the table as shown below -

<table border=1>
   <tr>
      <td>&nbsp;</td>
      <td>&nbsp;</td>
      <td>Alert</td>
      <td>Cluster Name</td>
      <td>IP addr</td>
      <td>Host Name</td>
      <td>Type</td>
      <td>Status</td>
      <td>Free</td>
      <td>Version</td>
      <td>Restart Time</td>
      <td>UpTime(Days)</td>
      <td>Last probed</td>
      <td>Last up</td>
   </tr>
   <tr bgcolor="ffffff">
      <td><a href=showlog?ip_addr=127.0.0.1>Hist</a></td>
      <td><a href=http://127.0.0.1:8080/test?full=y>VI</a></td>
      <td bgcolor="ffffff">&nbsp</td>
      <td>Titan</td>
      <td>10.100.111.77</td>
      <td>machineA.abc.com</td>
      <td></td>
      <td bgcolor="ffffff">up</td>
      <td bgcolor="ffffff" align=right>88%</td>
      <td bgcolor="ffffff">2.0.5-SNAPSHOT</td>
      <td bgcolor="ffffff">2014-07-04 01:49:08,220</td>
      <td bgcolor="ffffff" align=right>381</td>
      <td>07-14 20:01:59</td>
      <td>07-14 20:01:59</td>
   </tr>
   <tr bgcolor="ffffff">
      <td><a href=showlog?ip_addr=127.0.0.1>Hist</a></td>
      <td><a href=http://127.0.0.1:8080/test?full=y>VI</a></td>
      <td bgcolor="ffffff">&nbsp</td>
      <td></td>
      <td>10.200.192.99</td>
      <td>machineB.abc.com</td>
      <td></td>
      <td bgcolor="ffffff">down</td>
      <td bgcolor="ffffff" align=right>85%</td>
      <td bgcolor="ffffff">2.0.5-SNAPSHOT</td>
      <td bgcolor="ffffff">2014-07-04 01:52:20,613</td>
      <td bgcolor="ffffff" align=right>103</td>
      <td>07-14 20:01:59</td>
      <td>07-14 20:01:59</td>
   </tr>
   <tr bgcolor="ffffff">
      <td><a href=showlog?ip_addr=127.0.0.1>Hist</a></td>
      <td><a href=http://127.0.0.1:8080/test?full=y>VI</a></td>
      <td bgcolor="ffffff">&nbsp</td>
      <td>Goldy</td>
      <td>10.100.111.77</td>
      <td>machineH.pqr.com</td>
      <td></td>
      <td bgcolor="ffffff">up</td>
      <td bgcolor="ffffff" align=right>88%</td>
      <td bgcolor="ffffff">2.0.5-SNAPSHOT</td>
      <td bgcolor="ffffff">2014-07-04 01:49:08,220</td>
      <td bgcolor="ffffff" align=right>381</td>
      <td>07-14 20:01:59</td>
      <td>07-14 20:01:59</td>
   </tr>       
</table>

Now if you see above I have two cluster name - one is Titan and other is Goldy so I want to find all the machines which are down for Titan cluster name only.



Yes, it is possible with JSoup. First, you select the table. Then, you select the <tr> tags for rows. You can start from the second index since the first row contains only the column names. Then loop over the <th> tags and get the specific index. In your case, the indexes 7 and 5 are important(index 7: Status, index 5: Host Name). Check the status if it equals to down and if it is, then add the Host Name to a list. That's all.

ArrayList<String> downServers = new ArrayList<>();
Element table = doc.select("table").get(0); //select the first table.
Elements rows = table.select("tr");

for (int i = 1; i < rows.size(); i++) { //first row is the col names so skip it.
    Element row = rows.get(i);
    Elements cols = row.select("td");

    if (cols.get(7).text().equals("down")) {
        downServers.add(cols.get(5).text());
    }
}

Update: When you find the word Titan you can create another loop and look if the cluster name is empty.

Edit: I change the while loop to do while loop.

    ArrayList<String> downServers = new ArrayList<>();
    Element table = doc.select("table").get(0); //select the first table.
    Elements rows = table.select("tr");

    for (int i = 1; i < rows.size(); i++) { //first row is the col names so skip it.
        Element row = rows.get(i);
        Elements cols = row.select("td");

        if (cols.get(3).text().equals("Titan")) {
            if (cols.get(7).text().equals("down"))
                downServers.add(cols.get(5).text());

            do {
                if(i < rows.size() - 1)
                   i++;
                row = rows.get(i);
                cols = row.select("td");
                if (cols.get(7).text().equals("down") && cols.get(3).text().equals("")) {
                    downServers.add(cols.get(5).text());
                }
                if(i == rows.size() - 1)
                    break;
            }
            while (cols.get(3).text().equals(""));
            i--; //if there is two Titan names consecutively.
        }
    }

downServers ArrayList will contain the list of down servers hostnames.




What I would do in your case is first create an Object of your machine with all apropriate attributes. Then using Jsoup I would extract data and create an ArrayList, and then use logic to get data from the Arraylist.

I am skipping the Object creation (since it is not the issue here) and I will name the Object as Machine

Then using Jsoup I would get the row data like this:

ArrayList<Machine> list = new ArrayList();
Document doc = Jsoup.parse(url, 3000);
for (Element table : doc.select("table")) { //this will work if your doc contains only one table element
  for (Element row : table.select("tr")) {
    Machine tmp = new Machine();
    Elements tds = row.select("td");
    tmp.setClusterName(tds.get(3).text());
    tmp.setIp(tds.get(4).text());
    tmp.setStatus(tds.get(7).text());
    //.... and so on for the rest of attributes
    list.add(tmp);
  }
}

Then use a loop to get the values you need from the list:

for(Machine x:list){
  if(x.getStatus().equalsIgnoreCase("up")){
    //machine with UP status found
    System.out.println("The Machine with up status is:"+x.getHostName());
  }
}

That's all. Please also note that this code is not tested and may contain some syntactical errors as it is written directly on this editor and not in an IDE.