ÃֽŠ°Ô½Ã±Û(JAVA)
2017.07.09 / 01:45

Jsoup: Use selector-syntax to find elements

Ŭ·¡½Ä·Î¾â
Ãßõ ¼ö 233

Jsoup: Use selector-syntax to find elements

Problem

You want to find or manipulate elements using a CSS or jquery-like selector syntax.

Solution

Use the Element.select(String selector) and Elements.select(String selector) methods:

 File input = new File("/tmp/input.html");  
 Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");  
 Elements links = doc.select("a[href]"); // a with href  
 Elements pngs = doc.select("img[src$=.png]");  
  // img with src ending .png  
 Element masthead = doc.select("div.masthead").first();  
  // div with class=masthead  
 Elements resultLinks = doc.select("h3.r > a"); // direct a after h3  

Description

jsoup elements support a CSS (or jquery) like selector syntax to find matching elements, that allows very powerful and robust queries.
The select method is available in a DocumentElement, or in Elements. It is contextual, so you can filter by selecting from a specific element, or by chaining select calls.
Select returns a list of Elements (as Elements), which provides a range of methods to extract and manipulate the results.

Selector overview

  • tagname: find elements by tag, e.g. a
  • ns|tag: find elements by tag in a namespace, e.g. fb|name finds <fb:name> elements
  • #id: find elements by ID, e.g. #logo
  • .class: find elements by class name, e.g. .masthead
  • [attribute]: elements with attribute, e.g. [href]
  • [^attr]: elements with an attribute name prefix, e.g. [^data-] finds elements with HTML5 dataset attributes
  • [attr=value]: elements with attribute value, e.g. [width=500]
  • [attr^=value][attr$=value][attr*=value]: elements with attributes that start with, end with, or contain the value, e.g. [href*=/path/]
  • [attr~=regex]: elements with attribute values that match the regular expression; e.g. img[src~=(?i)\.(png|jpe?g)]
  • *: all elements, e.g. *

Selector combinations

  • el#id: elements with ID, e.g. div#logo
  • el.class: elements with class, e.g. div.masthead
  • el[attr]: elements with attribute, e.g. a[href]
  • Any combination, e.g. a[href].highlight
  • ancestor child: child elements that descend from ancestor, e.g. .body p finds pelements anywhere under a block with class "body"
  • parent > child: child elements that descend directly from parent, e.g. div.content > p finds p elements; and body > * finds the direct children of the body tag
  • siblingA + siblingB: finds sibling B element immediately preceded by sibling A, e.g. div.head + div
  • siblingA ~ siblingX: finds sibling X element preceded by sibling A, e.g. h1 ~ p
  • el, el, el: group multiple selectors, find unique elements that match any of the selectors; e.g. div.masthead, div.logo

Pseudo selectors

  • :lt(n): find elements whose sibling index (i.e. its position in the DOM tree relative to its parent) is less than n; e.g. td:lt(3)
  • :gt(n): find elements whose sibling index is greater than n; e.g. div p:gt(2)
  • :eq(n): find elements whose sibling index is equal to n; e.g. form input:eq(1)
  • :has(seletor): find elements that contain elements matching the selector; e.g. div:has(p)
  • :not(selector): find elements that do not match the selector; e.g. div:not(.logo)
  • :contains(text): find elements that contain the given text. The search is case-insensitive; e.g. p:contains(jsoup)
  • :containsOwn(text): find elements that directly contain the given text
  • :matches(regex): find elements whose text matches the specified regular expression; e.g. div:matches((?i)login)
  • :matchesOwn(regex): find elements whose own text matches the specified regular expression
  • Note that the above indexed pseudo-selectors are 0-based, that is, the first element is at index 0, the second at 1, etc
See the Selector API reference for the full supported list and details.

Sample code 


      public String parseOptHtmlPage() throws IOException, JSONException  
      {  
           Log.d("DemoPlugin", "Optimize parsing html");  
           AssetManager am = cordova.getContext().getAssets();  
           InputStream is = am.open("index.html");  
           StringWriter writer = new StringWriter();  
              IOUtils.copy(is, writer, "UTF-8");  
              String html = writer.toString();  
              int startIndex = html.indexOf("<body>");  
              int endIndex = html.indexOf("</body>");  
              html = html.substring(startIndex, endIndex + 7);          
           StringBuilder retHtml = new StringBuilder();  
           Log.d("DemoPlugin", "Start parsing html " + html.length());  
           org.jsoup.nodes.Document doc = Jsoup.parse(html);  
           Log.d("DemoPlugin", "Finishes parsing html!");  
           // Get all category  
           Log.d("DemoPlugin","Get td.tcat node ...");  
           //org.jsoup.select.Elements categoriesNodes = doc.select("body > div > div.page > div > table > tbody > tr > td > table:eq(2) > tbody > tr > td.cat");  
           org.jsoup.select.Elements categoriesNodes = doc.select("table table.tborder tbody tr td.tcat");  
           Log.d("DemoPlugin","Get td.tcat node done");  
           HashMap<JSONObject, List<JSONObject>> categories = new HashMap<JSONObject, List<JSONObject>>();  
           for (org.jsoup.nodes.Element category : categoriesNodes) {  
                org.jsoup.nodes.Element hrefNode = category.select("a").last();  
                if (hrefNode == null) {  
                     continue;  
                }  
                JSONObject jsonCategory = new JSONObject();  
                org.jsoup.nodes.Element tableNode = category.parent().parent().parent();  
                String href = hrefNode.attr("href");  
                String catName = hrefNode.text();  
                Log.d("DemoPlugin","BOX: " + catName + "(" + href + ")");  
                retHtml.append("<li boxid='" + href + "' data-role='list-divider'>" + catName + "</li>");  
                jsonCategory.put("name", catName);  
                jsonCategory.put("href", href);  
                String boxId = href.substring(href.indexOf("=") + 1);  
                String boxPrefix = "collapseobj_forumbit_";  
                String searchId = boxPrefix + boxId;  
                org.jsoup.select.Elements boxTrNodes = tableNode.select("tbody#" + searchId + " > tr");  
                Log.d("DemoPlugin","SUBFORUMS: " + boxTrNodes.size());  
                List<JSONObject> forums = new ArrayList<JSONObject>();  
                for (org.jsoup.nodes.Element boxTrNode : boxTrNodes) {  
                     JSONObject forum = new JSONObject();  
                     org.jsoup.nodes.Element subForum = boxTrNode.select("td[id^=f] ").first();                    
                     org.jsoup.nodes.Element subForumNode = subForum.select("a").last();   
                     String subForumHref = subForumNode.attr("href");  
                     String subForumName = subForumNode.text();  
                     org.jsoup.nodes.Element postsNumber = boxTrNode.select("td").last();  
                     org.jsoup.nodes.Element threadsNumber = postsNumber.previousElementSibling();  
                     int index = subForumHref.indexOf("?f=");  
                     String forumId = subForumHref.substring(index + 3);  
                     Log.d("DemoPlugin", "Write <a onClick='loadForum(\"" + forumId + "\", \"0\")'>");  
                     retHtml.append("<li subboxid='" + subForumHref + "'><a onClick='loadForum(\"" + forumId + "\", \"0\")'>" +   
                          subForumName + "<span class='ui-li-count'>" + threadsNumber.text() + "</span></a></li>");  
                     forum.put("name", subForumName);  
                     forum.put("href", subForumHref);  
                     forum.put("threads", threadsNumber.text());  
                     forum.put("posts", postsNumber.text());  
                     forums.add(forum);  
                }  
                categories.put(jsonCategory, forums);  
           }  
           //String jsonText = JSONValue.toJSONString(categories);  
           //Log.d("DemoPlugin",jsonText);  
           Log.d("DemoPlugin","Optimize parsing html finishes ...");  
           return retHtml.toString();  
      }  

      public String parseOptForum(final String boxId_, final String page_) throws IOException  
      {  
           Log.d("DemoPlugin", "Optimize parsing html forum " + boxId_ + " page " + page_);  
           AssetManager am = cordova.getContext().getAssets();  
           InputStream is = am.open("f"+ boxId_ + ".html");  
           StringWriter writer = new StringWriter();  
              IOUtils.copy(is, writer, "UTF-8");  
              String html = writer.toString();  
              int startIndex = html.indexOf("<body>");  
              int endIndex = html.indexOf("</body>");  
              html = html.substring(startIndex, endIndex + 7);  
           StringBuilder retHtml = new StringBuilder();  
           Log.d("DemoPlugin", "Start parsing html " + html.length());  
           org.jsoup.nodes.Document doc = Jsoup.parse(html);  
           Log.d("DemoPlugin", "Finishes parsing html!");  
           // Get all category  
           org.jsoup.select.Elements forumNodes = doc.select("table > tbody > tr > td[id^=f]");            
           if (!forumNodes.isEmpty()) {  
                retHtml.append("<li data-role='list-divider'>Forum</li>");  
                for (org.jsoup.nodes.Element forumNode : forumNodes) {  
                     org.jsoup.nodes.Element tdForumNode = forumNode.select("table > tbody > tr > td").last();  
                     if (tdForumNode == null) {  
                          continue;  
                     }  
                     org.jsoup.nodes.Element hrefNode = tdForumNode.select("div > a").first();  
                     if (hrefNode == null) {  
                          continue;  
                     }  
                     String forumHref = hrefNode.attr("href");  
                     String forumName = hrefNode.select("strong").first().text();  
                     org.jsoup.nodes.Element tdNodeThreads = forumNode.nextElementSibling().nextElementSibling();  
                     //org.jsoup.nodes.Element tdNodePosts = tdNodeThreads.nextElementSibling();   
                     int index = forumHref.indexOf("?f=");  
                     String forumId = forumHref.substring(index + 3);  
                     Log.d("DemoPlugin", "Write <a onClick='loadForum(\"" + forumId + "\", \"0\")'>");  
                     retHtml.append("<li subboxid='" + forumHref + "'><a onClick='loadForum(\"" + forumId + "\", \"0\")'>" + forumName +   
                          "<span class='ui-li-count'>" + tdNodeThreads.text() + "</span></a></li>");                      
                }  
           }  
           org.jsoup.select.Elements tdThreadNodes = doc.select("table#threadslist > tbody#threadbits_forum_" + boxId_ +   
                " > tr > td[id^=td_threadtitle_]");  
           if (!tdThreadNodes.isEmpty())  
           {  
                retHtml.append("<li data-role='list-divider'>Threads</li>");  
                for (org.jsoup.nodes.Element tdThreadNode : tdThreadNodes) {  
                     org.jsoup.nodes.Element threadNode = tdThreadNode.select("div > a[id^=thread_gotonew_]").first();  
                     if (threadNode == null) {  
                          continue;  
                     }  
                     String forumHref = threadNode.attr("href");  
                     int index = forumHref.indexOf("?t=");  
                     String threadId = forumHref.substring(index + 3);  
                     org.jsoup.nodes.Element threadDetailNode = threadNode.nextElementSibling();   
                     String threadTitle = threadDetailNode.text();  
                     org.jsoup.nodes.Element tdNodeViewReply = tdThreadNode.nextElementSibling();  
                     String viewReply = tdNodeViewReply.attr("title");  
                     retHtml.append("<li threadid='" + threadId + "'><a href='index.html'>" + threadTitle +   
                          "<span class='ui-li-count'>" + viewReply + "</span></a></li>");                      
                }  
           }  
           Log.d("DemoPlugin","Optimize parsing html finishes ...");  
           return retHtml.toString();  
      }