ÃֽŠ°Ô½Ã±Û(JAVA)
2017.07.09 / 01:45
Jsoup: Use selector-syntax to find elements
Ŭ·¡½Ä·Î¾â
Ãßõ ¼ö 233
Jsoup: Use selector-syntax to find elements
Problem
You want to find or manipulate elements using a CSS or jquery-like selector syntax.Solution
Use theElement.select(String selector)
and Elements.select(String selector)
methods: File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");
Elements links = doc.select("a[href]"); // a with href
Elements pngs = doc.select("img[src$=.png]");
// img with src ending .png
Element masthead = doc.select("div.masthead").first();
// div with class=masthead
Elements resultLinks = doc.select("h3.r > a"); // direct a after h3
Description
jsoup elements support a CSS (or jquery) like selector syntax to find matching elements, that allows very powerful and robust queries.The
select
method is available in a Document
, Element
, or in Elements
. It is contextual, so you can filter by selecting from a specific element, or by chaining select calls.Select returns a list of Elements (as
Elements
), which provides a range of methods to extract and manipulate the results.Selector overview
tagname
: find elements by tag, e.g.a
ns|tag
: find elements by tag in a namespace, e.g.fb|name
finds<fb:name>
elements#id
: find elements by ID, e.g.#logo
.class
: find elements by class name, e.g..masthead
[attribute]
: elements with attribute, e.g.[href]
[^attr]
: elements with an attribute name prefix, e.g.[^data-]
finds elements with HTML5 dataset attributes[attr=value]
: elements with attribute value, e.g.[width=500]
[attr^=value]
,[attr$=value]
,[attr*=value]
: elements with attributes that start with, end with, or contain the value, e.g.[href*=/path/]
[attr~=regex]
: elements with attribute values that match the regular expression; e.g.img[src~=(?i)\.(png|jpe?g)]
*
: all elements, e.g.*
Selector combinations
el#id
: elements with ID, e.g.div#logo
el.class
: elements with class, e.g.div.masthead
el[attr]
: elements with attribute, e.g.a[href]
- Any combination, e.g.
a[href].highlight
ancestor child
: child elements that descend from ancestor, e.g..body p
findsp
elements anywhere under a block with class "body"parent > child
: child elements that descend directly from parent, e.g.div.content > p
findsp
elements; andbody > *
finds the direct children of the body tagsiblingA + siblingB
: finds sibling B element immediately preceded by sibling A, e.g.div.head + div
siblingA ~ siblingX
: finds sibling X element preceded by sibling A, e.g.h1 ~ p
el, el, el
: group multiple selectors, find unique elements that match any of the selectors; e.g.div.masthead, div.logo
Pseudo selectors
:lt(n)
: find elements whose sibling index (i.e. its position in the DOM tree relative to its parent) is less thann
; e.g.td:lt(3)
:gt(n)
: find elements whose sibling index is greater thann
; e.g.div p:gt(2)
:eq(n)
: find elements whose sibling index is equal ton
; e.g.form input:eq(1)
:has(seletor)
: find elements that contain elements matching the selector; e.g.div:has(p)
:not(selector)
: find elements that do not match the selector; e.g.div:not(.logo)
:contains(text)
: find elements that contain the given text. The search is case-insensitive; e.g.p:contains(jsoup)
:containsOwn(text)
: find elements that directly contain the given text:matches(regex)
: find elements whose text matches the specified regular expression; e.g.div:matches((?i)login)
:matchesOwn(regex)
: find elements whose own text matches the specified regular expression- Note that the above indexed pseudo-selectors are 0-based, that is, the first element is at index 0, the second at 1, etc
Selector
API reference for the full supported list and details.Sample code
public String parseOptHtmlPage() throws IOException, JSONException { Log.d("DemoPlugin", "Optimize parsing html"); AssetManager am = cordova.getContext().getAssets(); InputStream is = am.open("index.html"); StringWriter writer = new StringWriter(); IOUtils.copy(is, writer, "UTF-8"); String html = writer.toString(); int startIndex = html.indexOf("<body>"); int endIndex = html.indexOf("</body>"); html = html.substring(startIndex, endIndex + 7); StringBuilder retHtml = new StringBuilder(); Log.d("DemoPlugin", "Start parsing html " + html.length()); org.jsoup.nodes.Document doc = Jsoup.parse(html); Log.d("DemoPlugin", "Finishes parsing html!"); // Get all category Log.d("DemoPlugin","Get td.tcat node ..."); //org.jsoup.select.Elements categoriesNodes = doc.select("body > div > div.page > div > table > tbody > tr > td > table:eq(2) > tbody > tr > td.cat"); org.jsoup.select.Elements categoriesNodes = doc.select("table table.tborder tbody tr td.tcat"); Log.d("DemoPlugin","Get td.tcat node done"); HashMap<JSONObject, List<JSONObject>> categories = new HashMap<JSONObject, List<JSONObject>>(); for (org.jsoup.nodes.Element category : categoriesNodes) { org.jsoup.nodes.Element hrefNode = category.select("a").last(); if (hrefNode == null) { continue; } JSONObject jsonCategory = new JSONObject(); org.jsoup.nodes.Element tableNode = category.parent().parent().parent(); String href = hrefNode.attr("href"); String catName = hrefNode.text(); Log.d("DemoPlugin","BOX: " + catName + "(" + href + ")"); retHtml.append("<li boxid='" + href + "' data-role='list-divider'>" + catName + "</li>"); jsonCategory.put("name", catName); jsonCategory.put("href", href); String boxId = href.substring(href.indexOf("=") + 1); String boxPrefix = "collapseobj_forumbit_"; String searchId = boxPrefix + boxId; org.jsoup.select.Elements boxTrNodes = tableNode.select("tbody#" + searchId + " > tr"); Log.d("DemoPlugin","SUBFORUMS: " + boxTrNodes.size()); List<JSONObject> forums = new ArrayList<JSONObject>(); for (org.jsoup.nodes.Element boxTrNode : boxTrNodes) { JSONObject forum = new JSONObject(); org.jsoup.nodes.Element subForum = boxTrNode.select("td[id^=f] ").first(); org.jsoup.nodes.Element subForumNode = subForum.select("a").last(); String subForumHref = subForumNode.attr("href"); String subForumName = subForumNode.text(); org.jsoup.nodes.Element postsNumber = boxTrNode.select("td").last(); org.jsoup.nodes.Element threadsNumber = postsNumber.previousElementSibling(); int index = subForumHref.indexOf("?f="); String forumId = subForumHref.substring(index + 3); Log.d("DemoPlugin", "Write <a onClick='loadForum(\"" + forumId + "\", \"0\")'>"); retHtml.append("<li subboxid='" + subForumHref + "'><a onClick='loadForum(\"" + forumId + "\", \"0\")'>" + subForumName + "<span class='ui-li-count'>" + threadsNumber.text() + "</span></a></li>"); forum.put("name", subForumName); forum.put("href", subForumHref); forum.put("threads", threadsNumber.text()); forum.put("posts", postsNumber.text()); forums.add(forum); } categories.put(jsonCategory, forums); } //String jsonText = JSONValue.toJSONString(categories); //Log.d("DemoPlugin",jsonText); Log.d("DemoPlugin","Optimize parsing html finishes ..."); return retHtml.toString(); } public String parseOptForum(final String boxId_, final String page_) throws IOException { Log.d("DemoPlugin", "Optimize parsing html forum " + boxId_ + " page " + page_); AssetManager am = cordova.getContext().getAssets(); InputStream is = am.open("f"+ boxId_ + ".html"); StringWriter writer = new StringWriter(); IOUtils.copy(is, writer, "UTF-8"); String html = writer.toString(); int startIndex = html.indexOf("<body>"); int endIndex = html.indexOf("</body>"); html = html.substring(startIndex, endIndex + 7); StringBuilder retHtml = new StringBuilder(); Log.d("DemoPlugin", "Start parsing html " + html.length()); org.jsoup.nodes.Document doc = Jsoup.parse(html); Log.d("DemoPlugin", "Finishes parsing html!"); // Get all category org.jsoup.select.Elements forumNodes = doc.select("table > tbody > tr > td[id^=f]"); if (!forumNodes.isEmpty()) { retHtml.append("<li data-role='list-divider'>Forum</li>"); for (org.jsoup.nodes.Element forumNode : forumNodes) { org.jsoup.nodes.Element tdForumNode = forumNode.select("table > tbody > tr > td").last(); if (tdForumNode == null) { continue; } org.jsoup.nodes.Element hrefNode = tdForumNode.select("div > a").first(); if (hrefNode == null) { continue; } String forumHref = hrefNode.attr("href"); String forumName = hrefNode.select("strong").first().text(); org.jsoup.nodes.Element tdNodeThreads = forumNode.nextElementSibling().nextElementSibling(); //org.jsoup.nodes.Element tdNodePosts = tdNodeThreads.nextElementSibling(); int index = forumHref.indexOf("?f="); String forumId = forumHref.substring(index + 3); Log.d("DemoPlugin", "Write <a onClick='loadForum(\"" + forumId + "\", \"0\")'>"); retHtml.append("<li subboxid='" + forumHref + "'><a onClick='loadForum(\"" + forumId + "\", \"0\")'>" + forumName + "<span class='ui-li-count'>" + tdNodeThreads.text() + "</span></a></li>"); } } org.jsoup.select.Elements tdThreadNodes = doc.select("table#threadslist > tbody#threadbits_forum_" + boxId_ + " > tr > td[id^=td_threadtitle_]"); if (!tdThreadNodes.isEmpty()) { retHtml.append("<li data-role='list-divider'>Threads</li>"); for (org.jsoup.nodes.Element tdThreadNode : tdThreadNodes) { org.jsoup.nodes.Element threadNode = tdThreadNode.select("div > a[id^=thread_gotonew_]").first(); if (threadNode == null) { continue; } String forumHref = threadNode.attr("href"); int index = forumHref.indexOf("?t="); String threadId = forumHref.substring(index + 3); org.jsoup.nodes.Element threadDetailNode = threadNode.nextElementSibling(); String threadTitle = threadDetailNode.text(); org.jsoup.nodes.Element tdNodeViewReply = tdThreadNode.nextElementSibling(); String viewReply = tdNodeViewReply.attr("title"); retHtml.append("<li threadid='" + threadId + "'><a href='index.html'>" + threadTitle + "<span class='ui-li-count'>" + viewReply + "</span></a></li>"); } } Log.d("DemoPlugin","Optimize parsing html finishes ..."); return retHtml.toString(); }