ÃֽŠ°Ô½Ã±Û(JAVA)
2017.07.10 / 17:14

JAVA ¶óÀ̺귯¸®ÀÇ ÀÏÁ¾À¸·Î jQuery¿Í À¯»çÇÑ Å½»ö ÀÎÅÍÆäÀ̽º¸¦ È°¿ëÇÏ¿© html¹®¼­ÀÇ Traversing

XMaLL°ü¸®ÀÚ
Ãßõ ¼ö 224

°³¿ä < url : Jsoup.org >  : JAVA ¶óÀ̺귯¸®ÀÇ ÀÏÁ¾À¸·Î jQuery¿Í À¯»çÇÑ Å½»ö ÀÎÅÍÆäÀ̽º¸¦ È°¿ëÇÏ¿© html¹®¼­ÀÇ Traversing -> Extracting ¿¡ È°¿ëÇÑ´Ù 

 

Download

Icon

jsoup is available as a downloadable .jar java library. The current release version is 1.7.2.

Maven dependency

Icon
<dependency>
 
<!-- jsoup HTML parser library @ http://jsoup.org/ -->
 
<groupId>org.jsoup</groupId>
 
<artifactId>jsoup</artifactId>
 
<version>1.7.2</version>
</dependency>
 

Reference

Icon
Jsoup API reference : http://jsoup.org/apidocs/

½ÃÀÛ! connect ¸Þ¼­µå¸¦ ½á¼­ ½ºÆ®¸²À¸·Î ¹Þ¾Æ¿Â´ç

Load Document from a URL

Load Document
Document doc =?Jsoup.connect("http://example.com/").get();
String title = doc.title();

Connection Method´Â jQueryó·³ Method ChainingÀ» Áö¿øÇÑ´Ù

Method Chaining
Document doc =?Jsoup.connect("http://example.com")
  .data("query",?"Java")
  .userAgent("Mozilla")
  .cookie("auth",?"token")
  .timeout(3000)
  .post();

This method only suports web URLs (http and https protocols); if you need to load from a file, use the parse(File in, String charsetName) method instead.

¿ä·¸´Ù³×....

Load Document from a File

Load Document
File input =?new?File("/tmp/input.html");
Document doc =?Jsoup.parse(input,?"UTF-8",?"http://example.com/");

ÀÌÁ¦ µ¥ÀÌŸ ÃßÃâ

1. Extracting : Use DOM methods to navigate a document 

Use Dom for Extracting
File input =?new?File("/tmp/input.html");
Document doc =?Jsoup.parse(input,?"UTF-8",?"http://example.com/");
 
Element content = doc.getElementById("content");
Elements links = content.getElementsByTag("a");
for?(Element link : links)?{
? String linkHref = link.attr("href");
? String linkText = link.text();
}

finding elements

element data

Manipulating HTML and text 

2. Extracting : Use selector-syntax to find elements  

Use selector
File input =?new?File("/tmp/input.html");
Document doc =?Jsoup.parse(input,?"UTF-8",?"http://example.com/");
 
Elements links = doc.select("a[href]");?// a with href
Elements pngs = doc.select("img[src$=.png]");
// img with src ending .png
 
Element masthead = doc.select("div.masthead").first();
// div with class="masthead"
 
Elements resultLinks = doc.select("h3.r > a");?// direct a after h3

Jsoup Àº matching elements¸¦ ã±âÀ§ÇØ CSS(or jQuery) ó·³ selector-syntax¸¦ Áö¿øÇÑ´Ù.

 select method´Â DocumentElement, or in Elements °°Àº ¹®¸Æ¿¡¼­ »ç¿ë°¡´ÉÇÏ´Ù ±×·¡¼­ ƯÁ¤ element·Î °É·¯³»°Å³ª üÀÌ´×ÇÏ¿© È£ÃâÇÒ ¼ö ÀÖ´Ù 

select ´Â Elements list ¸¦ µ¹·ÁÁش٠(as Elements), °Â´Â ÃßÃâÇÏ°í °á°ú¸¦ Á¶ÀÛÇϴµîÀÇ methods¸¦ Á¦°øÇÑ´Ù.

Selector overview 

  • tagname: find elements by tag, e.g. a
  • ns|tag: find elements by tag in a namespace, e.g. fb|name finds <fb:name> elements
  • #id: find elements by ID, e.g. #logo
  • .class: find elements by class name, e.g. .masthead
  • [attribute]: elements with attribute, e.g. [href]
  • [^attr]: elements with an attribute name prefix, e.g. [^data-] finds elements with HTML5 dataset attributes
  • [attr=value]: elements with attribute value, e.g. [width=500]
  • [attr^=value][attr$=value][attr*=value]: elements with attributes that start with, end with, or contain the value, e.g. [href*=/path/]
  • [attr~=regex]: elements with attribute values that match the regular expression; e.g.img[src~=(?i)\.(png|jpe?g)]
  • *: all elements, e.g. *

Selector combinations 

  • el#id: elements with ID, e.g. div#logo
  • el.class: elements with class, e.g. div.masthead
  • el[attr]: elements with attribute, e.g. a[href]
  • Any combination, e.g. a[href].highlight
  • ancestor child: child elements that descend from ancestor, e.g. .body p finds p elements anywhere under a block with class "body"
  • parent > child: child elements that descend directly from parent, e.g. div.content > pfinds p elements; and body > * finds the direct children of the body tag
  • siblingA + siblingB: finds sibling B element immediately preceded by sibling A, e.g.div.head + div
  • siblingA ~ siblingX: finds sibling X element preceded by sibling A, e.g. h1 ~ p
  • el, el, el: group multiple selectors, find unique elements that match any of the selectors; e.g. div.masthead, div.logo

 Pseudo selectors 

  • :lt(n): find elements whose sibling index (i.e. its position in the DOM tree relative to its parent) is less than n; e.g. td:lt(3)
  • :gt(n): find elements whose sibling index is greater than n; e.g. div p:gt(2)
  • :eq(n): find elements whose sibling index is equal to n; e.g. form input:eq(1)
  • :has(seletor): find elements that contain elements matching the selector; e.g. div:has(p)
  • :not(selector): find elements that do not match the selector; e.g. div:not(.logo)
  • :contains(text): find elements that contain the given text. The search is case-insensitive; e.g. p:contains(jsoup)
  • :containsOwn(text): find elements that directly contain the given text
  • :matches(regex): find elements whose text matches the specified regular expression; e.g.div:matches((?i)login)
  • :matchesOwn(regex): find elements whose own text matches the specified regular expression
  • Note that the above indexed pseudo-selectors are 0-based, that is, the first element is at index 0, the second at 1, etc

 See the Selector API reference for the full supported list and details. º°°Å¾øÀ½...

3. Extracting : Extract attributes, text, and HTML from elements 

Extract attributes....
String html =?"<p>An <a href='http://example.com/'><b>example</b></a> link.</p>";
Document doc =?Jsoup.parse(html);
Element link = doc.select("a").first();
 
String text = doc.body().text();?// "An example link"
String linkHref = link.attr("href");?// "http://example.com/"
String linkText = link.text();?// "example""
 
String linkOuterH = link.outerHtml();
String linkInnerH = link.html();?// "<b>example</b>"
¿ä methods´Â element µ¥ÀÌŸ¸¦ ¿¢¼¼½º ÇÏ´Â ÇÙ½ÉÀÌ°ø, ´Ù¸¥ ¹æ¹ýµµ Àִµ¥? 

ÀÌ·¯ÇÑ Á¢±Ù ¹æ¹ýÀÇ ¸ðµç µ¥ÀÌÅ͸¦ º¯°æÇÏ´Â ÇØ´ç ¼¼ÅÍ ¹æ¹ýÀÌ ÀÖ´ç / ¾Æ·¡´Â ±×³É Âü°í 

4. Extracting : Working with URLs

Working with URLs
?Document doc = Jsoup.connect("http://jsoup.org").get();
 
Element link = doc.select("a").first();
String relHref = link.attr("href"); // == "/"
String absHref = link.attr("abs:href"); // "http://jsoup.org/"

html¹®¼­¿¡¼­ urlÀº Á¾Á¾ document's location¿¡ »ó´ëÀûÀ¸·Î ¾º¿©Áú ¼ö Àִµ¥
³Ê°¡ Node.attr(String key) ¿ä°í·Î href ¼Ó¼º °¡Á®¿Ã¶§ °í³ðÀº ¼Ò½º html¿¡ ÁöÁ¤µÈ ³ðÀ» ¹ÝȯÇÒ°Å´ç

±¸·¡¼­ ³Ê°¡ Àý´ë URLÀ» °¡Á®¿À±æ ¹Ù¶õ´Ù¸é? abs: ÀÌ°Ô ÀÖ´ç ¾ê´Â document base URI ¸¦ Á¦¿ÜÇÑ? ÁÖ¼Ò¸¦ º¸³»ÁØ´Ù
attr("abs:href") ¿ä·¸°Ô... ÀÌ·± »ç¿ë¿¡¼­´Â document¸¦ parsing ÇÒ¶§ base URI¸¦ ÁöÁ¤ÇÏ´Â°Ô Áß¿äÇÏ´Ù

³Ê°¡ abs:¸¦ »ç¿ë º°·Î¸é  Node.absUrl(String key) ¿ä·±°ÅµÎ ÀÖ´Ù
¾ê´Â °°´Ù°í º¸¸é µÇ´Âµ¥ ±Ùµ¥ ¾ê´Â natural attribute key¸¦ ÅëÇؼ­ Á¢¼ÓÇÑ´Ù 

Example Program : List Links

Example
package org.jsoup.examples;
 
import org.jsoup.Jsoup;
import org.jsoup.helper.Validate;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
 
import java.io.IOException;
 
/**
?* Example program to list links from a URL.
?*/
public?class?ListLinks?{
? ? public?static?void main(String[] args)?throws?IOException?{
? ? ? ? Validate.isTrue(args.length ==?1,?"usage: supply url to fetch");
? ? ? ? String url = args[0];
? ? ? ? print("Fetching %s...", url);
 
? ? ? ? Document doc =?Jsoup.connect(url).get();
? ? ? ? Elements links = doc.select("a[href]");
? ? ? ? Elements media = doc.select("[src]");
? ? ? ? Elements imports = doc.select("link[href]");
 
? ? ? ? print("\nMedia: (%d)", media.size());
? ? ? ? for?(Element src : media)?{
? ? ? ? ? ? if?(src.tagName().equals("img"))
? ? ? ? ? ? ? ? print(" * %s: <%s> %sx%s (%s)",
? ? ? ? ? ? ? ? ? ? ? ? src.tagName(), src.attr("abs:src"), src.attr("width"), src.attr("height"),
? ? ? ? ? ? ? ? ? ? ? ? trim(src.attr("alt"),?20));
? ? ? ? ? ? else
? ? ? ? ? ? ? ? print(" * %s: <%s>", src.tagName(), src.attr("abs:src"));
? ? ? ? }
 
? ? ? ? print("\nImports: (%d)", imports.size());
? ? ? ? for?(Element link : imports)?{
? ? ? ? ? ? print(" * %s <%s> (%s)", link.tagName(),link.attr("abs:href"), link.attr("rel"));
? ? ? ? }
 
? ? ? ? print("\nLinks: (%d)", links.size());
? ? ? ? for?(Element link : links)?{
? ? ? ? ? ? print(" * a: <%s> ?(%s)", link.attr("abs:href"), trim(link.text(),?35));
? ? ? ? }
? ? }
 
? ? private?static?void?print(String msg,?Object... args)?{
? ? ? ? System.out.println(String.format(msg, args));
? ? }
 
? ? private?static?String trim(String s,?int width)?{
? ? ? ? if?(s.length()?> width)
? ? ? ? ? ? return s.substring(0, width-1)?+?".";
? ? ? ? else
? ? ? ? ? ? return s;
? ? }
}
 
org/jsoup/examples/ListLinks.java
Out result
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
Fetching http://news.ycombinator.com/...?
Media: (38)
* img: <http://ycombinator.com/images/y18.gif> 18x18 ()
* img: <http://ycombinator.com/images/s.gif> 10x1 ()
* img: <http://ycombinator.com/images/grayarrow.gif> x ()
* img: <http://ycombinator.com/images/s.gif> 0x10 ()
* img: <http://ycombinator.com/images/s.gif> 15x1 ()
* img: <http://ycombinator.com/images/hnsearch.png> x ()
* img: <http://ycombinator.com/images/s.gif> 25x1 ()
* img: <http://mixpanel.com/site_media/images/mixpanel_partner_logo_borderless.gif> x (Analytics by Mixpan.)?
Imports: (2)
* link <http://ycombinator.com/news.css> (stylesheet)
* link <http://ycombinator.com/favicon.ico> (shortcut icon)?
Links: (141)?
* a: <http://ycombinator.com> ()
* a: <http://news.ycombinator.com/news> (Hacker News)
* a: <http://news.ycombinator.com/newest> (new)
* a: <http://news.ycombinator.com/newcomments> (comments)
* a: <http://news.ycombinator.com/leaders> (leaders)
* a: <http://news.ycombinator.com/jobs> (jobs)
* a: <http://news.ycombinator.com/submit> (submit)
* a: <http://news.ycombinator.com/x?fnid=JKhQjfU7gW> (login)
* a: <http://news.ycombinator.com/vote?for=1094578&dir=up&whence=%6e%65%77%73> ()
                                            &utm_campaign=Feed%3A+readwriteweb+%28ReadWriteWeb%29&utm_content=Twitter> (Facebook speeds up PHP)
* a: <http://news.ycombinator.com/user?id=mcxx> (mcxx)
* a: <http://news.ycombinator.com/item?id=1094578> (9 comments)
* a: <http://news.ycombinator.com/vote?for=1094649&dir=up&whence=%6e%65%77%73> ()
* a: <http://groups.google.com/group/django-developers/msg/a65fbbc8effcd914> ("Tough. Django produces XHTML.")
* a: <http://news.ycombinator.com/user?id=andybak> (andybak)
* a: <http://news.ycombinator.com/item?id=1094649> (3 comments)
* a: <http://news.ycombinator.com/vote?for=1093927&dir=up&whence=%6e%65%77%73> ()
* a: <http://news.ycombinator.com/x?fnid=p2sdPLE7Ce> (More)
* a: <http://news.ycombinator.com/lists> (Lists)
* a: <http://news.ycombinator.com/rss> (RSS)
* a: <http://ycombinator.com/bookmarklet.html> (Bookmarklet)
* a: <http://ycombinator.com/newsguidelines.html> (Guidelines)
* a: <http://ycombinator.com/newsfaq.html> (FAQ)
* a: <http://ycombinator.com/newsnews.html> (News News)
* a: <http://news.ycombinator.com/item?id=363> (Feature Requests)
* a: <http://ycombinator.com> (Y Combinator)
* a: <http://ycombinator.com/w2010.html> (Apply)
* a: <http://ycombinator.com/lib.html> (Library)
* a: <http://mixpanel.com/?from=yc> ()
?