최신 게시글(JAVA)

HOME > JAVA > 최신 게시글(JAVA)

2017.07.10 / 17:14

JAVA 라이브러리의 일종으로 jQuery와 유사한 탐색 인터페이스를 활용하여 html문서의 Traversing

XMaLL관리자

추천 수 224

개요 < url : Jsoup.org > : JAVA 라이브러리의 일종으로 jQuery와 유사한 탐색 인터페이스를 활용하여 html문서의 Traversing -> Extracting 에 활용한다

Download

Icon

jsoup is available as a downloadable .jar java library. The current release version is 1.7.2.

jsoup-1.7.2.jar core library
jsoup-1.7.2-sources.jar optional sources jar
jsoup-1.7.2-javadoc.jar optional javadoc jar

Maven dependency

Icon

<dependency>
  <!-- jsoup HTML parser library @ http://jsoup.org/ -->
  <groupId>org.jsoup</groupId>
  <artifactId>jsoup</artifactId>
  <version>1.7.2</version>
</dependency>

Reference

Icon

Jsoup API reference : http://jsoup.org/apidocs/

시작! connect 메서드를 써서 스트림으로 받아온당

Load Document from a URL

Load Document

Document doc =?Jsoup.connect("http://example.com/").get();
String title = doc.title();

Connection Method는 jQuery처럼 Method Chaining을 지원한다

Method Chaining

Document doc =?Jsoup.connect("http://example.com")
  .data("query",?"Java")
  .userAgent("Mozilla")
  .cookie("auth",?"token")
  .timeout(3000)
  .post();

This method only suports web URLs (http and https protocols); if you need to load from a file, use the parse(File in, String charsetName) method instead.

요렇다네....

Load Document from a File

Load Document

File input =?new?File("/tmp/input.html");
Document doc =?Jsoup.parse(input,?"UTF-8",?"http://example.com/");

이제 데이타 추출

1. Extracting : Use DOM methods to navigate a document

Use Dom for Extracting

File input =?new?File("/tmp/input.html");
Document doc =?Jsoup.parse(input,?"UTF-8",?"http://example.com/");
 
Element content = doc.getElementById("content");
Elements links = content.getElementsByTag("a");
for?(Element link : links)?{
? String linkHref = link.attr("href");
? String linkText = link.text();
}

finding elements

getElementById(String id)
getElementsByTag(String tag)
getElementsByClass(String className)
getElementsByAttribute(String key) (and related methods)
Element siblings: siblingElements(), firstElementSibling(), lastElementSibling();nextElementSibling(), previousElementSibling()
Graph: parent(), children(), child(int index)

element data

attr(String key) to get and attr(String key, String value) to set attributes
attributes() to get all attributes
id(), className() and classNames()
text() to get and text(String value) to set the text content
html() to get and html(String value) to set the inner HTML content
outerHtml() to get the outer HTML value
data() to get data content (e.g. of script and style tags)
tag() and tagName()

Manipulating HTML and text

2. Extracting : Use selector-syntax to find elements

Use selector

File input =?new?File("/tmp/input.html");
Document doc =?Jsoup.parse(input,?"UTF-8",?"http://example.com/");
 
Elements links = doc.select("a[href]");?// a with href
Elements pngs = doc.select("img[src$=.png]");
? // img with src ending .png
 
Element masthead = doc.select("div.masthead").first();
? // div with class="masthead"
 
Elements resultLinks = doc.select("h3.r > a");?// direct a after h3

Jsoup 은 matching elements를 찾기위해 CSS(or jQuery) 처럼 selector-syntax를 지원한다.

select method는 Document, Element, or in Elements 같은 문맥에서 사용가능하다 그래서 특정 element로 걸러내거나 체이닝하여 호출할 수 있다

select 는 Elements list 를 돌려준다 (as Elements), 걔는 추출하고 결과를 조작하는등의 methods를 제공한다.

Selector overview

tagname: find elements by tag, e.g. a
ns|tag: find elements by tag in a namespace, e.g. fb|name finds <fb:name> elements
#id: find elements by ID, e.g. #logo
.class: find elements by class name, e.g. .masthead
[attribute]: elements with attribute, e.g. [href]
[^attr]: elements with an attribute name prefix, e.g. [^data-] finds elements with HTML5 dataset attributes
[attr=value]: elements with attribute value, e.g. [width=500]
[attr^=value], [attr$=value], [attr*=value]: elements with attributes that start with, end with, or contain the value, e.g. [href*=/path/]
[attr~=regex]: elements with attribute values that match the regular expression; e.g.img[src~=(?i)\.(png|jpe?g)]
*: all elements, e.g. *

Selector combinations

el#id: elements with ID, e.g. div#logo
el.class: elements with class, e.g. div.masthead
el[attr]: elements with attribute, e.g. a[href]
Any combination, e.g. a[href].highlight
ancestor child: child elements that descend from ancestor, e.g. .body p finds p elements anywhere under a block with class "body"
parent > child: child elements that descend directly from parent, e.g. div.content > pfinds p elements; and body > * finds the direct children of the body tag
siblingA + siblingB: finds sibling B element immediately preceded by sibling A, e.g.div.head + div
siblingA ~ siblingX: finds sibling X element preceded by sibling A, e.g. h1 ~ p
el, el, el: group multiple selectors, find unique elements that match any of the selectors; e.g. div.masthead, div.logo

Pseudo selectors

:lt(n): find elements whose sibling index (i.e. its position in the DOM tree relative to its parent) is less than n; e.g. td:lt(3)
:gt(n): find elements whose sibling index is greater than n; e.g. div p:gt(2)
:eq(n): find elements whose sibling index is equal to n; e.g. form input:eq(1)
:has(seletor): find elements that contain elements matching the selector; e.g. div:has(p)
:not(selector): find elements that do not match the selector; e.g. div:not(.logo)
:contains(text): find elements that contain the given text. The search is case-insensitive; e.g. p:contains(jsoup)
:containsOwn(text): find elements that directly contain the given text
:matches(regex): find elements whose text matches the specified regular expression; e.g.div:matches((?i)login)
:matchesOwn(regex): find elements whose own text matches the specified regular expression
Note that the above indexed pseudo-selectors are 0-based, that is, the first element is at index 0, the second at 1, etc

See the Selector API reference for the full supported list and details. 별거없음...

3. Extracting : Extract attributes, text, and HTML from elements

Extract attributes....

String html =?"<p>An <a href='http://example.com/'><b>example</b></a> link.</p>";
Document doc =?Jsoup.parse(html);
Element link = doc.select("a").first();
 
String text = doc.body().text();?// "An example link"
String linkHref = link.attr("href");?// "http://example.com/"
String linkText = link.text();?// "example""
 
String linkOuterH = link.outerHtml();
? ? // "<a href="http://example.com"><b>example</b></a>"
String linkInnerH = link.html();?// "<b>example</b>"

요 methods는 element 데이타를 엑세스 하는 핵심이공, 다른 방법도 있는데?

Element.id()
Element.tagName()
Element.className() and Element.hasClass(String className)

이러한 접근 방법의 모든 데이터를 변경하는 해당 세터 방법이 있당 / 아래는 그냥 참고

The reference documentation for Element and the collection Elements class
Working with URLs
finding elements with the CSS selector syntax

4. Extracting : Working with URLs

Working with URLs

?Document doc = Jsoup.connect("http://jsoup.org").get();
 
Element link = doc.select("a").first();
String relHref = link.attr("href"); // == "/"
String absHref = link.attr("abs:href"); // "http://jsoup.org/"

html문서에서 url은 종종 document's location에 상대적으로 씌여질 수 있는데
너가 Node.attr(String key) 요고로 href 속성 가져올때 고놈은 소스 html에 지정된 놈을 반환할거당

구래서 너가 절대 URL을 가져오길 바란다면? abs: 이게 있당 얘는 document base URI 를 제외한? 주소를 보내준다
attr("abs:href") 요렇게... 이런 사용에서는 document를 parsing 할때 base URI를 지정하는게 중요하다

너가 abs:를 사용 별로면 Node.absUrl(String key) 요런거두 있다 얘는 같다고 보면 되는데 근데 얘는 natural attribute key를 통해서 접속한다

Example Program : List Links

Example

package org.jsoup.examples;
 
import org.jsoup.Jsoup;
import org.jsoup.helper.Validate;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
 
import java.io.IOException;
 
/**
?* Example program to list links from a URL.
?*/
public?class?ListLinks?{
? ? public?static?void main(String[] args)?throws?IOException?{
? ? ? ? Validate.isTrue(args.length ==?1,?"usage: supply url to fetch");
? ? ? ? String url = args[0];
? ? ? ? print("Fetching %s...", url);
 
? ? ? ? Document doc =?Jsoup.connect(url).get();
? ? ? ? Elements links = doc.select("a[href]");
? ? ? ? Elements media = doc.select("[src]");
? ? ? ? Elements imports = doc.select("link[href]");
 
? ? ? ? print("\nMedia: (%d)", media.size());
? ? ? ? for?(Element src : media)?{
? ? ? ? ? ? if?(src.tagName().equals("img"))
? ? ? ? ? ? ? ? print(" * %s: <%s> %sx%s (%s)",
? ? ? ? ? ? ? ? ? ? ? ? src.tagName(), src.attr("abs:src"), src.attr("width"), src.attr("height"),
? ? ? ? ? ? ? ? ? ? ? ? trim(src.attr("alt"),?20));
? ? ? ? ? ? else
? ? ? ? ? ? ? ? print(" * %s: <%s>", src.tagName(), src.attr("abs:src"));
? ? ? ? }
 
? ? ? ? print("\nImports: (%d)", imports.size());
? ? ? ? for?(Element link : imports)?{
? ? ? ? ? ? print(" * %s <%s> (%s)", link.tagName(),link.attr("abs:href"), link.attr("rel"));
? ? ? ? }
 
? ? ? ? print("\nLinks: (%d)", links.size());
? ? ? ? for?(Element link : links)?{
? ? ? ? ? ? print(" * a: <%s> ?(%s)", link.attr("abs:href"), trim(link.text(),?35));
? ? ? ? }
? ? }
 
? ? private?static?void?print(String msg,?Object... args)?{
? ? ? ? System.out.println(String.format(msg, args));
? ? }
 
? ? private?static?String trim(String s,?int width)?{
? ? ? ? if?(s.length()?> width)
? ? ? ? ? ? return s.substring(0, width-1)?+?".";
? ? ? ? else
? ? ? ? ? ? return s;
? ? }
}
 
org/jsoup/examples/ListLinks.java

Example output (trimmed)

Out result

Fetching http://news.ycombinator.com/...?
Media: (38)
* img: <http://ycombinator.com/images/y18.gif> 18x18 ()
* img: <http://ycombinator.com/images/s.gif> 10x1 ()
* img: <http://ycombinator.com/images/grayarrow.gif> x ()
* img: <http://ycombinator.com/images/s.gif> 0x10 ()
* script: <http://www.co2stats.com/propres.php?s=1138>
* img: <http://ycombinator.com/images/s.gif> 15x1 ()
* img: <http://ycombinator.com/images/hnsearch.png> x ()
* img: <http://ycombinator.com/images/s.gif> 25x1 ()
* img: <http://mixpanel.com/site_media/images/mixpanel_partner_logo_borderless.gif> x (Analytics by Mixpan.)?
Imports: (2)
* link <http://ycombinator.com/news.css> (stylesheet)
* link <http://ycombinator.com/favicon.ico> (shortcut icon)?
Links: (141)?
* a: <http://ycombinator.com> ()
* a: <http://news.ycombinator.com/news> (Hacker News)
* a: <http://news.ycombinator.com/newest> (new)
* a: <http://news.ycombinator.com/newcomments> (comments)
* a: <http://news.ycombinator.com/leaders> (leaders)
* a: <http://news.ycombinator.com/jobs> (jobs)
* a: <http://news.ycombinator.com/submit> (submit)
* a: <http://news.ycombinator.com/x?fnid=JKhQjfU7gW> (login)
* a: <http://news.ycombinator.com/vote?for=1094578&dir=up&whence=%6e%65%77%73> ()
* a: <http://www.readwriteweb.com/archives/facebook_gets_faster_debuts_homegrown_php_compiler.php?utm_source=feedburner&utm_medium=feed
                                            &utm_campaign=Feed%3A+readwriteweb+%28ReadWriteWeb%29&utm_content=Twitter> (Facebook speeds up PHP)
* a: <http://news.ycombinator.com/user?id=mcxx> (mcxx)
* a: <http://news.ycombinator.com/item?id=1094578> (9 comments)
* a: <http://news.ycombinator.com/vote?for=1094649&dir=up&whence=%6e%65%77%73> ()
* a: <http://groups.google.com/group/django-developers/msg/a65fbbc8effcd914> ("Tough. Django produces XHTML.")
* a: <http://news.ycombinator.com/user?id=andybak> (andybak)
* a: <http://news.ycombinator.com/item?id=1094649> (3 comments)
* a: <http://news.ycombinator.com/vote?for=1093927&dir=up&whence=%6e%65%77%73> ()
* a: <http://news.ycombinator.com/x?fnid=p2sdPLE7Ce> (More)
* a: <http://news.ycombinator.com/lists> (Lists)
* a: <http://news.ycombinator.com/rss> (RSS)
* a: <http://ycombinator.com/bookmarklet.html> (Bookmarklet)
* a: <http://ycombinator.com/newsguidelines.html> (Guidelines)
* a: <http://ycombinator.com/newsfaq.html> (FAQ)
* a: <http://ycombinator.com/newsnews.html> (News News)
* a: <http://news.ycombinator.com/item?id=363> (Feature Requests)
* a: <http://ycombinator.com> (Y Combinator)
* a: <http://ycombinator.com/w2010.html> (Apply)
* a: <http://ycombinator.com/lib.html> (Library)
* a: <http://www.webmynd.com/html/hackernews.html> ()
* a: <http://mixpanel.com/?from=yc> ()
?

< Prev jsoup, HTML Parser example

JSoup script 파싱 Next >

♥