CORE
HOME > JAVA > J2SE > CORE
2019.01.13 / 22:23

[Mixi] °­·ÂÇÑ XMLPaser, htmlcleaner.

hanulbit
Ãßõ ¼ö 235
ÀÌÀü ¼Ò°³Çß¾ú´ø ¾ÆÆÄÄ¡ÀÇ httpclient¸¦ ÀÌ¿ëÇÑ À¥ ½ºÅ©·¡ÇÎ ÀÛ¾÷¿¡ À̾î, ÀÌÁ¨, À¥ ÆäÀÌÁöÀÇ ÇÊ¿äÇÑ ºÎºÐÀ» ½±°Ô °¡Á®¿À±â À§ÇÑ
ÀÏ·ÃÀÇ ¹æ¹ý Áß Çϳª·Î »ç¿ëµÇ´Â Æļ­ Áß . ÃÖ±Ù ÀαⰡ ÀÖ´Â htmlcleaner·Î ÆĽ̿¡ ¿ëÀÌÇÑ XML Çü½ÄÀ¸·Î º¯È¯ÇÏ´Â ¹æ¹ýÀ» ¼Ò°³ÇÑ´Ù.

htmlcleaner¸¦ ¾òÀ» ¼ö ÀÖ´Â °÷ :  http://htmlcleaner.sourceforge.net/

Matcher¿Í Pattern Ŭ·¡½º¸¦ »ç¿ëÇؼ­ ½ºÆ®¸®¹Ö µÇ´Â ¹®ÀÚ¿­À» Á¤±ÔÇ¥Çö½ÄÀ¸·Î ÀÚ¸£´Â °Íµµ ÁÁÁö¸¸, »ç½Ç, ÀÛ¾÷È¿À²¼ºÀ» º¸¾ÒÀ» ¶§´Â
XMLÇü½ÄÀ¸·Î º¯È¯ ÈÄ ÀڷḦ ²¨³» ¿À´Â °ÍÀÌ È¿À²ÀûÀÌ´Ù.(¹°·Ð, ó¸® ¼Óµµ´Â ÀüÀÚ°¡ ºü¸¥ °ÍÀº »ç½Ç.)

¾Æ·¡´Â °£´ÜÇÏ°Ô ÆíÁý ÇØ º» ÄÚµå´Ù.

import java.io.IOException;
import java.io.StringWriter;
import java.io.Writer;

import org.htmlcleaner.CleanerProperties;
import org.htmlcleaner.HtmlCleaner;
import org.htmlcleaner.SimpleXmlSerializer;
import org.htmlcleaner.TagNode;

public class HtmlToXml {
 public String htx(String string) {               //½ºÅ©·¡ÇÎ ÇØ ¿Â Html ¹®ÀÚÇü µ¥ÀÌÅÍ.
 HtmlCleaner cleaner=new HtmlCleaner();
    CleanerProperties props=cleaner.getProperties();
    Writer str = new StringWriter();
    TagNode node=null;
    try {
        node=cleaner.clean(string);
    } catch (IOException e) {
        e.printStackTrace();
    }
    SimpleXmlSerializer se=new SimpleXmlSerializer(props);
    try {
     se.writeXml(node, str,"EUC-JP");              //À¥ ÆäÀÌÁöÀÇ ÀÎÄÚµù ¹æ½Ä.
     str.close();       
        se.writeXmlToFile(node, "test.xml" ,"EUC-JP");    //¸¸µé¾îÁú XML ¹®¼­¸í°ú ÀÎÄÚµù ¹æ½Ä.
    } catch (IOException e) {
        e.printStackTrace();
    }
    return str.toString();
  }
}


À̰͸¸À¸·Î °£´ÜÇÏ°Ô XML¹®¼­°¡ ¸¸µé¾îÁø´Ù. ¹°·Ð, ±»ÀÌ ÆÄÀÏ·Î ¸¸µéÁö ¾Ê¾Æµµ ¹®ÀÚÇü µ¥ÀÌÅÍ·Î ¸¸µé¾îµµ ÆĽÌÇϴµ¥´Â ¹®Á¦°¡ ¾ø´Ù.

ÇÏÁö¸¸, ÇÑ°¡Áö ¹®Á¦°¡ ¹ß»ýÇϴµ¥, XMLÇü½Ä°ú´Â »ó°üÀÌ ¾ø´Â Javascript,Css StyleµîÀÌ °¡²û ¿¡·¯¸¦ ¹ñ¾î³½´Ù´Â °ÍÀÌ´Ù.
À̶§´Â, ±× ºÎºÐÀ» Á¦°ÅÇÏ´Â ÀÛ¾÷ÀÌ ÇÊ¿äÇÏ´Ù.
(±×¸®°í ÀÌ°ÍÀÌ Ã³¸® ¼Óµµ ¸é¿¡¼­ ¹®Á¦°¡ ÀϾ´Â ºÎºÐÀ̱⵵ ÇÏ´Ù.)

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class CleanDocument {
 private static interface Patterns {
  public static final Pattern SCRIPTS = Pattern.compile(
    "<(no)?script[^>]*>.*?</(no)?script>",
    Pattern.DOTALL);
  public static final Pattern STYLE = Pattern.compile(
    "<style[^>]*>.*</style>",
    Pattern.DOTALL);
  public static final Pattern Blank = Pattern.compile(
    "\n\n",
    Pattern.DOTALL);
 }
 public String clean(String str) {
  if (str == null){
   return null;
  }
  Matcher mat;
  mat = Patterns.SCRIPTS.matcher(str);
  str = mat.replaceAll("");
  mat = Patterns.STYLE.matcher(str);
  str = mat.replaceAll("");
  mat = Patterns.Blank.matcher(str);
  str = mat.replaceAll("");
  return str;
 }
}


À§ÀÇ ¹æ½ÄÀ¸·Î Á¤±ÔÇ¥Çö½ÄÀ» ÀÌ¿ëÇØ ½ºÅ©¸³Æ®/½ºÅ¸ÀÏ/°ø¹éÀ» Á¦°ÅÇÏ°í ±ú²ýÇÑ »óÅÂÀÇ HTML¹®¼­¸¦ ¾òÀ» ¼ö ÀÖ´Ù.
(ÇÏÁö¸¸, ¿ª½Ã ÁÖ¼® ó¸® ºÎºÐ¿¡ À־´Â ¾Ö¸ÅÇÑ Á¡ÀÌ ÀÖ´Â °ÍÀÌ Çö½ÇÀÌ´Ù.)

¿©±â¼­, ÇÑ°¡Áö Àǹ®ÀÌ »ý±ä´Ù¸é, ¿Ö, ±»ÀÌ, º¹ÀâÇÏ°Ô ÀÌ·¸°Ô HTMLÀ» XMLÇü½ÄÀ¸·Î ¸¸µé¾î¾ß Çϴ°¡...¶ó´Â °ÍÀε¥,
ÀÌ°Ç ¸ðµÎ HTMLÀÌ ³Ê¹«³ª Àû´çÈ÷ Çؼ®µÇ´Â ¾ð¾î-_-¶ó´Â µ¥ ±× ÀÌÀ¯¸¦ µé ¼ö ÀÖ´Ù.

±âÁ¸ÀÇ Æļ­´Â ¿Ïº®ÇÏ°Ô Á¤ÇüÈ­µÈ ¹®¼­¶ó´Â °ÍÀ» ÀüÁ¦¸¦ ÇÏ°í ÆĽÌÀÛ¾÷À» ÇÏ´Ùº¸´Ï, ÁÖ¼®À̳ª ºñÁ¤Çü»óÅÂÀÇ ¹®¼­¿¡¼­
°ªÀ» »Ì¾Æ¿À´Â °ÍÀÌ ºÒ°¡´É¿¡ °¡±î¿ï Á¤µµÀ̱⠶§¹®ÀÌ´Ù.
(ÀÚ¸£°í,ÀÚ¸£°í,ÀÚ¸£°í¸¦ ¹Ýº¹ÇÏ´Ùº¸¸é Â÷¶ó¸® ÀÌÂÊÀÌ Ã³¸®¼Óµµ ¸é¿¡¼­ ´õ ºü¸¦ ¶§µµ ÀÖ´Ù...;;;)

À¥ÀÇ ¹®ÀÚ¿­ µ¥ÀÌÅ͸¦ ó¸®ÇÏ´Â ÀÏÀÌ ±×¸® µå¹® ÀÏÀÌ ¾Æ´Ï±â¿¡, Ȥ¿©¶óµµ Á¤±Ô½ÄÀ¸·Î ¸Ó¸®¸¦ ½Î¸Å¸ç µ¥ÀÌÅ͸¦ °¡°øÇϴµ¥
¾Ö¸¦ ¸ÔÀ» ¶§°¡ ÀÖ´Ù¸é È®½ÇÈ÷ htmlcleaner´Â °¡¹³ÀÇ ´Üºñ °°Àº Á¸Àç´Ù.

-ƯÈ÷, 2¹ÙÀÌÆ® ¹®Àڱǿ¡¼­ ¹®ÀÚ¿­¿¡·¯(Ùþí®ûùª±)¿Í ½Î¿ì°í ÀÖ´Â °³¹ß Àü¼±ÀÇ ¿ëÀڵ鿡°Ô´Â ´õ¿í.

PS : ...ÀÌ ÀÏÀÌ »¡¸® ³¡³ª¾ß ¾ÆÀÌÆù °³¹ßÂÊÀ¸·Î ³Ñ¾î°¥ÅÙµ¥...ÇÁ·ÎÁ§Æ®°¡ ¾²·¯Áú ±â¹Ì°¡ º¸ÀÌÁö¾Ê´Â´Ù...;;;