๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
Java

[Java] - Jsoup์„ ์ด์šฉํ•œ ํฌ๋กค๋ง(feat. ์ธํ”„๋Ÿฐ)

by ์ฃผ๋ฐœ2 2021. 5. 11.
๋ฐ˜์‘ํ˜•

 ์•ˆ๋…•ํ•˜์„ธ์š”~ ์ด์ „์— ์šด์˜ํ•˜๋˜ ๋ธ”๋กœ๊ทธ ๋ฐ GitHub, ๊ณต๋ถ€ ๋‚ด์šฉ์„ ์ •๋ฆฌํ•˜๋Š” Study-GitHub ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค!

 ๋„ค์ด๋ฒ„ ๋ธ”๋กœ๊ทธ

 GitHub

Study-GitHub

 ๐Ÿ”


 

๐Ÿ“Ž Jsoup์„ ์ด์šฉํ•œ ํฌ๋กค๋ง

 

์•ˆ๋…•ํ•˜์„ธ์š”! ์ด๋ฒˆ์— ์ •๋ฆฌํ•  ๋‚ด์šฉ์€ Java์—์„œ Jsoup์„ ์ด์šฉํ•ด ํฌ๋กค๋ง ํ•˜๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค.

 

ํ˜„์žฌ ์ง„ํ–‰์ค‘์ธ(?) ์‚ฌ์ด๋“œ ํ”„๋กœ์ ํŠธ์—์„œ ์ธํ”„๋Ÿฐ ์‚ฌ์ดํŠธ์˜ ๊ฐ•์˜ ๋ฐ์ดํ„ฐ๋“ค์ด ํ•„์š”ํ•ด์„œ ํฌ๋กค๋ง์„ ํ•ด์•ผ ํ–ˆ๋Š”๋ฐ์š”,

์˜ˆ์ „์— ํ•ด๋ณธ Python์œผ๋กœ ํฌ๋กค๋ง์„ ํ•  ์ง€, ์•„๋‹ˆ๋ฉด ์ต์ˆ™ํ•œ Java๋กœ ํ•  ์ง€ ๊ณ ๋ฏผํ•˜๋‹ค๊ฐ€ ๊ฒฐ๊ตญ์—” Java๋กœ ํ•˜๊ฒŒ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

 

์‚ฌ์ดํŠธ์— ์ ‘์†ํ•œ ๋’ค ํƒœ๊ทธ๋ฅผ ํ†ตํ•ด ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ ธ์˜ค๋Š” ์ „์ฒด์ ์ธ ๋งฅ๋ฝ์€ Python๊ณผ ๋น„์Šทํ•˜๊ธฐ ๋•Œ๋ฌธ์— ํฐ ์–ด๋ ค์›€์—†์ด ์ง„ํ–‰ํ•  ์ˆ˜ ์žˆ์—ˆ๋Š”๋ฐ์š”,

์ธํ”„๋Ÿฐ ์‚ฌ์ดํŠธ์—์„œ ํ•„์š”ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€๊ณต & ํฌ๋กค๋ง ํ•˜๋Š” ๊ณผ์ •์„ ์ •๋ฆฌํ•ด๋ณด๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค !

 

 

 

 

๐Ÿ“Ž ํ•„์š”ํ•œ ๋ฐ์ดํ„ฐ

์ •ํ™•ํ•˜์ง€๋Š” ์•Š์ง€๋งŒ ํ”„๋กœ์ ํŠธ์— ํ•„์š”ํ•œ ๋ฐ์ดํ„ฐ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

  • ์ธ๋„ค์ผ ๋งํฌ
  • ๊ฐ•์˜ ์ œ๋ชฉ
  • ๊ฐ€๊ฒฉ(+ํ• ์ธ ๊ฐ€๊ฒฉ)
  • ํ‰์ 
  • ๊ฐ•์˜์ž
  • ๊ฐ•์˜ ๋งํฌ
  • ์ˆ˜๊ฐ•์ž ์ˆ˜
  • ๊ฐ•์˜ ์„ธ์…˜ ๊ฐœ์ˆ˜
  • ๊ฐ•์˜ ๋ถ€๊ฐ€์„ค๋ช…
  • ๊ฐ•์˜ ์Šคํ‚ฌ & ์Šคํƒ

 

ํ•„์š”ํ•œ ๋ฐ์ดํ„ฐ๊ฐ€ ๊ฝค ๋งŽ์€๋ฐ์š”... ๐Ÿ˜‚ ์ด ๋•Œ๋ฌธ์— ๊ฝค ์• ๋จน๊ธฐ๋„ ํ–ˆ์Šต๋‹ˆ๋‹ค... ๐Ÿ˜‚๐Ÿ˜‚

์œ„ ๋ฐ์ดํ„ฐ๋“ค์€ ๊ฐ€์ ธ์˜ฌ ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•์ด ์ „๋ถ€ ๋™์ผํ•˜์ง€๋Š” ์•Š๊ธฐ ๋•Œ๋ฌธ์— ํ•˜๋‚˜ํ•˜๋‚˜ ํ™•์ธํ•ด๊ฐ€๋ฉฐ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ ธ์™€์•ผ ํ•ฉ๋‹ˆ๋‹ค.

 

ํŠน์ • ๋ฐ์ดํ„ฐ๋Š” ์•„๋ž˜ ์‚ฌ์ดํŠธ์—์„œ ๋ฐ”๋กœ ๊ฐ€์ ธ์˜ฌ ์ˆ˜ ์žˆ๋Š” ๋ฐ˜๋ฉด, ํŠน์ • ๋ฐ์ดํ„ฐ๋Š” ์‚ฌ์ดํŠธ๋ฅผ ํƒ€๊ณ  ๋“ค์–ด๊ฐ€์•ผ ๊ฐ€์ ธ์˜ฌ ์ˆ˜ ์žˆ๋Š” ๋ฐ์ดํ„ฐ๋„ ์กด์žฌํ•ฉ๋‹ˆ๋‹ค.

 

์ฝ”๋“œ๋ฅผ ํ†ตํ•ด ์–ด๋– ํ•œ ๋ฐฉ์‹์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ํฌ๋กค๋งํ•˜๋Š”์ง€ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

 

 

 

 

 

๐Ÿ“Ž ์ฝ”๋“œ

๋จผ์ € ์ž๋ฐ”์—์„œ ํฌ๋กค๋ง์„ ํ•˜๊ธฐ ์œ„ํ•ด Jsoup ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

 

 

์ €๋Š” Maven ๊ธฐ๋ฐ˜์œผ๋กœ ๊ฐœ๋ฐœ์„ ์ง„ํ–‰ํ–ˆ๊ธฐ ๋•Œ๋ฌธ์— pom.xml ์— jsoup ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.

 

 

final String inflearnUrl = "https://www.inflearn.com/courses/it-programming
Connection conn = Jsoup.connect(inflearnUrl);
Document document = conn.get();

ํฌ๋กค๋ง์„ ํ•˜๊ธฐ ์œ„ํ•ด ํ•„์š”ํ•œ URL ์ฃผ์†Œ ๋ฐ Jsoup ๊ฐ์ฒด๋“ค์„ ์„ ์–ธํ•ฉ๋‹ˆ๋‹ค.

 

 

 

 

๐ŸŽฏ ์ธ๋„ค์ผ ๋งํฌ

์ด์ œ ์ธํ”„๋Ÿฐ ์‚ฌ์ดํŠธ์—์„œ ๊ฐœ๋ฐœ์ž ๋„๊ตฌ๋ฅผ ํ†ตํ•ด ํ•„์š”ํ•œ ๋ฐ์ดํ„ฐ์˜ ํƒœ๊ทธ๋ฅผ ํ™•์ธํ•œ ํ›„ ๊ฐ€์ ธ์™€๋ณด๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

F12 ๋ฅผ ๋ˆ„๋ฅธ ํ›„ ์ƒ๋‹จ์˜ ๋งˆ์šฐ์Šค ํ‘œ์‹œ๋ฅผ ํด๋ฆญํ•œ ํ›„ ๊ฐ•์˜ ์‚ฌ์ง„์„ ํด๋ฆญํ•˜๋ฉด ์šฐ์ธก๊ณผ ๊ฐ™์ด ์š”์†Œ๋“ค์ด ๋‚˜์˜ค๋Š”๋ฐ์š”,

๊ทธ ์ค‘ class="swiper-lazy" ์˜ ํƒœ๊ทธ๋ฅผ ํ†ตํ•ด ์ธ๋„ค์ผ ๋งํฌ๋ฅผ ๊ฐ€์ ธ์˜ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  โ€ป ์œ„์™€ ๊ฐ™์ด class="swpier-lazy" ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ ธ์˜ฌ ๊ฒฝ์šฐ, ํ•ด๋‹น ํŽ˜์ด์ง€์˜ ๋ชจ๋“  ์ธ๋„ค์ผ ๋งํฌ๋ฅผ ์ „๋ถ€ ๊ฐ€์ ธ์˜ต๋‹ˆ๋‹ค.

 

๊ทธ๋Ÿผ ์œ„ ํƒœ๊ทธ๋ฅผ ํ†ตํ•ด ํ˜„์žฌ ํŽ˜์ด์ง€(www.inflearn.com/courses/it-programming) ์˜ ์ธ๋„ค์ผ ๋งํฌ๋ฅผ ๋ชจ๋‘ ๊ฐ€์ ธ์™€๋ณด๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

 

 

import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;

public class Crawling {

    public static void main(String[] args) {
        final String inflearnUrl = "https://www.inflearn.com/courses/it-programming";
        Connection conn = Jsoup.connect(inflearnUrl);

        try {
            Document document = conn.get();
            Elements imageUrlElements = document.getElementsByClass("swiper-lazy");

            for (Element element : imageUrlElements) {
                System.out.println(element);
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

document์˜ getElementsByClass ๋ฉ”์„œ๋“œ๋ฅผ ํ†ตํ•ด class๊ฐ€ "swiper-lazy" ์ธ ๋ฐ์ดํ„ฐ๋ฅผ ๋ชจ๋‘ ๊ฐ€์ ธ์˜ค๋Š” ์ฝ”๋“œ์ž…๋‹ˆ๋‹ค.

๋ณต์ˆ˜์˜ ๋ฐ์ดํ„ฐ์ด๊ธฐ ๋•Œ๋ฌธ์— Elements(๋ณต์ˆ˜ํ˜•) ์š”์†Œ๋กœ ๊ฐ€์ ธ์˜จ ๋’ค, ๊ฐ ์š”์†Œ์— ์ ‘๊ทผํ•˜๊ธฐ ์œ„ํ•ด for๋ฌธ์„ ๋ฐ˜๋ณตํ•˜๋ฉฐ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค.

 

 

์Œ ๋ญ”๊ฐ€ ๋ชจ๋‘ ๊ฐ€์ ธ์˜ค๊ธด ํ•œ ๊ฒƒ ๊ฐ™์€๋ฐ.. ๋ถˆํ•„์š”ํ•œ ์š”์†Œ๊ฐ€ ์ •๋ง ๋งŽ๋„ค์š”~

 

 

๊ฐ€์žฅ ์œ„์˜ src ๋งํฌ๋ฅผ ํด๋ฆญํ•˜๋ฉด ์œ„์™€ ๊ฐ™์ด ์ธ๋„ค์ผ ๋งํฌ๊ฐ€ ์ถœ๋ ฅ์ด ๋˜๋Š”๋ฐ์š”, ํ•˜์ง€๋งŒ ์ €ํฌ๊ฐ€ ํ•„์š”ํ•œ๊ฑด ์œ„์™€ ๊ฐ™์€ ์ „์ฒด html ํƒœ๊ทธ๊ฐ€ ์•„๋‹Œ 

์ธ๋„ค์ผ ๋งํฌ ๋งŒ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

 

๋”ฐ๋ผ์„œ ํ•ด๋‹น ํƒœ๊ทธ๋ฅผ ๊ฐ€๊ณตํ•ด์ค„ ํ•„์š”๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

            for (Element element : imageUrlElements) {
                System.out.println(element.attr("abs:src");
            }

์œ„์™€ ๊ฐ™์ด element์˜ attr๋ฉ”์„œ๋“œ๋ฅผ ํ†ตํ•ด ํ•„์š”ํ•œ ๋ถ€๋ถ„(src:) ๋งŒ ๊ฐ€์ ธ์˜ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

์œ„์™€ ๊ฐ™์ด for๋ฌธ์„ ์ˆ˜์ •ํ•˜๊ณ  ๋‹ค์‹œ ์ถœ๋ ฅํ•ด๋ณด๋ฉด ํ•„์š”ํ•œ ๋ถ€๋ถ„๋งŒ ๊ฐ€์ ธ์˜ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ž ์ด์ œ ํ•„์š”ํ•œ ๋ฐ์ดํ„ฐ ์ค‘ ํ•˜๋‚˜ ๊ฐ€์ ธ์™”๋„ค์š”.. ใ…Žใ…Ž

๊ฐˆ๊ธธ์ด ๋ฉ€๊ธฐ ๋•Œ๋ฌธ์— ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ๋“ค๋„ ๋น ๋ฅด๊ฒŒ ๊ฐ€์ ธ์™€๋ณด๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค !

 

 

๐ŸŽฏ ๊ฐ•์˜ ์ œ๋ชฉ

 

๊ฐ•์˜๋„ ์ธ๋„ค์ผ๊ณผ ๋น„์Šทํ•œ๋ฐ์š”, ๊ฐ•์˜๋ฅผ ์ฐ์–ด๋ณด๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์ด 2๊ฐœ์˜ class๋กœ ๋ฌถ์—ฌ์žˆ์Šต๋‹ˆ๋‹ค.

<div class="card-content">

    <div class="course_title">

 

์œ„ ํƒœ๊ทธ๋Š” "card-content" > "course_title" ์™€ ๊ฐ™์ด ํ•˜์œ„๋กœ ๋ฌถ์—ฌ์žˆ๋Š” ํ˜•ํƒœ์ธ๋ฐ์š”, ์ด ๋ฐ์ดํ„ฐ๋Š” ๊ฐ•์˜์™€๋Š” ์กฐ๊ธˆ ๋‹ค๋ฅด๊ฒŒ select ๋ฉ”์„œ๋“œ๋ฅผ ํ†ตํ•ด ๊ฐ€์ ธ์™€๋ณด๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

(์™œ ๊ทธ๋ ‡๊ฒŒ ๊ฐ€์ ธ์™”๋Š”์ง€๋Š” ์ €๋„ ์ž˜ ๋ชฐ๋ผ์š” ใ…Žใ…Ž ์•„๋งˆ ๋‹ค๋ฅธ ๋ฐฉ๋ฒ•์ด ์•ˆ๋˜์–ด์„œ ์ €๋ ‡๊ฒŒ ๊ฐ€์ ธ์™”๋˜ ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค __ )

 

 

 

์ฝ”๋“œ์˜ ์œ— ๋ถ€๋ถ„๊ณผ ์•„๋žซ ๋ถ€๋ถ„์€ ์ธ๋„ค์ผ ์ฝ”๋“œ์™€ ๋™์ผํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ƒ๋žตํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค :)

            Elements titleElements = document.select("div.card-content > div.course_title");
            for (int j = 0; j < titleElements.size(); j++) {
                final String title = titleElements.get(j).text();
                System.out.println("๊ฐ•์˜ ์ œ๋ชฉ: " + title);
            }

์œ„์™€ ๊ฐ™์ด document ์˜ select ๋ฉ”์„œ๋“œ๋ฅผ ํ†ตํ•ด div.card-content ํ•˜์œ„(>) ์˜ div.course_title ๋ฅผ ๊ฐ€์ ธ์˜ต๋‹ˆ๋‹ค.

 

 

๐Ÿ˜ƒ์ข‹์Šต๋‹ˆ๋‹ค. ํ˜„์žฌ ์‚ฌ์ดํŠธ์— ์กด์žฌํ•˜๋Š” ๊ฐ•์˜ ์ œ๋ชฉ๋“ค์„ ๋ชจ๋‘ ๊ฐ€์ ธ์™”์Šต๋‹ˆ๋‹ค.๐Ÿ˜ƒ

 

 

 

 

๐ŸŽฏ ๊ฐ€๊ฒฉ + ํ• ์ธ ๊ฐ€๊ฒฉ

์ธํ”„๋Ÿฐ ์‚ฌ์ดํŠธ์—์„œ ๊ฐ€๊ฒฉ์˜ ๊ฒฝ์šฐ ์ผ๋ฐ˜ ๊ฐ€๊ฒฉ๊ณผ ํ• ์ธ ๊ฐ€๊ฒฉ ๋‘ ๊ฐ€์ง€์˜ ์œ ํ˜•์ด ์กด์žฌํ•˜๋Š”๋ฐ์š”, ์—ฌ๊ธฐ์„œ๋„ ํŠธ๋ฆญ์ด ์กด์žฌํ•˜๋‹ˆ ์œ ์˜ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

 

๋จผ์ € ์œ„์™€ ๊ฐ™์ด ํ• ์ธ ๊ฐ€๊ฒฉ์ด ์—†๋Š” ๊ฒฝ์šฐ๊ฐ€ ์กด์žฌํ•ฉ๋‹ˆ๋‹ค. 

 

 

 

๋˜ํ•œ ๊ฐ€๊ฒฉ์ด ์กด์žฌํ•˜์ง€ ์•Š๋Š” ๋ฌด๋ฃŒ ๋ผ๋Š” ๊ธ€์”จ๋„ ์กด์žฌํ•ฉ๋‹ˆ๋‹ค.

 

์ฆ‰, ๊ฐ€๊ฒฉ์˜ ๊ฒฝ์šฐ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์„ธ ๊ฐ€์ง€์˜ ์ผ€์ด์Šค๊ฐ€ ์กด์žฌํ•ฉ๋‹ˆ๋‹ค.

  • ์ผ๋ฐ˜ ๊ฐ€๊ฒฉ๋งŒ ์กด์žฌํ•˜๋Š” ๊ฒฝ์šฐ
  • ์ผ๋ฐ˜ ๊ฐ€๊ฒฉ, ํ• ์ธ ๊ฐ€๊ฒฉ ๋ชจ๋‘ ์กด์žฌํ•˜๋Š” ๊ฒฝ์šฐ
  • ๊ฐ€๊ฒฉ์ด ์—†๋Š” ๊ฒฝ์šฐ(๋ฌด๋ฃŒ)

๋”ฐ๋ผ์„œ ์œ„์˜ ๋ฐ์ดํ„ฐ์— ์œ ์˜ํ•ด์„œ ํฌ๋กค๋ง์„ ํ•ด์•ผํ•ฉ๋‹ˆ๋‹ค.

(์ €๋Š” ๊ฐ€๊ฒฉ์ด ์—†๋Š” ๊ฒฝ์šฐ๋Š” 0 ์œผ๋กœ ํ‘œ์‹œํ•˜๊ธฐ๋กœ ์•ฝ์†์„ ํ–ˆ์Šต๋‹ˆ๋‹ค.)

 

 

 

๊ฐ€๊ฒฉ์„ ๊ฐ€์ ธ์˜ค๋Š” class์˜ ํƒœ๊ทธ๋Š” "price" ์ž…๋‹ˆ๋‹ค.

 

 

๋จผ์ € ๊ฐ€๊ฒฉ, ํ• ์ธ ๊ฐ€๊ฒฉ์„ ๊ฐ€์ ธ์˜ค๋Š” ์ „์ฒด ์ฝ”๋“œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

    public static void main(String[] args) {
        final String inflearnUrl = "https://www.inflearn.com/courses/it-programming";
        Connection conn = Jsoup.connect(inflearnUrl);

        try {
            Document document = conn.get();
            Elements priceElements = document.getElementsByClass("price");

            for (int j = 0; j < priceElements.size(); j++) {
                final String price = priceElements.get(j).text();
                final String realPrice = getRealPrice(price);
                final String salePrice = getSalePrice(price);

                final int realIntPrice = toInt(removeNotNumeric(realPrice));
                final int saleIntPrice = toInt(removeNotNumeric(salePrice));

                System.out.println("๊ฐ€๊ฒฉ: " + realIntPrice);
                System.out.println("ํ• ์ธ ๊ฐ€๊ฒฉ: " + saleIntPrice);
            }

        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    private static String getRealPrice(final String price) {
        final String[] pricesArray = price.split(" ");
        return pricesArray[0];
    }

    private static String getSalePrice(final String price) {
        final String[] pricesArray = price.split(" ");
        return (pricesArray.length == 1) ? price : pricesArray[1];
    }

    private static String removeNotNumeric(final String str) {
        return str.replaceAll("\\W", "");
    }

    private static int toInt(final String str) {
        return Integer.parseInt(str);
    }

 

final String price = priceElements.get(j).text(); ์˜ ์ฝ”๋“œ๋ฅผ ํ†ตํ•ด ๊ฐ€๊ฒฉ์„ ๊ฐ€์ ธ์˜ค๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

 

๋งŒ์•ฝ ํ• ์ธ ๊ฐ€๊ฒฉ์ด ์กด์žฌํ•˜๋ฉด, ๊ณต๋ฐฑ์„ ๊ธฐ์ค€์œผ๋กœ ๋‘ ๊ฐœ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ ธ์˜ค๊ณ , ๊ทธ๋ ‡์ง€ ์•Š๋‹ค๋ฉด ํ•œ ๊ฐœ์˜ ๋ฐ์ดํ„ฐ๋งŒ ๊ฐ€์ ธ์˜ต๋‹ˆ๋‹ค.

 

๋”ฐ๋ผ์„œ getRealPrice() ๋ฉ”์„œ๋“œ์™€ getSalePrice() ๋ฉ”์„œ๋“œ๋ฅผ ํ†ตํ•ด ์ผ๋ฐ˜ ๊ฐ€๊ฒฉ / ํ• ์ธ ๊ฐ€๊ฒฉ์„ ๊ฐ€์ ธ์˜ค๊ณ ,

removeNotNumeric() ๋ฉ”์„œ๋“œ๋ฅผ ํ†ตํ•ด ์ˆซ์ž๊ฐ€ ์•„๋‹Œ ๋ฐ์ดํ„ฐ๋Š” ์ œ๊ฑฐํ•ฉ๋‹ˆ๋‹ค.

 

์œ„ ์ฝ”๋“œ๋ฅผ ์ถœ๋ ฅํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

 

 

 

 

 

๐ŸŽฏ ๊ฐ•์˜ ๋งํฌ

๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ๋ณด๋‹ค ๊ฐ•์˜ ๋งํฌ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๋จผ์ € ๊ฐ€์ ธ์™€์•ผ ๊ฐ•์˜ ๋งํฌ์—์„œ ๋ฐ์ดํ„ฐ๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋”ฐ๋ผ์„œ ๊ฐ•์˜ ๋งํฌ ๋ฐ์ดํ„ฐ๋ฅผ ๋จผ์ € ํฌ๋กค๋ง ํ•ด๋ณด๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

 

๊ฐ•์˜ ๋งํฌ๋Š” aํƒœ๊ทธ์˜ course_card_front ๊ฐ’์œผ๋กœ ๊ฐ€์ ธ์˜ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

 

            Elements linkElements = document.select("a.course_card_front");

            for (int j = 0; j < linkElements.size(); j++) {
                final String url = linkElements.get(j).attr("abs:href");
            }

๊ฐ•์˜ ๋งํฌ ๋˜ํ•œ ์ธ๋„ค์ผ ๋งํฌ์™€ ๋น„์Šทํ•˜๊ฒŒ html ํƒœ๊ทธ๊ฐ€ ์ „๋ถ€ ํฌํ•จ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.

๋”ฐ๋ผ์„œ attr("abs:href") ๋ฅผ ํ†ตํ•ด ํ•„์š”ํ•œ ๋ฐ์ดํ„ฐ๋งŒ ๋ฝ‘์•„์˜ค๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.

 

 

์ถœ๋ ฅํ•ด๋ณด๋ฉด ์œ„์™€ ๊ฐ™์ด ์ •์ƒ์ ์œผ๋กœ ๋งํฌ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฝ‘์•„์˜ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

 

 

 

๐ŸŽฏ ํ‰์ 

๋‹ค์Œ์œผ๋ก  ํ‰์ ์ž…๋‹ˆ๋‹ค.

 

ํ‰์ ์€ ํ•ด๋‹น ๊ฐ•์˜๋ฅผ ๋“ค์–ด๊ฐ€์•ผ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋”ฐ๋ผ์„œ ํฌ๋กค๋ง์„ ํ•˜๋ฉฐ ์ธํ”„๋Ÿฐ ์‚ฌ์ดํŠธ๋ฅผ ๋ฐฉ๋ฌธํ•  ๋•Œ, ํ•œ ๋ฒˆ ๋” ๋‚ด๋ถ€๋กœ ๋“ค์–ด๊ฐ€์•ผ ํ•ฉ๋‹ˆ๋‹ค.

์œ„์—์„œ ๋ฝ‘์€ ๊ฐ•์˜ ๋งํฌ ๋ฐ์ดํ„ฐ๋ฅผ ํ†ตํ•ด ๊ฐ•์˜ ๋‚ด๋ถ€๋กœ ๋“ค์–ด๊ฐ‘๋‹ˆ๋‹ค.

/* ๊ฐ•์˜ ๋งํฌ ๋‚ด๋ถ€ */
Connection innerConn = Jsoup.connect(url);
Document innerDocument = innerConn.get();

 

 

 

 

์œ„์™€ ๊ฐ™์ด ๊ฐ•์˜๋ฅผ ํด๋ฆญํ•ด์„œ ๋“ค์–ด๊ฐ€๋ฉด <div class="dashboard-star__num"> ์˜ ํƒœ๊ทธ๋กœ ํ‰์ ์„ ๊ฐ€์ ธ์˜ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

 

ํ‰์ ์„ ๊ฐ€์ ธ์˜ค๋Š” ์ฝ”๋“œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

/* ํ‰์  */
Element ratingElement = innerDocument.selectFirst("div.dashboard-star__num");
final float rating = Objects.isNull(ratingElement)
    ? toFloat("0")
    : toFloat(ratingElement.text());
System.out.println("ํ‰์ : " + rating);



private static float toFloat(final String str) {
    return Float.parseFloat(str);
}

ํ‰์  ๋˜ํ•œ ์•ฝ๊ฐ„์˜ ํŠธ๋ฆญ์ด ์กด์žฌํ•˜๋Š”๋ฐ์š”, ๋ชจ๋“  ๊ฐ•์˜๊ฐ€ ์œ„์™€ ๊ฐ™์ด ํ‰์ ์ด ์กด์žฌํ•˜์ง€๋Š” ์•Š์Šต๋‹ˆ๋‹ค.

๋”ฐ๋ผ์„œ ํ‰์ ์ด ์—†๋Š” ๋ฐ์ดํ„ฐ์˜ ๊ฒฝ์šฐ NullPointerException ์˜ˆ์™ธ๊ฐ€ ๋ฐœ์ƒํ•˜๊ธฐ ๋•Œ๋ฌธ์—, Objects.isNull() ๋ฉ”์„œ๋“œ๋ฅผ ํ†ตํ•ด ํ•„ํ„ฐ๋ง์„ ์ง„ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

 

์ถœ๋ ฅํ•ด๋ณด๋ฉด ํ‰์ ์ด ์—†๋Š” ๊ฒฝ์šฐ 0.0 ์œผ๋กœ ์ถœ๋ ฅ์ด ๋˜๊ณ , ๊ทธ ์™ธ์˜ ๋ฐ์ดํ„ฐ๋Š” ๋ชจ๋‘ ํ‰์ ์ด ์ถœ๋ ฅ๋ฉ๋‹ˆ๋‹ค.

 

 

 

 

๐ŸŽฏ ๊ฐ•์˜์ž, ๊ฐ•์˜ ๋ถ€๊ฐ€์„ค๋ช…, ๊ธฐ์ˆ  ์Šคํƒ

๊ฐ•์˜์ž, ๋ถ€๊ฐ€์„ค๋ช…, ๊ธฐ์ˆ  ์Šคํƒ์€ ๋ชจ๋‘ ์ธํ”„๋Ÿฐ ์ฒซ ์‚ฌ์ดํŠธ์—์„œ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฝ‘์•„๋‚ผ ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ ๊ฐ™์ด ์ •๋ฆฌ๋ฅผ ์ง„ํ–‰ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

 

๊ฐ•์˜์ž, ๋ถ€๊ฐ€์„ค๋ช…์€ ๊ธฐ์กด๊ณผ ๋™์ผํ•˜๊ฒŒ ํƒœ๊ทธ๋ฅผ ํฌ๋กค๋งํ•˜๋ฉด ๋˜๋ฏ€๋กœ ๋„˜์–ด๊ฐ€๊ณ  ๊ธฐ์ˆ  ์Šคํƒ์˜ ํƒœ๊ทธ๋งŒ ๊ฐ„๋‹จํžˆ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

 

๊ธฐ์ˆ  ์Šคํƒ์€ div ํƒœ๊ทธ์˜ course_skills > span ํƒœ๊ทธ์— ์กด์žฌํ•ฉ๋‹ˆ๋‹ค.

 

์ฝ”๋“œ๋ฅผ ๋ณด๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

    public static void main(String[] args) {
        final String inflearnUrl = "https://www.inflearn.com/courses/it-programming";
        Connection conn = Jsoup.connect(inflearnUrl);

        try {
            Document document = conn.get();
            Elements instructorElements = document.getElementsByClass("instructor");
            Elements descriptionElements = document.select("p.course_description");
            Elements skillElements = document.select("div.course_skills > span");

            for (int j = 0; j < instructorElements.size(); j++) {
                final String instructor = instructorElements.get(j).text();
                final String description = descriptionElements.get(j).text();
                final String skills = removeWhiteSpace(skillElements.get(j).text());

                System.out.println("๊ฐ•์˜์ž: " + instructor);
                System.out.println("๊ฐ•์˜ ๋ถ€๊ฐ€์„ค๋ช…: " + description);
                System.out.println("๊ธฐ์ˆ  ์Šคํƒ: " + skills);
            }

        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    private static String removeWhiteSpace(final String str) {
        return str.replaceAll("\\s", "");
    }

 

 

 

 

 

๐ŸŽฏ ์ „์ฒด ์ฝ”๋“œ

์ˆ˜๊ฐ•์ž ์ˆ˜, ๊ฐ•์˜ ์„ธ์…˜ ๊ฐœ์ˆ˜ ๋“ฑ์˜ ๋ฐ์ดํ„ฐ๋Š” ์œ„์™€ ๊ฐ™์€ ๋ฐฉ๋ฒ•์œผ๋กœ ์ง„ํ–‰ํ•˜๋ฉด ๋˜๋ฏ€๋กœ ์ƒ๋žตํ•˜๊ณ  ์ „์ฒด ์ฝ”๋“œ์— ๋‚˜ํƒ€๋‚ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค!

 

package com.github.oneline.onelinecourse.util;

import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.safety.Whitelist;
import org.jsoup.select.Elements;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.io.IOException;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.SQLException;
import java.util.Objects;

public class InflearnCrawling {

    private static final Logger log = LoggerFactory.getLogger(InflearnCrawling.class);
    private static final int FIRST_PAGE_INDEX = 1;
    private static final int LAST_PAGE_INDEX = 32;
    private static final String PLATFORM = "Inflearn";

    public static void main(String[] args) {
        try {
            // ๊ฐœ๋ฐœ ๊ฐ•์˜ ๋ชจ๋“  ํŽ˜์ด์ง• ์ˆœํšŒ
            for (int i = FIRST_PAGE_INDEX; i <= LAST_PAGE_INDEX; i++) {
                final String inflearnUrl = "https://www.inflearn.com/courses/it-programming?order=seq&page=" + i;
                Connection conn = Jsoup.connect(inflearnUrl);
                Document document = conn.get();

                // ํฌ๋กค๋ง ํ•ญ๋ชฉ ํ•„์š” ๋ฆฌ์ŠคํŠธ
                //   - ์ธ๋„ค์ผ ๋งํฌ, ๊ฐ•์˜ ์ œ๋ชฉ, ๊ฐ€๊ฒฉ(ํ• ์ธ๊ฐ€๊ฒฉ), ํ‰์ , ๊ฐ•์˜์ž, ๊ฐ•์˜ ๋งํฌ, ์ˆ˜๊ฐ•์ž ์ˆ˜, ํ”Œ๋žซํผ, ๊ฐ•์˜ ์„ธ์…˜ ๊ฐœ์ˆ˜ + ์‹œ๊ฐ„
                Elements imageUrlElements = document.getElementsByClass("swiper-lazy");
                Elements titleElements = document.select("div.card-content > div.course_title");
                Elements priceElements = document.getElementsByClass("price");
                Elements instructorElements = document.getElementsByClass("instructor");
                Elements linkElements = document.select("a.course_card_front");
                Elements descriptionElements = document.select("p.course_description");
                Elements skillElements = document.select("div.course_skills > span");
                String[] imageUrls = new String[imageUrlElements.size()];

                int setIndex = 0;
                int getIndex = 0;

                for (Element e : imageUrlElements) {
                    imageUrls[setIndex++] = e.attr("abs:src");
                }

                for (int j = 0; j < titleElements.size(); j++) {
                    final String title = titleElements.get(j).text();
                    final String price = priceElements.get(j).text();
                    final String realPrice = getRealPrice(price);
                    final String salePrice = getSalePrice(price);
                    final int realIntPrice = toInt(removeNotNumeric(realPrice));
                    final int saleIntPrice = toInt(removeNotNumeric(salePrice));
                    final String currency = String.valueOf(price.charAt(0));
                    final String instructor = instructorElements.get(j).text();
                    final String url = linkElements.get(j).attr("abs:href");
                    final String description = descriptionElements.get(j).text();
                    final String skills = removeWhiteSpace(skillElements.get(j).text());

                    System.out.println("์ธ๋„ค์ผ: " + imageUrls[j]);
                    System.out.println("๊ฐ•์˜ ์ œ๋ชฉ: " + title);
                    System.out.println("๊ฐ€๊ฒฉ: " + realIntPrice);
                    System.out.println("ํ• ์ธ ๊ฐ€๊ฒฉ: " + saleIntPrice);
                    System.out.println("์›ํ™”: " + currency);
                    System.out.println("๊ฐ•์˜์ž: " + instructor);
                    System.out.println("๊ฐ•์˜ ๋งํฌ: " + url);
                    System.out.println("๊ฐ•์˜ ์„ค๋ช…: " + description);
                    System.out.println("๊ธฐ์ˆ  ์Šคํƒ: " + skills);

                    /* ๊ฐ•์˜ ๋งํฌ ๋‚ด๋ถ€ */
                    Connection innerConn = Jsoup.connect(url);
                    Document innerDocument = innerConn.get();

                    /* ํ‰์  */
                    Element ratingElement = innerDocument.selectFirst("div.dashboard-star__num");
                    final float rating = Objects.isNull(ratingElement)
                            ? toFloat("0")
                            : toFloat(ratingElement.text());
                    System.out.println("ํ‰์ : " + rating);

                    /* ์ˆ˜๊ฐ•์ž ์ˆ˜ */
                    Element listenerElement = innerDocument.selectFirst("div.cd-header__info-cover");
                    final String listener = Objects.isNull(listenerElement)
                            ? innerDocument.selectFirst("span > strong").text()
                            : innerDocument.select("div.cd-header__info-cover > span > strong").get(1).text();
                    System.out.println("์ˆ˜๊ฐ•์ž ์ˆ˜: " + removeNotNumeric(listener));
                    final int viewCount = Integer.parseInt(removeNotNumeric(listener));

                    /* ๊ฐ•์˜ ์„ธ์…˜ ๊ฐœ์ˆ˜ */
                    final String course = innerDocument.selectFirst("span.cd-curriculum__sub-title").text();
                    System.out.println("๊ฐ•์˜ ์„ธ์…˜ ๊ฐœ์ˆ˜: " + getSessionCount(course));
                    final int sessionCount = Integer.parseInt(getSessionCount(course));
                    System.out.println();

                }
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    private static String getRealPrice(final String price) {
        final String[] pricesArray = price.split(" ");
        return pricesArray[0];
    }

    private static String getSalePrice(final String price) {
        final String[] pricesArray = price.split(" ");
        return (pricesArray.length == 1) ? price : pricesArray[1];
    }

    // html ํƒœ๊ทธ ์ œ๊ฑฐ
    private static String stripHtml(final String html) {
        return Jsoup.clean(html, Whitelist.none());
    }

    // ๋งจ ์•ž, ๋งจ ๋’ค ์†Œ๊ด„ํ˜ธ ์ œ๊ฑฐ
    private static String removeBracket(final String str) {
        return str.replaceAll("^[(]|[)]$", "");
    }

    private static String getSessionCount(final String course) {
        return removeNotNumeric(course.substring(0, course.indexOf("๊ฐœ")));
    }

    private static String removeNotNumeric(final String str) {
        return str.replaceAll("\\W", "");
    }

    private static String removeWhiteSpace(final String str) {
        return str.replaceAll("\\s", "");
    }

    private static int toInt(final String str) {
        return Integer.parseInt(str);
    }

    private static float toFloat(final String str) {
        return Float.parseFloat(str);
    }

}

 

 

final String inflearnUrl = "https://www.inflearn.com/courses/it-programming?order=seq&page=" + i;

์œ„ ์ฝ”๋“œ์—์„œ url์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋˜์–ด์žˆ๋Š”๋ฐ์š”, ์ด๋Š” ์ธํ”„๋Ÿฐ ์‚ฌ์ดํŠธ์—์„œ ์ „์ฒด ๊ฐ•์˜๋ฅผ ํฌ๋กค๋ง ํ•ด์•ผ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ „์ฒด ํŽ˜์ด์ง€๋ฅผ ์ˆœํšŒํ•˜๋„๋ก ์„ค์ •ํ–ˆ์Šต๋‹ˆ๋‹ค.

 

 

/* ์ˆ˜๊ฐ•์ž ์ˆ˜ */
Element listenerElement = innerDocument.selectFirst("div.cd-header__info-cover");
final String listener = Objects.isNull(listenerElement)
        ? innerDocument.selectFirst("span > strong").text()
        : innerDocument.select("div.cd-header__info-cover > span > strong").get(1).text();
System.out.println("์ˆ˜๊ฐ•์ž ์ˆ˜: " + removeNotNumeric(listener));
final int viewCount = Integer.parseInt(removeNotNumeric(listener));

์ˆ˜๊ฐ•์ž ์ˆ˜ ๋˜ํ•œ ๋ฐ์ดํ„ฐ๊ฐ€ ์กด์žฌํ•˜์ง€ ์•Š๊ฑฐ๋‚˜, ๋™์ผํ•œ ํƒœ๊ทธ๊ฐ€ ์ค‘๋ณต๋˜์–ด ์žˆ๊ฑฐ๋‚˜ ๋“ฑ์˜ ํŠธ๋ฆญ์ด ์กด์žฌํ•ด์„œ ์กฐ๊ฑด ์ฒ˜๋ฆฌ๋ฅผ ํ•ด์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค.

 

 

for (Element e : imageUrlElements) {
    imageUrls[setIndex++] = e.attr("abs:src");
}

๋˜ํ•œ ์ธ๋„ค์ผ ๋งํฌ์˜ ๊ฒฝ์šฐ, ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ์™€๋Š” ๋‹ค๋ฅด๊ฒŒ ํฌ๋กค๋ง ํ•  ๋•Œ ์ด์Šˆ๊ฐ€ ์žˆ์—ˆ๊ธฐ ๋•Œ๋ฌธ์—, ๋”ฐ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ์ถ”๊ฐ€ํ•ด์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค.

 

๋ฐ˜์‘ํ˜•

๋Œ“๊ธ€