인스타그램 해시태그 크롤링 및 분석

Notice

Recent Posts

Recent Comments

Link

Github

« 2025/05 »
일	월	화	수	목	금	토
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Tags more

Archives

Today

Total

관리 메뉴

코딩코딩코딩

인스타그램 해시태그 크롤링 및 분석 - 5 본문

파이썬/텍스트마이닝

인스타그램 해시태그 크롤링 및 분석 - 5

hanshow113 2020. 7. 29. 18:50

이전까지 작성했던 내용들은 인스타그램 게시물을 모두 가져오고 난 후에 데이터프레임 형식으로 변환하여 기간을 설정하려고 했었습니다.

이번에는 코드를 수정하여 기간을 설정하고 난 후에 함수를 실행해 그 기간 사이에 있는 게시물들만 크롤링해오려고 합니다.

기존 selenium에서 page scroll을 통해 모든 게시물의 링크를 가져온 후 링크를 읽어들였는데, 이 방법이 아니라 링크를 가져온 후 하나하나씩 읽어가며 날짜를 비교하는 방식으로 코드를 짰습니다.

최초 스크롤 시 게시물이 33개 읽히며 이후 스크롤마다 12개의 게시물을 읽을 수 있습니다.

이 때 임시 리스트를 만들어서 12개의 게시물을 추가한 후 12개의 게시물 모두 설정 기간 내에 존재하지 않는 글이면 함수를 종료하는 방식입니다. (게시물 중간중간 날짜에 맞지 않는 것들이 끼어있습니다..)

def scroll_crawling(date1, date2):
    post_link.clear()
    popularPost_len.clear()    # 이전 작업기록이 남아있을 수 있으므로 clear해줌
    timestamp_1 = to_timestamp(date1)
    timestamp_2 = to_timestamp(date2)   # 크롤링된 게시물과의 날짜 비교를 위해서 기준날짜 설정
    periods = range(int(timestamp_1), int(timestamp_2))  # 크롤링 기간 설정
    start = time.time()
    while True:
        pageString = driver.page_source
        bsObj = bs(pageString, 'lxml')

        temp_postlink = []
        for postline in bsObj.find_all(name='div', attrs={"class":"Nnq7C weEfm"}):
            a_len = len(postline.select('a'))
            popularPost_len.append(a_len)
            # 인스타그램 게시물은 행별로 최대 3개까지 확인할 수 있는데, 최근게시물이나 마지막 게시물은 1,2개가 나올 수도 있어서 len 지정
            for post in range(a_len):
                item = postline.select('a')[post]
                link = item.attrs['href']
                if link not in post_link:   # 스크롤을 내리고 중복된 것을 제거하지 않고 누적시키기 때문에 없는 것만 추가
                    post_link.append(link)
                    temp_postlink.append(link)
        count = len(temp_postlink)
        for i in range(len(temp_postlink)):
            req = Request("https://www.instagram.com" + temp_postlink[i], headers={'User-Agent': 'Mozila/5.0'})
            postpage = urlopen(req).read()
            post_body = bs(postpage, 'lxml', from_encoding='utf-8')
            post_core = post_body.find('meta', attrs={'property': "og:description"})
            contents = post_core['content']

            posttxt = str(postpage)
            timestamp = int(posttxt[posttxt.find('taken_at_timestamp')+20 : posttxt.find('taken_at_timestamp')+30])

            if timestamp in periods:
                # 시간
                date_list.append(datetime.fromtimestamp(timestamp).strftime('%Y.%m.%d %H:%M'))
                month_list.append(datetime.fromtimestamp(timestamp).strftime("%m"))
                day_list.append(datetime.fromtimestamp(timestamp).strftime("%d"))

                # 개별 링크 리스트
                link_list.append("https://www.instagram.com" + temp_postlink[i])

                # 좋아요
                try:
                    likes = int(contents[: contents.find(' Likes, ')])  # Likes 문자열 앞에 있는 좋아요 개수 추출
                except:
                    likes = 0  # 좋아요 가 아니라 조회수로 표시되는 경우도 있어 이런 경우는 0으로 표시
                like_list.append(likes)

                # 개별 계정
                if "@" and ")" in contents:
                    personal_id = contents[contents.find("@") + 1: contents.find(")")]
                elif "shared a post on Instagram" in contents:
                    personal_id = contents[contents.find("@") + 1: contents.find('shared a post on Instagram')]
                elif "shared a photo on Instagram" in contents:
                    personal_id = contents[contents.find("@") + 1: contents.find('shared a photo on Instagram')]
                elif "@" and ")" not in contents and "on Instagram" in contents:
                    personal_id = contents[contents.find("@") + 1: contents.find('on Instagram')]
                else:
                    personal_id = contents[1: contents.find(' posted on')]
                id_list.append(personal_id)

                '''    
                (@personal_id) on instagram, @persoanlid posted on instagram, personal_id on instgram 등의 형태로 meta 데이터에 표시되기
                때문에 여러 형식별 id 추출 if문 수행
                '''

                # 해시태그
                tag_list.append([])
                for tag_content in post_body.find_all('meta', attrs={'property': "instapp:hashtags"}):
                    hashtags = tag_content['content'].rstrip(',')
                    tag_list[-1].append(hashtags)

                count -= 1
            else:
                print('설정한 기간에 속한 게시물이 아닙니다.')
                continue
                if count == 0:
                    print(time.time() - start)
                    break
        last_height = driver.execute_script('return document.body.scrollHeight')  # 자바스크림트로 스크롤 길이를 넘겨줌
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")  # selenium에서 scroll 기능 사용
        time.sleep(SCROLL_PAUSE_TIME)
        # 프로세스 자체를 지정시간동안 기다려줌(무조건 지연)
        # driver.implicitly_wait(SCROLL_PAUSE_TIME)
        # 브라우저 엔진에서 파싱되는 시간을 기다려줌(요소가 존재하면 지연없이 코드 실행)
        new_height = driver.execute_script("return document.body.scrollHeight")

        if new_height == last_height:
            driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
            time.sleep(SCROLL_PAUSE_TIME)
            # driver.implicitly_wait(SCROLL_PAUSE_TIME)
            new_height = driver.execute_script("return document.body.scrollHeight")

            if new_height == last_height:
                break
            else:
                last_height = new_height
                continue

나머지 코드들은 이전 포스팅과 같고 page_scroll(), date_based_crawling() 함수를 통합해서 scroll_crawling() 함수로 작성했습니다.

'파이썬 > 텍스트마이닝' 카테고리의 다른 글

유튜브(Youtube) 크롤링 - selenium - 2 (1)	2020.08.12
유튜브(Youtube) 크롤링 - selenium - 1 (2)	2020.08.05
인스타그램 해시태그 크롤링 및 분석 - 4 (5)	2020.07.27
인스타그램 해시태그 크롤링 및 분석 - 3 (1)	2020.07.24
인스타그램 해시태그 크롤링 및 분석 - 2 (5)	2020.07.23

'파이썬/텍스트마이닝' Related Articles

Comments

코딩코딩코딩

인스타그램 해시태그 크롤링 및 분석 - 5 본문

인스타그램 해시태그 크롤링 및 분석 - 5

'파이썬 > 텍스트마이닝' 카테고리의 다른 글

티스토리툴바