데이터 분석[R]

오즈비
- 어떤 사건이 A조건에서 발생할 확률이 B조건에서 발생할 확률에 비해 얼마나 더 큰지를 나타낸 값

로그 오즈비
- 오즈비에 로그를 취한 값
- 어떠한 값에 로그를 취하면 1보다 큰 값은 양수, 1보다 작은 값은 음수가 됨.

오즈비를 이용한 각 대통령 연설문 분석

# 사용할 패키지 활성화
library(KoNLP)
library(dplyr)
library(stringr)
library(tidytext)
library(ggplot2)
library(wordcloud2)

# 작업 경로 설정
setwd("작업 경로명")

# 문재인 대통령 연설문 데이터 가져오기
raw_moon <- readLines("speech_moon.txt", encoding="UTF-8") # 해당 파일에서 한문장씩 읽어오기

moon <- raw_moon %>%
as_tibble() %>%
mutate(president="moon")
moon

# 박근혜 대통령 연설문 데이터 가져오기
raw_park <- readLines("speech_park.txt", encoding="UTF-8")

park <- raw_park %>%
as_tibble() %>%
mutate(president="park")
park

# 데이터 합치기(행 기준)
bind_speeches <- bind_rows(moon, park)
bind_speeches

# 한글을 제외한 문자 제거
speeches <- bind_speeches %>%
mutate(value = str_replace_all(value, "[^가-힣]", " ")) %>%
mutate(value = str_squish(value))
speeches

# 명사 토큰화(단어만 추출)
speeches <- speeches %>%
unnest_tokens(input=value, output=word, token=extractNoun) # input과 output은 열 이름
speeches

# 단어 사용빈도 확인, 1글자 이하 단어 제거
count_word <- speeches %>%
count(president, word) %>%
filter(str_count(word) > 1)
count_word

# 각 사용 단어의 빈도 ratio(확률) 구하기
frequency_wide <- count_word %>%
pivot_wider(names_from = president, values_from = n, values_fill = 0)
frequency_wide

frequency_wide <- frequency_wide %>%
  mutate(ratio_moon = (moon+1) / sum(moon+1),
         ratio_park = (park+1) / sum(moon+1),
         odds_ratio = ratio_moon / ratio_park)
frequency_wide

# 상위 10개씩 추출
top10 <- frequency_wide %>%
filter(rank(odds_ratio) <= 10 | rank(-odds_ratio) <= 10)
top10

# odds_ratio 정렬
top10 <- top10 %>% arrange(-odds_ratio)
top10 %>% print(n=Inf)

# 단어 사용자 확인
top10 <- top10 %>%
mutate(president = ifelse(odds_ratio > 1, "moon", "park"), # odds_ratio의 값이 1보다 크면 "moon" 크지 않다면 "park"
n = ifelse(odds_ratio > 1, moon, park))
top10

# 막대그래프로 시각화
ggplot(top10, aes(x=reorder_within(word,n,president), y=n)) +
  geom_col() + # 막대 그래프
  coord_flip() + # x축, y축 회전
  geom_text(aes(label=n), hjust=-0.3) + # 각각의 막대 옆에 정확한 숫자(빈도) 출력
  facet_wrap(~president, scales = "free") + # 어느 대통령이 말한 단어인지 구분하기 위해)
  scale_x_reordered() +
  labs(x = NULL)

TF-IDF
- Term Frequency - Inverse Document Frequency
- TF : 단어의 빈도수
- DF : 문서 빈도, 클수록 흔하게 사용되는 일반 단어임
- IDF : log(N/DF)
- TF-IDF = TF * IDF

TF-IDF를 이용한 각 대통령 연설문 분석

install.packages("readr")
library(readr)

setwd("작업 경로명")

raw_speeches <- read_csv("speeches_presidents.csv")
raw_speeches

speeches <- raw_speeches %>%
  mutate(value=str_replace_all(value, "[^가-힣]", " "),
         value=str_squish(value)) %>%
  unnest_tokens(input = value,
                output = word,
                token = extractNoun)
speeches

count_word <- speeches %>%
count(president, word) %>%
filter(str_count(word) > 1)
count_word

count_word <- count_word %>%
bind_tf_idf(term = word, document = president, n = n) %>% # tf, idf, tf-idf 값 구하기
arrange(-tf_idf)
count_word

# 각 대통령이 연설문을 통해 무엇을 중요하게 생각했는지 TF-IDF를 통해 확인
count_word %>% filter(president == "이명박")
count_word %>% filter(president == "노무현")
count_word %>% filter(president == "박근혜")
count_word %>% filter(president == "문재인")

top10 <- count_word %>%
group_by(president) %>%
slice_max(tf_idf, n = 10, with_ties = F)
top10 %>% print(n=Inf)

ggplot(top10, aes(x=reorder_within(word,tf_idf,president), y=tf_idf)) +
  geom_col() + # 막대 그래프
  coord_flip() + # x축, y축 회전
  geom_text(aes(label=n), hjust=-0.3) + # 각각의 막대 옆에 정확한 숫자(빈도) 출력
  facet_wrap(~president, scales = "free") + # 어느 대통령이 말한 단어인지 구분하기 위해)
  scale_x_reordered() +
  labs(x = NULL)

20210714.R

0.00MB

'데이터 분석 > R' 카테고리의 다른 글

데이터 분석[R] -15차시 (0)	2021.07.15
데이터 분석[R] - 13차시 (0)	2021.07.13
데이터 분석[R] - 12차시 (0)	2021.07.12
데이터 분석[R] - 11차시 (0)	2021.07.09
데이터 분석[R] - 10차시 (0)	2021.07.08

열씸히하자

데이터 분석[R] - 14차시

'데이터 분석 > R' 카테고리의 다른 글

티스토리툴바

데이터 분석[R] - 14차시

'데이터 분석 > R' 카테고리의 다른 글

'데이터 분석/R' Related Articles

티스토리툴바