[공부] Datacamp 머신러닝 - 지도학습 : 분류

대전제 : making more accurate predictions

3. Logistic Regression

Automatic Feature selection

Stepwise Regression
- 로지스틱 회귀는 사전에 예측변수를 정해야 함
- 종류
  - Forward Stepwise : 변수를 지워나감
  - Backward Stepwise : 변수를 추가해나감
- 몇몇 단점으로 인해 자주 쓰이진 않음
- 단점
  - 최적의 모델을 찾아낸다고 장담할 수 없음
  - 이 절차는 통계적인 가정을 위배함
  - 현실 세계에 대해 잘 설명하지 못하는 모델을 만들어낼 수 있음
- 결론 : 문제점은 있지만 어디서 시작해야 할 지 모를때 유용함

null_model <- glm(donated ~ 1, data = donors, family = "binomial")

full_model <- glm(donated ~ veteran + bad_address + has_children + wealth_rating + interest_veterans + interest_religion + pet_owner + catalog_shopper + recency + frequency + money + missing_age + imputed_age, data = donors, family = "binomial")

step_model <- step(null_model, scope = list(lower = null_model, upper = full_model), direction = "forward")

# Estimate the probability
step_prob <- predict(step_model, type='response')

# Plot the ROC
library(pROC)
ROC <- roc(donors$donated, step_prob)
plot(ROC, col = "red")
auc(ROC)

4. Classification Trees

통계가 필요 없기 때문에 비즈니스 전략에 유용하게 쓰임
- 특히 투명성이 중요한 분야. ex) 대출 조건 심사/승인
root node -> decision nodes
leaf node = final decision
R 패키지
- rpart( )

library(rpart)

loan_model <- rpart(outcome ~ loan_amount + credit_score, data = loans, method = "class", control = rpart.control(cp = 0))

predict(loan_model, good_credit, type = "class")
predict(loan_model, bad_credit, type = "class")

#visualizing classification trees
library(rpart.plot)
rpart.plot(loan_model)

4-1) Growing larger classification Trees

Choosing where to split
- Devide-and-conquer
  - The group it can split to create the greatest improvement in subgroup homogeneity.
  - Divide-and-conquer always looks to create the split resulting in the greatest improvement to purity.
- 늘 Axis parallel split을 만든다.
- 쓸데없이 복잡해질 수 있음
- noise를 모델링하느라 overfit 될 위험성
Tree performance 측정하기
- training set / test set으로 분리
- 랜덤 샘플 형성 : sample( )

nrow(loans)
nrow(loans)*0.75

sample_rows <- sample(11312, 8484)

loans_train <- loans[sample_rows,]
loans_test <- loans[-sample_rows,]

loan_model <- rpart(outcome ~ ., data = loans_train, method = "class", control = rpart.control(cp = 0))

loans_test$pred <- predict(loan_model, loans_test, type="class")

# Examine the confusion matrix
table(loans_test$pred, loans_test$outcome)

# Compute the accuracy
mean(loans_test$outcome == loans_test$pred)