early_warning_model / src /README.md

LLouis0622

Upload folder using huggingface_hub

5092c1e verified 17 days ago

preview code

raw

history blame contribute delete

6.94 kB

소스 코드 설명

파일 구조

src/
├── predictor.py              # 예측 클래스
├── feature_engineering.py    # 특징 생성
├── train.py                  # 학습 스크립트
└── README.md                 # 이 파일

각 파일 설명

1. `predictor.py` - 예측 클래스

용도: 학습된 모델을 로드하고 예측을 수행하는 메인 클래스

주요 클래스: EarlyWarningPredictor

주요 메서드:

# 모델 로드 (허깅페이스 스타일)
model = EarlyWarningPredictor.from_pretrained("models/")

# 단일 예측
result = model.predict(store_data)

# 배치 예측
results = model.predict_batch(stores_df)

# 예측 설명
explanation = model.explain(store_data)

# 모델 정보
info = model.get_model_info()

반환 값:

{
    'risk_score': 78.5,           # 0-100점 위험도
    'risk_level': '높음',          # 낮음/보통/높음
    'closure_probability': 0.785, # 폐업 확률
    'risk_factors': {...},        # 위험 요인별 점수
    'action_items': [...]         # 권장 조치
}

수정 방법:

# 1. 위험도 임계값 변경
def predict(self, store_data, threshold=0.5):  # 기본값 변경
    ...

# 2. 앙상블 가중치 조정
# models/config.json 파일에서:
{
    "ensemble_weights": [0.6, 0.4]  # XGBoost 60%, LightGBM 40%
}

# 3. 위험 등급 기준 변경
if risk_score < 40:  # 기존 30에서 40으로
    risk_level = '낮음'

2. `feature_engineering.py` - 특징 생성

용도: 원본 데이터에서 47개의 특징을 자동으로 생성

주요 클래스: FeatureEngineer

생성되는 특징:

매출 관련 (15개)

sales_avg_1m, sales_avg_3m, sales_avg_6m, sales_avg_12m
sales_recent_vs_previous, sales_mom_change, sales_yoy_change
sales_max, sales_min, sales_range

고객 관련 (12개)

customer_reuse_rate, customer_reuse_trend
customer_new_rate
연령/성별별 고객 비율 (10개)

운영 관련 (8개)

operation_months, operation_avg_amount
operation_cancel_rate, operation_delivery_rate

트렌드 (5개)

trend_slope, trend_r2, trend_direction
trend_consecutive_down, trend_consecutive_up

변동성 (4개)

volatility_cv, volatility_std, volatility_mad, volatility_recent_std

계절성 (2개)

seasonality_detected, seasonality_strength

맥락 (1개)

context_industry

사용 예시:

from feature_engineering import FeatureEngineer

engineer = FeatureEngineer()

features = engineer.create_features(
    store_data={'industry': '카페', 'location': '서울'},
    monthly_usage=usage_df,
    monthly_customers=customer_df
)

새로운 특징 추가 방법:

class FeatureEngineer:
    def _create_custom_features(self, df):
        """커스텀 특징 추가"""
        features = {}
        
        # 예: 성장률 지표
        if 'RC_M1_SAA' in df.columns and len(df) >= 6:
            recent_3m = df['RC_M1_SAA'].tail(3).mean()
            past_3m = df['RC_M1_SAA'].head(3).mean()
            features['growth_rate'] = (recent_3m / past_3m - 1) * 100
        
        return features
    
    def create_features(self, store_data, monthly_usage, monthly_customers):
        features = {}
        
        # 기존 특징들...
        features.update(self._create_sales_features(monthly_usage))
        features.update(self._create_customer_features(monthly_customers))
        
        # 새로운 커스텀 특징 추가
        features.update(self._create_custom_features(monthly_usage))
        
        return pd.DataFrame([features])

3. `train.py` - 학습 스크립트

용도: 커맨드라인에서 모델을 학습하는 스크립트

사용법:

# 기본 사용
python src/train.py

# 옵션 지정
python src/train.py --data data/raw --output models/ --max-stores 1000

# 도움말
python src/train.py --help

파라미터:

--data: 데이터 디렉토리 경로 (기본: data/raw)
--output: 모델 저장 경로 (기본: models)
--max-stores: 테스트용 최대 매장 수 (선택사항)

주요 함수:

def load_data(data_dir)
    """데이터 로드"""

def create_features(df_store, df_usage, df_customer)
    """특징 생성"""

def preprocess_data(X, y)
    """전처리 및 분할"""

def apply_smote(X_train, y_train)
    """SMOTE 적용"""

def train_models(X_train, y_train)
    """모델 학습"""

def evaluate_models(xgb_model, lgb_model, X_test, y_test)
    """평가"""

def save_models(...)
    """모델 저장"""

수정 방법:

# 1. 모델 하이퍼파라미터 변경
def train_models(X_train, y_train):
    xgb_model = xgb.XGBClassifier(
        max_depth=8,           # 6에서 8로 증가
        learning_rate=0.05,    # 0.1에서 0.05로 감소
        n_estimators=300,      # 200에서 300으로 증가
        # ...
    )

# 2. 앙상블 가중치 변경
def evaluate_models(...):
    ensemble_pred = 0.6 * xgb_pred + 0.4 * lgb_pred  # 기존 0.5, 0.5

# 3. 데이터 분할 비율 변경
def preprocess_data(X, y):
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, ...  # 0.25에서 0.2로
    )

주요 수정 시나리오

시나리오 1: 새로운 데이터로 학습

1단계: 데이터 준비

# data/raw/에 CSV 파일 3개 배치
data/raw/
├── big_data_set1_f.csv
├── ds2_monthly_usage.csv
└── ds3_monthly_customers.csv

2단계: 학습 실행

python src/train.py

3단계: 예측 사용

from src.predictor import EarlyWarningPredictor
model = EarlyWarningPredictor.from_pretrained("models/")

시나리오 2: 모델 성능 개선

방법 1: 특징 추가

# feature_engineering.py에 새로운 특징 추가
def _create_custom_features(self, df):
    # 새로운 지표 계산
    pass

방법 2: 하이퍼파라미터 튜닝

# train.py에서 파라미터 조정
xgb_model = xgb.XGBClassifier(
    max_depth=8,
    learning_rate=0.05,
    ...
)

방법 3: 앙상블 가중치 조정

# models/config.json 수정
{
    "ensemble_weights": [0.6, 0.4]
}

시나리오 3: 예측 임계값 조정

더 민감하게 (조기 경보 강화):

result = model.predict(store_data, threshold=0.3)
# 폐업 확률 30% 이상이면 위험으로 판단

더 보수적으로:

result = model.predict(store_data, threshold=0.7)
# 폐업 확률 70% 이상이어야 위험으로 판단

참고 자료

XGBoost 문서: https://xgboost.readthedocs.io/
LightGBM 문서: https://lightgbm.readthedocs.io/
SMOTE 설명: https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTE.html