microsoftml.categorical_hash: 텍스트 열을 범주로 해시하고 변환합니다.

아티클
12/18/2024

사용법

microsoftml.categorical_hash(cols: [str, dict, list],
    hash_bits: int = 16, seed: int = 314489979,
    ordered: bool = True, invert_hash: int = 0,
    output_kind: ['Bag', 'Ind', 'Key', 'Bin'] = 'Bag', **kargs)

묘사

모델을 학습시키기 전에 데이터에 대해 수행할 수 있는 범주 해시 변환입니다.

세부 정보

categorical_hash 값을 해시하고 해시를 모음의 인덱스로 사용하여 범주 값을 표시기 배열로 변환합니다. 입력 열이 벡터인 경우 단일 표시기 모음이 반환됩니다. categorical_hash 현재 요소 데이터 처리를 지원하지 않습니다.

인수

cols

변환할 변수 이름의 문자열 또는 목록입니다. dict경우 키는 만들 새 변수의 이름을 나타냅니다.

hash_bits

해시할 비트 수를 지정하는 정수입니다. 1에서 30 사이여야 합니다. 기본값은 16입니다.

씨

해시 시드를 지정하는 정수입니다. 기본값은 314489979.

주문

해시에 각 용어의 위치를 포함하도록 True. 그렇지 않으면 False. 기본값은 True.

invert_hash

슬롯 이름을 생성하는 데 사용할 수 있는 키 수 제한을 지정하는 정수입니다. 0 반전 해시가 없음을 의미합니다. -1 제한이 없음을 의미합니다. 값이 0이면 성능이 향상되지만 의미 있는 계수 이름을 얻으려면 0이 아닌 값이 필요합니다. 기본값은 0.

output_kind

출력 종류 종류를 지정하는 문자열입니다.

"Bag": 다중 집합 벡터를 출력합니다. 입력 열이 범주의 벡터인 경우 출력에는 하나의 벡터가 포함됩니다. 여기서 각 슬롯의 값은 입력 벡터의 범주 발생 횟수입니다. 입력 열에 단일 범주가 포함된 경우 표시기 벡터 및 모음 벡터는 동일합니다.
"Ind": 표시기 벡터를 출력합니다. 입력 열은 범주의 벡터이며 출력에는 입력 열의 슬롯당 하나의 표시기 벡터가 포함됩니다.
"Key: 인덱스를 출력합니다. 출력은 범주의 정수 ID(1에서 사전의 범주 수 사이)입니다.
"Bin: 범주의 이진 표현인 벡터를 출력합니다.

기본값은 "Bag".

kargs

컴퓨팅 엔진에 전송된 추가 인수입니다.

반환

변환을 정의하는 개체입니다.

참고 항목

categorical

본보기

'''
Example on rx_logistic_regression and categorical_hash.
'''
import numpy
import pandas
from microsoftml import rx_logistic_regression, categorical_hash, rx_predict
from microsoftml.datasets.datasets import get_dataset

movie_reviews = get_dataset("movie_reviews")

train_reviews = pandas.DataFrame(data=dict(
    review=[
        "This is great", "I hate it", "Love it", "Do not like it", "Really like it",
        "I hate it", "I like it a lot", "I kind of hate it", "I do like it",
        "I really hate it", "It is very good", "I hate it a bunch", "I love it a bunch",
        "I hate it", "I like it very much", "I hate it very much.",
        "I really do love it", "I really do hate it", "Love it!", "Hate it!",
        "I love it", "I hate it", "I love it", "I hate it", "I love it"],
    like=[True, False, True, False, True, False, True, False, True, False,
        True, False, True, False, True, False, True, False, True, False, True,
        False, True, False, True]))
        
test_reviews = pandas.DataFrame(data=dict(
    review=[
        "This is great", "I hate it", "Love it", "Really like it", "I hate it",
        "I like it a lot", "I love it", "I do like it", "I really hate it", "I love it"]))


# Use a categorical hash transform.
out_model = rx_logistic_regression("like ~ reviewCat",
                data=train_reviews,
                ml_transforms=[categorical_hash(cols=dict(reviewCat="review"))])
                
# Weights are similar to categorical.
print(out_model.coef_)

# Use the model to score.
source_out_df = rx_predict(out_model, data=test_reviews, extra_vars_to_write=["review"])
print(source_out_df.head())

출력:

Not adding a normalizer.
Beginning processing data.
Rows Read: 25, Read Time: 0, Transform Time: 0
Beginning processing data.
Beginning processing data.
Rows Read: 25, Read Time: 0, Transform Time: 0
Beginning processing data.
LBFGS multi-threading will attempt to load dataset into memory. In case of out-of-memory issues, turn off multi-threading by setting trainThreads to 1.
Warning: Too few instances to use 4 threads, decreasing to 1 thread(s)
Beginning optimization
num vars: 65537
improvement criterion: Mean Improvement
L1 regularization selected 3 of 65537 weights.
Not training a calibrator because it is not needed.
Elapsed time: 00:00:00.1209392
Elapsed time: 00:00:00.0190134
OrderedDict([('(Bias)', 0.2132447361946106), ('f1783', -0.7939924597740173), ('f38537', 0.1968022584915161)])
Beginning processing data.
Rows Read: 10, Read Time: 0, Transform Time: 0
Beginning processing data.
Elapsed time: 00:00:00.0284223
Finished writing 10 rows.
Writing completed.
           review PredictedLabel     Score  Probability
0   This is great           True  0.213245     0.553110
1       I hate it          False -0.580748     0.358761
2         Love it           True  0.213245     0.553110
3  Really like it           True  0.213245     0.553110
4       I hate it          False -0.580748     0.358761

다음을 통해 공유