共用方式為


microsoftml.categorical_hash:哈希並將文字數據行轉換成類別

用法

microsoftml.categorical_hash(cols: [str, dict, list],
    hash_bits: int = 16, seed: int = 314489979,
    ordered: bool = True, invert_hash: int = 0,
    output_kind: ['Bag', 'Ind', 'Key', 'Bin'] = 'Bag', **kargs)

描述

類別哈希轉換可在定型模型之前對數據執行。

categorical_hash 將類別值轉換成指標陣列,方法是哈希值,並使用哈希做為包中的索引。 如果輸入數據行是向量,則會傳回單一指標包。 categorical_hash 目前不支持處理因數數據。

參數

cols

要轉換的字元字串或變數名稱清單。 如果 dict,則索引鍵代表要建立的新變數名稱。

hash_bits

整數,指定要哈希到的位數。 必須介於 1 到 30 之間,且包含 。 預設值為 16。

種子

指定哈希種子的整數。 預設值為 314489979。

命令

True,以在哈希中包含每個字詞的位置。 否則,False。 預設值為 True

invert_hash

整數,指定可用來產生位置名稱的索引鍵數目限制。 0 表示沒有反轉哈希;-1 表示沒有限制。 雖然零值可提供較佳的效能,但需要非零的值,才能取得有意義的係數名稱。 預設值為 0

output_kind

指定輸出種類之字元字串。

  • "Bag":輸出多組向量。 如果輸入數據行是類別的向量,則輸出會包含一個向量,其中每個位置中的值都是輸入向量中類別的出現次數。 如果輸入數據行包含單一類別,則指標向量和包向量相等

  • "Ind":輸出指標向量。 輸入數據行是類別的向量,而輸出會在輸入數據行中包含每個位置的一個指標向量。

  • "Key:輸出索引。 輸出是類別目錄的整數識別碼(介於 1 和字典中的類別數目之間)。

  • "Bin:輸出向量,這是類別的二進位表示法。

預設值為 "Bag"

kargs

傳送至計算引擎的其他自變數。

返回

定義轉換的物件。

另請參閱

categorical

'''
Example on rx_logistic_regression and categorical_hash.
'''
import numpy
import pandas
from microsoftml import rx_logistic_regression, categorical_hash, rx_predict
from microsoftml.datasets.datasets import get_dataset

movie_reviews = get_dataset("movie_reviews")

train_reviews = pandas.DataFrame(data=dict(
    review=[
        "This is great", "I hate it", "Love it", "Do not like it", "Really like it",
        "I hate it", "I like it a lot", "I kind of hate it", "I do like it",
        "I really hate it", "It is very good", "I hate it a bunch", "I love it a bunch",
        "I hate it", "I like it very much", "I hate it very much.",
        "I really do love it", "I really do hate it", "Love it!", "Hate it!",
        "I love it", "I hate it", "I love it", "I hate it", "I love it"],
    like=[True, False, True, False, True, False, True, False, True, False,
        True, False, True, False, True, False, True, False, True, False, True,
        False, True, False, True]))
        
test_reviews = pandas.DataFrame(data=dict(
    review=[
        "This is great", "I hate it", "Love it", "Really like it", "I hate it",
        "I like it a lot", "I love it", "I do like it", "I really hate it", "I love it"]))


# Use a categorical hash transform.
out_model = rx_logistic_regression("like ~ reviewCat",
                data=train_reviews,
                ml_transforms=[categorical_hash(cols=dict(reviewCat="review"))])
                
# Weights are similar to categorical.
print(out_model.coef_)

# Use the model to score.
source_out_df = rx_predict(out_model, data=test_reviews, extra_vars_to_write=["review"])
print(source_out_df.head())

輸出:

Not adding a normalizer.
Beginning processing data.
Rows Read: 25, Read Time: 0, Transform Time: 0
Beginning processing data.
Beginning processing data.
Rows Read: 25, Read Time: 0, Transform Time: 0
Beginning processing data.
LBFGS multi-threading will attempt to load dataset into memory. In case of out-of-memory issues, turn off multi-threading by setting trainThreads to 1.
Warning: Too few instances to use 4 threads, decreasing to 1 thread(s)
Beginning optimization
num vars: 65537
improvement criterion: Mean Improvement
L1 regularization selected 3 of 65537 weights.
Not training a calibrator because it is not needed.
Elapsed time: 00:00:00.1209392
Elapsed time: 00:00:00.0190134
OrderedDict([('(Bias)', 0.2132447361946106), ('f1783', -0.7939924597740173), ('f38537', 0.1968022584915161)])
Beginning processing data.
Rows Read: 10, Read Time: 0, Transform Time: 0
Beginning processing data.
Elapsed time: 00:00:00.0284223
Finished writing 10 rows.
Writing completed.
           review PredictedLabel     Score  Probability
0   This is great           True  0.213245     0.553110
1       I hate it          False -0.580748     0.358761
2         Love it           True  0.213245     0.553110
3  Really like it           True  0.213245     0.553110
4       I hate it          False -0.580748     0.358761