我们以前介绍过Pandas和ChaGPT整合,这样可以不了解Pandas的情况下对DataFrame进行操作。现在又有人开源了Scikit-LLM,它结合了强大的语言模型,如ChatGPT和scikit-learn。但这个并不是让我们自动化scikit-learn,而是将scikit-learn和语言模型进行整合,scikit-learn也可以处理文本数据了。
安装
pip install scikit-llm
既然要与Open AI的模型整合,就需要他的Key,从Scikit-LLM库中导入SKLLMConfig模块,并添加openAI密钥:
# importing SKLLMConfig to configure OpenAI API (key and Name) fromskllm.configimportSKLLMConfig# Set your OpenAI API key SKLLMConfig.set_openai_key("")# Set your OpenAI organization (optional) SKLLMConfig.set_openai_org("")
ZeroShotGPTClassifier
通过整合ChatGPT不需要专门的训练就可以对文本进行分类。ZeroShotGPTClassifier,就像任何其他scikit-learn分类器一样,使用非常简单。
# importing zeroshotgptclassifier module and classification dataset fromskllmimportZeroShotGPTClassifier fromskllm.datasetsimportget_classification_dataset# get classification dataset from sklearn X, y=get_classification_dataset()# defining the model clf=ZeroShotGPTClassifier(openai_model="gpt-3.5-turbo")# fitting the data clf.fit(X, y)# predicting the data labels=clf.predict(X)
Scikit-LLM在结果上经过了特殊处理,确保响应只包含一个有效的标签。如果响应缺少标签,它还可以进行填充,根据它在训练数据中出现的频率为你选择一个标签。
对于我们自己的带标签的数据,只需要提供候选标签的列表,代码是这个样子的:
# importing zeroshotgptclassifier module and classification dataset fromskllmimportZeroShotGPTClassifier fromskllm.datasetsimportget_classification_dataset# get classification dataset from sklearn for prediction onlyX, _=get_classification_dataset()# defining the model clf=ZeroShotGPTClassifier()# Since no training so passing the labels only for prediction clf.fit(None, ['positive', 'negative', 'neutral'])# predicting the labels labels=clf.predict(X)
MultiLabelZeroShotGPTClassifier
多标签也类似
# importing Multi-Label zeroshot module and classification dataset fromskllmimportMultiLabelZeroShotGPTClassifier fromskllm.datasetsimportget_multilabel_classification_dataset# get classification dataset from sklearnX, y=get_multilabel_classification_dataset()# defining the model clf=MultiLabelZeroShotGPTClassifier(max_labels=3)# fitting the model clf.fit(X, y)# making predictions labels=clf.predict(X)
创建MultiLabelZeroShotGPTClassifier类的实例时,指定要分配给每个样本的最大标签数量(这里:max_labels=3)
数据没有没有标签怎么办?可以通过提供候选标签列表来训练没有标记数据的分类器。y的类型应该是List[List[str]]。下面是一个没有标记数据的训练示例:
# getting classification dataset for prediction only X, _=get_multilabel_classification_dataset()# Defining all the labels that needs to predicted candidate_labels= [ "Quality", "Price", "Delivery", "Service", "Product Variety" ]# creating the model clf=MultiLabelZeroShotGPTClassifier(max_labels=3)# fitting the labels only clf.fit(None, [candidate_labels])# predicting the data labels=clf.predict(X)
文本向量化
文本向量化是将文本转换为数字的过程,Scikit-LLM中的GPTVectorizer模块,可以将一段文本(无论文本有多长)转换为固定大小的一组向量。
# Importing the necessary modules and classes fromsklearn.pipelineimportPipeline fromsklearn.preprocessingimportLabelEncoder fromxgboostimportXGBClassifier# Creating an instance of LabelEncoder class le=LabelEncoder()# Encoding the training labels 'y_train' using LabelEncoder y_train_encoded=le.fit_transform(y_train)# Encoding the test labels 'y_test' using LabelEncoder y_test_encoded=le.transform(y_test)# Defining the steps of the pipeline as a list of tuples steps= [('GPT', GPTVectorizer()), ('Clf', XGBClassifier())]# Creating a pipeline with the defined steps clf=Pipeline(steps)# Fitting the pipeline on the training data 'X_train' and the encoded training labels 'y_train_encoded' clf.fit(X_train, y_train_encoded)# Predicting the labels for the test data 'X_test' using the trained pipeline yh=clf.predict(X_test)
文本摘要
GPT非常擅长总结文本。在Scikit-LLM中有一个叫GPTSummarizer的模块。
# Importing the GPTSummarizer class from the skllm.preprocessing module from skllm.preprocessing import GPTSummarizer# Importing the get_summarization_dataset function from skllm.datasets import get_summarization_dataset# Calling the get_summarization_dataset function X = get_summarization_dataset()# Creating an instance of the GPTSummarizer s = GPTSummarizer(openai_model='gpt-3.5-turbo', max_words=15)# Applying the fit_transform method of the GPTSummarizer instance to the input data 'X'. # It fits the model to the data and generates the summaries, which are assigned to the variable 'summaries' summaries = s.fit_transform(X)
需要注意的是,max_words超参数是对生成摘要中单词数量的灵活限制。虽然max_words为摘要长度设置了一个粗略的目标,但摘要器可能偶尔会根据输入文本的上下文和内容生成略长的摘要。
总结
ChaGPT的火爆使得泛化模型有了更多的进步,这种进步也给我们日常的使用带来了巨大的变革,Scikit-LLM就将LLM整合进了Scikit的工作流,如果你有兴趣,这里是源码:
https://avoid.overfit.cn/post/9ba131a01d374926b6b7efff97f61c45
作者:Fareed Khan