COUNT_VECTORIZER
Receive a collection of text documents (as a matrix, vector, or dataframe) and convert it into a matrix of token counts. Params: default : DataFrame|Matrix|Vector The corpus to vectorize. Returns: tokens : DataFrame Holds all the unique tokens observed from the input. word_count_vector : Vector Contains the occurences of these tokens from each sentence.
Python Code
from typing import TypedDict
from sklearn.feature_extraction.text import CountVectorizer
from flojoy import flojoy, DataFrame, Matrix, Vector
import pandas as pd
class CountVectorizerOutput(TypedDict):
tokens: DataFrame
word_count_vector: Vector
@flojoy(deps={"scikit-learn": "1.2.2"})
def COUNT_VECTORIZER(default: DataFrame | Matrix | Vector) -> CountVectorizerOutput:
"""Receive a collection of text documents (as a matrix, vector, or dataframe) and convert it into a matrix of token counts.
Parameters
----------
default : DataFrame|Matrix|Vector
The corpus to vectorize.
Returns
-------
tokens: DataFrame
Holds all the unique tokens observed from the input.
word_count_vector: Vector
Contains the occurences of these tokens from each sentence.
"""
if isinstance(default, DataFrame):
data = default.m.values
elif isinstance(default, Vector):
data = default.v
else:
data = default.m
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(data.flatten())
x = pd.DataFrame({"tokens": vectorizer.get_feature_names_out()})
y = X.toarray() # type: ignore
return CountVectorizerOutput(tokens=DataFrame(df=x), word_count_vector=Vector(v=y))
Example
Having problems with this example app? Join our Discord community and we will help you out!
In this example, the READ_CSV
node loads a local file. Then COUNT_VECTORIZER
node transforms the received dataframe of text into a matrix of token/word counts, and it returns a DataFrame
that contains unique words and a Matrix
that contains token counts for each sentence.