COUNT_VECTORIZER

Receive a collection of text documents (as a matrix, vector, or dataframe) and convert it into a matrix of token counts. Params: default : DataFrame|Matrix|Vector The corpus to vectorize. Returns: tokens : DataFrame Holds all the unique tokens observed from the input. word_count_vector : Vector Contains the occurences of these tokens from each sentence.

Python Code

from typing import TypedDict
from sklearn.feature_extraction.text import CountVectorizer
from flojoy import flojoy, DataFrame, Matrix, Vector
import pandas as pd


class CountVectorizerOutput(TypedDict):
    tokens: DataFrame
    word_count_vector: Vector


@flojoy(deps={"scikit-learn": "1.2.2"})
def COUNT_VECTORIZER(default: DataFrame | Matrix | Vector) -> CountVectorizerOutput:
    """Receive a collection of text documents (as a matrix, vector, or dataframe) and convert it into a matrix of token counts.

    Parameters
    ----------
    default : DataFrame|Matrix|Vector
        The corpus to vectorize.

    Returns
    -------
    tokens: DataFrame
        Holds all the unique tokens observed from the input.
    word_count_vector: Vector
        Contains the occurences of these tokens from each sentence.
    """

    if isinstance(default, DataFrame):
        data = default.m.values
    elif isinstance(default, Vector):
        data = default.v
    else:
        data = default.m

    vectorizer = CountVectorizer()
    X = vectorizer.fit_transform(data.flatten())

    x = pd.DataFrame({"tokens": vectorizer.get_feature_names_out()})
    y = X.toarray()  # type: ignore

    return CountVectorizerOutput(tokens=DataFrame(df=x), word_count_vector=Vector(v=y))

Find this Flojoy Block on GitHub

Example

Having problems with this example app? Join our Discord community and we will help you out!

In this example, the READ_CSV node loads a local file. Then COUNT_VECTORIZER node transforms the received dataframe of text into a matrix of token/word counts, and it returns a DataFrame that contains unique words and a Matrix that contains token counts for each sentence.