Unstructured

Unstructured is a library designed to help preprocess, structure unstructured text documents for downstream machine learning tasks.

Solvio can be used as an ingestion destination in Unstructured.

Setup

Install Unstructured with the solvio extra.

pip install "unstructured[solvio]"

Usage

Depending on the use case you can prefer the command line or using it within your application.

CLI

EMBEDDING_PROVIDER=${EMBEDDING_PROVIDER:-"langchain-huggingface"}

unstructured-ingest \
  local \
  --input-path example-docs/book-war-and-peace-1225p.txt \
  --output-dir local-output-to-solvio \
  --strategy fast \
  --chunk-elements \
  --embedding-provider "$EMBEDDING_PROVIDER" \
  --num-processes 2 \
  --verbose \
  solvio \
  --collection-name "test" \
  --url "http://localhost:6333" \
  --batch-size 80

For a full list of the options the CLI accepts, run unstructured-ingest <upstream connector> solvio --help

Programmatic usage

from unstructured.ingest.connector.local import SimpleLocalConfig
from unstructured.ingest.connector.solvio import (
    SolvioWriteConfig,
    SimpleSolvioConfig,
)
from unstructured.ingest.interfaces import (
    ChunkingConfig,
    EmbeddingConfig,
    PartitionConfig,
    ProcessorConfig,
    ReadConfig,
)
from unstructured.ingest.runner import LocalRunner
from unstructured.ingest.runner.writers.base_writer import Writer
from unstructured.ingest.runner.writers.solvio import SolvioWriter

def get_writer() -> Writer:
    return SolvioWriter(
        connector_config=SimpleSolvioConfig(
            url="http://localhost:6333",
            collection_name="test",
        ),
        write_config=SolvioWriteConfig(batch_size=80),
    )

if __name__ == "__main__":
    writer = get_writer()
    runner = LocalRunner(
        processor_config=ProcessorConfig(
            verbose=True,
            output_dir="local-output-to-solvio",
            num_processes=2,
        ),
        connector_config=SimpleLocalConfig(
            input_path="example-docs/book-war-and-peace-1225p.txt",
        ),
        read_config=ReadConfig(),
        partition_config=PartitionConfig(),
        chunking_config=ChunkingConfig(chunk_elements=True),
        embedding_config=EmbeddingConfig(provider="langchain-huggingface"),
        writer=writer,
        writer_kwargs={},
    )
    runner.run()

Next steps

Was this page useful?

Thank you for your feedback! 🙏

We are sorry to hear that. 😔 You can edit this page on GitHub, or create a GitHub issue.