Encamina.Enmarcha.SemanticKernel.Connectors.Document
8.1.3-preview-05
See the version list below for details.
dotnet add package Encamina.Enmarcha.SemanticKernel.Connectors.Document --version 8.1.3-preview-05
NuGet\Install-Package Encamina.Enmarcha.SemanticKernel.Connectors.Document -Version 8.1.3-preview-05
<PackageReference Include="Encamina.Enmarcha.SemanticKernel.Connectors.Document" Version="8.1.3-preview-05" />
paket add Encamina.Enmarcha.SemanticKernel.Connectors.Document --version 8.1.3-preview-05
#r "nuget: Encamina.Enmarcha.SemanticKernel.Connectors.Document, 8.1.3-preview-05"
// Install Encamina.Enmarcha.SemanticKernel.Connectors.Document as a Cake Addin #addin nuget:?package=Encamina.Enmarcha.SemanticKernel.Connectors.Document&version=8.1.3-preview-05&prerelease // Install Encamina.Enmarcha.SemanticKernel.Connectors.Document as a Cake Tool #tool nuget:?package=Encamina.Enmarcha.SemanticKernel.Connectors.Document&version=8.1.3-preview-05&prerelease
Semantic Kernel - Document Connectors
Document Connectors specializes in reading information from files in various formats and subsequently chunking it. The most typical use case is, within the context of generating document embeddings, reading information from a variety of file formats (pdf, docx, pptx, etc.) and chunks its content into smaller parts.
Setup
Nuget package
First, install NuGet. Then, install Encamina.Enmarcha.SemanticKernel.Connectors.Document from the package manager console:
PM> Install-Package Encamina.Enmarcha.SemanticKernel.Connectors.Document
.NET CLI:
First, install .NET CLI. Then, install Encamina.Enmarcha.SemanticKernel.Connectors.Document from the .NET CLI:
dotnet add package Encamina.Enmarcha.SemanticKernel.Connectors.Document
How to use
Starting from a Program.cs
or a similar entry point file in your project, add the following code:
// Entry point
var builder = WebApplication.CreateBuilder(new WebApplicationOptions
{
// ...
});
// ...
services.AddDefaultDocumentContentExtractor();
This extension method will add the default implementation of the IDocumentContentExtractor interface as a singleton. The default implementation is DefaultDocumentContentExtractor. With this, we can resolve the IDocumentContentExtractor
interface and obtain the chunks of a file:
Construction injection
public class MyClass
{
private readonly IDocumentContentExtractor documentContentExtractor;
public MyClass(IDocumentContentExtractor documentContentExtractor)
{
this.documentContentExtractor = documentContentExtractor;
}
public IEnumerable<string> GetPdfChunks()
{
using var file = File.OpenRead("example.pdf");
var pdfChunks = documentContentExtractor.GetDocumentContent(file, ".pdf");
return pdfChunks;
}
}
Service Provider
var serviceProvider = services.BuildServiceProvider();
var documentContentExtractor = serviceProvider.GetRequiredService<IDocumentContentExtractor>();
using var file = File.OpenRead("example.pdf");
var fileChunks = documentContentExtractor.GetDocumentContent(file, ".pdf");
For the above code to be fully functional, it is necessary to configure some additional services, specifically the ITextSplitter interface and a function to calculate the length of each chunk.
The previous code, based on the file extension, searches for a suitable IDocumentConnector for the file type, processes the file to extract its text and finally, it uses an ITextSplitter
to split the text into chunks.
Details about the IDocumentConnector
The default implementation DefaultDocumentContentExtractor
, uses the following IDocumentConnectors
:
WordDocumentConnector
: For .docx files, it extracts the text from the file by adding each paragraph on a new line.CleanPdfDocumentConnector
: For .pdf files, it extracts the raw text from the file (with all words separated by spaces) and removes common words, typically headers or footers that appear in at least 25% of the document.ParagraphPptxDocumentConnector
: For .pptx files, it extracts the text from the file, with one line per paragraph found in each slide.TxtDocumentConnector
: For .txt files, it extracts the raw text from the file using UTF-8 as the character encoding.TxtDocumentConnector
: For .md files, it extracts the raw text from the file using UTF-8 as the character encoding.VttDocumentConnector
: For .vtt files, it extracts the text from the subtitles while removing the timestamp marks. Use UTF-8 as the character encoding.
For other formats, it throws a NotSupportedException
.
Others available IDocumentConnector
SlidePptxDocumentConnector
: For .pptx files, it extracts the text from the file with just one line for each slide found.PdfDocumentConnector
: For .pdf files, it extracts the raw text from the file for each page (all words separated by spaces) and add a line break between the text of each page.PdfWithTocDocumentConnector
: For .pdf files, it retrieve the Table of Contents and generates, for each Table of Contents item, a text with the section title, a colon mark (:), and the content text of the section (e.g. Title1: Content of the Title1 section). Add a line break between each section. The output format of the text is configurable with theTocItemFormat
property. Additionally, remove common words, typically headers or footers that appear in at least 25% of the document.StrictFormatCleanPdfDocumentConnector
: For .pdf files, it extracts the text from the file and attempts to preserve the document's formatting, including paragraphs, titles, and other structural elements. Additionally, it removes common words, typically headers or footers that appear in at least 25% of the document, and it excludes non-horizontal text. During the text extraction process, an effort is made to retain the document's format; however, it is important to note that this process relies on OCR recognition, which is not perfect, and the results may vary depending on the quality of the PDF.
Use your own IDocumentConnector
To use your own IDocumentConnectors
, you can use the base class DocumentContentExtractorBase and override the GetDocumentConnector
method. This way, you can return your own IDocumentConnectors
to handle a specific file format based on the file extension.
public class MyCustomDocumentContentExtractor : DocumentContentExtractorBase
{
public MyCustomDocumentContentExtractor(ITextSplitter textSplitter, Func<string, int> lengthFunction) : base(textSplitter, lengthFunction)
{
}
protected override IDocumentConnector GetDocumentConnector(string fileExtension)
{
return fileExtension.ToUpperInvariant() switch
{
@".rtf" => new MyCustomRtfDocumentConnector(),
@".pdf" => new PdfWithTocDocumentConnector(),
@".txt" => new TxtDocumentConnector(Encoding.UTF8),
_ => throw new NotSupportedException(fileExtension),
};
}
}
Don't forget to register it.
// Entry point
var builder = WebApplication.CreateBuilder(new WebApplicationOptions
{
// ...
});
// ...
// Now we use our own implementation
// services.AddDefaultDocumentContentExtractor();
services.AddSingleton<IDocumentContentExtractor, MyCustomDocumentContentExtractor>();
With this, you will be able to use the extractor you need for each type of file.
Product | Versions Compatible and additional computed target framework versions. |
---|---|
.NET | net8.0 is compatible. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. |
-
net8.0
- CommunityToolkit.Diagnostics (>= 8.2.2)
- Encamina.Enmarcha.AI.Abstractions (>= 8.1.3-preview-05)
- Encamina.Enmarcha.Core (>= 8.1.3-preview-05)
- Microsoft.SemanticKernel.Plugins.Document (>= 1.3.0-alpha)
- PdfPig (>= 0.1.8)
NuGet packages
This package is not used by any NuGet packages.
GitHub repositories
This package is not used by any popular GitHub repositories.
Version | Downloads | Last updated |
---|---|---|
8.2.0 | 336 | 10/22/2024 |
8.2.0-preview-01-m01 | 94 | 9/17/2024 |
8.1.9-preview-03 | 222 | 11/19/2024 |
8.1.9-preview-02 | 69 | 10/22/2024 |
8.1.9-preview-01 | 206 | 10/4/2024 |
8.1.8 | 166 | 9/23/2024 |
8.1.8-preview-07 | 286 | 9/12/2024 |
8.1.8-preview-06 | 144 | 9/11/2024 |
8.1.8-preview-05 | 143 | 9/10/2024 |
8.1.8-preview-04 | 215 | 8/16/2024 |
8.1.8-preview-03 | 135 | 8/13/2024 |
8.1.8-preview-02 | 92 | 8/13/2024 |
8.1.8-preview-01 | 102 | 8/12/2024 |
8.1.7 | 106 | 8/7/2024 |
8.1.7-preview-09 | 130 | 7/3/2024 |
8.1.7-preview-08 | 109 | 7/2/2024 |
8.1.7-preview-07 | 88 | 6/10/2024 |
8.1.7-preview-06 | 91 | 6/10/2024 |
8.1.7-preview-05 | 112 | 6/6/2024 |
8.1.7-preview-04 | 92 | 6/6/2024 |
8.1.7-preview-03 | 100 | 5/24/2024 |
8.1.7-preview-02 | 109 | 5/10/2024 |
8.1.7-preview-01 | 100 | 5/8/2024 |
8.1.6 | 135 | 5/7/2024 |
8.1.6-preview-08 | 58 | 5/2/2024 |
8.1.6-preview-07 | 101 | 4/29/2024 |
8.1.6-preview-06 | 279 | 4/26/2024 |
8.1.6-preview-05 | 101 | 4/24/2024 |
8.1.6-preview-04 | 108 | 4/22/2024 |
8.1.6-preview-03 | 98 | 4/22/2024 |
8.1.6-preview-02 | 129 | 4/17/2024 |
8.1.6-preview-01 | 107 | 4/15/2024 |
8.1.5 | 111 | 4/15/2024 |
8.1.5-preview-15 | 90 | 4/10/2024 |
8.1.5-preview-14 | 145 | 3/20/2024 |
8.1.5-preview-13 | 87 | 3/18/2024 |
8.1.5-preview-12 | 120 | 3/13/2024 |
8.1.5-preview-11 | 93 | 3/13/2024 |
8.1.5-preview-10 | 125 | 3/13/2024 |
8.1.5-preview-09 | 99 | 3/12/2024 |
8.1.5-preview-08 | 82 | 3/12/2024 |
8.1.5-preview-07 | 100 | 3/8/2024 |
8.1.5-preview-06 | 205 | 3/8/2024 |
8.1.5-preview-05 | 94 | 3/7/2024 |
8.1.5-preview-04 | 94 | 3/7/2024 |
8.1.5-preview-03 | 86 | 3/7/2024 |
8.1.5-preview-02 | 148 | 2/28/2024 |
8.1.5-preview-01 | 146 | 2/19/2024 |
8.1.4 | 193 | 2/15/2024 |
8.1.3 | 143 | 2/13/2024 |
8.1.3-preview-07 | 79 | 2/13/2024 |
8.1.3-preview-06 | 108 | 2/12/2024 |
8.1.3-preview-05 | 95 | 2/9/2024 |
8.1.3-preview-04 | 102 | 2/8/2024 |
8.1.3-preview-03 | 125 | 2/7/2024 |
8.1.3-preview-02 | 82 | 2/2/2024 |
8.1.3-preview-01 | 90 | 2/2/2024 |
8.1.2 | 164 | 2/1/2024 |
8.1.2-preview-9 | 98 | 1/22/2024 |
8.1.2-preview-8 | 93 | 1/19/2024 |
8.1.2-preview-7 | 89 | 1/19/2024 |
8.1.2-preview-6 | 92 | 1/19/2024 |
8.1.2-preview-5 | 96 | 1/19/2024 |
8.1.2-preview-4 | 109 | 1/19/2024 |
8.1.2-preview-3 | 92 | 1/18/2024 |
8.1.2-preview-2 | 82 | 1/18/2024 |
8.1.2-preview-16 | 75 | 1/31/2024 |
8.1.2-preview-15 | 87 | 1/31/2024 |
8.1.2-preview-14 | 192 | 1/25/2024 |
8.1.2-preview-13 | 93 | 1/25/2024 |
8.1.2-preview-12 | 99 | 1/23/2024 |
8.1.2-preview-11 | 86 | 1/23/2024 |
8.1.2-preview-10 | 78 | 1/22/2024 |
8.1.2-preview-1 | 100 | 1/18/2024 |
8.1.1 | 131 | 1/18/2024 |
8.1.0 | 115 | 1/18/2024 |
8.0.3 | 177 | 12/29/2023 |
8.0.1 | 159 | 12/14/2023 |
8.0.0 | 158 | 12/7/2023 |
6.0.4.3 | 162 | 12/29/2023 |
6.0.4.2 | 167 | 12/20/2023 |
6.0.4.1 | 106 | 12/19/2023 |
6.0.4 | 172 | 12/4/2023 |
6.0.3.20 | 124 | 11/27/2023 |
6.0.3.19 | 135 | 11/22/2023 |