As part of integration-testing I needed to extract text from PDF's - all existing solutions was either too cumbersome or had a wierd API.
PDF Extract works by executing an external executable (Win64 only!) - but is fully self-contained and only exposes streams to the outside world.
Internally it uses Xpdf.
To extract text simply use provided extractor-class (here from a file):
using (var pdfStream = File.OpenRead("my.pdf"))
using (var extractor = new Extractor())
{
var extractedText = extractor.ExtractToString(pdfStream);
}
Or extract from/to a stream
using (var extractor = new Extractor())
{
using (var rawTextStream = extractor.ExtractText(pdfStream))
/// ...
}
Simply add the Nuget package:
PM> Install-Package pdf-extract
You'll need .NET Framework 4.5.1 or later on 64 bit Windows to use the precompiled binaries.
PDF Extract is licensed under the GNU General Pulbic License (GPL), version 2 or 3 similar to Xpdf.