Text.analyze_file([MY_PATH]).strings returns array of characters #11

sciprog · 2013-06-18T17:57:14Z

I'm testing the contents of a PDF generated by PDFKit. When I run Text.analyze_file([MY_PATH]).strings on the file I get an array which holds each character of the PDF content in it's own index. Spaces are stored as '' (empty string). I've been able to move forward by replacing all empty strings with a space character. However, I'm now up against content which contains new line characters. The new lines are not stored in the array, so the separation between the words is lost around the new line character. Ever see this sort of behaviour? I realize that there are a number of factors which could be screwing things up, including my own ignorance, and I'd love to find the root of the problem, but I have no time. Right now, I'd be happy with a hack to get my tests working.
Cheers!

EDIT: So I came up with a hack that'll get me through. I remove all the white space characters from the array (they weren't actually empty strings, as I had believed). Then join the characters with exactly one space, and downcase the whole thing.

def char_array_to_normalized_string(arr)
arr.delete_if{|s| s =~ /\s/ }.join(' ').downcase
end

After I put my test strings through the same process, by calling char_array_to_normalized_string("Test String".scan(/./)), I'm able to match them against the ouput of PDF inspector. It's not pretty, but it gets me where I need to go.
Cheers!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text.analyze_file([MY_PATH]).strings returns array of characters #11

Text.analyze_file([MY_PATH]).strings returns array of characters #11

sciprog commented Jun 18, 2013

Text.analyze_file([MY_PATH]).strings returns array of characters #11

Text.analyze_file([MY_PATH]).strings returns array of characters #11

Comments

sciprog commented Jun 18, 2013