• 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Real Book Extractor
#11
I've knocked this up in too much of a hurry. It works with everything I have here, but there's obviously more variety in formats than I realised. Always writing the files out with a jpg extension is an obvious mistake. I need to check the images I'm retrieving from the pdf and see what format they are. That doesn't explain why renaming them to png doesn't work.

The only thing I'm using iTextSharp for is to convert pdfs with iref streams into the older format, so they can be read by PdfSharp, which has a much easier to use API. I used the old version of iTextSharp because it has a less restrictive licence, although given that I've open sourced my program I suppose I might as well use a more recent version. The PdfSharp team have been saying they will try to support iref streams for about 5 years, but they aren't making enough from donations for anyone to be willing to commit to doing it.

I have come across a couple of pdfs on my disk that it can't get the sheets out of at all. They aren't stored as images but are actually rendered by Acrobat. I think I probably downloaded them from Wikifonia. I don't think I want to put the work in to deal with those.

GraemeJ, is there any chance you could send me a copy of the pdf? I promise I'll delete it once I get it working.
Reply
#12
(11-12-2014, 10:52 PM)trevorprinn Wrote: GraemeJ, is there any chance you could send me a copy of the pdf? I promise I'll delete it once I get it working.

Yes - no problem, but you need to PM me your email address, as it's not possible to attach files to emails through the forum.

Don't worry about deleting it, it's available all over the net anyway Smile .
Graeme

1: Samsung 12.2" SM-P900: Android 5.0.2 
2: eSTAR GRAND HD Quad-Core 4G 10.2": Android 5.1 
3: Home-built BT pedal

Some of my music here
Reply
#13
I've just been having a look at it, and it seems the 1.2.0 version has decided to start extracting all the 1bbp images as negatives. I can't see why. The code appears to be identical to before.

I also can't (easily) determine the image format that I've extracted from the pdf (the image is always a memory bmp), but I think I may be able to write it out converted to a specific format, once I have worked out the negative problem.
Reply
#14
I understand the problem with that file. I've been learning more about the innards of pdfs than I really intended to. The images in it are encoded using JPX (JPEG 2000) which is the one standard format not supported by the PdfSharp libraries. iTextSharp may support it, but I can't easily see how to get it to extract them.

I'm going to have to leave it at that for now. I've got too much work on to spend much more time on it. The pdf renderer built into the MobileSheets Companion (File/Convert PDF To Images) does support it, so all I can suggest is that you use that for any that the Extractor can't cope with.

I'm going to put out a release later this evening with a few small changes. It will at least not collapse if it tries to read a JPX encoded image (or any other format it can't handle) will always force the output to be png, won't (I hope) ever write the music pages out as negatives, and will have a Browse button for selecting a folder.
Reply
#15
OK - I understand that you have more important things to work on. Thanks for the effort so far and I'm sorry you had to learn more about pdf's than you would have liked Smile .

I'll have a play around with using the MS renderer.
Graeme

1: Samsung 12.2" SM-P900: Android 5.0.2 
2: eSTAR GRAND HD Quad-Core 4G 10.2": Android 5.1 
3: Home-built BT pedal

Some of my music here
Reply
#16
Not to worry. Extra knowledge tends to come in useful later.

I've released another version. It still won't manage that file, but I've made several other fixes and improvements.
http://tprinn.co.uk/RealBookExtractor/Re...-1.3.0.msi
Reply
#17
I think the main problem is going to be that "that" file is probably quite representative of a lot of the books that are out there. Many of the real books were digitised at around the same time and presumably with much the same versions of Acrobat.

Still, never mind - as you say, the knowledge should come in useful at some time in the future.
Graeme

1: Samsung 12.2" SM-P900: Android 5.0.2 
2: eSTAR GRAND HD Quad-Core 4G 10.2": Android 5.1 
3: Home-built BT pedal

Some of my music here
Reply
#18
Sorry to resurrect this thread haha...but I am using this app, and for the most part it seems good. However from time to time it will display negative images of pages within the same songbook?

Is there a way to stop this?

Thanks
Reply
#19
I know what you mean the pages coming out negative. The problem is that a lot of the images that are extracted don't have any palette information in them. They should have a palette saying whether 1 is a black or a white pixel. Because of that, when I first wrote it the pages from some pdfs came out completely reversed, so I wrote in a little kludge to cope with it. I assume that the corners of the page should be white, so I look at the 50x50 square in the top left hand corner, and count how many white pixels there are. If there's more white than black I assume it's OK. If not, I check the top right hand 50x50 square as well (because some of the scanned pages may be a bit torn). If I still find more black than white, I reverse the palette. This worked on everything I checked it on, but there's no guarantee that it will always work. If the pdf was made from scans that have a black border this could produce a negative image, for example.

I suppose the nearest I can get to fixing it easily would be to add a button to let you switch the palette around.

While I'm at it, I'll fix the extract so that it runs on a background thread and can be cancelled. My laziness in not bothering to do that originally annoys me.

I don't know when I'll get it done. Probably some time this week.
Reply
#20
(11-30-2016, 04:36 AM)trevorprinn Wrote: I know what you mean the pages coming out negative. The problem is that a lot of the images that are extracted don't have any palette information in them. They should have a palette saying whether 1 is a black or a white pixel. Because of that, when I first wrote it the pages from some pdfs came out completely reversed, so I wrote in a little kludge to cope with it. I assume that the corners of the page should be white, so I look at the 50x50 square in the top left hand corner, and count how many white pixels there are. If there's more white than black I assume it's OK. If not, I check the top right hand 50x50 square as well (because some of the scanned pages may be a bit torn). If I still find more black than white, I reverse the palette. This worked on everything I checked it on, but there's no guarantee that it will always work. If the pdf was made from scans that have a black border this could produce a negative image, for example.

I suppose the nearest I can get to fixing it easily would be to add a button to let you switch the palette around.

While I'm at it, I'll fix the extract so that it runs on a background thread and can be cancelled. My laziness in not bothering to do that originally annoys me.

I don't know when I'll get it done. Probably some time this week.

Thanks Trevor - whenever you feel like it/can do it is fine by me. It is an EXCELLENT app many thx.
Reply


Digg   Delicious   Reddit   Facebook   Twitter   StumbleUpon  


Users browsing this thread:
1 Guest(s)


  Theme © 2014 iAndrew  
Powered By MyBB, © 2002-2018 MyBB Group.