Post #429,151
5/26/19 9:02:51 PM
5/26/19 9:02:51 PM
|
Just splitting pages, or pulling content out?
|
Post #429,153
5/27/19 10:55:24 AM
5/27/19 10:55:24 AM
|
This
If all you need is to split the files, then any print-to-PDF driver will do. Win 10 has it built-in. For Win 7 and before, CutePDF works well but the installer was weaponized to drop adware and finding a clean older version is getting harder by the day. Possible alternatives with virtual printers: Foxit Reader, Nitro PDF reader.
Beyond splitting files, only Adobe Acrobat is able to deal with the mess. Unless the house is on fire, the only sane method is to fix the source and regenerate the PDF files as desired. (The reason for that is that PDF files do not have internal structure. The contents are just a bunch of objects at particular positions on the page. It is up to you to repair e.g. word wrap if a change pushes a line beyond the margin.)
|
Post #429,154
5/27/19 1:37:03 PM
5/27/19 1:38:50 PM
|
What he said
Although if you can't get the source, copy-and-paste out of Reader will probably get you blocks of text easier than trying to pull it from the file.
Oh, and PS: There are some GNU tools that have Windows versions that split and join pages.
Edited by drook
May 27, 2019, 01:38:50 PM EDT
|
Post #429,155
5/27/19 3:20:14 PM
5/27/19 3:20:14 PM
|
Re: Just splitting pages, or pulling content out?
Files are PDFs originally created by scans. However, some of the original papers can't be found, so scanning again is only a partial solution. I want to pull pages out of the existing files and break them up into smaller PDFs, grouped by vendor.
Satan (impatiently) to Newcomer: The trouble with you Chicago people is, that you think you are the best people down here; whereas you are merely the most numerous. - - - Mark Twain, "Pudd'nhead Wilson's New Calendar" 1897
|
Post #429,156
5/27/19 3:27:29 PM
5/27/19 3:27:29 PM
|
Re: Just splitting pages, or pulling content out?
|
Post #429,158
5/27/19 5:08:07 PM
5/27/19 5:08:07 PM
|
Virtual printer will probably be the fastest to extract
Load the large file in the PDF reader of choice, then use the Windows printer dialog to print the desired page range(s) to a new PDF file.
If that doesn't cut it, there are tools (pdfimages, part of XpdfReader) that can dump all the embedded files. PITA, but then the scans can be reassembled to suit in MS Word/OOo Writer/... and the PDFs regenerated.
|
Post #429,163
5/27/19 6:50:26 PM
5/27/19 6:50:26 PM
|
Is the vendor data embedded in a readable format?
And if not directly, you can usually convert PDF files to PostScript and pull out data from that. So then it is a matter of batching some logic so you can create an automated script to break out the pages individually, the information you need to rename the pages and then concatenate them back together into your final targeted output.
I used to do this kind of stuff all the time for a large-scale print runs in the print shop, while populating the web server.
|
Post #429,222
5/31/19 11:06:21 AM
5/31/19 11:06:21 AM
|
don't need the vendor data, just the images of the documents
the smaller files need the PDF pages from the big ones where each small file is for a specific vendor. And sadly a lot of the original documents can't be found.
Satan (impatiently) to Newcomer: The trouble with you Chicago people is, that you think you are the best people down here; whereas you are merely the most numerous. - - - Mark Twain, "Pudd'nhead Wilson's New Calendar" 1897
|
Post #429,175
5/28/19 8:30:46 AM
5/28/19 8:30:46 AM
|
I don't know if this is appropriate or not.
But I seem to recall you're a .Net developer. I've used this in a few apps I've written and can highly recommend it. I'm not sure this fits your use case, but thought I'd mention it. Good luck!
bcnu, Mikem
It's mourning in America again.
|