how to read a particular page from a DOC file (I/O and Streams forum at Coderanch) (original) (raw)

Ranch Hand

Posts: 94

Eclipse IDE Java

posted 11 years ago

hello all
i have .DOC file but i am not supposed to read entire file instead i am given a page number.
therefore i got to read only that particular page from the doc file.
I am using apache.poi api.

thank you.

Bartender

Posts: 3323

Gajendra Kangokar

Ranch Hand

Posts: 94

Eclipse IDE Java

posted 11 years ago

ok the doc file do not store page numbers.
but is there anyway to know that we have come to end of a page.
or any way to know that the page is changing.

posted 11 years ago

I don't think it is possible to know page numbers before the entire file has been read, for the reasons Tony mentioned.

i am not supposed to read entire file instead i am given a page number.

This sounds like a really strange requirement; what is the point of it?

Gajendra Kangokar

Ranch Hand

Posts: 94

Eclipse IDE Java

posted 11 years ago

I just want to count number of pages in a doc file.
we use while((in.read())!=-1) to read till end of file.
but is there any logic to check control has come to an end of page?

Ulf Dittmer

Rancher

Posts: 43081

posted 11 years ago

OK, so that requirement doesn't actually exist; that's good. You could use a library like JODConverter (which relies on running OpenOffice in server mode) to convert the document to PDF - PDFs are fixed in layout, and libraries like PDFBox can tell you the number of pages.

posted 11 years ago

Basically a Word document doesn't have pages at all. When you see it displayed in Word it may appear to have pages, but that's because it's using the default page layout information to paginate the document. If you click on the Page Layout tab you'll see all the things you can change -- margins, page orientation, page size, columns, and more -- and which will affect the pagination. And as already pointed out, there are many other things which affect the pagination.

But if, as you say, you're just reading the raw bytes from the .doc file, you don't have any hope of finding out any of those things. You're just reading the document text and the document formatting and other control information as uninterpreted bytes. You can't find out anything at all about the document that way except how many bytes it took Word to store it on disk.

Gajendra Kangokar

Ranch Hand

Posts: 94

Eclipse IDE Java

posted 11 years ago

I am not supposed to convert it to PDF.

@paul you mean there is no way to know where page break happened in DOC file..?is there any way to use form feed or something.

Ulf Dittmer

Rancher

Posts: 43081

posted 11 years ago

Yes, that's what Tony and Paul and myself have been saying.

I am not supposed to convert it to PDF.

Where are all these strange requirements coming from? It sounds like the requirements contain details of the technical implementation, where that kind of thing has no place.

Paul Clapham

Sheriff

Posts: 28408

Eclipse IDE Firefox Browser MySQL Database

posted 11 years ago

Gajendra Kangokar wrote:@paul you mean there is no way to know where page break happened in DOC file..?is there any way to use form feed or something.

Form feed? No, it's not nearly that simple. In fact Word is probably a thousand times as complicated as just throwing in a form-feed character. I'm guessing you haven't actually used Word yourself much?

If you really have to address the requirement of extracting a page from a Word document, you at least have to start by accessing it via Apache POI's Word components, or else Aspose's software which allows you to access Word documents. And then prepare yourself for a long stretch where you learn how to use those things. Last time I looked at accessing Word (from Visual Basic over a decade ago) there were about 500 different types in its data model. I'm sure that the number is closer to 1,000 by now. It isn't simple and you shouldn't expect a simple solution.

Gajendra Kangokar

Ranch Hand

Posts: 94

Eclipse IDE Java

posted 11 years ago

yes i am using Apache POI and thank you,will try with Aspose software also.