Package org.apache.poi.hwpf.extractor
Class WordExtractor
- java.lang.Object
-
- org.apache.poi.extractor.POITextExtractor
-
- org.apache.poi.extractor.POIOLE2TextExtractor
-
- org.apache.poi.hwpf.extractor.WordExtractor
-
- All Implemented Interfaces:
java.io.Closeable
,java.lang.AutoCloseable
public final class WordExtractor extends POIOLE2TextExtractor
Class to extract the text from a Word Document. You should use either getParagraphText() or getText() unless you have a strong reason otherwise.- Author:
- Nick Burch
-
-
Field Summary
-
Fields inherited from class org.apache.poi.extractor.POIOLE2TextExtractor
document
-
-
Constructor Summary
Constructors Constructor Description WordExtractor(java.io.InputStream is)
Create a new Word ExtractorWordExtractor(HWPFDocument doc)
Create a new Word ExtractorWordExtractor(DirectoryNode dir)
WordExtractor(POIFSFileSystem fs)
Create a new Word Extractor
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Deprecated Methods Modifier and Type Method Description java.lang.String[]
getCommentsText()
java.lang.String[]
getEndnoteText()
java.lang.String
getFooterText()
Deprecated.3.8 beta 4java.lang.String[]
getFootnoteText()
java.lang.String
getHeaderText()
Deprecated.3.8 beta 4java.lang.String[]
getMainTextboxText()
java.lang.String[]
getParagraphText()
Get the text from the word file, as an array with one String per paragraphprotected static java.lang.String[]
getParagraphText(Range r)
java.lang.String
getText()
Grab the text, based on the WordToTextConverter.java.lang.String
getTextFromPieces()
Grab the text out of the text pieces.static void
main(java.lang.String[] args)
Command line extractor, so people will stop moaning that they can't just run this.static java.lang.String
stripFields(java.lang.String text)
Removes any fields (eg macros, page markers etc) from the string.-
Methods inherited from class org.apache.poi.extractor.POIOLE2TextExtractor
getDocSummaryInformation, getDocument, getMetadataTextExtractor, getRoot, getSummaryInformation
-
Methods inherited from class org.apache.poi.extractor.POITextExtractor
close, setFilesystem
-
-
-
-
Constructor Detail
-
WordExtractor
public WordExtractor(java.io.InputStream is) throws java.io.IOException
Create a new Word Extractor- Parameters:
is
- InputStream containing the word file- Throws:
java.io.IOException
-
WordExtractor
public WordExtractor(POIFSFileSystem fs) throws java.io.IOException
Create a new Word Extractor- Parameters:
fs
- POIFSFileSystem containing the word file- Throws:
java.io.IOException
-
WordExtractor
public WordExtractor(DirectoryNode dir) throws java.io.IOException
- Throws:
java.io.IOException
-
WordExtractor
public WordExtractor(HWPFDocument doc)
Create a new Word Extractor- Parameters:
doc
- The HWPFDocument to extract from
-
-
Method Detail
-
main
public static void main(java.lang.String[] args) throws java.io.IOException
Command line extractor, so people will stop moaning that they can't just run this.- Throws:
java.io.IOException
-
getParagraphText
public java.lang.String[] getParagraphText()
Get the text from the word file, as an array with one String per paragraph
-
getFootnoteText
public java.lang.String[] getFootnoteText()
-
getMainTextboxText
public java.lang.String[] getMainTextboxText()
-
getEndnoteText
public java.lang.String[] getEndnoteText()
-
getCommentsText
public java.lang.String[] getCommentsText()
-
getParagraphText
protected static java.lang.String[] getParagraphText(Range r)
-
getHeaderText
@Deprecated public java.lang.String getHeaderText()
Deprecated.3.8 beta 4Grab the text from the headers
-
getFooterText
@Deprecated public java.lang.String getFooterText()
Deprecated.3.8 beta 4Grab the text from the footers
-
getTextFromPieces
public java.lang.String getTextFromPieces()
Grab the text out of the text pieces. Might also include various bits of crud, but will work in cases where the text piece -> paragraph mapping is broken. Fast too.
-
getText
public java.lang.String getText()
Grab the text, based on the WordToTextConverter. Shouldn't include any crud, but slower than getTextFromPieces().- Specified by:
getText
in classPOITextExtractor
- Returns:
- All the text from the document
-
stripFields
public static java.lang.String stripFields(java.lang.String text)
Removes any fields (eg macros, page markers etc) from the string.
-
-