文件在线预览doc，docx转换pdf（一）

1. 前言

文档转换是一个是一块硬骨头，但是也是必不可少的，我们正好做的知识库产品中，也面临着同样的问题，文档转换，精准的全文搜索，知识的转换率，是知识库产品的基本要素，初识阅读时同时绞尽脑汁，自己开发?，集成第三方？都是中小企业面临的一大难题…….

自己在网上搜索着找到poi开源出来的很多例子，最开始是用poi把所有文档转换为html，

1) 在github上面找到一个https://github.com/litter-fish/transform完整的demo，你想要的转换基本都提供，初学者可以参照实现转换出来的基本样子，达到通用级别，需要自己花很多功夫。此开源代码是基于poi和itext（pdf）的转换方式。

2) https://gitee.com/kekingcn/file-online-preview这是开源中国提供的一个源码，基于jodconverter，原理是调用windows，另存为的组件，实现转换。

3) 收费产品例如【永中office】【office365】【idocv】、【https://downloads.aspose.com/words/java】

2. 转换思路

自己在尝试过很多后，也与永中集成了文档转换，发现，要想完成预览的品质，必须的做二次渲染。毕竟永中做了十几年文档转换我们不能比的，自己琢磨后，发现一个勉强靠谱的思路，doc和docx都转换为pdf实现预览。都是在基于poi的基础上。

2.1. Doc转换pdf

1) Doc转换为xml

/**

	 * doc转xml

	 */

	public String toXML(String filePath){

	try{

		POIFSFileSystem nPOIFSFileSystem = new POIFSFileSystem(new File(filePath));

		HWPFDocument nHWPFDocument = new HWPFDocument(nPOIFSFileSystem);

		WordToFoConverter nWordToHtmlConverter = new WordToFoConverter(

				DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument());

		PicturesManager nPicturesManager = new PicturesManager() {

			public String savePicture(byte[] arg0, PictureType arg1,String arg2, float arg3, float arg4) {

				//file:///F://20.vscode//iWorkP//temp//images//0.jpg

				//System.out.println("file:///"+PathMaster.getWebRootPath()+ java.io.File.separator + "temp"+java.io.File.separator+"images" + java.io.File.separator + arg2);

//				return  "file:///"+PathMaster.getWebRootPath()+java.io.File.separator +"temp"+java.io.File.separator+"images" + java.io.File.separator + arg2;

				return  "file:///"+PathMaster.getWebRootPath()+java.io.File.separator +"temp"+java.io.File.separator+"images" + java.io.File.separator + arg2;

			}

		};

		nWordToHtmlConverter.setPicturesManager(nPicturesManager);

		nWordToHtmlConverter.processDocument(nHWPFDocument);

		String nTempPath = PathMaster.getWebRootPath()  + java.io.File.separator + "temp" + java.io.File.separator + "images" + java.io.File.separator;

		File nFile = new File(nTempPath);

		if (!nFile.exists()) {

			nFile.mkdirs();

		}

		for (Picture nPicture : nHWPFDocument.getPicturesTable().getAllPictures()) {

			nPicture.writeImageContent(new FileOutputStream(nTempPath + nPicture.suggestFullFileName()));

		}

		Document nHtmlDocument = nWordToHtmlConverter.getDocument();

		OutputStream nByteArrayOutputStream = new FileOutputStream(OUTFILEFO);

		DOMSource nDOMSource = new DOMSource(nHtmlDocument);

		StreamResult nStreamResult = new StreamResult(nByteArrayOutputStream);

		TransformerFactory nTransformerFactory = TransformerFactory.newInstance();

		Transformer nTransformer = nTransformerFactory.newTransformer();

		nTransformer.setOutputProperty(OutputKeys.ENCODING, "GBK");

		nTransformer.setOutputProperty(OutputKeys.INDENT, "YES");

		nTransformer.setOutputProperty(OutputKeys.METHOD, "xml");

		nTransformer.transform(nDOMSource, nStreamResult);

		nByteArrayOutputStream.close();

		return "";

		}catch(Exception e){

			e.printStackTrace();

		}

		return "";

	}

2) Xml转换为pdf

这里我是使用fop通过xml转换为pdf，也是最近欣喜的一个发现，poi官网推荐的我一直没去仔细看，里面的架包和永中的很多高清包，一模一样，现在貌似路子对了。有兴趣者研究去吧。我的源码已经在githubhttps://github.com/liuxufeijidian/file.convert.master/tree/master上面，环境已经配置好，需要准备好doc和docx文档即可。

/*

	 * xml 转pdf

	 */

	public void xmlToPDF() throws SAXException, TransformerException{

		// Step 1: Construct a FopFactory by specifying a reference to the configuration file

		// (reuse if you plan to render multiple documents!)

		FopFactory fopFactory = null;

		new URIResolverAdapter(new URIResolver(){

			public Source resolve(String href, String base) throws TransformerException {

				try {

		            URL url = new URL(href);

		            URLConnection connection = url.openConnection();

		            connection.setRequestProperty("User-Agent", "whatever");

		            return new StreamSource(connection.getInputStream());

		        } catch (IOException e) {

		            throw new RuntimeException(e);

		        }

			}

		});

		OutputStream out = null;

		try {

			fopFactory = FopFactory.newInstance(new File(CONFIG));

			// Step 2: Set up output stream.

			// Note: Using BufferedOutputStream for performance reasons (helpful with FileOutputStreams).

			out = new BufferedOutputStream(new FileOutputStream(OUTFILEPDF));

			// Step 3: Construct fop with desired output format

		    Fop fop = fopFactory.newFop(MimeConstants.MIME_PDF, out);

		    // Step 4: Setup JAXP using identity transformer

		    TransformerFactory factory = TransformerFactory.newInstance();

		    Transformer transformer = factory.newTransformer(); // identity transformer

		    // Step 5: Setup input and output for XSLT transformation

		    // Setup input stream

		    Source src = new StreamSource(OUTFILEFO);

		    // Resulting SAX events (the generated FO) must be piped through to FOP

		    Result res = new SAXResult(fop.getDefaultHandler());

		    // Step 6: Start XSLT transformation and FOP processing

		    transformer.transform(src, res);

		} catch (IOException e) {

			// TODO Auto-generated catch block

			e.printStackTrace();

		} finally {

		    //Clean-up

		    try {

				out.close();

			} catch (IOException e) {

				// TODO Auto-generated catch block

				e.printStackTrace();

			}

		}}

2.1.3

很多时候我们是使用word直接转的html，但是需要自己写二次渲染的代码，较为复杂，我是使用迂回方法，doc转xml，再用xml转换pdf，转换出来的pdf用pdfjs渲染即可实现和浏览器打开一样的预览，pdfjs预览方法详情见https://blog.csdn.net/liuxufeijidian/article/details/82260199

ending：大家都想看效果如何，https://github.com/litter-fish/transform，github获取改源码，配置好doc和docx文档即可实现转换，接下来会继续努力不间断优化和更新文档转换。

文件在线预览doc，docx转换pdf（一）

文件在线预览doc，docx转换pdf（一）的相关教程结束。

相关推荐

表格JS实现在线Excel的附件上传与下载

在线OJ实用技巧（转载）

【python】python读取文件报错UnicodeDecodeError: 'gbk' codec can't decode byte 0xac in position 2: illegal multibyte sequence

14 python读取文件时出现UnicodeDecodeError: 'gbk' codec can't decode byte 0xb7 in position 26: illegal multibyte sequence解决方法

python 读取文件时报错UnicodeDecodeError: 'gbk' codec can't decode byte 0x80 in position 205: illegal multibyte sequence

Python读取CSV文件，报错：UnicodeDecodeError: 'gbk' codec can't decode byte 0xa7 in position 727: illegal multibyte sequence

python 读取文件时报错： UnicodeDecodeError: 'gbk' codec can't decode byte 0xa4 in position 127: illegal multibyte sequence

PyQt5UI文件转换为对应版本的py文件