我知道有关的帖子,我已经尝试了多次尝试达到我的目标,我将在下面详细说明:
我有一个.zip
/ .rar
,包含多个xls
&xlsx
文件.
每个excel文件包含多达数千行的数据,大约90列提供或获取(每个excel文件可以包含更多或更少的列).
我已经创建了一个java windowbuilder应用程序,我在其中选择.zip
/ .rar
文件并选择将这些文件解压缩到哪里并使用它们创建它们FileOutputStream
.保存每个文件后,我正在读取文件的内容.
到现在为止还挺好.在多次尝试避免OOM(OutOfMemory)并加快速度之后,我已经达到了"最终版本"(这非常糟糕,但直到我弄清楚如何正确阅读)我将解释:
File file = new File('certainFile.xlsx'); //or xls, For example purposes Workbook wb; Sheet sheet; /* There is a ton of other things up to this point that I don't consider relevant, as it's related to unzipping and renaming, etc. This is within a cycle / In every zip file, there is at least 1 or 2 files that somehow, when it goes to WorkbookFactory.create(), it still gives an OOM because it recognizes is has a bit over a million rows, meaning it's an 2007 format file (according to our friend Google.com), or so I believe so. When I open the xlsx file, it indeed has like 10-20mb size and thousands of empty rows. When I save it again it has 1mb and a couple thousand. After many attempts to read as InputStream, File or trying to save it in an automatic way, I've worked with converting it to a CSV and read it differently, ence, this 'solution'. if parseAsXLS is true, it applies my regular logic per row per cell, otherwise I parse the CSV. */ if (file.getName().contains("xlsx")) { this.parseAsXLS = false; OPCPackage pkg = OPCPackage.open(file); //This is just to output the content into a csv file, that I will read later on and it gets overwritten everytime it comes by FileOutputStream fo = new FileOutputStream(this.filePath + File.separator + "excel.csv"); PrintStream ps = new PrintStream(fo); XLSX2CSV xlsxCsvCOnverter= new XLSX2CSV(pkg, ps, 90); try { xlsxCsvConverter.process(); } catch (Exception e) { //I've added a count at the XLSX2CSV class in order to limit the ammount of rows I want to fetch and throw an Exception on purpose System.out.println("Limited the file at 60k rows"); } } else { this.parseAsXLS = true; this.wb = WorkbookFactory.create(file); this.sheet = wb.getSheetAt(0); }
现在会发生什么是一个.xlsx
(从.zip
与其他几个文件.xls
和.xlsx
)有几分某个字符的行和XLSX2CSV视其为endRow,这会导致不正确的输出.
这是一个例子:imagelink
注意:目标是仅从每个excel文件中获取它们在commum中(或可能有,没有义务)的一组列,并将它们放在一个新的Excel中.电子邮件列(包含多个以逗号分隔的电子邮件)具有我认为在电子邮件之前"输入"的内容,因为如果我手动删除它,则可以解决问题.但是,目标是不要手动打开每个excel并修复它,否则我只需打开每个excel并复制粘贴我需要的列.在该示例中,我需要列:fieldAA,fieldAG,fieldAL和fieldAN.
XLSX2CSV.java(我不是这个文件的创建者,我只是将我的需求应用到它)
import java.awt.List; import java.io.File; import java.io.IOException; import java.io.InputStream; import java.io.PrintStream; import javax.xml.parsers.ParserConfigurationException; import org.apache.poi.openxml4j.exceptions.OpenXML4JException; import org.apache.poi.openxml4j.opc.OPCPackage; import org.apache.poi.openxml4j.opc.PackageAccess; import org.apache.poi.ss.usermodel.DataFormatter; import org.apache.poi.ss.util.CellAddress; import org.apache.poi.ss.util.CellReference; import org.apache.poi.util.SAXHelper; import org.apache.poi.xssf.eventusermodel.ReadOnlySharedStringsTable; import org.apache.poi.xssf.eventusermodel.XSSFReader; import org.apache.poi.xssf.eventusermodel.XSSFSheetXMLHandler; import org.apache.poi.xssf.eventusermodel.XSSFSheetXMLHandler.SheetContentsHandler; import org.apache.poi.xssf.extractor.XSSFEventBasedExcelExtractor; import org.apache.poi.xssf.model.StylesTable; import org.apache.poi.xssf.usermodel.XSSFComment; import org.xml.sax.ContentHandler; import org.xml.sax.InputSource; import org.xml.sax.SAXException; import org.xml.sax.XMLReader; /** * A rudimentary XLSX -> CSV processor modeled on the * POI sample program XLS2CSVmra from the package * org.apache.poi.hssf.eventusermodel.examples. * As with the HSSF version, this tries to spot missing * rows and cells, and output empty entries for them. ** Data sheets are read using a SAX parser to keep the * memory footprint relatively small, so this should be * able to read enormous workbooks. The styles table and * the shared-string table must be kept in memory. The * standard POI styles table class is used, but a custom * (read-only) class is used for the shared string table * because the standard POI SharedStringsTable grows very * quickly with the number of unique strings. *
* For a more advanced implementation of SAX event parsing * of XLSX files, see {@link XSSFEventBasedExcelExtractor} * and {@link XSSFSheetXMLHandler}. Note that for many cases, * it may be possible to simply use those with a custom * {@link SheetContentsHandler} and no SAX code needed of * your own! */ public class XLSX2CSV { /** * Uses the XSSF Event SAX helpers to do most of the work * of parsing the Sheet XML, and outputs the contents * as a (basic) CSV. */ private class SheetToCSV implements SheetContentsHandler { private boolean firstCellOfRow; private int currentRow = -1; private int currentCol = -1; private int maxrows = 60000; private void outputMissingRows(int number) { for (int i=0; i
CSV converter * * @param pkg The XLSX package to process * @param output The PrintStream to output the CSV to * @param minColumns The minimum number of columns to output, or -1 for no minimum */ public XLSX2CSV(OPCPackage pkg, PrintStream output, int minColumns) { this.xlsxPackage = pkg; this.output = output; this.minColumns = minColumns; } /** * Parses and shows the content of one sheet * using the specified styles and shared-strings tables. * * @param styles The table of styles that may be referenced by cells in the sheet * @param strings The table of strings that may be referenced by cells in the sheet * @param sheetInputStream The stream to read the sheet-data from. * @exception java.io.IOException An IO exception from the parser, * possibly from a byte stream or character stream * supplied by the application. * @throws SAXException if parsing the XML data fails. */ public void processSheet( StylesTable styles, ReadOnlySharedStringsTable strings, SheetContentsHandler sheetHandler, InputStream sheetInputStream) throws IOException, SAXException { DataFormatter formatter = new DataFormatter(); InputSource sheetSource = new InputSource(sheetInputStream); try { XMLReader sheetParser = SAXHelper.newXMLReader(); ContentHandler handler = new XSSFSheetXMLHandler( styles, null, strings, sheetHandler, formatter, false); sheetParser.setContentHandler(handler); sheetParser.parse(sheetSource); } catch(ParserConfigurationException e) { throw new RuntimeException("SAX parser appears to be broken - " + e.getMessage()); } } /** * Initiates the processing of the XLS workbook file to CSV. * * @throws IOException If reading the data from the package fails. * @throws SAXException if parsing the XML data fails. */ public void process() throws IOException, OpenXML4JException, SAXException { ReadOnlySharedStringsTable strings = new ReadOnlySharedStringsTable(this.xlsxPackage); XSSFReader xssfReader = new XSSFReader(this.xlsxPackage); StylesTable styles = xssfReader.getStylesTable(); XSSFReader.SheetIterator iter = (XSSFReader.SheetIterator) xssfReader.getSheetsData(); int index = 0; while (iter.hasNext()) { try (InputStream stream = iter.next()) { processSheet(styles, strings, new SheetToCSV(), stream); } ++index; } } }
我正在寻找不同(和工作)方法来实现我的目标.
感谢您的时间