本文介绍了Apache MetaModel - 糟糕的性能查询电子表格的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要用Java查询电子表格文件。我正在使用。

I need to query a spreadsheet file in Java. I'm using Apache MetaModel.

我已导入它与maven一起使用

I imported it with maven using

<dependency>
    <groupId>org.apache.metamodel</groupId>
    <artifactId>MetaModel-excel</artifactId>
    <version>4.5.2</version>
</dependency>

一切正常,但 next()当它应该返回false时的指令需要几秒钟,为什么?

Everything works fine but the next() instruction when it's supposed to return false takes a few seconds, why?

import org.apache.metamodel.DataContext;
import org.apache.metamodel.excel.ExcelDataContext;
import org.apache.metamodel.schema.Schema;
import org.apache.metamodel.schema.Column; 
import org.apache.metamodel.schema.Table; 
import org.apache.metamodel.query.Query;
import org.apache.metamodel.query.OperatorType; 
import org.apache.metamodel.data.DataSet; 
import org.apache.metamodel.data.Row; 
import org.apache.metamodel.MetaModelException;


public class SpreadsheetReader {

    private File spreadsheet;


    public SpreadsheetReader(String spreadsheetLocation, String spreadsheetName){

        this.spreadsheet = new File( spreadsheetLocation + spreadsheetName );

        if( !"OK".equals( checkSpreadSheet() ) ){
            throw new IllegalStateException("Error in spreadsheet. Cause: "+spreadsheetStatus);
        }

    }


    /** query the excel spreadsheet for the given  ID
    */
    public List<String> query( String givenProgId ){

        List<String> linksArray = new ArrayList<String>();
        int rowCount = 0;

        ExcelConfiguration conf = new ExcelConfiguration( 1, true, true ); // columnNameLineNumber, skipEmptyLines, sEColumns
        DataContext dataContext = new ExcelDataContext( this.spreadsheet, conf );

        System.out.println( "PROFILING >>> "+(new org.joda.time.DateTime())+" START-1" ); // ## 

        Schema schema = dataContext.getDefaultSchema();

        System.out.println( "PROFILING >>> "+(new org.joda.time.DateTime())+" STOP-1" ); // ## 
       // Takes 2 seconds. Will be moved into constructor.

        Table table = schema.getTables()[0];

        Column idsColumn = table.getColumnByName("ProgID");
        Column titlesColumn = table.getColumnByName("Titles");

        Query query = new Query().select(titlesColumn)
                                 .from(table)
                                 .where(idsColumn, OperatorType.EQUALS_TO, givenProgId);

        try( DataSet dataSet = dataContext.executeQuery(query) ){ // try-with-resource, no need to close dataset

            while (dataSet.next()) {

                // the rows are read quite quickly, problem will be when next() is false

                ++rowCount;

                Row currentRow = dataSet.getRow();
                String currentTitle = (String)currentRow.getValue(0);

                linksArray.add( "my-service/titles/"+currentTitle );

                System.out.println( "PROFILING >>> "+(new org.joda.time.DateTime())+" START-2" ); // @@@@@@@
            }
            System.out.println( "PROFILING >>> "+(new org.joda.time.DateTime())+" STOP-2" ); // @@@@@@@ 
            // TAKES ABOUT 6 SECONDS - (Excel file has just 14.779 rows and 114 columns)

        }catch(MetaModelException xx){
            //logger
            throw xx;
        }

        return linksArray;
    }
}}

更新
使用仅包含3个条目的电子表格文档进行更多分析

现在代码为:

try( DataSet dataSet = this.dataContext.executeQuery(query) ){


    // FIRST NEXT() with result => quite fast

    System.out.println( "\n PROFILING >>> "+(new org.joda.time.DateTime())+" START a\n" );
    System.out.println( "\n 88888 NEXT >>> "+(dataSet.next())+" <<<< \n" ); 
    Row currentRow = dataSet.getRow();
    String currentTitle = (String)currentRow.getValue(0);
    System.out.println( "\n READ: "+(new org.joda.time.DateTime())+" >>> "+currentTitle+" \n" );

    System.out.println( "\n PROFILING >>> "+(new org.joda.time.DateTime())+" STOP a\n" );


    // SECOND AND LAST NEXT() => very SLOW

    System.out.println( "\n PROFILING >>> "+(new org.joda.time.DateTime())+" START b\n" );
    System.out.println( "\n 88888 NEXT >>> "+(dataSet.next())+" <<<< \n" );
    System.out.println( "\n PROFILING >>> "+(new org.joda.time.DateTime())+" STOP b\n" );

}

电子表格预先加载在类构造函数中。

And the spreadsheet is pre-loaded in the class constructor.

一系列后续相同查询中的最后一个日志(有时间)

Jun 30, 2016 10:59:38 AM log my-project.logging.ITVLogger
INFO: CODE00012 - Query on spreadsheet started for ID 123456



PROFILING >>> 2016-06-30T10:59:38.651+01:00 START a

10:59:38.652 [main] INFO  o.a.m.d.RowPublisherDataSet - 
Starting separate thread for publishing action: org.apache.metamodel.excel.XlsxRowPublisherAction@4977e527

88888 NEXT >>> true <<<< 

READ: 2016-06-30T10:59:39.756+01:00 >>> A_TITLE

PROFILING >>> 2016-06-30T10:59:39.756+01:00 STOP a



PROFILING >>> 2016-06-30T10:59:39.756+01:00 START b

88888 NEXT >>> false <<<< 

PROFILING >>> 2016-06-30T10:59:44.735+01:00 STOP b

所以回顾一下一秒钟检索结果,最后一次执行 next() 4~6秒。

So to recap it takes about one second to retrieve the result and 4~6 seconds for the last execution of next().

推荐答案

excel DataContext实现它在后台展开和解析压缩的.xlsx文件。这意味着 DataContext.executeQuery(...)方法快速返回,但 DataSet.next()调用在必须等待数据在内存中可用之后才会出现。我没有看到任何可以避免这种情况的方法,这只是Excel文件处理相当复杂的事情的结果。

The excel DataContext implementation does it's unwrapping and parsing of the zipped .xlsx file in the background. This means that the DataContext.executeQuery(...) method returns quickly but the DataSet.next() call that comes right after it has to wait for the data to be available in memory. I don't see any way that you can avoid that, it's just a consequence of Excel files being rather complex things to deal with.

这篇关于Apache MetaModel - 糟糕的性能查询电子表格的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-27 16:27