本文介绍了Json对象使用Java而不转换为AVRO的Parquet格式(不使用Spark,Hive,Pig,Impala)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一种方案可以将使用Java的Json对象的消息转换为Apache Parquet格式。任何示例代码或示例都会有所帮助。就我发现的将消息转换为Parquet而言,Hive,Pig,Spark正在被使用。我需要转换到Parquet而不涉及这些仅由Java。

解决方案

要将JSON数据文件转换为Parquet,内存表示。 Parquet没有自己的一组Java对象;相反,它会重用其他格式的对象,如Avro和Thrift。这个想法是,Parquet原生地使用您的应用程序可能已经使用的对象。



要转换您的JSON,您需要将记录转换为Avro 内存对象并将它们传递给Parquet,但不需要将文件转换为Avro然后再转换为Parquet。



转换为Avro对象已完成为你,请参阅,并准备用作。转换方法需要Avro模式,但您可以使用同一个库。



要写这些记录,您只需使用 ParquetAvroWriter 。整个设置如下所示:



pre $ 模式jsonSchema = JsonUtil.inferSchema(fs.open(source),RecordName,20 );
try(JSONFileReader< Record> reader = new JSONFileReader<>(
fs.open(source),jsonSchema,Record.class)){

reader.initialize() ;

try(ParquetWriter< Record> writer = AvroParquetWriter
。< Record> builder(outputPath)
.withConf(new Configuration)
.withCompressionCodec(CompressionCodecName.SNAPPY )
.withSchema(jsonSchema)
.build()){
for(Record record:reader){
writer.write(record);
}
}
}


I have a scenario where to convert the messages present as Json object to Apache Parquet format using Java. Any sample code or examples would be helpful. As far as what I have found to convert the messages to Parquet either Hive, Pig, Spark are being used. I need to convert to Parquet without involving these only by Java.

解决方案

To convert JSON data files to Parquet, you need some in-memory representation. Parquet doesn't have its own set of Java objects; instead, it reuses the objects from other formats, like Avro and Thrift. The idea is that Parquet works natively with the objects your applications probably already use.

To convert your JSON, you need to convert the records to Avro in-memory objects and pass those to Parquet, but you don't need to convert a file to Avro and then to Parquet.

Conversion to Avro objects is already done for you, see Kite's JsonUtil, and is ready to use as a file reader. The conversion method needs an Avro schema, but you can use that same library to infer an Avro schema from JSON data.

To write those records, you just need to use ParquetAvroWriter. The whole setup looks like this:

Schema jsonSchema = JsonUtil.inferSchema(fs.open(source), "RecordName", 20);
try (JSONFileReader<Record> reader = new JSONFileReader<>(
                    fs.open(source), jsonSchema, Record.class)) {

  reader.initialize();

  try (ParquetWriter<Record> writer = AvroParquetWriter
      .<Record>builder(outputPath)
      .withConf(new Configuration)
      .withCompressionCodec(CompressionCodecName.SNAPPY)
      .withSchema(jsonSchema)
      .build()) {
    for (Record record : reader) {
      writer.write(record);
    }
  }
}

这篇关于Json对象使用Java而不转换为AVRO的Parquet格式(不使用Spark,Hive,Pig,Impala)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-11 08:21