近期发现分析部门同事告知,hive处理原始数据的时候总是不能执行完成,报错如下,这个问题是avro的文件不完整:
Diagnostic Messages for this Task:
Error: java.io.IOException: java.io.IOException: org.apache.avro.AvroRuntimeException: java.io.IOException: Invalid sync!
        at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121)
        at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77)
        at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.doNextWithExceptionHandler(HadoopShimsSecure.java:273)
        at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.next(HadoopShimsSecure.java:183)
        at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:198)
        at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:184)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:52)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1594)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
Caused by: java.io.IOException: org.apache.avro.AvroRuntimeException: java.io.IOException: Invalid sync!
        at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121)
        at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77)
        at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:352)
        at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.doNext(CombineHiveRecordReader.java:101)
        at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.doNext(CombineHiveRecordReader.java:41)
        at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:115)
        at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.doNextWithExceptionHandler(HadoopShimsSecure.java:271)
        ... 11 more
Caused by: org.apache.avro.AvroRuntimeException: java.io.IOException: Invalid sync!
        at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:210)
        at org.apache.hadoop.hive.ql.io.avro.AvroGenericRecordReader.next(AvroGenericRecordReader.java:149)
        at org.apache.hadoop.hive.ql.io.avro.AvroGenericRecordReader.next(AvroGenericRecordReader.java:52)
        at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:347)
        ... 15 more
Caused by: java.io.IOException: Invalid sync!
        at org.apache.avro.file.DataFileStream.nextRawBlock(DataFileStream.java:293)
        at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:198)
        ... 18 more



        
        
        
        
查看近期执行失败的job日志,发现提示服务器内存不足
Log Type: syslog
Log Length: 18946

2015-12-27 13:30:44,516 INFO [main] org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from hadoop-metrics2.properties
2015-12-27 13:30:44,540 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSinkAdapter: Sink ganglia started
2015-12-27 13:30:44,601 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s).
2015-12-27 13:30:44,601 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: MapTask metrics system started
2015-12-27 13:30:44,609 INFO [main] org.apache.hadoop.mapred.YarnChild: Executing with tokens:
2015-12-27 13:30:44,609 INFO [main] org.apache.hadoop.mapred.YarnChild: Kind: mapreduce.job, Service: job_1451036614992_0057, Ident: (org.apache.hadoop.mapreduce.security.token.JobTokenIdentifier@afb3f4c)
2015-12-27 13:30:44,670 INFO [main] org.apache.hadoop.mapred.YarnChild: Sleeping for 0ms before retrying again. Got null now.
2015-12-27 13:30:44,907 INFO [main] org.apache.hadoop.mapred.YarnChild: mapreduce.cluster.local.dir for child: /diskb/hadoop/yarn/local/usercache/hdfs/appcache/application_1451036614992_0057,/diskc/hadoop/yarn/local/usercache/hdfs/appcache/application_1451036614992_0057
2015-12-27 13:30:45,345 INFO [main] org.apache.hadoop.conf.Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
2015-12-27 13:30:45,669 INFO [main] org.apache.hadoop.mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
2015-12-27 13:30:46,003 INFO [main] org.apache.hadoop.mapred.MapTask: Processing split: hdfs://BeiJing/data/raw/click/2015122710/http-topic.avro.192.168.2.12.avro:1342177280+47143758
2015-12-27 13:30:46,223 INFO [main] org.apache.hadoop.mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer

Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x00000007efc80000, 272105472, 0) failed; error='无法分配内存' (errno=12)

分析是因为hadoop节点上有nodemanager和supervisor同时存在,且worker数量较多,跑任务的时候占用内存较大,所以导致内存不足,
将16个worker的数量减少为10个,重启下storm服务,有时候重启服务worker数量还是维持原来的那么多,所以直接到节点上删除所有的worker,之后再启动supervisor,就好。

观察一段时间发现hive job不会出现失败的情况了,感觉是因为原始数据的job在处理avro数据的时候,因为节点内存的问题,导致写入到HDFS的时候部分avro文件不完整,所以hive处理的时候会报错。

12-18 07:14