我经常发现火花失败了大工作,而且没有任何无意义的例外.工作日志看起来很正常,没有错误,但是它们的状态为"KILLED".这对于大型shuffle来说非常常见,所以操作就像.distinct
.
问题是,我如何诊断出现了什么问题,理想情况下,我该如何解决?
鉴于很多这些操作都是单一的,我一直在解决这个问题,方法是将数据分成10个块,在每个块上运行应用程序,然后在所有结果输出上运行应用程序.换句话说 - 元地图减少.
14/06/04 12:56:09 ERROR client.AppClient$ClientActor: Master removed our application: FAILED; stopping client 14/06/04 12:56:09 WARN cluster.SparkDeploySchedulerBackend: Disconnected from Spark cluster! Waiting for reconnection... 14/06/04 12:56:09 WARN scheduler.TaskSetManager: Loss was due to java.io.IOException java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:703) at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:779) at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:840) at java.io.DataInputStream.read(DataInputStream.java:149) at org.apache.hadoop.io.compress.DecompressorStream.getCompressedData(DecompressorStream.java:159) at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:143) at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85) at java.io.InputStream.read(InputStream.java:101) at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:180) at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216) at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174) at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:209) at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:47) at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:164) at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:149) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:27) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:176) at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:45) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toList(TraversableOnce.scala:257) at scala.collection.AbstractIterator.toList(Iterator.scala:1157) at $line5.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(:13) at $line5.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply( :13) at org.apache.spark.rdd.RDD$$anonfun$1.apply(RDD.scala:450) at org.apache.spark.rdd.RDD$$anonfun$1.apply(RDD.scala:450) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:34) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:34) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102) at org.apache.spark.scheduler.Task.run(Task.scala:53) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:42) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:41) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:41) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744)
小智.. 5
截至2014年9月1日,这是Spark的"开放式改进".请参阅https://issues.apache.org/jira/browse/SPARK-3052.正如syrza在给定链接中指出的那样,当执行程序失败导致此消息时,关闭挂钩可能以不正确的顺序执行.我知道你将不得不进行更多的调查,以找出问题的主要原因(即你的遗嘱执行人失败的原因).如果它是一个大的shuffle,它可能是一个内存不足错误导致执行程序失败,然后导致Hadoop文件系统在其关闭钩子中关闭.因此,RecordReaders在执行该执行程序的任务时抛出"java.io.IOException:Filesystem closed"异常.我想它将在后续版本中修复,然后你会得到更多有用的错误信息:)
截至2014年9月1日,这是Spark的"开放式改进".请参阅https://issues.apache.org/jira/browse/SPARK-3052.正如syrza在给定链接中指出的那样,当执行程序失败导致此消息时,关闭挂钩可能以不正确的顺序执行.我知道你将不得不进行更多的调查,以找出问题的主要原因(即你的遗嘱执行人失败的原因).如果它是一个大的shuffle,它可能是一个内存不足错误导致执行程序失败,然后导致Hadoop文件系统在其关闭钩子中关闭.因此,RecordReaders在执行该执行程序的任务时抛出"java.io.IOException:Filesystem closed"异常.我想它将在后续版本中修复,然后你会得到更多有用的错误信息:)