一种flink 作业提交失败的情况描述与原因排查

2023-08-03,,

遇到异常


2019-12-24 16:49:59,019 INFO org.apache.flink.yarn.YarnClusterClient - Starting client actor system.
2019-12-24 16:49:59,033 INFO org.apache.flink.yarn.YarnClusterClient - Trying to start actor system at 10-30-63-28_uf.cluster.ds.mosaic.com:0
2019-12-24 16:49:59,686 INFO org.apache.flink.yarn.YarnClusterClient - Actor system started at akka.tcp://flink@10-30-63-28_uf.cluster.ds.mosaic.com:33557 ------------------------------------------------------------
The program finished with the following exception: java.lang.RuntimeException: Unable to tell application master to stop once the specified job has been finised
at org.apache.flink.yarn.YarnClusterClient.stopAfterJob(YarnClusterClient.java:129)
at org.apache.flink.yarn.YarnClusterClient.submitJob(YarnClusterClient.java:154)
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:486)
at org.apache.flink.client.program.DetachedEnvironment.finalizeExecute(DetachedEnvironment.java:77)
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:432)
at org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:816)
at org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:290)
at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:216)
at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1053)
at org.apache.flink.client.cli.CliFrontend.lambda$main$11(CliFrontend.java:1129)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1727)
at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1129)
Caused by: org.apache.flink.util.FlinkException: Could not connect to the leading JobManager. Please check that the JobManager is running.
at org.apache.flink.client.program.ClusterClient.getJobManagerGateway(ClusterClient.java:956)
at org.apache.flink.yarn.YarnClusterClient.stopAfterJob(YarnClusterClient.java:124)
... 14 more
Caused by: org.apache.flink.runtime.leaderretrieval.LeaderRetrievalException: Could not retrieve the leader gateway.
at org.apache.flink.runtime.util.LeaderRetrievalUtils.retrieveLeaderGateway(LeaderRetrievalUtils.java:83)
at org.apache.flink.client.program.ClusterClient.getJobManagerGateway(ClusterClient.java:951)
... 15 more
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10000 milliseconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:190)
at scala.concurrent.Await.result(package.scala)
at org.apache.flink.runtime.util.LeaderRetrievalUtils.retrieveLeaderGateway(LeaderRetrievalUtils.java:81)
... 16 more

发现异常:
hostname 在shell中解析不完整

[d1_mosaic_bigdata_pa@10-30-63-28_uf ol-mpr]$ hostname
10-30-63-28_uf.cluster.ds.mosaic.com

调整hostname后


2019-12-24 17:13:10,044 INFO org.apache.flink.yarn.YarnClusterClient - Starting client actor system.
2019-12-24 17:13:10,060 INFO org.apache.flink.yarn.YarnClusterClient - Trying to start actor system at host10306328:0
2019-12-24 17:13:10,706 INFO org.apache.flink.yarn.YarnClusterClient - Actor system started at akka.tcp://flink@host10306328:41187 java.lang.RuntimeException: Unable to tell application master to stop once the specified job has been finised
at org.apache.flink.yarn.YarnClusterClient.stopAfterJob(YarnClusterClient.java:129)
at org.apache.flink.yarn.YarnClusterClient.submitJob(YarnClusterClient.java:154)
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:486)
at org.apache.flink.client.program.DetachedEnvironment.finalizeExecute(DetachedEnvironment.java:77)
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:432)
at org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:816)
at org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:290)
at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:216)
at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1053)
at org.apache.flink.client.cli.CliFrontend.lambda$main$11(CliFrontend.java:1129)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1727)
at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1129)
Caused by: org.apache.flink.util.FlinkException: Could not find out our own hostname by connecting to the leading JobManager. Please make sure that the Flink cluster has been started.
at org.apache.flink.client.program.ClusterClient$LazyActorSystemLoader.get(ClusterClient.java:276)
at org.apache.flink.client.program.ClusterClient.getJobManagerGateway(ClusterClient.java:953)
at org.apache.flink.yarn.YarnClusterClient.stopAfterJob(YarnClusterClient.java:124)
... 14 more
Caused by: org.apache.flink.runtime.leaderretrieval.LeaderRetrievalException: Could not find the connecting address by connecting to the current leader.
at org.apache.flink.runtime.util.LeaderRetrievalUtils.findConnectingAddress(LeaderRetrievalUtils.java:182)
at org.apache.flink.runtime.util.LeaderRetrievalUtils.findConnectingAddress(LeaderRetrievalUtils.java:163)
at org.apache.flink.client.program.ClusterClient$LazyActorSystemLoader.get(ClusterClient.java:272)
... 16 more
Caused by: org.apache.flink.runtime.leaderretrieval.LeaderRetrievalException: Could not retrieve the connecting address to the current leader with the akka URL akka.tcp://flink@emr-worker-190.cluster-40699:45716/user/jobmanager.
at org.apache.flink.runtime.net.ConnectionUtils$LeaderConnectingAddressListener.findConnectingAddress(ConnectionUtils.java:472)
at org.apache.flink.runtime.net.ConnectionUtils$LeaderConnectingAddressListener.findConnectingAddress(ConnectionUtils.java:361)
at org.apache.flink.runtime.util.LeaderRetrievalUtils.findConnectingAddress(LeaderRetrievalUtils.java:180)
... 18 more
Caused by: java.net.UnknownHostException: host10306328: host10306328: Name or service not known
at java.net.InetAddress.getLocalHost(InetAddress.java:1505)
at org.apache.flink.runtime.net.ConnectionUtils.tryLocalHostBeforeReturning(ConnectionUtils.java:190)
at org.apache.flink.runtime.net.ConnectionUtils.findAddressUsingStrategy(ConnectionUtils.java:276)
at org.apache.flink.runtime.net.ConnectionUtils.access$100(ConnectionUtils.java:51)
at org.apache.flink.runtime.net.ConnectionUtils$LeaderConnectingAddressListener.findConnectingAddress(ConnectionUtils.java:413)
... 20 more
Caused by: java.net.UnknownHostException: host10306328: Name or service not known
at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928)
at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323)
at java.net.InetAddress.getLocalHost(InetAddress.java:1500)
... 24 more

增加hosts配置

10.30.63.28 host10306328
2019-12-24 17:11:30,151 INFO  org.apache.flink.yarn.YarnClusterClient                       - Starting client actor system.
2019-12-24 17:11:30,168 INFO org.apache.flink.yarn.YarnClusterClient - Trying to start actor system at 10-30-63-28_uf.cluster.ds.mosaic.com:0
2019-12-24 17:11:30,811 INFO org.apache.flink.yarn.YarnClusterClient - Actor system started at akka.tcp://flink@10-30-63-28_uf.cluster.ds.mosaic.com:13182 ------------------------------------------------------------
The program finished with the following exception: java.lang.RuntimeException: Unable to tell application master to stop once the specified job has been finised
at org.apache.flink.yarn.YarnClusterClient.stopAfterJob(YarnClusterClient.java:129)
at org.apache.flink.yarn.YarnClusterClient.submitJob(YarnClusterClient.java:154)
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:486)
at org.apache.flink.client.program.DetachedEnvironment.finalizeExecute(DetachedEnvironment.java:77)
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:432)
at org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:816)
at org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:290)
at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:216)
at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1053)
at org.apache.flink.client.cli.CliFrontend.lambda$main$11(CliFrontend.java:1129)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1727)
at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1129)
Caused by: org.apache.flink.util.FlinkException: Could not connect to the leading JobManager. Please check that the JobManager is running.
at org.apache.flink.client.program.ClusterClient.getJobManagerGateway(ClusterClient.java:956)
at org.apache.flink.yarn.YarnClusterClient.stopAfterJob(YarnClusterClient.java:124)
... 14 more
Caused by: org.apache.flink.runtime.leaderretrieval.LeaderRetrievalException: Could not retrieve the leader gateway.
at org.apache.flink.runtime.util.LeaderRetrievalUtils.retrieveLeaderGateway(LeaderRetrievalUtils.java:83)
at org.apache.flink.client.program.ClusterClient.getJobManagerGateway(ClusterClient.java:951)
... 15 more
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10000 milliseconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:190)
at scala.concurrent.Await.result(package.scala)
at org.apache.flink.runtime.util.LeaderRetrievalUtils.retrieveLeaderGateway(LeaderRetrievalUtils.java:81)
... 16 more

屏蔽错误host配置
提交成功

# 10.30.63.28 10-30-63-28_uf.cluster.ds.mosaic.com
+++ dirname yarn_offltrain.sh
++ cd .
++ pwd
+ BASE_PATH=/data0/d1_mosaic_bigdata_test/mosaic/mosaicx/ol-mpr/offline_train_mainpage_mosaic_formosaic
+ HADOOP_USER_NAME=feed_mosaic
+ FLINK_RUN_MODE=yarn-cluster
+ mosaic_JOB_NAME=offlinetrain-beta_mainpage_mosaic6_base_prepage_v1-auc1
+ FLINK_TASK_MANAGER_NUMBER=15
+ FLINK_TASK_MANAGER_SLOT=5
+ FLINK_TASK_MANAGER_MEMORY=20000
+ FLINK_JOB_MANAGER_MEMORY=20000
+ JAR=mosaic-runtime-2.0.0.jar
+ XML=mosaic_offlinetrain_weiflow.xml
+ NODE=offline_training
+ FEATURE_CONF=feature_prepage.conf
+ export FLINK_LOG_DIR=/tmp
+ FLINK_LOG_DIR=/tmp
+ export FLINK_LOG_DIR=/tmp
+ FLINK_LOG_DIR=/tmp
++ hadoop classpath
+ export 'HADOOP_CLASSPATH=/data0/rsync_data/mosaic/ccConfs/yarn-setting/EMR-118-conf:/data0/rsync_data/mosaic/ccConfs/yarn-setting/hadoop-2.7.3/share/hadoop/common/lib/*:/data0/rsync_data/mosaic/ccConfs/yarn-setting/hadoop-2.7.3/share/hadoop/common/*:/data0/rsync_data/mosaic/ccConfs/yarn-setting/hadoop-2.7.3/share/hadoop/hdfs:/data0/rsync_data/mosaic/ccConfs/yarn-setting/hadoop-2.7.3/share/hadoop/hdfs/lib/*:/data0/rsync_data/mosaic/ccConfs/yarn-setting/hadoop-2.7.3/share/hadoop/hdfs/*:/data0/rsync_data/mosaic/ccConfs/yarn-setting/hadoop-2.7.3/share/hadoop/yarn/lib/*:/data0/rsync_data/mosaic/ccConfs/yarn-setting/hadoop-2.7.3/share/hadoop/yarn/*:/data0/rsync_data/mosaic/ccConfs/yarn-setting/hadoop-2.7.3/share/hadoop/mapreduce/lib/*:/data0/rsync_data/mosaic/ccConfs/yarn-setting/hadoop-2.7.3/share/hadoop/mapreduce/*:/data0/rsync_data/mosaic/ccConfs/yarn-setting/hadoop-2.7.3/contrib/capacity-scheduler/*.jar'
+ HADOOP_CLASSPATH='/data0/rsync_data/mosaic/ccConfs/yarn-setting/EMR-118-conf:/data0/rsync_data/mosaic/ccConfs/yarn-setting/hadoop-2.7.3/share/hadoop/common/lib/*:/data0/rsync_data/mosaic/ccConfs/yarn-setting/hadoop-2.7.3/share/hadoop/common/*:/data0/rsync_data/mosaic/ccConfs/yarn-setting/hadoop-2.7.3/share/hadoop/hdfs:/data0/rsync_data/mosaic/ccConfs/yarn-setting/hadoop-2.7.3/share/hadoop/hdfs/lib/*:/data0/rsync_data/mosaic/ccConfs/yarn-setting/hadoop-2.7.3/share/hadoop/hdfs/*:/data0/rsync_data/mosaic/ccConfs/yarn-setting/hadoop-2.7.3/share/hadoop/yarn/lib/*:/data0/rsync_data/mosaic/ccConfs/yarn-setting/hadoop-2.7.3/share/hadoop/yarn/*:/data0/rsync_data/mosaic/ccConfs/yarn-setting/hadoop-2.7.3/share/hadoop/mapreduce/lib/*:/data0/rsync_data/mosaic/ccConfs/yarn-setting/hadoop-2.7.3/share/hadoop/mapreduce/*:/data0/rsync_data/mosaic/ccConfs/yarn-setting/hadoop-2.7.3/contrib/capacity-scheduler/*.jar'
+ /usr/lib/flink-current/bin/flink run -d -m yarn-cluster -yD env.java.opts=-Djava.util.Arrays.useLegacyMergeSort=true -yD web.timeout=1000000 -yD 'env.java.opts=-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=5 -XX:GCLogFileSize=128M' -yD metrics.reporter.monitor._FLINK_CLUSTER_NAME=offlinetrain-beta_mainpage_mosaic6_base_prepage_v1-auc1 -yD metrics.reporters=monitor -yD metrics.reporter.monitor.class=com.mosaic.datasys.mosaic.metrics.WeiboKafkaReporter -yD metrics.reporter.monitor.kafka.bootstrap.servers=10.85.184.204:9092,10.85.184.205:9092 -yD metrics.reporter.monitor.topicName=metrics-topic -yjm 20000 -yn 15 -ytm 20000 -ys 5 -ynm offlinetrain-beta_mainpage_mosaic6_base_prepage_v1-auc1 -c com.mosaic.datasys.mosaic.framework.common.parser.FlowBuilder mosaic-runtime-2.0.0.jar mosaic_offlinetrain_weiflow.xml offline_training
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/data0/rsync_data/mosaic/ccConfs/yarn-setting/mosaic/flink-1.6.2-1.0.0-bin-weiclient/lib/slf4j-log4j12-1.7.7.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/data0/rsync_data/mosaic/ccConfs/yarn-setting/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
2019-12-24 17:12:20,400 INFO org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://emr-header-1.cluster-40699:8188/ws/v1/timeline/
2019-12-24 17:12:20,729 INFO org.apache.flink.yarn.cli.FlinkYarnSessionCli - No path for the flink jar passed. Using the location of class org.apache.flink.yarn.LegacyYarnClusterDescriptor to locate the jar
2019-12-24 17:12:20,729 INFO org.apache.flink.yarn.cli.FlinkYarnSessionCli - No path for the flink jar passed. Using the location of class org.apache.flink.yarn.LegacyYarnClusterDescriptor to locate the jar
2019-12-24 17:12:20,839 INFO org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider - Failing over to rm2
2019-12-24 17:12:21,011 INFO org.apache.flink.yarn.AbstractYarnClusterDescriptor - Cluster specification: ClusterSpecification{masterMemoryMB=20000, taskManagerMemoryMB=20000, numberTaskManagers=15, slotsPerTaskManager=5}
2019-12-24 17:12:21,376 WARN org.apache.flink.yarn.AbstractYarnClusterDescriptor - The configuration directory ('/data0/rsync_data/mosaic/ccConfs/yarn-setting/mosaic/flink-1.6.2-1.0.0-bin-weiclient/conf') contains both LOG4J and Logback configuration files. Please delete or rename one of them.
2019-12-24 17:13:00,539 INFO org.apache.flink.yarn.AbstractYarnClusterDescriptor - Submitting application master application_1547542611290_3184536
2019-12-24 17:13:00,589 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Submitted application application_1547542611290_3184536
2019-12-24 17:13:00,590 INFO org.apache.flink.yarn.AbstractYarnClusterDescriptor - Waiting for the cluster to be allocated
2019-12-24 17:13:00,598 INFO org.apache.flink.yarn.AbstractYarnClusterDescriptor - Deploying cluster, current state ACCEPTED
2019-12-24 17:13:07,826 INFO org.apache.flink.yarn.AbstractYarnClusterDescriptor - YARN application has been deployed successfully.
2019-12-24 17:13:07,828 INFO org.apache.flink.yarn.AbstractYarnClusterDescriptor - The Flink YARN client has been started in detached mode. In order to stop Flink on YARN, use the following command or a YARN web interface to stop it:
yarn application -kill application_1547542611290_3184536
Please also note that the temporary files of the YARN session in the home directory will not be removed.
Using the parallelism provided by the remote cluster (75). To use another parallelism, set it at the ./bin/flink client.
Starting execution of program
2019-12-24 17:13:07,863 INFO org.apache.flink.yarn.YarnClusterClient - Starting program in interactive mode (detached: true)
============================= Flink Job Name is :offlinetrain-mosaic6-mosaic-base-prepage-auc ============================= task info =============================
== task name: hiveInput1
2019-12-24 17:13:08,300 WARN org.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.metastore.local does not exist
2019-12-24 17:13:08,453 WARN org.apache.hadoop.security.UserGroupInformation - No groups available for user feed_mosaic
database:default, table:model_frequence_30_days, partitionFilter:dt=test, fieldSize:49
Schema:[is_act, is_click, is_video_play, ma_id, index_type, m_midtextveccos, fm_1082, fm_10169, fm_10170, fm_10171, fm_1033, fm_1086, fm_1089, fu_210, fu_207, fu_211, fu_2117, fu_2135, fu_215, fu_216, fuu_403, fuu_407, fuu_400, fuu_401, fuu_4019, fuu_402, fuu_405, fuu_409, m_1082, m_1011, m_10148, m_10169, m_10170, m_10171, m_1030, m_1032, m_1040, m_1041, m_1042, m_1063, m_1086, fu_uid, fm_uid, fm_mid, m_mid, m_uid, expo_time, pre_page, dt]
== task name: featureProcess
Read Feature File:
== task name: libsvmProcess
Read Feature File: feature_prepage.conf
== task name: trainProcess
============================= task info ============================= 2019-12-24 17:13:10,044 INFO org.apache.flink.yarn.YarnClusterClient - Starting client actor system.
2019-12-24 17:13:10,060 INFO org.apache.flink.yarn.YarnClusterClient - Trying to start actor system at host10306328:0
2019-12-24 17:13:10,706 INFO org.apache.flink.yarn.YarnClusterClient - Actor system started at akka.tcp://flink@host10306328:41187
2019-12-24 17:13:11,050 INFO org.apache.flink.yarn.YarnClusterClient - Waiting until all TaskManagers have connected
Waiting until all TaskManagers have connected
2019-12-24 17:13:11,111 INFO org.apache.flink.yarn.YarnClusterClient - TaskManager status (0/15)
TaskManager status (0/15)
2019-12-24 17:13:13,222 INFO org.apache.flink.yarn.YarnClusterClient - TaskManager status (2/15)
TaskManager status (2/15)
2019-12-24 17:13:13,504 INFO org.apache.flink.yarn.YarnClusterClient - TaskManager status (4/15)
TaskManager status (4/15)
2019-12-24 17:13:13,807 INFO org.apache.flink.yarn.YarnClusterClient - TaskManager status (9/15)
TaskManager status (9/15)
2019-12-24 17:13:14,084 INFO org.apache.flink.yarn.YarnClusterClient - TaskManager status (13/15)
TaskManager status (13/15)
2019-12-24 17:13:14,427 INFO org.apache.flink.yarn.YarnClusterClient - TaskManager status (15/15)
TaskManager status (15/15)
2019-12-24 17:13:14,427 INFO org.apache.flink.yarn.YarnClusterClient - All TaskManagers are connected
All TaskManagers are connected
2019-12-24 17:13:14,448 INFO org.apache.flink.yarn.YarnClusterClient - Submitting Job with JobID: a05e47246348d02d5c4fe5e322c8544d. Returning after job submission.
Submitting Job with JobID: a05e47246348d02d5c4fe5e322c8544d. Returning after job submission.
Job has been submitted with JobID a05e47246348d02d5c4fe5e322c8544d

一种flink 作业提交失败的情况描述与原因排查的相关教程结束。

《一种flink 作业提交失败的情况描述与原因排查.doc》

下载本文的Word格式文档,以方便收藏与打印。