[大数据]sqoop安装与运用

2023-05-22,,

1 文由

项目使用场景:OLTP Oracle 数据导入到 OLAP HIVE

2 Sqoop简述

Apache Sqoop(TM) 是一款开源的ETL工具,设计用于在 Apache Hadoop结构化数据存储(如传统的关系数据库)之间高效传输批量数据,其可以将一个关系型数据库(例如 : MySQL ,Oracle ,Postgres等)中的数据导进到Hadoop的HDFS中,也可以将HDFS的数据导进到关系型数据库中。
项目生命历程:Sqoop项目开始于2009年,最早是作为Hadoop的一个第三方模块存在,后来为了让使用者能够快速部署,也为了让开发人员能够更快速的迭代开发;于 2012 年 3 月成功从Apache孵化器毕业,独立成为一个Apache顶级项目;但是在2021年6月 Apache Sqoop 从 apache 社区退役到阁楼(Attic)。
社区动态: Apache Sqoop 已于 2021年6月 从 apache 社区退役 到 阁楼(Attic)

Apache Sqoop 于 2021 年 6 月搬入阁楼。Apache Sqoop 的任务是为 Apache Hadoop 和结构化数据存储创建和维护与批量数据传输相关的软件。
网站、下载和问题跟踪器都保持打开状态,尽管问题跟踪器是只读的。有关 Sqoop 的更多信息,请参阅网站 http://sqoop.apache.org 。
与 Attic 中的任何项目一样 - 如果您应该选择在 Apache 之外分叉 Sqoop,请告诉我们,以便我们可以链接到您的项目。

最新版本

Sqoop1最新的稳定版本是 1.4.7(下载、 文档)。
Sqoop2最新版本是1.99.7(下载、 文档)。请注意,1.99.7 与 1.4.7 不兼容,并且功能不完整,它不适用于生产部署。
Git

https://gitbox.apache.org/repos/asf?p=sqoop.git

https://github.com/apache/sqoop

these information is updated at 2022/05/11 12:29 PM

3 安装Sqoop2(1.99.7)

step0 部署环境

jdk 1.8
hadoop 3.3.2
hive 3.1.3
sqoop 1.99.7

step1 下载sqoop2安装包

[sqoop 1.99.7] 截止2022年5月10日,最新版本(发版时间:2020年)
http://archive.apache.org/dist/sqoop/1.99.6/sqoop-1.99.6-bin-hadoop200.tar.gz

http://archive.apache.org/dist/sqoop/

http://archive.apache.org/dist/sqoop/1.99.6/sqoop-1.99.6-bin-hadoop200.tar.gz

step2 解压,并创建软链接目录sqoop

tar -xvf /mnt/share_data/Apache\ Sqoop/sqoop-1.99.7-bin-hadoop200.tar.gz -C /home/hadoop/

cd /home/hadoop
ln -s sqoop-1.99.7-bin-hadoop200 sqoop ls -l /home/hadoop
ls -l /home/hadoop/sqoop
ls -l /home/hadoop/sqoop/

step3 配置环境变量:HADOOP_HOME/SQOOP_HOME

vim /etc/profile
# HADOOP_HOME
export HADOOP_HOME=/home/hadoop/hadoop
export PATH=$PATH:$HADOOP_HOME/bin # SQOOP_HOME
export SQOOP_HOME=/home/hadoop/sqoop
export PATH=$PATH:$SQOOP_HOME/bin source /etc/profile

[补充:查看hadoop目录]

[hadoop@node-d ~]$ whereis hadoop
hadoop: /usr/bin/hadoop /home/hadoop/hadoop-3.3.2/bin/hadoop /home/hadoop/hadoop-3.3.2/bin/hadoop.cmd [hadoop@node-d ~]$ whereis sqoop
sqoop: /usr/bin/sqoop /home/hadoop/sqoop-1.99.7-bin-hadoop200/bin/sqoop.sh # ll /usr/lib/hadoop

[补充:如果不配置HADOOP_HOME,则:]

# sqoop help
Error: /opt/software/sqoop-1.4.7/bin/../../hadoop-mapreduce does not exist!
Please set $HADOOP_MAPRED_HOME to the root of your Hadoop MapReduce installation.

step4 配置MYSQL数据库驱动

mysql-connector-java-5.1.*.jar放入$SQOOP_HOME/server/lib目录下

MySQL Product Archives -

ls -l $SQOOP_HOME/server/lib/

step5 配置derby数据库驱动

mv $SQOOP_HOME/server/lib/derby-10.8.2.2.jar $SQOOP_HOME/server/lib/derby-10.8.2.2.jar.bak

cp $HIVE_HOME/lib/derby-10.14.1.0.jar $SQOOP_HOME/server/lib/

此步是为了避免sqoop2-tool verify时报derby数据库错误(大概率是derby版本过低导致)

Exception in thread "main" java.lang.ExceptionInInitializerError
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.sqoop.utils.ClassUtils.loadClassWithClassLoader(ClassUtils.java:106)
at org.apache.sqoop.utils.ClassUtils.loadClass(ClassUtils.java:70)
at org.apache.sqoop.repository.JdbcRepositoryProvider.initializeRepositoryHandler(JdbcRepositoryProvider.java:122)
at org.apache.sqoop.repository.JdbcRepositoryProvider.initialize(JdbcRepositoryProvider.java:65)
at org.apache.sqoop.repository.RepositoryManager.initialize(RepositoryManager.java:124)
at org.apache.sqoop.tools.tool.UpgradeTool.runToolWithConfiguration(UpgradeTool.java:39)
at org.apache.sqoop.tools.ConfiguredTool.runTool(ConfiguredTool.java:35)
at org.apache.sqoop.tools.ToolRunner.main(ToolRunner.java:72)
Caused by: java.lang.SecurityException: sealing violation: package org.apache.derby.impl.jdbc.authentication is sealed

step6 配置$SQOOP_HOME/conf/sqoop.properties的Hadoop配置文件目录

# Hadoop configuration directory
org.apache.sqoop.submission.engine.mapreduce.configuration.directory=/home/hadoop/hadoop/etc/hadoop/
# 默认值: /etc/hadoop/conf/
# 不要配置成: $HADOOP_HOME/etc/hadoop/,否则还是会报错

此步骤是为了避免报如下错误:

...
Caused by: org.apache.sqoop.common.SqoopException: MAPREDUCE_0002:Failure on submission engine initialization - Invalid Hadoop configuration directory (not a directory or permission issues): /etc/hadoop/conf/
at org.apache.sqoop.submission.mapreduce.MapreduceSubmissionEngine.initialize(MapreduceSubmissionEngine.java:97)
...

step7 测试是否安装/启动成功

(sqoop 1.99.7)

利用sqoop2-tool验证是否安装成功

# cd $SQOOP_HOME/bin

# sqoop2-tool verify
Verification was Successful.
Tool class ..... has finished correctly.

如果出现任何Verification has failed, please check Server logs for further details.字样,可到在如下路径查看SQQOP日志:

vi $SQOOP_HOME/bin/@LOGDIR@/sqoop.log

补充: SQOOP的日志默认配置策略↓

/home/hadoop/sqoop-1.99.7-bin-hadoop200/conf/sqoop.properties:org.apache.sqoop.log4j.appender.file.File
org.apache.sqoop.log4j.debug=false
org.apache.sqoop.log4j.rootLogger=INFO, file
org.apache.sqoop.log4j.category.org.apache.sqoop=INFO
org.apache.sqoop.log4j.appender.file=org.apache.log4j.RollingFileAppender
org.apache.sqoop.log4j.appender.file.File=@LOGDIR@/sqoop.log
org.apache.sqoop.log4j.appender.file.MaxFileSize=25MB
org.apache.sqoop.log4j.appender.file.MaxBackupIndex=5
org.apache.sqoop.log4j.appender.file.layout=org.apache.log4j.PatternLayout
org.apache.sqoop.log4j.appender.file.layout.ConversionPattern=%d{ISO8601} %-5p [%l] %m%n
# Audit logger for default configuration of FileAuditLogger
org.apache.sqoop.log4j.logger.audit=INFO, audit
org.apache.sqoop.log4j.appender.audit=org.apache.log4j.RollingFileAppender
org.apache.sqoop.log4j.appender.audit.File=@LOGDIR@/audit.log
org.apache.sqoop.log4j.appender.audit.MaxFileSize=25MB
org.apache.sqoop.log4j.appender.audit.MaxBackupIndex=5
org.apache.sqoop.log4j.appender.audit.layout=org.apache.log4j.PatternLayout
org.apache.sqoop.log4j.appender.audit.layout.ConversionPattern=%d{ISO8601} %-5p [%l] %m%n

step8 启动sqoop2服务端、客户端

启动服务端sqoop2-server

sqoop2-server [ start | stop ]

登录客户端sqoop2-shell

注:不启动服务端也能登录 shell 端。

# sqoop2-shell

到这里,Sqoop2的安装就已经完成了。

4 安装Sqoop1(1.4.7)

step0 部署环境

hadoop 3.3.2
建议: hadoop 2.6.x / 2.7.x

step1 下载sqoop安装包

[sqoop 1.4.7]
http://sqoop.apache.org/
http://www.apache.org/dyn/closer.lua/sqoop/1.4.7
https://mirror.bit.edu.cn/apache/sqoop/1.4.7/sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz

http://archive.apache.org/dist/sqoop/1.4.7/sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz

step2 解压,并创建软链接目录sqoop

tar -xvf /mnt/share_data/Apache\ Sqoop/sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz -C /home/hadoop/

cd /home/hadoop
ln -s sqoop-1.4.7.bin__hadoop-2.6.0 sqoop ls -l /home/hadoop
ls -l /home/hadoop/sqoop
ls -l /home/hadoop/sqoop/

step3 step3 配置环境变量:HADOOP_HOME/SQOOP_HOME

vim /etc/profile
# HADOOP_HOME
export HADOOP_HOME=/home/hadoop/hadoop
export PATH=$PATH:$HADOOP_HOME/bin # SQOOP_HOME
export SQOOP_HOME=/home/hadoop/sqoop
export PATH=$PATH:$SQOOP_HOME/bin source /etc/profile

step4 配置MYSQL数据库驱动

step4.1 将mysql-connector-java-5.1.*.jar放入$SQOOP_HOME/lib目录下

ls -l $SQOOP_HOME/lib/

step2 将mysql数据库驱动所依赖的外部jar包(commons-lang-2.6.jar)导入$SQOOP_HOME/lib目录下

ls -la $SQOOP_HOME/lib | grep -i "commons-lang"

此举是为了避免后续使用sqoop,并连接mysql数据库时报如下错误:

...
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/commons/lang/StringUtils
at org.apache.sqoop.manager.MySQLManager.initOptionDefaults(MySQLManager.java:73)
at org.apache.sqoop.manager.SqlManager.<init>(SqlManager.java:89)
at com.cloudera.sqoop.manager.SqlManager.<init>(SqlManager.java:33)
at org.apache.sqoop.manager.GenericJdbcManager.<init>(GenericJdbcManager.java:51)
...

step5 配置 sqoop-env.sh 中 Hadoop 及其生态组件路径

cd $SQOOP_HOME/conf/
cp sqoop-env-template.sh sqoop-env.sh
vi $SQOOP_HOME/conf/sqoop-env.sh
#Set path to where bin/hadoop is available
export HADOOP_COMMON_HOME=$HADOOP_HOME
# 或: HADOOP_COMMON_HOME=/home/hadoop/hadoop #Set path to where hadoop-*-core.jar is available
export HADOOP_MAPRED_HOME=$HADOOP_HOME
# 或: HADOOP_MAPRED_HOME=/home/hadoop/hadoop #set the path to where bin/hbase is available
#export HBASE_HOME= #Set the path to where bin/hive is available
export HIVE_HOME=$HIVE_HOME
# 或: /home/hadoop/hive/ #Set the path for where zookeper config dir is
#export ZOOCFGDIR=
# 或: /home/hadoop/zookeeper/

step6 测试是否安装/启动成功

(如下命令支持 sqoop 1.4.7,不支持sqoop 1.99.7)

sqoop version

sqoop version

sqoop help

# sqoop help
Warning: /opt/software/sqoop-1.4.7/bin/../../hbase does not exist! HBase imports will fail.
Please set $HBASE_HOME to the root of your HBase installation.
Warning: /opt/software/sqoop-1.4.7/bin/../../hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
Warning: /opt/software/sqoop-1.4.7/bin/../../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/sdp/2.6.0.0-1245/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/sdp/2.6.0.0-1245/tez-0.9.0/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/sdp/2.6.0.0-1245/hadoop-2.7.3/lib/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
21/01/18 21:32:31 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
usage: sqoop COMMAND [ARGS] Available commands:
codegen Generate code to interact with database records
create-hive-table Import a table definition into Hive
eval Evaluate a SQL statement and display the results
export Export an HDFS directory to a database table
help List available commands
import Import a table from a database to HDFS
import-all-tables Import tables from a database to HDFS
import-mainframe Import datasets from a mainframe server to HDFS
job Work with saved jobs
list-databases List available databases on a server
list-tables List available tables in a database
merge Merge results of incremental imports
metastore Run a standalone Sqoop metastore
version Display version information See 'sqoop help COMMAND' for information on a specific command.

连接mysql数据库,并列出数据库、数据表

sqoop list-databases --connect jdbc:mysql://127.0.0.1:3306/ --username root --password 123456

sqoop list-tables --connect jdbc:mysql://127.0.0.1:3306/exam_system --username root --password 123456

5 应用场景(基于Sqoop1)

5.1 数据迁移:Oracle导入Hive

[格式]

sqoop import
--connect jdbc:oracle:thin:eas/eas@192.168.100.199:1521:eas
--table TMP_NEEDCLEAN_PROVIDER2
--hive-import --hive-database xxx
--hive-table RPT_PROVIDER_NEEDCLEAN2
--hive-overwrite
--delete-target-dir
--hive-drop-import-delims
--null-string '' --null-non-string ''
-m1

[示例]

sqoop import
--connect jdbc:oracle:thin:@10.0.3.3:1521:sid
--username abc
--password r123
--table CMX.SALES
-m 1
--hive-import
--hive-overwrite
--hive-database ra
--hive-table "cmx_sales"
--null-non-string ''
--null-string ''
--delete-target-dir
--hive-drop-import-delims

5.2 数据迁移: Hive导出到MySQL

sqoop export \
--connect jdbc:mysql://172.16.100.19:3306/test \
--username queryuser \
--password abcde12345 \
--table t_sitenumber_distance \
--export-dir /usr/local/warehouse/sitelight.db/t_sitenumber_distance \
--fields-terminated-by '\001'

注意: 创建hive数据库时,建议完全了解或指定【分隔符】,如果未指定,默认分隔符\001

5.3 数据迁移:MySQL/PostgreSQL导入到Hive

源表方式

sqoop import  \
--connect jdbc:mysql://172.16.100.19:3306/test_db \
--username queryuser \
--password abcde12345 \
--table t_sitenumber_distance \
# (如下选项为可选字段)
--null-string '@@@' \
--null-non-string '###' \
--hive-import \
--hive-overwrite \
--create-hive-table \
--delete-target-dir \
--hive-database ods_db\
--hive-table t_sitenumber_distance \

注:hive目标表默认会自动创建,不需要手动创建

--null-string的含义是 string类型的字段,当Value是NULL,替换成指定的字符,该例子中为” <-数据库那种空值
--null-non-string的含义是非string类型的字段,当Value是NULL,替换成指定字符,该例子中为”<-数据库那种空值

此处Sqoop失败主要是由于Yarn集群的内存资源不足所致,无需care。

SQL方式

sqoop import \
--connect 'jdbc:postgresql://172.21.232.128:1921/coupon?useUnicode=true&characterEncoding=utf8' \
--username coupondw \
--password 'dw!123456' \
--query "select id,primeofferid,primeoffercode,primecardcode,primecode,usercode,primemembertypeid,primeoffertypeid,primeofferactivityid,reusablecount,usedseqid,totalavailablecount,validstarttime
,validendtime,saving,activatedchannelid,activatedtime,channelremark,version,lastupdatetime,updateuser,lastusedtime,lastusedcount,offerstatus,statusupdatetime from t_prime_offer_info_4 where lastupdatetime >='2021-06-17' and lastupdatetime < '2021-06-18' and \$CONDITIONS" \
--target-dir /apps/hive/warehouse/staging/all/coupon_t_prime_offer_info_4 \
--hive-import \
--hive-database tmp \
--delete-target-dir \
--hive-table coupon_t_prime_offer_info_4 \
--m 1 # 注意1:-m/--num-mappers 指定maptask的个数
# 注意2:--query参数和--table参数不能同时出现
# 注意3:通过–query选项中"where $CONDITIONS"是必须带,否者会报错,如下:
## ERROR tool.ImportTool: Encountered IOException running import job: java.io.IOException: Query [SELECT * FROM t_passenger ] must contain '$CONDITIONS' in WHERE clause.

SQL方式(limit)

sqoop import \
--connect 'jdbc:postgresql://172.21.232.128:1921/coupon?useUnicode=true&characterEncoding=utf8' \
--username coupondw \
--password 'dw!123456' \
--query "select id,primeofferid,primeoffercode,primecardcode from t_prime_offer_info_0 where \$CONDITIONS limit 10" \
--target-dir /apps/hive/warehouse/staging/all/coupon_t_prime_offer_info_0_qhh \
--hive-import \
--create-hive-table \
--hive-database tmp \
--hive-table coupon_t_prime_offer_info_0_qhh \
--delete-target-dir \
--m 1

示例(SQL方式) 【推荐阅读】

step1 确定待导入到Hive的MySQL数据

select
id,name,birthdate,idcard,sex,height,school_name,now() as etl_time
from exam_system.tb_examinee
-- where \$CONDITIONS

step2 准备一个HDFS临时文件夹(Sqoop的--target-dir任务运行过程中需要暂存数据)

hdfs dfs -mkdir /test/test_ods/
hdfs dfs -ls /test/test_ods/
hdfs dfs -ls /test/

step3 正式开始导入

sqoop import  \
--connect jdbc:mysql://172.16.20.102:3306/exam_system \
--username root \
--password 123456 \
--query "select id,name,birthdate,idcard,sex,height,school_name,now() as etl_time from exam_system.tb_examinee where \$CONDITIONS" \
--split-by 'id' \
--target-dir /test/test_ods/ \
--hive-import \
--hive-overwrite \
--create-hive-table \
--delete-target-dir \
--hive-database default \
--hive-table tb_examinee

step4 查验hive中的新表和新数据

# hive
sql> select * from tb_examineee;

step5 查看Yarn Cluster Web UISqoop任务运行过程中所提交的Mapreduce任务

X 参考文献

Home Page - Apache Sqoop 【强烈推荐】
Sqoop User Guide (v1.4.7) - Apache Sqoop 【强烈推荐(sqoop1)】
Apache Sqoop documentation(v 1.99.7) - Apache Sqoop 【强烈推荐(sqoop2)】
sqoop安装及使用 - 博客园
sqoop安装教程 - 简书
mysql-connector-java-5.-bin.jar 下载方法 - 博客园
安装Sqoop的时候出现Error: /root/training/sqoop-1.4.5.bin__hadoop-0.23/../hadoop-mapreduce does not exist! - CSDN
oracle到hive数据类型转换 - CSDN
Sqoop - 1.99.7安装配置(详细图文) - CSDN 【强烈推荐(sqoop2)】
【sqoop】sqoop2安装时验证因derby的jar包版本过低报错 - CSDN
sqoop2安装 - CSDN
sqoop工具安装及简单操作 - 简书 【强烈推荐(sqoop1)】
sqoop --split-by详解 - 简书

[大数据]sqoop安装与运用的相关教程结束。

《[大数据]sqoop安装与运用.doc》

下载本文的Word格式文档,以方便收藏与打印。