Apache Sqoop
Sqoop is a command-line tool for efficiently transferring bulk data between Hadoop and structured datastores such as relational databases in both directions. Sqoop got the name combining SQL and Hadoop. Sqoop became a top-level Apache project in March 2012.
Sqoop helps offload certain tasks (such as ETL processing) from the EDW to Hadoop for efficient execution at a much lower cost. Sqoop can also be used to extract data from Hadoop and export it into external structured datastores. Sqoop works with relational databases such as Oracle, Microsoft SQL Server, Teradata, Oracle, MySQL, Postgres, and HSQLDB.
It supports incremental loads of a single table or a free form SQL query as well as saved jobs which can be run multiple times to import updates made to a database since the last import. Imports can also be used to populate tables in Hive or HBase.
Behaim’s has several years of experience in Apache Sqoop:
- Load script preparation and execution (including incremental loads)
- Export script preparation and execution
- Data storage optimization (parquet file format etc.)