This article is by Clive Bearman and originally appeared on the Qlik Blog here: https://blog.qlik.com/to-sqoop-or-not-to-sqoop-that-is-the-question
Hadoop-based data lakes are one of the hottest trends in big data today because they promise to provide dynamic insights into your enterprise data for improved analytics. However, that promise looks progressively unobtainable when you consider that keeping the data fresh and relevant often relies upon manual scripting and using a wildly disparate collection of open-source tools. This article describes the many problems associated with using one just of those tools (Apache Sqoop) for data ingestion, and offers a practical approach for easily augmenting data lakes with information from enterprise systems.
Data Lakes and Swamp Creatures
Data lakes built with technologies such as the Azure Data Lake have emerged as the leading design principle on which data architects, analysts and scientists address modern use cases such as fraud detection, real-time customer marketing and trend analysis.
It’s easy to see why. Data lakes offer several advantages for business analytics when compared to traditional solutions like data warehousing. Firstly, data lakes are modular to allow different processing engines like MapReduce or Spark to efficiently process data in either batch or real-time. Secondly, Hadoop’s architecture is built to scale both vertically and horizontally. You can easily scale by adding more computing power to existing servers, or by adding more servers to a pool of resources. Thirdly, cloud-based Hadoop has forever altered the economics and possibilities of business analytics. You can instantiate systems, scale up and down as business requirements dictate, and only pay for the resources you consumed. There’s so much flexibility.
However, enterprises must find a way to capture, load, store and manage the information in the data lake to realize the full potential and promise of the technology. We’ve all heard the horror stories where the data lake deteriorates and provides little value to its intended users especially as the lifespan of the data shortens. You don’t want your pristine data lake becoming a data swamp.
Everyone Loves a Free Puppy. The Appeal of Apache Sqoop
The Apache ecosystem recognized the need for fresh data and released a project called Apache Scoop. Apache Sqoop successfully graduated from the Incubator in March of 2012 and is now a top-level Apache project. Apache Sqoop provides a simple and economical way for organizations to transfer bulk data from relational databases into Hadoop. It’s an ideal solution for companies who are just beginning to explore data lake initiatives and there’s so much to like. Apache Sqoop is free, can perform full and incremental data loads, supports multiple database formats, can be integrated Apache Oozie for scheduling and can load data directly into Apache Hive. So, what’s not to love?
One Sqoop or Two? The Challenges of a Free Solution
Apache Sqoop has no graphical user interface which you might consider as a minor annoyance depending on your opinion. That was remedied in Apache Sqoop 2 which introduced a web application, a REST API and security some changes. However, Sqoop 1 and Sqoop 2 are incompatible and Sqoop 2 is not yet recommended for production environments. Therefore, whatever Sqoop you decide to use the interaction is largely going to be via the command line. A small price to pay for high speed data loading. But, you knew there was a but coming, didn’t you? There comes a time in everyone’s Hadoop initiative where Apache Sqoop encounters practical limitations as deployments expand. Performance bottlenecks begin to impact the freshness of the data and Apache Sqoop slows.
To increase throughput Apache Sqoop uses MapReduce jobs called “mappers”. Using more mappers will lead to a higher number of concurrent data transfer tasks, which can result in faster job completion. However, it will also increase the load on the database as Sqoop will execute more concurrent queries. Increasing the number of mappers won’t always lead to faster job completion. While increasing the number of mappers, there is a point at which you will fully saturate your database. Increasing the number of mappers beyond this point won’t lead to faster job completion; in fact, it will have the opposite effect as your database server spends more time context switching than serving data. It’s worth noting that when this situation arises then other queries running on your server might be impacted, adversely affecting your production environment.
One Sqoop or Two? The Challenges of a Free Solution
The next question is “where can you to turn for configuration and tuning advice?” If you’re using a Hadoop distro from one of the leading vendors like Cloudera, Hortonworks or MapR AND your subscription is current then you might be lucky that they’ll provide support. But, if you’re using the free Apache.org project then you’re on your own and you’re likely going to spend hours of research Googling the web for answers. You might think I’m alarmist, but a recent TDWI paper reported that close to one third of respondents were concerned about the lack of Hadoop skills and data integration tools (source: Data Lakes: Purposes, Practices, Patterns and Platforms. TDWI, 2017).
Finally, the last major drawback for Apache Sqoop is “what about loading data from other enterprise sources?”. For example, maybe you want to load data that originates from a mainframe application, or SAP system or telemetry data from IoT devices. Is that possible with Sqoop? Not easily. An eventual solution would require work arounds, like lots of file dumps or data loading into and out of intermediate databases. Ugly, slow and error prone for sure. That’s hardly a winning recipe for your modern analytics environment.
Attunity Replicate – We’ve Done the Hard Work So You Don’t Have to.
Attunity Replicate – We’ve Done the Hard Work So You Don’t Have to. I’m not going to over emphasize the benefits of Attunity Replicate because there’s a whole website dedicated to doing that, but suffice to say you can realize more value from your data lake without need for all that manual scripting and tuning. The combined solution puts enterprise data right at the heart of your modern analytics environment and you’ll be analyzing data in no time.
One Sqoop or Two? The Challenges of a Free Solution
With Attunity Replicate, you can easily ingest, replicate and synchronize data across your data lake and all major databases and data warehouses, whether on premises or in the cloud.
If I’ve not yet convinced you that you’ll outgrow Sqoop, then let me also highlight a few technical “gotcha’s” to consider when trying to scale a “production ready” Apache Sqoop. Check out the table below:
|Apache Sqoop||Attunity Replicate|
|Architectural Style||Generic Query-Based
Requires triggers or tables with timestamps. Apaches Sqoop only captures inserts/updates. It cannot capture deletes.
|Native Transaction Log Based
Uses native transaction log to identify changes and takes full advantage of the services andsecurity provided by the database.
|Application and DB Modification Required||Highly Intrusive
Requires timestamps. Either you retrofit applications and databases if none exist OR you need consistent representation across your data sources. This introduces significant development time and risk into your deployment.
Does not require software to be installed on source database, Hadoop node or cluster.
|Overhead on Operational Systems||Resource Intensive
Requires significant I/O and CPU utilization as queries against are continuously run against source tables.
Identifies transactional changes from native transaction log with minimal overhead.
|DDL / Schema Changes||No CDC for DDL/Schema Change
Although Apaches Sqoop can load metadata into Apache Hive it does not capture any DDL changes. There is significant risk that the application may break when schema changes occur, requiring more development effort, resources and time.
Automatically detects source schema changes and automatically implements changes to the target.
Waits for specified query intervals; time to execute queries; delays cause data sync issues. Sqoop cannot be paused and resumed. It is an atomic step. If failed you need to clear things up and start again.
|Low/No Latency. Real-Time
Immediate data delivery with no intermediary storage requirements.
As previously mentioned Sqoop can be slow to load data and is resource hungry because it uses MapReduce under the hood. Incremental pull is also difficult because different tables require incremental pull queries to be written.
Multi-server and multi-threaded architecture supports high-volume, rapidly changing environments. Multiple data centers, servers and tasks can be managed and optimized centrally with Attunity Enterprise Manager.
Expanding Your Data Lake – To Apache Sqoop and Beyond!
Ingesting data into a Hadoop-based data lake is just the beginning and IT often struggles to create usable analytics data stores. Traditional methods require Hadoop savvy ETL programmers to manually code various steps – including data transformation, Hive SQL structure creation, and data reconciliation, and the administrative burden can often delay or even kill analytics projects.
Attunity Compose for Hive automates the creation and loading Hadoop Hive structures, as well as the transformation of enterprise data. Our solution fully automates the BI data-ready pipeline into Hive, enabling you to automatically create both Operational Data Stores (ODS) and Historical Data Stores (HDS). Additionally, we leverage the latest Hadoop innovations such as the new ACID Merge SQL capabilities, to automatically and efficiently process data insertions, updates and deletions.
Attunity Replicate integrates with Attunity Compose for Hive to simplify and accelerate data ingestion, data landing, SQL schema creation, data transformation and ODS and HDS creation/updates. And, you definitely can’t do that with Sqoop!
Data lakes have emerged as a primary platform on which to cost-effectively store and process a wide variety of data types, however keeping the data fresh and relevant can be extremely complex, especially if a data loading strategy has not been considered. Apache Sqoop is a great place to start your data ingestion experiments you’ll soon realize that it’s not really an “enterprise-grade” solution, especially as your deployment expands to more data sources, nodes and geographic locations. The best advice is to augment your Apache Sqoop usage with a proven solution like Attunity Replicate to accelerate and automate your data loading pipeline.
Register to experience a free trial version of award-winning Attunity Replicate, a unified platform that helps you load and ingest data across all major databases, data warehouses and Hadoop, on-premises or in the cloud.