Behaim’s multi-year experience in Hadoop covers the the full breadth of Hadoop Projects, including all: Architectural, Technical Design, Installation, Configuration, Security, Integration, and Maintenance & Support tasks. The team has successfully delivered multiple Big Data implementations at Customers using a variety of Hadoop projects and supplementary vendors.
For a description of the project and Behaim’s know-how click on the links below:
Apache Hadoop is an open-source software framework used for distributed storage and processing of very large data sets. It consists of computer clusters built from commodity hardware. It supports various security standards and functionalities, such as SSL, Kerberos authentication, encryption at rest, role based authorization to ensure enterprise data are stored securely and accessed by permitted users only.
The base framework consists of the following modules:
– Hadoop Common – contains libraries and utilities needed by other Hadoop module
– Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity machines, providing high aggregate bandwidth across the cluster
– Hadoop YARN – a resource-management platform responsible for managing computing resources in clusters and using them for scheduling of users’ applications
Hadoop MapReduce – an implementation of the MapReduce programming model for large scale data processing.
HBase is an open source, non-relational, distributed database modeled after Google’s BigTable and is written in Java. It is developed as part of Apache Software Foundation’s Apache Hadoop project and runs on top of HDFS (Hadoop Distributed File System), providing BigTable-like capabilities for Hadoop. It provides a fault-tolerant way of storing large quantities of sparse data (small amounts of information caught within a large collection of empty or unimportant data, such as finding the 50 largest items in a group of 2 billion records, or finding the non-zero items representing less than 0.1% of a huge collection).
HBase features compression, in-memory operation, and Bloom filters on a per-column basis. HBase is a column-oriented key-value data store and has been idolized widely because of its lineage with Hadoop and HDFS. It runs on top of HDFS and is well-suited for faster read and write operations on large datasets with high throughput and low input/output latency.
Sqoop is a command-line tool for efficiently transferring bulk data between Hadoop and structured datastores such as relational databases in both directions. Sqoop got the name combining SQL and Hadoop. Sqoop became a top-level Apache project in March 2012.
Sqoop helps offload certain tasks (such as ETL processing) from the EDW to Hadoop for efficient execution at a much lower cost. Sqoop can also be used to extract data from Hadoop and export it into external structured datastores. Sqoop works with relational databases such as Oracle, Microsoft SQL Server, Teradata, Oracle, MySQL, Postgres, and HSQLDB.
It supports incremental loads of a single table or a free form SQL query as well as saved jobs which can be run multiple times to import updates made to a database since the last import. Imports can also be used to populate tables in Hive or HBase.
Behaim’s has several years of experience in Apache Sqoop:
- Load script preparation and execution (including incremental loads)
- Export script preparation and execution
- Data storage optimization (parquet file format etc.)
Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. Its main goal is to deliver data from applications to Apache Hadoop’s HDFS. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for access by online analytic applications.
Flume lets Hadoop users ingest high-volume streaming data into HDFS:
– Ingest streaming data from multiple sources into Hadoop for storage and analysis – typical examples of such data are application logs, sensor and machine data, geo-location data etc.
– Buffer storage platform from transient spikes, when the rate of incoming data exceeds the rate at which data can be written to the destination.
– Flume NG uses channel-based transactions to guarantee reliable message delivery. When a message moves from one agent to another, two transactions are started, one on the agent that delivers the event and the other on the agent that receives the event. This ensures guaranteed delivery semantics.
Behaim’s 2 year experience includes Flume installation, setup, configuration, and production deployment. Also, Flume component implementation (sources, channels, sinks, agents etc.) and the integration with other applications.
Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop. Without Hive SQL queries must be implemented in the MapReduce Java API to execute SQL applications and queries over distributed data. Hive provides the necessary SQL abstraction to integrate SQL-like Queries (HiveQL) into the underlying Java API without the need to implement queries in the low-level Java API.
Since most data warehousing applications work with SQL-based querying languages, Hive supports easy portability of SQL-based application to Hadoop. While initially developed by Facebook, Apache Hive is now used and developed by other companies such as Netflix and the Financial Industry Regulatory Authority (FINRA).
Behaim’s multi-year experience includes: Installation, setup, configuration, shell scripts, data access using JDBC from various clients, BI tools (Spotfire, Tableau), etc.
Apache Spark is an open-source cluster-computing framework. It provides programmers with an application programming interface centered on a data structure called the resilient distributed dataset (RDD), a read-only multiset of data items distributed over a cluster of machines, that is maintained in a fault-tolerant way.
It was developed in response to limitations in the MapReduce cluster computing paradigm, which forces a particular linear dataflow structure on distributed programs: MapReduce programs read input data from disk, map a function across the data, reduce the results of the map, and store reduction results on disk. Spark’s RDDs function as a working set for distributed programs that offers a (deliberately) restricted form of distributed shared memory.
The availability of RDDs facilitates the implementation of both iterative algorithms, that visit their dataset multiple times in a loop, and interactive/exploratory data analysis, i.e., the repeated database-style querying of data. The latency of such applications (compared to Apache Hadoop, a popular MapReduce implementation) may be reduced by several orders of magnitude. Among the class of iterative algorithms are the training algorithms for machine learning systems, which formed the initial impetus for developing Apache Spark.
The Behaim team has 2 years of experience with Apache Spark:
- Installation, setup, configuration, and production deployment
- Applications implementation (Java, Scala, and other)
- Mllib usage (including java, R scripts, etc.)
ZooKeeper is a distributed, open-source coordination service for distributed applications. It exposes a simple set of primitives that distributed applications can build upon to implement higher level services for synchronization, configuration maintenance, and groups and naming. It is designed to be easy to program to, and uses a data model styled after the familiar directory tree structure of file systems.
It runs in Java and has bindings for both Java and C.
Behaim’s know-how covers Zookeper’s setup, configuration, production deployment and the implementation of client applications which make use of Zookeeper’s API.
Cloudera Impala is a query engine that runs on Apache Hadoop. The project was announced in October 2012 with a public beta test distribution and became generally available in May 2013. Impala brings scalable parallel database technology to Hadoop, enabling users to issue low-latency SQL queries to data stored in HDFS and Apache HBase without requiring data movement or transformation. Impala is integrated with Hadoop to use the same file and data formats, metadata, security and resource management frameworks used by MapReduce, Apache Hive and other Hadoop software.
Impala is promoted for analysts and data scientists to perform analytics on data stored in Hadoop via SQL or business intelligence tools. The result is that large-scale data processing (via MapReduce) and interactive queries can be done on the same system using the same data and metadata – removing the need to migrate data sets into specialized systems and/or proprietary formats simply to perform analysis.
Behaim has several years of experience with Impala’s installation, setup, configuration, shell scripts, and data access using various JDBC clients, such as BI tools (Spotfire, Tableau), etc.
H2O is an open-source software for big-data analysis. It is produced by the start-up H2O.ai (formerly 0xdata), which launched in 2011 in Silicon Valley. Currently H2O is the world’s leading open source machine learning platform. H2O is used by over 70,000 data scientists and more than 8,000 organizations around the world. H2O allows users to fit hundreds or thousands of potential models as part of discovering patterns in data. With H2O, users can throw models at data to find usable information, allowing H2O to discover patterns.
The H2O software runs can be called from the statistical package R and other environments. It is used for exploring and analyzing datasets held in cloud computing systems and in the Apache HDFS (Hadoop Distributed File System) as well as in the conventional operating-systems Linux, macOS, and Microsoft Windows. The H2O software is written in Java, Python, and R.
Behaim’s delivered project experience includes: Cluster installation, setup, configuration, deployment to production, H2O models creation, and usage from R scripts.
"*" indicates required fields