On QCon2016 Beijing Railway Station, Druid open source project leader, Imply co-founder Fangjin Yang shared the title of "Evolution of Source Data Infrastructure" keynote speechBig data open source technology evolutionTo explore the development and direction of open source big data world. As an extension, we hope that on this basis, the use of Google Trends trend analysis capabilities of some popular open source tools to do a search, comparison and ranking.
Because of the large data surprisingly rapid development and knowledge limitations, this paper does not (and cannot) enumerate all types of the open source ecosystem, such as machine learning, data mining and so on not listed, nor to cover all the data involved in every corner, for exampleOpenStack,DockerAnd other related hot. Here are some of the more popular open source products, hoping to arouse everyone's interest and attention. In terms of content, because most of them are familiar with, but also just a few of the products to make a simple description, the description of the basic description from the official website as well as the introduction of various technical sites.
Scheduling and management services
- AzkabanHadoop is a task scheduling system based on Java task scheduling, from LinkedIn, for the management of their batch workflow. Azkaban sort according to the dependency of the work, provide a friendly Web user interface to maintain and track the user's workflow.
- YARNMapReduce is a new Hadoop resource manager, which is a universal resource management system, which can provide a unified resource management and scheduling for the upper application. Its basic idea is to separate the functions of resource management and job scheduling / monitoring into a single daemon.
- MesosAMPLab is an open source cluster management software developed by University of California at Berkeley, Hadoop, ElasticSearch, Spark, Storm and Kafka. The data center is like a single pool of resources away from CPU, from physical or virtual machine memory, storage and other computing resources, it is easy to set up and run effectively with distributed system fault tolerance and flexibility.
- AmbariAs part of the Hadoop ecosystem, provides an intuitive interface based on Web, can be used to configure, manage and monitor Hadoop clusters. Most Hadoop components have been supported, including HDFS, MapReduce, Hive, Pig, Hbase, Zookeper, Sqoop and Hcatalog.
- ZooKeeperIs a distributed application coordination services, is an important component of Hadoop and Hbase. It is a tool for distributed applications to provide consistent services, so that the Hadoop cluster nodes can coordinate each other. ZooKeeper has now become a top tier Apache project that provides efficient, reliable and easy to use collaborative services for distributed systems.
- ThriftThe Apache foundation will submit Thrift as an open source project Facebook in 2007, for the Facebook thrift is created in order to solve the characteristics of large amount of data between each subsystem in Facebook transmission system and the communication between different language environment need cross platform.
- ChukwaIs the monitoring of large distributed system is an open source data acquisition system based on HDFS/MapReduce framework and Hadoop inherits the scalability and reliability, can be collected from a large-scale distributed system for monitoring data. It also includes a flexible and powerful display tool for monitoring and analyzing results.
- LustreIt is a large-scale, secure and reliable cluster file system, which is developed and maintained by SUN. The main purpose of the project is to develop the next generation of cluster file system, currently supports more than 10000 nodes, the number of PB data storage.
- HDFSHadoop Distributed System File, referred to as HDFS, is a distributed file system. HDFS is a highly fault-tolerant system suitable for deployment on cheap machines. HDFS can provide high throughput data access, very suitable for large-scale data sets on the application.
- GlusterFSIs a cluster file system that supports PB level data. GlusterFS through RDMA and TCP/IP will be distributed to different servers on the storage space into a large network parallel file system.
- AlluxioFormerly known as Tachyon, is a distributed file system memory centric, with high performance and fault tolerance, to cluster framework (such as Spark, MapReduce) to provide a reliable memory speed file sharing service.
- CephIs a new generation of open source distributed file system, the main goal is to design a distributed file system based on POSIX without a single point of failure, to improve the fault tolerance of data and achieve seamless replication.
- PVFSIt is a high performance, open source parallel file system, which is mainly used in parallel computing environment. PVFS is specially designed for a large number of clients and servers, and its modular design structure can easily add new hardware and algorithm support.
- QFSQuantcast File System (QFS) is a high-performance, fault-tolerant, distributed file system for the development of MapReduce processing or need to read and write large files in the application.
- LogstashIs an application log, the event of the transmission, processing, management and search platform. It can be used to unify the application log collection management, providing the Web interface for query and statistics.
- ScribeScribe is the open source Facebook log collection system, it can collect logs from various log sources, stored in a central storage system (NFS, distributed file system), in order to facilitate centralized statistical analysis.
- FlumeCloudera is a highly available, highly reliable, distributed system for mass log collection, aggregation, and transmission. Flume supports custom data senders in the logging system for collecting data. At the same time, Flume support for data processing, and write to a variety of data recipient (customizable).
- SparkIs a high speed, large data processing engine. Hadoop has the advantages of MapReduce, but the difference is between the output of Job can be stored in the memory, thereby eliminating the need to read and write HDFS, so Spark is better suited for data mining and machine learning iterative MapReduce algorithm. It can be used with Hadoop and Apache Mesos, and can also be used independently.
- KinesisYou can build custom applications that handle or analyze stream data to meet specific needs. Amazon Kinesis Streams can capture and store several TB data per hour from hundreds of thousands of sources, such as site click streams, financial transactions, social media sources, IT logs, and location tracking events.
- HadoopIs an open source framework, suitable for running in general hardware, support the use of a simple program model for distributed processing of large data sets across clusters, support from a single server to thousands of servers on the level of scale up. Apache's Hadoop project is almost equal to the big data, it has grown up, has become a complete ecosystem, with many open source tools for highly scalable distributed computing. Efficient, reliable, scalable, able to provide the required YARN, HDFS, and infrastructure for your data storage projects, and to run the main big data services and applications.
- Spark StreamingThe realization of micro batch, the goal is to facilitate the establishment of scalable, fault-tolerant stream applications, support Java, Scala and Python, and Spark seamless integration. Spark Streaming can read data HDFS, Flume, Kafka, Twitter and ZeroMQ, you can also read custom data.
- TridentStorm is on a higher level of abstraction, in addition to providing a set of data processing flow API easy to use, which is batch (Group tuples) as a unit for processing, so that can make some more simple and efficient.
- FlinkApache among the top open source projects this year, and HDFS fully compatible. Flink provides Java and Scala based on API, is an efficient, distributed general-purpose large data analysis engine. More importantly, Flink supports incremental iterative computation, allowing the system to quickly process data intensive, iterative tasks.
- SamzaFrom the LinkedIn, built on the Kafka distributed stream computing framework, Apache is the top open source projects. Kafka and Hadoop can be used directly to provide fault tolerance, process isolation and security, resource management, YARN.
- StormStorm is an open source Twitter similar to the Hadoop real-time data processing framework. Programming model is simple, significantly reducing the difficulty of real-time processing, but also one of the most popular current computing framework. Compared with other computing frameworks, Storm has the greatest advantage of millisecond delay.
- Yahoo S4(Simple Scalable Streaming System) is a distributed stream computing platform, has the characteristics of general, distributed, scalable, fault-tolerant, pluggable, programmers can easily develop continuous data stream without borders (continuous unbounded streams of data) application. Its goal is to fill the gap between complex proprietary systems and batch oriented open source products, and to provide a high-performance computing platform to address the complexity of concurrent processing systems.
- HaLoopSSSP is a modified version of the Hadoop MapReduce framework, the goal is to efficiently support iterative, recursive data analysis tasks, such as PageRank, HITs, K-means, etc..
- PrestoIs an open source distributed SQL query engine, suitable for interactive analysis of queries, can be more than 250PB of the data for rapid interactive analysis. The design and preparation of Presto is to solve the problem of interactive analysis and processing speed of commercial data warehouse, such as Facebook. Facebook says Presto is 10 times better than Hive and MapReduce.
- DrillLaunched in 2012 by the Apache, allowing users to use SQL based queries, query Hadoop, NoSQL database and cloud storage services. It can run on thousands of nodes on the server cluster, and in a few seconds to deal with PB level or a trillion of data records. It can be used for data mining and ad hoc queries to support a wide range of databases, including HBase, MongoDB, MapR-DB, HDFS, MapR-FS, Amazon S3, Azure Blob Storage, Google cloud storage and Swift.
- PhoenixIs a Java intermediate layer, allowing developers to perform SQL queries on Apache HBase. Phoenix fully prepared using Java, and provides a client can be embedded JDBC driver. The Phoenix query engine converts the SQL query into one or more HBase scan, and is arranged to generate a standard JDBC result set.
- PigIs a programming language that simplifies the Hadoop common tasks. Pig can load data, convert data and store the final results. Pig is the biggest role for the MapReduce framework to achieve a set of shell scripts, similar to the SQL statements that we are familiar with.
- HiveIs a tool of data warehouse based on Hadoop, can be structured data file mapping for a database table, and provides a simple SQL query function, the SQL statement can be converted to MapReduce task operation. Its advantage is the low cost of learning, it can quickly realize simple MapReduce statistics by using the SQL statement, and it is not suitable for the development of specialized MapReduce applications.
- SparkSQLThe predecessor of the Shark, SparkSQL abandoned the original Shark code and draw some advantages, such as memory column storage (In-Memory Columnar Storage), such as compatibility. As a result of getting rid of the dependence on Hive, SparkSQL has been greatly improved in the aspects of data compatibility, performance optimization and component expansion.
- StingerOriginally called Tez, is the next generation of Hive, led by the development of Hortonworks, running on the DAG YARN computing framework. Some tests, Stinger can improve the performance of about 10 times, while allowing Hive to support more SQL.
- TajoThe purpose is to build a reliable, support relational data in distributed data warehouse system based on HDFS, its focus is to provide low latency, scalable ad-hoc query and online data aggregation, as well as the more traditional ETL tools.
- ElasticsearchIs a search server based on Lucene. It provides a distributed and multi - user full - text search engine based on RESTful web interface. Elasticsearch is developed with Java, and as an open source release under the Apache license terms, is the current popular enterprise search engine. Designed for cloud computing, to achieve real-time search, stable, reliable, fast, easy to install and use.
- SharkHive on Spark, is essentially resolved by HQL Hive, HQL Spark translated into RDD operation, and then obtain the database table information through Hive metadata, HDFS on the actual data and documents, will be obtained by the Shark on Spark operations. Shark is characterized by fast, fully compatible with Hive, and can be used in shell mode (rdd2sql) such as API, the HQL result set, to continue operations in the scala environment, to support their own writing simple machine learning or simple processing function, the results of HQL further analysis.
- LuceneJava based Lucene can perform full text search very quickly. According to the official website claims that it is able to retrieve more than 150GB of data per hour on modern hardware, it has a powerful and efficient search algorithm.
- IgniteThe platform is a kind of high performance, integrated, distributed memory, can be used to perform real-time computation and processing of large data sets, faster than disk or flash memory technology of high technology by several orders of magnitude based on the traditional. The platform consists of data grid, computing grid, service grid, streaming media, Hadoop acceleration, advanced cluster, file system, message passing, event and data structure.
- GridGainApache Ignite driven by the GridGrain provides memory data structure for rapid processing of large data, but also based on the same technology Hadoop accelerator.
- RedisKey-value is a high performance storage system, similar to Memcached, it supports the type of stored value is relatively more, including string (string), list (list), set (set) and Zset (ordered set). The emergence of Redis, to a large extent, compensate for the lack of such key/value memcached storage, in some cases can play a very good role in the complementary relationship database.
- HDFSHadoop distributed file system (HDFS) is designed as a distributed file system suitable for running on hardware (commodity). It has a lot in common with existing distributed file systems. HDFS is a highly fault-tolerant system suitable for deployment on cheap machines. HDFS can provide high throughput data access, very suitable for large-scale data sets on the application.
- HBaseHadoop is a database, a distributed, scalable, large data storage. Is designed for a large number of tens of millions of lines and millions of columns, is a distributed database, can be large data random read / write access. Provides storage capabilities similar to Google Bigtable, built on Hadoop and Hadoop distributed file system (HDFS).
- VerticaA database scheme based on column storage for high performance and high availability design provides the advantages of fine granularity, scalability and usability thanks to the support of massively parallel processing (MPP) technology. Each node operates completely independently and has no shared architecture, which reduces the competition of shared resources.
- CassandraIs a hybrid non relational database, similar to the Google BigTable, its main function is more abundant than Dynamo (distributed Key-Value storage system). The NoSQL database was originally developed by Facebook, is now more than 1 thousand and 500 enterprises organization uses, including apple, the European Organization for Nuclear Research (CERN), Con Kast, GitHub, GoDaddy, eBay, Hulu, Instagram, Intuit, Netfilx, Reddit and other institutions.
- DynamoIs a classic distributed Key-Value storage system, with the characteristics of de centralization, high availability, high scalability. Dynamo has been successfully applied in Amazon, and it can be deployed on thousands of nodes to provide services across the data center.
- Amazon SimpleDBIs a Erlang prepared by the high availability of NoSQL can reduce the data storage, database management, developers only need to request execution data storage and query through the Web service, Amazon SimpleDB will be responsible for the remaining work. As a Web service, like Amazon's EC2 and S3, is part of the Amazon web service.
- HypertableIs an open source, high-performance, scalable database, it uses a similar model with Google Bigtable. It is compatible with the Hadoop, the performance is extremely high, its user includes the electronic harbor, Baidu, Gao Peng, Yelp and many other Internet Co.
Analysis and reporting tools
- KettleThis is a ETL toolkit that allows you to manage data from different databases, by providing a graphical user environment to describe what you want to do, rather than what you want to do. As an important part of Pentaho, it is gradually increasing in the domestic project application.
- KylinIt is an open source distributed analysis engine, which provides a SQL interface based on Hadoop's super large data set (TB/PB level) and multi dimension OLAP distributed online analysis. Originally developed by eBay and contributed to the open source community. It can query a huge Hive table in seconds.
- KibanaIs a Apache open source protocol Elasticsearch search and analysis of the instrument panel, Web interface can be used as Logstash and ElasticSearch log analysis, search, visualization and analysis of various operations such as efficient to log.
- DruidIt is a high tolerance, high performance, distributed open source system for large data query and analysis in real time. It is designed to deal with the large scale of data quickly, and can achieve fast query and analysis.
- ZeppelinIs an interactive data analysis and Web based notebook. You can make a convenient data driven, interactive and collaborative exquisite documents, and supports a variety of languages, including Scala (Apache Spark), Python (Apache Spark), SparkSQL, Hive, Markdown, Shell etc..
- Talend Open StudioIs the first data integration tool for the market ETL (data extraction Extract, Transform, Load) open source software vendors. Talend download has more than 2 million people, its open source software provides data integration capabilities. Its users include American International Group (AIG), Con Kast, electronic harbor, general electric, Samsung, Ticketmaster and Werrison and other business organizations.
- SplunkEngine data engine. Use the Splunk collection, indexing and utilization of all applications, servers and devices (physical, virtual and cloud) fast mobile computers to generate data, search from a position and analyze all the real-time and historical data.
- PentahoIs the world's most popular open source business intelligence software, the workflow as the core, focusing on solutions rather than tool components, based on the Java platform for business intelligence (Business Intelligence) suite. Including a web server platform and several software tools: statements, analysis, charts, data integration, data mining, etc., can be said to include all aspects of business intelligence.
- JaspersoftProvides a flexible, can be embedded into business intelligence tools, users including many business organizations: gaopeng.com, United States Department of agriculture, science and technology, computer associates Ericsson, Time Warner Cable, iron and steel, Neisilasijia Olympic University and General Dynamic Corp.
- LumifyOwned by Altamira Technology (known for its national security technology), it is an open source data integration, analysis and visualization platform.
- LingualIs an advanced extension of Cascading that provides a ANSI SQL interface for the Hadoop that greatly simplifies application development and integration. Lingual implements the connection of existing business intelligence (BI) tools to optimize the computational cost and accelerate the speed of application development based on Hadoop.
- BeamBased on Java provides a unified data pipeline development process, and can well support Spark and Flink. Provide a lot of online framework, developers do not need to learn too many frameworks.
- CascadingAPI is a Hadoop based on the creation of complex and fault-tolerant data processing workflow. It abstracts the topology and configuration of the cluster, making it possible to quickly develop complex distributed applications without considering the underlying MapReduce.
- HPCCAs a Hadoop, is a system of large data analysis for the use of a cluster server, HPCC used for many years in the LexisNexis, is a mature and reliable system, contains a series of tools, a high-level programming language, called ECL and the related data warehouse, superior scalability.
- HivemallA combination of machine learning algorithm based on Hive, it includes a lot of good scalability can be used for data analysis algorithm, classification, recursion, recommendation, k nearest neighbor, anomaly detection and feature hash etc..
- RapidMinerWith rich analysis and data mining algorithm, used to solve various business critical problems and solutions covering various fields, including automotive, banking, insurance, life sciences, manufacturing, oil and gas, retail and FMCG industry, communications industry, public utilities and other industries.