Yahoo today, open source data fast and efficient computing algorithm, based on the Data Sketches Java algorithm. Sketches Apache follows the Data open source license agreement provided in GitHub:Https://github.com/datasketches/sketches-core/.
This type of technology in the study of academic papers in more and more, always use different names, but will share some of the key technical points. First, you can handle the flow data, because the data they only contact once. They are attached, you can add or merge these calculations. Even more interesting is that they are all approximate.
YAHOO said in a statement that the entire scientific calculation is based on a very basic function, as long as you can bear the results have a little deviation, so can greatly improve the speed of calculation.
Imagine if you would like to calculate something, such as a day to visit the number of YAHOO financial and access to YAHOO sports. If you try to calculate how many people have access, you can get the answer.
In addition to high-speed count, Sketches Data do some of the types of computing will be much faster than the exact calculation. 100 million the general situation of the numerical calculation takes 2.5 minutes, while the use of Sketches Data only takes 2.7 seconds.
Sketches Data has been used in a large number of Yahoo products, Yahoo's own Flurry use it to calculate the real-time count, YAHOO mail service and search engines are also used.
Sketches Data integrates Pig and Hive, as well as Druid open source data storage, and it is also easy to use in Maven build management tools.