I’ve heard a lot about Hadoop and had this podcast in my queue, .NET Rocks episode 898, Big Data with Hadoop with Jeremiah Peschka.
It’s possible that I’ll never work with Big Data, and now we’re talking BIG. A couple of Terabytes is on the smallest side of the spectrum, we’re talking Petabytes. But checking out the technology behind is very interesting, this is awesome stuff.
With Hadoop you can process large data sets across clusters of computers. Processing a couple of Terabytes on one high performance server goes pretty fast and instead of having all data on one server you can have multiple servers with a couple of Terabytes each. Hadoop then query all these servers and processing goes very fast.
The guys mention Yahoo as an example, who have one Hadoop hive with 20.000 servers! That’s only one of their clusters.
The biggest thing with Hadoop is Hive which is a data warehouse system that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. It came out of Facebook which means handling a lot of data.
As I mentioned I’ve heard about Hadoop but didn’t really know what it is. It was great to finally get some information about it.
Also, Carl mentions a video of a BAD version of Apache. You have to check it out.