Transitioning to Hadoop isn’t easy and not all use cases are suitable candidates. However, if you’ve determined that using Hadoop will help you achieve your business goals and you’ve identified a solid business use case, it may be time to take the plunge. Before you get started, there are three areas you should consider to get your IT team ready for Hadoop: people, processes, and technology. By taking a hard look at each of these areas you can better determine if you have what you need for a successful Hadoop implementation.
The right people
According to Forbes, an IDC and Computerworld survey reported that a third of companies feel they don’t have sufficiently skilled big data and analytics IT staff.
One reason why might be that implementing and administering Hadoop requires thinking about data management from a new perspective. For your team to make the transition to Hadoop successfully, they need training. If you tell your relational database team to work on Hadoop without training them on what’s different and how to think differently, you’ll be able to use Hadoop but you won’t use it well or get the full benefit of your Hadoop investment. That’s why at Zaloni, in addition to our platform, we offer consulting services, including training.
The right processes
It’s also critical to think about the structure of your team and its processes. Your current technology teams – IT, developers, etc. – may be structured separately. To implement and maintain a Hadoop data lake, there needs to be much more collaboration between teams and more of an overlap in roles. For example, with a traditional database, you have a database administrator (DBA) and also developers that develop the database apps.
With Hadoop, while you still need many of these roles – including administrators, developers, data architects, data engineers and data scientists – each person on the team needs to have the mentality of a developer and also be able to do operations and manage the cluster efficiently. This is because of the distributed nature of Hadoop processing, which involves multiple nodes/servers and requires the team to think about implications of this type of architecture on data storage and data processing requirements.
One classic mistake is for a company to take all of its relational database processes and copy them over into Hadoop. You get absolutely no incremental benefit from this – you’re simply doing things the old way in Hadoop.
The right technology
Lastly, you need the right tools for a successful Hadoop implementation. It’s relatively easy to ingest data into Hadoop using Apache tools like Flume and Sqoop. However, Hadoop isn’t great at managing data so that business users can derive value from it. Which, really, is the whole point of implementing Hadoop in the first place. That’s why it’s critical to have the right tools in place to help with ingestion and metadata capture, data preparation, data quality, ongoing data management and data analytics.
If you’re considering using familiar ETL tools with Hadoop, know this: old ETL tools don’t work well in Hadoop, due to their point-to-point architecture. Here’s why. With ETL and relational databases, to manage data you pre-define the schema once and then only capture data that meets those rules. In a distributed Hadoop data lake, the on-boarded data may be kept in its raw form (which may be semi- or unstructured) and can be transformed on an as-needed basis. Therefore, your data management platform for Hadoop needs to be able to operationalize and automate the maintenance of data on a more fluid, ongoing basis. It’s never “one and done.”
By taking the time to consider the right people, processes, and tools you’ll need for your Hadoop project, you help ensure that your IT team will be capable of tying together your entire managed data pipeline, from raw to analytics-ready data. Do you think you might be ready? Let's talk.
About the Author
Ben Sharma, is CEO and co-founder of Zaloni. He is a passionate technologist with experience in business development, solutions architecture, and service delivery of big data, analytics and enterprise infrastructure solutions. Having previously worked in management positions for NetApp, Fujitsu and others, Ben’s expertise ranges from business development to production deployment in a wide array of technologies including Hadoop, HBase, databases, virtualization and storage. Ben is the co-author of Java in Telecommunications and holds two patents. He received his MS in Computer Science from the University of Texas at Dallas.Follow on Twitter More Content by Ben Sharma