I was recently in a good discussion about sizing a Hadoop HDFS cluster for doing long-term archiving of data. Hadoop seems like a great fit for this, right? It has easy expansion of data storage as your data foot print grows, it is fault tolerant and somewhat self-recovering, and generally just works. From a high-level view, this should be easy to implement, right?
Sort of. I’ve done this before and years of experience led me to understand that it’s not simply about putting a cluster together, slapping some data on it, and leaving it alone in a dark datacenter.
A little background on our HDFS archiving
At a previous job, we had a need to store lots of historical data. We had every action ever recorded in the environment starting from the mid-2000’s to the present. The goal was to make it an online, warm archive that could be used at a moment’s notice by any developer in the company without requiring us to pull back tapes or archives from the vault.
Hadoop (and by extension, HDFS) was a logical choice for us. We needed a storage mechanism that could grow. We needed a computation mechanism for data on that storage.
Prior to 2010, the data sat on a single multi-disk RAID box. It was difficult to manage. The hardware was finicky, limited in expansion, and once the data size outgrew it, the box was considered a write-off. Imagine having to forklift a new system in every time you outgrew it instead of being able to sanely expand it.
Then came HDFS. We started putting data into HDFS in 2010. At the time, we had about 10 terabytes of compressed data. By 2016, the collection had grown to 1 petabyte. The data itself was highly compressible, giving us a 70-80% savings using gzip (about 5TB of data uncompressed in 2016). It doubled roughly every year.
The expansion rate alone would have quickly put most data storage platforms either out of commission, out of the running due to cost, or required a significant re-architecture of data to shard it across storage instances. This is where HDFS shined.
Live and Learn: HDFS Archiving the Easy Way
We learned a few things in 6 years of working with HDFS and this archiving use case. Some of these were natural evolutions of our understanding. Some of these evolved out of late-night calls on weekends to fix something broken (or worse: unrecoverable!).
- While you can treat this as a master copy, you should absolutely not treat it as sole source or as a backup. Keep a separate verifiable source to compare with, preferably in a different physical location.
- Sometimes you encounter a software bug. Sometimes you encounter a human interaction (that “oops” moment). In the end, the data gets corrupted or deleted and you need a way to restore it. We’ve used offline disk drives, tapes, remote online replicas in direct-attached storage (really big RAID), S3, Glacier, and so on. They all have their issues, but it’s better than not having your replica.
- Develop a remediation plan in the case that it all (or part) gets deleted. Trust me, this is important. Building a recovery plan during the data outage is much more stressful than building it before the data outage.
- Determine if you really need it all online or if there’s a subset that 95-99% of your use cases would cover. In our case, we kept everything online, but could likely have gotten away with 15 months of data to handle the year-over-year calculations that we were sometimes doing. Knowing your data and how it’s used will help you to evaluate this with your users.
- If keeping all data online, develop a method to periodically revalidate that it’s all there. Automate this. Alert loudly if parts go missing so it can be remedied. Noticing that data is corrupt or missing months after it occurs can have a big impact on the business. No one likes having to rerun 3 months of calculations because it was discovered that 10% of the data went missing.
- If possible, validation should go both ways: from your master copy to your replicas and your replicas to your master copy. You don’t want to find out 12 months in that your replica is completely corrupt while you’re trying to restore data onto your master copy.
- Develop a data growth model and refine it at least quarterly, if not monthly. Set business triggers based on this for buying additional hardware. Your Hadoop operations team doesn’t want to be caught flat footed and making an emergency funding request to buy expansion hardware at the last minute. Plus, being able to talk about this early and often will help your management team set appropriate budgets a head of time.
- In our case, at 70% utilization we started alerting management that a hardware purchase was needed. By 80%, we should have issued a purchase order for new equipment. By 90%, we should have had the purchase completed and all hardware delivered, online and in production. If we hit 90% HDFS utilization and had no hardware, we were taking drastic measures (i.e., temporarily reduce replication factor on older data).
- How did we do this? We started snapshotting the data file sizes and graphing over time. The more snapshots we had, the better we could predict our growth. This is a simple way. You could get much more in-depth.
- Pay attention to drastic increases in archive usage. About once a quarter we would see either a drastic increase due to a developer mistake, a new feature being rolled out to 100% of users (from 1%), or due to an external traffic event (i.e., such as lots of new users after being featured on a nationwide news show)
- In one case, a design flaw in event creation caused an exponential growth in archiving. A user triggered an event that was logged. This then triggered 4 events in another system which were logged, which then triggered another 20 events. We discovered it within a few weeks and could have a fix rolled out to reduce the chattiness of the data while still delivering what the developer needed.
- Understand the archive growth curve. Try to get developers to provide you estimates for Day 1 and Day 180 of a data set being archived. This should give you a feel for their understanding of the data and how it will grow. You should account for this in your growth model and periodically re-verify it with the developer to make sure it’s expected.
- Archive compression algorithm matters. We started with gzip (great compression, but slow). We moved to Snappy (fast, but crappy compression ballooning our hardware costs). Ultimately, we built our own compression that was roughly a combination of the two for performance and compressibility (although, don’t do this unless your developers have the necessary skill to implement it correctly).
- Changing an entire archive’s compression algorithm is a monumental affair.   Imagine recompressing hundreds of terabytes of data without significantly impacting the existing workflows using it.
- HDFS Erasure Coding would have helped reduce our hardware costs a lot. We wanted it, but it wasn’t ready. The future is near, though!
- If ingesting files from external sources, pay attention to how many you’re uploading. You may need to come up with a solution to periodically compact those into larger files to deal with the HDFS many-small-files problem. In our case, we were uploading between 200 and 400 thousand files per day. Alert when your file counts hit a threshold because this could imply your compaction mechanism isn’t working. Alert when your compaction mechanism determines itself to be non-functioning.
- Why is this important? Our compaction mechanism broke and it wasn’t noticed for quite some time. This destroyed a cluster to the point that it was unrecoverable because the NameNode and JournalNodes could never complete startup after the first OutOfMemory crash. We had to rebuild the cluster from scratch and reload the data despite extensive attempts at getting it running again.
- If you can’t roll your own mechanism, you might consider filecrush. Its purpose is to take many small files and turn them into a Sequence file, including options for compression.
Additionally, we played with doing some of the secondary verification archives using Amazon S3/Glacier, syncing from our on-premises clusters. In those cases, we learned:
- Copying data up into S3 is expensive time-wise. If you’re producing terabytes of data per day, you need a sufficiently large and dedicated uplink from your on-premises environment to AWS. You can copy approximately 10.8 terabytes of data across a 1Gbit uplink in a 24-hour period.
- Pay attention to upload failures. If you don’t have sufficient bandwidth to catch up, you’ll have to make some hard decisions about whether you really need that time period of data.
- Getting data into Glacier is cheap. Getting it out may not be.  As I recall, getting data out of Glacier can be expensive if you do a full data restore request. Pay attention to how much you pull out of Glacier at one time. If you can only download 10TB a day, don’t pull 100TB of data out of Glacier at once. Pull it in the chunk size you can reasonably manage.
Some missing pieces?
There’s a few things I’m glossing over here, for sure. For example, do you have data retention requirements? Some industries must keep data around for a minimum amount of time. Others must keep it around for only a maximum amount of time before it should be aged out or deleted. If you’re dealing with personally identifiable data, do you have to take precautions to conform to your local legal statutes, such as GDPR in Europe? How do you handle requests for removal by legal entities, such as DMCA claims? When in doubt, ask a lawyer. Heck, when not in doubt, you should still ask a lawyer.
Ultimately, my experience here isn’t an end-all-be-all answer to this problem, but I do feel it’s a good starting point for people to work from and tune for their own needs and requirements.
Do you have a different opinion or additional insight? Let me know! I would love to hear it.