During the course of my development on Scruddle (http://www.scruddle.com), a digital media aggregation and publishing platform, I have encountered several gotchas and practices which would be wise to share. Scruddle collects quite a bit of data, in varying sizes and formats, from short status messages from Facebook and Twitter, to large news articles and emails. This information is stored using Azure table storage, and so several rules were devised or discovered for content storage.
As you read this, it may be useful to review the MSDN article “Understanding the Table Service Data Model”, at http://msdn.microsoft.com/en-us/library/windowsazure/dd179338.aspx.
When aggregating information from the Internet, the content type will not be known. Early on, I encountered issues saving content directly to table storage, in the form of obscure HTTP exceptions from the Windows Azure Storage DLL (in one case, it turned out to be HTML character codes in Meta tags). For Scruddle, the best option was to URL encode text content prior to saving, using the HttpUtility.UrlEncodeUnicode() method in the System.Web assembly. This held true both for the storage emulator and actual cloud storage.
Before URL encoding, the raw data can be indexed for searching, trending, or other activities. Prior to displaying the content, you simply pass the URL encoded content through the HttpUtility.UrlDecode() method, also in the System.Web assembly.
Content Length (Strings)
A single table storage column with the String data type has a limit of 64KB, but there are several points to consider. First, this does not translate to 64K characters, and second, you may need to be prepared to handle string content that exceeds 64KB.
For Scruddle, I modified the business logic that Inserts table entities to examine the string content length. If it exceeded a certain threshold (in Scruddle’s case, 32K length or higher), the table entity’s content would be stored in Blob storage, instead of table storage. Simply set a boolean flag indicating where the content is stored, create and store a blob byte array, and null out the table entity’s content prior to storing.
When retrieving the table entity, the boolean flag indicates whether the content is stored with that entity, or should be retrieved from Blob storage, and the entity is then returned to the consuming processes.
Content Length (Byte Arrays)
Table storage also has a 64KB limit for byte arrays. If binary content might exceed 64KB, the same approach to Scruddle’s string storage can also be utilized. Use a boolean flag to indicate if the content is stored in the table entity or Blob storage, and insert and retrieve the content appropriately.
It is very important to realize that, for table storage, decimal data must be stored in the double data type, not the decimal data type. That holds true for any kind of decimal data, from currency to geospatial coordinates.
I have been extremely pleased with Azure table storage. In fact, I have been more impressed by the cost and flexibility of table storage than I have SQL Azure. While reporting on table storage entities presents a different challenge than traditional database reporting, it is worth considering keeping large datasets in cloud storage.
I’ll probably create another “lessons learned” blog as Scruddle transitions to production. Until then, hopefully this information can keep your head from banging against the wall.