Friday, November 14, 2014

Sitecore Item Serialisation for Memcache caching

I previously mentioned that serialising Items for caching in a mechanism such as memcache, which can only store serialised text or binary data, was not possible. I have recently realised this is not the case.

Due to a fair amount of trial and error, and no small amount of help from Sitecore support, I now finally have a working serialisation/deserialisation mechanism for Sitecore items. This mechanism was put into production for EWN a few weeks ago, and I'm happy to report that it's performing exactly as I hoped it would. CPU time for Lucene calls was already down due to caching of the Hits data (referred to in my previous post), but Database accesses were always an issue, as the hits still needed to be converted into Sitecore Items. Without a way to cache these directly in shared cache, I was forced to rely on the Sitecore Item cache, which meant that each individual Content Delivery frontend had to perform DB queries to get this data, and cache it, multiplying DB calls and RAM usage by the number of frontends involved.

Serialising an item comes down to a very simple concept, being that Sitecore has existing serialisation logic, that you merely need to hijack to produce a text version of an item, converting the Item object into a SyncItem, and then serialising it to a StringWriter object. Using the stringwriter to wrap a StringBuilder object, you are then able to output the raw string produced, which you can then cache :
public static string SerializeItem(Item item)
{
    StringBuilder sb = new StringBuilder();

    try
    {
        StringWriter sw = new StringWriter(sb);
        SyncItem syncItem = ItemSynchronization.BuildSyncItem(item);

        List<item> tempVersions = item.Versions.GetVersions().ToList();
        if (tempVersions.Count == 0)
        {
            tempVersions.Add(item);
        }

        foreach (var version in tempVersions)
        {
            SyncVersion sv = new SyncVersion();
            List<string> fieldNames = new List<string>();
            foreach (Field field in version.Fields)
            {
                SyncField sf = new SyncField();
                sf.FieldID = field.ID.Guid.ToString();
                sf.FieldKey = field.Key;
                sf.FieldName = field.Name;
                sf.FieldValue = field.Value;

                fieldNames.Add(field.Name.ToLower());
                sv.Fields.Add(sf);
            }

            syncItem.Versions.Add(sv);
        }

        syncItem.Serialize(sw);

        sw.Close();
    }
    catch (Exception ex)
    {
        Sitecore.Diagnostics.Log.Warn(String.Format("Error in ItemHelper.SerializeItem Message '{0}' stacktrace '{1}'", ex.Message, ex.StackTrace),
            "ItemHelper");
    }

    if (sb != null)
    {
        return sb.ToString();
    }

    return string.Empty;
}

Note that you can either cache a specific version of the item, or multiple versions. If I can't detect multiple versions, I merely use whatever version the item I passed in represents.

At this point it's important to note that your Sitecore settings in the web.config file affect this process. By default (via a hidden config value), standard values are not included in the serialised data. While this makes sense in terms of serialising as a backup of item data on a hard drive (for re-importing into the DB later), this is not what we want. We need the full structure of the item, without any need for a DB call. If the deserialised item needed a DB call to obtain the fields in the template's standard values, this won't work for us at all. This mechanism needs to be completely independent of the DB in order to work. Without this critical detail, when an Item object is rebuilt when our data is deserialised, these fields are missing, and later operations on the Item object result in a number of errors.

Add the following settings to the <settings> section of your web.config file to ensure that default and standard values are also serialised (I'm not sure what the distinction between the two is) :

      <setting name="ItemSerialization.AllowStandardValues" value="true" />
      <setting name="ItemSerialization.AllowDefaultValue" value="true" />

So now you have your serialised data, which you can store in memcache, redis, or whatever text-based caching mechanism you want to use. When you store it into cache, make sure to also cache the Item's Item URI. This is important for reasons that will be explained later. I wrap all of this in a class that I use specifically for caching items :

[Serializable]
public class ItemCacheFormat
{
    public ItemCacheFormat(Item item)
    {
        ItemUri = item.Uri;
        SerialisedData = ItemHelper.SerializeItem(item);
    }

    public ItemUri ItemUri { get; set; }
    public string SerialisedData { get; set; }

    [XmlIgnore]
    [NonSerialized()]
    private Item _item = null;
    public Item GetItem()
    {
        if (_item == null)
        {
            _item = ItemHelper.DeserializeItem(SerialisedData, ItemUri);
        }
        return _item;
    }
}

Note that I use the GetItem() call to obtain the actual Item data on the other side, and I have specifically told my caching code to ignore this data, and only cache the SerialisedData and ItemUri Properties.

On the other side, you perform the same process in reverse. Build a SyncItem from the serialised data, and then use this to create an Item object. When doing so, you need the item's version, language, and source database name, and these values are not contained in your serialised data. This is why you need to cache the item's Uri alongside this data. An ItemUri object contains all of this data :

public static Item DeserializeItem(string serialized, string versionString, string languageString, string databaseString)
{
    Version version = new Version(versionString);
    Language language = LanguageManager.GetLanguage(languageString);
    Database database = Database.GetDatabase(databaseString);

    return DeserializeItem(serialized, version, language, database);
}

public static Item DeserializeItem(string serialized, ItemUri itemUri)
{
    Database db = Database.GetDatabase(itemUri.DatabaseName);
    return DeserializeItem(serialized, itemUri.Version, itemUri.Language, db);
}

public static Item DeserializeItem(string serialized, Version version, Language language, Database database = null)
{
    if (string.IsNullOrEmpty(serialized))
    {
        Sitecore.Diagnostics.Log.Warn("Nothing to deserialize", "ItemHelper.DeserializeItem");
        return null;
    }
    try
    {
        var my = new StringReader(serialized);

        var token = new Tokenizer(my);

        var syncItem = SyncItem.ReadItem(token);

        return DeserializeItem(syncItem, version, language, database);
    }
    catch (Exception ex)
    {
        Sitecore.Diagnostics.Log.Warn(String.Format("Error in ItemHelper.DeserializeItem Message '{0}' stacktrace '{1}'", ex.Message, ex.StackTrace),
            "ItemHelper.DeserializeItem");
        return null;
    }
}

public static Item DeserializeItem(SyncItem syncItem, Version version, Language language, Database database = null)
{
    try
    {
        var itemID = new ID(syncItem.ID);

        var templateID = new ID(syncItem.TemplateID);

        var branchId = new ID(syncItem.MasterID);

        if (database == null)
        {
            database = Database.GetDatabase(syncItem.DatabaseName);
        }

        var itemName = syncItem.Name;

        FieldList fieldList = new FieldList();
        List<string> fieldNames = new List<string>();

        ID fieldID;

        foreach (SyncField sharedField in syncItem.SharedFields)
        {
            fieldID = new ID(sharedField.FieldID);
            fieldList.Add(fieldID, sharedField.FieldValue);
        }

        var versAsString = version.ToString();

        SyncVersion syncVersion =
            syncItem.Versions.FirstOrDefault(
            syncVers => (syncVers.Language == language.Name) && (syncVers.Version.Equals(versAsString)));

        //Sometimes this code comes up with no matches, because there is only one version in the deserialised result, and the version information is blank.
        //In this case, just use that single version, as it's obviously what we're after
        if ((syncVersion == null)
            && (syncItem.Versions.Count == 1)
            && (syncItem.Versions[0].Version.Equals("")))
        {
            syncVersion = syncItem.Versions[0];
        }

        if (syncVersion != null)
        {
            foreach (SyncField syncField in syncVersion.Fields)
            {
                fieldID = new ID(syncField.FieldID);
                if (!string.IsNullOrEmpty(syncField.FieldValue))
                {
                    fieldList.Add(fieldID, syncField.FieldValue);
                    fieldNames.Add(syncField.FieldName.ToLower());
                }
            }
        }

        var definition = new ItemDefinition(itemID, itemName, templateID, branchId);

        var itemData = new ItemData(definition, language, version, fieldList);

        var res = new Item(itemID, itemData, database);

        return res;
    }
    catch (Exception ex)
    {
        Sitecore.Diagnostics.Log.Warn(String.Format("Error in ItemHelper.DeserializeItem Message '{0}' stacktrace '{1}'", ex.Message, ex.StackTrace),
            "ItemHelper.DeserializeItem");
    }

    return null;
}

I've written multiple overloads for the deserialise method, so there are a few different ways to call it. The most obvious is to just call it with the seriasalied data and the ItemUri object, and let the overloaded methods do the work for you. You'll note that this is how the GetItem() method in the ItemCacheFormat class calls it.

...and that's pretty much it. You can cache lists of items, or build this logic into whatever makes sense in terms of the caching requirements for your site. All you need to do is wrap the items in something similar to my ItemCacheFormat class before caching, and use something similar to the GetItem() call to rebuild the Item data on the other side.

The last thing I need to mention is that your solution might behave a little differently after you implement this logic. For example, if you unpublish an item, or completely delete it from the master/web databases, it could still be visible on your site until the cache expires. The same issue applies to changes made to items. So be sure that content is in the state you want it to be before storing it in cache.

If you have any issues or questions, comment below. Good luck :)

Saturday, July 5, 2014

Memcached shared cache for lucene hits - saving CPU across frontends

Anyone that has spent even a small about of time on any form of website development knows that scalability is always a massive concern. Anything you can do to bring down CPU usage and I/O as your number of concurrent users increases, for example, is of great value.

A website running on the Sitecore CMS is no different. CPU load from work such as Lucene searches can quickly add up. Unless you make a concerted effort to keep this CPU load down, you quickly hit a hard cap of how many concurrent users you can handle, and your site crashes.

In the event that you have multiple content delivery servers serving your website, via some sort of load balancing strategy, you’ll find that your frontends end up duplicating work. For example, if a new item is published, which causes Lucene indexes to update, and this changes the potential result of a Lucene search that populates some part of your site, each frontend would have to clear their HTML cache, and perform the Lucene search to obtain this new result.

In cases like this, it would be preferable to have only one of the content delivery servers do this work, and place it in a cache where the other servers can look for it before duplicating the work.

The memcached shared cache framework works amazingly for this sort of sharing. The memcached service (https://code.google.com/p/memcached/) handles the caching of serialised data, and the Enyim Memcached library (https://github.com/enyim/EnyimMemcached) allows inserting into and retrieving from this cache within .Net code.

It’s worth noting that the memcached server is happiest if running in a Linux environment, and the stable .Net implementations I’ve found thus far have all ended up being a few versions behind their Linux counterparts, and take some effort to get working. So while I have been able to get it to work as a windows service, it’s easier (and perhaps even recommended) that it be run on a cluster of Linux servers instead.

Memcached is essentially a very simple key-value store, similar to the ASP.Net cache, with a few small (but important) differences :
1. Memcached can only store data that can be serialised, either as binary or text. Do not be fooled into thinking that any binary data can be serialised. This includes Sitecore Item objects.
2. Memcached is independant of ASP.Net and IIS, and as such is not cleared when an app pool recycles, or IIS is reset.
3. By having a cluster of memcache servers, shared between a number of application webservers, the cache is shared between those frontends, rather than each having their own isolated (and usually duplicated) cache. Also, the RAM requirement is split between all of these servers, rather than being multiplied by the number of servers. So for example, in a cluster of 5 Sitecore Content Delivery frontends, with 5 memcache servers, with a requirement for 10GB of cached data, each server will store 2GB of cache, rather than the full 10GB. This is where memcache is dramatically superior to the built-in Sitecore caches, and the ASP.Net cache.

The primary issue is the serialisation of data. The point here is to figure out exactly what you’re trying to cache. What exactly are you trying to save on? Do you really need the exact Sitecore Item object, or is the intention more to save on the CPU time required to do Lucene searches? Sitecore’s Item cache works extremely well to cache the underlying DB Item data, so generally DB I/O isn’t an issue. I’ve spent a great amount of time over the last year or so trying to solve the issue of Sitecore Item serialisation, and I eventually came to the conclusion that while it should be technically possible, the performance gain from it would not be enough to make the stability risk worthwhile.

So, you don’t need to cache the Sitecore Item objects. What you need instead is a representation of them, stored in text, that you can use to retrieve them later. Essentially, you need to know what Sitecore DB they came from, and an Item ID. The Sitecore Item URI contains both of these, so you just need to store the value in the Uri property of the Sitecore Item class. In fact, it’s this value that is stored in Lucene indexes, and is exactly how Sitecore retrieves items from the DB, as Lucene indexes do not store the raw item data :

StoreRawObjectInCache(key, item.Uri);

To retrieve the item from this value on the other side, you just need to take the URI object from memcache, get the database, using the Database property, and then call GetItem, passing in the DataUri variant of the ItemUri :

Sitecore.Data.ItemUri uri = (Sitecore.Data.ItemUri)GetRawObjectFromCache(key);
if (uri != null)
{
    Sitecore.Data.Database db = Sitecore.Data.Database.GetDatabase(uri.DatabaseName);
    if (db != null)
    {
           Sitecore.Data.Item item = db.GetItem(uri.ToDataUri());
           if (item != null)
           {
                   return item;
           }
    }
}
return null;

You can also use this for your Lucene hits, which is also mostly just a representation of Sitecore Items.

You can also expand this to store collections of items, which is how you can cache the results of Lucene queries.

Once we set this up in our Content Delivery environment, we immediately noticed a drop in CPU usage across all CD servers, as well as a drop in RAM usage. This allowed us to become a lot more aggressive in terms of what we allow our code to store in memcache (due to all the extra available RAM), which in turn led to further performance gains.