Introduction
I wanted to explore whether a database such as
CouchDB or
MongoDB was the solution to a data integration problem. (If you are unfamiliar with these databases then I highly recommend the book "
NoSQL Distilled" by
Martin Fowler and
Pramod Sadalage.)
I wanted to quickly get to the exploration phase, so didn't want to spend a lot of time- if any- developing domain models and the like for a skunkworks project like this one. The overarching goal was to expose the integrated data as easily consumable endpoints in a heterogeneous environment, but any deep discussions about service-oriented architecture were premature at this early analytic phase.
Implementation
Accessing the initial seed data was easy. The source systems were
Microsoft SQL Server so
Entity Framework was used to extract data.
ExpandoObjects came into play at this point.
Commonality-variability analysis revealed a core set of fields shared between the data sources. My idea was that elements common across data sources, such as names and birth dates, would be persisted using the same set of identifiers. Elements specific to a particular source, such as maiden name or favorite color, could be picked up if needed based on a value indicating the source of the data. That's the beauty of a
schemaless database- even though all of the items don't have the same 'shape' they can still all be stored together and queried.
At this point an example would probably help. Line 14 shows the use of an ExpandoObject to capture and reconcile common data elements. In other words, whether the original data was stored as 'firstname' or 'first_name' or 'fname' it is saved to the integration database using the 'FirstName' property. For the initial test phase I took the approach of testing whether the source data existed before creating a property as seen on lines 17, 19, and 21. It is quite possible that the proper approach is to create the properties regardless of whether they exist so that there is a degree of uniformity to the data.
public IEnumerable<ExpandoObject> GetPeople()
{
var peopleList = new List<ExpandoObject>();
using (var data = new sourceData.data_Entities())
{
var people = from p in data.clients
select p;
if (people != null)
{
foreach (var person in people)
{
dynamic expando = new ExpandoObject();
expando.FirstName = person.firstname;
expando.LastName = person.lastname;
if (!string.IsNullOrWhiteSpace(person.mi))
expando.MiddleName = person.mi;
if (person.birthdate.HasValue)
expando.DOB = person.birthdate.Value;
if (!string.IsNullOrWhiteSpace(person.gender))
expando.Sex = person.gender;
expando.SourceSystemId = person.clientid;
expando.SourceSystem = "Source1";
peopleList.Add(expando);
}
}
}
return peopleList.AsEnumerable();
}
Once the data was captured, the returned object was sent to the following method for upload to the database. In the example below, MongoDB is the destination. One nice side effect of using a list of ExpandoObjects is that this method can be reused because the input argument isn't strongly typed.
private static void UploadRecords(IEnumerable<ExpandoObject> records, string collectionName = "people")
{
if (records != null)
{
var stopwatch = GetAndStartStopwatch();
MongoServer server = MongoServer.Create(connectionString);
MongoDatabase db = server.GetDatabase("testdb");
MongoCollection<BsonDocument> people = db.GetCollection<BsonDocument>(collectionName);
List<BsonDocument> uploadDocs = new List<BsonDocument>();
IEnumerable<SafeModeResult> results = null;
foreach (var record in records)
{
BsonDocument doc = new BsonDocument();
doc.Add(record as IDictionary<string, object>);
uploadDocs.Add(doc);
}
if (uploadDocs.Count > 0)
results = people.InsertBatch(uploadDocs);
stopwatch.Stop();
log.InfoFormat("Records: {0}. Elapsed time: {1}", uploadDocs.Count, stopwatch.Elapsed);
}
}
In order to use the efficient
InsertBatch method, I needed to build up an IEnumerable<BsonDocument>. A
BsonDocument is "a collection of name/value pairs" and ExpandoObjects implement the
IDictionary<String, Object> interface, so I could loop through the source data and add each item to a BsonDocument (see lines 13-18 above). The new BsonDocument then was added to a List<BsonDocument>, and this list then was used for the call to InsertBatch.
After inserting all the data, I was ready to start querying the database. Here's an example of a couple different items, or "documents" as they are called in MongoDB. Both of these documents are from the same MongoDB collection.
> db.people.findOne({"SourceSystem":"Source1"})
{
"_id" : ObjectId("40842293c1199d1t7k6f0add"),
"FirstName" : "JANE",
"LastName" : "DOE",
"MiddleName" : "X",
"DOB" : ISODate("9999-01-01T06:00:00Z"),
"Sex" : "FEMALE",
"SourceSystemId" : 123,
"SourceSystem" : "Source1"
}
> db.people.findOne({"SourceSystem":"Source2"})
{
"_id" : ObjectId("309423b4c1299d1x7c6f155f"),
"FirstName" : "John",
"LastName" : "Doe",
"SourceSystemId" : 321,
"SourceSystem" : "Source2",
"Sex" : "U"
}
Summary
So far my testing has been going well. Both CouchDB and MongoDB perform well for what I am doing. It's really easy to make sweeping changes to the database if I want to try something else by starting from scratch. Example: it's easy to delete the database if I need to make changes to my expandos. (TOO easy-
Futon really needs an (optional) confirmation dialog.) The bottom line is that I saved a lot of time using ExpandoObjects. I haven't spent days crafting tailor made objects just to test some theories. While working with CouchDB I was also surprised (though I guess I shouldn't have beem) at how easily
RestSharp knew what to do with ExpandoObjects.
Next steps are to expose the data to a variety of clients and test performance. I have been using
ASP.NET Web API for some time now and I can quickly get XML and JSON output rolling now that I have a populated database. I also need to focus on the variability side of things by loading the 'uncommon' fields into my test collections.
Would I recommend using this approach for a production system? That really depends on the intent of the system. By using ExpandoObjects you aren't able to take advantage of IntelliSense in Visual Studio, and while that might not be a deal breaker, it might seriously impact code maintainability.