After writing my last post on MongoDB, I attended a meet-up at the Mozilla office in San Francisco to hear the tale of a real company in the process of migrating from Microsoft SQL Server to MongoDB.

The company, Harmonic, sells enterprise software for managing workflows around video. Videos come in, go through checks, conversions, and other processing, and get distributed over multiple channels.  (Ok, vast simplification, but that’s the gist of it.) The architecture consists of a GUI and other management tools on the front-end, a set of services for processing videos on the back-end, and a workflow engine that orchestrates the process. The workflow engine stores its state in a database, and that database had been Microsoft SQL Server.

The marketing staff demanded that engineering reduce complexity for customers, increase scalability, and keep costs low. Nick Vicars-Harris, who manages the Harmonic engineering team, experimented with MongoDB. It took just a few days to tweak the data layer, written in C# and utilizing LINQ, to work with MongoDB rather than SQL Server. According to Vicars-Harris, Harmonic removed code that had been needed for object relational mapping, refactored, and produced more intuitive code. Rather than normalizing workflow state across over twenty tables, Harmonic could now store each job and its related tasks in a single document. In addition to removing complexity, the solution passed the test for scalability.

Harmonics also took advantage of the MongoDB simplified deployment model to create what Vicars-Harris calls smart nodes, nodes that communicate with each other and self-configure, a solution that met the requirements for simplified deployment and maintenance.

After listening to the presentation, I was impressed with the ease of transition from SQL to NoSQL. Clearly, the workflow use case fits in well with document-oriented databases.

MongoDB is one of the most popular of open source NoSQL databases. Supported by 10gen and boasting a long list of deployments including Disney, Craigslist, and SAP, MongoDB has a remarkably simple application programming interface (API) and all the tools necessary for massive scalability.

MongoDB is a document-oriented NoSQL data store, but the word document might mislead. Really, MongoDB stores objects much like an object database. More technically, MongoDB stores BSON documents, which stands for Binary JSON. JSON stands for JavaScript Object Notation, which is a text-based format for storing structured data, much like XML but less verbose, with more colons and curly braces than angle brackets.

MongoDB stores each object as a document, which might be thought akin to a record in the relational world. While there is no schema in MongoDB, that is no defined set of columns, developers would normally store like objects (objects created from the same class) together in a collection, which might be thought of as a table. A database would typically consist of several such collections. And while there are no relationships between collections, a JSON object can itself contain a hierarchy of other JSON objects or fields that link to objects in other collections, so it is possible to model a variety of relationships.

Developers access MongoDB through a driver, which maps a MongoDB document to a familiar construct in the developer’s language of choice. JSON objects, as the name implies, map easily to JavaScript objects. In Python, a document maps to a dictionary. In C#, a document maps to a specially defined class called a BsonDocument. Regardless of the language, the API is quite straight forward.

Object databases remove the grunt work of mapping classes to relational data models. But object databases never caught on, perhaps because of the difficulty of ad hoc queries. MongoDB does provide a variety of query mechanisms. It allows for queries by object properties, queries based on regular expressions, and complex queries using JavaScript methods. Whether any of these approaches satisfies requirements for ad hoc queries depends on the specific application scenario.

MongoDB achieves scalability through sharding, which divides objects in a collection between different servers based on a key. If the developer defines zip code as the shard key, for example, a customer object with a New York City zip code might be stored on a different server than a customer object with a San Francisco zip code. MongoDB handles the work of distributing the data. Note that sharding is quite distinct from replication. Each shard in a MongoDB cluster could be configured to replicate to one or more slave instances, a process that uses logs in much the same way as relational databases. And while applications could read from these slaves to improve performance, the primary purpose of replication is reliability.

Reliability brings to mind transactions. Relational databases support transactions across multiple records and tables; MongoDB restricts transactions to single documents. While this might appear overly restrictive, it could be made to work in many scenarios. Recall that a document can include a hierarchy of objects. If an application’s data is modeled such that all of the data requiring all-or-nothing modification resides within one document, the MongoDB approach would suffice.

However, this restriction on transactions calls attention to the design objectives of MongoDB: speed, simplicity, and scalability. Expanding transactions to encompass multiple objects stored across shards would significantly impact performance leading to complex dead-lock problems. Indeed, by default, a save function call to the MongoDB returns immediately to the application without waiting for confirmation that the save was successfully persisted. If a networking or disk failure prevents the write, the application continues without awareness of the error. However, the MongoDB API does provide options for safe writes that wait for a success response. There are even options to specify how many replication slaves must get updated before the save is considered a success. So while speed is the default, reliability is a possibility.

While I mentioned earlier that MongoDB provides drivers for a variety of languages, it holds particular appeal to JavaScript devotees. JSON documents were designed to store JavaScript objects. JavaScript is the MongoDB language for complex queries. And the command line tool for managing MongoDB is built on top of the JavaScript shell. So if you have mastered JavaScript for the coding of dynamic web pages, MongoDB provides an opportunity to expand its use.

I recommend visiting MongoDB.org and trying out the online shell. In a few minutes, you’ll get a sense of the API. Then take a look at the tutorial. And to experiment further, download and install MongoDB for yourself. (I managed to install it on Windows 7 in a few minutes, but somehow got stuck installing the package on Ubuntu.)

(This is the first in a planned series on NoSQL databases. See NoSQL: The Joy is in the Details.)

Whenever my wife returns excitedly from the mall having bought something new, I respond on reflex: Why do we need that? To which my wife retorts that if it were up to me, humans would still live in caves. Maybe not caves, but we’d still program in C and all applications would run on relational databases. Fortunately, there are geeks out there with greater imagination.

When I first began reading about NoSQL, I ran into the CAP Theorem, according to which a database system can provide only two of three key characteristics: consistency, availability, or partition tolerance. Relational databases offer consistency and availability, but not partition tolerance, namely, the capability of a database system to survive network partitions. This notion of partition tolerance ties into the ability of a system to scale horizontally across many servers, achieving on commodity hardware the massive scalability necessary for Internet giants. In certain scenarios, the gain in scalability makes worthwhile the abandonment of consistency. (For a simplified explanation, see this visual guide. For a heavy computer science treatment, see this proof.)

This initially led me to assume that NoSQL makes sense only for the likes of Facebook and Twitter. The rest of us who seek something less than world domination and who associate consistency with job security may as well stay within the safe and comfortable realm of relational database, which have certainly passed the test of time.

However, I’m starting to question that assumption. Clearly, relational databases still make sense for many applications, especially those requiring strict transactions and complex ad hoc queries. Relational databases will certainly remain the backbone of financial and ERP systems. But I’m now wondering whether NoSQL might fit quite well for many other applications.

When I say NoSQL, however, I’m not really saying anything. Once computer scientists freed themselves from the principles of relational databases, an astounding creativity burst forth. The only thing that NoSQL databases have in common is that they are not relational. So it is not a choice between SQL and NoSQL, but rather a choice between SQL and a wide diversity of other options.

Wikipedia does a good job categorizing the many NoSQL databases now available. But that should just be taken as a starting point. The only way to appreciate the range of choices is to explore each one, looking over its documentation, playing with code, and experimenting. The value of NoSQL is not in the theory, but in the specific character of each NoSQL database.

And so I plan to spend time this year exploring and posting about some of the many NoSQL options out there. I’ve already started a post on MongoDB. Stay tuned for more. And if you have any suggestions for which database I should look into next, please make a comment.

Cloud Foundry, the open-source Platform-as-a-Service (PaaS) solution from VMware, continues to gain community support and evolve toward a more diverse, enterprise-ready platform. Last night at the Silicon Valley Cloud Computing Group meet up in Palo Alto, VMware engineers and representatives from several community partners spoke of recent progress and future plans.

Building on Cloud Foundry’s extension framework for languages and services, Uhuru Software has added .Net support to Cloud Foundry, enabling .Net developers to create applications in Visual Studio and deploy them directly to a Cloud Foundry-based private or public cloud. AppFog has added PHP support, and ActiveState has added Perl and Python support. Jaspersoft has extended Cloud Foundry with BI support, including user-friendly wizards for building reports and dashboards and direct support for the document-oriented MongoDB as a data source.

Scalr, whose founder, Sebastian Stadil, organizes the cloud computing group, demonstrated tooling for deploying a Cloud Foundry cluster. The graphical, web-based tool builds up a configuration for each server and calls web services to spin up the instances.

And VMware itself continues to make major contributions to Cloud Foundry. Patrick Chanezon and Ramnivas Laddad from VMware demonstrated Micro Cloud Foundry, a Cloud Foundry cluster with all components running on one virtual machine. This capability makes it possible for developers to spin up a PaaS instance on their laptop, deploy an application, and debug. Using an Eclipse plugin, Laddad gave a demo of debugging a cloud application that mirrored the experience of debugging traditional applications.

During a closing panel, VMware and partner representatives clarified the distinction between CloudFoundry.org and CloudFoundry.com. CloudFoundry.org hosts the open-source software development project, which enables organizations to run their own private or public PaaS cloud. This codebase will grow to support multiple languages and services. CloudFoundry.com is a public instance of Cloud Foundry run by VMware. Like other hosted instances of Cloud Foundry, CloudFoundry.com supports only a subset of the languages and services provided for by the open-source Cloud Foundry code. Despite the expansion of the code base, hosting providers must limit their offerings to services that they have the operational expertise to support.

In light of the rapid growth and expanding ecosystem, Jeremy Voorhis, a senior engineer at AppFog, suggested that VMware create an independent governance body to direct the future development of Cloud Foundry and to mediate potential conflicts between contributors. A few meet-up participants supported the suggestion. While all agreed that VMware has done a fabulous job of starting the project and building an ecosystem, those who raised the suggestion were concerned that conflicts were inevitable and that it would be better to build up a governance system in preparation. Representatives from VMware responded that they did not oppose the idea but did not consider governance a priority given the platform’s early stage of development.

The meet up made one thing clear, that extensiblity (see my post from last May) has made Cloud Foundry into a dynamic platform that has caught the attention of the open-source community.

On the LinkedIn Mountain View campus, Anirudh Pandit, a senior architect at Oracle, spoke last night at the SVForum’s Software Architecture and Platform SIG on the topic of Extreme Transaction Processing in SOA. Pandit proposed an architectural pattern for improving SOA performance and reviewed several case studies in which the pattern dramatically improved performance in enterprise SOA implementations.

While no longer generating the buzz that it once enjoyed, SOA (Services Oriented Architecture) remains an important integration strategy at large enterprises, perhaps now enjoying more real success than it did at the peak of it is hype cycle. Though implementations often prove costly and difficult, ultimately SOA provides the most conceptually compelling means to cut through the complexity and diversity of enterprise systems—systems encumbered by a mix of legacy and web-based applications, of behind-the-firewall systems with the external systems of business partners, and of applications accumulated through long histories of mergers and acquisitions.

Despite its conceptual advantages, Pandit pointed out that many SOA implementations suffer from poor performance and a lack of scalability. As messages move from service to service through an orchestration, the work of serializing and deserializing XML messages creates processing bottlenecks and maintaining transactions via relational database and data persistence creates disk IO bottlenecks.

Pandit proposed a solution based on two changes to the traditional SOA architecture. First, rather than passing XML messages with repeated serializations and deserializations, pass a token that each service may use to retrieve the message. Second, instead of persisting the message as it moves from service to service, cache the message to memory and assure message integrity by synchronizing the cache across multiple machines, preferably across machines in different data centers.

The solution, Pandit explained, does not work in all circumstances. By avoiding persistence to disk, the solution might fail to meet compliance requirements. By synchronizing the cache across many machines and data centers, the solution might introduce network bottlenecks and latency issues. But where it does fit, Pandit concluded, the solution overcomes performance problems in a plug-and-play manner without a need for the costly redesign of services.

While most recent conversations on scalability revolve around NoSQL, Pandit’s presentation was a reminder that the intelligent use of caching remains a viable option.

Stay tuned to the Software Architecture and Platform SIG for future events and the slide deck of Pandit’s presentation (not yet available at time of writing). The next meeting will be on the use of Hadoop at LinkedIn.

While the advantages of agile appear obvious, I observe that many IT departments stick stubbornly to waterfall. Despite failure after failure, IT managers blame the people—staff members, vendors, users—rather than the process.

Waterfall organizes software projects into distinct sequential stages: requirements, design, coding, integration, testing. Despite its common sense appeal, waterfall projects almost always lead to unproductive conflicts. Most users cannot visualize a system specified in a requirements document. Rather, users discover the true requirements only late in the project, usually during user acceptance testing, at which point accommodating the request means revising the design, reworking the implementation, and re-testing.

In fact, users should change their minds. We all do as we learn. As users begin using a system, they begin to see more clearly how it could improve their jobs. We should welcome this learning. But under the waterfall approach, such discovery gets labeled as scope creep, it undermines the process, causes cost overruns, and leads to a frenzy of finger pointing.

Agile methods throw out the assumption that a complex system can be understood and fully designed and planned for up front. While there are a great diversity of agile methods, they almost all tackle complexity by breaking projects into small—one week to a month—iterations, each one of which delivers working software of value to a customer. Users get to work with and provide feedback on the system sooner rather than later.

So why do so many IT organizations insist on crashing over waterfalls again and again? Here are the most frequent refrains I hear from waterfall proponents:

“We need a fixed-fee contract to avoid risk.”

Fixed-fee contracts force projects into a waterfall approach that makes change cost prohibitive. While fixed-fee contracts in theory transfer risk from client to vendor, these contracts in reality increase the risk of a final product that meets specification but fails to achieve the business objectives. A far simpler alternative would be a time and material contract that requires renewal at the start of each iteration. The customer reduces risk by agreeing to less cost at each signing and by receiving working software sooner rather than later, which both achieves a quicker return on investment and assures the project remains in line with expectations.

“What’s this going to cost?”

To make the business case for a project, it is essential to nail down costs, which the waterfall method promises for an entire project as early as the requirements stage. And while I’m all for business cases, making decisions based on bogus data makes little sense. What we need is an agile business case, one that covers a single iteration and rests on meaningful cost and revenue data.

“I’m a PM and I need a schedule to manage to.”

Wedded to the illusion of predictability, many PMs identify their discipline with large, complex schedules, full of interdependent sequences of tasks stretching out over months, schedules so easy to create in MS Project but that so rarely work out in reality. These schedules, together with requirement documents the size of a New York City phone book, end up as just more road kill of the waterfall process.

 “Should we buy or build? We need a decision.”

Agile eschews big designs up front in favor of designs that evolve organically over time. But IT departments must often make certain big design decisions up front, such as whether to buy or build a system or component.

And, unfortunately, these sorts of decisions fall outside of the well-known agile frameworks such as Scrum or XP, which focus exclusively on software development. So it might be assumed that once a big up front decision is required, that agile no longer applies. Indeed, it does make sense that more requirements and more design is needed up front to avoid committing dollars to the wrong vendor.

Regardless, I’d argue that there are ways to keep a project agile even if certain decisions must be made up front.

1)      Keep in mind that there is no need to collect every requirement, just those needed for the decision.

2)      Take advantage of demo versions and commit a few early iterations to build out rapid prototypes on different products.

3)      Consider open source. Low cost means low commitment.

And once a decision to buy or build has been made, implementation can proceed according to agile principles. Most enterprise software requires substantial configuration and customization that can benefit as much as pure software development from agile approaches.

“Let’s just get on with the next project.”

Perhaps because of the pain and strained relationships that accompany failed waterfall projects, few organizations analyze the process that led them to failure. Everybody flees the sinking ship and moves onto the next project so fast that nobody reflects on the root causes of the disaster. I would think that if enough time were spent studying these root causes, it would be determined that the problem was the process, not the people.

Thoughts?

If you know of other objections to agile, I’d like to hear them. And if you have had success in introducing agile into IT organizations, please share your experiences.

Erik Bansleben, Program Development Director at the University of Washington, invited me and several other bloggers to participate in a discussion last Friday on UW’s new certificate program in cloud computing. Aimed at students with programming experience and a basic knowledge of networking, the program covers a range of topics from the economics of cloud computing to big data in the cloud.

I had a mixed impression of the curriculum. It looked a bit heavy on big data, an important cloud use case that would perhaps be better suited to a certificate program of its own. And the curriculum looked a bit light on platform-as-a-service and open source platforms for managing clouds such as OpenStack. Since the program is aimed at programmers, perhaps I’d organize it around the challenges of architecting applications to benefit from cloud computing.

But that is nitpicking. It is certainly important for universities to update course offerings based on current trends. And UW has put together an impressive panel of industry experts to guide the program.

Clearly, it’s difficult to put together a program around cloud computing, since there is so much debate around what constitutes cloud computing. And given that just about every tech firm out there has repositioned its products as cloud offerings, it would seem very difficult to achieve sufficient breath without giving up the opportunity for rigorous depth.

But any interesting new direction in the tech industry would pose such challenges. Indeed, perhaps it is these difficulties that make it worth tackling the topic.

For more information, see the UW website at http://www.pce.uw.edu/certificates/cloud-computing.html.

Let me say up front that I’m a biased reviewer. I attend the SF Bay Area Agile meet-up organized by Agile Learning Labs, where both authors work. In fact, I won a copy of this book at the last meet-up event. And I’m a big fan of Chris Sims. Chris has a knack for engaging a group in interactive exercises that not only make clear the practicality of agile principles but are also just plain fun.

The Elements of Scrum is a short and pleasant read, a wonderful metaphor for what it conveys about Scrum, that Scrum makes software development a joy. Chris and his co-author, Hillary Louise Johnson, introduce Scrum in a way that is clear and practical. The book introduces the principles of agile, the Scrum framework, and a host of supporting practices.

Chris and Hillary make a great point of rejecting the common myth that agile means no planning: “On an agile software project, the plan is all around you, in the form of user stories, backlogs, acceptance tests, and big, visible charts: these are all part of your communication-rich environment.”

I particularly liked the metaphor comparing software development to sailing: the only way to reach a destination is to adjust frequently for wind and current. While I don’t want to spoil the experience by quoting this section, suffice it to say that it vividly describes the need for agile.

Any organization embarking on Scrum should hand this book out to all stakeholders. Given its brevity and practicality, it might actually get read. And it is bound to generate enthusiasm.

And by the way, if you live in Silicon Valley, check out the agile meet-up at http://www.meetup.com/AgileManagers/.

The Wikipedia article on DevOps states well the problem that DevOps seeks to address:

Many organizations divide Development and System Administration into different departments. While Development departments are usually driven by user needs for frequent delivery of new features, Operations departments focus more on availability, stability of IT services and IT cost efficiency. These two contradicting goals create a “gap” between Development and Operations, which slows down IT’s delivery of business value.

But is this gap a real problem?

In some cases, friction between developers and system administrators, arising out of their different training, work routines, and priorities, can result in losses. I can think of projects in which the failures of developers and system administrators to communicate caused project delays. In my case, these projects involved exposing corporate data to partners over the internet.

Many of the approaches advocated by DevOps make a lot of sense: training developers to better understand production environments, more effective release management, scripting, and automation. I can think of IT departments that have become too dependent upon vendor-supplied GUIs for system administration, causing much wasted time on repetitive tasks. So there certainly is value to a renewed focus on scripting.

And many DevOps ideas, especially those described as agile infrastructure, tie in well with the concept of private cloud, which is all about elasticity and agility.

However, the notion that Dev and Ops should merge into one harmonious DevOps is neither practical nor desirable; the two groups stand for different values: developers desire rapid change and operations staff demand stability. In heterogeneous, complex IT environments, organizations need advocates for both change and stability. IT needs people at the table with diverse perspectives, even if that means the occasional heated debate. The natural animosity between Dev and Ops, with each wielding influence, creates a healthy system of checks and balances, assuring a better airing of concerns.

The gap between development and operations, the problem that DevOps sets out to solve, represents as much a benefit as a hindrance to IT.

Tuesday night I drove to Santa Clara for the Big Data Camp, learned more about Hadoop, and even ran into a few Dell colleagues. Thanks to Dave Nielson for organizing the camp and to Ken Krugler for his great overview of Hadoop.

While the phrase big data lacks precision, it is of growing importance to ever more enterprises. With data flooding in from web sites, mobile devices, and those ever present sensors, processing and deriving business value out of it all becomes ever more difficult. Data becomes big when it exceeds the storage and processing capabilities of a single computer. While an enterprise could spend lots of money on high-end, specialized hardware finely tuned for a specific database, at some point that hardware will become either inadequate or too expensive.

Once data exceeds the capabilities of a single computer, we face the problem of distributing its storage and processing. To solve this problem, big data innovators are using Hadoop, an open source platform capable of scaling across thousands of nodes of commodity hardware. Hadoop includes both a distributed file system (HDFS) and a mechanism for distributing batch processes. It is scalable, reliable, fault tolerant, and simple in its design.

Many enterprises, however, would only need to process big data periodically, perhaps daily or weekly. When running a big data batch job, an organization might want to distribute the load over hundreds or thousands of nodes to assure completion within a time window. But when the batch job completes, those nodes may not be needed for Hadoop until the time comes for the next batch job.

This use case—the dynamic expansion and contraction in the need for computing resources—fits well with the capabilities of cloud computing, whether private, public, or hybrid. (In this case, I mean cloud as in Infrastructure-as-a-Service.) A cloud infrastructure could spin up the nodes when needed and free the resources when that need goes away.  In public cloud computing, that freeing of resources would directly impact cost.

During his presentation, Ken Krugler suggested that we think of Hadoop as a distributed operating system in that Hadoop enables many computers to store and process data as if they formed a single machine. I’d add that cloud computing—virtualization, automation, and processes that enable an agile infrastructure—may be needed to complete this operating system analogy so that this distributed operating system not only manages distributed resources but does so for maximum efficiency across a multitude of use cases.

Extra Credit Reading (Dell Resources on Hadoop):

Philippe Julio, Hadoop Architecture, http://www.slideshare.net/PhilippeJulio/hadoop-architecture

Joey Jablonski, Hadoop in the Enterprise, http://www.slideshare.net/jrjablo/hadoop-in-the-enterprise

Aurelian Dumitru, Hadoop Chief Architect, Dell’s Big Data Solutions including Hadoop Overview, http://www.youtube.com/watch?v=OTjX4FZ8u2s

Together with a few colleagues from Dell. I'm on the left. That's Aurelian Dumitru (Hadoop Chief Architect) in the center. And Barton George (Dell Cloud Evangelist) is on the right.

That’s me on the left, Aurelian Dumitru, Dell’s Hadoop Chief Architect, in the center, Barton George, Director of Marketing for Dell’s Web & Tech vertical, on the right. Thanks to DJ Cline for the photo. See http://bit.ly/iHJJIl for more photos of the Big Data Camp.

Follow

Get every new post delivered to your Inbox.