Jan 6 2012
My first post on web technology talks about what we are trying to accomplish when building for the web. There are four ways we can break down the standard flow of client action/server action/result: delivering, serving, rendering and developing. This post focuses on delivering content by understanding the network. Why use a cdn? What’s all the fuss about connections and compressed static assets? The network is often overlooked but understanding how it operates is essential for building high performing websites. A 50ms rendering time with a 50ms db query is meaningless if it takes three seconds to download a page.
TCP: Know It.
Going from client to server and back again rests on the network and how well you use it. TCP dominates communication on the web and is worth knowing well. In order to send data from one point to another a connection is established between two points via a back-and-forth handshake. Once established, data flows between the two in a series of packets. TCP offers reliability by acknowledging receipt of every packet sent by sending a second acknowledgement packet back to the server. The time it takes to go from one end to another and back is called latency or round-trip time. At any given time there are packets in flight waiting acknowledgement of receipt. TCP only allows a certain amount of unacknowledged packets in flight; this is called window size. Connections start with small window sizes but as more successful transfers occur the window size will increase (known as slow start). This effectively increases bandwidth because more data is sent at once. The longer the latency the slower a connection; if the window size is full then the server must wait for acknowledgements before sending more data. This is on top of the time it actually takes to send packets. The best case scenario is low latency with large, full windows.
Reliability
Reliable connections are also important; if packets are lost they must be resent. Retransmissions slow down transfers because tcp guarantees in-order delivery of data to the application layer. If a packet is dropped and needs to be resent nothing will be delivered to the application until that one packet is received. UDP, an alternative to TCP, doesn’t offer the same guarantees and assumes issues are dealt with at the application layer.
There is a good article by Phillip Tellis on understanding the network with JS which talks about data transfer and TCP. Wireshark is another great tool for analyzing packets across a network. You can actually view individual packets as they come and go and see how window size is scaling, view retransmissions, measure bandwidth, and examine latency.
The Importance of Connections
Establishing a connection takes time because of the handshake involved and latency considerations. A 100ms latency could mean more than 300ms before any data is even received on top of dealing with any dns lookups and os overhead. Keeping a connection alive avoids this creation overhead. Connection pooling, for example, is a popular technique to manage database connections. A web server will talk to a database frequently; if the connection is already established there is no overhead in executing a new query which can trim valuable time off of serving a request. Ping and traceroute are two worthwhile tools that examine latency and the “hops” packets take from one network to another as they travel from end to end.
Connections take time to create, have a relatively limited availability and require overhead to manage. It may seem like a great idea to keep connections open for the long haul, but there is a limited number of connections a server can sustain. Concurrent connections is a popular benchmark which examines how many simultaneous connections can occur at any given time. If you hit that mark, new requests will have to wait until something frees up. The ideal situation is to pump as much as possible through a few connections so there are more available to others. If you can serve multiple files files at once great; but why keep six empty connections open between requests if you don’t need too?
On a side note, this is where server architecture comes into play. A server usually processes a request by building a web page from a framework. If this can be done asynchronously by the web server, or offloaded somewhere else, the web server can handle a higher number of sustained connections. The server can grab a new connection while it waits for data to send on an existing one. We’ll talk more about this in another post. Fast server times also keep high concurrent connections from arising in the first place.
Optimizing the Network
Lowering latency and optimizing data throughput are what dominate delivery optimization. It is important to keep the data which flows between client and server to a minimum. Downloading a 100kb page is a lot faster than downloading a 1000kb page. Compressing static content like css and javascript greatly reduces payload which is why tools like Jammit and Google Closure are so ubiquitous. These tools can also merge files; because of http chatter it is faster to download one larger file than several small files. Remember the importance of knowing http? Each http request requires reusing or establishing a connection, a header request sent, the server handling the request, and the response. Doing this once is better than twice. Most web servers can also dynamically compress http responses and should be used when possible.
Fixing latency can be done by using a content delivery network like Amazon Cloudfront or Akamai. They shorten the distance between a request and response by taking content from your server and spreading it on their infrastructure all over the world. When a user requests a resource the cdn routes the request to the server with the lowest latency. A user in Japan can download a file from Japan a lot faster than he can from Europe. Shorter distance, fewer hops, fewer retransmissions. A good cdn strategy rests on how easy it is to push content to a cdn and how easy it is to refresh it. Both concerns should be well researched when leveraging a cdn. You don’t want stale css files in the wild when you release a new version of your app.
A WAN accelerator is also a cool technique. Let’s say you want to deliver a dynamic web page from the US to Tokyo. You could have that travel over the open internet with a high latency connection. Or you could route that request to a data center in Tokyo with an optimized connection to the US. The user gets a low-latency connection to the Tokyo datacenter which in turn gets a low-latency high bandwidth connection to the US. This can greatly simplify issues with running and keeping multiple data centers in sync.
The Bottom Line
There’s a lot of effort underway in making the web faster by changing how tcp connections are leveraged on the web. Http 1.0 requires a new connection for every request/response and browsers limit the number of parallel connections between client and server between two and six. Http keep-alive and http pipelining offer mechanisms to push more content through existing connections. Rails 3.1 introduced http streaming via chunked responses. Browsers can fetch assets in parallel with the main html response as soon as tags appear in the main response stream. Spdy, an effort by Google, is worth checking out: it proposes a multi-pronged attack to leverage as much flow as possible on a single connection. The docs also illustrate interesting pain points with the network on the web. The bottom line is simple: reduce the amount of data that needs to go from one place to another and make the travel time as fast as possible. Small amounts of data over existing parallel connections make a fast web.
- This approach shouldn’t be limited to users and servers; optimizing network communication within your datacenter is extremely important. You have total control over your infrastructure and can tune your network accordingly. You can also choose your communication protocols: using something like thrift or protocol buffers can save a tremendous amount of bandwidth over xml-based web services on http.
- :http://www.scala-lang.org/
:http://python.org/
Jan 4 2012
The web technology landscape is huge and growing every day. There are hundreds of options from servers to languages to frameworks for building the next big thing. Is it nginx + unicorn + rubinus or a node.js restful service on cassandra running with ember.js and html5 on the front end? Should I learn or ? What’s the best nosql database for a socially powered group buying predicative analysis real-time boutique mobile aggregator that scales to 100 million users and never fails?
It is true there are many choices out there but web technology boils down to a very simple premise. You want to respond to a user action as quickly as possible, under any circumstance, while easily changing functionality. Everything from the technology behind this blog to what goes into facebook operates on that simple idea. The problem comes down to doing it at the scale and speed of the modern web. You have hundreds of thousands of users; you have hundreds of thousands of things you want them to see; you want them to buy, share, create, and/or change those things; you want to deliver a beautiful, customized experience; everything that happens needs to be instantaneous; it can never stop working; and the cherry on the cake is that everything is constantly changing; different features, different experiences, different content; different users. The simple “hello world” app is easy. But how do you automatically translate that into every language with a personal message and show a real-time graph with historical data of every user accessing the page at scale? What if we wanted to have users leave messages on the same page? Hopefully by the end of this series you will get a sense of how all the pieces fit together and what’s involved from going to 10 users to 10 thousand to 10 million.
The Web at 50,000 Feet
The web breaks down to three distinct areas: what happens with the client, what happens on the server, and what happens in between. Usually this is rendering a web page: a user clicks a link, a page delivered, the browser displays the content. It could also involve handling an ajax request, calling an API, posting a search form. Either way it boils down to client action, server action, result. Doing this within a users attention span given any amount of meaningful content in a fail-safe way with even the smallest amount of variation is why we have all this technology. It all works together to combine flexibility and speed with power and simplicity. The trick is using the right tool in the swiss-army knife of tech to get the job done.
Let’s break all this down a bit. Handling a user action well comes down to:
- Having the server-side handle the request quickly (Serving)
- A quick travel time between the client and server (Delivering)
- Quickly displaying the result (Rendering)
And developing and managing all this successfully comes down to:
- Changing any aspect of what is going on quickly and easily (Developing)
As sites grow and come under heavier load doing any one of these things becomes increasingly difficult. More features, more users, more servers, more code. There is only so much one server can do. There is also only so much a server needs to do. Why hit the database if you can cache the result? Why render a page if you don’t need to? Why download a one meg html page when it’s only 100kb compressed? Why download a page if you don’t have to? How do you do all this and keep your code simple? How do you ensure everything still works even if your data center goes down?
Http plays an important role in all of this. I didn’t truly appreciate http until I read Restful Web Services by Sam Ruby and Leonard Richardson. Http as an application protocol offers an elegant, scalable mechanism for transferring data and defining intent. Understanding http verbs, various http headers and how http sits on top of tcp/ip can go along way in mastering the web.
Making it all work together
So how do you choose and use all the tools out there to serve, deliver, render and develop for the web? What does client action/server action/result have to do with rack and wsgi? I naïvely thought I could write everything I wanted to in a single post: from using sass for compressed, minimized css to sharding databases for horizontal scalability. It will be easier to spread it out a bit so stay tuned. But remember: any language, framework or tool out there is really about improving client action/server action/result. Even something like websockets. Websockets eliminates the client request completely. Why wait for a user to tell you something when you can push them content? Knowing your problem domain, your bottlenecks, and your available options will help you choose the right tool and make the right time/cost/benefit decision.
I’ll dig into the constraints and various techniques to effectively deliver, serve, develop and render for the web in upcoming posts.
Dec 15 2011
One of my favorite achievements in the agile/lean world has been the progression from standard Scrum practices to a Kanban approach of software development. In fact, Kanban, in my opinion, is such an ideal approach to software development I cannot imagine approaching team-based development any other way.
What’s Wrong With Scrum?
Before answering this, I want to mention Kanban came only after altering, tweaking, and refining the Scrum process as much as possible. If anything, Kanban represents a graduation from Scrum. Scrum worked, and worked well, but it was time to take the approach to the next level. Why? The Scrum process was failing. It became too constrained, too limiting. As I mentioned in my three-year old (as of this writing) post on Scrum one needs to constantly iterate in refining the practice. Pick one thing that isn’t working, fix it, and move one. Quite simply, there was nothing with Scrum left to refine except Scrum itself.
Why Scrum Was Failing
The main issue was simply it was time to break free from the time-boxed approach of sprints. Too much effort went into putting stories into iterations. Too much effort went into managing the process. This process took away from releasing new functionality. Nothing can be more important than releasing new functionality. Tweaking iteration length did not help; one week caused too many meetings to happen too frequently. Two weeks and the early sprint planning effort was lost on stories which would not occur until the second week. Too much time went into making stories “the right size”. Some where too small; not worth discussing in a group. Some were too big but they did not make sense to break down to fit into the iteration. Worse, valuable contributions in meetings only occurred with a few people. This had nothing to do with the quality of dev talent; some really good developers did not jive with the story time/sprint review/retrospective/group think model. Why would they? Who really likes meetings?
Rethinking Constraints
Scrum has a specific approach to constraints: limit by time. Focus on what can be accomplished in X timeframe (sprints). Add those sprints into releases. Wash, rinse, repeat. Kanban, however, rethinks constraints. Time is irrelevant; the constraint is how much work can occur at any one time. This is, essentially, your work in progress. Limit your work in work in progress (WIP) to work you can be actively doing at any one time. In order to do new work, old work must be done.
Always Be Releasing
The beauty of this approach is that it lends itself well to a continuous deployment approach. If you work on something, and work on it until it is done, when it is done, it can be released. So release it. Why wait until an arbitrary date? The development pipeline in Kanban is similar to Scrum. Stories are prioritized, they are sized, they are ready for work, they are developed, they are tested, they are released. The main difference is instead of doing these at set times, they are done just-in-time. In order to move from one stage of the process (analysis, development, testing, etc) there must be an open “slot” in the next stage. This is your WIP limit. If there isn’t an open slot, it cannot move, and stays as is. People can be focused on moving stories through the pipeline rather than meeting arbitrary deadlines, no matter how those deadlines came to be. Even blocking items can have WIP limits. The idea is simple: you have X resources. Map those resources directly to work items as soon as they are available, and see them through to the end. Then start again.
Everything is Just In Time
All of the benefits of Scrum are apparent in Kanban. Transparency into what is being worked on and the state of stories. Velocity can still be measured; stories are sized and can be timed through the pipeline. Averages can be calculated over time for approximate release dates. The business can prioritize what is next up to the point of development. Bugs can be weaved into the pipeline as necessary, without having to detract from sprints. With the right build and deploy setup releases can occur as soon as code is merged into the master branch. Standup meetings are still important.
The Goal
The theory of constraints is nothing new. My first encounter was with The Goal by Eliyahu Goldratt. The goal, in this case, is to release new functionality as efficiently (not quickly, not regularly; efficiently) as possible. There is a process to this: an idea happens, a request comes in. It is evaluated, it is fleshed out, given a cost. It is planned, implemented, and tested. It is released. Some are small, some are big. Some can be broken down. But in teams large and small, they go from inception to implementation to release. Value must be delivered efficiently. It can happen quickly, but it does not need to be arbitrarily time-boxed.
Scrum is a great and effective approach to software development. It helps focus the business and dev teams on thinking about what is next. It is a great way to get teams on board with a goal and working, in sync, together. It follows a predictable pattern to what will happen when. It offers the constraint of time. Kanban offers the constraint of capacity. For software development this is a far more effective constraint to managing work. You still need solid, manageable stories. You just don’t have to fit a square peg in a round hole. Kanban streamlines the development process so resources, which always have a fixed limit, are the real limit you are dealing with. They are matched directly to the current state of work so a continuous stream of value can be delivered without the stop-and-go Scrum approach.
Dec 8 2011
I wrote an article covering how we move images to our customers on the new Getty Images blog.
The system is a suite of .NET applications which handle various steps in our workflow. It features:
- A custom C# module which sits with IIS FTP to alert when a new image arrives
- Services built with the Topshelf framework
- WCF services which wrap a rules engine featuring dynamic code generation
The Getty Images blog be covering more insight into the system as well as other technology developed at Getty Images. So check out the new blog and subscribe to the RSS feed!
Nov 29 2011
We use Solr as our search engine for one of our internal systems. It has been awesome; before, we had to deal with very messy sql statements to support many search criteria. Solr allows us to stick our denormalized data into an index and search on an arbitrary number of fields via an elegant, RESTful interface. It’s extremely fast, easy to use, and easy to scale. I wanted to share some lessons learned from our experience with Solr.
Know Your Use Cases
There are two worlds of Solr: writing data (committing) and reading data (querying). Solr should not be treated like a database or some nosql solution; it is a search indexer built on top of Lucene. Treat it like a search indexer and not a permanent data store; it doesn’t behave like a database. There are plenty of tools to keep data in database in sync with Solr; the worst case scenario is you have to sync it yourself. You should know how heavy you will query it, how much you’ll write to it, and have a rough idea what your schema will be (but it doesn’t have to be 100%). Knowing your use cases will allow you to configure your instance and define your schema appropriately.
Solr offers a variety of ways to index and parse data; when you’re starting out, you don’t need to pick one. Solr has a great copyField feature that allows you to index the same data in multiple ways. This can be great for trying out new things or doing A/B comparisons. Once your patterns are well defined, you can tune your index and configuration as needed.
Our use cases are pretty straight forward; we simply need to search many different fields and aggregate results. We don’t need to deal with lexical analyzation or sorting on score. Our biggest issue was actually commits because we didn’t thoroughly vet our update patterns. Remember, Solr is about commits as much as it is about querying. You need to realize there will be some lag between when you update Solr and when you see the results. There are a large number of factors that go into how long that delay will be (it could be very quick), but it will be there, and you should design your system in knowing there will be a delay rather than trying to avoid it. The commit section covers why you shouldn’t try and commit on every update, even under moderate load.
Know Configuration Options
Go through the solrconfig.xml and schema.xml files. It’s well documented and there are lots of good bits in there (solrconfig.xml is often missed!). The caches are what matter most, and explained in later sections. If you know your usage patterns you can get a good sense of how you can tune your caches for optimal results. Autowarming is also important; it allows Solr to reuse caches from previous indexes when things change.
Don’t forget that Solr sits on top of Java, so you should also tune the JVM as appropriate. This probably will revolve around how much memory to allocate to the JVM. Be sure to give as much as possible, especially in production.
Understand Commits
You should control the number of commits being made to Solr. Load testing is important; you need to know how often and what happens when Solr will rebuilds an index. You shouldn’t commit on every update; you will surely hit memory and performance issues. When a commit occurs, an index and search warmer need to be built. A search warmer is a view onto an index. Caches may need to be pre-populated. Locking occurs. You don’t want to have that overhead if you don’t need it. If you have any post commit listeners those will also run. Finally, updating without forcing a commit is a lot faster than forcing a commit on update. The downside is simply that data will not be immediately available.
This is where autocommit comes into play. We use an autocommit every 5 seconds or 5000 docs. We never hit 5000 commits in less than 5 seconds; we just don’t want data to be too stale. 5000 docs allows us to re-index in production if we need to without killing the system. This ratio provides a good enough index time for searches to work appropriately without causing too many commits from choking the system. Again, know your usage patterns and you can get this number right.
Search Warmers and Cold Searches
Solr caching works by creating a view on an index called a searcher. A commit will create a warming search to prep the index and the cache. How long this takes is tricky to say, but the more rows, indexable fields, and the more parsing that is done the longer it takes. The default is to only allow two warming searches at once, and depending on how you’re doing commits, you can easily surpass that limit. If you read the solrconfig.xml file you’ll see that 1-2 is useful for read-only slaves. So you’re going to want to increase this number on your main instance; but be aware, you can kill your available memory if you’re committing so much you have a high number of warmers.
By default Solr will block if a search warmer isn’t available. Depending on how and when you’re committing, you may not want this. For instance, if the first search is warming an index, it could be a while before it returns. Be sure to reuse old warmers and see if you can live with a semi-built index. This is all handled in the solrconfig.xml file. Read it!
Increase Cache Sizes
Don’t forget out-of-the-box mode is not production mode. We’ve touched on a committing and search warmers. Cache sizes are another important aspect and should be as big as possible. This allows more warmers to be reused and offers a greater opportunity to search against cached search results (fq parameters) versus new query results (q parameters). The more we can cache the better; it also allows Solr to carry over search warmers when rebuilding indices which is very helpful.
Lock Types
Luckily the default lock type is now “Native” which means Solr uses OS level locking. Previously it was single and this killed the system in concurrent update scenarios. Go native.
Understand and Leverage Q and FQ parameters
Q is the original query, fq is a filter query. For larger sets this is important. If the original query is cached an fq query will just search the cached original query, rather than the entire index. So if you have an index with one million records, and a query returns 100k results, a q/fq combination will only search the 100k cached records. This is a big performance win. Ensure your cache settings are big enough for your usage patterns to create more cache hits.
Minimize Use of Facets
Calculating facets is time consuming and can easily increase a search 2-5x than normal. This is the slowest bottleneck we have with Solr (but still, it’s minimal compared to sql). If you can avoid facets than do so. If you can’t, only calculate them once on initial load, and design a UI that doesn’t need to refresh them (i.e. paging via ajax, etc). When searching from a facet, use the fq parameter to minimize the set you’re searching on from your q query. This also reduces the required number of entries that are calculated for a facet and greatly increases performance.
Avoid Dynamic Fields in Solr
This is more of an application architectural decision rather than anything else, and probably somewhat controversial. I feel you should avoid the use of dynamic fields and focus on defining your schema. I feel you can easily lose control over your schema if your model changes often as you have no base to work from. That can have unintended consequences depending on how you wrap your Solr instance and how you serialize and deserialize Solr data. It’s not too much up front work to define your schema during development that would call for the use of dynamic fields in production, unless of course your app necessitates using dynamic fields for one reason or another.
The other, more valid argument is that on a per-field level you can specify multi-values, required, and indexable fields. Solr handles multi valued and indexable fields differently on commits. If you are using dynamic fields and are indexing each one, and are not actually searching nor returning these fields, you have a really high and unnecessary commit cost. At the very least, consider turning off indexing for dynamic fields if you don’t need it.
Use Field Lists
You should always specify what data you want returned from the query with fl (field list). This is extremely important! Depending on how you’ve set up your schema, you probably have a ton of fields you don’t actually need returned to the UI. This is common when you are indexing the same field with different parsers via the copyField functionality. Use fl to get back only the data you need- this will greatly reduce the amount data (and network traffic) returned, and speed up the query because Solr will not have to fetch unnecessary fields from its internal storage. In a high-read environment, you can greatly reduce both memory and network load by trimming the fat from your dataset.
Have a Reindex strategy
There will come a time when you need to reindex your Solr instance. Most likely this will be when you’re releasing a new feature. It’s important to have a reindexing strategy ready to go. Let’s say you add a new field to your UI which you want to search on. You release your code, but that field is not in Solr yet so you get no results. Or, you get a doc back from Solr, you deserialize it to your object model, and get an error because you expect the field to be there and it’s not. You must prepare for that. You could change your schema file, reindex in a background process, and then release code when ready. In this scenario make sure you can reindex without killing the system. It’s also important to know how long it will take. Having to reindex like this may not be practical if takes a couple of days. You could also reindex to a second, unused Solr instance, and when you deploy you cut over to the new instance. By looking at your db update timestamps you can sync any missed data. (Remember how I said Solr is not a data store? This is a reason.)
Final Thoughts
Remember that data in Solr needs to be stored, indexed and returned. If you are only using dynamic fields, indexing all of them, defining copyField settings left and right and returning all that data because you are not using field lists (and potentially calculating facets on everything), you are generating a lot of unnecessary overhead. Keep it small and keep it slim. You’ll lower your storage needs, your memory requirements, and your result set. You’ll speed up commits as well.
Aug 12 2011
I found myself confronted with a MongoDb data modeling problem. I have your vanilla User model which has many Items. The exact nature of an Item is irrelevant, but let us say a User can have lots of Items. I struggled with trying to figure out how to model this data in a flexible way while still leveraging the documented-orientated nature of MongoDb. The answer may seem obvious to some but it is interesting to weigh the options available.
To Embed or Not to Embed
The main choice was to embed Items in a User or have that as a separate collection. I do not think it makes sense to go vice versa, as Users are unique and clearly a top level entity. It would not make sense to have thousands of the same User in an Items collection. So the choice was between having Items in its own collection or embedding it in Users. A couple of factors came into play: How can I access, sort, or page through Item results if it is embedded in a User? What happens if I had so many Items in a User class I hit the MongoDb 4mb document size limit? (Unlikely: 4mb is a lot of data, but I would certainly not want to have to refactor that logic later on!) What would sharding look like with a large number of very large User documents? Most importantly, at what point would the number of Items be problematic with this approach? A hundred? A thousand? A hundred thousand?
When to Embed
I think embedded documents are an awesome feature of MongoDb, and the general approach, as recommended on the docs, is to say “Why wouldn’t I put this in an embedded document?”. I would say if the number of Items a User would have is relatively small (say, enough that you would not need to page them on a UI, or if it would not create large network io by just accessing that field) then it can be an embedded document. The decision is a lot simpler if it is a 1..1 relationship as the potential size is clearly defined. 1..N relationships break down with embedded relations when N becomes so large that accessing it as a whole is impractical. As far as I know there does not seem to be a way to page or sort through an embedded array directly within MongoDb: you need to pull the entire field out of the database with field selection and then page on the client. Note MongoDb offers numerous ways to find data within a document no matter how it is stored within the document (see the docs on dot notation for more). You can even query on the position of elements in an array, which is helpful with sorted embedded lists (find me all Users who have Item Z as the first element). But sadly you cannot say “give me the first to the Nth element in an embedded array”. It is all or nothing.
Now Mongoid does offer the ability to page through an embedded association using a gem (seems like people use Kaminari as will_paginate was removed from Mongoid some time ago). However, this paging is done within the ruby object for embedded relations. More importantly, it is only done on a per-document basis. Under the hood you need to grab the entire embedded relation embedded within its root document (think an array of Users containing an array of Items, not a plain array of Items). This means you cannot grab a collection of embedded documents which span multiple root documents. You cannot say “give me all Items of type ‘X’. You need to say “give me all Users and its Items containing Items of type ‘X’”. If you ever ran into the “Access to the collection for XXX is not allowed since it is an embedded document, please access a collection from the root document” error you are probably trying to issue an unsupported Mongoid query by bypassing a root document. You think you can treat embedded relations like normal collections, but you can’t.
When to Have Separate Collections
So where does that leave us: If the relation is small enough, than an embedded relation is fine: we just need to realize that we can never really treat elements in that collection across its top level document and that getting those elements is an all-or-nothing decision for each parent document. For the sake of argument, let us say a User can have thousands of Items, and we wanted the ability to list Items across Users in a single view. That would be too much to manage as its own field as an embedded document, and we could not aggregate Items across Users easily. So it needs to be in its own collection. This now gives us numerous sorting options and paging features like skip and limit to reduce network traffic. If we have Items as its own collection then we can create a DBRef between the two. This is a classical relational breakdown. The thing that smells with this approach, specifically when using MongoDb, is that if I were viewing a list of Items, and wanted to show the username associated with them, I would either have to use a DBRef command to pull user information or make two queries. Less than ideal. A JOIN would certainly be easier (albeit at scale, impractical, but probably for the DbRef approach too).
The Solution
So what I’m really looking for is the ability to show the username with a list of Items when each has its own collection. The trick is I do not need to aggregate this data when I am pulling it out of the database. Instead I can assemble it before I put in the database and it will all be there when I take it out. Classic denormalization. With Mongoid and Callbacks this becomes extremely easy.
On my Items class I add a _:belongsto :user property along with a :username property. I want to ensure that a :user always exists, so I add a validates_presence_of :user validation. I do not need to add :username to this validation as we will see below. Then I leverage callbacks like so:
before_save :add_username
protected
def add_username
if user_id_changed?
self.username = user.username
end
end
What will happen is if the User property changed Mongoid will set the current Item’s username value to the user.username property value. The username field is now stored within the Item document, and I can query on this field as easily as any other Item property (including the user_id relation on the Item document). More importantly, it is already available in a query result so there is no need to make an additional query on User.username for display. Any time the user changes (if Items can switch Users) the username will be updated automatically before the save to maintain consistency. Because the :user object is required, there is no need to also make :username required. Username will read from the required User property before each save. There is a slight catch with this approach: callbacks will only be run on document which received the save call, so be careful with cascading updates. As always a great test suite will always ensure the behavior you want is enforced.
Sharding
The other point about the user relation, whether it is via the username field or on user_id, is that it makes a good shard key. If we shard off of this field (probably in conjunction of another key) you can control things like write scaling while keeping relevant data close together for querying. For instance, sharding only on username will put all data in the same server to make querying a user’s items extremely efficient. Sharding on username and something else will get writes distributed across servers at the expense of having to gather elements across servers when returning results. The bottom line is know your use case: are faster writes more important than faster reads? Which one are you doing more of?
In Conclusion
I think there are two important things to realize when it comes to modeling with not just Mongoid but with any type of data store, sql or nosql. First when you are dealing with scale you want to put your data in the same way you want to get it out. Know your data access patterns. Sql allows a tremendous amount of flexibility, but joining numerous tables across millions of rows is extremely inefficient. More importantly, if you model your data in NoSql incorrectly, you could end up with similar performance problems. In the case of the data denormalization exercise above, adding a username field to the Items collections saves us from a DbRef later. Plus, with the use of callbacks, getting our data into Mongoid in a denormalized way is easy. We could easily apply the same principle to a sql-based solution: add a username column to a Item table or create a materialized view/indexed view on the Users/Items data. If you are debating a no-sql solution over a sql one, take a look at the cost/benefit of one approach over another in terms of how easy it is to model your data around data access. I think MongoDb gives a good amount of flexibility, especially with querying and indices, while still promoting some of the NoSql goodies like easy sharding for scalability and easy replication for reliability and read scaling.
Secondly, it is extremely important to know your toolset. With MongoDb, you get a tremendous amount of querying power: filtering on any field, no matter the nesting, even if it’s an array; creating indices on said fields; map/reduce views; only retrieving specific fields from a document; the list is nearly endless. ORM features are important too: How does Mongoid map its API to MongoDB commands? How does it deal with dirty tracking? What callbacks are available? The coolest thing on the Mongoid website is the statement This is why the documentation provides the exact queries that Mongoid is executing against the database when you call a persistence operation. If we took the time to tell you, you should listen. VERY TRUE! I like that. The point being, there should be a purpose why you are choosing a NoSql solution: so know what it is and leverage it. It will mean the difference of succeeding at scale or failing at launch.
Cassandra
As an interesting footnote, I think Cassandra exemplifies the query-first approach to data modeling (I mean, it states so on its wiki!). Cassandra’s uniqueness is in its masterless approach as a key/value store. It comes with some interesting features: the choice of using a secondary index vs. columnfamily as index, numerous comparison operators on columnfamily names, super columns vs. columns for storing data, replication and write consistency options across multiple data centers. This leads to plenty of benefits but with a certain cost. As for the know your tools/know your data philosophy, an example is the typical choice of “Do you create a row and use its respected columns as an index, choosing an appropriate column comparison type, or do you treat your data as a key/value store and use a secondary index for queries?” One the one hand, you have a pre-sorted list that queries from one machine and with one call with slices for paging; on the other, you may need to farm out to a lot of machines to get the data you want. Knowing your options is important, and knowing what you have to do to implement your choice is nearly as important. Even with the best Cassandra ORMs you still need to do a lot of prep to get your data into and out of Cassandra in a meaningful way.
Final Thought
In a bit of contradictory advice, I’d say don’t sweat it too much. Do some preliminary research, go with your hunch and trust your ability to refactor when needed. If you wait to figure out the perfect solution, you won’t build anything!
May 10 2011
One of the things I’m excited to see is the huge increase in Open Source projects in the .NET world. NuGet has certainly helped the recent explosion, but even before that there have been numerous projects gaining legs in the .NET community. Even better, the movement has been learning from other programming ecosystems to bring some great functionality into all kinds of .NET based systems.
One of my favorite projects on the scene is Nina, a Web Micro Framework by jondot. What exactly is a web micro framework? Quite simply it easily allows you to go from an HTTP request to a server side method call with little friction. The project is inspired by Sinatra a very popular ruby framework for server-side interaction which doesn’t involve all the overhead of a convention based framework like Ruby on Rails.
Wait, Isn’t This .MVC?
Sort of- but the two frameworks take very different approaches in how they map an HTTP request to a function call. .MVC is a huge improvement over “that which must not be named” but still abstracts the underlying HTTP request/response: controllers and actions to handle logic, models to represent data, views to render results, and routing to figure out what to do. This is usually a good thing as you can easily get fully formed objects into and out of the server in an organized way and has incredible benefits over WebForms. But sometimes that is too much for what you want or need. In our ajax driven world we simply want to do something–GET or POST some data–as quickly and easily as possible. We don’t want to set up a routing for new controller, create a model or view model, invoke an action, return a view, and all that other stuff; we just want to look at the request and do something. That’s where Nina comes in- it elegantly lets you “think” in HTTP by providing an API to do something based on a given HTTP request. It’s extremely lightweight and extremely fast. It’s the bare essentials of MVC by providing a minimalist view of functionality in a well defined DSL. On the plus side, the MVC framework and Nina can complement each other quite well (Nina can also stand on its own, too!). Let’s take a look.
How It Works
Nina is essentially functionality added to a web project in the same way the MVC bits are added to a web project. It’s not an entirely new HTTP server implementation. It’s powered off of the standard .NET HttpApplication class and unlike the various OWIN toolkits Nina doesn’t try and rewrite the underlying HttpContext or IIS server stack. To start things off Nina is powered by creating a class that handles all requests to a given url, referred to as an endpoint. This class inherits from Nina.Application and handles all requests to that endpoint- no matter what the rest of the url is. This is done by “mounting” the class to an endpoint in your Global.asax file. It’s not too different than setting up a routing for MVC. However, instead of MVC, you’re not routing directly to specific actions or a pattern of actions, but “gobbing” up all requests to that url endpoint. Below is an example of a global.asax file from the Nina demo project. There are two Nina applications- the Contacts class gets mounted to the contacts endpoint and Posts gets mounted to the blog endpoint.
private static void RegisterRoutes()
{
RouteTable.Routes.Add(new MountingPoint("contacts"));
RouteTable.Routes.Add(new MountingPoint("blog"));
}
When you’re mounting an endpoint any request to that endpoint will go to that class- and that class will handle everything else. So anything with a url of /contacts, /contacts/123, /contacts/some/long/path/with/file.html?x=1&y=1 will go to the Contacts class. There’s no automatic mapping of url parts to action names, or auto filling of parameters. That’s all handled by the class you specify which inherits from Nina.Application. Routing to individual methods is handled within these classes by leveraging the Nina DSL. I like this approach, as it keeps routing logic tied to specific endpoints rather than requiring you to centrally locate everything or to dictate globally how routing should work via conventions. Of course, there are pros and cons in either case. In very complex systems the Global.asax can get quite large; you can certainly refactor routing logic into helper functions as necessary, but moving routing definitions closer to the logic has its benefits. I’m also not too big of a fan when it comes to attribute based programming so not having to pepper your action methods with specific filters- whether for a Uri template in the case of WCF or Http Verbs for .MVC- is a big plus.
Handling Requests
This is where the beauty of Nina comes in. Once we’ve mounted an application to an endpoint we can handle what to do based on two variables: the HTTP method and the path of the request. This is done via four function calls which are part of the Nina.Application class and map to the four HTTP verbs: Get(), Put(), Post() and Delete(). Each function takes in two parameters: the first is a Uri template which determines when this method gets invoked. The second is a lambda with a signature of Func. This lambda is what gets invoked when the current requests matches the Uri template. The first parameter are the template parts (explained later), the second parameter is the underlying HttpContext object, and the Function returns a Nina.ResourceResult class. For all intents and purposes a ResourceResult is similar to an ActionResult in .MVC. Nina provides quite a number of ResourceResults, from Html views to various serialization objects to binary data.
This setup is powered by an extremely nice DSL for handling function invocation from HTTP requests and yields a very nice description of your endpoint. You specify the HTTP verb required to invoke the function. You specify the Uri template to when that match should occur–very similar to setting up routes–and your handler is actually a parameter, which you can specify inline or elsewhere if needed. The Uri templating is pretty slick, as it allows any level of fuzzy matching. Because the template is automatically parsed and passed as a variable to your handler, you can easily get out elements of the Uri using the template tokens in your Uri. Let’s take a look at a simple example:
Take a look at the example application below.
public class Contacts : Nina.Application
{
public Contacts()
{
Get("", (m, c) =>
{
//Returns anything at the root endpoint, i.e. /contacts
var data = SomeRepository.GetAll();
return Json(data);
});
Get("Detail/{id}", (m, c) =>
{
//Returns /contacts/detail/XYZ
//m is the bound parameters in the template
//this will be a collection with m["ID"] returning XYZ
var id = m["ID"]; //returns XYZ
var data = SomeRepository.GetDetail(id);
return View("viewname", data); //Nina has configurable ViewEngines!
});
Post("", (m,c) =>
{
//A post request to the root endpoint.
return Nothing();
});
}
}
We’re exposing three operations: two GET calls and one POST. We’re handling a GET and POST operation at the endpoint root. In our global.asax we’ve mounted this application at /contacts, so everything here is relative to /contacts. A template of “” will simply match a Uri of /contacts. If we wrote RouteTable.Routes.Add(new MountingPoint(“contacts”));
in our Global.asax than this class would be at the root of our application, i.e. “http://localhost/”. Finally, we have another GET call at /detail/{id}. This is actually a URI template, similar to a Route, so anything which matches that template will be handled by that function. In this case /detail/123 or /detail/xyz would match. The template variables are passed as a key/value array in the “m” parameter of the lambda and can easily be pulled out. These are your template parts that are automatically parsed out for you.
Using this DSL we can create any number of handlers for any GET, POST, PUT or even DELETE request. We can easily access HTTP Headers, Form variables, or the Request/Response objects from the HttpContext class. Most importantly we can easily view how a request will get handled by our system. The abstraction that MVC brings via Routes, Controllers and Actions is helpful; but not always necessary. Nina provides a different way of describing what you want done that serves a variety of purposes.
Returning Results
So far we’ve focused on the Request side of Nina and haven’t delved too much in the Response side. Nina’s response system is very similar to .MVC’s ActionResult infrastructure. Nina has a suite of classes which inherit from ResourceResult that allows you to output a response in a variety of ways. You can serialize an object into Json or Xml, render straight text, return a file, return only a status code, or even return a view. Nina supports numerous View engines–including Razor but also NHaml, NDjango and Spark–that’s beyond the scope of this blog but worth checking out. I’m a big fan of Haml. Results are returned using one of the method calls provided through the Nina.Application class and should serve all your needs. The best thing to do is explore the Nina.Application class itself and find out which methods return ResourceResults objects.
This is cool, but why use it?
The great part about Nina is that even though it can stand alone as an application, it can just as easily augment an existing WebForms (Blah!) or MVC application via mounting endpoints using the Routing engine. There are times when you want speed and simplicity for your web app rather than a fully-fledged framework. MVC is great, but requires quite a few moving parts and abstracts away underlying HTTP. The new Restful Web API’s Microsoft is rolling out for WCF are also nice, but I’ve never been a fan of attribute based programming and the WCF endpoints are service specific. Nina offers much more flexibility. Nina strikes the right balance by honoring existing HTTP conventions while providing flexibility of output. Sinatra, Nina’s inspiration, came about by those who didn’t want to follow the Rails bandwagon and the MVC convention it implemented. They wanted an easier, lightweight way of parsing and handling HTTP requests, and that’s exactly what Nina does.
Here are some use cases where Nina works well:
- Json powered services. Even though MVC has JsonResult, Nina provides a low friction way of issuing a get request to return Json data, useful for Autosuggest lists or other Json powered services. JQuery thinks in terms of get/post commands so mapping these directly to mounted endpoints becomes much more fluid. One of my more popular articles is the New Web App Architecture. Nina provides a nice alternative to Json powered services that can augment one of the newer javascript frameworks like Knockout or Backbone.
- Better file delivery. HttpHandlers work well, but exist entirely outside the domain of your app. Powering file delivery through Nina- either because the info is in a data store or required specific authentication, works well.
- Conventions aren’t required. Setting up routings, organizing views, and implementing action methods all require work and coding. Most of the time, you just want to render something or save something. Posting a search form, save a record via ajax, polling for alerts are all things that could be done with the conventions of MVC but aren’t necessarily needed. Try the lightweight approach of Nina and you’ll be glad you did. With support for View engines you may even want to come up with your own conventions for organizing content.
When the time it takes to do something simple simply becomes too great you’re using the wrong tool. I strongly encourage you to play around with Nina– you’ll soon learn to love the raw power of HTTP and the simplicity of the API. It will augment your existing tool belt quite well and you’ll find how much you can do when when you can express yourself in different ways.
Mar 21 2011
I pushed a major update to the MVC3/Html5 Boilerplate Template found on the github page. The new update includes the latest boilerplate code and uses the DotnetOpenAuth CTP for logging in via Twitter and Facebook. Thanks to @jacob4u2 for making some necessary web.config changes (he has an alternate template on his bitbucket site you should also check out.
Your best option is to git clone [email protected]:mhamrah/Html5OpenIdTemplate.git
the template with your own app. That way you’ll get the latest nu-get packages with the bundle. You can also use the template, but you’ll need to manually pull the latest CTP for DotNetOpenAuth to get the latest dlls.
Mar 2 2011
I’ve been playing around with CSS Transforms and had an annoying issue: when rotating divs at an angle, the edge of the div also rotated leaving a gap where I didn’t want one. See the pic:
In this case, the div is actually a header tag I wanted to span the length of the page, like a ribbon stretching the width of browser. But I did not want that gap. So what to do? I thought about using Transforms to skew the header the angle required to maintain a vertical line, but that’s annoying.
Instead, I simply added a negative margin to the width to stretch the header enough to hide the gap. Here’s the css and final result:
header
{
background-color: #191919;
margin: 75px -20px;
-webkit-transform: rotate(-10deg);
-moz-transform: rotate(-10deg);
tranform:rotate(-10deg);
}
You may find rotate and skew is a better combination to achieve the desired result- however, if you have text in your containing element, that text will also be skewed if you use skew.
To achieve the desired result within a page (where you don’t have the benefit of viewport clipping) you can always put the rotated element within another element, use negative margins, and set overflow:hidden
to achieve the desired result.
Feb 6 2011
Vim has quickly become my go-to editor of choice for Windows, Mac and Linux. So far I’ve had about three months of serious Vim usage and I’m just starting to hit that vim-as-second-nature experience where the power really starts to shine. I’m shocked I’ve waited this long to put in the time to seriously learn it. Now that I’m past the beginner hump I wish I learned Vim long ago- when I tried Vim in the past, I just never got over that WTF-is-going-on-here frustration! Better late than never I suppose!
Why I Like Vim – Mac
Coming from Visual Studio, I longed for a VS like experience for programming on my Mac, whether it’s html/css/js or for my recent focus on Ruby and Rails. I checked out both Aptana and Eclipse but quickly became frustrated- it was kind of like VS but not really and it was just too weird going back and forth. Plus, my biggest pet peave with development started to emerge: I wasn’t learning a language, I was learning a tool that abstracted the language away. There’s nothing that could be worse- once your tool hides the benefits of the underlying infrastructure, you’re missing the point, and you’ll usually be behind the curve because the language always moves faster than the support.
TextMate then became the go-to: it’s widely used, powerful, and there’s a lot of resources for learning. The simplified environment mixed with the command line really created a higher degree of fluidity, and I realized how nice it can be to develop outside an integrated environment. Textmate has its features- Command-T is slick, the project drawer is helpful, and the Rails support is great along with the other available bundles. But it lacked split windows which drove me crazy. There’s nothing more essential than split windows: I want to see my specs and code side-by-side. I want to see my html and js side-by-side. And you can’t do that with the Textmate. So I turned to MacVim and haven’t looked back.
Why I Like Vim – Windows
Don’t get me wrong: I love me some Visual Studio. Visual Studio was my first IDE when programming professionally and my thought was “wow, I can really focus on building stuff rather than pulling my hair out with every build, every exception and every bug”. It was so much better than the emacs days of college. It’s my go-to for anything .NET, as it should be. But there are some text-editor needs that aren’t related to coding or .NET, and VS is too much of a beast to deal with for those things. First, for html/js/css editing that’s not part of a Visual Studio project, VS not great to work with. It’s annoying to be forced to create a Visual Studio project to house related content, especially when it’s already grouped together in the file system. Quickly checking out an html template or a js code samples becomes tedious when you just want to look around. The VS File Explorer is a step in the right direction, but it’s not there yet; I know there’s shell plugins for a “VS Project Here” shortcut but really? Is that necessary?
Then there’s the notepad issue. Notepad is barely an acceptable editor for checking out the occasional config file or random text file. Everyone knows how incredibly limited it can be, not to mention how much it sucks for large files. Pretty much everything about it sucks, actually, and everyone knows it. Notepad++ is a nice alternative, but it’s no Vim. So after I got modestly comfortable with MacVim, I thought, why not do this on Windows too? So I started using gVim and haven’t looked back.
Why I Like Vim – Linux
This blog runs WordPress on a Linux box hosted by Rackspace. Occasionally, I need to pop in via ssh, edit a config file, push some stuff, tweak some settings, etc. Nano was the lightweight go-to editor of choice, and works well. There are many differences between nano and Vim, the biggest being nano is a “modeless” editor while Vim’s power comes from the Normal, Insert, and Visual modes. Like always, stick with what works. But once you’re hooked on Vim, I’d be surprised how often you go back to Nano, especially when you get your configuration files synced up with source control, and Vim’s ubiquity on Linux.
Why I Like Vim
There’s plenty of resources out there about Vim and what makes it great- a quick google search will give you some great resources. But I really appreciated Vim when I became comfortable with the following features:
Splits and Buffers
Splits are one of my favorite features- it allows you to view more than one file at once on the same screen. The power of splits allows you to have multiple vertically and horizontally split windows so you can see anything you want. There’s also tab support, but I find using splits and managing buffers a better way to cycle through files. Buffers allow you to keep files open and active but in the background, so you can quickly swap them into the current window in a variety of ways. It’s like having a stack of papers on a table while easily going to any page instantaneously. This is different than having files within a project, which would be like having those papers in a folder- buffers provide another level of abstraction allowing you to manipulate a set of content together. In summary, you have a set of papers in a folder (the current directory) and put them on the desk to work on (using buffers) and arrange them in front of you (using splits).
Modes
A major feature of Vim, which really sets it apart from other editors, is how it separates behavior into different modes- specifically, normal and insert modes. Insert mode is simple: it allows you to write text. But normal mode is all about navigation and manipulation: finding text, cutting lines, moving stuff around, substituting words, running commands, etc. It offers a whole new level of functionality including shell commands interaction for doing anything you would on the command line. With the plethora of plugins around you can pretty much do anything within Vim you could imagine- from simple editing, to testing, to source control management, to deployments. Modes break you free from a whole slew of Ctrl + whatever commands required in other editors, allowing for precision movement with a minimal set of keystrokes. The best analogy is you can “program your editing” in a way unmatched from any other editor.
Search and Replace (aka Substitute)
Vim’s Search and Substitute features make any old find and replace dialog box seem stupid. I’ve barely started unlocking the power of search and replace, but already, I wish every application behaved this way. In normal mode you can easily create everything from simple string searches to complex regex to find what your looking for in a couple of keystrokes. On top of that, it’s only a few more keystrokes to replace text. Because it’s all driven by key commands, you can easily change or alter what you’re doing without having to start an entire search and you never have that “context switch” of filling out a form in a dialog box. It’s all right in front of you. Highlighting allows you to see matches and you can lock in a search to easily jump between results. Viewing and editing configuration files is the real win for me over other editors: whatever the size, I can open a file and type a few keystrokes to go to exactly where I need to be, even if I’m not sure where to go. This is so much better than using notepad or even Visual Studio.
Source controlled configuration
I put my vimfiles on github so I can synchronize them across platforms. This offers an unparalleled level of uniformity across environments with minimal effort. A lot of people do this, and it’s helpful to see how others have configured their environment. You’ll pick up a lot of neat tidbits by reading people’s Vimfiles!
Vim can run in either a gui window (like MacVim and gVim) or from a command line. These are two different executables with a slightly different feature set. Usually, you get a little more with a gui vim, especially around OS integration (like cutting and pasting text) while running shell commands are easier with command line vim. Gui Vim also offers better color support for syntax highlighting.
MacVim is the Vim app for the Mac. It has a really nice full screen mode and native Mac commands alongside the Vim ones. I love the Peepopen search plugin which is only available on the Mac (and can be used with Textmate). It’s a slick approach which is better than Textmate’s Command-T. I like running MacVim in full screen mode with the toolbar off to get the most screen real estate.
gVim is the Windows gui version of Vim, and I find it preferable over the command line Vim via cygwin. gVim has a shell extension which lets you open any file in gVim- set it as the default to avoid notepad. Note that Vim on Windows reads configuration files from the _vimrc, _gvimrc, and _vimfiles directory, which is different than the normal .vimrc and .vim location on other platforms. That hung me up when I was trying to sync configuration via git.
As for command line Vim, that’s probably what you’ll be using on Linux. In Gary Bernhardt’s Play-by-Play Peepcode video I learned a cool trick of running Vim via the command line, then using jobs to suspend (Ctrl-Z) to return to the command line, than going back to Vim via foreground (fg).
Learning Vim
There are plenty of resources on the web for getting started with Vim. Steve Losh’s Coming Home To Vim is a great overview that points you to a lot of other helpful posts. I bought a subscription to Peepcode and picked up the Smash Into Part II episode. I didn’t watch Part I because I felt like it was too basic, but Part II has a lot of substantial content. Peepcode offers a high quality product so you probably can’t go wrong with getting both if you’d rather be safe than sorry. I also watched the Gary Bernhardt Play by Play which focuses a lot on using Vim with Rspec and Ruby, and use the Peepopen plugin for file search with Vim on my Mac.
Here are some tips to avoid the beginner frustration:
- Take it slow. There’s a learning curve, but it’s worth it.
- Don’t sweat plugins when you’re starting out. Yes, everyone says “use Pathogen, use Rails.vim, use xyz” and it’s absolutely correct. But it’s not essential when you’re starting out.
- Take it one step at a time. Learn about Vim via tutorials and blogs so you know what’s there, but don’t try and do everything at once. Keep that knowledge in the back of your head, get comfortable with one thing, then move onto the next. You’ll just end up confusing yourself if you do everything at once.
Focusing on items in the following order will allow you to build on your knowledge:
- editing text and searching, as that’s what you’ll be doing most
- file management, like opening, saving, and navigating to files
- other navigation, like jumping to lines or words and the power of hjkl (don’t use arrow keys!)
- manipulation like replacing text, cutting and pasting
- window and buffer management, including splits
- start using and learning plugins to see where you can eliminate friction
- start customizing your vimrc file to make the vim experience more comfortable now that you know the basics.
Good luck!