|
High Scalability
Updated: 25 min 17 sec ago
Tue, 07/27/2010 - 8:17pm
The NoSQL movement faults the SQL query language as the source of many of the scalability issues that we face today with traditional database approach.
I think that the main reason so many people have come to see SQL as the source of all evil is the fact that, traditionally, the query language was burned into the database implementation. So by saying NoSQL you basically say "No" to the traditional non-scalable RDBMS implementations.
This view has brought on a flood of alternative query languages, each aiming to solve a different aspect that is missing in the traditional SQL query approach, such as a document model, or that provides a simpler approach, such as Key/Value query.
Most of the people I speak with seem fairly confused on this subject, and tend to use query semantics and architecture interchangeably. In Part I of this post i tried to provide quick overview of what each query term stands for in the context of the NoSQL world . Part II illustrates those ideas using code examples from GigaSpaces and Datanucleus/Hbase.
See Part I , Part II for more information..
Tue, 07/27/2010 - 8:22am
Should you pay more in the cloud or pay less for bare metal in the datacenter? This is a crucial decision point facing startups today. Which way should you go? In this Webpulp.tv interview, Joe Stump, always a go-to guy when you need a metric ass-ton (a favorite expression of Joe’s) of good advice on cutting edge practices for the modern startup, laughs at conventional wisdom by saying the cloud is really not more expensive than bare metal.
The argument for a cheaper cloud has a three main points:
Tue, 07/27/2010 - 7:55am
Who's Hiring?
Cool Products and Services
- Cloud Sigma. Instantly scalable European cloud servers.
- ManageEngine Applications Manager. ManageEngine provides Enterprise IT Management suite of products.
- Site24x7. Easy, fast and effective web server monitoring, server monitoring and website monitoring service.
Sat, 07/24/2010 - 9:08am
It's trendy today to say "I don't read blogs anymore, I just let the random chance of my social network guide me to new and interesting content." #fail. While someone says this I imagine them flicking their hair back in a "I can't be bothered with true understanding" disdain. And where does random chance get its content? From people like these. So: support your local blog!
If you would like to be a part of random chance, here are a few new podcasts/blogs/vidcasts that you may not know about and that I've found interesting:
- DevOps Cafe. With this new video series where John and Damon visit high performing companies and record an insider's tour of the tools and processes those companies are using to solve their DevOps problems, DevOps is a profession that finally seems to be realizing their own value. In the first episode John Paul Ramirez takes the crew on a tour of Shopzilla's application lifecycle metrics and dashboard. The second episode feature John Allspaw, VP of Technical Operations at Etsy, talking about the new role of DevOps in companies. Only more good stuff from there.
- Packet Pushers. A great podcast by real experts on seriously technical networking issues. They describe their podcast as: a podcast where we talk about routing, switching, security, firewalls, study and market changes. Some topics covered: “Defense in Depth” and what it really means; Deep Diving on Data Centre Switching; Chewing on DDOS; Enterprise MPLS; Career Progression.
Thu, 07/22/2010 - 7:22am
Over the years I've read a lot of research papers looking for better ways of doing things. Sometimes I find ideas I can use, but more often than not I come up empty. The problem is there are very few good papers. And by good I mean: can a reasonably intelligent person read a paper and turn it into something useful?
Now, clearly I'm not an academic and clearly I'm no genius, I'm just an everyday programmer searching for leverage, and as a common specimen of the species I've often thought how much better our industry would be if we could simply move research from academia into production with some sort of self-conscious professionalism. Currently the process is horribly hit or miss. And this problem extends equally to companies with research divisions that often do very little to help front-line developers succeed.
How many ideas break out of academia into industry in computer science? We have many brilliant examples: encryption, microprocessors, compression, transactions, distributed file systems, vector clocks, gossip protocols, MapReduce, search, algorithms, networking, communication, and on ad infinitum. For every Google that breaks out there must be thousands of other potential ideas that go nowhere, even in this hyper-VC aware age.
We need to do is a better job of using the research. There's a lot out there in the literature that we could be making use of right now, but it's closed off from the people, i.e., developers, who can turn this research into gold. And it's largely closed off because researchers don't consider developers as an audience and they don't write their papers with the intention of being applied. Change the publication process and we can save the cheerleader and save the world.
I'm bringing this up now because:
Tue, 07/20/2010 - 7:16am
At Monday's Cloud Computing Meetup, Paco Nathan gave an excellent Getting Started on Hadoop talk (slides). I found one of Paco's strategies particularly interesting: consider when a service starts charging in cost calculations. Depending on your use case it may be cheaper to go with a more expensive service that charges only for work accomplished rather than charging for both work + startup time.
Sat, 07/17/2010 - 7:30am
And by hot I also mean temperature. Summer has arrived. It's sizzling here in Silicon Valley. Thank you air conditioning!
- Scale the web by appointing a Crawler Czar? Tom Foremski has the idea that Google should open up their index so sites wouldn't have to endure the constant pounding by ravenous crawler bots. Don MacAskill of SmugMug estimates 50% of our web server CPU resources are spent serving crawlers. What a waste. How this would all work with real-time feeds, paid feeds (Twitter, movies, ...), etc. is unknown, but does it make sense for all that money to be spent on extracting the same data over and over again?
- Tweets of Gold:
- : Key to applications is architecture. Key for infrastructure supporting archs is configurability. Configurability==features.
- tjake: People who choose their datastore based oh hearsay and not their own evaluation are doomed.
- b6n: No global lock ever goes unpunished
- : scalability, systems & process feed each other right?
- : Statements like: "NoSQL database systems are designed for scalability." make me sad.
- : Focus on stability and features first, scalability and manageability second, per-unit performance last of all. This is a quote from Jeff Darcy
Wed, 07/14/2010 - 7:33am
DynaTrace in Top 10 Performance Problems taken from Zappos, Monster, Thomson and Co, has provided a useful compilation of performance problems, with potential solutions, that they've found while working with their clients.
- Too Many Database Calls - too many database query per request/transaction.
- Synchronized to Death - in a high-load or production environment over-synchronization results in severe performance and scalability problems.
- Too chatty on the remoting channels - too many calls across these remoting boundaries and in the end causes performance and scalability problems.
- Wrong usage of O/R-Mappers - incorrect usage of the framework itself too often results in unexpected performance and scalability problems within these frameworks.
- Memory Leaks - GC does not prevent memory leaks, it is important to release object references as soon as they are no longer needed.
Tue, 07/13/2010 - 8:31am
Who's Hiring?
VoltDB Field/Community Engineer
VoltDB is attracting more and more users every day. If you have a strong technical background in SQL and Linux, are experienced with production database deployments, and have a passion for customers and community, you could be just the person we are looking for. Are you excited about the prospect of working with users to develop and deploy VoltDB applications, and about helping users participate in the thriving VoltDB community? If so, read on at their job page.
Get Your High Scalability Fix at Digg
Interested in working on cutting-edge high-scale infrastructure at Digg? We're making a big investment in scaling and have committed to the NoSQL (Not only SQL) path with Cassandra. We're using other open-source infrastructure to help us scale including Hadoop, RabbitMQ, Zookeeper, Thrift, HDFS and Lucene. We're rewriting Digg from the ground up and we need amazing developers to join our world-class team. If you think you are up for the challenge, or you know someone who might be, take a look at our jobs page for more information.
Tue, 07/13/2010 - 7:45am
This is a follow up article by Cory Isaacson to the first article on DbShards, Product: dbShards - Share Nothing. Shard Everything, describing some of the details about how DbShards works on the inside.
The dbShards architecture is a true “shared nothing” implementation of Database Sharding. The high-level view of dbShards is shown here:
The above diagram shows how dbShards works for achieving massive database scalability across multiple database servers, using native DBMS engines and our dbShards components. The important components are:
Mon, 07/12/2010 - 8:32am
Like many other media content providers, libraries and museums are increasingly moving their content onto the Web. While the move itself is no easy process (with digitization, web development, and training costs), being able to successfully deliver content to a wide audience is an ongoing concern, particularly for large libraries.
Much of the concern is financial, as most libraries do not have the internal budget or outside investors that for-profit businesses enjoy. Even large university libraries will face serious budget constraints that even other university departments, such as science and technology would not face.
Creating a scalable infrastructure and also distributing a large digital collection that can handle multiple requests, requires planning that many librarians have not even imagined. They must stop thinking in terms of "one-item-per-customer" and start thinking in terms of numerous users accessing the same information simultaneously.
Sun, 07/11/2010 - 10:22am
A firestorm of accusations circled around recently saying that Cassandra, the elected-by-major-adopters emperor of the NoSQL movement, has no clothes. It was said Twitter was dumping Cassandra; Reddit outages were linked to Cassandra; and even Facebook, Cassandra's cradle of birth, was said to have abandoned Cassandra. Shouts of NoSQL Fail! were heard in the streets. Much gloating followed. Is the emperor really naked? Casually dressed maybe, but not naked.
(Note: after this point the article contains a flow chart that is NSFW. Some people are very sensitive about cussing, so if that's you, please go back, don't read on. Danger! There are no nude pictures or anything, just some strong language. But this is my most favorite flow chart of all time, so it's worth it :-)
Is Twitter really abandoning Cassandra?
Fri, 07/09/2010 - 8:55am
- Facebook serves 3 billion Like buttons a day says VentureBeat.
- CloudScaling reports: Rumor Mill: Google EC2 Competitor Coming in 2010? It looks like GAE for PaaS and an EC2 clone for IaaS.
- Tweets of gold:
- alandipert: scalability is a drug
- seldo: Scalability lesson #23: if any part of your system involves a list that gets bigger over time, eventually that list will become too big.
- obfuscurity: Her: "Go look at the pictures on the database." Me: "You mean our fileserver?" Her: "Whatever."
- luiscab: Ouch, I just read on an Info Mgmt rag that Hadoop could easily be an acronym for "Heck, Another Darn Obscure Open-source Project."
- sanity: Depressed about how much time I've had to spend searching for the right database solution for a new project. Each has it's flaws
- ioshints: You cannot take a car, grow it 10 times and expect to get a mining truck.
Thu, 07/08/2010 - 7:56am
This is a guest post by Frédéric Faure (architect at Ysance) on the differences between using a cloud infrastructure and building your own. Frédéric was kind enough to translate the original French version of this article into English.
I’ve been noticing many questions about the differences inherent in choosing between a Cloud infrastructure such as AWS (Amazon Web Services) and a traditional physical infrastructure. Firstly, there are a certain number of preconceived notions on this subject that I will attempt to decode for you. Then, it must be understood that each infrastructure has its advantages and disadvantages: a Cloud-type infrastructure does not necessarily fulfill your requirements in every case, however, it can satisfy some of them by optimizing or facilitating the features offered by a traditional physical infrastructure. I will therefore demonstrate the differences between the two that I have noticed, in order to help you make up your own mind.
Wed, 07/07/2010 - 10:16am
Professor Lance Fortnow, in his blog post Drowning in Data, says complexity has taught him this lesson: When storage is expensive, it is cheaper to recompute what you've already computed. And that's the world we now live in: Storage is pretty cheap but data acquisition and computation are even cheaper.
Jouni, one of the commenters, thinks the opposite is true: storage is cheap, but computation is expensive. When you are dealing with massive data, the size of the data set is very often determined by the amount of computing power available for a certain price. With such data, a linear-time algorithm takes O(1) seconds to finish, while a quadratic-time algorithm requires O(n) seconds. But as computing power increases exponentially over time, the quadratic algorithm gets exponentially slower.
For me it's not a matter of which is true, both positions can be true, but what's interesting is to think that storage and computation are in some cases fungible. Your architecture can decide which tradeoffs to make based on the cost of resources and the nature of your data. I'm not sure, but this seems like a new degree of freedom in the design space.
Fri, 07/02/2010 - 7:30am
- What says 4th of July like Nathan's ultimate scalable hot dog eating contest? This totally requires a scale-up strategy.
- Facebook at 60,000 servers and counting.
- Deepak Singh has collected some impressive massive data stats on extreme Hadoop usage: Facebook: 36 PB of uncompressed data, 2250 machines, 23,000 cores, 32 GB of RAM per machine, processing 80-90TB/day; Yahoo: 70 PB of data in HDFS, 170 PB spread across the globe, 34000 servers, Processing 3 PB per day, 120 TB flow through Hadoop every day; Twitter: 7 TB/day into HDFS; LinkedIn: 120 Billion relationships; 82 Hadoop jobs daily (IIRC); 16 TB of intermedia data.
- Who knew DevOps could be so funny? Adam Jacob, CTO of Opscode, gave a hilarious talk at the Velocity conference on the true nature of DevOps. Warning: your neck may get sore from nodding in agreement so much and your belly may ache from laughing so much.
Wed, 06/30/2010 - 8:41am
In the never ending quest to figure out how to do something useful with never ending streams of data, GraphLab: A New Framework For Parallel Machine Learning wants to go beyond low-level programming, MapReduce, and dataflow languages with a new parallel framework for ML (machine learning) which exploits the sparse structure and common computational patterns of ML algorithms. GraphLab enables ML experts to easily design and implement efficient scalable parallel algorithms by composing problem specific computation, data-dependencies, and scheduling. Our main contributions include:
- A graph-based data model which simultaneously represents data and computational dependencies.
- A set of concurrent access models which provide a range of sequential-consistency guarantees.
- A sophisticated modular scheduling mechanism.
- An aggregation framework to manage global state.
Mon, 06/28/2010 - 7:36am
What do you get when you take a SQL database and start a new implementation from scratch, taking advantage of the latest research and modern hardware? Mike Stonebraker, the sword wielding Johnny Appleseed of the database world, hopes you get something like his new database, VoltDB: a pure SQL, pure ACID, pure OLTP, shared nothing, sharded, scalable, lockless, open source, in-memory DBMS, purpose-built for running hundreds of thousands of transactions a second. VoltDB claims to be 100 times faster than MySQL, up to 13 times faster than Cassandra, and 45 times faster than Oracle, with near-linear scaling.
Will VoltDB kill off the new NoSQL upstarts? Will VoltDB cause a mass extinction of ancient databases? Probably no and no to both questions, but it's a product with a definite point-of-view and is worth a look as the transaction component in your system. But will it be right for you? Let's see...
Fri, 06/25/2010 - 7:44am
- Royans Tharakan is blogging like a mad man at the Velocity Conference. Read a summary of many of the presentations on his blog.
- Zuckerberg almost guarantees 1 billion Facebook users. And I almost believe him.
- Northscale introduces Membase, a new distributed key-value NoSQL competitor featuring a memcache compatible interface, yet is persistent like a database. Hopefully we'll have more on their internals later.
- Notable Tweets:
- Aaron Cordova - scalability means "can change size" and also "works at large sizes" - this conflates two orthogonal features of cloud computing.
- Jaime Garcia Reinoso - It's the scalability, stupid!
- Alex Averbuch - when I read/hear "unlimited/inifinite scalability" I stop reading/listening and start thinking about cake.
- Dennis Clark - I used to smirk at developers whose main DB experience was in MUMPS or Pick, until I realized those are old-school engines.
|
|