1.5M ratings
277k ratings

See, that’s what the app is perfect for.

Sounds perfect Wahhhh, I don’t wanna
vespaengine

Open Sourcing Vespa, Yahoo’s Big Data Processing and Serving Engine

vespaengine

By Jon Bratseth, Distinguished Architect, Vespa

Ever since we open sourced Hadoop in 2006, Yahoo – and now, Oath – has been committed to opening up its big data infrastructure to the larger developer community. Today, we are taking another major step in this direction by making Vespa, Yahoo’s big data processing and serving engine, available as open source on GitHub.

Building applications increasingly means dealing with huge amounts of data. While developers can use the Hadoop stack to store and batch process big data, and Storm to stream-process data, these technologies do not help with serving results to end users. Serving is challenging at large scale, especially when it is necessary to make computations quickly over data while a user is waiting, as with applications that feature search, recommendation, and personalization.

By releasing Vespa, we are making it easy for anyone to build applications that can compute responses to user requests, over large datasets, at real time and at internet scale – capabilities that up until now, have been within reach of only a few large companies.

Serving often involves more than looking up items by ID or computing a few numbers from a model. Many applications need to compute over large datasets at serving time. Two well-known examples are search and recommendation. To deliver a search result or a list of recommended articles to a user, you need to find all the items matching the query, determine how good each item is for the particular request using a relevance/recommendation model, organize the matches to remove duplicates, add navigation aids, and then return a response to the user. As these computations depend on features of the request, such as the user’s query or interests, it won’t do to compute the result upfront. It must be done at serving time, and since a user is waiting, it has to be done fast. Combining speedy completion of the aforementioned operations with the ability to perform them over large amounts of data requires a lot of infrastructure – distributed algorithms, data distribution and management, efficient data structures and memory management, and more. This is what Vespa provides in a neatly-packaged and easy to use engine.

With over 1 billion users, we currently use Vespa across many different Oath brands – including Yahoo.com, Yahoo News, Yahoo Sports, Yahoo Finance, Yahoo Gemini, Flickr, and others – to process and serve billions of daily requests over billions of documents while responding to search queries, making recommendations, and providing personalized content and advertisements, to name just a few use cases. In fact, Vespa processes and serves content and ads almost 90,000 times every second with latencies in the tens of milliseconds. On Flickr alone, Vespa performs keyword and image searches on the scale of a few hundred queries per second on tens of billions of images. Additionally, Vespa makes direct contributions to our company’s revenue stream by serving over 3 billion native ad requests per day via Yahoo Gemini, at a peak of 140k requests per second (per Oath internal data).

With Vespa, our teams build applications that:

  • Select content items using SQL-like queries and text search
  • Organize all matches to generate data-driven pages
  • Rank matches by handwritten or machine-learned relevance models
  • Serve results with response times in the low milliseconds
  • Write data in real-time, thousands of times per second per node
  • Grow, shrink, and re-configure clusters while serving and writing data

To achieve both speed and scale, Vespa distributes data and computation over many machines without any single master as a bottleneck. Where conventional applications work by pulling data into a stateless tier for processing, Vespa instead pushes computations to the data. This involves managing clusters of nodes with background redistribution of data in case of machine failures or the addition of new capacity, implementing distributed low latency query and processing algorithms, handling distributed data consistency, and a lot more. It’s a ton of hard work!

As the team behind Vespa, we have been working on developing search and serving capabilities ever since building alltheweb.com, which was later acquired by Yahoo. Over the last couple of years we have rewritten most of the engine from scratch to incorporate our experience onto a modern technology stack. Vespa is larger in scope and lines of code than any open source project we’ve ever released. Now that this has been battle-proven on Yahoo’s largest and most critical systems, we are pleased to release it to the world.

Vespa gives application developers the ability to feed data and models of any size to the serving system and make the final computations at request time. This often produces a better user experience at lower cost (for buying and running hardware) and complexity compared to pre-computing answers to requests. Furthermore it allows developers to work in a more interactive way where they navigate and interact with complex calculations in real time, rather than having to start offline jobs and check the results later.

Vespa can be run on premises or in the cloud. We provide both Docker images and rpm packages for Vespa, as well as guides for running them both on your own laptop or as an AWS cluster.

We’ll follow up this initial announcement with a series of posts on our blog showing how to build a real-world application with Vespa, but you can get started right now by following the getting started guide in our comprehensive documentation.

Managing distributed systems is not easy. We have worked hard to make it easy to develop and operate applications on Vespa so that you can focus on creating features that make use of the ability to compute over large datasets in real time, rather than the details of managing clusters and data. You should be able to get an application up and running in less than ten minutes by following the documentation.

We can’t wait to see what you’ll build with it!

staff
bidoof:
“ mooserattler:
“ jjflow:
“ freshrosemary:
“ allthelittlebeagles:
“ moonblossom:
“ mooserattler:
“ Reblog this picture of me holding a Family Size box of Honey Nut Cheerios? I’d really appreciate it.
”
How can I say no to such a great photo...
mooserattler

Reblog this picture of me holding a Family Size box of Honey Nut Cheerios? I’d really appreciate it.

moonblossom

How can I say no to such a great photo and such a polite request?

allthelittlebeagles

i will always support this post

freshrosemary

@mooserattler back on my dash!

jjflow

Why isn’t this at a million notes, yet, Dante???

mooserattler

I’m not sure. Hey lovely people who have taken me over half way to a cool million! If you’d like to reblog again, I’d love that, if not, I still love you, and hope you’re having a great day. I’m gonna go do some stand up tonight.

bidoof

god come on we’re so close. this is like the only meaningful thing that this website could ever achieve

Source: mooserattler
staff
politedoge:
“ you know what really fucking gets my cookies frosted sometimes??? i’ll be on the goddamn blue website scrolling along and suddenly come across a picture like this and i actually stop scrolling and go out of my way to share a picture of...
politedoge

you know what really fucking gets my cookies frosted sometimes??? i’ll be on the goddamn blue website scrolling along and suddenly come across a picture like this and i actually stop scrolling and go out of my way to share a picture of a man with a sly grin holding a fucking pineapple with a bunch of people who choose to look at what i put on my blog. people expect this from me. i hold the power to grace a plethora of people’s eyes with this picture. almost 20 thousand other people have looked at this and subconsciously decided that this represents the type of image that they want to share with others with no context. look at this man

Source: ejacutastic
kranglefant

Tear down this wall!

kranglefant

“We can’t work as efficiently as smaller companies. We’re just too big. We’ve got too many systems, too many customers and too many employees.”

If I had a dollar every time I heard something along those lines.  Too big. Bullshit. Your problem is not your size, whether measured by customers, systems or employees.  Your problem is how you manage them. Centralized control does not scale well.  That’s your problem. You’re trying to control all of your systems, all of your employees and all of your customers the same way.  That’s what’s slowing you down.  5 year plans and centralized control: this is software development communism.

Instead of letting go and delegating control to the various application teams, you obsess over creating a constellation of applications that looks nice on a diagram. A diagram that lets you understand the entirety of everything and how it’s all connected. One integration platform. One process engine.  Instead of every app having its own domain model, you create one common one for all to use.  Much easier for the person at the top to keep track of.  Worse for everyone else.

Simple example.  Say you have several applications that all need to generate, archive and distribute documents.  Today they all have their own separate code to do this. Duplication! The cries go out. We need to create a common document handling service they can all use.  Ok, NOW you have a problem. Because now you’ve got coupling.  And coupling is a much worse enemy than duplication.  The number of systems doesn’t matter nearly as much as the amount of coupling you have between them.  Instead of 3, say, separate, standalone apps that can be updated at will and tested in complete isolation, you now have 3 apps that are all connected to the same service. What if app 1 needs that service to behave a little bit differently in certain cases? To make this change you now need to retest your entire portfolio to ensure your new change didn’t break anything for the other connected apps.  Alternatively, you can make the change in the client app itself, thus allowing document handling logic to seep out of the document service and exist in both the common document service AND each individual application.  Over time, typically both things happen.  The common service grows and grows and gets more and more complex and configurable.  Each client app only uses a small fraction of the functionality provided.  It also gets harder and harder to maintain, as the intent of the code is hidden in configurabilty.  At the same time a substantial portion of document handling logic is added to each of the client apps to compensate for the functionality that has not been able to be added to the common module.  A complete mess.

If you don’t create a common service, you will most likely have a bit of duplication. This means that in the case where something needs to be updated – say the archiving system is upgraded - the required change will have to be implemented several times over. Every system using the archive must be updated.  This is annoying manual labor. But it is not difficult.  Writing and maintaining a common module that has to work for every possible client, requires less annoying manual grunt work, but it is very difficult.  Know what tradeoffs you’re making.

Sometimes common services pay off of course.  If they’re small, and only do one thing and do that same thing for all clients, they can be very valuable.  Which brings me to my second point.  Don’t worry about the connections looking messy when you draw every app and service in one diagram.  If there are 400 applications, and each one is connected to 3 other ones, it’s going to look appalling. Don’t draw that diagram! The only one interested in that diagram is the enterprise architect. Screw him (or her). The architect is far less important than your end user.  The purpose of your system is to provide meaning and value to the end user, not your enterprise architect.  Any one user will never be using all your systems. They will use a couple of them. Those are the diagrams you need to focus on. That’s where your energy should be spent. Focus on what the users need, and how the data they are providing or receiving is passed through the part of the system they are in contact with.  If that diagram is full of lots of systems, then you might have a problem. If for a user to enter some data, you need to go from a user interface app, to a process engine, to a document handling service, to a queue, to an archive and then back again.  That might be something you’d want to simplify.  This is the only context in which you should worry about the number of systems and the amount of coupling.  But reducing the amounts of systems for each user, will increase the number of systems at the enterprise level (and vice versa). This is why the enterprise architect doesn’t like it. We need to remind ourselves who we are really working to please. Our users or our architects?

Another argument I hear a lot in favor of introducing common systems, and reducing the number of applications is that “users dislike having so many systems to work with”. No they don’t! Ask anyone – what do you prefer: SAP (one system that does it all) or the iPhone with a multitude of apps downloaded from the App Store? They prefer the iPhone.  There are hundreds if not thousands of apps for every kind of need. But this is much better than the alternative: One app that serves everyone badly.  This idea that users dislike having many systems is bogus.  You know what they don’t like? They don’t like having to remember lots of usernames and passwords.  With a common authentication system (yes, I’m advocating reuse!) that problem is solved. What users don’t like is bad user experiences. It’s much easier to create good user experiences if you’re allowed to create several tailored small apps that serve very specific purposes.  If you only have one app, and it has to do everything imaginable for everyone, it’s not going to be very user friendly.  

So, no, I don’t accept your excuse.  Large companies are not inefficient, slow and bureaucratic because of their size. They’re inefficient, slow and bureaucratic because of their communism.
Mr. Enterprise Architect; tear down this wall!