Service Discovery: 6 questions to 4 experts

The successful format of ‘Immutable Infrastructure: 6 questions to 6 experts’ is back.

The topic of Service Discovery is not new (just think DNS!) but it’s been moving fast over the last few years thanks in part to the explosion of services and microservices architectures and, of course, Docker.

As it often happens in these cases, we see a lot of different interpretations of its meaning, benefits, challenges, and adoption paths. We asked 4 experts who have been thinking, writing and implementing Service Discovery to share their experience by answering the following questions on the topic:

1) What does Service Discovery mean to you?
2) What are the key aspects Service Discovery should have and why?
3) What are the main benefits you see/care about?
4) What are the most solid options out there today?
5) Biggest adoption challenges that are not there yet in your opinion?
6) Starting from scratch is (relatively) easy. What about those with existing systems? any hints on how people could get started introducing Service Discovery in existing systems?

The experts

Sam Newman, a techie at ThoughtWorks. Aside from other things I’ve committed sporadically to open source projects, spoke at more than a few conferences, and wrote some things including the book Building Microservices for O’Reilly.
Nitesh Kant is an engineer in the Cloud & Platform engineering team at Netflix, where he has been leading the evolution of the Inter Process Communication (IPC) stack to follow the reactive programming paradigm. He is the core contributor of the reactive IPC library RxNetty which forms the heart of this new stack. He’s also a contributor to eureka: Netflix’s Service Discovery system and has ideated and designed the 2.0 version. Eureka is the backbone of Netflix’s microservices architecture handling thousands of application instances in the Netflix ecosystem.
Jeff Lindsay is the author of Dokku and many Docker related open source projects. He is the co-founder and principal of Glider Labs, a DevOps consultancy specializing in modern, Docker-oriented system architectures. Jeff was involved in the early stages of Docker development, and has worked with many other organizations such as Twilio, Digital Ocean, and NASA Ames on distributed systems and developer platforms.
Eberhard Wolff works as a Fellow for innoQ. His technological focus is on modern architectures – often involving Cloud, Continuous Delivery, DevOps, Microservices and NoSQL Eberhard is a regular speaker at international conferences and author of over 100 articles and several books – most in German.

Highlights

The full transcript of all answers is at the end and we recommend you read them all since they cover a lot of ground. Here are just some highlights:

1) What does Service Discovery mean to you?

It keeps track of all the services in a (large scale) distributed system so that they can be found by both people and other services. Think of DNS as a simple example but on steroids: complex systems need features like storing metadata about a service, health monitoring, varying query capabilities, realtime updates, etc.

It differs depending on the context, e.g. network device discovery, rendezvous discovery, SOA discovery but in all cases it’s a coordination mechanism for services to announce themselves and find others without configuration.

2) What are the key aspects Service Discovery should have and why?

As the backbone of any large-scale service oriented architecture Service Discovery needs the be highly available and cover 3 main aspects: Registration, Directory and Lookup. Having just the Directory is not enough.

As mentioned the ability to store metadata is key since complex services provide multiple service interfaces and ports and often require non-trivial deployment environments. And once you have lots of metadata you need powerful querying capabilities, including health/status.

3) What are the main benefits you see/care about?

The main benefit is what in network discovery referred to as “zero configuration”: rather than hardcode addresses we specify a service name (and sometimes not even that!). In modern architectures nodes come and go and you need to decouple individual service instances from the knowledge of the deployment topology of your architecture.

Having a way to get insights into the deployment status of all the services and to control the available instances from a centralized place becomes key, especially in complex system s which warrant something more than just DNS.

4) What are the most solid options out there today?

Service Discovery solutions are a plenty today in the industry.

As mentioned DNS has been used for a long time and is probably the largest Service Discovery system out there. For small-scale setups start with DNS but once you start provisioning nodes more dynamically, DNS starts becoming problematic due to the propagation time.

Arguably, Zookeeper is the most mature of the config stores used for discovery since it has been around for quite some time and is a comprehensive solution including configuration management, leader election, distributed locking etc. This makes it a very compelling general-purpose solution although it’s often more complex than it could be.

etcd & doozerd are the new age cousins of Zookeeper, built with similar architectures and features sets and hence can be used interchangeably in place of Zookeeper

Consul is a newer solution in this space that provides configuration management and a generic key-value store apart from Service Discovery. It also has killer features of health checking of nodes and supporting DNS SRV for improved interop with other systems. A big differentiator from Zookeeper is the HTTP & DNS APIs that can be used to interact with consul vis-à-vis a Zookeeper client.

If you lean more towards AP systems Eureka is a great choice and is battle tested in Netflix and it prefers Availability over Consistency in the wake of network partitions.

5) Biggest adoption challenges that are not there yet in your opinion?

It’s more complicated than you realize: it’s an extension of the distributed systems problem.

You might roll out configuration files with service names, IPs and ports but when the system becomes very dynamic you need to migrate to a “real” Service Discovery solution and that migration is usually not as easy as you think. One of the biggest challenges is the inability to understand how intrusive the choice of a Service Discovery system is: once chosen it is very difficult to change it and hence it is critical to do it right.

Most systems implement some form of distributed consensus algorithms, designed to be resilient in the face of node outages, but these algorithms are notoriously hard to get right and understanding failure modes is both key and difficult and failing to analyse them correctly usually takes you to make the wrong choices.

6) Starting from scratch is (relatively) easy. What about those with existing systems? Any hints on how people could get started introducing Service Discovery in existing systems?

The first step of starting to use a Service Discovery system is for clients to stop baking knowledge about the dependent service deployment environments. Once the clients stop baking that knowledge in code they are forced to store that knowledge externally. Starting as simple as storing this knowledge in properties, they can slowly graduate to looking up this knowledge dynamically from an external Service Discovery system.

Once you pick a mechanism (network or service) and understand what you need to optimize for (and that’s maybe the hardest part), it’s just a matter of introducing integration points for registration and lookup.

Full transcript PDF

You can download the free, 12-page PDF that includes both the highlights above and the full transcript of all the answers grouped by questions here: