Project 3 Part 2

Final Report

Final Problem Statement

Up until assignment 2, our weather application lacked a service mesh layer on top of the Kubernetes architecture. While GCP provides a wealth of benefits for the organizations and teams (us included) that use them, there’s no denying that adopting the cloud can put strains on DevOps teams. Us developers must use microservices to architect for portability, meanwhile operators are managing extremely large hybrid and multi-cloud deployments. We figured Istio would let us connect, secure, control, and observe these services.

At an abstract level, Istio would help us reduce the complexity of these deployments, and ease the strain on our development team. It is a completely open source service mesh that layers transparently onto existing distributed applications. We'd say that it is also like a platform, including APIs that let it integrate into any logging platform, or telemetry or policy system. Istio’s diverse feature set would let us successfully, and efficiently, run a distributed microservice architecture, and provides a uniform way to secure, connect, and monitor our microservices. And it is what was promised and more!

As denoted in our Project 3 Part 1, we planned on integrating an Istio service mesh into our current system architecture and add features such as service authorization, security, canary deployment, better monitoring and improved metrics observability. We also aimed at comparing the results of what we could complete in the timeframe with GCP's own internal service mesh (eg. Anthos) that helps us provision TLS certificates (for instance) and provide better logging/monitoring on StackDriver. To throw a cherry on top, we planned to monitor error logs in each service by sending a notification to a dedicated slack channel with the error trace log using Sentry.

Changes to Initial Problem Statement

As it turns out, we were able to add most features that we targeted to add to our architecture, except a few, thanks to Istio 1.5 still being in beta version for GKE nodes.

Canary deployments for instance, sounded much better on paper and pragmatic. But due to some limitations and/or lack of support, we went with the traditional rollout deployments, and added horizontal pod autoscaling instead. This seemed like the correct way to handle the problem since canary deployments would require a significant user-base for our weather prediction system to test upon.

Using Istio, although we did inject each deployment with its own sidecar, they lacked a certain level of standard security. So we manually added mTLS certificates to take care of security handshakes on the NodePorts. On the other side, we used cert-manager with LetsEncrypt staging (and production) servers to provision HTTPs certificates to handle the same issue. Then we compared the results below.

We also configured Sentry into all the services and tested the notification to slack channel feature. However, since it's 21 day demo expired, we decided to keep that feature in the code but not in practice for monetary reasons.

Problem Statement Development

Links to Github issues:

Methodology

As part of our previous milestone, we carried out pretty extensive load testing, using JMeter, to understand the breaking point of our system (results in Project 2 wiki). However, we lacked the traffic management and monitoring tools to understand which service was the bottleneck. This lack of specific breaking point can be a real problem in distributed system architectures since every service has its own individual capacity to handle users. With proper monitoring tools, we can understand specific breaking points and then can work on improving these breaking points.

Furthermore, we lacked an observability tool to visualize our system architecture. In a distributed system, since we usually use a lot of microservices, our own as well as 3rd party, it is important to be able to see the entire graph of our system in a simple visualization. This enables teams to think about the architecture in a visual manner. Moreover, it is also important to see this visualization in real-time as traffic goes through our system. This feature is something that was lacking in our architecture and we wanted to implement it.

The lack of such tools led us to implement a Service Mesh on top of our system since a service mesh provides more control and monitoring of individual services and pods.

Implementation

Initial Setup

Install Istio. There are two main ways to install Istio. We first install using Helm charts and work with it. But, it leads to some issues with Grafana and Kiali due to version incompatibility problems. Hence, we remove that and use istioctl cmd tool to install Istio into our project.

Service Mesh Integration

Enable Sidecar injection into our system and inject a sidecar (envoy proxy) into every microservice pod. There are two main ways to do it: manual and automatic. We choose to use automatic sidecar injection because we didn’t need any custom sidecar configuration for the system and the default configuration is good enough.
Create a Gateway for the system using the istio-gateway.yaml file. We open up the gateway to all hosts. This gateway creates the istio-ingress gateway of our system and is of type LoadBalancer.
Create virtual services for all our microservices using istio-virtualservice.yaml and connect every virtual service to the gateway.

Integrating Observability Tools

Enable grafana on the system.
The main issue we struggled with was that grafana dashboard worked on localhost but was not accessible using the public IP.
So we try changing grafana service from ClusterIP to NodePort.
We create a custom gateway, virtual service and destination rules for grafana and telemetry. But, telemetry gateway and virtual service don’t help so we remove them.
Enable Kiali on the system.
Kiali faces the same issue as grafana and only works on localhost.
We try changing Kiali from ClusterIP to NodePort and then to LoadBalancer. We also create custom gateway, virtual service and destination rule for Kiali.
We remove kiali custom gateway, virtualservice and destinationrule before changing it into LoadBalancer.
Grafana and kiali both work on public IP, although Kiali works on its own public IP.
While checking our system architecture graph, we notice that kiali does not show the interaction between all services properly. It lacks the istio-ingress gateway and it shows every service sending data to telemetry.
We suspect it might be because we created Kiali service as a LoadBalancer type. Hence, we change it back to ClusterIP and experiment with its YAML file to make it work.
We notice our helm installation of Istio had become corrupt and kiali installation showed incompatibility with pilot and citadel versions.
Hence, we remove helm installation of Istio completely and do a fresh installation using istioctl and implement most of the above steps again to make grafana and kiali working.
We add additional firewall rules to our GCP cluster for Istio ingress ports.

Evaluation

Our evaluation consists of investigating our system on real-time traffic using grafana and kiali. We use JMeter to send consistent traffic to our system and monitor the changes on grafana and kiali.

We monitor the per-service latency on grafana and real-time traffic flow on kiali graph.

Visit the grafana dashboard here: http://35.226.31.68:15031/
Visit the Kiali dashboard here: http://35.226.31.68:15029/ (username:admin;pwd:admin)

To evaluate our system, use JMeter (or any other load testing tool) to send traffic to our system. Then go to the above URLs to access the dashboards. We evaluate the system in two scenarios:

Consistent traffic: We send 50 requests per minute for 5 minutes to monitor all services in less but consistent traffic.
High Load: We send 250 requests in one second to the system to break the system. The data retrieval service breaks after a certain point, mainly due to DarkSky API calls.

We evaluate our system on kiali by sending consistent traffic to visualize the flow of traffic. We observe an unknown passthrough cluster and an unknown node in the system.

(Note: Open these images in new tab for better visibility) alternate_text

alternate_text

Conclusions and Outcomes

Microservice load testing: We conclude that our data retrieval and model execution services are the slowest services in our architecture. The reason for this could be since both these services connect to the DarkSky API to fetch appropriate data for the system. Meanwhile, the fastest service in our system is the postprocessing service.
Architecture observations: Visualizing the architecture in kiali presents to us a clearer picture of how the services are connecting with each other. We make some peculiar observations in the graph created by kiali. Every service seems to be sending data to a common node (Passthrough cluster). Looking at the kiali documentation, we suspect it could be our Pubsub message handler. Also, we see traffic going out of an unknown node. According to the kiali documentation, this could be our kubernetes liveness probe.

In conclusion, we can see the clear benefits of adding a Service Mesh layer on top of the existing distributed system architecture on Kubernetes. Not only does it allow better monitoring of traffic and the system architecture in real-time, but it also allows for more control on traffic routing using virtual services and destination/route rules.

Istio has made us question our understanding of our own system's architecture. While monitoring the traffic, we see that the flow of traffic is not as sequential between the services as we thought it to be. Our previous understanding was that the data retrieval service sends data to model execution service and that in turn sends data to postprocessor service. But, after visualizing it via kiali, we can see that all the services send requests to a central point (the passthrough cluster), which we suspect to either be PubSub or Telemetry. We also did not realize that the kubernetes liveness probe would be maintaining a constant connection with all the services. But the unknown node, which we suspect to be the liveness probe, does maintain a constant connection with all the services.

While we were able to investigate many such things about our system, there are several we did not manage to yet. We would have liked to visualize the PubSub service in our system as well as the external API calls. We would also have liked to investigate the unknown nodes in our system to make certain what they really are. As we delve more into the features that the Istio service mesh provides, we are left with more questions about our architecture. But, we can say with certainty that any distributed system, such as Apache Airavata, would greatly benefit from using a Service Mesh.

Team Member Contributions

Bobby Rathore

Add horizontal pod autoscaling to all services. (commit)
Add automatic node scaling for GKE worker nodes.
Integrate all microservices with Sentry monitoring tool. (commit)
Add stackdriver monitoring to kubernetes cluster.
Implement Anthos security mesh (internal GCP) to compare results with Istio's.
Provision TLS certificates for all services using cert-manager. (commit)
Register the certificates on Lets-Encrypt staging and production servers for individual service authorization (HTTPs). (commit)
Debugging and tracing deployment failures on Istio sidecars.
Issues:

Dhruv Yadwadkar

Contributions

Integrate Istio into existing system.
Set up grafana monitoring tool.
Set up kiali observability tool.
Set up istio gateway and virtual services.
Debugging istio proxy and grafana/kiali failure on system.
Testing load on individual services using JMeter and Grafana.
Experimenting with opensource Istio vs GKE's Istio offering.
Integrate Istio mTLS for the service mesh.

Commits

Issues

Yashvardhan Jain