How Microsoft migrated an on-prem monolith to a micro service SaaS based solution

How Microsoft migrated an on-prem monolith to a micro service SaaS based solution


♫♫>>Hello, everyone. Welcome to today´s “IT Showcase” webinar on the topic of how Microsoft migrated an on-prem monolith to a microservice SaaS-based solution. My name is Thambu Zaemenock Kamalabai. I´ll be your host for today´s session. I´m a Principal Software Engineering Manager at Microsoft. I´ve been with Microsoft for the past 11 years. During my career at Microsoft, I have led and worked on various data projects. During the past year, I have been working on the monolith to the microservice SaaS-based solution conversion. I am pleased to be here with Seifu and Sakthi on the subject from Core Services Engineering and Ops, or CSEO organization. Why don´t you folks take a minute and introduce yourself.>>Hello, everyone. My name is Seifu Feyssa. I´m a Senior Software Engineer. I have been with Microsoft for the last three years. I have over 15 years of experience developing applications. Over the last year, I was involved in migrating the monolith application to microservices.>>Hello, everyone. I am Sakthivel Dhandapani. I´m a software engineer. I have been with Microsoft for the past 1.5 years now, and I also have 10-plus years of experience in software development. In the last one year, I have been working with Seifu and Thambu in migrating the monolith application to microservices. And I am excited to talk to you all.>>Thank you both. Before we get started with the presentation, I want to let the audience know, if there are any questions, you can submit them through the Q&A window. I´ll be on the lookout, and read them out loud for us to answer. All right. Let´s get started with our presentation.>>Let me walk you through today´s agenda. We will start with explaining our current application — What is MSRA? a demo of the application, current monolithic architecture, and the current pain points. By doing this, we assume that we will set the context for the current problem we have at hand. Once we do that, we´ll proceed with the approach to split the monolith, the current high-level microservice design we did, the learning challenges the team overcame, monitoring, alerting, and the deployment model we have in place. Sakthi, can you please walk us through, what is MSRA? What is the current application?>>Sure. So, MSRA stands for Microsoft Reporting Analytics tool. This is an internal Microsoft tool used by the Microsoft Business users to generate business-intelligence report. And this is a metadata-driven tool, meaning the data owners who wants to generate the report for their data, they define that metadata. Metadata is nothing but, what is the business logic to generate the report, and what are the dimensions and facts? And they also can set the security for the data. The security is at different levels. The row and column level. Just to add a little more data about row- and column-level security. Row-level security means — let´s say I am reporting data at the geography level. And I want restrict certain geography data to certain users. That can be done with row-level security. And column-level meaning, imagine I have geography and time dimension, with revenue and forecast and budget as fact. I want to restrict the revenue report to be generated by the revenue-reporting team, that set of users. Then I would go for the column-level security definition.>>So, on a high level, if a team has a database, and if they want to report on that, and if they learn to have row-level and security and column-level security, your team is able to give that.>>Yes.>>Okay.>>And, moving to the capabilities, this is mainly [Indistinct] reporting with different options — run, queue, and schedule options. Why do we need different options? Because run is — You keep the session opened for the entire execution happens for that host. So, run option is good for the report we´re targeting, under five minutes.>>Okay.>>I would go for the queue option whenever the report is more than 5 minutes or 10 minutes. I don´t want to keep the session open, so I queue the report, and that gives the ability for me to queue it and close the session. Later I can go and retrieve the report and render the data in Excel and see the data.>>Sounds good.>>Another option is schedule. The schedule option is, like, when I want to automatically generate a report during the start of the month or end of the month — let´s say I´m reporting revenue data for the fiscal-month end or fiscal-year end. I would go for the schedule option, and create a schedule for fiscal-year end, generate this report. And, also, there is another option that is event-based. Like, let´s say, the data — Well, now it´s refreshing the data, and I want to generate a new report when there is a refresh happened. Then I go for the schedule option. And the support of fabric types — Like, we are currently supporting two types of fabrics. What that means — like, let´s say it is a SQL data, we generate a SQL query. If it is a CUBE data, and we generate the MDX query.>>Cool.>>Yeah.>>Can you please walk us through the customers you have?>>Yeah. Here, what you are seeing in the presentation is different LOBs — LOBs meaning “line of business.” The line of business is nothing but the data owners who define the metadata in the metadata server.>>These are internal customers?>>Yes, exactly. These are Microsoft internal customers who are generating the business-intelligence report, like I said in the past. Approximately 60k users are currently using this one. And these are the split up of line of businesses, here.>>Sounds good. So, it´s an internal, Excel-based add-in tool which will be able to report on the data if you have data in the database.>>Correct.>>Cool. Let me give a brief demo of this application. So, what you are seeing right now on the screen is our Excel-based add-in. So, once the user installs the tool, and once he clicks, he will see a MSRA tab. User can go and click on the “New” button. Once he clicks on the “New” button, he will be shown with a black ribbon. He clicks on “Subject.” User will be able to see all the applications user has access to. Then he can select a perspective. Perspectives can be though of as a set of fact tables and the dimensions associated with it. I am going to select the “End Customer” perspective, and clicking “Okay.”>>Once you do that, the default report for that subject is come on the pivot canvas. I want to just select on one of the domains. I am going. Select, on “Product,” “SAP account,” and click on “Okay.” So, what I did right now is, I want to see the end-customer purchase history menu by SAP account. Behind the scene, when I am going to click on the “Run” button, selected attributes for this report is being sent to the middle tier of our existing monolithic application. It is going to generate the query, and then the report is going to get run on the corresponding database, and the data is going to get reported. I´m going to click on “Run.” What you´re seeing over here is, the report actually, query is generated, it´s executed, it´s retrieving. And you will see the result down in the Excel any moment now. So, what you see is a high-level workflow of our existing monolithic application. I also want to show you, there is a feature for user to see the actual query. So, under “Options,” you can click on “View Query,” and then you will be able to see the actual query which the tool generates to send to the reporting cell. This is the SQL query versus the tool-generated. Now that you have seen our existing monolithic application, Sakthi, can you please walk us through the current architecture for our monolith application?>>Yeah. This is the current monolith architecture, what you are seeing in the presentation. And now that Thambu showed how users are using this tool, let´s see what is happening behind the scenes. The user goes to the Excel first, and it connects to the application server, where security are defined for the user, and it brings in — It passes the user and brings in all the applications which are applicable for that user.>>So, the app-info server stores the –>>Security.>>…and securities.>>Security of the applications.>>Okay.>>When there user requests for a report, the request takes all the field-related information requested by the user and goes to the metadata server. Here in this monolith architecture, the metadata server is the core engine, where we have all the objects necessary for, you know, solving that request.>>So, it´s a bunch of SQL stored procedures there which is going to take that in.>>Exactly. In addition to SQL stored procedures, we have SQL jobs to drive the stored procedures, and also and it can make a communication between the metadata server and the reporting server.>>Sounds good.>>Okay. So, once we get the field-related information to the metadata server, it generates the query according to the fabric type, what we discussed in the previous slides. It generates the query — SQL query for SQL data marts, and MDX query for CUBE data marts. And once the query is generated, it looks for the best available server. This is for the load-balancing purpose.>>So, some of the customers have multiple reporting service configurations.>>Correct. This is to give a better performance for the users to generate the report.>>Okay.>>So, it looks for the best available server, and once the available server is ready, it submits the query through the service broker to the reporting servers.>>So, service broker used to communicate from your middle tier onto the reporting servers.>>Exactly.>>Okay.>>And while the query is getting executed in the reporting server, the service broker, the metadata server, keeps pulling for the status of the query. So, once the query execution is completed, the metadata server marks the telemetry document — telemetry in the metadata server, which is a table, single table. For that record, it marks it as “Query is completed,” and provides all the information to retrieve the data to Excel. The Excel frames the connection stream for the reporting server on the results table, and it retrieves the data into Excel, and renders the data.>>Sounds good. So, if it gets an MDX query, you have the same workflow, but the data goes to the MDX database.>>Exactly.>>That sounds good. Thank you, Sakthi. Now that you have seen our current application and a small demo, as well as the high-level architecture of our current monolith, I would like to walk you through the current pain points we have. On-boarding process of our current application takes a month or more because it´s a current on-prem application. Customer has to start with querying the server, then we have to ship our MSL metadata database on to them. They have to do the configuration. Everything takes a long time. On-boarding process takes a month. Another pain point we have is our release process. Whenever we have a release process from our application, which is MSRA, to the customer enrollment, customer may have another release lined up. They may be a bit hesitant to take our release. So, aligning our release with the LOB´s release or the customer´s release is a challenge for us. On-prem server patching. This being an on-prem server, we have to get the patching done. For security, we cannot skimp on that. That is a challenge. Then troubleshooting and monitoring challenges are also there. If you have an issue, and we have to get access to the customer´s enrollment to do some troubleshooting. Cost and infrastructure maintenance, and scale-up and no scale-out. This being a monolithic application, we have to scale up. We cannot scale out. Scale-up also is the whole monolith is getting scaled up. These are the pain points we are facing at this point in time.>>Now that we know the problem statement — we have a monolith with this database-based application, which has heavy processing logic in SQL stored procedures — our approach to split this monolith at Microsoft is, start with two processes. One, we migrated our first-step on-prem servers to IIS servers in Azure. That, we did a couple of years before. That saved us a lot of time. In Azure, we can do a couple of clicks, and we get a VM in hand. So, the hardware acquisition becomes much easier once we made it into an IIS model. That was a lift-and-shift approach to begin with. Once we did that, we wanted to create a pure PaaS-based solution, pure cloud solution. That, we started with splitting our current business logic in our database into a bunch of feature sets. This really — by splitting this, it really helped us to understand, what are the distinct features our application has?” And this led us to evolve the microservices in the right manner. Once we did the split — We never wanted to do a big-bang approach on this. We wanted to grow step-by-step. So, we wanted to identify, what is the minimum viable product we have? Then we came with query generation, security, telemetry, query execution, and load balancing being the minimum MVP. Though we had this as an MVP, we still didn´t want to do all this together to go to production. We wanted to still figure out, what is the smallest unit we can carve out of this? We came up with query generation, security, and telemetry. Once we identified that, then we had — within three or four months, we could ship the first microservice, which is the query generation with security and telemetry, into production. And we could decouple our monolith application to use this microservice for the query generation to prove that our approach is working good.>>Now that we are understood that, okay, this is a microservice-based application, we have to do a technology assessment. Seifu, Can you please walk us through how you did the technology assessment?>>Sure. So, in technology assessment, we evaluated microservices-management platforms with the goal of choosing the best platform for our project. Some of the factors we considered in choosing the platform includes the ease of development, packaging, deploying, and management of microservices. We finally selected Service Fabric as our platform of choice. Service Fabric is a microservices-management platform from Azure. It is a platform-as-a-service offering, or PaaS offering. Service Fabric solves some of the challenging problems that we have to otherwise develop out-of-the-box. I´ll walk you through some of the features of the Service Fabric, and how we use them to leverage in our projects.>>Before going there, Seifu, can you please explain, what is a stateful versus stateless service in Service Fabric?>>Yes. A stateful service is a service that maintains a state within the service itself, whereas a stateless is a service that doesn´t store the data within itself. It uses external data stores to store its estate.>>So, in case of a stateful service, if I have a small amount of data which I want to store, then instead of storing it in a SQL server, I will be able to store it in the Service Fabric itself?>>Exactly. So, Service Fabric provides a very good stateful-service management model. States are actually reliable in Service Fabric, and they´re also highly available. So, I will come to how those specific features are made available. So, one of the features which is available in Service Fabric is high availability of services. So, high availability is achieved in two different ways. For stateless services, high availability is achieved by — So, Service Fabric requires you to define a number of instances for each stateless service. So, the number of instances is the number of Service Fabric instances that runs in a cluster. If one of the services is down, Service Fabric will automatically create that service on an Azure-legible node, and your service will be available all the time.>>So, if you have a stateless service and you deployed it, say, three instances, if one of the instances goes down, Service Fabric manages getting that instance to a different node. But at the same time, if a request comes, it routes to the active service at that point.>>Yes, so, if the service is moved to another node, Service Fabric would automatically retry that query to another service that´s available.>>Sounds good.>>So, another scenario is for stateful services. Service Fabric follows the model of partitioning, a status partition across available nodes. And each of the services, the partition will have secondary replicas. So, we have primary replicas and secondary replicas. All transactions happen on the primary replica, and they are replicated to the second replicas for reliability. So if one of the primary replica is down, one of the secondary replica would be promoted to the primary replica, and it will assume the purpose of primary replica, and your transaction will go through that.>>That is for stateful service?>>For stateful service.>>Sounds good.>>So, the second feature is scalability. So, scalability is achieved — for stateful service, it´s in different scenarios. We can add or remove an instance of a stateless service. We can add a node into the cluster, adding service article automatically distribute the load to available nodes in your cluster. And we can also add an instance of application. So, we can add more applications into your service, and then you can scale it that way, as well. We can also scale stateful services by adding more resources into the cluster. If, when you add an additional node into the cluster, Service Fabric could move some of the partitions to appropriate nodes where they can be better.>>Sounds good.>>So, patching is another service that is available out-of-the-box for Service Fabric. Once we enable patching, it will automatically apply patching when the operating system patch is available for underlying virtual machines.>>So, behind the scenes, Service Fabric has nodes, and this patching service takes the responsibility of patching each of the nodes — not manually.>>Exactly. So, that would happen out of the box.>>Okay.>>So, removal of external dependencies is another feature that we used from Service Fabric. So, by this we mean, when we migrated the monolithic application into microservices, there are a lot of moving parts. We have to conduct with external data source to get the data, and to retrieve and drive the data to external data sources. We can remove those dependencies by containing some of the transactions and data inside the Service Fabric cluster itself. For our case, we have implemented a number of stateful services to keep the state of application within the cluster itself.>>So, in case of — traditionally, if you wanted to communicate across services, we send a message to a service-bus queue, and that other service keeps on monitoring that. So, in the case of microservices, you need not go to a service bus. You can use that reliable queue within the microservice-fabric architecture.>>Exactly. Even we implemented such scenarios in our application. We used the queues and reliable, additionally, to data to store data within the Service Fabric cluster. And another feature is a fast deployment time. So, we create the cluster ahead of time, and then we keep deploying applications into the existing cluster. And deployment time is very fast. Service Fabric also provides zero-downtime deployment. It uses upgrade domains. And the nodes are grouped into the domains, so it does domain by domain. And then it checks the hosting, if the deployment is successful on a given domain, and if it succeeds, it will move to the next domain. Otherwise, it will roll back, and your old application will be intact.>>That´s good.>>And distributed application model is another feature that´s available, and this is one of the distinguishing features in Service Fabric. It provides you a way to write reliable services, reliable stateful and stateless services. It also integrate with our ASP.NET Core, enabling you to write Web applications and APIs.>>Sounds good.>>Service Fabric also is used in Core services in Azure itself. It´s well-tested. And it´s also used in so many mission-critical enterprise applications.>>That´s good to know. So it´s a tested and proven platform for Microsoft, where in which now application developers can create application in the same platform, where Microsoft is using for the Cosmos DB or anything else.>>Exactly.>>Cool. Thank you. Thank you, Seifu. Now that we have designed or decided on our current fabric, which is the Service Fabric, we started with designing our microservice. So what you´re seeing right now is the architecture or the microservices design we have in place. This particular design didn´t come in the first day or the first sprint. It evolved over a period of time. We caught it in the end of maybe the second or the third month we have this architecture in place. So I would just walk you through how we ended up splitting our monolith to the difference microservices. So, as we have seen from XML-based tech line, when the request comes, we have started with our MSRA API microservice, which is our gateway or the routing or an orchestrated service. This is the service through which any request goes to any of the microservices. So the request comes over here, and then we route that to application service to validate whether that particular user or request has sufficient privileges to that tenant. Once we do that, then we resolve the security by calling the security microservice. This microservice is responsible for resolving the security for that particular user. Once the security is resolved, then we move on to the metadata service. Metadata service generates the query. As we have seen, the user selects an ad hoc report with set of attributes, and that app itself — fields are sent to the metadata service, and depending on the fabric we want to resolve this client to, we generate the query for that. Once the query is generated, then the query gets routed to the query-manager service. The query-manager service a reliable queue in it, wherein which the request is put. The query manager has couple of roles to play. Query manager wants to find out, “Which is the best server I can route this query to?” For doing that, the report-stat-collective service keep on monitoring all the available reporting servers and keeps collecting that stats in the database here. The load balancer just looks at that data at that snapshot point in time and gives the best server available to the query-manager service. Once the query-manager service gets that information, it calls the execution service to actually execute the query in the reporting server. All the state of this request is captured in the query-telemetry service. Query-telemetry service, once the request is done, the cleanup service moves that request on to the Cosmos DB for analysis or for telemetry purposes. We have a user-profile service which stores actually the profile or the preferences of the end user, so this is the architecture we came up with by decomposing what were the SQL-based or procedure-application to a scalable microservices architecture. Now that you guys have seen the high-level microservice architecture, I would like to walk you through the learnings and the challenges we had through this time. So, Seifu, can you please walk us through the millisecond-query-generation problem.>>Sure. One of the challenges we faced was latency in query generation. I´ll first walk you through the process of query generation, the challenge we faced, and how we solve it. So, when a user clicks on the “run” button from the Excel client, this request is generated and submitted to our API. This request contains important information that the query generator uses to generate a query. “Select attributes” is a field that the user has selected, and attribute filters are filters that are applied to some of the selected attributes. So, based on this important information, the query generation will generate a query statement that will answer the user´s query. It is based on the fabric selected, and if the fabric selected is SQL, this statement is generated.>>So, this is the input request generated from the Excel and sent to the API.>>Exactly. This comes through the API, and the the query generation uses that information query generator to generate this query statement. This is for SQL fabric, and it can be different depending on the fabric that we configured.>>Okay. So, this is a sample query we generated based on that.>>Exactly.>>Okay. So, what´s the problem here?>>Uh, so… Once the request is submitted from Excel, it comes through our MSRA API, which is our API, and then we call the application service to get the application information, and then we call the security service to get to apply security to the request, and then we called the metadata service to generate the final query. So, if you see, this was one monolithic application, and now it´s broken into three services, and each of these services are attached to the data source, a external data source, which is S0-10DTU SQL.>>So, what is changed here? First, in the old system, we used high-end SQL server configuration, and data and computerware located on the same server, and now we moved through SQL Azure with a lower configuration, and each service has to go to the external data source to receive the data. So, there is the query execution and the date movement is involved. This was our initial implementation, and with this, it took around 10 seconds to generate a query. In the old system, it was around 800 milliseconds. So, what we did to solve this. So, we implemented a caching layer. So, we moved each of the data from the external data store into the…caching for each of the three services. So, we implemented a caching layer, where the cache is maintained in memory. So, the process and the data will be in the same process. Now after implementing this, we were able to reduce the time from around 10 seconds to 200 milliseconds, which is a great achievement.>>Good. So, I´ve seen that you have moved the data from the external to the caching layer. But the application service being a stateless service, it may have multiple instances. How did you actually refreshing the cache to each of the instance?>>Yes. So, yeah. Once we have the cache in memory, then refreshing the cache a challenge. With the Service Fabric for each of the stateless services, Service Fabric maintains the instance ID. So, each service is uniquely identified, and Service Fabric provides a way to get the list of services that are running. So, we were able to eliminate each of the services, and submit a refresh command to refresh our data in each of the services. This is not the normal scenario. In the normal scenario, you just submit your request, and the Service Fabric will route to the best available. But in this case, we have to go through each of the services and refresh the cache.>>Okay. So what are the other approaches you would have thought before solving this.>>We thought of upgrading the SQL server database with higher configuration. That would serve some of the time, but still, the data movement is within… So, we chose to go with caching instead of upgrading –>>The SQLs.>>Yeah, the SQL.>>Thank you. So, we had the query state management issue. Can you please walk us through that?>>This is one of the problems we want to solve. The problem we want to solve here is prorating the record status at any point in time. As you see here, the record starts from Excel, and it goes to the MRSA API. Like you mentioned, Thambu, and Seifu also mentioned, that the MSRA API is the orchestrator and also the entry point for any record to be gathered. Once it reaches the MRSA API, it goes to the application service and security service to validate the security, and then it moves to the metadata service for query generation and moves to query manager service for — I mean, it puts it in the queue for query to be executed, to find the best reporting server, like in the same monolith architecture now that we have independent services threatening to gather that. And it eventually moves to the execution service and execution service executes it. But as you see in this process, I don´t have the ability to see what is the status of this ticket. So that is the problem we wanted to solve, and for that, we need to store the records and keep the telemetry of records in any data store. So, we introduced the Cosmos DB. And to create a data telemetry document, we introduced the query telemetry service, and now whenever a record enters into MRSA API, the MRSA API creates a telemetry document for each ticket.>>Okay.>>Which means it calls the query telemetry service and the query telemetry service makes a call to the Cosmos DB and creates a telemetry document for the tickets. And once the document is created in the Cosmos DB, whenever the record moves from application service, security service, and and metadata service, the data gets updated in the telemetry document in the Cosmos DB.>>Okay.>>Now, let´s see what are the different states we are maintaining. If you see first, there is a query packet ID. That is nothing but the…>>Correlation ID –>>…the key for getting the records. And by putting the key — Like, I can get the status of each ticket. The status is a collection of tickets, which moves from file — I´ll talk about the major status. File 10, 15, and 30, 35 and 40. The 10 and 15 are the metadata service — how much time we took for generating that query. And the 30 and 35 are — 30 is start of the execution. 35 is end of the execution. How much time it took for executing the records and creating the results table.>>So, basically, each of the microservice locks this request. So, at any point of time, you know where that request is.>>Exactly. Like, I pull the key, and I get the records, I can say, “Okay. This is where the record is.” And if you see 40, 40 status is the retrieval compete –>>What happens if one of the service fails? Will it stop at that point and show the corresponding error?>>It stops at the failure point, and if I pass the same thing, like the key and it brings in the record stream, I can see what is the error message and where it exactly failed. And then it will help us to debug the problem by checking the record state. And, again, 40 is the retrieval. Like, in Excel, we talked about, this is mainly summarizing the record and getting executed and doing it later. Once the table is completed, that is where the record is, and it also has the execution progress and the execution info, so which reporting server it got executed and what is the Excel table. And that is for the Excel to frame the connections through the data. And the execution progress is the end-to-end execution time for the records.>>So, I also see that a status for each of the microservices are also captured. That helps at debugging, whether each of the services takes more time.>>Yes. So, there are multiple benefits by introducing this ticket for us. One is, like I said, we can test status of the record at any point in time. The next one is debugging, which we discussed just now, like when there is an issue, we got pull the records and see where exactly they failed, and it also provides a detailed error message, like why it exactly failed. And the other benefit is the performance tuning. So, we keep storing this telemetry historical data. We go and say, “Okay.” Star of the services deployed, how much time it took. Over the period, is it going down. If it is going down, we address the problem, like which service is taking a longer time? So, those are the benefits I see.>>Seems like it is very good to store the state of the query throughout the lifecycle of the request. So, it seems like it comes with some problems for you.>>Exactly. It also created another problem for us. So, because… The records, execution, and retrieving the data is the main problem, which we want to solve. At the same time, by introducing this service, this query telemetry service, to make an external call to the Cosmos DB and create that telemetry document and update the telemetry document, and also the application service, security service, metadata service making a call to query telemetry service, and that in turn calls Cosmos DB, which is an external call. You know that is costly, and that can increase the turnaround time for the tickets to 500 milliseconds, a cumulative time of 500 milliseconds.>>Basically, the logging of the request took half a second for you.>>That´s right.>>How did you end up solving this?>>So, now that you know the query telemetry service is making an external call, right? So, we want to eliminate that external call and try to put the data within the query telemetry service to avoid the external call. So, the Service Fabric gives the option of putting the data inside the service itself, if it is a stateful service. There are two options when it´s labeled “Q” that we discussed in the query management piece. Another one is reliable dictionary So, here we put — Okay. To avoid the external call, whenever the call goes to the MRSA API, the MRSA API creates an entry into reliable dictionary. The same like whatever we put in the Cosmos DB. Now application service, security service, metadata service, each service calls the query telemetry service, but it won´t go to the external call. It updates the data in reliable dictionary. So, by that, we reduced the 500 milliseconds to 10 milliseconds. The turnaround time is now good.>>That´s good. What´s the cleanup service do here?>>Yes, exactly.>>So, another problem is, the reliable dictionary keeps growing. So, we need to be able to clean up that reliable dictionary to make sure that it is not exceeding the memory of the query telemetry service. So, we introduced a cleanup service which runs independently that is scheduled, and it periodically takes the reliable dictionary of the query telemetry service, and what about query is fully computed. Fully completed meaning like I talked about, like the 40 status. The data retrieval is completed, or the record is done. So, those cases, we considered and moved those records to Cosmos DB data server history that accesses history and then we can go check the performance, also.>>Thank you. So, now we have seen the major learnings or challenges we have, I would like to move on to the cost of the design we have in place. In our project, we incur cost in three ways. One is the cost for the Service Fabric itself, the cost for the SQL Azure DB, and the cost for the Cosmos DB. In the case of Service Fabric Cluster, we make sure that we do a high-density hosting of services, which means we monitor the CPU and the memory of the Service Fabric Cluster. We make sure that we enable auto scaling for the Service Fabric to scale up or scale down on condition. In Microsoft Azure, we have an auto-scaling feature, wherein which, if your CPU or memory exceeds a certain threshold, we can say that… Also, we can scale down, saying that if it goes beyond a low the level, we can take one out, no doubt. Auto scaling helped the — Service Fabric behind the scene does a good job of distributing the service based on the load at that point of time. This helped us to get a very cost efficient design from the Service Fabric itself. Going to the SQL Azure, as we have seen, whenever we have a problem with having better performance for the SQL Azure, we try to see whether we can use services in the Service Fabric to cache or keep the data which is most needed to the service in that that service layer itself. This allowed us to keep a very standard S0-DB4 SQL server. Coming to Cosmos DB, like the last learning challenge which Sakthivel explained, whenever we have to have a higher…, we could have used a…unit more for the Cosmos DB rather — We tried to solve the problem using stateful service in the Cosmos DB and move the data once the data has been cached to the customer — We moved that to the Cosmos DB. These are the three or four things we did in our application design to get the cost optimized. Now that we have seen this, Seifu can you please walk us through all of the challenges we had in our monolith tenant onboarding? It was taking long at times.>>How is tenant onboarding in this software as a service monolith?>>Sure. So, tenant onboarding was one of the most challenging tasks in all of the system. We had to coordinate with external and an Azure team to deploy our application, and we have to also align with deployment with the customers, and it was a time-consuming process. With microservices, we omitted everything possible to make as simple as possible. So, deployment — Customer onboarding involved three steps. The first step is, we create entry in the application database using the application service, and then we populate our reporting servers in our load-balancer database. And then we create the instance of the service in the Service Fabric Cluster. So, when you create the instance of the Service Fabric Cluster, we create resources that are unneeded by the service. We use ARM templates, Azure resource management template, to automate the creation of databases. Cosmos DB and external resources that are required by the service.>>So, the tenant onboarding is nothing but — Everything is automated, if you do not have any other manual stuff involved in it.>>Exactly. So, Service Fabric also simplified the onboarding process because, once we have the application type deployed in the cluster, the only thing we have to do is to create the instance of the application from the already-existing application.>>Okay. Sounds good. So, we have a question. I just want to read out the question. “What is the underlining servers that execute the final query? Are they hosted in the past resource?” So, the question over here is, in our microservice model, we have the execution service actually executing the query.>>Yes.>>The final report server — I´ll repeat that question. The final report server is actually come from the actual customer. They can have it in SQL Azure, which is the past service, or imagine that they have an on-prem server — they can keep it there. Our execution service, since it is hosted in a VNet, we can open ACL and execute the query in and ADW or if it is SQL on-prem server or a past server. It does not matter for our… structure where that reporting server is residing. So, I hope that answers that question there. Let me move on to the deployment strategy. Having this microservices architecture helped us to do a ring-based deployment. We have our CICD pipeline configured, wherein which, whenever our code-check happens, our CICD pipeline runs, and a Ring0 deployment gets run. And what is the Ring0 app? We have for the team itself, which the MSRA App, and the code gets deployed to the Ring0 App. Once that is done, we run a full test pass against that app and then, if there are failures, we will do the same process back again. Once a successful completion, we do an approval process, and we go into the Ring1 deployment. The Ring1 deployment is nothing but the apps, which are configured as Ring1, which gets constant releases as and when MRSA has a release. Once that is deployed, and based on the telemetry and when the system is stable, we pushed that to the ring, too, the Ring2 deployment. The best thing over here is, the deployment has to be aligned with our customers. Here, since we have it in a Service Fabric size-based application, we have better control over here, and we can have a better delivery with good quality. Coming to monitoring, Seifu, can you please walk us through what other monitoring and alerting you have in place?>>So, we use the application inside for monitoring and alerting of our applications. Service Fabric applications inside provide good integration with a Service Fabric, and it provides some of the telemetry features out of the box, like how many requests are coming, what is a performance of each of the requests that´s coming into our microservices, and some of the events that are thrown by the microservices are automatically pushed to the application inside, and we can visualize them from the application inside the dashboard. We have also extensively pushed our test logs into the application inside for debugging. So, our test logs include severity labels. Severity levels 1, 2, and 3 — 3 being the highest. And then on top of that severity levels, we have alerts that are configured to alert us when something goes wrong, or when the number of exceptions thrown are more than a given threshold.>>Mm-hmm.>>So, when the alert is thrown, then automatically the ICM ticket is created — ICM is our incident management system — and…are notified, and… can look — most of the time, they should look at the logs and understand what is going on and then involve the stakeholders, and resolve the issue.>>Sounds good. So, we have the traces being captured in happen sites. Based on the severity, you have configured downloading, and downloading enables… If there are more than “X” number of errors happening repetitive times…>>Exactly.>>Cool.>>So, we also have a dashboard where we can monitor resource utilization in our Service Fabric Cluster, including the CPU utilization, and the memory utilizations… all of those are displayed on the dashboard. We also have a dashboard that shows us historically how many exceptions are thrown from our applications. So, this helps us to intervene in case the resource utilization is very high in Service Fabric cluster. We have actually configured the scaling. But in case we want to reduce the number, for example, of down scale, then we can go do the resource management in the Service Fabric.>>Cool.>>So I have a couple questions which came through, so I would like to put that… How does the actual Service Fabric compare with Azure…Services? Seifu, do you want to get this?>>Sure. Yeah. So, I´ll take it. So, Azure…Service and and Service Fabric are both microservices management platforms, and both of them do orchestration of services. …is mostly to the container- based orchestration of microservices. Service Fabric goes beyond to provide a…platform. So, it is both…platform, as well as a management platform of microservices. It has a very good integration with Visual Studio.>>That goes to the next question, I think. “What is the Service Fabric development experience with the Visual Studio look like?>>So, you can debug your services in Visual Studio like you do any application, but you need to install that local Service Fabric Cluster. You have that you have to install. And then you will get the five- node or one-node Service Fabric Cluster on your…machine. And then the code gets packaged into the packages, and the package is pushed into your cluster, and you can do debugging from your local machine. Also has a tool link to deploy from PowerShell, and very good integration with Visual Studio debugging.>>That´s good. Without actually deploying to Azure, we can do all the dial-up and testing and debug everything in your laptop itself.>>Exactly.>>That is good.>>How do we visualize data inside the stateful service? Yeah, I think this is… You mentioned about the stateful service. How do you actually visualize the data there?>>Yeah. Currently, Service Fabric does not provide the ability to visualize the data. The stateful service, like I mentioned, it is — it has a memory data store, either reliable collection or reliable dictionary. To mitigate that, we introduced Endpoint, which makes a call to the service, and it reads the reliable dictionary and provides the data back. That´s all we visualize data.>>Okay. Okay. “How does Service Fabric handle high availability and disaster recovery? I think we talked about high availability but we didn´t do on the disaster. But if you can brief on both, it´ll be great.>>Sure. So, we discussed how Service Fabric maintains high availability of the services by creating another instance of the stateless service, and also by switching secondary replicates into a primary phase. But, for example, the disaster happens and the region´s down…>>The whole US base.>>The whole is down. In that case, we need another mechanism for disaster recovery. So, Service Fabric Cluster is region aware by itself. So, you can target multiple regions in your single Service Fabric Cluster. Some of the nodes can be in the West. Some of them can be in the East.>>And center. Okay.>>So, based on that, when one region is down, the nodes on the other region will take care of the requests.>>Okay.>>Another scenario is, we can also create two Service Fabric cluster on two different regions, and then use use Traffic Manager to route traffic to one or the others. So, you can use Active/Active Active/Passive. So, if it´s Active/Passive, then one is always serving the request and when it is down, it will fly over the other one.>>I think Active/Active is possible only if it´s stateless.>>Stateless service, exactly.>>Cool. I think I have another question.>>How did you manage transition of customers from the old model to the new microservices model?>>I think that´s a good question. We have introduced a concept of feature flag. We enabled that in the customer level. Whenever we onboard a new feature, we enable feature flag, and we route a certain set of customers to enable that feature. Once we look at the telemetry and we feel that the feature is working good, then we increase the user base, and that is the transition model we are using. So, we have flags in place which route in case of failure. That really helped us. In the case of failure, we can always fall back to the old model we have in place. Thank you, sir. We are almost at the end of the hour. Before we go, we would like each to share your takeaway with our audience. Let´s start with Seifu.>>Yes. So, with microservices implementation, most of the time, you are moving your monolith application into multiple services, and each of the services will do interaction with external data sources. There is a lot of moving parts. And this will add to your performance. So caching can help you to contain the data within the process itself and improve your performance.>>Sakthivel.>>My take away is, the main problem we faced in monolith architecture is scaling up and giving better performance for the customers. By moving to the Service Fabric architecture and with the Service Fabric in place, we are able to… We are able to add nodes and keep the performance for the customers and also ease the Onboarding. So the takeaway I would call is, look for an architecture which helps scaling.>>My biggest take away is, when you´re migrating a monolith to a microservice architecture, please don´t do a big-bang approach. Always do an incremental release. Find the small competence where in which you can decompose to the microservices and hook on the monolith to that microservice which you developed. My second takeaway is, once you have developed the microservices, you will see that team have faster delivery with good quality. Great points. Thank you all. The on-demand version of the roundtable will be posted soon in the Microsoft.com/ITShowcase where you can also find related content like case studies, blogs, and upcoming webinars. Thank you for joining us, and we hope to see you for the future webinars. ♫♫

Comments

  1. Post
    Author

Leave a Reply

Your email address will not be published. Required fields are marked *