本文共 16588 字,大约阅读时间需要 55 分钟。
敏捷开发度量
One of the coolest things I have learned in the last year is how to constantly deliver value into production without causing too much chaos.
我在去年学到的最酷的事情之一是如何在不引起太多混乱的情况下不断地将价值传递到生产中。
In this post, I’ll explain the metrics-driven development approach and how it helped me to achieve that. By the end of the post, you’ll be able to answer the following questions:
在这篇文章中,我将解释指标驱动的开发方法以及它如何帮助我实现这一目标。 在发布结束之前,您将能够回答以下问题:
Metrics give you the ability to collect information on an actively running system without changing its code.
指标使您能够在运行中的系统上收集信息,而无需更改其代码。
It allows you to gain valuable data on the behavior of your application while it runs so you can make based on real customer feedback and usage in production.
它使您可以在应用程序运行时获得有关其行为的有价值的数据,从而可以基于真实的客户反馈和生产中的使用情况来制定 。
These are the most common metrics used today:
这些是当今最常用的指标:
In this example, a counter metric is used to calculate the rate of events over time, by counting events per second
在此示例中,计数器度量用于通过每秒计数事件来计算随时间变化的事件速率
In this example, a gauge metric is used to monitor the in percentages
在此示例中,量表指标用于监视百分比的
In this example, a histogram metric is used to calculate the 75th and 90th percentiles of an HTTP request duration.
在此示例中,直方图度量用于计算HTTP请求持续时间的第75个百分位和第90个百分位。
The bits and bytes of the types: counter, histogram, and gauge can be quite confusing. Try reading about it further .
计数器,直方图和量表的类型的位和字节可能会非常混乱。 尝试进一步阅读。
Most monitoring systems consist of a few parts:
大多数监视系统由以下几部分组成:
Time-series database — A database software that optimizes storing and serving data. Two examples of this kind of database are and .
时间序列数据库—一种数据库软件,可优化存储和提供数据。 这种数据库的两个示例是和 。
Querying engine (with a querying language) — Two examples of common query engines are: and
查询引擎(使用查询语言)—常见查询引擎的两个示例是: 和
Alerting system — The mechanism that allows you to configure alerts based on graphs created by the querying language. The system can send these alerts to Mail, Slack, PagerDuty. Two examples of common alerting systems are: and .
警报系统-一种机制,允许您基于查询语言创建的图形来配置警报。 系统可以将这些警报发送到Mail,Slack,PagerDuty。 常见警报系统的两个示例是: 和 。
UI — Allows you to view the graphs generated by the incoming data and configure queries and alerts. Two examples of common UI systems are: and
UI-允许您查看传入数据生成的图形并配置查询和警报。 常见UI系统的两个示例是: 和
The setup we are using today in is
我们今天在中使用的设置是
— used as a StatsD server.
用作StatsD服务器。
— used as our scrapping engine, Time-series database and querying engine.
用作我们的抓取引擎,时间序列数据库和查询引擎。
— used for Alerting, and UI
用于警报和UI
And the constraints we had in mind while choosing this stack were:
在选择此堆栈时我们想到的约束是:
Let’s develop a new pipeline service that calculates sentiments based on textual inputs and does it in a Metrics Driven Development way!
让我们开发一个新的管道服务,该服务基于文本输入来计算情感并以度量驱动的开发方式进行!
Let’s say I need to develop this pipeline service:
假设我需要开发此管道服务:
And this is my usual development process:
这是我通常的开发过程:
So I write the following implementation:
因此,我编写了以下实现:
let senService: SentimentAnalysisService = new SentimentAnalysisService();while (true) { let tweetInformation = kafkaConsumer.consume() let deserializedTweet: { msg: string } = deSerialize(tweetInformation) let sentimentResult = senService.calculateSentiment(deserializedTweet.msg) let serializedSentimentResult = serialize(sentimentResult) sentimentStore.store(sentimentResult); kafkaProducer.produce(serializedSentimentResult, 'sentiment_topic', 0);}
The full gist can be found .
完整的要点可以在找到。
And this method works perfectly fine.
和T 他的方法工作完全正常 。
But what happens when it doesn’t?
但是,如果不这样做会发生什么呢 ?
The reality is that while working (in an agile development process) we make mistakes. That’s a fact of life.
现实情况是,在工作(在敏捷开发过程中)时,我们会犯错误。 那是生活的事实。
I believe that the real challenge with making mistakes is not to avoid them, but rather to optimize how fast we detect and repair them. So, we need to gain the ability to quickly discover our mistakes.
我相信犯错误的真正挑战不是避免错误,而是优化我们发现和修复错误的速度。 因此,我们需要获得快速发现错误的能力。
It's time for the MDD-way.
是时候采用MDD方式了。
The MDD approach is heavily inspired by the Three Commandments of Production (which I had learned about the hard way).
MDD方法在很大程度上受《生产三诫》 (我已从艰难的道路中学到)的启发。
The Three Commandments of Production are:
生产 的三诫是:
The data flowing in production is unpredictable and unique!
生产中流动的数据是不可预测的且唯一的!
Perfect your code from real customer feedback and usage in production.
通过真实的客户反馈和生产中的使用来完善您的代码。
And since we now know the Commandments, it's time to go over the 4 step plan of the Metrics-Driven development process.
既然我们现在知道了《 诫命》 ,那么现在该回顾一下“度量驱动的开发”过程的4个步骤计划了。
I write the code, and whenever possible, wrap it with a feature flag that allows me to gradually open it for users.
我编写了代码,并在可能的情况下使用功能标记包装它,使我可以逐步为用户打开它。
This consists of two parts:
这包括两个部分:
Add metrics on relevant parts
在相关部分上添加指标
In this part, I ask myself what are the success or failure metrics I can define to make sure my feature works? In this case, does my new pipeline application perform its logic correctly?
在这一部分,我问自己可以定义哪些成功或失败指标以确保功能正常运行? 在这种情况下,我的新管道应用程序是否正确执行其逻辑?
Add alerts on top of them so that I’ll be alerted when a bug occurs
在它们之上添加警报,以便在发生错误时向我发出警报
In this part, I ask myself What metric could alert me if I forgot something or did not implement it correctly?
在这一部分中,我问自己:如果我忘记了某些内容或未正确实施,什么指标可以提醒我?
I deploy the code and immediately monitor it to verify that it’s behaving as I have anticipated.
我部署了代码并立即对其进行监视,以验证其行为是否符合我的预期。
And that's it! Now that we have learned the process, let's tackle an important task inside it.
就是这样! 既然我们已经了解了该过程,那么让我们解决其中的一项重要任务。
One of the toughest questions for me, when I’m doing MDD, is: “what should I monitor”?
在执行MDD时,对我来说最棘手的问题之一是:“我应该监视什么”?
In order to answer the question, lets try to zoom out and look at the big picture.All the possible information available to monitor can be divided into two parts:
为了回答这个问题,让我们尝试放大并查看大图。可用于监视的所有可能信息可以分为两部分:
Applicative information — Information that has an applicative context and meaning. An example of this will be — “How many tweets did we classify as positive in the last hour”?
适用信息 -具有适用上下文和含义的信息。 例如:“在过去一个小时内,我们归类为正面的推文有多少条?”?
Operational information — Information that is related to the infrastructure that surrounds our application — Cloud data, CPU and disk utilization, network usage, etc.
运营信息 -与我们应用程序周围的基础结构相关的信息-云数据,CPU和磁盘利用率,网络利用率等。
Now, since we cannot monitor everything, we need to choose what applicative and operational information we want to monitor.
现在,由于我们无法监视所有内容,因此我们需要选择要监视的应用程序和操作信息。
After we do that, we can ask ourselves the question: what alerts do we want to set up on top of the metrics we just defined?
之后,我们可以问自己一个问题:我们要在刚刚定义的指标之上设置哪些警报?
The diagram (of information, metrics, alerts) can be drawn like this:
可以像这样绘制图表(信息,指标,警报):
I usually add applicative metrics out of two needs:
我通常会从两个需求中添加适用性指标:
A question is something like, “When my service misbehaves, what information would be helpful to know about?”
问题是,“当我的服务行为不正常时,了解哪些信息会有所帮助?”
Some answers to that question can be — latencies of all IO calls, processing rate, throughput, etc…
该问题的一些答案可能是—所有IO调用的延迟,处理速率,吞吐量等…
Most of these questions will be helpful while you are searching for the answer. But once you found it, chances are you will not look at it again (since you already know the answer).
当您寻找答案时,这些问题中的大多数都会有所帮助。 但是一旦找到它,您就不会再看它了(因为您已经知道答案了)。
These questions are usually driven by RND and are (usually) used to gather information internally.
这些问题通常由RND驱动,并且(通常)用于内部收集信息。
This may sound backward, but I usually add applicative metrics in order to define alerts on top of them. Meaning, we define the list of alerts and then deduce from them what are the applicative metrics to report.
这听起来可能有些倒退,但是我通常会添加适用性指标,以便在它们之上定义警报。 意思是,我们定义警报列表,然后从中推断出要报告的适用指标。
These alerts are derived from the SLA of the product and are usually treated with mission-critical importance.
这些警报源自产品的SLA,通常将其视为至关重要的任务。
Alerts can be broken down into three parts:
警报可以分为三个部分:
alerts surround the places in our system where an SLA is specified to meet explicit customer or internal requirements (i.e availability, throughput, latency, etc.). SLA breaches involve paging RND and waking people up, so try to keep the alerts in this list to a minimum.
警报围绕我们系统中指定SLA以满足明确的客户或内部要求(即可用性,吞吐量,延迟等)的位置。 违反SLA涉及寻呼RND并唤醒人们,因此,请尽量减少此列表中的警报。
Also, we can define Degradation Alerts in addition to SLA Alerts.Degradation alerts are defined with lower thresholds then SLA alerts, and are therefore useful in reducing the amount of SLA breaches — by giving you a proper heads-up before they happen.
此外,除了SLA警报外,我们还可以定义降级警报。降级警报的定义阈值比SLA警报低,因此对于减少SLA违规数量很有用-通过在事件发生之前给您适当的提示。
An example of an SLA alert would be, “All sentiment requests must finish in under 500ms.”
SLA警报的示例是:“所有情感请求必须在500毫秒内完成。”
An example of a Degradation Alert will be: “All sentiment requests must finish in under 400ms”.
降级警报的示例为:“所有情感请求必须在400毫秒内完成”。
These are the alerts I defined:
这些是我定义的警报:
200 ops * 60 bytes(Size of Sentiment Result)* 86400 sec in a day = 1GB < 2GB
200操作* 60字节(情感结果大小)*一天86400秒= 1GB <2GB
These alerts usually involve measuring and defining a baseline and making sure it doesn’t (dramatically) change over time with alerts.
这些警报通常涉及测量和定义基线,并确保基线不会(剧烈地)随时间变化。
For example, the 99th processing latency for an event must stay relatively the same across time unless we have made dramatic changes to the logic.
例如,除非我们对逻辑进行了重大更改,否则事件的第99个处理延迟必须在整个时间内保持相对不变。
These are the alerts I defined:
这些是我定义的警报:
I’ve given a talk about and their insane strength. As it turns out, collecting metrics allows us to run property-based tests on our system in production!
我已经讲了及其疯狂的强度。 事实证明,收集指标使我们能够在生产中的系统上运行基于属性的测试!
Some properties of our system:
我们系统的一些属性:
These alerts helped me validate that:
这些警报帮助我验证了:
In order to define these alerts, you need to submit metrics from your application. Go for the complete metrics list.
为了定义这些警报,您需要从您的应用程序提交指标。 转到以获取完整的指标列表。
Using these metrics, I can create alerts that will “page” me whenever one of these properties do not hold anymore in production.
使用这些指标,我可以创建警报 ,这些警报中的任何一个在生产中不再可用时将“寻呼”我。
Let’s take a look at a possible implementation of all these metrics
让我们看一下所有这些指标的可能实现
import SDC = require("statsd-client");let sdc = new SDC({ host: 'localhost' });let senService: SentimentAnalysisService; //...while (true) { let tweetInformation = kafkaConsumer.consume() sdc.increment('incoming_requests_count') let deserializedTweet: { msg: string } = deSerialize(tweetInformation) sdc.histogram('request_size_chars', deserializedTweet.msg.length); let sentimentResult = senService.calculateSentiment(deserializedTweet.msg) if (sentimentResult !== undefined) { let serializedSentimentResult = serialize(sentimentResult) sdc.histogram('outgoing_event_size_chars', serializedSentimentResult.length); sentimentStore.store(sentimentResult) kafkaProducer.produce(serializedSentimentResult, 'sentiment_topic', 0); }}
The full code can be found
完整的代码可以在找到
A few thoughts on the code example above:
对以上代码示例的几点思考:
Choosing correct metric names is hard. Take your time selecting proper names. an excellent post about this.
选择正确的度量标准名称很困难。 花点时间选择专有名称。 一篇很棒的帖子。
We can now make sure the application latency and throughput do not degrade over time. Also, adding alerts on these metrics allows for a much faster issue discovery and resolution.
现在,我们可以确保应用程序延迟和吞吐量不会随时间降低。 此外,在这些指标上添加警报可以更快地发现和解决问题。
Metrics-driven development goes hand in hand with CI\CD, DevOps, and agile development process. If you are using any of the above keywords, then you are in the right place.
指标驱动的开发与CI \ CD,DevOps和敏捷开发流程紧密结合。 如果您使用上述任何关键字,那么您来对地方了。
When done right, metrics make you feel more confident in your deployment in the same way that seeing passing unit-tests in your build makes you feel confident in the code you write.
正确完成操作后,度量标准就可以使您对部署更有信心,就像看到构建中通过的单元测试使您对编写的代码充满信心一样。
Adding metrics allows you to deploy code and feel confident that your production environment is stable and that your application is behaving as expected over time. So I encourage you to try it out!
添加指标使您可以部署代码,并确信自己的生产环境稳定,并且您的应用程序会随着时间推移按预期方式运行。 因此,我鼓励您尝试一下!
Here is a to the code shown in this post, and is the full metrics list described.
这里是一个到这个帖子中显示的代码,并描述的完全度量列表。
If you are eager to try writing some metrics and to connect them to a monitoring system, check out , and possibly this
如果您渴望尝试编写一些度量标准并将其连接到监视系统,请查看 , 以及可能的这篇
This guy wrote a delightful about metrics-driven development. GO read it.
这个家伙写了一篇有关度量驱动开发的令人愉快的 。 去读吧。
翻译自:
敏捷开发度量
转载地址:http://htzzd.baihongyu.com/