亞馬遜ECS,噩夢般的使用經(jīng)歷

原作者;Bilal Aslam

編譯:李哲

核心提示:今天,Appuri的聯(lián)合創(chuàng)始人兼首席產(chǎn)品官Bilal?Aslam與大家分享他與他的團(tuán)隊使用亞馬遜ECS的慘痛經(jīng)歷。

在Appuri,ETL管道、API和UI都是由大量的小型單目標(biāo)服務(wù)構(gòu)成的。一開始,我們使用的是大型的單個資源庫,后來逐漸向微服務(wù)模式轉(zhuǎn)型。這并不是因為某種觀念上的偏見,而是因為它符合我們的工作方式。盡管有利有弊,大體上來說,微服務(wù)的應(yīng)用效果很不錯。但是,我們今天不是來討論微服務(wù)的,而是要向你講述我們應(yīng)用亞馬遜EC2彈性容器服務(wù)(ECS)的慘痛經(jīng)歷,以及我們?nèi)绾瓮ㄟ^轉(zhuǎn)向?Kubernetes懸崖勒馬。

特此聲明:總體上講,我們很喜歡AWS的產(chǎn)品。而且,每家公司對ECS的使用程度不同。比如說,Segment就有對ECS非常愉快的使用經(jīng)歷,完全沒有我們這些抱怨。

我對管理服務(wù)很是鐘情。例如,我們不自己運行Postgres服務(wù)器,而是使用亞馬遜RDS。我們也不自己運行hypervisor或者bare-metal服務(wù)器,而是使用亞馬遜EC2。在理想的情況下,你向提供商購買管理服務(wù),以便專注于創(chuàng)造更多差異化的附加價值,這是一個雙贏的局面。事實上,我們與很多管理服務(wù)提供商都有這樣的經(jīng)歷。

2015年6月,我們開始考慮購買PaaS來部署公司的服務(wù)。我的意愿是選擇Docker化的管理服務(wù),與此同時,保持一定的控制權(quán)。作為AWS的客戶,我們考慮使用亞馬遜Elastic Beanstalk和全新的亞馬遜EC2 ECS。

亞馬遜ECS的優(yōu)勢在于:

  • 可以方便快捷地啟動Docker容器
  • ECS能提供多重可用區(qū)(Multiple Availability Zones)
  • 支持回滾部署(rolling deploys),真正實現(xiàn)了零停機(Zero-Downtime)部署
  • API客戶端。所有AWS服務(wù)的API客戶端都支持我們使用的所有語言類型。
  • ECS和EC2實例集群協(xié)同工作。這樣,我們就不需要學(xué)習(xí)一個新的PaaS,只需要在運行亞馬遜Linux的任何一個EC2實例上安裝ECS客戶端,加入ECS集群。

第一印象

我們看到ECS demo的第一印象是,它缺少很多關(guān)鍵功能:

缺少服務(wù)發(fā)現(xiàn)(service discovery)功能。在ECS中,服務(wù)發(fā)現(xiàn)功能的替代方式為使用內(nèi)置的負(fù)載均衡器(load balancers)。這是運行ECS網(wǎng)絡(luò)可訪問(network-accessible)服務(wù)的唯一方式,即使只有一個實例,也必須得運行ELB。對于微服務(wù)架構(gòu)來說,這就增加了每次部署服務(wù)的成本。

不能統(tǒng)一配置。ECS不能夠把不帶參數(shù)的配置信息傳遞給服務(wù)(即Docker容器),那么我如何把環(huán)境參數(shù)傳遞給每個服務(wù)呢?只能復(fù)制粘貼。

平庸的CLI。和Kubernetes等競爭對手相比,ECS的CLI表現(xiàn)很平庸。你可以從命令行(aws ecs update-service –desired-count N)進(jìn)行擴(kuò)展,但是ECS的CLI功能不是很強大。

盡管缺少了這么多核心功能,我們還是選擇了繼續(xù)使用ECS。

讓我們后悔的時刻

讓我們后悔的瞬間發(fā)生在,我們發(fā)現(xiàn),環(huán)境參數(shù)會被泄漏到CloudTrail以及使用CloudTrail事件記錄和日志的其他第三方服務(wù)中。

我們在論壇上發(fā)了帖子,ECS團(tuán)隊的回復(fù)沒有切中要害。顯然,他們不認(rèn)為環(huán)境參數(shù)是敏感信息。

我們原本可以建更多的基礎(chǔ)設(shè)施來用亞馬遜的密鑰管理服務(wù)(KMS)加密機密信息,然后在啟動服務(wù)的時候進(jìn)行解密。實際上,這正是Convox做的事情。但是,我們這個領(lǐng)域還有這么多有趣的工作可做,為什么要建這些基礎(chǔ)設(shè)施呢?

讓我們崩潰的時刻

在使用ECS的近一年時間里,我們關(guān)注每一個功能的發(fā)布,積極參與開放GitHub issue等等。但是到最后,我們還是因為以下幾個原因放棄了ECS:

ECS agent經(jīng)常斷開連接,致使我們無法啟動新容器。ECS在每一個EC2實例中都安裝一個agent,用來和亞馬遜API以及Docker進(jìn)行互動。但是這個agent經(jīng)常斷開連接,導(dǎo)致部署失敗,這對我們的服務(wù)部署來說是致命的。這一問題盡管已成定論,但仍然在不斷發(fā)生。在我們的集群上,這一問題每天至少出現(xiàn)兩次。盡管我們已經(jīng)做出了最大努力,但仍然找不到根本原因。據(jù)我所知,ECS團(tuán)隊至今還沒有解決這一問題。

下圖是在Slack上的搜索結(jié)果,這只是問題反饋的一小部分。這一問題出現(xiàn)得非常頻繁,以至于我們不得不經(jīng)常重啟agents來避免這一問題。

當(dāng)你每隔一小時就要重啟一次服務(wù)來修復(fù)漏洞的時候,你肯定會崩潰的。

  • 對GitHub issue缺少關(guān)注。GitHub issue上有很多功能和客戶請求,并沒有得到亞馬遜ECS的關(guān)注。
  • 糟糕的架構(gòu)。ECS欠缺很多現(xiàn)代化部署和運營基礎(chǔ)設(shè)施所需的基本元素。

再見,ECS;你好,Kubernetes

在對ECS的一片怨聲載道過后,我們決定試用Kubernetes (k8s)。兩個星期的體驗之后,我們感覺很滿意。這個開源項目很適合做大規(guī)模的部署和運營。不管是它的CLI,還是服務(wù)發(fā)現(xiàn)或配置管理,都非常好用。盡管我們遇到了一個很奇怪的問題,就是它的kube-proxy不能正確地挖掘流量,但是重啟之后問題就解決了,而且沒有復(fù)發(fā)。到目前為止,我們還沒有后悔我們做出的這一選擇。

英文原文:

Here at Appuri, we have a large number of small, single-purpose services that make up our ETL pipeline, API and UI. We started from large, monolithic repos and gradually migrated to this microservices pattern, not because of any philosophical bias but because it fit our work style. By and large, this has worked well with all the known pros and cons of microservices. But I’m not here to debate microservices. I’m here to tell you about our nightmare on Amazon EC2 Elastic Container Service (ECS) and how we saved ourselves by moving to Kubernetes.

NOTE: In general, we love AWS. Also, your mileage with ECS may vary. For example, Segment had a great experience with ECS and apparently none of our complaints.

There’s also the wonderful Convox project which contains a lot of great workflows on top of ECS. When we started using ECS, Convox wasn’t far enough along to meet our needs.

And so, it begins, with a love of managed services

I love managed services. For example, we don’t run our own Postgres server – we use Amazon RDS. We also don’t run our own hypervisor or bare metal servers, we use Amazon EC2. With managed services, you trade control for peace of mind and, in an ideal world, you can focus on building differentiated value add. Everyone wins. In fact, we have had exactly this experience with most managed services.

In June 2015, we started looking into a PaaS where we could deploy our services. I wanted to stay close to Docker, but maintain a degree of control. As an AWS customer, we considered Amazon Elastic Beanstalk and the shiny new Amazon EC2 Elastic Container Service (ECS).

Amazon ECS fit the bill because of several promises:

  • With ECS, you simply launch Docker containers.
  • ECS is aware of multiple availability zones (AZs). As long as EC2 instances are set up in multiple AZs, ECS will try to distribute containers to maintain high availability.
  • You can do rolling deploys. Neato, deployments with zero downtime!
  • API clients. All AWS services have (sadly auto-generated) API clients for all languages we use.
  • ECS works with vanilla EC2 instances. This is a nice plus, as we don’t have to learn a new PaaS – just install the ECS agent on any plain old EC2 instance running Amazon Linux and have it join an ECS cluster.

First impression: wow, it’s missing a LOT of stuff.

My first impression on seeing an ECS demo was how much it was missing. We use a lot of AWS services and are well-aware of how Amazon releases incremental updates. That’s all good, we do that, too. However, it was sad to see that these key features were missing:

  • No service discovery. In ECS, the recommended way to do service discovery is to use internal load balancers. This is actually a bigger deal because using an internal ELB is the only way you can run a service in ECS that is network-accessible; even with a single instance you HAVE to run an ELB for the service to be discoverable — for a microservice architecture this adds cost with every service you deploy despite having no additional hardware.
  • No central config. ECS doesn’t have a way to pass configuration to services (i.e. Docker containers) other than with environment variables. Great, how do I pass the same environment variable to every service? Copy and paste it. We considered setting up Consul, but instead decided to stick with native ECS environment variables to start using the service.
  • Mediocre CLI. Compared to competitors like Kubernetes, ECS has a mediocre CLI at best. You can scale from the command line (aws ecs update-service --desired-count N) but the ECS CLI is just not very powerful.

Despite these missing features, we decided to move ahead.

I have made a huge mistake

Our first “oh crap” moment with ECS in production was when we noticed that it was leaking environment variables to CloudTrail, and on to DataDog and other third party services that consume CloudTrail events and logs. ECS, like a good AWS citizen, logs events to CloudTrail. When you start a new service, it logs the service definition including environment variables to CloudTrail!

We opened a forum post and response from the team wasn’t on target. Apparently they don’t believe in treating environment variables as sensitive quantities.

Now, we could have built yet more infrastructure to encrypt secrets using Amazon Key Management Service (KMS) and decrypt them at service start – in fact, this is exactly what Convox does. But why would we build this infrastructure when there was so much more interesting work in our domain to do?

What killed ECS for us

We ran ECS in production for nearly a year. In that time, we watched every single feature announcement, participated in opening GitHub issues and so on. Finally, we gave up on ECS when two issues remained unaddressed:

  • ECS agent disconnects periodically, making it impossible to launch new containers. Recall that ECS works by installing an agent on every EC2 instance that’s part of an ECS cluster. This agent interacts with the Amazon API as well as Docker. This agent has a horrible tendency to disconnect, and when this happens your deployments will fail – this kills your services. This problem is tracked in this GitHub issue and despite it being a closed issue, we have seen it happen repeatedly. It happens at least twice a day on our clusters and despite our best efforts, we haven’t been able to nail the root cause. To my knowledge, it remains unaddressed by the ECS team.

This is a Slack search results view of just some of the times we’ve seen this problem happen. This problem became so pervasive that we started restarting agents periodically to get around the failure:

You know you’re going crazy when you restart a service every hour to fix its bugs.

  • Lack of traction on GitHub issues. This issue is an example of how many features and customer requests remain unaddressed. This issue is the most commented feature for a year and remains unaddressed. Incidentally, we hit this issue as well.
  • Bad architecture. I expect modern deployment and operations infrastructure to support 12 factor apps in a meaningful, robust way. ECS simply lacks the fundamentals.

Adios ECS, hello Kubernetes

After much grumbling at ECS, we decided to try out Kubernetes (k8s). Having flipped the switch in production two weeks ago, we are delighted. It seems that the contributors to this open source projects really thought through deployments and operations at scale. From the CLI to service discovery and configuration management, it has been a pleasure to use. We ran into an odd issue with kube-proxy not routing traffic correctly, but a restart fixed the issue and it hasn’t cropped up since. We haven’t looked back!

極客網(wǎng)企業(yè)會員

免責(zé)聲明:本網(wǎng)站內(nèi)容主要來自原創(chuàng)、合作伙伴供稿和第三方自媒體作者投稿,凡在本網(wǎng)站出現(xiàn)的信息,均僅供參考。本網(wǎng)站將盡力確保所提供信息的準(zhǔn)確性及可靠性,但不保證有關(guān)資料的準(zhǔn)確性及可靠性,讀者在使用前請進(jìn)一步核實,并對任何自主決定的行為負(fù)責(zé)。本網(wǎng)站對有關(guān)資料所引致的錯誤、不確或遺漏,概不負(fù)任何法律責(zé)任。任何單位或個人認(rèn)為本網(wǎng)站中的網(wǎng)頁或鏈接內(nèi)容可能涉嫌侵犯其知識產(chǎn)權(quán)或存在不實內(nèi)容時,應(yīng)及時向本網(wǎng)站提出書面權(quán)利通知或不實情況說明,并提供身份證明、權(quán)屬證明及詳細(xì)侵權(quán)或不實情況證明。本網(wǎng)站在收到上述法律文件后,將會依法盡快聯(lián)系相關(guān)文章源頭核實,溝通刪除相關(guān)內(nèi)容或斷開相關(guān)鏈接。

2016-11-22
亞馬遜ECS,噩夢般的使用經(jīng)歷
原作者;Bilal Aslam 編譯:李哲 核心提示:今天,Appuri的聯(lián)合創(chuàng)始人兼首席產(chǎn)品官Bilal?Aslam與大家分享他與他的團(tuán)隊使用亞馬遜ECS的慘痛經(jīng)歷。 在Appuri,ETL管道、API和UI都是由大量的小型單目標(biāo)服務(wù)構(gòu)成的。一開始,我們使用的是大型的單個資源庫,后來逐漸向微服務(wù)模式轉(zhuǎn)型。這并不是因為某種觀念上的偏見,而是因為它符

長按掃碼 閱讀全文