[00:59:40] PROBLEM - High CPU load on API appserver on mw1276 is CRITICAL: CRITICAL - load average: 62.87, 29.27, 18.63 [01:03:20] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb={PATCH,PUT} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:03:20] PROBLEM - etcd request latencies on neon is CRITICAL: instance=10.64.0.40:6443 operation=compareAndSwap https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:04:41] PROBLEM - etcd request latencies on argon is CRITICAL: instance=10.64.32.133:6443 operation=compareAndSwap https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:05:10] RECOVERY - High CPU load on API appserver on mw1276 is OK: OK - load average: 18.55, 30.51, 23.38 [02:06:11] RECOVERY - etcd request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [02:07:01] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [02:07:01] RECOVERY - etcd request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [02:28:50] (03PS6) 10Zhuyifei1999: Load project name dynamically from /etc/wmcs-project [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/430647 (https://phabricator.wikimedia.org/T190893) [02:39:04] (03CR) 10BryanDavis: [C: 031] "Guess this could be tagged as resolving T192244 as well, but I really hope people don't rely on /etc/wmcs-instancename widely." [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/430647 (https://phabricator.wikimedia.org/T190893) (owner: 10Zhuyifei1999) [02:40:10] PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: instance=kubernetes1003.eqiad.wmnet operation_type={container_status,create_container,list_containers,list_podsandbox,podsandbox_status,remove_container,start_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [02:40:11] PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet operation_type={container_status,create_container,list_containers,list_podsandbox,podsandbox_status,remove_container,start_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [02:53:20] RECOVERY - kubelet operational latencies on kubernetes1003 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [02:53:21] RECOVERY - kubelet operational latencies on kubernetes1004 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [02:54:01] (03CR) 10Zhuyifei1999: "pykube.exceptions.HTTPError: /etc/wmcs-project is not in allowed host paths nor allowed host path prefixes" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/430647 (https://phabricator.wikimedia.org/T190893) (owner: 10Zhuyifei1999) [02:59:00] PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet operation_type={list_podsandbox,podsandbox_status,remove_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [03:00:00] PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: instance=kubernetes1003.eqiad.wmnet operation_type={create_container,list_containers,list_podsandbox,podsandbox_status,remove_container,start_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [03:01:10] RECOVERY - kubelet operational latencies on kubernetes1004 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [03:02:10] RECOVERY - kubelet operational latencies on kubernetes1003 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [03:07:40] PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: instance=kubernetes1003.eqiad.wmnet operation_type={container_status,create_container,list_containers,list_podsandbox,podsandbox_status,remove_container,start_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [03:07:41] PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet operation_type={list_podsandbox,podsandbox_status,remove_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [03:08:50] RECOVERY - kubelet operational latencies on kubernetes1004 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [03:09:50] RECOVERY - kubelet operational latencies on kubernetes1003 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [03:15:21] PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: instance=kubernetes1003.eqiad.wmnet operation_type={container_status,create_container,list_podsandbox,podsandbox_status,remove_container,start_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [03:15:30] PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet operation_type={container_status,create_container,list_podsandbox,podsandbox_status,remove_container,start_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [03:17:40] RECOVERY - kubelet operational latencies on kubernetes1004 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [03:18:40] RECOVERY - kubelet operational latencies on kubernetes1003 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [03:23:10] PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet operation_type={container_status,create_container,list_containers,list_podsandbox,podsandbox_status,remove_container,start_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [03:24:11] PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: instance=kubernetes1003.eqiad.wmnet operation_type={container_status,create_container,list_containers,list_podsandbox,podsandbox_status,remove_container,start_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [03:26:30] RECOVERY - kubelet operational latencies on kubernetes1004 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [03:27:12] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 807.07 seconds [03:27:31] RECOVERY - kubelet operational latencies on kubernetes1003 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [03:28:10] (03CR) 10Krinkle: Raise Scribunto maxLangCacheSize to 200 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430068 (https://phabricator.wikimedia.org/T85461) (owner: 10Anomie) [03:28:58] 10Operations, 10Commons, 10Wikimedia-SVG-rendering, 10media-storage, 10Patch-For-Review: Install Noto fonts on scaling servers for SVG rendering - https://phabricator.wikimedia.org/T184664#3891777 (10Peachey88) >>! In T184664#4184441, @Verdy_p wrote: > How do you plan to update these fonts when the Noto... [03:33:10] PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet operation_type={list_podsandbox,podsandbox_status,remove_container,start_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [03:34:10] PROBLEM - puppet last run on mw2137 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIPCity.dat.gz] [03:34:11] PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: instance=kubernetes1003.eqiad.wmnet operation_type={podsandbox_status,remove_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [03:34:20] RECOVERY - kubelet operational latencies on kubernetes1004 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [03:34:21] PROBLEM - puppet last run on analytics1067 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIP2-ISP.mmdb.gz] [03:35:21] RECOVERY - kubelet operational latencies on kubernetes1003 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [03:41:00] PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet operation_type={create_container,list_podsandbox,podsandbox_status,remove_container,start_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [03:42:00] PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: instance=kubernetes1003.eqiad.wmnet operation_type={container_status,create_container,list_containers,list_podsandbox,podsandbox_status,remove_container,start_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [03:43:10] RECOVERY - kubelet operational latencies on kubernetes1004 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [03:44:11] RECOVERY - kubelet operational latencies on kubernetes1003 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [03:49:51] PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet operation_type={remove_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [03:51:00] RECOVERY - kubelet operational latencies on kubernetes1004 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [03:53:31] (03PS1) 10Zhuyifei1999: toolforge k8s: allow /etc/wmcs-project to be mounted [puppet] - 10https://gerrit.wikimedia.org/r/431285 (https://phabricator.wikimedia.org/T192244) [03:54:25] 10Operations, 10MediaWiki-Platform-Team, 10HHVM, 10TechCom-RFC (TechCom-Approved), 10User-ArielGlenn: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370#4184607 (10alex-mashin) > but I think we should also have Scribunto for other frameworks, notably Javascript/ECMASCript like... [03:54:46] (03PS2) 10Zhuyifei1999: toolforge k8s: allow /etc/wmcs-project to be mounted [puppet] - 10https://gerrit.wikimedia.org/r/431285 (https://phabricator.wikimedia.org/T190893) [03:55:14] (03PS7) 10Zhuyifei1999: Mount & load project name dynamically from /etc/wmcs-project [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/430647 (https://phabricator.wikimedia.org/T190893) [03:59:35] (03PS8) 10Zhuyifei1999: Mount & load project name dynamically from /etc/wmcs-project [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/430647 (https://phabricator.wikimedia.org/T190893) [04:00:21] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 284.94 seconds [04:04:50] RECOVERY - puppet last run on mw2137 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [04:05:00] RECOVERY - puppet last run on analytics1067 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [04:16:30] PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet operation_type={container_status,create_container,list_containers,list_podsandbox,podsandbox_status,remove_container,start_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [04:17:31] RECOVERY - kubelet operational latencies on kubernetes1004 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [04:38:10] PROBLEM - Device not healthy -SMART- on mw1230 is CRITICAL: cluster=api_appserver device=sda instance=mw1230:9100 job=node site=eqiad https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=mw1230&var-datasource=eqiad%2520prometheus%252Fops [06:28:00] PROBLEM - puppet last run on mw2161 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/cgroup-mediawiki-clean] [06:28:10] PROBLEM - puppet last run on mw1323 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/bash_autologout.sh] [06:29:20] PROBLEM - puppet last run on mw1305 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/cgroup-mediawiki-clean] [06:58:30] RECOVERY - puppet last run on mw2161 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [06:58:40] RECOVERY - puppet last run on mw1323 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [06:59:50] RECOVERY - puppet last run on mw1305 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:09:50] (03PS1) 10Addshore: WIP DNM WikibaseLexeme Config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431306 (https://phabricator.wikimedia.org/T184745) [11:11:00] (03CR) 10jerkins-bot: [V: 04-1] WIP DNM WikibaseLexeme Config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431306 (https://phabricator.wikimedia.org/T184745) (owner: 10Addshore) [11:12:00] PROBLEM - HTTP availability for Nginx -SSL terminators- on einsteinium is CRITICAL: cluster=cache_upload site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [11:14:10] RECOVERY - HTTP availability for Nginx -SSL terminators- on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [11:45:30] 10Operations, 10MediaWiki-Platform-Team, 10HHVM, 10TechCom-RFC (TechCom-Approved), 10User-ArielGlenn: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370#4185033 (10Verdy_p) "proposals as an attack" ? Strange attitude. I's easy to see that Lua is in fact very slow compared to m... [11:57:45] 10Operations, 10MediaWiki-Platform-Team, 10HHVM, 10TechCom-RFC (TechCom-Approved), 10User-ArielGlenn: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370#4185035 (10Reedy) This is off topic for this task. Please use/create an appropriate one [12:11:04] 10Operations, 10MediaWiki-Platform-Team, 10HHVM, 10TechCom-RFC (TechCom-Approved), 10User-ArielGlenn: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370#4185036 (10Verdy_p) What was off-topic was the sentence I commented: "proposals seen as an attack". Which is completely wron... [12:14:02] 10Operations, 10MediaWiki-Platform-Team, 10HHVM, 10TechCom-RFC (TechCom-Approved), 10User-ArielGlenn: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370#4185037 (10Reedy) This task isn’t about Lua. Or Scribunto. And where is your reference for it “draining resources”? [13:18:31] (03PS1) 10ArielGlenn: wikidata weekly dumps: set all default vars before parsing args [puppet] - 10https://gerrit.wikimedia.org/r/431312 [13:51:34] 10Operations, 10MediaWiki-Platform-Team, 10HHVM, 10TechCom-RFC (TechCom-Approved), 10User-ArielGlenn: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370#4185070 (10Joe) The decision to migrate the WMF production back to PHP 7.x is long taken and is not something we'd have done... [14:12:54] (03CR) 10BryanDavis: [C: 031] Mount & load project name dynamically from /etc/wmcs-project [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/430647 (https://phabricator.wikimedia.org/T190893) (owner: 10Zhuyifei1999) [14:35:02] (03CR) 10Hoo man: [C: 031] "Very good catch. Please note that is only relevant in case the last run aborted unsuccessfully/ was killed, otherwise the files will be de" [puppet] - 10https://gerrit.wikimedia.org/r/431312 (owner: 10ArielGlenn) [15:06:58] (03PS1) 10Hoo man: Fix dumpwikidatardf size sanity check [puppet] - 10https://gerrit.wikimedia.org/r/431315 [15:13:38] (03CR) 10ArielGlenn: [C: 032] wikidata weekly dumps: set all default vars before parsing args [puppet] - 10https://gerrit.wikimedia.org/r/431312 (owner: 10ArielGlenn) [15:14:10] (03CR) 10ArielGlenn: [C: 032] Fix dumpwikidatardf size sanity check [puppet] - 10https://gerrit.wikimedia.org/r/431315 (owner: 10Hoo man) [15:14:16] (03PS2) 10ArielGlenn: Fix dumpwikidatardf size sanity check [puppet] - 10https://gerrit.wikimedia.org/r/431315 (owner: 10Hoo man) [18:12:24] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): setup/install/deploy deploy1001 as deployment server - https://phabricator.wikimedia.org/T175288#3591799 (10TerraCodes) Can this task be closed? (since everything in the task description is checked off) [18:13:27] 10Operations, 10HHVM, 10Patch-For-Review, 10User-Elukey: Upgrade mw* servers to Debian Stretch (using HHVM) - https://phabricator.wikimedia.org/T174431#4185208 (10TerraCodes)