[00:48:17] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 59.61 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[00:51:45] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 79.62 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[01:08:01] <wikibugs>	 10Operations, 10Traffic, 10observability: Varnish traffic drop alert @ codfw is noisy / codfw incoming traffic is spikey - https://phabricator.wikimedia.org/T239039 (10CDanis)
[01:09:22] <wikibugs>	 10Operations, 10Traffic, 10observability: Varnish traffic drop alert @ codfw is noisy / codfw incoming traffic is spikey - https://phabricator.wikimedia.org/T239039 (10CDanis)
[01:49:06] <wikibugs>	 10Operations, 10Traffic: ATS logs aren't being rotated - https://phabricator.wikimedia.org/T238724 (10Vgutierrez) Even though moving (mv) the file manually works as expected without a reload, the rotation triggered by logrotate isn't forcing ATS to open a new file: ` -rw-r--r--  1 trafficserver trafficserver...
[01:52:09] <wikibugs>	 (03PS1) 10Herron: install_server: add logstash 7 vms [puppet] - 10https://gerrit.wikimedia.org/r/552676
[02:07:08] <wikibugs>	 (03PS1) 10Herron: icinga: disable notifications on logstash 7 hosts [puppet] - 10https://gerrit.wikimedia.org/r/552677
[02:09:35] <wikibugs>	 (03CR) 10Herron: [C: 03+2] icinga: disable notifications on logstash 7 hosts [puppet] - 10https://gerrit.wikimedia.org/r/552677 (owner: 10Herron)
[02:10:06] <wikibugs>	 (03PS2) 10Herron: icinga: disable notifications on logstash 7 hosts [puppet] - 10https://gerrit.wikimedia.org/r/552677
[02:14:22] <wikibugs>	 (03CR) 10Herron: [C: 03+2] install_server: add logstash 7 vms [puppet] - 10https://gerrit.wikimedia.org/r/552676 (owner: 10Herron)
[02:18:30] <wikibugs>	 10Operations, 10Traffic: ATS logs aren't being rotated - https://phabricator.wikimedia.org/T238724 (10Vgutierrez) upon manual removal of the empty error.log file, trafficserver creates a new one (without issuing a reload): ` -rw-r--r--  1 trafficserver trafficserver 2.8K Nov 25 02:17 error.log -rw-r--r--  1 tr...
[02:23:36] <wikibugs>	 (03PS1) 10Vgutierrez: ATS: Prevent logrotate from creating empty log files [puppet] - 10https://gerrit.wikimedia.org/r/552678 (https://phabricator.wikimedia.org/T238724)
[02:35:15] <wikibugs>	 (03PS2) 10Vgutierrez: ATS: Prevent logrotate from creating empty log files [puppet] - 10https://gerrit.wikimedia.org/r/552678 (https://phabricator.wikimedia.org/T238724)
[02:37:55] <wikibugs>	 (03CR) 10Vgutierrez: "pcc is happy: https://puppet-compiler.wmflabs.org/compiler1001/19573/" [puppet] - 10https://gerrit.wikimedia.org/r/552678 (https://phabricator.wikimedia.org/T238724) (owner: 10Vgutierrez)
[02:47:42] <wikibugs>	 10Operations, 10ops-esams, 10Traffic: cp3053 is unreachable - https://phabricator.wikimedia.org/T239041 (10Vgutierrez)
[02:59:24] <vgutierrez>	 !log depooling & power-cycling cp3053 - T239041
[02:59:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:59:29] <stashbot>	 T239041: cp3053 is unreachable - https://phabricator.wikimedia.org/T239041
[03:00:06] <logmsgbot>	 !log vgutierrez@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp3053.esams.wmnet
[03:00:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:02:26] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s8 on db2083 is OK: OK slave_sql_lag Replication lag: 0.28 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[03:03:24] <wikibugs>	 10Operations, 10ops-esams, 10Traffic: cp3053 is unreachable - https://phabricator.wikimedia.org/T239041 (10Vgutierrez) p:05Triage→03Normal
[03:05:59] <icinga-wm>	 RECOVERY - Host cp3053 is UP: PING OK - Packet loss = 0%, RTA = 83.43 ms
[03:09:09] <wikibugs>	 10Operations, 10Traffic: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10Vgutierrez)
[03:09:30] <wikibugs>	 10Operations, 10Traffic: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10Vgutierrez)
[03:09:33] <wikibugs>	 10Operations, 10ops-esams, 10Traffic: cp3053 is unreachable - https://phabricator.wikimedia.org/T239041 (10Vgutierrez)
[03:13:29] <vgutierrez>	 !log repooling cp3053 - T239041 
[03:13:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:13:34] <stashbot>	 T239041: cp3053 is unreachable - https://phabricator.wikimedia.org/T239041
[03:14:30] <wikibugs>	 10Operations, 10ops-esams, 10Traffic: cp3053 is unreachable - https://phabricator.wikimedia.org/T239041 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez Nothing on the logs or on SEL
[03:14:32] <wikibugs>	 10Operations, 10Traffic: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10Vgutierrez)
[03:21:03] <wikibugs>	 (03PS1) 10Vgutierrez: acme_chief: Revoke access from netmon boxes to netbox certificate [puppet] - 10https://gerrit.wikimedia.org/r/552680 (https://phabricator.wikimedia.org/T238919)
[03:38:01] <wikibugs>	 (03CR) 10Vgutierrez: "> Patch Set 1: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/552398 (https://phabricator.wikimedia.org/T238900) (owner: 10Vgutierrez)
[04:11:52] <icinga-wm>	 PROBLEM - MegaRAID on analytics1057 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[04:21:56] <icinga-wm>	 RECOVERY - MegaRAID on analytics1057 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[04:38:38] <icinga-wm>	 PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 1.417 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash
[04:41:44] <icinga-wm>	 RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0 https://wikitech.wikimedia.org/wiki/Logstash https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash
[04:51:54] <AntiComposite>	 https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Slowdown_on_WP
[04:52:07] <AntiComposite>	 Complaints of general slowness
[04:53:32] <icinga-wm>	 PROBLEM - MegaRAID on analytics1057 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[05:04:14] <icinga-wm>	 RECOVERY - MegaRAID on analytics1057 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[05:20:32] <wikibugs>	 10Operations, 10DNS, 10Internet-Archive, 10Traffic, 10Wikimedia-Video: Please set up a CNAME for videoserver.wikimedia.org to Video Editing Server - https://phabricator.wikimedia.org/T99216 (10Aklapper) > jcrespo changed the task status from Open to Stalled.  What exactly is this task [stalled](https://w...
[05:53:07] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1101:3318', diff saved to https://phabricator.wikimedia.org/P9725 and previous config saved to /var/cache/conftool/dbconfig/20191125-055305-marostegui.json
[05:53:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:57:05] <wikibugs>	 10Operations, 10DBA: db2125 crashed - https://phabricator.wikimedia.org/T239042 (10Marostegui) p:05Triage→03Normal
[05:58:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2125 - crashed T239042', diff saved to https://phabricator.wikimedia.org/P9726 and previous config saved to /var/cache/conftool/dbconfig/20191125-055813-marostegui.json
[05:58:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:58:19] <stashbot>	 T239042: db2125 crashed  - https://phabricator.wikimedia.org/T239042
[05:59:40] <icinga-wm>	 ACKNOWLEDGEMENT - SSH on db2125 is CRITICAL: CRITICAL - Socket timeout after 10 seconds Marostegui T239042 https://wikitech.wikimedia.org/wiki/SSH/monitoring
[06:00:12] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1101:3318', diff saved to https://phabricator.wikimedia.org/P9727 and previous config saved to /var/cache/conftool/dbconfig/20191125-060011-marostegui.json
[06:00:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:03:24] <wikibugs>	 10Operations, 10DBA: db2125 crashed - https://phabricator.wikimedia.org/T239042 (10Marostegui) There are no hardware logs: ` /admin1-> racadm getsel Record:      1 Date/Time:   07/12/2019 21:38:11 Source:      system Severity:    Ok Description: Log cleared. ----------------------------------------------------...
[06:07:29] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1101:3318', diff saved to https://phabricator.wikimedia.org/P9728 and previous config saved to /var/cache/conftool/dbconfig/20191125-060728-marostegui.json
[06:07:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:08:24] <icinga-wm>	 PROBLEM - MegaRAID on analytics1057 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[06:13:45] <wikibugs>	 10Operations, 10DBA: db2125 crashed - https://phabricator.wikimedia.org/T239042 (10Marostegui) The first traces of crash are: ` Nov 23 08:25:35 db2125 mysqld[13682]: InnoDB: Warning: a long semaphore wait: Nov 23 08:25:35 db2125 mysqld[13682]: --Thread 139387736135424 has waited at row0purge.cc line 772 for 24...
[06:15:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1101:3318', diff saved to https://phabricator.wikimedia.org/P9729 and previous config saved to /var/cache/conftool/dbconfig/20191125-061542-marostegui.json
[06:15:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:16:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1104 - schema change', diff saved to https://phabricator.wikimedia.org/P9730 and previous config saved to /var/cache/conftool/dbconfig/20191125-061629-marostegui.json
[06:16:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:18:16] <wikibugs>	 10Operations, 10DBA: db2125 crashed - https://phabricator.wikimedia.org/T239042 (10Marostegui) More logs from the console: ` [10086760.709402] NMI watchdog: BUG: soft lockup - CPU#14 stuck for 22s! [kworker/u480:0:6] [10086764.636175] NMI watchdog: BUG: soft lockup - CPU#4 stuck for 22s! [sshd:107965] [1008676...
[06:18:35] <marostegui>	 !log racadm serveraction hardreset on db2125 T239042
[06:18:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:18:39] <stashbot>	 T239042: db2125 crashed  - https://phabricator.wikimedia.org/T239042
[06:21:42] <icinga-wm>	 RECOVERY - SSH on db2125 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[06:22:15] <marostegui>	 !log Compress db2094:3318
[06:22:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:23:14] <marostegui>	 !log Compress db2082
[06:23:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:24:09] <marostegui>	 !log Compress db2080
[06:24:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:27:55] <wikibugs>	 10Operations, 10DBA: db2125 crashed - https://phabricator.wikimedia.org/T239042 (10Marostegui) Nothing apart from this on OS logs: ` Nov 23 08:17:09 db2125 systemd[1]: Started Time & Date Service. Nov 23 08:18:01 db2125 CRON[107127]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var...
[06:31:51] <wikibugs>	 10Operations, 10DBA: db2125 crashed - https://phabricator.wikimedia.org/T239042 (10Marostegui) I have extracted the controller logs....nothing showing up there.
[06:34:32] <wikibugs>	 10Operations, 10ops-codfw, 10DBA: db2125 crashed - https://phabricator.wikimedia.org/T239042 (10Marostegui) a:03Papaul @Papaul can we upgrade firwmare and BIOS on this host? It is a very new host, and if it this crash happens again we might need to contact Dell.
[06:37:16] <wikibugs>	 (03PS1) 10Marostegui: db2134: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/552690 (https://phabricator.wikimedia.org/T238183)
[06:37:40] <vgutierrez>	 marostegui: may I ask how db2125 crashed?
[06:37:59] <marostegui>	 vgutierrez: I don't know, I am guessing storage crashed,but there is nothing on logs
[06:38:01] <vgutierrez>	 marostegui: maybe no net, nothing on the KVM console and nothing on the logs?
[06:38:10] <marostegui>	 vgutierrez: nothing
[06:38:21] <marostegui>	 vgutierrez: from the MySQL logs, I think the storage crashed somehow
[06:38:25] <vgutierrez>	 hmmm R440?
[06:38:38] <marostegui>	 vgutierrez: yes
[06:38:53] <vgutierrez>	 marostegui: https://phabricator.wikimedia.org/T238305
[06:39:01] <vgutierrez>	 dunno if it's related
[06:39:10] <vgutierrez>	 but we are seeing something similar in esams brand new cp servers
[06:39:47] <marostegui>	 vgutierrez: This host is relatively new too (from july)
[06:39:56] <marostegui>	 vgutierrez: Going to add the task as a subtask
[06:40:13] <vgutierrez>	 yeah, cp1077 there is also not brand new but relatively new as well
[06:40:17] <vgutierrez>	 also a R440
[06:40:28] <icinga-wm>	 RECOVERY - MegaRAID on analytics1057 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[06:40:31] <wikibugs>	 10Operations, 10Traffic: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10Marostegui) db2125 crashed too and it is a new R440: {T239042}
[06:40:50] <wikibugs>	 10Operations, 10Traffic: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10Marostegui)
[06:41:07] <wikibugs>	 10Operations, 10ops-codfw, 10DBA: db2125 crashed - https://phabricator.wikimedia.org/T239042 (10Marostegui)
[06:41:09] <wikibugs>	 10Operations, 10Traffic: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10Marostegui)
[06:41:34] <wikibugs>	 10Operations, 10ops-codfw, 10DBA: db2125 crashed - https://phabricator.wikimedia.org/T239042 (10Marostegui) This can be related: T238305
[06:42:00] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2134: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/552690 (https://phabricator.wikimedia.org/T238183) (owner: 10Marostegui)
[06:43:36] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 53.92 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[06:52:08] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 71.29 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[06:57:57] <wikibugs>	 (03PS1) 10Marostegui: db2125: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/552695 (https://phabricator.wikimedia.org/T239042)
[07:00:06] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2125: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/552695 (https://phabricator.wikimedia.org/T239042) (owner: 10Marostegui)
[07:04:02] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 58.62 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[07:04:53] <marostegui>	 !log Upgrade db2134
[07:04:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:05:37] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Promote db2134 to m3 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/552699 (https://phabricator.wikimedia.org/T238183)
[07:07:25] <wikibugs>	 (03PS8) 10DannyS712: abusefilter.php: Remove settings that duplicate defaults, and clean up [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552610 (https://phabricator.wikimedia.org/T238965)
[07:07:28] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 74.34 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[07:11:01] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics: analytics1057's BBU is faulty - https://phabricator.wikimedia.org/T239045 (10elukey)
[07:13:31] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db2134 to m3 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/552699 (https://phabricator.wikimedia.org/T238183) (owner: 10Marostegui)
[07:22:15] <marostegui>	 !log Compress db2090
[07:22:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:28:31] <wikibugs>	 10Operations, 10DBA: Decommission db2043-db2070 - https://phabricator.wikimedia.org/T228258 (10Marostegui)
[07:28:43] <wikibugs>	 10Operations, 10DBA: Decommission db2043-db2070 - https://phabricator.wikimedia.org/T228258 (10Marostegui)
[07:50:04] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] racktables: use codfw database when in codfw [puppet] - 10https://gerrit.wikimedia.org/r/552553 (https://phabricator.wikimedia.org/T224247) (owner: 10Dzahn)
[07:51:13] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] iegreview app: use codfw database when in codfw [puppet] - 10https://gerrit.wikimedia.org/r/552552 (https://phabricator.wikimedia.org/T224247) (owner: 10Dzahn)
[07:51:37] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] wikimania_scholarships app: use codfw database when in codfw [puppet] - 10https://gerrit.wikimedia.org/r/552551 (https://phabricator.wikimedia.org/T224247) (owner: 10Dzahn)
[07:57:30] <wikibugs>	 (03PS1) 10ArielGlenn: properly handle mtime lookup for dumps log exception checker [puppet] - 10https://gerrit.wikimedia.org/r/552707
[07:59:03] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Remove db1067 [puppet] - 10https://gerrit.wikimedia.org/r/552708 (https://phabricator.wikimedia.org/T238297)
[08:00:04] <wikibugs>	 (03PS1) 10Marostegui: wmnet: Remove production DNS for db1067 [dns] - 10https://gerrit.wikimedia.org/r/552709 (https://phabricator.wikimedia.org/T238297)
[08:04:06] <wikibugs>	 (03PS6) 10Muehlenhoff: Enable ldap-corp1001/2001 as additional LDAP replicas [puppet] - 10https://gerrit.wikimedia.org/r/539150
[08:10:48] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission
[08:10:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:10:59] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
[08:11:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:11:06] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Remove db1067 [puppet] - 10https://gerrit.wikimedia.org/r/552708 (https://phabricator.wikimedia.org/T238297) (owner: 10Marostegui)
[08:11:46] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] wmnet: Remove production DNS for db1067 [dns] - 10https://gerrit.wikimedia.org/r/552709 (https://phabricator.wikimedia.org/T238297) (owner: 10Marostegui)
[08:12:47] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission db1067.eqiad.wmnet - https://phabricator.wikimedia.org/T238297 (10Marostegui) a:05Marostegui→03Jclark-ctr
[08:13:00] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission db1067.eqiad.wmnet - https://phabricator.wikimedia.org/T238297 (10Marostegui) Host ready for #dc-ops  steps
[08:13:15] <wikibugs>	 10Operations, 10DBA: Decommission db1061-db1073 - https://phabricator.wikimedia.org/T217396 (10Marostegui)
[08:14:42] <wikibugs>	 (03CR) 10Effie Mouzeli: "> We could create a generic profile which installs the perftools by" [puppet] - 10https://gerrit.wikimedia.org/r/550833 (https://phabricator.wikimedia.org/T236048) (owner: 10Effie Mouzeli)
[08:17:51] <wikibugs>	 10Operations, 10DNS, 10Internet-Archive, 10Traffic, 10Wikimedia-Video: Please set up a CNAME for videoserver.wikimedia.org to Video Editing Server - https://phabricator.wikimedia.org/T99216 (10jcrespo) @Aklapper an answer to T99216#2057570
[08:19:35] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/552506 (https://phabricator.wikimedia.org/T98006) (owner: 10BBlack)
[08:25:45] <icinga-wm>	 PROBLEM - MegaRAID on analytics1057 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[08:27:50] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Add server_name, override settings [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/552763
[08:27:57] <wikibugs>	 (03PS1) 10Marostegui: db2065: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/552764 (https://phabricator.wikimedia.org/T239046)
[08:29:57] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2065: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/552764 (https://phabricator.wikimedia.org/T239046) (owner: 10Marostegui)
[08:31:24] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] Add server_name, override settings [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/552763 (owner: 10Giuseppe Lavagetto)
[08:32:11] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Add server_name, override settings [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/552763 (owner: 10Giuseppe Lavagetto)
[08:33:29] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 54.67 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[08:39:00] <wikibugs>	 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10jijiki)
[08:39:18] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: db2125 crashed - https://phabricator.wikimedia.org/T239042 (10Marostegui) For what is worth, this is the kernel this host is running at the moment (I have not upgraded it since the crash): ` root@db2125:~# uname -a Linux db2125 4.9.0-11-amd64 #1 SMP Deb...
[08:39:39] <wikibugs>	 10Operations, 10Traffic: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10Marostegui)
[08:40:07] <wikibugs>	 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10jijiki) p:05Triage→03Normal
[08:42:17] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 81.75 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[08:43:49] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: support configcluster and configcluster_stretch [puppet] - 10https://gerrit.wikimedia.org/r/552765 (https://phabricator.wikimedia.org/T238791)
[08:47:40] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] "a bit ugly, but should work for now. I'll fix it back when we've tranistioned." [puppet] - 10https://gerrit.wikimedia.org/r/552765 (https://phabricator.wikimedia.org/T238791) (owner: 10Filippo Giunchedi)
[08:48:39] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: support configcluster and configcluster_stretch [puppet] - 10https://gerrit.wikimedia.org/r/552765 (https://phabricator.wikimedia.org/T238791) (owner: 10Filippo Giunchedi)
[08:53:58] <_joe_>	 !log rebuilding base docker images docker-registry.wikimedia.org/wikimedia-{jessie,stretch,buster}
[08:54:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:54:09] <wikibugs>	 10Operations, 10serviceops: Appservers rising GET latency might have triggered LVS pages - https://phabricator.wikimedia.org/T238973 (10fgiunchedi)
[08:54:12] <wikibugs>	 10Operations, 10Traffic, 10serviceops: Increased latency in appservers - 22 Nov 2019 - https://phabricator.wikimedia.org/T238939 (10fgiunchedi)
[08:54:35] <wikibugs>	 10Operations, 10serviceops: Appservers rising GET latency might have triggered LVS pages - https://phabricator.wikimedia.org/T238973 (10fgiunchedi) The cause was indeed appservers latency, resolving in favor of T238939
[08:55:20] <wikibugs>	 10Operations, 10serviceops: Appservers rising GET latency might have triggered LVS pages - https://phabricator.wikimedia.org/T238973 (10Joe) I find it hard to believe this is the case. Text-lb checks request a cached url, so the backend latency should not matter.
[08:56:59] <wikibugs>	 10Operations, 10Traffic: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10Marostegui)
[09:03:53] <wikibugs>	 (03PS1) 10Filippo Giunchedi: Fix invalid metric name for pdns_tcp4_queries [debs/prometheus-pdns-exporter] - 10https://gerrit.wikimedia.org/r/552767
[09:10:14] <wikibugs>	 (03CR) 10Filippo Giunchedi: "The current version is causing this in Prometheus logs:" [debs/prometheus-pdns-exporter] - 10https://gerrit.wikimedia.org/r/552767 (owner: 10Filippo Giunchedi)
[09:11:06] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [debs/prometheus-pdns-exporter] - 10https://gerrit.wikimedia.org/r/552767 (owner: 10Filippo Giunchedi)
[09:13:17] <moritzm>	 !log installing python2.7 updates on buster
[09:13:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:16:34] <wikibugs>	 (03CR) 10Ema: [C: 03+2] Revert "vcl: move XWD pass logic to wm_common" [puppet] - 10https://gerrit.wikimedia.org/r/552507 (https://phabricator.wikimedia.org/T233768) (owner: 10Ema)
[09:16:37] <wikibugs>	 (03CR) 10Ema: [C: 03+2] cache: do not cache noc.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/552508 (https://phabricator.wikimedia.org/T233768) (owner: 10Ema)
[09:17:02] <logmsgbot>	 !log onimisionipe@deploy1001 Started deploy [wdqs/wdqs@db43901]: T238822
[09:17:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:19:46] <wikibugs>	 10Operations, 10Performance-Team, 10Traffic, 10Patch-For-Review: 200ms / 50% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 (10Gilles)
[09:20:51] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] wdqs: remove the ban of Guzzle user agent. [puppet] - 10https://gerrit.wikimedia.org/r/552540 (owner: 10Gehel)
[09:23:26] <icinga-wm>	 RECOVERY - MegaRAID on analytics1057 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[09:25:25] <wikibugs>	 (03PS2) 10Gehel: wdqs: remove the ban of Guzzle user agent. [puppet] - 10https://gerrit.wikimedia.org/r/552540
[09:26:11] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] Fix invalid metric name for pdns_tcp4_queries [debs/prometheus-pdns-exporter] - 10https://gerrit.wikimedia.org/r/552767 (owner: 10Filippo Giunchedi)
[09:26:19] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] Fix invalid metric name for pdns_tcp4_queries [debs/prometheus-pdns-exporter] - 10https://gerrit.wikimedia.org/r/552767 (owner: 10Filippo Giunchedi)
[09:26:24] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/552680 (https://phabricator.wikimedia.org/T238919) (owner: 10Vgutierrez)
[09:28:34] <_joe_>	 !log building and publishing updated images for envoy
[09:28:37] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] wdqs: remove the ban of Guzzle user agent. [puppet] - 10https://gerrit.wikimedia.org/r/552540 (owner: 10Gehel)
[09:28:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:30:10] <logmsgbot>	 !log onimisionipe@deploy1001 Finished deploy [wdqs/wdqs@db43901]: T238822 (duration: 13m 08s)
[09:30:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:30:41] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1104 after schema change', diff saved to https://phabricator.wikimedia.org/P9731 and previous config saved to /var/cache/conftool/dbconfig/20191125-093038-marostegui.json
[09:30:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:31:22] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Fix badly formatted changelog entry, typo [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/552769
[09:31:58] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1126 - schema change', diff saved to https://phabricator.wikimedia.org/P9732 and previous config saved to /var/cache/conftool/dbconfig/20191125-093157-marostegui.json
[09:32:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:32:39] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "Thanks for the fix" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/552769 (owner: 10Giuseppe Lavagetto)
[09:32:43] <moritzm>	 !log installing systemd security/bugfix updates on buster
[09:32:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:34:07] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Fix badly formatted changelog entry, typo [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/552769 (owner: 10Giuseppe Lavagetto)
[09:41:46] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] labmon: add compatibility in buster [puppet] - 10https://gerrit.wikimedia.org/r/552107 (https://phabricator.wikimedia.org/T224585) (owner: 10Phamhi)
[09:45:02] <moritzm>	 !log installing cron updates from buster point release
[09:45:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:45:37] <wikibugs>	 10Operations, 10serviceops: Appservers rising GET latency might have triggered LVS pages - https://phabricator.wikimedia.org/T238973 (10fgiunchedi) 05duplicate→03Open >>! In T238973#5688257, @Joe wrote: > I find it hard to believe this is the case. Text-lb checks request a cached url, so the backend latenc...
[09:45:55] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: blubberoid: add telemetry collection support for envoy [deployment-charts] - 10https://gerrit.wikimedia.org/r/549837 (https://phabricator.wikimedia.org/T237234)
[09:45:57] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Add private stub to eventgate [deployment-charts] - 10https://gerrit.wikimedia.org/r/552771
[09:52:33] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] Add private stub to eventgate [deployment-charts] - 10https://gerrit.wikimedia.org/r/552771 (owner: 10Giuseppe Lavagetto)
[09:53:16] <wikibugs>	 10Operations: Integrate Buster 10.2 point update - https://phabricator.wikimedia.org/T238519 (10MoritzMuehlenhoff)
[09:54:42] <wikibugs>	 10Operations: Integrate Buster 10.2 point update - https://phabricator.wikimedia.org/T238519 (10MoritzMuehlenhoff)
[09:54:45] <wikibugs>	 10Operations, 10Puppet, 10User-jbond: Add method to admin module ci to detect removed users - https://phabricator.wikimedia.org/T239070 (10jbond)
[09:55:30] <icinga-wm>	 PROBLEM - MegaRAID on analytics1057 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[09:56:27] <wikibugs>	 (03CR) 10Jbond: [C: 04-1] admins: add Max Semenik as ldap_only_admin (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/552594 (https://phabricator.wikimedia.org/T238960) (owner: 10Dzahn)
[09:56:53] <elukey>	 downtimed for a week -^
[09:59:32] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/19574/deploy1001.eqiad.wmnet/ it compiles, and gives the correct result. I'm merging the " [puppet] - 10https://gerrit.wikimedia.org/r/549872 (https://phabricator.wikimedia.org/T237234) (owner: 10Giuseppe Lavagetto)
[09:59:44] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] prometheus: add scraping of k8s envoy sidecars [puppet] - 10https://gerrit.wikimedia.org/r/549871 (https://phabricator.wikimedia.org/T237234) (owner: 10Giuseppe Lavagetto)
[10:00:09] <wikibugs>	 (03PS5) 10Ayounsi: Add virtual-chassis support [software/homer] - 10https://gerrit.wikimedia.org/r/550367
[10:08:01] <wikibugs>	 (03CR) 10Jbond: "mostly look fine however as its a new role it will need approval in Mondays meeting.  would also be nice to have a pointer to the init  sc" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/552613 (https://phabricator.wikimedia.org/T238905) (owner: 10Dzahn)
[10:09:28] <wikibugs>	 (03CR) 10Muehlenhoff: admins: add Max Semenik as ldap_only_admin (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/552594 (https://phabricator.wikimedia.org/T238960) (owner: 10Dzahn)
[10:09:52] <wikibugs>	 (03Abandoned) 10Muehlenhoff: Remove now obsolete openstack-jessie-bpo filter [puppet] - 10https://gerrit.wikimedia.org/r/549814 (owner: 10Muehlenhoff)
[10:21:02] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] labmon: add compatibility in buster [puppet] - 10https://gerrit.wikimedia.org/r/552107 (https://phabricator.wikimedia.org/T224585) (owner: 10Phamhi)
[10:22:05] <wikibugs>	 10Operations, 10Traffic, 10Wikidata, 10Wikidata-Campsite, 10User-DannyS712: 502 errors on ATS/8.0.5 - https://phabricator.wikimedia.org/T237319 (10Ladsgroup) Given T238901#5687813 it seems it's fixed.
[10:26:31] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: prometheus::snmp_exporter: rationalize hiera calls [puppet] - 10https://gerrit.wikimedia.org/r/552774
[10:28:17] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: "https://puppet-compiler.wmflabs.org/compiler1002/19575/ shows this is a noop." [puppet] - 10https://gerrit.wikimedia.org/r/552774 (owner: 10Giuseppe Lavagetto)
[10:30:04] <jouncebot>	 jan_drewniak: How many deployers does it take to do Wikimedia Portals Update deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191125T1030).
[10:30:44] <wikibugs>	 (03CR) 10Nikerabbit: [C: 04-1] "Do we need the cxnonbeta list at all? I think we could just set" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548730 (https://phabricator.wikimedia.org/T234318) (owner: 10KartikMistry)
[10:31:32] <wikibugs>	 (03CR) 10Jbond: "Hi All, any objections to moving forward with this? can i get some +1's?" [puppet] - 10https://gerrit.wikimedia.org/r/548241 (https://phabricator.wikimedia.org/T237362) (owner: 10Jbond)
[10:33:26] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus::snmp_exporter: rationalize hiera calls [puppet] - 10https://gerrit.wikimedia.org/r/552774 (owner: 10Giuseppe Lavagetto)
[10:38:14] <icinga-wm>	 RECOVERY - MegaRAID on analytics1057 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[10:41:17] <wikibugs>	 10Operations, 10observability: Write ulogd logs to a dedicated logfile - https://phabricator.wikimedia.org/T238414 (10fgiunchedi) FWIW I'm ok with doing whichever is easiest, IIRC we can ship to kafka first and then add rules to log to a separate file.
[10:54:35] <wikibugs>	 (03PS7) 10KartikMistry: Enable CX out of beta for newly created WPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548730 (https://phabricator.wikimedia.org/T234318)
[10:57:00] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 51 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[10:58:33] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] prometheus::snmp_exporter: rationalize hiera calls [puppet] - 10https://gerrit.wikimedia.org/r/552774 (owner: 10Giuseppe Lavagetto)
[11:00:04] <jouncebot>	 Amir1, Lucas_WMDE, awight, and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191125T1100).
[11:00:04] <jouncebot>	 daimona and Tpt: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[11:00:13] <Daimona>	 o/
[11:00:20] <Tpt[m]>	 o/
[11:00:22] <Urbanecm>	 I can SWAT today!
[11:00:29] <Daimona>	 Noice
[11:00:39] <Tpt[m]>	 Hi!
[11:01:08] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Properly configures the Wikisource extension on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/551254 (https://phabricator.wikimedia.org/T239034) (owner: 10Tpt)
[11:01:23] <Urbanecm>	 Tpt[m]: +2'ed your patch, as soon as it is merged, it will be automatically deployed
[11:01:38] <Tpt[m]>	 Urbanecm: Thank you!
[11:01:52] <wikibugs>	 (03Merged) 10jenkins-bot: Properly configures the Wikisource extension on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/551254 (https://phabricator.wikimedia.org/T239034) (owner: 10Tpt)
[11:02:10] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 76.76 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[11:03:24] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Allow enwikiversity interface admins to remove their own interface administratorship [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552615 (https://phabricator.wikimedia.org/T238967) (owner: 10DannyS712)
[11:03:44] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: Create OpenGLAM mailing list - https://phabricator.wikimedia.org/T238759 (10SandraF_WMF) Thank you! Much appreciated 😀
[11:04:12] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "No blocker for me, I'd like to see more buy-in by other stakeholders too." [puppet] - 10https://gerrit.wikimedia.org/r/548241 (https://phabricator.wikimedia.org/T237362) (owner: 10Jbond)
[11:04:15] <wikibugs>	 (03Merged) 10jenkins-bot: Allow enwikiversity interface admins to remove their own interface administratorship [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552615 (https://phabricator.wikimedia.org/T238967) (owner: 10DannyS712)
[11:05:17] <wikibugs>	 (03PS1) 10Elukey: profile::mariadb::misc::eventlogging::database: set db to read only [puppet] - 10https://gerrit.wikimedia.org/r/552776 (https://phabricator.wikimedia.org/T234826)
[11:06:05] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 9394f1f: Allow enwikiversity interface admins to remove their own interface administratorship (T238967) (duration: 00m 57s)
[11:06:06] <Daimona>	 Urbanecm: please ping me when my patch is ready, I'm dealing with like 4 bugs at the same time
[11:06:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:06:11] <stashbot>	 T238967: Allow enwikiversity interface admins to remove their own interface administratorship - https://phabricator.wikimedia.org/T238967
[11:06:22] <Urbanecm>	 Daimona: ack, I'm waiting for CI now
[11:06:29] <Daimona>	 ty
[11:09:27] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Add throttle rule for WMCL Editathon 2019-12-07 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552640 (https://phabricator.wikimedia.org/T238986) (owner: 10Zoranzoki21)
[11:10:13] <wikibugs>	 (03Merged) 10jenkins-bot: Add throttle rule for WMCL Editathon 2019-12-07 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552640 (https://phabricator.wikimedia.org/T238986) (owner: 10Zoranzoki21)
[11:11:16] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: deployment_server::helmfile: use the correct prometheus nodes [puppet] - 10https://gerrit.wikimedia.org/r/552777
[11:11:57] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/throttle.php: SWAT: 4670d1d: Add  throttle rule for WMCL Editathon 2019-12-07 (T238986) (duration: 00m 53s)
[11:12:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:12:02] <stashbot>	 T238986: Lift IP limit - WMCL Editathon 2019-12-07 - https://phabricator.wikimedia.org/T238986
[11:12:08] <Urbanecm>	 Daimona: can you test it within a minute?
[11:12:37] <Daimona>	 Urbanecm: sure
[11:12:42] <Daimona>	 In like 10 seconds actually ahah
[11:13:04] <Urbanecm>	 Daimona: scap pull just finished, mwdebug1001
[11:13:40] <Daimona>	 Yay, works
[11:13:42] <Daimona>	 Thanks
[11:14:19] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] deployment_server::helmfile: use the correct prometheus nodes [puppet] - 10https://gerrit.wikimedia.org/r/552777 (owner: 10Giuseppe Lavagetto)
[11:14:31] <Urbanecm>	 Daimona: thx!
[11:15:15] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: deployment_server::helmfile: use the correct prometheus nodes [puppet] - 10https://gerrit.wikimedia.org/r/552777
[11:15:23] <Urbanecm>	 Daimona: syncing
[11:16:24] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized php-1.35.0-wmf.5/extensions/AbuseFilter/extension.json: SWAT: 29a16bd: Restrict viewing Special:Log/AbuseFilter, and remove from recent changes (T34959) (duration: 01m 04s)
[11:16:31] <Urbanecm>	 Daimona: synced!
[11:16:33] <Urbanecm>	 !log EU SWAT done
[11:16:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:16:39] <stashbot>	 T34959: Private filters should not be visible in recent changes - https://phabricator.wikimedia.org/T34959
[11:16:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:17:01] <Daimona>	 Confirmed working, thanks again
[11:17:12] <Urbanecm>	 happy to help!
[11:19:22] <icinga-wm>	 PROBLEM - Apache HTTP on mw1341 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[11:19:36] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1314 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[11:19:38] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[11:20:02] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1341 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[11:20:16] <effie>	 ^ checking 
[11:20:34] <icinga-wm>	 PROBLEM - Restrouter LVS eqiad on restrouter.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase
[11:20:40] <icinga-wm>	 PROBLEM - Apache HTTP on mw1314 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[11:20:44] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m
[11:20:50] <effie>	 lovely 
[11:20:54] <icinga-wm>	 RECOVERY - Apache HTTP on mw1341 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.064 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[11:20:58] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[11:21:00] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[11:21:00] <wikibugs>	 (03PS7) 10Muehlenhoff: Enable ldap-corp1001/2001 as additional LDAP replicas [puppet] - 10https://gerrit.wikimedia.org/r/539150
[11:21:06] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[11:21:10] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-
[11:21:20] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[11:21:44] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1341 is OK: HTTP OK: HTTP/1.1 200 OK - 79036 bytes in 7.813 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[11:22:12] <icinga-wm>	 RECOVERY - Apache HTTP on mw1314 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.050 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[11:22:16] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[11:22:16] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:22:44] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[11:22:49] <volans>	 effie: need help?
[11:22:54] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1314 is OK: HTTP OK: HTTP/1.1 200 OK - 79034 bytes in 0.972 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[11:22:56] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:23:24] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[11:23:34] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[11:23:54] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[11:24:04] <icinga-wm>	 PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[11:24:08] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] "Reminder: you have to either restart mysql or enable read-only directly on mysql with: set global read_only=1;" [puppet] - 10https://gerrit.wikimedia.org/r/552776 (https://phabricator.wikimedia.org/T234826) (owner: 10Elukey)
[11:24:20] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[11:24:36] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-
[11:24:36] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:24:48] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out befor
[11:24:48] <icinga-wm>	 received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[11:24:52] <icinga-wm>	 PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/pro
[11:25:28] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[11:25:28] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1314 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[11:25:36] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1316 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[11:25:44] <icinga-wm>	 RECOVERY - Restrouter LVS eqiad on restrouter.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[11:25:44] <icinga-wm>	 PROBLEM - Restrouter LVS codfw on restrouter.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase
[11:25:45] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] "PCC https://puppet-compiler.wmflabs.org/compiler1001/19579/icinga1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/552512 (owner: 10Jbond)
[11:25:58] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1313 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[11:26:28] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[11:26:34] <icinga-wm>	 RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton
[11:27:22] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:27:26] <icinga-wm>	 RECOVERY - Restrouter LVS codfw on restrouter.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[11:27:36] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1313 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 4.381 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[11:27:54] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:28:12] <icinga-wm>	 PROBLEM - Apache HTTP on mw1316 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[11:28:18] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[11:29:08] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[11:29:36] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[11:29:47] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: deployment_server::helmfile: use the correct prometheus nodes [puppet] - 10https://gerrit.wikimedia.org/r/552777
[11:29:54] <icinga-wm>	 RECOVERY - Apache HTTP on mw1316 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 8.603 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[11:29:56] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[11:30:16] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:30:18] <icinga-wm>	 PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase
[11:30:30] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1314 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 3.038 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[11:30:50] <icinga-wm>	 RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[11:31:12] <icinga-wm>	 PROBLEM - Apache HTTP on mw1314 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[11:31:14] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:31:40] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[11:31:44] <icinga-wm>	 PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton
[11:31:51] <effie>	 !log restart php-fpm on mw1314
[11:31:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:31:56] <icinga-wm>	 RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[11:31:56] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:32:14] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1312 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - header X-Powered-By: PHP/7. not found on http://en.wikipedia.org:80/wiki/Main_Page - 1312 bytes in 2.133 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[11:32:16] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1316 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[11:32:17] <wikibugs>	 (03CR) 10Elukey: "hey John, so the original patch is https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/552304/." [puppet] - 10https://gerrit.wikimedia.org/r/552613 (https://phabricator.wikimedia.org/T238905) (owner: 10Dzahn)
[11:32:26] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1316 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 8.256 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[11:32:46] <icinga-wm>	 RECOVERY - Apache HTTP on mw1314 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.207 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[11:33:02] <icinga-wm>	 PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase
[11:33:08] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[11:33:24] <icinga-wm>	 RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton
[11:33:36] <icinga-wm>	 PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[11:33:56] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1312 is OK: HTTP OK: HTTP/1.1 200 OK - 79034 bytes in 0.205 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[11:34:44] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[11:34:48] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:34:58] <icinga-wm>	 PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.5 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash
[11:35:20] <icinga-wm>	 RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[11:36:10] <icinga-wm>	 PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[11:36:26] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[11:36:32] <icinga-wm>	 RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[11:36:34] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:37:14] <icinga-wm>	 PROBLEM - proton endpoints health on proton2002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton
[11:37:30] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1316 is OK: HTTP OK: HTTP/1.1 200 OK - 79035 bytes in 8.708 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[11:37:48] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[11:37:50] <icinga-wm>	 RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[11:38:14] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:38:14] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[11:38:34] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[11:38:34] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[11:38:42] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1316 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[11:38:54] <icinga-wm>	 RECOVERY - proton endpoints health on proton2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton
[11:39:00] <icinga-wm>	 PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase
[11:39:08] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[11:39:28] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[11:39:58] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:39:58] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:40:02] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) timed out before a response was received: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[11:40:06] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[11:40:16] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1316 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.059 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[11:40:40] <icinga-wm>	 RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[11:40:44] <effie>	 !log cumin -b 2 -s 10 restart php on API servers
[11:40:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:40:48] <icinga-wm>	 PROBLEM - Apache HTTP on mw1346 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[11:41:37] <wikibugs>	 10Operations, 10Performance-Team, 10Traffic, 10Patch-For-Review: 200ms / 50% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 (10Gilles) Noteworthy graph found by @ema:  https://grafana.wikimedia.org/d/w4TRwaxZz/local-backend-hitrate-varnish-vs-ats?panelId=4&...
[11:41:38] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:41:40] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[11:41:42] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[11:41:52] <icinga-wm>	 RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.02917 https://wikitech.wikimedia.org/wiki/Logstash https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash
[11:41:58] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1316 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[11:42:26] <icinga-wm>	 RECOVERY - Apache HTTP on mw1346 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 4.612 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[11:42:28] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:42:46] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1313 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[11:43:22] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[11:43:28] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out bef
[11:43:28] <icinga-wm>	 s received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[11:43:32] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1313 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[11:43:48] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[11:43:48] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[11:44:06] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:44:18] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[11:44:18] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1313 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 0.463 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[11:44:38] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[11:45:04] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1313 is OK: HTTP OK: HTTP/1.1 200 OK - 79034 bytes in 0.199 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[11:45:14] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=POST
[11:45:18] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1316 is OK: HTTP OK: HTTP/1.1 200 OK - 79036 bytes in 3.230 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[11:45:24] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: protmeheus: haproxy: add support for Debian Buster [puppet] - 10https://gerrit.wikimedia.org/r/552789 (https://phabricator.wikimedia.org/T237643)
[11:46:21] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: protmeheus: haproxy: add support for Debian Buster [puppet] - 10https://gerrit.wikimedia.org/r/552789 (https://phabricator.wikimedia.org/T237643)
[11:47:00] <icinga-wm>	 PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 1.675 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash
[11:47:12] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[11:47:35] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1126 after schema change', diff saved to https://phabricator.wikimedia.org/P9734 and previous config saved to /var/cache/conftool/dbconfig/20191125-114733-marostegui.json
[11:47:36] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[11:47:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:48:22] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Increase main traffic weight for db1126', diff saved to https://phabricator.wikimedia.org/P9735 and previous config saved to /var/cache/conftool/dbconfig/20191125-114821-marostegui.json
[11:48:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:48:33] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[11:48:36] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[11:48:50] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[11:48:58] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[11:49:32] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[11:49:56] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET
[11:49:59] <wikibugs>	 10Operations, 10Prod-Kubernetes, 10Pybal, 10Traffic, 10serviceops: Proposal: simplify set up of a new load-balanced service on kubernetes - https://phabricator.wikimedia.org/T238909 (10akosiaris)
[11:50:07] <wikibugs>	 10Operations, 10Cloud-VPS (Debian Jessie Deprecation), 10Patch-For-Review, 10cloud-services-team (Kanban): Migrate labmon* to Stretch (or Buster, better yet!) - https://phabricator.wikimedia.org/T224585 (10Phamhi)
[11:50:53] <wikibugs>	 10Operations, 10Cloud-VPS (Debian Jessie Deprecation), 10Patch-For-Review, 10cloud-services-team (Kanban): Migrate labmon* to Stretch (or Buster, better yet!) - https://phabricator.wikimedia.org/T224585 (10Phamhi) The scope of this request has been extended to OS rename as well. New OS hostname will be det...
[11:50:56] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[11:55:30] <icinga-wm>	 RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.004167 https://wikitech.wikimedia.org/wiki/Logstash https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash
[11:58:20] <wikibugs>	 (03PS2) 10Muehlenhoff: Add image tracking support [software/debmonitor] - 10https://gerrit.wikimedia.org/r/552517 (https://phabricator.wikimedia.org/T237978)
[12:00:00] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, minor thing inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/552789 (https://phabricator.wikimedia.org/T237643) (owner: 10Arturo Borrero Gonzalez)
[12:10:38] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 58.59 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[12:10:43] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] protmeheus: haproxy: add support for Debian Buster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/552789 (https://phabricator.wikimedia.org/T237643) (owner: 10Arturo Borrero Gonzalez)
[12:11:33] <wikibugs>	 (03PS1) 10Ayounsi: Depool esams for esams/knams work [dns] - 10https://gerrit.wikimedia.org/r/552792
[12:14:06] <wikibugs>	 (03CR) 10Muehlenhoff: "That would work, but given that the Buster package provides a systemd unit in the package it seems better to only use the ERB version for " [puppet] - 10https://gerrit.wikimedia.org/r/552789 (https://phabricator.wikimedia.org/T237643) (owner: 10Arturo Borrero Gonzalez)
[12:14:48] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/19582/" [puppet] - 10https://gerrit.wikimedia.org/r/552777 (owner: 10Giuseppe Lavagetto)
[12:15:22] <wikibugs>	 10Operations: Add Daimona to #mediawiki_security - https://phabricator.wikimedia.org/T239093 (10Daimona)
[12:17:26] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 48.1 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[12:25:48] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] acme_chief: Revoke access from netmon boxes to netbox certificate [puppet] - 10https://gerrit.wikimedia.org/r/552680 (https://phabricator.wikimedia.org/T238919) (owner: 10Vgutierrez)
[12:26:00] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 99.06 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[12:27:29] <XioNoX>	 !log disable BGP to knams transits - T237031
[12:27:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:27:34] <stashbot>	 T237031: Bundle esams-knams links back - https://phabricator.wikimedia.org/T237031
[12:28:18] <wikibugs>	 10Operations, 10Traffic, 10Inuka-Team (Kanban), 10Patch-For-Review, 10Performance-Team (Radar): Code for InukaPageView instrumentation - https://phabricator.wikimedia.org/T238029 (10AMuigai) Makes sense to me on both fronts @Neil_P._Quinn_WMF
[12:29:35] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: toolforge: new k8s: haproxy: include prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/552794 (https://phabricator.wikimedia.org/T237643)
[12:30:23] <wikibugs>	 (03PS1) 10Jbond: cas-icinga: add redirect back / => /icinga/ [puppet] - 10https://gerrit.wikimedia.org/r/552795
[12:31:05] <wikibugs>	 (03CR) 10Gilles: "Has this been deployed? If not, le me know when you would like to do it." [puppet] - 10https://gerrit.wikimedia.org/r/519374 (https://phabricator.wikimedia.org/T211661) (owner: 10Gilles)
[12:31:14] <wikibugs>	 (03CR) 10Gilles: "Has this been deployed? If not, le me know when you would like to do it." [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/531204 (https://phabricator.wikimedia.org/T226707) (owner: 10Gilles)
[12:37:20] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: blubberoid: add telemetry collection support for envoy [deployment-charts] - 10https://gerrit.wikimedia.org/r/549837 (https://phabricator.wikimedia.org/T237234)
[12:40:52] <wikibugs>	 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10jijiki)
[12:41:11] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] cas-icinga: add redirect back / => /icinga/ [puppet] - 10https://gerrit.wikimedia.org/r/552795 (owner: 10Jbond)
[12:41:23] <wikibugs>	 (03PS2) 10Jbond: cas-icinga: add redirect back / => /icinga/ [puppet] - 10https://gerrit.wikimedia.org/r/552795
[12:42:13] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: toolforge: new k8s: haproxy: enable prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/552794 (https://phabricator.wikimedia.org/T237643)
[12:42:31] <XioNoX>	 !log bundle esams-knams links on esams side - T237031
[12:42:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:42:37] <stashbot>	 T237031: Bundle esams-knams links back - https://phabricator.wikimedia.org/T237031
[12:44:35] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: blubberoid: add telemetry collection support for envoy [deployment-charts] - 10https://gerrit.wikimedia.org/r/549837 (https://phabricator.wikimedia.org/T237234)
[12:45:00] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: Duplicate cable label in cr1-eqiad/cr2-eqiad - https://phabricator.wikimedia.org/T239098 (10faidon) p:05Triage→03High
[12:48:02] <XioNoX>	 !log bundle esams-knams links on knams side - T237031
[12:48:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:48:07] <stashbot>	 T237031: Bundle esams-knams links back - https://phabricator.wikimedia.org/T237031
[12:48:18] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] blubberoid: add telemetry collection support for envoy [deployment-charts] - 10https://gerrit.wikimedia.org/r/549837 (https://phabricator.wikimedia.org/T237234) (owner: 10Giuseppe Lavagetto)
[12:49:40] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: new k8s: haproxy: enable prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/552794 (https://phabricator.wikimedia.org/T237643) (owner: 10Arturo Borrero Gonzalez)
[12:49:51] <wikibugs>	 10Operations, 10Traffic, 10netops, 10Patch-For-Review: add TLS support for smokeping.wikimedia.org - https://phabricator.wikimedia.org/T238900 (10Vgutierrez) @volans @ayounsi IMHO it doesn't make any sense to include smokeping.wm.o SNI on the librenms certificate, that would set a dependency between otherw...
[12:54:27] <wikibugs>	 (03CR) 10BBlack: [C: 03+1] "Having fewer certs (with more SNIs in them, for shared purposes) doesn't seem worth trying to optimize for when cert management is so auto" [puppet] - 10https://gerrit.wikimedia.org/r/552398 (https://phabricator.wikimedia.org/T238900) (owner: 10Vgutierrez)
[12:57:25] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: blubberoid: revert to correct selector for service chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/552799
[12:57:56] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] blubberoid: revert to correct selector for service chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/552799 (owner: 10Giuseppe Lavagetto)
[12:59:24] <logmsgbot>	 !log oblivian@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'blubberoid' for release 'staging' .
[12:59:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:01:58] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 58.69 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[13:02:56] <logmsgbot>	 !log oblivian@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'blubberoid' for release 'staging' .
[13:03:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:06:36] <XioNoX>	 !log cleanup config on cr2-esams - T237031
[13:06:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:06:41] <stashbot>	 T237031: Bundle esams-knams links back - https://phabricator.wikimedia.org/T237031
[13:07:06] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 84.01 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[13:09:32] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: blubberoid: use string for port value [deployment-charts] - 10https://gerrit.wikimedia.org/r/552802
[13:10:41] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] blubberoid: use string for port value [deployment-charts] - 10https://gerrit.wikimedia.org/r/552802 (owner: 10Giuseppe Lavagetto)
[13:11:24] <logmsgbot>	 !log oblivian@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'blubberoid' for release 'staging' .
[13:11:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:14:00] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: blubberoid: all annotations are supposed to be strings. [deployment-charts] - 10https://gerrit.wikimedia.org/r/552803
[13:14:27] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] blubberoid: all annotations are supposed to be strings. [deployment-charts] - 10https://gerrit.wikimedia.org/r/552803 (owner: 10Giuseppe Lavagetto)
[13:15:34] <logmsgbot>	 !log oblivian@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'blubberoid' for release 'staging' .
[13:15:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:16:55] <XioNoX>	 !log cleanup config on cr3-esams - T237031
[13:17:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:17:00] <stashbot>	 T237031: Bundle esams-knams links back - https://phabricator.wikimedia.org/T237031
[13:20:34] <wikibugs>	 (03PS7) 10Jbond: puppetboard: Add cas authentication [puppet] - 10https://gerrit.wikimedia.org/r/552513 (https://phabricator.wikimedia.org/T238924)
[13:20:36] <wikibugs>	 (03PS2) 10Jbond: cas-puppetboard.wikimedia.org: add new cas protected puppetboard site [puppet] - 10https://gerrit.wikimedia.org/r/552536 (https://phabricator.wikimedia.org/T238924)
[13:23:03] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] profile::mariadb::misc::eventlogging::database: set db to read only [puppet] - 10https://gerrit.wikimedia.org/r/552776 (https://phabricator.wikimedia.org/T234826) (owner: 10Elukey)
[13:24:59] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops: mw1239 memory errors - https://phabricator.wikimedia.org/T227867 (10Cmjohnson) mw1239 is well out of warranty and is over 5 years old.  Historically we decom these host at this stage in their life.  We also have a several new MW servers waiting to be racke...
[13:25:57] <wikibugs>	 10Operations, 10ops-eqiad, 10Discovery-Search (Current work): Degraded RAID on elastic1039 - https://phabricator.wikimedia.org/T236601 (10Cmjohnson) 05Open→03Resolved Resolving this task for the failed raid, @gehel you may want to create a new one for the re-image.
[13:26:56] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops: mw1239 memory errors - https://phabricator.wikimedia.org/T227867 (10MoritzMuehlenhoff) @Cmjohnson mw1239 will be decommed soon via https://phabricator.wikimedia.org/T239054, we can close this task.
[13:27:23] <elukey>	 !log set global read_only=1 on db1108's log database - T159170
[13:27:25] <wikibugs>	 10Operations, 10SRE-tools, 10netbox: Netbox reports Icinga checks timeout - https://phabricator.wikimedia.org/T237803 (10faidon) What's the status of this task?
[13:27:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:27:27] <stashbot>	 T159170: Sunset MySQL data store for eventlogging - https://phabricator.wikimedia.org/T159170
[13:28:29] <wikibugs>	 (03CR) 10Vgutierrez: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/552515 (owner: 10Alexandros Kosiaris)
[13:31:40] <wikibugs>	 (03CR) 10Andrew Bogott: profile::url_downloader: Add missing labs neutron subnet, also link-local (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/552631 (owner: 10Alex Monk)
[13:34:43] <wikibugs>	 10Operations, 10Traffic, 10netops, 10Patch-For-Review: add TLS support for smokeping.wikimedia.org - https://phabricator.wikimedia.org/T238900 (10Volans) No problem for me for 1 cert, it seems a reasonable approach.
[13:36:56] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/548241 (https://phabricator.wikimedia.org/T237362) (owner: 10Jbond)
[13:37:24] <wikibugs>	 (03PS5) 10Elukey: Create airflow-search-admins admin group [puppet] - 10https://gerrit.wikimedia.org/r/552613 (https://phabricator.wikimedia.org/T238905) (owner: 10Dzahn)
[13:37:29] <wikibugs>	 (03PS3) 10Jbond: cas-puppetboard.wikimedia.org: add new cas protected puppetboard site [puppet] - 10https://gerrit.wikimedia.org/r/552536 (https://phabricator.wikimedia.org/T238924)
[13:38:36] <wikibugs>	 (03PS6) 10Elukey: Create airflow-search-admins admin group [puppet] - 10https://gerrit.wikimedia.org/r/552613 (https://phabricator.wikimedia.org/T238905) (owner: 10Dzahn)
[13:39:41] <wikibugs>	 (03CR) 10Elukey: "Daniel/John: reduced the scope of the change and added in the commit description https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/54" [puppet] - 10https://gerrit.wikimedia.org/r/552613 (https://phabricator.wikimedia.org/T238905) (owner: 10Dzahn)
[13:40:00] <elukey>	 jbond42: --^ let me know if you think it is ok or not :)
[13:41:05] <jbond42>	 looking
[13:41:27] <wikibugs>	 10Operations, 10observability: Tune HTTP availability alerts - https://phabricator.wikimedia.org/T236367 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Thresholds adjusted for global availability and I've updated "frontend traffic" dashboard
[13:41:44] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: envoy-tls-local-proxy: fix configuration [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/552807
[13:47:21] <wikibugs>	 10Operations, 10observability, 10serviceops, 10Patch-For-Review: dropped packets to conf1004/5/6 2379/tcp - https://phabricator.wikimedia.org/T238791 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Fixed!
[13:47:53] <wikibugs>	 10Operations, 10observability: Logstash doesn't parse ulogd source and destination ports - https://phabricator.wikimedia.org/T238416 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Looks like this is all done, resolving
[13:49:31] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: Asset tag remaining cablemgmt in eqiad - https://phabricator.wikimedia.org/T239110 (10faidon)
[13:51:31] <wikibugs>	 (03CR) 10Jbond: "Thanks luca for the update.  I think this is all fine as the service runs as the airflow user (we could lock down the systemd script furth" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/552613 (https://phabricator.wikimedia.org/T238905) (owner: 10Dzahn)
[13:52:15] <wikibugs>	 10Operations, 10Traffic: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10Volans) If needed, full list of R440 available here: https://puppetboard.wikimedia.org/fact/productname/PowerEdge+R440 (intentionally not mentioning their count here)
[13:55:28] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] envoy-tls-local-proxy: fix configuration [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/552807 (owner: 10Giuseppe Lavagetto)
[13:59:04] <icinga-wm>	 RECOVERY - snapshot of s3 in eqiad on db1115 is OK: snapshot for s3 at eqiad taken less than 4 days ago and larger than 90 GB: Last one 2019-11-25 12:44:21 from db1095.eqiad.wmnet:3313 (860 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups
[13:59:11] <wikibugs>	 (03PS8) 10Jbond: puppetboard: Add cas authentication [puppet] - 10https://gerrit.wikimedia.org/r/552513 (https://phabricator.wikimedia.org/T238924)
[13:59:57] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: use max5m for node_ipvs gauges [puppet] - 10https://gerrit.wikimedia.org/r/552810 (https://phabricator.wikimedia.org/T236700)
[14:04:07] <logmsgbot>	 !log oblivian@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'blubberoid' for release 'staging' .
[14:04:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:06:37] <wikibugs>	 (03PS7) 10Elukey: Create airflow-search-admins admin group [puppet] - 10https://gerrit.wikimedia.org/r/552613 (https://phabricator.wikimedia.org/T238905) (owner: 10Dzahn)
[14:06:55] <wikibugs>	 (03CR) 10Elukey: ">" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/552613 (https://phabricator.wikimedia.org/T238905) (owner: 10Dzahn)
[14:09:12] <wikibugs>	 (03CR) 10BBlack: [C: 04-1] "For the traffic caches, we've standardized on X-Client-IP (XCIP) as a way for the traffic layer to single the original client IP address t" [puppet] - 10https://gerrit.wikimedia.org/r/552515 (owner: 10Alexandros Kosiaris)
[14:09:43] <wikibugs>	 (03CR) 10Jbond: "lgtm assuming authorised in monday meeting" [puppet] - 10https://gerrit.wikimedia.org/r/552613 (https://phabricator.wikimedia.org/T238905) (owner: 10Dzahn)
[14:11:43] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] prometheus: use max5m for node_ipvs gauges [puppet] - 10https://gerrit.wikimedia.org/r/552810 (https://phabricator.wikimedia.org/T236700) (owner: 10Filippo Giunchedi)
[14:12:41] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] puppet_ca: update puppet ca with a new certificate valid for 10 years [puppet] - 10https://gerrit.wikimedia.org/r/548241 (https://phabricator.wikimedia.org/T237362) (owner: 10Jbond)
[14:13:21] <wikibugs>	 (03PS1) 10Cmjohnson: Adding dns entries new ganeti hosts [dns] - 10https://gerrit.wikimedia.org/r/552812 (https://phabricator.wikimedia.org/T228924)
[14:15:34] <wikibugs>	 (03PS6) 10BBlack: authdns: refactor role/profile/hieradata bits [puppet] - 10https://gerrit.wikimedia.org/r/552346 (https://phabricator.wikimedia.org/T98006)
[14:15:36] <wikibugs>	 (03PS3) 10BBlack: Unify and simplify DNS server ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/552506 (https://phabricator.wikimedia.org/T98006)
[14:15:38] <wikibugs>	 (03PS1) 10BBlack: Move DNS server profiles under profile::dns:: [puppet] - 10https://gerrit.wikimedia.org/r/552814 (https://phabricator.wikimedia.org/T98006)
[14:15:40] <wikibugs>	 (03PS1) 10BBlack: Move DNS roles together under role::dns:: [puppet] - 10https://gerrit.wikimedia.org/r/552815 (https://phabricator.wikimedia.org/T98006)
[14:16:38] <icinga-wm>	 PROBLEM - Router interfaces on cr3-esams is CRITICAL: CRITICAL: host 91.198.174.245, interfaces up: 86, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[14:16:50] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 82, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[14:18:32] <icinga-wm>	 PROBLEM - Router interfaces on cr2-knams is CRITICAL: CRITICAL: host 91.198.174.246, interfaces up: 51, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[14:20:57] <godog>	 that's expected I think, cc XioNoX 
[14:21:24] <elukey>	 godog: curious, is there maintenance or else?
[14:21:48] <paravoid>	 there is maintenance @ esams (and soon knams) indeed, mark is on-site
[14:22:00] <elukey>	 ahhh
[14:22:02] <XioNoX>	 ah yes, the downtime expired I'll extend it
[14:22:03] <paravoid>	 our maintenance, not by our vendor(s)
[14:22:14] <elukey>	 super
[14:22:16] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 58.17 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[14:22:21] <wikibugs>	 (03CR) 10Ema: [C: 03+1] cas-puppetboard.wikimedia.org: add new cas protected puppetboard site [puppet] - 10https://gerrit.wikimedia.org/r/552536 (https://phabricator.wikimedia.org/T238924) (owner: 10Jbond)
[14:23:22] <moritzm>	 !log upgrading OpenJDK 11 on an-conf*
[14:23:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:27:22] <wikibugs>	 (03PS1) 10Effie Mouzeli: admin: add jiji to analytics-privatedata-users group [puppet] - 10https://gerrit.wikimedia.org/r/552818
[14:29:06] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 71.45 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[14:29:07] <wikibugs>	 10Operations, 10DNS, 10Internet-Archive, 10Traffic, 10Wikimedia-Video: Please set up a CNAME for videoserver.wikimedia.org to Video Editing Server - https://phabricator.wikimedia.org/T99216 (10Aklapper) Ah, thanks. But who exactly is supposed to answer that question?
[14:30:24] <icinga-wm>	 RECOVERY - Router interfaces on cr3-esams is OK: OK: host 91.198.174.245, interfaces up: 86, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[14:30:36] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 82, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[14:31:23] <wikibugs>	 (03PS9) 10Jbond: puppetboard: Add cas authentication [puppet] - 10https://gerrit.wikimedia.org/r/552513 (https://phabricator.wikimedia.org/T238924)
[14:31:24] <wikibugs>	 (03PS1) 10Jbond: puppetboard: update puppet board public cert to add cas sni [puppet] - 10https://gerrit.wikimedia.org/r/552819 (https://phabricator.wikimedia.org/T238924)
[14:31:33] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "LGTM, let's wait for Nuria's formal +1 to follow rules :)" [puppet] - 10https://gerrit.wikimedia.org/r/552818 (owner: 10Effie Mouzeli)
[14:35:22] <wikibugs>	 10Operations, 10ops-esams, 10netops: setup new MX204 in knams - https://phabricator.wikimedia.org/T237030 (10ayounsi)
[14:37:26] <marostegui>	 !log Deploy schema change on s1 codfw (this will generate lag on codfw) - T234066 T233135
[14:37:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:37:32] <stashbot>	 T233135: Schema change for refactored actor and comment storage - https://phabricator.wikimedia.org/T233135
[14:37:32] <stashbot>	 T234066: Schema change to rename user_newtalk indexes - https://phabricator.wikimedia.org/T234066
[14:37:50] <wikibugs>	 (03PS1) 10Muehlenhoff: Setup rsync config for U2F device storage [puppet] - 10https://gerrit.wikimedia.org/r/552821
[14:37:53] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppetboard: Add cas authentication [puppet] - 10https://gerrit.wikimedia.org/r/552513 (https://phabricator.wikimedia.org/T238924) (owner: 10Jbond)
[14:38:10] <marostegui>	 !log Remove triggers from archive table on s1 codfw sanitarium T234704
[14:38:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:38:14] <stashbot>	 T234704: Remove ar_comment from sanitarium triggers - https://phabricator.wikimedia.org/T234704
[14:39:10] <wikibugs>	 10Operations, 10Traffic, 10observability: Varnish traffic drop alert @ codfw is noisy / codfw incoming traffic is spikey - https://phabricator.wikimedia.org/T239039 (10fgiunchedi) Took a quick look at the expression and the idea LGTM, thanks @CDanis. Also cc @ayounsi as the original implementor of the alert
[14:39:45] <wikibugs>	 (03PS1) 10Ema: Revert "cache: reimage cp3064 as text_ats" [puppet] - 10https://gerrit.wikimedia.org/r/552825 (https://phabricator.wikimedia.org/T238494)
[14:39:58] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Setup rsync config for U2F device storage [puppet] - 10https://gerrit.wikimedia.org/r/552821 (owner: 10Muehlenhoff)
[14:42:43] <wikibugs>	 (03PS2) 10Ema: Revert "cache: reimage cp3064 as text_ats" [puppet] - 10https://gerrit.wikimedia.org/r/552825 (https://phabricator.wikimedia.org/T238494)
[14:44:04] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.zookeeper.roll-restart-zookeeper
[14:44:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:44:45] <elukey>	 zookeeper analytics cluster --^
[14:45:58] <ema>	 !log depool cp3064 and reimage with varnish-be T227432
[14:46:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:46:03] <stashbot>	 T227432: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432
[14:47:18] <wikibugs>	 (03CR) 10Ema: [C: 03+2] Revert "cache: reimage cp3064 as text_ats" [puppet] - 10https://gerrit.wikimedia.org/r/552825 (https://phabricator.wikimedia.org/T238494) (owner: 10Ema)
[14:48:17] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppetboard: update puppet board public cert to add cas sni [puppet] - 10https://gerrit.wikimedia.org/r/552819 (https://phabricator.wikimedia.org/T238924) (owner: 10Jbond)
[14:48:41] <wikibugs>	 10Operations, 10Performance-Team, 10Traffic, 10Patch-For-Review: 200ms / 50% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp3064.esams.wmnet'] ` The...
[14:49:25] <wikibugs>	 (03PS2) 10Muehlenhoff: Setup rsync config for U2F device storage [puppet] - 10https://gerrit.wikimedia.org/r/552821
[14:50:12] <XioNoX>	 !log enable cr3-esams:et-1/0/0 - T236767
[14:50:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:50:17] <stashbot>	 T236767: cr3-esams:et-1/0/0 flap - https://phabricator.wikimedia.org/T236767
[14:50:31] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0)
[14:50:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:51:27] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] authdns: refactor role/profile/hieradata bits [puppet] - 10https://gerrit.wikimedia.org/r/552346 (https://phabricator.wikimedia.org/T98006) (owner: 10BBlack)
[14:51:29] <elukey>	 this spicerack thing seems working
[14:51:32] <elukey>	 :P
[14:51:47] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] cas-puppetboard.wikimedia.org: add new cas protected puppetboard site [puppet] - 10https://gerrit.wikimedia.org/r/552536 (https://phabricator.wikimedia.org/T238924) (owner: 10Jbond)
[14:52:20] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Setup rsync config for U2F device storage [puppet] - 10https://gerrit.wikimedia.org/r/552821 (owner: 10Muehlenhoff)
[14:54:37] <volans>	 elukey: lol :D
[14:55:01] <wikibugs>	 (03PS3) 10Muehlenhoff: Setup rsync config for U2F device storage [puppet] - 10https://gerrit.wikimedia.org/r/552821
[14:55:05] <wikibugs>	 (03CR) 10Ema: [C: 03+2] ATS: enable reload for global Lua script [puppet] - 10https://gerrit.wikimedia.org/r/552201 (https://phabricator.wikimedia.org/T233274) (owner: 10Ema)
[14:56:09] <wikibugs>	 (03PS1) 10DCausse: [wdqs] add async-import option [puppet] - 10https://gerrit.wikimedia.org/r/552835 (https://phabricator.wikimedia.org/T238045)
[14:56:11] <wikibugs>	 (03PS1) 10DCausse: [wdqs] enable asynchronous imports on wdqs1004 [puppet] - 10https://gerrit.wikimedia.org/r/552836 (https://phabricator.wikimedia.org/T238045)
[14:58:05] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Setup rsync config for U2F device storage [puppet] - 10https://gerrit.wikimedia.org/r/552821 (owner: 10Muehlenhoff)
[14:58:53] <wikibugs>	 (03PS1) 10Herron: logstash: create elk7 ES role and assign to elk7 ES hw hosts [puppet] - 10https://gerrit.wikimedia.org/r/552837 (https://phabricator.wikimedia.org/T234854)
[14:59:26] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] cas-puppetboard.wikimedia.org: add record [dns] - 10https://gerrit.wikimedia.org/r/552503 (owner: 10Jbond)
[14:59:34] <wikibugs>	 (03PS2) 10Jbond: cas-puppetboard.wikimedia.org: add record [dns] - 10https://gerrit.wikimedia.org/r/552503
[15:00:42] <icinga-wm>	 PROBLEM - Aggregate IPsec Tunnel Status eqiad on icinga1001 is CRITICAL: instance={cp1083:9536,cp1085:9536} site=eqiad tunnel={cp3064_v4,cp3064_v6} https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status
[15:00:46] <logmsgbot>	 !log bblack@cumin1001 conftool action : set/weight=100; selector: name=cp3056.esams.wmnet,service=ats-be
[15:00:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:01:50] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] logstash: create elk7 ES role and assign to elk7 ES hw hosts [puppet] - 10https://gerrit.wikimedia.org/r/552837 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron)
[15:02:01] <logmsgbot>	 !log bblack@cumin1001 conftool action : set/pooled=yes; selector: name=cp3056.esams.wmnet
[15:02:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:02:18] <wikibugs>	 (03CR) 10DCausse: "ge" [puppet] - 10https://gerrit.wikimedia.org/r/552836 (https://phabricator.wikimedia.org/T238045) (owner: 10DCausse)
[15:02:29] <wikibugs>	 (03PS2) 10Herron: logstash: create elk7 ES role and assign to elk7 ES hw hosts [puppet] - 10https://gerrit.wikimedia.org/r/552837 (https://phabricator.wikimedia.org/T234854)
[15:03:03] <ema>	 !log cp1075: ats-backend-restart to enable lua reload T233274
[15:03:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:03:08] <stashbot>	 T233274: ATS lua script reload doesn't work as expected - https://phabricator.wikimedia.org/T233274
[15:03:18] <icinga-wm>	 PROBLEM - Aggregate IPsec Tunnel Status codfw on icinga1001 is CRITICAL: instance=cp2023:9536 site=codfw tunnel={cp3064_v4,cp3064_v6} https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status
[15:04:34] <icinga-wm>	 ACKNOWLEDGEMENT - Aggregate IPsec Tunnel Status codfw on icinga1001 is CRITICAL: instance=cp2023:9536 site=codfw tunnel={cp3064_v4,cp3064_v6} Ema reimaging 3064 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status
[15:04:34] <icinga-wm>	 ACKNOWLEDGEMENT - Aggregate IPsec Tunnel Status eqiad on icinga1001 is CRITICAL: instance={cp1083:9536,cp1085:9536} site=eqiad tunnel={cp3064_v4,cp3064_v6} Ema reimaging 3064 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status
[15:05:25] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] logstash: create elk7 ES role and assign to elk7 ES hw hosts [puppet] - 10https://gerrit.wikimedia.org/r/552837 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron)
[15:05:26] <wikibugs>	 (03PS1) 10Mholloway: Update wikifeeds to 2019-11-25-144622-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/552838 (https://phabricator.wikimedia.org/T238942)
[15:05:41] <wikibugs>	 (03PS4) 10Muehlenhoff: Setup rsync config for U2F device storage [puppet] - 10https://gerrit.wikimedia.org/r/552821
[15:07:11] <wikibugs>	 (03CR) 10Mholloway: [C: 03+2] Update wikifeeds to 2019-11-25-144622-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/552838 (https://phabricator.wikimedia.org/T238942) (owner: 10Mholloway)
[15:07:29] <wikibugs>	 (03Merged) 10jenkins-bot: Update wikifeeds to 2019-11-25-144622-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/552838 (https://phabricator.wikimedia.org/T238942) (owner: 10Mholloway)
[15:07:41] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] Unify and simplify DNS server ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/552506 (https://phabricator.wikimedia.org/T98006) (owner: 10BBlack)
[15:07:49] <wikibugs>	 (03PS4) 10BBlack: Unify and simplify DNS server ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/552506 (https://phabricator.wikimedia.org/T98006)
[15:08:44] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Setup rsync config for U2F device storage [puppet] - 10https://gerrit.wikimedia.org/r/552821 (owner: 10Muehlenhoff)
[15:09:33] <logmsgbot>	 !log ema@cumin1001 START - Cookbook sre.hosts.downtime
[15:09:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:10:13] <logmsgbot>	 !log mholloway-shell@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'wikifeeds' for release 'staging' .
[15:10:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:10:56] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[15:11:09] <ema>	 !log cp1075: ats-tls-restart to enable lua reload T233274
[15:11:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:11:14] <stashbot>	 T233274: ATS lua script reload doesn't work as expected - https://phabricator.wikimedia.org/T233274
[15:11:21] <wikibugs>	 (03PS5) 10Muehlenhoff: Setup rsync config for U2F device storage [puppet] - 10https://gerrit.wikimedia.org/r/552821
[15:11:39] <logmsgbot>	 !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[15:11:40] <logmsgbot>	 !log mholloway-shell@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'wikifeeds' for release 'production' .
[15:11:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:11:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:13:29] <logmsgbot>	 !log mholloway-shell@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'wikifeeds' for release 'production' .
[15:13:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:14:22] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[15:14:22] <logmsgbot>	 !log gehel@cumin1001 START - Cookbook sre.hosts.downtime
[15:14:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:17:28] <icinga-wm>	 RECOVERY - Aggregate IPsec Tunnel Status eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status
[15:18:05] <wikibugs>	 10Operations, 10Performance-Team, 10Traffic, 10Patch-For-Review: 200ms / 50% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp3064.esams.wmnet'] `  Of which those **FAILED**: ` ['cp3064.esams.wmnet...
[15:18:35] <logmsgbot>	 !log gehel@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[15:18:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:20:33] <ema>	 !log cp-ats: rolling ats-{tls,backend} restart to enable lua reload T233274
[15:20:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:20:39] <stashbot>	 T233274: ATS lua script reload doesn't work as expected - https://phabricator.wikimedia.org/T233274
[15:21:52] <wikibugs>	 (03PS2) 10DCausse: [wdqs] enable asynchronous imports on wdqs1004 [puppet] - 10https://gerrit.wikimedia.org/r/552836 (https://phabricator.wikimedia.org/T238045)
[15:22:08] <wikibugs>	 (03PS3) 10Marostegui: mariadb: Promote db1086 to s7 primary master [puppet] - 10https://gerrit.wikimedia.org/r/552381 (https://phabricator.wikimedia.org/T238044)
[15:22:17] <wikibugs>	 (03PS3) 10Marostegui: wmnet: Update s7-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/552382 (https://phabricator.wikimedia.org/T238044)
[15:22:58] <ema>	 !log cp3064 manual reboot after wmf-auto-reimage error: 'Unable to run wmf-auto-reimage-host: Failed to reboot_host' T238494
[15:23:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:23:03] <stashbot>	 T238494: 200ms / 50% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494
[15:27:38] <wikibugs>	 (03PS6) 10Muehlenhoff: Setup rsync config for U2F device storage [puppet] - 10https://gerrit.wikimedia.org/r/552821
[15:28:26] <icinga-wm>	 PROBLEM - Aggregate IPsec Tunnel Status eqiad on icinga1001 is CRITICAL: instance={cp1083:9536,cp1085:9536,cp1087:9536,cp1089:9536} site=eqiad tunnel={cp3064_v4,cp3064_v6} https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status
[15:29:06] <icinga-wm>	 PROBLEM - traffic_server backend process restarted on cp5010 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=eqsin+prometheus/ops&var-instance=cp5010&var-layer=backend
[15:30:40] <wikibugs>	 (03PS1) 10Mobrovac: Parsoid: Switch private wiki consumers (Flow, VE) to Parsoid/PHP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552845 (https://phabricator.wikimedia.org/T229015)
[15:33:08] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 48.7 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[15:34:58] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=205 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[15:36:09] <ema>	 !log cp3064 create filesystem on /dev/nvme0n1p1 (see https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/552547/) and reboot T238494 
[15:36:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:36:14] <stashbot>	 T238494: 200ms / 50% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494
[15:36:22] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[15:37:04] <wikibugs>	 (03PS2) 10BBlack: Move DNS server profiles under profile::dns:: [puppet] - 10https://gerrit.wikimedia.org/r/552814 (https://phabricator.wikimedia.org/T98006)
[15:37:06] <wikibugs>	 (03PS2) 10BBlack: Move DNS roles together under role::dns:: [puppet] - 10https://gerrit.wikimedia.org/r/552815 (https://phabricator.wikimedia.org/T98006)
[15:39:14] <icinga-wm>	 RECOVERY - Aggregate IPsec Tunnel Status codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status
[15:39:28] <icinga-wm>	 RECOVERY - Aggregate IPsec Tunnel Status eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status
[15:40:05] <wikibugs>	 (03PS7) 10Muehlenhoff: Setup rsync config for U2F device storage [puppet] - 10https://gerrit.wikimedia.org/r/552821
[15:41:20] <wikibugs>	 (03PS1) 10Jbond: idp:  add puppetboard service [puppet] - 10https://gerrit.wikimedia.org/r/552850 (https://phabricator.wikimedia.org/T238924)
[15:44:11] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/552850 (https://phabricator.wikimedia.org/T238924) (owner: 10Jbond)
[15:44:39] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] idp:  add puppetboard service [puppet] - 10https://gerrit.wikimedia.org/r/552850 (https://phabricator.wikimedia.org/T238924) (owner: 10Jbond)
[15:45:12] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 79.18 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[15:46:04] <wikibugs>	 (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/19590/" [puppet] - 10https://gerrit.wikimedia.org/r/552821 (owner: 10Muehlenhoff)
[15:46:24] <wikibugs>	 (03PS1) 10Ayounsi: Remove old cr2-knams <-> cr2/3-esams links [dns] - 10https://gerrit.wikimedia.org/r/552851 (https://phabricator.wikimedia.org/T237031)
[15:46:53] <wikibugs>	 10Operations, 10Performance-Team, 10Traffic, 10Patch-For-Review: 200ms / 50% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 (10ema) Among the things I've checked to rule out obvious mistakes porting VCL to Lua:  - Cookie responses without "session" or "toke...
[15:47:58] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=205 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[15:49:00] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[15:49:20] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] "looks good in compiler, bunch of rename-y things but no functional bits: https://puppet-compiler.wmflabs.org/compiler1003/19589/" [puppet] - 10https://gerrit.wikimedia.org/r/552815 (https://phabricator.wikimedia.org/T98006) (owner: 10BBlack)
[15:49:25] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] Move DNS server profiles under profile::dns:: [puppet] - 10https://gerrit.wikimedia.org/r/552814 (https://phabricator.wikimedia.org/T98006) (owner: 10BBlack)
[15:49:47] <ema>	 !log pool cp3064 with varnish-be T227432
[15:49:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:49:51] <stashbot>	 T227432: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432
[15:54:30] <wikibugs>	 (03PS1) 10Ema: ATS: log Cache-Control as received from the origin [puppet] - 10https://gerrit.wikimedia.org/r/552853 (https://phabricator.wikimedia.org/T238494)
[15:56:49] <wikibugs>	 (03PS4) 10CRusnov: netbox: Expose automated DNS repository for web access [puppet] - 10https://gerrit.wikimedia.org/r/551948 (https://phabricator.wikimedia.org/T233183)
[15:57:55] <wikibugs>	 (03CR) 10CRusnov: "This change is ready for review." (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/551948 (https://phabricator.wikimedia.org/T233183) (owner: 10CRusnov)
[15:58:28] <wikibugs>	 (03CR) 10Jbond: "looks good but see inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/552821 (owner: 10Muehlenhoff)
[15:59:50] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] netbox: Expose automated DNS repository for web access [puppet] - 10https://gerrit.wikimedia.org/r/551948 (https://phabricator.wikimedia.org/T233183) (owner: 10CRusnov)
[16:05:36] <wikibugs>	 10Operations, 10SRE-tools, 10netbox: Netbox reports Icinga checks timeout - https://phabricator.wikimedia.org/T237803 (10crusnov) 05Open→03Resolved I executed the plan that Riccardo outlined, removed the running ability in the check and switched to running from the management script, which has simplified...
[16:06:06] <wikibugs>	 (03CR) 10Muehlenhoff: Setup rsync config for U2F device storage (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/552821 (owner: 10Muehlenhoff)
[16:06:24] <wikibugs>	 (03PS8) 10Muehlenhoff: Setup rsync config for U2F device storage [puppet] - 10https://gerrit.wikimedia.org/r/552821
[16:07:13] <wikibugs>	 10Operations, 10MediaWiki-REST-API, 10serviceops, 10wikidiff2, and 2 others: Deploy version 1.10.0 of wikidiff2 to production - https://phabricator.wikimedia.org/T236963 (10WDoranWMF) Thanks @jijiki!
[16:08:26] <icinga-wm>	 RECOVERY - traffic_server backend process restarted on cp5010 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=eqsin+prometheus/ops&var-instance=cp5010&var-layer=backend
[16:08:33] <wikibugs>	 10Operations, 10observability: Monitor mailman outbound mail queue - https://phabricator.wikimedia.org/T236505 (10colewhite)
[16:09:43] <wikibugs>	 (03PS1) 10Volans: netbox: remove limit from API query [puppet] - 10https://gerrit.wikimedia.org/r/552854
[16:11:05] <wikibugs>	 10Operations, 10ops-esams, 10netops: Setup esams atlas anchor - https://phabricator.wikimedia.org/T174637 (10faidon)
[16:11:39] <wikibugs>	 (03CR) 10CRusnov: [C: 03+1] "LGTM as discussed" [puppet] - 10https://gerrit.wikimedia.org/r/552854 (owner: 10Volans)
[16:13:38] <wikibugs>	 10Operations, 10ops-esams, 10netops: Setup esams atlas anchor - https://phabricator.wikimedia.org/T174637 (10faidon) The Anchor is now installed, connected to the SCS, and we see a getty on serial with the right hostname. It's also now responsive to IPv4 pings but not IPv6 (which matches our previous experie...
[16:14:08] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[16:15:48] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[16:17:26] <icinga-wm>	 PROBLEM - Check correctness of the icinga configuration on icinga1001 is CRITICAL: Icinga configuration contains errors https://wikitech.wikimedia.org/wiki/Icinga
[16:22:46] <wikibugs>	 (03PS9) 10Muehlenhoff: Setup rsync config for U2F device storage [puppet] - 10https://gerrit.wikimedia.org/r/552821
[16:22:48] <wikibugs>	 (03CR) 10Muehlenhoff: Setup rsync config for U2F device storage (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/552821 (owner: 10Muehlenhoff)
[16:24:28] <wikibugs>	 (03CR) 10Nuria: "Looks good, please take some time to read https://wikitech.wikimedia.org/wiki/Analytics/Data_Access_Guidelines" [puppet] - 10https://gerrit.wikimedia.org/r/552818 (owner: 10Effie Mouzeli)
[16:32:16] <wikibugs>	 10Operations: ganeti netbox sync alerts are noisy - https://phabricator.wikimedia.org/T233624 (10crusnov) 05Open→03Resolved This should be resolved.
[16:33:49] <wikibugs>	 10Operations: Puppet breakage in automation-framework VMs - https://phabricator.wikimedia.org/T234452 (10crusnov) 05Open→03Resolved This should be resolved, I've spot checked hosts in the af project and they have been running puppet normally.
[16:34:02] <wikibugs>	 (03PS1) 10Muehlenhoff: Setup systemd timer for rsync of U2F devices config [puppet] - 10https://gerrit.wikimedia.org/r/552857
[16:36:50] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Setup systemd timer for rsync of U2F devices config [puppet] - 10https://gerrit.wikimedia.org/r/552857 (owner: 10Muehlenhoff)
[16:37:36] <wikibugs>	 10Operations, 10ops-esams: cr3-esams:et-1/0/0 flap - https://phabricator.wikimedia.org/T236767 (10faidon) @mark swapped the optic with a new one and the link is now reenabled. This is being monitored for another 24-36h and will be resolved then.
[16:37:37] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Remove old cr2-knams <-> cr2/3-esams links [dns] - 10https://gerrit.wikimedia.org/r/552851 (https://phabricator.wikimedia.org/T237031) (owner: 10Ayounsi)
[16:38:36] <wikibugs>	 (03PS1) 10Jbond: puppetboard: add proxied_as parameter [puppet] - 10https://gerrit.wikimedia.org/r/552859 (https://phabricator.wikimedia.org/T238924)
[16:39:24] <wikibugs>	 10Operations, 10ops-esams, 10Patch-For-Review: Bundle esams-knams links back - https://phabricator.wikimedia.org/T237031 (10ayounsi) 05Open→03Resolved a:03ayounsi All done. Not re-enabling knams transits as we're setting up the new MX204 right now.
[16:39:54] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Rename multimedia-team to structured-data-team - https://phabricator.wikimedia.org/T235550 (10crusnov) a:05crusnov→03None
[16:40:18] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Rename multimedia-team to structured-data-team - https://phabricator.wikimedia.org/T235550 (10crusnov) Passing to next clinic duty person.
[16:40:38] <wikibugs>	 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10jijiki)
[16:41:08] <wikibugs>	 (03PS2) 10Muehlenhoff: Setup systemd timer for rsync of U2F devices config [puppet] - 10https://gerrit.wikimedia.org/r/552857
[16:44:12] <wikibugs>	 (03PS2) 10Jbond: puppetboard: add proxied_as parameter [puppet] - 10https://gerrit.wikimedia.org/r/552859 (https://phabricator.wikimedia.org/T238924)
[16:44:31] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Setup systemd timer for rsync of U2F devices config [puppet] - 10https://gerrit.wikimedia.org/r/552857 (owner: 10Muehlenhoff)
[16:47:40] <wikibugs>	 (03PS3) 10Herron: logstash: create elk7 ES role and assign to elk7 ES hw hosts [puppet] - 10https://gerrit.wikimedia.org/r/552837 (https://phabricator.wikimedia.org/T234854)
[16:47:52] <wikibugs>	 (03CR) 10Volans: [C: 03+2] netbox: remove limit from API query [puppet] - 10https://gerrit.wikimedia.org/r/552854 (owner: 10Volans)
[16:47:54] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/552821 (owner: 10Muehlenhoff)
[16:48:26] <jynus>	 !log upgrading and restarting dbprov* hosts
[16:48:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:50:10] <wikibugs>	 (03PS1) 10Jcrespo: check_mariadb.py: Update bacula logic to the latest Bacula class [puppet] - 10https://gerrit.wikimedia.org/r/552860
[16:50:26] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] logstash: create elk7 ES role and assign to elk7 ES hw hosts [puppet] - 10https://gerrit.wikimedia.org/r/552837 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron)
[16:51:08] <wikibugs>	 (03PS4) 10Herron: logstash: create elk7 ES role and assign to elk7 ES hw hosts [puppet] - 10https://gerrit.wikimedia.org/r/552837 (https://phabricator.wikimedia.org/T234854)
[16:52:01] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] check_mariadb.py: Update bacula logic to the latest Bacula class [puppet] - 10https://gerrit.wikimedia.org/r/552860 (owner: 10Jcrespo)
[16:52:15] <wikibugs>	 10Operations, 10Traffic, 10netops, 10observability: Network port utilization alerts should be paging - https://phabricator.wikimedia.org/T224888 (10CDanis) I've a proposal for doing this:    - Add some special tag like `#NRPE` or `#page` to the names of any [[ https://librenms.wikimedia.org/alert-rules | L...
[16:54:23] <wikibugs>	 (03PS3) 10Muehlenhoff: Setup systemd timer for rsync of U2F devices config [puppet] - 10https://gerrit.wikimedia.org/r/552857
[16:56:17] <wikibugs>	 (03CR) 10Ema: [C: 03+1] ATS: Prevent logrotate from creating empty log files [puppet] - 10https://gerrit.wikimedia.org/r/552678 (https://phabricator.wikimedia.org/T238724) (owner: 10Vgutierrez)
[16:56:50] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] ATS: Prevent logrotate from creating empty log files [puppet] - 10https://gerrit.wikimedia.org/r/552678 (https://phabricator.wikimedia.org/T238724) (owner: 10Vgutierrez)
[16:57:30] <wikibugs>	 (03PS5) 10CRusnov: netbox: Expose automated DNS repository for web access [puppet] - 10https://gerrit.wikimedia.org/r/551948 (https://phabricator.wikimedia.org/T233183)
[16:57:33] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Setup systemd timer for rsync of U2F devices config [puppet] - 10https://gerrit.wikimedia.org/r/552857 (owner: 10Muehlenhoff)
[16:58:27] <wikibugs>	 (03PS4) 10Jbond: Setup systemd timer for rsync of U2F devices config [puppet] - 10https://gerrit.wikimedia.org/r/552857 (owner: 10Muehlenhoff)
[17:00:04] <jouncebot>	 gehel and onimisionipe: How many deployers does it take to do Wikidata Query Service weekly deploy deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191125T1700).
[17:00:49] <onimisionipe>	 ack
[17:01:10] <wikibugs>	 10Operations, 10ops-eqsin: duplicate cable IDs in eqsin - https://phabricator.wikimedia.org/T239125 (10RobH) p:05Triage→03Normal
[17:01:22] <wikibugs>	 10Operations, 10ops-eqsin: duplicate cable IDs in eqsin - https://phabricator.wikimedia.org/T239125 (10RobH)
[17:02:56] <wikibugs>	 (03PS3) 10Jbond: puppetboard: add proxied_as parameter [puppet] - 10https://gerrit.wikimedia.org/r/552859 (https://phabricator.wikimedia.org/T238924)
[17:05:00] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics: analytics1057's BBU is faulty - https://phabricator.wikimedia.org/T239045 (10mforns) p:05Triage→03High
[17:05:20] <logmsgbot>	 !log arlolra@deploy1001 Started deploy [parsoid/deploy@e7faa19]: Updating Parsoid to a6bfdfa
[17:05:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:12:24] <wikibugs>	 10Operations, 10MediaWiki-REST-API, 10serviceops, 10wikidiff2, and 2 others: Deploy version 1.10.0 of wikidiff2 to production - https://phabricator.wikimedia.org/T236963 (10WDoranWMF) @jijiki Do you know when the rollout will be complete to all prod?
[17:12:59] <logmsgbot>	 !log onimisionipe@deploy1001 Started deploy [wdqs/wdqs@4c5f503]: New Blazegraph Build and WDQS Updates
[17:13:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:14:18] <logmsgbot>	 !log arlolra@deploy1001 Finished deploy [parsoid/deploy@e7faa19]: Updating Parsoid to a6bfdfa (duration: 08m 58s)
[17:14:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:14:55] <wikibugs>	 (03PS1) 10BBlack: Fix common/monitoring dnsbox cluster defs [puppet] - 10https://gerrit.wikimedia.org/r/552861 (https://phabricator.wikimedia.org/T98006)
[17:15:24] <wikibugs>	 10Operations, 10MediaWiki-REST-API, 10serviceops, 10wikidiff2, and 2 others: Deploy version 1.10.0 of wikidiff2 to production - https://phabricator.wikimedia.org/T236963 (10jijiki) @WDoranWMF Today after the SRE meeting, I will roll out to production. We had some minor issues with our api servers this morn...
[17:16:59] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] Fix common/monitoring dnsbox cluster defs [puppet] - 10https://gerrit.wikimedia.org/r/552861 (https://phabricator.wikimedia.org/T98006) (owner: 10BBlack)
[17:17:42] <wikibugs>	 (03PS2) 10Eevans: kask-echoseen: Do not report dupes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552257 (https://phabricator.wikimedia.org/T237143) (owner: 10Mobrovac)
[17:19:42] <XioNoX>	 !log power down cr2-knams - T237030
[17:19:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:19:47] <stashbot>	 T237030: setup new MX204 in knams - https://phabricator.wikimedia.org/T237030
[17:20:02] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: db2125 crashed - https://phabricator.wikimedia.org/T239042 (10Papaul) a:05Papaul→03Marostegui  complete Before  BIOS Version 2.2.11 iDRAC Firmware Version 3.34.34.34  After BIOS Version 2.4.7 iDRAC Firmware Version 3.36.36.36
[17:21:13] <icinga-wm>	 RECOVERY - Check correctness of the icinga configuration on icinga1001 is OK: Icinga configuration is correct https://wikitech.wikimedia.org/wiki/Icinga
[17:21:27] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: db2125 crashed - https://phabricator.wikimedia.org/T239042 (10Marostegui) Thank you Papaul. I will start MySQL and do a data consistency check.
[17:23:42] <wikibugs>	 (03CR) 10Gehel: [C: 03+1] "LGTM (for what it's worth)" [puppet] - 10https://gerrit.wikimedia.org/r/548241 (https://phabricator.wikimedia.org/T237362) (owner: 10Jbond)
[17:24:52] <wikibugs>	 (03PS3) 10Ema: ATS: explicitly skip the cache instead of hiding CC [puppet] - 10https://gerrit.wikimedia.org/r/552076 (https://phabricator.wikimedia.org/T238494)
[17:24:54] <wikibugs>	 (03PS1) 10Ema: ATS: do not coalesce uncacheable requests [puppet] - 10https://gerrit.wikimedia.org/r/552862 (https://phabricator.wikimedia.org/T238494)
[17:25:21] <logmsgbot>	 !log onimisionipe@deploy1001 Finished deploy [wdqs/wdqs@4c5f503]: New Blazegraph Build and WDQS Updates (duration: 12m 23s)
[17:25:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:28:06] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] ATS: do not coalesce uncacheable requests [puppet] - 10https://gerrit.wikimedia.org/r/552862 (https://phabricator.wikimedia.org/T238494) (owner: 10Ema)
[17:29:49] <wikibugs>	 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10elukey) In hiera we have 4 codfw mw hosts acting as proxy for mcrouter:  `     codfw:       A:         host: 10.192.0.61 # mw2235, A3         port: 11214         ssl: true       B:         host: 10.192.16.5...
[17:31:45] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:31:45] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:31:57] <icinga-wm>	 PROBLEM - Check systemd state on wdqs2006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:32:09] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:32:11] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:32:14] <wikibugs>	 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10elukey)
[17:32:19] <gehel>	 onimisionipe: ^^known issue?
[17:32:21] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:32:29] <icinga-wm>	 PROBLEM - Check systemd state on wdqs2005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:32:35] <icinga-wm>	 PROBLEM - Check systemd state on wdqs2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:32:35] <icinga-wm>	 PROBLEM - Check systemd state on wdqs2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:32:35] <icinga-wm>	 PROBLEM - Check systemd state on wdqs2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:32:59] <onimisionipe>	 hmmm
[17:33:20] <onimisionipe>	 I'm checking
[17:33:39] <gehel>	 onimisionipe: looks like an error in the updater
[17:34:10] <gehel>	 onimisionipe: probably needs a rollback
[17:34:19] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:34:22] <bblack>	 ● wdqs-updater.service loaded failed failed Query Service Updater 
[17:35:08] <bblack>	 yeah a java exception on startup
[17:35:11] <gehel>	 null pointer exception when syncing dates
[17:35:57] <gehel>	 onimisionipe: I'll open a phab task, scream if you need help!
[17:36:21] <marostegui>	 !log Upgrade kernel on db2125 T239042
[17:36:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:36:29] <stashbot>	 T239042: db2125 crashed  - https://phabricator.wikimedia.org/T239042
[17:36:36] <onimisionipe>	 rolling back!
[17:36:50] <wikibugs>	 (03PS1) 10Mholloway: Update wikifeeds to 2019-11-25-173023-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/552864 (https://phabricator.wikimedia.org/T235652)
[17:37:49] <icinga-wm>	 RECOVERY - Check systemd state on wdqs2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:38:23] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:38:51] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:40:19] <icinga-wm>	 PROBLEM - rsyslog in codfw is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_centrallog1001.eqiad.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops
[17:40:35] <wikibugs>	 (03PS5) 10Andrew Bogott: wmf_sink: remove instance-puppet git entries for deleted VMs [puppet] - 10https://gerrit.wikimedia.org/r/552583 (https://phabricator.wikimedia.org/T238708)
[17:41:45] <icinga-wm>	 RECOVERY - rsyslog in codfw is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops
[17:42:09] <icinga-wm>	 PROBLEM - Check systemd state on wdqs2005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:42:10] <wikibugs>	 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10jijiki)
[17:42:15] <icinga-wm>	 RECOVERY - Check systemd state on wdqs2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:43:08] <wikibugs>	 10Operations, 10ops-esams, 10netops: setup new MX204 in knams - https://phabricator.wikimedia.org/T237030 (10mark)
[17:43:13] <wikibugs>	 10Operations, 10Traffic: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10BBlack) It was observed earlier in the traffic meeting that we're fairly certain that none of our R440 hosts have had this problem more than once, so this may be a "once per server" phenomenon, in...
[17:43:15] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:43:32] <wikibugs>	 (03PS2) 10Cmjohnson: Adding dns entries new ganeti hosts [dns] - 10https://gerrit.wikimedia.org/r/552812 (https://phabricator.wikimedia.org/T228924)
[17:43:41] <wikibugs>	 (03PS6) 10Andrew Bogott: wmf_sink: remove instance-puppet git entries for deleted VMs [puppet] - 10https://gerrit.wikimedia.org/r/552583 (https://phabricator.wikimedia.org/T238708)
[17:44:00] <wikibugs>	 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack/setup/install fundraising db system - https://phabricator.wikimedia.org/T239139 (10RobH) p:05Triage→03Normal
[17:44:08] <wikibugs>	 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack/setup/install fundraising db system - https://phabricator.wikimedia.org/T239139 (10RobH)
[17:44:46] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: db2125 crashed - https://phabricator.wikimedia.org/T239042 (10Marostegui) Kernel upgraded and host rebooted: `  root@db2125:~# uname -a Linux db2125 4.9.0-11-amd64 #1 SMP Debian 4.9.189-3+deb9u2 (2019-11-11) x86_64 GNU/Linux `
[17:45:00] <wikibugs>	 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack/setup/install fundraising db system - https://phabricator.wikimedia.org/T239139 (10RobH) @jgreen: Can you provide the hostname and network info for these?  I think we want to have the network ports, one each, plugged into the fasw (so the single server...
[17:45:12] <wikibugs>	 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10jijiki)
[17:45:14] <logmsgbot>	 !log onimisionipe@deploy1001 Started deploy [wdqs/wdqs@4c5f503]: Revert New Blazegraph Build and WDQS Updates
[17:45:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:45:56] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] wmf_sink: remove instance-puppet git entries for deleted VMs [puppet] - 10https://gerrit.wikimedia.org/r/552583 (https://phabricator.wikimedia.org/T238708) (owner: 10Andrew Bogott)
[17:46:34] <wikibugs>	 (03CR) 10Cmjohnson: [C: 03+2] Adding dns entries new ganeti hosts [dns] - 10https://gerrit.wikimedia.org/r/552812 (https://phabricator.wikimedia.org/T228924) (owner: 10Cmjohnson)
[17:46:55] <icinga-wm>	 RECOVERY - Check systemd state on wdqs2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:46:55] <icinga-wm>	 PROBLEM - Check systemd state on wdqs2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:47:19] <wikibugs>	 10Operations, 10ops-eqiad: rack/setup/instal (4) CI ganeti nodes - https://phabricator.wikimedia.org/T228926 (10Cmjohnson) production dns was added here https://gerrit.wikimedia.org/r/#/c/operations/dns/+/552812/
[17:47:22] <wikibugs>	 (03CR) 10Mobrovac: [C: 03+2] Parsoid: Switch private wiki consumers (Flow, VE) to Parsoid/PHP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552845 (https://phabricator.wikimedia.org/T229015) (owner: 10Mobrovac)
[17:47:31] <wikibugs>	 10Operations, 10ops-eqiad: rack/setup/instal (4) CI ganeti nodes - https://phabricator.wikimedia.org/T228926 (10Cmjohnson)
[17:48:01] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:48:09] <wikibugs>	 (03Merged) 10jenkins-bot: Parsoid: Switch private wiki consumers (Flow, VE) to Parsoid/PHP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552845 (https://phabricator.wikimedia.org/T229015) (owner: 10Mobrovac)
[17:48:24] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] ATS: explicitly skip the cache instead of hiding CC [puppet] - 10https://gerrit.wikimedia.org/r/552076 (https://phabricator.wikimedia.org/T238494) (owner: 10Ema)
[17:48:33] <icinga-wm>	 RECOVERY - Check systemd state on wdqs2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:49:06] <wikibugs>	 10Operations, 10ops-eqiad: rack/setup/instal (4) CI ganeti nodes - https://phabricator.wikimedia.org/T228926 (10Cmjohnson)
[17:50:12] <logmsgbot>	 !log mobrovac@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Parsoid: Switch private wiki clients (Flow, VE) to Parsoid/PHP -- T229015 (duration: 00m 53s)
[17:50:13] <icinga-wm>	 PROBLEM - Check systemd state on wdqs2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:50:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:50:17] <stashbot>	 T229015: Tracking: Direct live production traffic at Parsoid/PHP - https://phabricator.wikimedia.org/T229015
[17:51:06] <wikibugs>	 10Operations, 10Traffic, 10fixcopyright.wikimedia.org: Redirect all traffic for fixcopyright.wikimedia.org to https://policy.wikimedia.org/policy-landing/copyright/ - https://phabricator.wikimedia.org/T239141 (10Jdforrester-WMF)
[17:51:26] <wikibugs>	 10Operations, 10Traffic, 10fixcopyright.wikimedia.org, 10Core Platform Team Workboards (Clinic Duty Team), and 3 others: Retire fixcopyright.wikimedia.org - https://phabricator.wikimedia.org/T238803 (10Jdforrester-WMF) 05Open→03Stalled
[17:51:55] <icinga-wm>	 PROBLEM - Check systemd state on wdqs2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:52:29] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:53:03] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:53:35] <icinga-wm>	 PROBLEM - Check systemd state on wdqs2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:53:37] <icinga-wm>	 RECOVERY - Check systemd state on wdqs2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:54:31] <icinga-wm>	 RECOVERY - Check systemd state on wdqs2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:55:01] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:55:38] <logmsgbot>	 !log onimisionipe@deploy1001 Finished deploy [wdqs/wdqs@4c5f503]: Revert New Blazegraph Build and WDQS Updates (duration: 10m 24s)
[17:55:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:55:49] <gehel>	 !log restart wdqs-updater on all wdqs servers
[17:55:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:55:55] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:56:29] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:56:31] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:56:50] <wikibugs>	 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack/setup/install fundraising db system - https://phabricator.wikimedia.org/T239139 (10Jgreen)
[17:56:59] <icinga-wm>	 RECOVERY - Check systemd state on wdqs2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:57:05] <icinga-wm>	 RECOVERY - Check systemd state on wdqs2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:57:05] <icinga-wm>	 RECOVERY - Check systemd state on wdqs2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:57:05] <icinga-wm>	 RECOVERY - Check systemd state on wdqs2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:57:25] <wikibugs>	 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack/setup/install fundraising db system - https://phabricator.wikimedia.org/T239139 (10Jgreen)
[17:57:28] <librenms-wmf>	 04Critical Alert for device cr2-knams.wikimedia.org - Juniper alarm active
[17:57:39] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:59:19] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:00:04] <jouncebot>	 MaxSem, RoanKattouw, Niharika, and Urbanecm: Your horoscope predicts another unfortunate Morning SWAT(Max 6 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191125T1800).
[18:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[18:00:11] <wikibugs>	 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack/setup/install fundraising db system - https://phabricator.wikimedia.org/T239139 (10Jgreen)
[18:02:39] <wikibugs>	 10Operations, 10serviceops, 10HHVM, 10MW-1.35-notes (1.35.0-wmf.3; 2019-10-22), and 2 others: Remove HHVM from production - https://phabricator.wikimedia.org/T229792 (10jijiki)
[18:02:57] <wikibugs>	 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10jijiki)
[18:03:00] <wikibugs>	 10Operations, 10serviceops, 10HHVM, 10MW-1.35-notes (1.35.0-wmf.3; 2019-10-22), and 2 others: Remove HHVM from production - https://phabricator.wikimedia.org/T229792 (10jijiki)
[18:03:35] <wikibugs>	 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack/setup/install fundraising db system - https://phabricator.wikimedia.org/T239139 (10Jgreen) >>! In T239139#5690558, @RobH wrote: > @jgreen: Can you provide the hostname and network info for these?  I think we want to have the network ports, one each, plu...
[18:05:26] <wikibugs>	 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10jijiki)
[18:05:37] <wikibugs>	 10Operations, 10ops-eqiad, 10fundraising-tech-ops: (Need By: IMMEDIATE) rack/setup/install frdb1003.eqiad.wmnet - https://phabricator.wikimedia.org/T239139 (10RobH)
[18:05:55] <wikibugs>	 10Operations, 10ops-eqiad, 10fundraising-tech-ops: (Need By: IMMEDIATE) rack/setup/install frdb1003.eqiad.wmnet - https://phabricator.wikimedia.org/T239139 (10RobH) p:05Normal→03High
[18:06:49] <wikibugs>	 10Operations, 10ops-eqiad, 10fundraising-tech-ops: (ESCALATED Need By: 11/26/19) rack/setup/install frdb1003.eqiad.wmnet - https://phabricator.wikimedia.org/T239139 (10RobH)
[18:07:19] <wikibugs>	 (03CR) 10Rush: "testing adding a group as a reviewer" [puppet] - 10https://gerrit.wikimedia.org/r/456690 (owner: 10Rush)
[18:07:22] <effie>	 !log Upgrade php-wikidiff2 to 1.10.0 to all servers - T236963
[18:07:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:07:27] <stashbot>	 T236963: Deploy version 1.10.0 of wikidiff2 to production - https://phabricator.wikimedia.org/T236963
[18:10:16] <wikibugs>	 (03PS1) 10Andrew Bogott: wmf_sink: forward newton changes to ocata [puppet] - 10https://gerrit.wikimedia.org/r/552867 (https://phabricator.wikimedia.org/T238708)
[18:11:59] <icinga-wm>	 PROBLEM - Disk space on cp4028 is CRITICAL: DISK CRITICAL - free space: / 107 MB (1% inode=91%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp4028&var-datasource=ulsfo+prometheus/ops
[18:12:21] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] wmf_sink: forward newton changes to ocata [puppet] - 10https://gerrit.wikimedia.org/r/552867 (https://phabricator.wikimedia.org/T238708) (owner: 10Andrew Bogott)
[18:13:24] <wikibugs>	 10Operations, 10ops-eqiad, 10fundraising-tech-ops: (ESCALATED Need By: 11/26/19) rack/setup/install frdb1003.eqiad.wmnet - https://phabricator.wikimedia.org/T239139 (10RobH)
[18:13:44] <subbu>	 is it the swat window now?
[18:13:54] <wikibugs>	 (03PS3) 10Elukey: admin: add analytics-privatedata system user [puppet] - 10https://gerrit.wikimedia.org/r/550814 (https://phabricator.wikimedia.org/T238306)
[18:14:03] <Lucas_WMDE>	 jouncebot: now
[18:14:03] <jouncebot>	 For the next 0 hour(s) and 45 minute(s): Morning SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191125T1800)
[18:14:11] <wikibugs>	 10Operations, 10ops-esams, 10netops: setup new MX204 in knams - https://phabricator.wikimedia.org/T237030 (10mark)
[18:14:21] <Lucas_WMDE>	 looks like it :)
[18:14:45] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] "Approved by today's SRE meeting. Just rebased." [puppet] - 10https://gerrit.wikimedia.org/r/550814 (https://phabricator.wikimedia.org/T238306) (owner: 10Elukey)
[18:15:29] <subbu>	 i would like https://gerrit.wikimedia.org/r/c/mediawiki/extensions/TemplateData/+/552620 backported once that merges.
[18:16:08] <effie>	 !log Restart php-fpm on mw* and wtp* servers in eqiad and codfw - T236963
[18:16:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:16:14] <stashbot>	 T236963: Deploy version 1.10.0 of wikidiff2 to production - https://phabricator.wikimedia.org/T236963
[18:17:48] <bblack>	 !log cp4028: disk space exhausted, rm /var/log/daemon.log + restart rsyslog
[18:17:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:18:27] <icinga-wm>	 PROBLEM - Disk space on cp4031 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=91%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp4031&var-datasource=ulsfo+prometheus/ops
[18:18:49] <icinga-wm>	 RECOVERY - Disk space on cp4028 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp4028&var-datasource=ulsfo+prometheus/ops
[18:19:29] <wikibugs>	 (03CR) 10Elukey: [C: 04-1] "Erik I got the +1 to merge https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/552613/, can you please change this accordingly?" [puppet] - 10https://gerrit.wikimedia.org/r/552304 (https://phabricator.wikimedia.org/T236180) (owner: 10EBernhardson)
[18:20:38] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Create airflow-search-admins admin group [puppet] - 10https://gerrit.wikimedia.org/r/552613 (https://phabricator.wikimedia.org/T238905) (owner: 10Dzahn)
[18:20:45] <wikibugs>	 (03PS8) 10Elukey: Create airflow-search-admins admin group [puppet] - 10https://gerrit.wikimedia.org/r/552613 (https://phabricator.wikimedia.org/T238905) (owner: 10Dzahn)
[18:20:57] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] "Approved by today's SRE meeting." [puppet] - 10https://gerrit.wikimedia.org/r/552613 (https://phabricator.wikimedia.org/T238905) (owner: 10Dzahn)
[18:24:25] <icinga-wm>	 PROBLEM - Disk space on cp4032 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=91%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp4032&var-datasource=ulsfo+prometheus/ops
[18:24:35] <icinga-wm>	 PROBLEM - Disk space on cp5007 is CRITICAL: DISK CRITICAL - free space: / 12 MB (0% inode=91%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp5007&var-datasource=eqsin+prometheus/ops
[18:25:22] <cdanis>	 jouncebot: refresh
[18:25:22] <jouncebot>	 I refreshed my knowledge about deployments.
[18:25:24] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: Check if a GPU fits in any of the remaining stat or notebook hosts - https://phabricator.wikimedia.org/T220698 (10RobH) So, I advise against any guesswork on this.  If you want to know if these 5 servers will hold a GPU, each purchase group needs to be...
[18:25:25] <cdanis>	 jouncebot: next
[18:25:25] <jouncebot>	 In 0 hour(s) and 34 minute(s): Grafana upgrade (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191125T1900)
[18:26:07] <icinga-wm>	 PROBLEM - Disk space on cp4029 is CRITICAL: DISK CRITICAL - free space: / 112 MB (1% inode=91%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp4029&var-datasource=ulsfo+prometheus/ops
[18:26:26] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: Check if a GPU fits in any of the remaining stat or notebook hosts - https://phabricator.wikimedia.org/T220698 (10RobH) Please note we likely need to schedule downtime for each of those 4 hosts to shutdown and check them.  @elukey: Can you advise how m...
[18:26:34] <bblack>	 !log cp[245]*: disk space exhausted, rm /var/log/daemon.log + restart rsyslog
[18:26:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:26:57] <icinga-wm>	 RECOVERY - Disk space on cp4031 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp4031&var-datasource=ulsfo+prometheus/ops
[18:27:49] <icinga-wm>	 RECOVERY - Disk space on cp4032 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp4032&var-datasource=ulsfo+prometheus/ops
[18:27:59] <icinga-wm>	 RECOVERY - Disk space on cp5007 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp5007&var-datasource=eqsin+prometheus/ops
[18:29:13] <wikibugs>	 (03PS1) 10RobH: frdb1003 mgmt entry [dns] - 10https://gerrit.wikimedia.org/r/552868 (https://phabricator.wikimedia.org/T239139)
[18:29:33] <icinga-wm>	 RECOVERY - Disk space on cp4029 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp4029&var-datasource=ulsfo+prometheus/ops
[18:29:35] <robh>	 doh
[18:29:36] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] frdb1003 mgmt entry [dns] - 10https://gerrit.wikimedia.org/r/552868 (https://phabricator.wikimedia.org/T239139) (owner: 10RobH)
[18:29:37] <robh>	 immediate typo
[18:29:45] <wikibugs>	 (03PS1) 10Vgutierrez: Revert "ATS: enable reload for global Lua script" [puppet] - 10https://gerrit.wikimedia.org/r/552869
[18:29:52] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: Check if a GPU fits in any of the remaining stat or notebook hosts - https://phabricator.wikimedia.org/T220698 (10elukey) a:05Cmjohnson→03RobH @RobH if we could go one at the time I think that a day before the maintenance is sufficient, I'll take c...
[18:30:02] <wikibugs>	 (03PS2) 10RobH: frdb1003 mgmt entry [dns] - 10https://gerrit.wikimedia.org/r/552868 (https://phabricator.wikimedia.org/T239139)
[18:30:57] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: Check if a GPU fits in any of the remaining stat or notebook hosts - https://phabricator.wikimedia.org/T220698 (10RobH) a:05RobH→03Cmjohnson Please note that Chris will still be performing this, it needs to stay assigned to him.  He will be coordin...
[18:31:13] <icinga-wm>	 PROBLEM - Check systemd state on cp4031 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:31:20] <wikibugs>	 (03CR) 10RobH: [C: 03+2] frdb1003 mgmt entry [dns] - 10https://gerrit.wikimedia.org/r/552868 (https://phabricator.wikimedia.org/T239139) (owner: 10RobH)
[18:31:51] <subbu>	 mdholloway, is it normal for machinevision tests to take 20+ mins? ( https://integration.wikimedia.org/zuul/ )
[18:31:58] <wikibugs>	 (03Abandoned) 10Lucas Werkmeister (WMDE): fatalmonitor: exec watch [puppet] - 10https://gerrit.wikimedia.org/r/499761 (owner: 10Lucas Werkmeister (WMDE))
[18:32:54] <wikibugs>	 (03PS5) 10Elukey: airflow: Add upstream configuration [puppet] - 10https://gerrit.wikimedia.org/r/544996 (owner: 10EBernhardson)
[18:34:10] <mdholloway>	 subbu: unfortunately yes. we have to wait for a lot of unrelated Wikibase tests to run due to our dependency on WikibaseMediaInfo.
[18:34:12] <wikibugs>	 (03PS10) 10Elukey: airflow: Initial deployment for search platform [puppet] - 10https://gerrit.wikimedia.org/r/544989 (https://phabricator.wikimedia.org/T236180) (owner: 10EBernhardson)
[18:34:21] <subbu>	 mdholloway, ok.
[18:34:37] <wikibugs>	 (03PS1) 10Ammarpad: Enable Translate extension on sewikiwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552870 (https://phabricator.wikimedia.org/T239091)
[18:35:43] <wikibugs>	 (03PS2) 10Ammarpad: Enable Translate extension on sewikiwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552870 (https://phabricator.wikimedia.org/T239091)
[18:36:02] <subbu>	 alrighty .. looks like the templatedata patch merged.
[18:36:11] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] airflow: Add upstream configuration [puppet] - 10https://gerrit.wikimedia.org/r/544996 (owner: 10EBernhardson)
[18:37:27] <librenms-wmf>	 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-knams.wikimedia.org recovered from Juniper alarm active
[18:38:11] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] airflow: Initial deployment for search platform [puppet] - 10https://gerrit.wikimedia.org/r/544989 (https://phabricator.wikimedia.org/T236180) (owner: 10EBernhardson)
[18:38:13] <subbu>	 is there some swatter available to backport https://gerrit.wikimedia.org/r/c/mediawiki/extensions/TemplateData/+/552620
[18:38:19] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to wiki pages, recentchanges, revesions and users info in DB for Matanel11111 and Nahum11 - https://phabricator.wikimedia.org/T239149 (10Matanel11111)
[18:39:15] <icinga-wm>	 PROBLEM - Disk space on cp4028 is CRITICAL: DISK CRITICAL - free space: / 242 MB (2% inode=91%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp4028&var-datasource=ulsfo+prometheus/ops
[18:40:17] <subbu>	 MaxSem, Niharika RoanKattouw ?
[18:41:25] <icinga-wm>	 RECOVERY - Check systemd state on cp4031 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:43:10] <Lucas_WMDE>	 I can do it
[18:43:23] <Lucas_WMDE>	 (I’m a swatter, just not usually for this slot :) )
[18:43:44] <wikibugs>	 (03CR) 10Ema: [C: 03+2] Revert "ATS: enable reload for global Lua script" [puppet] - 10https://gerrit.wikimedia.org/r/552869 (owner: 10Vgutierrez)
[18:44:55] <wikibugs>	 (03PS1) 10Elukey: airflow: move hiera config under role and add missing params [puppet] - 10https://gerrit.wikimedia.org/r/552872 (https://phabricator.wikimedia.org/T236180)
[18:45:18] <subbu>	 Lucas_WMDE, ok ty ... I see Reedy cherry-picked it already @ https://gerrit.wikimedia.org/r/c/mediawiki/extensions/TemplateData/+/552871
[18:45:24] * Urbanecm is around, if needed
[18:45:26] <Lucas_WMDE>	 yup, waiting on CI at the moment
[18:45:34] <Lucas_WMDE>	 Urbanecm: you do enough SWATs as it is, let me have this one :P
[18:45:47] <Urbanecm>	 Lucas_WMDE: feel free to :D 
[18:47:27] <icinga-wm>	 PROBLEM - Disk space on cp4031 is CRITICAL: DISK CRITICAL - free space: / 1 MB (0% inode=91%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp4031&var-datasource=ulsfo+prometheus/ops
[18:47:49] <subbu>	 the deploy should be a no-op since there isn't any deployed parsoid code that uses that new hook yet. but, that will change tomorrow. :)
[18:48:06] <wikibugs>	 (03PS2) 10Elukey: airflow: move hiera config under role and add missing params [puppet] - 10https://gerrit.wikimedia.org/r/552872 (https://phabricator.wikimedia.org/T236180)
[18:48:17] <icinga-wm>	 PROBLEM - Disk space on cp4032 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=91%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp4032&var-datasource=ulsfo+prometheus/ops
[18:48:31] <icinga-wm>	 PROBLEM - Disk space on cp5007 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=91%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp5007&var-datasource=eqsin+prometheus/ops
[18:48:47] <mutante>	 !log mw1298 - pooling
[18:48:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:49:41] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to wiki pages, recentchanges, revesions and users info in DB for Matanel11111 and Nahum11 - https://phabricator.wikimedia.org/T239149 (10Masumrezarock100) a:05Matanel11111→03None
[18:50:13] <icinga-wm>	 RECOVERY - Disk space on cp5007 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp5007&var-datasource=eqsin+prometheus/ops
[18:50:15] <bblack>	 !log cp[245]*: wipe daemon.log and restart syslog, again
[18:50:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:50:39] <icinga-wm>	 PROBLEM - Disk space on cp5010 is CRITICAL: DISK CRITICAL - free space: / 24 MB (0% inode=91%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp5010&var-datasource=eqsin+prometheus/ops
[18:50:40] <Lucas_WMDE>	 it’s merged!
[18:50:51] <icinga-wm>	 RECOVERY - Disk space on cp4031 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp4031&var-datasource=ulsfo+prometheus/ops
[18:51:18] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to wiki pages, RecentChanges, revisions and users info in DB for Matanel11111 and Nahum11 - https://phabricator.wikimedia.org/T239149 (10Masumrezarock100)
[18:51:20] <wikibugs>	 (03PS1) 10CRusnov: coherence: Check device names for correct case [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/552874 (https://phabricator.wikimedia.org/T237469)
[18:51:22] <ema>	 !log cumin -b1 'A:cp-ats and A:eqiad' 'run-puppet-agent; ats-backend-restart & ats-tls-restart'
[18:51:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:51:43] <icinga-wm>	 RECOVERY - Disk space on cp4032 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp4032&var-datasource=ulsfo+prometheus/ops
[18:52:09] <wikibugs>	 (03PS2) 10CRusnov: coherence: Check device names for correct case [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/552874 (https://phabricator.wikimedia.org/T237469)
[18:52:09] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to wiki pages, RecentChanges, revisions and users info in DB for Matanel11111 and Nahum11 - https://phabricator.wikimedia.org/T239149 (10Urbanecm) isn't that what toolforge allows you to have? See https://wikitech.wikimedia.org/wiki/Help:Toolforge and http...
[18:52:23] <icinga-wm>	 RECOVERY - Disk space on cp5010 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp5010&var-datasource=eqsin+prometheus/ops
[18:52:26] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] airflow: move hiera config under role and add missing params [puppet] - 10https://gerrit.wikimedia.org/r/552872 (https://phabricator.wikimedia.org/T236180) (owner: 10Elukey)
[18:52:28] <ema>	 !log cumin -b1 'A:cp-ats and A:codfw' 'run-puppet-agent; ats-backend-restart & ats-tls-restart'
[18:52:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:52:55] <icinga-wm>	 PROBLEM - Disk space on cp4028 is CRITICAL: DISK CRITICAL - free space: / 236 MB (2% inode=91%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp4028&var-datasource=ulsfo+prometheus/ops
[18:53:21] <ema>	 !log cumin -b1 'A:cp-ats and A:ulsfo' 'run-puppet-agent; ats-backend-restart & ats-tls-restart'
[18:53:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:53:35] <Lucas_WMDE>	 subbu: the change is on mwdebug1001, is there any way to test it at all?
[18:53:41] <subbu>	 nope. it is a no-op.
[18:53:44] <Lucas_WMDE>	 ok
[18:53:47] <Lucas_WMDE>	 then I’ll just sync
[18:53:49] <Lucas_WMDE>	 thanks!
[18:53:53] <subbu>	 yup. ty.
[18:53:54] <ema>	 !log cumin -b1 'A:cp-ats and A:eqsin' 'run-puppet-agent; ats-backend-restart & ats-tls-restart'
[18:53:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:54:20] <ema>	 !log cumin -b1 'A:cp-ats and A:esams' 'run-puppet-agent; ats-backend-restart & ats-tls-restart'
[18:54:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:54:24] <bblack>	 !log cp[245]*: wipe daemon.log and syslog and restart syslog, again
[18:54:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:54:29] <wikibugs>	 (03PS5) 10Herron: logstash: create elk7 ES role and assign to elk7 ES hw hosts [puppet] - 10https://gerrit.wikimedia.org/r/552837 (https://phabricator.wikimedia.org/T234854)
[18:55:40] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1001 Synchronized php-1.35.0-wmf.5/extensions/TemplateData/: SWAT: [[gerrit:552871|Implement ParsoidFetchTemplateData hook for Parsoid/PHP (T238954)]] (duration: 00m 53s)
[18:55:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:55:45] <stashbot>	 T238954: html2wt: Missing implementation of 'ParsoidFetchTemplateData' to fetch templatedata - https://phabricator.wikimedia.org/T238954
[18:56:51] <Lucas_WMDE>	 !log Morning SWAT done
[18:56:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:59:35] <wikibugs>	 (03PS6) 10Herron: logstash: create elk7 ES role and assign to elk7 ES hw hosts [puppet] - 10https://gerrit.wikimedia.org/r/552837 (https://phabricator.wikimedia.org/T234854)
[18:59:44] <wikibugs>	 10Operations, 10Gerrit, 10vm-requests: Gerrit VM to test data migration - https://phabricator.wikimedia.org/T239151 (10thcipriani)
[19:00:04] <jouncebot>	 cdanis: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Grafana upgrade. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191125T1900).
[19:01:07] <icinga-wm>	 PROBLEM - Disk space on cp4031 is CRITICAL: DISK CRITICAL - free space: / 186 MB (2% inode=91%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp4031&var-datasource=ulsfo+prometheus/ops
[19:01:17] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to wiki pages, RecentChanges, revisions and users info in DB for Matanel11111 and Nahum11 - https://phabricator.wikimedia.org/T239149 (10Matanel11111) I have a tool, but how can i login to the tool with SSH?
[19:01:39] <wikibugs>	 (03PS13) 10EBernhardson: airflow: Run webserver and scheduler processes [puppet] - 10https://gerrit.wikimedia.org/r/544990 (https://phabricator.wikimedia.org/T236180)
[19:01:59] <icinga-wm>	 PROBLEM - Disk space on cp4032 is CRITICAL: DISK CRITICAL - free space: / 114 MB (1% inode=91%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp4032&var-datasource=ulsfo+prometheus/ops
[19:02:11] <icinga-wm>	 PROBLEM - Disk space on cp5007 is CRITICAL: DISK CRITICAL - free space: / 133 MB (1% inode=91%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp5007&var-datasource=eqsin+prometheus/ops
[19:02:37] <wikibugs>	 (03PS1) 10RLazarus: poolcounter: Install and run poolcounter-prometheus-exporter alongside poolcounter. [puppet] - 10https://gerrit.wikimedia.org/r/552875 (https://phabricator.wikimedia.org/T237407)
[19:03:06] <wikibugs>	 (03PS1) 10CDanis: grafana1002: is just grafana.wm.o now [puppet] - 10https://gerrit.wikimedia.org/r/552876 (https://phabricator.wikimedia.org/T220838)
[19:03:12] <wikibugs>	 (03CR) 10Herron: "PCC https://puppet-compiler.wmflabs.org/compiler1002/19599/" [puppet] - 10https://gerrit.wikimedia.org/r/552837 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron)
[19:04:19] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] poolcounter: Install and run poolcounter-prometheus-exporter alongside poolcounter. [puppet] - 10https://gerrit.wikimedia.org/r/552875 (https://phabricator.wikimedia.org/T237407) (owner: 10RLazarus)
[19:04:30] <wikibugs>	 10Operations: (Need By 8/15/19) rack/setup/install ms-be105[7-9].eqiad.wmnet - https://phabricator.wikimedia.org/T237438 (10Cmjohnson)
[19:04:32] <wikibugs>	 10Operations, 10Gerrit, 10vm-requests: Gerrit VM to test data migration - https://phabricator.wikimedia.org/T239151 (10Paladox) a:03Dzahn
[19:05:17] <rlazarus>	 thank you jerkins-bot
[19:05:22] <wikibugs>	 10Operations: (Need By 8/15/19) rack/setup/install ms-be105[7-9].eqiad.wmnet - https://phabricator.wikimedia.org/T237438 (10Cmjohnson) @fgiunchedi These are ready for you for implementation. I removed the ops-eqiad tag. if you have an issue please assign to  me and add the ops-eqiad tag back
[19:05:51] <wikibugs>	 (03PS2) 10RLazarus: poolcounter: Install and run the prometheus exporter alongside poolcounter. [puppet] - 10https://gerrit.wikimedia.org/r/552875 (https://phabricator.wikimedia.org/T237407)
[19:06:03] <cdanis>	 !log making grafana.wikimedia.org read-only (on grafana1001) ✔️ cdanis@grafana1001.eqiad.wmnet ~ 🕑☕ sudo chmod -w /var/lib/grafana/grafana.db                   
[19:06:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:07:04] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] grafana1002: is just grafana.wm.o now [puppet] - 10https://gerrit.wikimedia.org/r/552876 (https://phabricator.wikimedia.org/T220838) (owner: 10CDanis)
[19:07:07] <icinga-wm>	 PROBLEM - Disk space on cp4029 is CRITICAL: DISK CRITICAL - free space: / 246 MB (2% inode=91%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp4029&var-datasource=ulsfo+prometheus/ops
[19:07:22] <cdanis>	 !log stopping grafana-next.wikimedia.org (on grafana1002)
[19:07:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:07:32] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] poolcounter: Install and run the prometheus exporter alongside poolcounter. [puppet] - 10https://gerrit.wikimedia.org/r/552875 (https://phabricator.wikimedia.org/T237407) (owner: 10RLazarus)
[19:09:00] <wikibugs>	 (03PS2) 10EBernhardson: Allow analytics-search-users to manage search/airflow instances [puppet] - 10https://gerrit.wikimedia.org/r/552304 (https://phabricator.wikimedia.org/T236180)
[19:09:32] <wikibugs>	 (03PS1) 10Elukey: profile::analytics::search::airflow: fix file resource and add deps [puppet] - 10https://gerrit.wikimedia.org/r/552878 (https://phabricator.wikimedia.org/T236180)
[19:10:44] <wikibugs>	 10Operations, 10ops-esams, 10netops: Setup new MX204 in knams - https://phabricator.wikimedia.org/T237030 (10faidon)
[19:11:20] <wikibugs>	 (03PS1) 10CDanis: grafana1002: is now the server for grafana.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/552879 (https://phabricator.wikimedia.org/T220838)
[19:11:32] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: Asset tag remaining cablemgmt in eqiad - https://phabricator.wikimedia.org/T239110 (10Cmjohnson) 05Open→03Resolved asset tags have been added to all and netbox updated
[19:11:36] <cdanis>	 !log copied snapshot of database from grafana1001 to grafana1002 T220838
[19:11:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:11:41] <stashbot>	 T220838: Upgrade grafana to 6.4.4 - https://phabricator.wikimedia.org/T220838
[19:11:43] <icinga-wm>	 PROBLEM - HTTPS Unified RSA on cp5007 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS
[19:11:49] <icinga-wm>	 PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp5007 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 error generating metric output - 665 bytes in 0.464 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:11:59] <icinga-wm>	 PROBLEM - Ensure traffic_manager binds on 443 and responds to HTTP requests on cp5007 is CRITICAL: connect to address 10.132.0.107 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:12:05] <icinga-wm>	 PROBLEM - check_trafficserver_log_fifo_tls_tls on cp5007 is CRITICAL: CRITICAL: /srv/trafficserver/tls/var/log/tls.pipe - fifo-log-demux not reading from pipe https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:12:15] <icinga-wm>	 PROBLEM - Ensure traffic_server is running for instance tls on cp5007 is CRITICAL: PROCS CRITICAL: 0 processes with args /srv/trafficserver/tls/bin/traffic_server -M --run-root=/srv/trafficserver/tls/runroot.yaml --httpport 443 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:12:33] <icinga-wm>	 PROBLEM - HTTPS Unified ECDSA on cp5007 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS
[19:12:41] <icinga-wm>	 PROBLEM - ats-tls HTTPS en.wikipedia.org RSA on cp5007 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS
[19:12:41] <icinga-wm>	 PROBLEM - ats-tls HTTPS en.wikipedia.org ECDSA on cp5007 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS
[19:13:03] <icinga-wm>	 PROBLEM - Ensure traffic_manager is running for instance tls on cp5007 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/traffic_manager --run-root=/srv/trafficserver/tls --nosyslog https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:13:35] <icinga-wm>	 PROBLEM - Juniper alarms on cr2-knams is CRITICAL: JNX_ALARMS CRITICAL - The requested table is empty or does not exist https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm
[19:13:38] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] profile::analytics::search::airflow: fix file resource and add deps [puppet] - 10https://gerrit.wikimedia.org/r/552878 (https://phabricator.wikimedia.org/T236180) (owner: 10Elukey)
[19:13:41] <cdanis>	 !log restarted grafana-server on grafana1002 T220838
[19:13:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:14:21] <icinga-wm>	 PROBLEM - Check systemd state on cp4032 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:14:22] <mutante>	 cp5007 is a reinstall?
[19:14:22] <bblack>	 !log cp[245]*: wipe daemon.log and syslog and restart syslog, again
[19:14:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:14:37] <mutante>	 i see the host key changed
[19:14:43] <wikibugs>	 10Operations, 10ops-eqiad, 10fundraising-tech-ops, 10Patch-For-Review: (ESCALATED Need By: 11/26/19) rack/setup/install frdb1003.eqiad.wmnet - https://phabricator.wikimedia.org/T239139 (10wiki_willy) @Jgreen - I gave @Jclark-ctr a heads up on this, so he'll starting working on it, when he gets in a bit lat...
[19:14:45] <icinga-wm>	 RECOVERY - Ensure traffic_manager is running for instance tls on cp5007 is OK: PROCS OK: 1 process with args /usr/bin/traffic_manager --run-root=/srv/trafficserver/tls --nosyslog https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:15:11] <icinga-wm>	 RECOVERY - HTTPS Unified RSA on cp5007 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345572 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2020-11-22 07:59:59 +0000 (expires in 362 days) https://wikitech.wikimedia.org/wiki/HTTPS
[19:15:15] <icinga-wm>	 RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp5007 is OK: HTTP OK: HTTP/1.0 200 OK - 19886 bytes in 0.705 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:15:27] <icinga-wm>	 RECOVERY - Ensure traffic_manager binds on 443 and responds to HTTP requests on cp5007 is OK: HTTP OK: HTTP/1.1 200 Ok - 30130 bytes in 1.166 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:15:31] <icinga-wm>	 RECOVERY - check_trafficserver_log_fifo_tls_tls on cp5007 is OK: OK: TS_MAIN writing to and fifo-log-demux reading from /srv/trafficserver/tls/var/log/tls.pipe https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:15:41] <icinga-wm>	 PROBLEM - HTTPS Unified RSA on cp4028 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS
[19:15:41] <icinga-wm>	 RECOVERY - Ensure traffic_server is running for instance tls on cp5007 is OK: PROCS OK: 1 process with args /srv/trafficserver/tls/bin/traffic_server -M --run-root=/srv/trafficserver/tls/runroot.yaml --httpport 443 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:15:47] <icinga-wm>	 RECOVERY - Disk space on cp4029 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp4029&var-datasource=ulsfo+prometheus/ops
[19:15:53] <icinga-wm>	 PROBLEM - Ensure traffic_manager binds on 443 and responds to HTTP requests on cp4028 is CRITICAL: connect to address 10.128.0.128 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:15:59] <icinga-wm>	 RECOVERY - HTTPS Unified ECDSA on cp5007 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345523 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2020-11-22 07:59:59 +0000 (expires in 362 days) https://wikitech.wikimedia.org/wiki/HTTPS
[19:16:03] <wikibugs>	 (03PS3) 10RLazarus: poolcounter: Install and run the prometheus exporter alongside poolcounter. [puppet] - 10https://gerrit.wikimedia.org/r/552875 (https://phabricator.wikimedia.org/T237407)
[19:16:07] <icinga-wm>	 RECOVERY - ats-tls HTTPS en.wikipedia.org RSA on cp5007 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345516 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2020-11-22 07:59:59 +0000 (expires in 362 days) https://wikitech.wikimedia.org/wiki/HTTPS
[19:16:07] <icinga-wm>	 RECOVERY - ats-tls HTTPS en.wikipedia.org ECDSA on cp5007 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345515 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2020-11-22 07:59:59 +0000 (expires in 362 days) https://wikitech.wikimedia.org/wiki/HTTPS
[19:16:19] <icinga-wm>	 PROBLEM - ats-tls HTTPS en.wikipedia.org RSA on cp4028 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS
[19:16:25] <icinga-wm>	 PROBLEM - Ensure traffic_server is running for instance tls on cp4028 is CRITICAL: PROCS CRITICAL: 0 processes with args /srv/trafficserver/tls/bin/traffic_server -M --run-root=/srv/trafficserver/tls/runroot.yaml --httpport 443 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:16:27] <icinga-wm>	 PROBLEM - HTTPS Unified ECDSA on cp4028 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS
[19:16:29] <icinga-wm>	 PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp4028 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 error generating metric output - 665 bytes in 0.151 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:17:38] <icinga-wm>	 PROBLEM - ats-tls HTTPS en.wikipedia.org ECDSA on cp4032 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS
[19:17:42] <icinga-wm>	 PROBLEM - HTTPS Unified RSA on cp4032 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS
[19:17:45] <wikibugs>	 (03PS1) 10Herron: logstash: create elk7 logstash role and assign to elk7 collectors [puppet] - 10https://gerrit.wikimedia.org/r/552881 (https://phabricator.wikimedia.org/T234854)
[19:20:33] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to wiki pages, RecentChanges, revisions and users info in DB for Matanel11111 and Nahum11 - https://phabricator.wikimedia.org/T239149 (10Urbanecm) You need to first ssh to your personal account and then do `become <your tool name>`
[19:20:35] <wikibugs>	 (03PS1) 10Elukey: profile::analytics::search::airflow: fix directory ensure [puppet] - 10https://gerrit.wikimedia.org/r/552882 (https://phabricator.wikimedia.org/T236180)
[19:20:54] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] logstash: create elk7 logstash role and assign to elk7 collectors [puppet] - 10https://gerrit.wikimedia.org/r/552881 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron)
[19:21:36] <wikibugs>	 (03PS2) 10Herron: logstash: create elk7 logstash role and assign to elk7 collectors [puppet] - 10https://gerrit.wikimedia.org/r/552881 (https://phabricator.wikimedia.org/T234854)
[19:21:44] <icinga-wm>	 RECOVERY - ats-tls HTTPS en.wikipedia.org RSA on cp4028 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345596 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2020-11-22 07:59:59 +0000 (expires in 362 days) https://wikitech.wikimedia.org/wiki/HTTPS
[19:21:50] <icinga-wm>	 RECOVERY - Ensure traffic_server is running for instance tls on cp4028 is OK: PROCS OK: 1 process with args /srv/trafficserver/tls/bin/traffic_server -M --run-root=/srv/trafficserver/tls/runroot.yaml --httpport 443 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:21:50] <icinga-wm>	 RECOVERY - HTTPS Unified ECDSA on cp4028 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345590 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2020-11-22 07:59:59 +0000 (expires in 362 days) https://wikitech.wikimedia.org/wiki/HTTPS
[19:21:52] <icinga-wm>	 RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp4028 is OK: HTTP OK: HTTP/1.0 200 OK - 19875 bytes in 0.230 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:22:12] <icinga-wm>	 RECOVERY - Ensure traffic_manager binds on 443 and responds to HTTP requests on cp4028 is OK: HTTP OK: HTTP/1.1 200 Ok - 30114 bytes in 0.379 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:22:24] <icinga-wm>	 RECOVERY - Disk space on cp5007 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp5007&var-datasource=eqsin+prometheus/ops
[19:22:50] <icinga-wm>	 RECOVERY - ats-tls HTTPS en.wikipedia.org ECDSA on cp4032 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345550 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2020-11-22 07:59:59 +0000 (expires in 362 days) https://wikitech.wikimedia.org/wiki/HTTPS
[19:22:52] <icinga-wm>	 RECOVERY - HTTPS Unified RSA on cp4032 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345548 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2020-11-22 07:59:59 +0000 (expires in 362 days) https://wikitech.wikimedia.org/wiki/HTTPS
[19:23:08] <icinga-wm>	 RECOVERY - Disk space on cp4031 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp4031&var-datasource=ulsfo+prometheus/ops
[19:23:10] <icinga-wm>	 RECOVERY - Disk space on cp4032 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp4032&var-datasource=ulsfo+prometheus/ops
[19:23:26] <icinga-wm>	 RECOVERY - HTTPS Unified RSA on cp4028 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345494 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2020-11-22 07:59:59 +0000 (expires in 362 days) https://wikitech.wikimedia.org/wiki/HTTPS
[19:23:51] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] profile::analytics::search::airflow: fix directory ensure [puppet] - 10https://gerrit.wikimedia.org/r/552882 (https://phabricator.wikimedia.org/T236180) (owner: 10Elukey)
[19:24:38] <wikibugs>	 (03PS1) 10Dzahn: conftool: un-comment mw1298, add back to pool [puppet] - 10https://gerrit.wikimedia.org/r/552884 (https://phabricator.wikimedia.org/T215332)
[19:25:00] <wikibugs>	 10Operations, 10ops-eqiad, 10fundraising-tech-ops, 10Patch-For-Review: (ESCALATED Need By: 11/26/19) rack/setup/install frdb1003.eqiad.wmnet - https://phabricator.wikimedia.org/T239139 (10Jgreen) >>! In T239139#5690948, @wiki_willy wrote: > @Jgreen - I gave @Jclark-ctr a heads up on this, so he'll starting...
[19:25:53] <wikibugs>	 10Operations, 10hardware-requests: Hardware request for Postgres database for censorship monitoring scripts - https://phabricator.wikimedia.org/T238652 (10RobH) Ok, this is an odd one and doesn't meet any of our current configuration requirements.  The only reason this isn't a VM is due to storage requirements...
[19:27:45] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "As Effie pointed out this server was not getting any traffic but was expected to be repooled again." [puppet] - 10https://gerrit.wikimedia.org/r/552884 (https://phabricator.wikimedia.org/T215332) (owner: 10Dzahn)
[19:29:12] <wikibugs>	 (03PS1) 10EBernhardson: Add dsh group for search platform airflow [puppet] - 10https://gerrit.wikimedia.org/r/552885
[19:29:50] <wikibugs>	 10Operations, 10hardware-requests: Hardware request for Postgres database for censorship monitoring scripts - https://phabricator.wikimedia.org/T238652 (10RobH)
[19:30:06] <logmsgbot>	 !log ema@cumin1001 conftool action : set/pooled=yes; selector: name=cp4032.ulsfo.wmnet,service=nginx
[19:30:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:30:48] <icinga-wm>	 RECOVERY - Check systemd state on cp4032 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:31:16] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Add dsh group for search platform airflow [puppet] - 10https://gerrit.wikimedia.org/r/552885 (owner: 10EBernhardson)
[19:31:19] <wikibugs>	 (03PS1) 10Faidon Liambotis: Add three new Sentry PDU expansion units [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/552886 (https://phabricator.wikimedia.org/T227632)
[19:31:56] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add three new Sentry PDU expansion units [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/552886 (https://phabricator.wikimedia.org/T227632) (owner: 10Faidon Liambotis)
[19:32:16] <icinga-wm>	 RECOVERY - Disk space on cp4028 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp4028&var-datasource=ulsfo+prometheus/ops
[19:32:26] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1298.eqiad.wmnet
[19:32:29] <wikibugs>	 (03PS2) 10Faidon Liambotis: Add three new Sentry PDU expansion units [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/552886 (https://phabricator.wikimedia.org/T227632)
[19:32:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:34:07] <wikibugs>	 (03CR) 10Faidon Liambotis: [C: 03+2] Add three new Sentry PDU expansion units [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/552886 (https://phabricator.wikimedia.org/T227632) (owner: 10Faidon Liambotis)
[19:34:45] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] grafana1002: is now the server for grafana.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/552879 (https://phabricator.wikimedia.org/T220838) (owner: 10CDanis)
[19:35:14] <mutante>	 !log mw1298 - scap pull
[19:35:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:35:33] <wikibugs>	 10Operations, 10hardware-requests: Hardware request for Postgres database for censorship monitoring scripts - https://phabricator.wikimedia.org/T238652 (10RobH) Alternatively, we use existing spare pool system wmf5174 or wmf5175 which have dual SSDs, and then swap in dual 2TB SFF disks if they have any spare i...
[19:36:41] <wikibugs>	 10Operations, 10ops-codfw, 10ops-eqiad, 10netbox, 10Patch-For-Review: Document PDU models - https://phabricator.wikimedia.org/T227632 (10faidon) 05Open→03Resolved I went digging in RT and fixed it for all of them except the old/unracked/offline sdtpa PDUs.
[19:40:20] <apergos>	 nice ghost recovery page there
[19:42:06] <icinga-wm>	 PROBLEM - Check systemd state on logstash2005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:42:24] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on mw1298 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[19:42:39] <mutante>	 apergos: which one did you get?
[19:43:02] <apergos>	 icinga
[19:43:22] <mutante>	 the only recoveries i see are not paging?
[19:43:35] <apergos>	 recovery: icinga1001... see email for details. from 4 minutes ago
[19:43:38] <elukey>	 does anybody now what needs to be done on deplo1001 when the repo is added for the first time? Does it need to be cloned manually?
[19:43:59] <apergos>	 hm yes iirc, plus the scap config 
[19:44:22] <apergos>	 I mean you will always pull manually anyways
[19:44:25] <mutante>	 and "keyholder arm" if a new deployment key is involved
[19:45:06] <thcipriani>	 shouldn't need to be cloned manually. You can add it to the heira data for the deploy host (if you're talking about scap3 deploys)
[19:45:14] <apergos>	 ah right
[19:45:30] <apergos>	 oh has that been updated since... 
[19:45:35] <apergos>	 well years ago, right. heh
[19:45:40] <apergos>	 nice!
[19:45:52] <thcipriani>	 > hieradata/role/common/deployment_server.yaml
[19:46:05] <elukey>	 thcipriani: o/ it is there, but I currently see
[19:46:15] <elukey>	 OSError: [Errno 2] No such file or directory: '/srv/deployment/search/airflow/.git/config-files'
[19:46:18] <elukey>	 19:33:42 ERROR    - deploy failed: <OSError> [Errno 2] No such file or directory: '/srv/deployment/search/airflow/.git/config-files'
[19:46:53] <elukey>	 ok super weird
[19:47:06] <cdanis>	 apergos: likely same issue as mentioned during the meeting today
[19:47:08] <elukey>	 if I rm -rf the directory of the repo (empty) and run puppet it works
[19:47:10] <cdanis>	 and that volans emailed about :)
[19:47:37] <elukey>	 yep worked
[19:48:11] <apergos>	 cdanis: yeah I figured as much
[19:48:42] <icinga-wm>	 PROBLEM - Check systemd state on logstash2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:48:46] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1298.eqiad.wmnet
[19:48:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:49:01] <rlazarus>	 so in the last month, we've had six icinga recoveries and two alerts -- that means we can ignore the next four alerts, right?
[19:49:03] <elukey>	 thcipriani: seems that there might be some puppet race condition with https://github.com/wikimedia/puppet/blob/production/modules/scap/lib/puppet/provider/scap_source/default.rb#L163
[19:49:32] <icinga-wm>	 PROBLEM - Check systemd state on logstash1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:50:37] <cdanis>	 rlazarus: eyyyy
[19:52:35] <apergos>	 the problem with forgiving close bracket matching is that new open brackets always count
[19:55:21] <wikibugs>	 (03PS1) 10CDanis: logstash collector: use proper ferm syntax [puppet] - 10https://gerrit.wikimedia.org/r/552887
[19:56:04] <icinga-wm>	 PROBLEM - Check systemd state on logstash1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:57:39] <thcipriani>	 elukey: hrm, that's possible. Not sure if I understand the full OOO that would lead to the race-condition, though
[19:57:58] <wikibugs>	 10Operations, 10hardware-requests: Hardware request for Postgres database for censorship monitoring scripts - https://phabricator.wikimedia.org/T238652 (10RobH)
[19:58:25] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] logstash collector: use proper ferm syntax [puppet] - 10https://gerrit.wikimedia.org/r/552887 (owner: 10CDanis)
[19:58:33] <elukey>	 thcipriani: so the repo dir was create (I believe by puppet) but then the clone didn't happen since the dir was already create.. I can try to create a task tomorrow if you want
[19:59:50] <icinga-wm>	 RECOVERY - Check systemd state on logstash1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:00:05] <jouncebot>	 cscott, arlolra, subbu, halfak, and accraze: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Services – Graphoid / Parsoid / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191125T2000).
[20:00:40] <icinga-wm>	 RECOVERY - Check systemd state on logstash2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:00:43] <wikibugs>	 (03PS1) 10Dzahn: conftool: move mw1298 to the jobrunner section [puppet] - 10https://gerrit.wikimedia.org/r/552888 (https://phabricator.wikimedia.org/T215332)
[20:01:01] <wikibugs>	 (03CR) 10Mholloway: [C: 03+2] Update wikifeeds to 2019-11-25-173023-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/552864 (https://phabricator.wikimedia.org/T235652) (owner: 10Mholloway)
[20:01:10] <icinga-wm>	 RECOVERY - Check systemd state on logstash1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:01:14] <wikibugs>	 10Operations, 10observability, 10Patch-For-Review, 10Performance-Team (Radar): Upgrade grafana to 6.4.4 - https://phabricator.wikimedia.org/T220838 (10CDanis) 05Open→03Resolved Grafana 6.4.4 is now in use at https://grafana.wikimedia.org.
[20:01:21] <wikibugs>	 (03Merged) 10jenkins-bot: Update wikifeeds to 2019-11-25-173023-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/552864 (https://phabricator.wikimedia.org/T235652) (owner: 10Mholloway)
[20:02:24] <icinga-wm>	 RECOVERY - Check systemd state on logstash2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:02:49] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: Asset tag remaining cablemgmt in eqiad - https://phabricator.wikimedia.org/T239110 (10faidon) 05Resolved→03Open Some of these were not done - I suspect partially because my ranges were misparsed as individual items (should had made that clearer, apologies!). The follo...
[20:04:12] <thcipriani>	 elukey: yeah, if you could file a task that'd be perfect
[20:04:50] <logmsgbot>	 !log mholloway-shell@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'wikifeeds' for release 'staging' .
[20:04:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:05:46] <logmsgbot>	 !log mholloway-shell@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'wikifeeds' for release 'production' .
[20:05:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:06:00] <wikibugs>	 10Operations, 10ops-esams: Update spare QFX labels - https://phabricator.wikimedia.org/T237014 (10faidon)
[20:06:05] <wikibugs>	 10Operations, 10hardware-requests: Hardware request for Postgres database for censorship monitoring scripts - https://phabricator.wikimedia.org/T238652 (10RobH) a:05RobH→03faidon Please note this task is now pending the approval of @faidon in conjunction with associated HDD purchase task T238652.  @faidon:...
[20:07:04] <logmsgbot>	 !log mholloway-shell@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'wikifeeds' for release 'production' .
[20:07:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:14:19] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "per site.pp mw1298 is a jobrunner, so move it to the correct section in conftool" [puppet] - 10https://gerrit.wikimedia.org/r/552888 (https://phabricator.wikimedia.org/T215332) (owner: 10Dzahn)
[20:15:18] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to wiki pages, RecentChanges, revisions and users info in DB for Matanel11111 and Nahum11 - https://phabricator.wikimedia.org/T239149 (10Aklapper) 05Open→03Declined This is what Toolforge is for, hence I'm boldly declining this request as it's about pr...
[20:15:28] <wikibugs>	 (03PS2) 10Dzahn: conftool: move mw1298 to the jobrunner section [puppet] - 10https://gerrit.wikimedia.org/r/552888 (https://phabricator.wikimedia.org/T215332)
[20:18:55] <wikibugs>	 10Operations, 10ops-esams, 10DC-Ops, 10Patch-For-Review: ESAMS Refresh/Rebuild (October 2019) - https://phabricator.wikimedia.org/T235805 (10RobH)
[20:20:50] <wikibugs>	 (03PS3) 10Herron: logstash: create elk7 logstash role and assign to elk7 collectors [puppet] - 10https://gerrit.wikimedia.org/r/552881 (https://phabricator.wikimedia.org/T234854)
[20:22:03] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1298.eqiad.wmnet
[20:22:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:31:11] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1298.eqiad.wmnet
[20:31:19] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1298.eqiad.wmnet
[20:31:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:31:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:31:51] <mutante>	 ^ don't know why this server has "weight: 0" 
[20:32:06] <mutante>	 and not 10
[20:36:26] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on icinga1001 is CRITICAL: 55.2 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[20:37:46] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/weight=10; selector: name=mw1298.eqiad.wmnet
[20:37:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:39:54] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on icinga1001 is OK: (C)60 le (W)70 le 75.46 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[20:40:12] <wikibugs>	 (03PS4) 10Herron: logstash: create elk7 logstash role and assign to elk7 collectors [puppet] - 10https://gerrit.wikimedia.org/r/552881 (https://phabricator.wikimedia.org/T234854)
[20:44:05] <wikibugs>	 (03CR) 10Dzahn: admins: add Max Semenik as ldap_only_admin (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/552594 (https://phabricator.wikimedia.org/T238960) (owner: 10Dzahn)
[20:44:19] <wikibugs>	 (03CR) 10Herron: "PCC https://puppet-compiler.wmflabs.org/compiler1003/19607/" [puppet] - 10https://gerrit.wikimedia.org/r/552881 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron)
[20:46:19] <wikibugs>	 (03CR) 10Dzahn: "After the comments, not sure what is the right solution here. The user is the same user in LDAP (UID 1220) before and after. It needs to m" [puppet] - 10https://gerrit.wikimedia.org/r/552594 (https://phabricator.wikimedia.org/T238960) (owner: 10Dzahn)
[20:50:14] <wikibugs>	 (03CR) 10Dzahn: "If the reason to keep it "absented" is to make sure keys are really removed.. we can manually check with cumin before we do.  If the reaso" [puppet] - 10https://gerrit.wikimedia.org/r/552594 (https://phabricator.wikimedia.org/T238960) (owner: 10Dzahn)
[20:52:27] <wikibugs>	 (03PS3) 10Dzahn: admins: add Max Semenik as ldap_only_admin [puppet] - 10https://gerrit.wikimedia.org/r/552594 (https://phabricator.wikimedia.org/T238960)
[20:52:46] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] admins: add Max Semenik as ldap_only_admin [puppet] - 10https://gerrit.wikimedia.org/r/552594 (https://phabricator.wikimedia.org/T238960) (owner: 10Dzahn)
[20:53:52] <wikibugs>	 (03PS4) 10Dzahn: admins: add Max Semenik as ldap_only_admin [puppet] - 10https://gerrit.wikimedia.org/r/552594 (https://phabricator.wikimedia.org/T238960)
[20:54:55] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] poolcounter: Install and run the prometheus exporter alongside poolcounter. [puppet] - 10https://gerrit.wikimedia.org/r/552875 (https://phabricator.wikimedia.org/T237407) (owner: 10RLazarus)
[20:55:11] <wikibugs>	 (03PS2) 10Dzahn: wikimania_scholarships app: use codfw database when in codfw [puppet] - 10https://gerrit.wikimedia.org/r/552551 (https://phabricator.wikimedia.org/T224247)
[20:55:16] <wikibugs>	 (03PS1) 10Andrew Bogott: nova: add support for the 'placement' api service [puppet] - 10https://gerrit.wikimedia.org/r/552890 (https://phabricator.wikimedia.org/T239161)
[20:56:43] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] wikimania_scholarships app: use codfw database when in codfw [puppet] - 10https://gerrit.wikimedia.org/r/552551 (https://phabricator.wikimedia.org/T224247) (owner: 10Dzahn)
[20:58:19] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] nova: add support for the 'placement' api service [puppet] - 10https://gerrit.wikimedia.org/r/552890 (https://phabricator.wikimedia.org/T239161) (owner: 10Andrew Bogott)
[20:59:06] <wikibugs>	 10Operations, 10MediaWiki-REST-API, 10serviceops, 10wikidiff2, and 2 others: Deploy version 1.10.0 of wikidiff2 to production - https://phabricator.wikimedia.org/T236963 (10jijiki) Version 1.10.0  is live, please mark this as resolved if everything works as expected.
[20:59:21] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: Asset tag remaining cablemgmt in eqiad - https://phabricator.wikimedia.org/T239110 (10Cmjohnson) Yes, I misread the task....I will update and resolve once completed
[21:00:04] <jouncebot>	 Reedy and sbassett: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191125T2100).
[21:00:08] <wikibugs>	 (03PS2) 10Andrew Bogott: nova: add support for the 'placement' api service [puppet] - 10https://gerrit.wikimedia.org/r/552890 (https://phabricator.wikimedia.org/T239161)
[21:01:49] <wikibugs>	 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10jijiki)
[21:05:36] <wikibugs>	 (03PS3) 10Zoranzoki21: Equalization of wgPopupsReferencePreviews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552643
[21:07:14] <wikibugs>	 10Operations, 10GLOW, 10SRE-Access-Requests: Requesting access to sites from Google Search Console - https://phabricator.wikimedia.org/T238868 (10Iflorez) Thank you @crusnov,   Below is the full list. I request access for the following sites and their mobile sites.  ID.wikipedia, SU.wikipedia, JV.wikipedia,...
[21:08:00] <wikibugs>	 (03PS3) 10Andrew Bogott: nova: add support for the 'placement' api service [puppet] - 10https://gerrit.wikimedia.org/r/552890 (https://phabricator.wikimedia.org/T239161)
[21:12:09] <wikibugs>	 (03PS2) 10Dzahn: iegreview app: use codfw database when in codfw [puppet] - 10https://gerrit.wikimedia.org/r/552552 (https://phabricator.wikimedia.org/T224247)
[21:14:30] <wikibugs>	 10Operations, 10ops-eqiad, 10Patch-For-Review: reimage WMF6937/mw1298 - https://phabricator.wikimedia.org/T215332 (10Dzahn) mw1298 is now back here, with weight 10 as a jobrunner  https://config-master.wikimedia.org/pybal/eqiad/jobrunner
[21:14:39] <wikibugs>	 (03PS4) 10Andrew Bogott: nova: add support for the 'placement' api service [puppet] - 10https://gerrit.wikimedia.org/r/552890 (https://phabricator.wikimedia.org/T239161)
[21:14:40] <wikibugs>	 (03PS1) 10Andrew Bogott: nova: add nova config for the placement service [puppet] - 10https://gerrit.wikimedia.org/r/552894 (https://phabricator.wikimedia.org/T239161)
[21:16:18] <wikibugs>	 10Operations, 10ops-eqiad: frqueue1001 system battery needs replacement - https://phabricator.wikimedia.org/T237582 (10Cmjohnson)
[21:17:32] <wikibugs>	 (03PS5) 10Andrew Bogott: nova: add support for the 'placement' api service [puppet] - 10https://gerrit.wikimedia.org/r/552890 (https://phabricator.wikimedia.org/T239161)
[21:17:34] <wikibugs>	 (03PS2) 10Andrew Bogott: nova: add nova config for the placement service [puppet] - 10https://gerrit.wikimedia.org/r/552894 (https://phabricator.wikimedia.org/T239161)
[21:18:16] <wikibugs>	 10Operations, 10ops-eqiad, 10fundraising-tech-ops, 10Patch-For-Review: (ESCALATED Need By: 11/26/19) rack/setup/install frdb1003.eqiad.wmnet - https://phabricator.wikimedia.org/T239139 (10Jclark-ctr)
[21:18:56] <wikibugs>	 (03CR) 10RLazarus: [C: 03+2] poolcounter: Install and run the prometheus exporter alongside poolcounter. [puppet] - 10https://gerrit.wikimedia.org/r/552875 (https://phabricator.wikimedia.org/T237407) (owner: 10RLazarus)
[21:21:22] <wikibugs>	 10Operations, 10ops-eqiad, 10serviceops: (No Need By Date Provided) rack/setup/install mw13[49-84].eqiad.wmnet - https://phabricator.wikimedia.org/T236437 (10jijiki) Currently: | Rack |A 5| A6|A 7|B 6|B 7| C 6| D 4 |  D 5 |mw servers|6|6|17|21|6| 30|6 (decom)|30 (decom)  We will decommission 36 servers from...
[21:24:06] <icinga-wm>	 PROBLEM - Host ms-be2056 is DOWN: PING CRITICAL - Packet loss = 100%
[21:26:12] <icinga-wm>	 RECOVERY - Host ms-be2056 is UP: PING OK - Packet loss = 0%, RTA = 36.16 ms
[21:27:23] <wikibugs>	 (03PS1) 10RLazarus: poolcounter: Specify port 9106 for the exporter [puppet] - 10https://gerrit.wikimedia.org/r/552896 (https://phabricator.wikimedia.org/T237407)
[21:28:15] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] poolcounter: Specify port 9106 for the exporter [puppet] - 10https://gerrit.wikimedia.org/r/552896 (https://phabricator.wikimedia.org/T237407) (owner: 10RLazarus)
[21:29:39] <wikibugs>	 (03CR) 10RLazarus: [C: 03+2] poolcounter: Specify port 9106 for the exporter [puppet] - 10https://gerrit.wikimedia.org/r/552896 (https://phabricator.wikimedia.org/T237407) (owner: 10RLazarus)
[21:32:38] <icinga-wm>	 RECOVERY - Wikitech and wt-static content in sync on labweb1002 is OK: wikitech-static OK - wikitech and wikitech-static in sync (191590 200000s) https://wikitech.wikimedia.org/wiki/Wikitech-static
[21:32:38] <icinga-wm>	 RECOVERY - Wikitech and wt-static content in sync on labweb1001 is OK: wikitech-static OK - wikitech and wikitech-static in sync (191590 200000s) https://wikitech.wikimedia.org/wiki/Wikitech-static
[21:32:39] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] iegreview app: use codfw database when in codfw [puppet] - 10https://gerrit.wikimedia.org/r/552552 (https://phabricator.wikimedia.org/T224247) (owner: 10Dzahn)
[21:37:44] <wikibugs>	 10Operations, 10ops-eqiad, 10fundraising-tech-ops, 10Patch-For-Review: (ESCALATED Need By: 11/26/19) rack/setup/install frdb1003.eqiad.wmnet - https://phabricator.wikimedia.org/T239139 (10Jclark-ctr) Racked Server   Connected to port 26   on fasw-c1a & fasw-c2a  Entered production idrac password @Jgreen...
[21:39:14] <wikibugs>	 (03PS1) 10RLazarus: poolcounter: Set restart => true for the exporter service. [puppet] - 10https://gerrit.wikimedia.org/r/552898 (https://phabricator.wikimedia.org/T237407)
[21:40:36] <wikibugs>	 10Operations, 10ops-eqiad, 10fundraising-tech-ops, 10Patch-For-Review: (ESCALATED Need By: 11/26/19) rack/setup/install frdb1003.eqiad.wmnet - https://phabricator.wikimedia.org/T239139 (10Jclark-ctr)
[21:41:37] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] poolcounter: Set restart => true for the exporter service. [puppet] - 10https://gerrit.wikimedia.org/r/552898 (https://phabricator.wikimedia.org/T237407) (owner: 10RLazarus)
[21:43:11] <wikibugs>	 (03CR) 10RLazarus: [C: 03+2] poolcounter: Set restart => true for the exporter service. [puppet] - 10https://gerrit.wikimedia.org/r/552898 (https://phabricator.wikimedia.org/T237407) (owner: 10RLazarus)
[21:48:11] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: Duplicate cable label in cr1-eqiad/cr2-eqiad - https://phabricator.wikimedia.org/T239098 (10wiki_willy) a:03Jclark-ctr
[21:49:54] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics: analytics1057's BBU is faulty - https://phabricator.wikimedia.org/T239045 (10wiki_willy) a:03Jclark-ctr
[21:52:30] <wikibugs>	 (03CR) 10Jhedden: [C: 03+1] nova: add nova config for the placement service [puppet] - 10https://gerrit.wikimedia.org/r/552894 (https://phabricator.wikimedia.org/T239161) (owner: 10Andrew Bogott)
[21:57:43] <wikibugs>	 (03CR) 10Jhedden: [C: 03+1] "LGTM overall, non-blocking comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/552890 (https://phabricator.wikimedia.org/T239161) (owner: 10Andrew Bogott)
[22:01:11] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] racktables: use codfw database when in codfw [puppet] - 10https://gerrit.wikimedia.org/r/552553 (https://phabricator.wikimedia.org/T224247) (owner: 10Dzahn)
[22:05:53] <Urbanecm>	 jouncebot: next
[22:05:53] <jouncebot>	 In 0 hour(s) and 54 minute(s): Evening SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191125T2300)
[22:07:03] <wikibugs>	 (03PS1) 10Jgreen: add frdb1003.frack.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/552899 (https://phabricator.wikimedia.org/T239139)
[22:08:07] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics: analytics1057's BBU is faulty - https://phabricator.wikimedia.org/T239045 (10Jclark-ctr) @elukey  No spare bbu around
[22:08:39] <wikibugs>	 10Operations, 10ops-eqiad, 10fundraising-tech-ops, 10Patch-For-Review: (ESCALATED Need By: 11/26/19) rack/setup/install frdb1003.eqiad.wmnet - https://phabricator.wikimedia.org/T239139 (10RobH) >>! In T239139#5691375, @Jclark-ctr wrote: > Racked Server  >  > Connected to port 26   on fasw-c1a & fasw-c2a >...
[22:10:22] <wikibugs>	 10Operations, 10ops-eqiad, 10fundraising-tech-ops, 10Patch-For-Review: (ESCALATED Need By: 11/26/19) rack/setup/install frdb1003.eqiad.wmnet - https://phabricator.wikimedia.org/T239139 (10RobH) forgot to remove from disabled group:   ` robh@fasw-c-eqiad# show | compare  [edit interfaces interface-range dis...
[22:12:12] <wikibugs>	 (03CR) 10Jgreen: [C: 03+2] add frdb1003.frack.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/552899 (https://phabricator.wikimedia.org/T239139) (owner: 10Jgreen)
[22:13:07] <Jeff_Green>	 !log authdns update to deploy I21ddc1a3e
[22:13:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:13:19] <wikibugs>	 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Patch-For-Review: Conversion to volunteer NDA for MaxSem - https://phabricator.wikimedia.org/T238960 (10RStallman-legalteam) Hi @MaxSem,   Happy to prepare an NDA for you. I will need your mailing address as well as personal email to send you t...
[22:13:40] <wikibugs>	 10Operations, 10ops-eqiad, 10fundraising-tech-ops, 10Patch-For-Review: (ESCALATED Need By: 11/26/19) rack/setup/install frdb1003.eqiad.wmnet - https://phabricator.wikimedia.org/T239139 (10Jgreen)
[22:15:57] <wikibugs>	 10Operations, 10ops-eqiad, 10fundraising-tech-ops, 10Patch-For-Review: (ESCALATED Need By: 11/26/19) rack/setup/install frdb1003.eqiad.wmnet - https://phabricator.wikimedia.org/T239139 (10RobH)
[22:20:51] <wikibugs>	 (03PS1) 10Urbanecm: Add gewikimedia to special.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552901 (https://phabricator.wikimedia.org/T239173)
[22:21:28] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add gewikimedia to special.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552901 (https://phabricator.wikimedia.org/T239173) (owner: 10Urbanecm)
[22:23:40] <wikibugs>	 (03PS2) 10Urbanecm: Add gewikimedia to special.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552901 (https://phabricator.wikimedia.org/T239173)
[22:24:20] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add gewikimedia to special.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552901 (https://phabricator.wikimedia.org/T239173) (owner: 10Urbanecm)
[22:25:39] <wikibugs>	 (03PS3) 10Urbanecm: Add gewikimedia to special.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552901 (https://phabricator.wikimedia.org/T239173)
[22:28:53] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: Duplicate cable label in cr1-eqiad/cr2-eqiad - https://phabricator.wikimedia.org/T239098 (10Jclark-ctr)  cr2-eqiad:xe-3/3/3 cable incorrect in netbox . correct is  2649
[22:29:43] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: Duplicate cable label in cr1-eqiad/cr2-eqiad - https://phabricator.wikimedia.org/T239098 (10Jclark-ctr) a:05Jclark-ctr→03ayounsi @ayounsi  Can you update routers to reflect . Thanks!!
[22:31:51] <wikibugs>	 (03PS3) 10Andrew Bogott: nova: add nova config for the placement service [puppet] - 10https://gerrit.wikimedia.org/r/552894 (https://phabricator.wikimedia.org/T239161)
[22:32:02] <wikibugs>	 (03PS1) 10RLazarus: poolcounter: Add ferm rule for the exporter; move it to profile::poolcounter. [puppet] - 10https://gerrit.wikimedia.org/r/552904 (https://phabricator.wikimedia.org/T237407)
[22:34:53] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] poolcounter: Add ferm rule for the exporter; move it to profile::poolcounter. [puppet] - 10https://gerrit.wikimedia.org/r/552904 (https://phabricator.wikimedia.org/T237407) (owner: 10RLazarus)
[22:35:18] <wikibugs>	 (03CR) 10Jhedden: [C: 03+1] nova: add nova config for the placement service [puppet] - 10https://gerrit.wikimedia.org/r/552894 (https://phabricator.wikimedia.org/T239161) (owner: 10Andrew Bogott)
[22:38:23] <wikibugs>	 (03PS4) 10DannyS712: Remove `move-rootuserpages` from user on svwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552374 (https://phabricator.wikimedia.org/T238842)
[22:39:31] <wikibugs>	 10Operations, 10ops-eqiad, 10fundraising-tech-ops: (ESCALATED Need By: 11/26/19) rack/setup/install frdb1003.eqiad.wmnet - https://phabricator.wikimedia.org/T239139 (10RobH) a:05Jclark-ctr→03Jgreen
[22:45:44] <wikibugs>	 10Operations, 10serviceops: upgrade and rename krypton & create its codfw equivalent - https://phabricator.wikimedia.org/T224247 (10Dzahn) 05Open→03Resolved
[22:45:46] <wikibugs>	 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10Dzahn)
[22:46:26] <wikibugs>	 (03PS1) 10RLazarus: poolcounter: Refactor: In role::poolcounter::server, use profile::poolcounter. [puppet] - 10https://gerrit.wikimedia.org/r/552910
[22:47:03] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] "LGTM assuming PCC is happy" [puppet] - 10https://gerrit.wikimedia.org/r/552910 (owner: 10RLazarus)
[22:48:56] <wikibugs>	 (03CR) 10RLazarus: [C: 03+2] "No changes: https://puppet-compiler.wmflabs.org/compiler1001/19614/poolcounter1004.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/552910 (owner: 10RLazarus)
[22:53:29] <wikibugs>	 (03PS2) 10RLazarus: poolcounter: Add ferm rule for the exporter; move it to profile::poolcounter. [puppet] - 10https://gerrit.wikimedia.org/r/552904 (https://phabricator.wikimedia.org/T237407)
[22:59:23] <wikibugs>	 10Operations, 10MediaWiki-REST-API, 10serviceops, 10wikidiff2, and 2 others: Prod compare endpoint missing offset object (with from & to keys) on diff items - https://phabricator.wikimedia.org/T238846 (10Tsevener) 05Open→03Resolved
[22:59:27] <wikibugs>	 10Operations, 10MediaWiki-REST-API, 10serviceops, 10wikidiff2, and 2 others: Deploy version 1.10.0 of wikidiff2 to production - https://phabricator.wikimedia.org/T236963 (10Tsevener)
[23:00:04] <jouncebot>	 MaxSem, RoanKattouw, Niharika, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do Evening SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191125T2300).
[23:00:05] <jouncebot>	 urandom: A patch you scheduled for Evening SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[23:00:10] <wikibugs>	 10Operations, 10MediaWiki-REST-API, 10serviceops, 10wikidiff2, and 2 others: Deploy version 1.10.0 of wikidiff2 to production - https://phabricator.wikimedia.org/T236963 (10Tsevener) 05Open→03Resolved
[23:00:14] <wikibugs>	 (03PS3) 10RLazarus: poolcounter: Add ferm rule for the exporter; move it to profile::poolcounter. [puppet] - 10https://gerrit.wikimedia.org/r/552904 (https://phabricator.wikimedia.org/T237407)
[23:00:20] <wikibugs>	 10Operations, 10MediaWiki-REST-API, 10serviceops, 10wikidiff2, and 2 others: Deploy version 1.10.0 of wikidiff2 to production - https://phabricator.wikimedia.org/T236963 (10Tsevener) Looking good in Prod, thanks everyone!
[23:00:42] <urandom>	 o/
[23:00:54] <Urbanecm>	 urandom: I can SWAT today!
[23:01:07] <urandom>	 Urbanecm: great; thanks!
[23:02:11] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] kask-echoseen: Do not report dupes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552257 (https://phabricator.wikimedia.org/T237143) (owner: 10Mobrovac)
[23:02:44] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Add gewikimedia to special.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552901 (https://phabricator.wikimedia.org/T239173) (owner: 10Urbanecm)
[23:02:56] <wikibugs>	 (03Merged) 10jenkins-bot: kask-echoseen: Do not report dupes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552257 (https://phabricator.wikimedia.org/T237143) (owner: 10Mobrovac)
[23:03:23] <Urbanecm>	 urandom: can you test at mwdebug1001, please?
[23:03:24] <wikibugs>	 (03Merged) 10jenkins-bot: Add gewikimedia to special.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552901 (https://phabricator.wikimedia.org/T239173) (owner: 10Urbanecm)
[23:03:32] <urandom>	 will do!
[23:04:15] <Urbanecm>	 thank you urandom 
[23:06:08] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:06:12] <urandom>	 Urbanecm: LGTM
[23:06:26] <Urbanecm>	 thanks urandom , syncing
[23:06:56] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] poolcounter: Add ferm rule for the exporter; move it to profile::poolcounter. [puppet] - 10https://gerrit.wikimedia.org/r/552904 (https://phabricator.wikimedia.org/T237407) (owner: 10RLazarus)
[23:07:09] <wikibugs>	 (03PS1) 10Faidon Liambotis: Fix some spelling issues [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/552919
[23:07:34] <wikibugs>	 (03CR) 10RLazarus: [C: 03+2] poolcounter: Add ferm rule for the exporter; move it to profile::poolcounter. [puppet] - 10https://gerrit.wikimedia.org/r/552904 (https://phabricator.wikimedia.org/T237407) (owner: 10RLazarus)
[23:07:44] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/CommonSettings.php: SWAT: d71b0ab: kask-echoseen: Do not report dupes (T237143) (duration: 00m 53s)
[23:07:47] <Urbanecm>	 urandom: done!
[23:07:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:07:49] <stashbot>	 T237143: Log warning: Duplicate get(): "officewiki:echo:seen:message:time:{n}" fetched 2 times - https://phabricator.wikimedia.org/T237143
[23:07:54] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Fix some spelling issues [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/552919 (owner: 10Faidon Liambotis)
[23:07:54] <urandom>	 Urbanecm: thanks!
[23:07:59] <Urbanecm>	 yw!
[23:09:18] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized dblists/: SWAT: aed2369: Add gewikimedia to special.dblist (T239173) (duration: 00m 52s)
[23:09:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:09:23] <stashbot>	 T239173: gewikimedia's w interwiki links to (nonexistent) gewiki - https://phabricator.wikimedia.org/T239173
[23:09:34] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:10:03] <logmsgbot>	 !log urbanecm@deploy1001 update-interwiki-cache aborted: Update interwiki cache (duration: 00m 01s)
[23:10:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:10:30] <wikibugs>	 (03PS1) 10Urbanecm: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552922
[23:10:32] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552922 (owner: 10Urbanecm)
[23:10:53] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552922 (owner: 10Urbanecm)
[23:11:17] <wikibugs>	 (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552922 (owner: 10Urbanecm)
[23:12:32] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/interwiki.php: Update interwiki cache (duration: 02m 14s)
[23:12:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:14:14] <Urbanecm>	 !log Evening SWAT done
[23:14:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:15:43] <wikibugs>	 10Operations, 10ChangeProp, 10Release Pipeline, 10Release-Engineering-Team-TODO, and 4 others: Migrate cpjobqueue to kubernetes - https://phabricator.wikimedia.org/T220399 (10holger.knust) a:03holger.knust
[23:21:34] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:23:17] <wikibugs>	 (03PS1) 10RobH: updating to remove 709-BBFK from required skus [software] - 10https://gerrit.wikimedia.org/r/552926
[23:24:18] <wikibugs>	 (03CR) 10RobH: [C: 03+2] updating to remove 709-BBFK from required skus [software] - 10https://gerrit.wikimedia.org/r/552926 (owner: 10RobH)
[23:24:45] <wikibugs>	 (03Merged) 10jenkins-bot: updating to remove 709-BBFK from required skus [software] - 10https://gerrit.wikimedia.org/r/552926 (owner: 10RobH)
[23:28:07] <wikibugs>	 (03CR) 10Krinkle: "Remove the HHVMRequestInit.php.txt symlink itself as well. LGTM otherwise, good to go." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542184 (https://phabricator.wikimedia.org/T235142) (owner: 10Jforrester)
[23:31:05] <wikibugs>	 10Operations, 10DNS, 10Internet-Archive, 10Traffic, 10Wikimedia-Video: Please set up a CNAME for videoserver.wikimedia.org to Video Editing Server - https://phabricator.wikimedia.org/T99216 (10Dzahn) >>! In T99216#5689785, @Aklapper wrote: > Ah, thanks. But who exactly is supposed to answer that question...
[23:31:47] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on wdqs1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Mathew.onipe This server auto deploys wdqs and the last release has some issues with updater. I will revert wdqs release version later today. Server is currently not serving traffic so we should be fine - The acknowledgement expires at: 2019-11-26 10:28:35. https://wikitech.wikimedia.org/wiki
[23:31:47] <icinga-wm>	 _systemd_state
[23:36:44] <icinga-wm>	 PROBLEM - Wikitech and wt-static content in sync on labweb1002 is CRITICAL: wikitech-static CRIT - wikitech and wikitech-static out of sync (200978s 200000s) https://wikitech.wikimedia.org/wiki/Wikitech-static
[23:36:44] <icinga-wm>	 PROBLEM - Wikitech and wt-static content in sync on labweb1001 is CRITICAL: wikitech-static CRIT - wikitech and wikitech-static out of sync (200978s 200000s) https://wikitech.wikimedia.org/wiki/Wikitech-static
[23:39:24] <wikibugs>	 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Patch-For-Review: Conversion to volunteer NDA for MaxSem - https://phabricator.wikimedia.org/T238960 (10Dzahn) Hi @RStallman-legalteam The email address is maxsem.wiki@gmail.com (no worries he already made it public all this time on the wiki us...
[23:40:26] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:40:26] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops: mw1239 memory errors - https://phabricator.wikimedia.org/T227867 (10Dzahn) 05Open→03Declined
[23:42:47] <wikibugs>	 10Operations, 10Product-Analytics, 10SRE-Access-Requests: Search Console access for he.wikisource.org - https://phabricator.wikimedia.org/T238090 (10Dzahn) a:03jbond
[23:47:18] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:47:19] <wikibugs>	 (03CR) 10Volans: "A comment inline" (032 comments) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/552874 (https://phabricator.wikimedia.org/T237469) (owner: 10CRusnov)
[23:54:38] <wikibugs>	 (03CR) 10Volans: "A comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/551948 (https://phabricator.wikimedia.org/T233183) (owner: 10CRusnov)