[01:51:32] (03PS5) 10Ladsgroup: mediawiki:errorpage: Make content default undef [puppet] - 10https://gerrit.wikimedia.org/r/530712 (https://phabricator.wikimedia.org/T113114) (owner: 10Alexandros Kosiaris) [01:51:34] (03PS10) 10Ladsgroup: mediawiki: Use mediawiki::errorpage instead of a hhvm-fatal-error.php.erb [puppet] - 10https://gerrit.wikimedia.org/r/511078 (https://phabricator.wikimedia.org/T113114) [01:52:36] (03CR) 10Ladsgroup: "> Patch Set 9:" [puppet] - 10https://gerrit.wikimedia.org/r/511078 (https://phabricator.wikimedia.org/T113114) (owner: 10Ladsgroup) [01:53:38] (03CR) 10Ladsgroup: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/17933/mw1234.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/511078 (https://phabricator.wikimedia.org/T113114) (owner: 10Ladsgroup) [05:10:59] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1063 - https://phabricator.wikimedia.org/T230682 (10Marostegui) a:03Cmjohnson @Cmjohnson can we get this disk replaced? This host is old and will be replaced "soon", but this is m1 primary master, so better to have it replaced. We are in process of switc... [05:13:00] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1063 - https://phabricator.wikimedia.org/T230682 (10Marostegui) p:05Triage→03Normal [05:24:12] o/ - checking cp2004 [05:26:27] bnx2x panic in the dmesg [05:26:44] I guess that we can only reboot right? [05:28:29] I think so, yes [05:29:14] we did have some issues with bnx2x a while back, which got solved via firmware updates on the NIC [05:29:42] this was done for the new cps in eqiad, maybe this server needs this update as well [05:29:47] !log reboot cp2004 due to bnx2x crash (kern.log saved into my home on the host if needed) [05:29:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:32:27] RECOVERY - Host cp2004 is UP: PING OK - Packet loss = 0%, RTA = 36.11 ms [05:32:29] RECOVERY - IPsec on cp3030 is OK: Strongswan OK - 36 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [05:32:29] RECOVERY - IPsec on cp1085 is OK: Strongswan OK - 58 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [05:32:29] RECOVERY - IPsec on cp3040 is OK: Strongswan OK - 36 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [05:32:31] RECOVERY - IPsec on cp1081 is OK: Strongswan OK - 58 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [05:32:43] RECOVERY - IPsec on cp3041 is OK: Strongswan OK - 36 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [05:32:51] RECOVERY - IPsec on cp3033 is OK: Strongswan OK - 36 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [05:32:53] RECOVERY - IPsec on cp5011 is OK: Strongswan OK - 36 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [05:32:53] RECOVERY - IPsec on cp5007 is OK: Strongswan OK - 36 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [05:33:05] RECOVERY - IPsec on cp1077 is OK: Strongswan OK - 58 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [05:33:07] RECOVERY - IPsec on cp4032 is OK: Strongswan OK - 36 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [05:33:11] RECOVERY - IPsec on cp1089 is OK: Strongswan OK - 58 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [05:33:23] RECOVERY - IPsec on cp1083 is OK: Strongswan OK - 58 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [05:33:23] RECOVERY - IPsec on cp3042 is OK: Strongswan OK - 36 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [05:33:23] RECOVERY - IPsec on cp3032 is OK: Strongswan OK - 36 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [05:33:23] RECOVERY - IPsec on cp3043 is OK: Strongswan OK - 36 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [05:33:31] RECOVERY - IPsec on cp1087 is OK: Strongswan OK - 58 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [05:33:31] RECOVERY - IPsec on cp4031 is OK: Strongswan OK - 36 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [05:33:31] RECOVERY - IPsec on cp4030 is OK: Strongswan OK - 36 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [05:33:35] RECOVERY - IPsec on cp4029 is OK: Strongswan OK - 36 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [05:33:35] RECOVERY - IPsec on cp4027 is OK: Strongswan OK - 36 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [05:33:35] RECOVERY - IPsec on cp4028 is OK: Strongswan OK - 36 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [05:33:39] RECOVERY - IPsec on cp5008 is OK: Strongswan OK - 36 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [05:33:41] RECOVERY - IPsec on cp1079 is OK: Strongswan OK - 58 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [05:33:53] RECOVERY - IPsec on cp1075 is OK: Strongswan OK - 58 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [05:34:03] RECOVERY - IPsec on cp5012 is OK: Strongswan OK - 36 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [05:34:03] RECOVERY - IPsec on cp5009 is OK: Strongswan OK - 36 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [05:34:03] RECOVERY - IPsec on cp5010 is OK: Strongswan OK - 36 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [05:34:32] 10Operations, 10DBA: Decommission db2063.codfw.wmnet - https://phabricator.wikimedia.org/T230704 (10Marostegui) [05:35:37] 10Operations, 10DBA: Decommission db2063.codfw.wmnet - https://phabricator.wikimedia.org/T230704 (10Marostegui) p:05Triage→03Normal [05:37:11] good morning cp2004 :) [05:38:22] 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10MoritzMuehlenhoff) db2044 now has a second disk in predictive failure: ` # hpssacli controller all show config Smart Array P420i in Slot 0 (Embedded) (sn: 0014380264FFFB0) Port Name: 1I... [05:39:32] 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) >>! In T208323#5420746, @MoritzMuehlenhoff wrote: > db2044 now has a second disk in predictive failure: > > ` > > # hpssacli controller all show config > > Smart Array P420i in Slot... [05:40:49] 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) [05:43:51] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/530578 (owner: 10Ema) [05:44:42] (03CR) 10Elukey: Add sre.hadoop.reboot-workers.py (037 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/530096 (https://phabricator.wikimedia.org/T225297) (owner: 10Elukey) [05:45:54] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/528475 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [05:46:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2067, will be moved to m1 T230705', diff saved to https://phabricator.wikimedia.org/P8930 and previous config saved to /var/cache/conftool/dbconfig/20190819-054606-marostegui.json [05:46:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:46:15] T230705: Replace db2044 (m2 codfw master) with db2067 - https://phabricator.wikimedia.org/T230705 [05:46:22] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Remove db2067 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530789 (https://phabricator.wikimedia.org/T230705) [05:47:42] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: hw troubleshooting: power supply for db1129 - https://phabricator.wikimedia.org/T230458 (10Marostegui) The alert cleared up - thanks! [05:48:09] (03CR) 10Marostegui: [C: 03+2] db-eqiad,db-codfw.php: Remove db2067 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530789 (https://phabricator.wikimedia.org/T230705) (owner: 10Marostegui) [05:49:06] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db2067 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530789 (https://phabricator.wikimedia.org/T230705) (owner: 10Marostegui) [05:49:22] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db2067 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530789 (https://phabricator.wikimedia.org/T230705) (owner: 10Marostegui) [05:50:17] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Remove db2067 from config T230705 (duration: 00m 50s) [05:50:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:51:09] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Remove db2067 from config T230705 (duration: 00m 47s) [05:51:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:51:17] T230705: Replace db2044 (m2 codfw master) with db2067 - https://phabricator.wikimedia.org/T230705 [05:54:32] (03CR) 10Muehlenhoff: Add sre.hadoop.reboot-workers.py (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/530096 (https://phabricator.wikimedia.org/T225297) (owner: 10Elukey) [05:55:50] (03PS1) 10Marostegui: mariadb: Move db2067 to m2, decomm db2063 [puppet] - 10https://gerrit.wikimedia.org/r/530790 (https://phabricator.wikimedia.org/T230704) [05:57:54] (03CR) 10Elukey: Add sre.hadoop.reboot-workers.py (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/530096 (https://phabricator.wikimedia.org/T225297) (owner: 10Elukey) [06:00:18] (03CR) 10Muehlenhoff: Add sre.hadoop.reboot-workers.py (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/530096 (https://phabricator.wikimedia.org/T225297) (owner: 10Elukey) [06:01:47] (03PS1) 10Giuseppe Lavagetto: aptrepo: fix getenvoy dist/component [puppet] - 10https://gerrit.wikimedia.org/r/530791 [06:03:48] !log installing php5 security updates [06:03:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:04:18] 10Operations, 10Acme-chief, 10Traffic: Provide the three cert types (chain-only, cert only and chained) as soon as we get the certificate issued - https://phabricator.wikimedia.org/T229096 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez Yeah, thanks for the reminder! :) [06:04:21] 10Operations, 10Traffic: acme-chief failing in puppet with "Cannot open input file" - https://phabricator.wikimedia.org/T229091 (10Vgutierrez) [06:06:27] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good looking at https://dl.bintray.com/tetrate/getenvoy-deb/dists/" [puppet] - 10https://gerrit.wikimedia.org/r/530791 (owner: 10Giuseppe Lavagetto) [06:12:05] 10Operations, 10Acme-chief, 10Traffic: acme-chief staging time not working as expected - https://phabricator.wikimedia.org/T225945 (10Vgutierrez) 05Open→03Resolved Yes :) it's working as expected.. latest renewal of unified and non-canonical-redirect certs set has been done with the proper staging time,... [06:14:01] (03CR) 10Elukey: [C: 03+1] confserver: enable ipv6 mapped address on the conf200* servers. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/528475 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [06:15:11] 10Operations, 10ops-eqiad: (Need By: August 31) rack/setup/install (3) new zookeeper nodes - https://phabricator.wikimedia.org/T227025 (10elukey) @Cmjohnson I can help with OS install/partman/etc.. if you want, so I'll free you from the last annoying steps :) [06:16:44] (03CR) 10Elukey: Add sre.hadoop.reboot-workers.py (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/530096 (https://phabricator.wikimedia.org/T225297) (owner: 10Elukey) [06:21:27] !log rolling upgrade of nginx in ncredir hosts [06:21:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:22:48] !log vgutierrez@puppetmaster1001 conftool action : set/pooled=no; selector: name=ncredir2002.codfw.wmnet [06:22:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:24:33] !log vgutierrez@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ncredir2002.codfw.wmnet [06:24:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:25:00] !log vgutierrez@puppetmaster1001 conftool action : set/pooled=no; selector: name=ncredir2001.codfw.wmnet [06:25:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:25:55] !log vgutierrez@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ncredir2001.codfw.wmnet [06:26:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:26:51] !log installing ghostscript security updates on scb/proton/notebook* hosts [06:26:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:27:01] !log vgutierrez@puppetmaster1001 conftool action : set/pooled=no; selector: name=ncredir1002.eqiad.wmnet [06:27:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:28:35] !log vgutierrez@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ncredir1002.eqiad.wmnet [06:28:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:29:09] !log vgutierrez@puppetmaster1001 conftool action : set/pooled=no; selector: name=ncredir1001.eqiad.wmnet [06:29:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:12] !log vgutierrez@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ncredir1001.eqiad.wmnet [06:30:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:36:04] (03CR) 10Muehlenhoff: [C: 03+1] confserver: enable ipv6 mapped address on the conf200* servers. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/528475 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [06:37:17] !log upgrading acme-chief to version 0.20 on production servers - T229096 [06:37:24] (03CR) 10Elukey: [C: 03+1] confserver: enable ipv6 mapped address on the conf200* servers. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/528475 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [06:37:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:37:25] T229096: Provide the three cert types (chain-only, cert only and chained) as soon as we get the certificate issued - https://phabricator.wikimedia.org/T229096 [07:01:25] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [dns] - 10https://gerrit.wikimedia.org/r/528479 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [07:08:30] !log installing ffmpeg security updates on buster [07:08:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:05] (03CR) 10Marostegui: [C: 03+2] mariadb: Move db2067 to m2, decomm db2063 [puppet] - 10https://gerrit.wikimedia.org/r/530790 (https://phabricator.wikimedia.org/T230704) (owner: 10Marostegui) [07:12:17] 10Operations, 10WMF-Legal, 10serviceops: Move old transparency report pages to historical URLs and setup redirect - https://phabricator.wikimedia.org/T230638 (10Joe) Sorry, the indications you give here are in contrast with each other: you seem to want a full redirect of transparency.wikimedia.org to the ne... [07:12:37] 10Operations, 10DBA: Decommission db2063.codfw.wmnet - https://phabricator.wikimedia.org/T230704 (10Marostegui) [07:14:41] PROBLEM - Disk space on elastic1018 is CRITICAL: DISK CRITICAL - free space: /srv 28121 MB (5% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1018&var-datasource=eqiad+prometheus/ops [07:16:15] RECOVERY - Disk space on elastic1018 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1018&var-datasource=eqiad+prometheus/ops [07:17:30] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2063.codfw.wmnet - https://phabricator.wikimedia.org/T230704 (10Marostegui) a:05Marostegui→03RobH [07:17:45] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2063.codfw.wmnet - https://phabricator.wikimedia.org/T230704 (10Marostegui) This host is ready for #dc-ops to decommission [07:19:01] 10Operations, 10DBA: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 (10Marostegui) [07:19:35] !log installing golang-1.11 security updates on buster [07:19:38] (03PS1) 10Elukey: statistics::mysql_credentials: add file only if group is defined [puppet] - 10https://gerrit.wikimedia.org/r/530801 [07:19:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:14] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/17934/stat1005.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/530801 (owner: 10Elukey) [07:28:38] 10Operations, 10ops-eqiad, 10DBA: db1114 crashed due to memory issues (server under warranty) - https://phabricator.wikimedia.org/T229452 (10Marostegui) >>! In T229452#5416446, @Cmjohnson wrote: > @Marostegui I see a potential issue with B3 as well. I will need to do a DIMM swap A -> B side and see if the e... [07:40:32] !log Redact napwikisource on db1124 and db2094 - T210762 [07:40:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:40] T210762: Prepare and check storage layer for nap.wikisource - https://phabricator.wikimedia.org/T210762 [07:43:03] PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [07:45:50] (03PS2) 10Vgutierrez: ocsp: Provide basic test coverage [software/acme-chief] - 10https://gerrit.wikimedia.org/r/530548 (https://phabricator.wikimedia.org/T219765) [07:45:52] (03PS5) 10Vgutierrez: acme_chief: Provide OCSP responses [software/acme-chief] - 10https://gerrit.wikimedia.org/r/530465 (https://phabricator.wikimedia.org/T219765) [07:45:54] (03PS1) 10Vgutierrez: api: Allow acme-chief clients to fetch OCSP responses [software/acme-chief] - 10https://gerrit.wikimedia.org/r/530806 (https://phabricator.wikimedia.org/T219765) [07:48:58] (03CR) 10jerkins-bot: [V: 04-1] api: Allow acme-chief clients to fetch OCSP responses [software/acme-chief] - 10https://gerrit.wikimedia.org/r/530806 (https://phabricator.wikimedia.org/T219765) (owner: 10Vgutierrez) [07:49:01] (03CR) 10jerkins-bot: [V: 04-1] acme_chief: Provide OCSP responses [software/acme-chief] - 10https://gerrit.wikimedia.org/r/530465 (https://phabricator.wikimedia.org/T219765) (owner: 10Vgutierrez) [07:49:28] (03CR) 10Vgutierrez: "This change is ready for review." [software/acme-chief] - 10https://gerrit.wikimedia.org/r/530548 (https://phabricator.wikimedia.org/T219765) (owner: 10Vgutierrez) [07:52:05] (03CR) 10Giuseppe Lavagetto: [C: 03+2] aptrepo: fix getenvoy dist/component [puppet] - 10https://gerrit.wikimedia.org/r/530791 (owner: 10Giuseppe Lavagetto) [07:52:18] (03PS2) 10Giuseppe Lavagetto: aptrepo: fix getenvoy dist/component [puppet] - 10https://gerrit.wikimedia.org/r/530791 [07:52:38] 10Operations, 10ops-eqiad, 10DC-Ops: a6-eqiad pdu refresh (Tuesday 10/22 @11am UTC) - https://phabricator.wikimedia.org/T227142 (10Marostegui) @wiki_willy any advice on T227142#5356294? [07:53:55] 10Operations, 10ops-eqiad, 10DC-Ops: b3-eqiad pdu refresh (Tuesday 9/17 @11am UTC) - https://phabricator.wikimedia.org/T227539 (10Marostegui) >>! In T227539#5395478, @Marostegui wrote: > db1104 is s8 primary master, we'd probably need to failover this host if we are not confident this host can be swapped ove... [07:55:05] 10Operations, 10DBA, 10Data-Services: Prepare and check storage layer for nqowiki - https://phabricator.wikimedia.org/T230543 (10Marostegui) Once this wiki is created, please let us know so we can sanitize it on labs and sanitarium before creating the views on the wikireplicas. [07:58:16] (03PS1) 10Muehlenhoff: Add DNS entries for failoid1001/2001 [dns] - 10https://gerrit.wikimedia.org/r/530808 (https://phabricator.wikimedia.org/T229903) [07:59:46] !log shutdown elastic2050 to prepare for mgmt reset - T230597 [07:59:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:55] T230597: can't SSH to elastic2050.mgmt - https://phabricator.wikimedia.org/T230597 [08:02:18] (03CR) 10Filippo Giunchedi: [C: 03+1] Remove apt-setup/multiarch from d-i config [puppet] - 10https://gerrit.wikimedia.org/r/530559 (owner: 10Muehlenhoff) [08:03:15] (03PS2) 10Filippo Giunchedi: swift: stop monitoring individual daemons [puppet] - 10https://gerrit.wikimedia.org/r/530080 (https://phabricator.wikimedia.org/T228878) [08:04:05] (03CR) 10Filippo Giunchedi: [C: 03+2] swift: stop monitoring individual daemons [puppet] - 10https://gerrit.wikimedia.org/r/530080 (https://phabricator.wikimedia.org/T228878) (owner: 10Filippo Giunchedi) [08:07:31] (03PS2) 10Filippo Giunchedi: icinga: add acknowledge details to emails [puppet] - 10https://gerrit.wikimedia.org/r/530098 (https://phabricator.wikimedia.org/T230413) [08:08:20] (03CR) 10Filippo Giunchedi: [C: 03+2] icinga: add acknowledge details to emails [puppet] - 10https://gerrit.wikimedia.org/r/530098 (https://phabricator.wikimedia.org/T230413) (owner: 10Filippo Giunchedi) [08:33:44] (03CR) 10Hashar: [C: 04-1] Whitelist jenkins for edit rate limits on beta (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530144 (https://phabricator.wikimedia.org/T230481) (owner: 10Jakob) [08:45:59] (03PS1) 10Volans: wmf-auto-reimage: fix delayed downtime [puppet] - 10https://gerrit.wikimedia.org/r/530810 [08:51:44] (03PS2) 10Volans: wmf-auto-reimage: fix delayed downtime [puppet] - 10https://gerrit.wikimedia.org/r/530810 [08:53:16] (03CR) 10Filippo Giunchedi: "LGTM overall" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/530616 (https://phabricator.wikimedia.org/T230236) (owner: 10Herron) [08:57:09] !log add 100G to graphite1004 / graphite2003 /srv LVs [08:57:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:30] (03PS6) 10Vgutierrez: acme_chief: Provide OCSP responses [software/acme-chief] - 10https://gerrit.wikimedia.org/r/530465 (https://phabricator.wikimedia.org/T219765) [08:57:32] (03PS2) 10Vgutierrez: api: Allow acme-chief clients to fetch OCSP responses [software/acme-chief] - 10https://gerrit.wikimedia.org/r/530806 (https://phabricator.wikimedia.org/T219765) [08:59:29] (03CR) 10Vgutierrez: "This change is ready for review." (033 comments) [software/acme-chief] - 10https://gerrit.wikimedia.org/r/530465 (https://phabricator.wikimedia.org/T219765) (owner: 10Vgutierrez) [09:01:55] (03CR) 10Volans: [C: 03+2] wmf-auto-reimage: fix delayed downtime [puppet] - 10https://gerrit.wikimedia.org/r/530810 (owner: 10Volans) [09:02:32] (03PS5) 10Filippo Giunchedi: mediawiki: add cluster latency alerts [puppet] - 10https://gerrit.wikimedia.org/r/529923 (https://phabricator.wikimedia.org/T230396) [09:04:02] (03CR) 10Filippo Giunchedi: mediawiki: add cluster latency alerts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/529923 (https://phabricator.wikimedia.org/T230396) (owner: 10Filippo Giunchedi) [09:17:49] (03CR) 10Filippo Giunchedi: "LGTM overall" (032 comments) [debs/prometheus-ipsec-exporter] - 10https://gerrit.wikimedia.org/r/530203 (https://phabricator.wikimedia.org/T230236) (owner: 10Herron) [09:19:48] (03CR) 10Filippo Giunchedi: VCL: workaround for images delivered with CT:x-www-form-urlencoded (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/530338 (https://phabricator.wikimedia.org/T162035) (owner: 10Ema) [09:19:58] 10Operations, 10Analytics, 10vm-requests: VM request to swap analytics-tool1002 with its equivalent on buster - https://phabricator.wikimedia.org/T230711 (10elukey) [09:20:06] 10Operations, 10Analytics, 10vm-requests: VM request to swap analytics-tool1002 with its equivalent on buster - https://phabricator.wikimedia.org/T230711 (10elukey) p:05Triage→03Normal [09:20:42] 10Operations, 10Analytics, 10vm-requests: VM request to swap analytics-tool1002 with its equivalent on buster - https://phabricator.wikimedia.org/T230711 (10elukey) [09:24:38] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [09:24:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:53] marostegui: ^^ :D [09:25:01] \o\ |o| /o/ [09:26:36] !log marostegui@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [09:26:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:13] you called victory too early :-P [09:28:20] but it's unrelated, downtime was applied [09:28:21] :_( [09:28:27] I'm digging [09:33:23] 10Operations, 10Analytics, 10vm-requests: VM request to swap analytics-tool1002 with its equivalent on buster - https://phabricator.wikimedia.org/T230711 (10elukey) ` elukey@ganeti1001:~$ sudo gnt-group list Group Nodes Instances AllocPolicy NDParams row_A 4 38 preferred ovs=False, ssh_port=22,... [09:36:15] (03PS1) 10Elukey: Add an-tool1007.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/530815 (https://phabricator.wikimedia.org/T230711) [09:36:36] (03PS2) 10Elukey: Add an-tool1007.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/530815 (https://phabricator.wikimedia.org/T230711) [09:37:34] anybody have time for a quick sanity check? --^ [09:39:27] RECOVERY - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [09:40:25] (03PS1) 10Volans: wmf-auto-reimage: do not use task ID in downtime [puppet] - 10https://gerrit.wikimedia.org/r/530816 [09:40:34] marostegui: ^^^ [09:41:35] (03CR) 10Marostegui: [C: 03+1] wmf-auto-reimage: do not use task ID in downtime [puppet] - 10https://gerrit.wikimedia.org/r/530816 (owner: 10Volans) [09:43:29] (03PS2) 10Alaa Sarhan: Revert "Revert "Switch property terms migration to WRITE_NEW on client wikis"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529807 [09:45:33] (03CR) 10Volans: [C: 03+2] wmf-auto-reimage: do not use task ID in downtime [puppet] - 10https://gerrit.wikimedia.org/r/530816 (owner: 10Volans) [09:49:50] alaa_wmde: Are you going to enable that today? [09:50:11] yes [09:50:21] (03PS5) 10Jbond: confserver: enable ipv6 mapped address on the conf200* servers. [puppet] - 10https://gerrit.wikimedia.org/r/528475 (https://phabricator.wikimedia.org/T102099) [09:50:25] alaa_wmde: ok, let's monitor the DBs [09:50:30] alaa_wmde: please ping me once it is live [09:50:53] yeap was planning on that .. oh nice if you could help too :) [09:50:55] will ping you [09:51:00] cheers [09:53:17] (03PS3) 10Jakob: Whitelist jenkins for edit rate limits on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530144 (https://phabricator.wikimedia.org/T230481) [09:53:45] (03PS6) 10Jbond: confserver: enable ipv6 mapped address on the conf200* servers. [puppet] - 10https://gerrit.wikimedia.org/r/528475 (https://phabricator.wikimedia.org/T102099) [09:54:24] (03CR) 10Jbond: [C: 03+2] confserver: enable ipv6 mapped address on the conf200* servers. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/528475 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [09:57:37] !log add mapped ipv6 to conf200* servers https://gerrit.wikimedia.org/r/c/operations/puppet/+/528475 [09:57:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:56] (03CR) 10Volans: [C: 04-1] "One thing to fix, I'm checking why CI didn't catch it" (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/530808 (https://phabricator.wikimedia.org/T229903) (owner: 10Muehlenhoff) [10:09:10] (03CR) 10Volans: [C: 04-1] "> Patch Set 1: Code-Review-1" (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/530808 (https://phabricator.wikimedia.org/T229903) (owner: 10Muehlenhoff) [10:16:34] (03CR) 10Volans: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/530815 (https://phabricator.wikimedia.org/T230711) (owner: 10Elukey) [10:17:05] (03CR) 10Elukey: [C: 03+2] Add an-tool1007.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/530815 (https://phabricator.wikimedia.org/T230711) (owner: 10Elukey) [10:18:25] (03CR) 10Volans: [C: 03+2] Move splitting of a RemoteHosts to a method [software/spicerack] - 10https://gerrit.wikimedia.org/r/529976 (owner: 10Giuseppe Lavagetto) [10:18:36] \o/ [10:18:42] thanks for both volans [10:18:46] :) [10:18:48] anytime [10:22:36] (03PS1) 10Ema: ATS: enable compress.so for upload@eqsin [puppet] - 10https://gerrit.wikimedia.org/r/530823 (https://phabricator.wikimedia.org/T227432) [10:22:48] !log elukey@cumin1001 START - Cookbook sre.ganeti.makevm [10:22:51] (03Merged) 10jenkins-bot: Move splitting of a RemoteHosts to a method [software/spicerack] - 10https://gerrit.wikimedia.org/r/529976 (owner: 10Giuseppe Lavagetto) [10:22:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:25] (03PS2) 10Jbond: conf servers: remove old IPv6 SLAAC addresses [puppet] - 10https://gerrit.wikimedia.org/r/528476 (https://phabricator.wikimedia.org/T102099) [10:24:01] (03CR) 10jenkins-bot: Move splitting of a RemoteHosts to a method [software/spicerack] - 10https://gerrit.wikimedia.org/r/529976 (owner: 10Giuseppe Lavagetto) [10:24:07] (03CR) 10Elukey: [C: 03+1] conf servers: remove old IPv6 SLAAC addresses [puppet] - 10https://gerrit.wikimedia.org/r/528476 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [10:25:25] (03CR) 10Jbond: [C: 03+2] conf servers: remove old IPv6 SLAAC addresses [puppet] - 10https://gerrit.wikimedia.org/r/528476 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [10:25:50] (03CR) 10Ema: "pcc looks correct: https://puppet-compiler.wmflabs.org/compiler1001/17936/" [puppet] - 10https://gerrit.wikimedia.org/r/530823 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [10:26:51] (03PS1) 10Jbond: Revert "conf servers: remove old IPv6 SLAAC addresses" [puppet] - 10https://gerrit.wikimedia.org/r/530824 [10:27:23] 10Operations, 10ops-eqiad, 10DC-Ops: a2-eqiad pdu refresh (Tuesday 10/8 @11am UTC) - https://phabricator.wikimedia.org/T227138 (10Marostegui) We have to masters on this rack db1075 (s3) and db1104 (s4). @wiki_willy how confident are you guys that this won't have an unexpected downtime? (cc @jcrespo) [10:27:26] (03CR) 10Jbond: [C: 03+2] Revert "conf servers: remove old IPv6 SLAAC addresses" [puppet] - 10https://gerrit.wikimedia.org/r/530824 (owner: 10Jbond) [10:29:38] (03PS1) 10Jbond: conf servers: remove SLAAC address from config [puppet] - 10https://gerrit.wikimedia.org/r/530825 [10:30:04] jan_drewniak: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190819T1030). [10:30:27] 10Operations, 10ops-eqiad, 10DC-Ops: b1-eqiad pdu refresh (Thursday 10/10 @11am UTC) - https://phabricator.wikimedia.org/T227536 (10Marostegui) [10:30:27] (03CR) 10Jbond: [C: 03+2] conf servers: remove SLAAC address from config [puppet] - 10https://gerrit.wikimedia.org/r/530825 (owner: 10Jbond) [10:30:47] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530826 (https://phabricator.wikimedia.org/T128546) [10:31:21] (03CR) 10Muehlenhoff: Add DNS entries for failoid1001/2001 (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/530808 (https://phabricator.wikimedia.org/T229903) (owner: 10Muehlenhoff) [10:32:24] !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [10:32:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:48] 10Operations, 10ops-eqiad, 10DC-Ops: a2-eqiad pdu refresh (Tuesday 10/8 @11am UTC) - https://phabricator.wikimedia.org/T227138 (10Marostegui) [10:34:26] 10Operations, 10Analytics, 10vm-requests, 10Patch-For-Review: VM request to swap analytics-tool1002 with its equivalent on buster - https://phabricator.wikimedia.org/T230711 (10elukey) ` elukey@cumin1001:~$ sudo cookbook sre.ganeti.makevm eqiad_A an-tool1007.eqiad.wmnet --vcpus 2 --memory 4 --disk 20 --lin... [10:34:41] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530826 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:34:58] (03PS3) 10Jbond: conf servers: add AAAA and ipv6 PTR records for conf200* servers [dns] - 10https://gerrit.wikimedia.org/r/528479 (https://phabricator.wikimedia.org/T102099) [10:35:40] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530826 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:35:50] (03CR) 10Jbond: [C: 03+2] conf servers: add AAAA and ipv6 PTR records for conf200* servers [dns] - 10https://gerrit.wikimedia.org/r/528479 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [10:35:58] (03PS1) 10Elukey: Add an-tool1007 as replacement for analytics-tool1002 [puppet] - 10https://gerrit.wikimedia.org/r/530827 (https://phabricator.wikimedia.org/T230711) [10:36:07] 10Operations, 10ops-eqiad, 10DC-Ops: a6-eqiad pdu refresh (Tuesday 10/22 @11am UTC) - https://phabricator.wikimedia.org/T227142 (10Marostegui) [10:36:45] (03CR) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530826 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:37:26] (03PS2) 10Muehlenhoff: Add DNS entries for failoid1001/2001 [dns] - 10https://gerrit.wikimedia.org/r/530808 (https://phabricator.wikimedia.org/T229903) [10:37:28] (03PS2) 10Elukey: Add an-tool1007 as replacement for analytics-tool1002 [puppet] - 10https://gerrit.wikimedia.org/r/530827 (https://phabricator.wikimedia.org/T230711) [10:37:48] !log jdrewniak@deploy1001 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:530826| Bumping portals to master (T128546)]] (duration: 00m 49s) [10:37:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:56] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [10:38:04] 10Operations, 10ops-eqiad, 10DC-Ops: b2-eqiad pdu refresh (Tuesday 10/29 @11am UTC) - https://phabricator.wikimedia.org/T227538 (10Marostegui) [10:38:22] 10Operations, 10ops-eqiad, 10DC-Ops: b2-eqiad pdu refresh (Tuesday 10/29 @11am UTC) - https://phabricator.wikimedia.org/T227538 (10Marostegui) [10:38:23] (03CR) 10Elukey: [C: 03+2] Add an-tool1007 as replacement for analytics-tool1002 [puppet] - 10https://gerrit.wikimedia.org/r/530827 (https://phabricator.wikimedia.org/T230711) (owner: 10Elukey) [10:38:38] !log jdrewniak@deploy1001 Synchronized portals: Wikimedia Portals Update: [[gerrit:530826| Bumping portals to master (T128546)]] (duration: 00m 49s) [10:38:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:58] (03CR) 10Volans: Add DNS entries for failoid1001/2001 (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/530808 (https://phabricator.wikimedia.org/T229903) (owner: 10Muehlenhoff) [10:39:26] 10Operations, 10ops-eqiad, 10DC-Ops: b2-eqiad pdu refresh (Tuesday 10/29 @11am UTC) - https://phabricator.wikimedia.org/T227538 (10Marostegui) @jcrespo - I will be on holidays this day, hence I added you on the list as primary contact for db1099 :) [10:41:27] (03CR) 10Muehlenhoff: Add DNS entries for failoid1001/2001 (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/530808 (https://phabricator.wikimedia.org/T229903) (owner: 10Muehlenhoff) [10:42:03] (03PS3) 10Muehlenhoff: Add DNS entries for failoid1001/2001 [dns] - 10https://gerrit.wikimedia.org/r/530808 (https://phabricator.wikimedia.org/T229903) [10:44:03] 10Operations, 10ops-eqiad, 10DC-Ops: a6-eqiad pdu refresh (Tuesday 10/22 @11am UTC) - https://phabricator.wikimedia.org/T227142 (10fgiunchedi) [10:47:01] (03CR) 10Volans: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/530808 (https://phabricator.wikimedia.org/T229903) (owner: 10Muehlenhoff) [10:47:36] (03CR) 10Subramanya Sastry: Update parsoid-rt-client.config.js.erb to fetch test ids from a function (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/529391 (https://phabricator.wikimedia.org/T230166) (owner: 10Subramanya Sastry) [10:49:14] (03CR) 10Jakob: Whitelist jenkins for edit rate limits on beta (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530144 (https://phabricator.wikimedia.org/T230481) (owner: 10Jakob) [10:50:57] 10Operations, 10WMF-Legal, 10serviceops: Move old transparency report pages to historical URLs and setup redirect - https://phabricator.wikimedia.org/T230638 (10BBlack) a:05Dzahn→03None Unassign for now. The actual ask here is unclear in terms of technical details. [10:51:43] (03PS4) 10Muehlenhoff: Add DNS entries for failoid1001/2001 [dns] - 10https://gerrit.wikimedia.org/r/530808 (https://phabricator.wikimedia.org/T229903) [10:52:14] (03CR) 10Muehlenhoff: [C: 03+2] Add DNS entries for failoid1001/2001 [dns] - 10https://gerrit.wikimedia.org/r/530808 (https://phabricator.wikimedia.org/T229903) (owner: 10Muehlenhoff) [10:52:43] (03PS2) 10Muehlenhoff: Remove apt-setup/multiarch from d-i config [puppet] - 10https://gerrit.wikimedia.org/r/530559 [10:52:57] !log elukey@cumin1001 START - Cookbook sre.ganeti.makevm [10:52:57] !log elukey@cumin1001 END (ERROR) - Cookbook sre.ganeti.makevm (exit_code=97) [10:53:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:08] !log elukey@cumin1001 START - Cookbook sre.ganeti.makevm [10:53:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:39] (03CR) 10Muehlenhoff: [C: 03+2] Remove apt-setup/multiarch from d-i config [puppet] - 10https://gerrit.wikimedia.org/r/530559 (owner: 10Muehlenhoff) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190819T1100). [11:00:04] alaa_wmde: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:27] !log jmm@cumin2001 START - Cookbook sre.ganeti.makevm [11:00:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:51] !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [11:02:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:08] !log jmm@cumin2001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [11:03:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:51] 10Operations, 10Analytics, 10vm-requests, 10Patch-For-Review: VM request to swap analytics-tool1002 with its equivalent on buster - https://phabricator.wikimedia.org/T230711 (10elukey) Of course I used the wrong row, so I had to gnt-instance remove an-tool1007 and then recreate: ` elukey@cumin1001:~$ sudo... [11:08:51] 10Operations, 10vm-requests, 10Patch-For-Review: eqiad/codfw: One VM for Failoid - https://phabricator.wikimedia.org/T229903 (10MoritzMuehlenhoff) The sre.ganeti.makevm cook book manages the RAM size as an int, so I went with 1 G instead of 0.5. I created https://phabricator.wikimedia.org/T230712 in case any... [11:09:34] PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [11:10:31] (03PS1) 10Muehlenhoff: Extend netboot.cfg for failoid* [puppet] - 10https://gerrit.wikimedia.org/r/530830 (https://phabricator.wikimedia.org/T229903) [11:10:57] anyone around to do SWAT deployment today? [11:12:16] (03PS1) 10Elukey: Change mac address to an-tool1007 [puppet] - 10https://gerrit.wikimedia.org/r/530831 (https://phabricator.wikimedia.org/T230711) [11:12:41] !log jmm@cumin1001 START - Cookbook sre.ganeti.makevm [11:12:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:52] (03CR) 10Elukey: [C: 03+2] Change mac address to an-tool1007 [puppet] - 10https://gerrit.wikimedia.org/r/530831 (https://phabricator.wikimedia.org/T230711) (owner: 10Elukey) [11:14:06] (03CR) 10Mobrovac: [C: 03+1] "LGTM. @Arlolra are you ok with this going out?" [puppet] - 10https://gerrit.wikimedia.org/r/529391 (https://phabricator.wikimedia.org/T230166) (owner: 10Subramanya Sastry) [11:15:40] !log jmm@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [11:15:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:18] (03PS1) 10Muehlenhoff: Extend MOU for Florian Lemmerich [puppet] - 10https://gerrit.wikimedia.org/r/530832 [11:16:45] (03PS22) 10Vgutierrez: ATS: Include TLS instance in cache upload role [puppet] - 10https://gerrit.wikimedia.org/r/513970 (https://phabricator.wikimedia.org/T221594) [11:17:41] alaa_wmde: I can deploy your patch in 15 mins [11:17:45] (03CR) 10Muehlenhoff: [C: 03+2] Extend MOU for Florian Lemmerich [puppet] - 10https://gerrit.wikimedia.org/r/530832 (owner: 10Muehlenhoff) [11:17:46] At airport security atm [11:19:28] thanks @Urbanecm .. no urgency though, nor stress. That patch might as well need a revert in case we see some undesirable increase in DB load .. so if you won't be able to revert it in about 15-20 minutes after deployment then it's better to deploy another day ;) [11:19:42] +1 :) [11:20:32] I should have time to do so if necessary [11:23:26] is there anyone else as a fallback to do a revert just in case Urbanecm loses internet connectivity at the airport? I'd be more comfortable then going ahead, @marostegui is that smth you can do, comfortably? Here at wikidata offices there aren't anyone available (Wikimania) atm [11:23:55] alaa_wmde: No, I wouldn't be comfortable reverting changes, sorry :) [11:24:36] (03CR) 10Volans: [C: 03+1] "LGTM, nit in commit message" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/530830 (https://phabricator.wikimedia.org/T229903) (owner: 10Muehlenhoff) [11:24:56] alaa_wmde: We (DBAs) only touch db-eqiad,db-codfw.php files because it used to be the only way to pool/depool hosts, but other than that we never touch any other MW files [11:25:36] :) okay let's wait another 10 minutes.. if no one else show up then better to re-schedule I think [11:25:45] alaa_wmde: I passed through airport security [11:25:48] and I'm happy to deploy [11:25:50] jouncebot: now [11:25:50] For the next 0 hour(s) and 34 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190819T1100) [11:26:03] alaa_wmde: around? [11:26:24] (btw, I have backup connectivity, just in case) [11:26:26] airport wifi+mobile internet [11:26:28] I'm around yeah .. you feel confident with your internet connection? if yes, pls go ahead [11:26:34] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529807 (owner: 10Alaa Sarhan) [11:26:36] awesome let's go then :) [11:26:49] +2'ed then! [11:26:57] alaa_wmde: to confirm, what is the expected behaviour we should see on the graphs with this change? [11:27:06] @marostegui I'm wathcing grafana and logstash for DB erros and fatals and wb_terms related metrics [11:27:13] alaa_wmde: cool, thanks [11:27:35] alaa_wmde: what patterns do you expect to see changes in? [11:27:37] (03PS1) 10Muehlenhoff: Add DNS entries for Buster puppetdb instances [dns] - 10https://gerrit.wikimedia.org/r/530835 (https://phabricator.wikimedia.org/T230609) [11:27:39] (03Merged) 10jenkins-bot: Revert "Revert "Switch property terms migration to WRITE_NEW on client wikis"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529807 (owner: 10Alaa Sarhan) [11:28:04] (03CR) 10jerkins-bot: [V: 04-1] Add DNS entries for Buster puppetdb instances [dns] - 10https://gerrit.wikimedia.org/r/530835 (https://phabricator.wikimedia.org/T230609) (owner: 10Muehlenhoff) [11:28:07] alaa_wmde: can you test on mwdebug1002, or should I just sync it? [11:28:11] (03CR) 10jenkins-bot: Revert "Revert "Switch property terms migration to WRITE_NEW on client wikis"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529807 (owner: 10Alaa Sarhan) [11:28:44] @marostegui we might see an increase in records per second .. not quite sure what is the acceptable range though .. After reading from new store in Wikidata itself, we didn't see a really significant increase (apart from the mysterious spikes). So I'm not really expecting a significant increases anywhere really including records per seconds nor db traffic [11:28:58] @Urbanecm I should be able to test on mwdebug1002 yes [11:29:12] alaa_wmde: go ahead [11:29:40] alaa_wmde: were you guys able to figure out what those spikes were? I believe A.mir1 was thinking on some user/bot activity [11:30:11] Urbanecm: on it [11:30:16] thanks [11:30:53] marostegui: yeah probably something related to bot activity .. we are suspecting edit activity now, but haven't yet continued digging in this week [11:31:03] roger [11:33:00] alaa_wmde: btw, is there a phab task related to this change? [11:34:17] yeap https://phabricator.wikimedia.org/T225053 [11:34:46] thanks alaa_wmde [11:34:47] it is being re-enabled after an incident we had the first time we did this switch https://wikitech.wikimedia.org/wiki/Incident_documentation/20190807-s8-cawiki-errors [11:35:07] so I'm watching ca-wiki and ru-wiki for errors [11:35:41] thank you [11:35:59] (03PS1) 10Muehlenhoff: Add PXE config for failoid1001/failoid2001 [puppet] - 10https://gerrit.wikimedia.org/r/530838 [11:37:06] RECOVERY - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [11:37:12] let me know once it's safe to deploy [11:37:38] Urbanecm: no erros nor increases in any metrics so far .. I'd sync now to get more traffic [11:37:44] syncing [11:37:48] thanks! [11:38:23] no problem [11:39:06] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 483691c: Revert "Revert "Switch property terms migration to WRITE_NEW on client wikis"" (T225053) (duration: 00m 48s) [11:39:13] alaa_wmde: synced! [11:39:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:14] T225053: Switch `tmpPropertyTermsMigrationStage` to MIGRATION_WRITE_NEW - https://phabricator.wikimedia.org/T225053 [11:39:17] (03PS2) 10Muehlenhoff: Add DNS entries for Buster puppetdb instances [dns] - 10https://gerrit.wikimedia.org/r/530835 (https://phabricator.wikimedia.org/T230609) [11:41:11] (03PS2) 10Muehlenhoff: Extend netboot.cfg for failoid* [puppet] - 10https://gerrit.wikimedia.org/r/530830 (https://phabricator.wikimedia.org/T229903) [11:41:47] (03CR) 10Muehlenhoff: Extend netboot.cfg for failoid* (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/530830 (https://phabricator.wikimedia.org/T229903) (owner: 10Muehlenhoff) [11:42:14] (03PS1) 10Marostegui: install_server: Do not reimage db2067 [puppet] - 10https://gerrit.wikimedia.org/r/530839 (https://phabricator.wikimedia.org/T230705) [11:42:54] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db2067 [puppet] - 10https://gerrit.wikimedia.org/r/530839 (https://phabricator.wikimedia.org/T230705) (owner: 10Marostegui) [11:45:03] I see increase in fatals https://logstash.wikimedia.org/app/kibana#/dashboard/Fatal-Monitor?_g=h@44136fa&_a=h@222f6b1 [11:45:12] alaa_wmde: there was a spike on errors [11:45:18] alaa_wmde: should I revert? [11:45:23] Urbanecm: the fixes are not deployed [11:45:29] yes please .. it seems the fix is not back ported [11:45:33] reverting [11:45:46] thanks [11:46:45] alaa_wmde: Amir1: Is there anything I can backport to make it work? [11:46:58] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Revert 483691c (T225053) (duration: 00m 48s) [11:47:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:06] T225053: Switch `tmpPropertyTermsMigrationStage` to MIGRATION_WRITE_NEW - https://phabricator.wikimedia.org/T225053 [11:47:28] Two patches that are in parent/subtasks [11:47:36] Urbanecm: that'd be great .. I'll get you the changes [11:48:05] thanks alaa_wmde [11:48:45] Amir1: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/529113 and https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/529742 .. in that order [11:49:04] alaa_wmde: should I backport those two? [11:49:27] * Amir1 hides again [11:49:36] yes please .. both are the fix .. the first one missed one place to apply the fix, the second follows up on that [11:49:43] I'm sick. Can't do much. Sorry [11:50:27] Amir1: yes please go get some rest .. thanks for your quick help with this now [11:51:00] alaa_wmde: cherry-picked to wmf.17 and +2'ed [11:51:27] (03PS1) 10Urbanecm: Revert "Revert "Revert "Switch property terms migration to WRITE_NEW on client wikis""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530843 [11:51:38] (03CR) 10Urbanecm: [C: 03+2] Revert "Revert "Revert "Switch property terms migration to WRITE_NEW on client wikis""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530843 (owner: 10Urbanecm) [11:52:50] let's wait for the CI now :/ [11:54:48] sorry about that.. I should've double checked that they were backported (I forgot we didn't have a train last week) [11:55:01] no problem alaa_wmde :) [11:55:17] and thanks for your quick response&help :) [11:55:32] (03Merged) 10jenkins-bot: Revert "Revert "Revert "Switch property terms migration to WRITE_NEW on client wikis""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530843 (owner: 10Urbanecm) [11:55:43] happy to help [11:55:58] alaa_wmde: job failed for https://gerrit.wikimedia.org/r/#/c/530841/ [11:56:04] mwselenium-quibble-docker [11:56:11] https://integration.wikimedia.org/ci/job/mwselenium-quibble-docker/16814/ [11:56:34] (03CR) 10jenkins-bot: Revert "Revert "Revert "Switch property terms migration to WRITE_NEW on client wikis""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530843 (owner: 10Urbanecm) [11:56:47] *looking* [11:57:22] thank you alaa_wmde [11:57:43] oh no seems like another flaky test :D gonna recheck to see if it passes then [11:57:58] I've no idea if it's a flaky test or an issue :) [11:58:05] reporting it just to be sure [11:58:46] we will be done with swat in a minute .. so we'll probably reschedule the switch for tomorrow [11:58:59] jouncebot: next [11:58:59] In 5 hour(s) and 1 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190819T1700) [11:59:03] (03PS1) 10Elukey: Set an-tool1007 as buster node [puppet] - 10https://gerrit.wikimedia.org/r/530844 [11:59:10] re the backport .. what should happen once the two are merged? [11:59:13] we probably can go outside of swat, there's nothing scheduled [11:59:17] I can simply remove the +2 [11:59:22] in that case, it won't be merged [11:59:25] if you want to reschedule [12:00:37] alaa_wmde: ^^ [12:01:27] alaa_wmde: wikibase-client-docker also failed [12:01:35] for https://gerrit.wikimedia.org/r/530842 [12:01:36] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Config: Fix operations/puppet.git "rebase hell" - https://phabricator.wikimedia.org/T224033 (10hashar) [12:01:40] (03CR) 10Alexandros Kosiaris: [C: 03+2] "PCC pretty happy at https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler-test/260/console, merging" [puppet] - 10https://gerrit.wikimedia.org/r/530712 (https://phabricator.wikimedia.org/T113114) (owner: 10Alexandros Kosiaris) [12:01:50] (03PS6) 10Alexandros Kosiaris: mediawiki:errorpage: Make content default undef [puppet] - 10https://gerrit.wikimedia.org/r/530712 (https://phabricator.wikimedia.org/T113114) [12:01:58] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] mediawiki:errorpage: Make content default undef [puppet] - 10https://gerrit.wikimedia.org/r/530712 (https://phabricator.wikimedia.org/T113114) (owner: 10Alexandros Kosiaris) [12:02:04] removed +2s [12:02:06] okay so it might not just a flaky test .. or that we need both fixes to be present to make tests pass [12:02:25] let's reschedule for a diff window [12:02:26] yeap thanks .. I'll check again and reschedule for later [12:02:28] mmm ci seems really slow [12:02:35] !log EU SWAT done [12:02:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:46] (03CR) 10Elukey: [C: 03+2] Set an-tool1007 as buster node [puppet] - 10https://gerrit.wikimedia.org/r/530844 (owner: 10Elukey) [12:02:56] happy to help alaa_wmde [12:03:06] (03PS2) 10Elukey: Set an-tool1007 as buster node [puppet] - 10https://gerrit.wikimedia.org/r/530844 [12:03:06] (03CR) 10Elukey: [V: 03+2 C: 03+2] Set an-tool1007 as buster node [puppet] - 10https://gerrit.wikimedia.org/r/530844 (owner: 10Elukey) [12:04:12] Urbanecm: marostegui thanks .. until next try (hopefully last one) [12:04:25] happy to help [12:04:32] going to board my flight [12:04:38] see you all later! [12:04:46] have a nice flight! [12:05:58] (03CR) 10Alexandros Kosiaris: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/511078 (https://phabricator.wikimedia.org/T113114) (owner: 10Ladsgroup) [12:07:02] Thanks [12:22:32] 10Operations, 10Analytics, 10vm-requests: VM request to swap analytics-tool1002 with its equivalent on buster - https://phabricator.wikimedia.org/T230711 (10elukey) 05Open→03Resolved [12:37:40] (03CR) 10Vgutierrez: [C: 03+2] ATS: Include TLS instance in cache upload role [puppet] - 10https://gerrit.wikimedia.org/r/513970 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [12:37:51] (03PS23) 10Vgutierrez: ATS: Include TLS instance in cache upload role [puppet] - 10https://gerrit.wikimedia.org/r/513970 (https://phabricator.wikimedia.org/T221594) [12:38:54] !log depooling cp5001 prior to ats-tls deployment [12:39:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:48] 10Operations, 10DBA: Decommission db2049.codfw.wmnet - https://phabricator.wikimedia.org/T230721 (10Marostegui) [12:42:38] 10Operations, 10DBA: Decommission db2049.codfw.wmnet - https://phabricator.wikimedia.org/T230721 (10Marostegui) p:05Triage→03Normal [12:42:53] 10Operations, 10DBA: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 (10Marostegui) [12:43:57] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Remove db2049 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530846 (https://phabricator.wikimedia.org/T228258) [12:47:35] (03CR) 10Marostegui: [C: 03+2] db-eqiad,db-codfw.php: Remove db2049 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530846 (https://phabricator.wikimedia.org/T228258) (owner: 10Marostegui) [12:48:19] (03PS1) 10Vgutierrez: ATS: include /var/cache/ocsp in the list of ReadWritePaths [puppet] - 10https://gerrit.wikimedia.org/r/530848 (https://phabricator.wikimedia.org/T221594) [12:48:33] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db2049 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530846 (https://phabricator.wikimedia.org/T228258) (owner: 10Marostegui) [12:48:56] PROBLEM - Freshness of OCSP Stapling files on cp5001 is CRITICAL: NRPE: Command check_trafficserver_tls_ocsp_freshness_acme_chief not defined https://wikitech.wikimedia.org/wiki/HTTPS/Unified_Certificates [12:48:58] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db2049 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530846 (https://phabricator.wikimedia.org/T228258) (owner: 10Marostegui) [12:49:12] ^^ that's me, kinda expected :) [12:50:00] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Remove db2049 from config T230721 (duration: 00m 48s) [12:50:04] PROBLEM - Widespread puppet agent failures on icinga1001 is CRITICAL: site=eqsin https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [12:50:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:08] T230721: Decommission db2049.codfw.wmnet - https://phabricator.wikimedia.org/T230721 [12:50:44] PROBLEM - Ensure traffic_manager binds on 8443 and responds to HTTP requests on cp5001 is CRITICAL: connect to address 10.132.0.101 and port 8443: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [12:51:13] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Remove db2049 from config T230721 (duration: 00m 48s) [12:51:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:34] (03CR) 10Vgutierrez: [C: 03+2] ATS: include /var/cache/ocsp in the list of ReadWritePaths [puppet] - 10https://gerrit.wikimedia.org/r/530848 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [12:54:05] (03CR) 10Ema: [C: 03+1] ATS: include /var/cache/ocsp in the list of ReadWritePaths [puppet] - 10https://gerrit.wikimedia.org/r/530848 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [12:54:42] PROBLEM - Ensure traffic_manager is running for instance tls on cp5001 is CRITICAL: NRPE: Command check_traffic_manager_tls not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [12:55:08] RECOVERY - Ensure traffic_manager binds on 8443 and responds to HTTP requests on cp5001 is OK: HTTP OK: HTTP/1.1 200 Ok - 29648 bytes in 3.867 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [12:57:36] PROBLEM - Check systemd state on cp5001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:58:46] PROBLEM - Ensure traffic_server is running for instance tls on cp5001 is CRITICAL: NRPE: Command check_traffic_server_tls not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [12:59:48] (03PS1) 10Vgutierrez: ATS: Allow ATS unit to write on sysconfdir if OCSP is enabled [puppet] - 10https://gerrit.wikimedia.org/r/530849 (https://phabricator.wikimedia.org/T221594) [13:00:27] (03CR) 10Alex Monk: [C: 04-1] "see comment" [software/acme-chief] - 10https://gerrit.wikimedia.org/r/530464 (https://phabricator.wikimedia.org/T219765) (owner: 10Vgutierrez) [13:00:48] PROBLEM - Ensure trafficserver_exporter is running for instance tls on cp5001 is CRITICAL: NRPE: Command check_trafficserver_exporter_tls not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:00:57] 10Operations, 10observability, 10Wikimedia-Incident: prometheus: upgrade to 2.12 - https://phabricator.wikimedia.org/T222113 (10fgiunchedi) [13:01:03] 10Operations, 10DBA: Decommission db2049.codfw.wmnet - https://phabricator.wikimedia.org/T230721 (10Marostegui) [13:01:47] (03CR) 10Gehel: [C: 04-1] "The change in ferm rules should be split from the LVS configuration." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/529053 (https://phabricator.wikimedia.org/T176875) (owner: 10Mathew.onipe) [13:01:51] (03CR) 10Vgutierrez: "> Patch Set 2:" [software/acme-chief] - 10https://gerrit.wikimedia.org/r/530464 (https://phabricator.wikimedia.org/T219765) (owner: 10Vgutierrez) [13:02:17] ACKNOWLEDGEMENT - Check systemd state on cp5001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Vgutierrez T221594 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:02:17] ACKNOWLEDGEMENT - Ensure traffic_manager is running for instance tls on cp5001 is CRITICAL: NRPE: Command check_traffic_manager_tls not defined Vgutierrez T221594 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:02:17] ACKNOWLEDGEMENT - Ensure traffic_server is running for instance tls on cp5001 is CRITICAL: NRPE: Command check_traffic_server_tls not defined Vgutierrez T221594 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:02:17] ACKNOWLEDGEMENT - Ensure trafficserver_exporter is running for instance tls on cp5001 is CRITICAL: NRPE: Command check_trafficserver_exporter_tls not defined Vgutierrez T221594 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:02:17] ACKNOWLEDGEMENT - Freshness of OCSP Stapling files on cp5001 is CRITICAL: NRPE: Command check_trafficserver_tls_ocsp_freshness_acme_chief not defined Vgutierrez T221594 https://wikitech.wikimedia.org/wiki/HTTPS/Unified_Certificates [13:02:28] RECOVERY - PHP opcache health on mwdebug1002 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:03:23] (03CR) 10Alex Monk: [C: 03+2] ocsp: Provide basic test coverage [software/acme-chief] - 10https://gerrit.wikimedia.org/r/530548 (https://phabricator.wikimedia.org/T219765) (owner: 10Vgutierrez) [13:03:37] 10Operations, 10observability, 10Wikimedia-Incident: prometheus: usable dashboard for meta-metrics about Prometheus itself (query durations etc) - https://phabricator.wikimedia.org/T222102 (10fgiunchedi) I've imported a Prometheus dashboard with 2.x stats and replaced the previous one: https://grafana.wikime... [13:07:25] (03CR) 10Jbond: [C: 03+1] "lgtm" [dns] - 10https://gerrit.wikimedia.org/r/530835 (https://phabricator.wikimedia.org/T230609) (owner: 10Muehlenhoff) [13:11:10] 10Operations, 10media-storage, 10observability: swift backend decomms / rebalances are noisy - https://phabricator.wikimedia.org/T221904 (10fgiunchedi) FWIW this should be slightly less noisy since individual swift daemons won't produce alerts anymore as per https://gerrit.wikimedia.org/r/c/operations/puppet... [13:12:50] PROBLEM - check_trafficserver_log_fifo_tls_tls on cp5001 is CRITICAL: NRPE: Command check_check_trafficserver_log_fifo_tls_tls not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:14:04] RECOVERY - Check systemd state on cp5001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:14:19] (03CR) 10Vgutierrez: [C: 03+2] ATS: Allow ATS unit to write on sysconfdir if OCSP is enabled [puppet] - 10https://gerrit.wikimedia.org/r/530849 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [13:14:48] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp5001 is CRITICAL: connect to address 10.132.0.101 and port 9322: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:14:50] PROBLEM - check_trafficserver_tls_config_status on cp5001 is CRITICAL: NRPE: Command check_check_trafficserver_tls_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:15:54] RECOVERY - check_trafficserver_log_fifo_tls_tls on cp5001 is OK: OK: TS_MAIN writing to and fifo-log-demux reading from /srv/trafficserver/tls/var/log/tls.pipe https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:16:20] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp5001 is OK: HTTP OK: HTTP/1.0 200 OK - 10933 bytes in 0.470 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:16:22] RECOVERY - Ensure traffic_manager is running for instance tls on cp5001 is OK: PROCS OK: 1 process with args /usr/bin/traffic_manager --run-root=/srv/trafficserver/tls --nosyslog https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:16:24] RECOVERY - check_trafficserver_tls_config_status on cp5001 is OK: OK: configuration is current https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:16:30] RECOVERY - Ensure trafficserver_exporter is running for instance tls on cp5001 is OK: PROCS OK: 1 process with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint https://127.0.0.1:8443/_stats --port 9322 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:17:16] RECOVERY - Freshness of OCSP Stapling files on cp5001 is OK: OK https://wikitech.wikimedia.org/wiki/HTTPS/Unified_Certificates [13:18:48] RECOVERY - Widespread puppet agent failures on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [13:28:31] (03PS1) 10Vgutierrez: ATS: Fix traffic_server --run-root parameter value [puppet] - 10https://gerrit.wikimedia.org/r/530853 (https://phabricator.wikimedia.org/T221594) [13:31:40] (03PS2) 10Vgutierrez: ATS: Fix traffic_server --run-root parameter value in check_procs check [puppet] - 10https://gerrit.wikimedia.org/r/530853 (https://phabricator.wikimedia.org/T221594) [13:33:30] (03CR) 10Vgutierrez: [C: 03+2] ATS: Fix traffic_server --run-root parameter value in check_procs check [puppet] - 10https://gerrit.wikimedia.org/r/530853 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [13:43:13] (03PS1) 10Vgutierrez: ATS: Fix non-default instance traffic_server path in check_procs check [puppet] - 10https://gerrit.wikimedia.org/r/530855 (https://phabricator.wikimedia.org/T221594) [13:46:26] (03PS1) 10Mathew.onipe: wdqs: restrict port 8888 to analytics networks [puppet] - 10https://gerrit.wikimedia.org/r/530856 (https://phabricator.wikimedia.org/T176875) [13:46:47] (03PS2) 10Vgutierrez: ATS: Fix non-default instance traffic_server path in check_procs check [puppet] - 10https://gerrit.wikimedia.org/r/530855 (https://phabricator.wikimedia.org/T221594) [13:47:44] (03CR) 10Elukey: wdqs: restrict port 8888 to analytics networks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/530856 (https://phabricator.wikimedia.org/T176875) (owner: 10Mathew.onipe) [13:48:44] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs_80: Servers wdqs1006.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:51:16] ^^ wdqs_80 looks like is timeouting on pybal checks: Fetch failed (http://localhost/readiness-probe), 5.001 s [13:52:32] (03CR) 10Vgutierrez: [C: 03+2] ATS: Fix non-default instance traffic_server path in check_procs check [puppet] - 10https://gerrit.wikimedia.org/r/530855 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [13:53:26] PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([wdqs1006.eqiad.wmnet, wdqs1004.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [13:56:20] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:56:32] RECOVERY - Ensure traffic_server is running for instance tls on cp5001 is OK: PROCS OK: 1 process with args /srv/trafficserver/tls/bin/traffic_server -M --run-root=/srv/trafficserver/tls/runroot.yaml --httpport 8443 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:57:03] !log repooling cp5001 [13:57:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:56] RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [13:59:54] * gehel is looking at wdqs [14:01:09] yep, looks like wdqs eqiad (public endpoint) is struggling [14:01:30] (03PS2) 10Mathew.onipe: wdqs: restrict port 8888 to analytics networks [puppet] - 10https://gerrit.wikimedia.org/r/530856 (https://phabricator.wikimedia.org/T176875) [14:01:32] (03PS6) 10Mathew.onipe: lvs: allow access to wdqs lvs on port 8888 [puppet] - 10https://gerrit.wikimedia.org/r/529053 (https://phabricator.wikimedia.org/T176875) [14:02:22] 10Operations, 10Puppet, 10User-herron: Improve puppet alerting - https://phabricator.wikimedia.org/T178628 (10fgiunchedi) [14:03:55] 10Operations, 10DBA, 10observability: Generate instance list of active database hosts to be monitored from prometheus - https://phabricator.wikimedia.org/T145072 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi This happened in parent task! [14:03:59] 10Operations, 10DBA, 10observability, 10Patch-For-Review: MySQL metrics monitoring - https://phabricator.wikimedia.org/T143896 (10fgiunchedi) [14:05:26] PROBLEM - Disk space on elastic1018 is CRITICAL: DISK CRITICAL - free space: /srv 25552 MB (5% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1018&var-datasource=eqiad+prometheus/ops [14:05:53] wdqs overload does not seem to be write related, so most probably a bot, or some kind of expensive queries [14:06:28] 10Operations, 10Puppet: puppetdb prometheus metrics per-host metrics - https://phabricator.wikimedia.org/T228395 (10fgiunchedi) cc @EBernhardson this is likely one of the problems you have been experiencing with prometheus' web interface (i.e. dropdown/autocomplete is slow because of many metric names) [14:07:01] 10Operations, 10Puppet, 10User-fgiunchedi: puppetdb prometheus metrics per-host metrics - https://phabricator.wikimedia.org/T228395 (10fgiunchedi) [14:07:44] (03CR) 10Mathew.onipe: wdqs: restrict port 8888 to analytics networks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/530856 (https://phabricator.wikimedia.org/T176875) (owner: 10Mathew.onipe) [14:12:00] (03PS1) 10Mathew.onipe: admin: add Erik and David to wdqs-admins [puppet] - 10https://gerrit.wikimedia.org/r/530857 [14:16:05] (03PS1) 10Volans: Initial structure of the project [software/homer] - 10https://gerrit.wikimedia.org/r/530860 (https://phabricator.wikimedia.org/T228388) [14:16:07] (03PS1) 10Volans: Initial draft of the CLI [software/homer] - 10https://gerrit.wikimedia.org/r/530861 (https://phabricator.wikimedia.org/T228388) [14:16:09] (03PS1) 10Volans: Initial draft of devices configuration parsing [software/homer] - 10https://gerrit.wikimedia.org/r/530862 (https://phabricator.wikimedia.org/T228388) [14:20:07] (03CR) 10Gehel: [C: 03+1] "LGTM, and we do want Erik and David to help with WDQS." [puppet] - 10https://gerrit.wikimedia.org/r/530857 (owner: 10Mathew.onipe) [14:22:10] RECOVERY - Disk space on elastic1018 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1018&var-datasource=eqiad+prometheus/ops [14:22:22] (03CR) 10Gehel: [C: 03+1] wdqs: restrict port 8888 to analytics networks [puppet] - 10https://gerrit.wikimedia.org/r/530856 (https://phabricator.wikimedia.org/T176875) (owner: 10Mathew.onipe) [14:25:20] (03PS2) 10Gehel: admin: add Erik and David to wdqs-admins [puppet] - 10https://gerrit.wikimedia.org/r/530857 (owner: 10Mathew.onipe) [14:25:51] going to move a few wikimedia-logstash tasks to observability, there shouldn't be notifications but mentioning it here just in case [14:26:26] (03CR) 10Gehel: [C: 03+2] admin: add Erik and David to wdqs-admins [puppet] - 10https://gerrit.wikimedia.org/r/530857 (owner: 10Mathew.onipe) [14:26:36] 10Operations, 10Recommendation-API, 10Wikimedia-Logstash, 10observability, and 3 others: Move recommendation-api logging to new logging pipeline - https://phabricator.wikimedia.org/T219926 (10fgiunchedi) [14:26:46] 10Operations, 10Wikimedia-Logstash, 10observability, 10Discovery-Search (Current work), 10Patch-For-Review: Migrate mjolnir to stdout/syslog/cee logging output - https://phabricator.wikimedia.org/T218833 (10fgiunchedi) [14:27:15] ggrrr I did mute the batch job [14:27:21] 10Operations, 10Wikimedia-Logstash, 10observability, 10Discovery-Search (Current work), 10Patch-For-Review: Kibana fails to load when using short URLs to share dashboard - https://phabricator.wikimedia.org/T192279 (10fgiunchedi) [14:27:27] 10Operations, 10Wikimedia-Logstash, 10observability: rsyslog on mw1180 seems to not use the logstash LVS endpoint - https://phabricator.wikimedia.org/T177833 (10fgiunchedi) [14:27:29] 10Operations, 10Wikimedia-Logstash, 10observability, 10Discovery-Search (Current work), 10Patch-For-Review: Cleanup multiple definitions of logstash endpoint in puppet / hiera - https://phabricator.wikimedia.org/T182304 (10fgiunchedi) [14:27:34] 10Operations, 10Traffic, 10Wikimedia-Logstash, 10observability, and 2 others: RESTBase logs disappeared from logstash - https://phabricator.wikimedia.org/T178078 (10fgiunchedi) [14:27:48] clearly doesn't work as expected [14:27:51] 10Operations, 10Wikimedia-Logstash, 10observability, 10Discovery-Search (Current work): logstash mapping mixing up field types - https://phabricator.wikimedia.org/T165137 (10fgiunchedi) [14:28:05] 10Operations, 10Elasticsearch, 10Wikimedia-Logstash, 10observability, 10Wikimedia-production-error: No HHVM logs on kibana since 1 Jan 2017 0:00 - https://phabricator.wikimedia.org/T154388 (10fgiunchedi) [14:28:14] 10Operations, 10Wikimedia-Logstash, 10observability: Log event rate types from Logstash to graphite - https://phabricator.wikimedia.org/T141784 (10fgiunchedi) [14:29:00] 10Operations, 10Discovery, 10Elasticsearch, 10Wikimedia-Logstash, and 3 others: Upgrade ElasticSearch to 1.7.5 - https://phabricator.wikimedia.org/T122697 (10fgiunchedi) [14:29:17] 10Operations, 10MediaWiki-Debug-Logger, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review: Rack and Setup (3) Logstash Servers - https://phabricator.wikimedia.org/T96692 (10fgiunchedi) [14:29:19] 10Operations, 10Discovery-Search, 10Elasticsearch, 10Icinga, and 2 others: Remove elasticsearch icinga checks from logstash collectors - https://phabricator.wikimedia.org/T218691 (10fgiunchedi) [14:29:22] 10Operations, 10Wikimedia-Logstash, 10observability, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade logstash plugins to 5.6.14 - https://phabricator.wikimedia.org/T216993 (10fgiunchedi) [14:29:24] 10Operations, 10Wikimedia-Logstash, 10observability: logstash stuck on its persistent queue - https://phabricator.wikimedia.org/T212640 (10fgiunchedi) [14:29:26] 10Operations, 10Wikimedia-Logstash, 10observability: Investigate Kafka main cluster usage for logging pipeline - https://phabricator.wikimedia.org/T205873 (10fgiunchedi) [14:29:28] 10Operations, 10Wikimedia-Logstash, 10observability, 10Discovery-Search (Current work), 10Patch-For-Review: upgrade logstash and the logstash elasticsearch cluster to 5.6.14 - https://phabricator.wikimedia.org/T216052 (10fgiunchedi) [14:29:33] 10Operations, 10Wikimedia-Logstash, 10observability, 10Services (watching): Logstash started showing full serialized log entry as a message - https://phabricator.wikimedia.org/T197219 (10fgiunchedi) [14:29:37] 10Operations, 10MediaWiki-Debug-Logger, 10Wikimedia-Logstash, 10observability, and 9 others: Port mediawiki/php/wmerrors to PHP7 and deploy - https://phabricator.wikimedia.org/T187147 (10fgiunchedi) [14:29:39] 10Operations, 10Discovery, 10Elasticsearch, 10Wikimedia-Logstash, and 3 others: Do not deploy Cirrus elasticsearch plugins on logstash cluster - https://phabricator.wikimedia.org/T174933 (10fgiunchedi) [14:29:41] 10Operations, 10ChangeProp, 10RESTBase, 10Wikimedia-Logstash, and 3 others: RB and CP logs disappeared from Logstash - https://phabricator.wikimedia.org/T179058 (10fgiunchedi) [14:29:44] 10Operations, 10Discovery, 10Elasticsearch, 10Wikimedia-Logstash, and 4 others: api feature logs should be sent to both eqiad and codfw clusters - https://phabricator.wikimedia.org/T176430 (10fgiunchedi) [14:29:48] 10Operations, 10ops-eqiad, 10Wikimedia-Logstash, 10observability, 10Discovery-Search (Current work): Failed disk on logstash1006 - https://phabricator.wikimedia.org/T173689 (10fgiunchedi) [14:29:50] 10Operations, 10Wikimedia-Logstash, 10observability, 10Discovery-Search (Current work), 10Patch-For-Review: upgrade kibana to v5.3.3 - https://phabricator.wikimedia.org/T167266 (10fgiunchedi) [14:30:02] well, sorry for the spam [14:30:02] 10Operations, 10Wikimedia-Logstash, 10observability, 10Discovery-Search (Current work): Elasticsearch restarts are failing in the logstash cluster - https://phabricator.wikimedia.org/T142357 (10fgiunchedi) [14:30:05] 10Operations, 10Parsoid, 10Services, 10Wikimedia-Logstash, and 3 others: Parsoid's service-runner events not showing up in Kibana since 2016-07-28 21:00 UTC - https://phabricator.wikimedia.org/T141776 (10fgiunchedi) [14:30:10] 10Operations, 10Services, 10Wikimedia-Logstash, 10observability: Kibana / logstash dashboards timing out consistently since Kibana upgrade - https://phabricator.wikimedia.org/T141384 (10fgiunchedi) [14:30:14] 10Operations, 10Wikimedia-Logstash, 10observability: Systemd unit for logstash - https://phabricator.wikimedia.org/T127677 (10fgiunchedi) [14:30:16] 10Operations, 10Wikimedia-Logstash, 10observability: Update Elasticsearch on logstash* to elasticsearch-1.7.0.deb - https://phabricator.wikimedia.org/T106126 (10fgiunchedi) [14:31:01] godog: my phone had an anti-gravity field for few seconds :-P [14:31:45] volans: lol [14:33:50] 80% done btw [14:34:16] my inbox is at 319 and counting :D [14:34:50] 10Operations, 10Wikimedia-Logstash, 10observability: Set up a service IP for logstash - https://phabricator.wikimedia.org/T113104 (10fgiunchedi) [14:34:58] 10Operations, 10Wikimedia-Logstash, 10observability: Import logstash 1.5.3 into apt.wm.o - https://phabricator.wikimedia.org/T107916 (10fgiunchedi) [14:35:00] 10Operations, 10Wikimedia-Logstash, 10observability: Setup rsyncable git fat store to host Logstash plugins - https://phabricator.wikimedia.org/T107121 (10fgiunchedi) [14:35:05] hehehe done now [14:35:17] 10Operations, 10Discovery, 10Discovery-Search, 10Elasticsearch, and 2 others: Update Wikimedia apt repo to include debs for Elasticsearch & Logstash on jessie - https://phabricator.wikimedia.org/T98042 (10fgiunchedi) [14:35:19] 10Operations, 10Wikimedia-Logstash, 10observability: Select a standard log shipping solution to use with applications that cannot be configured to send log events directly to Logstash and/or fluorine - https://phabricator.wikimedia.org/T97297 (10fgiunchedi) [14:35:21] 10Operations, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review: reinstall logstash1001-1003 - https://phabricator.wikimedia.org/T97545 (10fgiunchedi) [14:35:21] godog: tell the truth, you want to win in the Phab stats email :-P [14:35:23] 10Operations, 10Discovery, 10Discovery-Search, 10Elasticsearch, and 3 others: Deploy statsd plugin for production elasticsearch & logstash - https://phabricator.wikimedia.org/T90889 (10fgiunchedi) [14:35:26] 10Operations, 10MediaWiki-Debug-Logger, 10Wikimedia-Logstash, 10observability: Upgrade RAM for logstash100[123] to 64G - https://phabricator.wikimedia.org/T87078 (10fgiunchedi) [14:35:29] 10Operations, 10MediaWiki-Debug-Logger, 10Wikimedia-Logstash, 10hardware-requests, 10observability: purchase 3 additional logstash nodes - https://phabricator.wikimedia.org/T89402 (10fgiunchedi) [14:35:31] 10Operations, 10Wikimedia-Logstash, 10observability, 10Epic: Improve logstash - https://phabricator.wikimedia.org/T84895 (10fgiunchedi) [14:35:34] 10Operations, 10Wikimedia-Logstash, 10observability: Convert Hadoop-Logstash logging to use Redis to address failures - https://phabricator.wikimedia.org/T85015 (10fgiunchedi) [14:35:46] RECOVERY - Host elastic2050.mgmt is UP: PING OK - Packet loss = 0%, RTA = 37.05 ms [14:35:48] volans: indeed! top phabricator spammer [14:35:51] 10Operations, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review, 10WorkType-NewFunctionality: gdash reports for php/apache errors - https://phabricator.wikimedia.org/T81030 (10fgiunchedi) [14:36:18] 343 is my final counter [14:36:21] 10Operations, 10Traffic, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review: Add varnish logs to logstash - https://phabricator.wikimedia.org/T63782 (10fgiunchedi) [14:36:42] 10Operations, 10Wikimedia-Logstash, 10observability: Add puppet logs to logstash - https://phabricator.wikimedia.org/T62690 (10fgiunchedi) [14:36:51] 10Operations, 10Thumbor, 10Wikimedia-Logstash, 10observability, and 2 others: Stream Thumbor logs to logstash - https://phabricator.wikimedia.org/T212946 (10fgiunchedi) [14:36:56] 10Operations, 10Wikimedia-Logstash, 10observability, 10Core Platform Team Legacy (Watching / External), 10Services (watching): Reduce the number of fields declared in elasticsearch by logstash - https://phabricator.wikimedia.org/T180051 (10fgiunchedi) [14:37:12] 10Operations, 10Discovery, 10Elasticsearch, 10MediaWiki-Vendor, and 4 others: Upgrade ruflin/elastica to 2.3.1 - https://phabricator.wikimedia.org/T127831 (10fgiunchedi) [14:37:20] (03PS7) 10Mathew.onipe: lvs: allow access to wdqs lvs on port 8888 [puppet] - 10https://gerrit.wikimedia.org/r/529053 (https://phabricator.wikimedia.org/T176875) [14:37:22] (03PS3) 10Mathew.onipe: wdqs: restrict port 8888 to analytics networks [puppet] - 10https://gerrit.wikimedia.org/r/530856 (https://phabricator.wikimedia.org/T176875) [14:39:21] sounds about right [14:40:53] (03PS3) 10Ema: profile::tlsproxy::instance: do not autostart nginx [puppet] - 10https://gerrit.wikimedia.org/r/530578 [14:42:08] (03CR) 10Ema: [C: 03+2] profile::tlsproxy::instance: do not autostart nginx [puppet] - 10https://gerrit.wikimedia.org/r/530578 (owner: 10Ema) [14:51:59] (03PS1) 10Umherirrender: Add run mode to $wgDisableQueryPageUpdate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530871 (https://phabricator.wikimedia.org/T78711) [14:53:48] (03CR) 10Umherirrender: [C: 04-1] "Waiting for core change merge + one train to avoid reverts/rollback issues" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530871 (https://phabricator.wikimedia.org/T78711) (owner: 10Umherirrender) [14:57:35] 10Operations, 10SRE-tools, 10netbox, 10netops, 10User-crusnov: Netbox report to validate network equipment data - https://phabricator.wikimedia.org/T221507 (10crusnov) 05Open→03Resolved [15:05:03] 10Operations, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): can't SSH to elastic2050.mgmt - https://phabricator.wikimedia.org/T230597 (10Papaul) 05Open→03Resolved Upgrade firmware as well Before BIOS Version 1.5.6 iDRAC Firmware Version 3.21.21.21 After BIOS Version 2.2.11 iDRAC Firmware... [15:09:05] papaul: thanks! [15:14:08] PROBLEM - Check systemd state on cp4021 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:14:31] ^^ that's me [15:15:40] RECOVERY - Check systemd state on cp4021 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:18:20] PROBLEM - Widespread puppet agent failures on icinga1001 is CRITICAL: site=ulsfo https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [15:19:38] PROBLEM - Freshness of OCSP Stapling files on cp4021 is CRITICAL: NRPE: Command check_trafficserver_tls_ocsp_freshness_acme_chief not defined https://wikitech.wikimedia.org/wiki/HTTPS/Unified_Certificates [15:20:50] PROBLEM - Ensure traffic_server is running for instance tls on cp4021 is CRITICAL: NRPE: Command check_traffic_server_tls not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [15:24:48] PROBLEM - Ensure trafficserver_exporter is running for instance tls on cp4021 is CRITICAL: NRPE: Command check_trafficserver_exporter_tls not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [15:27:47] (03CR) 10Arlolra: [C: 03+1] Update parsoid-rt-client.config.js.erb to fetch test ids from a function [puppet] - 10https://gerrit.wikimedia.org/r/529391 (https://phabricator.wikimedia.org/T230166) (owner: 10Subramanya Sastry) [15:29:38] (03PS1) 10Marostegui: db2067: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/530880 (https://phabricator.wikimedia.org/T230705) [15:30:56] PROBLEM - Check systemd state on cp4021 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:36:56] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp4021 is CRITICAL: connect to address 10.128.0.121 and port 9322: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [15:37:46] (03PS1) 10Thcipriani: blubberoid: update base chart for "helm test" [deployment-charts] - 10https://gerrit.wikimedia.org/r/530881 [15:38:56] PROBLEM - check_trafficserver_log_fifo_tls_tls on cp4021 is CRITICAL: NRPE: Command check_check_trafficserver_log_fifo_tls_tls not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [15:39:23] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "See my inline comment: the rewriterules seem correct, but I'm a bit worried about unintended consequences of the reduction in performance " (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/528521 (https://phabricator.wikimedia.org/T228657) (owner: 10Jbond) [15:40:58] PROBLEM - Ensure traffic_manager binds on 8443 and responds to HTTP requests on cp4021 is CRITICAL: connect to address 10.128.0.121 and port 8443: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [15:41:00] PROBLEM - check_trafficserver_tls_config_status on cp4021 is CRITICAL: NRPE: Command check_check_trafficserver_tls_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [15:44:56] PROBLEM - Ensure traffic_manager is running for instance tls on cp4021 is CRITICAL: NRPE: Command check_traffic_manager_tls not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [15:46:23] 10Operations, 10MediaWiki-extensions-CentralAuth, 10TimedMediaHandler, 10Traffic, and 3 others: Consistent HTTP 503 Error on some urls for some logged-in users (CentralAuth Set-Cookie storm) - https://phabricator.wikimedia.org/T226840 (10BBlack) >>! In T226840#5372248, @Tgr wrote: >>>! In T226840#5366460,... [15:47:19] (03CR) 10Marostegui: [C: 03+2] db2067: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/530880 (https://phabricator.wikimedia.org/T230705) (owner: 10Marostegui) [15:47:41] 10Operations, 10MediaWiki-extensions-CentralAuth, 10TimedMediaHandler, 10Traffic, and 3 others: Consistent HTTP 503 Error on some urls for some logged-in users (CentralAuth Set-Cookie storm) - https://phabricator.wikimedia.org/T226840 (10BBlack) (Also, is the specific TMH fix actually deployed to all group... [15:48:41] 10Operations, 10Analytics: Access to HUE for Mayakpwiki - https://phabricator.wikimedia.org/T229143 (10Neil_P._Quinn_WMF) Let me just support Maya's request here. I work primarily in JupyterLab, but I still use Hue frequently for various things: * Running quick queries or exploring the Data Lake (since Hue has... [15:55:48] RECOVERY - Check systemd state on cp4021 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:56:10] RECOVERY - Ensure traffic_manager binds on 8443 and responds to HTTP requests on cp4021 is OK: HTTP OK: HTTP/1.1 200 Ok - 29742 bytes in 0.381 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [15:59:27] 10Operations, 10observability: Expose pooled status of gdnsd and conftool managed services as metrics - https://phabricator.wikimedia.org/T230733 (10fgiunchedi) [16:02:15] 10Operations, 10observability: Expose pooled status of gdnsd and conftool managed services as metrics - https://phabricator.wikimedia.org/T230733 (10BBlack) I'd start with the conftool stuff before moving on to anything that tracks gdnsd's `admin_state` -driven things. That whole mechanism is likely to be rep... [16:02:54] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [16:04:24] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [16:06:59] (03PS1) 10Vgutierrez: ATS: Only allow writing on /etc/acmecerts if acme_chief is being used [puppet] - 10https://gerrit.wikimedia.org/r/530886 (https://phabricator.wikimedia.org/T221594) [16:07:05] (03PS1) 10Vgutierrez: ATS: Only monitor OCSP Stapling freshness for acme_chief if it's being used [puppet] - 10https://gerrit.wikimedia.org/r/530887 (https://phabricator.wikimedia.org/T221594) [16:11:34] 10Operations, 10ops-eqiad, 10DC-Ops: b6-eqiad pdu refresh (Tuesday 9/10 @11am UTC) - https://phabricator.wikimedia.org/T227541 (10Gehel) [16:12:16] 10Operations, 10ops-eqiad, 10DC-Ops: a1-eqiad pdu refresh (Thursday 9/12 @11am UTC) - https://phabricator.wikimedia.org/T226782 (10Gehel) [16:13:35] 10Operations, 10ops-eqiad, 10DC-Ops: b3-eqiad pdu refresh (Tuesday 9/17 @11am UTC) - https://phabricator.wikimedia.org/T227539 (10Gehel) [16:15:11] 10Operations, 10ops-eqiad, 10DC-Ops: a2-eqiad pdu refresh (Tuesday 10/8 @11am UTC) - https://phabricator.wikimedia.org/T227138 (10Gehel) [16:15:48] 10Operations, 10Performance-Team, 10Traffic, 10Patch-For-Review, 10Performance: Study performance impact of disabling TCP selective acknowledgments - https://phabricator.wikimedia.org/T225998 (10Gilles) Nothing left to do here, I believe. [16:16:04] 10Operations, 10ops-eqiad, 10DC-Ops: b1-eqiad pdu refresh (Thursday 10/10 @11am UTC) - https://phabricator.wikimedia.org/T227536 (10Gehel) [16:16:57] (03PS2) 10Vgutierrez: ATS: Only allow writing on /etc/acmecerts if acme_chief is being used [puppet] - 10https://gerrit.wikimedia.org/r/530886 (https://phabricator.wikimedia.org/T221594) [16:17:25] (03CR) 10jerkins-bot: [V: 04-1] ATS: Only allow writing on /etc/acmecerts if acme_chief is being used [puppet] - 10https://gerrit.wikimedia.org/r/530886 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [16:17:28] 10Operations, 10ops-eqiad, 10DC-Ops: a6-eqiad pdu refresh (Tuesday 10/22 @11am UTC) - https://phabricator.wikimedia.org/T227142 (10Gehel) [16:18:14] 10Operations, 10ops-eqiad, 10DC-Ops: b2-eqiad pdu refresh (Tuesday 10/29 @11am UTC) - https://phabricator.wikimedia.org/T227538 (10Gehel) [16:19:53] (03PS1) 10EBernhardson: Add -XX:NewRatio=3 for cloudelastic-chi [puppet] - 10https://gerrit.wikimedia.org/r/530889 [16:20:11] (03PS3) 10Vgutierrez: ATS: Only allow writing on /etc/acmecerts if acme_chief is being used [puppet] - 10https://gerrit.wikimedia.org/r/530886 (https://phabricator.wikimedia.org/T221594) [16:21:35] (03CR) 10jerkins-bot: [V: 04-1] Add -XX:NewRatio=3 for cloudelastic-chi [puppet] - 10https://gerrit.wikimedia.org/r/530889 (owner: 10EBernhardson) [16:21:39] (03CR) 10jerkins-bot: [V: 04-1] ATS: Only allow writing on /etc/acmecerts if acme_chief is being used [puppet] - 10https://gerrit.wikimedia.org/r/530886 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [16:22:58] (03PS4) 10Vgutierrez: ATS: Only allow writing on /etc/acmecerts if acme_chief is being used [puppet] - 10https://gerrit.wikimedia.org/r/530886 (https://phabricator.wikimedia.org/T221594) [16:23:28] (03CR) 10jerkins-bot: [V: 04-1] ATS: Only allow writing on /etc/acmecerts if acme_chief is being used [puppet] - 10https://gerrit.wikimedia.org/r/530886 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [16:24:29] (03PS2) 10EBernhardson: Add -XX:NewRatio=3 for cloudelastic-chi [puppet] - 10https://gerrit.wikimedia.org/r/530889 [16:25:44] (03PS5) 10Vgutierrez: ATS: Only allow writing on /etc/acmecerts if acme_chief is being used [puppet] - 10https://gerrit.wikimedia.org/r/530886 (https://phabricator.wikimedia.org/T221594) [16:26:36] (03CR) 10Gehel: Add -XX:NewRatio=3 for cloudelastic-chi (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/530889 (owner: 10EBernhardson) [16:27:35] (03PS2) 10Vgutierrez: ATS: Only monitor OCSP Stapling freshness for acme_chief if it's being used [puppet] - 10https://gerrit.wikimedia.org/r/530887 (https://phabricator.wikimedia.org/T221594) [16:30:38] (03CR) 10Ema: [C: 03+1] ATS: Only allow writing on /etc/acmecerts if acme_chief is being used [puppet] - 10https://gerrit.wikimedia.org/r/530886 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [16:35:49] (03CR) 10Vgutierrez: [C: 03+2] ATS: Only allow writing on /etc/acmecerts if acme_chief is being used [puppet] - 10https://gerrit.wikimedia.org/r/530886 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [16:35:58] (03PS6) 10Vgutierrez: ATS: Only allow writing on /etc/acmecerts if acme_chief is being used [puppet] - 10https://gerrit.wikimedia.org/r/530886 (https://phabricator.wikimedia.org/T221594) [16:39:04] (03PS3) 10EBernhardson: Add -XX:NewRatio=3 for cloudelastic-chi [puppet] - 10https://gerrit.wikimedia.org/r/530889 [16:39:45] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp4021 is OK: HTTP OK: HTTP/1.0 200 OK - 10936 bytes in 0.157 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [16:39:51] RECOVERY - Ensure traffic_manager is running for instance tls on cp4021 is OK: PROCS OK: 1 process with args /usr/bin/traffic_manager --run-root=/srv/trafficserver/tls --nosyslog https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [16:39:53] RECOVERY - check_trafficserver_log_fifo_tls_tls on cp4021 is OK: OK: TS_MAIN writing to and fifo-log-demux reading from /srv/trafficserver/tls/var/log/tls.pipe https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [16:39:59] RECOVERY - Ensure trafficserver_exporter is running for instance tls on cp4021 is OK: PROCS OK: 1 process with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint https://127.0.0.1:8443/_stats --port 9322 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [16:40:29] RECOVERY - Freshness of OCSP Stapling files on cp4021 is OK: OK https://wikitech.wikimedia.org/wiki/HTTPS/Unified_Certificates [16:40:41] RECOVERY - Ensure traffic_server is running for instance tls on cp4021 is OK: PROCS OK: 1 process with args /srv/trafficserver/tls/bin/traffic_server -M --run-root=/srv/trafficserver/tls/runroot.yaml --httpport 8443 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [16:40:43] RECOVERY - check_trafficserver_tls_config_status on cp4021 is OK: OK: configuration is current https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [16:41:18] (03CR) 10EBernhardson: "PCC looks as expected: https://puppet-compiler.wmflabs.org/compiler1002/17950/" [puppet] - 10https://gerrit.wikimedia.org/r/530889 (owner: 10EBernhardson) [16:42:23] RECOVERY - Widespread puppet agent failures on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [16:44:34] (03PS3) 10Vgutierrez: ATS: Only monitor OCSP Stapling freshness for acme_chief if it's being used [puppet] - 10https://gerrit.wikimedia.org/r/530887 (https://phabricator.wikimedia.org/T221594) [16:45:22] !log pool elastic2050. mgmt issue has been resolved - T230597 [16:45:28] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, 10Traffic: server-cache did neither update on uploading nor with ?action=purge - https://phabricator.wikimedia.org/T228433 (10Gilles) [16:45:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:32] T230597: can't SSH to elastic2050.mgmt - https://phabricator.wikimedia.org/T230597 [16:51:09] (03PS4) 10Vgutierrez: ATS: Only monitor OCSP Stapling freshness for acme_chief if it's being used [puppet] - 10https://gerrit.wikimedia.org/r/530887 (https://phabricator.wikimedia.org/T221594) [16:51:46] 10Operations, 10Jade, 10TechCom, 10Core Platform Team Legacy (Watching / External), and 4 others: Deploy Jade extension MVP to production - https://phabricator.wikimedia.org/T183381 (10Halfak) [16:52:00] PROBLEM - Check systemd state on cp3034 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:53:00] (03PS1) 10Elukey: profile::analytics::refinery::job::druid_load: add more dim to netflow [puppet] - 10https://gerrit.wikimedia.org/r/530893 (https://phabricator.wikimedia.org/T229682) [16:54:12] RECOVERY - Check systemd state on cp3034 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:55:25] (03CR) 10Vgutierrez: [C: 03+2] ATS: Only monitor OCSP Stapling freshness for acme_chief if it's being used [puppet] - 10https://gerrit.wikimedia.org/r/530887 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [16:55:36] (03PS5) 10Vgutierrez: ATS: Only monitor OCSP Stapling freshness for acme_chief if it's being used [puppet] - 10https://gerrit.wikimedia.org/r/530887 (https://phabricator.wikimedia.org/T221594) [17:00:04] gehel and onimisionipe: Time to snap out of that daydream and deploy Wikidata Query Service weekly deploy. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190819T1700). [17:00:21] jouncebot: yep! [17:03:10] PROBLEM - Freshness of OCSP Stapling files on cp5003 is CRITICAL: NRPE: Command check_trafficserver_tls_ocsp_freshness_acme_chief not defined https://wikitech.wikimedia.org/wiki/HTTPS/Unified_Certificates [17:03:35] that's expected... puppet needs to run on icinga1001 [17:04:00] (03CR) 10Cwhite: [C: 03+2] icinga: disable autocomplete.js in icinga search text input [puppet] - 10https://gerrit.wikimedia.org/r/528586 (owner: 10Cwhite) [17:04:08] (03PS5) 10Cwhite: icinga: disable autocomplete.js in icinga search text input [puppet] - 10https://gerrit.wikimedia.org/r/528586 [17:04:43] !log onimisionipe@deploy1001 Started deploy [wdqs/wdqs@2d36896]: Fix Blazegraph dictionary mixup [17:04:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:30] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [140.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [17:09:53] RECOVERY - Freshness of OCSP Stapling files on cp5003 is OK: OK https://wikitech.wikimedia.org/wiki/HTTPS/Unified_Certificates [17:12:45] PROBLEM - Freshness of OCSP Stapling files on cp4026 is CRITICAL: NRPE: Command check_trafficserver_tls_ocsp_freshness_acme_chief not defined https://wikitech.wikimedia.org/wiki/HTTPS/Unified_Certificates [17:16:18] PROBLEM - Freshness of OCSP Stapling files on cp5004 is CRITICAL: NRPE: Command check_trafficserver_tls_ocsp_freshness_acme_chief not defined https://wikitech.wikimedia.org/wiki/HTTPS/Unified_Certificates [17:17:29] RECOVERY - Freshness of OCSP Stapling files on cp5004 is OK: OK https://wikitech.wikimedia.org/wiki/HTTPS/Unified_Certificates [17:17:36] !log restarting icinga to disable UI autocomplete [17:17:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:43] PROBLEM - Freshness of OCSP Stapling files on cp4021 is CRITICAL: NRPE: Command check_trafficserver_tls_ocsp_freshness_acme_chief not defined https://wikitech.wikimedia.org/wiki/HTTPS/Unified_Certificates [17:20:13] RECOVERY - Freshness of OCSP Stapling files on cp4021 is OK: OK https://wikitech.wikimedia.org/wiki/HTTPS/Unified_Certificates [17:20:41] RECOVERY - Freshness of OCSP Stapling files on cp4026 is OK: OK https://wikitech.wikimedia.org/wiki/HTTPS/Unified_Certificates [17:22:30] shdubsh: did you downtime the external check? :) [17:23:01] !log onimisionipe@deploy1001 Finished deploy [wdqs/wdqs@2d36896]: Fix Blazegraph dictionary mixup (duration: 18m 18s) [17:23:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:17] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [17:23:23] (03CR) 10Ayounsi: [C: 03+1] "Fields list looks correct." [puppet] - 10https://gerrit.wikimedia.org/r/530893 (https://phabricator.wikimedia.org/T229682) (owner: 10Elukey) [17:24:14] volans: I did not. I recieved no message/page though. [17:24:46] if the restart is quick enough it's ok [17:26:19] I can't find any mention of external downtiming on the icinga wikitech page. Maybe I missed it? Sorry :( [17:27:00] https://wikitech.wikimedia.org/wiki/Wikitech-static#Meta-monitoring [17:27:19] ad [17:27:20] https://wikitech.wikimedia.org/wiki/Service_restarts#Icinga [17:27:53] but it did restart in time [17:27:53] [ERROR] Check for host icinga1001.wikimedia.org (1/3) [17:27:59] [ERROR] Check for host icinga1001.wikimedia.org (2/3) failed [17:28:03] and the third was ok [17:28:10] did you do 2001 too? [17:30:22] Not yet. Will do soon. [17:30:34] ssh wikitech-static.wikimedia.org does not let me in [17:31:39] 10Operations, 10Acme-chief, 10Traffic: Decide/document criteria needed to serve acme-chief LE issued unified certificate to end users - https://phabricator.wikimedia.org/T230687 (10BBlack) 05Open→03Stalled a:03Vgutierrez There's perhaps a faulty implicit assumption here that we desire to use one cert f... [17:32:33] password is in pwstore [17:34:06] Works. Thanks :) [17:34:14] yw :) [17:53:31] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is CRITICAL: 2.797e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [17:54:57] ^^ looks like consumer lag is on its way down now [17:56:39] !log freeze cloudelastic writes to let prod clear 30 min backlog [17:56:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: How many deployers does it take to do Morning SWAT (Max 6 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190819T1800). [18:00:04] No GERRIT patches in the queue for this window AFAICS. [18:09:43] (03CR) 10Mforns: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/530893 (https://phabricator.wikimedia.org/T229682) (owner: 10Elukey) [18:19:35] * Urbanecm has stuff to deploy [18:20:01] 10Operations, 10Core Platform Team, 10MediaWiki-extensions-CentralAuth, 10TimedMediaHandler, and 5 others: Consistent HTTP 503 Error on some urls for some logged-in users (CentralAuth Set-Cookie storm) - https://phabricator.wikimedia.org/T226840 (10Anomie) [18:21:07] (03CR) 10Urbanecm: [C: 03+2] Assign all rights assigned to suppress group to oversight group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530612 (https://phabricator.wikimedia.org/T230601) (owner: 10Urbanecm) [18:21:12] (03PS4) 10Urbanecm: Assign all rights assigned to suppress group to oversight group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530612 (https://phabricator.wikimedia.org/T230601) [18:21:18] (03CR) 10Urbanecm: [C: 03+2] Assign all rights assigned to suppress group to oversight group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530612 (https://phabricator.wikimedia.org/T230601) (owner: 10Urbanecm) [18:22:56] (03Merged) 10jenkins-bot: Assign all rights assigned to suppress group to oversight group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530612 (https://phabricator.wikimedia.org/T230601) (owner: 10Urbanecm) [18:23:14] (03CR) 10jenkins-bot: Assign all rights assigned to suppress group to oversight group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530612 (https://phabricator.wikimedia.org/T230601) (owner: 10Urbanecm) [18:26:57] !log urbanecm@deploy1001 Synchronized wmf-config/CommonSettings.php: SWAT: 0a87e3c: Assign all rights assigned to suppress group to oversight group (T230601) (duration: 00m 48s) [18:27:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:08] T230601: Groups 'oversight'/'suppress' should be reconciled - https://phabricator.wikimedia.org/T230601 [18:28:43] (03CR) 10Urbanecm: [C: 03+2] Add `WS` and `CAT` as aliases for zhwikisource namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530413 (https://phabricator.wikimedia.org/T230548) (owner: 10DannyS712) [18:29:44] (03Merged) 10jenkins-bot: Add `WS` and `CAT` as aliases for zhwikisource namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530413 (https://phabricator.wikimedia.org/T230548) (owner: 10DannyS712) [18:30:02] (03CR) 10jenkins-bot: Add `WS` and `CAT` as aliases for zhwikisource namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530413 (https://phabricator.wikimedia.org/T230548) (owner: 10DannyS712) [18:31:35] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: b21bbc0: Add `WS` and `CAT` as aliases for zhwikisource namespaces (T230548) (duration: 00m 47s) [18:31:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:43] T230548: Shortcut namespace redirect on zhwikisource - https://phabricator.wikimedia.org/T230548 [18:32:22] (03PS2) 10Urbanecm: Fix zhwikisource wgExtraNamespaces entry [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530398 (https://phabricator.wikimedia.org/T230294) [18:32:30] 10Operations, 10Core Platform Team, 10MediaWiki-extensions-CentralAuth, 10TimedMediaHandler, and 5 others: Consistent HTTP 503 Error on some urls for some logged-in users (CentralAuth Set-Cookie storm) - https://phabricator.wikimedia.org/T226840 (10Anomie) >>! In T226840#5422259, @BBlack wrote: > Is there... [18:32:31] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530398 (https://phabricator.wikimedia.org/T230294) (owner: 10Urbanecm) [18:35:00] (03Merged) 10jenkins-bot: Fix zhwikisource wgExtraNamespaces entry [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530398 (https://phabricator.wikimedia.org/T230294) (owner: 10Urbanecm) [18:35:16] (03CR) 10jenkins-bot: Fix zhwikisource wgExtraNamespaces entry [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530398 (https://phabricator.wikimedia.org/T230294) (owner: 10Urbanecm) [18:36:26] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 26317c7: Fix zhwikisource wgExtraNamespaces entry (T230294) (duration: 00m 48s) [18:36:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:34] T230294: Add Portal namespace on Chinese Wikisource - https://phabricator.wikimedia.org/T230294 [18:37:47] (03PS2) 10Urbanecm: Add some HIDPI Wikivoyage logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529464 (https://phabricator.wikimedia.org/T230114) (owner: 10Jc86035) [18:37:54] (03CR) 10Urbanecm: [C: 03+2] Add some HIDPI Wikivoyage logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529464 (https://phabricator.wikimedia.org/T230114) (owner: 10Jc86035) [18:39:59] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is OK: (C)1e+05 gt (W)1e+04 gt 3 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [18:40:37] (03CR) 10jerkins-bot: [V: 04-1] Add some HIDPI Wikivoyage logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529464 (https://phabricator.wikimedia.org/T230114) (owner: 10Jc86035) [18:40:38] (03CR) 10jerkins-bot: [V: 04-1] Add some HIDPI Wikivoyage logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529464 (https://phabricator.wikimedia.org/T230114) (owner: 10Jc86035) [18:41:33] 10Operations, 10ops-eqiad, 10DC-Ops: b3-eqiad pdu refresh (Tuesday 9/17 @11am UTC) - https://phabricator.wikimedia.org/T227539 (10wiki_willy) @Marostegui - I'll defer to Faidon or Mark for their opinion, but my suggestion is to go ahead and fail out in advance if it's not too much of a hassle. The success r... [18:44:16] 10Operations, 10ops-eqiad, 10DC-Ops: a6-eqiad pdu refresh (Tuesday 10/22 @11am UTC) - https://phabricator.wikimedia.org/T227142 (10wiki_willy) @Marostegui - I would say just go for it and fail out in advance, if it's not too much trouble. Master DBs are very critical, so my opinion is to just take the extra... [18:46:11] (03PS1) 10Urbanecm: Raise rollback limit for all users to 100/60 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530945 (https://phabricator.wikimedia.org/T228708) [18:46:22] (03CR) 10Urbanecm: [C: 03+2] Raise rollback limit for all users to 100/60 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530945 (https://phabricator.wikimedia.org/T228708) (owner: 10Urbanecm) [18:48:41] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: Raise rollback limit for all groups (T228708) (duration: 00m 48s) [18:48:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:49] T228708: Raise rollback limit for all groups from 10/60 to at least 50/60 - https://phabricator.wikimedia.org/T228708 [18:48:57] (03Merged) 10jenkins-bot: Raise rollback limit for all users to 100/60 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530945 (https://phabricator.wikimedia.org/T228708) (owner: 10Urbanecm) [18:49:13] (03CR) 10jenkins-bot: Raise rollback limit for all users to 100/60 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530945 (https://phabricator.wikimedia.org/T228708) (owner: 10Urbanecm) [18:57:32] !log Morning SWaT done [18:57:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:38] 10Operations, 10ops-eqiad: rack/setup/install elastic10[53-67].eqiad.wmnet - https://phabricator.wikimedia.org/T230746 (10RobH) p:05Triage→03Normal [18:58:45] 10Operations, 10ops-eqiad: rack/setup/install elastic10[53-67].eqiad.wmnet - https://phabricator.wikimedia.org/T230746 (10RobH) [19:00:36] 10Operations, 10hardware-requests, 10Discovery-Search (Current work): Replace elastic1017-1031 - https://phabricator.wikimedia.org/T221636 (10RobH) 05Open→03Resolved a:05RobH→03None Please note that this hardware was ordered on T226843 and will be installed via T230746. As such, this request is reso... [19:34:05] (03PS2) 10Herron: prometheus: add prometheus ipsec exporter service & config [puppet] - 10https://gerrit.wikimedia.org/r/530616 (https://phabricator.wikimedia.org/T230236) [19:36:55] (03CR) 10Herron: prometheus: add prometheus ipsec exporter service & config (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/530616 (https://phabricator.wikimedia.org/T230236) (owner: 10Herron) [19:37:25] (03CR) 10jerkins-bot: [V: 04-1] prometheus: add prometheus ipsec exporter service & config [puppet] - 10https://gerrit.wikimedia.org/r/530616 (https://phabricator.wikimedia.org/T230236) (owner: 10Herron) [19:39:42] (03PS3) 10Herron: prometheus: add prometheus ipsec exporter service & config [puppet] - 10https://gerrit.wikimedia.org/r/530616 (https://phabricator.wikimedia.org/T230236) [19:45:20] (03PS1) 10DCausse: [elastic] log slow index ops to its own log file [puppet] - 10https://gerrit.wikimedia.org/r/530950 [20:00:04] cscott, arlolra, subbu, bearND, halfak, and accraze: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Services – Parsoid / Citoid / Mobileapps / ORES / …. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190819T2000). [20:09:10] (03PS2) 10Herron: prometheus-ipsec-exporter: initial commit of version 0.3.1 [debs/prometheus-ipsec-exporter] - 10https://gerrit.wikimedia.org/r/530203 (https://phabricator.wikimedia.org/T230236) [20:09:35] (03CR) 10Herron: prometheus-ipsec-exporter: initial commit of version 0.3.1 (032 comments) [debs/prometheus-ipsec-exporter] - 10https://gerrit.wikimedia.org/r/530203 (https://phabricator.wikimedia.org/T230236) (owner: 10Herron) [20:25:30] 10Operations, 10ops-eqiad: rack/setup/install elastic10[53-67].eqiad.wmnet - https://phabricator.wikimedia.org/T230746 (10EBernhardson) > Try to evenly space out elastic nodes in the row evenly in 1G racks. All new elastic servers are coming in with 10G cards and should go into 10G racks. [20:27:57] 10Operations, 10ops-eqiad: rack/setup/install elastic10[53-67].eqiad.wmnet - https://phabricator.wikimedia.org/T230746 (10RobH) [20:28:17] 10Operations, 10ops-eqiad: rack/setup/install elastic10[53-67].eqiad.wmnet - https://phabricator.wikimedia.org/T230746 (10RobH) >>! In T230746#5423035, @EBernhardson wrote: >> Try to evenly space out elastic nodes in the row evenly in 1G racks. > > All new elastic servers are coming in with 10G cards and shou... [20:32:17] 10Operations, 10observability: Expose pooled status of gdnsd and conftool managed services as metrics - https://phabricator.wikimedia.org/T230733 (10CDanis) p:05Triage→03Normal [20:32:18] 10Operations, 10WMF-Legal, 10serviceops: Move old transparency report pages to historical URLs and setup redirect - https://phabricator.wikimedia.org/T230638 (10CDanis) p:05Triage→03Normal [20:32:40] 10Operations, 10observability, 10User-CDanis: Expose pooled status of gdnsd and conftool managed services as metrics - https://phabricator.wikimedia.org/T230733 (10CDanis) [20:58:39] (03PS1) 10EBernhardson: Mjolnir bulk daemon to read from jumbo-eqiad cluster [puppet] - 10https://gerrit.wikimedia.org/r/530966 [20:59:06] (03PS2) 10EBernhardson: Mjolnir bulk daemon to read from jumbo-eqiad cluster [puppet] - 10https://gerrit.wikimedia.org/r/530966 [21:00:04] Reedy and sbassett: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190819T2100). [21:15:09] (03PS1) 10EBernhardson: Remove PrimateTmp=true from elasticsearch_6@ systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/530969 [21:25:19] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is CRITICAL: 2.791e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [21:27:56] (03PS1) 10Gehel: elasticsearch: remove PrivateTmp from systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/530973 [21:29:05] (03Abandoned) 10Gehel: elasticsearch: remove PrivateTmp from systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/530973 (owner: 10Gehel) [22:00:55] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a response was received: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [22:02:23] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [22:16:45] PROBLEM - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is CRITICAL: 130.9 ge 130 https://phabricator.wikimedia.org/T202307 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [22:44:37] (03CR) 10Ayounsi: [C: 03+1] "I don't know much of what those files do. But they work for other projects and I don't see anything obviously wrong." [software/homer] - 10https://gerrit.wikimedia.org/r/530860 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [23:00:04] MaxSem, RoanKattouw, and Niharika: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Evening SWAT (Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190819T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:07:28] (03CR) 10Ayounsi: [C: 03+1] "Discussed it extensively with Riccardo, looks sane to me!" [software/homer] - 10https://gerrit.wikimedia.org/r/530861 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [23:08:32] (03CR) 10Ayounsi: [C: 03+1] "Same as previous CR, stared a long time at the code and asked all my questions to Riccardo, can't find anything wrong. Will try at the nex" [software/homer] - 10https://gerrit.wikimedia.org/r/530862 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans)