[00:04:14] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: (codfw):rack/setup/install db213[2-5] - https://phabricator.wikimedia.org/T237702 (10Papaul) [00:06:55] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:07:51] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [00:15:28] chaomodus: what's going on with thoses ^ ? [00:15:40] an increase in 500s [00:15:51] I'ma adding retries to the sync script [00:15:55] (03PS1) 10Ayounsi: MR: add policy-options and routing-options [homer/public] - 10https://gerrit.wikimedia.org/r/550576 [00:15:57] that should quiet those out [00:16:37] (03CR) 10Ayounsi: "Full diff on all the mr routers:" [homer/public] - 10https://gerrit.wikimedia.org/r/550576 (owner: 10Ayounsi) [00:22:33] ok [00:22:47] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [00:37:31] 10Operations, 10ops-codfw, 10DBA: (codfw):rack/setup/install db213[2-5] - https://phabricator.wikimedia.org/T237702 (10Papaul) [00:38:51] 10Operations, 10ops-codfw, 10DBA: (codfw):rack/setup/install db213[2-5] - https://phabricator.wikimedia.org/T237702 (10Papaul) @Marostegui @jcrespo please fell free to take over the task. Thanks. [01:03:04] 10Operations, 10netops: mr1-esams flowd logs flood - https://phabricator.wikimedia.org/T238174 (10ayounsi) p:05Triage→03High [01:09:25] 10Operations, 10ops-codfw: Degraded RAID on db2134 - https://phabricator.wikimedia.org/T238175 (10ops-monitoring-bot) [01:13:12] 10Operations, 10netops: mr1-esams flowd logs flood - https://phabricator.wikimedia.org/T238174 (10ayounsi) Opened case 2019-1112-0803 [01:19:19] 10Operations, 10ops-codfw: Degraded RAID on db2135 - https://phabricator.wikimedia.org/T238176 (10ops-monitoring-bot) [01:27:38] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [01:44:26] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [02:06:54] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [03:03:02] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [03:14:18] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [04:10:24] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [04:27:14] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [05:12:06] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [05:18:26] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:19:04] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:22:34] There's a complaint in OTRS about an " Invalid OCSP signing certificate in OCSP response. " error from 8h ago, anyone know why that happened and if it's been fixed? [05:22:49] Ticket#2019111210010321 [05:30:57] hmm that could have been triggered by the certificate change that we performed last night [05:35:22] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:36:24] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:57:04] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [06:13:56] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [06:25:02] !log volker-e@deploy1001 Started deploy [design/style-guide@edce4cc]: Deploy design/style-guide: [06:25:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:25:10] !log volker-e@deploy1001 Finished deploy [design/style-guide@edce4cc]: Deploy design/style-guide: (duration: 00m 08s) [06:25:10] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [06:25:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:42:00] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [06:47:22] 10Operations, 10ops-codfw: Degraded RAID on db2134 - https://phabricator.wikimedia.org/T238175 (10Marostegui) 05Open→03Invalid I think the RAID was still being built as this is a new host. As the RAID looks good: ` root@db2134:~# megacli -LDPDInfo -aAll Adapter #0 Number of Virtual Disks: 1 Virtual Drive... [06:48:24] 10Operations, 10ops-codfw: Degraded RAID on db2135 - https://phabricator.wikimedia.org/T238176 (10Marostegui) 05Open→03Invalid Same as: T238175#5659160 ` root@db2135:~# megacli -LDPDInfo -aAll | grep -i firm Firmware state: Online, Spun Up Device Firmware Level: DL65 Firmware state: Online, Spun Up Devi... [06:53:28] 10Operations, 10ops-codfw, 10DBA: (codfw):rack/setup/install db213[2-5] - https://phabricator.wikimedia.org/T237702 (10Marostegui) 05Open→03Resolved [06:58:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1097:3315 after schema change', diff saved to https://phabricator.wikimedia.org/P9604 and previous config saved to /var/cache/conftool/dbconfig/20191113-065823-marostegui.json [06:58:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:59:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2087:3317 after compression', diff saved to https://phabricator.wikimedia.org/P9605 and previous config saved to /var/cache/conftool/dbconfig/20191113-065952-marostegui.json [06:59:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2086:3317 for compression', diff saved to https://phabricator.wikimedia.org/P9606 and previous config saved to /var/cache/conftool/dbconfig/20191113-070055-marostegui.json [07:00:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:03:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1096:3315 for schema change', diff saved to https://phabricator.wikimedia.org/P9607 and previous config saved to /var/cache/conftool/dbconfig/20191113-070339-marostegui.json [07:03:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:51] !log Fix replication on labsdb1010 - T233986 [07:09:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:51] (03PS1) 10Marostegui: mariadb: Promote db1083 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/550587 (https://phabricator.wikimedia.org/T234800) [07:19:31] (03PS1) 10Marostegui: wmnet: Point s1-master to db1083 [dns] - 10https://gerrit.wikimedia.org/r/550588 (https://phabricator.wikimedia.org/T234800) [07:19:46] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [puppet] - 10https://gerrit.wikimedia.org/r/550587 (https://phabricator.wikimedia.org/T234800) (owner: 10Marostegui) [07:20:23] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [dns] - 10https://gerrit.wikimedia.org/r/550588 (https://phabricator.wikimedia.org/T234800) (owner: 10Marostegui) [07:21:22] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [07:26:09] (03PS1) 10Marostegui: mariadb: Pool db2132 into m1 [puppet] - 10https://gerrit.wikimedia.org/r/550589 (https://phabricator.wikimedia.org/T238183) [07:28:01] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Pool db2132 into m1 [puppet] - 10https://gerrit.wikimedia.org/r/550589 (https://phabricator.wikimedia.org/T238183) (owner: 10Marostegui) [07:30:36] (03CR) 10Marostegui: "That's a known issue that will be tackled once we refactor the misc puppet classes and roles" [puppet] - 10https://gerrit.wikimedia.org/r/550589 (https://phabricator.wikimedia.org/T238183) (owner: 10Marostegui) [07:30:48] (03CR) 10Marostegui: [V: 03+2 C: 03+2] mariadb: Pool db2132 into m1 [puppet] - 10https://gerrit.wikimedia.org/r/550589 (https://phabricator.wikimedia.org/T238183) (owner: 10Marostegui) [07:56:58] (03PS1) 10Marostegui: install_server: Reimage db2132 with stretch [puppet] - 10https://gerrit.wikimedia.org/r/550593 (https://phabricator.wikimedia.org/T238183) [07:59:08] (03CR) 10Marostegui: [C: 03+2] install_server: Reimage db2132 with stretch [puppet] - 10https://gerrit.wikimedia.org/r/550593 (https://phabricator.wikimedia.org/T238183) (owner: 10Marostegui) [07:59:09] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [08:03:30] (03CR) 10Filippo Giunchedi: [C: 03+2] "PCC still a noop, merging https://puppet-compiler.wmflabs.org/compiler1001/19354/" [puppet] - 10https://gerrit.wikimedia.org/r/545867 (https://phabricator.wikimedia.org/T234854) (owner: 10Filippo Giunchedi) [08:15:45] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [08:15:49] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [08:15:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:40] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM! pcc noop https://puppet-compiler.wmflabs.org/compiler1002/19355/" [puppet] - 10https://gerrit.wikimedia.org/r/550506 (https://phabricator.wikimedia.org/T238096) (owner: 10Arturo Borrero Gonzalez) [08:18:34] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:18:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:58] (03CR) 10Jcrespo: "New candidate?" [puppet] - 10https://gerrit.wikimedia.org/r/550587 (https://phabricator.wikimedia.org/T234800) (owner: 10Marostegui) [08:21:35] (03CR) 10Marostegui: [C: 04-2] "> New candidate?" [puppet] - 10https://gerrit.wikimedia.org/r/550587 (https://phabricator.wikimedia.org/T234800) (owner: 10Marostegui) [08:25:32] !log Stop MySQL on db2062 to copy its data to db2132 T238183 [08:25:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:37] T238183: Productionize db213[2-5} - https://phabricator.wikimedia.org/T238183 [08:29:26] (03PS1) 10Muehlenhoff: Extend MOU for Bob West [puppet] - 10https://gerrit.wikimedia.org/r/550635 [08:29:52] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM to move this forward, I'm assuming after this is merged what's left is change the tox invocation to run in a python3 environment as n" [puppet] - 10https://gerrit.wikimedia.org/r/510613 (owner: 10Jbond) [08:31:09] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [08:31:44] (03CR) 10Jcrespo: [C: 03+1] mariadb: Promote db1083 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/550587 (https://phabricator.wikimedia.org/T234800) (owner: 10Marostegui) [08:32:28] (03CR) 10Marostegui: [C: 04-2] "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/550587 (https://phabricator.wikimedia.org/T234800) (owner: 10Marostegui) [08:34:03] (03CR) 10Muehlenhoff: [C: 03+2] Extend MOU for Bob West [puppet] - 10https://gerrit.wikimedia.org/r/550635 (owner: 10Muehlenhoff) [08:34:31] 10Operations, 10Traffic, 10Wikimedia-Logstash: varnishlog consumers http request/response logging field explosion - https://phabricator.wikimedia.org/T238089 (10fgiunchedi) 05Open→03Invalid Confirmed only a subset of headers are whitelisted for sending, closing [08:40:31] 10Operations, 10Puppet, 10observability: Icinga alert for hosts with no Puppet roles - https://phabricator.wikimedia.org/T238006 (10fgiunchedi) Interesting! Did the host show up in icinga at all? What's the hostname? [08:40:35] PROBLEM - haproxy failover on dbproxy2001 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [08:43:33] ^expected [08:48:53] (03CR) 10Mobrovac: [V: 03+2 C: 03+2] RESTRouter: Add gcrwiki and shywiktionary [deployment-charts] - 10https://gerrit.wikimedia.org/r/550533 (https://phabricator.wikimedia.org/T238117) (owner: 10Jon Harald Søby) [08:52:31] (03CR) 10Jcrespo: [C: 03+1] wmnet: Point s1-master to db1083 [dns] - 10https://gerrit.wikimedia.org/r/550588 (https://phabricator.wikimedia.org/T234800) (owner: 10Marostegui) [08:52:51] (03CR) 10Jcrespo: [C: 03+1] "Was it r" [puppet] - 10https://gerrit.wikimedia.org/r/550514 (https://phabricator.wikimedia.org/T234800) (owner: 10Marostegui) [09:06:34] !log mobrovac@deploy1001 Started deploy [restbase/deploy@1f2c7d8] (dev-cluster): Start storing Parsoid/PHP results; add gcrwiki, shywiktionary, szywiki [09:06:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:37] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [09:09:08] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@1f2c7d8] (dev-cluster): Start storing Parsoid/PHP results; add gcrwiki, shywiktionary, szywiki (duration: 02m 35s) [09:09:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:35] RECOVERY - haproxy failover on dbproxy2001 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [09:09:54] (03PS10) 10Gehel: [beta] configure sparql/query logging to deployment-eventgate-3 [puppet] - 10https://gerrit.wikimedia.org/r/549056 (owner: 10DCausse) [09:10:32] !log mobrovac@deploy1001 Started deploy [restbase/deploy@1f2c7d8]: Start storing Parsoid/PHP results; add gcrwiki, shywiktionary, szywiki - T229015 T238117 T238116 T237374 [09:10:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:41] T237374: Add szywiki to restbase - https://phabricator.wikimedia.org/T237374 [09:10:41] T238116: Add shywiktionary to restbase - https://phabricator.wikimedia.org/T238116 [09:10:41] T229015: Tracking: Direct live production traffic at Parsoid/PHP - https://phabricator.wikimedia.org/T229015 [09:10:42] T238117: Add gcrwiki to restbase - https://phabricator.wikimedia.org/T238117 [09:17:32] (03PS1) 10Filippo Giunchedi: mtail: export logstash ES index failure details [puppet] - 10https://gerrit.wikimedia.org/r/550640 (https://phabricator.wikimedia.org/T236343) [09:21:51] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@1f2c7d8]: Start storing Parsoid/PHP results; add gcrwiki, shywiktionary, szywiki - T229015 T238117 T238116 T237374 (duration: 11m 19s) [09:22:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:01] T237374: Add szywiki to restbase - https://phabricator.wikimedia.org/T237374 [09:22:03] T238116: Add shywiktionary to restbase - https://phabricator.wikimedia.org/T238116 [09:22:03] T229015: Tracking: Direct live production traffic at Parsoid/PHP - https://phabricator.wikimedia.org/T229015 [09:22:03] T238117: Add gcrwiki to restbase - https://phabricator.wikimedia.org/T238117 [09:23:58] (03CR) 10Filippo Giunchedi: [C: 03+2] mtail: export logstash ES index failure details [puppet] - 10https://gerrit.wikimedia.org/r/550640 (https://phabricator.wikimedia.org/T236343) (owner: 10Filippo Giunchedi) [09:24:09] (03PS2) 10Filippo Giunchedi: mtail: export logstash ES index failure details [puppet] - 10https://gerrit.wikimedia.org/r/550640 (https://phabricator.wikimedia.org/T236343) [09:37:10] (03PS1) 10Mobrovac: Mathoid: Use image 2019-11-13-084818-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/550641 [09:38:18] (03CR) 10Mobrovac: [V: 03+2 C: 03+2] Mathoid: Use image 2019-11-13-084818-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/550641 (owner: 10Mobrovac) [09:40:34] (03PS11) 10Gehel: [beta] configure sparql/query logging to deployment-eventgate-3 [puppet] - 10https://gerrit.wikimedia.org/r/549056 (owner: 10DCausse) [09:41:01] !log mobrovac@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'mathoid' for release 'staging' . [09:41:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:22] !log mobrovac@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'mathoid' for release 'production' . [09:43:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:50] (03CR) 10Gehel: [C: 03+2] [beta] configure sparql/query logging to deployment-eventgate-3 [puppet] - 10https://gerrit.wikimedia.org/r/549056 (owner: 10DCausse) [09:50:45] !log mobrovac@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'mathoid' for release 'production' . [09:50:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:17] !log upgraded wmf-mariadb101-client on cumin hosts [09:51:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:20] (03PS1) 10Odder: Update localized logos for the Fula Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/550643 (https://phabricator.wikimedia.org/T238191) [09:52:36] (03PS1) 10Jcrespo: mariadb-package: Create 10.4 control files [software] - 10https://gerrit.wikimedia.org/r/550644 [09:53:01] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] Update stretch mariadb wmf package to 10.1.41 [software] - 10https://gerrit.wikimedia.org/r/535810 (owner: 10Jcrespo) [09:53:08] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] Add percona support, and standarize xtrabackup reference [software] - 10https://gerrit.wikimedia.org/r/546455 (owner: 10Jcrespo) [09:53:15] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] mariadb package: Add 10.1.42 packages [software] - 10https://gerrit.wikimedia.org/r/548967 (owner: 10Jcrespo) [09:53:21] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [09:53:27] (03Merged) 10jenkins-bot: Update stretch mariadb wmf package to 10.1.41 [software] - 10https://gerrit.wikimedia.org/r/535810 (owner: 10Jcrespo) [09:53:31] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] mariadb-package: Updates for mariadb 10.1.43 and percona-server 8.0.17 [software] - 10https://gerrit.wikimedia.org/r/550102 (owner: 10Jcrespo) [09:53:33] (03Merged) 10jenkins-bot: Add percona support, and standarize xtrabackup reference [software] - 10https://gerrit.wikimedia.org/r/546455 (owner: 10Jcrespo) [09:53:38] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] mariadb-package: Create 10.4 control files [software] - 10https://gerrit.wikimedia.org/r/550644 (owner: 10Jcrespo) [09:53:41] (03Merged) 10jenkins-bot: mariadb package: Add 10.1.42 packages [software] - 10https://gerrit.wikimedia.org/r/548967 (owner: 10Jcrespo) [09:57:44] (03PS6) 10KartikMistry: Enable CX out of beta for newly created WPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548730 (https://phabricator.wikimedia.org/T234318) [10:03:57] 10Operations, 10DBA, 10MediaWiki-General, 10TechCom: Evaluate and decide the future of relational datastore at WMF after the upgrade of MariaDB 10.1 is finished - https://phabricator.wikimedia.org/T193224 (10jcrespo) db1114 is now running percona-server 8.0, if anyone wants to test it. [10:04:37] (03PS1) 10Jcrespo: mariadb-client: Install 10.4 on buster, unblock os upgrade [puppet] - 10https://gerrit.wikimedia.org/r/550647 (https://phabricator.wikimedia.org/T193224) [10:07:48] (03PS2) 10Jcrespo: mariadb-client: Install 10.4 on buster, unblock os upgrade [puppet] - 10https://gerrit.wikimedia.org/r/550647 (https://phabricator.wikimedia.org/T193224) [10:08:18] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloud: refactor prometheus role [puppet] - 10https://gerrit.wikimedia.org/r/550506 (https://phabricator.wikimedia.org/T238096) (owner: 10Arturo Borrero Gonzalez) [10:11:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1083 for kernel upgrade', diff saved to https://phabricator.wikimedia.org/P9609 and previous config saved to /var/cache/conftool/dbconfig/20191113-101127-marostegui.json [10:11:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:00] (03PS1) 10Ema: debmonitor: update certificate [puppet] - 10https://gerrit.wikimedia.org/r/550649 (https://phabricator.wikimedia.org/T210411) [10:12:09] (03PS1) 10Vgutierrez: ATS: Avoid reloading trafficserver-tls if the service is being restarted [puppet] - 10https://gerrit.wikimedia.org/r/550650 (https://phabricator.wikimedia.org/T237425) [10:13:22] (03PS2) 10Vgutierrez: ATS: Avoid reloading trafficserver-tls if the service is being restarted [puppet] - 10https://gerrit.wikimedia.org/r/550650 (https://phabricator.wikimedia.org/T237425) [10:14:18] (03PS1) 10Gehel: elasticsearch: initial configuration for new elasticsearch servers [puppet] - 10https://gerrit.wikimedia.org/r/550651 (https://phabricator.wikimedia.org/T230746) [10:16:54] (03CR) 10Ema: [C: 03+2] debmonitor: update certificate [puppet] - 10https://gerrit.wikimedia.org/r/550649 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [10:18:55] (03PS3) 10Vgutierrez: ATS: Avoid reloading trafficserver-tls if the service is being restarted [puppet] - 10https://gerrit.wikimedia.org/r/550650 (https://phabricator.wikimedia.org/T237425) [10:19:14] (03PS1) 10Arturo Borrero Gonzalez: wmcs: prometheus: delete datatype spec for kubernetes user [puppet] - 10https://gerrit.wikimedia.org/r/550652 [10:20:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1083 after kernel upgrade', diff saved to https://phabricator.wikimedia.org/P9610 and previous config saved to /var/cache/conftool/dbconfig/20191113-102054-marostegui.json [10:20:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:09] (03PS4) 10Vgutierrez: ATS: Avoid reloading trafficserver-tls if the service is being restarted [puppet] - 10https://gerrit.wikimedia.org/r/550650 (https://phabricator.wikimedia.org/T237425) [10:22:16] (03CR) 10Mathew.onipe: [C: 03+1] elasticsearch: initial configuration for new elasticsearch servers [puppet] - 10https://gerrit.wikimedia.org/r/550651 (https://phabricator.wikimedia.org/T230746) (owner: 10Gehel) [10:23:42] (03PS2) 10Gehel: elasticsearch: initial configuration for new elasticsearch servers [puppet] - 10https://gerrit.wikimedia.org/r/550651 (https://phabricator.wikimedia.org/T230746) [10:23:45] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: prometheus: delete datatype spec for kubernetes user [puppet] - 10https://gerrit.wikimedia.org/r/550652 (owner: 10Arturo Borrero Gonzalez) [10:24:27] 10Operations, 10Dumps-Generation, 10SDC General, 10Wikidata, 10Structured-Data-Backlog (Current Work): Capacity planning for Commons Structured Data - https://phabricator.wikimedia.org/T226093 (10matthiasmullie) >>! In T226093#5657414, @Ramsey-WMF wrote: > Matthias will look into discrepancies between nu... [10:25:13] (03CR) 10Ema: [C: 03+1] "looks great" [puppet] - 10https://gerrit.wikimedia.org/r/550650 (https://phabricator.wikimedia.org/T237425) (owner: 10Vgutierrez) [10:25:43] (03CR) 10Gehel: [C: 03+2] elasticsearch: initial configuration for new elasticsearch servers [puppet] - 10https://gerrit.wikimedia.org/r/550651 (https://phabricator.wikimedia.org/T230746) (owner: 10Gehel) [10:26:12] hi Urbanecm, would you be able to decipher what the failure here means? https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/550594/ [10:26:26] can try Jhs :-) [10:26:45] Jhs: could we continue this conversation in #wikimedia-dev? [10:26:55] Urbanecm, sure [10:27:19] !log start configuration of new elasticsearch servers - T230746 [10:27:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:25] T230746: (Aug 30th, 2019) rack/setup/install elastic10[53-67].eqiad.wmnet - https://phabricator.wikimedia.org/T230746 [10:28:19] (03CR) 10Vgutierrez: [C: 03+2] ATS: Avoid reloading trafficserver-tls if the service is being restarted [puppet] - 10https://gerrit.wikimedia.org/r/550650 (https://phabricator.wikimedia.org/T237425) (owner: 10Vgutierrez) [10:31:02] (03PS1) 10Mathew.onipe: Add log dir to wdqs labs config [puppet] - 10https://gerrit.wikimedia.org/r/550656 [10:31:46] Amir1: we've had a query running on the vslow host from wikidata for 9h, which looks like a cronjob: https://phabricator.wikimedia.org/P9611 [10:31:50] (03PS3) 10Filippo Giunchedi: logstash: alert on indexing failures [puppet] - 10https://gerrit.wikimedia.org/r/550471 (https://phabricator.wikimedia.org/T236343) [10:31:54] Amir1: https://grafana.wikimedia.org/d/000000273/mysql?panelId=3&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1087&var-port=9104 [10:32:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'More traffic to db1083', diff saved to https://phabricator.wikimedia.org/P9612 and previous config saved to /var/cache/conftool/dbconfig/20191113-103225-marostegui.json [10:32:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:35] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [10:32:47] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:33:12] ^checking [10:34:27] it is wikidata againa [10:34:29] again* [10:34:48] (03PS1) 10Vgutierrez: ATS: get rid of non-ascii whitespace on update-ocsp-trafficserver-hook.erb [puppet] - 10https://gerrit.wikimedia.org/r/550657 [10:34:57] yep [10:35:04] (03PS2) 10Gehel: Add log dir to wdqs labs config [puppet] - 10https://gerrit.wikimedia.org/r/550656 (owner: 10Mathew.onipe) [10:36:27] (03CR) 10Vgutierrez: [C: 03+2] ATS: get rid of non-ascii whitespace on update-ocsp-trafficserver-hook.erb [puppet] - 10https://gerrit.wikimedia.org/r/550657 (owner: 10Vgutierrez) [10:37:29] (03CR) 10Gehel: [C: 03+2] Add log dir to wdqs labs config [puppet] - 10https://gerrit.wikimedia.org/r/550656 (owner: 10Mathew.onipe) [10:38:07] onimisionipe: ^ [10:38:19] (03PS2) 10Vgutierrez: ATS: get rid of non-ascii whitespace on update-ocsp-trafficserver-hook.erb [puppet] - 10https://gerrit.wikimedia.org/r/550657 [10:39:15] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:43:16] (03PS4) 10Filippo Giunchedi: logstash: alert on indexing failures [puppet] - 10https://gerrit.wikimedia.org/r/550471 (https://phabricator.wikimedia.org/T236343) [10:43:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'More traffic to db1083', diff saved to https://phabricator.wikimedia.org/P9613 and previous config saved to /var/cache/conftool/dbconfig/20191113-104326-marostegui.json [10:43:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:47] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [10:44:00] (03CR) 10Filippo Giunchedi: [C: 03+2] logstash: alert on indexing failures [puppet] - 10https://gerrit.wikimedia.org/r/550471 (https://phabricator.wikimedia.org/T236343) (owner: 10Filippo Giunchedi) [10:46:09] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] logstash: alert on indexing failures [puppet] - 10https://gerrit.wikimedia.org/r/550471 (https://phabricator.wikimedia.org/T236343) (owner: 10Filippo Giunchedi) [10:47:19] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:51:11] 10Operations, 10Puppet, 10Traffic, 10User-jbond: In valid byte sequence: File[/etc/update-ocsp.d/hooks/trafficserver-tls-ocsp] - https://phabricator.wikimedia.org/T238198 (10jbond) p:05Triage→03Normal [10:52:11] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:53:21] !log Testing ats-tls-restart on cp5007 - T237425 [10:53:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:26] T237425: ats-tls-restart failed on cp4027 - https://phabricator.wikimedia.org/T237425 [10:53:49] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:56:32] 10Operations, 10Traffic, 10Patch-For-Review: ats-tls-restart failed on cp4027 - https://phabricator.wikimedia.org/T237425 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez This has been fixed with the conditional reload: `vgutierrez@cp5007:~$ sudo -i ats-tls-restart eqsin/cache_text/nginx/cp5007.eqsin.wm... [10:59:02] marostegui: it's one of the query cache thingy, maybe we need to turn that thing off as well? Can you make a phabricator ticket? [10:59:08] (03PS1) 10Gehel: elasticsearch: add node specific configuration for new servers [puppet] - 10https://gerrit.wikimedia.org/r/550658 (https://phabricator.wikimedia.org/T230746) [10:59:10] (Sorry I just woke up) [10:59:15] Amir1: yep, will do [10:59:15] thanks [10:59:21] Amir1: any preferred tags? [10:59:35] (03PS1) 10Arturo Borrero Gonzalez: openstack: clientpackages: vms: refresh comments and messages [puppet] - 10https://gerrit.wikimedia.org/r/550659 (https://phabricator.wikimedia.org/T212302) [10:59:40] QueryPage [10:59:45] Or something similar [10:59:53] ok thanks! [11:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191113T1100). [11:00:05] odder: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:14] (03PS1) 10Jbond: profile:trafficserver: remove bad character [puppet] - 10https://gerrit.wikimedia.org/r/550660 [11:00:17] * odder confirms presence [11:00:34] I can SWAT today! [11:00:47] (03PS2) 10Jbond: profile:trafficserver: remove bad character [puppet] - 10https://gerrit.wikimedia.org/r/550660 (https://phabricator.wikimedia.org/T238198) [11:01:04] (03CR) 10Urbanecm: [C: 03+2] Update localized logos for the Fula Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/550643 (https://phabricator.wikimedia.org/T238191) (owner: 10Odder) [11:01:50] (03Merged) 10jenkins-bot: Update localized logos for the Fula Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/550643 (https://phabricator.wikimedia.org/T238191) (owner: 10Odder) [11:02:43] (03CR) 10Jbond: [C: 03+2] profile:trafficserver: remove bad character [puppet] - 10https://gerrit.wikimedia.org/r/550660 (https://phabricator.wikimedia.org/T238198) (owner: 10Jbond) [11:03:38] odder: syncing, seems to work [11:04:12] !log urbanecm@deploy1001 Synchronized static/images/project-logos/: SWAT: 0a90ef9: Update localized logos for the Fula Wikipedia (T238191) (duration: 00m 54s) [11:04:15] Urbanecm: Yes, I can see it on mwdebug1001 [11:04:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:17] T238191: New localized logo for the Fula Wikipedia - https://phabricator.wikimedia.org/T238191 [11:04:54] odder: great. It should be visible everywhere know. Let me know if it works on your side as well :-) [11:05:16] !log Purge https://en.wikipedia.org/static/images/project-logos/ffwiki* (T238191) [11:05:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:32] Urbanecm: Yes, it does! [11:05:35] * odder peaces out [11:05:44] wonderful, odder ! [11:05:47] (03PS1) 10Mathew.onipe: wdqs: add deploy name config for labs [puppet] - 10https://gerrit.wikimedia.org/r/550661 [11:05:49] !log EU SWAT done [11:05:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:16] Urbanecm: That was nice n' easy ;) [11:06:25] 10Operations, 10Puppet, 10Traffic, 10Patch-For-Review, 10User-jbond: In valid byte sequence: File[/etc/update-ocsp.d/hooks/trafficserver-tls-ocsp] - https://phabricator.wikimedia.org/T238198 (10jbond) This was actully fixed by the following change https://gerrit.wikimedia.org/r/c/operations/puppet/+/5506... [11:06:40] odder: yeah 🙂 [11:07:58] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: clientpackages: vms: refresh comments and messages [puppet] - 10https://gerrit.wikimedia.org/r/550659 (https://phabricator.wikimedia.org/T212302) (owner: 10Arturo Borrero Gonzalez) [11:08:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1083', diff saved to https://phabricator.wikimedia.org/P9614 and previous config saved to /var/cache/conftool/dbconfig/20191113-110802-marostegui.json [11:08:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:37] 10Operations, 10Traffic: debmonitor TLS termination - https://phabricator.wikimedia.org/T238200 (10ema) [11:08:39] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:08:53] 10Operations, 10Traffic: debmonitor TLS termination - https://phabricator.wikimedia.org/T238200 (10ema) p:05Triage→03Normal [11:10:24] 10Operations, 10SRE-tools, 10netbox, 10observability, 10User-crusnov: Netbox Alert Cleanups - https://phabricator.wikimedia.org/T224946 (10jijiki) [11:11:01] PROBLEM - Check the last execution of netbox_ganeti_eqiad_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_eqiad_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [11:11:51] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [11:13:19] (03CR) 10Mathew.onipe: elasticsearch: add node specific configuration for new servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/550658 (https://phabricator.wikimedia.org/T230746) (owner: 10Gehel) [11:14:37] (03CR) 10DCausse: [C: 03+1] "change seems coherent" [puppet] - 10https://gerrit.wikimedia.org/r/550658 (https://phabricator.wikimedia.org/T230746) (owner: 10Gehel) [11:16:19] (03PS2) 10Gehel: elasticsearch: add node specific configuration for new servers [puppet] - 10https://gerrit.wikimedia.org/r/550658 (https://phabricator.wikimedia.org/T230746) [11:20:53] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:21:05] RECOVERY - Check the last execution of netbox_ganeti_eqiad_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_eqiad_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [11:21:52] (03CR) 10Mathew.onipe: [C: 03+1] elasticsearch: add node specific configuration for new servers [puppet] - 10https://gerrit.wikimedia.org/r/550658 (https://phabricator.wikimedia.org/T230746) (owner: 10Gehel) [11:24:04] !log cp4022: trafficserver (8.0.5-1wm10) and fifo-log-demux (0.6) upgrade and restart [11:24:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:09] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [11:24:10] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:24:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:17] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [11:27:26] oh I missed that one [11:27:27] !log rebooting cloudcontrol2003-dev for some microcode debugging [11:27:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:58] !log cp-ats-ulsfo: rolling trafficserver (8.0.5-1wm10) and fifo-log-demux (0.6) upgrade and restart [11:28:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:15] (03CR) 10Marostegui: [C: 03+1] mariadb-client: Install 10.4 on buster, unblock os upgrade [puppet] - 10https://gerrit.wikimedia.org/r/550647 (https://phabricator.wikimedia.org/T193224) (owner: 10Jcrespo) [11:29:54] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review, 10User-fgiunchedi: Ingest production logs with ELK7 - https://phabricator.wikimedia.org/T235891 (10fgiunchedi) re: bridging the gap with non-kafka inputs, my current thinking is to output all logs with `deprecated-input` tag back into kafka-logging on... [11:37:11] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [11:37:12] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:37:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:06] !log rebooting labtestpuppetmaster2001 for microcode debugging [11:38:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:02] (03PS1) 10KartikMistry: Update cxserver to 2019-11-13-111130-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/550663 (https://phabricator.wikimedia.org/T237379) [11:45:45] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [11:45:46] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:45:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:27] !log rebooting cloudcontrol2001-dev for microcode debugging [11:46:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:25] !log cp-ats: rolling trafficserver (8.0.5-1wm10) and fifo-log-demux (0.6) upgrade and restart [11:55:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:30] !log Upgrade to php 7.2.24-1 mediawiki eqiad hosts and restart php-fpm - T237239 [11:56:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:34] T237239: Upgrade to PHP 7.2.24 - https://phabricator.wikimedia.org/T237239 [11:58:18] 10Operations: Conffile handling for PHP 7.2 packages - https://phabricator.wikimedia.org/T231881 (10jijiki) p:05Triage→03Normal [11:58:33] 10Operations: Conffile handling for PHP 7.2 packages - https://phabricator.wikimedia.org/T231881 (10jijiki) [11:58:35] 10Operations, 10serviceops: Upgrade to PHP 7.2.24 - https://phabricator.wikimedia.org/T237239 (10jijiki) [12:04:03] (03CR) 10ArielGlenn: "Let's add a comment in site.pp in the labstore1006,7 stanza mentioning the need to get a new primary generated for any new host before the" [puppet] - 10https://gerrit.wikimedia.org/r/550466 (https://phabricator.wikimedia.org/T234229) (owner: 10Elukey) [12:16:15] 10Operations, 10Traffic: debmonitor TLS termination - https://phabricator.wikimedia.org/T238200 (10Volans) As we discussed a while ago about this, the easiest solution is to pick another port for the public TLS server on the debmonitor servers as the 443 is already taken for the internal clients to report the... [12:20:44] (03PS1) 10Tchanders: Enable CheckUser's Special:Investigate page on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/550668 (https://phabricator.wikimedia.org/T236981) [12:32:07] (03PS1) 10Ema: debmonitor: terminate TLS on port 7443 [puppet] - 10https://gerrit.wikimedia.org/r/550670 (https://phabricator.wikimedia.org/T238200) [12:34:18] (03CR) 10Ema: "https://puppet-compiler.wmflabs.org/compiler1001/19358/" [puppet] - 10https://gerrit.wikimedia.org/r/550670 (https://phabricator.wikimedia.org/T238200) (owner: 10Ema) [12:37:40] PROBLEM - Logstash Elasticsearch indexing errors on logstash1007 is CRITICAL: cluster=logstash instance=logstash1007:3903 job=mtail prog=logstash.mtail site=eqiad https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [12:38:02] PROBLEM - Logstash Elasticsearch indexing errors on logstash1008 is CRITICAL: cluster=logstash instance=logstash1008:3903 job=mtail prog=logstash.mtail site=eqiad https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [12:40:54] RECOVERY - Logstash Elasticsearch indexing errors on logstash1007 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [12:41:18] RECOVERY - Logstash Elasticsearch indexing errors on logstash1008 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [12:45:27] (03PS3) 10Jcrespo: mariadb-client: Install 10.4 on buster, unblock os upgrade [puppet] - 10https://gerrit.wikimedia.org/r/550647 (https://phabricator.wikimedia.org/T193224) [12:45:29] (03PS1) 10Jcrespo: bacula: Setup separate pool and defaults for database backups on backup1001 [puppet] - 10https://gerrit.wikimedia.org/r/550671 (https://phabricator.wikimedia.org/T238048) [12:50:56] (03PS14) 10ArielGlenn: store generated misc cron dump output on second nfs server [puppet] - 10https://gerrit.wikimedia.org/r/447402 (https://phabricator.wikimedia.org/T200180) [12:53:55] (03CR) 10Gehel: [C: 03+2] wdqs: add deploy name config for labs [puppet] - 10https://gerrit.wikimedia.org/r/550661 (owner: 10Mathew.onipe) [12:53:59] (03CR) 10jerkins-bot: [V: 04-1] store generated misc cron dump output on second nfs server [puppet] - 10https://gerrit.wikimedia.org/r/447402 (https://phabricator.wikimedia.org/T200180) (owner: 10ArielGlenn) [12:54:03] (03PS2) 10Gehel: wdqs: add deploy name config for labs [puppet] - 10https://gerrit.wikimedia.org/r/550661 (owner: 10Mathew.onipe) [12:55:14] (03CR) 10Gehel: [C: 03+2] elasticsearch: add node specific configuration for new servers [puppet] - 10https://gerrit.wikimedia.org/r/550658 (https://phabricator.wikimedia.org/T230746) (owner: 10Gehel) [12:55:58] (03PS3) 10Gehel: wdqs: add deploy name config for labs [puppet] - 10https://gerrit.wikimedia.org/r/550661 (owner: 10Mathew.onipe) [12:56:42] (03PS15) 10ArielGlenn: store generated misc cron dump output on second nfs server [puppet] - 10https://gerrit.wikimedia.org/r/447402 (https://phabricator.wikimedia.org/T200180) [12:59:23] (03PS3) 10Muehlenhoff: Add Icinga check for correct application of microcode mitigations [puppet] - 10https://gerrit.wikimedia.org/r/547747 (https://phabricator.wikimedia.org/T235250) [13:04:36] PROBLEM - Logstash Elasticsearch indexing errors on logstash2006 is CRITICAL: cluster=logstash instance=logstash2006:3903 job=mtail prog=logstash.mtail site=codfw https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [13:05:02] PROBLEM - Logstash Elasticsearch indexing errors on logstash2005 is CRITICAL: cluster=logstash instance=logstash2005:3903 job=mtail prog=logstash.mtail site=codfw https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [13:05:10] PROBLEM - Logstash Elasticsearch indexing errors on logstash1007 is CRITICAL: cluster=logstash instance=logstash1007:3903 job=mtail prog=logstash.mtail site=eqiad https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [13:05:12] PROBLEM - Logstash Elasticsearch indexing errors on logstash1009 is CRITICAL: cluster=logstash instance=logstash1009:3903 job=mtail prog=logstash.mtail site=eqiad https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [13:05:32] PROBLEM - Logstash Elasticsearch indexing errors on logstash2004 is CRITICAL: cluster=logstash instance=logstash2004:3903 job=mtail prog=logstash.mtail site=codfw https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [13:05:34] PROBLEM - Logstash Elasticsearch indexing errors on logstash1008 is CRITICAL: cluster=logstash instance=logstash1008:3903 job=mtail prog=logstash.mtail site=eqiad https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [13:08:16] RECOVERY - Logstash Elasticsearch indexing errors on logstash2005 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [13:08:24] RECOVERY - Logstash Elasticsearch indexing errors on logstash1007 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [13:08:28] RECOVERY - Logstash Elasticsearch indexing errors on logstash1009 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [13:08:46] RECOVERY - Logstash Elasticsearch indexing errors on logstash2004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [13:08:48] RECOVERY - Logstash Elasticsearch indexing errors on logstash1008 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [13:09:28] RECOVERY - Logstash Elasticsearch indexing errors on logstash2006 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [13:15:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1096:3315', diff saved to https://phabricator.wikimedia.org/P9615 and previous config saved to /var/cache/conftool/dbconfig/20191113-131530-marostegui.json [13:15:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:19] (03PS2) 10Jcrespo: bacula: Setup separate pool and defaults for database backups on backup1001 [puppet] - 10https://gerrit.wikimedia.org/r/550671 (https://phabricator.wikimedia.org/T238048) [13:17:14] sorry about the alert spam, looking into it [13:17:50] (03PS1) 10Arturo Borrero Gonzalez: toolforge: new k8s: add wmcs-k8s-get-cert.sh script [puppet] - 10https://gerrit.wikimedia.org/r/550673 (https://phabricator.wikimedia.org/T215553) [13:19:52] (03CR) 10jerkins-bot: [V: 04-1] bacula: Setup separate pool and defaults for database backups on backup1001 [puppet] - 10https://gerrit.wikimedia.org/r/550671 (https://phabricator.wikimedia.org/T238048) (owner: 10Jcrespo) [13:20:38] (03PS3) 10Jcrespo: bacula: Setup separate pool and defaults for database backups on backup1001 [puppet] - 10https://gerrit.wikimedia.org/r/550671 (https://phabricator.wikimedia.org/T238048) [13:20:40] (03PS2) 10Arturo Borrero Gonzalez: toolforge: new k8s: add wmcs-k8s-get-cert.sh script [puppet] - 10https://gerrit.wikimedia.org/r/550673 (https://phabricator.wikimedia.org/T215553) [13:22:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1089 for upgrade', diff saved to https://phabricator.wikimedia.org/P9616 and previous config saved to /var/cache/conftool/dbconfig/20191113-132216-marostegui.json [13:22:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:47] (03PS4) 10Jcrespo: bacula: Setup separate pool and defaults for database backups on backup1001 [puppet] - 10https://gerrit.wikimedia.org/r/550671 (https://phabricator.wikimedia.org/T238048) [13:25:46] (03PS5) 10Jcrespo: bacula: Setup separate pool and defaults for database backups on backup1001 [puppet] - 10https://gerrit.wikimedia.org/r/550671 (https://phabricator.wikimedia.org/T238048) [13:25:56] (03CR) 10Arturo Borrero Gonzalez: "My idea for this script is double:" [puppet] - 10https://gerrit.wikimedia.org/r/550673 (https://phabricator.wikimedia.org/T215553) (owner: 10Arturo Borrero Gonzalez) [13:27:47] 10Operations, 10Traffic: Renew and deploy GlobalSign unified cert (2019) - https://phabricator.wikimedia.org/T237650 (10Seb35) A certificate warning here https://social.imirhil.fr/@aeris/103126273693383568 the user still had the cert 2018 at 2019-11-12 17:07 UTC although the sites served the cert 2019 at that... [13:29:08] (03CR) 10jerkins-bot: [V: 04-1] bacula: Setup separate pool and defaults for database backups on backup1001 [puppet] - 10https://gerrit.wikimedia.org/r/550671 (https://phabricator.wikimedia.org/T238048) (owner: 10Jcrespo) [13:30:10] (03PS1) 10Filippo Giunchedi: logstash: move ingestion alerts to be site-local [puppet] - 10https://gerrit.wikimedia.org/r/550678 (https://phabricator.wikimedia.org/T236343) [13:33:24] (03CR) 10Filippo Giunchedi: [C: 03+2] logstash: move ingestion alerts to be site-local [puppet] - 10https://gerrit.wikimedia.org/r/550678 (https://phabricator.wikimedia.org/T236343) (owner: 10Filippo Giunchedi) [13:34:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1089 after upgrade', diff saved to https://phabricator.wikimedia.org/P9617 and previous config saved to /var/cache/conftool/dbconfig/20191113-133410-marostegui.json [13:34:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:12] (03PS1) 10Muehlenhoff: Remove smalyshev from wdqs contact group [puppet] - 10https://gerrit.wikimedia.org/r/550679 [13:36:55] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/550679 (owner: 10Muehlenhoff) [13:38:07] (03CR) 10Muehlenhoff: [C: 03+2] Remove smalyshev from wdqs contact group [puppet] - 10https://gerrit.wikimedia.org/r/550679 (owner: 10Muehlenhoff) [13:40:07] (03PS6) 10Jcrespo: bacula: Setup separate pool and defaults for database backups on backup1001 [puppet] - 10https://gerrit.wikimedia.org/r/550671 (https://phabricator.wikimedia.org/T238048) [13:43:21] (03CR) 10jerkins-bot: [V: 04-1] bacula: Setup separate pool and defaults for database backups on backup1001 [puppet] - 10https://gerrit.wikimedia.org/r/550671 (https://phabricator.wikimedia.org/T238048) (owner: 10Jcrespo) [13:44:34] (03PS1) 10Jbond: build.gradle: add memcached support to cas blob [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/550682 (https://phabricator.wikimedia.org/T233931) [13:46:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'More traffic to db1089 after upgrade', diff saved to https://phabricator.wikimedia.org/P9618 and previous config saved to /var/cache/conftool/dbconfig/20191113-134625-marostegui.json [13:46:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:33] (03PS7) 10Jcrespo: bacula: Setup separate pool and defaults for database backups on backup1001 [puppet] - 10https://gerrit.wikimedia.org/r/550671 (https://phabricator.wikimedia.org/T238048) [13:49:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1130 for schema change', diff saved to https://phabricator.wikimedia.org/P9619 and previous config saved to /var/cache/conftool/dbconfig/20191113-134938-marostegui.json [13:49:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:52] (03CR) 10jerkins-bot: [V: 04-1] bacula: Setup separate pool and defaults for database backups on backup1001 [puppet] - 10https://gerrit.wikimedia.org/r/550671 (https://phabricator.wikimedia.org/T238048) (owner: 10Jcrespo) [13:52:20] PROBLEM - Logstash Elasticsearch indexing errors on logstash1009 is CRITICAL: cluster=logstash instance=logstash1009:3903 job=mtail prog=logstash.mtail site=eqiad https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [13:52:45] (03PS8) 10Jcrespo: bacula: Setup separate pool and defaults for database backups on backup1001 [puppet] - 10https://gerrit.wikimedia.org/r/550671 (https://phabricator.wikimedia.org/T238048) [13:53:59] 10Operations, 10Patch-For-Review, 10User-jbond: Cross data center setup for CAS - https://phabricator.wikimedia.org/T233931 (10jbond) Looking at the [[ https://apereo.github.io/cas/6.0.x/installation/Ticket-Registry-Replication-Encryption.html | list of supported ticketing registries ]] we have the followin... [13:55:44] RECOVERY - Logstash Elasticsearch indexing errors on logstash1009 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [13:55:57] (03CR) 10jerkins-bot: [V: 04-1] bacula: Setup separate pool and defaults for database backups on backup1001 [puppet] - 10https://gerrit.wikimedia.org/r/550671 (https://phabricator.wikimedia.org/T238048) (owner: 10Jcrespo) [13:56:31] (03CR) 10Jcrespo: "I would like your ok to convert all schedules to this format (and make configured_pools an array):" [puppet] - 10https://gerrit.wikimedia.org/r/550671 (https://phabricator.wikimedia.org/T238048) (owner: 10Jcrespo) [13:59:34] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:00:29] ^ downtiming [14:03:36] (03CR) 10Elukey: "> Let's add a comment in site.pp in the labstore1006,7 stanza" [puppet] - 10https://gerrit.wikimedia.org/r/550466 (https://phabricator.wikimedia.org/T234229) (owner: 10Elukey) [14:08:10] (03PS2) 10Elukey: role::dumps::distribution::server: add kerberos [puppet] - 10https://gerrit.wikimedia.org/r/550466 (https://phabricator.wikimedia.org/T234229) [14:10:06] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:10:13] (03CR) 10Muehlenhoff: [C: 03+2] Add Icinga check for correct application of microcode mitigations [puppet] - 10https://gerrit.wikimedia.org/r/547747 (https://phabricator.wikimedia.org/T235250) (owner: 10Muehlenhoff) [14:12:35] (03PS9) 10Jcrespo: bacula: Setup separate pool and defaults for database backups on backup1001 [puppet] - 10https://gerrit.wikimedia.org/r/550671 (https://phabricator.wikimedia.org/T238048) [14:13:18] 10Operations, 10Traffic: Renew and deploy GlobalSign unified cert (2019) - https://phabricator.wikimedia.org/T237650 (10BBlack) >>! In T237650#5660016, @Seb35 wrote: > A certificate warning here https://social.imirhil.fr/@aeris/103126273693383568 the user still had the cert 2018 at 2019-11-12 17:07 UTC althoug... [14:15:50] (03CR) 10jerkins-bot: [V: 04-1] bacula: Setup separate pool and defaults for database backups on backup1001 [puppet] - 10https://gerrit.wikimedia.org/r/550671 (https://phabricator.wikimedia.org/T238048) (owner: 10Jcrespo) [14:19:43] (03PS16) 10ArielGlenn: store generated misc cron dump output on second nfs server [puppet] - 10https://gerrit.wikimedia.org/r/447402 (https://phabricator.wikimedia.org/T200180) [14:20:40] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:22:08] 10Operations: Xeon CPU microcode vulnerability (CVE-2019-11139) - https://phabricator.wikimedia.org/T238214 (10MoritzMuehlenhoff) [14:22:35] 10Operations, 10Wikimedia-Mailing-lists: Reset wikifr-l admin password - https://phabricator.wikimedia.org/T238215 (10Kelson) [14:23:58] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:24:26] (03PS1) 10Ema: cache: reimage cp3054 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/550688 (https://phabricator.wikimedia.org/T227432) [14:27:21] 10Operations, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Replace Torrus with Prometheus snmp_exporter for PDUs monitoring - https://phabricator.wikimedia.org/T148541 (10fgiunchedi) >>! In T148541#5432212, @RobH wrote: > > So we likely need the following metrics for each PDU tower: > * input v... [14:27:52] (03PS1) 10Elukey: Remove some sudo permissions from the analytics-admins group [puppet] - 10https://gerrit.wikimedia.org/r/550689 (https://phabricator.wikimedia.org/T224859) [14:28:42] (03CR) 10Volans: [C: 04-1] "Couple of replace private/public and it should be good to go" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/550670 (https://phabricator.wikimedia.org/T238200) (owner: 10Ema) [14:32:00] PROBLEM - Check whether microcode mitigations for CPU vulnerabilities are applied on dbproxy2001 is CRITICAL: CRITICAL - Server is missing the following CPU flags: {md_clear} https://wikitech.wikimedia.org/wiki/Microcode [14:32:24] ^ expected [14:34:27] 10Operations, 10SRE-tools, 10netbox, 10observability, 10User-crusnov: Netbox Alert Cleanups - https://phabricator.wikimedia.org/T224946 (10Volans) >>! In T224946#5659653, @jijiki wrote: > I am not sure this is related, but we get many alerts of > > * PROBLEM - Check the Netbox report puppetdb for fail... [14:35:53] (03CR) 10Ottomata: [C: 03+1] Remove some sudo permissions from the analytics-admins group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/550689 (https://phabricator.wikimedia.org/T224859) (owner: 10Elukey) [14:36:25] (03PS2) 10Ema: debmonitor: terminate TLS on port 7443 [puppet] - 10https://gerrit.wikimedia.org/r/550670 (https://phabricator.wikimedia.org/T238200) [14:37:18] (03CR) 10Ema: debmonitor: terminate TLS on port 7443 (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/550670 (https://phabricator.wikimedia.org/T238200) (owner: 10Ema) [14:37:45] (03CR) 10Elukey: ">" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/550689 (https://phabricator.wikimedia.org/T224859) (owner: 10Elukey) [14:38:30] 10Operations, 10Product-Analytics, 10SRE-Access-Requests: Search Console access for he.wikisource.org - https://phabricator.wikimedia.org/T238090 (10Dzahn) Hi Fuzzy, it might be problematic to share because it's just a single account not meant for sharing. (i.e. if multiple people use it they are logging ea... [14:40:57] (03CR) 10Volans: [C: 03+1] "LGTM, thanks for taking care of this" [puppet] - 10https://gerrit.wikimedia.org/r/550670 (https://phabricator.wikimedia.org/T238200) (owner: 10Ema) [14:41:37] 10Operations, 10Wikimedia-Mailing-lists: Reset wikifr-l admin password - https://phabricator.wikimedia.org/T238215 (10Dzahn) 05Open→03Resolved a:03Dzahn Done. A mail has been sent to the wikifr-l-owner@ address which forwards to all admins at once. You should be able to use that to login again now. Kee... [14:41:48] (03CR) 10Ema: [C: 03+2] "Thank you Volans!" [puppet] - 10https://gerrit.wikimedia.org/r/550670 (https://phabricator.wikimedia.org/T238200) (owner: 10Ema) [14:43:44] (03PS1) 10Jbond: apereo_cas: add ability to configure basic memcached support [puppet] - 10https://gerrit.wikimedia.org/r/550695 (https://phabricator.wikimedia.org/T233931) [14:44:15] (03CR) 10Dzahn: [C: 03+2] Add gcr and shy languages [dns] - 10https://gerrit.wikimedia.org/r/550511 (https://phabricator.wikimedia.org/T238104) (owner: 10Jon Harald Søby) [14:44:28] PROBLEM - Check whether microcode mitigations for CPU vulnerabilities are applied on kafka-main2005 is CRITICAL: CRITICAL - Server is missing the following CPU flags: {md_clear} https://wikitech.wikimedia.org/wiki/Microcode [14:46:05] (03CR) 10jerkins-bot: [V: 04-1] apereo_cas: add ability to configure basic memcached support [puppet] - 10https://gerrit.wikimedia.org/r/550695 (https://phabricator.wikimedia.org/T233931) (owner: 10Jbond) [14:48:07] (03PS2) 10Jbond: apereo_cas: add ability to configure basic memcached support [puppet] - 10https://gerrit.wikimedia.org/r/550695 (https://phabricator.wikimedia.org/T233931) [14:48:17] (03Abandoned) 10Dzahn: Raise PHP memory_limit from 660MB to 760MB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548923 (https://phabricator.wikimedia.org/T236833) (owner: 10Dzahn) [14:54:02] (03PS1) 10Ema: debmonitor: expect 302 on successful TLS termination [puppet] - 10https://gerrit.wikimedia.org/r/550696 (https://phabricator.wikimedia.org/T238200) [14:55:42] !log jynus@cumin1001 dbctl commit (dc=all): 'Depool db2072', diff saved to https://phabricator.wikimedia.org/P9620 and previous config saved to /var/cache/conftool/dbconfig/20191113-145541-jynus.json [14:55:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:55] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/550696 (https://phabricator.wikimedia.org/T238200) (owner: 10Ema) [14:57:40] (03CR) 10Ema: [C: 03+2] debmonitor: expect 302 on successful TLS termination [puppet] - 10https://gerrit.wikimedia.org/r/550696 (https://phabricator.wikimedia.org/T238200) (owner: 10Ema) [14:59:12] (03PS1) 10Ema: ATS: use port 7443 for debmonitor [puppet] - 10https://gerrit.wikimedia.org/r/550697 (https://phabricator.wikimedia.org/T238200) [14:59:59] (03CR) 10Volans: [C: 03+1] "LGTM! Ship it" [puppet] - 10https://gerrit.wikimedia.org/r/550697 (https://phabricator.wikimedia.org/T238200) (owner: 10Ema) [15:02:41] (03CR) 10Ema: [C: 03+2] ATS: use port 7443 for debmonitor [puppet] - 10https://gerrit.wikimedia.org/r/550697 (https://phabricator.wikimedia.org/T238200) (owner: 10Ema) [15:07:23] (03PS2) 10Elukey: Remove some sudo permissions from the analytics-admins group [puppet] - 10https://gerrit.wikimedia.org/r/550689 (https://phabricator.wikimedia.org/T224859) [15:07:58] (03CR) 10Elukey: [C: 03+2] "Given that we are removing permissions I don't think that we need to wait for SRE approval, so I am merging :)" [puppet] - 10https://gerrit.wikimedia.org/r/550689 (https://phabricator.wikimedia.org/T224859) (owner: 10Elukey) [15:08:42] PROBLEM - Check whether microcode mitigations for CPU vulnerabilities are applied on dbproxy2002 is CRITICAL: CRITICAL - Server is missing the following CPU flags: {md_clear} https://wikitech.wikimedia.org/wiki/Microcode [15:09:36] PROBLEM - MariaDB Slave IO: s1 on db2094 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2072.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2072.codfw.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [15:10:20] ^ expected, db2072 is being worked on by jynus and papaul [15:10:33] (03CR) 10Jforrester: "> Patch Set 4: Code-Review-1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547596 (owner: 10Jforrester) [15:11:06] PROBLEM - Check whether microcode mitigations for CPU vulnerabilities are applied on kafka-main2004 is CRITICAL: CRITICAL - Server is missing the following CPU flags: {md_clear} https://wikitech.wikimedia.org/wiki/Microcode [15:13:40] PROBLEM - Check whether microcode mitigations for CPU vulnerabilities are applied on puppetmaster1001 is CRITICAL: CRITICAL - Server is missing the following CPU flags: {md_clear, ssbd, flush_l1d} https://wikitech.wikimedia.org/wiki/Microcode [15:19:12] PROBLEM - dhclient process on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [15:19:18] PROBLEM - Check size of conntrack table on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [15:19:30] PROBLEM - Check whether ferm is active by checking the default input chain on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:19:32] PROBLEM - MD RAID on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [15:19:32] PROBLEM - configured eth on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [15:19:50] PROBLEM - DPKG on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [15:20:00] PROBLEM - Check systemd state on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:20:40] PROBLEM - puppet last run on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [15:21:20] I sense that notebook1004 is not feeling well :D [15:21:50] must be the winter [15:21:55] 10Operations, 10Wikimedia-Mailing-lists: Reset wikifr-l admin password - https://phabricator.wikimedia.org/T238215 (10Kelson) @Dzahn Thank you. I have been able to connect to the admin UI. [15:22:44] hmm [15:23:02] RECOVERY - DPKG on notebook1004 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [15:23:09] restarted the nagios server [15:23:14] RECOVERY - Check systemd state on notebook1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:23:30] RECOVERY - dhclient process on notebook1004 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [15:23:38] RECOVERY - Check size of conntrack table on notebook1004 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [15:23:44] PROBLEM - Check whether microcode mitigations for CPU vulnerabilities are applied on oresrdb2001 is CRITICAL: CRITICAL - Server is missing the following CPU flags: {ssbd} https://wikitech.wikimedia.org/wiki/Microcode [15:23:52] RECOVERY - Check whether ferm is active by checking the default input chain on notebook1004 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:23:54] RECOVERY - MD RAID on notebook1004 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [15:23:54] RECOVERY - configured eth on notebook1004 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [15:24:16] PROBLEM - MariaDB Slave Lag: s1 on db2094 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1609.32 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [15:25:08] (03CR) 10Elukey: [C: 03+2] "Reasonably sure that this works, going to merge and test with a new keytab creation :)" [puppet] - 10https://gerrit.wikimedia.org/r/550491 (owner: 10Elukey) [15:25:50] RECOVERY - puppet last run on notebook1004 is OK: OK: Puppet is currently enabled, last run 16 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [15:27:56] PROBLEM - Check whether microcode mitigations for CPU vulnerabilities are applied on rpki1001 is CRITICAL: CRITICAL - Server is missing the following CPU flags: {md_clear} https://wikitech.wikimedia.org/wiki/Microcode [15:29:16] ACKNOWLEDGEMENT - Check whether microcode mitigations for CPU vulnerabilities are applied on puppetmaster1001 is CRITICAL: CRITICAL - Server is missing the following CPU flags: {md_clear, ssbd, flush_l1d} Muehlenhoff T235250 https://wikitech.wikimedia.org/wiki/Microcode [15:29:32] !log configuration of new elasticsearch servers completed, all working and pooled - T230746 [15:29:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:38] T230746: (Aug 30th, 2019) rack/setup/install elastic10[53-67].eqiad.wmnet - https://phabricator.wikimedia.org/T230746 [15:29:49] !log shutdown db2072 T237905 [15:29:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:54] T237905: Upgrade db2072 firmware and bios - https://phabricator.wikimedia.org/T237905 [15:30:36] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [15:30:37] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:30:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:12] !log rebooting cloudbackup2002 [15:32:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:39] !log depool cp3054 and reimage as text_ats T227432 [15:35:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:45] T227432: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 [15:35:52] RECOVERY - Check whether microcode mitigations for CPU vulnerabilities are applied on kafka-main2004 is OK: OK - All expected CPU flags found https://wikitech.wikimedia.org/wiki/Microcode [15:35:52] RECOVERY - Check whether microcode mitigations for CPU vulnerabilities are applied on kafka-main2005 is OK: OK - All expected CPU flags found https://wikitech.wikimedia.org/wiki/Microcode [15:36:09] (03CR) 10Ema: [C: 03+2] cache: reimage cp3054 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/550688 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [15:39:05] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp3054.esams.wmnet'] ` The log can be found in `/var/log/wm... [15:39:23] (03PS1) 10Elukey: kerberos: fix generate_keytabs.py script [puppet] - 10https://gerrit.wikimedia.org/r/550704 [15:39:57] !log powercycle cloudbackup2002 [15:40:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:26] 10Operations, 10ops-esams, 10netops: mr1-esams flowd logs flood - https://phabricator.wikimedia.org/T238174 (10ayounsi) > Upon internally checking information about the logs you provided, we have found out that this is a hardware issue, please fill out the shipping form to proceed replacing this unit: > [...] [15:42:27] (03CR) 10jerkins-bot: [V: 04-1] kerberos: fix generate_keytabs.py script [puppet] - 10https://gerrit.wikimedia.org/r/550704 (owner: 10Elukey) [15:44:06] PROBLEM - Host db2071 is DOWN: PING CRITICAL - Packet loss = 100% [15:44:35] ^ papaul is that you [15:45:09] papaul: the host to tackle was db2072 if I remember correctly, no? [15:45:12] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-m [15:45:54] PROBLEM - Aggregate IPsec Tunnel Status codfw on icinga1001 is CRITICAL: instance={cp2001:9536,cp2004:9536,cp2007:9536,cp2010:9536,cp2012:9536,cp2016:9536,cp2023:9536} site=codfw tunnel={cp3054_v4,cp3054_v6} https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [15:45:59] marostegui: yes [15:46:12] papaul: did it go down by accident? [15:46:14] PROBLEM - Aggregate IPsec Tunnel Status eqiad on icinga1001 is CRITICAL: instance={cp1077:9536,cp1083:9536,cp1085:9536,cp1089:9536} site=eqiad tunnel={cp3054_v4,cp3054_v6} https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [15:47:34] marostegui: no it was me doing upgrade on db2071 it supposed to be db2072 [15:48:03] marostegui: mistake sorry [15:48:05] 10Operations, 10SRE-tools, 10User-jbond: Puppet compiler: abort on git rebase conflict - https://phabricator.wikimedia.org/T157001 (10jbond) 05Open→03Resolved a:03jbond I think this should be fixed now please reopen if there is still an issue [15:48:37] 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install cloudcephmon100[123] - https://phabricator.wikimedia.org/T228102 (10Cmjohnson) The mgmt issue seems to have been resolved, was this done by @Jclark-ctr I do not see an update [15:49:18] papaul: no worries :) [15:49:24] RECOVERY - Host db2071 is UP: PING WARNING - Packet loss = 73%, RTA = 36.28 ms [15:49:29] papaul: so db2071 got upgraded too? :) [15:49:32] PROBLEM - High average GET latency for mw requests on appserver in codfw on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [15:49:56] marostegui: yes just the BIOS upgrade in progress [15:50:09] papaul: cool! one more host upgraded! :) [15:50:22] 10Operations, 10DBA: Reboot, upgrade firmware and kernel of db1096-db1106, db2071-db2092 - https://phabricator.wikimedia.org/T216240 (10Marostegui) [15:50:36] RECOVERY - High average GET latency for mw requests on appserver in codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [15:50:37] 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install cloudcephmon100[123] - https://phabricator.wikimedia.org/T228102 (10Cmjohnson) [15:51:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1089', diff saved to https://phabricator.wikimedia.org/P9621 and previous config saved to /var/cache/conftool/dbconfig/20191113-155134-marostegui.json [15:51:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:06] (03PS2) 10Gehel: mjolnir_bulk_daemon: Add new kafka topics for model upload [puppet] - 10https://gerrit.wikimedia.org/r/549939 (owner: 10EBernhardson) [15:52:44] 10Operations, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install cloudcephmon100[123] - https://phabricator.wikimedia.org/T228102 (10Cmjohnson) a:05Cmjohnson→03Bstorm @bstorm assigning to you to update netbox once the systems are online. I am removing the ops-eqiad tag. If you have an... [15:54:10] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster: analytics1062 lost one of its power supplies - https://phabricator.wikimedia.org/T237133 (10Jclark-ctr) 05Open→03Resolved alert cleared no errors in icinga [15:54:16] 10Operations, 10ops-codfw, 10DBA: (codfw):rack/setup/install db213[2-5] - https://phabricator.wikimedia.org/T237702 (10Marostegui) Thanks @Papaul - the hosts look good, I will create another task to productionize them Thanks! [15:54:29] (03CR) 10Gehel: [C: 03+2] mjolnir_bulk_daemon: Add new kafka topics for model upload [puppet] - 10https://gerrit.wikimedia.org/r/549939 (owner: 10EBernhardson) [15:54:50] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review, 10User-fgiunchedi: Ingest production logs with ELK7 - https://phabricator.wikimedia.org/T235891 (10herron) >>! In T235891#5659716, @fgiunchedi wrote: > re: bridging the gap with non-kafka inputs, my current thinking is to output all logs with `depreca... [15:55:31] 10Operations, 10ops-eqiad, 10DC-Ops: hw troubleshooting: Memory correctable errors -EDAC- for elastic1029.eqiad.wmnet - https://phabricator.wikimedia.org/T233578 (10Cmjohnson) @mathew.onipe elastic1029 is over 2 years out of warranty. The DIMM can be reseated, does this server need scheduled downtime or can... [15:57:57] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work): Degraded RAID on elastic1039 - https://phabricator.wikimedia.org/T236601 (10Cmjohnson) @Gehel The disk is here, Can this be done anytime or does this need to be coordinated? [15:58:40] 10Operations, 10Puppet, 10observability: Icinga alert for hosts with no Puppet roles - https://phabricator.wikimedia.org/T238006 (10ayounsi) I don't know if the host showed up in Icinga. The host was cloudcephmon1001.wikimedia.org but it happened to other hosts in the past. I *think* it should be a check for... [15:59:13] (03CR) 10Dmaza: "@Jforrester: Can you confirm this? https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/550668/1/wmf-config/CommonSettings.php#3" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/550668 (https://phabricator.wikimedia.org/T236981) (owner: 10Tchanders) [15:59:55] 10Operations, 10Dumps-Generation: Migrate dumpsdata hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224563 (10ArielGlenn) I have extended the rsync of xlm/sql dumps to the last three good dumps and have been running a bandwidth-limited pull from labstore1006 to dumpsdata1003 in a screen session on... [16:00:00] !log ema@cumin1001 START - Cookbook sre.hosts.downtime [16:00:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:06] !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:02:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:14] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work), 10Patch-For-Review: Degraded RAID on elastic1046 - https://phabricator.wikimedia.org/T228606 (10Jclark-ctr) [16:02:44] 10Operations, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install cloudcephmon100[123] - https://phabricator.wikimedia.org/T228102 (10JHedden) 05Open→03Resolved [16:04:02] RECOVERY - Aggregate IPsec Tunnel Status eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [16:06:06] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [16:06:46] PROBLEM - Check the Netbox report cables for fail status. on netbox1001 is CRITICAL: cables.Cables CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [16:07:42] RECOVERY - Aggregate IPsec Tunnel Status codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [16:10:06] (03PS6) 10Herron: logstash: add version param and exclude plugins when non 5.x [puppet] - 10https://gerrit.wikimedia.org/r/548880 (https://phabricator.wikimedia.org/T217340) [16:11:33] 10Operations, 10Wikispeech-Text-to-Speech, 10Wikispeech-WMSE, 10Wikispeech-jobrunner: TTS server deployment strategy - https://phabricator.wikimedia.org/T193072 (10Sebastian_Berlin-WMSE) [16:11:43] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work): Degraded RAID on elastic1039 - https://phabricator.wikimedia.org/T236601 (10Gehel) @Cmjohnson server is already shutdown, do whatever you want, whenever you want! RAID0, so it will require a reimage once the new disk is in place, but I can do that... [16:13:34] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp3054.esams.wmnet'] ` and were **ALL** successful. [16:15:22] (03PS2) 10Elukey: kerberos: fix generate_keytabs.py script [puppet] - 10https://gerrit.wikimedia.org/r/550704 [16:16:11] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T230575 (10Cmjohnson) I created another dispatch ticket with Dell and will try and resolve this through them. If not a phone call will have to be made. [16:18:50] (03CR) 10Elukey: [C: 03+2] kerberos: fix generate_keytabs.py script [puppet] - 10https://gerrit.wikimedia.org/r/550704 (owner: 10Elukey) [16:19:14] (03PS2) 10CRusnov: cables: detect duplicate cable names, and blank cable names [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/550052 (https://phabricator.wikimedia.org/T237007) [16:19:37] PROBLEM - rpki grafana alert on icinga1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: eqiad rsync status alert, rsync status alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [16:19:52] 10Operations, 10Goal, 10Patch-For-Review: Followup to backup1001 bacula switchover (misc pending tasks) - https://phabricator.wikimedia.org/T238048 (10akosiaris) >>! In T238048#5655633, @jcrespo wrote: > @akosiaris Could you give a quick look to see if these seems like a complete archive contents? > {P9597}... [16:19:59] (03CR) 10jerkins-bot: [V: 04-1] cables: detect duplicate cable names, and blank cable names [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/550052 (https://phabricator.wikimedia.org/T237007) (owner: 10CRusnov) [16:20:38] papaul: once you are done with db2072, let me know and I will take over [16:21:28] !log draining elastic1017-1031 to prepare for decommission - T230746 [16:21:34] !log pool cp3054 with ATS backend T227432 [16:21:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:36] T230746: (Aug 30th, 2019) rack/setup/install elastic10[53-67].eqiad.wmnet - https://phabricator.wikimedia.org/T230746 [16:21:37] marostegui: ok [16:21:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:41] T227432: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 [16:22:26] (03CR) 10Alexandros Kosiaris: [C: 03+1] prometheus: add scraping of k8s envoy sidecars [puppet] - 10https://gerrit.wikimedia.org/r/549871 (owner: 10Giuseppe Lavagetto) [16:22:56] (03CR) 10Alexandros Kosiaris: [C: 04-1] "LGTM, but add a bug line please?" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/549871 (owner: 10Giuseppe Lavagetto) [16:23:05] 10Operations, 10ops-eqiad, 10DC-Ops: Move kafka100[123] to logstash102[012] - https://phabricator.wikimedia.org/T235124 (10Cmjohnson) 05Open→03Resolved labels updated [16:23:57] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: replace onboard NIC in kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T236327 (10Cmjohnson) These are projected to be in eqiad Dec 5th [16:26:54] (03PS21) 10Jhedden: ceph: add ceph storage cluster profiles and modules [puppet] - 10https://gerrit.wikimedia.org/r/546182 (https://phabricator.wikimedia.org/T236290) [16:32:05] (03PS17) 10ArielGlenn: store generated misc cron dump output on second nfs server [puppet] - 10https://gerrit.wikimedia.org/r/447402 (https://phabricator.wikimedia.org/T200180) [16:34:04] (03CR) 10Jforrester: Enable CheckUser's Special:Investigate page on testwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/550668 (https://phabricator.wikimedia.org/T236981) (owner: 10Tchanders) [16:36:39] PROBLEM - Query Service HTTP Port on wdqs1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [16:36:44] !log restart blazegraph on wdqs1005 [16:36:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:08] (03PS1) 10Elukey: Deploy Kerberos keytabs for analytics-search and presto system users [puppet] - 10https://gerrit.wikimedia.org/r/550713 (https://phabricator.wikimedia.org/T237269) [16:37:54] (03CR) 10Alexandros Kosiaris: [C: 04-1] kubernetes::deployment_server: Add a private/general.yaml file (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/549872 (https://phabricator.wikimedia.org/T237234) (owner: 10Giuseppe Lavagetto) [16:38:09] RECOVERY - Query Service HTTP Port on wdqs1005 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.020 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [16:38:23] PROBLEM - WDQS high update lag on wdqs1005 is CRITICAL: 4.976e+04 ge 4.32e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [16:39:39] (03CR) 10Elukey: [C: 03+2] Deploy Kerberos keytabs for analytics-search and presto system users [puppet] - 10https://gerrit.wikimedia.org/r/550713 (https://phabricator.wikimedia.org/T237269) (owner: 10Elukey) [16:39:49] 10Operations, 10ops-eqiad, 10DC-Ops: b7-eqiad pdu refresh (Tuesday 11/5 @12pm UTC) - https://phabricator.wikimedia.org/T227542 (10Jclark-ctr) [16:40:19] 10Operations, 10ops-eqiad, 10DC-Ops: b7-eqiad pdu refresh (Tuesday 11/5 @12pm UTC) - https://phabricator.wikimedia.org/T227542 (10Jclark-ctr) 05Open→03Resolved confirmed link and errors cleared from icinga [16:40:21] 10Operations, 10ops-eqiad, 10DC-Ops: Install new PDUs in rows A/B (Top level tracking task) - https://phabricator.wikimedia.org/T226778 (10Jclark-ctr) [16:40:59] !log depool wdqs1005 - T238232 [16:41:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:04] T238232: blazegraph journal on wdqs1005 is oversized - https://phabricator.wikimedia.org/T238232 [16:43:08] 10Operations, 10DC-Ops, 10decommission, 10fundraising-tech-ops: decommission alnilam.frack.codfw.wmnet - https://phabricator.wikimedia.org/T238233 (10Jgreen) [16:43:17] 10Operations, 10ops-codfw, 10DBA: Upgrade db2072 firmware and bios - https://phabricator.wikimedia.org/T237905 (10Papaul) a:05Papaul→03Marostegui Before BIOS Version 2.4.3 Firmware Version 2.40 After BIOS Version 2.10.5 Firmware Version 2.70 [16:43:57] (03PS1) 10Jgreen: remove alnilam from nsca_frack.cfg.erb [puppet] - 10https://gerrit.wikimedia.org/r/550715 [16:44:33] (03PS2) 10Jgreen: remove alnilam from nsca_frack.cfg.erb [puppet] - 10https://gerrit.wikimedia.org/r/550715 (https://phabricator.wikimedia.org/T238233) [16:46:32] (03CR) 10ArielGlenn: "This currently fails under the puppet compiler; I guess some wmcs fake secrets need to be added for that?" [puppet] - 10https://gerrit.wikimedia.org/r/550466 (https://phabricator.wikimedia.org/T234229) (owner: 10Elukey) [16:46:37] (03CR) 10jerkins-bot: [V: 04-1] remove alnilam from nsca_frack.cfg.erb [puppet] - 10https://gerrit.wikimedia.org/r/550715 (https://phabricator.wikimedia.org/T238233) (owner: 10Jgreen) [16:46:59] RECOVERY - MariaDB Slave IO: s1 on db2094 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [16:47:13] 10Operations, 10ops-codfw, 10DBA: Upgrade db2072 firmware and bios - https://phabricator.wikimedia.org/T237905 (10Marostegui) 05Open→03Resolved Thanks - I have started MySQL (and run mysql_upgrade). Thanks Jaime for getting this host down for Papaul too! [16:47:15] 10Operations, 10DBA: Reboot, upgrade firmware and kernel of db1096-db1106, db2071-db2092 - https://phabricator.wikimedia.org/T216240 (10Marostegui) [16:47:23] 10Operations, 10DBA: Reboot, upgrade firmware and kernel of db1096-db1106, db2071-db2092 - https://phabricator.wikimedia.org/T216240 (10Marostegui) [16:48:09] (03PS3) 10Jgreen: remove alnilam from nsca_frack.cfg.erb [puppet] - 10https://gerrit.wikimedia.org/r/550715 (https://phabricator.wikimedia.org/T238233) [16:48:34] (03PS22) 10Jhedden: ceph: add ceph storage cluster profiles and modules [puppet] - 10https://gerrit.wikimedia.org/r/546182 (https://phabricator.wikimedia.org/T236290) [16:49:19] (03PS1) 10Elukey: Add fake kerberos keytab secrets for labstore nodes [labs/private] - 10https://gerrit.wikimedia.org/r/550716 [16:49:38] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add fake kerberos keytab secrets for labstore nodes [labs/private] - 10https://gerrit.wikimedia.org/r/550716 (owner: 10Elukey) [16:50:47] (03CR) 10Elukey: "> This currently fails under the puppet compiler; I guess some wmcs" [puppet] - 10https://gerrit.wikimedia.org/r/550466 (https://phabricator.wikimedia.org/T234229) (owner: 10Elukey) [16:53:48] (03PS2) 10Tchanders: Enable CheckUser's Special:Investigate page on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/550668 (https://phabricator.wikimedia.org/T236981) [16:54:24] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work): Degraded RAID on elastic1039 - https://phabricator.wikimedia.org/T236601 (10Jclark-ctr) @gehel replaced drive [16:55:45] 10Operations, 10Product-Analytics, 10SRE-Access-Requests: Search Console access for he.wikisource.org - https://phabricator.wikimedia.org/T238090 (10Fuzzy) Hi Dzahn, @RobH says it is possible to grant access to the search console for specific subdomains and user. He wrote guidelines in [[ https://wikitech.w... [16:55:49] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] ceph: add ceph storage cluster profiles and modules [puppet] - 10https://gerrit.wikimedia.org/r/546182 (https://phabricator.wikimedia.org/T236290) (owner: 10Jhedden) [16:56:12] 10Operations, 10Puppet, 10User-jbond: Investigate using the rich_data opsion to support Binary and binary_file for binary data - https://phabricator.wikimedia.org/T236481 (10jbond) [16:56:14] 10Operations, 10Puppet, 10puppet-compiler, 10User-jbond: puppet master command will be removed in puppet 6 - https://phabricator.wikimedia.org/T236373 (10jbond) [16:57:03] (03CR) 10Andrew Bogott: [C: 03+1] "The roles and site.pp changes look right to me." [puppet] - 10https://gerrit.wikimedia.org/r/546182 (https://phabricator.wikimedia.org/T236290) (owner: 10Jhedden) [17:00:55] (03PS23) 10Jhedden: ceph: add ceph storage cluster profiles and modules [puppet] - 10https://gerrit.wikimedia.org/r/546182 (https://phabricator.wikimedia.org/T236290) [17:01:24] 10Operations, 10ops-eqiad: rack/setup/install ms-be105[7-9].eqiad.wmnet - https://phabricator.wikimedia.org/T237438 (10Cmjohnson) [17:01:31] RECOVERY - MariaDB Slave Lag: s1 on db2094 is OK: OK slave_sql_lag Replication lag: 0.34 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [17:02:29] 10Operations, 10puppet-compiler: puppet-compiler fails to compile production catalog for restbase2014 - https://phabricator.wikimedia.org/T238053 (10jbond) This seems to be related to [[ https://tickets.puppetlabs.com/browse/PUP-8187 | PUP-8187 ]] the `puppet master --compile` option in puppet version 5.5 doe... [17:04:30] (03CR) 10Jhedden: [C: 03+2] ceph: add ceph storage cluster profiles and modules [puppet] - 10https://gerrit.wikimedia.org/r/546182 (https://phabricator.wikimedia.org/T236290) (owner: 10Jhedden) [17:06:06] (03PS1) 10Elukey: presto::server: add the presto user to the catalog [puppet] - 10https://gerrit.wikimedia.org/r/550718 [17:12:41] 10Operations, 10ops-eqiad: reimage WMF6937/mw1298 - https://phabricator.wikimedia.org/T215332 (10Cmjohnson) 05Open→03Resolved The label has been done [17:12:46] 10Operations, 10ops-eqiad, 10Patch-For-Review: setup/install phab1002(WMF4727) - https://phabricator.wikimedia.org/T196019 (10Cmjohnson) [17:12:48] 10Operations, 10hardware-requests, 10Patch-For-Review: request to assign wmf6937 (mw1298, former imagescaler) (now: wmf4727) as phab1002 - https://phabricator.wikimedia.org/T195623 (10Cmjohnson) [17:12:52] 10Operations, 10serviceops, 10Patch-For-Review: Reallocate former image scalers - https://phabricator.wikimedia.org/T192457 (10Cmjohnson) [17:13:25] (03CR) 10Jgreen: [C: 03+2] remove alnilam from nsca_frack.cfg.erb [puppet] - 10https://gerrit.wikimedia.org/r/550715 (https://phabricator.wikimedia.org/T238233) (owner: 10Jgreen) [17:16:06] 10Operations, 10ops-eqiad, 10DC-Ops: Install new PDUs in rows A/B (Top level tracking task) - https://phabricator.wikimedia.org/T226778 (10wiki_willy) 05Open→03Resolved a:03Cmjohnson Resolving parent task for PDU upgrades. Much appreciated to @Cmjohnson and @Jclark-ctr for taking care of these. Thank... [17:17:40] 10Operations, 10DC-Ops, 10decommission, 10fundraising-tech-ops, 10Patch-For-Review: decommission alnilam.frack.codfw.wmnet - https://phabricator.wikimedia.org/T238233 (10Jgreen) [17:18:50] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Couple of comments inline, overall I am ok with this but I have to note that this is essentially 3 different patches:" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/550671 (https://phabricator.wikimedia.org/T238048) (owner: 10Jcrespo) [17:20:48] (03PS1) 10Jhedden: ceph: update mon and osd hieradata in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/550721 (https://phabricator.wikimedia.org/T236290) [17:23:18] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack/setup/install frnetmon1001 - https://phabricator.wikimedia.org/T232137 (10Cmjohnson) [17:27:07] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack/setup/install frban1001.eqiad.wmnet - https://phabricator.wikimedia.org/T234068 (10Cmjohnson) [17:34:58] (03CR) 10BBlack: [C: 03+2] Unified cert: add digicert-2019a files [puppet] - 10https://gerrit.wikimedia.org/r/550525 (owner: 10BBlack) [17:35:27] 10Operations, 10serviceops: Upgrade to PHP 7.2.24 - https://phabricator.wikimedia.org/T237239 (10jijiki) @Dzahn I have left only phab* and scandium, can you take care of them? :) ` 'export DEBIAN_FRONTEND=noninteractive; apt-get install php7.2-bcmath php7.2-bz2 php7.2-cli php7.2-common php7.2-curl php7.2-dba... [17:35:57] 10Operations, 10serviceops: Upgrade to PHP 7.2.24 - https://phabricator.wikimedia.org/T237239 (10jijiki) a:05jijiki→03Dzahn [17:37:55] (03CR) 10BBlack: [C: 03+2] Unified cert: deploy digicert-2019a to infra [puppet] - 10https://gerrit.wikimedia.org/r/550526 (owner: 10BBlack) [17:38:46] PROBLEM - Check the Netbox report cables for fail status. on netbox1001 is CRITICAL: cables.Cables CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [17:41:02] (03PS1) 10Elukey: Rename labstore node dir path for kerberos fake keytabs [labs/private] - 10https://gerrit.wikimedia.org/r/550725 [17:41:16] (03CR) 10Elukey: [V: 03+2 C: 03+2] Rename labstore node dir path for kerberos fake keytabs [labs/private] - 10https://gerrit.wikimedia.org/r/550725 (owner: 10Elukey) [17:44:46] (03PS1) 10Mholloway: MachineVision: Update WD item blacklist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/550726 [17:45:33] (03PS1) 10Jhedden: add fake k8s node_token for ceph storage cluster [labs/private] - 10https://gerrit.wikimedia.org/r/550727 [17:46:19] (03CR) 10Jhedden: [C: 03+2] add fake k8s node_token for ceph storage cluster [labs/private] - 10https://gerrit.wikimedia.org/r/550727 (owner: 10Jhedden) [17:46:23] (03CR) 10Mholloway: [C: 03+2] MachineVision: Update WD item blacklist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/550726 (owner: 10Mholloway) [17:46:55] (03CR) 10Jhedden: [C: 03+2] ceph: update mon and osd hieradata in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/550721 (https://phabricator.wikimedia.org/T236290) (owner: 10Jhedden) [17:47:11] (03Merged) 10jenkins-bot: MachineVision: Update WD item blacklist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/550726 (owner: 10Mholloway) [17:48:37] (03CR) 10ArielGlenn: "https://puppet-compiler.wmflabs.org/compiler1002/19368/ pcc output which now completes" [puppet] - 10https://gerrit.wikimedia.org/r/550466 (https://phabricator.wikimedia.org/T234229) (owner: 10Elukey) [17:49:01] !log mholloway-shell@deploy1001 Synchronized wmf-config/InitialiseSettings.php: MachineVision: Update WD item blacklist (duration: 00m 53s) [17:49:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:35] 10Operations, 10ops-eqiad: rack/setup/instal (4) CI ganeti nodes - https://phabricator.wikimedia.org/T228926 (10Cmjohnson) [17:50:03] (03CR) 10Jhedden: [V: 03+2 C: 03+2] add fake k8s node_token for ceph storage cluster [labs/private] - 10https://gerrit.wikimedia.org/r/550727 (owner: 10Jhedden) [17:50:20] 10Operations, 10ops-eqiad: rack/setup/instal (4) CI ganeti nodes - https://phabricator.wikimedia.org/T228926 (10Cmjohnson) a:05Cmjohnson→03Jclark-ctr @Jclark-ctr Can you get the network ports, please add to the task. Thanks [17:54:45] 10Operations, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install (3) new osd ceph nodes - https://phabricator.wikimedia.org/T224188 (10Cmjohnson) a:05Cmjohnson→03JHedden @JHedden What is the status of these servers, it looks like most everything is finished but the checkboxes are not co... [17:58:04] (03PS1) 10Jhedden: ceph: update docker profile hiera key name [puppet] - 10https://gerrit.wikimedia.org/r/550728 (https://phabricator.wikimedia.org/T236290) [18:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Morning SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191113T1800). [18:00:04] No GERRIT patches in the queue for this window AFAICS. [18:01:24] (03CR) 10Ottomata: [C: 03+1] presto::server: add the presto user to the catalog [puppet] - 10https://gerrit.wikimedia.org/r/550718 (owner: 10Elukey) [18:02:59] (03CR) 10Jhedden: [C: 03+2] ceph: update docker profile hiera key name [puppet] - 10https://gerrit.wikimedia.org/r/550728 (https://phabricator.wikimedia.org/T236290) (owner: 10Jhedden) [18:04:53] (03CR) 10Elukey: [C: 03+2] presto::server: add the presto user to the catalog [puppet] - 10https://gerrit.wikimedia.org/r/550718 (owner: 10Elukey) [18:11:59] 10Operations, 10SRE-tools, 10netbox, 10observability, 10User-crusnov: Netbox Alert Cleanups - https://phabricator.wikimedia.org/T224946 (10jijiki) I have downtimed some of the alerts, but it will expire in a couple of hours from now [18:19:47] (03PS1) 10Jhedden: ceph: upgrade to latest upstream docker-ce version [puppet] - 10https://gerrit.wikimedia.org/r/550734 (https://phabricator.wikimedia.org/T236290) [18:23:16] (03CR) 10Jhedden: [C: 03+2] ceph: upgrade to latest upstream docker-ce version [puppet] - 10https://gerrit.wikimedia.org/r/550734 (https://phabricator.wikimedia.org/T236290) (owner: 10Jhedden) [18:27:42] (03PS7) 10Herron: logstash: introduce logstash 7 and openjdk-11 support [puppet] - 10https://gerrit.wikimedia.org/r/548880 (https://phabricator.wikimedia.org/T217340) [18:37:59] (03PS8) 10Herron: logstash: introduce logstash 7 and openjdk-11 support [puppet] - 10https://gerrit.wikimedia.org/r/548880 (https://phabricator.wikimedia.org/T217340) [18:47:17] (03CR) 10Dmaza: [C: 03+1] Enable CheckUser's Special:Investigate page on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/550668 (https://phabricator.wikimedia.org/T236981) (owner: 10Tchanders) [18:47:52] 10Operations, 10ops-esams, 10netops: mr1-esams flowd logs flood - https://phabricator.wikimedia.org/T238174 (10ayounsi) a:05ayounsi→03RobH Over to Rob for the RMA. [18:48:22] (03PS1) 10Jgreen: remove DNS entries for alnilam.frack.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/550735 (https://phabricator.wikimedia.org/T238233) [18:48:52] (03CR) 10Jgreen: [C: 03+2] remove DNS entries for alnilam.frack.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/550735 (https://phabricator.wikimedia.org/T238233) (owner: 10Jgreen) [18:49:42] !log authdns-update to remove host alnilam [18:49:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:13] 10Operations, 10DC-Ops, 10decommission, 10fundraising-tech-ops, 10Patch-For-Review: decommission alnilam.frack.codfw.wmnet - https://phabricator.wikimedia.org/T238233 (10Jgreen) [18:50:25] 10Operations, 10DC-Ops, 10decommission, 10fundraising-tech-ops, 10Patch-For-Review: decommission alnilam.frack.codfw.wmnet - https://phabricator.wikimedia.org/T238233 (10Jgreen) [18:50:43] 10Operations, 10DC-Ops, 10decommission, 10fundraising-tech-ops, 10Patch-For-Review: decommission alnilam.frack.codfw.wmnet - https://phabricator.wikimedia.org/T238233 (10Jgreen) a:05Jgreen→03Papaul [18:57:10] 10Operations, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install (3) new osd ceph nodes - https://phabricator.wikimedia.org/T224188 (10JHedden) [19:00:20] 10Operations, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install (3) new osd ceph nodes - https://phabricator.wikimedia.org/T224188 (10JHedden) [19:10:42] (03PS1) 10Mholloway: MachineVision: Update Wikidata item blacklist (again) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/550739 [19:21:22] 10Operations, 10ops-esams, 10netops: mr1-esams flowd logs flood - https://phabricator.wikimedia.org/T238174 (10ayounsi) [19:22:06] 10Operations, 10ops-esams: Relabel cables with duplicate IDs - https://phabricator.wikimedia.org/T237006 (10ayounsi) See also https://netbox.wikimedia.org/extras/reports/cables.Cables/#test_duplicate_cable_label [19:22:50] 10Operations, 10ops-esams, 10DC-Ops: Add missing labels for equipment and cables - https://phabricator.wikimedia.org/T237009 (10ayounsi) See also: https://netbox.wikimedia.org/extras/reports/cables.Cables/#test_blank_cable_label [19:25:55] (03PS1) 10CRusnov: ganeti-sync: Add retries to api calls [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/550741 (https://phabricator.wikimedia.org/T224946) [19:33:01] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [19:35:05] (03CR) 10Mholloway: [C: 03+2] MachineVision: Update Wikidata item blacklist (again) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/550739 (owner: 10Mholloway) [19:35:49] (03Merged) 10jenkins-bot: MachineVision: Update Wikidata item blacklist (again) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/550739 (owner: 10Mholloway) [19:37:58] !log mholloway-shell@deploy1001 Synchronized wmf-config/InitialiseSettings.php: MachineVision: Update WD item blacklist (again) (duration: 00m 52s) [19:38:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:57] 10Operations, 10ops-esams, 10netops: setup new MX204 in knams - https://phabricator.wikimedia.org/T237030 (10ayounsi) [20:00:04] cscott, arlolra, subbu, halfak, and accraze: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Services – Graphoid / Parsoid / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191113T2000). [20:01:45] jouncebot, your timezones are broken. [20:11:27] (03CR) 10RLazarus: [C: 03+1] Remove apache config for zero.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/524088 (https://phabricator.wikimedia.org/T187716) (owner: 10MaxSem) [20:17:52] !log delete unused asw2-esams:ae1 [20:17:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:20] (03CR) 10Jcrespo: "Once a month has passed without incidents, and archive is available to be restored, I would like to do a general clean up and remove all t" [puppet] - 10https://gerrit.wikimedia.org/r/550671 (https://phabricator.wikimedia.org/T238048) (owner: 10Jcrespo) [20:26:56] 10Operations, 10Discovery, 10Elasticsearch, 10Discovery-Search (Current work), 10Patch-For-Review: Icinga should alert on free disk space < 15% (now < 12%) on Elasticsearch hosts - https://phabricator.wikimedia.org/T130329 (10EBernhardson) 05Open→03Resolved These servers (elastic1017-31) no longer ha... [20:28:56] (03PS9) 10Herron: logstash: introduce logstash 7 and openjdk-11 support [puppet] - 10https://gerrit.wikimedia.org/r/548880 (https://phabricator.wikimedia.org/T217340) [20:30:57] (03CR) 10jerkins-bot: [V: 04-1] logstash: introduce logstash 7 and openjdk-11 support [puppet] - 10https://gerrit.wikimedia.org/r/548880 (https://phabricator.wikimedia.org/T217340) (owner: 10Herron) [20:33:18] (03CR) 10Dzahn: "@Reuven Here's another candidate for httpbb?" [puppet] - 10https://gerrit.wikimedia.org/r/524088 (https://phabricator.wikimedia.org/T187716) (owner: 10MaxSem) [20:34:19] (03PS10) 10Herron: logstash: introduce logstash 7 and openjdk-11 support [puppet] - 10https://gerrit.wikimedia.org/r/548880 (https://phabricator.wikimedia.org/T217340) [20:36:01] (03Restored) 10Dzahn: wmf_auto_reimage: Adjust message about waiting for puppet [puppet] - 10https://gerrit.wikimedia.org/r/522567 (owner: 10Dzahn) [20:39:36] (03PS1) 10RLazarus: Install httpbb on cluster-management hosts. [puppet] - 10https://gerrit.wikimedia.org/r/550750 (https://phabricator.wikimedia.org/T236699) [20:43:30] (03CR) 10Alexandros Kosiaris: md: Globally set lower sync limits (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/549847 (https://phabricator.wikimedia.org/T237197) (owner: 10Alexandros Kosiaris) [20:44:22] (03PS11) 10Dzahn: wmf_auto_reimage: Adjust message about waiting for puppet [puppet] - 10https://gerrit.wikimedia.org/r/522567 [20:44:39] (03PS3) 10Alexandros Kosiaris: md: Globally set lower sync limits [puppet] - 10https://gerrit.wikimedia.org/r/549847 (https://phabricator.wikimedia.org/T237197) [20:45:35] 10Operations, 10ops-esams: Bundle esams-knams links back - https://phabricator.wikimedia.org/T237031 (10ayounsi) [20:47:02] (03Abandoned) 10Alexandros Kosiaris: First draft of a graphoid helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/434475 (owner: 10Alexandros Kosiaris) [20:50:21] (03PS11) 10Herron: logstash: introduce logstash 7 and openjdk-11 support [puppet] - 10https://gerrit.wikimedia.org/r/548880 (https://phabricator.wikimedia.org/T217340) [20:52:30] (03PS12) 10Herron: logstash: introduce logstash 7 and openjdk-11 support [puppet] - 10https://gerrit.wikimedia.org/r/548880 (https://phabricator.wikimedia.org/T217340) [21:01:34] (03PS5) 10Alexandros Kosiaris: Update scaffold template names to use chart name [deployment-charts] - 10https://gerrit.wikimedia.org/r/539220 (owner: 10Jeena Huneidi) [21:01:51] (03CR) 10Alexandros Kosiaris: [C: 03+2] Update scaffold template names to use chart name [deployment-charts] - 10https://gerrit.wikimedia.org/r/539220 (owner: 10Jeena Huneidi) [21:04:36] (03PS1) 10RLazarus: [Testing only, don't merge] Install httpbb on appservers. [puppet] - 10https://gerrit.wikimedia.org/r/550752 [21:06:24] (03CR) 10Herron: logstash: introduce logstash 7 and openjdk-11 support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/548880 (https://phabricator.wikimedia.org/T217340) (owner: 10Herron) [21:12:09] PROBLEM - Check the Netbox report cables for fail status. on netbox1001 is CRITICAL: cables.Cables CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [21:24:00] (03PS10) 10DCausse: [wdqs] configure eventgate endpoint for sparql/query events [puppet] - 10https://gerrit.wikimedia.org/r/549081 (https://phabricator.wikimedia.org/T101013) [21:24:02] (03PS1) 10DCausse: [wdqs] Fix event service configuration [puppet] - 10https://gerrit.wikimedia.org/r/550754 (https://phabricator.wikimedia.org/T101013) [21:25:28] 10Operations, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install (3) new osd ceph nodes - https://phabricator.wikimedia.org/T224188 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jeh on cumin1001.eqiad.wmnet for hosts: ` ['cloudcephosd1001.wikimedia.org'] ` The log can be fo... [21:44:11] 10Operations, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install (3) new osd ceph nodes - https://phabricator.wikimedia.org/T224188 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudcephosd1001.wikimedia.org'] ` Of which those **FAILED**: ` ['cloudcephosd1001.wikimedia.org'] ` [21:54:31] (03CR) 10RLazarus: "Puppet-compile failed, on the prod and change sides, with errors that look unrelated to this change (https://puppet-compiler.wmflabs.org/c" [puppet] - 10https://gerrit.wikimedia.org/r/550750 (https://phabricator.wikimedia.org/T236699) (owner: 10RLazarus) [22:00:45] !log catrope@deploy1001 Synchronized php-1.35.0-wmf.5/extensions/GrowthExperiments/: Update to master (b937dce) (duration: 00m 54s) [22:00:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:04:27] !log catrope@deploy1001 scap sync-l10n completed (1.35.0-wmf.5) (duration: 02m 54s) [22:04:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:07:19] !log catrope@deploy1001 Started scap: For some reason that limited i18n sync didn't work, trying a full scap [22:07:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:09:05] RECOVERY - Check the Netbox report cables for fail status. on netbox1001 is OK: cables.Cables OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [22:11:23] (03CR) 10CDanis: [C: 03+1] Install httpbb on cluster-management hosts. [puppet] - 10https://gerrit.wikimedia.org/r/550750 (https://phabricator.wikimedia.org/T236699) (owner: 10RLazarus) [22:14:43] (03CR) 10RLazarus: [C: 03+1] "Yes! Deploying httpbb is still a work in progress (https://gerrit.wikimedia.org/r/c/operations/puppet/+/550750) so if you're in no hurry w" [puppet] - 10https://gerrit.wikimedia.org/r/524088 (https://phabricator.wikimedia.org/T187716) (owner: 10MaxSem) [22:22:53] (03PS1) 10Jhedden: install_server: add boot partition to cloudcephosd config [puppet] - 10https://gerrit.wikimedia.org/r/550763 (https://phabricator.wikimedia.org/T224188) [22:25:52] !log catrope@deploy1001 Finished scap: For some reason that limited i18n sync didn't work, trying a full scap (duration: 18m 33s) [22:25:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:26:27] (03CR) 10Jhedden: [C: 03+2] install_server: add boot partition to cloudcephosd config [puppet] - 10https://gerrit.wikimedia.org/r/550763 (https://phabricator.wikimedia.org/T224188) (owner: 10Jhedden) [22:35:17] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [22:45:13] 10Operations, 10Cloud-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install (3) new osd ceph nodes - https://phabricator.wikimedia.org/T224188 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jeh on cumin1001.eqiad.wmnet for hosts: ` ['cloudcephosd1001.wikimedia.org... [22:58:27] !log jeh@cumin1001 START - Cookbook sre.hosts.downtime [22:58:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Evening SWAT(Max 6 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191113T2300). [23:00:05] No GERRIT patches in the queue for this window AFAICS. [23:00:33] !log jeh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [23:00:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:02:52] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:05:25] 10Operations, 10Cloud-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install (3) new osd ceph nodes - https://phabricator.wikimedia.org/T224188 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudcephosd1001.wikimedia.org'] ` and were **ALL** successful. [23:07:58] 10Operations, 10Cloud-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install (3) new osd ceph nodes - https://phabricator.wikimedia.org/T224188 (10JHedden) [23:08:13] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [23:08:54] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Minor inline comments, but otherwise those could prove useful, so let's populate them and use them" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/547777 (owner: 10Brennen Bearnes) [23:34:42] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports