[01:08:22] 08Warning Alert for device scs-c1-eqiad.mgmt.eqiad.wmnet - Processor usage over 85% [01:31:25] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [01:42:15] 10Operations: HP Gen9 onboard controller review - https://phabricator.wikimedia.org/T216175 (10RobH) 05Open→03Resolved a:03RobH agreed [02:08:19] PROBLEM - MariaDB Slave Lag: s3 on db2098 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 944.45 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [02:10:41] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [02:21:55] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [02:29:20] 10Operations, 10ops-esams, 10Traffic: cp3065 crashed - https://phabricator.wikimedia.org/T238032 (10Vgutierrez) [02:38:45] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [02:40:38] (03CR) 10Vgutierrez: [C: 03+1] "Thanks for taking care of this <3" [puppet] - 10https://gerrit.wikimedia.org/r/550094 (https://phabricator.wikimedia.org/T236482) (owner: 10Filippo Giunchedi) [03:06:49] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [03:17:31] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs1006.eqiad.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs1006.eqiad.wmnet are marked down but pooled: wdqs_80: Servers wdqs1006.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:19:35] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([wdqs1006.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [03:19:49] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs1006.eqiad.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs1006.eqiad.wmnet are marked down but pooled: wdqs_80: Servers wdqs1006.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:21:21] PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([wdqs1006.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [03:21:30] here we ago again [03:21:34] *go [03:26:09] 10Operations, 10Traffic, 10HTTPS: Enable QUIC support on Wikimedia servers - https://phabricator.wikimedia.org/T238034 (10Morgankevinj) [03:38:37] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:39:17] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:40:49] RECOVERY - MariaDB Slave Lag: s3 on db2098 is OK: OK slave_sql_lag Replication lag: 0.18 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [03:41:07] !log restart wdqs-blazegraph on wdqs1004 [03:41:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:42:03] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [03:42:48] onimisionipe: ^^ wdqs1004 didn't recover on its own... it needed a restart [03:43:47] RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [04:02:12] 10Operations, 10netops: scs-c1-eqiad CPU usage over 85% - https://phabricator.wikimedia.org/T238036 (10Vgutierrez) [04:02:59] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [04:08:21] 08Warning Alert for device scs-c1-eqiad.mgmt.eqiad.wmnet - Processor usage over 85% [04:22:18] (03PS3) 10Ayounsi: Add vlan support for asw [homer/public] - 10https://gerrit.wikimedia.org/r/550376 [04:27:18] 10Operations, 10Traffic, 10HTTPS: Enable QUIC support on Wikimedia servers - https://phabricator.wikimedia.org/T238034 (10Vgutierrez) We should consider QUIC and HTTP/3 adoption carefully as it implies a switch from TCP to UDP, and that could open new (D)DoS vectors and render unusable some mitigation techni... [04:33:57] (03PS1) 10Ayounsi: Add partial chassis support for asw and cr [homer/public] - 10https://gerrit.wikimedia.org/r/550389 [05:19:18] 10Operations, 10Traffic: Start warning and deprecation process for all legacy TLS - https://phabricator.wikimedia.org/T238038 (10Vgutierrez) [05:32:40] (03PS1) 10Vgutierrez: varnish: Update sec-warning message [puppet] - 10https://gerrit.wikimedia.org/r/550391 (https://phabricator.wikimedia.org/T238038) [05:32:51] 10Operations, 10Traffic, 10Patch-For-Review: Start warning and deprecation process for all legacy TLS - https://phabricator.wikimedia.org/T238038 (10Vgutierrez) p:05Triage→03Normal [06:12:07] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [06:12:53] (03PS1) 10Marostegui: packages_wmf: Install 10.1 with buster [puppet] - 10https://gerrit.wikimedia.org/r/550393 [06:14:29] (03CR) 10Marostegui: [C: 03+1] Revert "nrpe: Don't set PrivateTmp=True" [puppet] - 10https://gerrit.wikimedia.org/r/464601 (owner: 10Jcrespo) [06:20:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2087:3316, db2087:3317 for compression - T235599', diff saved to https://phabricator.wikimedia.org/P9588 and previous config saved to /var/cache/conftool/dbconfig/20191112-061959-marostegui.json [06:20:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:06] T235599: Recompress special slaves across eqiad and codfw - https://phabricator.wikimedia.org/T235599 [06:21:17] !log Compress db2087:3316, db2087:3317 T235599 [06:21:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:51] !log Deploy schema change on s5 codfw with replication, this will generate lag on s5 codfw T233135 T234066 [06:40:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:57] T233135: Schema change for refactored actor and comment storage - https://phabricator.wikimedia.org/T233135 [06:40:57] T234066: Schema change to rename user_newtalk indexes - https://phabricator.wikimedia.org/T234066 [06:44:00] !log Change triggers on s5 db2094 - T234704 [06:44:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:44:06] T234704: Remove ar_comment from sanitarium triggers - https://phabricator.wikimedia.org/T234704 [06:56:36] (03PS1) 10Marostegui: mariadb: Remove db2048 from puppet [puppet] - 10https://gerrit.wikimedia.org/r/550394 (https://phabricator.wikimedia.org/T237913) [06:57:09] (03PS1) 10Marostegui: wmnet: Remove production DNS entries for db2048.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/550395 (https://phabricator.wikimedia.org/T237913) [06:57:49] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission [06:57:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:00] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [06:58:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:07] 10Operations, 10DBA, 10Patch-For-Review: Decommission db2048.codfw.wmnet - https://phabricator.wikimedia.org/T237913 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1001 for hosts: `db2048.codfw.wmnet` - db2048.codfw.wmnet (**PASS**) - Downtimed host on Icinga - Downt... [06:58:22] (03CR) 10Marostegui: "check" [puppet] - 10https://gerrit.wikimedia.org/r/550394 (https://phabricator.wikimedia.org/T237913) (owner: 10Marostegui) [06:58:43] (03CR) 10Marostegui: [C: 03+2] mariadb: Remove db2048 from puppet [puppet] - 10https://gerrit.wikimedia.org/r/550394 (https://phabricator.wikimedia.org/T237913) (owner: 10Marostegui) [06:59:00] (03CR) 10Marostegui: [C: 03+2] wmnet: Remove production DNS entries for db2048.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/550395 (https://phabricator.wikimedia.org/T237913) (owner: 10Marostegui) [07:00:10] 10Operations, 10ops-codfw, 10decommission: Decommission db2048.codfw.wmnet - https://phabricator.wikimedia.org/T237913 (10Marostegui) a:05Marostegui→03Papaul [07:00:17] 10Operations, 10ops-codfw, 10decommission: Decommission db2048.codfw.wmnet - https://phabricator.wikimedia.org/T237913 (10Marostegui) Host ready for @Papaul [07:00:37] 10Operations, 10DBA: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 (10Marostegui) [07:04:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1083 for kernel upgrade - T234800', diff saved to https://phabricator.wikimedia.org/P9589 and previous config saved to /var/cache/conftool/dbconfig/20191112-070436-marostegui.json [07:04:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:43] T234800: Switchover s1 primary database master db1067 -> db1083 - 14th Nov 05:00 - 05:30 UTC - https://phabricator.wikimedia.org/T234800 [07:08:22] 08Warning Alert for device scs-c1-eqiad.mgmt.eqiad.wmnet - Processor usage over 85% [07:10:37] !log Upgrade kernel on db1083 (s1 candidate master) [07:10:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:45] (03CR) 10Jcrespo: [C: 03+1] packages_wmf: Install 10.1 with buster [puppet] - 10https://gerrit.wikimedia.org/r/550393 (owner: 10Marostegui) [07:13:37] (03CR) 10Marostegui: [C: 03+2] packages_wmf: Install 10.1 with buster [puppet] - 10https://gerrit.wikimedia.org/r/550393 (owner: 10Marostegui) [07:25:50] 10Operations, 10DBA: Switchover s1 primary database master db1067 -> db1083 - 14th Nov 05:00 - 05:30 UTC - https://phabricator.wikimedia.org/T234800 (10Marostegui) I just realised that when we created this task and scheduled the maintenance I didn't take into account that we'd have had the change from CEST to... [07:26:37] 10Operations, 10DBA: Switchover s1 primary database master db1067 -> db1083 - 14th Nov 06:00 - 06:30 UTC - https://phabricator.wikimedia.org/T234800 (10Marostegui) [07:28:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1083 after kernel upgrade', diff saved to https://phabricator.wikimedia.org/P9590 and previous config saved to /var/cache/conftool/dbconfig/20191112-072823-marostegui.json [07:28:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:29] (03PS1) 10Elukey: query_service::common: reduce cronspam due to absent files [puppet] - 10https://gerrit.wikimedia.org/r/550396 [07:33:46] (03PS3) 10Marostegui: m5 grants: remove grants for 'labtestwiki' database [puppet] - 10https://gerrit.wikimedia.org/r/543955 (https://phabricator.wikimedia.org/T233236) (owner: 10Andrew Bogott) [07:34:40] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [07:36:12] !log remove /etc/logrotate.d/wdqs_autodeployment_log from wdqs1009 (not in puppet anymore and causing cronspam) [07:36:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:20] Cc: gehel, onimisionipe --^ [07:37:17] (03CR) 10Marostegui: [C: 03+2] m5 grants: remove grants for 'labtestwiki' database [puppet] - 10https://gerrit.wikimedia.org/r/543955 (https://phabricator.wikimedia.org/T233236) (owner: 10Andrew Bogott) [07:38:52] elukey: Thanks! [07:39:56] (03CR) 10Filippo Giunchedi: [C: 03+2] monitoring: add alerts for ats availability [puppet] - 10https://gerrit.wikimedia.org/r/550094 (https://phabricator.wikimedia.org/T236482) (owner: 10Filippo Giunchedi) [07:40:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Increase traffic to db1083', diff saved to https://phabricator.wikimedia.org/P9591 and previous config saved to /var/cache/conftool/dbconfig/20191112-074006-marostegui.json [07:40:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:24] PROBLEM - WDQS high update lag on wdqs1004 is CRITICAL: 4.321e+04 ge 4.32e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [07:53:55] * onimisionipe is looking [07:54:36] onimisionipe: I'm still on the way back from school [07:54:45] no p [07:54:59] Can you take a few thread dumps from both blazegraph and updater? [07:55:44] PROBLEM - WDQS high update lag on wdqs1004 is CRITICAL: 4.324e+04 ge 4.32e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [07:57:54] onimisionipe: dunno if you have seen it on the backlog / SAL, but I had to restart wdqs-blazergraph ~4 hours ago on wdqs1004 [07:59:22] PROBLEM - WDQS high update lag on wdqs1004 is CRITICAL: 4.324e+04 ge 4.32e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [08:09:57] ACKNOWLEDGEMENT - WDQS high update lag on wdqs1004 is CRITICAL: 4.348e+04 ge 4.32e+04 Gehel under investigation https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [08:13:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Increase traffic to db1083', diff saved to https://phabricator.wikimedia.org/P9592 and previous config saved to /var/cache/conftool/dbconfig/20191112-081322-marostegui.json [08:13:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:13] !log installing curl security updates [08:15:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:25] !log volker-e@deploy1001 Started deploy [design/style-guide@b926b95]: Deploy design/style-guide: [08:24:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:32] !log volker-e@deploy1001 Finished deploy [design/style-guide@b926b95]: Deploy design/style-guide: (duration: 00m 07s) [08:24:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:54] 10Operations, 10Traffic, 10Wikimedia-Site-requests, 10HTTPS: Enable QUIC support on Wikimedia servers - https://phabricator.wikimedia.org/T238034 (10Masumrezarock100) [08:30:22] 10Operations, 10Traffic, 10HTTPS: Enable QUIC support on Wikimedia servers - https://phabricator.wikimedia.org/T238034 (10Peachey88) @Masumrezarock100 This is something that needs to be done on the operations side of thigs, so i've removed Site-Requests which is for local wiki config changes. [08:35:56] !log installing poppler security updates [08:35:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:19] !log depool wdqs1004 to investigate update lag [08:37:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1083', diff saved to https://phabricator.wikimedia.org/P9593 and previous config saved to /var/cache/conftool/dbconfig/20191112-083720-marostegui.json [08:37:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:16] !log volker-e@deploy1001 Started deploy [design/style-guide@3de6820]: Deploy design/style-guide: [08:44:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:22] !log volker-e@deploy1001 Finished deploy [design/style-guide@3de6820]: Deploy design/style-guide: (duration: 00m 06s) [08:44:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:36] (03PS1) 10Muehlenhoff: Add Cumin aliases for presto cluster [puppet] - 10https://gerrit.wikimedia.org/r/550422 [08:48:09] (03CR) 10Muehlenhoff: [C: 03+2] Add Cumin aliases for presto cluster [puppet] - 10https://gerrit.wikimedia.org/r/550422 (owner: 10Muehlenhoff) [08:51:55] 10Operations, 10Traffic, 10HTTPS: Enable QUIC support on Wikimedia servers - https://phabricator.wikimedia.org/T238034 (10Masumrezarock100) Oh I see. [08:56:37] !log restarting archiva to pick up Java security updates [08:56:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:34] !log Upgrade mariadb to 10.1.39 on db1083 (candidate master for s1) [09:11:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1083 for mariadb upgrade to 10.1.39 - T234800', diff saved to https://phabricator.wikimedia.org/P9594 and previous config saved to /var/cache/conftool/dbconfig/20191112-091158-marostegui.json [09:12:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:04] T234800: Switchover s1 primary database master db1067 -> db1083 - 14th Nov 06:00 - 06:30 UTC - https://phabricator.wikimedia.org/T234800 [09:17:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1083', diff saved to https://phabricator.wikimedia.org/P9595 and previous config saved to /var/cache/conftool/dbconfig/20191112-091706-marostegui.json [09:17:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:54] 10Operations, 10DBA, 10serviceops, 10Goal, 10Patch-For-Review: Switchover backup director service from helium to backup1001 - https://phabricator.wikimedia.org/T236406 (10jcrespo) 05Open→03Resolved I am going to consider the switchover as done, and create a separate task for followups. [09:22:58] 10Operations, 10DBA, 10serviceops, 10Goal, 10Patch-For-Review: Strengthen backup infrastructure and support - https://phabricator.wikimedia.org/T229209 (10jcrespo) [09:25:14] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [09:25:44] 10Operations, 10DBA: Switchover s7 primary database master db1062 -> db1086 - 26th Nov 06:00 - 06:30 UTC - https://phabricator.wikimedia.org/T238044 (10Marostegui) [09:26:00] 10Operations, 10DBA: Decommission db1061-db1073 - https://phabricator.wikimedia.org/T217396 (10Marostegui) [09:26:03] 10Operations, 10DBA: Switchover s7 primary database master db1062 -> db1086 - 26th Nov 06:00 - 06:30 UTC - https://phabricator.wikimedia.org/T238044 (10Marostegui) [09:27:16] 10Operations, 10DBA: Switchover s7 primary database master db1062 -> db1086 - 26th Nov 06:00 - 06:30 UTC - https://phabricator.wikimedia.org/T238044 (10Marostegui) p:05Triage→03Normal [09:27:30] !log restarting blazegraph on wdqs1004 [09:27:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:06] 10Operations, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), and 2 others: Upgrade memcached for Debian Stretch/Buster - https://phabricator.wikimedia.org/T213089 (10elukey) Build and deployed 0.6.0 and deployed on deployment-prep-memc08. Also created the following: https://grafana-labs.wik... [09:31:11] 10Operations, 10DBA: Switchover s7 primary database master db1062 -> db1086 - 26th Nov 06:00 - 06:30 UTC - https://phabricator.wikimedia.org/T238044 (10Marostegui) [09:31:24] 10Operations, 10DBA, 10User-notice: Switchover s7 primary database master db1062 -> db1086 - 26th Nov 06:00 - 06:30 UTC - https://phabricator.wikimedia.org/T238044 (10Marostegui) [09:34:56] 10Operations, 10Goal: Followup to backup1001 bacula switchover (misc pending tasks) - https://phabricator.wikimedia.org/T238048 (10jcrespo) [09:35:59] 10Operations, 10DBA, 10serviceops, 10Goal, 10Patch-For-Review: Strengthen backup infrastructure and support - https://phabricator.wikimedia.org/T229209 (10jcrespo) [09:36:39] 10Operations, 10observability, 10Availability, 10Goal, 10Patch-For-Review: Setup bacula backup monitoring - https://phabricator.wikimedia.org/T234900 (10jcrespo) a:05Ottomata→03jcrespo [09:39:50] 10Operations: Migrate dbmonitor hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224589 (10jcrespo) 05Open→03Stalled a:05jcrespo→03None [09:39:52] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10jcrespo) [09:42:04] !log Remove privileges for labtestwiki on m5 - T236010 [09:42:04] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [09:42:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:08] T236010: Drop labtestwikitech database from m5 - https://phabricator.wikimedia.org/T236010 [09:50:11] 10Operations, 10Traffic: envoy overwrites the server header - https://phabricator.wikimedia.org/T238050 (10Vgutierrez) [09:50:25] 10Operations, 10Traffic: envoy overwrites the server header - https://phabricator.wikimedia.org/T238050 (10Vgutierrez) p:05Triage→03Normal [09:52:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'More traffic to db1083', diff saved to https://phabricator.wikimedia.org/P9596 and previous config saved to /var/cache/conftool/dbconfig/20191112-095221-marostegui.json [09:52:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:35] (03PS1) 10Vgutierrez: envoyproxy: Avoid overwriting existing server header [puppet] - 10https://gerrit.wikimedia.org/r/550436 (https://phabricator.wikimedia.org/T238050) [10:05:52] 10Operations, 10ops-esams, 10Traffic: cp3065 crashed - https://phabricator.wikimedia.org/T238032 (10ema) p:05Triage→03Normal [10:05:53] (03PS4) 10Jbond: CI - pytohn3: first attempt at adding python3 CI [puppet] - 10https://gerrit.wikimedia.org/r/510613 [10:05:55] (03PS1) 10Jbond: CI - pytohn3: make python3 the default for tests [puppet] - 10https://gerrit.wikimedia.org/r/550437 [10:06:59] !log repool cp3065, nothing interesting in kern.log and SEL T238032 [10:07:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:03] T238032: cp3065 crashed - https://phabricator.wikimedia.org/T238032 [10:07:37] (03CR) 10jerkins-bot: [V: 04-1] CI - pytohn3: first attempt at adding python3 CI [puppet] - 10https://gerrit.wikimedia.org/r/510613 (owner: 10Jbond) [10:07:52] (03CR) 10jerkins-bot: [V: 04-1] CI - pytohn3: make python3 the default for tests [puppet] - 10https://gerrit.wikimedia.org/r/550437 (owner: 10Jbond) [10:08:06] (03PS5) 10Jbond: CI - python3: first attempt at adding python3 CI [puppet] - 10https://gerrit.wikimedia.org/r/510613 [10:08:22] 08Warning Alert for device scs-c1-eqiad.mgmt.eqiad.wmnet - Processor usage over 85% [10:09:51] (03CR) 10jerkins-bot: [V: 04-1] CI - python3: first attempt at adding python3 CI [puppet] - 10https://gerrit.wikimedia.org/r/510613 (owner: 10Jbond) [10:15:52] (03CR) 10Ema: [C: 03+1] envoyproxy: Avoid overwriting existing server header [puppet] - 10https://gerrit.wikimedia.org/r/550436 (https://phabricator.wikimedia.org/T238050) (owner: 10Vgutierrez) [10:21:08] (03Abandoned) 10Muehlenhoff: ntp: Restrict access [puppet] - 10https://gerrit.wikimedia.org/r/393247 (owner: 10Muehlenhoff) [10:21:14] 10Operations: puppet-compiler fails to compile production catalog for restbase2014 - https://phabricator.wikimedia.org/T238053 (10Vgutierrez) [10:22:55] 10Operations, 10puppet-compiler: puppet-compiler fails to compile production catalog for restbase2014 - https://phabricator.wikimedia.org/T238053 (10Peachey88) [10:30:37] !log Deploy schema change on dbstore1003:3315 [10:30:40] (03CR) 10Vgutierrez: [C: 03+2] envoyproxy: Avoid overwriting existing server header [puppet] - 10https://gerrit.wikimedia.org/r/550436 (https://phabricator.wikimedia.org/T238050) (owner: 10Vgutierrez) [10:30:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:03] 10Operations, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), and 2 others: Upgrade memcached for Debian Stretch/Buster - https://phabricator.wikimedia.org/T213089 (10elukey) Not all new slab metrics are rendered, opened an issue upstream: https://github.com/prometheus/memcached_exporter/issu... [10:31:07] (03PS6) 10Jbond: CI - python3: first attempt at adding python3 CI [puppet] - 10https://gerrit.wikimedia.org/r/510613 [10:33:22] !log Drop labtestwiki database from m5 master db1133 - T236010 [10:33:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:26] T236010: Drop labtestwikitech database from m5 - https://phabricator.wikimedia.org/T236010 [10:35:09] (03CR) 10Volans: "Some nits inline, but Arzhel's comment should be addressed." (033 comments) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/550052 (https://phabricator.wikimedia.org/T237007) (owner: 10CRusnov) [10:35:11] !log resetting cronfile on wdqs hosts [10:35:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'More traffic to db1083', diff saved to https://phabricator.wikimedia.org/P9598 and previous config saved to /var/cache/conftool/dbconfig/20191112-103641-marostegui.json [10:36:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:46] (03CR) 10Volans: [C: 03+1] "Sure, why not, seem reasonable. It shouldn't break the current production version that doesn't know about this option IIRC, but please dou" [puppet] - 10https://gerrit.wikimedia.org/r/550245 (https://phabricator.wikimedia.org/T237978) (owner: 10Muehlenhoff) [10:38:38] (03PS3) 10Jon Harald Søby: Initial configuration for szywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548717 (https://phabricator.wikimedia.org/T237369) [10:39:44] (03CR) 10Jon Harald Søby: "Updated to change the name of the project (Wyiki-payke --> Wikipitiya) and update the logos." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548717 (https://phabricator.wikimedia.org/T237369) (owner: 10Jon Harald Søby) [10:43:44] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [10:44:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1083', diff saved to https://phabricator.wikimedia.org/P9599 and previous config saved to /var/cache/conftool/dbconfig/20191112-104400-marostegui.json [10:44:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:25] (03CR) 10Ema: [C: 03+2] ATS: double log_buffer_size and max_line_size [puppet] - 10https://gerrit.wikimedia.org/r/548258 (https://phabricator.wikimedia.org/T237608) (owner: 10Ema) [10:48:41] wow, I was able to crash bacula [10:48:54] with just listing bconsole commands [10:49:06] :-/ [10:49:34] can you share some details? [10:49:51] (03CR) 10Mathew.onipe: "Root cause was due to the recent refactoring creating duplicate cron jobs. I reset cron on all servers except wdqs1010 because of ongoing " [puppet] - 10https://gerrit.wikimedia.org/r/550396 (owner: 10Elukey) [10:49:58] I ran dir and bacula director crashed [10:50:10] (03CR) 10Mathew.onipe: [C: 04-1] "let's not merge this" [puppet] - 10https://gerrit.wikimedia.org/r/550396 (owner: 10Elukey) [10:50:26] what?? [10:50:27] (03Abandoned) 10Elukey: query_service::common: reduce cronspam due to absent files [puppet] - 10https://gerrit.wikimedia.org/r/550396 (owner: 10Elukey) [10:50:42] elukey: Thanks a lot! [10:50:43] onimisionipe: nice! just abandoned [10:51:34] (03PS1) 10Arturo Borrero Gonzalez: toolforge: new k8s: etcd: enable TLS for metrics endpoint [puppet] - 10https://gerrit.wikimedia.org/r/550442 (https://phabricator.wikimedia.org/T237643) [10:52:15] marostegui: "Kaboom! bacula-dir, backup1001.eqiad.wmnet got signal 11 - Segmentation vi" [10:52:24] "Kaboom! exepath=/usr/sbin/" [10:52:32] "Bacula interrupted by signal 11: Segmentation violation" [10:52:37] that sounds scary :/ [10:53:12] sounds like someone going to play with gdb this afternoon ;D [10:53:17] good "morning" [10:56:44] wow, impressive [10:57:35] jouncebot now [10:57:35] No deployments scheduled for the next 0 hour(s) and 2 minute(s) [10:57:39] jouncebot next [10:57:39] In 0 hour(s) and 2 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191112T1100) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191112T1100). [11:00:04] awight and Amir1: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:13] o/ [11:00:34] Amir1: you deploying your own patches? [11:01:06] o/ In ten minutes ish [11:01:27] k, I'll do my own stuff now then [11:02:24] I am planning to reimage mwdebug1002, can any of you please ping me when swat is done? [11:02:53] I would appreciate it [11:03:19] Urbanecm: what own stuff? [11:03:25] I don’t see anything on the deployment calendar [11:03:45] it doesn't have a gerrit number yet, since it's #security config patch (T237192) [11:05:47] effie: IIRC we aren't supposed to use mwdebug1002 for testing per recent ops@lists.wm.o email? [11:05:57] !log urbanecm@deploy1001 Synchronized wmf-config/CommonSettings.php: SECURITY: Dont allow Wikimedia sysops to see who had 2FA disabled (duration: 00m 53s) [11:06:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:06] (03PS1) 10Urbanecm: SECURITY: Don't allow Wikimedia sysops to see who had 2FA disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/550445 (https://phabricator.wikimedia.org/T237192) [11:06:12] Urbanecm: yes, but scap might be checking it still [11:06:18] (03PS1) 10Filippo Giunchedi: prometheus: collect logstash mtail metrics [puppet] - 10https://gerrit.wikimedia.org/r/550446 (https://phabricator.wikimedia.org/T236343) [11:06:19] so, just to be sure [11:06:26] (03CR) 10Urbanecm: [C: 03+2] SECURITY: Don't allow Wikimedia sysops to see who had 2FA disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/550445 (https://phabricator.wikimedia.org/T237192) (owner: 10Urbanecm) [11:06:44] ok, makes sense effie [11:07:17] (03Merged) 10jenkins-bot: SECURITY: Don't allow Wikimedia sysops to see who had 2FA disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/550445 (https://phabricator.wikimedia.org/T237192) (owner: 10Urbanecm) [11:10:29] 10Operations, 10Traffic: envoy overwrites the server header - https://phabricator.wikimedia.org/T238050 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez Fixed with 2428105: ` (eqsin) $ curl -v https://en.wikipedia.org/api/rest_v1/page/summary/Tremont_Street_Subway 2>&1 |grep server: < server: restbase1017 ` [11:10:53] * Urbanecm is done [11:11:06] Amir1: the air is clear for you, please let effie know once the SWAT is over [11:11:18] sure [11:11:50] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [11:12:21] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: collect logstash mtail metrics [puppet] - 10https://gerrit.wikimedia.org/r/550446 (https://phabricator.wikimedia.org/T236343) (owner: 10Filippo Giunchedi) [11:12:40] (03PS3) 10Ladsgroup: Set all of wikidata for write both for term store [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548584 (https://phabricator.wikimedia.org/T225055) [11:12:52] (03CR) 10Ladsgroup: [C: 03+2] Set all of wikidata for write both for term store [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548584 (https://phabricator.wikimedia.org/T225055) (owner: 10Ladsgroup) [11:13:38] (03PS3) 10Elukey: profile::mariadb::misc::eventlogging::sanitization: ease clean up [puppet] - 10https://gerrit.wikimedia.org/r/549070 (https://phabricator.wikimedia.org/T236818) [11:13:41] (03Merged) 10jenkins-bot: Set all of wikidata for write both for term store [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548584 (https://phabricator.wikimedia.org/T225055) (owner: 10Ladsgroup) [11:14:29] 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) [11:16:26] (03CR) 10Elukey: [C: 03+2] profile::mariadb::misc::eventlogging::sanitization: ease clean up [puppet] - 10https://gerrit.wikimedia.org/r/549070 (https://phabricator.wikimedia.org/T236818) (owner: 10Elukey) [11:17:59] (03PS3) 10Elukey: Remove Eventloggging sanitization automation from log databases [puppet] - 10https://gerrit.wikimedia.org/r/549071 (https://phabricator.wikimedia.org/T236818) [11:19:34] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: new k8s: etcd: enable TLS for metrics endpoint [puppet] - 10https://gerrit.wikimedia.org/r/550442 (https://phabricator.wikimedia.org/T237643) (owner: 10Arturo Borrero Gonzalez) [11:20:52] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[gerrit:548584|Set all of wikidata for write both for term store (T225055)]] (duration: 00m 52s) [11:20:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:56] T225055: Switch `tmpItemTermsMigrationStages` to MIGRATION_WRITE_BOTH - https://phabricator.wikimedia.org/T225055 [11:21:35] (03PS1) 10Elukey: profile::mariadb::misc::eventlogging::sanitization: restore eventlog group [puppet] - 10https://gerrit.wikimedia.org/r/550447 [11:22:16] (03CR) 10Elukey: [C: 03+2] profile::mariadb::misc::eventlogging::sanitization: restore eventlog group [puppet] - 10https://gerrit.wikimedia.org/r/550447 (owner: 10Elukey) [11:23:23] 10Operations, 10Goal: Followup to backup1001 bacula switchover (misc pending tasks) - https://phabricator.wikimedia.org/T238048 (10jcrespo) @akosiaris Could you give a quick look to see if these seems like a complete archive contents? {P9597} I can execute the recovery of these files [[ https://www.hungred.co... [11:24:08] (03CR) 10Elukey: [V: 03+2 C: 03+2] profile::mariadb::misc::eventlogging::sanitization: restore eventlog group [puppet] - 10https://gerrit.wikimedia.org/r/550447 (owner: 10Elukey) [11:25:19] (03PS4) 10DannyS712: Partial cleanup of InitializeSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546369 (https://phabricator.wikimedia.org/T231178) [11:26:09] Amir1: completed? [11:28:01] (03CR) 10Volans: "As a side note, it's better to put comments inline, easier to reply and follow the thread there than here." [software/homer] - 10https://gerrit.wikimedia.org/r/550367 (owner: 10Ayounsi) [11:31:29] 10Operations, 10Dumps-Generation, 10SDC General, 10Wikidata: Capacity planning for Commons Structured Data - https://phabricator.wikimedia.org/T226093 (10Marostegui) I would like to know if there is some work going on to be able to split those tables from s4 into their own set of servers. My understanding... [11:32:22] (03CR) 10Muehlenhoff: "Ack, I'll check prod when merging; I mostly created this early so that I could continue to use the puppetised debmonitor config in my test" [puppet] - 10https://gerrit.wikimedia.org/r/550245 (https://phabricator.wikimedia.org/T237978) (owner: 10Muehlenhoff) [11:33:37] (03PS4) 10Elukey: Remove Eventloggging sanitization automation from log databases [puppet] - 10https://gerrit.wikimedia.org/r/549071 (https://phabricator.wikimedia.org/T236818) [11:34:11] effie: not yet, backports take really long [11:35:27] (03PS5) 10Elukey: Remove Eventloggging sanitization automation from log databases [puppet] - 10https://gerrit.wikimedia.org/r/549071 (https://phabricator.wikimedia.org/T236818) [11:36:20] (03CR) 10Jbond: "thanks rebased and updated" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/510613 (owner: 10Jbond) [11:37:38] (03CR) 10Jbond: [C: 03+2] puppet_compiler.prepare: Fail fast if the git commands fail (031 comment) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/549845 (https://phabricator.wikimedia.org/T157001) (owner: 10Jbond) [11:39:37] (03CR) 10Elukey: [C: 03+2] Remove Eventloggging sanitization automation from log databases [puppet] - 10https://gerrit.wikimedia.org/r/549071 (https://phabricator.wikimedia.org/T236818) (owner: 10Elukey) [11:40:46] elukey: I guess that's ^ related to the alerts we are seeing on icinga now? [11:40:51] for db1107 and db1108 [11:41:01] (03PS1) 10Ema: ATS: move backend::storage_elements settings to role yaml [puppet] - 10https://gerrit.wikimedia.org/r/550448 (https://phabricator.wikimedia.org/T227432) [11:41:43] (03PS1) 10Jbond: 0.5.4: bump version [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/550449 [11:43:32] 10Operations, 10serviceops: Upgrade to PHP 7.2.24 - https://phabricator.wikimedia.org/T237239 (10jijiki) a:03jijiki [11:43:40] (03CR) 10Jbond: [C: 03+2] 0.5.4: bump version [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/550449 (owner: 10Jbond) [11:44:14] !log Upgrade wtp* to 7.2.24-1 with elegance and restart php-fpm - T237239 [11:44:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:18] T237239: Upgrade to PHP 7.2.24 - https://phabricator.wikimedia.org/T237239 [11:45:45] marostegui: yes yes after the next puppet run on icinga1001 they'll go away [11:46:14] (03CR) 10Ema: "pcc says it's a noop https://puppet-compiler.wmflabs.org/compiler1003/19344/" [puppet] - 10https://gerrit.wikimedia.org/r/550448 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [11:46:16] marostegui: should we open a task to move db1107 to you? [11:46:38] we only need to take a mysql dump now as precaution, but nothing more [11:47:17] !log ladsgroup@deploy1001 Synchronized php-1.35.0-wmf.4/extensions/Wikibase: Wikibase term store error reduction, [[gerrit:550441|Do not catch DBError in ReplicaMasterAwareRecordIdsAcquirer.]] (T236466) (duration: 00m 56s) [11:47:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:22] T236466: PHP Warning: [data-update-failed]: A data update callback triggered an exception (Wikimedia\Rdbms\Database::makeList: empty input for field wbxl_text_id) [Called from Wikibase\Repo\Content\DataUpdateAdapter::doUpdate in /extensions/Wikibase/repo/includes/Content/DataUpdateAdapter.php at line 62] - https://phabricator.wikimedia.org/T236466 [11:47:39] !log EU SWAT is done [11:47:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:46] effie: the floor is yours [11:49:28] elukey: we have this https://phabricator.wikimedia.org/T234826 but it says db1108, we can probably just rephrase it? [11:49:40] Amir1: tx :) [11:49:48] elukey: or create a subtask for it, either way is fine [12:03:09] (03PS13) 10MarcoAurelio: Initial configuration for ge.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545909 (https://phabricator.wikimedia.org/T236389) [12:10:07] (03CR) 10Jbond: [C: 03+2] puppet-export-facts: use the certificate provided by localcacert [puppet] - 10https://gerrit.wikimedia.org/r/549857 (https://phabricator.wikimedia.org/T214472) (owner: 10Jbond) [12:11:27] !log Reimage mwdebug1002 - T214734 [12:11:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:34] T214734: PHP Fatal error: The UdpSocket to 127.0.0.1:10514 has been closed (from Monolog/SyslogUdp on mwdebug1002) - https://phabricator.wikimedia.org/T214734 [12:11:41] 10Operations, 10puppet-compiler, 10Patch-For-Review, 10User-jbond: puppet: compiler-update-facts error and warning - https://phabricator.wikimedia.org/T214472 (10jbond) 05Open→03Resolved a:03jbond [12:12:00] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [12:15:42] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs1005.eqiad.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs1005.eqiad.wmnet are marked down but pooled: wdqs_80: Servers wdqs1005.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:16:22] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs1005.eqiad.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs1005.eqiad.wmnet are marked down but pooled: wdqs_80: Servers wdqs1005.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:18:32] damn [12:18:48] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([wdqs1005.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [12:19:16] !log restarting blazegraph on wdqs1005 [12:19:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:02] PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([wdqs1005.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [12:20:32] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:21:10] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:25:38] RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [12:25:58] PROBLEM - Unmerged changes on repository puppet on labtestpuppetmaster2001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [12:30:00] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [12:37:48] !log ladsgroup@mwmaint1002:~$ mwscript extensions/Wikibase/repo/maintenance/rebuildPropertyTerms.php --wiki=wikidatawiki --batch-size 100 (T237984) [12:37:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:54] T237984: Some property labels are not displayed on Item pages - https://phabricator.wikimedia.org/T237984 [12:46:25] !log repool wdqs1004 [12:46:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:34] Ok [12:48:50] (03CR) 10Filippo Giunchedi: "LGTM overall, haven't tested it though" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/510613 (owner: 10Jbond) [12:57:05] !log refresh kibana field list [12:57:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:36] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:07:47] lots of wikidata errors [13:07:50] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:08:22] 08Warning Alert for device scs-c1-eqiad.mgmt.eqiad.wmnet - Processor usage over 85% [13:13:40] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [13:24:52] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [13:26:09] (03PS1) 10Jbond: puppetmaster: update webconfig to use correct file path [puppet] - 10https://gerrit.wikimedia.org/r/550459 (https://phabricator.wikimedia.org/T237994) [13:30:24] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:35:20] (03PS1) 10Ladsgroup: Revert "Set all of wikidata for write both for term store" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/550461 [13:35:43] Amir1: you think it is related to that? [13:35:52] yup [13:35:55] https://logstash.wikimedia.org/goto/2b3c0bc9ff3633b8e964507a4a3d13e5 [13:36:00] the errors are for the new items [13:36:27] I was about to ping you Amir1, thanks :) [13:36:40] let me change it [13:36:52] (03PS1) 10BBlack: Switch to globalsign-2019 in codfw [puppet] - 10https://gerrit.wikimedia.org/r/550462 (https://phabricator.wikimedia.org/T237650) [13:37:00] (03PS1) 10BBlack: Switch to globalsign-2019 globally [puppet] - 10https://gerrit.wikimedia.org/r/550463 (https://phabricator.wikimedia.org/T237650) [13:38:10] (03PS7) 10Jbond: CI - python3: first attempt at adding python3 CI [puppet] - 10https://gerrit.wikimedia.org/r/510613 [13:38:19] (03CR) 10Jbond: CI - python3: first attempt at adding python3 CI (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/510613 (owner: 10Jbond) [13:38:50] (03CR) 10BBlack: [C: 03+2] Switch to globalsign-2019 in codfw [puppet] - 10https://gerrit.wikimedia.org/r/550462 (https://phabricator.wikimedia.org/T237650) (owner: 10BBlack) [13:38:55] (03PS2) 10Ladsgroup: Revert "Set all of wikidata for write both for term store" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/550461 [13:39:11] (03CR) 10Ladsgroup: [C: 03+2] Revert "Set all of wikidata for write both for term store" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/550461 (owner: 10Ladsgroup) [13:39:48] jbond42: ok to merge? [13:40:02] (03Merged) 10jenkins-bot: Revert "Set all of wikidata for write both for term store" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/550461 (owner: 10Ladsgroup) [13:41:42] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [13:43:44] RECOVERY - Unmerged changes on repository puppet on labtestpuppetmaster2001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [13:44:56] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:46:28] (03PS1) 10Elukey: role::dumps::distribution::server: add kerberos [puppet] - 10https://gerrit.wikimedia.org/r/550466 (https://phabricator.wikimedia.org/T234229) [13:53:02] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:54:20] (03CR) 10BBlack: [C: 03+2] Switch to globalsign-2019 globally [puppet] - 10https://gerrit.wikimedia.org/r/550463 (https://phabricator.wikimedia.org/T237650) (owner: 10BBlack) [13:58:48] RECOVERY - HTTPS Unified RSA on cp1075 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 309436 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2020-11-22 07:59:59 +0000 (expires in 375 days) https://wikitech.wikimedia.org/wiki/HTTPS [14:01:34] RECOVERY - ats-tls HTTPS en.wikipedia.org ECDSA on cp5007 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 312691 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2020-11-22 07:59:59 +0000 (expires in 375 days) https://wikitech.wikimedia.org/wiki/HTTPS [14:01:43] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Revert "Set all of wikidata for write both for term store" (duration: 00m 52s) [14:01:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:52] RECOVERY - ats-tls HTTPS en.wikipedia.org RSA on cp5007 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 312675 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2020-11-22 07:59:59 +0000 (expires in 375 days) https://wikitech.wikimedia.org/wiki/HTTPS [14:03:00] there might be a number of these recovery-spams from the cert lifetime checks, sorry for the noise [14:05:28] (03PS2) 10Ema: ATS: move backend::storage_elements settings to profile yaml [puppet] - 10https://gerrit.wikimedia.org/r/550448 (https://phabricator.wikimedia.org/T227432) [14:11:20] (03PS1) 10Filippo Giunchedi: logstash: alert on indexing failures [puppet] - 10https://gerrit.wikimedia.org/r/550471 (https://phabricator.wikimedia.org/T236343) [14:14:17] (03PS1) 10Jgreen: switch fundraisingdb-read.wmnet back to repaired frdb1001 [dns] - 10https://gerrit.wikimedia.org/r/550472 [14:14:33] (03CR) 10jerkins-bot: [V: 04-1] logstash: alert on indexing failures [puppet] - 10https://gerrit.wikimedia.org/r/550471 (https://phabricator.wikimedia.org/T236343) (owner: 10Filippo Giunchedi) [14:15:18] (03CR) 10Jgreen: [C: 03+2] switch fundraisingdb-read.wmnet back to repaired frdb1001 [dns] - 10https://gerrit.wikimedia.org/r/550472 (owner: 10Jgreen) [14:15:37] (03PS1) 10Ladsgroup: mediawiki: Make the rebuildItemTerms slightly script slower [puppet] - 10https://gerrit.wikimedia.org/r/550473 [14:16:02] jynus: elukey marostegui ^ Can you merge this please? :D [14:16:10] yep [14:16:17] Thanks [14:16:33] !log authdns-update to deploy fundraising-read.wmnet service cname adjustment [14:16:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:54] (03PS2) 10Filippo Giunchedi: logstash: alert on indexing failures [puppet] - 10https://gerrit.wikimedia.org/r/550471 (https://phabricator.wikimedia.org/T236343) [14:18:55] (03CR) 10Marostegui: [C: 03+2] mediawiki: Make the rebuildItemTerms slightly script slower [puppet] - 10https://gerrit.wikimedia.org/r/550473 (owner: 10Ladsgroup) [14:19:31] Amir1: merged [14:22:58] marostegui: Thanks! [14:24:18] (03PS1) 10BBlack: Remove globalsign-2018 puppetization [puppet] - 10https://gerrit.wikimedia.org/r/550474 (https://phabricator.wikimedia.org/T237650) [14:26:35] (03PS1) 10Ema: ATS: add sysconfdir to ReadWritePaths if ocsp is enabled [puppet] - 10https://gerrit.wikimedia.org/r/550475 [14:27:58] (03CR) 10Ema: "pcc output here https://puppet-compiler.wmflabs.org/compiler1001/19352/cp4021.ulsfo.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/550475 (owner: 10Ema) [14:28:18] (03CR) 10jerkins-bot: [V: 04-1] ATS: add sysconfdir to ReadWritePaths if ocsp is enabled [puppet] - 10https://gerrit.wikimedia.org/r/550475 (owner: 10Ema) [14:29:23] (03PS2) 10Ema: ATS: add sysconfdir to ReadWritePaths if ocsp is enabled [puppet] - 10https://gerrit.wikimedia.org/r/550475 [14:32:07] (03CR) 10Ema: [C: 03+2] ATS: add sysconfdir to ReadWritePaths if ocsp is enabled [puppet] - 10https://gerrit.wikimedia.org/r/550475 (owner: 10Ema) [14:35:23] !log cp4021: ats-tls-restart to see if https://gerrit.wikimedia.org/r/550475 fixed the script [14:35:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:19] !log ema@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4021.ulsfo.wmnet,service=nginx [14:38:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:38] (03Restored) 10Ottomata: Add schema.discovery.wmnet [dns] - 10https://gerrit.wikimedia.org/r/549105 (https://phabricator.wikimedia.org/T233630) (owner: 10Ottomata) [14:38:41] (03Restored) 10Ottomata: Set up schema.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/549106 (https://phabricator.wikimedia.org/T233630) (owner: 10Ottomata) [14:41:14] !log cp4022: trafficserver (8.0.5-1wm10) and fifo-log-demux (0.6) upgrade and restart [14:41:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:38] (03PS20) 10Jhedden: ceph: add ceph storage cluster profiles and modules [puppet] - 10https://gerrit.wikimedia.org/r/546182 (https://phabricator.wikimedia.org/T236290) [14:46:11] !log cpNNNN (all caches): remove stale outputs from transient ocsp failures ( /var/cache/ocsp/update-ocsp-*.tmp ) [14:46:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:12] (03CR) 10Jhedden: ceph: add ceph storage cluster profiles and modules (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/546182 (https://phabricator.wikimedia.org/T236290) (owner: 10Jhedden) [15:06:00] (03PS1) 10Andrew Bogott: Cloud puppetmasters: replace ruby-httpclient package [puppet] - 10https://gerrit.wikimedia.org/r/550480 [15:06:24] (03PS2) 10Andrew Bogott: Cloud puppetmasters: replace ruby-httpclient package [puppet] - 10https://gerrit.wikimedia.org/r/550480 (https://phabricator.wikimedia.org/T237994) [15:09:04] (03CR) 10Andrew Bogott: [C: 03+2] Cloud puppetmasters: replace ruby-httpclient package [puppet] - 10https://gerrit.wikimedia.org/r/550480 (https://phabricator.wikimedia.org/T237994) (owner: 10Andrew Bogott) [15:09:50] (03PS1) 10Effie Mouzeli: admin: Remove hhvm related sudo privileges [puppet] - 10https://gerrit.wikimedia.org/r/550483 (https://phabricator.wikimedia.org/T229792) [15:09:54] (03PS2) 10Ottomata: Set up schema.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/549106 (https://phabricator.wikimedia.org/T233630) [15:12:43] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/550483 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [15:15:56] (03PS2) 10Ottomata: Add schema.discovery.wmnet [dns] - 10https://gerrit.wikimedia.org/r/549105 (https://phabricator.wikimedia.org/T233630) [15:18:44] (03CR) 10Ema: [C: 03+1] Set up schema.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/549106 (https://phabricator.wikimedia.org/T233630) (owner: 10Ottomata) [15:18:45] (03CR) 10Ema: [C: 03+1] Add schema.discovery.wmnet [dns] - 10https://gerrit.wikimedia.org/r/549105 (https://phabricator.wikimedia.org/T233630) (owner: 10Ottomata) [15:18:49] (03CR) 10Ottomata: [C: 03+2] Set up schema.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/549106 (https://phabricator.wikimedia.org/T233630) (owner: 10Ottomata) [15:24:45] !log otto@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=schema [15:24:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:04] (03CR) 10Ottomata: [C: 03+2] Add schema.discovery.wmnet [dns] - 10https://gerrit.wikimedia.org/r/549105 (https://phabricator.wikimedia.org/T233630) (owner: 10Ottomata) [15:28:36] (03PS1) 10Elukey: kerberos: allow the management of keytabs outside the keytab_specs [puppet] - 10https://gerrit.wikimedia.org/r/550491 [15:32:04] (03PS1) 10Ottomata: Fix indentation in discovery.yaml for schema [puppet] - 10https://gerrit.wikimedia.org/r/550493 (https://phabricator.wikimedia.org/T233630) [15:32:19] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Fix indentation in discovery.yaml for schema [puppet] - 10https://gerrit.wikimedia.org/r/550493 (https://phabricator.wikimedia.org/T233630) (owner: 10Ottomata) [15:33:50] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [15:35:45] (03PS4) 10Ottomata: Set up cache routing for schema.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/549177 (https://phabricator.wikimedia.org/T233630) [15:41:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1097:3315 for a schema change T233135 T234066', diff saved to https://phabricator.wikimedia.org/P9600 and previous config saved to /var/cache/conftool/dbconfig/20191112-154127-marostegui.json [15:41:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:37] T233135: Schema change for refactored actor and comment storage - https://phabricator.wikimedia.org/T233135 [15:41:38] T234066: Schema change to rename user_newtalk indexes - https://phabricator.wikimedia.org/T234066 [15:44:57] (03PS3) 10Ema: ATS: move backend::storage_elements settings to profile [puppet] - 10https://gerrit.wikimedia.org/r/550448 (https://phabricator.wikimedia.org/T227432) [15:45:26] !log mholloway-shell@deploy1001 Synchronized php-1.35.0-wmf.5/extensions/MachineVision: Fixes and tweaks for initial rollout (duration: 00m 53s) [15:45:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:58] (03CR) 10Ema: [C: 03+2] ATS: move backend::storage_elements settings to profile [puppet] - 10https://gerrit.wikimedia.org/r/550448 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [15:49:26] !log Deploy schema change on db1102:3315 T233135 T234066 [15:49:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:31] T233135: Schema change for refactored actor and comment storage - https://phabricator.wikimedia.org/T233135 [15:49:32] T234066: Schema change to rename user_newtalk indexes - https://phabricator.wikimedia.org/T234066 [15:55:51] (03PS8) 10Herron: Introduce Elastic 7 support [puppet] - 10https://gerrit.wikimedia.org/r/545867 (https://phabricator.wikimedia.org/T234854) (owner: 10Filippo Giunchedi) [15:57:03] (03CR) 10RLazarus: "Thanks for the heads up! I'll start looking at setting that up -- it'll probably suit us just fine." [puppet] - 10https://gerrit.wikimedia.org/r/549668 (owner: 10RLazarus) [16:00:04] godog and _joe_: My dear minions, it's time we take the moon! Just kidding. Time for Puppet SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191112T1600). [16:00:04] No GERRIT patches in the queue for this window AFAICS. [16:04:25] (03PS1) 10Ema: cache: reimage cp3052 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/550501 (https://phabricator.wikimedia.org/T227432) [16:08:58] (03PS1) 10Andrew Bogott: Openstack: Add config files for version 'ocata' [puppet] - 10https://gerrit.wikimedia.org/r/550502 (https://phabricator.wikimedia.org/T237749) [16:09:00] (03PS1) 10Andrew Bogott: OpenStack Nova: Update config to work with Ocata [puppet] - 10https://gerrit.wikimedia.org/r/550503 (https://phabricator.wikimedia.org/T237749) [16:09:02] (03PS1) 10Andrew Bogott: Openstack Neutron: Remove no-longer-supported min_l3_agents_per_router [puppet] - 10https://gerrit.wikimedia.org/r/550504 [16:09:45] !log depool cp3052 and observe performance impact T238085 before reimaging as text_ats T227432 [16:09:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:51] T238085: Depooling single text caching server in esams had a disproportionate performance impact - https://phabricator.wikimedia.org/T238085 [16:09:52] T227432: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 [16:17:09] (03PS1) 10Arturo Borrero Gonzalez: cloud: refactor prometheus role [puppet] - 10https://gerrit.wikimedia.org/r/550506 (https://phabricator.wikimedia.org/T238096) [16:17:18] (03CR) 10Ema: [C: 03+2] cache: reimage cp3052 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/550501 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [16:19:32] (03CR) 10Ayounsi: "> Patch Set 2:" [software/homer] - 10https://gerrit.wikimedia.org/r/550367 (owner: 10Ayounsi) [16:19:54] (03PS3) 10Ayounsi: Add virtual-chassis support [software/homer] - 10https://gerrit.wikimedia.org/r/550367 [16:20:11] (03CR) 10BPirkle: "Following up on Krinkle's comment - is there any reason not to get https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/548923" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549219 (https://phabricator.wikimedia.org/T237555) (owner: 10Tim Starling) [16:21:02] !log reboot scs-c1-eqiad.mgmt.eqiad.wmnet - T238036 [16:21:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:07] T238036: scs-c1-eqiad CPU usage over 85% - https://phabricator.wikimedia.org/T238036 [16:22:13] (03CR) 10jerkins-bot: [V: 04-1] Add virtual-chassis support [software/homer] - 10https://gerrit.wikimedia.org/r/550367 (owner: 10Ayounsi) [16:27:18] PROBLEM - Aggregate IPsec Tunnel Status codfw on icinga1001 is CRITICAL: instance={cp2006:9536,cp2012:9536,cp2013:9536,cp2016:9536,cp2019:9536} site=codfw tunnel={cp3052_v4,cp3052_v6} https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [16:27:34] PROBLEM - Aggregate IPsec Tunnel Status eqiad on icinga1001 is CRITICAL: instance={cp1077:9536,cp1079:9536,cp1081:9536,cp1087:9536,cp1089:9536} site=eqiad tunnel={cp3052_v4,cp3052_v6} https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [16:27:54] ACKNOWLEDGEMENT - Aggregate IPsec Tunnel Status codfw on icinga1001 is CRITICAL: instance={cp2006:9536,cp2012:9536,cp2013:9536,cp2016:9536,cp2019:9536} site=codfw tunnel={cp3052_v4,cp3052_v6} Ema reimaging cp3052 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [16:27:54] ACKNOWLEDGEMENT - Aggregate IPsec Tunnel Status eqiad on icinga1001 is CRITICAL: instance={cp1077:9536,cp1079:9536,cp1081:9536,cp1087:9536,cp1089:9536} site=eqiad tunnel={cp3052_v4,cp3052_v6} Ema reimaging cp3052 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [16:28:43] !log setup bgp session from cr2-codfw to multihop RIS collector - T106056 [16:28:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:47] T106056: set up a looking glass for WMF ASes - https://phabricator.wikimedia.org/T106056 [16:29:28] (03PS2) 10BBlack: Remove globalsign-2018 puppetization [puppet] - 10https://gerrit.wikimedia.org/r/550474 (https://phabricator.wikimedia.org/T237650) [16:30:09] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] ceph: add ceph storage cluster profiles and modules [puppet] - 10https://gerrit.wikimedia.org/r/546182 (https://phabricator.wikimedia.org/T236290) (owner: 10Jhedden) [16:34:33] marostegui: errors stopped https://logstash.wikimedia.org/goto/9dec48b7ad242f15008ac2f1f4a91189 [16:35:23] Amir1: Excellent, thank you [16:37:39] (03PS2) 10Ayounsi: msw: ensure no vlans are configured [homer/public] - 10https://gerrit.wikimedia.org/r/549938 [16:38:22] 08̶W̶a̶r̶n̶i̶n̶g Device scs-c1-eqiad.mgmt.eqiad.wmnet recovered from Processor usage over 85% [16:40:35] !log ema@cumin1001 START - Cookbook sre.hosts.downtime [16:40:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:50] RECOVERY - Aggregate IPsec Tunnel Status codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [16:42:32] (03CR) 10BBlack: [C: 03+2] Remove globalsign-2018 puppetization [puppet] - 10https://gerrit.wikimedia.org/r/550474 (https://phabricator.wikimedia.org/T237650) (owner: 10BBlack) [16:42:37] !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:42:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:42] RECOVERY - Aggregate IPsec Tunnel Status eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [16:47:27] (03PS1) 10Jon Harald Søby: Add gcr and shy languages [dns] - 10https://gerrit.wikimedia.org/r/550511 (https://phabricator.wikimedia.org/T238104) [16:53:42] !log cpNNNN (all cache nodes) - cumin manual removal of globalsign-2018 remnants (key, cert, ocsp config, ocsp output) [16:53:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:07] (03CR) 10Dzahn: [C: 03+2] "approved by langcom (https://meta.wikimedia.org/wiki/Requests_for_new_languages/Wikipedia_Sakizaya)" [dns] - 10https://gerrit.wikimedia.org/r/548718 (https://phabricator.wikimedia.org/T237369) (owner: 10Jon Harald Søby) [16:54:39] (03PS2) 10Dzahn: Add Sakizaya (szy) language [dns] - 10https://gerrit.wikimedia.org/r/548718 (https://phabricator.wikimedia.org/T237369) (owner: 10Jon Harald Søby) [16:55:15] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [16:59:05] (03CR) 10BPirkle: "Per Krinkle's comment on https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/549219/, should we be doing https://gerrit.wikim" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548944 (https://phabricator.wikimedia.org/T236833) (owner: 10Dzahn) [16:59:25] (03PS1) 10Marostegui: mariadb: db1134 will be the future candidate master for s1 [puppet] - 10https://gerrit.wikimedia.org/r/550514 (https://phabricator.wikimedia.org/T234800) [17:00:04] cscott, arlolra, subbu, halfak, and accraze: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Services – Graphoid / Parsoid / Citoid / ORES . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191112T1700). [17:00:36] (03CR) 10Krinkle: [C: 03+1] "@bill: Converstation on that change and at T236833 has since settled on doing it per server instead (this change). So that's good to go fo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548944 (https://phabricator.wikimedia.org/T236833) (owner: 10Dzahn) [17:02:47] 10Operations, 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for gcrwiki - https://phabricator.wikimedia.org/T238114 (10Marostegui) Please let us know when the database is created so we can sanitize it on labs, before sending it to the cloud team for the views creation [17:03:06] 10Operations, 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for shywiktionary - https://phabricator.wikimedia.org/T238115 (10Marostegui) Please let us know when the database is created so we can sanitize it on labs, before sending it to the cloud team for the views... [17:03:17] !log pool cp3052 with ATS backend T227432 [17:03:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:22] T227432: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 [17:03:23] 10Operations, 10netops, 10observability: Determine & implement near-term method for escalating network alerts - https://phabricator.wikimedia.org/T237587 (10fgiunchedi) (2) to me seems the way to go as it would integrate best with our existing workflows. With an eye pointed at low hanging fruits though I'm w... [17:03:41] !log pool cp3052 with ATS backend T238085 [17:03:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:46] T238085: Depooling single text caching server in esams had a disproportionate performance impact - https://phabricator.wikimedia.org/T238085 [17:07:45] (03CR) 10Marostegui: [C: 03+2] mariadb: db1134 will be the future candidate master for s1 [puppet] - 10https://gerrit.wikimedia.org/r/550514 (https://phabricator.wikimedia.org/T234800) (owner: 10Marostegui) [17:10:17] (03PS1) 10Alexandros Kosiaris: cp1071: mark as test [puppet] - 10https://gerrit.wikimedia.org/r/550515 [17:12:33] (03CR) 10Alexandros Kosiaris: [C: 03+2] cp1071: mark as test [puppet] - 10https://gerrit.wikimedia.org/r/550515 (owner: 10Alexandros Kosiaris) [17:15:23] (03CR) 10ArielGlenn: "I'm not clear on the problem with a new keytab if another server gets this role. Right now for example both labstore1006 and 1007 have it;" [puppet] - 10https://gerrit.wikimedia.org/r/550466 (https://phabricator.wikimedia.org/T234229) (owner: 10Elukey) [17:17:42] did houncebot ping us an hour early today? [17:17:47] *jouncebot [17:19:42] jouncebot: next [17:19:42] In 0 hour(s) and 40 minute(s): Morning SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191112T1800) [17:22:27] !log update fasw-c-codfw to match current standard (ntp/users/rootpw/lldp) [17:22:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:27] (03PS4) 10Urbanecm: Set namespace alias for Index: (NS 102/103) for elwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548463 (https://phabricator.wikimedia.org/T237253) [17:24:50] usually services deploy is at 10 PT .. but looks like window changed .. anyway, arlo will deploy parsoid code in the next hour [17:25:15] subbu DST changes? :) [17:25:24] i was wondering. [17:25:43] everything is one hour before now, even morning SWAT [17:25:47] my time of course [17:30:54] (03PS1) 10Papaul: DNS: Add mgmt and production DNS for db213[2-5] [dns] - 10https://gerrit.wikimedia.org/r/550517 [17:31:40] 10Operations, 10Dumps-Generation, 10SDC General, 10Wikidata, 10Structured-Data-Backlog (Current Work): Capacity planning for Commons Structured Data - https://phabricator.wikimedia.org/T226093 (10Ramsey-WMF) Matthias will look into discrepancies between number of files with mediainfo slots vs. what's ind... [17:33:19] !log update fasw-c-eqiad to match current standard (ntp/users/rootpw/lldp) [17:33:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:50] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: (codfw):rack/setup/install db213[2-5] - https://phabricator.wikimedia.org/T237702 (10Papaul) [18:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: How many deployers does it take to do Morning SWAT(Max 6 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191112T1800). [18:00:04] Tks4Fish: A patch you scheduled for Morning SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:16] I can SWAT today! [18:00:26] hooray! [18:00:56] (03PS2) 10Urbanecm: Add right "abusefilter-log-private" to usergroup "rollbacker" at ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549987 (https://phabricator.wikimedia.org/T237830) (owner: 10Tks4Fish) [18:01:02] (03CR) 10Urbanecm: [C: 03+2] Add right "abusefilter-log-private" to usergroup "rollbacker" at ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549987 (https://phabricator.wikimedia.org/T237830) (owner: 10Tks4Fish) [18:01:29] (03CR) 10BPirkle: "@krinkle thank you for the clarification." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548944 (https://phabricator.wikimedia.org/T236833) (owner: 10Dzahn) [18:01:59] (03Merged) 10jenkins-bot: Add right "abusefilter-log-private" to usergroup "rollbacker" at ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549987 (https://phabricator.wikimedia.org/T237830) (owner: 10Tks4Fish) [18:02:16] thanks Urbanecm :) [18:02:41] Tks4Fish: please test at mwdebug1001 and let me know [18:04:16] (03CR) 10BPirkle: "Per Krinkle's comment on https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/548944/ memory limits are going to be adjusted o" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549219 (https://phabricator.wikimedia.org/T237555) (owner: 10Tim Starling) [18:06:08] Tks4Fish: any problem? :-) [18:06:18] 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install cloudcephmon100[123] - https://phabricator.wikimedia.org/T228102 (10JHedden) >>! In T228102#5648171, @JHedden wrote: > @Jclark-ctr Could you help me with the cloudcephmon1002 and cloudcephmon1003 servers? I'm un... [18:06:52] Urbanecm: check your PMs :P [18:07:15] :D [18:07:37] mwdebug1001 is a debug server we use to make sure patches work as expected [18:08:02] ah [18:08:09] so how does that work? [18:08:12] you need to install a browser extension to be able to connect to the debug server, and then you just need to make sure the patch work [18:08:18] see https://wikitech.wikimedia.org/wiki/X-Wikimedia-Debug#Browser_extensions for links to the extension [18:08:29] !log push pfw change to add recdns anycast IP [18:08:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:58] ah, okay, will do that now then [18:10:00] cool [18:12:50] (03PS5) 10Herron: logstash: add version param and exclude plugins when non 5.x [puppet] - 10https://gerrit.wikimedia.org/r/548880 (https://phabricator.wikimedia.org/T217340) [18:13:45] yeah, all okay, Urbanecm :) [18:13:51] great, syncing [18:15:15] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 130ef87: Add right "abusefilter-log-private" to usergroup "rollbacker" at ptwiki (T237830) (duration: 00m 53s) [18:15:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:20] T237830: Add right "abusefilter-log-private" to rollbackers at ptwiki - https://phabricator.wikimedia.org/T237830 [18:15:23] Tks4Fish: here you are! [18:18:37] !log Deploy security patch for T237887 [18:18:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:52] freat :D [18:18:58] thanks man :) [18:19:03] happy to help! [18:20:28] !log Morning SWAT done [18:20:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:51] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [18:22:42] (03CR) 10Catrope: [C: 03+2] [beta] Set wgGERestbaseUrl to false by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/550350 (https://phabricator.wikimedia.org/T238011) (owner: 10Urbanecm) [18:23:26] (03Merged) 10jenkins-bot: [beta] Set wgGERestbaseUrl to false by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/550350 (https://phabricator.wikimedia.org/T238011) (owner: 10Urbanecm) [18:28:54] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to restricted production access for Nikki Nikkhoui - https://phabricator.wikimedia.org/T237689 (10Fjalapeno) approved [18:29:16] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to restricted production access for Nikki Nikkhoui - https://phabricator.wikimedia.org/T237689 (10Fjalapeno) a:05corey→03Fjalapeno [18:31:43] (03PS1) 10BBlack: Unified cert: add digicert-2019a files [puppet] - 10https://gerrit.wikimedia.org/r/550525 [18:31:45] (03PS1) 10BBlack: Unified cert: deploy digicert-2019a to infra [puppet] - 10https://gerrit.wikimedia.org/r/550526 [18:32:04] (03CR) 10Elukey: "> I'm not clear on the problem with a new keytab if another server" [puppet] - 10https://gerrit.wikimedia.org/r/550466 (https://phabricator.wikimedia.org/T234229) (owner: 10Elukey) [18:44:21] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [18:53:48] !log arlolra@deploy1001 Started deploy [parsoid/deploy@f516018]: Updating Parsoid to 6a0a708 [18:53:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:30] !log mholloway-shell@deploy1001 Synchronized php-1.35.0-wmf.5/extensions/MachineVision: Final fixes and tweaks for testing (duration: 00m 53s) [18:58:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:15] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:01:51] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:03:57] !log arlolra@deploy1001 Finished deploy [parsoid/deploy@f516018]: Updating Parsoid to 6a0a708 (duration: 10m 09s) [19:04:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:05] 10Operations, 10ops-eqiad, 10DC-Ops, 10User-Zppix, 10cloud-services-team (Kanban): VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 (10Jclark-ctr) @Andrew want to confirm this box is not in use right now. Need to perform additional test for dell [19:06:11] 10Operations, 10ops-eqiad, 10DC-Ops, 10User-Zppix, 10cloud-services-team (Kanban): VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 (10Andrew) It's still out of service awaiting a fix. [19:18:53] !log Updated Parsoid to 6a0a708 (T215000, T235295, T235656, T235217, T235295, T236846, T237556, T235231) [19:19:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:09] T236846: TemplateHandler.php: trim() expects parameter 1 to be string, boolean given - https://phabricator.wikimedia.org/T236846 [19:19:10] T235656: Ref fragments remain unexpanded in Image:Frameless, mw:ExpandedAttrs, mw:LanguageVariant nodes - https://phabricator.wikimedia.org/T235656 [19:19:10] T235295: MathML tags are missing xmlns attribute - https://phabricator.wikimedia.org/T235295 [19:19:10] T235217: Parsoid should use protocol-relative URLs for media - https://phabricator.wikimedia.org/T235217 [19:19:11] T237556: Detect html2wt reqs issued to Parsoid/PHP with data-parsoid blobs generated by Parsoid/JS and issue a HTTP 421 - https://phabricator.wikimedia.org/T237556 [19:19:11] T215000: Fill gaps in PHP DOM's functionality - https://phabricator.wikimedia.org/T215000 [19:19:11] T235231: Parsoid/JS video tag has a "seek" parameter in the URL that Parsoid/PHP video tag output doesn't - https://phabricator.wikimedia.org/T235231 [19:33:09] (03PS1) 10Jon Harald Søby: RESTRouter: Add gcrwiki and shywiktionary [deployment-charts] - 10https://gerrit.wikimedia.org/r/550533 (https://phabricator.wikimedia.org/T238117) [19:35:58] (03PS1) 10Anomie: Set MCR migration stage to NEW on group 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/550534 (https://phabricator.wikimedia.org/T198312) [19:36:53] (03CR) 10Anomie: [C: 03+2] "Deploying config change" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/550534 (https://phabricator.wikimedia.org/T198312) (owner: 10Anomie) [19:37:38] (03Merged) 10jenkins-bot: Set MCR migration stage to NEW on group 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/550534 (https://phabricator.wikimedia.org/T198312) (owner: 10Anomie) [19:41:33] !log anomie@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Set MCR migration stage to NEW on group0 for T198312 (duration: 00m 52s) [19:41:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:39] T198312: Set the WMF cluster to use the new MCR-only schema - https://phabricator.wikimedia.org/T198312 [19:43:33] !log anomie@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: Sync a previously undeployed change to InitialiseSettings-labs.php that someone forgot to deploy (as a no-op) in production (duration: 00m 52s) [19:43:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:49] (03PS1) 10Bartosz Dziewoński: Clean up VisualEditor settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/550535 [19:45:41] !log ladsgroup@mwmaint1002:~$ mwscript extensions/Wikibase/repo/maintenance/changePropertyDataType.php --wiki=wikidatawiki --property-id P4839 --new-data-type external-id (T234221) [19:45:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:46] T234221: Change the datatype of P4839 from string to identifier - https://phabricator.wikimedia.org/T234221 [19:46:09] !log ladsgroup@mwmaint1002:~$ mwscript extensions/Wikibase/repo/maintenance/changePropertyDataType.php --wiki=wikidatawiki --property-id P7007 --new-data-type external-id (T234221) [19:46:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:46] (03CR) 10Anomie: "> If you keep track of moving it forwards, I would be fine with that as well. I'll be mostly out for wikitech, and we should not delay thi" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549449 (https://phabricator.wikimedia.org/T198312) (owner: 10Daniel Kinzler) [20:07:47] 10Operations, 10Performance-Team, 10Traffic: Depooling single text caching server in esams had a disproportionate performance impact - https://phabricator.wikimedia.org/T238085 (10Gilles) Impact of that test in Europe: https://grafana.wikimedia.org/d/000000230/navigation-timing-by-continent?panelId=54&fulls... [20:09:33] 10Operations, 10ops-eqiad, 10DC-Ops, 10User-Zppix, 10cloud-services-team (Kanban): VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 (10Jclark-ctr) Verified performance mode in bios . loaded stress test multiple errors on start up sent errors to dell Req... [20:11:53] !log otto@cumin1001 START - Cookbook sre.hosts.downtime [20:11:55] !log otto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [20:11:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:59] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster: analytics1062 lost one of its power supplies - https://phabricator.wikimedia.org/T237133 (10ops-monitoring-bot) Icinga downtime for 2:00:00 set by otto@cumin1001 on 1 host(s) and their services with reason: analytics1062 lost one of its power sup... [20:12:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:31] very stupid question, how can I find out what version of debian a node is without sshing into it? I couldn't find anything in puppet for contint1001 [20:17:28] (03PS4) 10BPirkle: Enable REST API on all WMF wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549219 (https://phabricator.wikimedia.org/T237555) (owner: 10Tim Starling) [20:18:43] !log reprepro copy buster-wikimedia stretch-wikimedia prometheus-elasticsearch-exporter [20:18:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:49] RECOVERY - IPMI Sensor Status on analytics1062 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [20:47:13] (03CR) 10Jforrester: [C: 04-2] "You can't change IS and CS like this in a single commit, it can't be deployed (or if you force it, it will break the cluster). Add the new" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/550535 (owner: 10Bartosz Dziewoński) [20:54:21] 10Operations, 10vm-requests, 10Performance-Team (Radar): vm request for xhgui - https://phabricator.wikimedia.org/T238098 (10Gilles) [20:56:14] (03CR) 10Dzahn: [C: 03+2] Add Nikki Nikkhoui to the restricted group [puppet] - 10https://gerrit.wikimedia.org/r/549700 (https://phabricator.wikimedia.org/T237689) (owner: 10Muehlenhoff) [20:56:27] 10Operations, 10Traffic, 10Performance-Team (Radar): Edge cache response time per server should be monitored - https://phabricator.wikimedia.org/T238086 (10Gilles) [20:57:09] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [20:57:10] 10Operations, 10Traffic, 10Performance-Team (Radar): Depooling single text caching server in esams had a disproportionate performance impact - https://phabricator.wikimedia.org/T238085 (10Gilles) [20:58:48] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to restricted production access for Nikki Nikkhoui - https://phabricator.wikimedia.org/T237689 (10Dzahn) @Fjalapeno Thanks for the approval @nnikkhoui You have been added to the restricted group. [21:03:21] (03PS1) 10Ayounsi: Add policy-statement BGP_graceful_shutdown_out [homer/public] - 10https://gerrit.wikimedia.org/r/550550 (https://phabricator.wikimedia.org/T211728) [21:03:36] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] Add policy-statement BGP_graceful_shutdown_out [homer/public] - 10https://gerrit.wikimedia.org/r/550550 (https://phabricator.wikimedia.org/T211728) (owner: 10Ayounsi) [21:12:09] 10Operations, 10netops, 10Patch-For-Review: Outbound BGP graceful shutdown - https://phabricator.wikimedia.org/T211728 (10ayounsi) 05Open→03Resolved a:03ayounsi All good! [21:12:52] 10Operations, 10Traffic, 10Wikimedia-Apache-configuration, 10HHVM: Data passed to HHVM ($_SERVER variables) is a mixed bag of already-decoded and non-decoded nonsense - https://phabricator.wikimedia.org/T132629 (10matmarex) [21:15:41] 10Operations, 10Wikimedia-Apache-configuration: URL to pagenames with special characters fail - https://phabricator.wikimedia.org/T153275 (10matmarex) [21:15:42] 10Operations, 10Wikimedia-Apache-configuration: URL parameters do not work with pages that have "?" in their names - https://phabricator.wikimedia.org/T123276 (10matmarex) [21:15:48] (03CR) 10Cwhite: [C: 03+2] hiera: update ores matching rules and drop undefined metrics [puppet] - 10https://gerrit.wikimedia.org/r/548938 (https://phabricator.wikimedia.org/T233448) (owner: 10Cwhite) [21:15:56] (03PS2) 10Cwhite: hiera: update ores matching rules and drop undefined metrics [puppet] - 10https://gerrit.wikimedia.org/r/548938 (https://phabricator.wikimedia.org/T233448) [21:16:34] 10Operations, 10Wikimedia-Apache-configuration: URL parameters do not work with pages that have "?" in their names - https://phabricator.wikimedia.org/T123276 (10matmarex) I can't reproduce this problem any more. I assume this was fixed by migrating to PHP 7 (T176370). [21:16:53] 10Operations, 10Wikimedia-Apache-configuration: URL parameters do not work with pages that have "?" in their names - https://phabricator.wikimedia.org/T123276 (10matmarex) 05Open→03Resolved [21:25:20] (03PS1) 10Papaul: DHCP: Add MAC address for db213[2-5] [puppet] - 10https://gerrit.wikimedia.org/r/550554 (https://phabricator.wikimedia.org/T237702) [21:27:02] (03PS2) 10Ayounsi: Add partial chassis support for asw and cr [homer/public] - 10https://gerrit.wikimedia.org/r/550389 [21:41:55] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [21:54:55] !log depooling cp1076 for some local experimentation [21:54:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:12:51] 10Operations, 10SRE-Access-Requests: Requesting access to restricted production access for Nikki Nikkhoui - https://phabricator.wikimedia.org/T237689 (10nnikkhoui) Thanks @Dzahn ! [22:13:24] clinic duty week mutante ? :) [22:24:34] (03PS1) 10BBlack: Add GlobalSign R3/R5 cross-signing intermediate [puppet] - 10https://gerrit.wikimedia.org/r/550564 [22:24:36] (03PS1) 10BBlack: x509-bundle: blacklist GlobalSign R5 Root [puppet] - 10https://gerrit.wikimedia.org/r/550565 [22:27:29] (03CR) 10BPirkle: [C: 03+1] Enable REST API on all WMF wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549219 (https://phabricator.wikimedia.org/T237555) (owner: 10Tim Starling) [22:27:52] (03CR) 10jerkins-bot: [V: 04-1] x509-bundle: blacklist GlobalSign R5 Root [puppet] - 10https://gerrit.wikimedia.org/r/550565 (owner: 10BBlack) [22:29:21] 10Operations, 10SRE-Access-Requests: Requesting access to restricted production access for Nikki Nikkhoui - https://phabricator.wikimedia.org/T237689 (10Dzahn) 05Open→03Resolved @nnikkhoui You're welcome. I'm calling this resolved. Let us know if any unexpected issues. [22:29:50] (03CR) 10Tim Starling: [C: 03+2] Enable REST API on all WMF wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549219 (https://phabricator.wikimedia.org/T237555) (owner: 10Tim Starling) [22:30:41] (03Merged) 10jenkins-bot: Enable REST API on all WMF wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549219 (https://phabricator.wikimedia.org/T237555) (owner: 10Tim Starling) [22:31:09] (03PS2) 10BBlack: Add GlobalSign R3/R5 cross-signing intermediate [puppet] - 10https://gerrit.wikimedia.org/r/550564 [22:31:11] (03PS2) 10BBlack: x509-bundle: blacklist GlobalSign R5 Root [puppet] - 10https://gerrit.wikimedia.org/r/550565 [22:32:03] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: (codfw):rack/setup/install db213[2-5] - https://phabricator.wikimedia.org/T237702 (10Papaul) ` papaul@asw-a-codfw# show | compare [edit interfaces interface-range vlan-private1-a-codfw] member xe-7/0/9 { ... } + member ge-5/0/1; [edit interface... [22:34:39] !log tstarling@deploy1001 Synchronized wmf-config/InitialiseSettings.php: enabling REST API (duration: 00m 52s) [22:34:40] (03CR) 10BBlack: [C: 03+2] Add GlobalSign R3/R5 cross-signing intermediate [puppet] - 10https://gerrit.wikimedia.org/r/550564 (owner: 10BBlack) [22:34:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:35:42] !log tstarling@deploy1001 Synchronized wmf-config/CommonSettings.php: enabling REST API (duration: 00m 52s) [22:35:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:35:56] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: (codfw):rack/setup/install db213[2-5] - https://phabricator.wikimedia.org/T237702 (10Papaul) [22:37:16] !log repool cp1076 (experiments concluded) [22:37:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:39:22] (03CR) 10BBlack: [C: 03+1] "Not ideal to put the blacklist inside the script, but this is a tradeoff on keeping the change minimal and easy to review because it's goi" [puppet] - 10https://gerrit.wikimedia.org/r/550565 (owner: 10BBlack) [22:41:49] (03CR) 10BBlack: [C: 03+2] x509-bundle: blacklist GlobalSign R5 Root [puppet] - 10https://gerrit.wikimedia.org/r/550565 (owner: 10BBlack) [22:43:08] (03CR) 10Dzahn: [C: 03+1] "the ticket comments confirm buster is desired here. ACK" [puppet] - 10https://gerrit.wikimedia.org/r/550554 (https://phabricator.wikimedia.org/T237702) (owner: 10Papaul) [22:47:21] (03CR) 10Dzahn: [C: 03+1] DNS: Add mgmt and production DNS for db213[2-5] [dns] - 10https://gerrit.wikimedia.org/r/550517 (owner: 10Papaul) [22:58:13] 10Operations, 10Product-Analytics, 10SRE-Access-Requests: Search Console access for he.wikisource.org - https://phabricator.wikimedia.org/T238090 (10Dzahn) [23:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Evening SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191112T2300). [23:00:04] Zoranzoki21: A patch you scheduled for Evening SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:20] Hi [23:00:33] :) jouncebot: now [23:00:37] jouncebot: now [23:00:37] For the next 0 hour(s) and 59 minute(s): Evening SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191112T2300) [23:00:42] jouncebot: next [23:00:43] In 11 hour(s) and 59 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191113T1100) [23:01:16] Wait it's an hour early?! [23:02:02] Yes [23:02:03] Looks so [23:02:08] It shows happening now on page [23:02:35] https://snipboard.io/FngJWo.jpg [23:04:50] Even better and for me and for us :) [23:05:34] For me because of sleep because I'm tired.. I spend at least two hours traveling all day from home to school and from school to home [23:06:30] 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install cloudcephmon100[123] - https://phabricator.wikimedia.org/T228102 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jeh on cumin1001.eqiad.wmnet for hosts: ` ['cloudcephmon1002.wikimedia.org'] ` The... [23:07:57] (03CR) 10Papaul: [C: 03+2] DNS: Add mgmt and production DNS for db213[2-5] [dns] - 10https://gerrit.wikimedia.org/r/550517 (owner: 10Papaul) [23:09:13] (03CR) 10Papaul: [C: 03+2] DHCP: Add MAC address for db213[2-5] [puppet] - 10https://gerrit.wikimedia.org/r/550554 (https://phabricator.wikimedia.org/T237702) (owner: 10Papaul) [23:11:12] (03PS2) 10Papaul: DNS: Add mgmt and production DNS for db213[2-5] [dns] - 10https://gerrit.wikimedia.org/r/550517 [23:11:32] (03CR) 10Papaul: [V: 03+2 C: 03+2] DNS: Add mgmt and production DNS for db213[2-5] [dns] - 10https://gerrit.wikimedia.org/r/550517 (owner: 10Papaul) [23:15:18] RoanKattuow: What happening? [23:15:28] Oh typo.. RoanKattouw: [23:20:51] !log jeh@cumin1001 START - Cookbook sre.hosts.downtime [23:20:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:57] !log jeh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [23:23:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:27:18] 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install cloudcephmon100[123] - https://phabricator.wikimedia.org/T228102 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudcephmon1002.wikimedia.org'] ` and were **ALL** successful. [23:28:15] (03PS2) 10Dzahn: Add gcr and shy languages [dns] - 10https://gerrit.wikimedia.org/r/550511 (https://phabricator.wikimedia.org/T238104) (owner: 10Jon Harald Søby) [23:29:45] Sorry, but what happens with SWAT_ [23:29:55] Normally it's an hour later [23:29:59] Oh sorry, I didn't see your patch [23:30:01] I can deploy that now [23:30:34] Oh ok [23:30:55] I don't know why it shows that SWAT happens now [23:30:58] on wikitech [23:31:44] I guess they kept it at the same time in UTC [23:31:57] 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install cloudcephmon100[123] - https://phabricator.wikimedia.org/T228102 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jeh on cumin1001.eqiad.wmnet for hosts: ` ['cloudcephmon1003.wikimedia.org'] ` The... [23:32:06] In previous years we shifted the UTC time to account for daylight savings, so it would be the same time for the US and Europe [23:32:47] Instead, it's now moved by an hour for the US and Europe, and stayed at the same time in places that don't observe DST (e.g. India, many countries in East Asia) [23:34:20] I don't understand it, but ok.. I saw you voted patch, tell me when it is available in mwdebug [23:45:15] !log jeh@cumin1001 START - Cookbook sre.hosts.downtime [23:45:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:45:47] RoanKattouw: Can you tell me on which mwdebug is? [23:49:28] !log jeh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [23:49:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:49:43] !log catrope@deploy1001 Synchronized php-1.35.0-wmf.5/includes/changes/ChangesList.php: Remove extraneous semicolons (T233649), part 1 (duration: 00m 53s) [23:49:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:49:48] T233649: Stray semicolons in RecentChanges, Watchlist, History and Contributions interface - https://phabricator.wikimedia.org/T233649 [23:50:59] !log catrope@deploy1001 Synchronized php-1.35.0-wmf.5/resources/src/mediawiki.interface.helpers.styles.less: Remove extraneous semicolons (T233649), part 2 (duration: 00m 52s) [23:51:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:52:05] Zoranzoki21: Sorry, I went and deployed it right away without testing [23:52:18] No problem [23:52:20] Ok is all [23:52:21] Maybe I shouldn't have, but this seemed like a simple change [23:52:45] Ok is all [23:52:48] No problem :) [23:52:50] Thank you! [23:52:55] WIth Jon's help I just verified it [23:53:13] 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install cloudcephmon100[123] - https://phabricator.wikimedia.org/T228102 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudcephmon1003.wikimedia.org'] ` and were **ALL** successful. [23:53:46] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:57:24] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [23:57:38] !log mholloway-shell@deploy1001 Synchronized php-1.35.0-wmf.5/extensions/MachineVision: Fix: Do not return after inserting a single suggestion (duration: 00m 52s) [23:57:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log