[00:15:47] PROBLEM - MariaDB Slave Lag: s3 on db2098 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 944.24 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [00:45:59] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 111 probes of 466 (alerts on 35) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [00:46:49] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [01:36:51] RECOVERY - MariaDB Slave Lag: s3 on db2098 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [01:45:47] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING WARNING - Packet loss = 93%, RTA = 214.58 ms [01:53:35] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [02:05:19] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING WARNING - Packet loss = 93%, RTA = 214.50 ms [02:13:07] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [02:19:01] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING WARNING - Packet loss = 80%, RTA = 214.67 ms [02:26:47] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [02:35:43] 10Operations, 10Wikimedia-Mailing-lists, 10I18n, 10RTL: Make pipermail show RTL emails better by emitting dir=auto - https://phabricator.wikimedia.org/T235458 (10Bawolff) [02:59:57] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 18097000 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [03:01:35] RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 216 and 64 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [03:13:55] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING WARNING - Packet loss = 93%, RTA = 217.54 ms [03:21:41] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [03:27:14] 10Operations, 10ops-codfw, 10Traffic: rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 (10Vgutierrez) @papaul you can proceed at will with lvs2009 and lvs2010 because they are not handling production traffic at the moment [04:11:46] (03CR) 10Marostegui: mariadb: Promote db1100 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/540762 (https://phabricator.wikimedia.org/T234300) (owner: 10Marostegui) [04:11:54] (03PS5) 10Marostegui: mariadb: Promote db1100 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/540762 (https://phabricator.wikimedia.org/T234300) [04:12:04] (03CR) 10Marostegui: wmnet: Update s5-master alias [dns] - 10https://gerrit.wikimedia.org/r/540763 (https://phabricator.wikimedia.org/T234300) (owner: 10Marostegui) [04:12:08] (03PS4) 10Marostegui: wmnet: Update s5-master alias [dns] - 10https://gerrit.wikimedia.org/r/540763 (https://phabricator.wikimedia.org/T234300) [04:14:43] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING WARNING - Packet loss = 86%, RTA = 215.02 ms [04:15:46] !log Start pre-switchover steps T234300 [04:15:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:15:51] T234300: Switchover s5 primary database master db1070 -> db1100 - 15th Oct 05:00 - 05:30 UTC - https://phabricator.wikimedia.org/T234300 [04:17:40] 10Operations, 10Traffic: ATS lua script reload doesn't work as expected - https://phabricator.wikimedia.org/T233274 (10Vgutierrez) nice catch, I have to backport the SSL EC cache PR... so I guess I'll include this one as well [04:22:33] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [04:24:38] 10Operations, 10netops: mr1-eqsin.oob IPv6 connectivity flapping - https://phabricator.wikimedia.org/T227967 (10Marostegui) 05Resolved→03Open This has been flapping overnight (times in UTC+2): ` [02:46:49] <+icinga-wm> PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [03:45:4... [04:30:08] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1100 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/540762 (https://phabricator.wikimedia.org/T234300) (owner: 10Marostegui) [04:31:35] (03PS2) 10Marostegui: db-eqiad.php: Temporary pool pc1010 in pc1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542890 (https://phabricator.wikimedia.org/T227142) [04:33:31] (03CR) 10ArielGlenn: "Looks good, can merge and deploy whenever you like." [puppet] - 10https://gerrit.wikimedia.org/r/542278 (owner: 10Hoo man) [04:34:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1101:3317 after schema change', diff saved to https://phabricator.wikimedia.org/P9335 and previous config saved to /var/cache/conftool/dbconfig/20191015-043403-marostegui.json [04:34:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:41:31] 10Operations, 10Traffic: ATS lua script reload doesn't work as expected - https://phabricator.wikimedia.org/T233274 (10Vgutierrez) hmmm it looks like the commit is on 8.x but the feature is disabled by default, see https://github.com/apache/trafficserver/pull/3940/commits/cc7b7dd7540fb33d93aa969defb416461c0202e7 [05:00:04] marostegui and jynus: Dear deployers, time to do the s5 database master failover deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191015T0500). [05:00:08] !log Starting s5 failover from db1070 to db1100 - T234300 [05:00:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:00:12] T234300: Switchover s5 primary database master db1070 -> db1100 - 15th Oct 05:00 - 05:30 UTC - https://phabricator.wikimedia.org/T234300 [05:00:17] !log marostegui@cumin2001 dbctl commit (dc=all): 'Set s5 as read-only for maintenance T234300', diff saved to https://phabricator.wikimedia.org/P9336 and previous config saved to /var/cache/conftool/dbconfig/20191015-050016-marostegui.json [05:00:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:00:43] !log marostegui@cumin2001 dbctl commit (dc=all): 'Promote db1100 to s5 master and remove read-only from s5 T234300', diff saved to https://phabricator.wikimedia.org/P9337 and previous config saved to /var/cache/conftool/dbconfig/20191015-050042-marostegui.json [05:00:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:04:38] (03CR) 10Marostegui: [C: 03+2] wmnet: Update s5-master alias [dns] - 10https://gerrit.wikimedia.org/r/540763 (https://phabricator.wikimedia.org/T234300) (owner: 10Marostegui) [05:10:18] (03PS1) 10Vgutierrez: ATS: Enable reloading global lua script [puppet] - 10https://gerrit.wikimedia.org/r/543022 (https://phabricator.wikimedia.org/T233274) [05:12:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1098:3316', diff saved to https://phabricator.wikimedia.org/P9338 and previous config saved to /var/cache/conftool/dbconfig/20191015-051236-marostegui.json [05:12:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:12:41] (03CR) 10Vgutierrez: "pcc looks happy: https://puppet-compiler.wmflabs.org/compiler1001/18879/" [puppet] - 10https://gerrit.wikimedia.org/r/543022 (https://phabricator.wikimedia.org/T233274) (owner: 10Vgutierrez) [05:14:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1103:3314', diff saved to https://phabricator.wikimedia.org/P9339 and previous config saved to /var/cache/conftool/dbconfig/20191015-051400-marostegui.json [05:14:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:19:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2085:3318', diff saved to https://phabricator.wikimedia.org/P9340 and previous config saved to /var/cache/conftool/dbconfig/20191015-051924-marostegui.json [05:19:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:22:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1098:3317', diff saved to https://phabricator.wikimedia.org/P9341 and previous config saved to /var/cache/conftool/dbconfig/20191015-052220-marostegui.json [05:22:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:22:53] 10Operations, 10DBA: Decommission db1061-db1073 - https://phabricator.wikimedia.org/T217396 (10Marostegui) [05:24:05] 10Operations, 10DBA: decommission db1070.eqiad.wmnet - https://phabricator.wikimedia.org/T235464 (10Marostegui) [05:24:14] 10Operations, 10DBA: decommission db1070.eqiad.wmnet - https://phabricator.wikimedia.org/T235464 (10Marostegui) a:03Marostegui [05:24:48] 10Operations, 10DBA: Decommission db1061-db1073 - https://phabricator.wikimedia.org/T217396 (10Marostegui) [05:25:00] 10Operations, 10DBA: Decommission db1061-db1073 - https://phabricator.wikimedia.org/T217396 (10Marostegui) [05:26:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1097:3314', diff saved to https://phabricator.wikimedia.org/P9342 and previous config saved to /var/cache/conftool/dbconfig/20191015-052621-marostegui.json [05:26:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:28:29] !log Deploy schema change on db1097:3314 T233625 [05:28:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:28:37] T233625: Change PK and remove partitions from the logging table - https://phabricator.wikimedia.org/T233625 [05:28:49] !log Deploy schema change on db1098:3317 T234066 T233135 [05:28:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:28:53] T233135: Schema change for refactored actor and comment storage - https://phabricator.wikimedia.org/T233135 [05:28:54] T234066: Schema change to rename user_newtalk indexes - https://phabricator.wikimedia.org/T234066 [05:35:12] (03PS1) 10Marostegui: mariadb: Depool labsdb1009 [puppet] - 10https://gerrit.wikimedia.org/r/543023 (https://phabricator.wikimedia.org/T226782) [05:36:05] (03CR) 10Marostegui: [C: 03+2] mariadb: Depool labsdb1009 [puppet] - 10https://gerrit.wikimedia.org/r/543023 (https://phabricator.wikimedia.org/T226782) (owner: 10Marostegui) [05:38:25] !log Depool labsdb1009 for PDU maintenance T226782 [05:38:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:38:30] T226782: a1-eqiad pdu refresh (Tuesday 10/15 @11am UTC) - https://phabricator.wikimedia.org/T226782 [05:44:53] 10Operations, 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for banwiki - https://phabricator.wikimedia.org/T234770 (10Marostegui) a:05Marostegui→03None db1124, db2094, labsdb1009, labsdb1010, labsdb1011, labsdb1012 are clean. I have created the database on all... [05:45:44] (03PS1) 10Giuseppe Lavagetto: role::parsoid: remove obsolete feature flag [puppet] - 10https://gerrit.wikimedia.org/r/543024 [05:46:15] (03PS3) 10Marostegui: mediawiki: Split cronjob for updatequerypages to multiple modules [puppet] - 10https://gerrit.wikimedia.org/r/542956 (https://phabricator.wikimedia.org/T234948) (owner: 10Ladsgroup) [05:47:55] (03CR) 10Giuseppe Lavagetto: [C: 03+2] role::parsoid: remove obsolete feature flag [puppet] - 10https://gerrit.wikimedia.org/r/543024 (owner: 10Giuseppe Lavagetto) [05:50:21] (03PS4) 10Marostegui: mediawiki: Split cronjob for updatequerypages to multiple modules [puppet] - 10https://gerrit.wikimedia.org/r/542956 (https://phabricator.wikimedia.org/T234948) (owner: 10Ladsgroup) [05:52:27] (03CR) 10Marostegui: [C: 03+2] mediawiki: Split cronjob for updatequerypages to multiple modules [puppet] - 10https://gerrit.wikimedia.org/r/542956 (https://phabricator.wikimedia.org/T234948) (owner: 10Ladsgroup) [05:58:38] (03PS1) 10Vgutierrez: Release 8.0.5-wm9 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/543025 (https://phabricator.wikimedia.org/T234011) [06:01:33] (03CR) 10jerkins-bot: [V: 04-1] Release 8.0.5-wm9 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/543025 (https://phabricator.wikimedia.org/T234011) (owner: 10Vgutierrez) [06:01:38] wonderful [06:02:45] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING WARNING - Packet loss = 93%, RTA = 214.64 ms [06:03:57] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 66, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:04:11] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 52, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:04:13] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:06:09] (03CR) 10Elukey: [V: 03+1 C: 03+1] "Built for stretch and buster, tested on two hosts and the segfault doesn't happen." [debs/memkeys] (debian) - 10https://gerrit.wikimedia.org/r/542992 (https://phabricator.wikimedia.org/T223863) (owner: 10Elukey) [06:09:41] (03PS1) 10Marostegui: db1131: Change its binlog format [puppet] - 10https://gerrit.wikimedia.org/r/543026 [06:09:53] (03PS2) 10Marostegui: db1131: Change its binlog format [puppet] - 10https://gerrit.wikimedia.org/r/543026 [06:10:46] (03CR) 10Marostegui: [C: 03+2] db1131: Change its binlog format [puppet] - 10https://gerrit.wikimedia.org/r/543026 (owner: 10Marostegui) [06:12:59] Checking the interfaces down [06:13:23] so cr4-uslfo connects to cr2-eqord via Telia [06:13:37] and cr2-codfw connects to cr2-eqord via Telia [06:14:15] ok maintenance scheduled in the gcal [06:14:18] all good :) [06:18:21] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [06:18:43] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:18:46] about --^ - there is also a ripe atlas alert [06:18:47] (03PS1) 10Elukey: role::analytics_cluster::zookeeper: fix prometheus monitors [puppet] - 10https://gerrit.wikimedia.org/r/543027 (https://phabricator.wikimedia.org/T217057) [06:19:31] (03CR) 10Elukey: [C: 03+2] role::analytics_cluster::zookeeper: fix prometheus monitors [puppet] - 10https://gerrit.wikimedia.org/r/543027 (https://phabricator.wikimedia.org/T217057) (owner: 10Elukey) [06:21:58] 10Operations, 10netops: mr1-eqsin.oob IPv6 connectivity flapping - https://phabricator.wikimedia.org/T227967 (10elukey) The ripe atlas ipv6 alert is in CRITICAL state as well again. [06:24:15] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING WARNING - Packet loss = 93%, RTA = 215.16 ms [06:25:39] 10Operations, 10netops: IPv6 packet loss registered by the Ripe Atlas anchor in eqsin - https://phabricator.wikimedia.org/T228015 (10elukey) 05Resolved→03Open This seems re-happening (from my home again): ` [..] 4. AS??? ??? 100.0 10 0.0 0.0 0.0 0.0 0.... [06:31:16] (03PS2) 10Vgutierrez: Release 8.0.5-wm9 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/543025 (https://phabricator.wikimedia.org/T234011) [06:31:35] (03PS2) 10DCausse: [cirrus] Disable instant indexing on wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539117 [06:32:01] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [06:33:37] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 133, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:33:37] 10Operations, 10DBA: Decommission db1061-db1073 - https://phabricator.wikimedia.org/T217396 (10Marostegui) [06:34:54] (03CR) 10jerkins-bot: [V: 04-1] Release 8.0.5-wm9 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/543025 (https://phabricator.wikimedia.org/T234011) (owner: 10Vgutierrez) [06:35:01] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:35:06] 10Operations, 10DBA: decommission db1070.eqiad.wmnet - https://phabricator.wikimedia.org/T235464 (10Marostegui) p:05Triage→03Normal [06:35:13] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 56, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:36:15] (03PS1) 10Marostegui: db1070: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/543028 (https://phabricator.wikimedia.org/T235464) [06:37:09] 10Operations, 10netops: IPv6 packet loss registered by the Ripe Atlas anchor in eqsin - https://phabricator.wikimedia.org/T228015 (10elukey) Contacted Hurricane Electric, at this point we could add the HE path to AVOID-PATH? [06:37:20] 10Operations, 10DBA, 10Patch-For-Review: decommission db1070.eqiad.wmnet - https://phabricator.wikimedia.org/T235464 (10Marostegui) [06:37:33] (03CR) 10Marostegui: [C: 03+2] db1070: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/543028 (https://phabricator.wikimedia.org/T235464) (owner: 10Marostegui) [06:38:01] 10Operations, 10DBA, 10Patch-For-Review: decommission db1070.eqiad.wmnet - https://phabricator.wikimedia.org/T235464 (10Marostegui) [06:40:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1070 T235464', diff saved to https://phabricator.wikimedia.org/P9343 and previous config saved to /var/cache/conftool/dbconfig/20191015-064005-marostegui.json [06:40:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:11] T235464: decommission db1070.eqiad.wmnet - https://phabricator.wikimedia.org/T235464 [06:42:02] (03CR) 10Muehlenhoff: "Please open a Phab task and tag it "SRE-Access-Requests", these changes are processed by a weekly rotatation of SREs and without a Phabric" [puppet] - 10https://gerrit.wikimedia.org/r/542621 (owner: 10Groceryheist) [06:43:25] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [debs/memkeys] (debian) - 10https://gerrit.wikimedia.org/r/542992 (https://phabricator.wikimedia.org/T223863) (owner: 10Elukey) [06:43:37] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 214.58 ms [06:44:14] (03PS2) 10Muehlenhoff: Remove unused/unnecessary passwords::postgres include [puppet] - 10https://gerrit.wikimedia.org/r/541224 [06:44:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2086:3318 T232446', diff saved to https://phabricator.wikimedia.org/P9344 and previous config saved to /var/cache/conftool/dbconfig/20191015-064419-marostegui.json [06:44:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:44:24] T232446: Compress new Wikibase tables - https://phabricator.wikimedia.org/T232446 [06:50:17] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 22 probes of 466 (alerts on 35) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [06:54:57] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:00:17] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [07:01:01] (03CR) 10Elukey: [V: 03+1 C: 03+2] Add upstream patch to avoid segfaults on Debian Stretch [debs/memkeys] (debian) - 10https://gerrit.wikimedia.org/r/542992 (https://phabricator.wikimedia.org/T223863) (owner: 10Elukey) [07:02:53] (03PS3) 10Vgutierrez: Release 8.0.5-wm9 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/543025 (https://phabricator.wikimedia.org/T234011) [07:05:02] 10Operations, 10netops: IPv6 packet loss registered by the Ripe Atlas anchor in eqsin - https://phabricator.wikimedia.org/T228015 (10elukey) 05Open→03Resolved HE fixed the issue on their end, all good now. [07:05:26] 10Operations, 10netops: mr1-eqsin.oob IPv6 connectivity flapping - https://phabricator.wikimedia.org/T227967 (10elukey) 05Open→03Resolved Related to a HE issue, see T228015 [07:05:31] 10Operations, 10ops-eqiad, 10DC-Ops: a1-eqiad pdu refresh (Tuesday 10/15 @11am UTC) - https://phabricator.wikimedia.org/T226782 (10Marostegui) [07:06:39] (03CR) 10jerkins-bot: [V: 04-1] Release 8.0.5-wm9 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/543025 (https://phabricator.wikimedia.org/T234011) (owner: 10Vgutierrez) [07:10:43] !log failover VRRP from cr1-eqiad to cr2-eqiad in prevision of the PDU work of - T226782 [07:10:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:47] T226782: a1-eqiad pdu refresh (Tuesday 10/15 @11am UTC) - https://phabricator.wikimedia.org/T226782 [07:10:51] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [07:13:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1126 for PDU maintenance T226782', diff saved to https://phabricator.wikimedia.org/P9345 and previous config saved to /var/cache/conftool/dbconfig/20191015-071338-marostegui.json [07:13:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:09] (03PS4) 10Vgutierrez: Release 8.0.5-wm9 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/543025 (https://phabricator.wikimedia.org/T234011) [07:25:48] (03CR) 10Muehlenhoff: [C: 03+2] Remove unused/unnecessary passwords::postgres include [puppet] - 10https://gerrit.wikimedia.org/r/541224 (owner: 10Muehlenhoff) [07:34:56] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission [07:34:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:08] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [07:35:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:11] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2051.codfw.wmnet - https://phabricator.wikimedia.org/T230778 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1001 for hosts: `db2051.codfw.wmnet` - db2051.codfw.wmnet (**PASS**) - Downtimed host on Ic... [07:36:19] (03PS1) 10Marostegui: site.pp: Remove puppet references for db2051 [puppet] - 10https://gerrit.wikimedia.org/r/543032 (https://phabricator.wikimedia.org/T230778) [07:36:40] (03PS1) 10Marostegui: wmnet: Remove production DNS entries for db2051 [dns] - 10https://gerrit.wikimedia.org/r/543033 (https://phabricator.wikimedia.org/T230778) [07:37:15] (03CR) 10Marostegui: [C: 03+2] site.pp: Remove puppet references for db2051 [puppet] - 10https://gerrit.wikimedia.org/r/543032 (https://phabricator.wikimedia.org/T230778) (owner: 10Marostegui) [07:37:38] (03CR) 10Marostegui: [C: 03+2] wmnet: Remove production DNS entries for db2051 [dns] - 10https://gerrit.wikimedia.org/r/543033 (https://phabricator.wikimedia.org/T230778) (owner: 10Marostegui) [07:38:19] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2051.codfw.wmnet - https://phabricator.wikimedia.org/T230778 (10Marostegui) a:05RobH→03Papaul [07:38:34] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2051.codfw.wmnet - https://phabricator.wikimedia.org/T230778 (10Marostegui) Host ready for onsite steps + switch disablement [07:41:33] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission [07:41:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:45] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [07:41:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:51] (03PS1) 10Marostegui: mariadb: Remove dbstore2001 references [puppet] - 10https://gerrit.wikimedia.org/r/543035 (https://phabricator.wikimedia.org/T220002) [07:43:34] (03PS1) 10Marostegui: wmnet: Remove production DNS entries for dbstore2001 [dns] - 10https://gerrit.wikimedia.org/r/543036 (https://phabricator.wikimedia.org/T220002) [07:44:03] (03CR) 10Marostegui: [C: 03+2] mariadb: Remove dbstore2001 references [puppet] - 10https://gerrit.wikimedia.org/r/543035 (https://phabricator.wikimedia.org/T220002) (owner: 10Marostegui) [07:44:15] (03CR) 10Marostegui: [C: 03+2] wmnet: Remove production DNS entries for dbstore2001 [dns] - 10https://gerrit.wikimedia.org/r/543036 (https://phabricator.wikimedia.org/T220002) (owner: 10Marostegui) [07:44:24] (03CR) 10Ayounsi: [C: 03+2] profile:bird::anycast_healthchecker_monitoring: add python3-docopt [puppet] - 10https://gerrit.wikimedia.org/r/526849 (owner: 10Elukey) [07:44:41] (03PS7) 10Ayounsi: profile:bird::anycast_healthchecker_monitoring: add python3-docopt [puppet] - 10https://gerrit.wikimedia.org/r/526849 (owner: 10Elukey) [07:46:56] 10Operations, 10decommission, 10User-fgiunchedi: Return graphite200[12] to spares pool - https://phabricator.wikimedia.org/T199321 (10MoritzMuehlenhoff) a:05RobH→03MoritzMuehlenhoff [07:48:09] !log Password reset for Xaris333 #2 (T235441) [07:48:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:13] 10Operations, 10DBA: Switchover s1 primary database master db1067 -> db1083 - 14th Nov 05:00 - 05:30 UTC - https://phabricator.wikimedia.org/T234800 (10Marostegui) [07:52:32] !log Set email for `Martin Urbanec (test 10)` to test@wikimedia.cz (debug, no ticket) [07:52:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:26] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Anycast recdns - https://phabricator.wikimedia.org/T186550 (10ayounsi) >>! In T186550#5383508, @elukey wrote: > Couple of notes about the anycast-healthchecker: > 2) `python3-docopt` seems to be required by the healthchecker's nagios monitor, and it wa... [07:53:33] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Anycast recdns - https://phabricator.wikimedia.org/T186550 (10ayounsi) 05Open→03Resolved [07:53:40] 10Operations, 10Traffic, 10netops, 10Patch-For-Review, 10Performance-Team (Radar): Anycast (Auth)DNS - https://phabricator.wikimedia.org/T98006 (10ayounsi) [07:59:27] 10Operations, 10decommission, 10User-fgiunchedi: Return graphite200[12] to spares pool - https://phabricator.wikimedia.org/T199321 (10MoritzMuehlenhoff) The current state of these servers is inconsistent with the task description; I was wondering why graphite2001/2002 were not in puppetdb despite the checkbo... [08:05:22] 10Operations, 10Patch-For-Review, 10User-Elukey: memkeys segfaults on Debian Stretch - https://phabricator.wikimedia.org/T223863 (10elukey) ` root@install1002:/srv/wikimedia# reprepro lsbycomponent memkeys memkeys | 20181031-1 | jessie-wikimedia | thirdparty | amd64, source memkeys | 20181031-2+deb... [08:06:18] !log upload new version of memkeys (adding a patch to merged to upstream to avoid segfaults on stretch/buster) to stretch|buster wikimedia apt repos - T223863 [08:06:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:22] T223863: memkeys segfaults on Debian Stretch - https://phabricator.wikimedia.org/T223863 [08:06:45] 10Operations, 10Patch-For-Review, 10User-Elukey: memkeys segfaults on Debian Stretch - https://phabricator.wikimedia.org/T223863 (10elukey) 05Open→03Resolved [08:06:48] 10Operations, 10serviceops, 10Performance-Team (Radar), 10User-Elukey, 10User-jijiki: Upgrade memcached for Debian Stretch/Buster - https://phabricator.wikimedia.org/T213089 (10elukey) [08:07:13] !log Stop MySQL on db1126 and labsdb1009 for PDU maintenance - T226782 [08:07:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:17] T226782: a1-eqiad pdu refresh (Tuesday 10/15 @11am UTC) - https://phabricator.wikimedia.org/T226782 [08:07:47] 10Operations, 10decommission, 10User-fgiunchedi: Return graphite200[12] to spares pool - https://phabricator.wikimedia.org/T199321 (10MoritzMuehlenhoff) Turns out there was a separate task for graphite2002 (https://phabricator.wikimedia.org/T200210), so I'm removing 2002 from the task summary. [08:08:19] 10Operations, 10decommission, 10User-fgiunchedi: Return graphite2001 to spares pool - https://phabricator.wikimedia.org/T199321 (10MoritzMuehlenhoff) [08:09:06] 10Operations, 10ops-eqiad, 10DC-Ops: a1-eqiad pdu refresh (Tuesday 10/15 @11am UTC) - https://phabricator.wikimedia.org/T226782 (10Marostegui) db1126 and labsdb1009 are ok to proceed. Note: db1069 has its power OFF as it is pending on-site decommissioning steps. **DO NOT power it back on** [08:09:28] 10Operations, 10ops-eqiad, 10DC-Ops: a1-eqiad pdu refresh (Tuesday 10/15 @11am UTC) - https://phabricator.wikimedia.org/T226782 (10Marostegui) [08:12:46] ACKNOWLEDGEMENT - haproxy failover on dbproxy1019 is CRITICAL: CRITICAL check_failover servers up 1 down 1 Marostegui PDU maintenance https://wikitech.wikimedia.org/wiki/HAProxy [08:19:09] 10Operations, 10decommission, 10User-fgiunchedi: Return graphite2001 to spares pool - https://phabricator.wikimedia.org/T199321 (10MoritzMuehlenhoff) For graphite2001 there's also a separate decom task in https://phabricator.wikimedia.org/T200209, there are remaining Puppet references, though. [08:21:27] (03PS1) 10Muehlenhoff: Remove remaining Puppet references to graphite2001 [puppet] - 10https://gerrit.wikimedia.org/r/543058 (https://phabricator.wikimedia.org/T199321) [08:22:14] (03CR) 10Jcrespo: "No comment on the patch, just a warning that this will not take effect until puppet is enabled on the old or the new backup director." [puppet] - 10https://gerrit.wikimedia.org/r/543005 (https://phabricator.wikimedia.org/T235425) (owner: 10Dzahn) [08:22:31] 10Operations, 10decommission, 10Patch-For-Review, 10User-fgiunchedi: Return graphite2001 to spares pool - https://phabricator.wikimedia.org/T199321 (10MoritzMuehlenhoff) [08:22:55] (03PS2) 10Muehlenhoff: Remove remaining Puppet references to graphite2001 [puppet] - 10https://gerrit.wikimedia.org/r/543058 (https://phabricator.wikimedia.org/T199321) [08:28:12] (03CR) 10Muehlenhoff: [C: 03+2] Remove remaining Puppet references to graphite2001 [puppet] - 10https://gerrit.wikimedia.org/r/543058 (https://phabricator.wikimedia.org/T199321) (owner: 10Muehlenhoff) [08:29:29] 10Operations, 10decommission, 10Patch-For-Review, 10User-fgiunchedi: Return graphite2001 to spares pool - https://phabricator.wikimedia.org/T199321 (10MoritzMuehlenhoff) 05Open→03Resolved Merged a patch to remove the remaining Puppet references. [08:37:32] 10Operations, 10serviceops, 10Performance-Team (Radar), 10User-Elukey, 10User-jijiki: Upgrade memcached for Debian Stretch/Buster - https://phabricator.wikimedia.org/T213089 (10elukey) >>! In T213089#5023501, @MoritzMuehlenhoff wrote: >>>! In T213089#5023411, @elukey wrote: >> Leaving here also a referen... [08:40:57] 10Operations, 10Arc-Lamp, 10Performance-Team, 10serviceops, 10Patch-For-Review: webperf*002 running out of disk space (arc lamp, xhgui) - https://phabricator.wikimedia.org/T235425 (10Krinkle) Immediate issue is resolved. But this is expected to alert again in 5-10 days unless a few days' worth are delete... [08:41:19] 10Operations, 10Arc-Lamp, 10Performance-Team, 10serviceops, 10Patch-For-Review: webperf*002 running out of disk space (arc lamp, xhgui) - https://phabricator.wikimedia.org/T235425 (10Krinkle) a:03Krinkle [08:41:39] 10Operations, 10serviceops, 10Performance-Team (Radar), 10User-Elukey, 10User-jijiki: Upgrade memcached for Debian Stretch/Buster - https://phabricator.wikimedia.org/T213089 (10elukey) I have been discussing how to proceed with @jijiki recently, and T208934 will be prioritized otherwise each reimage of a... [08:52:57] 10Operations, 10observability, 10serviceops, 10Performance-Team (Radar): Messages in Logstash from php-fatal-error.php are missing from type:mediawiki/channel:fatal - https://phabricator.wikimedia.org/T234283 (10Krinkle) [08:56:16] (03CR) 10Gilles: "Are you sure that you want to apply this to all types? We changed it only for SVG in Varnish" [puppet] - 10https://gerrit.wikimedia.org/r/542996 (https://phabricator.wikimedia.org/T232615) (owner: 10Ema) [09:03:01] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to 'analytics-privatedata-users' and 'researchers' for Erin Yener - https://phabricator.wikimedia.org/T234529 (10jrobell) Thank you @Nuria Duly noted regarding creating separate tickets for individual users as well as the timeline o... [09:06:51] 10Operations, 10MediaWiki-REST-API, 10Parsoid-PHP, 10Traffic, 10CPT Initiatives (Core REST API in PHP): Varnish/ATS should not decode URIs for /w/rest.php - https://phabricator.wikimedia.org/T235478 (10mobrovac) p:05Triage→03High [09:15:00] 10Operations, 10serviceops, 10Performance-Team (Radar), 10User-Elukey, 10User-jijiki: Upgrade memcached for Debian Stretch/Buster - https://phabricator.wikimedia.org/T213089 (10jijiki) >>! In T213089#5574371, @elukey wrote: > I have been discussing how to proceed with @jijiki recently, and T208934 will b... [09:15:33] (03PS3) 10Jbond: profile::base: add adduser module to profile:base [puppet] - 10https://gerrit.wikimedia.org/r/542984 (https://phabricator.wikimedia.org/T235162) [09:18:08] (03CR) 10jerkins-bot: [V: 04-1] profile::base: add adduser module to profile:base [puppet] - 10https://gerrit.wikimedia.org/r/542984 (https://phabricator.wikimedia.org/T235162) (owner: 10Jbond) [09:21:30] (03CR) 10Effie Mouzeli: scap: mediawiki logstash_checker (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/539881 (https://phabricator.wikimedia.org/T234283) (owner: 10Thcipriani) [09:26:55] (03CR) 10Krinkle: [C: 03+1] scap: mediawiki logstash_checker (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/539881 (https://phabricator.wikimedia.org/T234283) (owner: 10Thcipriani) [09:28:47] (03CR) 10Effie Mouzeli: [C: 03+2] scap: mediawiki logstash_checker (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/539881 (https://phabricator.wikimedia.org/T234283) (owner: 10Thcipriani) [09:28:56] (03PS6) 10Effie Mouzeli: scap: mediawiki logstash_checker [puppet] - 10https://gerrit.wikimedia.org/r/539881 (https://phabricator.wikimedia.org/T234283) (owner: 10Thcipriani) [09:37:35] 10Operations, 10vm-requests: codfw: 1 VM for idp - https://phabricator.wikimedia.org/T235479 (10MoritzMuehlenhoff) [09:37:49] 10Operations, 10vm-requests: codfw: 1 VM for idp - https://phabricator.wikimedia.org/T235479 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [09:46:25] 10Operations, 10vm-requests: codfw: 1 VM for idp - https://phabricator.wikimedia.org/T235479 (10akosiaris) Same as eqiad, LGTM [09:48:27] (03PS2) 10Jbond: adduser: create module to manage /etc/adduser.conf [puppet] - 10https://gerrit.wikimedia.org/r/542983 (https://phabricator.wikimedia.org/T235162) [09:50:42] (03PS3) 10Krinkle: webperf: add backups for arclamp application data [puppet] - 10https://gerrit.wikimedia.org/r/543005 (https://phabricator.wikimedia.org/T235425) (owner: 10Dzahn) [09:51:34] (03PS4) 10Jbond: profile::base: add adduser module to profile:base [puppet] - 10https://gerrit.wikimedia.org/r/542984 (https://phabricator.wikimedia.org/T235162) [09:53:47] (03CR) 10jerkins-bot: [V: 04-1] profile::base: add adduser module to profile:base [puppet] - 10https://gerrit.wikimedia.org/r/542984 (https://phabricator.wikimedia.org/T235162) (owner: 10Jbond) [10:02:28] (03PS5) 10Jbond: profile::base: add adduser module to profile:base [puppet] - 10https://gerrit.wikimedia.org/r/542984 (https://phabricator.wikimedia.org/T235162) [10:07:27] !log installing openssl updates for buster (some ciphers we don't use were not enabled due to an upstream change related to the selection of ASM-optimised implementations over generic C) [10:07:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:20] 10Operations, 10observability, 10Availability, 10Goal: Setup bacula backup monitoring - https://phabricator.wikimedia.org/T234900 (10jcrespo) Everything with a correct full backup in the last month, ordered by size: ` root@db1135[bacula]> SELECT Job.JobId AS job_id, REPLACE(Client.Name, '-fd', '') AS clie... [10:23:06] (03CR) 10Phamhi: [C: 03+2] tools-webservice: Disable access.log feature by default [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/541609 (https://phabricator.wikimedia.org/T233347) (owner: 10Phamhi) [10:23:41] (03Merged) 10jenkins-bot: tools-webservice: Disable access.log feature by default [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/541609 (https://phabricator.wikimedia.org/T233347) (owner: 10Phamhi) [10:24:45] 10Operations, 10observability, 10Availability, 10Goal: Setup bacula backup monitoring - https://phabricator.wikimedia.org/T234900 (10jcrespo) Global retention queries: ` root@db1135[bacula]> select LEFT(VolumeName, 4) as Catalog, min(FirstWritten) FirstUnpurgedVolume FROM Media WHERE VolStatus IN ('Full'... [10:25:10] 10Operations, 10observability, 10Availability, 10Goal: Setup bacula backup monitoring - https://phabricator.wikimedia.org/T234900 (10akosiaris) There's a number of ready plugins for icinga on https://exchange.nagios.org/directory/Plugins/Backup-and-Recovery/Bacula We could evaluate them, no need to reinve... [10:30:37] (03PS1) 10Jbond: nrpe::check_puppetrun: add a require for 'time' [puppet] - 10https://gerrit.wikimedia.org/r/543109 [10:31:10] 10Operations, 10serviceops: Jobrunners: allow to check that they are in sync with the etcd data - https://phabricator.wikimedia.org/T235488 (10Volans) [10:33:32] 10Operations, 10observability, 10Availability, 10Goal: Setup bacula backup monitoring - https://phabricator.wikimedia.org/T234900 (10jcrespo) I will look at them, this was mostly an excuse to get familiar with the current status. Independently of the prepackaged ones, we should come up with a list of thing... [10:34:55] 10Operations, 10serviceops: Jobrunners: allow to check that they are in sync with the etcd data - https://phabricator.wikimedia.org/T235488 (10Joe) I think the best way is probably writing a small endpoint in operations/mediawiki-config that just exposes that. [10:37:22] 10Operations, 10serviceops: Jobrunners: allow to check that they are in sync with the etcd data - https://phabricator.wikimedia.org/T235488 (10Krinkle) [10:40:11] 10Operations, 10CommRel-Specialists-Support, 10Core Platform Team, 10Editing-team, and 12 others: RFC: Serve Main Page of Wikimedia wikis from a consistent URL - https://phabricator.wikimedia.org/T120085 (10Elitre) [10:41:32] 10Operations, 10observability, 10Availability, 10Goal: Setup bacula backup monitoring - https://phabricator.wikimedia.org/T234900 (10jcrespo) It also helps having a look at the global status- for example, bugzilla and rt, being in read only mode, don't make sense having monthly backups, but a proper long t... [10:43:59] 10Operations, 10observability, 10serviceops, 10Performance-Team (Radar): Messages in Logstash from php-fatal-error.php are missing from type:mediawiki/channel:fatal - https://phabricator.wikimedia.org/T234283 (10jijiki) [10:47:00] (03PS1) 10Muehlenhoff: Extend restart filter with (sd-pam) [puppet] - 10https://gerrit.wikimedia.org/r/543112 [10:53:27] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [10:54:19] !log mark ruby-safe-yaml as manually installed using apt-mark on jessie/stretch, prevents accidental removal of ruby-safe-yaml after puppet 4->5 migration [10:54:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:10] (03CR) 10Jbond: [C: 03+1] Extend restart filter with (sd-pam) [puppet] - 10https://gerrit.wikimedia.org/r/543112 (owner: 10Muehlenhoff) [10:55:46] (03PS2) 10Muehlenhoff: Extend restart filter with (sd-pam) [puppet] - 10https://gerrit.wikimedia.org/r/543112 [10:59:06] (03PS1) 10Volans: Initial support for Netbox integration [software/homer] - 10https://gerrit.wikimedia.org/r/543114 (https://phabricator.wikimedia.org/T228388) [10:59:08] (03PS1) 10Volans: Templates: add missing documentation [software/homer] - 10https://gerrit.wikimedia.org/r/543115 [10:59:10] (03PS1) 10Volans: Homer: add missing documentation in a method [software/homer] - 10https://gerrit.wikimedia.org/r/543116 [10:59:12] (03CR) 10Muehlenhoff: [C: 03+2] Extend restart filter with (sd-pam) [puppet] - 10https://gerrit.wikimedia.org/r/543112 (owner: 10Muehlenhoff) [11:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191015T1100). [11:00:05] duesen: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:31] 10Operations, 10ops-eqiad, 10DC-Ops: a1-eqiad pdu refresh (Tuesday 10/15 @11am UTC) - https://phabricator.wikimedia.org/T226782 (10Jclark-ctr) Starting pdu upgrade [11:00:33] I’m not available, sorry [11:00:37] * Urbanecm claims the SWAT for an emergency patch [11:01:20] starting pdu upgrade eqiad rack a1 https://phabricator.wikimedia.org/T226782 [11:03:57] (03CR) 10jerkins-bot: [V: 04-1] Initial support for Netbox integration [software/homer] - 10https://gerrit.wikimedia.org/r/543114 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [11:04:01] (03CR) 10jerkins-bot: [V: 04-1] Homer: add missing documentation in a method [software/homer] - 10https://gerrit.wikimedia.org/r/543116 (owner: 10Volans) [11:04:16] (03PS1) 10Urbanecm: New throttle rule for Czech course [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543117 (https://phabricator.wikimedia.org/T235493) [11:04:39] (03Abandoned) 10Jbond: puppetmaster/pybal_config: move ca and pybal_config to puppetmaster2001 [puppet] - 10https://gerrit.wikimedia.org/r/541838 (https://phabricator.wikimedia.org/T234315) (owner: 10Jbond) [11:05:00] (03CR) 10jerkins-bot: [V: 04-1] Templates: add missing documentation [software/homer] - 10https://gerrit.wikimedia.org/r/543115 (owner: 10Volans) [11:05:02] (03PS2) 10Jbond: pybal_backend: enable codfw endpoint [puppet] - 10https://gerrit.wikimedia.org/r/542926 (https://phabricator.wikimedia.org/T234315) [11:05:54] (03CR) 10Jbond: [C: 03+2] pybal_backend: enable codfw endpoint [puppet] - 10https://gerrit.wikimedia.org/r/542926 (https://phabricator.wikimedia.org/T234315) (owner: 10Jbond) [11:05:57] !log urbanecm@deploy1001 Synchronized wmf-config/throttle.php: SWAT:855aca4eb: Throttle rule for Czech course (T235493) (duration: 00m 51s) [11:06:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:01] T235493: Throttle rule for Czech course - https://phabricator.wikimedia.org/T235493 [11:06:15] (03CR) 10Urbanecm: [C: 03+2] New throttle rule for Czech course [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543117 (https://phabricator.wikimedia.org/T235493) (owner: 10Urbanecm) [11:07:04] (03Merged) 10jenkins-bot: New throttle rule for Czech course [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543117 (https://phabricator.wikimedia.org/T235493) (owner: 10Urbanecm) [11:07:30] (03PS2) 10Jbond: pupetmasters: remove local server config [puppet] - 10https://gerrit.wikimedia.org/r/542929 (https://phabricator.wikimedia.org/T234315) [11:08:04] !log mwscript resetAuthenticationThrottle.php --wiki=cswiki --signup --ip 195.113.145.2 (T235493) [11:08:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:26] (03CR) 10Jbond: [C: 03+2] pupetmasters: remove local server config [puppet] - 10https://gerrit.wikimedia.org/r/542929 (https://phabricator.wikimedia.org/T234315) (owner: 10Jbond) [11:09:29] (03PS7) 10Urbanecm: Add `autopatrol` to translation administrators on mediawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541057 (owner: 10DannyS712) [11:09:33] (03CR) 10Urbanecm: [C: 03+2] Add `autopatrol` to translation administrators on mediawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541057 (owner: 10DannyS712) [11:10:29] (03Merged) 10jenkins-bot: Add `autopatrol` to translation administrators on mediawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541057 (owner: 10DannyS712) [11:10:31] (03PS1) 10Jbond: puppetmaster_ca_server: migrate back to puppetmaster1001 [puppet] - 10https://gerrit.wikimedia.org/r/543118 [11:11:35] (03CR) 10Jbond: [C: 03+2] puppetmaster_ca_server: migrate back to puppetmaster1001 [puppet] - 10https://gerrit.wikimedia.org/r/543118 (owner: 10Jbond) [11:11:52] (03CR) 10Urbanecm: "No objection, merged, and doesn't seem like controversial => merged." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541057 (owner: 10DannyS712) [11:12:08] !log move puppetmaster_ca_server back to puppetmaster1001 [11:12:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:12] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: ac37540: Add `autopatrol` to translation administrators on mediawiki (duration: 00m 51s) [11:12:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:32] !log EU SWAT done [11:12:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:18] (03PS1) 10Jbond: puppetmaster::local_servers: ensure we have a default for locale servers [puppet] - 10https://gerrit.wikimedia.org/r/543119 [11:15:49] 10Operations, 10serviceops, 10Performance-Team (Radar), 10User-Elukey, 10User-jijiki: Upgrade memcached for Debian Stretch/Buster - https://phabricator.wikimedia.org/T213089 (10elukey) Settings that we might want to use with memcached 1.5.6 (Buster's version): * `-o modern`: alias for the following sett... [11:16:05] (03CR) 10Jbond: [C: 03+2] puppetmaster::local_servers: ensure we have a default for locale servers [puppet] - 10https://gerrit.wikimedia.org/r/543119 (owner: 10Jbond) [11:16:22] Urbanecm: hey. did you just ping my about the swat deploy? [11:16:35] i got a truncated desktop notification and can't find your message anywhere :p [11:16:46] duesen: no, jouncebot did :) [11:16:55] ah, right [11:17:05] I see it's deployed now, so I sliently ignored it [11:17:07] afaik, the deploy went through last night [11:17:12] okay, cool [11:17:21] right. me too ;) i was just wondering if i missed something [11:17:38] there are still bad cache entries to clean up. but no clear idea how to do it [11:18:07] if more affected pages turn up, not may become necessary. Nikerabbit has the details [11:18:18] I'll be on vacation starting tomorrow [11:19:29] (03Abandoned) 10Jbond: pybal_config: remove puppetmaster1001 from pybal_config backend [puppet] - 10https://gerrit.wikimedia.org/r/540393 (https://phabricator.wikimedia.org/T234315) (owner: 10Jbond) [11:22:34] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/543109 (owner: 10Jbond) [11:22:49] (03CR) 10Jbond: [C: 03+2] nrpe::check_puppetrun: add a require for 'time' [puppet] - 10https://gerrit.wikimedia.org/r/543109 (owner: 10Jbond) [11:23:02] (03PS2) 10Jbond: nrpe::check_puppetrun: add a require for 'time' [puppet] - 10https://gerrit.wikimedia.org/r/543109 [11:24:37] hashar: you around by any chance? I'm wondering why CI is failing on some CRs when is passing locally. See https://integration.wikimedia.org/ci/job/tox-docker/8611/consoleFull [11:25:15] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [11:35:32] (03PS1) 10Giuseppe Lavagetto: conftool-data: add echostore to the k8s cluster [puppet] - 10https://gerrit.wikimedia.org/r/543121 [11:35:34] (03PS1) 10Giuseppe Lavagetto: echostore: add stanzas for deployment on k8s [puppet] - 10https://gerrit.wikimedia.org/r/543122 [11:35:36] (03PS1) 10Giuseppe Lavagetto: echostore: add LVS configuration stanzas [puppet] - 10https://gerrit.wikimedia.org/r/543123 (https://phabricator.wikimedia.org/T234464) [11:36:53] PROBLEM - Juniper alarms on cr1-eqiad is CRITICAL: JNX_ALARMS CRITICAL - 2 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [11:38:29] RECOVERY - Juniper alarms on cr1-eqiad is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [11:43:36] (03PS1) 10Phamhi: Update all images based on buster (T230961) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/543124 [11:45:27] (03PS10) 10Arturo Borrero Gonzalez: toolforge: introduce new proxy role [puppet] - 10https://gerrit.wikimedia.org/r/508560 (https://phabricator.wikimedia.org/T219362) [11:46:07] 10Operations, 10observability, 10Availability, 10Goal: Setup bacula backup monitoring - https://phabricator.wikimedia.org/T234900 (10akosiaris) > * Certain configured backup is not active (As far I can see, configurations are not cleaned up on decommission, something to look at) They are. It's taken care... [11:46:47] (03CR) 10Alexandros Kosiaris: [C: 03+1] conftool-data: add echostore to the k8s cluster [puppet] - 10https://gerrit.wikimedia.org/r/543121 (owner: 10Giuseppe Lavagetto) [11:47:16] (03CR) 10Alexandros Kosiaris: [C: 03+1] echostore: add stanzas for deployment on k8s [puppet] - 10https://gerrit.wikimedia.org/r/543122 (owner: 10Giuseppe Lavagetto) [11:47:40] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: introduce new proxy role [puppet] - 10https://gerrit.wikimedia.org/r/508560 (https://phabricator.wikimedia.org/T219362) (owner: 10Arturo Borrero Gonzalez) [11:48:26] (03CR) 10Alexandros Kosiaris: "LGTM, but don't merge before the service is deployed" [puppet] - 10https://gerrit.wikimedia.org/r/543123 (https://phabricator.wikimedia.org/T234464) (owner: 10Giuseppe Lavagetto) [11:52:31] (03PS1) 10Jbond: nrpe::check_puppetrun: update check to responde correctly with alert_master_fail [puppet] - 10https://gerrit.wikimedia.org/r/543127 [11:53:31] (03PS2) 10Elukey: Move the Analytics Hadoop cluster to the new Analytics ZK cluster [puppet] - 10https://gerrit.wikimedia.org/r/542866 (https://phabricator.wikimedia.org/T217057) [11:58:31] (03PS1) 10Muehlenhoff: Add DNS record for idp2001 [dns] - 10https://gerrit.wikimedia.org/r/543128 (https://phabricator.wikimedia.org/T235479) [11:58:57] (03CR) 10jerkins-bot: [V: 04-1] Add DNS record for idp2001 [dns] - 10https://gerrit.wikimedia.org/r/543128 (https://phabricator.wikimedia.org/T235479) (owner: 10Muehlenhoff) [11:59:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1098:3317', diff saved to https://phabricator.wikimedia.org/P9346 and previous config saved to /var/cache/conftool/dbconfig/20191015-115922-marostegui.json [11:59:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:31] 10Operations, 10serviceops, 10HHVM, 10Patch-For-Review, 10Performance-Team (Radar): Remove HHVM from production - https://phabricator.wikimedia.org/T229792 (10jijiki) [12:00:57] (03PS1) 10Effie Mouzeli: hhvm: stop monitoring hhvm [puppet] - 10https://gerrit.wikimedia.org/r/543129 (https://phabricator.wikimedia.org/T229792) [12:01:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1086 for schema change', diff saved to https://phabricator.wikimedia.org/P9347 and previous config saved to /var/cache/conftool/dbconfig/20191015-120133-marostegui.json [12:01:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:50] (03CR) 10jerkins-bot: [V: 04-1] hhvm: stop monitoring hhvm [puppet] - 10https://gerrit.wikimedia.org/r/543129 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [12:02:15] (03PS2) 10Muehlenhoff: Add DNS record for idp2001 [dns] - 10https://gerrit.wikimedia.org/r/543128 (https://phabricator.wikimedia.org/T235479) [12:03:58] jouncebot: now [12:03:58] No deployments scheduled for the next 3 hour(s) and 56 minute(s) [12:04:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1097:3314', diff saved to https://phabricator.wikimedia.org/P9348 and previous config saved to /var/cache/conftool/dbconfig/20191015-120359-marostegui.json [12:04:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:12] !log Restarting CI Jenkins [12:04:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:27] !log CI Jenkins restarted [12:05:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:40] (03PS2) 10Effie Mouzeli: hhvm: stop monitoring hhvm [puppet] - 10https://gerrit.wikimedia.org/r/543129 (https://phabricator.wikimedia.org/T229792) [12:05:42] web UI takes a bit more time though [12:05:47] but the background magic does work [12:06:24] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime [12:06:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:42] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:06:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:35] PROBLEM - Check the Netbox report librenms for fail status. on netbox1001 is CRITICAL: librenms.LibreNMS CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [12:11:01] 10Operations, 10ops-eqiad, 10DC-Ops: a1-eqiad pdu refresh (Tuesday 10/15 @11am UTC) - https://phabricator.wikimedia.org/T226782 (10Jclark-ctr) a:05Cmjohnson→03Jclark-ctr [12:11:51] 10Operations, 10ops-eqiad, 10DC-Ops: a1-eqiad pdu refresh (Tuesday 10/15 @11am UTC) - https://phabricator.wikimedia.org/T226782 (10Jclark-ctr) [12:12:48] (03CR) 10Effie Mouzeli: "As expected" [puppet] - 10https://gerrit.wikimedia.org/r/543129 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [12:13:02] 10Operations, 10service-runner, 10serviceops, 10CPT Initiatives (RESTBase Split (CDP2)), and 5 others: RESTBase/RESTRouter/service-runner rate limiting plans - https://phabricator.wikimedia.org/T235437 (10mobrovac) [12:13:08] !log add copy of python-pykube and python3-pykube from stretch-wikimedia to buster-wikimedia (T230961) [12:13:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:12] T230961: Install a version of Python newer than 3.5.3 in Toolforge - https://phabricator.wikimedia.org/T230961 [12:15:44] (03PS1) 10Marostegui: Revert "mariadb: Depool labsdb1009" [puppet] - 10https://gerrit.wikimedia.org/r/543130 [12:16:02] (03PS2) 10Marostegui: Revert "mariadb: Depool labsdb1009" [puppet] - 10https://gerrit.wikimedia.org/r/543130 [12:16:58] !log installing sudo security updates on buster/stretch [12:17:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:05] !log Hadoop maintenance start - migration to the new Zookepeer cluster [12:17:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:11] (03CR) 10Marostegui: [C: 03+2] Revert "mariadb: Depool labsdb1009" [puppet] - 10https://gerrit.wikimedia.org/r/543130 (owner: 10Marostegui) [12:17:50] !log Repool labsdb1009 after PDU maintenance [12:17:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1126 after PDU maintenance', diff saved to https://phabricator.wikimedia.org/P9350 and previous config saved to /var/cache/conftool/dbconfig/20191015-121840-marostegui.json [12:18:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:02] 10Operations, 10ops-eqiad, 10DC-Ops: a1-eqiad pdu refresh (Tuesday 10/15 @11am UTC) - https://phabricator.wikimedia.org/T226782 (10Jclark-ctr) a:05Jclark-ctr→03RobH the PDU swap is over. Nothing lost while swapping PDU. Everything is cabled and they're linked together. Netbox is updated. still needs... [12:20:21] 10Operations, 10observability, 10Availability, 10Goal: Setup bacula backup monitoring - https://phabricator.wikimedia.org/T234900 (10jcrespo) Just to be clear, the above was not a concrete proposal, more like a brainstorming of everything I could think from the top of my mind. There is probably more things... [12:22:44] (03CR) 10Elukey: [C: 03+2] Move the Analytics Hadoop cluster to the new Analytics ZK cluster [puppet] - 10https://gerrit.wikimedia.org/r/542866 (https://phabricator.wikimedia.org/T217057) (owner: 10Elukey) [12:22:52] (03PS3) 10Elukey: Move the Analytics Hadoop cluster to the new Analytics ZK cluster [puppet] - 10https://gerrit.wikimedia.org/r/542866 (https://phabricator.wikimedia.org/T217057) [12:22:59] 10Operations, 10ops-codfw: (OoW) wtp2020: correctable memory errors - https://phabricator.wikimedia.org/T205712 (10jijiki) And again :) [12:23:41] ACKNOWLEDGEMENT - Memory correctable errors -EDAC- on wtp2020 is CRITICAL: 7.001 ge 4 Effie Mouzeli host is OoW T205712 https://wikitech.wikimedia.org/wiki/Monitoring/Memory%23Memory_correctable_errors_-EDAC- https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=wtp2020&var-datasource=codfw+prometheus/ops [12:24:44] !log restbase add parsoidphp tables in prod - T230792 [12:24:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:47] T230792: Create Parsoid/PHP tables in Cassandra - https://phabricator.wikimedia.org/T230792 [12:29:46] (03PS1) 10Effie Mouzeli: hhvm: stop hhvm service from all hosts [puppet] - 10https://gerrit.wikimedia.org/r/543131 (https://phabricator.wikimedia.org/T229792) [12:33:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'More traffic to db1126 after PDU maintenance', diff saved to https://phabricator.wikimedia.org/P9351 and previous config saved to /var/cache/conftool/dbconfig/20191015-123356-marostegui.json [12:34:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:52] (03CR) 10Ayounsi: [C: 03+1] "LGTM + passes tox locally." [software/homer] - 10https://gerrit.wikimedia.org/r/543116 (owner: 10Volans) [12:35:06] (03CR) 10Ayounsi: [C: 03+1] "LGTM + passes tox locally." [software/homer] - 10https://gerrit.wikimedia.org/r/543115 (owner: 10Volans) [12:38:49] James_F: hello [12:44:07] (03CR) 10Muehlenhoff: hhvm: stop hhvm service from all hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/543131 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [12:45:19] PROBLEM - DPKG on schema1002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [12:46:02] !log Hadoop maintenance over [12:46:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:55] RECOVERY - DPKG on schema1002 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [12:47:16] ^ fallout from sudo rollou [12:49:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'More traffic to db1126 after PDU maintenance', diff saved to https://phabricator.wikimedia.org/P9352 and previous config saved to /var/cache/conftool/dbconfig/20191015-124942-marostegui.json [12:49:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:18] (03PS3) 10Elukey: profile::hadoop::master: fix check_hdfs_topology script [puppet] - 10https://gerrit.wikimedia.org/r/540150 [12:53:19] (03CR) 10Elukey: [C: 03+2] profile::hadoop::master: fix check_hdfs_topology script [puppet] - 10https://gerrit.wikimedia.org/r/540150 (owner: 10Elukey) [12:54:00] (03PS2) 10Effie Mouzeli: hhvm: stop hhvm service from all hosts [puppet] - 10https://gerrit.wikimedia.org/r/543131 (https://phabricator.wikimedia.org/T229792) [12:54:46] (03CR) 10Effie Mouzeli: hhvm: stop hhvm service from all hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/543131 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [12:55:44] (03CR) 10DannyS712: "> No objection, merged, and doesn't seem like controversial =>" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541057 (owner: 10DannyS712) [12:56:08] (03CR) 10Muehlenhoff: hhvm: stop hhvm service from all hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/543131 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [12:58:33] (03PS1) 10Arturo Borrero Gonzalez: toolforge: proxy: adjust setup for the new k8s cluster [puppet] - 10https://gerrit.wikimedia.org/r/543135 (https://phabricator.wikimedia.org/T234037) [12:58:48] (03PS3) 10Effie Mouzeli: hhvm: stop hhvm service from all hosts [puppet] - 10https://gerrit.wikimedia.org/r/543131 (https://phabricator.wikimedia.org/T229792) [12:59:35] (03PS1) 10Elukey: Enable notifications for stat1005 [puppet] - 10https://gerrit.wikimedia.org/r/543136 [12:59:50] (03CR) 10Elukey: [C: 03+2] Enable notifications for stat1005 [puppet] - 10https://gerrit.wikimedia.org/r/543136 (owner: 10Elukey) [13:01:30] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "Don't merge this until we are ready to go live!" [puppet] - 10https://gerrit.wikimedia.org/r/543135 (https://phabricator.wikimedia.org/T234037) (owner: 10Arturo Borrero Gonzalez) [13:03:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1126 after PDU maintenance', diff saved to https://phabricator.wikimedia.org/P9353 and previous config saved to /var/cache/conftool/dbconfig/20191015-130356-marostegui.json [13:04:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:30] (03PS1) 10Arturo Borrero Gonzalez: toolforge: k8s: adjust ports in the ingress setup [puppet] - 10https://gerrit.wikimedia.org/r/543137 [13:05:34] (03PS1) 10Reedy: Remove comment about phab bans being superseded by now non existent WP0 bans [puppet] - 10https://gerrit.wikimedia.org/r/543138 [13:06:47] (03PS2) 10Arturo Borrero Gonzalez: toolforge: k8s: adjust ports in the ingress setup [puppet] - 10https://gerrit.wikimedia.org/r/543137 [13:07:33] (03PS3) 10Arturo Borrero Gonzalez: toolforge: k8s: adjust ports in the ingress setup [puppet] - 10https://gerrit.wikimedia.org/r/543137 (https://phabricator.wikimedia.org/T234037) [13:15:16] (03CR) 10Hashar: "recheck now using docker-registry.wikimedia.org/releng/tox-homer:0.1.0 which has libffi-dev" [software/homer] - 10https://gerrit.wikimedia.org/r/543114 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [13:17:52] 10Operations, 10DNS, 10Toolforge, 10Traffic, 10cloud-services-team (Kanban): Update authoratiative nameservers for the toolforge.org domain to point to Designate - https://phabricator.wikimedia.org/T235303 (10Andrew) @robh should I email Doneva myself and cc you for approval or do you want to reach out y... [13:17:58] (03PS2) 10Elukey: ferm: remove hadoop_masters from puppet config [puppet] - 10https://gerrit.wikimedia.org/r/542867 (https://phabricator.wikimedia.org/T217057) [13:18:43] (03PS1) 10Muehlenhoff: Offline puppetmaster2002 for hardware maintenance [puppet] - 10https://gerrit.wikimedia.org/r/543142 (https://phabricator.wikimedia.org/T235250) [13:20:24] 10Operations, 10DC-Ops, 10decommission: decommission rhodium - https://phabricator.wikimedia.org/T235503 (10MoritzMuehlenhoff) [13:22:11] (03CR) 10Elukey: [C: 03+2] ferm: remove hadoop_masters from puppet config [puppet] - 10https://gerrit.wikimedia.org/r/542867 (https://phabricator.wikimedia.org/T217057) (owner: 10Elukey) [13:24:56] (03PS2) 10Giuseppe Lavagetto: conftool-data: add echostore to the k8s cluster [puppet] - 10https://gerrit.wikimedia.org/r/543121 [13:27:24] (03CR) 10Giuseppe Lavagetto: [C: 03+2] conftool-data: add echostore to the k8s cluster [puppet] - 10https://gerrit.wikimedia.org/r/543121 (owner: 10Giuseppe Lavagetto) [13:28:47] 10Operations, 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for banwiki - https://phabricator.wikimedia.org/T234770 (10JHedden) a:03JHedden [13:31:09] (03PS2) 10ArielGlenn: No longer use --no-cache when dumping Wikibase entities [puppet] - 10https://gerrit.wikimedia.org/r/542278 (owner: 10Hoo man) [13:31:11] (03CR) 10Lucas Werkmeister (WMDE): "> Thanks; does this need to be synced on production via SWAT?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541057 (owner: 10DannyS712) [13:31:53] (03PS2) 10Giuseppe Lavagetto: echostore: add stanzas for deployment on k8s [puppet] - 10https://gerrit.wikimedia.org/r/543122 [13:32:10] (03CR) 10ArielGlenn: [C: 03+2] No longer use --no-cache when dumping Wikibase entities [puppet] - 10https://gerrit.wikimedia.org/r/542278 (owner: 10Hoo man) [13:33:03] <_joe_> GGGGR damn rebase race [13:33:14] (03PS3) 10Giuseppe Lavagetto: echostore: add stanzas for deployment on k8s [puppet] - 10https://gerrit.wikimedia.org/r/543122 [13:34:18] 10Operations, 10Phabricator, 10Traffic: Access Forbidden to Phabricator at WikiArabia 2019 (Morocco) - https://phabricator.wikimedia.org/T234598 (10Reedy) @SandraF_WMF Do you happen to remember if you got that IP address before connecting to your VPN? As it seems to belong to an Indian hosting company - http... [13:34:26] (03CR) 10Giuseppe Lavagetto: [C: 03+2] echostore: add stanzas for deployment on k8s [puppet] - 10https://gerrit.wikimedia.org/r/543122 (owner: 10Giuseppe Lavagetto) [13:34:56] hauskater: Hey. [13:34:57] 10Operations: Redundant bootloaders for software RAID - https://phabricator.wikimedia.org/T215183 (10CDanis) >>! In T215183#5496348, @jcrespo wrote: > Buster failed to install on my md install for 2552d12fe15ec1 with "grub install sda sdb failed, cannot install on sda". I cannot be sure it is that change because... [13:35:29] (03PS1) 10Lucas Werkmeister (WMDE): Configure Citoid+Wikibase integration on Test Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543145 (https://phabricator.wikimedia.org/T228412) [13:35:43] James_F: Hello. I've uploaded a secfix for jquery.i18n @ wikimedia/github and added you as reviewer. When you got a spare minute, could you check? Thanks! [13:36:24] 10Operations: Redundant bootloaders for software RAID - https://phabricator.wikimedia.org/T215183 (10CDanis) An update: as of now, 20.8% of the fleet (or 30% of hosts with software RAID enabled) have redundant bootloaders. This is just from fixing the partman configs and waiting for reimages to happen 'naturall... [13:37:23] * James_F hunts. [13:38:47] 10Operations, 10Phabricator, 10Traffic: Access Forbidden to Phabricator at WikiArabia 2019 (Morocco) - https://phabricator.wikimedia.org/T234598 (10Reedy) p:05High→03Normal Event was two weeks ago.... [13:40:11] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [13:40:33] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [13:40:41] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [13:40:51] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [13:41:01] PROBLEM - HTTP availability for Varnish at eqsin on icinga1001 is CRITICAL: job=varnish-text site=eqsin https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [13:41:03] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [13:41:27] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [13:41:33] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [13:41:47] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [13:42:09] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [13:42:31] there was a similar peak for rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw%20prometheus%2Fops&refresh=5m&orgId=1 [13:42:37] RECOVERY - HTTP availability for Varnish at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [13:42:39] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [13:43:03] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [13:43:09] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [13:43:47] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [13:43:47] (03PS1) 10Jbond: puppetmaster: offline rhodium [puppet] - 10https://gerrit.wikimedia.org/r/543148 (https://phabricator.wikimedia.org/T234315) [13:43:53] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [13:44:00] api_appserver latency spike [13:44:05] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [13:44:09] hello cdanis :) [13:44:13] I suspect the same thing _joe_ has been digging into these past weeks [13:44:15] hi elukey :D [13:44:16] <_joe_> yes [13:44:39] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [13:44:47] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [13:44:57] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [13:44:57] <_joe_> parsoid with euwiki as eff.ie determined I guess, let me check [13:45:10] (03CR) 10Jbond: [C: 03+2] puppetmaster: offline rhodium [puppet] - 10https://gerrit.wikimedia.org/r/543148 (https://phabricator.wikimedia.org/T234315) (owner: 10Jbond) [13:45:31] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [13:45:41] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [13:45:49] PROBLEM - Packet loss ratio for UDP on logstash1007 is CRITICAL: 0.2138 ge 0.1 https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [13:45:53] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [13:45:53] PROBLEM - HTTP availability for Varnish at codfw on icinga1001 is CRITICAL: job=varnish-text site=codfw https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [13:46:28] oh good, we're logging so much we're dropping logs [13:46:40] <_joe_> didn't use to be the case [13:46:41] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [13:46:53] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [13:46:55] <_joe_> anyways the problems have gone away already [13:47:02] ack :) [13:47:05] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [13:47:05] RECOVERY - HTTP availability for Varnish at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [13:47:13] (bbiab) [13:47:16] <_joe_> I could not find anything happening in current logs [13:47:29] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [13:48:01] RECOVERY - Packet loss ratio for UDP on logstash1007 is OK: (C)0.1 ge (W)0.05 ge 0 https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [13:49:39] okay, this isn't new, so I don't think related, but: https://logstash.wikimedia.org/goto/2e43a67f77a2b3cffbed01688316362c why is there so much of this message? [13:49:57] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/543142 (https://phabricator.wikimedia.org/T235250) (owner: 10Muehlenhoff) [13:50:42] <_joe_> cdanis: I will answer when kibana loads for me [13:50:51] yes it's being quite slow for me as well _joe_ [13:50:54] <_joe_> cdanis: sigh [13:51:15] <_joe_> that's the kind of thing that should go to flat files only btw [13:51:30] <_joe_> it gives us no advantage to be able to process it with querying IMHO [13:51:47] (03PS3) 10Jforrester: build: Upgrade mediawiki-codesniffer to v28.0.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542522 [13:52:20] (03PS2) 10Muehlenhoff: Offline puppetmaster2002 for hardware maintenance [puppet] - 10https://gerrit.wikimedia.org/r/543142 (https://phabricator.wikimedia.org/T235250) [13:53:10] 10Operations, 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for banwiki - https://phabricator.wikimedia.org/T234770 (10JHedden) 05Open→03Resolved `name="Updates on labsdb10{09,10,11,12}" $ sudo /usr/local/sbin/maintain-replica-indexes --database banwiki --debug... [13:53:25] !log installing 4.9.189 Linux update from last stretch point releases (no reboots, deploying the package only at this point) [13:53:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:29] (03PS3) 10Andrew Bogott: cloud puppetmasters: allow nova controllers to use the certmanager account [puppet] - 10https://gerrit.wikimedia.org/r/542201 (https://phabricator.wikimedia.org/T235129) [13:55:45] (03CR) 10Muehlenhoff: [C: 03+2] Offline puppetmaster2002 for hardware maintenance [puppet] - 10https://gerrit.wikimedia.org/r/543142 (https://phabricator.wikimedia.org/T235250) (owner: 10Muehlenhoff) [13:56:10] (03PS4) 10Andrew Bogott: cloud puppetmasters: allow nova controllers to use the certmanager account [puppet] - 10https://gerrit.wikimedia.org/r/542201 (https://phabricator.wikimedia.org/T235129) [13:58:17] (03CR) 10Andrew Bogott: [C: 03+2] cloud puppetmasters: allow nova controllers to use the certmanager account [puppet] - 10https://gerrit.wikimedia.org/r/542201 (https://phabricator.wikimedia.org/T235129) (owner: 10Andrew Bogott) [14:03:43] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [14:05:37] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [14:06:37] PROBLEM - Check systemd state on ms-be1050 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:06:54] (03PS1) 10Jbond: ulogd: filter out etcd broadcast messages [puppet] - 10https://gerrit.wikimedia.org/r/543149 [14:07:31] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [14:07:44] (03PS7) 10Andrew Bogott: nova-fullstack: add puppet cert monitoring [puppet] - 10https://gerrit.wikimedia.org/r/542202 (https://phabricator.wikimedia.org/T235129) [14:08:35] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be1050 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:09:51] ^ looking [14:10:59] (03CR) 10Andrew Bogott: [C: 03+2] nova-fullstack: add puppet cert monitoring [puppet] - 10https://gerrit.wikimedia.org/r/542202 (https://phabricator.wikimedia.org/T235129) (owner: 10Andrew Bogott) [14:11:13] RECOVERY - Check systemd state on ms-be1050 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:11:45] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be1050 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:23:32] (03PS2) 10Herron: logstash: add an index for deployment related logs [puppet] - 10https://gerrit.wikimedia.org/r/542557 (https://phabricator.wikimedia.org/T234564) [14:30:15] 10Operations, 10ops-codfw: No microcode updates loaded on puppetmaster2001/2002 after reimage to Buster - https://phabricator.wikimedia.org/T235250 (10MoritzMuehlenhoff) @Papaul puppetmaster2002 is offline, you can power it down for maintenance at your convenience [14:34:54] !log executed 'rmr' in zookeeper on conf1004 for znodes /yarn-leader-election /hadoop-ha /hive_zookeeper_namespace [14:34:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:43] 10Operations: Integrate Stretch 9.10/9.11 point updates - https://phabricator.wikimedia.org/T232308 (10MoritzMuehlenhoff) [14:38:23] !log installing usbutils update from stretch point release [14:38:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:47] (03CR) 10SBassett: [C: 03+1] Phabricator: Uninstall Conpherence application also in default settings [puppet] - 10https://gerrit.wikimedia.org/r/542787 (https://phabricator.wikimedia.org/T127640) (owner: 10Aklapper) [14:40:56] !log power down puppetmaster2002 for HW maintenance [14:40:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:00] 10Operations: Redundant bootloaders for software RAID - https://phabricator.wikimedia.org/T215183 (10jcrespo) > Jaime, if you have a spare host in this hardware configuration, I could try reimaging and gathering logs. @CDanis I was able to install it in the end, it was a conflict with other drive (hw RAID, in ad... [14:42:58] !log start a root tmux containing a bash script on conf1004 to clean up znodes under /yarn-rmstore/analytics-hadoop/ZKRMStateRoot/RMAppRoot slowly - T217057 [14:43:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:04] T217057: Decouple analytics zookeeper cluster from kafka zookeeper cluster [2019-2020] - https://phabricator.wikimedia.org/T217057 [14:43:27] this --^ is to avoid removing ~30k znodes at the same time, since I am not sure what are the effects on zk [14:43:40] (03CR) 10Ayounsi: [C: 03+1] "LGTM!" [software/homer] - 10https://gerrit.wikimedia.org/r/543114 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [14:43:47] I already done this clean up in the past, shouldn't be harmful, but in case ping me [14:44:03] PROBLEM - Host puppetmaster2002 is DOWN: PING CRITICAL - Packet loss = 100% [14:46:15] RECOVERY - Host puppetmaster2002 is UP: PING OK - Packet loss = 0%, RTA = 36.19 ms [14:50:23] (03CR) 10Volans: "recheck" [software/homer] - 10https://gerrit.wikimedia.org/r/543114 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [14:52:56] PROBLEM - Disk space on netflow2001 is CRITICAL: DISK CRITICAL - free space: / 302 MB (3% inode=91%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=netflow2001&var-datasource=codfw+prometheus/ops [14:53:06] PROBLEM - Check the last execution of netbox_ganeti_eqiad_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_eqiad_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:53:44] PROBLEM - Host puppetmaster2002 is DOWN: PING CRITICAL - Packet loss = 100% [14:54:01] RECOVERY - Host puppetmaster2002 is UP: PING OK - Packet loss = 0%, RTA = 36.18 ms [14:54:14] !log installing cups security updates for stretch (client-side libs/tools only) [14:54:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:05] 10Operations, 10MediaWiki-REST-API, 10Parsoid-PHP, 10Traffic, 10CPT Initiatives (Core REST API in PHP): Varnish/ATS should not decode URIs for /w/rest.php - https://phabricator.wikimedia.org/T235478 (10mobrovac) [14:57:05] (03PS3) 10Ottomata: Set up presto single node on analytics1030 in hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/540968 [14:57:11] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Set up presto single node on analytics1030 in hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/540968 (owner: 10Ottomata) [14:57:12] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:57:15] (03PS2) 10Giuseppe Lavagetto: echostore: add LVS service IPs [dns] - 10https://gerrit.wikimedia.org/r/541275 (https://phabricator.wikimedia.org/T234464) [14:57:23] (03CR) 10Volans: [C: 03+2] Initial support for Netbox integration [software/homer] - 10https://gerrit.wikimedia.org/r/543114 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [14:57:30] (03CR) 10Volans: "recheck" [software/homer] - 10https://gerrit.wikimedia.org/r/543115 (owner: 10Volans) [14:57:36] (03CR) 10Volans: "recheck" [software/homer] - 10https://gerrit.wikimedia.org/r/543116 (owner: 10Volans) [14:57:48] RECOVERY - haproxy failover on dbproxy1018 is OK: OK check_failover servers up 0 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [15:00:28] (03CR) 10Giuseppe Lavagetto: [C: 03+2] echostore: add LVS service IPs [dns] - 10https://gerrit.wikimedia.org/r/541275 (https://phabricator.wikimedia.org/T234464) (owner: 10Giuseppe Lavagetto) [15:01:19] (03Merged) 10jenkins-bot: Initial support for Netbox integration [software/homer] - 10https://gerrit.wikimedia.org/r/543114 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [15:01:31] !log installing fribidi bugfix updates from stretch point release [15:01:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:38] PROBLEM - haproxy failover on dbproxy1018 is CRITICAL: CRITICAL check_failover servers up 2 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [15:01:41] 10Operations, 10Analytics, 10Analytics-Kanban, 10SRE-Access-Requests: Analytics Access for Grant - https://phabricator.wikimedia.org/T235260 (10Nuria) ping #sre-access-requests after verification of employment (grant is our new CTO) his user should be added to the wmf ldap group [15:02:58] (03CR) 10Volans: [C: 03+2] Templates: add missing documentation [software/homer] - 10https://gerrit.wikimedia.org/r/543115 (owner: 10Volans) [15:03:04] (03CR) 10Volans: [C: 03+2] Homer: add missing documentation in a method [software/homer] - 10https://gerrit.wikimedia.org/r/543116 (owner: 10Volans) [15:03:06] 10Operations, 10Analytics, 10Analytics-Kanban, 10SRE-Access-Requests: Analytics Access for Grant - https://phabricator.wikimedia.org/T235260 (10Reedy) Doesn't this technically need Katherine to sign it off? [15:05:21] (03Merged) 10jenkins-bot: Templates: add missing documentation [software/homer] - 10https://gerrit.wikimedia.org/r/543115 (owner: 10Volans) [15:06:26] (03Merged) 10jenkins-bot: Homer: add missing documentation in a method [software/homer] - 10https://gerrit.wikimedia.org/r/543116 (owner: 10Volans) [15:07:19] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to 'analytics-privatedata-users' and 'researchers' for Erin Yener - https://phabricator.wikimedia.org/T234529 (10Nuria) Sounds great, I understand this might have been frustrated cause it took a long time. Note that the first request... [15:07:38] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:12:39] 10Operations, 10DNS, 10Toolforge, 10Traffic, 10cloud-services-team (Kanban): Update authoratiative nameservers for the toolforge.org domain to point to Designate - https://phabricator.wikimedia.org/T235303 (10RobH) >>! In T235303#5575089, @Andrew wrote: > @robh should I email Doneva myself and cc you for... [15:13:58] RECOVERY - Check the last execution of netbox_ganeti_eqiad_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_eqiad_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:15:00] 10Operations, 10SRE-Access-Requests: Update prod SSH for nathante - https://phabricator.wikimedia.org/T235526 (10Groceryheist) [15:15:10] (03PS1) 10Andrew Bogott: nova-fullstack: fix up behavior of cert cleanup checking [puppet] - 10https://gerrit.wikimedia.org/r/543159 [15:15:12] (03PS1) 10Ayounsi: Add config.yaml env/ and output/ to gitignore [software/homer] - 10https://gerrit.wikimedia.org/r/543160 [15:15:51] (03CR) 10jerkins-bot: [V: 04-1] nova-fullstack: fix up behavior of cert cleanup checking [puppet] - 10https://gerrit.wikimedia.org/r/543159 (owner: 10Andrew Bogott) [15:17:22] (03PS2) 10Andrew Bogott: nova-fullstack: fix up behavior of cert cleanup checking [puppet] - 10https://gerrit.wikimedia.org/r/543159 [15:18:05] 10Operations, 10Growth-Team, 10Notifications, 10serviceops, and 3 others: Echostore service endpoints - https://phabricator.wikimedia.org/T234464 (10Joe) I've done all the puppet/dns prep work. You can now proceed to prepare this new kask deployment in `operations/deployment-charts`. Data you will need:... [15:18:29] (03CR) 10Andrew Bogott: [C: 03+2] nova-fullstack: fix up behavior of cert cleanup checking [puppet] - 10https://gerrit.wikimedia.org/r/543159 (owner: 10Andrew Bogott) [15:21:02] (03PS1) 10Ottomata: Fix presto release, it was supposed to be 0.226 [debs/presto] (debian) - 10https://gerrit.wikimedia.org/r/543161 [15:22:13] (03PS2) 10Ottomata: Fix presto release, it was supposed to be 0.226 [debs/presto] (debian) - 10https://gerrit.wikimedia.org/r/543161 [15:23:06] (03CR) 10Volans: [C: 03+1] "LGTM" [software/homer] - 10https://gerrit.wikimedia.org/r/543160 (owner: 10Ayounsi) [15:23:51] PROBLEM - Host puppetmaster2002 is DOWN: PING CRITICAL - Packet loss = 100% [15:23:52] 10Operations, 10Growth-Team, 10Notifications, 10serviceops, and 3 others: Echostore service endpoints - https://phabricator.wikimedia.org/T234464 (10Eevans) >>! In T234464#5575683, @Joe wrote: > I've done all the puppet/dns prep work. You can now proceed to prepare this new kask deployment in `operations/... [15:24:03] 10Operations, 10SRE-Access-Requests: Update prod SSH key for nathante - https://phabricator.wikimedia.org/T235526 (10Reedy) [15:26:49] 10Operations: Integrate Stretch 9.10/9.11 point updates - https://phabricator.wikimedia.org/T232308 (10MoritzMuehlenhoff) [15:29:51] RECOVERY - Host puppetmaster2002 is UP: PING OK - Packet loss = 0%, RTA = 36.22 ms [15:32:31] (03PS1) 10Muehlenhoff: Add library hint for fribidi [puppet] - 10https://gerrit.wikimedia.org/r/543163 [15:32:37] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Fix presto release, it was supposed to be 0.226 [debs/presto] (debian) - 10https://gerrit.wikimedia.org/r/543161 (owner: 10Ottomata) [15:32:53] (03CR) 10Groceryheist: "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/542621 (owner: 10Groceryheist) [15:33:20] (03PS4) 10Arturo Borrero Gonzalez: toolforge: k8s: adjust ports in the ingress setup [puppet] - 10https://gerrit.wikimedia.org/r/543137 (https://phabricator.wikimedia.org/T234037) [15:43:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1086 after schema change', diff saved to https://phabricator.wikimedia.org/P9354 and previous config saved to /var/cache/conftool/dbconfig/20191015-154325-marostegui.json [15:43:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:58] 10Operations, 10ops-codfw: No microcode updates loaded on puppetmaster2001/2002 after reimage to Buster - https://phabricator.wikimedia.org/T235250 (10Papaul) a:05Papaul→03MoritzMuehlenhoff @MoritzMuehlenhoff upgrade complete [15:52:17] !log power down lvs2009 for HW maintenance [15:52:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1086 after schema change', diff saved to https://phabricator.wikimedia.org/P9355 and previous config saved to /var/cache/conftool/dbconfig/20191015-155454-marostegui.json [15:54:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:31] PROBLEM - Host lvs2009 is DOWN: PING CRITICAL - Packet loss = 100% [15:56:01] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2056.codfw.wmnet - https://phabricator.wikimedia.org/T230777 (10Papaul) [16:00:04] godog and _joe_: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Puppet SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191015T1600). [16:00:04] No GERRIT patches in the queue for this window AFAICS. [16:00:21] RECOVERY - Host lvs2009 is UP: PING OK - Packet loss = 0%, RTA = 36.15 ms [16:00:38] 10Operations, 10serviceops, 10Kubernetes: Upgrade the envoyproxy package to its latest version. - https://phabricator.wikimedia.org/T235412 (10Joe) [16:00:46] <_joe_> !log uploading envoy 1.11.2 to stretch-wikimedia, buster-wikimedia T230779 T235412 [16:00:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:50] T235412: Upgrade the envoyproxy package to its latest version. - https://phabricator.wikimedia.org/T235412 [16:06:16] (03PS2) 10Arturo Borrero Gonzalez: toolforge: proxy: adjust setup for the new k8s cluster [puppet] - 10https://gerrit.wikimedia.org/r/543135 (https://phabricator.wikimedia.org/T234037) [16:07:57] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests: Onboarding Reuven Lazarus - https://phabricator.wikimedia.org/T235215 (10RLazarus) Hi, Reuven here! LDAP user created, and I just set up 2FA on this Phabricator account. I'll see what else I can do from this checklist on my own. [16:07:59] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2068.codfw.wmnet - https://phabricator.wikimedia.org/T235399 (10Papaul) ` papaul@asw-d-codfw# show | compare [edit interfaces interface-range vlan-private1-d-codfw] - member ge-6/0/16; [edit interfaces interface-range disabled] me... [16:09:01] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2068.codfw.wmnet - https://phabricator.wikimedia.org/T235399 (10Papaul) [16:09:08] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests: Onboarding Reuven Lazarus - https://phabricator.wikimedia.org/T235215 (10RLazarus) [16:23:05] PROBLEM - Host lvs2009 is DOWN: PING CRITICAL - Packet loss = 100% [16:24:18] !log power down lvs2010 for HW maintenance [16:24:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:12] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests: Onboarding Reuven Lazarus - https://phabricator.wikimedia.org/T235215 (10CDanis) [16:29:31] RECOVERY - Host lvs2009 is UP: PING OK - Packet loss = 0%, RTA = 36.24 ms [16:30:16] (03CR) 10Andrew Bogott: [C: 03+1] keystone: change monitoring some details to email rather than paging [puppet] - 10https://gerrit.wikimedia.org/r/542610 (owner: 10Bstorm) [16:30:23] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/542610 (owner: 10Bstorm) [16:35:45] 10Operations, 10LDAP-Access-Requests: LDAP membership for new employee Nikki Nikkhoui - https://phabricator.wikimedia.org/T235136 (10nnikkhoui) Hello! I have a wikitech account, username is my full name: Nikki Nikkhoui. Thanks ! Nikki [16:38:38] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2051.codfw.wmnet - https://phabricator.wikimedia.org/T230778 (10Papaul) ` papaul@asw-b-codfw# show | compare [edit interfaces interface-range vlan-private1-b-codfw] - member ge-8/0/17; [edit interfaces interface-range disabled] me... [16:39:03] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2051.codfw.wmnet - https://phabricator.wikimedia.org/T230778 (10Papaul) [16:42:22] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests: Onboarding Reuven Lazarus - https://phabricator.wikimedia.org/T235215 (10Dzahn) Very nice. Welcome @RLazarus! I'll upload a change to code review to create your shell account. Could you create a SSH key pair and paste the public part here on ticke... [16:48:43] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests: Onboarding Reuven Lazarus - https://phabricator.wikimedia.org/T235215 (10RLazarus) Here's the SSH public key: $ ssh-keygen -t ed25519 $ cat /home/rlazarus/.ssh/id_ed25519.pub ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIPtpkeKO3QRiK4rGMkCX5u3T55PPWG... [16:48:55] RECOVERY - haproxy failover on dbproxy1018 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [16:49:05] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests: Onboarding Reuven Lazarus - https://phabricator.wikimedia.org/T235215 (10RLazarus) [17:00:04] cscott, arlolra, subbu, halfak, and accraze: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Services – Graphoid / Parsoid / Citoid / ORES . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191015T1700). [17:00:38] no parsoid deploy today [17:03:42] 10Operations, 10ops-codfw, 10DC-Ops: Move kafka200[123] to logstash202[012] - https://phabricator.wikimedia.org/T235125 (10Papaul) 05Open→03Resolved Complete [17:04:48] 10Operations, 10ops-codfw, 10Traffic: rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 (10Papaul) @Vgutierrez firmware upgrade complete on both servers [17:05:19] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is CRITICAL: 59.79 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:08:31] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is OK: (C)60 le (W)70 le 79.82 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:15:37] 10Operations, 10ops-codfw: (OoW) wtp2020: correctable memory errors - https://phabricator.wikimedia.org/T205712 (10Papaul) This system is still showing no sign of any hardware issue in the log . if there is any hardware issue going on, it is no been logged. Can someone look at the OS level to see if there is... [17:21:18] 10Operations, 10ops-codfw, 10DC-Ops, 10netops: setup wifi in codfw - https://phabricator.wikimedia.org/T86541 (10Papaul) @wiki_willy tis task has been open since 2015 and it is not moving and nothing has been decided on this. I mentioned that the AP that we used to have onsite is no longer working. 1- if w... [17:21:24] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to 'analytics-privatedata-users' and 'researchers' for Erin Yener - https://phabricator.wikimedia.org/T234529 (10EYener) Hi @Nuria is there anything I need to do on my end at this point, or should I wait for the mentioned merge to hap... [17:23:10] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to 'analytics-privatedata-users' and 'researchers' for Erin Yener - https://phabricator.wikimedia.org/T234529 (10Nuria) @EYener nothing for you to do @herron needs to merge the patch [17:26:05] 10Operations, 10ops-codfw, 10DC-Ops, 10netops: setup wifi in codfw - https://phabricator.wikimedia.org/T86541 (10wiki_willy) Hi @Papaul - if there aren't any objections from anyone, I think we can just resolve this. You have your primary connection via MIFI and a backup option via CyrusOne. And since it... [17:30:42] 10Operations, 10ops-codfw, 10DC-Ops, 10netops: setup wifi in codfw - https://phabricator.wikimedia.org/T86541 (10Papaul) 05Stalled→03Resolved [17:32:54] (03PS2) 10Herron: admin: add eyener to researchers, analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/542599 (https://phabricator.wikimedia.org/T234529) [17:33:28] 10Operations, 10Analytics, 10Analytics-EventLogging, 10Event-Platform, and 3 others: Public schema.wikimedia.org endpoint for schema.svc - https://phabricator.wikimedia.org/T233630 (10Ottomata) Ping @ema and/or @BBlack. Should I make some patches? [17:34:27] (03PS1) 10Lucas Werkmeister (WMDE): extension-list: Load FlaggedRevs via extension.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543180 (https://phabricator.wikimedia.org/T87915) [17:35:16] (03CR) 10Herron: [C: 03+2] admin: add eyener to researchers, analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/542599 (https://phabricator.wikimedia.org/T234529) (owner: 10Herron) [17:36:03] 10Operations, 10ops-codfw: No microcode updates loaded on puppetmaster2001/2002 after reimage to Buster - https://phabricator.wikimedia.org/T235250 (10MoritzMuehlenhoff) Interesting, that fixed it! So we'll need to apply the same updates to puppetmaster1001/2001 as well, if you have time tomorrow I'll prepare... [17:37:31] 10Operations, 10Analytics, 10Analytics-EventLogging, 10Event-Platform, and 3 others: Public EventGate endpoint for analytics event intake - https://phabricator.wikimedia.org/T233629 (10Ottomata) I'm inclined to just use the existent eventgate-analytics backend endpoint for now. We can consider setting up... [17:38:16] (03CR) 10Jforrester: "Oops. (This is technically T140852 not T87915, but we should have done this nearer the time.)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543180 (https://phabricator.wikimedia.org/T87915) (owner: 10Lucas Werkmeister (WMDE)) [17:38:19] (03PS8) 10Dzahn: conftool/LVS: add new service parsoid-php [puppet] - 10https://gerrit.wikimedia.org/r/541377 (https://phabricator.wikimedia.org/T233654) [17:38:47] (03PS1) 10Elukey: Remove zookeeper terms from the Analytics filters [homer/public] - 10https://gerrit.wikimedia.org/r/543183 (https://phabricator.wikimedia.org/T217057) [17:39:05] (03PS2) 10Lucas Werkmeister (WMDE): extension-list: Load FlaggedRevs via extension.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543180 (https://phabricator.wikimedia.org/T87915) [17:39:23] (03CR) 10Lucas Werkmeister (WMDE): "Thanks for the hint, I’ve added the second task ID. (I kept the first one because it still makes sense IMHO.)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543180 (https://phabricator.wikimedia.org/T87915) (owner: 10Lucas Werkmeister (WMDE)) [17:42:33] (03CR) 10Lucas Werkmeister (WMDE): "Should it also be linked to T139800 “Update wmf-config/extension-list to use extension.json where available”?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543180 (https://phabricator.wikimedia.org/T87915) (owner: 10Lucas Werkmeister (WMDE)) [17:43:07] (03CR) 10D3r1ck01: "> On most other wikis (eg meta) they are already autopatrolled; I don't think its intentional, since ta requires more trust than autopatro" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541057 (owner: 10DannyS712) [17:43:44] (03PS4) 10Lucas Werkmeister (WMDE): Stop using wmg variables for Score extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502269 (https://phabricator.wikimedia.org/T119117) [17:44:44] (03CR) 10Bstorm: "Confirmed this includes the feed. wmcs-team-email is:" [puppet] - 10https://gerrit.wikimedia.org/r/542610 (owner: 10Bstorm) [17:44:58] (03PS2) 10Bstorm: keystone: change monitoring some details to email rather than paging [puppet] - 10https://gerrit.wikimedia.org/r/542610 [17:45:33] 10Operations, 10SRE-Access-Requests: Update prod SSH key for nathante - https://phabricator.wikimedia.org/T235526 (10crusnov) p:05Triage→03Normal a:03crusnov [17:46:17] (03CR) 10Lucas Werkmeister (WMDE): "This has been in the pipeline for far too long already. I’ll deploy it tomorrow, and anyone who wants to load the settings differently is " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502269 (https://phabricator.wikimedia.org/T119117) (owner: 10Lucas Werkmeister (WMDE)) [17:47:39] (03CR) 10Ema: [C: 03+1] "Tested successfully in labs. Note however that a restart is required for the change to take effect. :)" [puppet] - 10https://gerrit.wikimedia.org/r/543022 (https://phabricator.wikimedia.org/T233274) (owner: 10Vgutierrez) [17:49:28] (03CR) 10CRusnov: [V: 03+2 C: 03+2] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/542621 (owner: 10Groceryheist) [17:49:45] (03PS2) 10CRusnov: update ssh key for nathante [puppet] - 10https://gerrit.wikimedia.org/r/542621 (owner: 10Groceryheist) [17:51:12] !log cutting the branch for 1.35.0-wmf.2 T233850 [17:51:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:16] T233850: 1.35.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T233850 [17:52:31] 10Operations, 10Traffic: Broken puppet on traffic-upload-stretch.traffic.eqiad.wmflabs and traffic-text-stretch.traffic.eqiad.wmflabs - https://phabricator.wikimedia.org/T234256 (10ema) >>! In T234256#5573825, @Andrew wrote: >> Error while evaluating a Function Call, Could not find data item profile::trafficse... [17:55:17] (03CR) 10Jforrester: "Sorry, totally forgot about this. Yes, let's just do it." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502269 (https://phabricator.wikimedia.org/T119117) (owner: 10Lucas Werkmeister (WMDE)) [17:55:26] 10Operations, 10SRE-Access-Requests: Update prod SSH key for nathante - https://phabricator.wikimedia.org/T235526 (10crusnov) 05Open→03Resolved Done and done. [17:55:42] 10Operations, 10MediaWiki-REST-API, 10Parsoid-PHP, 10Traffic, and 2 others: Varnish/ATS should not decode URIs for /w/rest.php - https://phabricator.wikimedia.org/T235478 (10WDoranWMF) [17:56:33] (03CR) 10Jforrester: "> Patch Set 2:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543180 (https://phabricator.wikimedia.org/T87915) (owner: 10Lucas Werkmeister (WMDE)) [17:57:41] (03CR) 10Lucas Werkmeister (WMDE): "> Patch Set 2:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543180 (https://phabricator.wikimedia.org/T87915) (owner: 10Lucas Werkmeister (WMDE)) [17:58:28] 10Operations, 10Wikimedia-Mailing-lists: Create wikimedia sustainability mailing list - https://phabricator.wikimedia.org/T234999 (10crusnov) p:05Triage→03Normal [17:59:37] 10Operations, 10Traffic: Broken puppet on traffic-upload-stretch.traffic.eqiad.wmflabs and traffic-text-stretch.traffic.eqiad.wmflabs - https://phabricator.wikimedia.org/T234256 (10Andrew) @ema, the missing key in my paste is a different key from the one you mentioned in your comment. You were talking about... [18:00:05] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191015T1800) [18:07:18] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:07:46] (03PS3) 10Bstorm: keystone: change monitoring some details to email rather than paging [puppet] - 10https://gerrit.wikimedia.org/r/542610 [18:08:48] 10Operations, 10Traffic: Broken puppet on traffic-upload-stretch.traffic.eqiad.wmflabs and traffic-text-stretch.traffic.eqiad.wmflabs - https://phabricator.wikimedia.org/T234256 (10ema) >>! In T234256#5576473, @Andrew wrote: > @ema, the missing key in my paste is a different key from the one you mentioned in y... [18:09:00] (03PS1) 10DCausse: Sort debian/sha256sums explicitely [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/543187 [18:09:02] (03PS1) 10DCausse: Bump experimental-highlighter to 5.6.4.1 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/543188 [18:09:04] (03PS3) 10Herron: logstash: add an index for deployment related logs [puppet] - 10https://gerrit.wikimedia.org/r/542557 (https://phabricator.wikimedia.org/T234564) [18:09:06] (03PS1) 10Ema: ATS: set tls::parent_rules for labs [puppet] - 10https://gerrit.wikimedia.org/r/543186 (https://phabricator.wikimedia.org/T234256) [18:12:25] (03CR) 10Bstorm: [C: 03+2] keystone: change monitoring some details to email rather than paging [puppet] - 10https://gerrit.wikimedia.org/r/542610 (owner: 10Bstorm) [18:14:10] (03PS4) 10Herron: logstash: add an index for deployment related logs [puppet] - 10https://gerrit.wikimedia.org/r/542557 (https://phabricator.wikimedia.org/T234564) [18:14:56] 10Operations, 10SRE-Access-Requests: Update prod SSH key for nathante - https://phabricator.wikimedia.org/T235526 (10Groceryheist) Thanks! [18:15:37] (03PS2) 10Ema: ATS: set tls::parent_rules for labs [puppet] - 10https://gerrit.wikimedia.org/r/543186 (https://phabricator.wikimedia.org/T234256) [18:16:13] 10Operations, 10serviceops: Jobrunners: allow to check that they are in sync with the etcd data - https://phabricator.wikimedia.org/T235488 (10crusnov) p:05Triage→03Normal [18:18:54] (03CR) 10Ema: [C: 03+2] ATS: set tls::parent_rules for labs [puppet] - 10https://gerrit.wikimedia.org/r/543186 (https://phabricator.wikimedia.org/T234256) (owner: 10Ema) [18:20:38] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests: Onboarding Reuven Lazarus - https://phabricator.wikimedia.org/T235215 (10Dzahn) [18:21:14] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests: Onboarding Reuven Lazarus - https://phabricator.wikimedia.org/T235215 (10Dzahn) - added to special groups in Phabricator to see private tickets (acl*SRE and WMF/NDA) [18:23:33] (03PS1) 10Andrew Bogott: cloudbackup2001: move tools backup job to Thurdsay [puppet] - 10https://gerrit.wikimedia.org/r/543190 [18:24:51] (03PS2) 10Andrew Bogott: cloudbackup2001: move tools backup job to Thursday [puppet] - 10https://gerrit.wikimedia.org/r/543190 [18:26:30] (03CR) 10Andrew Bogott: [C: 03+2] cloudbackup2001: move tools backup job to Thursday [puppet] - 10https://gerrit.wikimedia.org/r/543190 (owner: 10Andrew Bogott) [18:28:12] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:32:52] is Gerrit struggling? it just gave me a weird error about proxies (sorry, i didn't copy it), and now it seems slow [18:33:23] (03PS1) 10Hashar: logstash: raise elasticsearch mapping limit [puppet] - 10https://gerrit.wikimedia.org/r/543192 (https://phabricator.wikimedia.org/T234564) [18:33:25] 10Operations, 10observability, 10serviceops, 10Performance-Team (Radar): Messages in Logstash from php-fatal-error.php are missing from type:mediawiki/channel:fatal - https://phabricator.wikimedia.org/T234283 (10Krinkle) [18:33:49] seems maybe a bit slow [18:34:31] gerrit is going to die :\ [18:34:32] thcipriani: [18:34:35] ;-\ [18:34:56] branch cut is happening [18:35:04] should recover after that. Looks like it's on the last step [18:35:10] oic [18:35:29] (03CR) 10Hashar: "Just a guess based on documentation ;] https://www.elastic.co/guide/en/elasticsearch/reference/master/mapping.html#mapping-limit-settings" [puppet] - 10https://gerrit.wikimedia.org/r/543192 (https://phabricator.wikimedia.org/T234564) (owner: 10Hashar) [18:35:39] we're in process of migrating branching scripts to cause less stress, but...not quite there yet [18:36:58] 10Operations, 10Wikimedia-Mailing-lists: Rename multimedia-team to structured-data-team - https://phabricator.wikimedia.org/T235550 (10MarkTraceur) [18:37:32] oh branch cutting [18:38:15] (03PS1) 10RLazarus: Add rlazarus to sms contactgroup [puppet] - 10https://gerrit.wikimedia.org/r/543194 [18:38:17] (03CR) 10Welcome, new contributor!: "Thank you for making your first contribution to Wikimedia! :) To learn how to get your code changes reviewed faster and more likely to get" [puppet] - 10https://gerrit.wikimedia.org/r/543194 (owner: 10RLazarus) [18:38:41] rlazarus: :) i did not even expect the welcome bot. cool [18:38:43] 10Operations, 10Traffic, 10Patch-For-Review: Broken puppet on traffic-upload-stretch.traffic.eqiad.wmflabs and traffic-text-stretch.traffic.eqiad.wmflabs - https://phabricator.wikimedia.org/T234256 (10ema) 05Open→03Resolved a:03ema Puppet works again on both instances after adding `tls::parent_rules`,... [18:39:54] 10Operations, 10Analytics, 10Analytics-Kanban, 10SRE-Access-Requests: Analytics Access for Grant - https://phabricator.wikimedia.org/T235260 (10Nuria) @Reedy not sure if you are for real but while "technically" that might be a requirement I vote for proceeding w/o it in this case so as not to have to wait... [18:41:43] (03PS2) 10Dzahn: nagios_common: Add rlazarus to sms contactgroup [puppet] - 10https://gerrit.wikimedia.org/r/543194 (owner: 10RLazarus) [18:46:26] (03PS3) 10RLazarus: nagios_common: Add rlazarus to sms contactgroup [puppet] - 10https://gerrit.wikimedia.org/r/543194 (https://phabricator.wikimedia.org/T235215) [18:50:05] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Patch-For-Review: Onboarding Reuven Lazarus - https://phabricator.wikimedia.org/T235215 (10Dzahn) [18:54:55] (03CR) 10Jforrester: "> Patch Set 2:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543180 (https://phabricator.wikimedia.org/T87915) (owner: 10Lucas Werkmeister (WMDE)) [18:57:51] (03PS1) 10Dzahn: admins: add Reuven Lazarus to ldap_only_admins [puppet] - 10https://gerrit.wikimedia.org/r/543197 (https://phabricator.wikimedia.org/T235215) [19:00:04] longma: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train - American version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191015T1900). [19:01:08] 10Operations, 10MediaWiki-REST-API, 10Parsoid-PHP, 10Traffic, and 2 others: Varnish/ATS should not decode URIs for /w/rest.php - https://phabricator.wikimedia.org/T235478 (10ema) In ATS-land, URI path normalization can be configured by passing the right parameters to the script `normalize-path.lua`. See [[... [19:01:47] (03CR) 10RLazarus: [C: 03+1] admins: add Reuven Lazarus to ldap_only_admins [puppet] - 10https://gerrit.wikimedia.org/r/543197 (https://phabricator.wikimedia.org/T235215) (owner: 10Dzahn) [19:02:52] (03CR) 10Dzahn: [C: 03+2] admins: add Reuven Lazarus to ldap_only_admins [puppet] - 10https://gerrit.wikimedia.org/r/543197 (https://phabricator.wikimedia.org/T235215) (owner: 10Dzahn) [19:07:11] !log LDAP - adding user rzl to groups wmf and ops (T235215) [19:07:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:16] T235215: Onboarding Reuven Lazarus - https://phabricator.wikimedia.org/T235215 [19:13:49] (03PS1) 10Dzahn: icinga: give command permissions to RLazarus [puppet] - 10https://gerrit.wikimedia.org/r/543200 (https://phabricator.wikimedia.org/T235215) [19:15:26] (03CR) 10RLazarus: [C: 03+2] icinga: give command permissions to RLazarus [puppet] - 10https://gerrit.wikimedia.org/r/543200 (https://phabricator.wikimedia.org/T235215) (owner: 10Dzahn) [19:18:20] (03PS1) 10Jeena Huneidi: testwikis wikis to 1.35.0-wmf.2 refs T233850 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543202 [19:18:22] (03CR) 10Jeena Huneidi: [C: 03+2] testwikis wikis to 1.35.0-wmf.2 refs T233850 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543202 (owner: 10Jeena Huneidi) [19:18:43] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Patch-For-Review: Onboarding Reuven Lazarus - https://phabricator.wikimedia.org/T235215 (10RLazarus) [19:19:15] (03Merged) 10jenkins-bot: testwikis wikis to 1.35.0-wmf.2 refs T233850 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543202 (owner: 10Jeena Huneidi) [19:19:28] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Patch-For-Review: Onboarding Reuven Lazarus - https://phabricator.wikimedia.org/T235215 (10Dzahn) [19:20:32] !log jhuneidi@deploy1001 Started scap: testwikis wikis to 1.35.0-wmf.2 refs T233850 [19:20:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:36] T233850: 1.35.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T233850 [19:24:45] (03CR) 10EBernhardson: [C: 03+1] "This is fine, and what we have done for wikidata within search to handle an explosion of fields. The primary caveat is that every cluster " [puppet] - 10https://gerrit.wikimedia.org/r/543192 (https://phabricator.wikimedia.org/T234564) (owner: 10Hashar) [19:33:54] (03PS1) 10Dzahn: admins: add shell account for Reuven Lazarus [puppet] - 10https://gerrit.wikimedia.org/r/543204 (https://phabricator.wikimedia.org/T235215) [19:37:29] (03CR) 10Dzahn: [C: 03+2] nagios_common: Add rlazarus to sms contactgroup [puppet] - 10https://gerrit.wikimedia.org/r/543194 (https://phabricator.wikimedia.org/T235215) (owner: 10RLazarus) [19:37:38] (03PS4) 10Dzahn: nagios_common: Add rlazarus to sms contactgroup [puppet] - 10https://gerrit.wikimedia.org/r/543194 (https://phabricator.wikimedia.org/T235215) (owner: 10RLazarus) [19:37:44] PROBLEM - Wikitech-static main page has content on labweb1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki configuration Error - string Wikitech not found on https://wikitech-static.wikimedia.org:443/wiki/Main_Page?debug=true - 1669 bytes in 0.113 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [19:38:16] PROBLEM - Wikitech-static main page has content on labweb1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki configuration Error - string Wikitech not found on https://wikitech-static.wikimedia.org:443/wiki/Main_Page?debug=true - 1669 bytes in 0.121 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [19:42:26] !log upgrade restbase1016-a to cassandra 3.11.-4 -- T200803 [19:42:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:31] T200803: Upgrade Cassandra 3.11.2 clusters to 3.11.4 (bugfix release) - https://phabricator.wikimedia.org/T200803 [19:44:06] actual wikitech seems to work fine [19:44:40] but https://wikitech-static.wikimedia.org/wiki/ is broken [19:45:47] andrewbogott: ^ [19:47:30] PROBLEM - Wikitech and wt-static content in sync on labweb1001 is CRITICAL: wikitech-static CRIT - failed to fetch timestamp from wikitech-static https://wikitech.wikimedia.org/wiki/Wikitech-static [19:47:53] (03PS2) 10Dzahn: admins: add shell account for Reuven Lazarus [puppet] - 10https://gerrit.wikimedia.org/r/543204 (https://phabricator.wikimedia.org/T235215) [19:47:54] yeah, the wikitech-static errors are me trying to update [19:48:01] !log upgrade restbase1016-b to cassandra 3.11.-4 -- T200803 [19:48:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:05] T200803: Upgrade Cassandra 3.11.2 clusters to 3.11.4 (bugfix release) - https://phabricator.wikimedia.org/T200803 [19:48:06] andrewbogott: ack! [19:48:08] but apparently we're running custom php packages in prod? [19:48:11] !log jhuneidi@deploy1001 Finished scap: testwikis wikis to 1.35.0-wmf.2 refs T233850 (duration: 27m 39s) [19:48:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:15] T233850: 1.35.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T233850 [19:48:44] ACKNOWLEDGEMENT - Wikitech-static main page has content on labweb1001 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error - string Wikitech not found on https://wikitech-static.wikimedia.org:443/wiki/Main_Page?debug=true - 291 bytes in 0.117 second response time andrew bogott Andrew working on updates https://wikitech.wikimedia.org/wiki/Wikitech-static [19:48:44] ACKNOWLEDGEMENT - Wikitech-static main page has content on labweb1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki configuration Error - string Wikitech not found on https://wikitech-static.wikimedia.org:443/wiki/Main_Page?debug=true - 1669 bytes in 0.276 second response time andrew bogott Andrew working on updates https://wikitech.wikimedia.org/wiki/Wikitech-static [19:49:09] andrewbogott: I think it's just backported from upstream mostly.... [19:49:13] What's up? [19:49:34] Reedy: just need to figure out a way to get those packages + all dependencies onto wikitech-static [19:49:49] that, or rebuild wikitech-static on Buster but apparently rackspace isn't providing a buster base image yet :( [19:50:15] So this means that third party users can't run 1.34? [19:50:23] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Patch-For-Review: Onboarding Reuven Lazarus - https://phabricator.wikimedia.org/T235215 (10Dzahn) [19:50:27] deb http://apt.wikimedia.org/wikimedia stretch-wikimedia component/php72 [19:50:27] deb-src http://apt.wikimedia.org/wikimedia stretch-wikimedia component/php72 [19:51:15] can an out-of-prod server use those repos? [19:51:42] Yup [19:52:11] !log upgrade restbase1016-c to cassandra 3.11.-4 -- T200803 [19:52:14] * andrewbogott tries it [19:52:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:04] !log upgrade restbase2011-{a,b,c} to cassandra 3.11.-4 -- T200803 [19:55:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:07] T200803: Upgrade Cassandra 3.11.2 clusters to 3.11.4 (bugfix release) - https://phabricator.wikimedia.org/T200803 [19:55:13] Reedy: care to remind me what other file/config I need to sign this? Or convince apt to use it unsigned? [19:55:56] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Patch-For-Review: Onboarding Reuven Lazarus - https://phabricator.wikimedia.org/T235215 (10RLazarus) [19:58:16] 10Operations, 10media-storage, 10serviceops, 10Patch-For-Review, and 2 others: Swift object servers become briefly unresponsive on a regular basis - https://phabricator.wikimedia.org/T226373 (10Gilles) [19:58:31] andrewbogott: https://wikitech.wikimedia.org/wiki/APT_repository#Security [19:58:46] thx [20:05:50] Reedy: now I'm hitting [20:05:51] "InvalidArgumentException from line 490 of /srv/mediawiki/w/includes/libs/rdbms/database/Database.php: Wikimedia\Rdbms\Database::getClass no viable database extension found for type 'mysql'" [20:06:39] OOI, why are you upgrading MW? :P [20:06:56] I'm guessing it's missing php-mysql/php7.2-mysql [20:07:48] Reedy: I have a monitor set to alert when MW on wikitech-static falls behind the production version. I've found that if they get out of sync then random content starts to break [20:07:54] (03PS1) 10Jeena Huneidi: group0 wikis to 1.35.0-wmf.2 refs T233850 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543208 [20:07:56] (03CR) 10Jeena Huneidi: [C: 03+2] group0 wikis to 1.35.0-wmf.2 refs T233850 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543208 (owner: 10Jeena Huneidi) [20:08:00] Looks like php7.2-mysql was it! [20:08:49] (03Merged) 10jenkins-bot: group0 wikis to 1.35.0-wmf.2 refs T233850 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543208 (owner: 10Jeena Huneidi) [20:08:51] (03PS1) 10Andrew Bogott: Settings.php: move more extensions to wfLoadExtension [wikitech-static] - 10https://gerrit.wikimedia.org/r/543209 [20:09:08] (03CR) 10CDanis: [C: 03+1] "looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/543204 (https://phabricator.wikimedia.org/T235215) (owner: 10Dzahn) [20:09:44] well, part of it [20:10:16] !log jhuneidi@deploy1001 rebuilt and synchronized wikiversions files: group0 wikis to 1.35.0-wmf.2 refs T233850 [20:10:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:20] T233850: 1.35.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T233850 [20:11:05] (03CR) 10CDanis: [C: 03+1] swift: add swiftrepl [puppet] - 10https://gerrit.wikimedia.org/r/536586 (https://phabricator.wikimedia.org/T162123) (owner: 10Filippo Giunchedi) [20:13:24] jouncebot: now [20:13:25] No deployments scheduled for the next 2 hour(s) and 46 minute(s) [20:13:28] jouncebot: next [20:13:28] In 2 hour(s) and 46 minute(s): Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191015T2300) [20:13:52] hauskater: Right now is the train. :-) [20:14:23] I see :) [20:14:28] Got a ticket :P [20:15:36] RECOVERY - Wikitech-static main page has content on labweb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 28527 bytes in 0.236 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [20:16:04] RECOVERY - Wikitech-static main page has content on labweb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 28536 bytes in 0.421 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [20:16:15] Reedy: when are tags like 'remotes/origin/REL1_34' created? I expected to see REL1_35 since we're running 1.35 in prod but that must come later? [20:16:15] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: Decommission dbproxy1005.eqiad.wmnet - https://phabricator.wikimedia.org/T231967 (10Jclark-ctr) a:05Cmjohnson→03Jclark-ctr [20:16:31] That's a branched [20:16:36] *branch [20:16:45] the REL1_35 one won't be created for months [20:16:50] Yup. [20:16:51] ok [20:17:19] REL1_34 is created when the release (beta) is being created, not when the alphas start. [20:17:55] So, that is partly my question… is '1.35.0-wmf.1' an alpha release? [20:18:40] * andrewbogott should probably just read the release schedule docs [20:18:47] 10Operations, 10ops-codfw, 10DC-Ops, 10netops: setup wifi in codfw - https://phabricator.wikimedia.org/T86541 (10Dzahn) 05Resolved→03Declined [20:21:31] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Settings.php: move more extensions to wfLoadExtension [wikitech-static] - 10https://gerrit.wikimedia.org/r/543209 (owner: 10Andrew Bogott) [20:24:31] andrewbogott: Yes. No production code should run anything newer than REL1_33. [20:24:54] andrewbogott: Unless someone is prepared to baby-sit it a lot. [20:25:09] looks like translate is working again on group0 Nikerabbit :) [20:27:22] James_F: That makes sense. https://en.wikipedia.org/wiki/Special:Version says '1.35.0-wmf.1 (f8ed243)' which I expected to be on a branch called REL1_35. So does that mean that when you say 'No production code…' you're not referring to /our/ production code? Or do version numbers not work like I think they do? [20:28:50] 1.35.0-wmf.1 is the alpha branch https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/core/+/wmf/1.35.0-wmf.1 [20:29:30] "No production code" means "any who isn't Wikimedia". Wikitech-static should definitely not be running alpha code as we don't watch it (and don't have access to fix it). [20:29:48] ok, that makes sense [20:30:07] And it's not worth burning SRE time to get minor weekly fixes. [20:30:13] andrewbogott: all wmf branches are snapshots of 'master'. usually N commits forked starting with a commit that sets the version to "wmf.X" in DefaultSettings.php, and then a number of cherry picks throughout the week. [20:30:15] Especially as so many of them wreck prod. [20:30:34] I'm pretty sure that every time I've done this before there's been a REL1_XX branch that corresponded with what was running on enwiki. But maybe that's just a testament to how long I've waited to upgrade, hsitorically. [20:30:58] Anyway — this all makes sense now. I was just confused about seeing a thing you'd just declared 'alpha' running on WMF production. [20:31:01] thanks [20:31:21] Yeah, our release strategy is a lot more conservative than our deploy strategy. [20:31:34] Think of production as unstable. ;-) [20:32:22] "MediaWiki 1.35 is an alpha-quality development branch, and is not recommended for use in production." [20:32:29] lolol [20:32:39] Reedy: Almost like I wrote that line. :-) [20:33:30] * hauskater giggles [20:35:58] I think my first draft was "if you run this in production, you only have yourself to blame". [20:36:06] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission restbase10(0[7-9]|1[0-5]) - https://phabricator.wikimedia.org/T226715 (10Papaul) ` papaul@asw2-c-eqiad# show | compare [edit interfaces] - ge-4/0/28 { - description restbase1012; - } - ge-4/0/29 { - description restbase101... [20:36:38] James_F: We need a pic for that [20:37:00] https://commons.wikimedia.org/wiki/File:Fire_inside_an_abandoned_convent_in_Massueville,_Quebec,_Canada.jpg [20:38:37] I was hoping something from Heartbreak Ridge [20:39:54] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission restbase10(0[7-9]|1[0-5]) - https://phabricator.wikimedia.org/T226715 (10Papaul) [20:42:05] 10Operations, 10ops-codfw: No microcode updates loaded on puppetmaster2001/2002 after reimage to Buster - https://phabricator.wikimedia.org/T235250 (10Papaul) Yes we can do 2001 tomorrow [20:43:09] 10Operations, 10Wikimedia-Mailing-lists, 10I18n, 10RTL: Make pipermail show RTL emails better by emitting dir=auto - https://phabricator.wikimedia.org/T235458 (10crusnov) p:05Triage→03Normal [20:43:37] 10Operations, 10Wikimedia-Mailing-lists, 10I18n, 10RTL: Make pipermail show RTL emails better by emitting dir=auto - https://phabricator.wikimedia.org/T235458 (10crusnov) I did a little digging but I don't immediately see where this is configured. Anyone more experienced with Mailman should look at this. [20:44:31] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: Decommission db1073.eqiad.wmnet - https://phabricator.wikimedia.org/T231892 (10Jclark-ctr) a:05Cmjohnson→03Jclark-ctr [20:45:52] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: Decommission db1063.eqiad.wmnet - https://phabricator.wikimedia.org/T232564 (10Jclark-ctr) a:05Cmjohnson→03Jclark-ctr [20:47:24] 10Operations, 10Growth-Team, 10Notifications, 10serviceops, 10CPT Initiatives (Multi-DC Echo Notification Storage): Dashboards for monitoring of echostore - https://phabricator.wikimedia.org/T235558 (10Eevans) [20:47:41] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission db1072.eqiad.wmnet - https://phabricator.wikimedia.org/T228956 (10Jclark-ctr) a:05Cmjohnson→03Jclark-ctr [20:48:02] 10Operations, 10Growth-Team, 10Notifications, 10serviceops, 10CPT Initiatives (Multi-DC Echo Notification Storage): Dashboards for monitoring of echostore - https://phabricator.wikimedia.org/T235558 (10Eevans) p:05Triage→03Normal [20:49:09] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission db1071.eqiad.wmnet - https://phabricator.wikimedia.org/T229381 (10Jclark-ctr) a:05Cmjohnson→03Jclark-ctr [20:49:57] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission db1069 - https://phabricator.wikimedia.org/T227166 (10Jclark-ctr) a:05Cmjohnson→03Jclark-ctr [20:50:06] James_F: or https://commons.wikimedia.org/wiki/File:Nuclear_Blast_Animation_Blinding_Light.gif [20:50:20] (03PS1) 10Eevans: [WIP] echostore: create staging deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/543212 (https://phabricator.wikimedia.org/T234376) [20:51:04] PROBLEM - ElasticSearch shard size check - 9243 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - commonswiki_content_1556151793(79.33333333333333gb) https://wikitech.wikimedia.org/wiki/Search%23If_it_has_been_indexed [20:51:19] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission db1065 - https://phabricator.wikimedia.org/T227560 (10Jclark-ctr) a:05Cmjohnson→03Jclark-ctr [20:52:59] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission db1068 - https://phabricator.wikimedia.org/T226689 (10Jclark-ctr) a:05Cmjohnson→03Jclark-ctr [20:53:30] (03CR) 10Eevans: "Looking for a quick sanity test before I proceed..." [deployment-charts] - 10https://gerrit.wikimedia.org/r/543212 (https://phabricator.wikimedia.org/T234376) (owner: 10Eevans) [20:54:12] 10Operations, 10ops-eqiad, 10decommission: Decommission db1064 - https://phabricator.wikimedia.org/T223217 (10Jclark-ctr) a:05Cmjohnson→03Jclark-ctr [20:55:15] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10fundraising-tech-ops: decommission frav1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T222109 (10Jclark-ctr) a:05Cmjohnson→03Jclark-ctr [20:56:51] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission oxygen.eqiad.wmnet - https://phabricator.wikimedia.org/T211826 (10Jclark-ctr) a:05Cmjohnson→03Jclark-ctr [20:57:30] 10Operations, 10Wikimedia-Mailing-lists: Rename multimedia-team to structured-data-team - https://phabricator.wikimedia.org/T235550 (10crusnov) p:05Triage→03Normal a:03crusnov Question, do you need an alias for the old list name? Thanks [20:58:07] 10Operations, 10ops-eqiad, 10Analytics, 10decommission, 10Patch-For-Review: Decommission dbstore1002 - https://phabricator.wikimedia.org/T216491 (10Jclark-ctr) a:05Cmjohnson→03Jclark-ctr [20:58:31] hauskater: That's only if you install Semantic MediaWiki. ;-) [20:58:38] lol [20:58:46] 10Operations, 10ops-eqiad, 10Analytics, 10DC-Ops, 10decommission: Decommission analytics1003 - https://phabricator.wikimedia.org/T206524 (10Jclark-ctr) a:05Cmjohnson→03Jclark-ctr [20:59:50] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10Patch-For-Review: decom californium - https://phabricator.wikimedia.org/T189921 (10Jclark-ctr) a:05Cmjohnson→03Jclark-ctr [21:01:06] 10Operations, 10ops-eqiad, 10decommission: decom einsteinium - https://phabricator.wikimedia.org/T209738 (10Jclark-ctr) a:05Cmjohnson→03Jclark-ctr [21:02:34] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: Remove labnodepool1001.eqiad.wmnet - https://phabricator.wikimedia.org/T209642 (10Jclark-ctr) a:05Cmjohnson→03Jclark-ctr [21:03:15] 10Operations, 10ops-eqiad, 10decommission, 10Patch-For-Review: decom radium - https://phabricator.wikimedia.org/T203861 (10Jclark-ctr) a:05Cmjohnson→03Jclark-ctr [21:03:34] longma: All looks good with the train. Can I sling out a config patch? [21:04:30] James_F: go ahead [21:04:34] Thanks. [21:04:44] (03PS4) 10Jforrester: build: Upgrade mediawiki-codesniffer to v28.0.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542522 [21:04:54] (03CR) 10Jforrester: [C: 03+2] build: Upgrade mediawiki-codesniffer to v28.0.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542522 (owner: 10Jforrester) [21:05:05] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decom promethium/WMF3571 - https://phabricator.wikimedia.org/T191362 (10Jclark-ctr) a:05Cmjohnson→03Jclark-ctr [21:05:46] 10Operations, 10ops-eqiad, 10Traffic, 10decommission: Decommission radon - https://phabricator.wikimedia.org/T202040 (10Jclark-ctr) a:05Cmjohnson→03Jclark-ctr [21:05:55] (03Merged) 10jenkins-bot: build: Upgrade mediawiki-codesniffer to v28.0.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542522 (owner: 10Jforrester) [21:06:09] (03CR) 10Jforrester: [C: 03+2] Stop using wmg variables for Score extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502269 (https://phabricator.wikimedia.org/T119117) (owner: 10Lucas Werkmeister (WMDE)) [21:06:37] (03CR) 10Jforrester: "Argh, no, this is undeployable. One mo." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502269 (https://phabricator.wikimedia.org/T119117) (owner: 10Lucas Werkmeister (WMDE)) [21:06:54] 10Operations, 10ops-eqiad, 10decommission: decommission thulium.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T203520 (10Jclark-ctr) a:05Cmjohnson→03Jclark-ctr [21:07:49] (03CR) 10Cwhite: [C: 03+1] "Looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/543127 (owner: 10Jbond) [21:08:04] 10Operations, 10ops-eqiad, 10decommission: Decommission ms-be1027 - https://phabricator.wikimedia.org/T233289 (10Jclark-ctr) a:05Cmjohnson→03Jclark-ctr [21:08:34] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/542557 (https://phabricator.wikimedia.org/T234564) (owner: 10Herron) [21:09:50] (03CR) 10RLazarus: [C: 03+1] admins: add shell account for Reuven Lazarus [puppet] - 10https://gerrit.wikimedia.org/r/543204 (https://phabricator.wikimedia.org/T235215) (owner: 10Dzahn) [21:10:32] (03PS5) 10Jforrester: CommonSettings: Stop using wmg variables for Score extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502269 (https://phabricator.wikimedia.org/T119117) (owner: 10Lucas Werkmeister (WMDE)) [21:10:34] (03PS1) 10Jforrester: Write wgScoreFileBackend and wgScorePath directly, not via CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543217 [21:10:36] (03PS1) 10Jforrester: InitialiseSettings: Stop writing wmgScoreFileBackend and wmgScorePath, never read [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543218 [21:11:03] (03CR) 10Jforrester: [C: 03+2] Write wgScoreFileBackend and wgScorePath directly, not via CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543217 (owner: 10Jforrester) [21:11:53] (03Merged) 10jenkins-bot: Write wgScoreFileBackend and wgScorePath directly, not via CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543217 (owner: 10Jforrester) [21:12:33] (03CR) 10Jforrester: [C: 03+2] CommonSettings: Stop using wmg variables for Score extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502269 (https://phabricator.wikimedia.org/T119117) (owner: 10Lucas Werkmeister (WMDE)) [21:13:27] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Write wgScoreFileBackend and wgScorePath directly, not via CommonSettings (duration: 01m 00s) [21:13:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:44] (03Merged) 10jenkins-bot: CommonSettings: Stop using wmg variables for Score extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502269 (https://phabricator.wikimedia.org/T119117) (owner: 10Lucas Werkmeister (WMDE)) [21:14:22] (03CR) 10Jforrester: [C: 03+2] InitialiseSettings: Stop writing wmgScoreFileBackend and wmgScorePath, never read [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543218 (owner: 10Jforrester) [21:15:07] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: CommonSettings: Stop using wmg variables for Score extension (duration: 01m 01s) [21:15:07] (03Merged) 10jenkins-bot: InitialiseSettings: Stop writing wmgScoreFileBackend and wmgScorePath, never read [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543218 (owner: 10Jforrester) [21:15:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:15:21] (03PS9) 10Jforrester: MWConfigCacheGenerator: Provide getCachableMWConfig() which doesn't rely on wgConf [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538078 [21:16:38] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: InitialiseSettings: Stop writing wmgScoreFileBackend and wmgScorePath, never read (duration: 00m 59s) [21:16:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:19] (03CR) 10Jforrester: [C: 03+2] MWConfigCacheGenerator: Provide getCachableMWConfig() which doesn't rely on wgConf [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538078 (owner: 10Jforrester) [21:18:02] (03Merged) 10jenkins-bot: MWConfigCacheGenerator: Provide getCachableMWConfig() which doesn't rely on wgConf [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538078 (owner: 10Jforrester) [21:18:27] !log mholloway-shell@deploy1001 Started deploy [mobileapps/deploy@cdfa545]: Media: Fix TypeError when processing pages with only Mathoid images (T235408) [21:18:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:31] T235408: Cannot read property filter of undefined - https://phabricator.wikimedia.org/T235408 [21:19:05] /32// [21:19:52] Live in mwdeploy1001. [21:20:15] (03PS4) 10Jforrester: CommonSettings: Switch from getMWConfigForCacheing to getCachableMWConfig to avoid wgConf [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538342 [21:22:16] (03PS5) 10Jforrester: CommonSettings: Switch from getMWConfigForCacheing to getCachableMWConfig to avoid wgConf [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538342 [21:22:31] Err. mwdebug1001, even. [21:22:34] 10Operations, 10Arc-Lamp, 10Performance-Team, 10serviceops, 10Patch-For-Review: webperf*002 running out of disk space (arc lamp, xhgui) - https://phabricator.wikimedia.org/T235425 (10Krinkle) 05Open→03Resolved [21:23:26] PROBLEM - HHVM rendering on mwdebug1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2590 bytes in 0.004 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:23:26] PROBLEM - Apache HTTP on mwdebug1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2591 bytes in 0.023 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:23:50] PROBLEM - Nginx local proxy to apache on mwdebug1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2590 bytes in 0.011 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:23:52] PROBLEM - PHP7 rendering on mwdebug1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2591 bytes in 0.024 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:24:03] !log mholloway-shell@deploy1001 Finished deploy [mobileapps/deploy@cdfa545]: Media: Fix TypeError when processing pages with only Mathoid images (T235408) (duration: 05m 35s) [21:24:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:24:06] T235408: Cannot read property filter of undefined - https://phabricator.wikimedia.org/T235408 [21:25:04] RECOVERY - Apache HTTP on mwdebug1001 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.044 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:25:04] RECOVERY - HHVM rendering on mwdebug1001 is OK: HTTP OK: HTTP/1.1 200 OK - 76026 bytes in 0.177 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:25:28] RECOVERY - Nginx local proxy to apache on mwdebug1001 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:25:28] RECOVERY - PHP7 rendering on mwdebug1001 is OK: HTTP OK: HTTP/1.1 200 OK - 76026 bytes in 0.193 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:25:59] !log jforrester@deploy1001 Synchronized multiversion/MWConfigCacheGenerator.php: Provide getCachableMWConfig() which doesn't rely on wgConf (duration: 01m 00s) [21:26:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:34] (03CR) 10Jforrester: [C: 03+2] CommonSettings: Switch from getMWConfigForCacheing to getCachableMWConfig to avoid wgConf [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538342 (owner: 10Jforrester) [21:27:48] (03CR) 10Krinkle: "I don't know what this patch does exactly, but wgConf must remain populated the way it has been thus far as extensions and core depend on " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538342 (owner: 10Jforrester) [21:27:56] James_F: please ack^ [21:29:29] (03CR) 10Jforrester: "> Patch Set 5:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538342 (owner: 10Jforrester) [21:29:44] Krinkle: I've killed the C+2, but I don't understand your concern? [21:29:56] wgConf->get('MyName', 'dewiki') whilst on enwiki needs to still work. [21:30:11] It looks like this breaks that. [21:30:14] (03CR) 10Jforrester: CommonSettings: Switch from getMWConfigForCacheing to getCachableMWConfig to avoid wgConf (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538342 (owner: 10Jforrester) [21:30:18] No? [21:31:16] James_F: OK, that line exists but does that still get called? And does wgConf still get told about tags and db lists and stuff [21:31:26] Yes, and yes, and yes. [21:31:30] Looks like not as it is now only being given to the faux siteconfig. [21:31:48] https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/538078/9/multiversion/MWConfigCacheGenerator.php [21:32:15] This gives $lang etc. to the MWCOnfig only, not to wgConf etc. [21:32:45] $lang is still exposed in CommonSettings. [21:32:59] `list( $site, $lang ) = $wgConf->siteFromDB( $wgDBname );` etc. [21:33:19] that's for current wiki that doesn't matter. [21:33:47] Most of the users in https://codesearch.wmflabs.org/deployed/?q=%5CbwgConf%5Cb&i=nope&files=&repos= look pretty suspect, honestly. [21:34:33] Anyway, I don't have time right now. I'd prefer we hold off, also because it doesn't seem required for the overall end goal, so might be a misunderstanding somewhere. I don't think we can feasibly test all the angles of this without further action I don' t hav etime for right now today. [21:34:49] Fine. [21:35:46] But I really don't want to move forward with the static generation of config using different code from what prod uses. [21:35:52] That's just asking for diasters. [21:36:57] (03PS1) 10Papaul: DNS: Remove mgmt DNS for restbase100[7-9] and restbase101[0-5] [dns] - 10https://gerrit.wikimedia.org/r/543222 [21:39:54] (03PS19) 10Jforrester: Variant configuration: Pre-calculate config for each wiki and store it in config.git [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507729 (https://phabricator.wikimedia.org/T223602) [21:40:45] Krinkle: Can I grab some of your time next week (in person)? [21:44:29] I'm flying on Sunday, but might have some time before that. Pretty packed with fawg/techcom/quarterly stuff though. I don't want to rush this. But yeah, let's meet and see what progress we can make to unblock you at least for the next steps. [21:46:00] Wait, flying where? I'm in London for Monday/Tuesday/Wednesday… [21:46:22] Boston, next Sunday not this one [21:48:04] RECOVERY - Wikitech and wt-static content in sync on labweb1001 is OK: wikitech-static OK - wikitech and wikitech-static in sync (60995 200000s) https://wikitech.wikimedia.org/wiki/Wikitech-static [21:51:08] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10Patch-For-Review: decom californium - https://phabricator.wikimedia.org/T189921 (10Papaul) [21:53:07] James_F: one thing to think about is how we'd populate wgConf in the yaml world, given we'd want to read neither 100s of json files nor 3x100s yaml files at run-time. Perhaps another expansion is needed, not sure. Also thinking whether we could get away with reading 3 json files for "current wiki" settings, which would reduce merge conflicts and map more closely to yaml (and less git bloat), could get per-wiki diffing in CI by [21:53:07] other means (like JJB does), just some random ideas to think about. [21:54:15] Krinkle: See my patch. :-) [21:55:03] (Spoiler: There's no reading of YAML at run time, and reading of exactly one JSON file at run time. The whole point is reducing run cost.) [21:55:39] (03PS9) 10Dzahn: conftool/LVS: add new service parsoid-php [puppet] - 10https://gerrit.wikimedia.org/r/541377 (https://phabricator.wikimedia.org/T233654) [21:55:52] Yes, I understand the intent. but if we only read cache/enwiki.json, how do we have wgConf['wgCanonicalServer']['dewiki'] populated ;-) [21:55:58] anyway, stretching time - gotta go! [21:56:03] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-EventLogging, and 2 others: Decommission dbproxy1004 and dbproxy1009 - https://phabricator.wikimedia.org/T228768 (10Papaul) [21:56:17] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-EventLogging, and 2 others: Decommission dbproxy1004 and dbproxy1009 - https://phabricator.wikimedia.org/T228768 (10Papaul) 05Open→03Resolved Complete [21:57:52] (03PS10) 10Dzahn: conftool/LVS: add new service parsoid-php [puppet] - 10https://gerrit.wikimedia.org/r/541377 (https://phabricator.wikimedia.org/T233654) [22:10:08] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Patch-For-Review: Onboarding Reuven Lazarus - https://phabricator.wikimedia.org/T235215 (10Dzahn) [22:12:02] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work): maps1002: Failed power supply - https://phabricator.wikimedia.org/T235406 (10wiki_willy) a:03Jclark-ctr @Jclark-ctr - looks like this one is barely out of warranty. Before we order the part though, can you doublecheck that it's not something si... [22:12:17] (03Abandoned) 10Dzahn: add service records for new parsoid-php service [dns] - 10https://gerrit.wikimedia.org/r/542566 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn) [22:15:02] 10Operations, 10Machine vision, 10Product-Infrastructure-Team-Backlog, 10serviceops, and 2 others: Update open_nsfw-- for Wikimedia production deployment - https://phabricator.wikimedia.org/T225664 (10Mholloway) 05Open→03Resolved The work encompassed by this task (updating the open_nsfw service) is com... [22:15:19] (03Abandoned) 10Dzahn: hhvm: base::service_unit -> systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/456319 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [22:18:44] (03CR) 10Dzahn: [C: 03+1] DNS: Remove mgmt DNS for restbase100[7-9] and restbase101[0-5] [dns] - 10https://gerrit.wikimedia.org/r/543222 (owner: 10Papaul) [22:29:01] (03CR) 10Papaul: [C: 03+2] DNS: Remove mgmt DNS for restbase100[7-9] and restbase101[0-5] [dns] - 10https://gerrit.wikimedia.org/r/543222 (owner: 10Papaul) [22:31:15] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission restbase10(0[7-9]|1[0-5]) - https://phabricator.wikimedia.org/T226715 (10Papaul) [22:31:48] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission restbase10(0[7-9]|1[0-5]) - https://phabricator.wikimedia.org/T226715 (10Papaul) 05Open→03Resolved complete [22:41:37] !log manually running `extensions/ConfirmEdit/maintenance/GenerateFancyCaptchas.php` T230245 [22:41:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:41:41] T230245: Mediawiki maintenance job "generate-fancycaptcha" - fatal error when trying to copy new captchas to storage - https://phabricator.wikimedia.org/T230245 [23:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for Evening SWAT (Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191015T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:13:33] (03PS11) 10Dzahn: conftool: add parsoid-php servers to wtp servers [puppet] - 10https://gerrit.wikimedia.org/r/541377 (https://phabricator.wikimedia.org/T233654) [23:21:06] 10Operations, 10ConfirmEdit (CAPTCHA extension), 10Core Platform Team, 10Editing-team, and 2 others: Mediawiki maintenance job "generate-fancycaptcha" - fatal error when trying to copy new captchas to storage - https://phabricator.wikimedia.org/T230245 (10Reedy) Tagging #media-storage due to the vague swif... [23:21:34] (03CR) 10CDanis: "Just to make sure I understand: The DNS discovery records are going to be controlled by the existing 'parsoid' service, right? (And same " [puppet] - 10https://gerrit.wikimedia.org/r/541377 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn) [23:22:41] (03PS1) 10Dzahn: LVS: add config for parsoid-php service [puppet] - 10https://gerrit.wikimedia.org/r/543243 (https://phabricator.wikimedia.org/T233654) [23:23:05] (03PS12) 10Dzahn: conftool: add parsoid-php service to wtp servers [puppet] - 10https://gerrit.wikimedia.org/r/541377 (https://phabricator.wikimedia.org/T233654) [23:24:09] (03CR) 10jerkins-bot: [V: 04-1] LVS: add config for parsoid-php service [puppet] - 10https://gerrit.wikimedia.org/r/543243 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn) [23:27:12] (03CR) 10Dzahn: "> Patch Set 11:" [puppet] - 10https://gerrit.wikimedia.org/r/541377 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn) [23:29:55] (03PS2) 10Dzahn: LVS: add config for parsoid-php service [puppet] - 10https://gerrit.wikimedia.org/r/543243 (https://phabricator.wikimedia.org/T233654) [23:35:33] 10Operations, 10serviceops, 10Patch-For-Review: Make the parsoid cluster support parsoid/PHP - https://phabricator.wikimedia.org/T233654 (10Dzahn) [23:56:43] 10Operations, 10Wikimedia-Mailing-lists, 10I18n, 10RTL: Make pipermail show RTL emails better by emitting dir=auto - https://phabricator.wikimedia.org/T235458 (10crusnov) After discussing this a bit it looks like it's not currently possible with pipermail without uptsream changes (willing to be wrong here...